// experiments

Real benchmarks.
Honest limitations.

Every post here is backed by actual tests on open-source models — structured output failures, adversarial guardrails, context position bias, RAG compliance. Run locally. Published with full scope.

10experiments
7models tested
2,000+test cases

// archive

02
Fine-Tuning

Fine-Tuning Gemma 4 E2B: Notes from a Weekend

Spent a week doing LoRA fine-tuning on Gemma 4 E2B (~5.1B total params, ~2B active in text decoder) for a narrow Python code-generation task. Bad outputs went from ~5% to 0% (greedy) and 1.5% (sampled) across 134 tests. The fixes weren't more data or compute. They were three uncomfortable lessons about what LLMs actually are.

Bad outputs from 5% → 0% (greedy) and 1.5% (sampled) across 134 tests. The fixes weren't more data or compute — they were three lessons about what LLMs actually are.

Apr 20268 min
03
BenchmarksPart 2 · Gemma 4 Benchmarks

Gemma 4 E2B vs the Gemma Family: The 2B Underdog That Punches Above Its Weight

After last month's Gemma 4 E4B benchmark, the obvious follow-up: can the 2B variant deliver real generational improvement at constant parameter count? It can. Multi-turn doubled. RAG grounding jumped 17 points. The E2B scored 80.4% overall — 0.4 points behind a model with twice its parameters.

Gemma 4 E2B scored 80.4% overall — beating Gemma 2 2B by 3 points at the same parameter count. Multi-turn at 70% is the highest in the entire family.

Apr 202613 min
04
BenchmarksPart 1 · Gemma 4 Benchmarks

Gemma 4 E4B vs the Gemma Family: Enterprise Benchmark Showdown

We ran Gemma 4 E4B through 8 enterprise test suites — function calling, RAG grounding, classification, code generation, summarization, information extraction, multilingual, and multi-turn — and compared it head-to-head against three other Gemma models. The 4B model scored 83.6% overall, beating even the 12B.

Gemma 4 E4B (4B params) scored 83.6% overall — beating the 3x larger Gemma 3 12B (82.3%) across 8 enterprise suites.

Apr 202612 min
05
Benchmarks

Structured JSON Output from Small LLMs

You know that feeling when you ask an AI to return data in a specific structure, and everything looks clean — but the actual content is quietly wrong? I ran 1,500+ tests across 7 small open-source models.

A well-instructed 2B model jumped from 30% to 90% compliance — outperforming models 3–4x its size.

Feb 20265 min
06
RAG · Experiments

Context Position Bias in Small LLMs

The "Lost in the Middle" paper showed that large models perform worst when important information is buried in the middle of long contexts. I tested whether small 2–4B models behave the same way. They don't.

Each architecture fails differently. Gemma-2B has strong recency bias (p=0.023). Llama-3B is completely flat (p=1.0).

Feb 20264 min
07
LLM SecurityPart 4 · RAG Compliance

RAG Compliance Week 4: 100% Recall

Week 1: 80% F1. Week 2: Llama Guard hit 53% F1. Week 3: Prompt injection testing. NeMo hit 55% recall. Enforcement engine hit 93%. 4 attacks still got through. Today: 100% recall. 0 missed.

v2 accuracy dropped from 68% to 65% — but blocks 7 more benign queries to eliminate the final 4 missed attacks.

Feb 20263 min
08
LLM SecurityPart 3 · RAG Compliance

NeMo Guardrails vs Prompt Injections

Week 3 of the RAG compliance series. I ran two separate tests: 17 high-risk compliance queries and 85 prompt injection attacks. The head-to-head results were eye-opening.

NeMo: 55% recall. Llama Guard: 58% recall. Enforcement Engine: 93% recall on prompt injections.

Jan 20264 min
09
LLM SecurityPart 2 · RAG Compliance

Llama Guard vs Enforcement Engine

I ran a head-to-head benchmark using the same 17 adversarial queries and 82 compliance rules. Llama Guard 3: 53% F1. Enforcement Engine: 80% F1. The gap comes down to what the model is 'looking' for.

Llama Guard asks: 'Is this text harmful?' Enforcement Engine asks: 'Does this violate compliance rule #42?'

Jan 20264 min
10
LLM SecurityPart 1 · RAG Compliance

RAG Compliance Enforcement Engine

Two posts convinced me that RAG alone isn't enough for compliance. So I tested it. Baseline RAG blocked 15–23% of violations. With the enforcement layer: 85%. Architecture mattered more than model size.

Baseline RAG: 15–23% block rate. With tiered enforcement: 85%. Architecture dominated over model size.

Jan 20266 min