// blog

All experiments.
All frameworks.

Real benchmarks. Architecture frameworks. Fine-tuning lessons. RAG compliance. Filter by topic or browse the full archive below.

Latest

Fine-Tuning Gemma 4 E2B: Notes from a Weekend

LoRA on the text decoder, ~5,000 training pairs. Three observations about probability distributions, prompts, and what fine-tuning quietly erased.

Spent a week doing LoRA fine-tuning on Gemma 4 E2B (~5.1B total params, ~2B active in text decoder) for a narrow Python code-generation task. Bad outputs went from ~5% to 0% (greedy) and 1.5% (sampled) across 134 tests. The fixes weren't more data or compute. They were three uncomfortable lessons about what LLMs actually are.

FindingBad outputs from 5% → 0% (greedy) and 1.5% (sampled) across 134 tests. The fixes weren't more data or compute — they were three lessons about what LLMs actually are.

Read the full breakdown →

// archive

BenchmarksPart 2 · Gemma 4 Benchmarks

Gemma 4 E2B vs the Gemma Family: The 2B Underdog That Punches Above Its Weight

After last month's Gemma 4 E4B benchmark, the obvious follow-up: can the 2B variant deliver real generational improvement at constant parameter count? It can. Multi-turn doubled. RAG grounding jumped 17 points. The E2B scored 80.4% overall — 0.4 points behind a model with twice its parameters.

› Gemma 4 E2B scored 80.4% overall — beating Gemma 2 2B by 3 points at the same parameter count. Multi-turn at 70% is the highest in the entire family.

Apr 202613 min→

BenchmarksPart 1 · Gemma 4 Benchmarks

Gemma 4 E4B vs the Gemma Family: Enterprise Benchmark Showdown

We ran Gemma 4 E4B through 8 enterprise test suites — function calling, RAG grounding, classification, code generation, summarization, information extraction, multilingual, and multi-turn — and compared it head-to-head against three other Gemma models. The 4B model scored 83.6% overall, beating even the 12B.

› Gemma 4 E4B (4B params) scored 83.6% overall — beating the 3x larger Gemma 3 12B (82.3%) across 8 enterprise suites.

Apr 202612 min→

Architecture

The Black Box Problem

Every abstraction layer in computing let you drop down and look when things broke. AI is the first layer where you can't. The bugs are non-deterministic by design.

› You don't debug the model. You debug everything around it.

Mar 20264 min→

Business Case

Same Team, 2× Output

Everyone keeps saying AI will reduce the number of developers. After months of working with these tools, I think they're looking at it wrong. AI reduces the cost of shipping a feature with the same team.

› 10 developers shipping 20 features a quarter → same 10 developers with AI shipping 40–50 features.

Mar 20264 min→

Architecture

AI as Translator, Not Decision-Maker

If AI is a black box you can't debug, how do you trust it in production? Honest answer: you don't. Not for everything. Here's the framework I keep coming back to.

› Human (messy) → AI translates → Structured data → Traditional code processes → Output

Mar 20265 min→

Architecture

AI as the Next Abstraction Layer

I asked Claude directly: 'Why should I use you?' And it gave me the clearest framing I've heard. The history of computing is a story of rising abstraction layers.

› Unlike a compiler, this layer can be wrong. The new discipline is review.

Mar 20263 min→

Craft

The Cognitive Friction Problem

When the author quit using AI writing tools, they realized they had completely lost their tolerance for cognitive friction. The moment a thought became difficult to articulate, their instinct was to reach for the AI escape hatch.

› AI is an incredible tool, but mindless reliance is a cognitive trap.

Feb 20263 min→

Benchmarks

Structured JSON Output from Small LLMs

You know that feeling when you ask an AI to return data in a specific structure, and everything looks clean — but the actual content is quietly wrong? I ran 1,500+ tests across 7 small open-source models.

› A well-instructed 2B model jumped from 30% to 90% compliance — outperforming models 3–4x its size.

Feb 20265 min→

RAG · Experiments

Context Position Bias in Small LLMs

The "Lost in the Middle" paper showed that large models perform worst when important information is buried in the middle of long contexts. I tested whether small 2–4B models behave the same way. They don't.

› Each architecture fails differently. Gemma-2B has strong recency bias (p=0.023). Llama-3B is completely flat (p=1.0).

Feb 20264 min→

LLM SecurityPart 4 · RAG Compliance

RAG Compliance Week 4: 100% Recall

Week 1: 80% F1. Week 2: Llama Guard hit 53% F1. Week 3: Prompt injection testing. NeMo hit 55% recall. Enforcement engine hit 93%. 4 attacks still got through. Today: 100% recall. 0 missed.

› v2 accuracy dropped from 68% to 65% — but blocks 7 more benign queries to eliminate the final 4 missed attacks.

Feb 20263 min→

Production AI

Arnab vs AI: Intelligence vs Conscience

I watched the unscripted face-off between Arnab Goswami and Blue Machines (an enterprise Voice AI). It highlighted exactly where the line is drawn between Intelligence and Conscience.

› We often talk about AI replacing jobs, but this debate highlighted exactly where the line is drawn.

Feb 20261 min→

LLM SecurityPart 3 · RAG Compliance

NeMo Guardrails vs Prompt Injections

Week 3 of the RAG compliance series. I ran two separate tests: 17 high-risk compliance queries and 85 prompt injection attacks. The head-to-head results were eye-opening.

› NeMo: 55% recall. Llama Guard: 58% recall. Enforcement Engine: 93% recall on prompt injections.

Jan 20264 min→

LLM SecurityPart 2 · RAG Compliance

Llama Guard vs Enforcement Engine

I ran a head-to-head benchmark using the same 17 adversarial queries and 82 compliance rules. Llama Guard 3: 53% F1. Enforcement Engine: 80% F1. The gap comes down to what the model is 'looking' for.

› Llama Guard asks: 'Is this text harmful?' Enforcement Engine asks: 'Does this violate compliance rule #42?'

Jan 20264 min→

LLM SecurityPart 1 · RAG Compliance

RAG Compliance Enforcement Engine

Two posts convinced me that RAG alone isn't enough for compliance. So I tested it. Baseline RAG blocked 15–23% of violations. With the enforcement layer: 85%. Architecture mattered more than model size.

› Baseline RAG: 15–23% block rate. With tiered enforcement: 85%. Architecture dominated over model size.

Jan 20266 min→

All experiments.All frameworks.

Gemma 4 E2B vs the Gemma Family: The 2B Underdog That Punches Above Its Weight

Gemma 4 E4B vs the Gemma Family: Enterprise Benchmark Showdown

The Black Box Problem

Same Team, 2× Output

AI as Translator, Not Decision-Maker

AI as the Next Abstraction Layer

The Cognitive Friction Problem

Structured JSON Output from Small LLMs

Context Position Bias in Small LLMs

RAG Compliance Week 4: 100% Recall

Arnab vs AI: Intelligence vs Conscience

NeMo Guardrails vs Prompt Injections

Llama Guard vs Enforcement Engine

RAG Compliance Enforcement Engine

All experiments.
All frameworks.