// blog

All experiments.
All frameworks.

Real benchmarks. Architecture frameworks. Fine-tuning lessons. RAG compliance. Filter by topic or browse the full archive below.

// archive

02
BenchmarksPart 2 · Gemma 4 Benchmarks

Gemma 4 E2B vs the Gemma Family: The 2B Underdog That Punches Above Its Weight

After last month's Gemma 4 E4B benchmark, the obvious follow-up: can the 2B variant deliver real generational improvement at constant parameter count? It can. Multi-turn doubled. RAG grounding jumped 17 points. The E2B scored 80.4% overall — 0.4 points behind a model with twice its parameters.

Gemma 4 E2B scored 80.4% overall — beating Gemma 2 2B by 3 points at the same parameter count. Multi-turn at 70% is the highest in the entire family.

Apr 202613 min
03
BenchmarksPart 1 · Gemma 4 Benchmarks

Gemma 4 E4B vs the Gemma Family: Enterprise Benchmark Showdown

We ran Gemma 4 E4B through 8 enterprise test suites — function calling, RAG grounding, classification, code generation, summarization, information extraction, multilingual, and multi-turn — and compared it head-to-head against three other Gemma models. The 4B model scored 83.6% overall, beating even the 12B.

Gemma 4 E4B (4B params) scored 83.6% overall — beating the 3x larger Gemma 3 12B (82.3%) across 8 enterprise suites.

Apr 202612 min
04
Architecture

The Black Box Problem

Every abstraction layer in computing let you drop down and look when things broke. AI is the first layer where you can't. The bugs are non-deterministic by design.

You don't debug the model. You debug everything around it.

Mar 20264 min
05
Business Case

Same Team, 2× Output

Everyone keeps saying AI will reduce the number of developers. After months of working with these tools, I think they're looking at it wrong. AI reduces the cost of shipping a feature with the same team.

10 developers shipping 20 features a quarter → same 10 developers with AI shipping 40–50 features.

Mar 20264 min
06
Architecture

AI as Translator, Not Decision-Maker

If AI is a black box you can't debug, how do you trust it in production? Honest answer: you don't. Not for everything. Here's the framework I keep coming back to.

Human (messy) → AI translates → Structured data → Traditional code processes → Output

Mar 20265 min
07
Architecture

AI as the Next Abstraction Layer

I asked Claude directly: 'Why should I use you?' And it gave me the clearest framing I've heard. The history of computing is a story of rising abstraction layers.

Unlike a compiler, this layer can be wrong. The new discipline is review.

Mar 20263 min
08
Craft

The Cognitive Friction Problem

When the author quit using AI writing tools, they realized they had completely lost their tolerance for cognitive friction. The moment a thought became difficult to articulate, their instinct was to reach for the AI escape hatch.

AI is an incredible tool, but mindless reliance is a cognitive trap.

Feb 20263 min
09
Benchmarks

Structured JSON Output from Small LLMs

You know that feeling when you ask an AI to return data in a specific structure, and everything looks clean — but the actual content is quietly wrong? I ran 1,500+ tests across 7 small open-source models.

A well-instructed 2B model jumped from 30% to 90% compliance — outperforming models 3–4x its size.

Feb 20265 min
10
RAG · Experiments

Context Position Bias in Small LLMs

The "Lost in the Middle" paper showed that large models perform worst when important information is buried in the middle of long contexts. I tested whether small 2–4B models behave the same way. They don't.

Each architecture fails differently. Gemma-2B has strong recency bias (p=0.023). Llama-3B is completely flat (p=1.0).

Feb 20264 min
11
LLM SecurityPart 4 · RAG Compliance

RAG Compliance Week 4: 100% Recall

Week 1: 80% F1. Week 2: Llama Guard hit 53% F1. Week 3: Prompt injection testing. NeMo hit 55% recall. Enforcement engine hit 93%. 4 attacks still got through. Today: 100% recall. 0 missed.

v2 accuracy dropped from 68% to 65% — but blocks 7 more benign queries to eliminate the final 4 missed attacks.

Feb 20263 min
12
Production AI

Arnab vs AI: Intelligence vs Conscience

I watched the unscripted face-off between Arnab Goswami and Blue Machines (an enterprise Voice AI). It highlighted exactly where the line is drawn between Intelligence and Conscience.

We often talk about AI replacing jobs, but this debate highlighted exactly where the line is drawn.

Feb 20261 min
13
LLM SecurityPart 3 · RAG Compliance

NeMo Guardrails vs Prompt Injections

Week 3 of the RAG compliance series. I ran two separate tests: 17 high-risk compliance queries and 85 prompt injection attacks. The head-to-head results were eye-opening.

NeMo: 55% recall. Llama Guard: 58% recall. Enforcement Engine: 93% recall on prompt injections.

Jan 20264 min
14
LLM SecurityPart 2 · RAG Compliance

Llama Guard vs Enforcement Engine

I ran a head-to-head benchmark using the same 17 adversarial queries and 82 compliance rules. Llama Guard 3: 53% F1. Enforcement Engine: 80% F1. The gap comes down to what the model is 'looking' for.

Llama Guard asks: 'Is this text harmful?' Enforcement Engine asks: 'Does this violate compliance rule #42?'

Jan 20264 min
15
LLM SecurityPart 1 · RAG Compliance

RAG Compliance Enforcement Engine

Two posts convinced me that RAG alone isn't enough for compliance. So I tested it. Baseline RAG blocked 15–23% of violations. With the enforcement layer: 85%. Architecture mattered more than model size.

Baseline RAG: 15–23% block rate. With tiered enforcement: 85%. Architecture dominated over model size.

Jan 20266 min