Writing & Research

Experiments with real numbers. Agentic systems, strategy, and engineering craft.

22 of 22 posts

Jun 2026·8 minStrategy

Why Every AI Integration Fails Without a Governed API Layer

95% of enterprise AI pilots fail. Everyone blames the model. The model is fine.

MIT studied enterprise AI pilots. 95% delivered no measurable business impact. The cause wasn't model capability — it was integration gaps and governance gaps. Here's the architecture that fixes it.

Read post →

Jun 2026·8 minStrategy

Microsoft Built 7 AI Models From Scratch. Here's What That Actually Means.

At Build 2026, Microsoft announced 7 AI models built entirely in-house. Here's what they actually built.

At Build 2026, the Microsoft AI team led by Mustafa Suleyman announced the MAI model family: 7 models covering reasoning, coding, images, voice, and transcription — every one built entirely in-house.

Read post →

Jun 2026·9 minCraft

I Gave Claude Access to My Notes. Here's What I Had to Build to Make It Safe.

A knowledge graph as an MCP server — Rust engine, Python connectors, and a security gate that makes nodes structurally invisible rather than just filtered.

Dumping your notes into an LLM context is easy. Giving an AI agent structured, safe, scoped access to a living knowledge graph is not. Here's the architecture I built.

Read post →

Jun 2026·8 minExperiment

I Built an AI Agent That Works With No Internet. Here's How.

Gemma 4 running fully offline in Rust. Nine specialized agents, two model tiers, model encryption, and a load balancer — all from a single binary.

I wanted AI that runs with no internet and keeps everything private. No API keys, no cloud, no data leaving the machine. So I built it in Rust using llama.cpp. Here's what that actually looks like.

Read post →

Jun 2026·7 minExperiment

I Connected Two AI Models With One Layer. Here Are the Real Numbers.

What happens when you train a single linear layer to bridge a 1B model and a 4B model? I ran the experiment. The loss dropped 99.9%.

Gemma 3 1B has 1152 dimensions. Gemma 3 4B has 2560. They speak different sizes. I trained a single layer to translate between them and logged every number.

Read post →

May 2026·3 minAgentic AI

When My AI Changed Persona and Refused My Instructions

Nine links in, Claude Code stopped summarizing and called out the avoidance. Notes on what happened, and what training keeps this behavior intact.

Read post →

May 2026·13 minBenchmarks

Gemma 4 E2B vs the Gemma Family: The 2B Underdog That Punches Above Its Weight

Google's newest 2B model tested across 10 enterprise task suites against Gemma 2 2B, Gemma 3 4B, Gemma 4 E4B, and Gemma 3 12B. Run locally on Apple Silicon.

Gemma 4 E2B scored 80.4% across 9 enterprise suites — 0.4 points behind the previous-gen 4B model. But the real surprise: 70% multi-turn, the highest score in the entire Gemma family.

Read post →

Apr 2026·12 minBenchmarks

Gemma 4 E4B vs the Gemma Family: Enterprise Benchmark Showdown

Google's newest 4B model tested across 8 enterprise task suites against Gemma 2 2B, Gemma 3 4B, and Gemma 3 12B. Run locally on Apple Silicon.

We ran Gemma 4 E4B through 8 enterprise test suites — function calling, RAG grounding, classification, code generation, summarization, information extraction, multilingual, and multi-turn — and compared it head-to-head against three other Gemma models. The 4B model scored 83.6% overall, beating even the 12B.

Read post →

Mar 2026·2 min readLinkedIn Post

Is Your Small Open-Source AI Giving You Wrong Answers in the Right Format?

I ran 1,500+ tests across 7 small open-source models. Forcing JSON Mode can backfire. A 2B model beat one 4× its size with the right guidance.

Read post →

Mar 2026·4 minArchitecture

The Black Box Problem

AI is the first abstraction layer you can't open the hood on. Here's how to build on it anyway.

Every abstraction layer in computing let you drop down and look when things broke. AI is the first layer where you can't. The bugs are non-deterministic by design.

Read post →

Mar 2026·4 minBusiness Case

Same Team, 2× Output

AI doesn't reduce the number of developers. It reduces the cost of shipping a feature.

Everyone keeps saying AI will reduce the number of developers. After months of working with these tools, I think they're looking at it wrong. AI reduces the cost of shipping a feature with the same team.

Read post →

Mar 2026·5 minArchitecture

AI as Translator, Not Decision-Maker

Use AI where the output is inspectable. Be deliberate about where the black box runs live.

If AI is a black box you can't debug, how do you trust it in production? Honest answer: you don't. Not for everything. Here's the framework I keep coming back to.

Read post →

Mar 2026·3 minArchitecture

AI as the Next Abstraction Layer

The value isn't in what AI knows. It's in what it frees you to focus on.

I asked Claude directly: 'Why should I use you?' And it gave me the clearest framing I've heard. The history of computing is a story of rising abstraction layers.

Read post →

Feb 2026·3 minCraft

The Cognitive Friction Problem

Like AI models, our biological neural networks require constant fine-tuning.

When the author quit using AI writing tools, they realized they had completely lost their tolerance for cognitive friction. The moment a thought became difficult to articulate, their instinct was to reach for the AI escape hatch.

Read post →

Feb 2026·5 minBenchmarks

Structured JSON Output from Small LLMs

1,500+ tests across 7 models. Forcing JSON Mode degraded 2 of 3 models. A 2B model beat a 7B on defaults.

You know that feeling when you ask an AI to return data in a specific structure, and everything looks clean — but the actual content is quietly wrong? I ran 1,500+ tests across 7 small open-source models.

Read post →

Feb 2026·4 minRAG · Experiments

Context Position Bias in Small LLMs

Stanford showed GPT-3.5 loses middle-context info. Do small open-source models behave the same way?

The "Lost in the Middle" paper showed that large models perform worst when important information is buried in the middle of long contexts. I tested whether small 2–4B models behave the same way. They don't.

Read post →

Feb 2026·3 minLLM Security

RAG Compliance Week 4: 100% Recall

4 attacks still got through. 4 too many. Today: 100% recall. 0 missed. 490 test cases.

Week 1: 80% F1. Week 2: Llama Guard hit 53% F1. Week 3: Prompt injection testing. NeMo hit 55% recall. Enforcement engine hit 93%. 4 attacks still got through. Today: 100% recall. 0 missed.

Read post →

Feb 2026·1 minProduction AI

Arnab vs AI: Intelligence vs Conscience

A live, unscripted face-off between a journalist and enterprise Voice AI.

I watched the unscripted face-off between Arnab Goswami and Blue Machines (an enterprise Voice AI). It highlighted exactly where the line is drawn between Intelligence and Conscience.

Read post →

Jan 2026·4 minLLM Security

NeMo Guardrails vs Prompt Injections

I tested NVIDIA NeMo Guardrails against prompt injections and compliance queries. Here is the data.

Week 3 of the RAG compliance series. I ran two separate tests: 17 high-risk compliance queries and 85 prompt injection attacks. The head-to-head results were eye-opening.

Read post →

Jan 2026·4 minLLM Security

Llama Guard vs Enforcement Engine

The most common question: 'Why build this? Why not just use Llama Guard?' So I put it to the test.

I ran a head-to-head benchmark using the same 17 adversarial queries and 82 compliance rules. Llama Guard 3: 53% F1. Enforcement Engine: 80% F1. The gap comes down to what the model is 'looking' for.

Read post →

Jan 2026·6 minLLM Security

RAG Compliance Enforcement Engine

Baseline RAG retrieved the rule correctly — and still advised breaking it. So I built an enforcement layer.

Two posts convinced me that RAG alone isn't enough for compliance. So I tested it. Baseline RAG blocked 15–23% of violations. With the enforcement layer: 85%. Architecture mattered more than model size.

Read post →

2024·5 minCraft

I Deployed a PDF Summarizer Using Gemini on Google Cloud. Here's What the Setup Looks Like.

Google's Vertex AI quickstart gets you to a working app fast. Getting it deployed with auth on Cloud Run takes a bit more. Here's the full path.

I started with Google's Gemini quickstart and got a PDF summarizer running. Then I had to actually deploy it. Here's what the full setup looks like — Flask, Cloud Run, IAP, and the token limit gotcha.

Read post →