Three Lessons From Fine-Tuning a 5B Code Assistant
A week of LoRA fine-tuning on a small model. The fine-tune itself was routine. What surprised me was watching the model's statistical behavior up close.
Scope & limitations — read first
Gemma 4 E2B (gemma-4-e2b-it, ~5.1B total / ~2B active text decoder) · bf16 · LoRA r=32, α=64 on q/k/v/o + gate/up/down projections · vision and audio towers frozen · ~5,000 training examples · 134 test generations across 23 query types · M-series Mac · ~30 sec per query
Spent a week doing LoRA fine-tuning on Gemma 4 E2B (gemma-4-e2b-it — Google's open-weights multimodal model from the Gemma 4 family, ~5.1B total params with vision + audio towers, ~2B active in the text decoder) for a narrow Python code-generation task. The fine-tune itself was routine. What surprised me was watching the model's statistical behavior up close.
Three observations stand out. None are about hyperparameters. One of them isn't even strictly about fine-tuning — it's about what the instrumentation revealed at inference time. I'll flag that as we go.
1. Watching probabilities makes the abstract concrete

I instrumented token-by-token generation to print top-K candidates and their probabilities at every step. On one problematic decision point, the wrong answer had 55.3% probability and the right answer had 38.0%. That's not a bug — that's the trained weight distribution telling me something about what my training data actually looked like.
Tracked it back: a small fraction of training rows slipped through my data filter with an old syntax pattern. The model learned them faithfully. The pattern surfaced in test outputs at almost exactly the training frequency.
2. Prompt signal can outweigh adapter bias (an inference-time aside)

This one isn't really a fine-tuning lesson — it's something the same instrumentation surfaced at inference time. Worth including because it changed how I designed prompts for the deployed system.
Without touching weights, I added explicit 'prefer X, not Y' instructions to the prompt. The same decision point flipped to 56.2% right / 34.1% wrong. A 22-point swing from prompt alone.
The prompt didn't teach a rule. It gave the base model's existing knowledge a louder vote in the sampling mix. Strong, specific prompt signals can outweigh weaker biases baked into the adapter.
Every output is a sample from a probability distribution. The prompt is a conditioning signal that reshapes the distribution. It is not a command.
3. Fine-tuning made the model context-compliant — including with wrong context

I tested this with deliberately misleading instructions — wrong specs paired with code that should have ignored them — and ran them through both the fine-tuned adapter and the base Gemma 4 E2B.
- Fine-tuned model: followed the misleading instruction, wrote ~60 lines of confidently wrong code
- Base Gemma 4 E2B: ignored the bad signal, reverted to safer patterns from pretraining
Specifically: context-compliant. The adapter learned to weight the instruction context heavily when generating. When the instruction was right, clean output. When the instruction was contradictory or misleading, it still followed — and confidently produced wrong code.
What in the training data caused it
Softer evidence than the Python 2 case (no probability-tracing this time), but the hypothesis: the fine-tuning corpus was almost entirely clean instruction → correct code pairs. Almost no examples where the instruction was wrong and the right response was to ignore it. So the adapter has no representation for 'instruction is wrong, push back.' The base model has more of that distribution because pretraining includes Stack Overflow corrections, blog critiques, debate threads, etc.
Likely mitigation, untested: include adversarial examples in the fine-tune data — deliberately wrong instructions paired with correct code that ignores them. That's a follow-up I haven't run yet.
Which is why a retrieval-confidence check — use the specialist when retrieval is confident, fall back to the base model when it isn't — ended up being the most important component in the system. More important than the adapter itself.
Results

| Setting | Bad outputs |
|---|---|
| Deterministic (greedy) | 0 / 69 (0.0%) |
| Sampled (temp=0.7) | 1 / 65 (1.5%) |
| Baseline (before interventions) | ~5% on diverse stress tests |
Final results across 134 generations. The fine-tune + prompt + retrieval-gating combination drove bad outputs to near-zero.
What this stack actually looks like
- Base model: Gemma 4 E2B (gemma-4-e2b-it), bf16, language_model component only — vision and audio towers frozen and unused
- LoRA config: rank 32, alpha 64, attached to text decoder q/k/v/o + gate_proj/up_proj/down_proj
- Frameworks: Hugging Face transformers + PEFT + TRL
- Retrieval: sentence-transformers, with confidence gate routing per query
- Hardware: M-series Mac, ~30 sec per query
- Adapter: ~5,000 training examples (modest)
Bottom line
The fine-tuned adapter is a narrow specialist. The intelligence is in the pretrained base. The retrieval layer decides which to trust.
Every output is a sample from a probability distribution. Every prompt and every fine-tune just shapes that distribution. You don't teach models to reason; you condition statistics and put guards around the edges.
Karpathy's been saying this for years. Building this project made it concrete. Everything downstream of that insight became obvious. Everything upstream of it felt like magic.
Open questions
Would the same three lessons hold at larger scale (70B+) where pretraining dominance is even stronger?
Does the retrieval-confidence gating threshold transfer across model families, or is it model-specific?
How much of the +22pt prompt-conditioning effect is recoverable through better fine-tuning data alone?
What's the right way to detect 'context the model shouldn't trust' before generation, not after?