Three Lessons From Fine-Tuning a 5B Code Assistant

A week of LoRA fine-tuning on a small model. The fine-tune itself was routine. What surprised me was watching the model's statistical behavior up close.

Scope & limitations — read first

Gemma 4 E2B (gemma-4-e2b-it, ~5.1B total / ~2B active text decoder) · bf16 · LoRA r=32, α=64 on q/k/v/o + gate/up/down projections · vision and audio towers frozen · ~5,000 training examples · 134 test generations across 23 query types · M-series Mac · ~30 sec per query

Spent a week doing LoRA fine-tuning on Gemma 4 E2B (gemma-4-e2b-it — Google's open-weights multimodal model from the Gemma 4 family, ~5.1B total params with vision + audio towers, ~2B active in the text decoder) for a narrow Python code-generation task. The fine-tune itself was routine. What surprised me was watching the model's statistical behavior up close.

Three observations stand out. None are about hyperparameters. One of them isn't even strictly about fine-tuning — it's about what the instrumentation revealed at inference time. I'll flag that as we go.

1. Watching probabilities makes the abstract concrete

Token-level probability inspection at a problematic decision point. The wrong answer had 55.3% probability; the right answer had 38.0%. Tracked back to a small fraction of training rows that slipped through the data filter.
Token-level probability inspection at a problematic decision point. The wrong answer had 55.3% probability; the right answer had 38.0%. Tracked back to a small fraction of training rows that slipped through the data filter.

I instrumented token-by-token generation to print top-K candidates and their probabilities at every step. On one problematic decision point, the wrong answer had 55.3% probability and the right answer had 38.0%. That's not a bug — that's the trained weight distribution telling me something about what my training data actually looked like.

Tracked it back: a small fraction of training rows slipped through my data filter with an old syntax pattern. The model learned them faithfully. The pattern surfaced in test outputs at almost exactly the training frequency.

Models don't learn what you intend. They learn what's actually in the data — with a faithfulness that's almost uncomfortable.

2. Prompt signal can outweigh adapter bias (an inference-time aside)

Same model, same weights, different prompt. The probability of the correct answer at the key decision point shifted from 38% to 56% — a 22-percentage-point swing from prompt alone.
Same model, same weights, different prompt. The probability of the correct answer at the key decision point shifted from 38% to 56% — a 22-percentage-point swing from prompt alone.

This one isn't really a fine-tuning lesson — it's something the same instrumentation surfaced at inference time. Worth including because it changed how I designed prompts for the deployed system.

Without touching weights, I added explicit 'prefer X, not Y' instructions to the prompt. The same decision point flipped to 56.2% right / 34.1% wrong. A 22-point swing from prompt alone.

The prompt didn't teach a rule. It gave the base model's existing knowledge a louder vote in the sampling mix. Strong, specific prompt signals can outweigh weaker biases baked into the adapter.

Every output is a sample from a probability distribution. The prompt is a conditioning signal that reshapes the distribution. It is not a command.

3. Fine-tuning made the model context-compliant — including with wrong context

Same deliberately misleading instruction, two models. The fine-tuned adapter produced ~60 lines of confidently wrong code. The base Gemma 4 E2B ignored the bad signal and reverted to safer patterns from pretraining.
Same deliberately misleading instruction, two models. The fine-tuned adapter produced ~60 lines of confidently wrong code. The base Gemma 4 E2B ignored the bad signal and reverted to safer patterns from pretraining.

I tested this with deliberately misleading instructions — wrong specs paired with code that should have ignored them — and ran them through both the fine-tuned adapter and the base Gemma 4 E2B.

  • Fine-tuned model: followed the misleading instruction, wrote ~60 lines of confidently wrong code
  • Base Gemma 4 E2B: ignored the bad signal, reverted to safer patterns from pretraining

Specifically: context-compliant. The adapter learned to weight the instruction context heavily when generating. When the instruction was right, clean output. When the instruction was contradictory or misleading, it still followed — and confidently produced wrong code.

What in the training data caused it

Softer evidence than the Python 2 case (no probability-tracing this time), but the hypothesis: the fine-tuning corpus was almost entirely clean instruction → correct code pairs. Almost no examples where the instruction was wrong and the right response was to ignore it. So the adapter has no representation for 'instruction is wrong, push back.' The base model has more of that distribution because pretraining includes Stack Overflow corrections, blog critiques, debate threads, etc.

Likely mitigation, untested: include adversarial examples in the fine-tune data — deliberately wrong instructions paired with correct code that ignores them. That's a follow-up I haven't run yet.

Which is why a retrieval-confidence check — use the specialist when retrieval is confident, fall back to the base model when it isn't — ended up being the most important component in the system. More important than the adapter itself.

Results

134 test generations across 23 query types: casual phrasing, typos, multi-step composition, negation/constraints, tiny fragments, and out-of-domain stress.
134 test generations across 23 query types: casual phrasing, typos, multi-step composition, negation/constraints, tiny fragments, and out-of-domain stress.
SettingBad outputs
Deterministic (greedy)0 / 69 (0.0%)
Sampled (temp=0.7)1 / 65 (1.5%)
Baseline (before interventions)~5% on diverse stress tests

Final results across 134 generations. The fine-tune + prompt + retrieval-gating combination drove bad outputs to near-zero.

What this stack actually looks like

  • Base model: Gemma 4 E2B (gemma-4-e2b-it), bf16, language_model component only — vision and audio towers frozen and unused
  • LoRA config: rank 32, alpha 64, attached to text decoder q/k/v/o + gate_proj/up_proj/down_proj
  • Frameworks: Hugging Face transformers + PEFT + TRL
  • Retrieval: sentence-transformers, with confidence gate routing per query
  • Hardware: M-series Mac, ~30 sec per query
  • Adapter: ~5,000 training examples (modest)

Bottom line

The fine-tuned adapter is a narrow specialist. The intelligence is in the pretrained base. The retrieval layer decides which to trust.

Every output is a sample from a probability distribution. Every prompt and every fine-tune just shapes that distribution. You don't teach models to reason; you condition statistics and put guards around the edges.

Karpathy's been saying this for years. Building this project made it concrete. Everything downstream of that insight became obvious. Everything upstream of it felt like magic.

Open questions

Would the same three lessons hold at larger scale (70B+) where pretraining dominance is even stronger?

Does the retrieval-confidence gating threshold transfer across model families, or is it model-specific?

How much of the +22pt prompt-conditioning effect is recoverable through better fine-tuning data alone?

What's the right way to detect 'context the model shouldn't trust' before generation, not after?