LoRA on the text decoder, ~5,000 training pairs. Three observations about probability distributions, prompts, and what fine-tuning quietly erased.
Spent a week doing LoRA fine-tuning on Gemma 4 E2B (~5.1B total params, ~2B active in text decoder) for a narrow Python code-generation task. Bad outputs went from ~5% to 0% (greedy) and 1.5% (sampled) across 134 tests. The fixes weren't more data or compute. They were three uncomfortable lessons about what LLMs actually are.