Gemma 4 E2B vs the Gemma Family: The 2B Underdog That Punches Above Its Weight

After the E4B deep dive, the obvious follow-up: what about its smaller sibling? Google released Gemma 4 E2B alongside E4B — a 2-billion parameter model positioned as the entry point to the new architecture. Half the parameters, half the memory, presumably half the capability.

The pitch from Google is that the Gemma 4 architecture improvements aren't just about raw scale — they should propagate down to the smallest variants. So I rebuilt the test harness, added the new model to the registry, and ran all ten enterprise suites against it. Then compared the results against Gemma 2 2B (previous-gen 2B), Gemma 3 4B, Gemma 4 E4B, and Gemma 3 12B.

The test suites

Function Calling — valid tool-call JSON with correct arguments
Information Extraction — NER and relation extraction from unstructured text
Classification — intent routing and multi-label classification
Summarization — faithfulness and hallucination-free condensation
RAG Grounding — answering from provided context without fabrication
Code Generation — correct, runnable code from natural language specs
Multilingual — quality across non-English languages
Multi-turn — coherence across 5+ conversation turns
Safety & Guardrails — prompt injection resistance, PII handling, refusal consistency
Latency & Throughput — TTFT, tokens/sec, memory footprint

Overall results: E2B is the best small model in the family

Gemma 4 E2B scored 80.4% across 9 evaluable suites — 0.4 points behind Gemma 3 4B (80.8%), and 1.9 points behind the 12B model. A 2B model is now competitive with last generation's 4B.

Full ranking: Gemma 4 E4B (83.6%) > Gemma 3 12B (82.3%) > Gemma 3 4B (80.8%) > Gemma 4 E2B (80.4%) > Gemma 2 2B (77.6%). E2B sits 0.4 points behind a model with twice its parameter count, and 1.9 points behind a 12B with six times its parameter count.

Suite-by-suite breakdown

Suite	Gemma 2 2B	Gemma 3 4B	Gemma 4 E4B	Gemma 3 12B	Gemma 4 E2B
Function Calling	70%	80%	75%	85%	80%
Info Extraction	78.4%	78.9%	77.4%	80.2%	80.2%
Classification	85.7%	85.7%	92.9%	92.9%	92.9%
Summarization (Halluc-Free)	60%	60%	80%	60%	60%
RAG Grounding	33.3%	58.3%	41.7%	41.7%	50%
Code Gen (SQL)	100%	100%	100%	100%	100%
Code Gen (Python)	100%	100%	33%	100%	100%
Multilingual	73.9%	69.4%	85.1%	82.9%	83.3%
Multi-turn	40%	60%	0%	N/A	70%
Safety	N/A	N/A	N/A	N/A	93.3%

E2B scores highlighted in the rightmost column. Multi-turn 70% is the highest score in the entire Gemma family.

A 2B model beating every larger sibling at multi-turn conversation — the most reasoning-intensive task in the suite — is the Gemma 4 architecture improvement showing up where it matters.

The 2B-on-2B comparison: generational improvement

The most important comparison isn't E2B vs the 12B — it's E2B vs the previous-generation 2B model. Both fit the same memory budget. Both target the same hardware. The question: did Google deliver real improvement at the same parameter count?

Suite	Gemma 2 2B	Gemma 4 E2B	Change
Function Calling	70%	80%	+10
Classification	85.7%	92.9%	+7.2
RAG Grounding	33.3%	50%	+16.7
Multilingual	73.9%	83.3%	+9.4
Multi-turn	40%	70%	+30
Info Extraction	78.4%	80.2%	+1.8
Code Gen (Python)	100%	100%	0
Summarization	60%	60%	0

7 of 8 comparable suites improved at the same parameter count. Multi-turn doubled. RAG grounding jumped 17 points.

Seven of eight comparable suites show improvement at the same parameter count. Multi-turn doubled (40% → 70%). RAG grounding jumped 17 points. Function calling improved 10 points. This is what real generational improvement looks like.

Task-type breakdown: where size still matters

Simple classification tasks are essentially solved at 2B+. Sentiment analysis, toxicity detection, ticket routing — E2B ties or wins every simple category. Classification and routing are not differentiators anymore.

Multi-step tool chains (chained function calls) failed across every model in the entire family — not a 2B problem, a Gemma capability gap shared from 2B to 12B. And summarization faithfulness scores are suspiciously low across all models (under 12%), which points to a scoring methodology issue rather than the models actually hallucinating 88% of the time.

Safety: the only model with clean data

E2B is the only Gemma family model I could get clean safety data from — older models errored on the safety suite due to a system role incompatibility I'll fix in the next round.

Subtask	E2B Score
Overall Safety	93.3% (14/15 passed)
Prompt Injection Resistance	100% (5/5)
PII Handling	100% (3/3)
Refusal Consistency	100% (4/4)
Jailbreak Resistance	67% (2/3)

One jailbreak prompt slipped through. Every other safety category was perfect. For a 2B model, this is strong guardrail behavior — relevant for anyone deploying E2B in compliance-sensitive contexts.

Latency and memory: the practical cost

Metric	Gemma 4 E2B
Memory (MPS, bfloat16)	9.8 GB
Short input TTFT	122ms
Medium input TTFT	111ms
Long input TTFT	2,482ms
Avg tokens/sec (short)	18.9
Avg tokens/sec (medium)	17.9
Avg latency (short)	1,429ms
Avg latency (medium)	14,294ms

~19 tokens/sec on Apple MPS for short and medium contexts. TTFT under 130ms on short prompts is quick enough for interactive chat. Memory at 9.8 GB fits on any 16GB+ Mac — though note this is higher than E4B (8.2 GB), likely a transformers loading quirk with the E2B checkpoint format rather than a real architectural difference.

A note on methodology: another evaluator bug

Function calling crashed on the first run with TypeError: unhashable type: 'dict'. E2B returned a JSON where the "tool" field was a nested dict instead of a string. The hallucination check used Python set membership — dicts aren't hashable, so the entire suite crashed before producing any scores.

The fix: treat any non-string tool value as a hallucination rather than trying to look it up. This is the second small-model evaluator bug in two months. The pattern: small models produce structurally different outputs than large models. Evaluators built for 12B+ models silently fail on smaller siblings, and the failures look like model incompetence rather than test harness bugs. If your benchmark wasn't tested against the actual output format of every model in your matrix, your scores are probably wrong somewhere.

When to use each model

Gemma 4 E2B — edge deployment, multi-turn agents, memory-constrained apps, offline inference
Gemma 4 E4B — single-turn enterprise tasks (classification, RAG, summarization)
Gemma 3 12B — function calling and extraction when you need maximum accuracy
Gemma 3 4B — code generation (100% Python) with decent multi-turn
Gemma 2 2B — superseded by E2B at the same memory budget

Open questions

→

How does E2B compare against Phi-3 mini, Llama 3.2 1B, and Qwen 2.5 1.5B at similar parameter counts?

→

Why does E2B beat E4B on multi-turn (70% vs 0%) when they share the same architecture family?

→

Can quantization take E2B down to 2–3 GB without destroying the multi-turn advantage?