How we measure
Methodology
Every score on this site comes from the same pipeline — an open eval harness scoring the same model, across the same use cases, in the same Ollama environment, on three different GPUs.
Download raw data
Everything on this site is open. No login. No rate limit. JSON by default, CSV via ?format=csv.
Six model tiers
We group models into six tiers by parameter count so the leaderboard is readable at a glance. Tier is informational — scores are absolute, not tier-relative.
| Tier | Range | Typical fit |
|---|---|---|
| Micro | ≤ 1B params | Edge / on-device candidates |
| Small | ~2–4B params | Laptop-class inference |
| Medium | ~7–8B params | Single-GPU production floor |
| Large | ~13–14B params | Mid-range reasoning |
| XL | ~30–40B params | Dense, multi-GPU capable |
| Flagship | ≥ 70B params | State-of-the-art open weights |
Eight eval use cases
These eight tasks were chosen because they are the real shapes an agent encounters in production — not trivia, not academic.
- Chunking
- Split long documents into coherent, retrieval-friendly units.
- Search query
- Turn a user intent into a precise retrieval query.
- Delta summarization
- Summarize only what changed between two documents.
- Memory extraction
- Pull durable facts out of a session transcript.
- Context synthesis
- Weave N retrieved snippets into one faithful brief.
- Adapter extraction
- Fill a strict schema from unstructured prose.
- Classification
- Assign a single label from a closed set, with confidence.
- Embedding enrichment
- Rewrite a chunk to be more embeddable without losing meaning.
Four scoring dimensions
Every output is scored along four dimensions, then combined into a single composite.
| Dimension | Weight | What it asks |
|---|---|---|
| Correctness | 0.40 | Does it say the true thing? |
| Completeness | 0.25 | Does it say all of the true things it was asked for? |
| Format quality | 0.20 | JSON parses, schema matches, boundaries respected. |
| Conciseness | 0.15 | Extra tokens are a tax. No padding, no hedging. |
Composite score
composite =
0.40 * correctness
+ 0.25 * completeness
+ 0.20 * format_quality
+ 0.15 * concisenessAll dimensions are normalized to [0, 1]. Composite is in [0, 1].
The Ollama environment
Every instance serves the model under an identical Ollama config, so the only meaningful variable is the silicon.
OLLAMA_CONTEXT_LENGTH=8192 OLLAMA_KV_CACHE_TYPE=q4_0 OLLAMA_FLASH_ATTENTION=1
How we count savings
The strip at the top of the site is a live readout of the self-cheapening loop: every activity event used to be narrated by hosted Claude, and we're migrating that narration to a local model running on the A6000. The percentage is the share of the last 24 hours of events authored by local inference (or zero-cost system signals like webhooks), computed over a rolling window against the events table.
The dollar number is a conservative floor. We assume each hosted Claude event cost CLAUDE_EVENT_COST_USD (default $0.004, configurable in .env.example) and compare to the counterfactual where every event had been hosted. Real Claude API costs vary with prompt size and model; our estimate is deliberately low so the published number only ever rounds down. Events authored by humans are excluded from the ratio — they're a signal, not an inference cost.
Source
The harness that produces every row on this site lives in the open. Fork it. File issues. Propose new use cases or dimensions.
What we don't do
- —No private data in prompts. Every prompt is open-source. If you can't read it in the repo, it isn't in the eval.
- —No vendor-supplied benchmark results. Every number is produced by our harness on hardware we rent and pay for. Model card claims are not reproduced here.
- —No cloud API models. Only open-weights served through local inference. Closed-API models are a different measurement problem — latency is dominated by someone else's datacenter, not the chip.