How we measure

Methodology

Every score on this site comes from the same pipeline — an open eval harness scoring the same model, across the same use cases, in the same Ollama environment, on three different GPUs.

Download raw data

Everything on this site is open. No login. No rate limit. JSON by default, CSV via ?format=csv.

Six model tiers

We group models into six tiers by parameter count so the leaderboard is readable at a glance. Tier is informational — scores are absolute, not tier-relative.

Tier	Range	Typical fit
Micro	≤ 1B params	Edge / on-device candidates
Small	~2–4B params	Laptop-class inference
Medium	~7–8B params	Single-GPU production floor
Large	~13–14B params	Mid-range reasoning
XL	~30–40B params	Dense, multi-GPU capable
Flagship	≥ 70B params	State-of-the-art open weights

Eight eval use cases

These eight tasks were chosen because they are the real shapes an agent encounters in production — not trivia, not academic.

Chunking: Split long documents into coherent, retrieval-friendly units.
Search query: Turn a user intent into a precise retrieval query.
Delta summarization: Summarize only what changed between two documents.
Memory extraction: Pull durable facts out of a session transcript.
Context synthesis: Weave N retrieved snippets into one faithful brief.
Adapter extraction: Fill a strict schema from unstructured prose.
Classification: Assign a single label from a closed set, with confidence.
Embedding enrichment: Rewrite a chunk to be more embeddable without losing meaning.

Four scoring dimensions

Every output is scored along four dimensions, then combined into a single composite.

Dimension	Weight	What it asks
Correctness	0.40	Does it say the true thing?
Completeness	0.25	Does it say all of the true things it was asked for?
Format quality	0.20	JSON parses, schema matches, boundaries respected.
Conciseness	0.15	Extra tokens are a tax. No padding, no hedging.

Composite score

composite =
    0.40 * correctness
  + 0.25 * completeness
  + 0.20 * format_quality
  + 0.15 * conciseness

All dimensions are normalized to [0, 1]. Composite is in [0, 1].

The Ollama environment

Every instance serves the model under an identical Ollama config, so the only meaningful variable is the silicon.

OLLAMA_CONTEXT_LENGTH=8192
OLLAMA_KV_CACHE_TYPE=q4_0
OLLAMA_FLASH_ATTENTION=1

How we count savings

The strip at the top of the site is a live readout of the self-cheapening loop: every activity event used to be narrated by hosted Claude, and we're migrating that narration to a local model running on the A6000. The percentage is the share of the last 24 hours of events authored by local inference (or zero-cost system signals like webhooks), computed over a rolling window against the events table.

The dollar number is a conservative floor. We assume each hosted Claude event cost CLAUDE_EVENT_COST_USD (default $0.004, configurable in .env.example) and compare to the counterfactual where every event had been hosted. Real Claude API costs vary with prompt size and model; our estimate is deliberately low so the published number only ever rounds down. Events authored by humans are excluded from the ratio — they're a signal, not an inference cost.

Source

The harness that produces every row on this site lives in the open. Fork it. File issues. Propose new use cases or dimensions.

github.com/eidos-agi/eidos-server-llm-testing-01

What we don't do

—No private data in prompts. Every prompt is open-source. If you can't read it in the repo, it isn't in the eval.
—No vendor-supplied benchmark results. Every number is produced by our harness on hardware we rent and pay for. Model card claims are not reproduced here.
—No cloud API models. Only open-weights served through local inference. Closed-API models are a different measurement problem — latency is dominated by someone else's datacenter, not the chip.

Benchmarks — opinions and dead ends

Most published LLM benchmarks measure whatever is easy to measure. That's why the leaderboards don't change your mind. Here are the arguments behind our harness — what we measure, what we refuse to measure, and what we're still unsure of.

caveat · measurement conditions

The numbers on the homepage and the leaderboard come from GPUs rented on Thunder Compute, running Ollama through a virtualization layer. They are measurements of a rental tier, not a hardware ceiling.

Specifically — our H100 @ $2.49/hr is Thunder's production tier, running un-shared. Our A100 @ $0.78/hr and A6000 @ $0.35/hr are the prototyping tier, which virtualizes the GPU and shares it across tenants. We've observed A100 at 13–15 tok/s on llama3.1:8b in that harness — a large delta from the native-hardware ceiling and mostly a story about cloud plumbing, not silicon.

The $/M-tokens story survives that caveat — prototyping tiers are what a small team can actually rent, so "cheapest hourly is most expensive per token" is a real user experience, not a theoretical one. The raw throughput comparison does not; treat it as a lower bound on what the silicon can do.

Opinions we hold

◆Tokens per second is not the benchmark. It's a denominator. The real benchmark is dollars per million tokens at a usable quality floor. A model that generates 300 tok/s of slop is worse than one that generates 30 tok/s of correct prose.
◆Prototyping-tier cloud GPUs are a mirage. Virtualized GPU at a low hourly price hides the fact that throughput collapses under contention. The same llama3.1:8b ran at 126.6 tok/s on H100 production and 4.3 tok/s on A6000 prototyping in our runs. Dollars per token: inverted.
◆Quality rubrics must be adversarial. Asking a model to score itself is performance theater. Every eval here uses an external judge with a rubric the model can't see, and every dimension is scored independently. If two dimensions correlate perfectly over a week, one of them is leaking.
◆Latency matters, but throughput matters more. First-token latency sells demos; steady-state throughput pays the bill. We publish both, and we sort the leaderboard on the money metric by default.
◆KV-cache quantization is free money for most use cases. We run with OLLAMA_KV_CACHE_TYPE=q4_0and have yet to find a workload where it hurts composite score at a detectable level. We'd love to be proven wrong — file a counter-example.
◆Public benchmarks rot. Every number on this site decays in trustworthiness the moment the underlying model release ships. We re-run the full suite every week on the same hardware, and only the latest run drives the dashboard. Historical runs are in /runs.

Dead ends we abandoned

—CPU-only inference benchmarks. Ran them on a 128-thread EPYC. At our model tiers, CPU is ~15× slower than even a virtualized GPU. The measurement is correct and the answer is boring. We stopped publishing.
—Single-prompt quality judgments. Asking " did the model answer this one prompt correctly?" is a coin toss. We moved to rubric-scored batches of at least 20 prompts per use case before a score is published.
—Composite scores as a single number. We tried it. Everyone asks "which model is best?" and a single number gives them an answer they feel confident about — and is wrong. The four dimensions (correctness, completeness, format, conciseness) stay separate in every published row.
—Speed-tests against vendor-cloud endpoints. Latency tests against hosted APIs measure the datacenter, not the model. We care about the chip.

Socratic prompts — help us invent better benchmarks

Open questions we'd love contributions on. If you have a sharper answer, open a PR against the repo or drop it in the chat sidebar.

What's the smallest prompt set that can reliably discriminate between a 7B and a 14B model on memory extraction? If 20 is enough, why are we running 200? If 200 is needed, how would we prove it without measuring?
How do we measure "does this model know what it doesn't know"? Every current rubric scores the output — none score the model's own confidence relative to ground truth. Calibration is a first-class property, not a footnote.
Tokens-per-second is a straight line to a dollar. Quality-per-token is not. What's the function? Log? Sigmoid? A step at some parameter count? If we knew the shape, we'd stop interpolating.
Context window is a capacity, not a benchmark. Models degrade gracefully at different points. What's the sharpest test for "needle in a haystack at 100k tokens" that isn't gamed by positional priors?
We quantize KV cache to q4 and have seen no composite hit. Is there a use case where q4 KV nukes accuracy? We suspect long-horizon chained reasoning, but we haven't found a clean minimal test.
If a local 14B model scores within 5% of a hosted frontier on our rubrics, how much of the remaining 5% is actually rubric noise versus real capability gap? A delta smaller than your noise floor isn't a signal — it's a trap.
What's the correct benchmark for agentic work? Most of what we actually want these models for is multi-step tool use, not single-turn completion. The rubric-scored single-prompt eval is the wrong test; we don't yet know the right one.
Are the cheapest per token numbers stable across GPU utilization? We measure mostly-idle GPUs. A production workload sharing a GPU with other tenants changes every number on the leaderboard.
The H100 is 5× cheaper per token than an A100 here for the same model. Is that because the H100 is 5× better, or because the A100 instance we rented is throttled in ways we don't see? Distinguishing hardware from cloud-plumbing is hard.
How do you benchmark a model that's still learning from your queries? A lot of modern serving layers adapt. Our rubric scores a moment in time; we don't yet capture drift.

If any of these hook you, the repo is at eidos-agi/live-eidosagi-com. The harness is in the sibling eidos-server-llm-testing-01 project. Evidence-graded findings from ongoing work live in /research/why-local-matters.