live eventEidos is moving itself to 90%-cheaper silicon without losing intelligence. Watch.follow on LinkedIn →
mission: 90% local authorship — waiting for first eventhow?

How we measure

Methodology

Every score on this site comes from the same pipeline — an open eval harness scoring the same model, across the same use cases, in the same Ollama environment, on three different GPUs.

Download raw data

Everything on this site is open. No login. No rate limit. JSON by default, CSV via ?format=csv.

Six model tiers

We group models into six tiers by parameter count so the leaderboard is readable at a glance. Tier is informational — scores are absolute, not tier-relative.

TierRangeTypical fit
Micro≤ 1B paramsEdge / on-device candidates
Small~2–4B paramsLaptop-class inference
Medium~7–8B paramsSingle-GPU production floor
Large~13–14B paramsMid-range reasoning
XL~30–40B paramsDense, multi-GPU capable
Flagship≥ 70B paramsState-of-the-art open weights

Eight eval use cases

These eight tasks were chosen because they are the real shapes an agent encounters in production — not trivia, not academic.

Chunking
Split long documents into coherent, retrieval-friendly units.
Search query
Turn a user intent into a precise retrieval query.
Delta summarization
Summarize only what changed between two documents.
Memory extraction
Pull durable facts out of a session transcript.
Context synthesis
Weave N retrieved snippets into one faithful brief.
Adapter extraction
Fill a strict schema from unstructured prose.
Classification
Assign a single label from a closed set, with confidence.
Embedding enrichment
Rewrite a chunk to be more embeddable without losing meaning.

Four scoring dimensions

Every output is scored along four dimensions, then combined into a single composite.

DimensionWeightWhat it asks
Correctness0.40Does it say the true thing?
Completeness0.25Does it say all of the true things it was asked for?
Format quality0.20JSON parses, schema matches, boundaries respected.
Conciseness0.15Extra tokens are a tax. No padding, no hedging.

Composite score

composite =
    0.40 * correctness
  + 0.25 * completeness
  + 0.20 * format_quality
  + 0.15 * conciseness

All dimensions are normalized to [0, 1]. Composite is in [0, 1].

The Ollama environment

Every instance serves the model under an identical Ollama config, so the only meaningful variable is the silicon.

OLLAMA_CONTEXT_LENGTH=8192
OLLAMA_KV_CACHE_TYPE=q4_0
OLLAMA_FLASH_ATTENTION=1

How we count savings

The strip at the top of the site is a live readout of the self-cheapening loop: every activity event used to be narrated by hosted Claude, and we're migrating that narration to a local model running on the A6000. The percentage is the share of the last 24 hours of events authored by local inference (or zero-cost system signals like webhooks), computed over a rolling window against the events table.

The dollar number is a conservative floor. We assume each hosted Claude event cost CLAUDE_EVENT_COST_USD (default $0.004, configurable in .env.example) and compare to the counterfactual where every event had been hosted. Real Claude API costs vary with prompt size and model; our estimate is deliberately low so the published number only ever rounds down. Events authored by humans are excluded from the ratio — they're a signal, not an inference cost.

Source

The harness that produces every row on this site lives in the open. Fork it. File issues. Propose new use cases or dimensions.

github.com/eidos-agi/eidos-server-llm-testing-01

What we don't do

  • No private data in prompts. Every prompt is open-source. If you can't read it in the repo, it isn't in the eval.
  • No vendor-supplied benchmark results. Every number is produced by our harness on hardware we rent and pay for. Model card claims are not reproduced here.
  • No cloud API models. Only open-weights served through local inference. Closed-API models are a different measurement problem — latency is dominated by someone else's datacenter, not the chip.