Plan: Ground OpenRouter import scores in real benchmarks
Date: 2026-05-06
Branch: feature/import-grounding (off master)
Status: Phase 0 in progress
Problem
estimateModelScores (src/app/admin/actions.ts:172) sends Haiku
the OpenRouter metadata for a new model and accepts whatever JSON it
returns for tier, task_fitness, speed_score, privacy_score,
transparency.*, and sustainability.*. None of these are grounded
in published benchmarks. Imported models ship with hallucinated
scores that can recommend the wrong models.
We already ingest LMArena and LiveBench into benchmark_snapshots
(see src/lib/benchmarks.ts) and blend them into task_fitness at
recommendation time, but this pipeline is not consulted during
import. Artificial Analysis is not yet a source.
Goal
At import time, derive every score that has an objective public source from that source. Use Haiku only for fields with no published signal. Surface provenance so admins can see which numbers are evidence-based.
Sources
- LMArena (already ingested).
- LiveBench (already ingested).
- Artificial Analysis (new) —
https://artificialanalysis.ai/api/v2/data/llms/modelswithx-api-keyheader. Docs: https://artificialanalysis.ai/api-reference#models-endpoint. 513 models with per-modelevaluations(intelligence, coding, math indices plus mmlu_pro, gpqa, hle, livecodebench, scicode, tau2, ifbench, aime_25, terminalbench_hard, lcr, math_500),median_output_tokens_per_second,median_time_to_first_token_seconds,pricing,slug.
Decisions made (Phase 0 prep)
- No automatic tier derivation. AA intelligence vs curated tier
overlaps too heavily across the 29 registry models —
gpt-5.4-nano(budget, intel 44) outscoresmistral-medium-3(balanced, 18.8). Tier reflects provider positioning, not absolute capability. Tier remains an admin dropdown, defaults tobalancedon import. - Many-to-one alias model — one bearing slug can map to multiple
AA variants (e.g. Claude Haiku 4.5 → both
claude-4-5-haikuandclaude-4-5-haiku-reasoning). Existing schema supports this (unique key is onsource_model_name, notbearing_slug). - Admin confirms candidate aliases. The matcher returns ranked suggestions with reasons; admin checks the right ones. False suggestions are cheap; false rejections are expensive.
Phases
Phase 0 — AA matcher and alias suggester ⬅ now
Build src/lib/import-grounding.ts exposing:
suggestBenchmarkAliases(
bearing: { slug: string; name: string; provider: string },
source: BenchmarkSource,
candidates: { name: string; slug?: string }[],
): { name: string; slug?: string; score: number; flags: string[] }[]
Matching:
- Strip parenthetical suffixes — but treat their content separately
so (Reasoning), (Non-reasoning), (Sep '25), effort markers
don't pollute the main token bag.
- Outside parens, strip product-suffix noise (Distill, Instruct,
Chat, Terminus, Speciale, Thinking, Preview, dates).
- Hyphens/underscores → spaces.
- Compress family + space + version-digit so Qwen 3 235B matches
qwen3 235b.
- Score with |intersection| / min(|q|, |aa|), threshold 0.85.
- Don't reject candidates with extra "size disambiguators" (mini,
nano, vl, distill, etc.) — flag them in the reason and let
the admin pick.
- Hard-skip a "no AA coverage" allowlist (greenpt-*, mistral-ocr,
codestral-*, devstral, ibm-granite-*).
Tests in src/lib/__tests__/import-grounding.test.ts against a
fixture AA payload.
Phase 1 — AA ingester
scripts/ingest-artificialanalysis.tsfetches the endpoint, writes one snapshot per (model, eval-key) plusaa_speed/aa_ttftrows.- Add
'artificialanalysis'toBenchmarkSource. ExtendCATEGORY_TO_TASKS. 010-aa-signals.sqladdssignal_typeenum column tobenchmark_snapshots(default'task'), so speed/TTFT can ride the same table.scripts/seed-aa-aliases.tsseeds initial aliases for the 29 registry models using the Phase 0 matcher.
Phase 2 — Suggestions in import modal
discover-tab.tsxgains a "Benchmark matches" panel above Generate Estimates. For each source, top 5 candidates with checkboxes; admin confirms. Saving writes tobenchmark_aliases.- Show flag badges on each candidate (
distill,vl, etc.).
Phase 3 — Grounded estimation
Refactor estimateModelScores:
1. Read benchmark scores via getLatestBenchmarkScores() plus
getLatestPerformanceSignals(slug) (new) for AA speed/TTFT.
2. Deterministic mapping for grounded fields:
- task_fitness[task] ← weighted mean of mapped categories.
- speed_score ← AA median_output_tokens_per_second cohort-
normalised.
- privacy_score ← static provider table.
- tier stays admin-set (default balanced).
3. Haiku fills gaps with benchmark numbers in the prompt; explicitly
forbidden from overriding grounded fields.
4. Return provenance per field.
Phase 4 — UI provenance + re-grounding
- Source badges on each slider in the import modal and edit page.
- "Refresh from benchmarks" button on edit page.
- Warning on import if a flagship-priced model has zero benchmark coverage.
Phase 5 — Tests + re-grounding the existing 29
- Unit tests for the matcher and the grounded merge.
- Integration test for end-to-end import.
- Manual: re-run grounded estimation against the existing 29 models with admin review pass.
Files
New: src/lib/import-grounding.ts,
src/lib/__tests__/import-grounding.test.ts,
scripts/ingest-artificialanalysis.ts, scripts/seed-aa-aliases.ts,
src/db/migrations/010-aa-signals.sql.
Modified: src/lib/benchmarks.ts, src/app/admin/actions.ts,
src/app/admin/discover-tab.tsx,
src/app/admin/edit/[slug]/page.tsx, src/prompts/estimate-model.md.
Out of scope
- Replacing
getLatestBenchmarkScores()blend logic. - New task types beyond the existing 8 +
agentic. - Public benchmark page.