Plan — Benchmark re-ingest from the admin UI
Date: 2026-06-30
Status: ✅ Built (2026-06-23) on feat/benchmark-reingest-admin (stacked on
fix/review-findings-embedding-autoroute). All 5 phases done; tsc/lint/tests/build green.
Branch: feat/benchmark-reingest-admin
Implementation note: the EcoLogits script (
scripts/ingest-ecologits.ts) was left as-is — its dry-run preview mode is a distinct dev tool that the always-storesingestEcoLogits()lib function doesn't replicate. Only the cron route was lifted onto the shared core, per the plan's wording.
Goal
Let an admin trigger re-ingestion from the live benchmark sources from the
admin → Benchmarks tab, instead of only re-reading the DB. Today the "Refresh"
button calls fetchBenchmarksData(), which just re-reads benchmark_snapshots;
fresh data only lands when a developer runs scripts/ingest-*.ts from a laptop.
Current state
| Source | Origin | Server-side fetchable? | Secrets | How it runs today |
|---|---|---|---|---|
| ecologits | api.ecologits.ai (live) |
✅ yes | none | scripts/ingest-ecologits.ts + live route /api/admin/ecologits-refresh (CRON_SECRET, weekly cron Mon 03:00) |
| lmarena | HF Datasets Server (live) | ✅ yes | optional HF_TOKEN |
scripts/ingest-lmarena.ts |
| artificialanalysis | artificialanalysis.ai/api/v2 (live) |
✅ yes | ARTIFICIAL_ANALYSIS_API_KEY |
scripts/ingest-artificialanalysis.ts |
| mteb | hardcoded seed list (no live source; HF dataset broken) | n/a — manual curation | none | scripts/ingest-mteb.ts |
| livebench | not implemented (licence pending) | — | — | — |
Key facts from investigation:
- ingestSnapshot() (src/lib/benchmarks.ts:158) is idempotent — unique key
(source, source_category, source_model_name, snapshot_date) with ON CONFLICT
DO UPDATE. Safe to re-run; same day overwrites, new day adds a snapshot.
- The 3 live sources have no local-file dependency → all runnable inside a
Vercel serverless function.
- EcoLogits already proves the pattern: fetchEcoLogitsScore(slug, provider,
{storeInDb:true}) is a lib function; the route just loops over models.
- The other ingest logic lives only in the scripts — fetch + parse + ingest +
console.log are mixed together. It must be extracted to src/lib/ to be
callable from a server action/route.
Scope decision
In scope: wire the 3 live sources (ecologits, lmarena, artificialanalysis) to an admin-triggered re-ingest. mteb surfaces as "manual / seed — no live refresh"; livebench as "not available (licence pending)". Both are disabled in the UI with explanatory text rather than hidden.
Proposed architecture
1. Extract ingest cores into src/lib/ingest/ (the real work)
Create one module per live source exposing a pure-ish async function that
fetches → parses → calls ingestSnapshot, returning a structured result (no
console.log, no process.exit):
src/lib/ingest/types.ts // IngestResult { source, inserted, unmatched, fetched, error? }
src/lib/ingest/lmarena.ts // ingestLmArena(opts): Promise<IngestResult>
src/lib/ingest/artificialanalysis.ts // ingestArtificialAnalysis(opts)
src/lib/ingest/ecologits.ts // ingestEcoLogits(opts) — wraps the existing loop in the route
Then refactor the scripts to thin CLI wrappers that import these and keep the
--apply/dry-run + console.log behaviour. This keeps the CLI working and gives
us one source of truth. (EcoLogits: lift the loop out of the route into
ingestEcoLogits, have both the route and the new action call it.)
2. Trigger: admin server actions (not new public routes)
Add to src/app/admin/actions.ts, each guarded by requireAdmin() (matches
addBenchmarkAlias etc.):
export async function reingestSource(source: 'lmarena' | 'artificialanalysis' | 'ecologits')
: Promise<IngestResult>
Rationale: the existing /api/admin/ecologits-refresh route stays as the
cron entry point (CRON_SECRET). Admin-UI triggering goes through a server
action protected by the admin session — no second auth path to reason about.
reingestSource('ecologits') and the cron route both call ingestEcoLogits.
3. UI: src/app/admin/benchmarks-tab.tsx
- Keep the existing Refresh button but relabel to "Reload view" (it only re-reads the DB — current wording is what caused the confusion).
- In the per-source summary table, add a "Re-fetch" action per row:
- live sources → button calls
reingestSource(source), shows spinner, then a result banner (inserted/unmatchedcounts) and reloads the view. mteb→ disabled, tooltip "Seed data — re-curate via script".livebench→ disabled, tooltip "Licence pending".- Reuse the existing
feedbackbanner +isPendingpattern. Add error handling (already fixed for the plain Refresh in4571dda).
4. Config / secrets
- Vercel env: add
ARTIFICIAL_ANALYSIS_API_KEY(required for AA), optionallyHF_TOKEN(LMArena rate limits). Document in HANDOFF "before deploying". vercel.jsoncron unchanged.
Build sequence (phases)
- Extract —
src/lib/ingest/*+ refactor 3 scripts to wrappers. Verify each script still runs identically (dry-run diff). No behaviour change. ✅ tsc/tests. - EcoLogits route refactor — route delegates to
ingestEcoLogits; confirm cron path unchanged. - Server actions —
reingestSourcein admin actions,requireAdmin-guarded. - UI — per-source Re-fetch buttons + relabel Refresh + result banners.
- Docs — changelog, HANDOFF (new env vars), user-guide admin section.
Decisions (locked 2026-06-30)
- Granularity: ✅ Per-source "Re-fetch" buttons. One per source row; AA key failure or a single source error won't block the others; clearer feedback.
- Confirmation: ✅ Require a confirm dialog before any live re-fetch ("Re-fetch lmarena now?") — it writes to the shared prod DB.
- Timeouts: LMArena paginates 3 subsets (text/webdev/vision) — could exceed
the default serverless duration. Plan: (a) bump
maxDuration, measure; fall back to (b) splitting LMArena into 3 per-subset calls if needed. Main technical risk. (No input needed — implementation detail.) - MTEB: leave as script-only (manual seed re-curation); UI shows it disabled with an explanatory tooltip. No "re-apply seed" button.
Status: HELD AT PLAN — awaiting review before any code is written.
Risks / notes
- Serverless duration (decision 2) is the one thing most likely to bite.
- AA API key must be in Vercel before its button works — otherwise the action returns a clean "ARTIFICIAL_ANALYSIS_API_KEY not set" error (handle gracefully).
- Cohort scaling: LMArena/AA rows are cohort-normalised at ingest time per
(source, category, snapshot_date). Re-fetching a full source is fine; ingesting a single model alone would skew its cohort — but these actions always ingest the whole source, so no regression. (This is the same bug class the EcoLogits absolute curve avoided; don't add single-model live ingest for cohort-scaled sources.) - Idempotency means re-running is safe; no dedup work needed.
Out of scope
- LiveBench ingestion (licence pending).
- Live MTEB (upstream HF dataset broken).
- Scheduled crons for lmarena/AA (could add later, mirroring the ecologits cron).