Task-type review
Date: 2026-05-19 Status: Approved — pending v0.7.0 merge before execution Owner: Tom
Approved decisions (2026-05-19)
- Remove
visionas a task type. It stays only as a capability (requires_capabilities: ["vision"]). - Remove
"other"entirely. Classifier must pick a real type or setclarification_needed: truewhen confidence is low. - Keep
is_agenticas a separate flag (do not introduce anagenttask type). Add UI surfacing whenis_agentic: trueso users know the recommendation assumes a tool harness. - Version bump to 0.8.0 — breaking registry schema change.
- Land v0.7.0 first on a clean branch from master before starting this work.
Why now
The current canonical task types (summarise, generate, extract, code, analyse, translate, conversation, vision, other) have two visible problems:
-
"other"is a silent failure mode. When the classifier picks it (e.g. for the Grafana "open a browser and analyse" prompt), downstream code readstask_fitness["other"]and getsundefined. Inlocal-inference.ts:111this defaults to0and filters every model out — the "Run it locally" section silently disappears. Inscoring.ts:188it defaults to0.5and recommendations come back generic. Both consumers swallow the failure rather than warning. -
The types are too few and too coarse. Mathematical reasoning, multi-step logic puzzles, factual Q&A, and research with citations are all currently routed through
analyse. Short business comms (email, slack reply) are routed throughgenerate. The benchmark grounding work (v0.7.0) already exposes this —livebench.reasoning,livebench.mathematics,aa_math,aime_25,gpqa,hleall collapse intoanalyseinbenchmarks.ts:30-74, so we lose meaningful signal.
A short prompt fix (Option B from the earlier discussion, already shipped) tells the classifier to pick by intent not mechanism — that stops the Grafana-style "other" misfires. This plan is the deeper Option C: a properly-scoped task-type expansion.
Inventory of current state
Definition
src/lib/registry.ts:5 — single source of truth for TaskType. Currently nine values including other.
Where task_type is consumed
| Site | Behaviour on unknown / "other" |
|---|---|
src/lib/scoring.ts:188 |
?? 0.5 — defaults silently |
src/lib/local-inference.ts:111 |
?? 0 — filters model out silently (this is the bug) |
src/lib/benchmarks.ts:30 |
Reverse map (category → task) — adding a type requires updating the benchmark-source mappings |
src/app/actions.ts:112 |
Persisted to DB row |
src/app/admin/discover-tab.tsx:27 |
Duplicated ALL_TASK_TYPES literal — admins toggle fitness values |
src/app/admin/models/[slug]/page.tsx:12 |
Duplicated ALL_TASK_TYPES literal — same |
src/app/recommend/[taskId]/results/page.tsx:39 |
Displays raw string to user |
src/app/validate/[taskId]/results/page.tsx:177 |
Displays raw string to user |
src/app/models/[slug]/page.tsx:190 |
Iterates task_fitness for model detail page |
src/lib/db.ts:286,362,374 |
Stored as JSONB; type-agnostic |
Coverage in registry data
All 29 models in src/data/bearing-registry.json have exactly the eight task-keys (code, vision, analyse, extract, generate, summarise, translate, conversation). No other key present, which is what causes the silent-zero. No drift across models — they're uniform.
Duplication risk
ALL_TASK_TYPES is defined twice in admin pages with no other. The registry data also doesn't include other. So other only exists in the classifier's enum and the TaskType union — everywhere it matters it's missing. This is consistent with treating it as an escape valve rather than a real type.
Proposed type list
Twelve types, organised by what the user wants out. Existing types preserved where the meaning stays clean; narrowed where overloaded.
| Type | Definition | Notes / migration |
|---|---|---|
summarise |
Condense longer input into shorter output | unchanged |
extract |
Pull structured data from unstructured input (incl. OCR, transcription, table extraction) | absorbs current "extract from PDF" cases that misfire as vision |
generate |
Create new long-form prose: reports, articles, proposals, creative writing | narrowed — short business comms move to comms |
comms |
New. Short business communication: emails, Slack replies, customer responses, meeting recaps | currently misclassified as generate |
code |
Write, review, debug, or explain code | unchanged |
math |
New. Numerical computation, calculation, proofs, formal problem-solving | currently misclassified as analyse |
reasoning |
New. Multi-step logic, planning under constraints, structured decisions where the process matters more than the facts | currently misclassified as analyse |
analyse |
Interpret and explain qualitative information; produce judgments backed by evidence | narrowed — pure logic moves to reasoning, pure maths moves to math |
research |
New. Investigative Q&A that benefits from tool use, retrieval, or citing sources | currently splits between analyse and conversation |
qa |
New. Short factual question-answer: definitions, lookups, factual recall | currently misclassified as conversation |
translate |
Convert text between human languages | unchanged |
conversation |
Ongoing multi-turn dialogue: chatbots, tutoring, brainstorming, dialogue companions | narrowed — one-shot Q&A moves to qa |
Notes on what is not in the list
visionis removed as a task type. It was already overloaded (also a capability) and the use cases it covers all reduce to one of the above (vision-input extraction →extract; vision-input analysis →analyse; etc.). Image input is expressed viarequires_capabilities: ["vision"].otheris removed. With twelve types, every reasonable request maps somewhere. If a request genuinely can't be classified, the classifier setsclarification_needed: trueand returnsconfidence < 0.5instead of routing through a silently-broken type.
Distinguishing rubrics (hardest cases)
These are the boundaries I expect the classifier to mis-call most often. Worth nailing down in the prompt:
-
reasoningvsanalyse—reasoningis process-led (the user wants the steps, the logic, the plan);analyseis interpretation-led (the user wants the meaning, the judgement, the assessment). Test: "Plan the optimal order to visit 6 cities given these constraints" → reasoning. "Why is our churn rate trending up?" → analyse. -
mathvsreasoning—mathis numeric/formal;reasoningis symbolic/strategic. Test: "Solve this integral" → math. "Which is the cheapest hardware plan that meets all constraints?" → reasoning. -
qavsresearch—qais short and self-contained;researchbenefits from tools or citations. Test: "What's the capital of Bulgaria?" → qa. "What's the current state of FDA approval for GLP-1 drugs?" → research. -
qavsconversation—qais one-shot,conversationis multi-turn. Test: A single factual question → qa. A back-and-forth chatbot → conversation. -
commsvsgenerate—commsis short and addressed to a specific audience;generateis long-form. Test: "Reply to this customer email apologetically" → comms. "Write a 5-page proposal for the board" → generate.
Migration plan
Step 1 — Data: fill in task_fitness for the new types
The 29 existing models have eight keys; the new schema needs four added (comms, math, reasoning, research, qa) and one removed (vision).
Three options:
A. Derive from benchmarks (preferred — matches v0.7.0 grounding philosophy)
math←livebench.mathematics,aa_math,aime_25,math_500reasoning←livebench.reasoning,gpqa,hle,mmlu_proresearch← (limited direct coverage — fall back to a curated derivation, e.g.0.6 * analyse + 0.4 * tools-capable)qa←aa_intelligenceas a weak proxy; otherwise curatedcomms← no direct benchmark — derive as0.85 * generateinitially
This means updating CATEGORY_TO_TASKS in src/lib/benchmarks.ts to map benchmark categories onto the new types directly, then re-running the grounding job over the snapshots already in benchmark_snapshots.
B. Manual curation — fast but regresses the v0.7.0 work. Not recommended.
C. Sensible defaults + iterate — start every model at 0.5 for the new types and lean on benchmark grounding to refine. Safer than A as a first step but produces uniformly mediocre recommendations until grounding runs.
Recommendation: A with C as fallback for types with no benchmark coverage (research, comms, qa).
Step 2 — Code: update the type definition and consumers
src/lib/registry.ts:5— updateTaskTypeunion. Removevisionandother. Addmath,reasoning,research,qa,comms.src/lib/scoring.ts:188— review the?? 0.5fallback. Withothergone, missing keys would indicate a stale registry row; consider erroring loudly instead.src/lib/local-inference.ts:111— same review. The?? 0default should never fire under the new schema; consider erroring or warning.src/lib/benchmarks.ts:30-74— extensive update. Reroute math/reasoning categories away fromanalyse. Add explicitresearch/qa/commsmappings where benchmarks exist.src/app/admin/discover-tab.tsx:27andsrc/app/admin/models/[slug]/page.tsx:12— update bothALL_TASK_TYPESliterals. Also dedupe — move to a shared constant exported fromregistry.ts.
Step 3 — Prompt: rewrite src/prompts/classify.md
- Update the task_type enum in the schema (line 8 and line 36 — the pipeline stage variant).
- Replace the "Task type definitions" section with the rubric in this doc.
- Keep the "classify by intent, not mechanism" guidance added in Option B.
- Add the four hardest-case rubrics (math vs reasoning, qa vs research, qa vs conversation, comms vs generate) as worked examples.
- Add 4-6 new few-shot-style examples covering the new types.
Step 4 — UI
recommend/[taskId]/results/page.tsx:39andvalidate/[taskId]/results/page.tsx:177display the raw task_type string. Add a display-name map ('qa' → 'question answering','comms' → 'business communication', etc.). Probably belongs inregistry.tsalongside the type.models/[slug]/page.tsx:190displays thetask_fitnesskeys — same label treatment.- Agentic badge. When
is_agentic: trueon the classification, render a small notice near the top of the results page: "This task involves agent-style steps (browser automation, tool calls, code execution). The recommendation assumes your deployment provides a tool harness — e.g. Playwright/MCP for browsers, code-execution sandboxes, file I/O. Without that, the model can analyse but cannot act." This is non-blocking — the recommendation still renders — but tells the user honestly that capability ≠ execution.
Step 5 — Tests
src/lib/__tests__/scoring.test.ts:160,190,211,228— currently uses'code'and'analyse'. Add tests covering at least one each of the new types and confirming that fallbacks behave correctly.src/lib/__tests__/classification.test.ts— add a "classify by intent" test (e.g. the Grafana prompt) confirming the right type is chosen.- Add a registry-integrity test: every model must have all twelve task-fitness keys, no extras.
- Add a benchmark-mapping test: every key in
CATEGORY_TO_TASKSmust point to a validTaskType.
Step 6 — Database
task_fitness is stored as JSONB so the schema doesn't need migrating, but the contents of existing rows need updating to match the new keys.
- Write a one-off migration script (or admin-side action) that:
- Reads each row's current
task_fitness. - Drops the
visionkey. - Computes new keys per Step 1.A.
- Writes back.
- Run it once against prod after merge. Keep the previous JSONB blob in a backup column or git-tracked snapshot for rollback.
Step 7 — Docs
- Update
docs/scoring/task-fitness-rubric.md(if it exists; if not, create) with the new twelve-type rubric and worked examples. - Update
docs/model-ratings.mdwith the new task labels. - Update
docs/user-guide.mdif it references task types by name. - Update
docs/changelog.mdwith a v0.8.0 entry. - Bump
meta.versioninbearing-registry.jsonto0.8.0.
Scope and ordering
This is a single coherent change. Estimated diff: 500–800 lines including data migration, code, prompt rewrite, and tests. I'd suggest a separate branch (feat/task-types-v2) with the work ordered:
- Update
TaskType+ sharedALL_TASK_TYPESconstant (compiler will tell us everywhere that breaks). - Update benchmarks mapping + run grounding to populate new fitness values.
- Migrate registry JSON + DB rows.
- Update prompt + add classifier tests.
- Update UI labels + admin pages.
- Update docs + bump version.
- Manual QA against the three test prompts (and the Grafana case).
Open questions
All five resolved 2026-05-19 — see "Approved decisions" at the top.