Task Fitness Rubric

task_fitness is a curated 0–1 score per (model, task_type) pair stored in src/data/bearing-registry.json. It feeds the "quality" factor in our recommendation scoring, so it directly controls whether frontier models surface for hard tasks. Historically TF values clustered in a narrow 0.7–0.9 band, which flattened the dynamic range and let cheap models edge out frontier ones on complex work. This rubric exists so re-grades are consistent across the registry and so future model additions have an anchor.

Score anchors

Range	Meaning
0.95–1.00	Best-in-class on this task type, definitively SoTA
0.88–0.94	Strong frontier — very close to SoTA, but not the leader
0.78–0.87	Capable balanced/mid-tier — solid for medium complexity
0.65–0.77	Budget tier — fine for simple tasks, struggles on complex
0.50–0.64	Weak / specialist mismatch
<0.50	Don't recommend for this task type

Per-task anchors

These are illustrative, not exhaustive — 3–4 reference points per task type. When grading a new model, find the closest anchor and adjust.

`code`

0.96: Claude Opus 4.6, GPT-5.4 (Aider/SWE-Bench leaders)
0.93: Claude Sonnet 4.6
0.91: Gemini 3.1 Pro
0.82: Llama 4 Maverick
0.78: Gemini 3 Flash
0.70: IBM Granite, Mistral Small

`analyse`

0.97: Claude Opus 4.6 (top-of-range — long-context synthesis, multi-step reasoning)
0.95: Gemini 3.1 Pro
0.93: GPT-5.4
0.91: Claude Sonnet 4.6
0.80: GPT-5.4 mini
0.74: Gemini 3 Flash
0.68: Mistral Small (24B)

`generate`

0.94: Claude Opus 4.6, GPT-5.4 (long-form prose, creative writing)
0.89: Claude Sonnet 4.6
0.80: Gemini 3 Flash, Llama 4 Maverick
0.70: Mistral Small, Granite

`summarise`

0.93: Claude Opus 4.6 (faithful, concise, low hallucination)
0.92: Gemini 3.1 Pro
0.90: Claude Sonnet 4.6
0.88: GPT-5.4, Claude Haiku 4.5
0.80: Gemini 3 Flash, GPT-5.4 mini
0.70: Llama 4 Maverick

`extract`

0.93: GPT-5.4, Claude Sonnet 4.6 (structured output, schema adherence)
0.88: Gemini 3.1 Pro, Claude Haiku 4.5
0.80: Gemini 3 Flash, GPT-5.4 mini
0.68: Mistral Small

`translate`

0.92: GPT-5.4, Gemini 3.1 Pro (broad language coverage, idiom handling)
0.87: Claude Sonnet 4.6, Gemini 3 Flash
0.78: Llama 4 Maverick
0.65: Granite, smaller specialists

`conversation`

0.93: Claude Sonnet 4.6, GPT-5.4 (tone, steerability, multi-turn coherence)
0.88: Claude Haiku 4.5, Gemini 3.1 Pro
0.80: Gemini 3 Flash, Llama 4 Maverick
0.70: Mistral Small

`vision`

0.93: Gemini 3.1 Pro (chart/diagram reasoning, OCR, spatial)
0.88: Claude Opus 4.6, Claude Sonnet 4.6, GPT-5.4
0.78: Gemini 3 Flash
0.50: text-only models (use unknown_default if unsure)

Vision anchors are noisier than text — re-check against current benchmark coverage when grading.

Rules of thumb

The same model gets DIFFERENT TF scores per task — Codestral might be 0.92 on code and 0.45 on generate. Don't average a model into one number.
When in doubt, anchor against an existing model in the same band rather than inventing a number.
Keep tier-tier gaps meaningful: a flagship should usually beat a balanced model by ≥0.10 on its strong tasks. If the gap is <0.05 the rubric isn't doing its job.
If a model lacks coverage for a task type, set the field to its unknown_default (the registry already has 0.5 as a fallback) — DON'T leave it absent.

Pointers

See src/lib/scoring.ts for how this is used in scoring; see docs/plans/2026-05-05-recommendation-tuning.md Phase 1 for the re-grading work this rubric supports.