Task Fitness Rubric
task_fitness is a curated 0–1 score per (model, task_type) pair stored in src/data/bearing-registry.json. It feeds the "quality" factor in our recommendation scoring, so it directly controls whether frontier models surface for hard tasks. Historically TF values clustered in a narrow 0.7–0.9 band, which flattened the dynamic range and let cheap models edge out frontier ones on complex work. This rubric exists so re-grades are consistent across the registry and so future model additions have an anchor.
Score anchors
| Range | Meaning |
|---|---|
| 0.95–1.00 | Best-in-class on this task type, definitively SoTA |
| 0.88–0.94 | Strong frontier — very close to SoTA, but not the leader |
| 0.78–0.87 | Capable balanced/mid-tier — solid for medium complexity |
| 0.65–0.77 | Budget tier — fine for simple tasks, struggles on complex |
| 0.50–0.64 | Weak / specialist mismatch |
| <0.50 | Don't recommend for this task type |
Per-task anchors
These are illustrative, not exhaustive — 3–4 reference points per task type. When grading a new model, find the closest anchor and adjust.
code
- 0.96: Claude Opus 4.6, GPT-5.4 (Aider/SWE-Bench leaders)
- 0.93: Claude Sonnet 4.6
- 0.91: Gemini 3.1 Pro
- 0.82: Llama 4 Maverick
- 0.78: Gemini 3 Flash
- 0.70: IBM Granite, Mistral Small
analyse
- 0.97: Claude Opus 4.6 (top-of-range — long-context synthesis, multi-step reasoning)
- 0.95: Gemini 3.1 Pro
- 0.93: GPT-5.4
- 0.91: Claude Sonnet 4.6
- 0.80: GPT-5.4 mini
- 0.74: Gemini 3 Flash
- 0.68: Mistral Small (24B)
generate
- 0.94: Claude Opus 4.6, GPT-5.4 (long-form prose, creative writing)
- 0.89: Claude Sonnet 4.6
- 0.80: Gemini 3 Flash, Llama 4 Maverick
- 0.70: Mistral Small, Granite
summarise
- 0.93: Claude Opus 4.6 (faithful, concise, low hallucination)
- 0.92: Gemini 3.1 Pro
- 0.90: Claude Sonnet 4.6
- 0.88: GPT-5.4, Claude Haiku 4.5
- 0.80: Gemini 3 Flash, GPT-5.4 mini
- 0.70: Llama 4 Maverick
extract
- 0.93: GPT-5.4, Claude Sonnet 4.6 (structured output, schema adherence)
- 0.88: Gemini 3.1 Pro, Claude Haiku 4.5
- 0.80: Gemini 3 Flash, GPT-5.4 mini
- 0.68: Mistral Small
translate
- 0.92: GPT-5.4, Gemini 3.1 Pro (broad language coverage, idiom handling)
- 0.87: Claude Sonnet 4.6, Gemini 3 Flash
- 0.78: Llama 4 Maverick
- 0.65: Granite, smaller specialists
conversation
- 0.93: Claude Sonnet 4.6, GPT-5.4 (tone, steerability, multi-turn coherence)
- 0.88: Claude Haiku 4.5, Gemini 3.1 Pro
- 0.80: Gemini 3 Flash, Llama 4 Maverick
- 0.70: Mistral Small
vision
- 0.93: Gemini 3.1 Pro (chart/diagram reasoning, OCR, spatial)
- 0.88: Claude Opus 4.6, Claude Sonnet 4.6, GPT-5.4
- 0.78: Gemini 3 Flash
- 0.50: text-only models (use
unknown_defaultif unsure)
Vision anchors are noisier than text — re-check against current benchmark coverage when grading.
Rules of thumb
- The same model gets DIFFERENT TF scores per task — Codestral might be 0.92 on
codeand 0.45 ongenerate. Don't average a model into one number. - When in doubt, anchor against an existing model in the same band rather than inventing a number.
- Keep tier-tier gaps meaningful: a flagship should usually beat a balanced model by ≥0.10 on its strong tasks. If the gap is <0.05 the rubric isn't doing its job.
- If a model lacks coverage for a task type, set the field to its
unknown_default(the registry already has 0.5 as a fallback) — DON'T leave it absent.
Pointers
See src/lib/scoring.ts for how this is used in scoring; see docs/plans/2026-05-05-recommendation-tuning.md Phase 1 for the re-grading work this rubric supports.