Recommendation Tuning Implementation Plan
For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
Goal: Make Bearing's recommendations responsive to task type, complexity, and user priorities — so frontier models surface for hard work, cheap models surface for cheap work, and constraints like on-prem-only or realtime are honoured.
Architecture: Tightens three layers without restructuring them. (1) Registry: re-grade task_fitness so quality has real dynamic range. (2) Scoring: priority-aware weight compression, complexity-tier floors, hard filters for on-prem / latency / context. (3) Classification: new dimensions (data_sensitivity, latency_target, volume, needs_reasoning, etc.) and a tighter pipeline rule. The existing test battery (scripts/test-recommendations.ts) becomes the regression baseline between phases.
Tech Stack: TypeScript, Next.js server actions, Anthropic SDK (Haiku 4.5 classifier), Neon Postgres, Vitest.
Phase plan
| Phase | Theme | Risk | Expected impact |
|---|---|---|---|
| 0 | Lock the regression baseline | none | Enables before/after diffing |
| 1 | Registry quality recalibration | medium (data only) | Frontier models become recommendable |
| 2 | Cost curve + priority-aware weight compression | medium | Top-3 actually shifts with priority order |
| 3 | Complexity → tier floor + reasoning multiplier | low | Complex tasks stop getting budget models |
| 4 | New classification dimensions (additive) | medium | Privacy / latency / volume / reasoning surface |
| 5 | Hard filters wired in scoring | low | On-prem, realtime, long-context honoured |
| 6 | Pipeline detection tightened | low | Chatbots and refactors stop getting 4-stage pipelines |
| 7 | Classifier robustness (JSON parse, output_length, subtype) | low | No more silent crashes on $/ms/etc |
| 8 | Final regression + write-up | none | Documents what changed |
Commit at the end of every task. Re-run npx tsx scripts/test-recommendations.ts at the end of every phase and compare against the Phase-0 baseline.
Phase 0 — Lock the baseline
Task 0.1: Promote the test script to a versioned baseline
Files:
- Modify: scripts/test-recommendations.ts
- Create: scripts/baselines/2026-05-05-baseline.json
Step 1: Run the script once with current code:
npx tsx scripts/test-recommendations.ts
Step 2: Copy test-recommendations-output.json → scripts/baselines/2026-05-05-baseline.json. This is the "before" snapshot.
Step 3: In test-recommendations.ts, also write an expected field per prompt (already present) and add a --diff <baseline-path> flag that compares the current run's top-3 model slugs per prompt against the baseline and prints a colour-coded diff (added / removed / reordered). Keep it simple — JSON.parse + Set diff is enough.
Step 4: Commit:
git add scripts/test-recommendations.ts scripts/baselines/
git commit -m "test: lock recommendation baseline for tuning work"
Acceptance: npx tsx scripts/test-recommendations.ts --diff scripts/baselines/2026-05-05-baseline.json reports zero diffs.
Task 0.2: Add a Vitest unit-test for the most surprising current behaviour
Files:
- Create: src/lib/__tests__/scoring-tuning.test.ts
These are regression-pinning tests — they assert today's (wrong) behaviour, not the desired behaviour. Phase 2 will flip them.
import { scoreModels } from '../scoring'
describe('scoring tuning baseline (will flip in Phase 2)', () => {
it('currently does NOT recommend Claude Opus for complex code with quality-first priorities', () => {
const result = scoreModels({
taskType: 'code',
complexity: 'complex',
inputLength: 'long',
needsVision: false, needsTools: true, needsCode: true,
priorityOrder: ['quality','capability','cost','transparency','privacy','sustainability','speed'],
})
const top3 = result.slice(0,3).map(m => m.slug)
expect(top3).not.toContain('claude-opus-4.6') // baseline behaviour
})
it('currently recommends a cloud model even when default priorities apply to a privacy-sensitive task', () => {
// Documents the bug: privacy is rank-5 by default, so even moderate cloud
// wins. Phase 4 introduces data_sensitivity to fix this.
const result = scoreModels({
taskType: 'analyse',
complexity: 'complex',
inputLength: 'long',
needsVision: false, needsTools: false, needsCode: false,
priorityOrder: ['quality','capability','cost','transparency','privacy','sustainability','speed'],
})
expect(result[0].provider).not.toBe('on-prem-only') // tautology, just locks the shape
})
})
Step 5: Run npx vitest run src/lib/__tests__/scoring-tuning.test.ts. Expected: PASS (locking baseline).
Step 6: Commit:
git commit -am "test: pin baseline scoring regressions (to be flipped in later phases)"
Phase 1 — Registry quality recalibration (data only)
Hypothesis: widening the spread of task_fitness between flagship/balanced/budget tiers will let frontier models out-score budget models on complex tasks.
Task 1.1: Document the target rubric
Files:
- Create: docs/scoring/task-fitness-rubric.md
Write the rubric so future grading is consistent. Suggested anchors:
0.95–1.00 best-in-class on this task type, definitively SoTA
0.88–0.94 strong frontier, very close to SoTA but not the leader
0.78–0.87 capable balanced/mid-tier — solid for medium complexity
0.65–0.77 budget tier — fine for simple tasks, struggles on complex
0.50–0.64 weak / specialist mismatch
< 0.50 don't recommend for this task type
Step 1: Commit the rubric.
Task 1.2: Re-grade flagship-tier models
Files:
- Modify: src/data/bearing-registry.json
Targets (apply only on this model's strong tasks; leave weak tasks alone):
| Model | code | analyse | generate | summarise | extract |
|---|---|---|---|---|---|
| claude-opus-4.6 | 0.96 | 0.97 | 0.96 | 0.93 | 0.90 |
| claude-sonnet-4.6 | 0.93 | 0.91 | 0.92 | 0.90 | 0.88 |
| gpt-5.4 | 0.96 | 0.93 | 0.91 | 0.88 | 0.88 |
| gemini-3.1-pro | 0.91 | 0.95 | 0.88 | 0.92 | 0.90 |
| grok-4 | 0.92 | 0.94 | 0.87 | 0.85 | 0.83 |
Step 1: Edit JSON. Step 2: Run npx tsc --noEmit (registry is statically typed). Step 3: Run unit tests. Step 4: Commit chore(registry): widen flagship task_fitness range.
Task 1.3: Re-grade balanced and budget tiers down
Files:
- Modify: src/data/bearing-registry.json
| Model | Change |
|---|---|
| gemini-3-flash | code 0.88→0.82, analyse 0.82→0.78 (it's balanced, not flagship) |
| gpt-5.4-mini | hold |
| claude-haiku-4.5 | hold (already at 0.80 / 0.72) |
| llama-4-maverick | code 0.84→0.81, analyse 0.78→0.76 |
| ibm-granite-3.3 | code 0.78→0.70, analyse 0.72→0.65 |
| greenpt-greenl | code 0.72→0.66, analyse 0.65→0.60 |
| greenpt-greenr | code 0.78→0.72, analyse 0.75→0.70 |
| qwen3.5-397b | analyse 0.82→0.85 (strong reasoning; bump) |
| deepseek-r1 | analyse 0.88→0.93 (reasoning specialist; bump) |
| codestral-25.01 | code 0.95→0.92 (still leader for pure code, but not over Opus/GPT-5.4) |
Step 1: Edit JSON. Step 2: Re-run regression: npx tsx scripts/test-recommendations.ts --diff scripts/baselines/2026-05-05-baseline.json. Expected: Opus / GPT-5.4 / Sonnet appear in top 3 for at least the code-hard, analyse-legal, analyse-strat, gen-creative prompts. Step 3: Commit.
Task 1.4: Re-run benchmark blend sanity check
LMArena snapshots blend in at 30%. After the curated change, check that benchmark coverage isn't fighting the new curated grades. If a blended slug (e.g. claude-opus-4.6::code) now has a curated 0.96 but a normalised LMArena score of 0.78, the blend pulls it back to 0.91.
Step 1: Run node -e to print the blend delta for each (slug, task) pair where a benchmark row exists. Step 2: If any flagship is dragged below balanced-tier curated scores, decide: lower the BENCHMARK_BLEND default for now (0.3 → 0.2), or skip blending for slugs where curated > benchmark by > 0.10. Pick the simpler one; document choice in src/lib/scoring.ts comment. Step 3: Commit.
Phase 2 — Cost curve + priority-aware weight compression
Task 2.1: Make cost-score steepness depend on cost weight
Files:
- Modify: src/lib/scoring.ts (costScore function)
- Modify: src/lib/__tests__/scoring.test.ts
Current: log-scale, floor 0.05, weight applied externally. The penalty for an expensive model is fixed regardless of where the user ranked cost.
New: pass the user's cost weight (or rank) into costScore. When cost is low-priority (rank 4+), compress the curve so an expensive model scores no worse than 0.40 instead of 0.05.
function costScore(model, allModels, inputLength, costWeightHint = 0.18): number {
// ... existing log calculation produces baseScore in [0,1]
// Compression: when costWeightHint is small, pull baseScore towards 0.5.
const compression = Math.max(0, 1 - costWeightHint / 0.30) // 0 when weight ≥0.30, up to 1 when weight≈0
return baseScore + (0.5 - baseScore) * compression * 0.6
}
Step 1: Write 2 unit tests: (a) when cost weight is 0.30, expensive flagship still scores ≤ 0.10; (b) when cost weight is 0.05 (last priority), expensive flagship scores ≥ 0.30. Step 2: Implement. Step 3: Update call sites in scoreModels. Step 4: Run npx vitest run. Step 5: Commit.
Task 2.2: Compress non-priority factor weights
Files:
- Modify: src/lib/weights.ts
When a factor is rank 5+ in the user's priority order, multiply its raw weight by 0.4 before normalisation. This stops transparency/sustainability/privacy from quietly dominating when the user clearly didn't care about them.
const LOW_PRIORITY_DAMP = 0.4
for (let i = 4; i < priorityOrder.length; i++) {
raw[priorityOrder[i]] *= LOW_PRIORITY_DAMP
}
Step 1: Write a test: with priorities ['quality','capability','cost','speed','transparency','sustainability','privacy'], transparency weight after normalisation < 0.05. Step 2: Implement. Step 3: Run regression battery. Expect Opus / GPT-5.4 to now top quality-first prompts. Step 4: Flip the Phase-0 pinning tests: assertion changes from not.toContain('claude-opus-4.6') → toContain('claude-opus-4.6'). Step 5: Commit.
Task 2.3: Re-run regression + update baseline
Step 1: Compare new output to 2026-05-05-baseline.json. Step 2: Save fresh snapshot as scripts/baselines/2026-05-05-phase2.json. Step 3: Commit baseline.
Phase 3 — Complexity → tier floor + reasoning multiplier
Task 3.1: Tier floor for complex tasks
Files:
- Modify: src/lib/scoring.ts
Rule: when complexity === 'complex', set factorScores.quality *= 0.85 for any model with tier ∈ {'budget','sustainable_balanced','enterprise_transparent'} unless the user ranked sustainability/transparency in their top 3. (Respect the user's ethical preferences.)
Step 1: Test: complex code task with default priorities — Granite Micro / GreenPT GreenL drop out of top 5. Step 2: Implement. Step 3: Run battery. Step 4: Commit.
Task 3.2: Reasoning multiplier
Files:
- Modify: src/lib/scoring.ts
- Modify: src/prompts/classify.md (add needs_reasoning to schema)
- Modify: src/lib/classification.ts (add field to interface)
- Modify: src/app/actions.ts (persist field)
- Migration: src/db/migrations/010_needs_reasoning.sql — ALTER TABLE tasks ADD COLUMN needs_reasoning BOOLEAN DEFAULT FALSE
When needsReasoning && model.capabilities.includes('extended_thinking'), multiply quality by 1.20.
Step 1: Write migration; apply to Neon (use psql $NEON_DATABASE_URL -f …). Step 2: Update classifier prompt with examples (math symbolic, multi-step strategy, legal risk). Step 3: Update Classification interface and DB plumbing. Step 4: Update scoring + tests. Step 5: Run battery; expect DeepSeek R1 / Opus to surface for prompt #17 (PDEs) and #18 (German market expansion). Step 6: Commit.
Phase 4 — New classification dimensions
These all follow the same pattern: prompt update → interface field → DB column → scoring use. One commit per dimension keeps reverts easy.
Task 4.1: data_sensitivity
Schema: 'none' | 'pii' | 'regulated_health' | 'regulated_finance' | 'on_prem_required'
- Migration
011_data_sensitivity.sql - Classifier prompt: examples for "patient records", "credit-card data", "must run on-prem"
- Scoring:
on_prem_required→ hard filter to models withlocal_info != null;regulated_*→ bump privacy weight by 1.5×;pii→ bump privacy by 1.2×
Acceptance: prompt #30 (medical, on-prem) top recommends a Llama / Granite / Mistral with local_info, not Gemini.
Task 4.2: latency_target
Schema: 'realtime' | 'interactive' | 'batch' (default 'interactive')
- Migration
012_latency_target.sql - Prompt: "voice assistant under 200ms" → realtime
- Scoring:
realtime→ hard filterspeed_score >= 0.85;batch→ cost weight × 1.3
Acceptance: prompt #34 returns only fast tier; prompt #20 (bulk translate) drops cost-heavy models.
Task 4.3: volume
Schema: 'one_off' | 'hundreds_per_day' | 'thousands_per_day' | 'millions_per_day'
- Migration
013_volume.sql - Scoring:
thousands_per_day→ cost weight floor 0.30;millions_per_day→ 0.45 (overrides priority order partially — document this trade-off in code)
Acceptance: prompt #33 (1M tweets/day under $50/month) recommends cheapest-viable tier.
Task 4.4: needs_long_context
Schema: boolean
- Migration
014_needs_long_context.sql - Scoring: hard filter
context_window >= 100_000
Acceptance: prompt #6 (200-page board report) excludes any 8k-context model from results.
Task 4.5: needs_multilingual and is_agentic
Lighter-touch fields. Multilingual → multiplier 1.10 if multilingual ∈ capabilities. Agentic → multiplier 1.15 if tools ∈ capabilities and extended_thinking ∈ capabilities.
Task 4.6: output_length separate from input_length
- Migration
015_output_length.sql - Cost estimation already uses both implicitly via
TOKEN_ESTIMATES, but only one input string. RefactorestimateCostto take(inputLength, outputLength)separately.
Phase 5 — Wire hard filters cleanly
Task 5.1: Centralise hard filters in scoring
Files:
- Modify: src/lib/scoring.ts
Currently capabilityScore returns null to drop a model. Generalise: a single hardFilter(model, input): { ok: boolean; reason?: string } so it's testable in isolation. Reasons are returned through the API so the UI can show "5 models excluded because they require cloud hosting".
Step 1: Write tests for each filter. Step 2: Refactor. Step 3: Update getResults action to surface filter reasons in the response. Step 4: Render in recommend/[id]/results page (one-line summary). Step 5: Commit.
Phase 6 — Pipeline detection tightening
Task 6.1: Tighten the pipeline rule in the prompt
Files:
- Modify: src/prompts/classify.md
Replace the current "2+ distinct operations" rule with:
A pipeline requires ≥2 operations that:
1. Have different task_type values, AND
2. Cannot share a single model efficiently — i.e., the operations
genuinely differ in modality (vision → text), language (translate → analyse),
or specialty (OCR → reasoning).
NEVER recommend a pipeline for:
- A single chat / conversation / chatbot use case (chatbots are not pipelines)
- Code that involves writing + testing + refactoring (one job)
- A single document being summarised
- Anything where the same general-purpose model could do all stages
Add 4 negative examples to the prompt: GCSE tutor chatbot, customer-support chatbot, code refactor with tests, multilingual chatbot.
Step 1: Update prompt. Step 2: Re-run battery. Expect ≤ 5 of 32 prompts now pipeline (currently 13). Step 3: Add a test in src/lib/__tests__/classification.test.ts that asserts a chatbot prompt does not produce a pipeline (uses buildClassificationMessages + a recorded fixture; do not call the live API in tests). Step 4: Commit.
Phase 7 — Classifier robustness + minor cleanup
Task 7.1: Use Anthropic structured-output instead of raw-JSON parse
Files:
- Modify: src/lib/classification.ts
Currently JSON.parse(rawText.replace(/```/g,'')). Two prompts crashed on this. Use the SDK's tool-use to force a structured response — define classify_task as a tool with the schema, and read the tool input. Eliminates the regex-cleaning fragility.
Step 1: Write a test using buildClassificationMessages + a stubbed Anthropic response containing prose around JSON; assert it still parses (records the bug). Step 2: Refactor to tool-use. Step 3: Run live against the 2 previously-failing prompts (#33, #34). Step 4: Commit.
Task 7.2: Drop task_subtype OR map it to weights
Two options:
(a) Drop task_subtype from the prompt entirely. Saves tokens, removes unused field. Keep the column nullable for historical rows.
(b) Map subtype strings → strength tags via a small lookup table; bump models whose strengths array contains the matching tag.
I'd suggest (a) — subtype was nice in theory but never wired to scoring, and the strings are inconsistent ("creative_fiction" vs "creative writing"). Document the choice in code.
Task 7.3: Calibrate confidence or remove
Currently every prompt scores 0.85 / 0.92 / 0.95 — the field is decorative. Pick one:
(a) Drop confidence; use only the boolean clarification_needed.
(b) Add 4 explicit calibration anchors to the prompt: - 0.95+ : description specifies task verb, domain, and an output format - 0.80–0.94 : task verb + domain clear, output format ambiguous - 0.50–0.79 : task verb clear but ambiguous between 2+ task types - < 0.50 : core verb unclear; ask
Suggest (a) for simplicity; clarification_needed already does the job.
Task 7.4: Enforce output_length distinct from input_length
Already in 4.6 — verify gen-creative (1500-word story, short input) classifies as input_length=short, output_length=medium and estimateCost uses both.
Phase 8 — Final regression + write-up
Task 8.1: Run the full regression battery
Step 1: npx tsx scripts/test-recommendations.ts --diff scripts/baselines/2026-05-05-baseline.json. Step 2: Spot-check that:
- prompts 1, 3, 4, 14, 16, 17, 18, 32 (complex / reasoning) → top 3 contains at least one of
claude-opus-4.6 / claude-sonnet-4.6 / gpt-5.4 / gemini-3.1-pro / deepseek-r1 - prompts 2, 5, 13, 22, 23 (simple) → top 3 still contains a budget tier
- prompt 30 (medical on-prem) → all top-3 models have
local_info - prompt 34 (realtime) → all top-3 have
speed_score >= 0.85 - prompt 33 (high volume budget) → top-3 estimated cost ≤ $0.001 per call
- prompts 21/22/23 (chatbots) → no pipeline recommended
- prompts 26/27 (genuine multi-modal pipelines) → still pipeline-recommended
Step 3: Save fresh snapshot scripts/baselines/2026-05-05-final.json. Step 4: Commit.
Task 8.2: Update HANDOFF.md and the public methodology
Files:
- Modify: HANDOFF.md
- Modify: src/data/bearing-registry.json → bump meta.version 0.6.0 → 0.7.0; add scoring_methodology.notes describing the new dimensions
- Modify: src/app/about/page.tsx (or wherever methodology is publicly described) — add the new classification fields and how they flow into scoring
Step 1: Write changelog entries. Step 2: Commit.
Task 8.3: Open PR
git checkout -b tuning/recommendation-overhaul
git push -u origin tuning/recommendation-overhaul
gh pr create --title "tuning: recommendation overhaul (Phases 1-7)" --body "$(cat <<'EOF'
## Summary
- Widen task_fitness range so frontier models surface for complex tasks
- Priority-aware weight compression (rank-5+ factors damped 0.4×)
- Cost curve responds to cost weight; floor lifts when cost is low priority
- 6 new classification dimensions: data_sensitivity, latency_target, volume, needs_reasoning, needs_long_context, needs_multilingual, is_agentic, output_length
- Hard filters: on-prem, realtime, long-context
- Pipeline detection tightened (chatbots/refactors no longer pipelined)
- Classifier uses Anthropic structured-output (no more JSON parse crashes)
## Test plan
- [ ] `npx vitest run` passes
- [ ] `npx tsc --noEmit` passes
- [ ] `npx tsx scripts/test-recommendations.ts --diff scripts/baselines/2026-05-05-baseline.json` shows the expected shifts (see Task 8.1)
- [ ] Smoke-test the live UI for 3 prompts spanning simple / complex / on-prem
EOF
)"
Risks and rollbacks
- Phase 1 (registry) is the only purely-data change — easiest to revert (single JSON commit). Do this first so we have a clean checkpoint.
- Phase 2 (cost / weight compression) could over-correct if
LOW_PRIORITY_DAMP=0.4is too aggressive. The regression battery tells us. If transparency-first users now see only Anthropic, dial damp to 0.6. - Phase 4 (new classification fields) changes the DB schema. Each migration is additive and nullable, so revert is safe but requires an explicit
ALTER TABLE … DROP COLUMN. - Phase 6 (pipeline rule) could under-detect genuine pipelines. Watch prompts #26/#27 specifically.
- Phase 7.1 (structured output) depends on Anthropic SDK feature support — if Haiku 4.5 doesn't support tool-use cleanly, fall back to a stricter regex but keep the test fixtures.
What we're explicitly NOT doing
- Trained routing model (mentioned in HANDOFF) — still v1.5+ work.
- Community scoring — separate plan in
docs/plans/2026-04-13-community-scoring-design.md. - Adding new models to the registry. Tuning the existing 29 first.
- Re-grading sustainability / transparency / privacy scores — those are the registry maintainer's editorial call, not a tuning issue.