Embedding Model Rubric
task_fitness.embedding is a curated 0–1 score per embedding model stored in src/data/bearing-registry.json. It feeds the "quality" factor for task_type = 'embedding' recommendations.
How embedding quality is scored
Unlike chat models, embedding model quality is anchored to a single external leaderboard: MTEB (Massive Text Embedding Benchmark) — a standardised suite of 56+ retrieval, STS, classification, and clustering tasks. We use the MTEB Overall average as a proxy for general-purpose embedding quality.
Score anchors
| Range | Meaning | Example |
|---|---|---|
| 0.90–1.00 | Best-in-class MTEB; flagship precision retrieval | voyage-3-large (MTEB ~73) |
| 0.80–0.89 | Strong; excellent for most production retrieval | openai-embed-3-large, cohere-embed-v4 |
| 0.70–0.79 | Balanced; good retrieval with lower cost or latency | openai-embed-3-small, bge-m3 |
| 0.60–0.69 | Budget / local; fine for semantic similarity | voyage-3-lite, nomic-embed-v2 |
| <0.60 | Niche or outdated; avoid for new builds | — |
Normalisation
Raw MTEB Overall scores are normalised linearly within the embedding-model cohort using min-max scaling. The current cohort min is ~62 and max is ~73, so a small absolute difference translates to a larger normalised score difference.
The BENCHMARK_BLEND environment variable (default 0) controls how much weight the ingest score gets vs the curated value. A blend of 0.3–0.5 is reasonable once the cohort grows to 20+ models.
When to pick embedding vs similar task types
| Situation | Pick |
|---|---|
| You need to store text as vectors for semantic search or retrieval | embedding |
| You need to extract structured fields from a document | extract — produces structured text, not vectors |
| You need to answer a question using retrieved context | research or qa — the retrieval layer uses embeddings, but the LLM stage is research/qa |
| Your pipeline has an "index documents" stage | embedding — even if the downstream stage is summarise or qa |
Model class routing
The model_class field (values: "chat", "embedding") gates which models appear in results:
- Tasks with
task_type = 'embedding'hard-reject allmodel_class = 'chat'models (reason = 'wrong_class'). - All other task types hard-reject
model_class = 'embedding'models. - Class routing runs before capability gates (vision, tools, etc.), so the rejection reason for a cross-class pair is always
wrong_class.
Local / open-weight embedding models
The three open-weight embedding models in the registry (bge-m3, nomic-embed-v2-moe, gte-qwen2-7b) carry local_info entries listing recommended hardware tiers. They can be served with Ollama, llama.cpp, or the sentence-transformers library.
Benchmark source
Muennighoff, N., Tazi, N., Magne, L., & Reimers, N. (2023). MTEB: Massive Text Embedding Benchmark. arXiv:2210.07316. MIT Licence.
MTEB results are published as mteb/results on Hugging Face Hub. Scores cited per-model are taken from the model card or official leaderboard submission. Ingest script: scripts/ingest-mteb.ts.