Today a piece of news matters to anyone using search and recommendation with models: Hugging Face and the MTEB team are launching RTEB, a benchmark designed to really measure how well embeddings work in real scenarios. Why does it sound different from the usual? Because it proposes avoiding models that pass exams they already know. (huggingface.co)
Why current benchmarks don't cut it
Have you ever wondered why a model can score high on benchmarks but fail in your product? That happens because many public tests repeat so often that models end up "learning" the test instead of learning to generalize.
The consequence is clear: inflated metrics that don't reflect performance on new data. (huggingface.co)
Also, many evaluation sets come from academic tasks or QA that weren't designed to evaluate information retrieval the way companies need. That can favor lexical matches over true semantic understanding. (huggingface.co)
What RTEB is and what it offers
RTEB stands for ReTrieval Embedding Benchmark. It's an initiative to create a more honest, practical standard that measures how precise embedding models are on search and retrieval tasks.
It's launching in beta and aims to involve the community to grow and improve. (huggingface.co)
The hybrid strategy: transparency and control
The core idea of RTEB is to combine two types of data:
- Open sets: corpus, queries and annotations are public so anyone can reproduce results.
- Private sets: closed data evaluated by the MTEB maintainers to measure how much real generalization a model has.
If a model scores well on the open sets but drops a lot on the private ones, that's a strong signal of overfitting to known benchmarks. Wouldn't you rather know that before putting a model into production? (huggingface.co)
Designed for real-world cases
RTEB prioritizes enterprise use cases: it includes domains like law, health, finance and code, and covers 20 languages — from the most common to some less frequent ones.
It looks for datasets of reasonable size (at least 1k documents and 50 queries) and uses NDCG@10
as the default metric to evaluate ranking quality. (huggingface.co)
What this means for developers and product teams
If you work with RAG, agents or recommendation systems, RTEB gives you a more realistic way to compare embedding models before integrating them. In practice you can spot models that only "learn the exam" and pick those that actually generalize to new data.
For small teams this can save hours of debugging and trust issues in production. For companies, it's a way to reduce risk when selecting providers or models. (huggingface.co)
Limitations and future
RTEB starts with plain text; multimodal tasks (text-image) are left for future versions. They also acknowledge that around 50% of current sets come from reused QA resources, which can favor word matches over deep understanding.
They're working on expanding languages and data types, and they ask the community for suggestions and participation. (huggingface.co)
How to participate or test your model?
RTEB arrives in beta and the dashboard is already available on Hugging Face as part of the Retrieval section in the MTEB leaderboard. If you want to share feedback, add datasets or run evaluations, you can follow the official post or open issues on the MTEB repo. (huggingface.co)
Final reflection
Benchmarks matter because they guide technical and business decisions. RTEB isn't a silver bullet, but it aims for a more honest approach: transparency in the public sets and rigorous controls in the private ones to measure real generalization.
Wouldn't you prefer to choose models knowing the cases where they actually work?