Ettin Reranker: new family of efficient rerankers | Keryc
Today Hugging Face publishes six CrossEncoder rerankers based on the Ettin ModernBERT encoders. They’re distilled models, optimized for reranking in retrieve-then-rerank pipelines, and span from 17M to 1B parameters with support up to 8K tokens. Here I’ll explain why they matter, how you can use them, and the recipe used to train them.
What is a reranker and why pair it with an embedder
A reranker (or cross-encoder) takes a pair (query, document) and returns a relevance score. Unlike an embedder that encodes separately and compares vectors, the reranker allows cross-attention between the query and the document at every layer. That usually gives you more accuracy but costs more compute: you run the model for every pair.
That’s why the common pattern is retrieve-then-rerank: a fast embedder first retrieves the top K candidates, then the reranker reorders only those K. You keep costs under control and get close to what you’d have if you applied the cross-encoder to the whole corpus.
Does that sound useful for internal search, assistants, or QA over long documents? Exactly. And these models support up to 8192 tokens, so they work for long documents too.
Modelos liberados y licencia
Hugging Face released six sizes, all under Apache 2.0 license:
cross-encoder/ettin-reranker-17m-v1
cross-encoder/ettin-reranker-32m-v1
cross-encoder/ettin-reranker-68m-v1
cross-encoder/ettin-reranker-150m-v1
cross-encoder/ettin-reranker-400m-v1
cross-encoder/ettin-reranker-1b-v1
All use the Ettin encoders (ModernBERT-style) and share the same classification head; only the backbone changes.
Uso (rápido)
Three lines to get started with sentence-transformers:
from sentence_transformers import CrossEncoder
model = CrossEncoder("cross-encoder/ettin-reranker-32m-v1")
scores = model.predict([
("Where was Apple founded?", "Apple Inc. was founded in Cupertino, California in 1976."),
])
print(scores)
If you prefer the rank function to sort documents:
ranked = model.rank(query="Which planet is known as the Red Planet?",
documents=[...], top_k=4, return_documents=True)
for r in ranked:
print(f"({r['score']:.2f}): {r['text']}")
Performance tip: load the model with model_kwargs={"dtype": "bfloat16", "attn_implementation": "flash_attention_2"} and, if you can, install kernels for precompiled FA2. That gives between 1.7x and 8.3x throughput depending on model size and sequence length.
Arquitectura y por qué son rápidos
Backbone: Ettin ModernBERT (RoPE, GeGLU, attention without padding, pretraining on 2T tokens).
Head: 4 modules that replicate ModernBertForSequenceClassification, but built with Sentence Transformers modules and an AutoModel without a head, which lets inputs flow without padding and take advantage of FA2 without wasted compute.
Head stack:
Transformer (FA2)
Pooling (CLS)
Dense(H, H, bias=False, GELU)
LayerNorm
Dense(H, 1)
Fun fact: in ablations, CLS pooling outperformed mean pooling, probably because the few layers with global attention carry enough signal to the CLS token.
Resultados y benchmarks (resumen técnico)
Main benchmark: MTEB(eng, v2) Retrieval (10 tasks, top-100 reranked) evaluated in a two-stage flow. The six rerankers were paired with six embedders covering different quality/latency points.
Key points:
The smallest model (17M) is the fastest in the set and outperforms larger MiniLM variants on NDCG@10.
The 17M reaches 7517 pairs/s on an H100 and 9008 pairs/s on an RTX 3090, while being more accurate than comparable MiniLMs.
The 150M appears as the best reranker under 600M on MTEB, beating architectural peers around 150M.
The 1B nearly ties its 1.54B teacher on MTEB (difference ~0.0001) and runs 2.4x faster than the teacher on H100.
Practical takeaway: if you use MiniLM in production, switching to ettin-reranker-17m or 32m can give you better quality and latency with minimal changes.
Latency y cómo obtenerla
Recommended configuration for max throughput:
Enable bf16 and flash_attention_2 together.
Use the modular Transformer version (these models ship that way), which removes padding before passing through FA2.
Ablation observations:
bf16 provides the largest single improvement by allowing larger batches.
FA2 with padded inputs can be slower than bf16+SDPA. The key is combining FA2 with unpadded inputs.
If you can’t use FA2 (compatibility reasons), the models remain competitive, but you’ll lose some of the GPU speedups seen on modern hardware.
Receta de entrenamiento (técnico)
Approach: point-to-point distillation with MSELoss over the teacher’s raw logits.
Data: ~143M triplets (query, document, teacher_score) published as cross-encoder/ettin-reranker-v1-data (39 splits for traceability).
Sources: a mix of LightOn pretraining data and a reranked subset of LightOn fine-tuning data.
Important notes: they did a single pass over the data (num_train_epochs=1) and prioritized more data over more epochs.
Hyperparameters by size:
LR and global_batch_size vary by model. For example: 17m LR=2.4e-4, global_batch=1024; 1b LR=3e-6, global_batch=512.
Evaluation during training: NanoBEIR mean NDCG@10 every 5% of steps, and the best checkpoint by that metric was selected for final MTEB evaluation.
The author released the training script (~150 lines) and the dataset so anyone can reproduce or improve the recipe.
Implicaciones prácticas y recomendaciones
If you want a drop-in to improve your retrieve-then-rerank pipeline: try ettin-reranker-17m or 32m if you’re on small CPU/GPU. Better quality and lower latency.
If your workload can handle larger models and you want to close the gap with huge rerankers, ettin-reranker-150m and 400m offer a great quality/latency balance.
For deployment on modern GPUs: enable bfloat16 and flash_attention_2 and make sure you use the loader version that avoids unnecessary padding.
The recipe is intentionally simple: distill from a strong teacher over broad, retrieval-specific data. If you train with an even stronger teacher, the same recipe scales.
Conclusión
These six rerankers are a practical bet: better numbers than MiniLM on many tasks, competitive speeds thanks to padding-free design and FA2, and a reproducible recipe with public data. If you work on search, QA, or retrieval systems, it’s worth trying them and measuring the impact in your pipeline.