NVIDIA publishes a recipe to fine-tune embeddings in a day | Keryc
You can turn a general embeddings model into one that truly understands your domain with a single GPU and less than a day of training. Sounds like magic? It isn’t: it’s a practical recipe that combines synthetic data generation, hard-negative mining, contrastive training and optimized deployment.
What this NVIDIA recipe offers
NVIDIA publishes a full pipeline (NeMo Data Designer, NeMo Automodel and Nemotron) that goes from raw documents to a production-ready embeddings service. Key points:
Automatic generation of (question, document) pairs using an LLM to create high-quality synthetic data.
Hard-negative mining to teach the model to distinguish confusing passages.
Support for multi-hop questions (1 to 3 hops) and unrolling for contrastive training.
Export to ONNX/TensorRT and deployment on NVIDIA NIM with an OpenAI-compatible API.
The result? In their tests they saw double-digit improvements in metrics like Recall@10 and nDCG@10. Atlassian applied the recipe to their Jira dataset and took Recall@60 from 0.751 to 0.951 using a single A100 80GB GPU.
Requirements and tools
You need:
NeMo Data Designer for synthetic generation.
NeMo Automodel and Nemotron to train embeddings.
BEIR for information retrieval evaluation.
NeMo Export-Deploy to convert to ONNX/TensorRT.
NVIDIA NIM to serve in production.
A directory with your domain documents (.txt, .md, etc.).
An NVIDIA API key (free at build.nvidia.com).
An NVIDIA Ampere or newer GPU with at least 80 GB VRAM (A100/H100 80GB tested).
If you don’t have everything, the pipeline is modular: you can start at the stage that suits you.
Each stage can take minutes or hours; the whole flow is designed to fit in a day with the right setup.
How synthetic generation (SDG) works
Instead of manually labeling thousands of query-document pairs, the pipeline uses an LLM (for example nvidia/nemotron-3-nano-30b-a3b) to read your documents and generate QA pairs. The process produces different question types: contextual queries, factual questions and multi-hop questions.
Example document chunk:
The TDP of the H100 GPU in SXM form is 700W. The cooling solution must keep junction temperature below 83°C. For dense deployments above 4 GPUs per node, liquid cooling is recommended.
Generated pair examples include simple lookup questions and multi-hop questions that require connecting information across sections. Each pair receives quality scores (relevance, accuracy, contextual support and clarity) and only pairs that pass thresholds enter training.
Hard-negative mining: why it matters
If you train only with positive pairs, the model learns to separate very different passages but won’t learn to reject passages that look relevant but aren’t. Hard-negative mining finds those “almost correct” examples so the model learns fine-grained distinctions.
The automated process:
Embed all queries and passages with the base model.
Compute similarities and mask out the true positives.
Apply a margin filter: remove candidates that are above 95% of the minimum positive score (avoids false negatives).
Select the top-k as hard negatives (default 5 per query).
Why use a 95% ceiling? To avoid teaching the model that a passage that might be relevant is a negative by accident.
Multi-hop and unrolling
Multi-hop questions can have multiple positive documents. Unrolling turns a multi-hop question into multiple examples (query, each positive document) so the contrastive loss sees each positive separately. That way the model learns that multiple passages can be relevant to a single composite query.
Fine-tuning: architecture and recommended parameters
The recipe fine-tunes a biencoder (two encoders: one for queries and one for documents) using a contrastive loss. Useful default parameters:
Parameter
Default value
Notes
Epochs
3
For large datasets drop to 1 or 2
Learning rate
1e-5
Try 5e-6 or 2e-5 if needed
Warmup steps
5
5-10% of total steps works well
Global batch size
128
Scales automatically if your dataset is small
Passages per query
5
1 positive + 4 hard negatives
Temperature
0.02
Low temperature = very sharp distribution
The aggressive temperature (0.02) works because hard negatives produce strong, precise gradients.
If you have fewer than 2000 examples, the pipeline adapts batch size, checkpoint frequency and validation so training remains stable.
Evaluation with BEIR and expected results
The standard evaluation uses BEIR and computes nDCG@k, Recall@k, Precision@k and MAP@k for k = 1, 5, 10, 100. In their tests with the synthetic dataset Retrieval Synthetic NVDocs, the results were:
NDCG:
NDCG@1: 0.55178 -> 0.60796 (+10.2%)
NDCG@5: 0.51894 -> 0.57689 (+11.2%)
NDCG@10: 0.55506 -> 0.61559 (+10.9%)
NDCG@100: 0.60617 -> 0.66567 (+9.8%)
Recall:
Recall@1: 0.28478 -> 0.31547 (+10.8%)
Recall@5: 0.54486 -> 0.60288 (+10.6%)
Recall@10: 0.62979 -> 0.69296 (+10.0%)
Recall@100: 0.81421 -> 0.87020 (+6.9%)
A good fine-tune typically delivers ~15% improvement in nDCG@10 and Recall@10 in under a day, although real numbers depend on corpus quality and dataset size.
Atlassian reported a real case: Recall@60 went from 0.751 to 0.951 (a 26.7% gain) on their Jira dataset with a single A100 80GB.
Export, quantization and deployment
For production it’s advisable to export to ONNX and (optionally) build a TensorRT engine for maximum latency/throughput. The pipeline supports:
Export to ONNX (opset 17).
TensorRT compilation with optimization profiles for batch and sequence length.
FP8 quantization to speed things up further.
Packaging into a NIM container that exposes a /v1/embeddings endpoint compatible with OpenAI-style APIs.
There’s also a precision check that compares BEIR metrics on the NIM endpoint and flags deviations beyond a tolerance as alerts.
Practical tips and common issues
Messy data = mediocre results. Clean and format your documents before SDG.
If you see overfitting, lower epochs or raise the SDG quality threshold.
If you lack data, add more documents or try a stronger LLM for SDG.
Tune learning rate (examples: 5e-6 for large datasets, 2e-5 for very small ones).
Start with 50-100 documents for a quick POC; the pipeline scales well.
Final thought
You don’t need a data center or a huge team to get embeddings that actually understand your domain. With NVIDIA’s recipe you can generate training data without manual labeling, teach your model to distinguish confusing cases and deploy to production in less than a day. Got domain documents and a GPU with enough VRAM? Then you have everything to experiment and improve the relevance of your search or RAG systems.