NVIDIA publishes open data that accelerate AI

Mar 10, 20265 minutes

NVIDIA announces an open training-data strategy aimed at reducing friction and scaling reliable AI systems. Why should you care? Because improving models today doesn’t depend only on architectures and GPUs: it depends on the quality, diversity and transparency of the data.

What NVIDIA announced

The company published more than 2 petabytes of AI-ready data, organized into over 180 datasets and accompanied by 650+ open models, training recipes and evaluation frameworks. The goal is to offer a shared reference layer so developers and organizations can speed up building and evaluating models, especially more autonomous agents.

Open data isn’t charity: it’s reproducible infrastructure. When sources are visible, evaluating, replicating and improving models is more straightforward.

Why open data changes the equation

Many projects spend millions and months gathering and annotating data before training a single model. Public datasets and permissive permissions lower that barrier to entry and enable fast iterations: evaluate, fix and retrain in weeks instead of years.

For agentic systems (those that make decisions and operate with tools), the selection and structure of data determine what the agent knows, how it reasons and how far it can operate safely. That’s why NVIDIA publishes not only data, but also the recipes and frameworks used to train them.

Technical examples and their impact

NVIDIA shares datasets across multiple domains: robotics, autonomous vehicles, sovereign AI, biology and evaluation benchmarks. Here are the ones that provide interesting technical signals.

Robotics and GR00T

Dataset with 500K+ trajectories, 57M grasps and 15TB multimodal data (vision, sensors, gripper configurations).
Data used to train the vision-language-action reasoning model GR00T.
Downloads: more than 10 million; practical adoptions by Runway and Lightwheel.

Why does it matter? Because robotics requires structured, consistent data across sensors and actions for learning robust, transferable policies.

Autonomous vehicles (AV)

1,700+ hours of multi-sensor data with setups of 7 cameras, LiDAR and radar, covering 25 countries and 2,500 cities.
This geographic diversity enables perception benchmarking in real commercial environments, complementing academic datasets.

Nemotron People (synthetic and demographically informed)

Synthetic people aligned to real distributions by country: US 6M, Japan 6M, India 21M, Brazil 6M, Singapore 888K.
Real use: CrowdStrike improved an NL→CQL task from 50.7% to 90.4% using 2M people; NTT Data and APTO improved legal QA from 15.3% to 79.3%.

These figures show how well-designed synthetic data can bootstrap systems in domains where you have little native data.

The Protein (atomistic synthetic proteins)

455,000 structures with a 73% increase in structural diversity compared to previous baselines.
Designed for molecular modeling and drug discovery without PII or licensing restrictions.

SPEED-Bench (speculative decoding benchmark)

Two splits: Qualitative (11 textual categories) and Throughput (buckets 1K–32K tokens).
Allows plotting real throughput Pareto curves based on prompt complexity and context length.
Adopted internally to measure Nemotron MTP performance.

Synthetic dataset for retrieval and RAG

110,000 triplets (query, passage, answer) generated from 15,000 NVIDIA public documentation files.
Fast training: generate dataset in 3–4 days; fine-tuning in ~2 hours with 8×A100.
Result: fine-tuning nvidia/llama-nemotron-embed-1b-v2 produced +11% in NDCG@10.

This shows how effective a well-designed retrieval dataset is at boosting ranking and retrieval metrics.

ClimbMix and pre-training

ClimbMix is a 400B-token corpus built with the CLIMB algorithm: clustering by embeddings and iterating to refine mixes of high-signal data.
Impact: reduced compute time on H100 by ~33% vs the previous recipe and improved results on Time-to-GPT-2 leaderboards.
License: CC-BY-NC-4.0

The Nemotron stack: pre-training and post-training

NVIDIA documents the evolution of its datasets for the Nemotron ecosystem.

Pre-training: datasets like Nemotron-CC, Nemotron-CC-Math, Nemotron-CC-Code and specialized collections that preserve LaTeX and code formats to increase signal for mathematical reasoning and programming.
Post-training: structured supervision with Nemotron-Instruction-Following-Chat, Nemotron-Science, Nemotron-Math-Proofs, Nemotron-Agentic, Nemotron-SWE to improve traces of reasoning, multi-step planning and software engineering tasks.

These stacks let you move from general learning to behaviors guided by specialized supervision, which explains why models fine-tuned with these data outperform alternatives on concrete tasks.

Safety, RL and datasets for agents

Nemotron-Agentic-Safety: 11K labeled traces of telemetry in tool workflows.
Nemotron-RL: 900K tasks (math, code, tools, puzzles) that act as a training "gym" for models.

Publishing these data helps reproducibility in safety research and enables more robust evaluations of models that interact with tools.

Technical practices and considerations for teams

Evaluation: use SPEED-Bench to measure real throughput in different contexts and NDCG@10 for retrieval.
Rapid iteration: dataset generation in days and fine-tuning in hours (example: 2 hours on 8×A100) enable short development cycles.
Licenses: check restrictions like CC-BY-NC-4.0; useful for research and development but limiting for commercial use.
Extreme co-design: NVIDIA equates data design to hardware and software engineering, integrating data strategists, researchers, infra engineers and policy experts.

If you work on a small team, these practices help you prioritize: invest in clean datasets and consistent evaluations before scaling architecture or infrastructure.

What this means for developers and companies

Lower barrier to entry: less cost and time to prototype competitive models.
More reproducibility: public recipes and benchmarks make fair comparisons across techniques easier.
Practical adoption: companies are already using the datasets to improve NL→CQL, legal QA, multilingual models and more.

If you’re an entrepreneur, this lowers the initial risk for your MVP; if you’re a researcher, it lets you focus on methods because data and evaluations are available.

Final reflection

NVIDIA highlights something many try to forget: architecture and GPUs matter, but data is the decisive infrastructure. Publishing datasets, licenses and tools speeds up building more capable and evaluable agents. Interested in trying them? Fast dataset generation and manageable fine-tuning times mean you can iterate your idea in weeks.

Original source

https://huggingface.co/blog/nvidia/open-data-for-ai

Stay up to date!

Get AI news, tool launches, and innovative products straight to your inbox. Everything clear and useful.