Task-seeded SDG improves Nemotron in reasoning and code | Keryc
NVIDIA publishes a practical recipe: it’s not enough to feed the model lots of raw data, the data needs structured learning signals. What did they do? They took public training splits as seeds, generated synthetic question-and-answer pairs aligned to tasks, enriched answers with reasoning and relevant knowledge, and filtered everything into a curated corpus to continue pretraining Nemotron. The result: measurable gains in reasoning, code and scientific QA in a 100B-token experiment on Nemotron-3 Nano.
What is the "task-seeded" SDG
Task-seeded SDG is a synthetic generation flow designed to add compact, structured examples into the pretraining mix. Instead of producing random plain text, the idea is to use public training splits (lm-eval-harness) as capacity seeds and create examples that:
preserve the frame of the task (selection, generation, classification, explanation),
respect the response structure (multiple-choice, short answer, restricted format),
include relevant domain and context (science, code, math, multilingual),
and, crucially, add traces of reasoning or knowledge that connect the evidence to the answer.
The pipeline is compact and repeatable: collect seeds, normalize records into a unified scheme (JSONL), generate similar questions, solve and enrich the answers, and filter/package the resulting data.
Key stages (summary)
Seed collection: enumerate tasks from lm-eval-harness and keep only suitable training splits.
Normalization: convert heterogeneous YAML formats to a common schema for generation.
Generation: create new prompts that preserve the pedagogical capacity of the seed.
Enrichment: attach the final answer plus reasoning and contextual knowledge.
Filtering: apply schema/format checks, deduplication and task-specific validation (for example, majority verification in multiple-choice).
Why it improves models like Nemotron
Isn’t this redundant with all the raw text LLMs already see? Not exactly. Raw data provides coverage, but it often lacks explicit signals on how to solve concrete tasks. By adding synthetic examples that show how to get to an answer (both the route and why certain options are discarded), the model learns reusable behaviors: identify needed information, apply relevant knowledge, compare plausible alternatives and produce responses within format constraints.
This is transfer learning across task families: a science seed can help commonsense physical reasoning; a logic seed can improve comparison of alternatives; code or math seeds strengthen step-by-step planning.
Data, coverage and verification
Scale of the experiment: ~70 tasks and ~700 subtasks extracted from lm-eval-harness.
Types of generated outputs: similar questions, samples with enriched answers and traces of reasoning/context.
Validation: schema/format checks, deduplication and, when possible, answer majority checking. Multiple-choice is easier to verify; generative tasks require extraction and specific filters.
A practical detail: storing the semantic answer (for example, 'dirt trapped under the fingernails') is preferable to saving only a label like B. Small format choices change the training signal.
Results in the 100B-token experiment (Nemotron-3 Nano)
In a 100B-token continuation, mixing task-seeded SDG in later stages produced improvements across several capability groups:
Group
Before
After
Change
MMLU-Pro
64.8
66.6
+1.8
Average code
73.2
75.1
+1.9
Average math
87.6
87.9
+0.3
Commonsense understanding
72.9
74.5
+1.6
GPQA
30.8
41.9
+11.1
Also, in an ablation with and without context in the enriched answers, the variant with context showed gains on several metrics (selected examples):
Evaluation
No context
With context
Change
PIQA
82.86
84.44
+1.58
AGIEval-en CoT
63.16
69.32
+6.16
GPQA-Diamond CoT n-shot
34.85
45.96
+11.11
Quick interpretation: the largest jumps (for example GPQA) suggest that enriched examples with knowledge and reasoning steps help handle harder scientific questions. Improvements aren’t limited to the seed task but appear across multiple groups, supporting the idea of transfer between task families.
Practical findings and recommendations
Broad coverage of seeds reduces overfitting to a single evaluation style.
Context and traces of reasoning help more than the answer alone, especially on reasoning and science tasks.
Storing semantic text in the answer is better than cryptic labels.
Multiple-choice is easy to check; generative tasks require extraction and task-specific validation pipelines.
Mixture design matters: without controls, large tasks dominate the mix. You need sampling adjustments to preserve important families.
Verify improvements with broad metrics: a single bump in MMLU-Pro or GPQA matters more if other capabilities (math, code, general knowledge) remain stable.
Limits and considerations
Do not use test/holdout splits to generate examples: the pipeline only takes public training splits as seeds.
Risk that generation reproduces seeds' biases; verification and source diversity help.
For commercial trainings (Nemotron Ultra/Super) a subset compatible with licenses was filtered and selected.
The main lesson is clear: generating more data isn’t enough. Generating data with the right structure, explanation signals and enough metadata for mixing decisions offers a practical, scalable lever to improve reasoning and QA skills in late pretraining stages.