Nemotron-Personas-Brazil is an open collection of 6 million synthetic people in Brazilian Portuguese, designed to train and evaluate models that actually understand Brazil's cultural and demographic diversity. It's aimed at developers and researchers who need locally grounded data, is commercially licensed (CC BY 4.0), and doesn't expose any real person.
What is Nemotron-Personas-Brazil?
It's a synthetic, structured dataset: 1 million base records, each expanded into 6 distinct people, for a total of 6 million people. The texts are in natural Brazilian Portuguese and each person includes cultural background, skills, goals, hobbies and interests.
- Size: ~1.4 billion tokens in total, with ~450 million person tokens.
- Fields: 20 per record (6 for people + 14 contextual fields with statistical anchoring).
- Geographic coverage: all 26 Brazilian states plus the Federal District, anchored down to municipality level.
- Unique names: ~457k Portuguese names.
- Occupations: 1,500+ categories reflecting the real workforce, including micro-entrepreneurs and regional trades.
- Person types: professional, sports, arts, travel and others.
Statistically grounded data: each person is aligned with official distributions from the Brazilian Institute of Geography and Statistics (IBGE), but does not represent any real individual.
How it was generated (technical overview)
Generation combines probabilistic models and large language models inside NVIDIA's NeMo Data Designer platform. The pipeline includes structured generation, validation and retry mechanisms to scale while keeping population-level coherence.
Key components:
- A Probabilistic Graphical Model (Apache-2.0) to ensure attribute combinations (age, location, occupation, life stage) follow real distributions.
- GPT-OSS-120B (Apache-2.0) for narrative generation in Brazilian Portuguese, producing natural and culturally faithful texts.
- Automated validation flows that detect inconsistencies and re-generate when needed.
There will also be an extended version available inside NeMo Data Designer so you can generate, refine and extend people as part of your own synthetic pipelines.
Why that combination?
The probabilistic model makes demographic statistics faithful to IBGE, while the LLM adds narrative richness and linguistic coherence. Why both? Because when you want a synthetic person to be both representative and usable for natural language tasks, you need statistical grounding plus natural language quality.
What it contains and how to use it
- Format: ready to load from Hugging Face Datasets.
Example to get started:
from datasets import load_dataset
dataset = load_dataset("nvidia/nemotron-personas-brazil")
Practical uses:
- Multi-turn conversations: seeds to generate authentic dialogues in Brazilian Portuguese.
- Training local assistants: fine-tuning to improve cultural understanding and regional references.
- Bias and fairness testing: evaluate performance across rural and urban areas, age groups and education levels.
- Domain data generation: create annotated datasets from personas for regulated or government sectors.
Technical and evaluation considerations
- Tokenization and costs: with ~1.4B tokens, plan storage and compute budget according to your model's tokenization (e.g., subword BPE/Unigram).
- Fine-tuning: you can use these personas for further pretraining or fine-tuning; consider separating personas by macro-region for generalization analysis.
- Distribution validation: compare statistics (e.g., age histograms, state distribution, occupation classes) between the dataset and IBGE public tables before training.
- Fairness metrics: use disparity and fairness metrics (accuracy per subgroup, calibration per subgroup, AUC by segment) to detect gaps.
Ethics, privacy and limits
- No PII: the dataset is designed not to represent real people. Names and combinations use real distributions but are synthetic.
- Risk of biases: synthetic data doesn't remove biases present in statistical sources or generative models. It requires auditing and robustness testing.
- License: CC BY 4.0 allows commercial use with attribution; that makes adoption easier for startups and public entities.
Synthetic doesn't mean infallible. You need technical and social evaluation before deploying models to production.
Impact for sovereign AI in Brazil
Nemotron-Personas-Brazil lowers a technical and legal barrier: it offers Brazilian Portuguese data with national coverage and an open license. For governments, SMEs and research teams in Brazil this means you can train and audit models that better understand local contexts without relying solely on Anglo datasets or closed providers.
It also serves as a resource for governance and AI regulation initiatives: it enables comparable, reproducible tests on equity and performance across local populations.
Quick recommendations for engineering teams
- Before training: inspect distributions and sample by state and occupation.
- During training: experiment with fine-tuning and subgroup-calibrated classification.
- During validation: measure performance on real tasks (e.g., dialogue responses, occupational classification) and evaluate disparities.
- For deployment: combine synthetic data with small amounts of audited, labeled real data to improve adaptation and safety.
Nemotron-Personas-Brazil is a practical bet: it puts representative data in the hands of those building AI in Brazil. It's not the final solution to all data hurdles, but it is a powerful resource so local developers can build, evaluate and justify models based on Brazilian realities.
