NVIDIA and AI Singapore have launched Nemotron-Personas-Singapore, a synthetic dataset designed for anyone building AI with local and regulatory requirements in mind. Why does this matter now? Because AI sovereignty isn't just about having models; it's about having data, evaluations and metrics that reflect social reality without exposing real people.
What is Nemotron-Personas-Singapore
It's a synthetic collection of people meant to train and evaluate AI systems focused on Singapore. It was co-created with AI Singapore (AISG) and released under a CC BY 4.0 license, which makes it easier for you to use in commercial and public-sector projects without relying on personally identifiable information.
The dataset aims to be culturally contextualized and statistically grounded: there are no real individuals, no PII, and the risk of reidentification is minimized by basing generation on public statistics, including the 2024 census and other official sources.
Key data and structure
- 888,000 synthetic people (148,000 records × 6 people per record).
- ~118 million tokens in total, with ~48 million tokens belonging to the person descriptions.
- 38 fields per record: 7 person fields + 31 contextual fields aligned to official statistics.
- Full geographic coverage: all 55 planning areas of Singapore.
- Names: 148k unique names (8,992 given names, 4,182 middle names, 4,894 surnames) sampled according to local distributions.
- Varied person types: professional, sports, arts, travel, among others.
These attributes include education with finer levels than the census, occupations aligned with the services workforce, life stages (employment, retirement, household composition), preferred language, religion, ethnicity and digital familiarity by age cohort.
How it was generated (technical overview)
Generation used NeMo Data Designer, NVIDIA's microservice for synthetic data in enterprise settings. The pipeline combined:
- A Probabilistic Graphical Model (Apache-2.0) for statistical anchoring to public distributions.
- GPT-OSS-120B (Apache-2.0) for narrative generation of the person descriptions.
The idea is to separate the statistical structure (what probabilities each attribute must respect) from the narrative part (how each person's story sounds). That helps you get reproducible, auditable and inspectable records for model evaluation.
Purposes and use cases
Nemotron-Personas-Singapore is mainly targeted at teams building "sovereign" AI in Singapore, but it's also useful for global developers who need to improve performance and adoption in Singaporean contexts. Curious how you might use it?
Practical uses:
- Financial services: bias testing, suitability checks and stress tests without using sensitive customer data.
- Health: safe evaluation of clinical assistants, patient chatbots and medical translation across literacy levels.
- Consumer safety: detecting hallucinations, tone failures and demographic-specific risks.
- Benchmarking: standardized, model-agnostic inputs for reproducible comparisons between models and institutions.
Regulatory alignment and governance
Generation is designed to reduce regulatory friction, helping compliance with Singapore's Personal Data Protection Act (PDPA) and emerging AI governance frameworks. The people are fully synthetic and the methodology is documented to facilitate audits, reviews and evidence-based oversight.
Integration and extensibility
Nemotron-Personas-Singapore integrates with Nemotron models and other open LLMs for fine-tuning and evaluation. There's also an extended version that will be available directly inside NeMo Data Designer so you can generate, refine and expand specific personas as part of your own synthetic pipelines.
Quick example to load the dataset with Hugging Face:
from datasets import load_dataset
dataset = load_dataset("nvidia/nemotron-personas-singapore")
Risks and limitations
No synthetic dataset removes all risks: you should validate that distributions and statistical biases match what you need. The statistical approach reduces legal and privacy risks, but it still requires care if you use the data for high-impact decisions (for example, credit or medical diagnosis).
It's also not a magic fix for fairness: synthetic people reproduce design choices (what attributes you include, how you model occupations, education levels, etc.). That's why traceability and pipeline documentation are essential.
Who is this for and why does it matter to you?
If you work on models for Singapore, this speeds up creating locally relevant, auditable benchmarks. If you're responsible for compliance or oversight, it gives a common baseline for evaluations across teams and institutions. And if you're a researcher or engineer, it provides a playground to experiment with synthetic people without exposing PII.
In the end, AI sovereignty becomes a practice: locally relevant data, transparency in design and tools that enable collaboration between public and private sectors.
Original source
https://huggingface.co/blog/nvidia/nemotron-personas-singapore
