Lack of useful, localized data is a real barrier to AI adoption in Japan. Can you design a practical strategy that respects privacy and speeds up development? NTT DATA's answer is: yes, using synthetic people and auditable pipelines.
What NTT DATA did and why it matters
NTT DATA used the Nemotron-Personas-Japan set generated with NeMo Data Designer to expand small local data collections and train domain models with surprising results. NVIDIA's dataset includes 6 million synthetic people based on Japanese demographics, occupations and regional distribution, and it's available under CC BY 4.0.
In a toy experiment on fictional legal classification, starting from a few hundred real examples and tens of thousands of synthetic samples, NTT DATA raised accuracy from 15.3% to 79.3%. That's a 60-point improvement without exposing sensitive data in the training pipeline.
Why does this matter for you as a developer or product lead? Because it shows that with a few real seeds and a reproducible synthesis strategy you can build useful task-specific models without relying on massive proprietary corpora.
How synthetic people work (technical but clear)
A 'synthetic person' is a generated profile that combines demographic, occupational and behavioral traits (for example, age, profession, location, interaction scenarios). From those profiles you generate texts, conversations or documents that reflect real patterns without containing PII.
Technically, the common flow is:
Define or sample a set of personas (e.g. 500 profiles with real demographic distribution).
Create templates and controlled prompts to generate textual samples aligned with tasks (legal documents, support queries, forms).
Validate and clean synthetic samples, ensuring diversity and absence of PII.
Use SFT (supervised fine-tuning) with a mix of real and synthetic data to adapt base models.
NTT DATA shows that, with enough volume and quality of synthetic data, the CPT (continual pretraining) stage may not be necessary. That cuts GPU usage, time and operational costs.
Auditability and governance
Synthesis can and should be reproducible. Pipelines based on NeMo Data Designer allow logs, seeds and template versions, which makes auditing, traceability and compliance with laws like Japan's Personal Information Protection Act (PIPA) easier.
Key results and practical lessons
Base dataset: Nemotron-Personas-Japan (6,000,000 synthetic people).
Expansion experiment: with ~450 seed examples and 500 profiles, ~138,000 synthetic samples were generated.
Accuracy improvement: 15.3% (no training) to 79.3% (SFT with synthetic data).
Positive side effect: reduction of the hallucinations of the model on legal classification tasks.
Configuración
Datos semilla
Datos sintéticos
Precisión
Baseline (sin entrenamiento)
—
—
15.3%
SFT con datos sintéticos
240 — 450
138,000
79.3%
Practical lessons for your team:
Identify 200 to 500 representative seed examples from the domain.
Design personas that cover demographics, roles and real market scenarios.
Generate multiple templates and control randomness to avoid repetitive biases.
Prioritize human validation by sampling to catch conceptual errors.
Keep versioned records for audit and compliance.
Impact on data governance and organizational strategy
Data synthesis is not just a lab trick. It's a Privacy Enhancing Technology (PET) that lets companies share trends and patterns without exposing PII. Together with approaches like data spaces and federated learning, organizations can collaborate under common governance frameworks while preserving data sovereignty.
For teams in regulated environments this means shifting from a defensive stance to a collaborative one: exchange insights based on reproducible, auditable synthetic data and accelerate local innovation without relying solely on large models trained outside the region.
What's next and how to get started today?
Want to try it on a real project? Concrete steps:
Try NeMo Data Designer to generate sets from templates and seeds.
Use the Nemotron-Personas-Japan dataset as a starting point for population sampling.
Plan iterative SFT instead of massive CPT: more fast experiments, less pretraining cost.
Integrate privacy controls and audit pipelines from phase zero.
If you manage product or lead AI in a company, this approach can cut costs, speed deployments and improve model quality in domains where real data is scarce.
The data scarcity barrier isn't insurmountable. With open tools, well-designed personas and reproducible pipelines, you can build AI that understands Japanese culture and language without exposing sensitive information. Ready to get started?