Synthetic people power Japanese AI: scale and privacy | Keryc
Cultural data scarcity is the barrier holding back many AI projects in Japan. Sound familiar? NTT DATA shows that with very few of your own examples plus an open set of synthetic people you can move from prototype to production without exposing sensitive data or paying the high cost of massive manual collection.
What NTT DATA did with Nemotron-Personas-Japan
NTT DATA used Nemotron-Personas-Japan, an open set of 6 million synthetic people generated with NeMo Data Designer, to amplify proprietary seeds and train models that understand Japanese language and context. The experiment was deliberately controlled: fictional legal documents forced the model to learn new terminology instead of memorizing global patterns.
Key results:
Dataset of people: 6 million, 1,500+ occupational categories and regional distribution based on official statistics.
Test protocol: they started with 450 seed samples and used 500 synthetic people to expand to 138,000 examples (300x larger than the equivalent manual expansion).
Accuracy improvement: from 15.3% at baseline to 79.3% after supervised fine-tuning with synthetic data.
Effect on hallucinations: the trained version stopped inventing plausible-but-wrong legal classifications and began to extract precise terminology.
Expanding a small proprietary set with synthetic people lets you build task-specific models while keeping privacy and reducing dependence on CPT.
Why this matters for engineering and governance
From a technical point of view, the findings show that well-designed synthetic generation can replace some heavy lifting in training: in many cases CPT (continued pre-training) becomes optional if you have sufficient SFT (supervised fine-tuning) with high-quality synthetic data. That means less GPU use, faster experiment cycles, and more iterative pipelines.
On governance and compliance, Japan has strong frameworks like the Personal Information Protection Act (PIPA) and AI governance guides published in September 2025. Controlled synthesis lets you minimize PII exposure, create traceability in data transformations, and produce auditable artifacts for compliance teams.
How to replicate the recipe (practical steps)
Select a small set of real, domain-representative seeds (for example 400–500 examples).
Define relevant cultural/occupational profiles and use NeMo Data Designer to condition generation per person.
Generate multiple variations per seed: templates, paraphrases, context shifts and class balancing.
Apply automated filters and human checks to control quality and remove residual PII.
Train with SFT using the expanded synthetic set; evaluate with robust metrics: accuracy, F1, hallucination rate and calibration.
If SFT reaches the desired performance, consider skipping CPT to save time and cost; always validate with real hold-outs.
Good technical practices
Measure hallucinations with adversarial tests and out-of-distribution examples.
Keep traceability of how each synthetic example was generated for audits.
Balance synthetic classes to avoid template-induced biases.
Use human validation in the first iteration, then automate QA with evaluation models.
Strategic implications: sovereignty, collaboration and the data economy
Synthetic people act like a data primitive: they let local models learn cultural behavior and terminology without relying on large Western corpora. They also open the door to collaborative data spaces where organizations share synthetic representations instead of real data, enabling federation and end-to-end encryption with lower risk.
For Japanese companies aiming for data sovereignty, this means you can build interoperable AI aligned with local rules, keep compliance, and reduce leakage to global, non-auditable models.
Quick technical recommendation
If you have a domain with few samples, try a synthetic expansion pipeline before investing in massive collection.
Prioritize seed quality and diversity of conditioning personas.
Implement hallucination metrics and traceability pipelines from the start.
The data barrier isn’t an immovable wall; it’s a method problem. The combination of synthetic people, open-source infrastructure like NeMo Data Designer and reproducible validation practices lets you scale local, responsible models today—not in some hypothetical future.