Korean agents with Nemotron-Personas and synthetic data | Keryc
Do you want your agent to speak, think and act like a real Korean professional without touching sensitive personal data? Nemotron-Personas-Korea makes that possible: a bank of demographically accurate synthetic people that helps anchor agents to local contexts in South Korea.
What is Nemotron-Personas-Korea
Nemotron-Personas-Korea is a synthetic people dataset that integrates official statistics and seed data from Korean sources: KOSIS, the Supreme Court of Korea, the National Health Insurance Service and the Korea Rural Economic Institute. NAVER Cloud contributed seed data and domain expertise during the design.
The promise: each persona is demographically correct but contains no personally identifiable information (PII). It was designed with Korea’s Personal Information Protection Act (PIPA) in mind and follows Korea’s official synthetic data generation guidelines.
Total personas: 7 million (1 million records multiplied by 7 persona variants each)
Fields per persona: 26 (7 person fields, 6 persona attributes, 12 demographic and geographic fields, 1 identifier)
Geographic coverage: 17 provinces and 25 districts
Names: ~209K unique names (118 surnames, ~21.4K given names)
Occupations: 2000+ roles, including technology, manufacturing, and public sector
Persona types: professional, family, sports, arts, travel, culinary, concise
Life stages: student, military service, employed, unemployed, retired
Language: natural Korean
License: CC BY 4.0
How it was generated and technical architecture
Generation used NeMo Data Designer, NVIDIA’s system for synthetic data. The pipeline combines a Probabilistic Graphical Model (Apache-2.0) to guarantee statistical anchoring with Gemma-4-31B to generate Korean narratives.
Population sources: KOSIS (releases 2020–2026). Name distributions come from the Supreme Court via namechart.kr. The result is a collection that can be used as a seed to train, fine-tune or condition agents without exposing real PII.
If you work with multilingual agents, the Nemotron-Personas collection includes versions for the US, Japan, India, Singapore, Brazil and France, letting you combine personas across countries in the same workflow.
Practical example: from dataset to a Korean agent in ~20 minutes
Want a public health agent that responds with local confidence? You can filter and mount a persona so the agent uses 존댓말 and references Korean policies.
Loading the dataset (Python, Hugging Face Datasets):
Filter by health occupations and select a persona:
health_personas = dataset["train"].filter(
lambda x: "보건" in x["occupation"] or "간호" in x["occupation"] or "의료" in x["occupation"]
)
print(f"Found {len(health_personas)} health personas")
persona = health_personas[0]
print(persona)
Build a system prompt from the structured fields and the persona narrative to anchor behavior:
system_prompt = f"""당신은 한국의 공중보건 상담 AI 에이전트입니다.
[신원]
- 이름: {persona['name']}
- 지역: {persona['region']}
- 직업: {persona['occupation']}
- 전문분야: {persona['skills']}
[행동 지침]
- 한국어 존댓말을 사용하여 응답하세요.
- 지역 보건소 및 공공 의료 체계에 대한 안내를 제공하세요.
- 한국 공중보건 정책과 절차를 기반으로 정확한 정보를 제공하세요.
"""
Connect the prompt to a model for inference. Options:
NVIDIA API (quick testing)
NVIDIA NIM for self-hosted inference
NemoClaw for always-on agents deployed on OpenShell or NVIDIA infrastructures
Example using NVIDIA’s OpenAI-compatible interface:
That flow takes you from synthetic data to contextual Korean responses, with local references and an appropriate tone.
Governance, evaluation and risk considerations
Does this remove all risks? No. Well-made synthesis reduces PII risk and improves demographic grounding, but you still need controls.
Privacy: Nemotron-Personas-Korea declares zero PII and was designed with PIPA in mind. Still, audit your pipelines in case re-identifiable data emerges after careless combinations.
Governance: follow Korea’s official synthetic data guidance when you use population samples or group by sensitive subgroups.
Bias and distribution: a synthetic persona can replicate statistical biases. Evaluate with slices by region, age and occupation.
Security and alignment: guard against prompt injection and define clear scopes in the system prompt. Log sensitive queries and review outputs in staging environments.
Measurement: use utility metrics (accuracy on local questions), user trust (surveys) and performance (latency and inference cost).
Deployment options and best practices
Nemotron-Personas-Korea is framework-agnostic. Quick recommendations:
For rapid prototyping: call the NVIDIA API with the persona prompt.
For production: NIM for private inference or NemoClaw for always-on agents.
Persona versioning: keep a record of the persona slice version used for each model/prompt version.
A/B testing: compare agents with and without persona grounding on local tasks to measure gains in specificity and trust.
Final thought
Nemotron-Personas-Korea isn’t magic; it’s a tool to ground agents in real contexts without exposing PII. If you ask the right questions, filter carefully and apply governance, you can build agents that don’t just translate words but understand norms, schedules and local expectations. Ready to try it and see how it changes your Korean users’ experience?