Nemotron 3.5: customizable multimodal safety for enterprises | Keryc
Nemotron 3.5 arrives as a practical evolution for teams that need fast, explainable moderation across languages and formats. What changes compared to previous models? Here I explain it clearly and with examples so you can decide how to integrate it into a real product.
What's new in Nemotron 3.5
Nemotron 3.5 deepens multimodal integration: the model now receives in a single context window the user's prompt, an optional image, and the assistant's possible response, and returns a coherent safety verdict for the whole interaction. Why does this matter? Because many violations only appear when you combine text and image or the request and the reply.
The most important novelty: support for custom policies at inference time. Instead of relying only on a fixed taxonomy, you can pass a policy specification in natural language and the model reasons about it before issuing its verdict. This is key for regulated environments or use cases with different risk rules (health, finance, apps for kids).
It also includes an optional THINK mode that returns auditable reasoning traces before the final verdict, and three output modes to balance latency and explainability.
Architecture and technical design (practical summary)
Base: Gemma 3 4B IT with a 128K token context window. That gives long context and improved vision-language reasoning.
Fine-tune: NVIDIA uses a LoRA adapter to install the safety-classification behavior while keeping the model compact (4B) and suitable for GPUs with 8GB+ VRAM.
Output: supports three modes: 1) low-latency binary verdict, 2) verdict + categories, 3) THINK mode with step-by-step reasoning.
Integration: compatible with transformers, vLLM, SGLang and available as NIM-optimized on build.nvidia.com. It's also on inference platforms (Baseten, Eigen AI, DeepInfra, OpenRouter, Vultr).
How THINK mode works and why it helps
THINK mode produces concise reasoning traces (usually under 3 sentences). NVIDIA generated these traces in two steps using very large models like Qwen 397B and then distilled them with Qwen 80B to ensure clarity and token efficiency.
What do you get from this? Three practical things:
Auditability: traces that justify why something was marked unsafe.
Faster human review: the moderator sees the logic, not just the label.
Policy iteration: you learn how the model interprets ambiguous phrases and can adjust the policy.
If latency is critical, you can disable THINK and keep latency equal to Nemotron 3 in binary mode.
Custom policies: examples and controls
Think of an IDE for developers: the phrase "terminate a process" should not trigger a violence category. With Nemotron 3.5 you can apply Category Suppression to avoid false positives. Another example: a bank can inject proprietary categories related to financial fraud.
The mechanics are simple: you send the policy specification as text along with the input and the model reasons about it. This makes it easy to:
Disable irrelevant categories.
Add internal risk categories.
Keep traceability in logs.
Dataset and transparency
NVIDIA releases the training dataset associated with Nemotron 3.5 under the NVIDIA Open Model License. The dataset is multimodal and multilingual, with reasoning traces used to teach the model. Key points:
99% of training images are real photos (not just generated images), which improves robustness to real-world content.
Translations to 12 languages with explicit coverage and strong zero-shot generalization to ~140 languages thanks to the Gemma 3 base.
About 10% of the data is synthetic to cover rare patterns and jailbreaks.
Releasing the dataset matters because many OSS models don't provide the training/evaluation sets, especially on the multimodal side.
Performance and relevant metrics
Average on the multimodal benchmark suite: around 85% accuracy in harmful-content classification.
In latency comparisons, Nemotron 3.5 achieves up to 3x lower end-to-end latency versus a cited multimodal alternative.
When THINK is active, the model generates up to 50% fewer tokens in reasoning compared to another reasoning model, which reduces costs and latency.
Remember: accuracy alone isn't enough. Efficiency and the ability to run repeated checks in production are what make global, continuous moderation possible.
Practical integration into production pipelines
Suggested minimal architecture to run it today:
Synchronous, low-latency: use binary mode (mode 1) on the critical path to block or allow content in real time.
Asynchronous, audit trail: run THINK mode in parallel to store traces in an audit system or to feed human reviewers.
Dynamic policy: keep policy templates per product (for example, child education, finance, health) and pass them on each inference.
Specialized moderator: train small moderator models using the traces as a signal to automate parts of human review.
Need to operate on limited GPUs? The 4B version with LoRA makes deployment possible on infra with 8GB+ VRAM.
Evaluation and known limits
NVIDIA evaluated Nemotron 3.5 across many benchmarks (VLGuard, MM-SafetyBench, PolyGuard, RTP-LX, Aegis and others). Even so, the multimodal field faces challenges:
Many benchmarks are text-only; you can't infer multimodal performance from them.
Much of the public multimodal datasets use AI-generated images; that underestimates the complexity of real content.
Photo licensing limits redistribution, so public sets remain partial.
Nemotron 3.5 advances these issues by training with real photos and publishing a subset of the dataset, but the community still needs better multimodal benchmarks with open licenses.
Practical tips before you deploy
Define policies per product and keep them in natural language: the model consumes them directly.
Use Category Suppression to reduce false positives in technical domains.
Plan THINK mode as an audit component rather than a real-time blocker if latency is critical.
Use the traces to build internal datasets of edge cases and iteratively improve policy.
Final reflection
Nemotron 3.5 isn't just another safety model; it's a package designed for companies that need multimodal, multilingual, customizable moderation with traceability. If your product operates globally or has regulatory requirements, here's an efficient option that combines performance and explainability without demanding huge infra.