CyberSecQwen-4B: local model for cyber defense | Keryc
CyberSecQwen-4B was born from a practical question: can cyber defense use models that run locally, preserve sensitive evidence, and still perform like larger models? The team’s answer was yes — with caveats. Here I explain why it matters, how they trained it, and what it means for your SOC, vulnerability researcher, or critical-infrastructure team.
Why local models matter in cyber defense
Can you imagine pasting a dump of credentials or a suspicious binary into a public API? Don’t do it. In defense, the data is the vulnerability. Sending evidence to an external service can be exactly the leak you’re trying to avoid.
Also, cost per call and air-gapped environments are real constraints. A mid-size SOC processes thousands of alerts daily: outsourcing CVE explanations or mappings to CWE becomes expensive and, sometimes, impossible from isolated networks.
Finally, adversaries automate everything: from generating phishing in dozens of languages to chaining agentic tools. If defense is going to compete, it needs models you can run on your hardware, without sending secrets out.
Local isn’t just running on your laptop. It’s being able to deploy on laptops, on-prem GPUs, and partially connected environments, without sacrificing quality on the tasks that matter.
What CyberSecQwen-4B is and what it shows
CyberSecQwen-4B is a specialized fine-tune of 4B parameters on a Qwen3-4B-Instruct-2507 checkpoint. The bet: a carefully tuned 4B model can match or beat 8B models on concrete CTI (Cyber Threat Intelligence) tasks, and still fit on a consumer GPU with 12 GB.
Key results on CTI-Bench (n=5, temp 0.3):
Metric (CTI-Bench)
CyberSecQwen-4B
Foundation-Sec-Instruct-8B
Δ
CTI-MCQ (2,500 items)
0.5868 ± 0.0029
0.4996
+8.7 pp
CTI-RCM (1,000 CVE→CWE items)
0.6664 ± 0.0023
0.6850
−1.9 pp
Parameters
4 B
8 B
half the size
In short: it keeps 97.3% of the 8B’s RCM accuracy and strongly outperforms on MCQ. For a defender choosing what to deploy, that resource/performance trade-off is what matters.
How it was trained (technical ingredients)
The work ran entirely on a single AMD Instinct MI300X with 192 GB HBM3 using ROCm 7 and the vLLM stack. That setup enabled full bf16 training, FlashAttention-2 forward+backward, and a 4096 sequence length without quantization or sharding tricks.
Main components and versions:
Hardware: AMD Instinct MI300X 192 GB · gfx942
ROCm: 7.0
Docker: vllm/vllm-openai-rocm:latest
PyTorch: 2.6.0 (ROCm)
flash-attn: 2.8.3
vLLM: 0.10.1
Hyperparameters and recipe details:
Base: Qwen3-4B-Instruct-2507 (fine-tuned from an IT checkpoint, not pure pretrain)
LoRA r = 64, alpha = 64, dropout = 0.05
LR = 5e-5 with cosine schedule, warmup ratio 0.03
Epochs = 10
Precision = bf16
Attention = FlashAttention-2 (forward + backward)
Max seq len = 4096
Batch = 4 (no accumulation)
Optimizer = paged_adamw_8bit
Step time stabilized around ~7.85 s/step on MI300X with this recipe. FlashAttention-2 fits well because Qwen’s head_dim (128) falls within the MI300X shared-memory budget.
Corpus, licenses and data cleaning
The fine-tune used two Apache-2.0 clean corpora:
CVE → CWE mappings 2021 from MITRE / NVD, deduplicated against CTI-Bench to avoid evaluation contamination.
Synthetic Q&A with defensive context, generated by a more powerful teacher and released under Apache-2.0.
Deduplicating the training set from the benchmark was key so the numbers stay honest and out-of-distribution.
Portability and verifying the recipe
Is the improvement from the MI300X or the recipe? To check, they trained a sibling: Gemma4Defense-2B using the exact same recipe and corpus but based on Gemma-4-E2B-it. Results:
Model
CTI-RCM (mean ± std)
CTI-MCQ
CyberSecQwen-4B (Qwen base)
0.6664 ± 0.0023
0.5868 ± 0.0029
Gemma4Defense-2B (Gemma base)
0.6754 ± 0.0035
0.6042 ± 0.0090
Conclusion: the recipe travels. The point is how you fine-tune the IT checkpoint, not only the model family. Choice between Qwen and Gemma may come down to licensing or deployment budget (2B vs 4B).
Quick inference example
Minimal usage to run on any 12 GB+ GPU:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "lablab-ai-amd-developer-hackathon/CyberSecQwen-4B"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")
messages = [
{"role": "system", "content": "You are a defensive cybersecurity assistant. Answer with the canonical CWE-ID first, then 1-3 sentences of justification."},
{"role": "user", "content": "Path traversal in a Java web app where user-controlled input concatenates into a File() path. What's the CWE?"},
]
prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
out = model.generate(**tok(prompt, return_tensors="pt").to(model.device), max_new_tokens=256, temperature=0.3)
print(tok.decode(out[0], skip_special_tokens=True))
For high-performance serving, vLLM works on MI300X with the vllm/vllm-openai-rocm image. There are pinned commands and config in the repo if you want to deploy it.
Limitations and responsible use
CyberSecQwen-4B is designed for defensive tasks: CWE mapping, structured CTI Q&A, and triage assistance. It’s not meant to generate exploits, automate critical decisions without human review, or provide legal or medical advice.
The team plans work on robustness to adversarial examples and continuous evaluation as NVD grows. A specialist is only as good as their worst input; hardening against prompt-injection is a priority.
Problems encountered and practical fixes
Problem
Fix
FA2 fails on Gemma-4 with head_dim=512
Fall back to sdpa for global-attention; local-attention keeps using FA2. Result: ~1.6x slower vs Qwen with FA2.
AITER conflict in serving with CyberPal-2.0-20B
Set VLLM_ROCM_USE_AITER=0 for that particular evaluation.
bitsandbytes not officially supported on ROCm
Not needed thanks to 192 GB HBM; used paged_adamw_8bit as optimizer path.
Demo on HF Spaces with ZeroGPU quota
The demo uses HF OAuth so each visitor consumes their own free quota.
What’s next and how you can participate
Team priorities:
1B variant for laptops and consumer-class deployment.
Quantized GGUF release (Q4_K_M, Q5_K_M) for edge and mobile.
Continuous evaluation against new CVE entries.
Adversarial hardening pass against prompt-injection.
If any of those points interest you, open an issue in the project repo: that moves priorities.
The lesson is clear: the conversation about AI and defense must shift to fit. You don’t always need the biggest model, but the one that best matches your operational constraints: privacy, cost, and on-prem execution. A specialist 4B that matches an 8B on the tasks that matter, runs on affordable hardware, and doesn’t leak evidence outside your network is a practical win for any security team.