MedQA: fine-tune clinical AI on AMD ROCm without CUDA | Keryc
MedQA shows you can train and deploy a clinical AI that answers exam-style questions with clinical explanations using only AMD hardware and ROCm. Surprise? You don't need CUDA or magical quantization tricks when you have a GPU like the MI300X.
What is MedQA and why it matters
MedQA is a LoRA adapter finely tuned on Qwen3-1.7B to answer multiple-choice medical questions and, importantly, justify the answer with clinical reasoning. The goal isn't to replace a medical diagnosis, but to provide answers with explanations that are more useful and verifiable than a single letter with no context.
Here are three key reasons this matters for technical and clinical teams:
The output includes both the correct letter and a clinical explanation, which helps auditing and verification.
The adapter was trained and exported entirely on AMD hardware using ROCm, with zero CUDA dependencies.
Using LoRA keeps tuning efficient: only ~2.2 million trainable parameters versus 1.5B of the base model.
Hardware: why the AMD Instinct MI300X changes the game
The MI300X offers 192 GB of HBM3 on a single card. For LLM fine-tuning, memory is often the limiting factor: it determines batch size, sequence length, and whether you need to quantize.
With 192 GB you don't need 4-bit or 8-bit quantization. That translates to a cleaner pipeline and less risk of quantization artifacts. In this project we trained Qwen3-1.7B in fp16 with LoRA and it took roughly 5 minutes on the MI300X for 2,000 examples.
If you want to replicate it on your ROCm machine: these three environment variables were enough to make the same code that runs on CUDA work on ROCm:
No code changes, custom kernels, or compatibility shims were required.
Pipeline technicals: base model, LoRA and training params
Quick stack summary:
Base model: Qwen3-1.7B (capable and relatively compact at 1.7B parameters).
Adaptation: LoRA via PEFT to inject low-rank matrices into attention layers.
Frameworks: Transformers, PEFT, TRL, Accelerate on top of PyTorch + ROCm 6.1.
LoRA configuration example (conceptual):
from peft import LoraConfig, get_peft_model, TaskType
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=8,
lora_alpha=16,
lora_dropout=0.05,
target_modules=['q_proj', 'v_proj'],
bias='none',
)
model = get_peft_model(model, lora_config)
# trainable params ~2.2M of ~1.5B
Relevant training parameters:
fp16=True (bfloat16 produced NaNs in initial tests)
gradient_checkpointing=True to save memory
per_device_train_batch_size=4 with gradient_accumulation_steps=4 => effective batch 16
optim='adamw_torch', lr=2e-4, scheduler cosine with warmup_ratio=0.05
The dataset used was a slice of MedMCQA: 2,000 examples (question, options A-D, correct label and optional explanation). The idea was to show that a small slice can produce practical, explainable improvements in minutes.
Inference and deployment
Inference flow in brief:
Load tokenizer and base model.
Attach the LoRA adapter with PeftModel.from_pretrained.
Generate with greedy decoding and repetition_penalty to avoid loops.
You can download the adapter from the HuggingFace Hub and merge it with the base model if you want a single lightweight checkpoint. The adapter takes up a few megabytes, not gigabytes.
Results, metrics and lessons learned
Trainable params: ~2.2M (0.15% of the total).
Training time on MI300X: ~5 minutes for 2,000 examples.
Set ROCR_VISIBLE_DEVICES, HIP_VISIBLE_DEVICES, HSA_OVERRIDE_GFX_VERSION
bitsandbytes doesn't work
No ROCm build
Avoid quantization, use MI300X memory
Garbage output at inference
Padding misconfigured
pad_token = eos_token and fix padding_side
Trainer errors
Mismatched Transformers versions
Pin transformers>=4.40.0
Note: the lack of bitsandbytes support on ROCm is real, but with 192 GB of HBM3 this wasn't a problem for this experiment. That simplifies the pipeline.
What's next: scaling and robustness
The authors suggest natural next steps to take this further:
Train on the full MedMCQA corpus (~180k questions) and add PubMedQA.
Add confidence calibration to report certainty estimates alongside the answer.
Integrate RAG to ground answers in medical literature in real time.
Build an evaluation harness with real test splits to measure out-of-sample gains.
Final reflection
MedQA shows the technical barrier of 'CUDA-only' can be broken. If you have AMD ROCm hardware, the HuggingFace ecosystem works with few adaptations, and the large memory of the MI300X removes many engineering trade-offs. For medical projects, prioritizing explanations over just labels is a helpful reminder: transparency matters as much as accuracy.
Summary: MedQA is a LoRA adapter finely tuned on Qwen3-1.7B trained on an AMD Instinct MI300X with ROCm, without CUDA dependencies or quantization. The project shows you can achieve efficient, explainable clinical fine-tuning in minutes by leveraging 192 GB of HBM3 and HuggingFace's ROCm compatibility.
Stay up to date!
Get AI news, tool launches, and innovative products straight to your inbox. Everything clear and useful.
MedQA: fine-tune clinical AI on AMD ROCm without CUDA