MedQA shows you can train and deploy a clinical AI that answers exam-style questions with clinical explanations using only AMD hardware and ROCm. Surprise? You don't need CUDA or magical quantization tricks when you have a GPU like the MI300X.
What is MedQA and why it matters
MedQA is a LoRA adapter finely tuned on Qwen3-1.7B to answer multiple-choice medical questions and, importantly, justify the answer with clinical reasoning. The goal isn't to replace a medical diagnosis, but to provide answers with explanations that are more useful and verifiable than a single letter with no context.
Here are three key reasons this matters for technical and clinical teams:
- The output includes both the correct letter and a clinical explanation, which helps auditing and verification.
- The adapter was trained and exported entirely on AMD hardware using ROCm, with zero CUDA dependencies.
- Using LoRA keeps tuning efficient: only ~2.2 million trainable parameters versus 1.5B of the base model.
Hardware: why the AMD Instinct MI300X changes the game
The MI300X offers 192 GB of HBM3 on a single card. For LLM fine-tuning, memory is often the limiting factor: it determines batch size, sequence length, and whether you need to quantize.
With 192 GB you don't need 4-bit or 8-bit quantization. That translates to a cleaner pipeline and less risk of quantization artifacts. In this project we trained Qwen3-1.7B in fp16 with LoRA and it took roughly 5 minutes on the MI300X for 2,000 examples.
If you want to replicate it on your ROCm machine: these three environment variables were enough to make the same code that runs on CUDA work on ROCm:
import os
os.environ['ROCR_VISIBLE_DEVICES'] = '0'
os.environ['HIP_VISIBLE_DEVICES'] = '0'
os.environ['HSA_OVERRIDE_GFX_VERSION'] = '9.4.2'
No code changes, custom kernels, or compatibility shims were required.
Pipeline technicals: base model, LoRA and training params
Quick stack summary:
- Base model: Qwen3-1.7B (capable and relatively compact at 1.7B parameters).
- Adaptation: LoRA via PEFT to inject low-rank matrices into attention layers.
- Frameworks: Transformers, PEFT, TRL, Accelerate on top of PyTorch + ROCm 6.1.
LoRA configuration example (conceptual):
from peft import LoraConfig, get_peft_model, TaskType
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=8,
lora_alpha=16,
lora_dropout=0.05,
target_modules=['q_proj', 'v_proj'],
bias='none',
)
model = get_peft_model(model, lora_config)
# trainable params ~2.2M of ~1.5B
Relevant training parameters:
fp16=True(bfloat16 produced NaNs in initial tests)gradient_checkpointing=Trueto save memoryper_device_train_batch_size=4withgradient_accumulation_steps=4=> effective batch 16optim='adamw_torch',lr=2e-4, schedulercosinewithwarmup_ratio=0.05
The dataset used was a slice of MedMCQA: 2,000 examples (question, options A-D, correct label and optional explanation). The idea was to show that a small slice can produce practical, explainable improvements in minutes.
Inference and deployment
Inference flow in brief:
- Load tokenizer and base model.
- Attach the LoRA adapter with
PeftModel.from_pretrained. - Generate with greedy decoding and
repetition_penaltyto avoid loops.
Conceptual generation example:
inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=200,
do_sample=False,
repetition_penalty=1.1,
eos_token_id=tokenizer.eos_token_id,
)
You can download the adapter from the HuggingFace Hub and merge it with the base model if you want a single lightweight checkpoint. The adapter takes up a few megabytes, not gigabytes.
Results, metrics and lessons learned
- Trainable params: ~2.2M (0.15% of the total).
- Training time on MI300X: ~5 minutes for 2,000 examples.
- Dataset used: 2,000 examples from MedMCQA.
- Baseline MedMCQA accuracy reported: ~45% (dataset reference).
Common issues and how they were solved:
| Problem | Cause | Solution |
|---|---|---|
| NaN loss | Mixed precision instability | Switch from bfloat16 to fp16 |
| GPU not detected | Missing ROCm environment variables | Set ROCR_VISIBLE_DEVICES, HIP_VISIBLE_DEVICES, HSA_OVERRIDE_GFX_VERSION |
| bitsandbytes doesn't work | No ROCm build | Avoid quantization, use MI300X memory |
| Garbage output at inference | Padding misconfigured | pad_token = eos_token and fix padding_side |
| Trainer errors | Mismatched Transformers versions | Pin transformers>=4.40.0 |
Note: the lack of bitsandbytes support on ROCm is real, but with 192 GB of HBM3 this wasn't a problem for this experiment. That simplifies the pipeline.
What's next: scaling and robustness
The authors suggest natural next steps to take this further:
- Train on the full MedMCQA corpus (~180k questions) and add PubMedQA.
- Add confidence calibration to report certainty estimates alongside the answer.
- Integrate RAG to ground answers in medical literature in real time.
- Build an evaluation harness with real test splits to measure out-of-sample gains.
Final reflection
MedQA shows the technical barrier of 'CUDA-only' can be broken. If you have AMD ROCm hardware, the HuggingFace ecosystem works with few adaptations, and the large memory of the MI300X removes many engineering trade-offs. For medical projects, prioritizing explanations over just labels is a helpful reminder: transparency matters as much as accuracy.
Original source
https://huggingface.co/blog/lablab-ai-amd-developer-hackathon/medqa
Summary: MedQA is a LoRA adapter finely tuned on Qwen3-1.7B trained on an AMD Instinct MI300X with ROCm, without CUDA dependencies or quantization. The project shows you can achieve efficient, explainable clinical fine-tuning in minutes by leveraging 192 GB of HBM3 and HuggingFace's ROCm compatibility.
