MedQA: fine-tune clinical AI on AMD ROCm without CUDA

18 hours ago4 minutes

MedQA shows you can train and deploy a clinical AI that answers exam-style questions with clinical explanations using only AMD hardware and ROCm. Surprise? You don't need CUDA or magical quantization tricks when you have a GPU like the MI300X.

What is MedQA and why it matters

MedQA is a LoRA adapter finely tuned on Qwen3-1.7B to answer multiple-choice medical questions and, importantly, justify the answer with clinical reasoning. The goal isn't to replace a medical diagnosis, but to provide answers with explanations that are more useful and verifiable than a single letter with no context.

Here are three key reasons this matters for technical and clinical teams:

The output includes both the correct letter and a clinical explanation, which helps auditing and verification.
The adapter was trained and exported entirely on AMD hardware using ROCm, with zero CUDA dependencies.
Using LoRA keeps tuning efficient: only ~2.2 million trainable parameters versus 1.5B of the base model.

Hardware: why the AMD Instinct MI300X changes the game

The MI300X offers 192 GB of HBM3 on a single card. For LLM fine-tuning, memory is often the limiting factor: it determines batch size, sequence length, and whether you need to quantize.

With 192 GB you don't need 4-bit or 8-bit quantization. That translates to a cleaner pipeline and less risk of quantization artifacts. In this project we trained Qwen3-1.7B in fp16 with LoRA and it took roughly 5 minutes on the MI300X for 2,000 examples.

If you want to replicate it on your ROCm machine: these three environment variables were enough to make the same code that runs on CUDA work on ROCm:

import os
os.environ['ROCR_VISIBLE_DEVICES'] = '0'
os.environ['HIP_VISIBLE_DEVICES'] = '0'
os.environ['HSA_OVERRIDE_GFX_VERSION'] = '9.4.2'

No code changes, custom kernels, or compatibility shims were required.

Pipeline technicals: base model, LoRA and training params

Quick stack summary:

Base model: Qwen3-1.7B (capable and relatively compact at 1.7B parameters).
Adaptation: LoRA via PEFT to inject low-rank matrices into attention layers.
Frameworks: Transformers, PEFT, TRL, Accelerate on top of PyTorch + ROCm 6.1.

LoRA configuration example (conceptual):

from peft import LoraConfig, get_peft_model, TaskType
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=8,
    lora_alpha=16,
    lora_dropout=0.05,
    target_modules=['q_proj', 'v_proj'],
    bias='none',
)
model = get_peft_model(model, lora_config)
# trainable params ~2.2M of ~1.5B

Relevant training parameters:

fp16=True (bfloat16 produced NaNs in initial tests)
gradient_checkpointing=True to save memory
per_device_train_batch_size=4 with gradient_accumulation_steps=4 => effective batch 16
optim='adamw_torch', lr=2e-4, scheduler cosine with warmup_ratio=0.05

The dataset used was a slice of MedMCQA: 2,000 examples (question, options A-D, correct label and optional explanation). The idea was to show that a small slice can produce practical, explainable improvements in minutes.

Inference and deployment

Inference flow in brief:

Load tokenizer and base model.
Attach the LoRA adapter with PeftModel.from_pretrained.
Generate with greedy decoding and repetition_penalty to avoid loops.

Conceptual generation example:

inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=200,
        do_sample=False,
        repetition_penalty=1.1,
        eos_token_id=tokenizer.eos_token_id,
    )

You can download the adapter from the HuggingFace Hub and merge it with the base model if you want a single lightweight checkpoint. The adapter takes up a few megabytes, not gigabytes.

Results, metrics and lessons learned

Trainable params: ~2.2M (0.15% of the total).
Training time on MI300X: ~5 minutes for 2,000 examples.
Dataset used: 2,000 examples from MedMCQA.
Baseline MedMCQA accuracy reported: ~45% (dataset reference).

Common issues and how they were solved:

Problem	Cause	Solution
NaN loss	Mixed precision instability	Switch from bfloat16 to fp16
GPU not detected	Missing ROCm environment variables	Set `ROCR_VISIBLE_DEVICES`, `HIP_VISIBLE_DEVICES`, `HSA_OVERRIDE_GFX_VERSION`
bitsandbytes doesn't work	No ROCm build	Avoid quantization, use MI300X memory
Garbage output at inference	Padding misconfigured	`pad_token = eos_token` and fix `padding_side`
Trainer errors	Mismatched Transformers versions	Pin `transformers>=4.40.0`

Note: the lack of bitsandbytes support on ROCm is real, but with 192 GB of HBM3 this wasn't a problem for this experiment. That simplifies the pipeline.

What's next: scaling and robustness

The authors suggest natural next steps to take this further:

Train on the full MedMCQA corpus (~180k questions) and add PubMedQA.
Add confidence calibration to report certainty estimates alongside the answer.
Integrate RAG to ground answers in medical literature in real time.
Build an evaluation harness with real test splits to measure out-of-sample gains.

Final reflection

MedQA shows the technical barrier of 'CUDA-only' can be broken. If you have AMD ROCm hardware, the HuggingFace ecosystem works with few adaptations, and the large memory of the MI300X removes many engineering trade-offs. For medical projects, prioritizing explanations over just labels is a helpful reminder: transparency matters as much as accuracy.

Original source

https://huggingface.co/blog/lablab-ai-amd-developer-hackathon/medqa

Summary: MedQA is a LoRA adapter finely tuned on Qwen3-1.7B trained on an AMD Instinct MI300X with ROCm, without CUDA dependencies or quantization. The project shows you can achieve efficient, explainable clinical fine-tuning in minutes by leveraging 192 GB of HBM3 and HuggingFace's ROCm compatibility.

Stay up to date!

Get AI news, tool launches, and innovative products straight to your inbox. Everything clear and useful.