Did the developers have to rewrite their code to run the training on AMD ROCm instead of NVIDIA CUDA?

No. A key finding of the MedQA project was that the standard training script built with HuggingFace libraries ran on AMD ROCm without any code changes. The only configuration required was setting three environment variables to ensure the ROCm drivers correctly identified the GPU.

MedQA: Fine-Tuning a Clinical AI on AMD ROCm — No CUDA Required

A new project dubbed MedQA demonstrates a complete fine-tuning workflow for a clinical language model using AMD hardware and its ROCm software stack, operating entirely without NVIDIA's CUDA platform. The effort, undertaken for the AMD Developer Hackathon, shows that the mainstream HuggingFace AI development ecosystem can run on AMD accelerators with minimal configuration changes. This serves as a critical proof point for developers seeking viable alternatives to NVIDIA's dominant hardware and software ecosystem for common AI tasks like LoRA fine-tuning.

Technical Breakdown: Training on the MI300X

The team fine-tuned Alibaba's Qwen3-1.7B model on the MedMCQA dataset using an AMD Instinct MI300X accelerator. A key advantage highlighted by the project was the GPU's 192 GB of HBM3 memory, which allowed the team to train the model in full fp16 precision without resorting to memory-saving quantization techniques often required on consumer or enterprise NVIDIA GPUs. The training code, originally compatible with CUDA, required no modifications to run on ROCm, needing only three environment variables to be set for hardware detection.

Hardware: AMD Instinct MI300X with 192 GB HBM3 VRAM.
Technique: LoRA fine-tuning with the HuggingFace PEFT library, training only ~2.2M parameters.
Compatibility: Standard HuggingFace libraries (Transformers, TRL, Accelerate) ran without code changes.
Performance: Training on a 2,000-sample dataset was completed in approximately 5 minutes.
Key Challenge Overcome: The project avoided `bitsandbytes` dependency for quantization, a library which currently lacks a robust ROCm build, by leveraging the MI300X's extensive memory.

Ecosystem Implications

This project's success is a significant indicator of AMD ROCm's maturing software support. By demonstrating that a standard PyTorch and HuggingFace workflow can be seamlessly ported from NVIDIA, AMD lowers the barrier to entry for developers and institutions exploring non-CUDA hardware options. The ability to bypass quantization not only simplifies the engineering pipeline but also eliminates potential performance artifacts, presenting a tangible hardware advantage. As ROCm continues to improve its compatibility with the open-source AI toolkit, the lock-in effect of CUDA may see a gradual erosion, particularly for well-defined workflows like instruction fine-tuning.

The MedQA project's real significance isn't a new state-of-the-art model, but its demonstration of 'boring' compatibility. It signals that AMD's ROCm is approaching a 'it just works' level of maturity for the mainstream HuggingFace stack, threatening NVIDIA’s CUDA moat not with a killer feature, but with the quiet reliability of a viable alternative.

>> Verify Original Transmission at Hugging Face