View all Blog

Training Large Vision Models (LVMs): Benchmarking AMD vs. NVIDIA GPUs

Daniel Bibireata
April 03, 2024

Landing AI’s workloads for training Large Vision Models (LVMs) are computationally intensive. Looking for efficient model training options, we benchmarked AMD Instinct MI250 vs. NVIDIA’s A100, using comparable server configurations, with 1-4 GPUs: (i) MI250 128 GB, (ii) A100 SXM 80GB. We found their performance comparable, with AMD offering a slightly better price-performance tradeoff. Furthermore, our LVM training code, which we had developed in PyTorch, required no code modifications to run on either AMD or NVIDIA hardware, using, respectively, AMD’s open-source ROCm and NVIDIA’s CUDA frameworks to execute on the GPU.

Benchmark results for two practical LVM workloads, in our preferred BF16 representation, showing total training time (less is better):

 

Although not our preferred representation for efficient training, we also found that AMD’s MI250 was about 2x faster than A100 when training using FP32 precision.

The benchmarking for the AMD Instinct MI250 was done on the AMD Accelerator Cloud (AAC), where Landing AI has tested multiple ML vision workloads, and we plan to include more benchmark results in a future report.

 

Technical details: 

Large Vision Models

Landing AI trains domain-specific Large Vision Models (LVMs) for customers with hundreds of thousands or sometimes millions of unlabeled images in specific domains. LVMs are the image modality equivalent of Large Language Models (LLMs). For images, we have found the best option is to train models in a specific domain, like manufacturing, rather than use the entire diversity of images on the internet. The resulting LVMs are highly efficient in solving vision tasks in their domain.

We benchmarked training of two practical LVMs, in the histopathology and semiconductor domains. We used the Vision Transformer architecture (ViT), with the following setup:

 

Histopathology LVM Semiconductor LVM
Model ViT-S (22M param) ViT-B (86M param)
Images / Tokens 100K images / 19.6M tokens  500K images / 98M tokens
Training Time (4 GPUs) 8 hours 3 days

 

Given the importance of data security for Landing AI’s customers, we offer training on a variety of cloud service providers and on-premises as well. Thus, it has been important to offer customers the flexibility of compute environment choices. In this blog, we benchmark LVMs on A100 and MI250 compute environments. 

 

Hardware configuration

For MI250, we used a server configuration with 4 MI250 GPUs. For A100 configuration, we picked the comparable A100 SXM 80GB, in a server configuration with 8 GPUs (but we benchmarked with up to 4 GPUs

 

A100 SXM 80GB MI250
ML hardware Tensor Core Matrix Core
FP32 19.5 TFLOPs 47.9 TFLOPs
FP16/BF16 312 TFLOPs 383 TFLOPs
Memory 80GB 128GB

 

Benchmark results 

For both LVMs, we ran with 1,2,4 GPU configurations, using the largest batch size (BS) that fit in GPU memory for both MI250 and A100. We used three precisions: (i) FP32, (ii) FP16, (iii) BF16. For FP16/BF16, we used mixed precision (MP), accumulating results in FP32. Our preferred representation for training is BF16, where we get a significant speedup in the training time, without loss in training performance. We report times for complete training runs, with 300 epochs. Some of the longer runs reported below were cut after a few epochs and results extrapolated to complete runs.

Histopathology LVM:

 

ViT-S BS 256 BS 512 BS 1024
1 MI250 1 A100 2 MI250 2 A100 4 MI250 4 A100
FP32 35h 77h 19h 41h 10h 23h
MP – FP16 24h 33h 13h 18h 8h 12h
MP – BF16 25h 30h 14h 18h 8h 12h

 

Semiconductor LVM:

ViT-B BS 128 BS 256 BS 512
1 MI250 1 A100 2 MI250 2 A100 4 MI250 4 A100
FP32 19d 48d 10d 28d
5d 12d
MP – FP16 10d 10d 5d 5d 3d 3d
MP – BF16 11d 10d 6d 6d
3d 3d

 

For the smaller histopathology LVM, MI250 is 1.5x faster, training the model in 8 hours compared to 12 hours for A100 (BF16, 4 GPU configuration). For the larger semiconductor LVM, MI250 and A100 training times are comparable, taking about 3 days to train.

When running with FP32 precision, MI250 is 2-2.5x faster than A100. We use FP32 precision during the initial exploration of new LVM training methods and architectures

 

Conclusion

We believe domain specific LVMs will become indispensable in a wide range of real-world scenarios, revolutionizing industries and enhancing the efficiency and capabilities of various systems. 

If your organization has a large (100K images or more) set of images that you want to extract significant value from your data using domain-specific Large Vision Models, submit a request to Start Your LVM Journey.

Landing AI Monthly Newsletter

Stay updated with AI news and resources delivered to your box

Related Resources