View all Blog

Training Large Vision Models (LVMs): Benchmarking AMD vs. NVIDIA GPUs

Daniel Bibireata

April 03, 2024

Training Large Vision Models (LVMs)_ Benchmarking AMD vs. NVIDIA GPUs

LandingAI’s workloads for training Large Vision Models (LVMs) are computationally intensive. Looking for efficient model training options, we benchmarked AMD Instinct MI250 vs. NVIDIA’s A100, using comparable server configurations, with 1-4 GPUs: (i) MI250 128 GB, (ii) A100 SXM 80GB. We found their performance comparable, with AMD offering a slightly better price-performance tradeoff. Furthermore, our LVM training code, which we had developed in PyTorch, required no code modifications to run on either AMD or NVIDIA hardware, using, respectively, AMD’s open-source ROCm and NVIDIA’s CUDA frameworks to execute on the GPU.

Benchmark results for two practical LVM workloads, in our preferred BF16 representation, showing total training time (less is better):

Although not our preferred representation for efficient training, we also found that AMD’s MI250 was about 2x faster than A100 when training using FP32 precision.

The benchmarking for the AMD Instinct MI250 was done on the AMD Accelerator Cloud (AAC), where LandingAI has tested multiple ML vision workloads, and we plan to include more benchmark results in a future report.

Technical details:

Large Vision Models

LandingAI trains domain-specific Large Vision Models (LVMs) for customers with hundreds of thousands or sometimes millions of unlabeled images in specific domains. LVMs are the image modality equivalent of Large Language Models (LLMs). For images, we have found the best option is to train models in a specific domain, like manufacturing, rather than use the entire diversity of images on the internet. The resulting LVMs are highly efficient in solving vision tasks in their domain.

We benchmarked training of two practical LVMs, in the histopathology and semiconductor domains. We used the Vision Transformer architecture (ViT), with the following setup:

	Histopathology LVM	Semiconductor LVM
Model	ViT-S (22M param)	ViT-B (86M param)
Images / Tokens	100K images / 19.6M tokens	500K images / 98M tokens
Training Time (4 GPUs)	8 hours	3 days

Given the importance of data security for LandingAI’s customers, we offer training on a variety of cloud service providers and on-premises as well. Thus, it has been important to offer customers the flexibility of compute environment choices. In this blog, we benchmark LVMs on A100 and MI250 compute environments.

Hardware configuration

For MI250, we used a server configuration with 4 MI250 GPUs. For A100 configuration, we picked the comparable A100 SXM 80GB, in a server configuration with 8 GPUs (but we benchmarked with up to 4 GPUs

	A100 SXM 80GB	MI250
ML hardware	Tensor Core	Matrix Core
FP32	19.5 TFLOPs	47.9 TFLOPs
FP16/BF16	312 TFLOPs	383 TFLOPs
Memory	80GB	128GB

Benchmark results

For both LVMs, we ran with 1,2,4 GPU configurations, using the largest batch size (BS) that fit in GPU memory for both MI250 and A100. We used three precisions: (i) FP32, (ii) FP16, (iii) BF16. For FP16/BF16, we used mixed precision (MP), accumulating results in FP32. Our preferred representation for training is BF16, where we get a significant speedup in the training time, without loss in training performance. We report times for complete training runs, with 300 epochs. Some of the longer runs reported below were cut after a few epochs and results extrapolated to complete runs.

Histopathology LVM:

ViT-S	BS 256		BS 512		BS 1024
	1 MI250	1 A100	2 MI250	2 A100	4 MI250	4 A100
FP32	35h	77h	19h	41h	10h	23h
MP – FP16	24h	33h	13h	18h	8h	12h
MP – BF16	25h	30h	14h	18h	8h	12h

Semiconductor LVM:

ViT-B	BS 128		BS 256		BS 512
	1 MI250	1 A100	2 MI250	2 A100	4 MI250	4 A100
FP32	19d	48d	10d	28d	5d	12d
MP – FP16	10d	10d	5d	5d	3d	3d
MP – BF16	11d	10d	6d	6d	3d	3d

For the smaller histopathology LVM, MI250 is 1.5x faster, training the model in 8 hours compared to 12 hours for A100 (BF16, 4 GPU configuration). For the larger semiconductor LVM, MI250 and A100 training times are comparable, taking about 3 days to train.

When running with FP32 precision, MI250 is 2-2.5x faster than A100. We use FP32 precision during the initial exploration of new LVM training methods and architectures

Conclusion

We believe domain specific LVMs will become indispensable in a wide range of real-world scenarios, revolutionizing industries and enhancing the efficiency and capabilities of various systems.

If your organization has a large (100K images or more) set of images that you want to extract significant value from your data using domain-specific Large Vision Models, submit a request to Start Your LVM Journey.