LandingAI’s workloads for training Large Vision Models (LVMs) are computationally intensive. Looking for efficient model training options, we benchmarked AMD Instinct MI250 vs. NVIDIA’s A100, using comparable server configurations, with 1-4 GPUs: (i) MI250 128 GB, (ii) A100 SXM 80GB. We found their performance comparable, with AMD offering a slightly better price-performance tradeoff. Furthermore, our LVM training code, which we had developed in PyTorch, required no code modifications to run on either AMD or NVIDIA hardware, using, respectively, AMD’s open-source ROCm and NVIDIA’s CUDA frameworks to execute on the GPU.
Benchmark results for two practical LVM workloads, in our preferred BF16 representation, showing total training time (less is better):
Although not our preferred representation for efficient training, we also found that AMD’s MI250 was about 2x faster than A100 when training using FP32 precision.
The benchmarking for the AMD Instinct MI250 was done on the AMD Accelerator Cloud (AAC), where LandingAI has tested multiple ML vision workloads, and we plan to include more benchmark results in a future report.
Technical details:
Large Vision Models
LandingAI trains domain-specific Large Vision Models (LVMs) for customers with hundreds of thousands or sometimes millions of unlabeled images in specific domains. LVMs are the image modality equivalent of Large Language Models (LLMs). For images, we have found the best option is to train models in a specific domain, like manufacturing, rather than use the entire diversity of images on the internet. The resulting LVMs are highly efficient in solving vision tasks in their domain.
We benchmarked training of two practical LVMs, in the histopathology and semiconductor domains. We used the Vision Transformer architecture (ViT), with the following setup:
Histopathology LVM | Semiconductor LVM | |
Model | ViT-S (22M param) | ViT-B (86M param) |
Images / Tokens | 100K images / 19.6M tokens | 500K images / 98M tokens |
Training Time (4 GPUs) | 8 hours | 3 days |
Given the importance of data security for LandingAI’s customers, we offer training on a variety of cloud service providers and on-premises as well. Thus, it has been important to offer customers the flexibility of compute environment choices. In this blog, we benchmark LVMs on A100 and MI250 compute environments.
Hardware configuration
For MI250, we used a server configuration with 4 MI250 GPUs. For A100 configuration, we picked the comparable A100 SXM 80GB, in a server configuration with 8 GPUs (but we benchmarked with up to 4 GPUs
A100 SXM 80GB | MI250 | |
ML hardware | Tensor Core | Matrix Core |
FP32 | 19.5 TFLOPs | 47.9 TFLOPs |
FP16/BF16 | 312 TFLOPs | 383 TFLOPs |
Memory | 80GB | 128GB |
Benchmark results
For both LVMs, we ran with 1,2,4 GPU configurations, using the largest batch size (BS) that fit in GPU memory for both MI250 and A100. We used three precisions: (i) FP32, (ii) FP16, (iii) BF16. For FP16/BF16, we used mixed precision (MP), accumulating results in FP32. Our preferred representation for training is BF16, where we get a significant speedup in the training time, without loss in training performance. We report times for complete training runs, with 300 epochs. Some of the longer runs reported below were cut after a few epochs and results extrapolated to complete runs.
Histopathology LVM:
ViT-S | BS 256 | BS 512 | BS 1024 | |||
1 MI250 | 1 A100 | 2 MI250 | 2 A100 | 4 MI250 | 4 A100 | |
FP32 | 35h | 77h | 19h | 41h | 10h | 23h |
MP – FP16 | 24h | 33h | 13h | 18h | 8h | 12h |
MP – BF16 | 25h | 30h | 14h | 18h | 8h | 12h |
Semiconductor LVM:
ViT-B | BS 128 | BS 256 | BS 512 | |||
1 MI250 | 1 A100 | 2 MI250 | 2 A100 | 4 MI250 | 4 A100 | |
FP32 | 19d | 48d | 10d | 28d |
5d | 12d |
MP – FP16 | 10d | 10d | 5d | 5d | 3d | 3d |
MP – BF16 | 11d | 10d | 6d | 6d |
3d | 3d |
For the smaller histopathology LVM, MI250 is 1.5x faster, training the model in 8 hours compared to 12 hours for A100 (BF16, 4 GPU configuration). For the larger semiconductor LVM, MI250 and A100 training times are comparable, taking about 3 days to train.
When running with FP32 precision, MI250 is 2-2.5x faster than A100. We use FP32 precision during the initial exploration of new LVM training methods and architectures
Conclusion
We believe domain specific LVMs will become indispensable in a wide range of real-world scenarios, revolutionizing industries and enhancing the efficiency and capabilities of various systems.
If your organization has a large (100K images or more) set of images that you want to extract significant value from your data using domain-specific Large Vision Models, submit a request to Start Your LVM Journey.