mirror of
https://github.com/deepspeedai/DeepSpeed.git
synced 2025-10-20 15:33:51 +08:00
Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com> Signed-off-by: Logan Adams <loadams@microsoft.com> Signed-off-by: Fabien Dupont <fdupont@redhat.com> Co-authored-by: Fabien Dupont <fabiendupont@fabiendupont.fr>
1.4 KiB
1.4 KiB
title, excerpt, date, toc, tags
title | excerpt | date | toc | tags |
---|---|---|---|---|
The Fastest and Most Efficient BERT Training through Optimized Transformer Kernels | 2020-05-19 00:00:00 | false | training English |
We introduce new technology to accelerate single GPU performance via kernel optimizations. These optimizations not only create a strong foundation for scaling out large models, but also improve the single GPU performance of highly tuned and moderately sized models like BERT by more than 30%, reaching a staggering performance of 66 teraflops per V100 GPU, which is 52% of the hardware peak. Using optimized transformer kernels as the building block, DeepSpeed achieves the fastest BERT training record: 44 minutes on 1,024 NVIDIA V100 GPUs, compared with the best published result of 67 minutes on the same number and generation of GPUs.
- Brief overview, see our press release.
- Detailed technology deep dive, see our blog post.
- Tutorial on how to reproduce our results, see our BERT pre-training tutorial.
- The source code for our transformer kernels can be found in the DeepSpeed repo and BERT pre-training code can be found in the DeepSpeedExamples repo.