DeepSpeed/2020-05-19-bert-record.md at fd405169232dd83bdc7883df1c7d707d482e1be6

mirror of https://github.com/deepspeedai/DeepSpeed.git synced 2025-10-20 15:33:51 +08:00

Files

Olatunji Ruwase fd40516923 Update GH org references (#6998 )

Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>
Signed-off-by: Logan Adams <loadams@microsoft.com>
Signed-off-by: Fabien Dupont <fdupont@redhat.com>
Co-authored-by: Fabien Dupont <fabiendupont@fabiendupont.fr>

2025-02-05 00:56:50 +00:00

1.4 KiB

Raw Blame History

title, excerpt, date, toc, tags

title	excerpt	date	toc	tags
The Fastest and Most Efficient BERT Training through Optimized Transformer Kernels		2020-05-19 00:00:00	false	training English

We introduce new technology to accelerate single GPU performance via kernel optimizations. These optimizations not only create a strong foundation for scaling out large models, but also improve the single GPU performance of highly tuned and moderately sized models like BERT by more than 30%, reaching a staggering performance of 66 teraflops per V100 GPU, which is 52% of the hardware peak. Using optimized transformer kernels as the building block, DeepSpeed achieves the fastest BERT training record: 44 minutes on 1,024 NVIDIA V100 GPUs, compared with the best published result of 67 minutes on the same number and generation of GPUs.

Brief overview, see our press release.
Detailed technology deep dive, see our blog post.
Tutorial on how to reproduce our results, see our BERT pre-training tutorial.
The source code for our transformer kernels can be found in the DeepSpeed repo and BERT pre-training code can be found in the DeepSpeedExamples repo.

1.4 KiB Raw Blame History

1.4 KiB

Raw Blame History