Files
DeepSpeed/docs/_posts/2020-10-28-progressive-layer-dropping-news.md
Olatunji Ruwase fd40516923 Update GH org references (#6998)
Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>
Signed-off-by: Logan Adams <loadams@microsoft.com>
Signed-off-by: Fabien Dupont <fdupont@redhat.com>
Co-authored-by: Fabien Dupont <fabiendupont@fabiendupont.fr>
2025-02-05 00:56:50 +00:00

1.3 KiB
Executable File

title, excerpt, date, tags, toc
title excerpt date tags toc
Progressive Layer Dropping 2020-10-29 00:00:00 training English false

We introduce a new technology called progressive layer dropping (PLD) to speedup the pre-training of Transformer-based networks through efficient and robust compressed training. The pre-training step of Transformer networks often suffer from unbearable overall computational expenses. We analyze the training dynamics and stability of Transformer networks and propose PLD to sparsely update Transformer blocks following a progressive dropping schedule, which smoothly increases the layer dropping rate for each mini-batch as training evolves along both the temporal and the model depth dimension. PLD is able to allow the pre-training to be 2.5X faster to get similar accuracy on downstream tasks and allows the training to be 24% faster when training the same number of samples, not at the cost of excessive hardware resources.