DeepSpeed

mirror of https://github.com/deepspeedai/DeepSpeed.git synced 2025-10-20 15:33:51 +08:00

Files

Masahiro Tanaka ed5f737554 Enable torch.autocast with ZeRO (#6993 )

DeepSpeed supports mixed precision training, but the behavior is
different from `torch.autocast`. DeepSpeed maintains parameters and
gradients both in FP32 and a lower precision (FP16/BF16) (NVIDIA Apex
AMP style) and computes all modules in the lower precision while
`torch.autocast` maintains parameters in FP32 but computes only certain
operators in the lower precision.
This leads to differences in:
- performance: `torch.autocast` needs downcast in forward/backward
- memory usage: DeepSpeed needs more memory to keep copies of parameters
and gradients in lower precision
- accuracy: `torch.autocast` has a list of modules that can safely be
computed in lower precision. Some precision-sensitive operators (e.g.
softmax) are computed in FP32.

To align DeepSpeed's behavior with `torch.autocast` when necessary, this
PR adds the integration with `torch.autocast` with ZeRO. Here is an
examples of the configuration.

```json
"torch_autocast": {
  "enabled": true,
  "dtype": "bfloat16",
  "lower_precision_safe_modules": ["torch.nn.Linear", "torch.nn.Conv2d"]
}
```

Each configuration works as follows:
- `enabled`: Enable the integration with `torch.autocast` if this is set
to `True`. You don't need to call `torch.autocast` in your code. The
grad scaler is also applied in the DeepSpeed optimizer.
- `dtype`: lower precision dtype passed to `torch.autocast`. Gradients
for allreduce (reduce-scatter) and parameters for allgather (only for
ZeRO3) of `lower_precision_safe_modules` are also downcasted to this
dtype.
- `lower_precision_safe_modules`: Downcast for allreduce
(reduce-scatter) and allgather (ZeRO3) are applied only to modules
specified in this list. (The precision for PyTorch operators in
forward/backward follows `torch.autocast`'s policy, not this list.) You
can set names of classes with their packages. If you don't set this
item, DeepSpeed uses the default list: `[torch.nn.Linear,
torch.nn.Conv1d, torch.nn.Conv2d, torch.nn.Conv3d]`.

Note that we only maintain FP32 parameters with this feature enabled.
For consistency, you cannot enable `fp16` or `bf16` in DeepSpeed config.

---------

Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>
Signed-off-by: Fabien Dupont <fdupont@redhat.com>
Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>
Signed-off-by: Logan Adams <loadams@microsoft.com>
Signed-off-by: inkcherry <mingzhi.liu@intel.com>
Signed-off-by: Omar Elayan <oelayan@habana.ai>
Signed-off-by: Roman Fitzjalen <romaactor@gmail.com>
Signed-off-by: Hongwei <hongweichen@microsoft.com>
Signed-off-by: shaomin <wukon1992@gmail.com>
Signed-off-by: Stas Bekman <stas@stason.org>
Signed-off-by: siqi <siqi@tecorigin.com>
Signed-off-by: Wei Wu <wuwei211x@gmail.com>
Signed-off-by: ShellyNR <shelly.nahir@live.biu.ac.il>
Signed-off-by: Lai, Yejing <yejing.lai@intel.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Fabien Dupont <fabiendupont@fabiendupont.fr>
Co-authored-by: Liangliang Ma <1906710196@qq.com>
Co-authored-by: inkcherry <mingzhi.liu@intel.com>
Co-authored-by: Omar Elayan <142979319+oelayan7@users.noreply.github.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Roman Fitzjalen <romaactor@gmail.com>
Co-authored-by: Ramya Ramineni <62723901+rraminen@users.noreply.github.com>
Co-authored-by: Guanhua Wang <alexwgh333@gmail.com>
Co-authored-by: root <root@ftqtmec25000000.taxzvufipdhelhupulxcbvr15f.ux.internal.cloudapp.net>
Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com>
Co-authored-by: Joe Mayer <114769929+jomayeri@users.noreply.github.com>
Co-authored-by: wukong1992 <wukong1992@users.noreply.github.com>
Co-authored-by: shaomin <wukon1992@gmail.com>
Co-authored-by: loadams <loadams@users.noreply.github.com>
Co-authored-by: siqi654321 <siqi202311@163.com>
Co-authored-by: siqi <siqi@tecorigin.com>
Co-authored-by: Wei Wu <45323446+U-rara@users.noreply.github.com>
Co-authored-by: Shelly Nahir <73890534+ShellyNR@users.noreply.github.com>
Co-authored-by: snahir <snahir@habana.ai>
Co-authored-by: Yejing-Lai <yejing.lai@intel.com>
Co-authored-by: Siddharth Singh <siddharth9820@gmail.com>
Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com>