This PR introduces *DeepCompile*, a new feature that efficiently
integrates compiler optimizations with other DeepSpeed features.
DeepCompile utilizes torch's dynamo to capture the computation graph and
modifies it to incorporate DeepSpeed’s optimizations seamlessly.
Currently, DeepCompile supports ZeRO-1 and ZeRO-3, with enhancements
such as proactive prefetching and selective unsharding to improve
performance.
(More details will be added later.)
---------
Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>
Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: zafarsadiq <zafarsadiq120@gmail.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Use accelerator apis to select device in setup.py and set visible
devices env in runner.py
---------
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Fixed the Windows build.
Fixes applied:
- Remove some more ops that don't build on Windows.
- Remove the use of symlinks that didn't work correctly and replace with
`shutil.copytree()`.
- Small fixes to make the C++ code compile.
Tested with Python 3.9 and CUDA 12.1.
---------
Co-authored-by: Costin Eseanu <costineseanu@gmail.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Order of parameters in create_dir_symlink method looks wrong. Because
this we get the error "PermissionError: [WinError 5] Denied access:
'.\\deepspeed\\ops\\csrc'" when install deepspeed >= 0.4.0 on Windows
enviroment.
Please check this out @eltonzheng and @jeffra.
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
This is an WIP PR that make op builder detection adapt to accelerator
change. This is followup of
https://github.com/microsoft/DeepSpeed/issues/5173
Currently, DeepSpeed generate `installed_ops` and `compatible_ops` at
setup time. If the system change to a different accelerator at DeepSpeed
launch time, these two list would contain incorrect information.
This PR intend to solve this problem with more flexity ops detection.
* For `installed_ops`, DeepSpeed should disable all installed ops if
accelerator detected at setup time is different from launch time.
* For `compatible_ops`, DeepSpeed should refresh the list for each
launch to avoid impact of accelerator change.
In the first step, nv-inference workflow is temporary change to emulate
the scenario that the system is setup with CPU_Accelerator, then launch
with CUDA_Accelerator. And CPU_Accelerator is modified to make Intel
Extension for PyTorch and oneCCL binding for PyTorch not mandatory.
Starting from here we can reconstruct installed_ops and compatible_ops
to follow the design above.
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
#5192 reports an issue with the latest DeepSpeed release (0.13.3)
related to pre-compilation and the recently re-enabled `ninja` support
in #5088. Reverting to disabling `ninja` by default, but users can still
enable it with `DS_ENABLE_NINJA=1` until we can further debug to
understand the problem.
* skip cpu support unimplemented error and update cpu inference workflow
* add torch.bfloat16 to cuda_accelerator
* remove UtilsBuilder skip
* fused adam can build
* use cpu adam to implement fused adam
* enable zero stage 1 and 2 for synchronized accelerator (a.k.a. CPU)
* remove unused parameters
* remove skip FusedAdamBuilder; add suported_dtypes
* fix format
* Revert "fix format"
Revert "remove skip FusedAdamBuilder; add suported_dtypes"
Revert "remove unused parameters"
Revert "enable zero stage 1 and 2 for synchronized accelerator (a.k.a. CPU)"
Revert "use cpu adam to implement fused adam"
Revert "fused adam can build"
---------
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Ma, Guokai <guokai.ma@intel.com>
* Abstract accelerator (step 2)
* more flex op_builder path for both installation and runtime
* add SpatialInferenceBuilder into cuda_accelerator.py
* use reflection to make cuda_accelerator adapt to CUDA op builder change automatically
* clean up deepspeed/__init__.py
* add comments in cuda_accelerator for no torch path
* Update deepspeed/env_report.py
Change env_report.py according to suggestion
Co-authored-by: Michael Wyatt <mrwyattii@gmail.com>
* reduce the range of try...except for better code clarity
* Add porting for deepspeed/ops/random_ltd/dropping_utils.py
* move accelerator to top directory and create symlink under deepspeed
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Michael Wyatt <mrwyattii@gmail.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
* remove any cupy install when setting up environments
* revert previous changes to run on cu111 runners
* fix for when no cupy is installed
* remove cupy uninstall for workflows not using latest torch version
* update to cu116 for inference tests
* fix pip uninstall line
* move python environment list to after DS install
* remove cupy uninstall
* re-add --forked
* fix how we get cupy version (should be based on nvcc version)