In the process of adding onebit optimizers support for XPU devices, we have noticed that for different accelerator, the main difference of implementation of `compressed_allreduce` lies on `packbits` and `unpackbits`. CUDA uses cupy and NPU uses torch_npu. Instead of replace these to xpu only functions, we provided a CompressedBackend to do the `compressed_allreduce` work where users can add their own packbits/unpackbits kernels, which is a general path for all kinds of accelerators. In this PR, we: 1. Add CompressedBackend for onebitAdam, onebitLamb and zerooneAdam 2. Add XPU implement of packbits/unpackbits with SYCL, built in PackbitsBuilder 3. Add tests for onebit with CompressedBackend --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
One-Bit tests
In this folder, you can test the functionality and performance of different backend for doing compressed allreduce, which is the main algorithm in one-bit optimizers like One-Bit Adam, One-Bit Lamb and Zero-One Adam.
How to run
NCCL and MPI backend
Basically it requires your environment have relative communication backend installed, the NCCL backend of PyTorch distributed or Message Passing Interface (MPI) like MVAPICH2-GDR and OpenMPI. Detailed Pre-requisites.
To test accuracy and performance of NCCL backend:
python test_nccl_backend.py
python test_nccl_perf.py
Similarly, for MPI backend:
python test_mpi_backend.py
python test_mpi_perf.py
Compressed backend
This backend provides an approach to abstract the generic part of one-bit optimizers and implements accelerator dependent part with DeepSpeed custom op builder. To use this CompressedBackend
and test it, you should make sure that your current accelerator supports PackbitsBuilder
, so that it could be loaded to do high performance packing and unpacking between float and Byte datatype.
An example can be found in Deepspeed/op_builder/xpu/packbits.py
.
The test usage is same as others:
python test_compressed_backend.py
python test_compressed_perf.py