91d63e0228
update formatter version and style settings ( #3098 )
2023-03-27 07:55:19 -04:00
b3ec1c9712
Move cuda check into utils ( #3074 )
...
Co-authored-by: Jeff Rasley <jerasley@microsoft.com >
2023-03-24 04:22:48 +00:00
090d49e79f
pre-commit check for torch.cuda in code ( #2981 )
...
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com >
2023-03-23 20:29:54 -07:00
e80ae08886
Empty ZeRO3 partition cache ( #3060 )
...
Co-authored-by: Jeff Rasley <jerasley@microsoft.com >
2023-03-23 17:15:34 -07:00
5cdf35935d
Goodbye Torch 1.8 ( #3082 )
...
* bump torch18 -> torch19
* fix gptj
---------
Co-authored-by: Jeff Rasley <jerasley@microsoft.com >
2023-03-23 11:43:28 -07:00
5c2a81c2c1
allow list ( #3042 )
...
Co-authored-by: Jeff Rasley <jerasley@microsoft.com >
2023-03-23 09:50:01 -07:00
a78d6b89e0
Fix nebula in save_16bit_model issue ( #3023 )
...
Co-authored-by: Qinghuan Rao <qinghuanrao@microsoft.com >
2023-03-23 09:45:42 -07:00
1286e374ab
Softmax Scheduling Cleanup ( #3046 )
...
Co-authored-by: Jeff Rasley <jerasley@microsoft.com >
2023-03-22 08:45:06 -07:00
27e1b02deb
Remove bf16 from inference config dtye enum ( #3010 )
...
Co-authored-by: Jeff Rasley <jerasley@microsoft.com >
2023-03-22 08:44:09 -07:00
871c8a3f5d
fix return prev key and value , added strides to from_blob ( #2828 )
...
Co-authored-by: Mor Zusman <morz@ai21.com >
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com >
Co-authored-by: Jeff Rasley <jerasley@microsoft.com >
2023-03-22 08:43:35 -07:00
36677588b6
[CI] follow-up fixes ( #3072 )
2023-03-21 15:38:21 -07:00
9ea0fdc2ce
Assert mp_size is factor of model dimensions ( #2891 )
...
Co-authored-by: Jeff Rasley <jerasley@microsoft.com >
2023-03-21 14:50:43 -07:00
4e0686233a
Several fixes to unblock CI ( #3047 )
...
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com >
Co-authored-by: Jeff Rasley <jerasley@microsoft.com >
2023-03-21 13:45:25 -07:00
b38b3036dd
[docs] add MCR-DL paper to readme/docs ( #3066 )
2023-03-21 10:19:16 -07:00
f1e4fb0b87
Fix Broken Links ( #3048 )
v0.8.3
2023-03-17 11:30:24 -07:00
bbfd0a6a3e
update email info
2023-03-15 14:16:26 -07:00
ac2c9ffae4
Improve loss overflow logs ( #3008 )
...
* Improve overflow logs
* Trigger CI
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com >
2023-03-15 09:33:15 -04:00
94f7da26b6
Convert model parameters from generator to list. ( #3017 )
...
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com >
2023-03-15 07:45:20 -04:00
50a49e42fb
[logger] implement warning_once ( #3021 )
...
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com >
2023-03-14 22:29:00 -07:00
d7c925e4e8
adding attribute checks for bf opt with zero ( #3022 )
...
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com >
2023-03-14 20:13:08 -07:00
e355863b83
Update torch_checkpoint_engine.py ( #3019 )
2023-03-14 18:41:39 -04:00
4292e8c59a
[docs] add new paper to readme/docs ( #3018 )
...
Co-authored-by: Zhewei Yao <zheweiyao@gmail.com >
2023-03-14 12:46:48 -07:00
b528f50e3d
Fix buffer size for pipeline parallel and communication schedule ( #2862 )
...
* fix buffer size for pipeline parallel (#2800 )
* improve explanation of buffer size for pipeline parallelism
Co-authored-by: Jae-Won Chung <jwnchung@umich.edu >
* fix format of comment
---------
Co-authored-by: Jeff Rasley <jerasley@microsoft.com >
Co-authored-by: Jae-Won Chung <jwnchung@umich.edu >
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com >
2023-03-13 10:48:10 -07:00
43d58d99eb
ckpt: create directories in checkpoint_engine ( #2988 )
...
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com >
2023-03-13 10:52:48 -04:00
3798e60519
Fix Meta Tensor checkpoint load for OPT models ( #2990 )
...
This PR fixes Meta Tensor checkpoint loading for OPT models where the SD keys start with `model.`.
2023-03-10 11:45:36 -08:00
457850dc5a
[zero] prevent poor configs from running w. zero-offload ( #2971 )
2023-03-09 09:57:51 -08:00
58a4a4d4c1
Fix issue between our abstract accelerator and colossalai's version of op_builder ( #2963 )
...
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com >
2023-03-08 09:55:41 -08:00
6379defaef
bug fix for skipping mbs ( #2171 )
...
Co-authored-by: Rajhans Samdani <rajhans@gmail.com >
2023-03-08 05:27:56 +00:00
d58b4df39f
bump to 0.8.3
2023-03-07 10:16:32 -08:00
db15ef578a
deepspeed.init_distributed() support for TCP protocols ( #2905 )
...
Co-authored-by: Jeff Rasley <jerasley@microsoft.com >
v0.8.2
2023-03-07 09:42:22 -08:00
0acf7e9c48
[RFC] add device abstraction to allow other device than CUDA be used ( #2221 )
...
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com >
Co-authored-by: Jeff Rasley <jerasley@microsoft.com >
2023-03-07 09:40:17 -08:00
80d8fcbdb3
Improve overflow handling ( #2944 )
...
Co-authored-by: Jeff Rasley <jerasley@microsoft.com >
2023-03-06 18:24:25 -08:00
87eaf8f99a
Check for local CUDA graphs when enable_cuda_graph=True ( #2941 )
2023-03-06 17:38:50 -08:00
2ede0d942a
AutoTP Assert Kernel Injection Support ( #2939 )
...
* check kernel injection supported models
* Clarify why user should use kernel injection
2023-03-06 14:23:55 -08:00
4ae3a3da0d
TP unsupported models and assertions ( #2810 )
...
Co-authored-by: Jeff Rasley <jerasley@microsoft.com >
2023-03-01 14:18:13 -08:00
8d53ac0cd3
Add MPICH Multinode Runner ( #2839 )
...
* MPICH support
* MPICH changes
* MPICH changes
* MPICH changes
* MPICH changes
* accelerator runtime modifications
* Accelerator runtime changes
* Accelerator runtime modifications
* Remove redundant print from single node
* Move hostfile to tmp
* Code cleanup for MPICH class
* Code cleanup, rm whitespace
* Removing mpiexec environment check details
* Not needed tmp hostfile as pass directly
* Remove debugging comments
* rm print statement
* Revert comm changes as WA not needed
* Use MPICHRunner name for class
* Use MPICHRunner as class name
* No need to use args.force_multi and args.launcher .
This should be set in deepspeedexamples gpt-3.6b .sh script as:
$launcher=MPICH
run_cmd=" deepspeed --hostfile=${hostfile_ds} --num_nodes ${NUM_WORKERS} --num_gpus ${NUM_GPUS_PER_WORKER} --launcher=${launcher} --force_multi pretrain_gpt2.py $@ ${gpt_options}"
* Adhere to code pattern
* Rm empty lines in MPICHRunner class
* Uncomment check for num nodes and workers when used hostfile_deepspeed in gpt-3.6b.sh
* pass MPICH hostfile through launcher_args in gpt-3.6b.sh
* Clean code and remove args hostfile
* fix merge
* fix merge
---------
Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com >
* clean up and fix format
* add ut
---------
Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com >
Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com >
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com >
2023-03-01 12:04:07 -05:00
91d7090e47
Fixes AttributeError
in #2853 ( #2854 )
...
Updates `deepspeed/monitor/monitor.py`
to instantiate objects with correct configs
Relevant issue:
https://github.com/microsoft/DeepSpeed/issues/2853
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com >
2023-03-01 06:06:19 -05:00
17fa0876ad
Always convert input mask to half ( #2851 )
2023-02-28 21:36:21 +00:00
9886d6d9e0
Fix CPUAdam for when vendor_id_raw
is not provided ( #2836 )
...
* #1213 : Fix CPUAdam for when `vendor_id_raw` is not provided
* formatting (yapf) fix
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com >
2023-02-28 18:41:49 +00:00
dc01cee5ca
using container when loading inference checkpoints ( #2875 )
...
This PR updates the replace_fn function when loading inference checkpoints. The container will now be passed to the load_model_with_checkpoint() so we can call load_params() from there. load_params() is also updated to access the variables in the policy.
2023-02-28 14:59:23 +00:00
f1d2a15b50
better eval sampler ( #2907 )
...
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com >
2023-02-28 06:10:19 +00:00
541e423ae6
Enable tensor fragments for zero 2 & 3 ( #2727 )
...
* Enable tensor fragments for zero 2
* Update deepspeed/utils/tensor_fragment.py
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com >
* Update deepspeed/utils/tensor_fragment.py
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com >
* Support offload
* Support multi-gpu
* Cleanup
* WIP
* Update deepspeed/runtime/zero/stage3.py
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com >
* Support padding
* Update deepspeed/runtime/zero/stage3.py
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com >
* z3 optimizer state support; aligned api
* Support frozen z3 params
* Unit tests
* Check NVMe offload capability
* Formatting
* Docs
* More docs
* More docs
* Update docs/code-docs/source/zero3.rst
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com >
* More docs
* Update docs/code-docs/source/zero3.rst
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com >
* More docs
* More docs
* Update docs/code-docs/source/zero3.rst
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com >
* Update deepspeed/utils/tensor_fragment.py
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com >
* More docs
* Support unsharded fp32 grad
* Remove debug prints
* Fix off-by-one detection of empty grads
* Update deepspeed/utils/tensor_fragment.py
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com >
* Update deepspeed/utils/tensor_fragment.py
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com >
* Update deepspeed/utils/tensor_fragment.py
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com >
* Update deepspeed/runtime/zero/stage3.py
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com >
* Fix off-by-one error
* Skip ranks with no gradient data
* Formatting
* Add license
* Fix license
---------
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com >
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com >
2023-02-27 23:40:49 -05:00
da84e60d98
add missing license info to top of all source code ( #2889 )
...
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com >
Co-authored-by: Conglong Li <conglong.li@gmail.com >
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com >
2023-02-27 11:20:41 -08:00
8710f0514e
Reduce I/O size ( #2814 )
2023-02-24 10:59:44 -05:00
d3de737550
Remove deprecated torch._six
imports ( #2863 )
...
* Remove deprecated `torch._six` imports
Closes #2845 .
* Support older versions of PyTorch as well.
---------
Co-authored-by: Jeff Rasley <jerasley@microsoft.com >
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com >
2023-02-23 18:51:02 -08:00
b47c592ae2
AutoTP tutorial web formatting and news ( #2883 )
...
Co-authored-by: Jeff Rasley <jerasley@microsoft.com >
2023-02-23 13:19:14 -08:00
7e77cf710a
Check device count before running dist tests ( #2799 )
...
* Check device count before running dist tests
* fixing format for "Check device count before running dist tests"
* Check device count against max world size
* Check GPU count before launching dist tests
* double-check GPU actually exists
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com >
Co-authored-by: Jeff Rasley <jerasley@microsoft.com >
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com >
2023-02-23 12:04:49 -08:00
859d7c92ab
Automatic Tensor Parallelism Blog Links ( #2877 )
...
* Modify table for compatible web format
* Add tutorial links to navigation
* Add news bit to main readme
* Update docs/_tutorials/automatic-tensor-parallelism.md
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com >
---------
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com >
2023-02-23 04:41:15 -08:00
81b4d5db06
Make z3 respect comm dtype ( #2807 )
...
* Make z3 respect comm dtype
* Support fp32 comm dtype
* Remove obsolete assert
* Code cleanup
2023-02-22 06:50:57 -05:00
7c99def0f0
Data efficiency library update ( #2866 )
...
* data efficiency library update
* data efficiency library update
* data efficiency update
* data efficiency update
2023-02-21 14:43:29 -08:00