Commit Graph

1396 Commits

Author SHA1 Message Date
91d63e0228 update formatter version and style settings (#3098) 2023-03-27 07:55:19 -04:00
b3ec1c9712 Move cuda check into utils (#3074)
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2023-03-24 04:22:48 +00:00
090d49e79f pre-commit check for torch.cuda in code (#2981)
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2023-03-23 20:29:54 -07:00
e80ae08886 Empty ZeRO3 partition cache (#3060)
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2023-03-23 17:15:34 -07:00
5cdf35935d Goodbye Torch 1.8 (#3082)
* bump torch18 -> torch19
* fix gptj

---------

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2023-03-23 11:43:28 -07:00
5c2a81c2c1 allow list (#3042)
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2023-03-23 09:50:01 -07:00
a78d6b89e0 Fix nebula in save_16bit_model issue (#3023)
Co-authored-by: Qinghuan Rao <qinghuanrao@microsoft.com>
2023-03-23 09:45:42 -07:00
1286e374ab Softmax Scheduling Cleanup (#3046)
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2023-03-22 08:45:06 -07:00
27e1b02deb Remove bf16 from inference config dtye enum (#3010)
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2023-03-22 08:44:09 -07:00
871c8a3f5d fix return prev key and value , added strides to from_blob (#2828)
Co-authored-by: Mor Zusman <morz@ai21.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2023-03-22 08:43:35 -07:00
36677588b6 [CI] follow-up fixes (#3072) 2023-03-21 15:38:21 -07:00
9ea0fdc2ce Assert mp_size is factor of model dimensions (#2891)
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2023-03-21 14:50:43 -07:00
4e0686233a Several fixes to unblock CI (#3047)
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2023-03-21 13:45:25 -07:00
b38b3036dd [docs] add MCR-DL paper to readme/docs (#3066) 2023-03-21 10:19:16 -07:00
f1e4fb0b87 Fix Broken Links (#3048) v0.8.3 2023-03-17 11:30:24 -07:00
bbfd0a6a3e update email info 2023-03-15 14:16:26 -07:00
ac2c9ffae4 Improve loss overflow logs (#3008)
* Improve overflow logs

* Trigger CI

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2023-03-15 09:33:15 -04:00
94f7da26b6 Convert model parameters from generator to list. (#3017)
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2023-03-15 07:45:20 -04:00
50a49e42fb [logger] implement warning_once (#3021)
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2023-03-14 22:29:00 -07:00
d7c925e4e8 adding attribute checks for bf opt with zero (#3022)
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2023-03-14 20:13:08 -07:00
e355863b83 Update torch_checkpoint_engine.py (#3019) 2023-03-14 18:41:39 -04:00
4292e8c59a [docs] add new paper to readme/docs (#3018)
Co-authored-by: Zhewei Yao <zheweiyao@gmail.com>
2023-03-14 12:46:48 -07:00
b528f50e3d Fix buffer size for pipeline parallel and communication schedule (#2862)
* fix buffer size for pipeline parallel (#2800)

* improve explanation of buffer size for pipeline parallelism

Co-authored-by: Jae-Won Chung <jwnchung@umich.edu>

* fix format of comment

---------

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Jae-Won Chung <jwnchung@umich.edu>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2023-03-13 10:48:10 -07:00
43d58d99eb ckpt: create directories in checkpoint_engine (#2988)
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2023-03-13 10:52:48 -04:00
3798e60519 Fix Meta Tensor checkpoint load for OPT models (#2990)
This PR fixes Meta Tensor checkpoint loading for OPT models where the SD keys start with `model.`.
2023-03-10 11:45:36 -08:00
457850dc5a [zero] prevent poor configs from running w. zero-offload (#2971) 2023-03-09 09:57:51 -08:00
58a4a4d4c1 Fix issue between our abstract accelerator and colossalai's version of op_builder (#2963)
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2023-03-08 09:55:41 -08:00
6379defaef bug fix for skipping mbs (#2171)
Co-authored-by: Rajhans Samdani <rajhans@gmail.com>
2023-03-08 05:27:56 +00:00
d58b4df39f bump to 0.8.3 2023-03-07 10:16:32 -08:00
db15ef578a deepspeed.init_distributed() support for TCP protocols (#2905)
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
v0.8.2
2023-03-07 09:42:22 -08:00
0acf7e9c48 [RFC] add device abstraction to allow other device than CUDA be used (#2221)
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2023-03-07 09:40:17 -08:00
80d8fcbdb3 Improve overflow handling (#2944)
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2023-03-06 18:24:25 -08:00
87eaf8f99a Check for local CUDA graphs when enable_cuda_graph=True (#2941) 2023-03-06 17:38:50 -08:00
2ede0d942a AutoTP Assert Kernel Injection Support (#2939)
* check kernel injection supported models

* Clarify why user should use kernel injection
2023-03-06 14:23:55 -08:00
4ae3a3da0d TP unsupported models and assertions (#2810)
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2023-03-01 14:18:13 -08:00
mzl
8d53ac0cd3 Add MPICH Multinode Runner (#2839)
* MPICH support

* MPICH changes

* MPICH changes

* MPICH changes

* MPICH changes

* accelerator runtime modifications

* Accelerator runtime changes

* Accelerator runtime modifications

* Remove redundant print from single node

* Move hostfile to tmp

* Code cleanup for MPICH class

* Code cleanup, rm whitespace

* Removing mpiexec environment check details

* Not needed tmp hostfile as pass directly

* Remove debugging comments

* rm print statement

* Revert comm changes as WA not needed

* Use MPICHRunner name for class

* Use MPICHRunner as class name

* No need to use args.force_multi and args.launcher .

This should be set in deepspeedexamples gpt-3.6b .sh script as:
$launcher=MPICH
run_cmd=" deepspeed  --hostfile=${hostfile_ds}  --num_nodes ${NUM_WORKERS} --num_gpus ${NUM_GPUS_PER_WORKER} --launcher=${launcher} --force_multi pretrain_gpt2.py $@ ${gpt_options}"

* Adhere to code pattern

* Rm empty lines in MPICHRunner class

* Uncomment check for num nodes and workers when used hostfile_deepspeed in gpt-3.6b.sh

* pass MPICH hostfile through launcher_args in gpt-3.6b.sh

* Clean code and remove args hostfile

* fix merge

* fix merge

---------

Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>

* clean up and fix format

* add ut

---------

Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2023-03-01 12:04:07 -05:00
91d7090e47 Fixes AttributeError in #2853 (#2854)
Updates `deepspeed/monitor/monitor.py`
to instantiate objects with correct configs

Relevant issue:
https://github.com/microsoft/DeepSpeed/issues/2853

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2023-03-01 06:06:19 -05:00
17fa0876ad Always convert input mask to half (#2851) 2023-02-28 21:36:21 +00:00
9886d6d9e0 Fix CPUAdam for when vendor_id_raw is not provided (#2836)
* #1213: Fix CPUAdam for when `vendor_id_raw` is not provided

* formatting (yapf) fix

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2023-02-28 18:41:49 +00:00
dc01cee5ca using container when loading inference checkpoints (#2875)
This PR updates the replace_fn function when loading inference checkpoints. The container will now be passed to the load_model_with_checkpoint() so we can call load_params() from there. load_params() is also updated to access the variables in the policy.
2023-02-28 14:59:23 +00:00
f1d2a15b50 better eval sampler (#2907)
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2023-02-28 06:10:19 +00:00
541e423ae6 Enable tensor fragments for zero 2 & 3 (#2727)
* Enable tensor fragments for zero 2

* Update deepspeed/utils/tensor_fragment.py

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

* Update deepspeed/utils/tensor_fragment.py

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

* Support offload

* Support multi-gpu

* Cleanup

* WIP

* Update deepspeed/runtime/zero/stage3.py

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

* Support padding

* Update deepspeed/runtime/zero/stage3.py

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

* z3 optimizer state support; aligned api

* Support frozen z3 params

* Unit tests

* Check NVMe offload capability

* Formatting

* Docs

* More docs

* More docs

* Update docs/code-docs/source/zero3.rst

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

* More docs

* Update docs/code-docs/source/zero3.rst

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

* More docs

* More docs

* Update docs/code-docs/source/zero3.rst

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

* Update deepspeed/utils/tensor_fragment.py

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

* More docs

* Support unsharded fp32 grad

* Remove debug prints

* Fix off-by-one detection of empty grads

* Update deepspeed/utils/tensor_fragment.py

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

* Update deepspeed/utils/tensor_fragment.py

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

* Update deepspeed/utils/tensor_fragment.py

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

* Update deepspeed/runtime/zero/stage3.py

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

* Fix off-by-one error

* Skip ranks with no gradient data

* Formatting

* Add license

* Fix license

---------

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
2023-02-27 23:40:49 -05:00
da84e60d98 add missing license info to top of all source code (#2889)
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: Conglong Li <conglong.li@gmail.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2023-02-27 11:20:41 -08:00
8710f0514e Reduce I/O size (#2814) 2023-02-24 10:59:44 -05:00
d3de737550 Remove deprecated torch._six imports (#2863)
* Remove deprecated `torch._six` imports

Closes #2845.

* Support older versions of PyTorch as well.

---------

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2023-02-23 18:51:02 -08:00
b47c592ae2 AutoTP tutorial web formatting and news (#2883)
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2023-02-23 13:19:14 -08:00
7e77cf710a Check device count before running dist tests (#2799)
* Check device count before running dist tests

* fixing format for "Check device count before running dist tests"

* Check device count against max world size

* Check GPU count before launching dist tests

* double-check GPU actually exists

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
2023-02-23 12:04:49 -08:00
859d7c92ab Automatic Tensor Parallelism Blog Links (#2877)
* Modify table for compatible web format

* Add tutorial links to navigation

* Add news bit to main readme

* Update docs/_tutorials/automatic-tensor-parallelism.md

Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>

---------

Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
2023-02-23 04:41:15 -08:00
81b4d5db06 Make z3 respect comm dtype (#2807)
* Make z3 respect comm dtype

* Support fp32 comm dtype

* Remove obsolete assert

* Code cleanup
2023-02-22 06:50:57 -05:00
7c99def0f0 Data efficiency library update (#2866)
* data efficiency library update

* data efficiency library update

* data efficiency update

* data efficiency update
2023-02-21 14:43:29 -08:00