DeepSpeed

mirror of https://github.com/deepspeedai/DeepSpeed.git synced 2025-10-20 15:33:51 +08:00

Author	SHA1	Message	Date
Jeff Rasley	91d63e0228	update formatter version and style settings (#3098 )	2023-03-27 07:55:19 -04:00
Logan Adams	b3ec1c9712	Move cuda check into utils (#3074 ) Co-authored-by: Jeff Rasley <jerasley@microsoft.com>	2023-03-24 04:22:48 +00:00
Ma, Guokai	090d49e79f	pre-commit check for torch.cuda in code (#2981 ) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2023-03-23 20:29:54 -07:00
Olatunji Ruwase	e80ae08886	Empty ZeRO3 partition cache (#3060 ) Co-authored-by: Jeff Rasley <jerasley@microsoft.com>	2023-03-23 17:15:34 -07:00
Michael Wyatt	5cdf35935d	Goodbye Torch 1.8 (#3082 ) * bump torch18 -> torch19 * fix gptj --------- Co-authored-by: Jeff Rasley <jerasley@microsoft.com>	2023-03-23 11:43:28 -07:00
Satpal Singh Rathore	5c2a81c2c1	allow list (#3042 ) Co-authored-by: Jeff Rasley <jerasley@microsoft.com>	2023-03-23 09:50:01 -07:00
FreyaRao	a78d6b89e0	Fix nebula in save_16bit_model issue (#3023 ) Co-authored-by: Qinghuan Rao <qinghuanrao@microsoft.com>	2023-03-23 09:45:42 -07:00
Connor Holmes	1286e374ab	Softmax Scheduling Cleanup (#3046 ) Co-authored-by: Jeff Rasley <jerasley@microsoft.com>	2023-03-22 08:45:06 -07:00
Molly Smith	27e1b02deb	Remove bf16 from inference config dtye enum (#3010 ) Co-authored-by: Jeff Rasley <jerasley@microsoft.com>	2023-03-22 08:44:09 -07:00
Mor Zusman	871c8a3f5d	fix return prev key and value , added strides to from_blob (#2828 ) Co-authored-by: Mor Zusman <morz@ai21.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com>	2023-03-22 08:43:35 -07:00
Jeff Rasley	36677588b6	[CI] follow-up fixes (#3072 )	2023-03-21 15:38:21 -07:00
Molly Smith	9ea0fdc2ce	Assert mp_size is factor of model dimensions (#2891 ) Co-authored-by: Jeff Rasley <jerasley@microsoft.com>	2023-03-21 14:50:43 -07:00
Logan Adams	4e0686233a	Several fixes to unblock CI (#3047 ) Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com>	2023-03-21 13:45:25 -07:00
Quentin Anthony	b38b3036dd	[docs] add MCR-DL paper to readme/docs (#3066 )	2023-03-21 10:19:16 -07:00
Satpal Singh Rathore	f1e4fb0b87	Fix Broken Links (#3048 ) v0.8.3	2023-03-17 11:30:24 -07:00
Jeff Rasley	bbfd0a6a3e	update email info	2023-03-15 14:16:26 -07:00
Quentin Anthony	ac2c9ffae4	Improve loss overflow logs (#3008 ) * Improve overflow logs * Trigger CI --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2023-03-15 09:33:15 -04:00
Joe Mayer	94f7da26b6	Convert model parameters from generator to list. (#3017 ) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2023-03-15 07:45:20 -04:00
Stas Bekman	50a49e42fb	[logger] implement warning_once (#3021 ) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2023-03-14 22:29:00 -07:00
Joe Mayer	d7c925e4e8	adding attribute checks for bf opt with zero (#3022 ) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2023-03-14 20:13:08 -07:00
Stas Bekman	e355863b83	Update torch_checkpoint_engine.py (#3019 )	2023-03-14 18:41:39 -04:00
Jeff Rasley	4292e8c59a	[docs] add new paper to readme/docs (#3018 ) Co-authored-by: Zhewei Yao <zheweiyao@gmail.com>	2023-03-14 12:46:48 -07:00
Masahiro Tanaka	b528f50e3d	Fix buffer size for pipeline parallel and communication schedule (#2862 ) * fix buffer size for pipeline parallel (#2800) * improve explanation of buffer size for pipeline parallelism Co-authored-by: Jae-Won Chung <jwnchung@umich.edu> * fix format of comment --------- Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Jae-Won Chung <jwnchung@umich.edu> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2023-03-13 10:48:10 -07:00
Adam Moody	43d58d99eb	ckpt: create directories in checkpoint_engine (#2988 ) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2023-03-13 10:52:48 -04:00
Lev Kurilenko	3798e60519	Fix Meta Tensor checkpoint load for OPT models (#2990 ) This PR fixes Meta Tensor checkpoint loading for OPT models where the SD keys start with `model.`.	2023-03-10 11:45:36 -08:00
Jeff Rasley	457850dc5a	[zero] prevent poor configs from running w. zero-offload (#2971 )	2023-03-09 09:57:51 -08:00
Jeff Rasley	58a4a4d4c1	Fix issue between our abstract accelerator and colossalai's version of op_builder (#2963 ) Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2023-03-08 09:55:41 -08:00
Rahil Bathwal	6379defaef	bug fix for skipping mbs (#2171 ) Co-authored-by: Rajhans Samdani <rajhans@gmail.com>	2023-03-08 05:27:56 +00:00
Jeff Rasley	d58b4df39f	bump to 0.8.3	2023-03-07 10:16:32 -08:00
noabauma	db15ef578a	deepspeed.init_distributed() support for TCP protocols (#2905 ) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> v0.8.2	2023-03-07 09:42:22 -08:00
Ma, Guokai	0acf7e9c48	[RFC] add device abstraction to allow other device than CUDA be used (#2221 ) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com>	2023-03-07 09:40:17 -08:00
Olatunji Ruwase	80d8fcbdb3	Improve overflow handling (#2944 ) Co-authored-by: Jeff Rasley <jerasley@microsoft.com>	2023-03-06 18:24:25 -08:00
Lev Kurilenko	87eaf8f99a	Check for local CUDA graphs when enable_cuda_graph=True (#2941 )	2023-03-06 17:38:50 -08:00
Molly Smith	2ede0d942a	AutoTP Assert Kernel Injection Support (#2939 ) * check kernel injection supported models * Clarify why user should use kernel injection	2023-03-06 14:23:55 -08:00
Molly Smith	4ae3a3da0d	TP unsupported models and assertions (#2810 ) Co-authored-by: Jeff Rasley <jerasley@microsoft.com>	2023-03-01 14:18:13 -08:00
mzl	8d53ac0cd3	Add MPICH Multinode Runner (#2839 ) * MPICH support * MPICH changes * MPICH changes * MPICH changes * MPICH changes * accelerator runtime modifications * Accelerator runtime changes * Accelerator runtime modifications * Remove redundant print from single node * Move hostfile to tmp * Code cleanup for MPICH class * Code cleanup, rm whitespace * Removing mpiexec environment check details * Not needed tmp hostfile as pass directly * Remove debugging comments * rm print statement * Revert comm changes as WA not needed * Use MPICHRunner name for class * Use MPICHRunner as class name * No need to use args.force_multi and args.launcher . This should be set in deepspeedexamples gpt-3.6b .sh script as: $launcher=MPICH run_cmd=" deepspeed --hostfile=${hostfile_ds} --num_nodes ${NUM_WORKERS} --num_gpus ${NUM_GPUS_PER_WORKER} --launcher=${launcher} --force_multi pretrain_gpt2.py $@ ${gpt_options}" * Adhere to code pattern * Rm empty lines in MPICHRunner class * Uncomment check for num nodes and workers when used hostfile_deepspeed in gpt-3.6b.sh * pass MPICH hostfile through launcher_args in gpt-3.6b.sh * Clean code and remove args hostfile * fix merge * fix merge --------- Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com> * clean up and fix format * add ut --------- Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com> Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2023-03-01 12:04:07 -05:00
Sam Foreman	91d7090e47	Fixes `AttributeError` in #2853 (#2854 ) Updates `deepspeed/monitor/monitor.py` to instantiate objects with correct configs Relevant issue: https://github.com/microsoft/DeepSpeed/issues/2853 Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2023-03-01 06:06:19 -05:00
Molly Smith	17fa0876ad	Always convert input mask to half (#2851 )	2023-02-28 21:36:21 +00:00
Farzan Taj	9886d6d9e0	Fix CPUAdam for when `vendor_id_raw` is not provided (#2836 ) * #1213: Fix CPUAdam for when `vendor_id_raw` is not provided * formatting (yapf) fix --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2023-02-28 18:41:49 +00:00
Heyang Qin	dc01cee5ca	using container when loading inference checkpoints (#2875 ) This PR updates the replace_fn function when loading inference checkpoints. The container will now be passed to the load_model_with_checkpoint() so we can call load_params() from there. load_params() is also updated to access the variables in the policy.	2023-02-28 14:59:23 +00:00
Mayank Mishra	f1d2a15b50	better eval sampler (#2907 ) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2023-02-28 06:10:19 +00:00
Olatunji Ruwase	541e423ae6	Enable tensor fragments for zero 2 & 3 (#2727 ) * Enable tensor fragments for zero 2 * Update deepspeed/utils/tensor_fragment.py Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * Update deepspeed/utils/tensor_fragment.py Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * Support offload * Support multi-gpu * Cleanup * WIP * Update deepspeed/runtime/zero/stage3.py Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * Support padding * Update deepspeed/runtime/zero/stage3.py Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * z3 optimizer state support; aligned api * Support frozen z3 params * Unit tests * Check NVMe offload capability * Formatting * Docs * More docs * More docs * Update docs/code-docs/source/zero3.rst Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * More docs * Update docs/code-docs/source/zero3.rst Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * More docs * More docs * Update docs/code-docs/source/zero3.rst Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * Update deepspeed/utils/tensor_fragment.py Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * More docs * Support unsharded fp32 grad * Remove debug prints * Fix off-by-one detection of empty grads * Update deepspeed/utils/tensor_fragment.py Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * Update deepspeed/utils/tensor_fragment.py Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * Update deepspeed/utils/tensor_fragment.py Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * Update deepspeed/runtime/zero/stage3.py Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * Fix off-by-one error * Skip ranks with no gradient data * Formatting * Add license * Fix license --------- Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>	2023-02-27 23:40:49 -05:00
Jeff Rasley	da84e60d98	add missing license info to top of all source code (#2889 ) Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> Co-authored-by: Conglong Li <conglong.li@gmail.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2023-02-27 11:20:41 -08:00
Olatunji Ruwase	8710f0514e	Reduce I/O size (#2814 )	2023-02-24 10:59:44 -05:00
Yasyf Mohamedali	d3de737550	Remove deprecated `torch._six` imports (#2863 ) * Remove deprecated `torch._six` imports Closes #2845. * Support older versions of PyTorch as well. --------- Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2023-02-23 18:51:02 -08:00
Molly Smith	b47c592ae2	AutoTP tutorial web formatting and news (#2883 ) Co-authored-by: Jeff Rasley <jerasley@microsoft.com>	2023-02-23 13:19:14 -08:00
Heyang Qin	7e77cf710a	Check device count before running dist tests (#2799 ) * Check device count before running dist tests * fixing format for "Check device count before running dist tests" * Check device count against max world size * Check GPU count before launching dist tests * double-check GPU actually exists --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>	2023-02-23 12:04:49 -08:00
Molly Smith	859d7c92ab	Automatic Tensor Parallelism Blog Links (#2877 ) * Modify table for compatible web format * Add tutorial links to navigation * Add news bit to main readme * Update docs/_tutorials/automatic-tensor-parallelism.md Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> --------- Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>	2023-02-23 04:41:15 -08:00
Olatunji Ruwase	81b4d5db06	Make z3 respect comm dtype (#2807 ) * Make z3 respect comm dtype * Support fp32 comm dtype * Remove obsolete assert * Code cleanup	2023-02-22 06:50:57 -05:00
Conglong Li	7c99def0f0	Data efficiency library update (#2866 ) * data efficiency library update * data efficiency library update * data efficiency update * data efficiency update	2023-02-21 14:43:29 -08:00

1 2 3 4 5 ...

1396 Commits