DeepSpeed

mirror of https://github.com/deepspeedai/DeepSpeed.git synced 2025-10-20 15:33:51 +08:00

Author	SHA1	Message	Date
Logan Adams	1640f6df4f	Update build_win.bat script to exclue GDS op as it lacks Windows support. (#6971 ) Nvidia GDS [does not support Windows](https://developer.nvidia.com/gpudirect-storage).	2025-01-24 21:58:43 +00:00
Olatunji Ruwase	470dd6dceb	Precisely track nvme optimizer offload (#6963 ) Fix #4998	2025-01-23 16:42:06 +00:00
Logan Adams	de4596bedc	Update version.txt after 0.16.3 release (#6965 ) Auto-generated PR to update version.txt after a DeepSpeed release Released version - 0.16.3 Author - @loadams Co-authored-by: loadams <loadams@users.noreply.github.com>	2025-01-21 14:34:26 -08:00
Xinyu Lian	c17dc33c04	Using explicit GPU upcast for ZeRO-Offload (#6962 ) Following discussion in [PR-6670](https://github.com/microsoft/DeepSpeed/pull/6670), the explict upcast is much more efficient than implicit upcast, this PR is to replace implicit upcast with explict one. The results on 3B model are shown below: \| Option \| BWD (ms) \| Speed up \| \|------------\|-----\|------\| \| Before PR-6670 \| 25603.30 \| 1x \| \| After PR-6670 \| 1174.31 \| 21.8X \| \| After this PR\| 309.2 \| 82.8X \| v0.16.3	2025-01-21 18:48:38 +00:00
Logan Adams	8d1bc0a042	Update `torch.norm` to `torch.linalg.norm` and `torch.linalg.vector_norm` (#6931 ) - [x] Update PR since `torch.norm` and `torch.linalg.norm` have [different function signatures](https://pytorch.org/docs/stable/generated/torch.linalg.norm.html#torch.linalg.norm). - [x] Check if there are any numeric differences between the functions. - [x] Determine why there appear to be performance differences from others [here](https://github.com/pytorch/pytorch/issues/136360). - [x] Update to `torch.linalg.vectornorm` Follow up PR handles these in the comm folder: #6960	2025-01-21 16:49:58 +00:00
inkcherry	bc76b04e28	Add the missing view operations from sequence parallel(async). (#6750 ) FYI @loadams a view operation was missing in some updates compared to the original version `17ed7c77c5/deepspeed/sequence/layer.py (L56)` add missing view operation. The shape required for the view cannot be easily obtained in the current function， so refactor layout params code. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>	2025-01-21 16:49:06 +00:00
Hyogeun Oh (오효근)	7f3d669b40	Remove Duplicate Declaration of pandas in `Dockerfile` (#6959 ) ### Description This pull request removes the redundant installation of `pandas` from the `Dockerfile`. It was previously declared twice, and this update eliminates the duplicate entry, improving the clarity and maintainability of the `Dockerfile`. `018ece5af2/docker/Dockerfile (L124)` `018ece5af2/docker/Dockerfile (L135)` ### Changes Removed the duplicate pandas installation line from the `RUN pip install` command.	2025-01-17 17:44:49 +00:00
Logan Adams	f97f0885cf	Update import for torchvision.transformers (#6958 ) Fixes import - found via [torchfix](https://github.com/pytorch-labs/torchfix).	2025-01-17 09:43:51 -08:00
Xia Weiwen	018ece5af2	Add extra_repr to Linear classes for debugging purpose (#6954 ) Summary This PR adds `extra_repr` method to some Linear classes so that additional info is printed when printing such modules. It is useful for debugging. Affected modules: - LinearLayer - LinearAllreduce - LmHeadLinearAllreduce The `extra_repr` method gives the following info: - in_features - out_features - bias (true or false) - dtype Example Print llama-2-7b model on rank 0 after `init_inference` with world size = 2. Previously we only got class names of these modules: ``` InferenceEngine( (module): LlamaForCausalLM( (model): LlamaModel( (embed_tokens): Embedding(32000, 4096) (layers): ModuleList( (0-31): 32 x LlamaDecoderLayer( (self_attn): LlamaSdpaAttention( (q_proj): LinearLayer() (k_proj): LinearLayer() (v_proj): LinearLayer() (o_proj): LinearAllreduce() (rotary_emb): LlamaRotaryEmbedding() ) (mlp): LlamaMLP( (gate_proj): LinearLayer() (up_proj): LinearLayer() (down_proj): LinearAllreduce() (act_fn): SiLU() ) (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05) (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05) ) ) (norm): LlamaRMSNorm((4096,), eps=1e-05) (rotary_emb): LlamaRotaryEmbedding() ) (lm_head): LmHeadLinearAllreduce() ) ) ``` Now we get more useful info: ``` InferenceEngine( (module): LlamaForCausalLM( (model): LlamaModel( (embed_tokens): Embedding(32000, 4096) (layers): ModuleList( (0-31): 32 x LlamaDecoderLayer( (self_attn): LlamaSdpaAttention( (q_proj): LinearLayer(in_features=4096, out_features=2048, bias=False, dtype=torch.bfloat16) (k_proj): LinearLayer(in_features=4096, out_features=2048, bias=False, dtype=torch.bfloat16) (v_proj): LinearLayer(in_features=4096, out_features=2048, bias=False, dtype=torch.bfloat16) (o_proj): LinearAllreduce(in_features=2048, out_features=4096, bias=False, dtype=torch.bfloat16) (rotary_emb): LlamaRotaryEmbedding() ) (mlp): LlamaMLP( (gate_proj): LinearLayer(in_features=4096, out_features=5504, bias=False, dtype=torch.bfloat16) (up_proj): LinearLayer(in_features=4096, out_features=5504, bias=False, dtype=torch.bfloat16) (down_proj): LinearAllreduce(in_features=5504, out_features=4096, bias=False, dtype=torch.bfloat16) (act_fn): SiLU() ) (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05) (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05) ) ) (norm): LlamaRMSNorm((4096,), eps=1e-05) (rotary_emb): LlamaRotaryEmbedding() ) (lm_head): LmHeadLinearAllreduce(in_features=2048, out_features=32000, bias=False, dtype=torch.bfloat16) ) ) ```	2025-01-16 18:11:07 +00:00
Quentin Gallouédec	05eaf3d1ca	`warn` to `warning` (#6952 ) `warn` is deprecated, see https://docs.python.org/3/library/logging.html#logging.Logger.warning ```DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead```	2025-01-15 22:08:56 +00:00
Omar Elayan	fae714d6bd	[inf] Add config var to enable keeping module on host (#6846 ) Using keep_module_on_host config var will let us control if the loaded checkpoints to model parameters will be moved to the device or stay on host --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2025-01-15 19:25:29 +00:00
Logan Adams	66d3d3e94d	Pin nv-a6000 workflow (#6938 ) Breaking change in transformers is https://github.com/huggingface/transformers/pull/35235. Need to make changes to unpin nv-a6000 workflow.	2025-01-13 10:34:15 -08:00
Nir Sonnenschein	396f8db793	Remove op compilation flags due to perf issue (#6944 ) in some scenarios some of the optimization flags for the ops compiler for HPU can cause a significant performance degradation. remove the flags until the issue is resolved	2025-01-13 16:50:22 +00:00
Yejing-Lai	fa8db5cf2f	Support pure meta model lm_head tp (#6812 ) Add lm_head tp support when checkpoint not provided to deepspeed.init_inference(). --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Ma, Guokai <guokai.ma@gmail.com>	2025-01-10 22:18:01 +00:00
Logan Adams	1d15ef0acf	Add information on security expectations with this software (#6941 ) Inspired by the link vllm [includes](https://github.com/vllm-project/vllm/blob/main/SECURITY.md), this starts to give users insight into the security expectations they should have from using DeepSpeed.	2025-01-09 15:56:54 -08:00
Lev Kurilenko	0fc3daade7	Add position_ids arg to OPTEmbedding forward function (#6939 ) This PR updates the DeepSpeed `OPTEmbedding` forward function to include a new `positions_ids` argument. --------- Co-authored-by: Logan Adams <loadams@microsoft.com>	2025-01-09 20:11:35 +00:00
Yejing-Lai	45fce45c95	Add deepseek autotp (#6937 ) Deepseek including Multi-Head Latent Attention(MLA) and MoE. For MLA TP, we need to skip two low-rank layers("q_a_proj" and "kv_a_proj_with_mqa) For Deepseek MoE, tp_parse gets this moe layer name is layer_idx.down_proj, it is hard to add the policy, so we set the down_proj layer to all_reduce_linears default.	2025-01-09 18:11:32 +00:00
Logan Adams	53fb5795a1	Fix windows blog examples (#6934 )	2025-01-08 12:54:19 -08:00
woctordho	b62c84d88d	Fix building on Windows with presence of Triton (#6749 ) This fixes some errors when installing DeepSpeed on Windows with the presence of Triton. I guess we can assume we don't need the warning about NFS on Windows for now. I did not try how to detect NFS path on Windows, but we can detect UNC path starting with `\\` if needed. `os.rename` does not allow overwriting the file on Windows, and `os.replace` is more cross-platform. Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2025-01-08 18:59:41 +00:00
Logan Adams	6628127a37	Update python version classifiers (#6933 ) Update python version classifiers in setup.py to reflect python versions currently supported.	2025-01-08 18:43:06 +00:00
Sergii Dymchenko	c41b0c2855	Use `torch.log1p` (#6930 ) This function provides greater precision than `log(1 + x)` for small values of `x`. Found with TorchFix https://github.com/pytorch-labs/torchfix/	2025-01-08 01:27:30 +00:00
Nadav Elyahu	c7f30322fd	inference: remove unused _validate_args function (#5505 ) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2025-01-07 18:10:24 +00:00
Lev Kurilenko	f2cc80909b	Check transformers version in BLOOM for inference v1 (#6766 ) This PR checks that the `transformers` version is `<= 4.43.4` in the BLOOM container for inference v1, due to breaking changes in `transformers > 4.43.4`. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2025-01-06 17:13:35 -08:00
Logan Adams	c348c5b11a	Cleanup ops/transformer/inference tests (#6925 )	2025-01-06 14:35:50 -08:00
inkcherry	b0040b6ca4	Reduce the device bubble introduced by heavy loop synchronization in coalesced fetch/release(z3_leaf_module) (#6694 ) depend on https://github.com/microsoft/DeepSpeed/pull/6649 When performing fetch/release operations on Z3 leaf modules, the loop time is excessively long in fine-grained module. Compared to non-leaf modules, Z3 leaf modules may include a larger number of parameters. Although each loop unit does not consume much time, the overall loop length can be significant. ![image](https://github.com/user-attachments/assets/9891835a-2620-47f3-aba6-ea22b8905d1c) The fetch time is impacted by: Post-allgather operations (narrow， slice ，cat, difficult to avoid) Memory pressure（record_stream/fetch event create&sync） The release time is impacted by: slice Free parameter record_stream Considering the fine-grained leaf modules, where each parameter is relatively small, we can treat the parameters within each leaf module as a unified entity to handle memory pressure. This approach can approximately halve the CPU time required for fetch/release operations. --------- Co-authored-by: Ma, Guokai <guokai.ma@gmail.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2025-01-06 20:06:06 +00:00
Omar Elayan	c5e48f49d8	Add fp8_gemm fallback for non-triton systems (#6916 ) - Removed try/except from __init__ file in fp_quantizer and added a single entry point instead - Renamed file fp8_gemm to fp8_gemm_triton, and the function matmul_fp8 to matmul_fp8_triton - Added a new entry point fp8_gemm with matmul_fp8 inside, and if the system supports triton it calls the triton implementation and if not it calls the fallback Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2025-01-06 18:54:57 +00:00
hj-wei	f8c9f314ff	[BUG FIX]:fix get torch.version.cuda error when cuda is None in rocm (#6909 ) HI, I found some error when using deepspeed with rocm-torch ``` torch_cuda_version = ".".join(torch.version.cuda.split('.')[:2]) ``` will raise an AttributeError when torch.version.cuda is None. This occurs because the CUDA version in rocm-torch/version.py is set to always be None, leading to potential runtime errors in environments where ROCm is being used. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2025-01-06 17:38:19 +00:00
Quentin Anthony	0dbbb70b99	Fix `checkpointable_layers` Logic (#6881 ) Problem There's an edge-case in DeepSpeed, where if all three of the following are true: 1. Deepspeed activation checkpointing is applied 2. The user passes `checkpointable_layers` (e.g. `f532580567/megatron/model/gpt2_model.py (L175)`) 3. The user's model class contains `GPT2ModelPipe` or GPTModelPipe` Then the `checkpointable_layers` will not be activation checkpointed. Reason This is because in the current logic, `_is_checkpointable` will short-circuit to just return layers matching `ParallelTransformerLayerPipe` in the case of `self.__class__.__name__ in ('GPTModelPipe', 'GPT2ModelPipe')`. See `da771ed42e/deepspeed/runtime/pipe/module.py (L653)` Proposed Fixes I think that `checkpointable_layers` should always be checked for, and added logic to this effect. I also found the documentation for `checkpointable_layers` confusing and contradictory, so I updated the docstring. Lastly, I added a unit test for `checkpointable_layers`. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2025-01-04 05:57:49 +00:00
Logan Adams	a8ede3a9df	Cleanup ops/transformer/inference tests (#6830 )	2025-01-03 08:25:50 -08:00
Max Kovalenko	456c9ac679	Stage3: Use new torch grad accumulation hooks API (#6773 ) * This commit addresses a Deepspeed issue [#6718](https://github.com/microsoft/DeepSpeed/issues/6718) * The existing code has been using the grad_acc node hook to reduce params grads. The constructs such as `param.data = replicated_tensor.data` used in `allgather_params(..)` are compiled into `param.set()` causing the hook assigned to the grad_acc node not being called. * Starting from PyTorch 2.1 there is a new and robust hook API on a param itself: `param.register_post_accumulate_grad_hook(..)` * This commit will make use of the proper API depending on the PyTorch version * It will also disable compile for PyTorch versions < 2.1 --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>	2025-01-03 07:48:24 -08:00
Nir Sonnenschein	3573858e7c	Change compile for pipeline module torch.compile (#6478 ) We have encountered and issue with torch.compile and the pipeline module. modifying a member of the module (micro_offset) during the forward function will cause torch compile to restart the analysis and treat the module as dynamic. In order to bypass this issue without significantly changing the way the pipeline module works we propose to compile only the layers in the pipeline module instead of the forward function of pipeline module. this will bypass the issue and should still give most of the benefit of torch compiling the pipeline module while avoiding the issue. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-12-30 10:53:41 -08:00
Raza Sikander	cc03c76d57	Update Gaudi2 jobs to latest 1.19 build (#6905 ) Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-12-26 12:07:28 -08:00
Hongwei Chen	85cc5f9bb3	Fix error caused by all_reduce call in domino (#6880 ) Fix #6851 Initialize communication backend to fix error caused by all_reduce call in the Domino transformer layer. Verified correctness in local test. --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-12-26 09:12:04 -08:00
Nadav Elyahu	eea5304807	hpu_accelerator: use torch.use_deterministic_algorithms (#6897 ) formal API instead of hpu.setDeterministic	2024-12-19 21:13:46 -08:00
Nadav Elyahu	00ea0c46c2	Zero2: avoid graph breaks in torch.compile by using param_idx (#6803 ) inside reduce_independent_p_g_buckets_and_remove_grads and in reduce_ipg_grads which are being executed during the BWD hook in zero2, the model param is being stored inside params_in_ipg_bucket. torch.compile has hard time tracing parameters. By using the param's static index inside the group the same logic can be maintain with less complexity. --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Logan Adams <loadams@microsoft.com>	2024-12-19 16:54:45 -08:00
Nir Sonnenschein	4fd79205c6	Allow to compile collective for PT>2.3 (#6899 ) Allow to compile collective for PT>2.3 commit re-uploaded due to github CI issue originally uploaded by @nelyahu	2024-12-19 09:26:50 -08:00
Logan Adams	f9e158a0f5	Update version.txt after 0.16.2 release (#6893 ) Auto-generated PR to update version.txt after a DeepSpeed release Released version - 0.16.2 Author - @loadams Co-authored-by: loadams <loadams@users.noreply.github.com>	2024-12-18 09:53:17 -08:00
Olatunji Ruwase	b344c04df0	Update code owners (#6890 ) Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> v0.16.2	2024-12-18 08:49:28 -08:00
Daniel Huang	0b25630abe	Add arctic model support by adding w2 to all_reduce (#6856 ) As title says. Default behavior of arctic model produces shape issues with AutoTP due to the MLP layer performing `w2 * act(w1*w3)`. However, method provided to fix Mixtral-7x8b in #5257 does not work since the MLP for Arctic is also used within a ModuleList for the MoE. This results in MLP weights hiding behind individual experts as layers `#.w#`, which is not caught by the fix in #5257. This adds the check directly within replace, where it can check for actual layer names for the `w2` key in the model to patch with `all_reduce`. --------- Signed-off-by: Daniel Huang <daniel1.huang@intel.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-12-18 08:09:31 -08:00
Logan Adams	4cd1d97460	Don't error out when cpu accelerator doesn't have torch (as default for whl building) (#6886 ) This fixes a bug introduced in #6845, which breaks the `no-torch` workflow that we require in order to do releases where we do not require torch to be in the environment when building an sdist. This adds the same logic to the cpuaccelerator that the cudaaccelerator had where we don't require torch to be installed to build the whl.	2024-12-17 17:30:52 -08:00
Logan Adams	2f32966b1c	Update transformers ops unit tests to use `requried_torch_version` (#6884 )	2024-12-17 11:53:47 -08:00
Aviv Keshet	a964e43553	Fix --enable_each_rank_log when used with PDSH multi-node runner (#6863 ) This PR addresses fixes https://github.com/microsoft/DeepSpeed/issues/6859 by threading this argument into the deepspeed launcher command build by PDSHRunner. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-12-17 09:33:09 -08:00
Yejing-Lai	da771ed42e	Add MLP/lm_head tp grain size setting. (#6828 ) This PR aims to add MLP/lm_head tp size granularity setting to deepspeed.init_inference() API. It will be more flexible to set the MLP/lm_head sharding grain size. DNN library favors tensor size in granularity of power of 2, we pick 64 as a default size. We aim to be able to set the MLP/lm_head tp grain size flexibly. This is a preliminary solution. If there is a better solution, we can discuss it together. Thanks~ --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2024-12-16 14:14:53 -08:00
Logan Adams	87c650681e	Remove pin from transformers version and fix Processing/Threading issues in tests (#6822 ) Changes from https://github.com/huggingface/transformers/pull/34966 caused the `nv-torch-latest-v100` tests to fail with the following error: ``` File "/tmp/azureml/cr/j/e4bfd57a509846d6bbc4914639ad248d/exe/wd/actions-runner/_work/DeepSpeed/DeepSpeed/unit-test-venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3941, in from_pretrained raise EnvironmentError( OSError: Can't load the model for 'hf-internal-testing/tiny-random-VisionEncoderDecoderModel-vit-gpt2'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'hf-internal-testing/tiny-random-VisionEncoderDecoderModel-vit-gpt2' is the correct path to a directory containing a file named pytorch_model.bin, tf_model.h5, model.ckpt or flax_model.msgpack. ``` Sample failure here: https://github.com/microsoft/DeepSpeed/actions/runs/12169422174/job/33942348835?pr=6794#step:8:3506 This was resolved on the Transformers side here: https://github.com/huggingface/transformers/pull/35236	2024-12-16 11:21:51 -08:00
Masahiro Tanaka	db98cc3ad1	Fix assertion for offloading states (#6855 ) This PR fixes the assertions in `offload_states` method mentioned in #6833. Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-12-16 11:05:55 -08:00
keiwoo	fc7c07007f	Update real_accelerator.py (#6845 ) ### Comment out or delete `accelerate_name="cpu"` when `xpu` is not detected. When `xpu `is not detected it just pass at lines from 68 to 74 if `DS_ACCELERATOR` is set. However, `cpu` is assigned to `accelerate_name` if it cannot import `intel_extension_for_pytorch` or find` xpu`, namely, at line from 125 to 133 when`DS_ACCELERATOR` is not set. I found this problem yesterday and spent whole afternoon figuring it out. I got `intel_extension_for_pytorch `installed with other package which I do not use actually and have no idea about this. Then I found that it `cpu` is assigned to accelerate_name directly if it cannot find `xpu` and it affects `cuda` detection. In fact, `cpu` will be assigned finally if `cuda` is even not detected at line from 170 to 177. --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <loadams@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-12-13 16:41:43 -08:00
Logan Adams	6e3e13cb28	Remove warnings from autodoc and sphinx (#6788 ) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2024-12-13 15:35:12 -08:00
Olatunji Ruwase	8efbcc495c	Update TSC (#6867 )	2024-12-13 13:49:08 -08:00
Guanhua Wang	b5e3fac6a5	add domino navigation (#6866 ) add domino item into navigation list	2024-12-13 12:59:08 -08:00
Guanhua Wang	d7750c3429	Domino updates (#6861 ) Updating our website for Domino --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-12-13 11:40:41 -08:00

... 5 6 7 8 9 ...

2958 Commits