pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
Yuanyuan Chen	3255e7872b	Enable all flake8-logging-format rules (#164655 ) These rules are enabled by removing existing suppressions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164655 Approved by: https://github.com/janeyx99, https://github.com/mlazos	2025-10-19 00:59:28 +00:00
Yuanyuan Chen	e595136187	Enable PLC1802 on ruff (#165813 ) This PR enables ruff check `PLC1802`, which detects len calls on sequences in a boolean test context. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165813 Approved by: https://github.com/ezyang	2025-10-18 05:44:14 +00:00
Howard Huang	bc1f2108d7	[PP] Update backward_counter and fsdp util to schedule class (#165513 ) Fixed one issue with FSDP last reshard not being called. Rest is mostly refactoring, changing some variables to be class variables so they can be used in https://github.com/pytorch/torchtitan/pull/1721 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165513 Approved by: https://github.com/fegin	2025-10-15 21:58:16 +00:00
Timm Ruland	ffe3cb226a	In pipeline parallelism: Use same dtype for receive and send tensor when initializing p2p communication. (#165539 ) When initializing the p2p communication for pipeline parallelism, currently different default dtypes are used for the send and receive tensor here: `5c583e2573/torch/distributed/pipelining/stage.py (L935-L936)` This caused hard to trace issues when training on multiple nodes. Multiple stages on one node seem to work for some reason which probably caused the unit tests not to catch this. Fixes #165143 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165539 Approved by: https://github.com/H-Huang	2025-10-15 15:05:55 +00:00
Yuanyuan Chen	b11593c31b	[8/N] Apply ruff UP035 rule (#165214 ) This is follow-up of #164653 to continue applying `UP035` fixes. The purpose is to finally enable this rule. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165214 Approved by: https://github.com/ezyang	2025-10-15 03:18:57 +00:00
Howard Huang	ca65023b90	[PP] Fix edge case with FSDP when stages_per_rank > 3 (#165467 ) There is an edge case with FSDP + PP when we add UNSHARD + RESHARD, we at max have 3 stages unsharded, `3f83e8915e/torch/distributed/pipelining/schedules.py (L1029-L1031)` This change is need to be able to unshard and reshard a stage multiple times. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165467 Approved by: https://github.com/wwwjn	2025-10-15 01:53:04 +00:00
Yuanyuan Chen	fbe0d20a17	[2/N] More ruff SIM fixes (#165031 ) This is follow-up of #164695 to apply ruff SIM rules to more files. Most changes are about simplifying dict.get because None is already the default value. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165031 Approved by: https://github.com/mlazos	2025-10-14 14:22:54 +00:00
Chien-Chin Huang	6bda3bb286	[PP] Fix split_args_kwargs_into_chunks issues (#165306 ) 1. https://github.com/pytorch/pytorch/pull/164111/ adds the support of splitting BlockMask. But BlockMask actually has B=1 case that the BlockMask will be broadcast. This PR adds the support of B=1 case. 2. The original split_args_kwargs_into_chunks doesn't initialize the default specs correctly. Since we now use tree_flatten and tree_unflatten to do split, we should also use tree_map to initialize the default spec. This will actually support the case when the values are not torch.Tensor, which were only supported if users explicitly provide the shard spec. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165306 Approved by: https://github.com/H-Huang	2025-10-13 15:52:39 +00:00
Howard Huang	2beead7523	[PP] move FSDP reduce scatters to end of step (#165106 ) Move FSDP reduce scatters to the end of the PP step. The reduce scatter compute stream sync blocks the other stages from executing their backwards leading to bubbles. There should be a way to execute these RS earlier, but doing this for now as a quick fix. <img width="1056" height="463" alt="image" src="https://github.com/user-attachments/assets/b945dd55-8ab1-4acc-b862-c6e2e476b834" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/165106 Approved by: https://github.com/weifengpy ghstack dependencies: #164976	2025-10-12 13:28:02 +00:00
Howard Huang	a3eb275d3c	Add torch compile check for ZeroBubble (#162511 ) Fix https://github.com/pytorch/pytorch/issues/161904 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162511 Approved by: https://github.com/fegin	2025-10-10 18:49:45 +00:00
Yuanyuan Chen	fb64da0791	[2/N] Use "is" in python type comparison (#165142 ) This is follow-up of #165037. It generally recommended to use `is/is not` to compare types. Therefore this series of changes apply this suggestion in the code base, and it aims to finally enabling related linter checks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165142 Approved by: https://github.com/albanD	2025-10-10 15:36:44 +00:00
PyTorch MergeBot	b8be796a57	Revert "[2/N] More ruff SIM fixes (#165031 )" This reverts commit 38095fbd1323ee4a9541fbcbb9b28bd20f2cd956. Reverted https://github.com/pytorch/pytorch/pull/165031 on behalf of https://github.com/albanD due to One of the changed line started to fail on trunk ([comment](https://github.com/pytorch/pytorch/pull/165031#issuecomment-3390190870))	2025-10-10 13:42:14 +00:00
Howard Huang	238dd5517d	[PP] Move profiler record_function in schedule (#164976 ) Better engineering to move the `record_function` call to also encompass the custom callback, this line is the only change: https://github.com/pytorch/pytorch/pull/164976/files#diff-1d3d91f53db88fb886901fb178d69e47776e71b8103f85688fa9ca64cc55d068R2147, the rest is just formatting. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164976 Approved by: https://github.com/fegin ghstack dependencies: #162016, #164962	2025-10-10 13:09:23 +00:00
Yuanyuan Chen	38095fbd13	[2/N] More ruff SIM fixes (#165031 ) This is follow-up of #164695 to apply ruff SIM rules to more files. Most changes are about simplifying dict.get because None is already the default value. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165031 Approved by: https://github.com/mlazos	2025-10-10 05:37:46 +00:00
Maggie Moss	7457d139c5	Add pyrefly suppressions to torch/distributed (7/n) (#165002 ) Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283 One more PR after this one. Test plan: dmypy restart && python3 scripts/lintrunner.py -a pyrefly check step 1: delete lines in the pyrefly.toml file from the project-excludes field step 2: run pyrefly check step 3: add suppressions, clean up unused suppressions before: https://gist.github.com/maggiemoss/4b3bf2037014e116bc00706a16aef199 after: INFO 0 errors (6,884 ignored) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165002 Approved by: https://github.com/oulgen	2025-10-09 04:08:25 +00:00
Howard Huang	005c3d449e	Support custom callback functions in schedule (#162016 ) This is going to be used in https://github.com/pytorch/torchtitan/issues/1682 Add a `register_custom_function` to the `_PipelineScheduleRuntime` which allows users to implement any custom function to replace the runtime operation dynamically. The signature of the callback should look like: ```python class _CustomFunctionProtocol(Protocol): def __call__(self, action: _Action, ctx: _PipelineContext) -> None: ... ``` `_PipelineContext` contains a reference to the schedule which is executing the operations. ### Testing Added a test which adds custom methods for `FORWARD` and `OVERLAP_F_B` which are just the same implementations as those used in the default schedule runtime. Check that the schedule can still run, numerics are correct, and the callbacks are executed the correct number of times. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162016 Approved by: https://github.com/fegin	2025-10-08 20:43:26 +00:00
mingyuan.wang	0a3e4e894c	[PP]: Optimize memory by early releasing stage inputs' gradients (#164329 ) Seems that we can release input activations' gradients early in `stage_backward()` in PP, which helps to reduce the peak memory. I tested this using `1F1B` and `Interleaved1F1B` PP strategy (for simplicity, I use 4 decoder layers of llama3, set PP size to 2 and set num_microbatches to 128) based on torchtitan run command using torchtitan: ```bash CUDA_VISIBLE_DEVICES=4,5 LOG_RANK=0,1 NGPU=2 CONFIG_FILE=./torchtitan/models/llama3/train_configs/llama3_8b.toml ./run_train.sh --metrics.log_freq 1 --training.seq_len 8192 --training.steps 10 --parallelism.data_parallel_shard_degree 1 --activation_checkpoint.mode full --model.tokenizer_path /workspace/torchtitan-v0.1.0/torchtitan/torchtitan/datasets/tokenizer/original/tokenizer.model --tr aining.dataset wikipedia --parallelism.pipeline_parallel_degree 2 --training.local_batch_size 128 --parallelism.pipeline_parallel_microbatch_size 1 --training.dataset_path /workspace/wikipedia_subset --training.seed 42 --parallelism.pipeline_parallel_schedule 1F1B ``` ## 1F1B torchtitan train results ### before fix <img width="1526" height="606" alt="b8e281cce1dac15e827c216e7d83f402" src="https://github.com/user-attachments/assets/545c0a80-6276-40c0-893f-fd2df0a53b8d" /> ### after fix <img width="1526" height="594" alt="70d5ceba311a8398d041189bf8897cfc" src="https://github.com/user-attachments/assets/0d606e08-238a-4115-a1c0-b40df101d867" /> after fix, the memory usage on rank1, i.e., non first stages saving 6.9GB compare to before fix. the memory usage on rank0 remains unchanged (rank0 represents stage0) ## Interleaved1F1B torchtitan train results ### before fix <img width="1514" height="601" alt="a28b7f9704b9234870619c43194e8a72" src="https://github.com/user-attachments/assets/2c28565f-ffff-4747-a8f5-722b5c65dc7e" /> ### after fix <img width="1526" height="621" alt="2d8d6d956b72885186f8c7059146c41a" src="https://github.com/user-attachments/assets/8c4a4ff2-336b-4e0b-8ac4-014ae22c2ed1" /> after fix, the memory usage on rank1 saving 14.57GB (rank1 holds layer1 and layer3) and rank0 saving 7.5GB (rank0 holds layer0 and layer2) ## Memory snapshot results also, I have dumped the memory snapshot to observe the memory under the 1F1B PP strategy. ### before fix <img width="1906" height="918" alt="6fd4e4ba82b8bacf9ca6edee4f3d5581" src="https://github.com/user-attachments/assets/d1b9245c-b09f-43c5-87ce-87ba48533a70" /> we can see the memory is increasing as pp step_microbatches running. (the lifetime of input activation's gradient, i.e., the output of `FusedRMSNormBackward` lasts too long) ### after fix <img width="1903" height="918" alt="2e415f25af6750d06e5e647683b212b9" src="https://github.com/user-attachments/assets/b657c8f6-5a56-46bd-8743-f3b8375c81b0" /> after fix, we got more steady memory usage during training. (the input activation's gradient will be released or return allocator soon) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164329 Approved by: https://github.com/H-Huang	2025-10-08 16:12:00 +00:00
Howard Huang	c813617c53	[PP] Migrate other schedules to use PipelineScheduleRuntime (#164777 ) Second fix for https://github.com/pytorch/pytorch/issues/164756 This has been a TODO to make the all schedules execute using the same runtime. Now after this change, schedules will use the same logic for `_PipelineScheduleRuntime` where it adds `UNSHARD` and `RESHARD` operations to the schedules which fixes the issue mentioned above. <img width="920" height="406" alt="image" src="https://github.com/user-attachments/assets/a4d5bcd0-7dac-43cd-96f9-8ca33cfd8b91" /> A test is failing after the conversion: - Fixed a gradient scaling issue for dWeight Pull Request resolved: https://github.com/pytorch/pytorch/pull/164777 Approved by: https://github.com/fegin ghstack dependencies: #164775	2025-10-08 01:45:57 +00:00
Howard Huang	e659661ffa	[PP] Fix FSDP unshard/reshard (#164775 ) First fix for https://github.com/pytorch/pytorch/issues/164756 In the pipeline IR we call `UNSHARD` and `RESHARD`, but there is a bug because when we call `module.unshard()` these do not recursively call the FSDP modules, hence leading to sometime call allgather before the module forward. Since we want the pipeline IR to explicitly handle this, we can call `group.unshard` instead which ensures that all the modules are unsharded. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164775 Approved by: https://github.com/weifengpy	2025-10-08 01:45:57 +00:00
Chien-Chin Huang	e3ae80fc03	[PP] Let PP split BlockMask into micro-BlockMask (#164111 ) BlockMask has batch dimension information. So PP has to split it as well just like all other tensors. All the tensors in BlockMask have the batch dimension, so we can just split it without too many issues. However, `mask_mod` requires the batch index as the input, which the value is going to be changed after the split. So we have to wrap it inside a closure to modify the batch index. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164111 Approved by: https://github.com/H-Huang	2025-10-07 23:25:34 +00:00
Howard Huang	65f10becdf	Support OVERLAP_F_B in schedule (#161072 ) Previously, we converted the overlap_f_b into separate forward and backward operations in the plan. This is a small change that includes it in the plan and handles it in the runtime Pull Request resolved: https://github.com/pytorch/pytorch/pull/161072 Approved by: https://github.com/fegin, https://github.com/wconstab	2025-10-07 19:55:10 +00:00
PyTorch MergeBot	5d7360bb03	Revert "Enable all SIM rules except disabled ones (#164645 )" This reverts commit 321e6026925f6b6e8a36e3a8b7c0295cd7541911. Reverted https://github.com/pytorch/pytorch/pull/164645 on behalf of https://github.com/izaitsevfb due to causes lint failures ([comment](https://github.com/pytorch/pytorch/pull/164645#issuecomment-3369274351))	2025-10-05 19:32:21 +00:00
Yuanyuan Chen	321e602692	Enable all SIM rules except disabled ones (#164645 ) `SIM` rules are useful for simplifying boolean expressions and enhances code readability. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164645 Approved by: https://github.com/ezyang	2025-10-05 07:38:25 +00:00
Yuanyuan Chen	35c4130fd1	[2/N] Fix ruff warnings (#164460 ) Apply ruff `SIM` rules. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164460 Approved by: https://github.com/ezyang	2025-10-04 03:40:32 +00:00
Anshul Sinha	3ffaab3bc8	[Replicate][Pipeline Parallelism] integration of new replicate function with pipeline parallelism (#164031 ) Summary: In order to test numerics for replicate + pp, stage.py needs to be able to call replicate's backward manually as pipeline parallelism doesn't have this feature. Test Case 1. pytest test/distributed/_composable/test_composability/test_pp_composability.py -k test_replicate_pp Pull Request resolved: https://github.com/pytorch/pytorch/pull/164031 Approved by: https://github.com/weifengpy, https://github.com/H-Huang ghstack dependencies: #163897	2025-10-01 18:01:16 +00:00
Ke Wen	e419dc6d08	[PP] Customize pipeline's submod name (#164037 ) Changing PP submodules' name from `submod_i` to `submod_pp_i` to distinguish from the submodule created by HOP. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164037 Approved by: https://github.com/H-Huang ghstack dependencies: #164045, #164035	2025-10-01 16:29:19 +00:00
PyTorch MergeBot	36a37b81cd	Revert "[PP] Customize pipeline's submod name (#164037 )" This reverts commit 704cd771f6a63abf9498934aeb7f3079ab9e2232. Reverted https://github.com/pytorch/pytorch/pull/164037 on behalf of https://github.com/yangw-dev due to internal build failed Buck build failed for this target, and is likely caused by your changes. ([comment](https://github.com/pytorch/pytorch/pull/164035#issuecomment-3357113348))	2025-10-01 16:09:50 +00:00
Yuanyuan Chen	a293206bd5	Fix invalid f-strings (#164112 ) Fixes invalid f-strings detected by `ruff`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164112 Approved by: https://github.com/Skylion007, https://github.com/mlazos	2025-09-30 04:17:13 +00:00
Yuanyuan Chen	85012fe167	Remove unnecessary list comprehensions (#164103 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/164103 Approved by: https://github.com/Lucaskabela, https://github.com/mlazos	2025-09-30 03:56:54 +00:00
Yuanyuan Chen	da003d7b95	[3/N] Import Callable from collections.abc in torch/distributed (#164104 ) This is the result of applying the ruff `UP035` check. `Callable` is imported from `collections.abc` instead of `typing`. This PR is the follow-up of #164054. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164104 Approved by: https://github.com/Skylion007	2025-09-30 00:28:53 +00:00
Ke Wen	704cd771f6	[PP] Customize pipeline's submod name (#164037 ) Changing PP submodules' name from `submod_i` to `submod_pp_i` to distinguish from the submodule created by HOP. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164037 Approved by: https://github.com/H-Huang ghstack dependencies: #164045, #164035	2025-09-29 23:29:52 +00:00
Ke Wen	5ddad22196	[PP] Use default export mode (non-strict) (#164045 ) export's default mode has switched from strict to non-strict. We just follow suit in PP. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164045 Approved by: https://github.com/H-Huang	2025-09-29 06:31:06 +00:00
can-gaa-hou	7c7ae86991	[Fix] Adding missing `f` prefixes to formatted strings [2/N] (#164066 ) As stated in the title. * #164068 * #164067 * __->__ #164066 * #164065 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164066 Approved by: https://github.com/Skylion007	2025-09-29 04:40:44 +00:00
Timm Ruland	5fcde74aed	Fix pipeline parallelism not correctly initializing backwards stages when evaluating before training. (#162823 ) Previously, an eval() call before a training step() would not correctly initialize the backward pass of the pipeline stages, leading to errors during the subsequent training step. This PR ensures that the backward stages can still be initialized after an eval() call. Fixes #162822 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162823 Approved by: https://github.com/dcci, https://github.com/H-Huang	2025-09-25 15:13:19 +00:00
Howard Huang	9de22bc5da	Inspect schedule IR comms (#162996 ) Small change to util to allow us to see comms (e.g. `SEND`, `RECV`, etc.) in the schedule IR Pull Request resolved: https://github.com/pytorch/pytorch/pull/162996 Approved by: https://github.com/fegin	2025-09-16 16:59:06 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	de05dbc39c	Replace export_for_training with export (#162396 ) Summary: replace export_for_training with epxort Test Plan: CI Rollback Plan: Differential Revision: D81935792 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162396 Approved by: https://github.com/angelayi, https://github.com/jerryzh168	2025-09-10 14:19:34 +00:00
Howard Huang	e1be887870	[PP] Add spacing to visualizer (#160474 ) When visualizing the schedules using `_PipelineScheduleExecution`, we don't provide any spacing between dependencies, so when visualizing `DualPipeV` it looks like this: <img width="3168" height="486" alt="image" src="https://github.com/user-attachments/assets/d2c881ad-4ee0-46b6-ac03-13e5600b5a55" /> While it has the correct order of operations, it does not show the dependencies correctly. As shown in the original implementation, it should look something like this: <img width="3542" height="384" alt="image" src="https://github.com/user-attachments/assets/c930fa98-848e-4951-a58b-c81f41092d14" /> This allows an option to add spacing to the visualizer, so it is easier to see dependencies. After change: <img width="3633" height="486" alt="image" src="https://github.com/user-attachments/assets/7708367e-bdb4-46e8-a7c4-f19e18047f59" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160474 Approved by: https://github.com/fegin	2025-09-09 17:52:52 +00:00
Avik Chaudhuri	711c8c821e	shape guards (#161178 ) Summary: This PR introduces shape guards to export. Previously only value ranges, equalities, and specializations would be tracked for symbolic expressions, and we had a forward hook to check them. Instead now we create a function to check shape guards and call it in the exported program. Test Plan: updated several tests Rollback Plan: Differential Revision: D80713603 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161178 Approved by: https://github.com/tugsbayasgalan	2025-09-08 22:44:09 +00:00
Howard Huang	abc447174c	[PP] Add profiling to schedule execution (#160753 ) Profiling title will be `str(action)` <img width="1545" height="694" alt="image" src="https://github.com/user-attachments/assets/60b3506b-b8d6-4ae0-8b32-0d51d45fa2f0" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160753 Approved by: https://github.com/wconstab	2025-09-03 21:31:50 +00:00
Howard Huang	82d2d23e85	Add batch option for send/recv_object_list (#160342 ) `send_object_list` and `recv_object_list` use regular `send`/`recv` P2P ops which means that they will create 2-rank NCCL communicators between ranks if the communicators have not been initialized. This adds an option `use_batch` which will call the send/recv with `batch_isend_irecv` which will re-use the communicators already initialized for collectives in the group. --- BatchP2P ops, creates (or use existing) communicator keyed by device index Regular P2P Ops, creates (or use existing) dedicated 2-rank communicators keyed by “rank1:rank2” See: `c8205cb354/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (L3980-L4008)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160342 Approved by: https://github.com/wconstab	2025-08-30 03:29:09 +00:00
Howard Huang	2ff7c1c774	[PP] Rename _load_actions and validate (#160558 ) Rename method and add validation Pull Request resolved: https://github.com/pytorch/pytorch/pull/160558 Approved by: https://github.com/wconstab ghstack dependencies: #159591	2025-08-14 17:41:58 +00:00
Howard Huang	198b5fd2d4	[PP] Add DualPipeV schedule (#159591 ) Added the DualPipeV schedule according to http://github.com/deepseek-ai/DualPipe/blob/main/dualpipe/dualpipev.py#L11 <img width="3633" height="486" alt="image" src="https://github.com/user-attachments/assets/4e843bb9-87cd-4d11-936c-7dfe8ee12f16" /> This schedule doesn't perform the actual "overlap" during execution, but provides the scaffolding and schedule definition we need to run it E2E in torchtitan. Supporting the overlapped operation will be worked on in following PRs. Tests: ```sh python test/distributed/pipelining/test_schedule_multiproc.py -k test_v_shape_schedules python test/distributed/pipelining/test_schedule.py -k test_pipeline_order_for_v_schedules ``` Also tested in TorchTitan and is running. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159591 Approved by: https://github.com/wconstab	2025-08-14 14:58:35 +00:00
Howard Huang	e63c2b21c1	[PP] Initialize P2P communicators on first step (#160210 ) Was hitting hangs in multi-node settings and initializing the NCCL communicators needed for batch p2p ops ahead of time fixes this. This change adds extra communication since it communicates a dummy tensor to next and previous stage ranks. However, this is only paid on the first step so it is negligible. Debug history: https://docs.google.com/document/d/1EKVJYmW2hj_VsvDvnSggXhZzJyvMu9dA0iDJWOZAtjY/edit?tab=t.0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160210 Approved by: https://github.com/wconstab	2025-08-11 23:46:58 +00:00
Howard Huang	5e8b95605f	[PP] Support OVERLAP_F_B computation type (#158978 ) Some changes to validation code and visualizer to support a new computation type that will be used in DualPipeV (see https://github.com/pytorch/pytorch/pull/159591) The IR looks like: ``` [0F0, 0F1, 0F2, 0F3, 0F4, 0F5, 0F6, 7F0, 7I0, 7W0, 7F1, 7I1, 7W1, 7F2, 7I2, 7W2, 7F3, (0F7;7B3)OVERLAP_F_B, (7F4;0B0)OVERLAP_F_B, (0F8;7B4)OVERLAP_F_B, (7F5;0B1)OVERLAP_F_B, (0F9;7B5)OVERLAP_F_B, (7F6;0B2)OVERLAP_F_B, 7B6, (7F7;0B3)OVERLAP_F_B, 7B7, (7F8;0B4)OVERLAP_F_B, 7B8, (7F9;0B5)OVERLAP_F_B, 7B9, 0I6, 0W6, 0I7, 0W7, 0I8, 0W8, 0I9, 0W9] [1F0, 1F1, 1F2, 1F3, 1F4, 6F0, 1F5, 6F1, 6I0, 6W0, 6F2, 6I1, 6W1, 6F3, (1F6;6B2)OVERLAP_F_B, (6F4;1B0)OVERLAP_F_B, (1F7;6B3)OVERLAP_F_B, (6F5;1B1)OVERLAP_F_B, (1F8;6B4)OVERLAP_F_B, (6F6;1B2)OVERLAP_F_B, (1F9;6B5)OVERLAP_F_B, (6F7;1B3)OVERLAP_F_B, 6B6, (6F8;1B4)OVERLAP_F_B, 6B7, (6F9;1B5)OVERLAP_F_B, 6B8, 1B6, 6I9, 1I7, 6W9, 1I8, 1W7, 1I9, 1W8, 1W9] [2F0, 2F1, 2F2, 5F0, 2F3, 5F1, 2F4, 5F2, 5I0, 5W0, 5F3, (2F5;5B1)OVERLAP_F_B, (5F4;2B0)OVERLAP_F_B, (2F6;5B2)OVERLAP_F_B, (5F5;2B1)OVERLAP_F_B, (2F7;5B3)OVERLAP_F_B, (5F6;2B2)OVERLAP_F_B, (2F8;5B4)OVERLAP_F_B, (5F7;2B3)OVERLAP_F_B, (2F9;5B5)OVERLAP_F_B, (5F8;2B4)OVERLAP_F_B, 5B6, (5F9;2B5)OVERLAP_F_B, 5B7, 2B6, 5B8, 2I7, 5I9, 2I8, 2W7, 2I9, 5W9, 2W8, 2W9] [3F0, 4F0, 3F1, 4F1, 3F2, 4F2, 3F3, 4F3, 3F4, 4B0, (4F4;3B0)OVERLAP_F_B, (3F5;4B1)OVERLAP_F_B, (4F5;3B1)OVERLAP_F_B, (3F6;4B2)OVERLAP_F_B, (4F6;3B2)OVERLAP_F_B, (3F7;4B3)OVERLAP_F_B, (4F7;3B3)OVERLAP_F_B, (3F8;4B4)OVERLAP_F_B, (4F8;3B4)OVERLAP_F_B, (3F9;4B5)OVERLAP_F_B, (4F9;3B5)OVERLAP_F_B, 4B6, 3B6, 4B7, 3B7, 4I8, 3I8, 4I9, 3I9, 4W8, 3W8, 4W9, 3W9] ``` In this PR, the schedule execution will just treat the OVERLAP_F_B as two separate operations of F and B (so there is no actual overlap). The next step is to allow users to create a custom function to plug in what this operation does. `814629043a/torch/distributed/pipelining/schedules.py (L1205-L1216)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158978 Approved by: https://github.com/wconstab	2025-08-01 20:22:30 +00:00
Howard Huang	f74842d57f	[PP] Fix zero bubble schedules for eval() (#159475 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159475 Approved by: https://github.com/tianyu-l, https://github.com/Skylion007	2025-07-30 19:46:10 +00:00
Howard Huang	8d00833fdb	[PP] Fix eval step under no_grad() (#159293 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159293 Approved by: https://github.com/tianyu-l, https://github.com/wconstab	2025-07-29 14:42:33 +00:00
Howard Huang	ede6186c86	[PP] Allow intermediate nodes in ZB to have multiple grads (#159084 ) Fixes a ZB regression (https://github.com/pytorch/torchtitan/actions/runs/16478292562/job/46585646792) Previously we only allowed an intermediate node to have 1 gradient. Recently a torchtitan ZB test started failing and I tracked to back to FusedRMSNorm grad_fn having two values `(grad, None)` (see https://github.com/pytorch/pytorch/pull/153666) and it started breaking our ZB tests. This PR allows `stage_backward_weight` intermediate nodes to have multiple grads (it sums them together or if the grad value is None, then ignores it). Here is an example where the backward would have two grad values (gI1, gI2): ```python class Func(torch.autograd.Function): @staticmethod def forward(ctx, x): return x, 2 @staticmethod def backward(ctx, gI1, gI2): assert gI2 is None return gI1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159084 Approved by: https://github.com/tianyu-l	2025-07-27 19:16:51 +00:00
Howard Huang	1d58476162	[PP] Add eval() API to schedule (#157795 ) These change add an `eval()` API to PP schedules ## Context Currently, you can run "Forward only" for a schedule in two ways: 1. Use a custom schedule `_ScheduleForwardOnly` 2. Do not pass in `loss_fn` in schedule constructor, and no backward computations will be executed. However, this is still limiting because we may want to run forward through the pipeline / calculate the loss, but without backward, e.g. during validation. These changes allow for this. ```python if self.rank == 0: schedule.eval(x) elif self.rank == self.world_size - 1: losses = [] schedule.eval(target=target, losses=losses) else: schedule.eval() ``` TODO: - in later PRs, we will deprecate the `_ScheduleForwardOnly` Pull Request resolved: https://github.com/pytorch/pytorch/pull/157795 Approved by: https://github.com/wconstab	2025-07-16 23:48:45 +00:00
Howard Huang	f4406689b8	fix MPCT destroy_pg call (#157952 ) I was seeing hangs / exceptions not raising in some cases. Only call `c10d.destroy_process_group()` for `MultiProcessContinuousTest` in the clean exit case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157952 Approved by: https://github.com/fduwjj ghstack dependencies: #157589	2025-07-12 00:46:19 +00:00
Xuehai Pan	4ccc0381de	[BE][5/16] fix typos in torch/ (torch/distributed/) (#156315 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156315 Approved by: https://github.com/Skylion007, https://github.com/albanD ghstack dependencies: #156313, #156314	2025-06-23 02:57:28 +00:00

1 2 3 4 5

219 Commits