pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
soulitzer	71aefd5595	[reland] Allow setting grad_dtype on leaf tensors (#164751 ) ghstack-source-id: e44b3941530be83a630ec93f1478eec741ffca2e Pull-Request-resolved: https://github.com/pytorch/pytorch/pull/162815 Fixes #ISSUE_NUMBER Relanding due to internal weirdness. Separate PR to codev w/o ghstack. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164751 Approved by: https://github.com/albanD	2025-10-08 20:23:13 +00:00
eellison	001e1d2637	Add memory estimator (#164738 ) Original work by @ShatianWang, with lints applied. I am going to a few changes and add tests in subsequent prs but I want to preserve original commit first. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164738 Approved by: https://github.com/IvanKobzarev viable/strict/1759974894	2025-10-08 20:04:33 +00:00
Aleksandar Samardžić	e0cb1848d0	Use TMA loads always for Triton grouped MM kernel (#164256 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164256 Approved by: https://github.com/ngimel	2025-10-08 19:40:06 +00:00
Lakshay Garg	a4110fedcf	Use insert_or_assign instead of erase+emplace (#164868 ) insert_or_assign does effectively the same thing as erase+emplace but more efficiently since the search does not need to be repeated Pull Request resolved: https://github.com/pytorch/pytorch/pull/164868 Approved by: https://github.com/eqy	2025-10-08 19:13:49 +00:00
Natalia Gimelshein	37c6087334	Add split-K control to cuBLAS reduced-precision settings (#164766 ) ## Summary - add a CuBLASReductionOption enum so the CUDA context can track reduced-precision and split-K options - extend the Python bindings, backend helpers, and docs to accept an optional allow_splitk argument for fp16/bf16 matmul controls - update cuBLAS/cuBLASLt call sites plus dynamo guards and tests to respect the new combinations ## Testing - python test/test_cuda.py TestCuda.test_cublas_allow_fp16_reduced_precision_reduction_get_set -v (fails: ModuleNotFoundError: No module named 'psutil') ------ https://chatgpt.com/codex/tasks/task_e_68e404623178832f8a3e1d34e1e175da Pull Request resolved: https://github.com/pytorch/pytorch/pull/164766 Approved by: https://github.com/malfet, https://github.com/albanD	2025-10-08 18:48:45 +00:00
Laith Sakka	0b85236477	Fix refine_ranges corner case (#164075 ) (#164846 ) Summary: address https://github.com/pytorch/pytorch/issues/161360 u0>0 should update the range of u0 to start from [1, ..] this fix it. it was not doing that. Test Plan: contbuild & OSS CI, see `27234792ad` D84038721 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164846 Approved by: https://github.com/izaitsevfb, https://github.com/ezyang viable/strict/1759970213	2025-10-08 18:42:37 +00:00
Janani Sriram	4c0fec3e4d	[Max Autotune][B200] Skip carveout tests (#164435 ) Summary: Skip sm `carveout` tests on B200, as carveout is currently unsupported. Test Plan: ``` buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:max_autotune -c fbcode.nvcc_arch=b200a -c fbcode.enable_gpu_sections=true -c fbcode.platform010_cuda_version=12.8 -c fbcode.re_gpu_tests=False -- test_honor_sm_carveout_with_triton_tma ``` Differential Revision: D83395610 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164435 Approved by: https://github.com/eellison viable/strict/1759965837	2025-10-08 18:39:43 +00:00
cyy	fdc622b513	[CMake] Remove LLVM link code (#134940 ) This handling is not needed no recent LLVM APIs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134940 Approved by: https://github.com/ezyang, https://github.com/malfet	2025-10-08 18:39:16 +00:00
bobrenjc93	91b9484264	[ez] fix small doc error (#164915 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164915 Approved by: https://github.com/svekars viable/strict/1759962298	2025-10-08 18:27:44 +00:00
Ke Wen	5c827a4133	[SymmMem] Multi-root tile reduction (#164757 ) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): Perform multiple tile reductions concurrently, with each tile reduced to a separate root. - The number of concurrent reductions can be smaller than world size, i.e. roots can be a subset of all ranks. But all ranks are still required to call into this API. - Currently supports NVLink SHARP scope only. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164757 Approved by: https://github.com/weifengpy, https://github.com/fegin ghstack dependencies: #162243	2025-10-08 17:28:00 +00:00
Boyuan Feng	83458197d1	[Benchmark] remove old timm models from benchmark (#164805 ) Prune models from TorchInductor dashboard to reduce ci cost. This PR prunes for timm models according to the [doc](https://docs.google.com/document/d/1nLPNNAU-_M9Clx9FMrJ1ycdPxe-xRA54olPnsFzdpoU/edit?tab=t.0), which reduces from 60 to 14 models. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164805 Approved by: https://github.com/anijain2305, https://github.com/seemethere, https://github.com/huydhn, https://github.com/malfet	2025-10-08 17:14:58 +00:00
Gheorghe-Teodor Bercea	0b01ff4de0	[ROCm] Improve non stride-one backwards indexing for small index sets (#164409 ) This patch fixes a performance problem which occurs when a small set of indices is used and there are practically no duplicates. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164409 Approved by: https://github.com/jerrymannil, https://github.com/jeffdaily	2025-10-08 17:04:52 +00:00
Nikita Shulga	01f3a43462	[MPS] Update OS version in error message (#164946 ) Followup after https://github.com/pytorch/pytorch/pull/159912 Fixes https://github.com/pytorch/pytorch/issues/164943 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164946 Approved by: https://github.com/Camyll	2025-10-08 16:43:50 +00:00
Sean McGovern	f332017294	C++ API handle optimizer defaults (#161825 ) Fixes #141884 This fixes the issue for all optimizers and parameter options. A member function `overwrite_from` is added to the optimizer base class. Each optimizer then implements this function for comparing their accepted parameters to defaults. A SFINAE approach to handle the different optimizer parameters generically (in optimizer.h only) was evaluated, but I think this is easier to review and maintain. This mirrors the Python API up to one edge case. An example of the edge case is provided below. Python can distinguish between 1) Key not present in dict = "not specified" and 2) Key present in dict = "explicitly set". The C++ implementation cannot. The issue hinges on whether or not to track if a particular parameter was set by the user explicitly or not (discrepancy in the case when the constructor default is explicitly passed in). To track this seems like it will take more intervention than would be worth it (modify TORCH_ARG to keep track, use std::optional for the parameter types, use bitset tracking) and was not pursued in the current PR. I'm happy to alter the design if appropriate. ### Example of edge case hinging on CONSTRUCTOR DEFAULTS vs OPTIMIZER DEFAULTS 1. CONSTRUCTOR DEFAULTS: These are the values you get when calling AdamOptions() AdamOptions().lr() = 0.001 AdamOptions().weight_decay() = 0 AdamOptions().eps() = 1e-08 2. OPTIMIZER DEFAULTS: These are the values the user chose when creating the optimizer User's optimizer defaults: optimizer.lr() = 0.005 optimizer.weight_decay() = 0.1 optimizer.eps() = 1e-07 3. THE PROBLEM SCENARIO: User wants to add a parameter group with explicit weight_decay=0.0 User sets: weight_decay(0) 4. THE CONFUSION: Constructor default weight_decay: 0 User's explicit weight_decay: 0 Are they equal? YES Since they're equal, our overwrite_from() logic thinks: "User didn't set weight_decay explicitly, use optimizer default" 5. CURRENT BEHAVIOR: Final weight_decay: 0.1 User expected: 0 Match? ❌ NO === KEY INSIGHT === Constructor defaults are built into the C++ class definition. Optimizer defaults are chosen by the user at runtime. We want to respect the user intention. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161825 Approved by: https://github.com/janeyx99	2025-10-08 16:40:45 +00:00
mingyuan.wang	0a3e4e894c	[PP]: Optimize memory by early releasing stage inputs' gradients (#164329 ) Seems that we can release input activations' gradients early in `stage_backward()` in PP, which helps to reduce the peak memory. I tested this using `1F1B` and `Interleaved1F1B` PP strategy (for simplicity, I use 4 decoder layers of llama3, set PP size to 2 and set num_microbatches to 128) based on torchtitan run command using torchtitan: ```bash CUDA_VISIBLE_DEVICES=4,5 LOG_RANK=0,1 NGPU=2 CONFIG_FILE=./torchtitan/models/llama3/train_configs/llama3_8b.toml ./run_train.sh --metrics.log_freq 1 --training.seq_len 8192 --training.steps 10 --parallelism.data_parallel_shard_degree 1 --activation_checkpoint.mode full --model.tokenizer_path /workspace/torchtitan-v0.1.0/torchtitan/torchtitan/datasets/tokenizer/original/tokenizer.model --tr aining.dataset wikipedia --parallelism.pipeline_parallel_degree 2 --training.local_batch_size 128 --parallelism.pipeline_parallel_microbatch_size 1 --training.dataset_path /workspace/wikipedia_subset --training.seed 42 --parallelism.pipeline_parallel_schedule 1F1B ``` ## 1F1B torchtitan train results ### before fix <img width="1526" height="606" alt="b8e281cce1dac15e827c216e7d83f402" src="https://github.com/user-attachments/assets/545c0a80-6276-40c0-893f-fd2df0a53b8d" /> ### after fix <img width="1526" height="594" alt="70d5ceba311a8398d041189bf8897cfc" src="https://github.com/user-attachments/assets/0d606e08-238a-4115-a1c0-b40df101d867" /> after fix, the memory usage on rank1, i.e., non first stages saving 6.9GB compare to before fix. the memory usage on rank0 remains unchanged (rank0 represents stage0) ## Interleaved1F1B torchtitan train results ### before fix <img width="1514" height="601" alt="a28b7f9704b9234870619c43194e8a72" src="https://github.com/user-attachments/assets/2c28565f-ffff-4747-a8f5-722b5c65dc7e" /> ### after fix <img width="1526" height="621" alt="2d8d6d956b72885186f8c7059146c41a" src="https://github.com/user-attachments/assets/8c4a4ff2-336b-4e0b-8ac4-014ae22c2ed1" /> after fix, the memory usage on rank1 saving 14.57GB (rank1 holds layer1 and layer3) and rank0 saving 7.5GB (rank0 holds layer0 and layer2) ## Memory snapshot results also, I have dumped the memory snapshot to observe the memory under the 1F1B PP strategy. ### before fix <img width="1906" height="918" alt="6fd4e4ba82b8bacf9ca6edee4f3d5581" src="https://github.com/user-attachments/assets/d1b9245c-b09f-43c5-87ce-87ba48533a70" /> we can see the memory is increasing as pp step_microbatches running. (the lifetime of input activation's gradient, i.e., the output of `FusedRMSNormBackward` lasts too long) ### after fix <img width="1903" height="918" alt="2e415f25af6750d06e5e647683b212b9" src="https://github.com/user-attachments/assets/b657c8f6-5a56-46bd-8743-f3b8375c81b0" /> after fix, we got more steady memory usage during training. (the input activation's gradient will be released or return allocator soon) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164329 Approved by: https://github.com/H-Huang	2025-10-08 16:12:00 +00:00
Adnan Akhundov	73adac05d1	Triton 3.5.x pin update to 7416ffc (#164587 ) Updates triton pin to latest: https://github.com/triton-lang/triton/commits/release/3.5.x/ This updates contains 1 cherry-pick to fix flex_attention_fwd regression on B200: - https://github.com/triton-lang/triton/pull/8366 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164587 Approved by: https://github.com/atalman	2025-10-08 16:07:18 +00:00
eqy	0d39ecb2ce	[cuDNN][RNN] cuDNN RNN supports BFloat16 inputs since 9.13 (#164411 ) seems to work Pull Request resolved: https://github.com/pytorch/pytorch/pull/164411 Approved by: https://github.com/Skylion007	2025-10-08 15:26:50 +00:00
Nikita Shulga	90c0825e2d	[GHF] Allow reverts from pytorch-auto-revert app (#164911 ) This is a bit weird, but author_login is not a unique field, but author_url is. Explicitly allow https://github.com/apps/pytorch-auto-revert to issue revert commands Update mocks by running ``` sed -i -e s/8e262b0495bd934d39dda198d4c09144311c5ddd6cca6a227194bd48dbfe7201/47860a8f57a214a426d1150c29893cbc2aa49507f12b731483b1a1254bca3428/ gql_mocks.json ``` Test plan: Run ```python from trymerge import GitHubPR pr=GitHubPR("pytorch", "pytorch", 164660) print(pr.get_last_comment().author_url, pr.get_comment_by_id(3375785595).author_url) ``` that should produce ``` https://github.com/pytorch-auto-revert https://github.com/apps/pytorch-auto-revert ``` Plus added a regression test that checks two particular comments for revert validity `pytorch-auto-revert` user is my alter ego :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164911 Approved by: https://github.com/jeanschmidt viable/strict/1759955121	2025-10-08 15:15:45 +00:00
PyTorch MergeBot	fd4bde430a	Revert "list_stored_sd_metadata API. (#160610 )" This reverts commit da903b6a8be422529d47649e89c0d50bb95c37ca. Reverted https://github.com/pytorch/pytorch/pull/160610 on behalf of https://github.com/jeffdaily due to broke ROCm CI, but flaky also on CUDA CI https://hud.pytorch.org/failure?name=periodic%20%2F%20linux-jammy-rocm-py3.10%20%2F%20test%20(distributed%2C%202%2C%203%2C%20linux.rocm.gpu.mi250.4%2C%20module%3Arocm%2C%20oncall%3Adistributed)&jobName=undefined&failureCaptures=distributed%2Fcheckpoint%2Ftest_list_stored_state_dict.py%3A%3ATestListStateDict%3A%3Atest_list_stored_sd_metadata ([comment](https://github.com/pytorch/pytorch/pull/160610#issuecomment-3382023022)) viable/strict/1759952983	2025-10-08 15:10:38 +00:00
PyTorch MergeBot	b5e93ffdcf	Revert "Limit path search within range (#164581 )" This reverts commit 415e641572473479fc9d9eaea12762e1a223a9e0. Reverted https://github.com/pytorch/pytorch/pull/164581 on behalf of https://github.com/eellison due to merge sets makes this trickier ([comment](https://github.com/pytorch/pytorch/pull/164581#issuecomment-3381955240))	2025-10-08 14:56:21 +00:00
PyTorch MergeBot	f8d0d65ddc	Revert "Add memory estimator (#164738 )" This reverts commit ab01a0d7d352e7fd07989b8d6bf035bf82aea74e. Reverted https://github.com/pytorch/pytorch/pull/164738 on behalf of https://github.com/eellison due to merge sets makes this trickier ([comment](https://github.com/pytorch/pytorch/pull/164581#issuecomment-3381955240))	2025-10-08 14:56:21 +00:00
Jeff Daily	f46ddb1e65	[ROCm][CI] add gfx1150 gfx1151 to docker images for binary builds (#164854 ) Fixes #164346. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164854 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-10-08 14:34:22 +00:00
PyTorch MergeBot	20082d7136	Revert "fix flex attention eager bwd: more rounding (#164317 )" This reverts commit 41808b2ba9a61ab2f4c7af394c1668d09a4a0331. Reverted https://github.com/pytorch/pytorch/pull/164317 on behalf of https://github.com/jeffdaily due to inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_builtin_score_mods_seqlen_lt_custom_sparse_block_size_score_mod4_cuda_float16 [GH job link](https://github.com/pytorch/pytorch/actions/runs/18330774537/job/52207370954) [HUD commit link](`41808b2ba9`) ([comment](https://github.com/pytorch/pytorch/pull/164317#issuecomment-3381812090))	2025-10-08 14:29:10 +00:00
Laith Sakka	7158aa22e8	remove more (#164753 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164753 Approved by: https://github.com/aorenste, https://github.com/mlazos ghstack dependencies: #164664, #164665, #164667, #164668	2025-10-08 14:23:38 +00:00
Laith Sakka	2035f6b2e6	use check_size instead of check_is_size in ops.py (#164668 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164668 Approved by: https://github.com/angelayi ghstack dependencies: #164664, #164665, #164667	2025-10-08 14:23:38 +00:00
Mwiza Kunda	2b58adc3bd	[inductor][templates] Distinguish between kernel input nodes and codegen input nodes (#163752 ) If there is a single autotuner choice, the wrong type of input node is used to instantiate `TritonTemplateBuffer` through `TritonTemplateCaller.output_node`. This PR distinguishes the input nodes used in `AlgorithmSelectorCache.__call__` between the actual inputs passed to the kernel at runtime, vs the possibly viewed inputs that influence scheduling behaviour (e.g. `MemoryDeps`) and codegen. See the added unit test for more detail. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163752 Approved by: https://github.com/eellison viable/strict/1759948122	2025-10-08 14:12:14 +00:00
angelayi	322091d8d8	[opaque_obj] Add make_fx tracing support (#163278 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163278 Approved by: https://github.com/zou3519 ghstack dependencies: #163279, #163277 viable/strict/1759930024	2025-10-08 09:09:16 +00:00
angelayi	2bb4e6876c	[opaque obj] Error for torch.library.custom_op infer_schema (#163277 ) Unsure how we can get infer_schema to infer the scriptObject type from just the type annotation, so for now will just error clearly and ask users to specify a schema. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163277 Approved by: https://github.com/zou3519 ghstack dependencies: #163279	2025-10-08 09:09:16 +00:00
angelayi	56ef7743fc	[opaque_obj] Add __eq__ and __deepcopy__ (#163279 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163279 Approved by: https://github.com/zou3519	2025-10-08 09:09:16 +00:00
Yuanyuan Chen	64108bdbed	[BC-Breaking] Remove long-deprecated casting functions from native_functions.yaml (#164641 ) This PR removes `torch._cast_XXX` from generated OPs. They were deprecated in PyTorch 1 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164641 Approved by: https://github.com/albanD, https://github.com/justinchuby	2025-10-08 08:27:58 +00:00
Maggie Moss	c855f8632e	Pyrefly suppressions 7/n (#164913 ) Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283 Almost there! Test plan: dmypy restart && python3 scripts/lintrunner.py -a pyrefly check step 1: delete lines in the pyrefly.toml file from the project-excludes field step 2: run pyrefly check step 3: add suppressions, clean up unused suppressions before: https://gist.github.com/maggiemoss/4b3bf2037014e116bc00706a16aef199 after: INFO 0 errors (6,884 ignored) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164913 Approved by: https://github.com/oulgen	2025-10-08 07:27:17 +00:00
morrison-turnansky	12d2ef557f	Update round size with 1 division behavior (#162203 ) have round size return nearest power of 2 greater than or equal to size with 1 division Fixes #161139 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162203 Approved by: https://github.com/ezyang	2025-10-08 06:41:46 +00:00
Edward Yang	65aa62d50d	Use codegen for the boxed interpreters (#164573 ) Authored with claude code. The arg parsing is kind of horrible, open to more suggestions. Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/164573 Approved by: https://github.com/albanD, https://github.com/jansel	2025-10-08 06:27:44 +00:00
Jane Xu	6a09f9306c	Fix #164742 , all header-impl'd userfacing functions should be inline (#164871 ) It is as @mxmpl pointed out; we are missing an inline. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164871 Approved by: https://github.com/mikaylagawarecki	2025-10-08 05:57:19 +00:00
Ke Wen	19bf67be32	multimem reduce (#164517 ) Modified `multimem_one_shot_all_reduce_out` function to accept a `root` argument, making it a `multimem_reduce` op. The original `multimem_one_shot_all_reduce` op becomes a caller of the `multimem_reduce`, with each rank providing its own rank id as root. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164517 Approved by: https://github.com/ngimel viable/strict/1759916978	2025-10-08 05:25:16 +00:00
PyTorch MergeBot	1927783aa3	Revert "Reland vision pinned commit hash update (#164492 )" This reverts commit 6861a270624b44954826688f8dad668eb0154452. Reverted https://github.com/pytorch/pytorch/pull/164492 on behalf of https://github.com/izaitsevfb due to see autorevert msg above, inductor breakage is legit ([comment](https://github.com/pytorch/pytorch/pull/164492#issuecomment-3379537888))	2025-10-08 04:38:26 +00:00
Nicolas Macchioni	184817c7a8	locks + unit tests (#164636 ) Test Plan: ``` buck test fbcode//mode/opt caffe2/test/inductor:caching ``` Reviewed By: aorenste D83714690 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164636 Approved by: https://github.com/aorenste	2025-10-08 04:34:22 +00:00
Pradeep Fernando	da903b6a8b	list_stored_sd_metadata API. (#160610 ) Summary: 1\ Certain checkpoint load use cases are not aware of the properties of the data/tensors they want to load. 2\ These usecases include data loader checkpoints, reading data for post processing (when the original model definition is not available). 3\ There, we have to use saved checkpoint (metadata) as our source of truth. 4\ This RFC proposal exposes the checkpoint metadata using a public API. In this proposal we expose the stored state-dict metadata (minus associated storage/chunk metadata). Chunk/storage details should not be exposed to the users and is a impl detail of the storage writer/reader. Test Plan: UT. Rollback Plan: Differential Revision: D80231457 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160610 Approved by: https://github.com/saumishr viable/strict/1759915520	2025-10-08 04:33:51 +00:00
Boyuan Feng	f76fdcaaf8	[Benchmark] cleanup huggingface models (#164815 ) Prune models from TorchInductor dashboard to reduce ci cost. This PR prunes for hugging face models according to the [doc](https://docs.google.com/document/d/1nLPNNAU-_M9Clx9FMrJ1ycdPxe-xRA54olPnsFzdpoU/edit?tab=t.0), which reduces from 46 to 27 models. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164815 Approved by: https://github.com/anijain2305, https://github.com/seemethere, https://github.com/huydhn, https://github.com/malfet	2025-10-08 03:21:04 +00:00
Sam Larsen	608792153f	[inductor][codecache] Print bytes in codecache debug output (#164898 ) Summary: We have an internal request to help understand why the hash of `post_grad_custom_post_pass` is changing between attempts. We don't get useful info from the debug output, because we just print "<bytes>". Instead, attempt to print at least _some_ of the value in case it contains readable characters. Test Plan: Registered a dummy post_grad_custom_pass and printed codecache debug output `TORCH_LOGS=+torch._inductor.codecache python ~/foo.py` Yields something like: ``` V1007 16:41:19.024000 3546009 /data/users/slarsen/pytorch-3.10_4/torch/_inductor/codecache.py:989] [0/0] [law2ujt2wzjb5tyiu6jh64r2lxpvl62yvxcsmdouhg3qyelhhdv] post_grad_custom_post_pass: HelloWorld!��... ``` Differential Revision: [D84108770](https://our.internmc.facebook.com/intern/diff/D84108770) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164898 Approved by: https://github.com/oulgen viable/strict/1759908300	2025-10-08 02:45:20 +00:00
Maggie Moss	086dec3235	Pyrefly suppressions 6/n (#164877 ) Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283 Almost there! Test plan: dmypy restart && python3 scripts/lintrunner.py -a pyrefly check step 1: delete lines in the pyrefly.toml file from the project-excludes field step 2: run pyrefly check step 3: add suppressions, clean up unused suppressions before: https://gist.github.com/maggiemoss/4b3bf2037014e116bc00706a16aef199 after: INFO 0 errors (5,064 ignored) Only four directories left to enable Pull Request resolved: https://github.com/pytorch/pytorch/pull/164877 Approved by: https://github.com/oulgen	2025-10-08 02:30:57 +00:00
Aaron Orenstein	ad7b2bebc6	Use tuples to have a deterministic ordering. (#164851 ) When debugging I noticed some non-deterministic behavior and tracked it down to this literal set. Changed to be a tuple for determinism. Changed two other small literal sets also because using a set for a small lookup like that is slow. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164851 Approved by: https://github.com/bobrenjc93, https://github.com/bdhirsh viable/strict/1759904803	2025-10-08 02:12:03 +00:00
Ke Wen	d444384003	[SymmMem] Tiled reduce (#162243 ) Added op: `tile_reduce(Tensor input, Tensor(a!) out, int root, str group_name)` For now supports only: - NVSHMEM backed symmetric tensor; - 2D tensor and tile; - torch.float. Testing on right-bottom quandrant: ``` rank 0: tensor([[0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 1., 1., 1., 1.], [0., 0., 0., 0., 1., 1., 1., 1.], [0., 0., 0., 0., 1., 1., 1., 1.], [0., 0., 0., 0., 1., 1., 1., 1.]], device='cuda:0') PASSED ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162243 Approved by: https://github.com/ngimel	2025-10-08 02:03:04 +00:00
PyTorch MergeBot	3040a5d294	Revert "[dynamo] Support torch.fx.traceback.annotate (#164678 )" This reverts commit 801e282f39e9ef4424dfd3ecfd2b550a44595229. Reverted https://github.com/pytorch/pytorch/pull/164678 on behalf of https://github.com/izaitsevfb due to breaks executorch internally, see [D84068062](https://www.internalfb.com/diff/D84068062?entry_point=16) ([comment](https://github.com/pytorch/pytorch/pull/164678#issuecomment-3379281844))	2025-10-08 01:49:34 +00:00
PyTorch MergeBot	97463d4cf3	Revert "Fix double dispatch to Python for detach (#163671 )" This reverts commit c32118dc3e50505fd285e6e448a90883fce11535. Reverted https://github.com/pytorch/pytorch/pull/163671 on behalf of https://github.com/izaitsevfb due to breaks export tests ([comment](https://github.com/pytorch/pytorch/pull/163671#issuecomment-3379281422))	2025-10-08 01:46:45 +00:00
Howard Huang	c813617c53	[PP] Migrate other schedules to use PipelineScheduleRuntime (#164777 ) Second fix for https://github.com/pytorch/pytorch/issues/164756 This has been a TODO to make the all schedules execute using the same runtime. Now after this change, schedules will use the same logic for `_PipelineScheduleRuntime` where it adds `UNSHARD` and `RESHARD` operations to the schedules which fixes the issue mentioned above. <img width="920" height="406" alt="image" src="https://github.com/user-attachments/assets/a4d5bcd0-7dac-43cd-96f9-8ca33cfd8b91" /> A test is failing after the conversion: - Fixed a gradient scaling issue for dWeight Pull Request resolved: https://github.com/pytorch/pytorch/pull/164777 Approved by: https://github.com/fegin ghstack dependencies: #164775	2025-10-08 01:45:57 +00:00
Howard Huang	e659661ffa	[PP] Fix FSDP unshard/reshard (#164775 ) First fix for https://github.com/pytorch/pytorch/issues/164756 In the pipeline IR we call `UNSHARD` and `RESHARD`, but there is a bug because when we call `module.unshard()` these do not recursively call the FSDP modules, hence leading to sometime call allgather before the module forward. Since we want the pipeline IR to explicitly handle this, we can call `group.unshard` instead which ensures that all the modules are unsharded. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164775 Approved by: https://github.com/weifengpy	2025-10-08 01:45:57 +00:00
Markus Hoehnerbach	41808b2ba9	fix flex attention eager bwd: more rounding (#164317 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164317 Approved by: https://github.com/drisspg ghstack dependencies: #163986	2025-10-08 01:17:45 +00:00
Xilun Wu	c0510dc447	[ContextParallel] add `_LoadBalancer` classes, and load-balance interface to Context Parallel APIs (#161062 ) Summary This PR provides an interface for users to specify how to load-balance the attention input. The load-balance is essentially a rearrangement of the input tensor(s) over the seq_dim before sharding and can be specified via an index tensor `rearrange` such that Q[rearrange] is the balanced Q users want (i.e. `rearrange[i] == j` where `i` is the new index of `Q[j]` in the balanced Q). An example is the `_generate_round_robin_indices()` added in https://github.com/pytorch/pytorch/pull/155442. New `_LoadBalancer` classes New `_LoadBalancer` class (defined in `torch/distributed/tensor/experimental/_load_balancer.py`) provides one interface for defining load-balance behavior: `_generate_indices(self, restore: bool = False)`. When `restore == False`, this method should output an index Tensor (namely `rearrange_idx`) such that QKV will be transformed into Q' K' V' in a way that `Q'[i] == Q[rearrange_idx[i]]` (same applies to K and V). When `restore == True`, this method outputs an index Tensor (namely `restore_idx` such that `Q'[restore_idx] == Q` (same applies to K and V). Impact 2 public CP APIs and 1 private CP API is modified. This PR should be backward-compatible by: - For uses w/ SDPA, existing users must be using the `context_parallel()` API which does not take in the extra `load_balancer` argument and solely determines from the global var `_cp_options.enable_load_balance`. - For new users including who want to try `flex_attention()`, we require to use the new API `_context_parallel_buffers` to explicitly shard the QKV input instead of using `context_parallel()` because we no longer rely on TorchDispatchMode nor TorchFunctionMode for op replacement. And we also require users to explicitly pass in a `load_balancer` argument if load-balancing is demanded. Load-Balance Behavior `context_parallel_unshard()`, and `create_cp_block_mask()` APIs now take an extra optional argument `load_balancer`. This argument is optional because of backward compatibility but we require new users to explicitly pass in a `load_balancer` if load-balancing is demanded: - if `load_balancer == None` and `_cp_options.enable_load_balance == False`, CP performs no load-balancing on input Tensors. - if `load_balancer == None` and `_cp_options.enable_load_balance ==True`, CP performs head-tail load-balancing (e.g. split a Tensor into 2N chunks and first N are called head and the rest are called tail. Place the first head chunk the last tail chunk on rank 0, and the second head along with the second last tail chunk on rank 1, and so on). `_context_parallel_buffers()` also takes the extra optional argument `load_balancer`, but the behavior is slightly different from the other 2 APIs -- it doesn't branch on `_cp_options.enable_load_balance` : - if `load_balancer == None`, no load-balancing will be performed - otherwise, apply load-balancing using `load_balancer._generate_indices()` before sharding. Changes* This PR moves the index Tensor generation logic into a set of LoadBalancer classes and make LoadBalancer the common interface for Context Parallel APIs that leverages load-balancing: * _context_parallel_buffers * context_parallel_unshard * create_cp_block_mask The `_LoadBalancer` classes added are: - `_LoadBalancer`: the abstract base class that provides “_generate_indices” interface index Tensor generation. - `_HeadTailLoadBalancer`: Implements head-tail balancing logic. - `_PerDocumentHeadTailLoadBalancer`: Supports per-document head-tail balancing for batched sequences. Test `pytest test/distributed/tensor/test_attention.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161062 Approved by: https://github.com/fegin	2025-10-08 01:09:14 +00:00
Nicolas Macchioni	9ec10dc26a	utils + unit tests (#164551 ) Test Plan: ``` buck test fbcode//mode/opt caffe2/test/inductor:caching ``` Reviewed By: aorenste Differential Revision: D83714691 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164551 Approved by: https://github.com/aorenste	2025-10-08 01:05:45 +00:00

1 2 3 4 5 ...

94133 Commits