pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 12:54:11 +08:00

Author	SHA1	Message	Date
Xuehai Pan	4dce5b71a0	[build] modernize build-frontend: `python setup.py develop/install` -> `[uv ]pip install --no-build-isolation [-e ].` (#156027 ) Modernize the development installation: ```bash # python setup.py develop python -m pip install --no-build-isolation -e . # python setup.py install python -m pip install --no-build-isolation . ``` Now, the `python setup.py develop` is a wrapper around `python -m pip install -e .` since `setuptools>=80.0`: - pypa/setuptools#4955 `python setup.py install` is deprecated and will emit a warning during run. The warning will become an error on October 31, 2025. - `9c4d383631/setuptools/command/install.py (L58-L67)` > ```python > SetuptoolsDeprecationWarning.emit( > "setup.py install is deprecated.", > """ > Please avoid running ``setup.py`` directly. > Instead, use pypa/build, pypa/installer or other > standards-based tools. > """, > see_url="https://blog.ganssle.io/articles/2021/10/setup-py-deprecated.html", > due_date=(2025, 10, 31), > ) > ``` - pypa/setuptools#3849 Additional Resource: - [Why you shouldn't invoke setup.py directly](https://blog.ganssle.io/articles/2021/10/setup-py-deprecated.html) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156027 Approved by: https://github.com/ezyang	2025-07-09 11:24:27 +00:00
Xuehai Pan	fc0376e8b1	[BE][2/6] fix typos in test/ (test/test_*.py) (#157636 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157636 Approved by: https://github.com/yewentao256, https://github.com/mlazos ghstack dependencies: #156311, #156609	2025-07-09 11:02:23 +00:00
Xuehai Pan	ffe11b2bf2	[BE] fix typo in torch/distributed/tensor/: childs -> children (#156609 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156609 Approved by: https://github.com/wanchaol, https://github.com/cyyever ghstack dependencies: #156311	2025-07-09 11:02:23 +00:00
Xuehai Pan	4cc8b60d1b	[BE][1/16] fix typos in torch/ (#156311 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156311 Approved by: https://github.com/albanD	2025-07-09 11:02:22 +00:00
Syed Tousif Ahmed	f5bbaa2253	Fixes typo in nccl_window_registration test (#157293 ) As mentioned here: https://github.com/pytorch/pytorch/pull/155134#discussion_r2175605192 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157293 Approved by: https://github.com/Skylion007	2025-07-09 11:01:18 +00:00
Xuehai Pan	924fc52e18	[BE] add a linter to check consistency for cmake minimum version in requirements (#156961 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156961 Approved by: https://github.com/ezyang, https://github.com/malfet	2025-07-09 10:44:17 +00:00
PyTorch MergeBot	b83d8827bc	Revert "Deprecate DataLoader pin_memory_device param (#146821 )" This reverts commit ab655816b8f76f511fb2262d45276d8d1b13d59c. Reverted https://github.com/pytorch/pytorch/pull/146821 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/146821#issuecomment-3052093902))	2025-07-09 10:29:31 +00:00
thenumberouscode	6f23f53599	[inductor] fix tensor.to(uint8) error when tensor src type is float (#157267 ) The cpu inductor processes .to(torch.uint8) incorrectly, leading to numerical inconsistencies. The convert_float_to_int8 function may return incorrect results for negative inputs, such as -2.xx, when the data type is uint8_t, producing 0 instead of 255. This issue stems from the clamping logic; we should avoid converting min_val to uint8_t too early Fixes https://github.com/pytorch/pytorch/issues/156788 @leslie-fang-intel Pull Request resolved: https://github.com/pytorch/pytorch/pull/157267 Approved by: https://github.com/leslie-fang-intel	2025-07-09 07:03:38 +00:00
Menglu Yu	e3f2597b45	[Optimus] Fix normalization pass in the aten IR (#157857 ) Summary: We found there's a special case in recent APS model where the input tensor has smaller size compared to the split size. It will be automatically truncated in split.Tensor thus we add extra condition check for split_with_sizes when do the normalization. Test Plan: ### unit ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:split_cat_fx_aten_passes -- test_split_aten_normalization ``` Buck UI: https://www.internalfb.com/buck2/2ecd1ef8-8efe-4245-b4c8-282c23645b3c Test UI: https://www.internalfb.com/intern/testinfra/testrun/7599824648585787 Network: Up: 3.9GiB Down: 9.2GiB (reSessionID-1396c91e-0dd2-457b-a49b-a6ab1f2a7d8f) Loading targets. Remaining 0/5344 99617 dirs read, 1074949 targets declared Analyzing targets. Remaining 0/123279 4988547 actions, 5966764 artifacts declared Executing actions. Remaining 0/728058 209:52:59.9s exec time total Command: test. Finished 12466 local, 209448 remote, 1226 cache (1% hit) 42:10.5s exec time cached (0%) Time elapsed: 26:07.6s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 ### E2E before fix: aps-afoc_apop_pt2_v0-db2fe0449a after fix: aps-afoc_apop_pt2_v0-755ad0cdc6 Rollback Plan: Differential Revision: D77961394 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157857 Approved by: https://github.com/anijain2305	2025-07-09 05:38:15 +00:00
Shangdi Yu	effe376db0	Adding aoti_standalone config (#157731 ) Summary: When `compile_standalone` is True, we set `package_cpp_only` to True as well. We raise an error if `package_cpp_only` is explicitly set to False in config. Test Plan: ``` buck2 run mode/dev-nosan fbcode//caffe2/test/inductor:test_aot_inductor -- -r TestAOTInductorConfig ``` Rollback Plan: Differential Revision: D77889754 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157731 Approved by: https://github.com/desertfire	2025-07-09 04:30:04 +00:00
Xu Han	fcbf7c749a	[Windows][Inductor] normalize_path_separator compiler path (#157835 ) Fixes #157673 For the call trace: ``` ...... File "D:\Programs\Python\virtualenvs\torch_code-afvE469o\lib\site-packages\torch\_inductor\codegen\common.py", line 2569, in reduction return self.kernel.reduction(dtype, src_dtype, reduction_type, value) File "D:\Programs\Python\virtualenvs\torch_code-afvE469o\lib\site-packages\torch\_inductor\codegen\cpp.py", line 2155, in reduction self._gen_parallel_reduction_buffers(acc, acc_type, reduction_type, init_dtype) File "D:\Programs\Python\virtualenvs\torch_code-afvE469o\lib\site-packages\torch\_inductor\codegen\cpp.py", line 1942, in _gen_parallel_reduction_buffers reduction_prefix_array( File "D:\Programs\Python\virtualenvs\torch_code-afvE469o\lib\site-packages\torch\_inductor\codegen\cpp.py", line 335, in reduction_prefix_array if cpp_builder.is_msvc_cl() File "D:\Programs\Python\virtualenvs\torch_code-afvE469o\lib\site-packages\torch\_inductor\cpp_builder.py", line 317, in is_msvc_cl return _is_msvc_cl(get_cpp_compiler()) File "D:\Programs\Python\virtualenvs\torch_code-afvE469o\lib\site-packages\torch\_inductor\cpp_builder.py", line 240, in _is_msvc_cl subprocess.check_output([cpp_compiler, "/help"], stderr=subprocess.STDOUT) torch._inductor.exc.InductorError: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd3 in position 0: invalid continuation byte ``` On non-English language pack msvc environment, compiler path has raised `utf-8` issue. I add the `normalize_path_separator` to normalize the compiler path and avoid the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157835 Approved by: https://github.com/jansel	2025-07-09 04:02:20 +00:00
soulitzer	8bda95228f	[autograd] Avoid creating and recording event when unnecessary (#157503 ) Today, we always create and record an events in two places: 1) Upon seeing the first producer, we record an event on the producer, and we wait for this event in two places: (1) when the engine goes to run the consumer, the consumer stream waits for this event. (2) prior to doing accumulation, the accumulation stream waits for this event. 2) After doing accumulation, we record an event on the accumulation stream and wait for this event in a single place: when the engine goes to run the consumer. We do not actually need to record the event in the cases where the 1st producer stream is the same as the consumer and as the accumulation stream, and where the accumulation stream is the same as the consumer stream. Removing this unnecessary create + record event should save a few us for each instance avoided. Fixes https://github.com/pytorch/pytorch/issues/157407 ---- Manual test plan: - [x] @eqy to confirm perf is restored - [x] Running the repro originally reported before/after the patch Pull Request resolved: https://github.com/pytorch/pytorch/pull/157503 Approved by: https://github.com/eqy ghstack dependencies: #155715	2025-07-09 03:36:14 +00:00
florian	8d070187e3	fix type hints for interpolation functions (#157202 ) Fixes #129053 Previously interpolate had a bad signature and not correct type hints. This fixes this issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157202 Approved by: https://github.com/ezyang, https://github.com/albanD	2025-07-09 03:11:37 +00:00
Jing Xu	c515385b0a	Add Intel GPU info collection to the collect env script (#157351 ) https://github.com/pytorch/pytorch/pull/137846 was mistakenly closed. Reopen a PR to land the PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157351 Approved by: https://github.com/guangyey, https://github.com/malfet	2025-07-09 03:01:41 +00:00
Nikita Shulga	d6237721c0	[Build] Make PyTorch compilable with gcc-14 on ARM (#157867 ) Fixes numerous ICEs in vreg allocations for SVE+BF16 ``` /pytorch/aten/src/ATen/ParallelOpenMP.h:25:9: error: unrecognizable insn: 25 \| #pragma omp parallel \| ^~~ (insn 257 256 258 30 (set (reg:VNx8BF 449 [ bf16_vec1_217 ]) (unspec:VNx8BF [ (reg:VNx8BF 455) (reg:VNx8BF 456) ] UNSPEC_IORF)) "/pytorch/aten/src/ATen/cpu/vec/sve/vec_bfloat16.h":228:31 discrim 1 -1 (nil)) during RTL pass: vregs /pytorch/aten/src/ATen/ParallelOpenMP.h:25:9: internal compiler error: in extract_insn, at recog.cc:2812 0xd73c33 internal_error(char const, ...) ???:0 0xd73d1f fancy_abort(char const, int, char const) ???:0 0x890053 _fatal_insn(char const, rtx_def const, char const, int, char const) ???:0 0x890087 _fatal_insn_not_found(rtx_def const, char const, int, char const) ???:0 0x1379093 extract_insn(rtx_insn) ???:0 ``` And one in RTL-expand pass while compiling Activation.cpp ``` during RTL pass: expand In file included from /pytorch/aten/src/ATen/native/cpu/Activation.cpp:12, from /pytorch/build/aten/src/ATen/native/cpu/Activation.cpp.DEFAULT.cpp:1: /pytorch/aten/src/ATen/native/cpu/Activation.cpp: In lambda function: /pytorch/aten/src/ATen/native/cpu/Activation.cpp:94:7: internal compiler error: Segmentation fault 94 \| }); \| ^ /pytorch/aten/src/ATen/Dispatch.h:201:7: note: in definition of macro 'AT_DISPATCH_SWITCH' 201 \| __VA_ARGS__ \ \| ^~~~~~~~~~~ /pytorch/aten/src/ATen/Dispatch.h:72:3: note: in expansion of macro 'AT_PRIVATE_CASE_TYPE_USING_HINT' 72 \| AT_PRIVATE_CASE_TYPE_USING_HINT(enum_type, scalar_t, __VA_ARGS__) \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /pytorch/aten/src/ATen/Dispatch.h:214:3: note: in expansion of macro 'AT_DISPATCH_CASE' 214 \| AT_DISPATCH_CASE(at::ScalarType::Double, __VA_ARGS__) \ \| ^~~~~~~~~~~~~~~~ /pytorch/aten/src/ATen/Dispatch.h:218:34: note: in expansion of macro 'AT_DISPATCH_CASE_FLOATING_TYPES' 218 \| AT_DISPATCH_SWITCH(TYPE, NAME, AT_DISPATCH_CASE_FLOATING_TYPES(__VA_ARGS__)) \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /pytorch/aten/src/ATen/native/cpu/Activation.cpp:70:5: note: in expansion of macro 'AT_DISPATCH_FLOATING_TYPES' 70 \| AT_DISPATCH_FLOATING_TYPES(input.scalar_type(), "log_sigmoid_cpu", [&] { \| ^~~~~~~~~~~~~~~~~~~~~~~~~~ 0xd73c33 internal_error(char const, ...) ???:0 0x134f987 rebuild_jump_labels(rtx_insn) ???:0 ``` Interestingly enough, attempt to compile `Unfold2d.cpp` for `-march=armv8-a+sve` (i.e. without sve+bf16) support also causes ICE ``` /pytorch/aten/src/ATen/native/cpu/Unfold2d.cpp:221:1: error: unrecognizable insn: 221 \| } \| ^ (insn 2918 2917 2919 296 (set (reg:VNx8BI 5917) (unspec:VNx16BI [ (reg:VNx8BI 5920) (reg:VNx8BI 5922) (const_vector:VNx4BI [ (const_int 0 [0]) repeated x8 ]) ] UNSPEC_TRN1_CONV)) "/usr/include/aarch64-linux-gnu/bits/string_fortified.h":29:33 discrim 1 -1 (expr_list:REG_EQUAL (const_vector:VNx8BI [ (const_int 1 [0x1]) repeated x9 (const_int 0 [0]) (const_int 1 [0x1]) repeated x2 (const_int 0 [0]) repeated x4 ]) (nil))) during RTL pass: vregs ``` Which could be worked around by adding ```patch diff --git a/aten/src/ATen/native/cpu/Unfold2d.cpp b/aten/src/ATen/native/cpu/Unfold2d.cpp index 8ef0741e77af0a..59c76505dd6246 100644 --- a/aten/src/ATen/native/cpu/Unfold2d.cpp +++ b/aten/src/ATen/native/cpu/Unfold2d.cpp @@ -169,6 +169,10 @@ static void unfolded2d_acc_channels_last( / note: due to write issues, this one cannot be parallelized as well as * unfolded2d_copy / +#if defined(__GNUC__) && __GNUC__ == 14 && defined(__ARM_FEATURE_SVE) +// Workaround for gcc-14.2.0 ICE during RTL pass: vregs when compiling for SVE +__attribute__((optimize("no-tree-vectorize"))) +#endif void unfolded2d_acc_kernel( ScalarType dtype, void finput_data, ``` Fixes https://github.com/pytorch/pytorch/issues/157842 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157867 Approved by: https://github.com/atalman, https://github.com/Skylion007	2025-07-09 02:59:08 +00:00
Hao Zhang(张浩)	ab8874bd26	Suppress warning when using native arch for jit loading cuda extensions. (#156923 ) Previeusly, if users want to let pytorch determine the cuda arch when jit loading cuda extensions, they should left environment variable `TORCH_CUDA_ARCH_LIST` empty, but which will raise an warning. This commit add an option to set `TORCH_CUDA_ARCH_LIST=native`, to tell pytorch users want to use native cuda arch intentionally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156923 Approved by: https://github.com/ezyang	2025-07-09 02:51:20 +00:00
fduwjj	bc6e0661a6	Fix more H100 CI (#157829 ) Follow @d4l3k 's fix in https://github.com/pytorch/pytorch/pull/157826/files. Two more fixes might be needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157829 Approved by: https://github.com/davidberard98, https://github.com/d4l3k	2025-07-09 01:28:05 +00:00
Bin Bao	e5edd013ab	[AOTI] Skip test_simple_multi_arch_embed_kernel_binary_True_cuda (#157301 ) Summary: For https://github.com/pytorch/pytorch/issues/156930, still no clue on what went wrong as it is not reproducible locally, but somehow the problem seems only exists when embed_kernel_binary is True. Let's skip it for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157301 Approved by: https://github.com/yushangdi	2025-07-09 01:18:36 +00:00
xinan.lin	75f489d37f	[Break XPU][Inductor UT] Align tolerance of newly added case with cuda. (#157702 ) Align tolerance with cuda for the newly added case `test_comprehensive_logcumsumexp_xpu_float16` in #157512. Fixes #157697 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157702 Approved by: https://github.com/jansel	2025-07-09 00:55:01 +00:00
Tristan Rice	3eb7084f7a	[ci] fix h100-distributed (#157826 ) This was broken by https://github.com/pytorch/pytorch/pull/157341 This should resolve the permission issue Pull Request resolved: https://github.com/pytorch/pytorch/pull/157826 Approved by: https://github.com/fduwjj, https://github.com/Skylion007, https://github.com/huydhn	2025-07-09 00:27:55 +00:00
PyTorch MergeBot	86251eff40	Revert "Introduce AcceleratorAllocatorConfig as the common class (#149601 )" This reverts commit 55108074c0795be3b617d3b13b06794f63e1f8ca. Reverted https://github.com/pytorch/pytorch/pull/149601 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/149601#issuecomment-3050628047))	2025-07-09 00:07:31 +00:00
Tristan Rice	1b3d69b59f	Work: block_current_stream API (#156883 ) This implements a new `wait_stream` API in Work that matches how `wait` works for ProcessGroupNCCL for CPU based backends such as Gloo. The idea is to support Gloo communication overlap in FSDPv2/HSDP with minimal changes to FSDP. There was a previous attempt to make FSDPv2 use Work.wait but given the extensive stream semantics used it doesn't play nicely. https://github.com/pytorch/pytorch/pull/148780 This uses a "Baton" CUDA kernel which spinlocks on a pinned CPU tensor waiting for it to be set. Test plan: ``` pytest test/distributed/test_c10d_gloo.py -v -k wait_stream pytest test/distributed/test_c10d_nccl.py -v -k wait_stream ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156883 Approved by: https://github.com/kwen2501, https://github.com/fduwjj	2025-07-08 23:55:46 +00:00
Blaine Burton Rister	92f41ccc26	[Inductor] Support precomputed size args in the FX backend. (#157758 ) # Feature If a Triton kernel has a complicated indexing expression, Inductor may decide to precompute it on the host and pass it to the kernel as an argument. This happens in situations like broadcasts with dynamic shapes. This PR adds support for this feature to Inductor's FX IR backend. We generate FX IR for precomputed size args in 3 steps: 1. In `PythonWrapperCodegen`, this PR refactors the relevant code to use a `SymbolicCallArgLine` instead of raw Python strings. This stores a (symbol, expr) pair. (Prior to this PR, it was (str, expr), but changing this to a symbol makes it easier to do substitutions later on.) 2. In `WrapperFxCodegen`, keep a dict of {symbol: expr} arg defs which gets updated whenever we see a `SymbolicCallArgLine`. 3. When the FX backend sees a `KernelCallLine`, it uses this dict to replace symbolic call args with their definitions. In the longer run, it might be desirable to emit FX nodes defining these symbolic call args. That way, we could reuse the size computation when the same kernel is called multiple times. However, I wasn't sure if there was an existing way to generate FX nodes from a sympy expression, and implementing that seemed like overkill for the present purposes. # Test plan Added a new CI test exercising this feature. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157758 Approved by: https://github.com/jansel	2025-07-08 23:22:17 +00:00
Simon Fan	95bc3da9f8	[c10d] support dynamic shapes for all_to_all_single_autograd (#157521 ) `all_to_all_single_autograd` is not an op, all the code executed until the `all_to_all_single` dispatch is visible to the compiler. This means the `all_to_all_single_autograd` wrapper code must support symints in order to be traceable with dynamic shapes. FIXES https://github.com/pytorch/pytorch/issues/157479 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157521 Approved by: https://github.com/wconstab	2025-07-08 23:19:59 +00:00
Sidharth	9f18482d41	[dynamo] removing string literals for weblink generation (#157820 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157820 Approved by: https://github.com/williamwen42	2025-07-08 23:08:06 +00:00
Nikita Shulga	c5b46b5408	[BE] Standardize CPU capabilities name (#157809 ) It's weird to call default x86 CPU capability `NO AVX`, when in reality it's something different. Also it's a bit strange to have it assigned different names on different platforms Fixes https://github.com/pytorch/pytorch/issues/157538 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157809 Approved by: https://github.com/Skylion007	2025-07-08 23:06:09 +00:00
Andrey Talman	179dcc10e4	Add sm_70 arch for linux cuda 12.8 and 12.9 builds (#157558 ) Please see: https://github.com/pytorch/pytorch/issues/157517 We would like to keep Volta architectures by default for release 2.8 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157558 Approved by: https://github.com/Skylion007, https://github.com/Camyll, https://github.com/seemethere, https://github.com/malfet	2025-07-08 23:02:10 +00:00
Sam Larsen	7a41f20794	[inductor] Quiesce Triton compile worker pool after each dynamo compile (#156187 ) For internal usages, keeping the Triton compile worker pool active for the lifetime of the process has caused some challenges, e.g., it slows down and muddies profiling due to the huge number of threads on a box: N threads = 8 ranks * 32 subprocs * M threads started by torch. Also, each subproc can use more than 1GB each. This PR adds the functionality to shutdown worker subprocs after each dynamo compile when using the SubprocPool implementation. The idea is to leave the main sidecar process running, but signal it to tear down its internal ProcessPoolExecutor when compile is finished. Restarting the ProcessPoolExecutor is relatively fast, e.g., 500ms because the ProcessPoolExecutor forks from the sidecar. Changes: * Do not start the ProcessPoolExecutor automatically when compile_fx is imported. Instead, start the sidecar process only. The sidecar process imports torch, so is still slow to start. * Introduce wakeup() and quiesce() calls to the implementation to start and stop the ProcessPoolExecutor. * Add a context manager to automatically quiesce() at the end of dynamo compilation. * Signal a wakeup() in compile_fx only when we have cuda devices. * Add a killswitch so we can turn of quiescing. Testing: For correctness, the stacked change at https://github.com/pytorch/pytorch/pull/156534 enables the feature for OSS so it's exercised in CI. For performance, because of recent compile-time variance (see https://github.com/pytorch/pytorch/issues/152566), it's pretty hard to glean whether there's a regression.... * Training: https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Tue%2C%2017%20Jun%202025%2021%3A32%3A04%20GMT&stopTime=Tue%2C%2024%20Jun%202025%2021%3A32%3A04%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=gh/masnesral/210/head&lCommit=1b7315031c3bfad66a1a01700167a9ca1a2ae5f1&rBranch=main&rCommit=eab45643f22e58ee12d95d8b0162d51ca0a50801 * Inference: https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Tue%2C%2017%20Jun%202025%2021%3A32%3A04%20GMT&stopTime=Tue%2C%2024%20Jun%202025%2021%3A32%3A04%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=gh/masnesral/210/head&lCommit=1b7315031c3bfad66a1a01700167a9ca1a2ae5f1&rBranch=main&rCommit=eab45643f22e58ee12d95d8b0162d51ca0a50801 The wins (mostly for inference) don't make sense, but I'm also skeptical of the losses (mostly for training). I can't repro any of the slowdowns locally. Furthermore, check out the benchmarking results for the stacked diff, which actually enables the quiescing functionality for OSS. That should only slow down compile since there can only be overhead to stop and start the workers. But the results are somehow better: * Training: https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Tue%2C%2017%20Jun%202025%2021%3A32%3A04%20GMT&stopTime=Tue%2C%2024%20Jun%202025%2021%3A32%3A04%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=gh/masnesral/214/head&lCommit=41943253882a019b8ceafcd2bf4cd6acbe0cbca9&rBranch=main&rCommit=eab45643f22e58ee12d95d8b0162d51ca0a50801 * Inference: https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Tue%2C%2017%20Jun%202025%2021%3A32%3A04%20GMT&stopTime=Tue%2C%2024%20Jun%202025%2021%3A32%3A04%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=gh/masnesral/214/head&lCommit=41943253882a019b8ceafcd2bf4cd6acbe0cbca9&rBranch=main&rCommit=eab45643f22e58ee12d95d8b0162d51ca0a50801 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156187 Approved by: https://github.com/aorenste, https://github.com/jansel	2025-07-08 22:53:13 +00:00
Animesh Jain	178fe7aa98	[dynamo][fsdp] Consistent behavior of int attributes (#157262 ) Reimpl of https://github.com/pytorch/pytorch/pull/150954 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157262 Approved by: https://github.com/bdhirsh	2025-07-08 22:11:33 +00:00
PyTorch MergeBot	2e14069081	Revert "[DTensor][FSDP2] necessary changes to FSDP and TP to unblock EP (#157216 )" This reverts commit 777eca9f16aeecd7c362a235cf25e6b8e6eda57f. Reverted https://github.com/pytorch/pytorch/pull/157216 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to fail a distributed test in trunk ([comment](https://github.com/pytorch/pytorch/pull/157216#issuecomment-3050258896))	2025-07-08 20:48:51 +00:00
angelayi	391473cca0	[export] Fix lift constants bug (#157719 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/157719 Approved by: https://github.com/yushangdi	2025-07-08 20:33:53 +00:00
lucasb-eyer	b9dc2fa4f7	Add legacy note to autograd.profiler doc. (#157459 ) Via google search I got to `torch.autograd.profiler` and implemented my code with it. Only to be taken by surprise finding `torch.profile.profiler`, which has a note saying the autograd one is legacy. This just adds such note to `autograd.profiler` to avoid this confusion and waste of time to future people in my situation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157459 Approved by: https://github.com/sraikund16	2025-07-08 20:33:23 +00:00
zpcore	a73d9e0aec	Fix einsum strategy shard dim > ndim (#157593 ) Previously we didn't constrain Shard dim to be <= the tensor's ndim. This cause an invalid strategy like `(RR, RS(2)) -> RS(2),` for einsum `bmk,kn->bmn` on the 2d mesh. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157593 Approved by: https://github.com/wconstab, https://github.com/wanchaol	2025-07-08 20:27:17 +00:00
Huy Do	06b3265cb1	Increase nightly C++ docs build timeout to 6h (#157759 ) This job has been timing out since May `261897734a/1`, maybe it's time to figure out if this makes sense. Issues https://github.com/pytorch/pytorch/issues/157763 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157759 Approved by: https://github.com/malfet	2025-07-08 19:28:48 +00:00
Ankita George	dea4864ce0	HF loads dcp - don't do a full deserialize on every file (#157715 ) Summary: These changes in D76442012 got reverted after the PR landed due to aps_models/ads/launchers/pearl/tests/ne/e2e_deterministic_tests:pearl_e2e_ne_tests failing with `Config not loaded due to no timely response from configerator. Likely configerator_proxy or falcon_proxy are not healthy`, but that test failing is definitely transient and unrelated to my changes, so re-creating the diff Test Plan: ensure tests pass Rollback Plan: Differential Revision: D77871099 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157715 Approved by: https://github.com/meetv18	2025-07-08 18:13:27 +00:00
Zeina Migeed	4f5be56612	[Pyrefly][Refactor] Replace dict() calls with literal dict syntax for improved readability (#157735 ) There are 31 places that I spotted which construct literal dictionaries. This PR refactors dictionary construction by replacing` dict(...) `calls with `literal {...}` syntax where applicable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157735 Approved by: https://github.com/ezyang, https://github.com/Skylion007	2025-07-08 18:10:33 +00:00
Howard Huang	0f31445139	Add stack trace of exception to MultiProcContinousTest (#157589 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157589 Approved by: https://github.com/Skylion007	2025-07-08 17:54:35 +00:00
Shangdi Yu	5b4e0255d7	Check FakeScriptObject in _resolve_name_collision (#157736 ) Summary: Fix https://github.com/pytorch/pytorch/issues/157401 torch.equal cannot handle FakeScriptObject inputs. Test Plan: ``` buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r test_aoti_torchbind_name_collision ``` Rollback Plan: Differential Revision: D77894081 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157736 Approved by: https://github.com/angelayi	2025-07-08 17:51:46 +00:00
Weishi.Deng	44d0800d60	[Intel GPU] Set higher tolerance for squeezenet1_1 with bf16 (#156920 ) We need to increase the tolerance slightly to ensure that certain models pass the accuracy check on the XPU device. This pull request preserves the original tolerance threshold for CUDA/CPU devices and introduces a new key, higher_bf16_xpu, which only affects the XPU device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156920 Approved by: https://github.com/soulitzer	2025-07-08 17:49:54 +00:00
Nikita Shulga	a5c61eb78d	[MPS][BE] Delete `as_strided_tensorimpl_mps` (#157772 ) Because it's just copy-n-paste of `as_strided_tensorimpl` with call to `updateTensorBaseShape`, which is not called/used anywhere else. Fixes https://github.com/pytorch/pytorch/issues/152701 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157772 Approved by: https://github.com/Skylion007	2025-07-08 17:02:36 +00:00
henrylhtsang	bbe681ed51	[cutlass backend][BE][ez] Make matmul layouts be row x column (#156656 ) Differential Revision: [D77184232](https://our.internmc.facebook.com/intern/diff/D77184232/) Motivation: * This is the case we care the most. * We are caching the kernels for this row x column layout. So testing on them can potentially make ci run faster. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156656 Approved by: https://github.com/ColinPeppler	2025-07-08 16:57:33 +00:00
Tianyu Liu	ed911747c2	[dtensor] add support for fused optimizer with parameters across multiple meshes (#157682 ) We are seeing more and more use cases where parameters in a model (under the same optimizer group) are put on different meshes. E.g. - when FSDP and TP are both applied, some parameters are sharded only on the FSDP mesh but not TP mesh (see https://github.com/pytorch/pytorch/pull/153268). - in [dp2ep Expert Parallel](https://github.com/pytorch/torchtitan/pull/1324), the routed experts are sharded on the (global FSDP \ EP) mesh for smaller FSDP and on the EP mesh for EP, whereas other params are sharded on the global FSDP mesh for FSDP. This PR is, in some sense, a continuation of https://github.com/pytorch/pytorch/pull/147869 to tackle the problem when fused optimizers are used. In such cases, the [`fused_adam`](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/native_functions.yaml#L15786) / `fused_adamw` has a scalar tensor arg `state_steps` which gets automatically cast to DTensor on the default [`compute_mesh`](https://github.com/pytorch/pytorch/blob/main/torch/distributed/tensor/_dispatch.py#L350) (one of the multiple meshes), even though the it could correspond to different meshes. To avoid hitting the cross-mesh propagation exception in `common_pointwise_strategy` and followup redistribute problems, we manually set the target mesh and placements to be the same as input mesh and placements, so that no redistribute will be triggered. This also helps bypass the situation where [`generate_redistribute_costs`](https://github.com/pytorch/pytorch/pull/157682/files#diff-eea32a36dd2d4e58307bc5229402e48048b2ecaef64a7c085495fba1ee10ac89R597) returns infinite cost due to cross mesh redistribute. Moreover, this PR has minimal scope (restricted to the `fused_ops`) and doesn't need to modify other files such as `_sharding_prop.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157682 Approved by: https://github.com/wanchaol	2025-07-08 15:58:30 +00:00
Tianyu Liu	777eca9f16	[DTensor][FSDP2] necessary changes to FSDP and TP to unblock EP (#157216 ) This is to unblock "dp2ep" Expert Parallel + TP integration in torchtitan https://github.com/pytorch/torchtitan/pull/1324. It does two things: 1. Slightly modifies the glue code for FSDP/HSDP + TP to work with FSDP/HSDP + EP and FSDP/HSDP + EP + TP. I kept the name `FSDPParam._tp_spec` to make the change minimal. We can consider renaming it in the future if it confuses people, but I heard @wanchaol has a plan to rewrite DTensor strided sharding entirely. 2. Lifts the check of `_validate_tp_mesh_dim` for `torch.distributed.tensor.parallel.parallelize_module`, as in EP or EP+TP this check is too strict. In particular it assumes a DeviceMesh must have `mesh_dim_names` which is not always true. I'm also removing the file `torch/distributed/tensor/parallel/_utils.py` it belongs entirely, as the other check `_deprecate_warnings`, added two years ago, is not used any more. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157216 Approved by: https://github.com/wanchaol, https://github.com/weifengpy	2025-07-08 15:57:37 +00:00
Aaron Gokaslan	476874b37f	[BE]: Update NCCL to 2.27.5 (#157108 ) Update NCCL to 2.27.5. Minor version, improves Blackwell, Symmem FP8 support, and fixes a bug with MNVVL. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157108 Approved by: https://github.com/atalman	2025-07-08 15:40:54 +00:00
Steven Troxler	5dc75f72d4	Simplify the base classes of `_PyFutureMeta` (#157757 ) Summary: I'm fairly sure the use of a custom metaclass is a holdover from pre-3.7 where Generic used a custom metaclass so we had to use multiple inheritance to avoid import-time failures. At this point, `type(Generic)` is just `type` so it isn't needed, and we will get the least metaclass from our base classes, which means the `type(torch._C.Future)` isn't needed either, it will happen automatically just by inheritance. Test Plan: I'm fairly confident from local testing that this should be a no-op. But also, Pytorch CI should give us pretty strong signal that this change doesn't break anything in case there's some edge case I missed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157757 Approved by: https://github.com/ezyang, https://github.com/Skylion007	2025-07-08 15:39:56 +00:00
Nikita Shulga	f88d7a7a34	[BE] Do not add `.` after troubleshooting_url (#157753 ) As it gets included into auto-hrefed URLs in say github logs to point to non existing location For example from https://github.com/pytorch/pytorch/actions/runs/16130448756/job/45517004735?pr=157749#step:18:27 > W0708 00:23:20.150000 67082 torch/_dynamo/convert_frame.py:1047] [0/8] To diagnose recompilation issues, see [https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.](https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157753 Approved by: https://github.com/zou3519, https://github.com/jansel	2025-07-08 15:38:24 +00:00
Nikita Shulga	98bb0c0e78	[CI][MacOS] Add `VENV_PATH` to search path (#157749 ) When building/testing PyTorch on MacOS Shoudl prevent some flakiness when conda environment overtakes CI/CD Pull Request resolved: https://github.com/pytorch/pytorch/pull/157749 Approved by: https://github.com/atalman, https://github.com/huydhn	2025-07-08 15:37:45 +00:00
PyTorch MergeBot	76fe88fa56	Revert "Cleanup leftover miniconda brew installation (#156898 )" This reverts commit 214e2959dcdbf91a999d5c0a5d40c91e4442e8c5. Reverted https://github.com/pytorch/pytorch/pull/156898 on behalf of https://github.com/malfet due to Breaks TorchVision builds ([comment](https://github.com/pytorch/pytorch/pull/156898#issuecomment-3049281232))	2025-07-08 14:54:42 +00:00
Xuan Zhang	86670b39fa	[PT2][memory] mutation size correctness (#157562 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157562 Approved by: https://github.com/yf225	2025-07-08 14:02:20 +00:00
Wang, Chuanqi	c78bbdf410	[BE] Update xpu driver repo for CD used almalinux 8.10 (#157356 ) XPU CD docker image built on `quay.io/pypa/manylinux_2_28_x86_64`, which based on almalinux 8.10 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157356 Approved by: https://github.com/EikanWang, https://github.com/malfet	2025-07-08 13:59:46 +00:00

1 2 3 4 5 ...

90134 Commits