pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 12:54:11 +08:00

Author	SHA1	Message	Date
Jane Xu	30587195d3	Migrate c10/macros/cmake_macros.h.in to torch/headeronly (#158035 ) Summary: As above, also changes a bunch of the build files to be better Test Plan: internal and external CI did run buck2 build fbcode//caffe2:torch and it succeeded Rollback Plan: Reviewed By: swolchok Differential Revision: D78016591 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158035 Approved by: https://github.com/swolchok	2025-07-15 19:52:59 +00:00
Aaron Orenstein	250ae2531c	Fix types in graphs.py (#158192 ) Added type annotations for torch/cuda/graphs.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/158192 Approved by: https://github.com/oulgen	2025-07-15 19:49:38 +00:00
Songhao Jia	011026205a	make node source hashable (#158322 ) Summary: as title Test Plan: ci Rollback Plan: Reviewed By: yushangdi Differential Revision: D78296410 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158322 Approved by: https://github.com/yushangdi	2025-07-15 19:31:00 +00:00
Menglu Yu	4657a84bc5	[Optimus][fp8_activation_quantization] Only log when there's some node to be quantized (#158129 ) Summary: We add some extra check on whether there's some node has been marked as should quantize, otherwise we skip the quantizaton and tlparse log. Rollback Plan: Differential Revision: D78173788 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158129 Approved by: https://github.com/Skylion007, https://github.com/avicizhu	2025-07-15 19:22:26 +00:00
Ti-Tai Wang	5606c516fd	[ONNX] Remove legacy Dort (#158258 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158258 Approved by: https://github.com/justinchuby, https://github.com/malfet	2025-07-15 19:14:06 +00:00
Edward Z. Yang	7afb834f93	Inline dispatch_and_compile into its call site. (#158150 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158150 Approved by: https://github.com/jamesjwu, https://github.com/wconstab ghstack dependencies: #158149	2025-07-15 19:08:55 +00:00
Edward Z. Yang	148789ddd8	Avoid AOTAutogradCache.load in stack trace on cache miss path (#158149 ) The general context for the upcoming stack of commits is I am attempting to "pipeline" AOTAutograd. Instead of having function f call function g which is the next "stage" of compilation, instead f should return with its outputs, which are then piped to g for the next stage. This will make it easier to implement early exit / resume pipeline without forcing callback structure, which is good for export-style use cases. It also reduces the size of our stack traces, which makes tools like Perfetto happy. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158149 Approved by: https://github.com/jamesjwu	2025-07-15 19:08:55 +00:00
albanD	3beb915004	Update CODEOWNERS for dataloading (#158348 ) Adding Scott Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/158348 Approved by: https://github.com/scotts, https://github.com/janeyx99	2025-07-15 19:06:18 +00:00
Shangdi Yu	cf3247b74a	Standalone compile API in _Exporter (#158139 ) Given an `package: _ExportPackage`, users can get a ready-to-use workspace in `tmp_dir` by calling: ```python package._compiled_and_package( tmp_dir + "/pt2_pacakge_name.pt2", True, package_example_inputs = True ) ``` `tmp_dir` will contains: - `main.cpp` (an example cpp file that create the models, if package_example_inputs is True, it'll also load the example inputs and run the models) - `CMakeLists.txt` - `pt2_pacakge_name/` (this is where the models are) - `pt2_pacakge_name.pt2` - `inputs.pt` files if package_example_inputs is True Remaining TODOs - support loading contants/weights - the `package_example_inputs = True` option only supports a list of Tensors for now - eventually we should remove the `torch` dependency, and use `SlimTensor`/`StableIValue` instead. Test Plan: ``` python test/inductor/test_aot_inductor_package.py -k test_compile_with_exporter ``` Example generated `main.cpp`: ```cpp #include <dlfcn.h> #include <fstream> #include <iostream> #include <memory> #include <torch/torch.h> #include <vector> #include <torch/csrc/inductor/aoti_torch/tensor_converter.h> #include "package/data/aotinductor/Plus__default/Plus__default.h" #include "package/data/aotinductor/Minus__default/Minus__default.h" using torch::aot_inductor::AOTInductorModelPlus__default; using torch::aot_inductor::AOTInductorModelMinus__default; using torch::aot_inductor::ConstantHandle; using torch::aot_inductor::ConstantMap; int main(int argc, char* argv[]) { std::string device_str = "cpu"; try { c10::Device device(device_str); // Load input tensors for model Plus__default std::vector<at::Tensor> input_tensors1; for (int j = 0; j < 2; ++j) { std::string filename = "Plus__default_input_" + std::to_string(j) + ".pt"; std::ifstream in(filename, std::ios::binary); if (!in.is_open()) { std::cerr << "Failed to open file: " << filename << std::endl; return 1; } std::vector<char> buffer((std::istreambuf_iterator<char>(in)), std::istreambuf_iterator<char>()); torch::IValue ivalue = torch::pickle_load(buffer); input_tensors1.push_back(ivalue.toTensor().to(device)); } // Load input tensors for model Minus__default std::vector<at::Tensor> input_tensors2; for (int j = 0; j < 2; ++j) { std::string filename = "Minus__default_input_" + std::to_string(j) + ".pt"; std::ifstream in(filename, std::ios::binary); if (!in.is_open()) { std::cerr << "Failed to open file: " << filename << std::endl; return 1; } std::vector<char> buffer((std::istreambuf_iterator<char>(in)), std::istreambuf_iterator<char>()); torch::IValue ivalue = torch::pickle_load(buffer); input_tensors2.push_back(ivalue.toTensor().to(device)); } // Create array of input handles auto input_handles1 = torch::aot_inductor::unsafe_alloc_new_handles_from_tensors(input_tensors1); auto input_handles2 = torch::aot_inductor::unsafe_alloc_new_handles_from_tensors(input_tensors2); // Create array for output handles AtenTensorHandle output_handle1; AtenTensorHandle output_handle2; // Create and load models auto constants_map1 = std::make_shared<ConstantMap>(); auto constants_array1 = std::make_shared<std::vector<ConstantHandle>>(); auto model1 = AOTInductorModelPlus__default::Create( constants_map1, constants_array1, device_str, "package/data/aotinductor/Plus__default/"); model1->load_constants(); auto constants_map2 = std::make_shared<ConstantMap>(); auto constants_array2 = std::make_shared<std::vector<ConstantHandle>>(); auto model2 = AOTInductorModelMinus__default::Create( constants_map2, constants_array2, device_str, "package/data/aotinductor/Minus__default/"); model2->load_constants(); // Run the models torch::aot_inductor::DeviceStreamType stream1 = nullptr; model1->run(&input_handles1[0], &output_handle1, stream1, nullptr); torch::aot_inductor::DeviceStreamType stream2 = nullptr; model2->run(&input_handles2[0], &output_handle2, stream2, nullptr); // Convert output handles to tensors auto output_tensor1 = torch::aot_inductor::alloc_tensors_by_stealing_from_handles(&output_handle1, 1); auto output_tensor2 = torch::aot_inductor::alloc_tensors_by_stealing_from_handles(&output_handle2, 1); // Validate outputs std::cout << "output_tensor1" << output_tensor1 << std::endl; std::cout << "output_tensor2" << output_tensor2 << std::endl; return 0; } catch (const std::exception &e) { std::cerr << "Error: " << e.what() << std::endl; return 1; } } ``` Rollback Plan: Differential Revision: D78124705 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158139 Approved by: https://github.com/desertfire	2025-07-15 18:47:56 +00:00
PyTorch MergeBot	46915b1361	Revert "Introduce AcceleratorAllocatorConfig as the common class (#149601 )" This reverts commit 1e8e9f745e43fa38bbfc7b67b30bc66c0e7ebbd6. Reverted https://github.com/pytorch/pytorch/pull/149601 on behalf of https://github.com/huydhn due to See https://github.com/pytorch/pytorch/pull/149601#discussion_r2208325379 ([comment](https://github.com/pytorch/pytorch/pull/149601#issuecomment-3074965720))	2025-07-15 18:40:59 +00:00
Robert Hardwick	8c3f206457	Fix AArch64 segfaults by disabling strict-aliasing in GridSamplerKernel for GCC 12 and above (#158117 ) This PR disables `strict-aliasing` GCC C++ optimization flag on all AArch64 cpus for GCC versions 12 and above. Pull Request #152825 upgraded gcc version from 11 to 13 in manywheel which caused several segmentation faults in unit tests ( not visible in CI workflows because the jammy gcc version has not been updated yet ). We Identified the problem also exists in GCC12 hence the ` __GNUC__ >= 12` Fixes #157626 fixes these tests failures when pytorch is built in GCC12 and above ``` test_ops.py::TestCommonCPU::test_noncontiguous_samples_grid_sampler_2d_cpu_float32 Fatal Python error: Segmentation fault test_ops.py::TestCommonCPU::test_dtypes_grid_sampler_2d_cpu Fatal Python error: Segmentation fault test_ops.py::TestMathBitsCPU::test_neg_view_nn_functional_grid_sample_cpu_float64 free(): invalid next size (fast) test_ops.py::TestCompositeComplianceCPU::test_backward_grid_sampler_2d_cpu_float32 Fatal Python error: Segmentation fault test_ops.py::TestCommonCPU::test_dtypes_nn_functional_grid_sample_cpu Fatal Python error: Segmentation fault ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158117 Approved by: https://github.com/malfet	2025-07-15 18:26:38 +00:00
PyTorch MergeBot	41971335c9	Revert "Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig (#150312 )" This reverts commit e241a07e6b88aa49d604803bc5a6562f0d9f94d2. Reverted https://github.com/pytorch/pytorch/pull/150312 on behalf of https://github.com/huydhn due to Sorry for reverting your change but because https://github.com/pytorch/pytorch/pull/157908 has been reverted + this PR caused issue earlier, I think it is better to revert the whole stack and reland it from scratch to be sure ([comment](https://github.com/pytorch/pytorch/pull/150312#issuecomment-3074897532))	2025-07-15 18:24:36 +00:00
PyTorch MergeBot	ea5f88dca6	Revert "Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165 )" This reverts commit e40ade5182233f548b25f2732effe3719d16e9ad. Reverted https://github.com/pytorch/pytorch/pull/156165 on behalf of https://github.com/huydhn due to Sorry for reverting your change but because https://github.com/pytorch/pytorch/pull/157908 has been reverted + this PR caused issue earlier, I think it is better to revert the whole stack and reland it from scratch to be sure ([comment](https://github.com/pytorch/pytorch/pull/150312#issuecomment-3074897532))	2025-07-15 18:24:36 +00:00
PyTorch MergeBot	f2ecf6145f	Revert "Enable AcceleratorAllocatorConfig key check (#157908 )" This reverts commit 65fcca4f8c97de82d35d51ad9b790d10433e9b91. Reverted https://github.com/pytorch/pytorch/pull/157908 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing internally per https://github.com/pytorch/pytorch/pull/157908#discussion_r2208204782 ([comment](https://github.com/pytorch/pytorch/pull/157908#issuecomment-3074833696))	2025-07-15 18:17:43 +00:00
PyTorch MergeBot	b26da7741b	Revert "[CI] Fixes CI for CUDA Version > 12.9 (#157385 )" This reverts commit 6c5227ba00a2904365af566c24b4681cd01a041c. Reverted https://github.com/pytorch/pytorch/pull/157385 on behalf of https://github.com/clee2000 due to broke some slow tests test_cpp_extensions_jit.py::TestCppExtensionJIT::test_jit_cuda_archflags [GH job link](https://github.com/pytorch/pytorch/actions/runs/16286465717/job/45986677885) [HUD commit link](`6c5227ba00`) ([comment](https://github.com/pytorch/pytorch/pull/157385#issuecomment-3074737541))	2025-07-15 18:06:52 +00:00
Menglu Yu	243b12e565	[Optimus] add einsum_to_pointwise_pass pattern (#155666 ) Summary: More context: https://docs.google.com/document/d/1ipiskqG13ZKNX1SGygB3QnHcSyXNQ8pACazPIcS4bnI/edit?tab=t.0 Test Plan: ### how to enable ``` torch._inductor.config.pre_grad_fusion_options={ "einsum_to_pointwise_pass": {}, }, ``` ### unit test ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 test 'fbcode//mode/dev-nosan' //caffe2/test/inductor:kernel_optimization ``` Buck UI: https://www.internalfb.com/buck2/267263ff-6f5b-4fff-bfc0-d8f013440ba0 Test UI: https://www.internalfb.com/intern/testinfra/testrun/5629499820839168 Network: Up: 61KiB Down: 675KiB (reSessionID-fda8edfc-6eef-4bf0-b268-0f8d2e666571) Loading targets. Remaining 0/1 1 dirs read, 2310 targets declared Analyzing targets. Remaining 0/345 284 actions, 329 artifacts declared Executing actions. Remaining 0/18334 8.0s exec time total Command: test. Finished 6 local Time elapsed: 1:15.5s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 ### local reproduce baseline: \| Metric \| Value \| \|:----------------------\|:------------\| \| Batch size \| 4096 \| \| GPU type \| H100 \| \| Latency \| 196.06 ms \| \| Model size \| 1205.21 MB \| \| Flops \| 7671.30 G \| \| Flops/example \| 1.87 G \| \| TFLOPS/sec \| 39.13 \| \| MFU \| 4.89% \| \| Activation/example \| 1.51 MB \| \| CPU time total \| 602.28 ms \| \| GPU time total \| 798.60 ms \| \| Estimated avg BW \| 234.62 GB/s \| \| Estimated avg BW util \| 9.78% \| Trace link: https://our.intern.facebook.com/intern/perfdoctor/trace_view?filepath=tree/traces/efficient_module_suite/fused_attention_mlp.Jun_09_22_12_38_trace.json.gz&bucket=pyper_traces with the pattern: \| Metric \| Value \| \|:----------------------\|:------------\| \| Batch size \| 4096 \| \| GPU type \| H100 \| \| Latency \| 184.94 ms \| \| Model size \| 1205.21 MB \| \| Flops \| 7671.30 G \| \| Flops/example \| 1.87 G \| \| TFLOPS/sec \| 41.48 \| \| MFU \| 5.18% \| \| Activation/example \| 1.15 MB \| \| CPU time total \| 562.44 ms \| \| GPU time total \| 754.36 ms \| \| Estimated avg BW \| 201.40 GB/s \| \| Estimated avg BW util \| 8.39% \| Trace link: https://our.intern.facebook.com/intern/perfdoctor/trace_view?filepath=tree/traces/efficient_module_suite/fused_attention_mlp.Jun_10_22_03_34_trace.json.gz&bucket=pyper_traces ### E2E baseline: f713998364 with patter: Rollback Plan: Differential Revision: D76400889 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155666 Approved by: https://github.com/Yuzhen11	2025-07-15 17:50:23 +00:00
vfdev	b7b1109f49	Expose opt_einsum in torch.backends (#157740 ) Fixes the following issue: ``` :/tmp# python -c "import torch; print(torch.__version__)" 2.7.1+cu126 :/tmp# python -c "import torch; print(torch.backends.opt_einsum.is_available())" Traceback (most recent call last): File "<string>", line 1, in <module> AttributeError: module 'torch.backends' has no attribute 'opt_einsum' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/157740 Approved by: https://github.com/Skylion007, https://github.com/benjaminglass1	2025-07-15 17:46:43 +00:00
PyTorch MergeBot	26807dcf27	Revert "[PT2][fusion] ban fusions with large accumulated reads (#157563 )" This reverts commit c062550a3598d27c2d6572db7c0f4ff90a84cc84. Reverted https://github.com/pytorch/pytorch/pull/157563 on behalf of https://github.com/clee2000 due to broke test_linear_and_cel on main `c062550a35`, caused OOM? Also broken on PR, Dr. CI classification is wrong (claims the test is disabled by an issue but the issue is for a different test). Also I'm pretty sure the expected results json is supposed to have a ton of empty lines, its to prevent merge conflicts, I will add it to the linter ([comment](https://github.com/pytorch/pytorch/pull/157563#issuecomment-3074355331))	2025-07-15 16:35:55 +00:00
PyTorch MergeBot	4f36743f5e	Revert "[simple_fsdp][inductor_collectives] rewrite reorder_collectives, sink_waits_iterative (#158062 )" This reverts commit 5a54db14e3843cfa87fd8d27487dbf2f2dfb6c47. Reverted https://github.com/pytorch/pytorch/pull/158062 on behalf of https://github.com/clee2000 due to sorry I want to revert something else and this is causing a merge conflict, all you should need to do is rebase and remerged ([comment](https://github.com/pytorch/pytorch/pull/158062#issuecomment-3074342140))	2025-07-15 16:31:13 +00:00
dsashidh	05d7288e31	Fix incorrect bin edge description in histogramdd docs (#158275 ) Fixes #124435 This updates the torch.histogramdd documentation to correctly state that bins are inclusive of their left edges, not exclusive as currently written. There was a previous PR addressing this but it was closed due to inactivity. This picks that up and applies the fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158275 Approved by: https://github.com/albanD	2025-07-15 16:25:01 +00:00
IvanKobzarev	5a54db14e3	[simple_fsdp][inductor_collectives] rewrite reorder_collectives, sink_waits_iterative (#158062 ) Differential Revision: [D78159013](https://our.internmc.facebook.com/intern/diff/D78159013) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158062 Approved by: https://github.com/wconstab	2025-07-15 14:27:57 +00:00
Aleksandar Samardžić	90618581e9	Fix grouped MM output strides when compiled but not max-autotuned (#158143 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158143 Approved by: https://github.com/ngimel	2025-07-15 11:53:13 +00:00
Andrey Talman	4e13eca713	[BE] Remove CUDA 11.8 artifacts (#158303 ) We are including cufile by default in all CUDA 12+ builds. Since CUDA 11.8 is removed we can safely remove this code Pull Request resolved: https://github.com/pytorch/pytorch/pull/158303 Approved by: https://github.com/Camyll, https://github.com/cyyever	2025-07-15 11:52:08 +00:00
Xiangyang (Mark) Guo	156a377f4c	[AOTI][CPP] add flag TORCHINDUCTOR_CPP_FORCE_INLINE_KERNEL (#157949 ) Summary: Add flag TORCHINDUCTOR_CPP_FORCE_INLINE_KERNEL to force inline the kernel function when TORCHINDUCTOR_CPP_FORCE_INLINE_KERNEL=1. It's disabled by default because force inlining may increase the build time. Differential Revision: D77915987 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157949 Approved by: https://github.com/desertfire	2025-07-15 10:51:43 +00:00
henrylhtsang	6200584193	[cutlass backend][BE] remove force disable cache in tests (#158053 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158053 Approved by: https://github.com/coconutruben	2025-07-15 10:35:34 +00:00
Yu, Guangye	e40ade5182	Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156165 Approved by: https://github.com/albanD ghstack dependencies: #150312	2025-07-15 10:14:35 +00:00
Yu, Guangye	e241a07e6b	Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig (#150312 ) # Motivation Refactor `CUDAAllocatorConfig` to reuse `AcceleratorAllocatorConfig` and `ConfigTokenizer`. We would deprecate those option that overleap with `AcceleratorAllocatorConfig` in the following PR and keep them only for BC. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150312 Approved by: https://github.com/albanD	2025-07-15 10:14:35 +00:00
Huamin Li	7f9fc7e67c	[Inductor] Add CPU_MAX_FIRST_DIMENSION_DECOMPOSITION and CPU_MAX_OTHER_DIMENSION_DECOMPOSITION for decompose_mm_pass (#158183 ) Differential Revision: D78209993 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158183 Approved by: https://github.com/houseroad	2025-07-15 10:07:25 +00:00
FFFrog	1b389025ba	Refactor and Improve the OpenReg Module (#158090 ) ---- # Refactor and Improve the OpenReg Module ## Background Since PrivateUse1 has become the main path for integrating new devices with PyTorch, there have been some feature requests related to PrivateUse1 regarding interfaces, documentation, reference examples, etc., such as the following: - https://github.com/pytorch/pytorch/issues/155864 - https://github.com/pytorch/pytorch/issues/144955 - https://github.com/pytorch/pytorch/issues/144845 Taking these requests into consideration and combining them with the position of OpenReg, which is currently used as the test backend for PrivateUse1, I'm planning to make the following optimizations: - Optimize the implementation of OpenReg to make it align with the standard specifications for real backend (C++) access, serving as a reference for new device integration code. - Add comprehensive documentation to the [developer notes](https://docs.pytorch.org/docs/main/notes.html) to guide new accelerator integration, functioning as a reference manual. ## Design Principles: - Minimization Principle: Keep the code small and clear; only implement the minimum set of code required for verification and as an integration reference. - Authenticity Principle: Integrate OpenReg in the same way that real accelerators access PyTorch. ## More Infos: Pleaes refer to [this](`6b8020f1ab/test/cpp_extensions/open_registration_extension/torch_openreg/README.md`) for more information about `OpenReg`. ## Current Progress: - Refer to the implementation of [torch_xla](https://github.com/pytorch/xla) to refactor all of OpenReg's code, making it easier to understand. - Ensure all tests in [test/test_openreg.py](https://github.com/FFFrog/pytorch/blob/openreg/test/test_openreg.py) pass after refactoring. ## Next Steps: - Add more features to cover all integration points. - Gradually add user guides and documentation to the [developer notes](https://docs.pytorch.org/docs/main/notes.html). Pull Request resolved: https://github.com/pytorch/pytorch/pull/158090 Approved by: https://github.com/seemethere, https://github.com/albanD	2025-07-15 08:10:05 +00:00
AaronWang04	6c5227ba00	[CI] Fixes CI for CUDA Version > 12.9 (#157385 ) Compute capabilities older than volta (inclusive) is no longer supported in CUDA Version > 12.9 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157385 Approved by: https://github.com/huydhn	2025-07-15 07:04:54 +00:00
wengshiy	c8c221c0b3	[Inductor][Float8] Add float8_e4m3fn into assertion dtype list. (#157684 ) Fix assert issue. Add float8_e4m3fn into dtype list. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157684 Approved by: https://github.com/Xia-Weiwen, https://github.com/leslie-fang-intel, https://github.com/jansel	2025-07-15 06:02:01 +00:00
codingwithsurya	3341c131b7	[SymmMem] Fix NCCL Hang in NVSHMEM Triton Wait Until Test (#158167 ) The `test_triton_wait_until` test was hanging due to an NCCL synchronization issue stemming from mismatched NVSHMEM operations. Specifically, the flag variable was updated using `nvshmemx_signal_op` (a signaling operation), but waited on with `nvshmem_wait_until` (intended for put/get updates). Per NVSHMEM documentation (see documentation reference section below), signal-updated variables require `nvshmem_signal_wait_until` for proper completion guarantees, so the mismatch caused a deadlock and NCCL hang. Fix: - A simple fix was to replace the flag update with a regular `nvshmem_putmem_block` (via `put_kernel`) to match `nvshmem_wait_until`. I also added a fence (`nvshmem_fence`) between data and flag puts on the sender (Rank 1) for ordered delivery. - In a follow-up PR I will add a kernel/test to demonstrate usage of `nvshmemx_signal_op` Testing: - I ran `python test/distributed/test_nvshmem_triton.py` and `python test/distributed/test_nvshmem_triton.py -k test_triton_wait_until` - I also verified with debug prints (Sender completes puts/fence before receiver's wait returns, and assertions confirm correct state). Multiple runs show no hangs or failures. Documentation Referenced: - [NVSHMEM Point-To-Point Synchronization](https://docs.nvidia.com/nvshmem/api/gen/api/sync.html) explicitly states: "the sig_addr object at the calling PE is expected only to be updated as a signal, through the signaling operations available in Section NVSHMEM_PUT_SIGNAL and Section NVSHMEM_PUT_SIGNAL_NBI" - [NVIDIA's Official Ring Broadcast Example](https://docs.nvidia.com/nvshmem/api/examples.html) demonstrates the correct pairing: `nvshmemx_signal_op` with `nvshmem_signal_wait_until` (not `nvshmem_wait_until`) - [NVSHMEM Signaling Operations](https://docs.nvidia.com/nvshmem/api/gen/api/signal.html) documents that signal operations work on special "signal data objects" with specific atomicity guarantees distinct from regular RMA operations Pull Request resolved: https://github.com/pytorch/pytorch/pull/158167 Approved by: https://github.com/Skylion007, https://github.com/fduwjj	2025-07-15 05:57:27 +00:00
Gabriel Ferns	9cd521de4d	Fix torchrec multiprocess tests (#158159 ) Summary: The new version of `get_device_tflops` imported something from testing, which imported common_utils.py, which disabled global flags. Test Plan: Fixing existing tests Rollback Plan: Differential Revision: D78192700 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158159 Approved by: https://github.com/nipung90, https://github.com/huydhn	2025-07-15 05:44:37 +00:00
albanD	058fb1790f	Fix compilation and "import torch" issues for cpython 3.14 (#158184 ) Beginning of process for 3.14 bringup. State of things from this PR: - Nothing too scary looking from the Dynamo CPython side, nothing we heavily rely on seems to be missing @williamwen42 - The existing check that makes torch.compile() nicely fail is working as expected. So all these empty functions shouldn't cause any weirdness. - The `__module__` update changes look suspicious, we should investigate what is the reason and impact of that, in particular for our public API checking @jbschlosser - Leaving the weakref.py thread safety change as a follow up to keep this a bit simpler. I vendored the whole struct in the meantime FYI @ezyang EDIT: The `__module__` change is even more cursed than I though due to changes to Union and Optional type where the `__module__` field cannot be changed anymore. See https://github.com/python/cpython/issues/132139 for details. For now, I'm just skipping the `__module__` setting for 3.14 which will trip the public API checks. Will revisit once I have a final answer on the cpython issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158184 Approved by: https://github.com/msaroufim	2025-07-15 05:06:55 +00:00
Xilun Wu	add0b450bd	[DTensor][BE] improve DTensor ops correctness check utils (#158112 ) Summary Implemented the test pattern described in https://github.com/pytorch/pytorch/pull/157991#discussion_r2196363170 as a util method in `DTensorTestBase`. The difference to `DTensorTestBase._test_op` is: 1. allowing users to specify the `Partial` placement. 2. supporting tree-like output structure. Test so far only adopt `DTensorTestBase._test_op_on_dtensor` in `DistTensorOpsTest.test_split_on_partial`. `pytest test/distributed/tensor/test_tensor_ops.py -s -k test_split_on_partial` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158112 Approved by: https://github.com/Skylion007, https://github.com/zpcore ghstack dependencies: #158051	2025-07-15 04:50:34 +00:00
Xilun Wu	4c1fabf2c9	[DTensor] have split_strategy return OpStrategy instead of TupleStrategy (#158051 ) Summary `split_strategy` used `TupleStrategy` as return type because DTensor sharding propagation's `OpStrategy` support on multi-returns only applies to `Tuple`. However, `TupleStrategy`'s not a good fit for `split` op. `TupleStrategy` was initially introduced to handle the sharding strategy of `foreach_` ops where the input args can be split into independent subsets regarding sharding decisions, so are the outputs. To address the misuse, this PR adds `OpStrategy` propagation for `List[Tensor]` (note that this support is INCOMPLETE because it only checks the return type to be `torch.ListType`). Nevertheless, the logic for `Tuple` returns also made similar assumption so I think it's fine to unblock in such a way. Besides adding `OpStrategy` support to ops having `List[Tensor]` return type, this PR also changes `split_strategy`'s return from `TupleStrategy` to `OpStrategy`. Test* `pytest test/distributed/tensor/test_tensor_ops.py -s -k test_split_on_partial` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158051 Approved by: https://github.com/wconstab, https://github.com/zpcore	2025-07-15 04:50:34 +00:00
Ti-Tai Wang	a2ad16be72	[ONNX] Remove legacy Dort tests (#158294 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158294 Approved by: https://github.com/justinchuby ghstack dependencies: #158255, #158256, #158257	2025-07-15 04:44:14 +00:00
Ti-Tai Wang	5fb07acbc3	[ONNX] Remove legacy modularization (#158257 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158257 Approved by: https://github.com/justinchuby ghstack dependencies: #158255, #158256	2025-07-15 04:36:01 +00:00
Ti-Tai Wang	336bff6d58	[ONNX] Remove legacy graph passes (#158256 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158256 Approved by: https://github.com/justinchuby ghstack dependencies: #158255	2025-07-15 04:27:30 +00:00
Ti-Tai Wang	12151c96d9	[ONNX] Remove legacy io_adapter (#158255 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158255 Approved by: https://github.com/justinchuby	2025-07-15 03:39:18 +00:00
Will Constable	4486a6dbfd	[DTensor] Fix grouped_mm strategy for invalid stride cases (#158245 ) local_tensor input to grouped_mm has a stride requirement. (see `_meta_grouped_mm_common` in meta_registrations.py or `check_valid_strides_and_return_transposed` in native/cuda/Blas.cpp) Don't allow sharding a tensor if its shape would result in an incompatible local_tensor stride. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158245 Approved by: https://github.com/zpcore, https://github.com/XilunWu	2025-07-15 03:29:49 +00:00
Arsh Zahed	a5e68814d5	Allow dynamic shapes for DTensor slice (#157953 ) This PR allows for symints in `gen_slice_strategy` which is the strategy for `aten.slice.Tensor`. Previously, using dynamic shapes with slicing would result in ``` File ".../pytorch/torch/distributed/tensor/_ops/_tensor_ops.py", line 348, in gen_slice_strategy assert isinstance(end, int) torch._dynamo.exc.TorchRuntimeError: Dynamo failed to run FX node with fake tensors: call_function <built-in function getitem>((DTensor(local_tensor=FakeTensor(..., device='cuda:0', size=(s3, 2)), device_mesh=DeviceMesh('cuda', [0, 1]), placements=(Shard(dim=0),)), slice(None, (s77//2), None)), *{}): got AssertionError() ``` Questions before merge: 1. `dim` is still asserted to be int. Is this fine, or is this potentially dynamic as well? 2. I'm using argtype ignore for `normalize_dim`. Should I instead change types for `normalize_dim` and further dependency to be `IntLike` as well? Pull Request resolved: https://github.com/pytorch/pytorch/pull/157953 Approved by: https://github.com/wconstab	2025-07-15 00:54:01 +00:00
James Wu	ef4cca2d79	[precompile] Increment frame and add compile ids when loading packages (#158028 ) When loading a package and calling package.install(backends), we create a new frame and compile id for each package load, so that tlparse and chromium events still show compile times on warm start. There is an argument for not doing this in AOT precompile, as no "compile" occurs. So for now, we put it in `package.install`, which hopefully won't be a thing for AOT precompile. ## Recompiles Recompiles get saved to the same frame and code entry, so on warm start, each recompile will get collapsed into the same entry. Therefore, dynamo compiles that have recompiles on cold start (0/0, 0/1, 0/2, etc) will all get collapsed into a single compile id (0/0), as warm start will load all of the entries properly. ## Graph breaks Graph breaks get their own compile id, and therefore their own code entry. These are replicated on warm start, so if cold start you had 4 different graphs (and therefore 4 compile ids), you'll have 4 compile ids on warm start as well. ## Test plan Added a frame counter check to existing unit tests for automatic dynamic, showing that old and new frame counter between old and new load is the same. This is the chromium event for test_automatic_dynamo_graph_breaks_device_cuda: ``` python test/dynamo/test_package.py -k test_automatic_dynamo_graph_breaks_device_cuda ``` <img width="2216" height="508" alt="image" src="https://github.com/user-attachments/assets/f604ed33-5c31-464b-9320-d67b2e6f57a1" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158028 Approved by: https://github.com/oulgen	2025-07-15 00:53:52 +00:00
Songhao Jia	1c6057fd17	add eq function to NodeSource (#158170 ) Summary: add eq function to NodeSouce by comparing their dict representation. Test Plan: ci Rollback Plan: Differential Revision: D78200762 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158170 Approved by: https://github.com/ezyang, https://github.com/yushangdi	2025-07-15 00:50:06 +00:00
henrylhtsang	7e433d5f42	[cutlass backend] cache a few things for codegen and properties (#158158 ) Differential Revision: [D78193404](https://our.internmc.facebook.com/intern/diff/D78193404/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158158 Approved by: https://github.com/ColinPeppler	2025-07-15 00:18:31 +00:00
Tristan Rice	b7def5ff1c	dist2: add support for passing custom configs directly to PG (#158147 ) This is intended to make it easier to have backend specific "hints" that can be provided by the user to hint about certain options. ```py import torch.distributed._dist2 as dist2 pg = dist2.new_group(backend="my_custom_backend", device=..., timeout=..., foo=1234, bar="1234") pg.allreduce(...) ``` Test plan: ``` pytest test/distributed/test_dist2.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158147 Approved by: https://github.com/fduwjj	2025-07-15 00:02:54 +00:00
Simon Fan	7cf31b4a42	[dynamo] fix NamedTupleVariable cloning (#158190 ) FIXES https://github.com/pytorch/pytorch/issues/157945 ## Explanation 1. Some VTs add additional attrs e.g. NamedTupleVariable has "dynamic_attributes" `a0308edb6c/torch/_dynamo/variables/lists.py (L1048-L1051)` 2. VT.clone passes everything by dict, includes "dynamic_attributes" `a0308edb6c/torch/_dynamo/variables/base.py (L255-L259)` 3. Non-handled args become kwargs in VT's `__init__`, `super().__init__()` passes kwargs to Base VT `a0308edb6c/torch/_dynamo/variables/lists.py (L1048-L1051)` 4. Base VT's `__init__` gets unexpected "dynamic_attributes" kwarg `a0308edb6c/torch/_dynamo/variables/base.py (L609-L613)` You could also let Base VT's `__init__` ignore additional kwargs, but that seemed a bit too permissive, and I don't think many VT's add these derived class only attrs. ## After fix ```python ===== __compiled_fn_1_7f9541ed_e166_43fe_8322_c5225ce4207f ===== /home/xmfan/core/miniconda3/envs/0712/lib/python3.12/site-packages/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module): def forward(self, L_x_: "f32[4, 8, 6][48, 6, 1]cpu"): l_x_ = L_x_ # File: /home/xmfan/core/a/torchtitan/wtf.py:10 in forward, code: U, S = torch.linalg.svd(x)[:2] linalg_svd = torch._C._linalg.linalg_svd(l_x_); l_x_ = None U: "f32[4, 8, 8][64, 1, 8]cpu" = linalg_svd[0] S: "f32[4, 6][6, 1]cpu" = linalg_svd[1]; linalg_svd = None # File: /home/xmfan/core/a/torchtitan/wtf.py:11 in forward, code: reduced = U[:, :, :self.k] @ torch.diag_embed(S[:, :self.k]) getitem_3: "f32[4, 8, 5][64, 1, 8]cpu" = U[(slice(None, None, None), slice(None, None, None), slice(None, 5, None))]; U = None getitem_4: "f32[4, 5][6, 1]cpu" = S[(slice(None, None, None), slice(None, 5, None))]; S = None diag_embed: "f32[4, 5, 5][25, 5, 1]cpu" = torch.diag_embed(getitem_4); getitem_4 = None reduced: "f32[4, 8, 5][40, 5, 1]cpu" = getitem_3 @ diag_embed; getitem_3 = diag_embed = None return (reduced,) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158190 Approved by: https://github.com/StrongerXi	2025-07-14 23:39:25 +00:00
Catherine Lee	08799217ae	[CI] Move main branch rocm binary builds to its own workflow (#158161 ) Petition to move out of ciflow/trunk and into ciflow/rocm because it's a long pole for TTS <img width="1192" height="312" alt="image" src="https://github.com/user-attachments/assets/b12a097a-3763-4c62-b09f-094ee9ae1c37" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158161 Approved by: https://github.com/seemethere	2025-07-14 23:07:49 +00:00
Catherine Lee	48315181c7	[CI] Do not run inductor rocm on ciflow/inductor (#158162 ) Petition to only run inductor-rocm on ciflow/inductor-rocm and not ciflow/inductor because it's a long pole for TTS <img width="1266" height="315" alt="image" src="https://github.com/user-attachments/assets/b3587bf7-b1a6-45f3-9b6a-c0e6d473d13b" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158162 Approved by: https://github.com/seemethere	2025-07-14 23:07:45 +00:00
Eli Uriegas	38371f693b	ci: Switch lintrunner-noclang to use linter image (#158261 ) This changes the image the lintrunner jobs utilizes to be the base linter image instead of the CUDA image. This is done to reduce the image size and speed up the build time. This was switched in https://github.com/pytorch/pytorch/pull/110502 when clang used to run in the lintrunner jobs but it is now split out so we can use the default image for non-clang jobs. Difference in pull time (from running job): ~5min --> ~1min (80% reduction), this should result in an overall runtime decrease of ~25min --> ~20min (20% reduction) Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158261 Approved by: https://github.com/Camyll, https://github.com/ZainRizvi, https://github.com/atalman, https://github.com/Skylion007	2025-07-14 22:54:51 +00:00

1 2 3 4 5 ...

90387 Commits