pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
PyTorch MergeBot	96ef26f71a	Revert "[ROCm] Integrate AITER Fav3 fwd kernels (#160105 )" This reverts commit d2393c2d7da03a1523a12e6f80edb6bd7b464ec5. Reverted https://github.com/pytorch/pytorch/pull/160105 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing internal ROCm build ([comment](https://github.com/pytorch/pytorch/pull/160105#issuecomment-3273297183))	2025-09-10 04:42:28 +00:00
Rob Timpe	5ac112b569	[dynamo] Graph break on on user-defined class in compiled region (#161670 ) Currently, user-defined classes inside of a compiled frame will cause the whole frame to be skipped by dynamo. This change defers the Unsupported exception until the __build_class__ builtin is actually called, which allows a graph break to be inserted. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161670 Approved by: https://github.com/williamwen42, https://github.com/guilhermeleobas	2025-09-10 04:39:20 +00:00
Edward Yang	dda071587f	Revert "Make distributed modules importable even when backend not built (#159889 )" (#162568 ) This reverts commit a0d026688cd69583d5a4e0c6f3e5fda141a7f4a9. Revert "Always build USE_DISTRIBUTED. (#160449)" This reverts commit d80297a6846f1f2c36fd4f19e22919f2abe8fcea. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162568 Approved by: https://github.com/huydhn	2025-09-10 04:29:42 +00:00
PyTorch UpdateBot	11acfed3ce	[audio hash update] update the pinned audio hash (#162552 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162552 Approved by: https://github.com/pytorchbot	2025-09-10 04:24:39 +00:00
Nikita Shulga	5f40a8a9a3	[BE] Fix `'_WIN32' is not defined` warning (#162516 ) Summary: As indeed it is not defined neither on Linux nor on MacOS platforms Test Plan: CI Rollback Plan: Differential Revision: D82044853 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162516 Approved by: https://github.com/Skylion007	2025-09-10 04:21:38 +00:00
Huy Do	e64965300a	Repackage vLLM nightlies (#162371 ) I suspected that I would need to repack vLLM wheels from https://github.com/pytorch/pytorch/pull/162000 because I renamed the wheel, and it turns out to be true. The error is as follows: ``` $ uv pip install --pre xformers --index-url https://download.pytorch.org/whl/nightly/cu129 Using Python 3.12.11+meta environment at: venv/py3.12 Resolved 28 packages in 759ms error: Failed to install: xformers-0.0.33.dev20250901+cu129-cp39-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (xformers==0.0.33.dev20250901+cu129) Caused by: Wheel version does not match filename: 0.0.33+5d4b92a5.d20250907 != 0.0.33.dev20250901+cu129 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162371 Approved by: https://github.com/atalman	2025-09-10 04:02:34 +00:00
Huy Do	00985970e3	Put torchao (0.13.0) back to benchmark workflow (#162227 ) 0.13.0 was released on Sep 3rd https://pypi.org/project/torchao/#history, which should have fixed the crashing issue on transformers now Pull Request resolved: https://github.com/pytorch/pytorch/pull/162227 Approved by: https://github.com/malfet	2025-09-10 03:56:25 +00:00
angelayi	484c4093a8	test fixing benchmarks (#162503 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162503 Approved by: https://github.com/huydhn ghstack dependencies: #160741	2025-09-10 03:15:49 +00:00
Boyuan Feng	760c478a14	[FlexAttn][Minor] Update FlexConfig doc (#162533 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162533 Approved by: https://github.com/drisspg	2025-09-10 02:03:48 +00:00
Yu Guo	dc4f97e9c1	[triton] enable int64 indexing in convolution and mm template (#162506 ) Summary: hitting illegal memory access issue when compiling conv and addmm kernels with the change in https://github.com/pytorch/pytorch/pull/157767 Differential Revision: D81995664 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162506 Approved by: https://github.com/iseeyuan	2025-09-10 01:53:26 +00:00
Justin Chu	c66e58b7d0	[ONNX] Expose the testing module (#162495 ) * Created a new module `torch/onnx/testing.py` that exposes the `assert_onnx_program` function for testing exported ONNX models. * Updated the ONNX documentation (`docs/source/onnx.md`) to include `onnx_testing` in the list of relevant modules. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162495 Approved by: https://github.com/titaiwangms, https://github.com/xadupre	2025-09-10 01:40:24 +00:00
Tristan Rice	878f59ef75	DeviceMesh: support _rank for use with non-global PGs (#162439 ) Summary: This adds a `_rank` field to DeviceMesh init that allows for instantiating a DeviceMesh without depending on `dist.get_rank()` which requires a global PG to be instantiated. Test Plan: ``` buck2 test mode/opt -c fbcode.enable_gpu_sections=true //caffe2/test/distributed:device_mesh -- init_backend ``` Rollback Plan: Differential Revision: D81981777 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162439 Approved by: https://github.com/kwen2501, https://github.com/fduwjj	2025-09-10 01:18:28 +00:00
Tianyu Liu	e60ad4f628	[DTensor] fix copy_ strategy to support linearity (#162460 ) Fixing issue introduced in https://github.com/pytorch/pytorch/pull/158538 where `aten.copy_.default` is registered as a pointwise op, but without linearity. In particular, when both `src` and `dst` tensors have same `Partial` placements, direct copy should happen without redistribute, instead of redistributing both to `Replicate` before making the copy. This was discovered from silent incorrect results e.g. on `torch.einsum` backward. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162460 Approved by: https://github.com/zpcore	2025-09-10 00:47:14 +00:00
PyTorch MergeBot	2281d009e5	Revert "[ROCm] Add specific compile options for CK SDPA (#161759 )" This reverts commit d22d916719eb7daff8455a01d216d65f81899a9e. Reverted https://github.com/pytorch/pytorch/pull/161759 on behalf of https://github.com/huydhn due to Sorry for reverting your change but this seems to break internal ROCm jobs ([comment](https://github.com/pytorch/pytorch/pull/161759#issuecomment-3272807726))	2025-09-10 00:44:30 +00:00
Saurabh Mishra	33589374b6	[DCP] Avoid multiple storage writer resets in async save (#159448 ) Summary: Avoid multiple storage writer resets in async save. Currently the reset gets called by the async_save method and then again in the save method. In the async path, async_save should only do the staging and the reset should only happen in the synchronous save path. Test Plan: ``` buck test 'fbcode//mode/opt' //aiplatform/modelstore/experimental/DCP/tests:checkpoint_dist_client_test ``` https://www.internalfb.com/intern/testinfra/testrun/15199648841705052 Rollback Plan: Differential Revision: D79230339 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159448 Approved by: https://github.com/meetv18	2025-09-10 00:43:03 +00:00
Animesh Jain	5539916fe1	[dynamo][refactor] Move get_framelocals_idx to a helper (#162519 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162519 Approved by: https://github.com/williamwen42	2025-09-10 00:35:09 +00:00
Laith Sakka	e4174b1fd7	remove gso from collapse_view_helper (#162212 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162212 Approved by: https://github.com/aorenste Co-authored-by: Aaron Orenstein <aorenste@fb.com>	2025-09-10 00:17:15 +00:00
Scott Wolchok	0e7ccc09db	[easy] Don't force copy result of getAllOperatorsFor in init.cpp (#162218 ) It returns a const reference to a vector. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162218 Approved by: https://github.com/Skylion007 ghstack dependencies: #161591, #161595, #161633, #161634, #161692, #162219, #162220	2025-09-10 00:08:15 +00:00
Thomas Bohnstingl	87cc126457	[associative_scan] partial gradient support (#162388 ) This PR tests the partial gradient support of the `associative_scan` operation. It replaces https://github.com/bohnstingl/pytorch/pull/6 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162388 Approved by: https://github.com/ydwu4	2025-09-09 23:52:29 +00:00
PyTorch MergeBot	a3e26d1727	Revert "[dynamo] Graph break on on user-defined class in compiled region (#161670 )" This reverts commit e2545487de3dbbe663e3f0adb699547a14da0f6a. Reverted https://github.com/pytorch/pytorch/pull/161670 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing a trunk test ([comment](https://github.com/pytorch/pytorch/pull/161670#issuecomment-3272626391))	2025-09-09 23:40:26 +00:00
Andy Lugo	d2393c2d7d	[ROCm] Integrate AITER Fav3 fwd kernels (#160105 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/160105 Approved by: https://github.com/jeffdaily	2025-09-09 22:30:12 +00:00
SandishKumarHN	b498299953	154849 Add support to handle IGUSR1 and SIGUSR2 in multiprocessing (#160690 ) Fixes #154849 This change addresses the request to add support for SIGUSR1 and SIGUSR2 signals in torchrun for SLURM environments. Changes supports these signals through the configurable `TORCHELASTIC_SIGNALS_TO_HANDLE` environment variable and signals_to_handle parameter from laucher api Tests: For validations purpose: test_signal_handling.py, simple_test_api_signal_handling.py, Unit Tests: for launcher changes:launcher/test_api.py for api changes: multiprocessing/test_api.py E2E: test_run.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/160690 Approved by: https://github.com/fduwjj	2025-09-09 22:23:06 +00:00
Howard Huang	4d66a3b894	fix Dtensor doc link (#162494 ) Small fix for https://docs.pytorch.org/docs/main/distributed.tensor.parallel.html <img width="890" height="274" alt="image" src="https://github.com/user-attachments/assets/6ee7fc7c-e0fe-4f5e-ab7e-a895bb3fa79f" /> now it is: <img width="909" height="320" alt="image" src="https://github.com/user-attachments/assets/8b2c41ef-1684-4597-8dae-144b49723796" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/162494 Approved by: https://github.com/XilunWu	2025-09-09 22:10:37 +00:00
Rob Timpe	e2545487de	[dynamo] Graph break on on user-defined class in compiled region (#161670 ) Currently, user-defined classes inside of a compiled frame will cause the whole frame to be skipped by dynamo. This change defers the Unsupported exception until the __build_class__ builtin is actually called, which allows a graph break to be inserted. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161670 Approved by: https://github.com/williamwen42, https://github.com/guilhermeleobas	2025-09-09 21:07:49 +00:00
Ke Wen	8922bbcaab	Use same NVSHMEM version across CUDA builds (#162206 ) #161321 bumped NVSHMEM version to 3.3.24 for CUDA 13, leaving CUDA 12 with 3.3.20. This PR bumps the NVSHMEM version to 3.3.24 for CUDA 12 as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162206 Approved by: https://github.com/tinglvv, https://github.com/Skylion007	2025-09-09 20:59:50 +00:00
atalman	14744e1ab2	[Release 2.9] Add compatibility matrix, Version Bump (#162526 ) Release 2.9 1. Add release compatibility matrix 2. Add version bump for 2.10 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162526 Approved by: https://github.com/malfet	2025-09-09 20:38:15 +00:00
Jeff Daily	b477fb106f	[ROCm] enable grouped gemm fallback (#162419 ) Enables bf16 group gemm alternative path as described in #161366 Fast path will be enabled in future through CK integration. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162419 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-09 20:04:56 +00:00
Andy Lugo	d22d916719	[ROCm] Add specific compile options for CK SDPA (#161759 ) Updates CK version and adds CK specific compilation options Pull Request resolved: https://github.com/pytorch/pytorch/pull/161759 Approved by: https://github.com/jeffdaily	2025-09-09 20:04:19 +00:00
morrison-turnansky	86d34a43f5	NamedTuple: Allow side effects for dynamic attributes (#161645 ) I confirmed that the tracing was correct i.e. NamedTupleVariable had the correct dynamic attribute added to it. The problem was that NamedTupleVariable was always marked as immutable. This does not reflect the behavior of namedtuple. Subclasses of namedtuple may be mutable, so when a NamedTupleVariable is derived from a subclass that is mutable, I made NamedTupleVariable mutable as well. Then side_effects correctly updates the returned object. Fixes #161610 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161645 Approved by: https://github.com/anijain2305, https://github.com/StrongerXi	2025-09-09 19:42:02 +00:00
Huy Do	8508651477	Fix flaky AOTFxirTestCase (#162472 ) Fixes https://github.com/pytorch/pytorch/issues/162357 Fixes https://github.com/pytorch/pytorch/issues/160970 Fixes https://github.com/pytorch/pytorch/issues/161038 Fixes https://github.com/pytorch/pytorch/issues/160951 Fixes https://github.com/pytorch/pytorch/issues/161698 These tests were introduced in https://github.com/pytorch/pytorch/pull/160765 and they are all flaky when `torch._inductor.aot_compile` uses multiple threads (the default option). The issue could be reproduced by running them locally multiple times. For example, ``` pytest --flake-runs 10 --flake-finder -v inductor/test_fxir_backend.py -k test_aoti_fx_add (output logs at P1938386961) ... --------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------------------------------------------------------------------------------------- inductor [('async_compile_cache_miss', 1)] graph_break [] --------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------------------------------------------------------------------------------------- inductor [('async_compile_cache_miss', 1)] graph_break [] --------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------------------------------------------------------------------------------------- inductor [('async_compile_cache_miss', 1)] graph_break [] --------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------------------------------------------------------------------------------------- inductor [('async_compile_cache_miss', 1)] graph_break [] --------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------------------------------------------------------------------------------------- inductor [('async_compile_cache_miss', 1)] graph_break [] --------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------------------------------------------------------------------------------------- inductor [('async_compile_cache_miss', 1)] graph_break [] --------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------------------------------------------------------------------------------------- inductor [('async_compile_cache_miss', 1)] graph_break [] --------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------------------------------------------------------------------------------------- inductor [('async_compile_cache_miss', 2), ('async_compile_cache_hit', 1)] graph_break [] --------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------------------------------------------------------------------------------------- inductor [('async_compile_cache_miss', 2), ('async_compile_cache_hit', 1)] graph_break [] --------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------------------------------------------------------------------------------------- inductor [('async_compile_cache_miss', 2), ('async_compile_cache_hit', 1)] graph_break [] ================================================================================================================================================= short test summary info ================================================================================================================================================== FAILED [0.4834s] inductor/test_fxir_backend.py::AOTFxirTestCase::test_aoti_fx_add - AttributeError: 'NoneType' object has no attribute '__code__' FAILED [0.4576s] inductor/test_fxir_backend.py::AOTFxirTestCase::test_aoti_fx_add - AttributeError: 'NoneType' object has no attribute '__code__' FAILED [0.4613s] inductor/test_fxir_backend.py::AOTFxirTestCase::test_aoti_fx_add - AttributeError: 'NoneType' object has no attribute '__code__' =============================================================================================================================================== 3 failed, 7 passed in 12.89s =============================================================================================================================================== ``` Setting `compile_threads` to 1 will get rid of the test flakiness, but there might be underlying issues from https://github.com/pytorch/pytorch/pull/160765. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162472 Approved by: https://github.com/angelayi, https://github.com/Skylion007	2025-09-09 19:39:24 +00:00
rzou	723c27ed78	[standalone_compile] binary format write should be atomic (#162432 ) We update it to call write_atomic instead of file.write Pull Request resolved: https://github.com/pytorch/pytorch/pull/162432 Approved by: https://github.com/oulgen	2025-09-09 18:43:13 +00:00
Benjamin Glass	bdbe931d58	[build] Add LeakSanitizer option to CMake (#158686 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158686 Approved by: https://github.com/eellison	2025-09-09 18:41:20 +00:00
jainapurva	af60398c3a	Update the operator benchmarking, to benchmark using torch.compile (#161394 ) This pull request enhances the PyTorch operator benchmarking suite by introducing support for benchmarking with `torch.compile` mode, in addition to existing Eager and JIT. It also adds peak memory measurement (fwd/bwd pass); improves the output format in JSON to be used by dashboard for reporting; and introduce some more CLI options. The new CLI flags introduced are: - Added `--use-compile` CLI argument and corresponding logic to run benchmarks using `torch.compile`, including mutual exclusivity with `--use-jit` - Added `--benchmark-name` argument for customizing the benchmark name in output - Updated default value for `--output-json-for-dashboard` to `benchmark-results.json` for more predictable output file name Sample command to run a single operator: `python -m pt.mm_test --use-compile` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161394 Approved by: https://github.com/jbschlosser	2025-09-09 18:17:37 +00:00
PyTorch MergeBot	82f1eb9b03	Revert "[MPS] mps sparse mul op implementation (#162349 )" This reverts commit 3ea686804925f1291de57ffdb3394da0b46deb54. Reverted https://github.com/pytorch/pytorch/pull/162349 on behalf of https://github.com/malfet due to Fails trunk tests, with uint8 sum ([comment](https://github.com/pytorch/pytorch/pull/162349#issuecomment-3271783442))	2025-09-09 18:14:16 +00:00
Brian Hirsh	4b2d297eec	python fastpath for DTensor detach(), confirm that aliasing DTensorSpec is ok (#160580 ) My goal right now is to try to make the "vanilla" AccumulateGrad path for DTensor (that just calls detach) fast. I'm doing this in two steps: (1) [this PR]: hardcode aten.detach in DTensor to re-use the input tensor's DTensorSpec, instead of running "real" sharding prop. (2) [assuming success of 1]: move the detach() call into C++, try adding a DTensor dispatch key, and avoid dispatching back to python entirely (except for some code that probably needs to allocate a pyobject for the output DTensor, from C++) I'm pushing this PR first to confirm that I don't break anything with my detach fastpath. I did some manual local testing to confirm that for normal usages of detach, the input and output DTensor have equal DTensorSpec objects. Technically, we previously would allocate a fresh DTensorSpec, and with this change we are just re-using the input tensor's DTensorSpec. So I'm mostly hoping that DTensorSpecs don't generally get mutated This by itself does seem to speed up `alias` by quite a bit (roughly 2.5x speedup, from ~336us -> 133us): aten.detach(plain_tensor) ``` <torch.utils.benchmark.utils.common.Measurement object at 0x7f8da2921790> _ = x.detach() 4.80 us 1 measurement, 100000 runs , 1 thread ``` aten.detach(DTensor) [before this PR] ``` <torch.utils.benchmark.utils.common.Measurement object at 0x7f47cd68e750> _ = x_dt.detach() 336.40 us 1 measurement, 1000 runs , 1 thread ``` aten.detach(DTensor) [after this PR] ``` <torch.utils.benchmark.utils.common.Measurement object at 0x7f0a34c05520> _ = x_dt.detach() Median: 133.45 us 2 measurements, 1000 runs per measurement, 1 thread ``` benchmark script: ``` import torch import torch.distributed as dist from torch.distributed.tensor import DeviceMesh, DTensor, Partial, Replicate, Shard from torch.testing._internal.distributed.fake_pg import FakeStore import torch.utils.benchmark as benchmark fake_store = FakeStore() dist.init_process_group("fake", store=fake_store, rank=0, world_size=2) mesh = torch.distributed.device_mesh.init_device_mesh('cuda', (2,)) x = torch.randn(4, 4, requires_grad=True) x_dt = DTensor.from_local(x, mesh, [Shard(0)], run_check=False) t0 = benchmark.Timer( stmt='_ = x_dt.detach()', globals={'x_dt': x_dt}, ) print(t0.blocked_autorange()) dist.destroy_process_group() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160580 Approved by: https://github.com/ezyang	2025-09-09 18:04:56 +00:00
Jane Xu	0ec723acd0	Update docs for quantile to be clearer for nearest (#162423 ) Correct the rounding scheme for nearest in quantile. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162423 Approved by: https://github.com/soulitzer	2025-09-09 18:04:12 +00:00
Howard Huang	e1be887870	[PP] Add spacing to visualizer (#160474 ) When visualizing the schedules using `_PipelineScheduleExecution`, we don't provide any spacing between dependencies, so when visualizing `DualPipeV` it looks like this: <img width="3168" height="486" alt="image" src="https://github.com/user-attachments/assets/d2c881ad-4ee0-46b6-ac03-13e5600b5a55" /> While it has the correct order of operations, it does not show the dependencies correctly. As shown in the original implementation, it should look something like this: <img width="3542" height="384" alt="image" src="https://github.com/user-attachments/assets/c930fa98-848e-4951-a58b-c81f41092d14" /> This allows an option to add spacing to the visualizer, so it is easier to see dependencies. After change: <img width="3633" height="486" alt="image" src="https://github.com/user-attachments/assets/7708367e-bdb4-46e8-a7c4-f19e18047f59" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160474 Approved by: https://github.com/fegin	2025-09-09 17:52:52 +00:00
Ruben Rodriguez Buchillon	d91eecc9a5	[inductor][template heuristics] don't take layout to generate choices (#162238 ) # why - unnecessary as we only ever need to know the dtype and maybe the device - we already take in the kernel inputs which have the device - enable us to specify the layout after finding all the configs but before generating the ChoiceCallers # what - replace all calls in template_heuristics that used to take Layout with now just taking out_dtype # testing ci Differential Revision: [D81820115](https://our.internmc.facebook.com/intern/diff/D81820115) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162238 Approved by: https://github.com/eellison ghstack dependencies: #161347, #161348, #161349	2025-09-09 17:17:04 +00:00
Ruben Rodriguez Buchillon	24a4dae85b	[inductor] V.choices.get_mm_configs override point (#161349 ) # why - enable us to override the default configs, or fall back to them through subclassing InductorChoices # what - override (private) function - default implementationt takes the kernel template choice (ktc) generator for every template and just executes the generator - future overrides can decide to replace those generators, or filter out choices - the 2nd expensive step (maybe_append_choices, choice_or_none) is handled outside this function, in the main V.choices.get_mm_configs this means that any overriding benefits from not generating expensive templates that aren't going to be used # testing ``` python3 -bb -m pytest test/inductor/test_max_autotune.py -v ``` Differential Revision: [D81520570](https://our.internmc.facebook.com/intern/diff/D81520570) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161349 Approved by: https://github.com/eellison ghstack dependencies: #161347, #161348	2025-09-09 17:17:04 +00:00
Ruben Rodriguez Buchillon	d3c4cf838e	[inductor][ez] V.choices.get_mm_configs returns list of ChoiceCallers (#161348 ) \# why - every callsite just executes the generator on the spot - previous pr adds the ability to add an override before expensive generators are executed, so we don't need this generator anymore \# what - rather than yielding the ChoiceCaller, just return the list of all valid ChoiceCallers \# testing ``` python3 -bb -m pytest test/inductor/test_max_autotune.py -v ``` Differential Revision: [D81520574](https://our.internmc.facebook.com/intern/diff/D81520574) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161348 Approved by: https://github.com/eellison ghstack dependencies: #161347	2025-09-09 17:16:57 +00:00
Ruben Rodriguez Buchillon	b1e99c8c7a	[inductor] add kernel template choice (ktc) (#161347 ) # why - gather everything up to make choices, without running potentially expensive generators - enables overrides where we toss the entire list of configs from inductor, without having to enumrate it (expensive) # what - add a holding class that just gets all the components necessary to generate a ChoiceCaller - use that class to generate ChoiceCallers - this does not (yet) add the override function, but just prepares the scene ``` python3 -bb -m pytest test/inductor/test_max_autotune.py -v ``` Differential Revision: [D81520569](https://our.internmc.facebook.com/intern/diff/D81520569) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161347 Approved by: https://github.com/eellison	2025-09-09 17:16:50 +00:00
Eddie Yan	5eb35d2ab8	[CUDA][float8][TF32] Disable tf32 for vs. emulated rowwise comparison (#162387 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162387 Approved by: https://github.com/Skylion007	2025-09-09 17:04:06 +00:00
Jeff Daily	f03d635dc6	[ROCm][CI] skip test_max_autotune until resolved (#162496 ) many tests taking >30 min and causing timeouts Pull Request resolved: https://github.com/pytorch/pytorch/pull/162496 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-09 16:34:01 +00:00
Hashem Hashemi	1f0b01d4b6	[ROCm] OffsetCalc Unroll Optimization (#161700 ) Our compiler is generating inefficient code for the offsetCalc in certain situations. The root-cause for this needs to be identified. For now specialized unrolling based on 'dims' notably helps perf. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161700 Approved by: https://github.com/jeffdaily	2025-09-09 16:11:48 +00:00
Prachi Gupta	c0142f5c06	[ROCm] Enabling several UTs (#161715 ) All these UTs are working as is, just removing the skip - test_p2p_ipc - test_repros.py: working, added fp8 support - test_activation_checkpointing.py - test_content_store.py - test_cuda_multigpu.py - test_compute_comm_reordering.py - test_segment_reductions.py - test_dataloader.py - test_math_ops.py - test_loop_ordering.py - test_control_flow.py - distributed_test.py - test_mem_tracker.py - test_fsdp_optim_state.py - test_fully_shard_mixed_precision.py: skippped for < ROCm7.0 - test_aot_inductor_custom_ops.py - test_c10d_ops_nccl.py - test_eager_transforms.py - test_sparse_csr.py - test_inductor_collectives.py - test_fake_tensor.py - test_cupy_as_tensor.py - test_cuda.py: enable UTs that are working - test_matmul_cuda.py: enable UTs that are working Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/161715 Approved by: https://github.com/msaroufim Co-authored-by: Mark Saroufim <marksaroufim@fb.com>	2025-09-09 15:49:21 +00:00
Isalia20	3ea6868049	[MPS] mps sparse mul op implementation (#162349 ) Implements mps sparse mul operation as well as enables other operations such as: 1. copy_ 2. div 3. sum 4. floor 5. power 6. sub 7. floor_divide Pull Request resolved: https://github.com/pytorch/pytorch/pull/162349 Approved by: https://github.com/pearu, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-09-09 15:45:37 +00:00
Jack Taylor	be3b8d2ec9	[ROCm][CI] update fbgemm nightly benchmark hash (#162385 ) fbgemm_gpu was failing to clone due to missing submodule commit. ``` + pushd fbgemm/fbgemm_gpu ~/pytorch/fbgemm/fbgemm_gpu ~/pytorch + git checkout 7f1de94a4c2d14f59ad4ca84538c36084ea6b2c8 --recurse-submodules fatal: failed to unpack tree object b1281b8b08d973a7064f864f47eeb30f3e2596e9 error: Submodule 'external/composable_kernel' could not be updated. error: Cannot update submodule: external/composable_kernel ``` Log File [inductor-periodic · pytorch/pytorch@5babb4d](https://github.com/pytorch/pytorch/actions/runs/17536630806/job/49802458834) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162385 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-09 15:44:39 +00:00
PyTorch MergeBot	5ccf3ca3ec	Revert "Use same NVSHMEM version across CUDA builds (#162206 )" This reverts commit 0d9c95cd7ee299e2e8c09df26d395be8775b506b. Reverted https://github.com/pytorch/pytorch/pull/162206 on behalf of https://github.com/malfet due to Broke lint, see `4dd73e659a/1` ([comment](https://github.com/pytorch/pytorch/pull/162206#issuecomment-3271040521))	2025-09-09 14:40:45 +00:00
atalman	e38e953432	CUDA 13.0 Windows Nvidia Driver Update to 580.88 (#162425 ) Related to https://github.com/pytorch/pytorch/issues/162333 https://github.com/pytorch/pytorch/issues/159779 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162425 Approved by: https://github.com/tinglvv, https://github.com/malfet	2025-09-09 14:40:34 +00:00
PyTorch MergeBot	4dd73e659a	Revert "fix torch.sparse.log_softmax on CPU (#161959 )" This reverts commit 002e59440afe8711019e68df500f5e18b9a43f3c. Reverted https://github.com/pytorch/pytorch/pull/161959 on behalf of https://github.com/davidberard98 due to test failure: test_sparse.py::TestSparseMPS::test_log_softmax_float_mps_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/17573794461/job/49915138287) [HUD commit link](`002e59440a`) ([comment](https://github.com/pytorch/pytorch/pull/161959#issuecomment-3270509418))	2025-09-09 12:33:25 +00:00

1 2 3 4 5 ...

92807 Commits