pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-22 06:11:27 +08:00

Author	SHA1	Message	Date
Ankita George	0f49e915a9	rebase	2025-05-30 14:30:12 -07:00
Ankita George	2f1217f944	benchmarking	2025-05-30 14:27:37 -07:00
Ankita George	e0bf01e87b	Script for consolidation of sharded safetensor files Pull Request resolved: https://github.com/pytorch/pytorch/pull/154743 Script to consolidate sharded safetensors files with DCP into full tensors. This relies on file system operations to read and copy bytes directly instead of the traditional approach of loading and re-sharding and then saving again, because users will have models that are larger than allotted memory. Differential Revision: [D75536985](https://our.internmc.facebook.com/intern/diff/D75536985/) ghstack-source-id: 287291639	2025-05-30 14:18:51 -07:00
Ankita George	3b5ae0e9fc	Support re-sharding for safetensors checkpoints Pull Request resolved: https://github.com/pytorch/pytorch/pull/154519 This change will add the ability to support re-sharding for hf safetensors checkpoints. This is done by adding more metadata when saving each file. This metadata captures the size and offset of the saved shard. This can be used to re-shard on load by using this information to create the chunks belonging to TensorStorageMetadata class. Differential Revision: [D75226344](https://our.internmc.facebook.com/intern/diff/D75226344/) ghstack-source-id: 286572125	2025-05-30 10:40:32 -07:00
Ankita George	5f5f654a3e	Updates to HFStorageReader to use TensorStorageMetadata instead of BytesStorageMetadata Pull Request resolved: https://github.com/pytorch/pytorch/pull/154518 As we prepare to support re-sharding, the current approach of using BytesStorageMetadata to read safetenstors won't work anymore. Before, we didn't need to read the metadata of the safetensors file from its header because we were just loading the contents of the file directly into tensors with safetensor.load() that would handle the metadata and deserialization. But now, in preparation of handling re-sharding, we need to read the metadata directly from the header of the safetensors file and store it directly in TensorStorageMetadata objects so that we can perform re-sharding. Re-sharding won't currently work, as we need extra metadata to be stored on each save, so that will be added in a subsequent PR. In addition this PR adds an integration test in addition to the unit tests. It also removes the HfFileSystem import because that's only needed if users are using HfFileSystem, but we want to support any backend. ghstack-source-id: 286649070 @exported-using-ghexport Differential Revision: [D74891998](https://our.internmc.facebook.com/intern/diff/D74891998/)	2025-05-30 10:40:30 -07:00
Ankita George	21931cbbc6	Changes to HFStorageWriter to support saving shards of tensors As we move towards supporting saving partial tensors natively with HFStorageWriter, there are some simple changes that need to be made to make this happen. - The current approach for distributed writes is that every rank has full tensors, but we split up the writing of these full tensors across all available ranks. We're removing this logic that was in the HFSavePlanner and instead assuming that every rank has a shard and saving every rank's local state - as a result we can probably remove the HFSavePlanner, but keeping it as a placeholder for now - the current naming of files doesn't support shards as its in the format "model-00001-of-00004.safetensors", but if every rank is writing the same file names they will overwrite eachother, so this adds a shard-00001 prefix, so that the rank files don't overwrite eachother - don't save the metadata file models.safetensors.index.json if sharding is enabled. This file expects a 1 to 1 ratio between tensor and filename, but this doesn't make sense in the sharded saving approach, so we can just get rid of this file - make the "fqn_to_file_index" map optional. This is to describe which files to save which tensors in, but if users don't want to provide this, we can just save all the tensors to one file. If they run into issues, they can choose how to split up their tensors to be more friendly with 5GB HF remote storage file size soft limit. Differential Revision: [D75099862](https://our.internmc.facebook.com/intern/diff/D75099862/) ghstack-source-id: 286648122 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154742	2025-05-30 10:40:28 -07:00
mieshkiwrk	ef4d57329b	[CAG] Support for call_module at copy paste aot bwd graph (#153827 ) Support for `call_module` in `copy_paste_aot_backward_graph` added recently with PT2.7 Problem is being observed with HPU backend in example repro due to creating fused modules. ``` import torch device = 'cpu' #'hpu' backend = 'inductor' #'hpu_backend' def fn(t1): t1 = t1 * 1 t1_grad = torch.ones_like(t1, device=device) t1.backward(t1_grad, retain_graph=True) return t1 t1 = torch.ones(1, requires_grad=True, device=device) #.squeeze() compiled_fn = torch.compile(fn, backend=backend) result = compiled_fn(t1) with torch._dynamo.compiled_autograd._enable(torch.compile(backend=backend)): result_grad = torch.ones_like(result, device=device) result.backward(result_grad) print(f'{result_grad=}') print(f'{t1.grad=}') ``` With this change I'm getting same results like on CPU, however I'm facing below problem when running with scalar (t1 tensor after squeeze): `torch._dynamo.exc.TorchRuntimeError: Dynamo failed to run FX node with fake tensors: call_function <built-in function getitem>((FakeTensor(..., device='hpu:0', size=()), 0), *{}): got IndexError('invalid index of a 0-dim tensor. Use `tensor.item()` in Python or `tensor.item<T>()` in C++ to convert a 0-dim tensor to a number')` While on CPU there's following warning and None returned: `repro.py:23: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations. (Triggered internally at pytorch/build/aten/src/ATen/core/TensorBody.h:489.) print(f'{t1.grad=}') t1.grad=None` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153827 Approved by: https://github.com/xmfan	2025-05-28 22:52:40 +00:00
bobrenjc93	d62a33c002	[ez] add docblock for _expandsums (#154397 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154397 Approved by: https://github.com/laithsakka ghstack dependencies: #154400, #154398, #154396, #154399	2025-05-28 22:43:26 +00:00
bobrenjc93	0c00e32632	[ez] add docblock for _eval_is_non_overlapping_and_dense (#154399 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154399 Approved by: https://github.com/laithsakka ghstack dependencies: #154400, #154398, #154396	2025-05-28 22:40:03 +00:00
Zhengxu Chen	0f56318152	[precompile] Add Exception type PackageError for unsupported precompile features. (#154430 ) Summary: Today when guard serialization fails, dynamo will raise an internal error like: ``` torch._dynamo.exc.InternalTorchDynamoError: RuntimeError: CLOSURE_MATCH guard cannot be serialized. ``` Adding a dedicated PackageError type to surface the error more clearly. Test Plan: CI Differential Revision: D75452124 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154430 Approved by: https://github.com/jamesjwu, https://github.com/jansel	2025-05-28 22:34:51 +00:00
Serhat Gundem	11129d9317	Add new ops in fallback ops (#154251 ) Fixes #ISSUE_NUMBER ## Background Task: [T222738229](https://www.internalfb.com/intern/tasks/?t=222738229) It's the first starter task on the project _Enabling TorchNative Standalone on Whisper_. We are using cshim to create a layer of abstraction between _libtorch_ and _AOTInductor generated artifacts_. So we needed to add an entry in the cshim for every API surface in libtorch. And we only care about operators that AOTInductor does not handle. And for this task, we only wanted to add it for the following ops. ## What I've done? 4 new fallback ops are added that show up in the Whisper model. (torchgen/aoti/fallback_ops.py) - aten.permute (default) - aten.squueze (dim) - aten.abs (default) - aten.hann_window (default) Then I ran the below command to generate new header C shim header files. As it says [here](`7e86a7c015/torchgen/gen.py (L2424-L2436%20for%20details)`) `python torchgen/gen.py --update-aoti-c-shim` Then, `python setup.py develop` to rebuild PyTorch ## Testing Also 4 new tests have been added on test/inductor/test_aot_inductor.py - test_proxy_executor_permute - test_proxy_executor_abs - test_proxy_executor_squeeze - test_proxy_executor_hann I ran these commands to test it (inside local pytorch root folder): `python test/inductor/test_aot_inductor.py -k test_proxy_executor_permute` `python test/inductor/test_aot_inductor.py -k test_proxy_executor_abs` `python test/inductor/test_aot_inductor.py -k test_proxy_executor_squeeze` `python test/inductor/test_aot_inductor.py -k test_proxy_executor_hann` ## NOTE: I didn't see any order between the tests inside _test/inductor/test_aot_inductor.py_. That's why, I added new tests just after the test given in the example. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154251 Approved by: https://github.com/angelayi	2025-05-28 22:11:07 +00:00
Simon Fan	d2f506cae8	[ca] disable ca for functorch grad and run all HOO tests (#154147 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154147 Approved by: https://github.com/zou3519 ghstack dependencies: #154133	2025-05-28 22:06:13 +00:00
Simon Fan	857f21631d	[ca] fix hop_db tests (#154133 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154133 Approved by: https://github.com/zou3519	2025-05-28 22:06:13 +00:00
bobrenjc93	ed348e7026	Add docblock for TrackedFake (#154396 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154396 Approved by: https://github.com/laithsakka ghstack dependencies: #154400, #154398	2025-05-28 21:19:49 +00:00
bobrenjc93	d311b79c12	add docblock for _fast_expand (#154398 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154398 Approved by: https://github.com/laithsakka ghstack dependencies: #154400	2025-05-28 21:16:47 +00:00
bobrenjc93	e7318b863d	[ez] add docblock to cast_symbool_to_symint_guardless (#154400 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154400 Approved by: https://github.com/laithsakka	2025-05-28 21:11:53 +00:00
Zizeng Meng	f6dcc45c44	[Kineto x Insight] Add device to activity type map in pytorch (#154253 ) Summary: Update the device to ActivityType Map in pytorch. Need to be exported to github Test Plan: Run the ondemand e2e test and insight profiler is triggered during profiling P1819539581: https://www.internalfb.com/intern/paste/P1819539581/ {F1978519960} Insight profiler is not enabled when mtia_insight not specifying in config {F1978527200} Reviewed By: fenypatel99 Differential Revision: D75246621 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154253 Approved by: https://github.com/Skylion007	2025-05-28 20:36:19 +00:00
Ke Wen	e25074d462	[c10d][CI] Change expected return code in Sandcastle for Nan tests (#154441 ) Fixing internal error caused by #153167. `skip_but_pass_in_sandcastle_if` returns exit code 0. But `test_nan_assert` expects exit code -6. So we'd need to set expected return code conditional on `IS_SANDCASTLE`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154441 Approved by: https://github.com/fduwjj, https://github.com/nWEIdia ghstack dependencies: #153167	2025-05-28 20:35:52 +00:00
Huy Do	c381103fd7	Fix the logic of set_cpu_affinity (#154503 ) While investigating https://github.com/pytorch/pytorch/issues/152566, I found two issues with how the cpu affinity is set in benchmark job: * The current logic doesn't work with cgroups slice, the mechanism behind multi-tenant runner: * Using `lscpu` returns all CPUs and not the available ones from cgroups. On the other hand, `nproc` works correctly. For example, on H100, `lscpu` returns 192 CPUs while `nproc` returns 24 (192 / 8) * Setting `taskset -c 0-N` blindly is wrong because CPU 0 is only available to the the first tenant, aka alice. For example, running `taskset -c 0 ls` on any other tenants will fail. To fix this, the ID of available CPUs can be fetched by calling `os.sched_getaffinity(0)`. * The last bug is `taskset` works with logical CPUs https://www.man7.org/linux/man-pages/man1/taskset.1.html, so using the result from `test_inductor_get_core_number` is also wrong because that function returns the number of physical CPUs. ### Testing CPU benchmark jobs look ok * [aarch64 torch.compile benchmark](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2021%20May%202025%2016%3A40%3A28%20GMT&stopTime=Wed%2C%2028%20May%202025%2016%3A40%3A28%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cpu%20(aarch64)&lBranch=fix-cpu-affinity-cgroups&lCommit=9a6288e083d650c470623f5fe136b1060824021c&rBranch=main&rCommit=dec5ab8d984b8a608140911351d877b9ddb141c2) * [x86 micro benchmark](https://hud.pytorch.org/benchmark/llms?startTime=Wed%2C%2021%20May%202025%2016%3A41%3A26%20GMT&stopTime=Wed%2C%2028%20May%202025%2016%3A41%3A26%20GMT&granularity=day&lBranch=main&lCommit=c1b7dbc52aaa49f4cd147bbe5935110a4a10e3e3&rBranch=refs/tags/ciflow/inductor-micro-benchmark-cpu-x86/154503&rCommit=9a6288e083d650c470623f5fe136b1060824021c&repoName=pytorch%2Fpytorch&benchmarkName=&modelName=All%20Models&backendName=All%20Backends&modeName=All%20Modes&dtypeName=All%20DType&deviceName=cpu%20(x86_64)&archName=All%20Platforms) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154503 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-05-28 19:38:20 +00:00
dolpm	66f53889d5	[nativert] port semaphore to c10 util (#153504 ) Summary: nativert RFC: https://github.com/zhxchen17/rfcs/blob/master/RFC-0043-torch-native-runtime.md To land the runtime into PyTorch core, we will gradually land logical parts of the code into the Github issue and get each piece properly reviewed. This diff adds a simple semaphore interface into c10 until c++20 where we get counting_semaphore gonna need a oss build export to take a look at this... Test Plan: CI Differential Revision: D73882656 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153504 Approved by: https://github.com/zhxchen17	2025-05-28 19:17:30 +00:00
Jithun Nair	24980d2641	[ROCm][CI] Update build-environment for mi300 workflows (#153134 ) so their test times are tracked separately in https://raw.githubusercontent.com/pytorch/test-infra/generated-stats/stats/test-times.json. Currently, both MI200 and MI300 test times get combined into the same key `linux-focal-rocm-py3.10` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153134 Approved by: https://github.com/huydhn	2025-05-28 19:04:53 +00:00
PyTorch MergeBot	d4ab8e74f3	Revert "Fix the Problems About Defining Static Variable in Inline Function (#147095 )" This reverts commit c6fc11af760d4ad1f01cc699a3c6488ab5f41770. Reverted https://github.com/pytorch/pytorch/pull/147095 on behalf of https://github.com/izaitsevfb due to still fails to link internally at meta ([comment](https://github.com/pytorch/pytorch/pull/147095#issuecomment-2917221575))	2025-05-28 18:22:39 +00:00
henrylhtsang	1c7a70b483	[AOTI][cutlass backend] Do not remove the cutlass kernel .o file after packaging (#154155 ) Differential Revision: [D75253009](https://our.internmc.facebook.com/intern/diff/D75253009/) In general, we want to cache the cutlass kernels. Also saw an error saying .o not found. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154155 Approved by: https://github.com/chenyang78	2025-05-28 17:35:19 +00:00
Laith Sakka	66ac724b56	pyfmt lint torch/_export/passes/replace_view_ops_with_view_copy_ops_pass.py (#154488 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154488 Approved by: https://github.com/Skylion007 ghstack dependencies: #154483, #154484, #154485, #154487	2025-05-28 17:07:15 +00:00
Laith Sakka	dfe0f48123	pyfmt lint torch/_export/serde/schema.py (#154487 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154487 Approved by: https://github.com/Skylion007 ghstack dependencies: #154483, #154484, #154485	2025-05-28 17:07:15 +00:00
Laith Sakka	92cebed1bd	pyfmt lint torch/_export/serde/serialize.py (#154485 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154485 Approved by: https://github.com/Skylion007 ghstack dependencies: #154483, #154484	2025-05-28 17:07:07 +00:00
Laith Sakka	b4fe5ca58a	pymft lint torch/utils/weak.py (#154484 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154484 Approved by: https://github.com/Skylion007 ghstack dependencies: #154483	2025-05-28 17:06:58 +00:00
Laith Sakka	4de1b25df7	Remove empty files from execlude lint rule (#154483 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154483 Approved by: https://github.com/Skylion007	2025-05-28 17:06:50 +00:00
Sidharth	70539308ac	[dynamo] updating gb_type names for uniqueness (#154452 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154452 Approved by: https://github.com/williamwen42	2025-05-28 16:54:10 +00:00
Isalia20	e313152a33	SDPA fix memory efficient attention for large batch dim (#154029 ) Fixes #146704 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154029 Approved by: https://github.com/ngimel	2025-05-28 16:53:53 +00:00
Natalia Gimelshein	3b38989b5f	Remove MemPoolContext (#154042 ) Removes MemPoolContext from custom user mempools. The ground truth for which pool should be used is in graph_pools active pool, and MemPoolContext just introduced an opportunity for the pool pointed to by MemPoolContext and active pool in graph_pools to go out of sync (see all the asserts in the code to make sure that happens, and yet it still could happen in a multithread scenario, see my recent PRs (#153990). Pull Request resolved: https://github.com/pytorch/pytorch/pull/154042 Approved by: https://github.com/albanD, https://github.com/syed-ahmed	2025-05-28 16:35:48 +00:00
Jerry Zhang	d23aa7e182	Add deprecation warning for `torch.ao.quantization` (#153892 ) Summary: att Test Plan: (ao) $ PYTHONWARNINGS='default' python Python 3.10.14 \| packaged by conda-forge \| (main, Mar 20 2024, 12:45:18) [GCC 12.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> from torch.ao.quantization.quantizer.xnnpack_quantizer import XNNPACKQuantizer printing warning /anaconda3/envs/ao/lib/python3.10/site-packages/torch/ao/quantization/__init__.py:36: DeprecationWarning: torch.ao.quantization is deprecated. Plan is to 1. Remove eager mode quantization (torch.ao.quantization.quantize, torch.ao.quantization.quantize_dynamic), please migrate to use torchao eager mode quantize_ API instead 2. Remove fx graph mode quantization (torch.ao.quantization.quantize_fx.prepare_fx, torch.ao.quantization.quantize_fx.convert_fx, please migrate to use torchao pt2e quantization API instead (prepare_pt2e, convert_pt2e) 3. pt2e quantization has been migrated to torchao (https://github.com/pytorch/ao/tree/main/torchao/quantization/pt2e) see https://dev-discuss.pytorch.org/t/torch-ao-quantization-migration-plan/2810 for more details warnings.warn( >>> a = XNNPACKQuantizer() /anaconda3/envs/ao/lib/python3.10/site-packages/torch/ao/quantization/quantizer/xnnpack_quantizer.py:281: DeprecationWarning: XNNPACKQuantizer is deprecated! Please use xnnpack quantizer in ExecuTorch (https://github.com/pytorch/executorch/tree/main/backends/xnnpack/quantizer) instead warnings.warn(f"{self.__class__.__name__} is deprecated! Please use xnnpack quantizer in ExecuTorch (https://github.com/pytorch/executorch/tree/main/backends/xnnpack/quantizer) instead", DeprecationWarning) >>> Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/153892 Approved by: https://github.com/Skylion007	2025-05-28 16:25:30 +00:00
Zhengxu Chen	5bf74753f6	[precompile] Prune local scope variables for guard serialization. (#154431 ) Summary: Prune unused local objects from serialized local scope if they are not used in guard reconstruction. This is helpful when a user program takes things like local callable functions or the function call is recursive. Test Plan: test/dynamo/test_guard_serialization.py -k test_function_locals Before pruning locals: ``` state = GuardsState(output_graph=OutputGraphGuardsState(local_scope={'x': tensor([ 0.0461, 0.4024, -1.0115]), 'g': <function ...aints=None, _guards=<torch._guards.GuardsSet object at 0x7fbccc7e9fc0>, _aotautograd_guards=[]), shape_code_parts=None) def pickle_guards_state(state: GuardsState) -> bytes: buf = io.BytesIO() pickler = GuardsStatePickler(buf) try: pickler.dump(state) except AttributeError as e: > raise torch._dynamo.exc.PackageError(str(e)) from e E torch._dynamo.exc.PackageError: Can't pickle local object 'TestGuardSerialization.test_function_locals.<locals>.foo' ``` After the diff ``` Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` Differential Revision: D75452123 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154431 Approved by: https://github.com/jansel	2025-05-28 16:03:02 +00:00
Joel Schlosser	9db7bcb3fe	[Dynamo] Introduce hook receiving list of traced code objects (#153622 ) This PR: * Expands `Hooks` with a new, optional `frame_traced_fn` field. It should be a callable receiving the list of traced code objects * Maintains a list of `traced_code` objects in the `TracingContext` of an `OutputGraph` * Whenever an `inline_call()` is encountered, the corresponding code object is added to this set * `OutputGraph`'s associated `f_code` is added to the list just before the hook is called I believe use of this hook should enable the source code hashing that vLLM does in a better way than monkey-patching `inline_call()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153622 Approved by: https://github.com/jansel	2025-05-28 15:40:09 +00:00
bobrenjc93	476e0a643a	[ez] add docblock for ShapeGuardPythonPrinter (#154403 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154403 Approved by: https://github.com/jingsh ghstack dependencies: #154374, #154375, #154376, #154386, #154401, #154404, #154405, #154377, #154378, #154379, #154380, #154381, #154383, #154384, #154385, #154402	2025-05-28 14:17:17 +00:00
bobrenjc93	473a93eb58	[ez] add docblock for _ShapeGuardPrinter (#154402 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154402 Approved by: https://github.com/jingsh ghstack dependencies: #154374, #154375, #154376, #154386, #154401, #154404, #154405, #154377, #154378, #154379, #154380, #154381, #154383, #154384, #154385	2025-05-28 14:13:22 +00:00
bobrenjc93	35a473e364	[ez] add docblock for guard_scalar (#154385 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154385 Approved by: https://github.com/jingsh ghstack dependencies: #154374, #154375, #154376, #154386, #154401, #154404, #154405, #154377, #154378, #154379, #154380, #154381, #154383, #154384	2025-05-28 14:10:07 +00:00
bobrenjc93	ee4f433963	[ez] add docblock for _guard_or (#154384 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154384 Approved by: https://github.com/pianpwk ghstack dependencies: #154374, #154375, #154376, #154386, #154401, #154404, #154405, #154377, #154378, #154379, #154380, #154381, #154383	2025-05-28 14:06:29 +00:00
bobrenjc93	e9b97d19b1	[ez] Make SymNodeImpl comments less misleading (#154480 ) As discussed in DS workchat, it's easy for users to get confused by guarding for these supposedly non-guarding methods. The TL;DR is in the case of non pythonic compilers like XLA, we actually do guard. I've updated the comments accordingly to reduce confusion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154480 Approved by: https://github.com/pianpwk, https://github.com/Skylion007	2025-05-28 14:04:32 +00:00
PyTorch MergeBot	a75e3a02be	Revert "[dynamo, nested graph breaks] small fixes to resume function generation (#151056 )" This reverts commit 28e7aa21c522e92ea01a62dfdc5e3b74e398d8f0. Reverted https://github.com/pytorch/pytorch/pull/151056 on behalf of https://github.com/malfet due to Not sure which one, but it broke test_error_messages, see `203b0efd63/1` ([comment](https://github.com/pytorch/pytorch/pull/151056#issuecomment-2916437433))	2025-05-28 13:53:50 +00:00
PyTorch MergeBot	9603d6382d	Revert "[dynamo, nested graph breaks] refactor codegen to minimize NULL codegen'ing (#153510 )" This reverts commit 1fe98429222a8ba5e16dd9381f50a8fb90edcf0e. Reverted https://github.com/pytorch/pytorch/pull/153510 on behalf of https://github.com/malfet due to Not sure which one, but it broke test_error_messages, see `203b0efd63/1` ([comment](https://github.com/pytorch/pytorch/pull/151056#issuecomment-2916437433))	2025-05-28 13:53:50 +00:00
PyTorch MergeBot	5fd7004dc9	Revert "[dynamo, nested graph breaks] remove block stack graph break in output_graph (#153772 )" This reverts commit 9a66c30bdc563c62375e5030c4103b67515b8dac. Reverted https://github.com/pytorch/pytorch/pull/153772 on behalf of https://github.com/malfet due to Not sure which one, but it broke test_error_messages, see `203b0efd63/1` ([comment](https://github.com/pytorch/pytorch/pull/151056#issuecomment-2916437433))	2025-05-28 13:53:50 +00:00
PyTorch MergeBot	e86439ed5b	Revert "[dynamo, nested graph breaks] add skip_frame debugging function (#153773 )" This reverts commit aadf9eae63c4793e1107a3b21ede30e5289eeaca. Reverted https://github.com/pytorch/pytorch/pull/153773 on behalf of https://github.com/malfet due to Not sure which one, but it broke test_error_messages, see `203b0efd63/1` ([comment](https://github.com/pytorch/pytorch/pull/151056#issuecomment-2916437433))	2025-05-28 13:53:50 +00:00
Howard Huang	203b0efd63	[PP] Allow unused kwargs in ZB path (#153498 ) This is a fix when an unused kwarg is in the PP stage forward, we try to call `torch.autograd.grad()` and update its gradients when it shouldn't have gradients. Leading to this error: ``` [rank3]:[rank3]: File "/data/users/howardhuang/pytorch/torch/distributed/pipelining/stage.py", line 613, in [rank3]:[rank3]: return lambda: stage_backward_input( [rank3]:[rank3]: File "/data/users/howardhuang/pytorch/torch/distributed/pipelining/_backward.py", line 199, in stage_backward_input [rank3]:[rank3]: dinputs = torch.autograd.grad( [rank3]:[rank3]: File "/data/users/howardhuang/pytorch/torch/autograd/init.py", line 503, in grad [rank3]:[rank3]: result = _engine_run_backward( [rank3]:[rank3]: File "/data/users/howardhuang/pytorch/torch/autograd/graph.py", line 824, in _engine_run_backward [rank3]:[rank3]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [rank3]:[rank3]: RuntimeError: One of the differentiated Tensors does not require grad ``` related issues: https://github.com/pytorch/torchtitan/issues/1188 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153498 Approved by: https://github.com/kwen2501	2025-05-28 13:34:04 +00:00
ILCSFNO	cf7451f279	Fix signature of torch.sparse_coo_tensor() (#152681 ) Fixes #145371 @pearu Searched all and find these codes, wondering whether is the root cause of the issue, could you have a review? Thanks a lot! Pull Request resolved: https://github.com/pytorch/pytorch/pull/152681 Approved by: https://github.com/Skylion007, https://github.com/pearu, https://github.com/nikitaved	2025-05-28 13:16:41 +00:00
Yuanhao Ji	f58143b945	[Typing] Refactor `torch.types.Device` in `torch/cuda/__init__.py` (#153447 ) Part of: #152952 Follow up: #153027 Here is the definition of `torch.types.Device`: `ab997d9ff5/torch/types.py (L74)` So `Optional[Union[Device, int]]` is equivalent to `torch.types.Device`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153447 Approved by: https://github.com/cyyever, https://github.com/Skylion007	2025-05-28 10:09:31 +00:00
PyTorch MergeBot	fdc339003b	Revert "[AOTI] Support multi-arch when using package_cpp_only (#154414 )" This reverts commit a84d8c4a1cc515db274366537afd0b1492800c2d. Reverted https://github.com/pytorch/pytorch/pull/154414 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing ROCm trunk job ([comment](https://github.com/pytorch/pytorch/pull/154414#issuecomment-2915597821))	2025-05-28 09:23:31 +00:00
Laith Sakka	853958f82c	Fix: Replacements can cause runtime assertions to disappear and can cause invalid inductor code. (#153661 ) Lets explore firs a couple of problem related to replacements and runtime assertions. #### example problem 1 if we have a runtime assertions that u0==s0, u0 is an input coming from mark_unbacked. A replacement u0=s0 will be added, the function f(u0, s0) will become f(s0, s0), this leads to the assert not being inserted during insert_deferred_runtime_asserts. The reason is that insert_deferred_runtime_asserts logic insert each assertion once all its inputs are seen, but u0 will never be seen. Same thing can happen when we defer assertion on backed i.e: s0==s2 ..etc. #### example problem 2 Consider u0==s0, where u0 is coming from a call to .item() Imagine later on that a specialization happens to s0 to become 2. In that case s0 as input wont be seen during insert_deferred_runtime_asserts and the assertion won't be inserted in the graph. Worse, Inductor will generate some code that refers to s0 in the cpp wrapper while it does not exist, causing a failure. internal xref: https://fb.workplace.com/groups/1075192433118967/permalink/1669766396994898/ ## The solution : Runtime assertions insertion loops depend on detecting that the symbols that are used in the runtime assertions are seen, note that those symbols are either graph inputs or generated in the graph from data dependent ops like .item(). The issues above happen when symbols are graph inputs, in order to force the symbols to exist in the graph and to be seen by the runtime assertions we do not do replacements on placeholders expressions during codegen and during runtime assertions insertion. This should not have performance overhead, since we already optimized the graph with replacements, the only effect is not mistakenly dropping graph inputs that are used in runtime assertions. I added extended testing. A solo unrelated follow up that I noticed, is that we might want to rename unbacked symbols in runtime assertions when we do unbacked renaming, but that's a different issue. Other approaches that did not work : #### ban replacements on unbacked. 1. does not work when we defer runtime assertions on backed ex: s0==s1. we could also ban such replacements but problem 2 becomes more problematic. 2. Problem two, it affects the quality of reasoning ! in a bad way. #### Apply specialization on runtime assertions before codegen . 1. Can fix some issues, but may lead also to runtime assertions becoming NOPs. 2. Does not fix the issue if not inserting runtime assertions during insert_deferred_runtime_asserts due to input not being detected. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153661 Approved by: https://github.com/jansel	2025-05-28 09:08:05 +00:00
William Wen	aadf9eae63	[dynamo, nested graph breaks] add skip_frame debugging function (#153773 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153773 Approved by: https://github.com/jansel ghstack dependencies: #151056, #153510, #153772	2025-05-28 08:54:09 +00:00
William Wen	9a66c30bdc	[dynamo, nested graph breaks] remove block stack graph break in output_graph (#153772 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153772 Approved by: https://github.com/jansel ghstack dependencies: #151056, #153510	2025-05-28 08:54:09 +00:00
William Wen	1fe9842922	[dynamo, nested graph breaks] refactor codegen to minimize NULL codegen'ing (#153510 ) Stop codegening NULLs that we need to pop later. Some output_graph.py changes to prepare for nested graph break support. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153510 Approved by: https://github.com/jansel ghstack dependencies: #151056	2025-05-28 08:54:09 +00:00
William Wen	28e7aa21c5	[dynamo, nested graph breaks] small fixes to resume function generation (#151056 ) Old: ~pack resume function stack + locals into a list: we need to be able to pass frame stack+locals in lists to hand off to nested functions in the future, so we implement this part first.~ We are no longer doing this right now since GraphModule/guard variable naming gets messed up. Going forward, our approach will be to keep the top frame unpacked, but pack the rest of the contents of other frames in a list. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151056 Approved by: https://github.com/jansel	2025-05-28 08:54:09 +00:00
cyy	9d04c0f352	Remove outdated CUDA 11 conditions (#154313 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/154313 Approved by: https://github.com/eqy	2025-05-28 08:44:58 +00:00
Pian Pawakapan	1d9b7dd2d1	[PGO] suggest dynamic whitelist for recompilations (#154189 ) suggests `TORCH_COMPILE_DYNAMIC_SOURCES` based off tensor size changes in PGO code state, including parameters. Closing #153442 which took the dynamo guards approach. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154189 Approved by: https://github.com/bobrenjc93	2025-05-28 07:11:43 +00:00
bobrenjc93	fe760b6636	[ez] add docblock for _free_unbacked_symbols_with_path (#154383 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154383 Approved by: https://github.com/pianpwk ghstack dependencies: #154374, #154375, #154376, #154386, #154401, #154404, #154405, #154377, #154378, #154379, #154380, #154381	2025-05-28 05:53:50 +00:00
bobrenjc93	8e25ba6963	[ez] add docblock for find_symbol_binding_fx_nodes (#154381 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154381 Approved by: https://github.com/pianpwk ghstack dependencies: #154374, #154375, #154376, #154386, #154401, #154404, #154405, #154377, #154378, #154379, #154380	2025-05-28 05:44:26 +00:00
bobrenjc93	08c29deb5f	[ez] add docblock to is_symbol_binding_fx_node (#154380 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154380 Approved by: https://github.com/pianpwk ghstack dependencies: #154374, #154375, #154376, #154386, #154401, #154404, #154405, #154377, #154378, #154379	2025-05-28 05:41:19 +00:00
bobrenjc93	07405a6cff	[ez] add docblock for free_unbacked_symbols (#154379 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154379 Approved by: https://github.com/pianpwk ghstack dependencies: #154374, #154375, #154376, #154386, #154401, #154404, #154405, #154377, #154378	2025-05-28 05:37:25 +00:00
bobrenjc93	dcdaef5206	[ez] add docblock for free_symbols (#154378 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154378 Approved by: https://github.com/pianpwk ghstack dependencies: #154374, #154375, #154376, #154386, #154401, #154404, #154405, #154377	2025-05-28 05:34:25 +00:00
bobrenjc93	abc3fdc7ac	[ez] add docblock for _iterate_exprs (#154377 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154377 Approved by: https://github.com/pianpwk ghstack dependencies: #154374, #154375, #154376, #154386, #154401, #154404, #154405	2025-05-28 05:28:58 +00:00
bobrenjc93	ab6cb85cb0	[ez] add docblock for _remove_effect_token_unbacked_bindings (#154405 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154405 Approved by: https://github.com/Skylion007, https://github.com/pianpwk ghstack dependencies: #154374, #154375, #154376, #154386, #154401, #154404	2025-05-28 05:16:14 +00:00
bobrenjc93	fde8f6a8b8	[ez] add docblock for _suggest_torch_checks (#154404 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154404 Approved by: https://github.com/Skylion007 ghstack dependencies: #154374, #154375, #154376, #154386, #154401	2025-05-28 04:45:55 +00:00
bobrenjc93	b82fb57b67	[ez] add docblock for RuntimeAssert (#154401 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154401 Approved by: https://github.com/Skylion007 ghstack dependencies: #154374, #154375, #154376, #154386	2025-05-28 04:43:22 +00:00
bobrenjc93	d64b4a91dd	[ez] remove unused function _constrain_symbol_range (#154386 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154386 Approved by: https://github.com/Skylion007 ghstack dependencies: #154374, #154375, #154376	2025-05-28 04:41:00 +00:00
Laith Sakka	ef90cc18d7	use definitely_contiguous for _prim_elementwise_meta short circuit (#153441 ) * This verifies that the check short circuit is not material. https://github.com/pytorch/pytorch/pull/153431 ``` import torch from torch.export import Dim, export class MyModel(torch.nn.Module): def forward(self, x, ranks): first_k = ranks.max().item() torch._check_is_size(first_k) narrow = x.narrow(dim = 1, start = 0, length = first_k) lt = narrow < narrow.size(1) return lt inps = ( torch.randn((8, 16), device="cuda"), torch.arange(8, device="cuda", dtype=torch.int8) ) spec = { "x": (Dim.AUTO, Dim.AUTO), "ranks": (Dim.AUTO,), } traced = export(MyModel(), inps, dynamic_shapes=spec, strict=True).run_decompositions({}) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153441 Approved by: https://github.com/jansel ghstack dependencies: #153432	2025-05-28 03:41:26 +00:00
Laith Sakka	39df901b2a	introduce definitely_contiguous and use it for reshape and tensor meta data computation. (#153432 ) when a tensor has unbacked symbols it can be general enough to represent both contiguous and non contiguous tensors. in that case we cant really evaluate is_contiguous. In many places in the code base, we check for is_contiguous to take a fast path. but the general path usually works for both contiguous and not contiguous in that case we probably want to use definitely _contiguous API. This is appleid for reshape in this PR and also to tensor meta data computation, the meta data now will have an attribute that says that its contiguous when its always contiguous. We would store that only if definitely _contiguous is true now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153432 Approved by: https://github.com/bobrenjc93	2025-05-28 03:41:26 +00:00
Sidharth	54f1f29fed	[dynamo] dynamic gb_type -> static gb_type (#154435 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154435 Approved by: https://github.com/williamwen42	2025-05-28 03:14:26 +00:00
ZhiweiYan-96	f12ce4e36b	[Intel GPU] convolution fusion at XPU backend (#154202 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154202 Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/etaf ghstack dependencies: #140365	2025-05-28 03:14:18 +00:00
FFFrog	c6fc11af76	Fix the Problems About Defining Static Variable in Inline Function (#147095 ) Refer to https://github.com/pytorch/pytorch/issues/125465 for more informations - Remove unused header files - Move the inline function that defines the static variable to .cc Pull Request resolved: https://github.com/pytorch/pytorch/pull/147095 Approved by: https://github.com/cyyever, https://github.com/albanD	2025-05-28 02:47:16 +00:00
bobrenjc93	855eff8e8e	Don't CSE unbacked nodes (#154387 ) * #154440 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154387 Approved by: https://github.com/TroyGarden ghstack dependencies: #154440	2025-05-28 02:21:56 +00:00
bobrenjc93	919a1a17e3	[ez] Replace misleading implementations with NYI (#154440 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154440 Approved by: https://github.com/Skylion007, https://github.com/pianpwk	2025-05-28 02:21:56 +00:00
Bin Bao	a84d8c4a1c	[AOTI] Support multi-arch when using package_cpp_only (#154414 ) Summary: Add support of multi_arch_kernel_binary in the package_cpp_only mode. More specifically, generate specific cmake targets to compile .ptx to .fatbin and embed them in the final shared library or binary. Differential Revision: [D75452096](https://our.internmc.facebook.com/intern/diff/D75452096) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154414 Approved by: https://github.com/angelayi ghstack dependencies: #154412, #154413	2025-05-28 01:20:38 +00:00
Bin Bao	cde82d25b7	[AOTI] Add a multi_arch_kernel_binary option (#154413 ) Summary: CUDA can support multi-arch with the fatbin format. Add this multi_arch_kernel_binary option, so the compiled model binary can run across different GPU archs. Differential Revision: [D75452094](https://our.internmc.facebook.com/intern/diff/D75452094) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154413 Approved by: https://github.com/angelayi ghstack dependencies: #154412	2025-05-28 01:20:38 +00:00
Bin Bao	4d8f3d537a	[AOTI][refactor] Rename embed_cubin to embed_kernel_binary (#154412 ) Summary: Rename as it is not CUDA specific. Differential Revision: [D75452095](https://our.internmc.facebook.com/intern/diff/D75452095) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154412 Approved by: https://github.com/angelayi	2025-05-28 01:20:28 +00:00
bobrenjc93	e79790e14b	[ez] add docblock for _sympy_from_args (#154376 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154376 Approved by: https://github.com/Skylion007 ghstack dependencies: #154374, #154375	2025-05-27 23:43:13 +00:00
atalman	fe082c5ffe	Move inductor workflows focal (ubuntu 20.04) -> jammy (ubuntu 22.04) (#154153 ) Trying to fix: https://github.com/pytorch/pytorch/issues/154157 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154153 Approved by: https://github.com/Skylion007, https://github.com/huydhn, https://github.com/nike4949, https://github.com/cyyever	2025-05-27 23:16:21 +00:00
iupaikov-amd	3f10c9d8af	Fixed an issue with XPU skip so the test_decompose_mem_bound_mm.py suite can be ran correctly (#153245 ) Fixes #153239 Replaced custom decorator with the common one. Although the better way to skip the whole suite would be to add it to skip list in run_test.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/153245 Approved by: https://github.com/jeffdaily	2025-05-27 23:10:25 +00:00
atalman	4b39832412	[CI] Update torchbench pin (#154453 ) Related to https://github.com/pytorch/pytorch/issues/154446 Pins torchbench repo to a https://github.com/pytorch/benchmark/pull/2620 which pins opacus to ``1.5.3`` version Pull Request resolved: https://github.com/pytorch/pytorch/pull/154453 Approved by: https://github.com/wdvr, https://github.com/malfet	2025-05-27 23:08:42 +00:00
atalman	247ea229ba	Create issue template: Release highlight for proposed Feature (#154125 ) Authors: @anitakat @atalman This is related to: https://github.com/pytorch/pytorch/issues/152134 . Adding RFC template for feature submissions Pull Request resolved: https://github.com/pytorch/pytorch/pull/154125 Approved by: https://github.com/anitakat, https://github.com/ZainRizvi, https://github.com/albanD	2025-05-27 22:45:21 +00:00
anwang	53affa273b	[MTIA Aten Backend][1.3/n] Migrate remaining view ops, which all need explicit register in `native_functions.yaml` (#154337 ) See context in D75266206. This diff/PR migrates all the remaining view ops, which all need changes in `native_functions.yaml` and thus need to be exported to PR. Ops covered by this diff: - _reshape_alias - unfold internal: Also delete the entire aten_mtia_view_ops.cpp file, and update corresponding build config. Differential Revision: [D75385411](https://our.internmc.facebook.com/intern/diff/D75385411/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154337 Approved by: https://github.com/nautsimon ghstack dependencies: #154336	2025-05-27 22:18:12 +00:00
Shangdi Yu	eaf355cb11	[BE] Clean up unused parameter input in AOTIModel (#154276 ) Summary: As title Test Plan: CI Differential Revision: D74691763 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154276 Approved by: https://github.com/Skylion007	2025-05-27 22:17:32 +00:00
PyTorch MergeBot	241f8dc84d	Revert "Remove outdated CUDA 11 conditions (#154313 )" This reverts commit 3936e6141c09dab94f21e4fdab7bea4bddf62ac2. Reverted https://github.com/pytorch/pytorch/pull/154313 on behalf of https://github.com/izaitsevfb due to breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/154313#issuecomment-2914230005))	2025-05-27 21:54:41 +00:00
Jerry Mannil	6be829535f	[ROCm] Improve vectorized elementwise kernel performance in MI300X (#153634 ) * Use non-temporal loads to improve the vectorized elementwise kernel performance on MI300 * Use thread_work_size of 8 or 16 for vectorized elementwise kernel Co-author: @amd-hhashemi Pull Request resolved: https://github.com/pytorch/pytorch/pull/153634 Approved by: https://github.com/jeffdaily	2025-05-27 20:49:32 +00:00
PyTorch MergeBot	555fc05868	Revert "[Inductor] Improve typing, and prepare for ABI-compatible AOTI C-shim dispatching (#154371 )" This reverts commit 6169ca0b65bcb382faa1a2287278b3717c18f127. Reverted https://github.com/pytorch/pytorch/pull/154371 on behalf of https://github.com/benjaminglass1 due to Appears to have broken main ([comment](https://github.com/pytorch/pytorch/pull/154371#issuecomment-2913975736))	2025-05-27 20:39:09 +00:00
Guilherme Leobas	7359705232	Add CPython tests for unittest (#150788 ) Tests: * test_assertions.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/150788 Approved by: https://github.com/williamwen42	2025-05-27 20:26:17 +00:00
Guilherme Leobas	12fc06d267	Add CPython complex tests (#152015 ) Tests: * test_complex.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/152015 Approved by: https://github.com/williamwen42	2025-05-27 20:24:28 +00:00
Guilherme Leobas	3b218e56dc	Add CPython tests for iter/sort (#150797 ) Tests: * test_iter.py * test_sort.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/150797 Approved by: https://github.com/williamwen42	2025-05-27 20:22:34 +00:00
bobrenjc93	4fd8a54a41	[ez] add docblock for is_accessor_node (#154375 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154375 Approved by: https://github.com/Skylion007, https://github.com/pianpwk ghstack dependencies: #154374	2025-05-27 19:47:32 +00:00
tvukovic-amd	b367e5f6a6	[ROCm][Windows] Fix building torch 2.8 wheel with ROCm (added hipblasLt and rocblas directories) (#153144 ) Since rocblas.dll and hipblaslt.dll are copied to torch/lib, rocblas and hipblaslt directories are needed to be stored there too (otherwise we have an error after wheel installation while searching for files in rocblas/library and hipblaslt/library which doesn't exist). This PR fixes this issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153144 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-05-27 19:40:28 +00:00
PyTorch MergeBot	fa6ca59079	Revert "Move inductor workflows focal (ubuntu 20.04) -> jammy (ubuntu 22.04) (#154153 )" This reverts commit 2bd95f3a1f07132aa00f5c438c5228866d7dd1f8. Reverted https://github.com/pytorch/pytorch/pull/154153 on behalf of https://github.com/malfet due to Broke inductor tests, see `b8452e55bc/1` ([comment](https://github.com/pytorch/pytorch/pull/154153#issuecomment-2913738047))	2025-05-27 19:23:28 +00:00
Benjamin Glass	6169ca0b65	[Inductor] Improve typing, and prepare for ABI-compatible AOTI C-shim dispatching (#154371 ) Prepares for the next PR in the stack by tightening up typing on a `cpp_wrapper` interface that's only used in one (well-typed) place, as well as downstream effects of that change. In particular, this enabled: 1. removing a number of now clearly unnecessary asserts 2. adding a few more targeted asserts to validate the code's current assumptions 3. removing some unneeded control flow in several functions As far as I can tell, this PR should be functionally neutral. One argument was removed from a `cpp_wrapper` public API, but that argument was unused, and only had a single callsite. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154371 Approved by: https://github.com/desertfire	2025-05-27 19:17:41 +00:00
Ryan Guo	75bbd4989c	[dynamo] Support using symint from dispatcher-style tensor subclass (#154130 ) Fixes #146932. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154130 Approved by: https://github.com/laithsakka	2025-05-27 19:05:46 +00:00
PyTorch MergeBot	8c0f07f944	Revert "[ROCm] Improve vectorized elementwise kernel performance in MI300X (#153634 )" This reverts commit 0d4de7872ac019abbd6e87b3391b2276d9d05bd4. Reverted https://github.com/pytorch/pytorch/pull/153634 on behalf of https://github.com/malfet due to Broke inductor jobs, see `b8452e55bc/1` ([comment](https://github.com/pytorch/pytorch/pull/153634#issuecomment-2913619071))	2025-05-27 19:02:59 +00:00
Zizeng Meng	b8452e55bc	[Kineto x Insight] Update Kineto submodule (#154426 ) Summary: We add a new ActivityType::MTIA_INSIGHT in `20f652846f` Test Plan: CI Differential Revision: D75454945 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154426 Approved by: https://github.com/Skylion007	2025-05-27 18:29:29 +00:00
Nikita Shulga	5075df6fee	Make torch importable if compiled without TensorPipe (#154382 ) By delaying the import/hiding it behind `torch.distributed.rpc.is_tensorpipe_avaiable()` check Fixes https://github.com/pytorch/pytorch/issues/154300 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154382 Approved by: https://github.com/Skylion007 ghstack dependencies: #154325	2025-05-27 18:13:38 +00:00
Nikita Shulga	f472ea63bb	[BE] Fix typos in SyntaxError description (#154436 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154436 Approved by: https://github.com/seemethere, https://github.com/wdvr, https://github.com/ZainRizvi	2025-05-27 18:08:58 +00:00
Cyrus Daruwala	cfbd99fdfd	[Pytorch] Add option to CPU Blas GEMM to avoid output downcast (#154012 ) Summary: Dot product for a single output element consists of 3 steps (both input vectors have elements of type scalar_t): 1. elementwise vector multiply (scalar_t x scalar_t -> opmath_t) 2. vector reduction to a scalar value (opmath_t -> opmath_t) 3. optional downcast if opmath_t != out_t The current blas kernel performs steps 1 and 2 correctly, but for step 3, it will always downcast to scalar_t even when opmath_t == output_t (and then do an upcast back to output_t), which results in precision loss. This diff fixes the precision loss in the BlasKernel Test Plan: Attention CI passes Differential Revision: D75023858 topic: not user facing Pull Request resolved: https://github.com/pytorch/pytorch/pull/154012 Approved by: https://github.com/Valentine233, https://github.com/aditew01, https://github.com/CaoE, https://github.com/drisspg	2025-05-27 17:43:21 +00:00
bobrenjc93	1ca082d9a1	[ez] Rewrite comment to be more friendly to non haskellers (#151421 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151421 Approved by: https://github.com/aorenste	2025-05-27 17:32:34 +00:00
bobrenjc93	70fbd5e08c	[ez] Add docblock for resolve_unbacked_bindings (#154374 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154374 Approved by: https://github.com/Skylion007, https://github.com/pianpwk	2025-05-27 17:05:49 +00:00
bobrenjc93	2560c1f3f0	add sticky cache pgo (#154418 ) It's a reland of https://github.com/pytorch/pytorch/pull/154394 that hit some mergebot bug Pull Request resolved: https://github.com/pytorch/pytorch/pull/154418 Approved by: https://github.com/malfet	2025-05-27 16:40:18 +00:00
Boyuan Feng	514409d032	update torchvision pin (#154255 ) Fixes #153985 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154255 Approved by: https://github.com/desertfire	2025-05-27 16:15:25 +00:00
ZhiweiYan-96	0ddfd1ed43	[Intel GPU] Enable mkdnn._linear_pointwise at XPU backend (#140365 ) # Motivation This PR is intended to add post-op fusion support fo Linear. The liner-pointwise fusion is expected to be used in graph mode like torch.compile. The FusionUtils.cpp file defines a utilization APIs for generating primitive attribute. This APIs would also be used for conv-pointwise fusion, which is in #140372. # Validation ```bash python test/xpu/test_fusion.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/140365 Approved by: https://github.com/etaf, https://github.com/guangyey, https://github.com/EikanWang	2025-05-27 15:57:15 +00:00
Jerry Mannil	0d4de7872a	[ROCm] Improve vectorized elementwise kernel performance in MI300X (#153634 ) * Use non-temporal loads to improve the vectorized elementwise kernel performance on MI300 * Use thread_work_size of 8 or 16 for vectorized elementwise kernel Co-author: @amd-hhashemi Pull Request resolved: https://github.com/pytorch/pytorch/pull/153634 Approved by: https://github.com/jeffdaily	2025-05-27 15:38:43 +00:00
Xuehai Pan	7ae204c3b6	[BE][CI][Easy] Run `lintrunner` on generated `.pyi` stub files (#150732 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150732 Approved by: https://github.com/malfet, https://github.com/cyyever, https://github.com/aorenste	2025-05-27 14:58:02 +00:00
Yuanhao Ji	0a7eef140b	Add `torch.Tensor._make_wrapper_subclass` to `torch/_C/__init__.pyi` (#154022 ) Fixes #153790 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154022 Approved by: https://github.com/Skylion007	2025-05-27 14:10:00 +00:00
Nikita Shulga	d88699308f	[CI][MacOS] Move more dependencies to pypi (#154309 ) Hopefully last step before all Mac build/tests could be switched away from conda - Update cmake version from 3.22 to 3.25 as 3.22 from pipy seems to be unusable with python-3.12 - Add `--plat-name macosx_11_0_arm64` to setup.py command - Remove `codesign` for cmake workaround (that was probably never really necessary - Install `libpng` and `jpeg-turbo` when building torchbench and build torchaudio without OpenMP (to be fixed) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154309 Approved by: https://github.com/Skylion007, https://github.com/cyyever	2025-05-27 13:49:40 +00:00
PyTorch MergeBot	11a51a11af	Revert "introduce definitely_contiguous and use it for reshape and tensor meta data computation. (#153432 )" This reverts commit 5c6d7caaaa08f134c3b17ce032cb014527b53417. Reverted https://github.com/pytorch/pytorch/pull/153432 on behalf of https://github.com/malfet due to Looks like it broke flex attention tests, see https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=g6.4xlarge&mergeEphemeralLF=true ([comment](https://github.com/pytorch/pytorch/pull/153432#issuecomment-2912562570))	2025-05-27 13:42:34 +00:00
Emmanuel Menage	c52a002a22	Add getDeviceProperties api to torch mtia device (#153577 ) topic: not user facing Test Plan: Internal benchmark. Differential Revision: D74256550 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153577 Approved by: https://github.com/nautsimon	2025-05-27 11:55:58 +00:00
atalman	2bd95f3a1f	Move inductor workflows focal (ubuntu 20.04) -> jammy (ubuntu 22.04) (#154153 ) Trying to fix: https://github.com/pytorch/pytorch/issues/154157 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154153 Approved by: https://github.com/Skylion007, https://github.com/huydhn, https://github.com/nike4949, https://github.com/cyyever	2025-05-27 11:53:47 +00:00
Tom Ritchford	6f86c1ce1d	Add pyrefly.toml (#154144 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154144 Approved by: https://github.com/Skylion007	2025-05-27 10:16:30 +00:00
Laith Sakka	5c6d7caaaa	introduce definitely_contiguous and use it for reshape and tensor meta data computation. (#153432 ) when a tensor has unbacked symbols it can be general enough to represent both contiguous and non contiguous tensors. in that case we cant really evaluate is_contiguous. In many places in the code base, we check for is_contiguous to take a fast path. but the general path usually works for both contiguous and not contiguous in that case we probably want to use definitely _contiguous API. This is appleid for reshape in this PR and also to tensor meta data computation, the meta data now will have an attribute that says that its contiguous when its always contiguous. We would store that only if definitely _contiguous is true now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153432 Approved by: https://github.com/bobrenjc93	2025-05-27 08:54:31 +00:00
anwang	dec5ab8d98	[MTIA Aten Backend][1.2/n] Migrate as_strided to in-tree, and add unit tests (#154336 ) See context in PR https://github.com/pytorch/pytorch/pull/153670 This diff migrate as_strided to in-tree. I found it's not covered by `test_kernel_eager_ci` so also adding unit tests. Differential Revision: [D75385404](https://our.internmc.facebook.com/intern/diff/D75385404/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154336 Approved by: https://github.com/nautsimon	2025-05-27 06:32:38 +00:00
PyTorch MergeBot	ef6306e1c6	Revert "[executorch hash update] update the pinned executorch hash (#153436 )" This reverts commit 8d6139b8d8a75aab5ead4262ff59d48615ebee31. Reverted https://github.com/pytorch/pytorch/pull/153436 on behalf of https://github.com/malfet due to Broke ET sanity ([comment](https://github.com/pytorch/pytorch/pull/153436#issuecomment-2911206795))	2025-05-27 06:02:14 +00:00
Yu, Guangye	870133b2a0	Use get_device_context in aoti runtime for XPU directly (#154360 ) # Motivation Reuse [c10::xpu::get_device_context](`1bebe0424e/c10/xpu/XPUFunctions.h (L27)`) directly to reduce overhead, as it returns a cached `sycl::context` managed by PyTorch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154360 Approved by: https://github.com/EikanWang	2025-05-27 05:55:59 +00:00
Jing Xu	8d89cdceb6	fix a compilation issue when TORCH_XPU_ARCH_LIST is an empty string (#153604 ) When `XPU_ARCH_FLAGS` is an empty string, compilation will fail on `C10_STRINGIZE(XPU_ARCH_FLAGS)` in file `torch/csrc/xpu/Module.cpp` on Windows. This PR fixes this issue by setting `TORCH_XPU_ARCH_LIST` to `""` to avoid an empty string conversion in `C10_STRINGIZE()` when compiling without an AOT. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153604 Approved by: https://github.com/guangyey, https://github.com/EikanWang Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com> Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>	2025-05-27 05:26:46 +00:00
PyTorch UpdateBot	8d6139b8d8	[executorch hash update] update the pinned executorch hash (#153436 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153436 Approved by: https://github.com/pytorchbot	2025-05-27 04:54:46 +00:00
Boyuan Feng	912af9b2c2	update torchbench pin (#154256 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154256 Approved by: https://github.com/huydhn	2025-05-27 04:40:54 +00:00
Xia, Weiwen	8d319607a7	[CPU][Brgemm] add s8s8 GEMM microkernel API (#154358 ) As the title. `u8s8` and `u8u8` have already been supported. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154358 Approved by: https://github.com/leslie-fang-intel, https://github.com/Skylion007, https://github.com/Valentine233	2025-05-27 03:47:56 +00:00
Georgia Phillips	f8010e7b93	[nativert] Move file_util to pytorch core (#153162 ) Summary: fbcode//sigmoid/core/common -> fbcode//caffe2/torch/nativert/common Test Plan: Github CI Differential Revision: D74328089 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153162 Approved by: https://github.com/zhxchen17	2025-05-27 03:42:47 +00:00
Pat Vignola	70d12ccc3f	[Torch] Fix error message formatting in fp8 comparison logic (#153647 ) Summary: Using `\` includes all the tabs from the next line in the error message. Test Plan: Nothing, simply error message fixing Reviewed By: exclamaforte Differential Revision: D74539234 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153647 Approved by: https://github.com/exclamaforte	2025-05-27 02:51:05 +00:00
Will Feng	100ec0b34a	[Inductor] Allow passing in custom lowering dict to register_lowering() (#154344 ) This PR adds support for passing in custom lowering dict to `register_lowering()`, which allows systems (e.g. Helion, https://github.com/pytorch-labs/helion/pull/80) that uses Inductor to maintain their own lowering dict instead of using the Inductor global `lowerings` dict. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154344 Approved by: https://github.com/jansel	2025-05-27 01:35:26 +00:00
cyy	3936e6141c	Remove outdated CUDA 11 conditions (#154313 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/154313 Approved by: https://github.com/eqy	2025-05-27 00:30:14 +00:00
atalman	6006352ed3	[BE] Refactor manywheel build scripts (#154372 ) 1. Remove `CentOS Linux` cases, since its deprecated 2. Remove logic for old CUDA versions 3. Remove logic for `CUDA_VERSION=12.4` since we deprecated CUDA 12.4 support 4. Simplify setting `USE_CUFILE=1` - only supported on CUDA 12.6 and 12.8 builds Pull Request resolved: https://github.com/pytorch/pytorch/pull/154372 Approved by: https://github.com/malfet, https://github.com/huydhn	2025-05-26 23:17:23 +00:00
PyTorch MergeBot	b643076e4e	Revert "[executorch hash update] update the pinned executorch hash (#153436 )" This reverts commit b6868f290e4882f9c895b1c9476327974288eaba. Reverted https://github.com/pytorch/pytorch/pull/153436 on behalf of https://github.com/malfet due to Broke ET sanity ([comment](https://github.com/pytorch/pytorch/pull/153436#issuecomment-2910692163))	2025-05-26 22:09:16 +00:00
Laith Sakka	aaf5cc13d9	[EASY] use guard_or_false instead of gso in Meta converter (#154234 ) this was added in https://github.com/pytorch/pytorch/pull/141659, the current change keep the same intention "i do not want to fail here if i cant tell if the size is zero or not" i am not familiar enough in the code to know if we need here a runtime check, but looking at current impl it seems that guard_or_false is appropriate to match current behaviour and have the same effect of guard_size_oblivious here. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154234 Approved by: https://github.com/bobrenjc93 ghstack dependencies: #154154, #154164, #154167, #154172	2025-05-26 21:59:52 +00:00
Laith Sakka	e33feddb72	used guard_or_false instead of guard_size_oblivious inside maybe_reduce (#154172 ) This was added in https://github.com/pytorch/pytorch/pull/119562 the idea in this loop seems to be the following. ``` if (TORCH_GUARD_SIZE_OBLIVIOUS(size.sym_eq(1))) { // NB: we could short circuit this once needs_reduce is true but there's // no point since the reduction function will guard on this anyway if (!c10::guard_or_false(size.sym_eq(target), __FILE__, __LINE__)) { needs_reduce = true; } } else { if (!size.sym_eq(target).expect_true(__FILE__, __LINE__)) { fail(); } } ``` 1. if we know size ==1 1.1 : if we know for sure size == target --> no reduce needed. 1.2 : we know for sure that size != target --> we do reduction. 1.3: we could not tell if size == target or not --> we do reduction. 2. if we do now know if size ==1 or not we add a runtime assertions that size ==target and we fail at runtime if size is not equal to target. We could have simplified 1.1 and always do reduction under 1.1, since doing 1.3 without runtime checks implies that it is safe, but i feel the reason could be perf here? idk. anyway using TORCH_GUARD_OR_FALSE instead of TORCH_GUARD_SIZE_OBLIVIOUS here is appropriate. there is really no clear reason for size oblivious reasoning. or for this logic not to apply when size is not size like size is always >=0 anyway. but bad reasoning can make us not able to infer that although we know its true here. python test/dynamo/test_misc.py -k test_validate_outputs_unbacked Pull Request resolved: https://github.com/pytorch/pytorch/pull/154172 Approved by: https://github.com/bobrenjc93 ghstack dependencies: #154154, #154164, #154167	2025-05-26 21:59:52 +00:00
Laith Sakka	ab5137b048	used guard_or_false instead of guard_size_oblivious in is_int_or_symint (#154167 ) This is a short circuit, that we should not fail on. Before this PR we would not fail on u0, u0+u1, only if they are size like. but we will fail on u0-u1.. etc for no need. guard_or_false seems appropriate for that reason. This was added in https://github.com/pytorch/pytorch/pull/122145 there was no unit tests for me to verify why it was added, i could not repo using the associated issue , the example does not work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154167 Approved by: https://github.com/bobrenjc93 ghstack dependencies: #154154, #154164	2025-05-26 21:59:45 +00:00
Laith Sakka	1da2cc52bc	[EASY] remove guard_size_oblivious from is_nonzero proxy call check (#154164 ) This was added in https://github.com/pytorch/pytorch/pull/149637, torch._check can handle unbacked there is no need for size oblivious reasoning here. Note this does not make is_nonzero unbacked friendly. but that is a different story. I ran the test added in https://github.com/pytorch/pytorch/pull/149637 for veirfication. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154164 Approved by: https://github.com/aorenste, https://github.com/bobrenjc93 ghstack dependencies: #154154	2025-05-26 21:59:29 +00:00
Laith Sakka	f8a2998832	[EASY] used guard_or_false instead of guard_sizes_oblivious in pointless_view (#154154 ) The change is direct and clear, the optimizations removes pointless_view iff it all sizes are the same if not we want to return false, there is no need for size oblivious reasoning. this was added in https://github.com/pytorch/pytorch/pull/139136, run existing tests that are added in that PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154154 Approved by: https://github.com/bobrenjc93	2025-05-26 21:59:21 +00:00
atalman	e89ee1e217	Pin almalinux version to 8.10-20250519 (#154367 ) This PR pins Almalinux version to latest supported 8.10 This is related to: https://github.com/pytorch/pytorch/pull/154364 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154367 Approved by: https://github.com/jeanschmidt, https://github.com/wdvr, https://github.com/malfet, https://github.com/huydhn	2025-05-26 20:08:20 +00:00
Randolf Scholz	839c9c6156	Use property instead of ClassVar for `Uniform.arg_constraints` and `Wishart.arg_constraints` (#154361 ) Fixes #154355 For these two distributions, the constraints depend on the actual values, and so `arg_constraints` cannot be a `ClassVar`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154361 Approved by: https://github.com/Skylion007	2025-05-26 17:48:28 +00:00
PyTorch MergeBot	3f64502c98	Revert "Re-enable FakeTensor caching for SymInts (#152662 )" This reverts commit 7d11c61c26c596076613aa0111892f7cbccae32e. Reverted https://github.com/pytorch/pytorch/pull/152662 on behalf of https://github.com/malfet due to Looks like it broke bunch of inductor tests, see `187d38185e/1` ([comment](https://github.com/pytorch/pytorch/pull/152662#issuecomment-2910293593))	2025-05-26 17:13:22 +00:00
Henry Tsang	187d38185e	[cutlass backend] Do not raise hard error when re worker has cuda compilation error (#154173 ) fbcode specific Differential Revision: D75262641 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154173 Approved by: https://github.com/bertmaher	2025-05-26 17:10:36 +00:00
Yuki Kobayashi	f55f2f42a7	Add missing docstring for `sym_ite` (#154201 ) `sym_ite` is listed in [the reference page](https://docs.pytorch.org/docs/stable/torch.html) and has no document. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154201 Approved by: https://github.com/Skylion007	2025-05-26 15:59:21 +00:00
atalman	02445ec8f0	Almalinux image, install glibc-langpack-en (#154364 ) After update to: https://hub.docker.com/layers/amd64/almalinux/8/images/sha256-4f63eb966695df3c993deeacec7c73d87728e2ea66d3b48fed4b40cb547fa7c2 Started seeing warning: bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8) and random Segfaults when using python like: https://github.com/pytorch/test-infra/actions/runs/15216565225/job/42901732536 ``` +++ python -c 'import torch' ./check_binary.sh: line 258: 2276 Segmentation fault (core dumped) python -c 'import torch' ``` Installing langpack does resolve these issues: https://github.com/pytorch/test-infra/actions/runs/15256338815/job/42904808826#step:15:2311 Almalinux Docker build without setlocale warning: https://github.com/pytorch/pytorch/actions/runs/15030284546/job/42240978131 Almalinux Docker build with setlocale warning: https://github.com/pytorch/pytorch/actions/runs/15246391200/job/42873875745#step:3:7180 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154364 Approved by: https://github.com/Skylion007, https://github.com/jeanschmidt	2025-05-26 15:56:42 +00:00
Nikita Shulga	4b0ee3f4f2	[BE] Do not templetize unnnecessarily (#154305 ) `${{ os.runner }}` would always evaluate to macOS for those files And architecutre is always ARM64 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154305 Approved by: https://github.com/atalman	2025-05-26 15:00:48 +00:00
Aleksei Nikiforov	7ab4fae62a	Fix s390x vectorization compilation in inductor (#153946 ) Fix s390x vectorization compilation in inductor. One of failing tests is inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpu::test_add_complex_cpu but it is still disabled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153946 Approved by: https://github.com/malfet, https://github.com/jgong5	2025-05-26 12:54:25 +00:00
Ben Olson	1bebe0424e	Fix platform detection in MKLDNN CMake file (#142067 ) When building PyTorch with `USE_XPU=True` and Clang, the user sees misleading errors related to incorrect platform detection that assumes that all users that are not using the GNU compilers are on Windows. We can fix this by simply using CMake's builtin platform detection variables. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142067 Approved by: https://github.com/EikanWang, https://github.com/min-jean-cho, https://github.com/guangyey	2025-05-26 06:09:37 +00:00
Narek Malkhasyan	21e42c5d62	More descriptive error message for torch.nanmean() with complex dtypes (#153252 ) Fixes #153132 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153252 Approved by: https://github.com/colesbury	2025-05-26 05:42:57 +00:00
PyTorch UpdateBot	b6868f290e	[executorch hash update] update the pinned executorch hash (#153436 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153436 Approved by: https://github.com/pytorchbot	2025-05-26 04:43:10 +00:00
Aaron Orenstein	7d11c61c26	Re-enable FakeTensor caching for SymInts (#152662 ) Summary: This backs out D60320595 which itself turned off FakeTensor caching when a SymInt was present. There has been a lot of dynamic shape fixes done this year and tests pass so I'm assuming some of that work fixed what was breaking previously. Test Plan: Reran the tests listed in T196779132 and they pass. ## Perf ### Instruction Counter Benchmark: - 26% win on add_loop_eager_dynamic - 13% win on add_loop_inductor_dynamic_gpu ### Perf Dashboard Compilation Latency wins across the board but especially strong on the dynamic tests (like cudagraphs_dynamic) - for example MobileBertForMaskedLM went from 66s -> 50s. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152662 Approved by: https://github.com/anijain2305	2025-05-26 04:17:56 +00:00
Ke Wen	062387fb53	[SymmMem] Speed up tests (#153677 ) Use `MultiProcContinousTest` to avoid re-create ProcessGroup in each test instance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153677 Approved by: https://github.com/fegin, https://github.com/Skylion007, https://github.com/ngimel ghstack dependencies: #153653	2025-05-26 03:39:11 +00:00
Ke Wen	8c16d0e404	[c10d] Add support for testing SIGABRT return (#153167 ) `SIGABRT` is a common return by negative distributed tests, which checks for effectiveness of NaN assert, watchdog throw, etc. These errors are not detectable by traditional statements like `with self.assertRaises(RuntimeError)`. Instead, we'd need to check for the process's return code, e.g. `SIGABRT(6)` would have a return code of -6. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153167 Approved by: https://github.com/fduwjj	2025-05-26 00:56:05 +00:00
Natalia Gimelshein	b04852e404	Fix deterministic indexing with broadcast (#154296 ) Fixes #79987, now for real. Also removed thrust sort path that was needed for cuda <=11.2 because we no longer support it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154296 Approved by: https://github.com/soumith	2025-05-25 21:14:50 +00:00
Justin Chu	c3100067ae	[ONNX] Update onnx to 1.18 (#153746 ) Update onnx python package to 1.18. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153746 Approved by: https://github.com/titaiwangms, https://github.com/cyyever, https://github.com/malfet	2025-05-25 20:58:47 +00:00
Laith Sakka	43b2716e89	PYFMT lint grandfathered files 1 (#154261 ) lint: - test/test_fake_tensor.py - test/test_flop_counter.py - torch/_export/verifier.py with same rules as other files, it was a night mare for me to update tests in one of the skipped files with not being able to lint them locally like other files with lintrunner -a. note that those file do have active dev and not old not touched files. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154261 Approved by: https://github.com/angelayi, https://github.com/Skylion007	2025-05-25 17:36:14 +00:00
Nikita Shulga	5677ab9aab	[BE] Correctly pass exceptions raised from `rpc_init` to CPython (#154325 ) By decorating function body with `HANDLE_TH_ERRORS` Partially addresses https://github.com/pytorch/pytorch/issues/154300 I.e. after that change, importing torch no longer crashes but returns a readable (and actionable exception) ``` >>> import torch Traceback (most recent call last): File "<python-input-0>", line 1, in <module> import torch File "/Users/malfet/git/pytorch/pytorch/torch/__init__.py", line 2134, in <module> from torch import _VF as _VF, functional as functional # usort: skip ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/malfet/git/pytorch/pytorch/torch/functional.py", line 8, in <module> import torch.nn.functional as F File "/Users/malfet/git/pytorch/pytorch/torch/nn/__init__.py", line 8, in <module> from torch.nn.modules import * # usort: skip # noqa: F403 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/malfet/git/pytorch/pytorch/torch/nn/modules/__init__.py", line 2, in <module> from .linear import Bilinear, Identity, LazyLinear, Linear # usort: skip ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/malfet/git/pytorch/pytorch/torch/nn/modules/linear.py", line 7, in <module> from torch.nn import functional as F, init File "/Users/malfet/git/pytorch/pytorch/torch/nn/functional.py", line 11, in <module> from torch._jit_internal import ( ...<5 lines>... ) File "/Users/malfet/git/pytorch/pytorch/torch/_jit_internal.py", line 42, in <module> import torch.distributed.rpc File "/Users/malfet/git/pytorch/pytorch/torch/distributed/rpc/__init__.py", line 37, in <module> from torch._C._distributed_rpc import ( # noqa: F401 ...<33 lines>... ) ImportError: cannot import name '_DEFAULT_NUM_WORKER_THREADS' from 'torch._C._distributed_rpc' (unknown location) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/154325 Approved by: https://github.com/Skylion007	2025-05-25 17:01:45 +00:00
Nikita Shulga	31ae07b5e7	[CI] Do not install libuv on MacOS (#154307 ) It's tensorpipe submodule and is build from source Same for `dataclasses` as it's needed only for python-3.6 And get rid of `nidia-ml-py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/154307 Approved by: https://github.com/cyyever, https://github.com/Skylion007 ghstack dependencies: #154304	2025-05-25 15:30:38 +00:00
Nikita Shulga	6968386385	[BE] Sort requirements files alphabetically (#154304 ) Using `sort` tool Pull Request resolved: https://github.com/pytorch/pytorch/pull/154304 Approved by: https://github.com/cyyever, https://github.com/Skylion007	2025-05-25 15:30:38 +00:00
dependabot[bot]	ed27ee8355	Bump setuptools from 70.0.0 to 78.1.1 in /tools/build/bazel (#154075 ) Bumps [setuptools](https://github.com/pypa/setuptools) from 70.0.0 to 78.1.1. <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/pypa/setuptools/blob/main/NEWS.rst">setuptools's changelog</a>.</em></p> <blockquote> <h1>v78.1.1</h1> <h2>Bugfixes</h2> <ul> <li>More fully sanitized the filename in PackageIndex._download. (<a href="https://redirect.github.com/pypa/setuptools/issues/4946">#4946</a>)</li> </ul> <h1>v78.1.0</h1> <h2>Features</h2> <ul> <li>Restore access to _get_vc_env with a warning. (<a href="https://redirect.github.com/pypa/setuptools/issues/4874">#4874</a>)</li> </ul> <h1>v78.0.2</h1> <h2>Bugfixes</h2> <ul> <li>Postponed removals of deprecated dash-separated and uppercase fields in <code>setup.cfg</code>. All packages with deprecated configurations are advised to move before 2026. (<a href="https://redirect.github.com/pypa/setuptools/issues/4911">#4911</a>)</li> </ul> <h1>v78.0.1</h1> <h2>Misc</h2> <ul> <li><a href="https://redirect.github.com/pypa/setuptools/issues/4909">#4909</a></li> </ul> <h1>v78.0.0</h1> <h2>Bugfixes</h2> <ul> <li>Reverted distutils changes that broke the monkey patching of command classes. (<a href="https://redirect.github.com/pypa/setuptools/issues/4902">#4902</a>)</li> </ul> <h2>Deprecations and Removals</h2> <ul> <li>Setuptools no longer accepts options containing uppercase or dash characters in <code>setup.cfg</code>.</li> </ul> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="`8e4868a036`"><code>8e4868a</code></a> Bump version: 78.1.0 → 78.1.1</li> <li><a href="`100e9a61ad`"><code>100e9a6</code></a> Merge pull request <a href="https://redirect.github.com/pypa/setuptools/issues/4951">#4951</a></li> <li><a href="`8faf1d7e0c`"><code>8faf1d7</code></a> Add news fragment.</li> <li><a href="`2ca4a9fe47`"><code>2ca4a9f</code></a> Rely on re.sub to perform the decision in one expression.</li> <li><a href="`e409e80029`"><code>e409e80</code></a> Extract _sanitize method for sanitizing the filename.</li> <li><a href="`250a6d1797`"><code>250a6d1</code></a> Add a check to ensure the name resolves relative to the tmpdir.</li> <li><a href="`d8390feaa9`"><code>d8390fe</code></a> Extract _resolve_download_filename with test.</li> <li><a href="`4e1e89392d`"><code>4e1e893</code></a> Merge <a href="https://github.com/jaraco/skeleton">https://github.com/jaraco/skeleton</a></li> <li><a href="`3a3144f0d2`"><code>3a3144f</code></a> Fix typo: <code>pyproject.license</code> -> <code>project.license</code> (<a href="https://redirect.github.com/pypa/setuptools/issues/4931">#4931</a>)</li> <li><a href="`d751068fd2`"><code>d751068</code></a> Fix typo: pyproject.license -> project.license</li> <li>Additional commits viewable in <a href="https://github.com/pypa/setuptools/compare/v70.0.0...v78.1.1">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=setuptools&package-manager=pip&previous-version=70.0.0&new-version=78.1.1)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/pytorch/pytorch/network/alerts). </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/154075 Approved by: https://github.com/Skylion007 Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-05-25 15:13:03 +00:00
Nikita Shulga	c113cf5a8f	[BE] Remove unused `conda-env-Linux-X64` (#154303 ) According to https://github.com/search?type=code&q=conda-env-++repo%3Apytorch%2Fpytorch it's not referenced anywhere and has been replaced with `conda-env-ci` a while ago Pull Request resolved: https://github.com/pytorch/pytorch/pull/154303 Approved by: https://github.com/cyyever, https://github.com/Skylion007	2025-05-25 14:24:28 +00:00
Aaron Gokaslan	d8aed0703e	[BE][Ez]: Enable ruff rule PLW1507. os.environ is not copied (#154120 ) Enables a RUFF rule check against copying os.environ since its' actually a proxy object, not a dict so a shallow copy will be a noop which is rarely desired behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154120 Approved by: https://github.com/malfet	2025-05-25 14:22:57 +00:00
PyTorch MergeBot	54932d865e	Revert "[c10d] Add support for testing SIGABRT return (#153167 )" This reverts commit 03e102dbe8cbffc2e42a3122b262d02f03571de7. Reverted https://github.com/pytorch/pytorch/pull/153167 on behalf of https://github.com/malfet due to It broke lint ([comment](https://github.com/pytorch/pytorch/pull/153167#issuecomment-2907820789))	2025-05-25 13:17:27 +00:00
Keith	c4ef4090c5	Fix segfault on exit in CachingHostAllocator by signaling background thread to exit (#154117 ) Fixes #152008 This PR fixes a segmentation fault that occurred when exiting the program due to improper background thread management in CachingHostAllocator. Previously, the background thread continued running and called process_events() even after the allocator object was destroyed, leading to a crash on exit. `f12d8d60b1/aten/src/ATen/core/CachingHostAllocator.h (L218)` ```cpp // Launch the background thread and process events in a loop. static bool background_thread_flag [[maybe_unused]] = [this] { getBackgroundThreadPool()->run([&]() { while (true) { process_events(); // <-- This line may cause segfault on exit std::this_thread::sleep_for(std::chrono::microseconds(100)); } }); return true; }(); ``` The fix adds a mechanism to signal the background thread to exit before the object is destructed, ensuring the thread stops safely. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154117 Approved by: https://github.com/ngimel, https://github.com/cyyever	2025-05-25 07:46:12 +00:00
Ke Wen	9d922b55ef	[Distributed][CI] Rework continuous TestCase (#153653 ) 1. Reworked `MultiProcContinousTest` to spawn processes during `setUpClass` instead of `main` (so that we can support multiple TestClass'es in one file). 2. The child processes are now an infinite loop, monitoring test IDs passed from main process via a task queue. Reciprocally, the child processes inform the main process completion of a test via a completion queue. 3. Added a test template. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153653 Approved by: https://github.com/d4l3k, https://github.com/fegin, https://github.com/fduwjj	2025-05-25 03:49:29 +00:00
Ke Wen	03e102dbe8	[c10d] Add support for testing SIGABRT return (#153167 ) `SIGABRT` is a common return by negative distributed tests, which checks for effectiveness of NaN assert, watchdog throw, etc. These errors are not detectable by traditional statements like `with self.assertRaises(RuntimeError)`. Instead, we'd need to check for the process's return code, e.g. `SIGABRT(6)` would have a return code of -6. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153167 Approved by: https://github.com/fduwjj	2025-05-25 03:48:34 +00:00
Justin Chu	10c51b11ff	Bump protobuf version and refactor tensorboard tests (#154244 ) In preparation for https://github.com/pytorch/pytorch/pull/153746, I am bumping protobuf to 5.29.4 and fixing the tensorboard tests first. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154244 Approved by: https://github.com/malfet, https://github.com/cyyever	2025-05-25 00:50:07 +00:00
bobrenjc93	53ecb8159a	Introduce statically_known_false (#154291 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154291 Approved by: https://github.com/mengluy0125	2025-05-24 14:23:55 +00:00
xinan.lin	2dfc0e3327	[Inductor UT] Reuse test_fused_attention.py for Intel GPU. (#154110 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154110 Approved by: https://github.com/eellison, https://github.com/jansel, https://github.com/EikanWang	2025-05-24 09:51:33 +00:00
cyy	8fe7ec6721	Add /Zc:preprocessor for torch libraries in MSVC builds (#147825 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/147825 Approved by: https://github.com/janeyx99	2025-05-24 06:57:46 +00:00
Aaron Orenstein	6503b4a96e	Update to using mypy 1.15 (#154054 ) The BC break isn't real - mypy decided to start complaining about the way we were typing that function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154054 Approved by: https://github.com/Skylion007	2025-05-24 04:30:57 +00:00
Eddie Yan	76ed9db468	[cuBLAS][cuBLASLt] Use cuBLAS default workspace size in Lt (#153556 ) Also enables unified workspaces by default for non-FBCODE use cases. Default Lt workspace size is also updated to match cuBLAS logic for default, including for Blackwell (SM 10.0) and GeForce Blackwell (SM 12.0). Recommended defaults are documented here: https://docs.nvidia.com/cuda/cublas/#cublassetworkspace Pull Request resolved: https://github.com/pytorch/pytorch/pull/153556 Approved by: https://github.com/Skylion007, https://github.com/ngimel	2025-05-24 03:43:35 +00:00
Svetlana Karslioglu	1ab2993345	Add a link to transformer_building_blocks tutorial (#154281 ) Cross-link to https://docs.pytorch.org/tutorials/intermediate/transformer_building_blocks.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/154281 Approved by: https://github.com/mikaylagawarecki	2025-05-24 02:50:24 +00:00
Yu, Guangye	e904d01c16	Make inductor UT to be generic (#154196 ) # Motivation https://github.com/pytorch/pytorch/pull/151773 introduces UT `test_triton_template_generated_code_caching` failed on XPU; https://github.com/pytorch/pytorch/pull/153895 introduces UT `test_mutation_rename` failed on XPU; fix https://github.com/pytorch/pytorch/issues/154218 # Additional Context With this PR, both failed UTs passed on local machine. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154196 Approved by: https://github.com/jansel	2025-05-24 02:47:46 +00:00
Pian Pawakapan	a19f2cdf29	[draft export] skip when no LOC found (#154190 ) Couldn't repro error, but verified fix with @ColinPeppler Pull Request resolved: https://github.com/pytorch/pytorch/pull/154190 Approved by: https://github.com/ColinPeppler	2025-05-24 02:29:34 +00:00
Nikita Shulga	975bbc63db	[MPS][BE] Move fmod/remainder to Metal ops (#154280 ) This accomplishes following: - Fixes correctness problem with large integer types (though probably makes it slower, but this could not be avoided if one wants to compute accurate answer) - Makes op faster for floating point types (as Metal kernel invocation is faster than creating MPSGraph) - Eliminates need for several correctness workarounds Fixes https://github.com/pytorch/pytorch/issues/154171 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154280 Approved by: https://github.com/dcci ghstack dependencies: #154275, #154290	2025-05-24 01:45:33 +00:00
Nikita Shulga	8f08bdb7f2	[MPS][BE] Code dedup (#154290 ) Eliminate some copy-pasta by introducing `REGISTER_FLOAT_BINARY_OP` and `REGISTER_INTEGER_BINARY_OP` macros Use `_METAL_310_PLUS` to guard bfloat dtype use Pull Request resolved: https://github.com/pytorch/pytorch/pull/154290 Approved by: https://github.com/yangw-dev, https://github.com/wdvr ghstack dependencies: #154275	2025-05-24 01:41:31 +00:00
Nikita Shulga	e5f63f4f66	[CI] Move Mac testing to 3.12 (#154177 ) Prep step to completely move away from Conda during the builds.. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154177 Approved by: https://github.com/huydhn, https://github.com/cyyever, https://github.com/atalman ghstack dependencies: #154237, #154268, #154271, #154269, #154270	2025-05-24 01:41:20 +00:00
Catherine Lee	11a490f32f	[CI] Reuse old whl on more workflows (#154285 ) Still only on main branch, not PRs, so that we can monitor Pull Request resolved: https://github.com/pytorch/pytorch/pull/154285 Approved by: https://github.com/malfet	2025-05-24 01:25:35 +00:00
Zhengxu Chen	308beeeb56	[dynamo] Use UUID for compiled function variable names. (#154148 ) Summary: We previously assign each compiled function variable a name based on in-process global counter. This works fine within the same process but when we're trying to serialize the states with precompile, we need a way to load back these compiled functions without causing collision to the existing global scope. Changing the counter to a true global uuid seems to resolve this issue. For example, the new variable name will look like: ``` __compiled_fn_0_7ce7d872_4fe8_4174_b8fd_2496b09b8b43 ``` Test Plan: CI Differential Revision: D75244901 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154148 Approved by: https://github.com/jansel	2025-05-24 01:08:42 +00:00
leslie-fang-intel	7ba6fb69e6	[Inductor][CPP] Enable vectorized fp8 E5M2 quant dequant (#153365 ) Summary This PR enables the vectorization codegen with Inductor CPP backend for `FP8_E5M2` `quant` from `float32` and `dequant` to `float32`. Test Plan ``` python test/inductor/test_cpu_repro.py -k test_dequant_quant_lowering_fp8_e5m2 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153365 Approved by: https://github.com/jansel, https://github.com/jgong5 ghstack dependencies: #152417, #152418, #153364	2025-05-23 23:20:02 +00:00
leslie-fang-intel	84b657d0b5	Add Vectorized FP8 E5M2 (#153364 ) Summary This PR mainly adding the `Vectorized<Float8_e5m2>` class to support the vectorization of `FP8 E5M2` with methods: - Convert to/from `Vectorized<float>` - Common vectorized methods like: `mul`, `abs`, `eq` and etc. Test Plan ``` ./build/bin/vec_test_all_types_AVX512 --gtest_filter=FP8E5M2Test.* ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153364 Approved by: https://github.com/jgong5, https://github.com/CaoE, https://github.com/vkuzo ghstack dependencies: #152417, #152418	2025-05-23 23:11:25 +00:00
leslie-fang-intel	b77a6504fa	[Inductor][CPP] Enable vectorized fp8 quant dequant (#152418 ) Summary This PR enables the vectorization codegen with Inductor CPP backend for `FP8_E4M3` `quant` from `float32` and `dequant` to `float32`. Test Plan ``` python test/inductor/test_cpu_repro.py -k test_dequant_quant_lowering_fp8_e4m3 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152418 Approved by: https://github.com/jansel, https://github.com/jgong5, https://github.com/CaoE ghstack dependencies: #152417	2025-05-23 23:05:17 +00:00
leslie-fang-intel	080b74ce67	Add Vectorized FP8 E4M3 (#152417 ) Summary This PR mainly adding the `Vectorized<Float8_e4m3fn>` class to support the vectorization of `FP8 E4M3` with methods: - Convert to/from `Vectorized<float>` - Common vectorized methods like: `mul`, `abs`, `eq` and etc. Test Plan ``` ./build/bin/vec_test_all_types_AVX512 --gtest_filter=FP8E4M3Test.* ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152417 Approved by: https://github.com/mingfeima, https://github.com/CaoE, https://github.com/yanbing-j, https://github.com/jgong5, https://github.com/vkuzo	2025-05-23 22:56:56 +00:00
Ting Lu	bab59d3c28	Upgrade to CUDA 12.8.1 for nightly binaries (#152923 ) Upgrade current CUDA 12.8 builds to 12.8.1 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152923 Approved by: https://github.com/atalman	2025-05-23 22:37:05 +00:00
Scott Wolchok	f0b2706914	remove sleef_arm target (#154166 ) Summary: X-link: https://github.com/pytorch/executorch/pull/11082 We shouldn't need an ARM-specific variant; we have select() where we should need it. Test Plan: CI Reviewed By: nlutsenko Differential Revision: D74356413 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154166 Approved by: https://github.com/kimishpatel, https://github.com/malfet, https://github.com/Skylion007	2025-05-23 22:16:01 +00:00
Zain Rizvi	86a160353e	[BE] Don't run windows builds in pull.yml (#154264 ) We already run windows builds and tests [during trunk.yml](`c13eeaa718/.github/workflows/trunk.yml (L115-L130)`). Spot checking for failures of this job in pull.yml shows that the most of the times this job fails, the failure correlates with other build jobs failing as well, so it's not offering much unique signal. Given that we'll run this job before merging the PR as part of trunk.yml anyways, the trade off of extra signal from getting a windows build signal a little earlier doesn't seem worth the infra investment. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154264 Approved by: https://github.com/malfet	2025-05-23 22:03:19 +00:00
Catherine Lee	65f0cf3df5	[mergebot] Do not block on autoformat workflow (#154236 ) Helps with https://github.com/pytorch/pytorch/issues/154084 Merge sometimes fails due to autoformat failing. I believe it's because author doesn't have write perms/workflow running perms -> needs approval for workflows. On merge, the bot adds the merge label -> triggers autoformat workflow -> needs approval (even though it will end up getting get skipped because the label doesn't match) -> merge sees and fails So I put an ugly exception for the workflow in mergebot Some restrictions to keep in mind: * Need to checkout the PRs code changes to run lint/format on them -> possible security issue if someone modifies a linter/formatter * The (third party) reusable action used in the autoformat workflow requires the trigger to be pull_request Pull Request resolved: https://github.com/pytorch/pytorch/pull/154236 Approved by: https://github.com/malfet	2025-05-23 22:00:34 +00:00
James Wu	bb17f9c98b	[AOTAutogradCache] Fix CHROMIUM_EVENT_LOG being none (#154258 ) It turns out if you import something that's None at import time in python, and later update the value, the one you imported stays none: ``` import torch from torch._dynamo.utils import CHROMIUM_EVENT_LOG class Foo: pass torch._dynamo.utils.CHROMIUM_EVENT_LOG = Foo() print(CHROMIUM_EVENT_LOG) # None ``` This fixes teh bug so we get AOTAUtogradCache instant events again Differential Revision: [D75305770](https://our.internmc.facebook.com/intern/diff/D75305770/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154258 Approved by: https://github.com/oulgen	2025-05-23 21:53:31 +00:00
Nikita Shulga	0e4f1b8a06	[CI] Update MacOS conda requirmenets (#154270 ) Pick package versions which are compatible with both 3.9 and 3.12 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154270 Approved by: https://github.com/clee2000, https://github.com/atalman ghstack dependencies: #154237, #154268, #154271, #154269	2025-05-23 21:44:50 +00:00
Nikita Shulga	5db1503846	[CI] Update MacOS numba and scipy versions (#154269 ) Pick versions that supported by both 3.9 and 3.12 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154269 Approved by: https://github.com/clee2000, https://github.com/atalman ghstack dependencies: #154237, #154268, #154271	2025-05-23 21:44:49 +00:00
Howard Huang	aa3eab2ce6	Fix tcp init when using port 0 (#154156 ) I hit this in tests when calling `init_process_group(init_method="tcp://localhost:0", ...)`. You can't use port 0 due to the bug in the conditional and will get error `ValueError: Error initializing torch.distributed using tcp:// rendezvous: port number missing` Pull Request resolved: https://github.com/pytorch/pytorch/pull/154156 Approved by: https://github.com/d4l3k, https://github.com/Skylion007	2025-05-23 21:41:58 +00:00
Anthony Shoumikhin	3c0b93afc5	Re-enable link linter (#153280 ) And make URL linter always succeed for now. I'll monitor the logs manually and experiment with it futher. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153280 Approved by: https://github.com/albanD	2025-05-23 20:56:25 +00:00
Nikita Shulga	6f34d141ab	[MPS][BE] Delete `complex_div` (#154275 ) An absolute no-op: delete `complex_div` from `UnaryKernel.metal` and use identical one from `c10/metal/utils.h` Pull Request resolved: https://github.com/pytorch/pytorch/pull/154275 Approved by: https://github.com/dcci	2025-05-23 20:53:50 +00:00
Nikita Shulga	dec6a47996	[BE] Delete unused pip-requirements-iOS.txt (#154271 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154271 Approved by: https://github.com/clee2000 ghstack dependencies: #154237, #154268	2025-05-23 20:08:19 +00:00
Nikita Shulga	acd0873d3b	[CI] Fix `TestDynamoTimed.test_ir_count` for 3.12 (#154268 ) Python-3.12 emits the same bytecode as 3.13 for code in question Pull Request resolved: https://github.com/pytorch/pytorch/pull/154268 Approved by: https://github.com/clee2000, https://github.com/atalman ghstack dependencies: #154237	2025-05-23 20:08:19 +00:00
PyTorch MergeBot	28af44285b	Revert "[c10d] Add support for testing SIGABRT return (#153167 )" This reverts commit 499a76b844bbcbc5465cb76c617b3076c1b0fd65. Reverted https://github.com/pytorch/pytorch/pull/153167 on behalf of https://github.com/malfet due to Broke lint, see `fe784c5a2c/1` ([comment](https://github.com/pytorch/pytorch/pull/153167#issuecomment-2905623868))	2025-05-23 19:44:08 +00:00
Shangdi Yu	fe784c5a2c	Fix torchbind path in AOTI package loader (#154265 ) Summary: as title, fix the path in package loader and fix the test to take the additional dir into consideration. Test Plan: ``` buck run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:torchbind ``` Reviewed By: angelayi Differential Revision: D75308904 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154265 Approved by: https://github.com/clee2000, https://github.com/malfet	2025-05-23 19:32:53 +00:00
PyTorch MergeBot	90855835ff	Revert "[AOTI][cutlass backend] Do not remove the cutlass kernel .o file after packaging (#154155 )" This reverts commit 269fa8028f68b29176e21886108634f48b1eced7. Reverted https://github.com/pytorch/pytorch/pull/154155 on behalf of https://github.com/henrylhtsang due to mistake in PR ([comment](https://github.com/pytorch/pytorch/pull/154155#issuecomment-2905514934))	2025-05-23 19:08:40 +00:00
Angela Yi	3b21d79225	[export] Move PT2ArchiveWriter/Reader to torch/export (#153795 ) Summary: Before: `from sigmoid.core.package.pt2_archive import PT2ArchiveWriter, PT2ArchiveReader, is_sigmoid_package` After: `from torch.export.pt2_archive import PT2ArchiveWriter, PT2ArchiveReader, is_pt2_package` By merging the two PT2ArchiveReader/Writers, into using the native PytorchFileReader/Writer, the open source PT2 archive also changed to have an additional folder. However this PR still maintains support for loading an old PT2 archive which does not have the additional folder. Before: ``` ├── archive_format ├── byteorder ├── .data │ ├── serialization_id │ └── version ├── data │ ├── aotinductor ``` After: ``` ├── tmp │ ├── archive_format │ ├── byteorder │ ├── .data │ │ ├── serialization_id │ │ └── version │ ├── data │ │ ├── aotinductor ``` Test Plan: `buck2 test //sigmoid/...` https://www.internalfb.com/intern/testinfra/testrun/5348024839248187 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153795 Approved by: https://github.com/zhxchen17	2025-05-23 19:04:36 +00:00
Ke Wen	499a76b844	[c10d] Add support for testing SIGABRT return (#153167 ) `SIGABRT` is a common return by negative distributed tests, which checks for effectiveness of NaN assert, watchdog throw, etc. These errors are not detectable by traditional statements like `with self.assertRaises(RuntimeError)`. Instead, we'd need to check for the process's return code, e.g. `SIGABRT(6)` would have a return code of -6. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153167 Approved by: https://github.com/fduwjj	2025-05-23 19:04:28 +00:00
PyTorch MergeBot	561a11aa68	Revert "Patch the _is_conv_node function (#153749 )" This reverts commit c985cec5b2545d46af682d486b18866eee5dffd5. Reverted https://github.com/pytorch/pytorch/pull/153749 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/153749#issuecomment-2905504697))	2025-05-23 19:04:20 +00:00
PyTorch MergeBot	4ff19ecf66	Revert "[export] Move PT2ArchiveWriter/Reader to torch/export (#153795 )" This reverts commit 7e80f23516a86e18ae5bc5579d3005c1e7610102. Reverted https://github.com/pytorch/pytorch/pull/153795 on behalf of https://github.com/malfet due to Looks like it broke lots of tests, see `ec368a1903/1` ([comment](https://github.com/pytorch/pytorch/pull/153795#issuecomment-2905415496))	2025-05-23 18:29:08 +00:00
Svetlana Karslioglu	ec368a1903	Add sitemap (#154158 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/154158 Approved by: https://github.com/albanD	2025-05-23 18:01:00 +00:00
Andy (An) Wang	0d62fd5c3c	[MTIA Aten Backend][2/n] Migrate clamp ops(clamp.out/clamp_min.out/clamp_max.out) from out-of-tree to in-tree (#154015 ) Summary: # Context See the first PR https://github.com/pytorch/pytorch/pull/153670 # This PR 1. Migrate 3 clamp ops from out-of-tree to in-tree(had to migrate the 3 ops altogether, because clamp.out calls all 3 stubs, which are also called by the other 2 ops): - clamp.out - clamp_min.out - clamp_max.out 2. Also enabled structured kernel codegen for MTIA, which is needed by clamp 3. Also introduced the `--mtia` flag to torchgen to prevent OSS from gencoding MTIA code.(Otherwise we got such link error `lib/libtorch_cpu.so: undefined reference to at::detail::empty_mtia`) Differential Revision: D74674418 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154015 Approved by: https://github.com/albanD, https://github.com/nautsimon	2025-05-23 17:59:47 +00:00
Nikita Shulga	bcb2125f0a	[BE][CI] Update expecttest version to 0.3.0 (#154237 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154237 Approved by: https://github.com/Skylion007, https://github.com/albanD, https://github.com/atalman	2025-05-23 17:27:41 +00:00
Tsung-Hsien Lee	cae25ef4e5	[c10d] Enhance Error Logging in `new_subgroups()` for Non-Divisible World Sizes (#154124 ) Summary: The error caused by the world size not being divisible by `group_size` is a common issue encountered by end-users when utilizing applications built on top of `new_subgroups()`. However, these applications may employ different variable names, such as `num_trainers_per_group`, which can make the current error messages less effective despite being correct. To address this, we have improved the error messages to display the actual numbers involved, thereby enhancing their clarity and usefulness. Test Plan: contbuild & OSS CI Differential Revision: D75226925 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154124 Approved by: https://github.com/wz337	2025-05-23 17:12:43 +00:00
henrylhtsang	e927ba6dbd	[inductor][cutlass backend] Add 2 stage autotuning aka prescreening (#153335 ) Motivation: By default, we are tuning the cutlass backend kernels on 3 swizzles. There are runtime params, so they share the same underlying kernel, which saves a lot of compilation time. However, autotuning all combinations of {configs} x {swizzles} is still expensive. Observations: Winner of the {configs} x {swizzles} autotuning is the same as if we do a greedy search: first find the top X winners of {configs} with swizzle 2 (hardcoded), then autotune on the {top X winner configs} x {swizzles}. In other words, we can use a Greedy algorithm to reduce autotuning time. I attach the logs below. This somewhat depends on what X is, but a number like 5-10 works pretty well from empirical observations. Logs: Baseline: https://gist.github.com/henrylhtsang/9a604f150a270dc19524f72a5d4dfac2 ``` AUTOTUNE mm(2048x2048, 2048x2048) strides: [2048, 1], [1, 2048] dtypes: torch.bfloat16, torch.bfloat16 cuda_cutlass_gemm_1776 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_1777 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_1778 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_1800 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_1801 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_1802 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_9012 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_9013 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_9014 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_8940 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_8941 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_8942 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_8934 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_8935 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_8936 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_2001 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_2002 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_2003 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_1848 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_1849 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_1850 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_8964 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_8965 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_8966 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_8958 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_8959 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_8960 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_1929 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_1930 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_1931 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_1770 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_1771 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_1772 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_1953 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_1954 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_1955 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_1995 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_1996 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_1997 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_1794 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_1795 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_1796 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_1842 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_1843 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_1844 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_9006 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_9007 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_9008 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_1923 0.0306 ms 95.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1 ``` with prescreening: ``` AUTOTUNE mm(147456x6144, 6144x2048) strides: [6144, 1], [2048, 1] dtypes: torch.bfloat16, torch.bfloat16 cutlass_1a5e81af 4.5469 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1 cutlass_aa6f899c 4.6328 ms 98.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cutlass_aa6f899c 4.6836 ms 97.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cutlass_161b8b81 4.7224 ms 96.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cutlass_161b8b81 4.7234 ms 96.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cutlass_161b8b81 4.7274 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cutlass_853b6347 4.7369 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cutlass_aa6f899c 4.7404 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cutlass_161b8b81 4.7711 ms 95.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8 cutlass_8bc6fbda 4.8148 ms 94.4% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8 cutlass_8bc6fbda 4.8159 ms 94.4% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cutlass_8bc6fbda 4.8214 ms 94.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cutlass_8bc6fbda 4.8302 ms 94.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cutlass_0a1c55af 4.8487 ms 93.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8 cutlass_0a1c55af 4.8527 ms 93.7% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2 cutlass_02780d72 4.8617 ms 93.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cutlass_0a1c55af 4.8737 ms 93.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1 cutlass_0a1c55af 4.8738 ms 93.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4 cutlass_02780d72 4.9348 ms 92.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cutlass_02780d72 4.9763 ms 91.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cutlass_853b6347 4.9805 ms 91.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cutlass_1a5e81af 5.0225 ms 90.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8 cutlass_853b6347 5.0271 ms 90.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8 cutlass_02780d72 5.0595 ms 89.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8 cutlass_853b6347 5.1434 ms 88.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cutlass_c1ffa14b 5.1574 ms 88.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8 cutlass_1a5e81af 5.1916 ms 87.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4 cutlass_c1ffa14b 5.2018 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4 cutlass_c1ffa14b 5.2019 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1 cutlass_c1ffa14b 5.2037 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2 cutlass_1a5e81af 5.5329 ms 82.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2 cutlass_aa6f899c 11.5046 ms 39.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8 SingleProcess AUTOTUNE benchmarking takes 1.9526 seconds and 0.0352 seconds precompiling for 32 choices ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153335 Approved by: https://github.com/eellison	2025-05-23 17:12:25 +00:00
Shangdi Yu	04a6fe7914	Update provenance tracking doc (#154062 ) Summary: Update the doc to reflect the changes in https://github.com/pytorch/pytorch/pull/153584/files#diff-e0cdb58c0f84f56f20c5433339b6d83c470dcde47847e2328effea6bedd4cd27 and https://github.com/pytorch/tlparse/pull/110 Test Plan: CI Differential Revision: D75155981 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154062 Approved by: https://github.com/svekars, https://github.com/desertfire	2025-05-23 17:09:52 +00:00
Aleksei Nikiforov	7d8ea5db69	Disable cache and utilization stats uploading steps on s390x (#150297 ) There are no AWS credentials available on s390x runners. These steps are failing anyway due to that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150297 Approved by: https://github.com/seemethere	2025-05-23 16:49:38 +00:00
Angela Yi	7e80f23516	[export] Move PT2ArchiveWriter/Reader to torch/export (#153795 ) Summary: Before: `from sigmoid.core.package.pt2_archive import PT2ArchiveWriter, PT2ArchiveReader, is_sigmoid_package` After: `from torch.export.pt2_archive import PT2ArchiveWriter, PT2ArchiveReader, is_pt2_package` By merging the two PT2ArchiveReader/Writers, into using the native PytorchFileReader/Writer, the open source PT2 archive also changed to have an additional folder. However this PR still maintains support for loading an old PT2 archive which does not have the additional folder. Before: ``` ├── archive_format ├── byteorder ├── .data │ ├── serialization_id │ └── version ├── data │ ├── aotinductor ``` After: ``` ├── tmp │ ├── archive_format │ ├── byteorder │ ├── .data │ │ ├── serialization_id │ │ └── version │ ├── data │ │ ├── aotinductor ``` Test Plan: `buck2 test //sigmoid/...` https://www.internalfb.com/intern/testinfra/testrun/5348024839248187 Differential Revision: D74616598 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153795 Approved by: https://github.com/zhxchen17	2025-05-23 15:40:25 +00:00
Nikita Shulga	214e4cef9f	Fix RMSNorm doc rendering (#154205 ) By removing `::func::` decorator which adds unneeded parenthesis Test plan: Check https://docs-preview.pytorch.org/pytorch/pytorch/154205/generated/torch.nn.RMSNorm.html#rmsnorm that now renders as <img width="704" alt="image" src="https://github.com/user-attachments/assets/443f605d-75a6-41ef-8971-21e7dc8ef9f6" /> Fixes https://github.com/pytorch/pytorch/issues/154184 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154205 Approved by: https://github.com/mikaylagawarecki	2025-05-23 15:39:29 +00:00
Laith Sakka	9e089bb5b6	change guard_or impl for better perf and simplicity (#153674 ) PR time benchmarks has been showing regressions as we move to guard_or_false, reason is that prev implementation do not cache. This new approach will propagate the fallback value to eval and return it. allowing eval to cache and reducing scamming logs and complexity. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153674 Approved by: https://github.com/bobrenjc93	2025-05-23 15:24:28 +00:00
Aaron Orenstein	4b7abce6a4	Fix fake tensor caching when output has unbacked (#153034 ) We handle fake tensor caching in two ways: 1. If the inputs have no symbols (SymInt, etc) then we cache on the FakeTensorMode. 2. If the inputs have symbols then we cache on the ShapeEnv. This way the symbols in the inputs and outputs are associated with the guards in place at the time of the call. However - it's possible to have an op where there are no symbols in the inputs but there is an unbacked symbol in the output. In this case we shouldn't cache at all because what would that really mean? So this PR changes the caching behavior so that if there's a symbol in the output which doesn't come in some way from the input then we refuse to cache that op. Added a test which checks for this case. While in there I also did a couple other related changes: 1. Added negative caching - if we see that an (op, args) failed to cache previously we don't even bother trying to cache it again. 2. Reworked the inner behavior of _cached_dispatch_impl a little to make it more clear which bits we expect to be able to throw _BypassDispatchCache and add some comments. The latest version of this also: 1. Addresses the problem that caused #153891. The issue was that with caching ops are required to support `__eq__`. Unfortunately _RecordFunction is minimalistic and doesn't support that - so in the off-chance that two keys hash to the same value the `__eq__` check would raise an exception. Apparently this was much more common on MacOS where memory patterns end up with more reuse (so the object IDs are the same and give you the same hash value for objects that use pointer hash). Tested locally on MacOS where running ``` python test/inductor/test_torchinductor.py GPUTests ``` was pretty much guaranteed to fail (at least for me) somewhere around test 100-200 and passed all 800 tests after this change. Another way to test this is to run the inductor tests with `torch._subclasses.fake_tensor._DispatchCacheKey.__hash__` monkey-patched to return a constant (causing all values to hash-collide) but this can't really be checked-in since it causes the cache lookup to turn into an O(n) lookup which takes a crazy long time to run through all the tests... 2. Folds in #153780 to ensure that exceptions raised from the op don't include the context from the cache key bypass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153034 Approved by: https://github.com/masnesral, https://github.com/tugsbayasgalan	2025-05-23 15:03:31 +00:00
PyTorch MergeBot	866142ff16	Revert "Update the heuristic for AArch64 bmm/baddbmm (#149122 )" This reverts commit d759a517af3e6b2337bf8f8e0d1734e64e470f1b. Reverted https://github.com/pytorch/pytorch/pull/149122 on behalf of https://github.com/jeanschmidt due to breaking internal models, @malfet may you help merge this? ([comment](https://github.com/pytorch/pytorch/pull/149122#issuecomment-2904703075))	2025-05-23 14:54:54 +00:00
Nikita Shulga	5859582ee4	[BE][MPS] Delete unused `complex_mul_out` (#154175 ) It's no longer called, after `mul` has been migrated to binary op Pull Request resolved: https://github.com/pytorch/pytorch/pull/154175 Approved by: https://github.com/dcci, https://github.com/Skylion007	2025-05-23 13:44:24 +00:00
Jonathan Deakin	2225231a14	Enable AArch64 CI scripts to be used for local dev (#143190 ) - Allow user to specify custom ComputeLibrary directory, which is then built rather than checking out a clean copy - Remove `setup.py clean` in build. The CI environment should be clean already, removing this enables incremental rebuilds - Use all cores for building ComputeLibrary Mostly a port of https://github.com/pytorch/builder/pull/2028 with the conda part removed, because aarch64_ci_setup.sh has changed and can now handle being called twice. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143190 Approved by: https://github.com/aditew01, https://github.com/fadara01, https://github.com/malfet Co-authored-by: David Svantesson-Yeung <David.Svantesson-Yeung@arm.com>	2025-05-23 12:09:59 +00:00
Ke Wen	25149cd173	[c10d] Add more tests to prevent extra context (#154174 ) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): Loop a bunch of sync ops and see if any of them creates extra context. Requires nvml to check number of processes resident on a device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154174 Approved by: https://github.com/atalman	2025-05-23 09:54:01 +00:00
wengshiy	ba5d45d22e	Add assertion to align with cuda (#153233 ) Fixes #153137 Aligned batch_norm_cpu_out assertion to [batch_norm_cuda_out](`a7ea115494/aten/src/ATen/native/cuda/Normalization.cu (L436)`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/153233 Approved by: https://github.com/malfet	2025-05-23 07:32:43 +00:00
Autin Mitra	5623d30228	[Minimizer] Gracefully exit when there is no discrepancy in block mode (#154076 ) Summary: Previously, when there is no discrepancy in results for block mode, net_min_base will throw an OOB error. This occurs due to the block _block_traverse_impl returning an OOB after exhausting subgraphs all the way down to a single node There is also an issue where we may get an unsound subgraph (i.e. mark an earlier node as the "end" even if the correct end is later). This is due to an incorrect check (start_idx == mid) where there can possibly be two values left before the program pre-maturely returns Test Plan: Buck UI: https://www.internalfb.com/buck2/52524c26-ace5-4593-8a4b-843a54eb206a Test UI: https://www.internalfb.com/intern/testinfra/testrun/3096224973363310 Network: Up: 0B Down: 15MiB (reSessionID-cd404e97-395f-49fc-8381-373e90a1378f) Executing actions. Remaining 0/1 Command: test. Time elapsed: 53.7s Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0 Differential Revision: D75143242 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154076 Approved by: https://github.com/jfix71	2025-05-23 06:42:07 +00:00
Filip Jankovic	8342b9371e	[ROCm] Prefer hipblaslt for gfx1200, gfx1201 (#153610 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153610 Approved by: https://github.com/jeffdaily, https://github.com/atalman	2025-05-23 06:01:53 +00:00
angelayi	26471fc203	[aoti] Initial Metal support (#153959 ) An example generated file: P1816629015 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153959 Approved by: https://github.com/malfet, https://github.com/desertfire ghstack dependencies: #153964	2025-05-23 05:45:35 +00:00
angelayi	b33b7d5c8c	[aoti] Add MPS runner and shim (#153964 ) Added AOTIModelContainerRunnerMps and a shim for mps fallback ops. I also added a mps-specific shim which contains one operator, which will be used to set arguments being passed to the Metal kernel: ``` AOTI_TORCH_EXPORT AOTITorchError aoti_torch_mps_set_arg( AOTIMetalKernelFunctionHandle func, unsigned idx, AtenTensorHandle tensor); ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153964 Approved by: https://github.com/malfet, https://github.com/desertfire	2025-05-23 05:45:35 +00:00
henrylhtsang	269fa8028f	[AOTI][cutlass backend] Do not remove the cutlass kernel .o file after packaging (#154155 ) Differential Revision: [D75253009](https://our.internmc.facebook.com/intern/diff/D75253009/) In general, we want to cache the cutlass kernels. Also saw an error saying .o not found. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154155 Approved by: https://github.com/chenyang78	2025-05-23 04:51:36 +00:00
William Wen	5bb156a7fd	[dynamo] raise observed exception for module attribute errors (#153659 ) Fixes https://github.com/pytorch/pytorch/issues/153605 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153659 Approved by: https://github.com/StrongerXi	2025-05-23 03:56:26 +00:00
PyTorch UpdateBot	db1f33147b	[audio hash update] update the pinned audio hash (#154001 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154001 Approved by: https://github.com/pytorchbot	2025-05-23 03:51:21 +00:00
Laith Sakka	c1055f41a6	Data dependent free reshape. (#153198 ) #### change 1: if compute_strides stride fail for reshape just clone. Lets consider the most general case, if torch compile is asked to reshape [u0, u1][u3, u4] -> [u5, u6] what shall it do? The shape is general enough to represent both contiguous and non contiguous tensors, tensors where a clone free reshape can happen and other where a clone free cant happen. The current algorithm will fail due to data dependent errors. The general idea is if its impossible to tell if the reshape can happen in place, (because for some concrete inputs it will and other not) then its ok to take the general path and clone, instead of failing or asking the user to give hints. Because the user want a single graph (single compilations) and this is the only way it can be done. Had this been a view? then the user is explicitly asking for a copy-free reshape, we would fail asking for more information (hints in torch.checks form). with this change reshape works as the following: 1. if we know the input is contiguous we will convert the reshape to view. 2. if compute_strides succeed we will use view. (compute_strides was changed to not fail when when unbacked presented instead it will just return nullptr if it cant compute the strides meaning we shall use a clone). 3. if neither 1, 2 works clone and use a view. Side note: having a view does not mean that inductor will not clone, for inductor there is a pass that converts all views back to reshapes and inductor has its logic dealing with those. #### change 2 : skip _reshape_view_helper and fall back to simpler logic if it fail. We trace _reshape_view_helper when doing fake tensor tracing , but not during proxy tracing. hence such tracing wont effect the graph (only compute output shapes of several operations). We should not fail there, because it should always be possible for us to pass it in case of reshape. i.e. when reshape_symint was called we would have either cloned, or compute_strides succeeded so the view should pass. What I did is the following: we run _reshape_view_helper, if we fail due to unbacked we call _view_simple which will succeed always for reshapes, (might fail for views when its impossible to do the view, in such case we throw the dde that was thrown by the original algorithm). Ideally I would want to register _view_simple as the meta for view and avoid calling _reshape_view_helper completely but I am running some issues with the dispatcher with subclasses and I do not have time to debug it. Namely one test would end up calling some c++ view function that does not support symints during meta dispatch when i register a python meta decompositions ```python test/dynamo/test_subclasses.py SubclassTests.test_subclass_views_dynamic_True ``` https://github.com/pytorch/pytorch/issues/153303.I will follow up with that change in a separate PR. cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @bdhirsh Two other alternatives for registering _view_simple as meta and the try catch approach in this PR is: 1. call _view_simple if any input is dynamic see #153521 2. if we make is_compiling works for framework code tracing (does not work rn) we can call _view_simple is if is_compiling. #### Note: Reshape can still fail when is_contiguous is called, Next PR will handle that by calling is_known_contiguous. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153198 Approved by: https://github.com/etaf, https://github.com/bobrenjc93	2025-05-23 01:45:16 +00:00
Ruisi Zhang	f74842d665	[DTensor] enable SimpleFSDP's composability with Tensor Parallel (#152286 ) This PR adds support for SimpleFSDP's composability with Tensor Parallel + torch.compile. `_StridedShard` is used in SimpleFSDP/FSDP2 to support correct distributed checkpointing when FSDP+TP is applied. Previously, `_StridedShard` is not guarded by torch.compile. This PR adds `_StridedShard` as an additional placement type to be guarded by torch.compile. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152286 Approved by: https://github.com/bdhirsh	2025-05-23 01:40:38 +00:00
Huy Do	7509b150af	Don't upload compiler benchmark debug info to the benchmark database (#153769 ) During our debug session, @wdvr and I found out that the benchmark database is growing much faster than we expect. After taking a closer look, the majority of them coming from TorchInductor benchmark and the top 3 are all debug information not used by any dashboard atm. In the period of 7 days, there are close to 6 millions records ([query](https://paste.sh/GUVCBa0v#UzszFCZaWQxh7oSVsZtfZdVE)) ``` Benchmark,Metric,Count "TorchInductor","user_stack","1926014" "TorchInductor","reason","1926014" "TorchInductor","model","1926014" ``` Let's skip uploading them to avoid bloating the database. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153769 Approved by: https://github.com/malfet	2025-05-23 01:18:26 +00:00
Benjamin Glass	768cb734ec	cpp_wrapper: build non-performance-sensitive code at O1 (#148773 ) Builds on #148212, applying the same improvements to `cpp_wrapper` mode. Benchmark results: * [A100 Benchmarks](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2014%20May%202025%2015%3A10%3A05%20GMT&stopTime=Wed%2C%2021%20May%202025%2015%3A10%3A05%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/benjaminglass1/77/orig&lCommit=ca7d0a3f16e3c511534d2cd03d695be8524570d3&rBranch=main&rCommit=1075bb37d34e483763a09c7810790d5491441e13) * [x86 Benchmarks](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2014%20May%202025%2015%3A10%3A05%20GMT&stopTime=Wed%2C%2021%20May%202025%2015%3A10%3A05%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cpu%20(x86)&lBranch=gh/benjaminglass1/77/orig&lCommit=ca7d0a3f16e3c511534d2cd03d695be8524570d3&rBranch=main&rCommit=1075bb37d34e483763a09c7810790d5491441e13) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148773 Approved by: https://github.com/desertfire	2025-05-23 00:51:20 +00:00
Svetlana Karslioglu	3c0cbf4b44	Update GH action to use the correct label (#154126 ) Update GH action to use the correct label for the docathon Pull Request resolved: https://github.com/pytorch/pytorch/pull/154126 Approved by: https://github.com/AlannaBurke, https://github.com/clee2000	2025-05-23 00:29:43 +00:00
Aaron Gokaslan	31f3ee0966	[BE][Ez]: Enable PT014 check for duplicate parameterize test cases (#154118 ) Ruff rule which checks for an error [PT014](https://docs.astral.sh/ruff/rules/pytest-duplicate-parametrize-test-cases/) where a user might specify two duplicate test cases in pytest.parameterize, which is likely an error since it tests the same thing twice. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154118 Approved by: https://github.com/malfet	2025-05-23 00:00:53 +00:00
xinan.lin	7b25ff7cf2	[Inductor] Add attention pattern for model DistilBert in transformers==4.44.2. (#154091 ) This PR add a attention fusion pattern that match the attention of DistilDistilBert in transformers==4.44.2 at `953196a43d/src/transformers/models/distilbert/modeling_distilbert.py (L212)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/154091 Approved by: https://github.com/jansel, https://github.com/eellison	2025-05-22 23:37:03 +00:00
PyTorch MergeBot	59c5fff2aa	Revert "[DDP] rebuilt bucket order when find_unused_parameters=true (#153404 )" This reverts commit a79e621c1c11bcef5f816b9770b751237b84f620. Reverted https://github.com/pytorch/pytorch/pull/153404 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/153404#issuecomment-2902741300))	2025-05-22 22:26:59 +00:00
Yuxuan Chen	f2cce45657	[libc++ readiness][caffe2] No reason to check for "ext/stdio_filebuf.h" (#154080 ) Summary: There should be no reason to check for existence of this GNU C++ header here in this file. It doesn't include it. Removing this condition to make it build under libc++. Differential Revision: D75179136 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154080 Approved by: https://github.com/soumith	2025-05-22 22:23:39 +00:00
Chen Lai	c985cec5b2	Patch the _is_conv_node function (#153749 ) Summary: torch.ops.aten.conv2d.padding is also conv2d node Differential Revision: D74898941 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153749 Approved by: https://github.com/andrewor14, https://github.com/Skylion007	2025-05-22 22:17:02 +00:00
bobrenjc93	413664b3c5	catch CSE recursion depth errors (#154039 ) Fixes #153777 CSE is an optimization and shouldn't block a compile if it hits recursion depth limits. Unfortunately we can't write this iteratively due to a dependency on `ast.unparse` which necessarily needs to do recursion. This PR catches opts out of CSE when we hit recursion depth errors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154039 Approved by: https://github.com/Microve	2025-05-22 20:17:19 +00:00
Rachel Guo	cad0727fe1	Rename the provenance tracing artifact name for kernel <-> post_grad nodes mapping (#154046 ) Summary: Context: Recently we've added a couple more kernel types support other than inductor generated triton kernels, such as cpu cpp kernels, extern kernels. The name appeared in tlparse chrome link can be confusing to users. Rename from `inductor_triton_kernel_to_post_grad_nodes.json` to `inductor_generated_kernel_to_post_grad_nodes.json` Test Plan: CI Differential Revision: D75159042 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154046 Approved by: https://github.com/yushangdi	2025-05-22 19:20:56 +00:00
atalman	4277907d02	[binary builds] Linux aarch64 CUDA builds. Make sure tag is set correctly (#154045 ) 1. This should set the Manylinux 2.28 tag correctly for CUDA Aarch builds. I believe we used to have something similar in the old script: https://github.com/pytorch/pytorch/blob/main/.ci/aarch64_linux/build_aarch64_wheel.py#L811 ``Tag: cp311-cp311-linux_aarch64 ``-> ``Tag: cp311-cp311-manylinux_2_28_aarch64`` 2. Remove section for CUDA 12.6, since we no longer building CUDA 12.6 aarch64 builds Pull Request resolved: https://github.com/pytorch/pytorch/pull/154045 Approved by: https://github.com/Camyll, https://github.com/malfet	2025-05-22 18:36:13 +00:00
Menglu Yu	788d9cb2d7	[3/n][Optimus][Auto-AC][reland] Support any fp8 quantization type and set scaling as the default" (#154057 ) Summary: This is a reland of D74910193. We change the dtype to torch.float8_e5m2 in unit test since it is not supported. Test Plan: ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization ``` Differential Revision: D75169792 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154057 Approved by: https://github.com/Mingming-Ding	2025-05-22 18:26:34 +00:00
skishore	c2660d29a5	[ROCm] Added unit test to test the cuda_pluggable allocator (#154041 ) Added unit test to include the cuda_pluggable allocator and replicate the apex setup.py to build nccl_allocator extension This test to check if this commit https://github.com/pytorch/pytorch/pull/152179 helps to build the cuda pluggable allocator in Rocm/Apex Pull Request resolved: https://github.com/pytorch/pytorch/pull/154041 Approved by: https://github.com/atalman, https://github.com/jeffdaily Co-authored-by: Jithun Nair <jithun.nair@amd.com>	2025-05-22 18:22:15 +00:00
Menglu Yu	5b8f422561	[PT2][Optimus] Fix a typo in decompose_mm (#154048 ) Summary: As titled Differential Revision: D75160513 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154048 Approved by: https://github.com/Mingming-Ding	2025-05-22 18:11:40 +00:00
Nikita Shulga	633ed01145	[MPS] Add support for two more isin variants (#154010 ) `isin_Tensor_Scalar_out` is just a redispatch to eq/neq `isin_Scalar_Tensor_out` redispatches back to generic `isin` op, but needs a small tweak to handle float scalars Make sure that `out` is resized to an expected value in `isin_Tensor_Tensor_out_mps` Add unittests to validate that, but skip them on MacOS-13, where MPS op just returns garbage Before this change both of those failed ```python >>> import torch >>> t = torch.tensor([0, 1, 2], device='mps') >>> torch.isin(t, 1) Traceback (most recent call last): File "<stdin>", line 1, in <module> NotImplementedError: The operator 'aten::isin.Tensor_Scalar_out' is not currently implemented for the MPS device. If you want this op to be considered for addition please comment on https://github.com/pytorch/pytorch/issues/141287 and mention use-case, that resulted in missing op as well as commit hash 3b875c25ea6d8802a0c53af9eb961ddf2f058188. As a temporary fix, you can set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS. >>> torch.isin(1, t) Traceback (most recent call last): File "<stdin>", line 1, in <module> NotImplementedError: The operator 'aten::isin.Scalar_Tensor_out' is not currently implemented for the MPS device. If you want this op to be considered for addition please comment on https://github.com/pytorch/pytorch/issues/141287 and mention use-case, that resulted in missing op as well as commit hash 3b875c25ea6d8802a0c53af9eb961ddf2f058188. As a temporary fix, you can set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/154010 Approved by: https://github.com/Skylion007, https://github.com/dcci, https://github.com/manuelcandales ghstack dependencies: #153970, #153971, #153997	2025-05-22 17:59:35 +00:00
Xu Han	7421c21b5e	remove unused code. (#153979 ) Remove the unused cmake code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153979 Approved by: https://github.com/albanD	2025-05-22 17:50:11 +00:00
Yidi Wu	fc859077a0	[export][cond] support merging constant ints as unbacked symint (#152742 ) @pianpwk points out that this will be helpful to address several data dependent issues in huggingface [models](`e23705e557/src/diffusers/schedulers/scheduling_euler_ancestral_discrete.py (L332)`) with the following pattern: ```python idx = return 0 if u0 else return 1 return x[idx] ``` We could preserve the conditional with a cond. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152742 Approved by: https://github.com/zou3519	2025-05-22 17:25:38 +00:00
PyTorch MergeBot	025c5cc048	Revert "[inductor][cutlass backend] Add 2 stage autotuning aka prescreening (#153335 )" This reverts commit d23762974eae105aad837188d5d2254ea9783b37. Reverted https://github.com/pytorch/pytorch/pull/153335 on behalf of https://github.com/yangw-dev due to sorry the pr is failed internally [D75155648](https://www.internalfb.com/diff/D75155648) ([comment](https://github.com/pytorch/pytorch/pull/153335#issuecomment-2901916364))	2025-05-22 16:52:04 +00:00
PyTorch MergeBot	7d3dab6b90	Revert "[BE]: Type previously untyped decorators (#153726 )" This reverts commit b7d08defe9cfe1595ff680f845b39f5e03a89555. Reverted https://github.com/pytorch/pytorch/pull/153726 on behalf of https://github.com/yangw-dev due to sorry, it seems like your pr failed typecheck error internally, [D75155486](https://www.internalfb.com/diff/D75155486) ([comment](https://github.com/pytorch/pytorch/pull/153726#issuecomment-2901911114))	2025-05-22 16:49:08 +00:00
Michael Lazos	a15550b776	[Cutlass] Use env var for EVT flag (#154099 ) Swaps out hard flag for environment variable in inductor config. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154099 Approved by: https://github.com/eellison	2025-05-22 16:36:57 +00:00
PyTorch MergeBot	a82c8891d5	Revert "[aoti] Add MPS runner and shim (#153964 )" This reverts commit 918ae5d36188f419a47f3b1315f9fb373035ed66. Reverted https://github.com/pytorch/pytorch/pull/153964 on behalf of https://github.com/angelayi due to broke frl build ([comment](https://github.com/pytorch/pytorch/pull/153964#issuecomment-2901876832))	2025-05-22 16:35:59 +00:00
PyTorch MergeBot	47a01f3efb	Revert "[aoti] Initial Metal support (#153959 )" This reverts commit 28bcd9eb30336b370298dbe9677b95019882f2a8. Reverted https://github.com/pytorch/pytorch/pull/153959 on behalf of https://github.com/angelayi due to previous PR broke frl build ([comment](https://github.com/pytorch/pytorch/pull/153959#issuecomment-2901825315))	2025-05-22 16:17:07 +00:00
Isuru Fernando	f419373dd3	[inductor] lowering for fractional_max_pool3d (#148630 ) also a lowering with a reduction for large window_sizes for fractional_max_pool2d Pull Request resolved: https://github.com/pytorch/pytorch/pull/148630 Approved by: https://github.com/eellison	2025-05-22 16:06:29 +00:00
Tom Ritchford	9a8c42ff94	Get rid of unused code in linters (#154043 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154043 Approved by: https://github.com/XuehaiPan, https://github.com/Skylion007	2025-05-22 15:24:54 +00:00
eellison	35ddad284d	update mutation renames (#153895 ) Thanks to @PaulZhang12 for original find. When we finalize a multi template buffer, we need to reflect mutation renaming in dependencies. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153895 Approved by: https://github.com/PaulZhang12	2025-05-22 14:54:39 +00:00
Huy Do	6cd9d66b7f	Allow higher fp16 tolerance for phlippe_resnet on CUDA 12.8 (#154109 ) After https://github.com/pytorch/pytorch/pull/154004, one of the model `phlippe_resnet` needs higher tolerance for fp16 on CUDA 12.8. I can reproduce it locally with: ``` python benchmarks/dynamo/torchbench.py --accuracy --timing --explain --print-compilation-time --inductor --device cuda --training --amp --only phlippe_resnet E0522 02:47:12.392000 2130213 site-packages/torch/_dynamo/utils.py:2949] RMSE (res-fp64): 0.00144, (ref-fp64): 0.00036 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000, use_larger_multiplier_for_smaller_tensor: 0 ``` I'm not sure what exactly happens behind the scene, but this should help fix the CI failure. Also remove some left over expected accuracy results for CUDA 12.4 which we are not using anymore on CI for benchmark jobs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154109 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-05-22 14:25:12 +00:00
IvanKobzarev	4439255148	[aotd] Support saved tensors hooks in aot_autograd (#150032 ) https://github.com/pytorch/pytorch/issues/148222 Goal: At the moment autograd saved tensors hooks are run in eager after compiled forward. They are executed at the same time for all saved tensors. Hooks can be used to reduce amout of memory used for saved tensors, doing quantization or offloading to cpu. This is suboptimal for optimization of peak memory. Better solution will be to put the hooks in the graph, as close as possible to the last usage of the tensor. To get user specified autograd saved tensors hooks in the graph. Logic: UX: If user specifies with torch.autograd.graph.saved_tensors_hooks(pack_gm, unpack_gm). Where pack_gm and unpack_gm are torch.fx.GraphModule. Then AotAutograd will retrace those graph modules, doing decompositions and functionalization in aot_autograd, inlining the result graphs in forward epilogue and backward prologue. User may want to use control logic in the hooks, for example applying quantization only for specific dtypes and sizes. This is also possible, user can put it into torch.fx.wrap function and use symbolic trace to make a GraphModule. In that case AotAutograd cahing will work only in case when user explicitly set to the torch.fx.wrap call_function node "user_cache_hash" metadata. If this metadata set - then aot_autograd cache can use saved cache artifact. If metadata is not set - then cache is bypassed. Dynamo: Dynamo traces pack and unpack hooks and installs them as subgraph and explicitly adds to the output_graph. (As those subgraphs are not used and will not be copied in the result by default). The complexity here is that at this moment we do not have example of inputs for the hooks. We trace pack_hook with some Tensor from the inputs. The result subgraphs are added to the hashing of AotAutograd Cache. In AotAutograd we retrace the graph with the true saved tensors coming from partitioner. Backwards Compatibility: As current hooks are executed in eager mode and not all of them will be traceable - we only try to put in the graph hooks, explicitly marked by user with annotation (@_inlineable_saved_tensors_hooks). For other hooks or if compiled autograd is enabled - keep the same logic. Recompilations: Hooks are guarded with lambda guard matching function id to cause recompilation if user reruns compiled function. Aot_autograd: After partitioner prepared forward and backward module - we trace prepared at Dynamo graphs for pack and unpack hooks and inline them in epilogue of forward and prologue of backward. Forward outputs and backward inputs are changed, transparently for user. We do not try to put it close the last usage etc., relying on inductor to do this optimization. ``` INFO: TRACED GRAPH ===== Forward graph pre saved_tensors_hooks inlining 3 ===== /data/users/ivankobzarev/a/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module): def forward(self, primals_1: "Sym(s0)", primals_2: "Sym(s1)", primals_3: "f32[s0, s1][s1, 1]cuda:0"): # File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6660 in simple_fn, code: x = x + 1 add: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten.add.Tensor(primals_3, 1); primals_3 = None # File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6661 in simple_fn, code: x = SAF.apply(x) view: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten.view.default(add, [primals_1, primals_2]) return (view, add, primals_1, primals_2) INFO: TRACED GRAPH ===== Backward graph pre saved_tensors_hooks inlining 3 ===== /data/users/ivankobzarev/a/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module): def forward(self, primals_1: "Sym(s0)", primals_2: "Sym(s1)", primals_3: "f32[s0, s1][s1, 1]cuda:0"): # File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6660 in simple_fn, code: x = x + 1 add: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten.add.Tensor(primals_3, 1); primals_3 = None # File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6661 in simple_fn, code: x = SAF.apply(x) view: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten.view.default(add, [primals_1, primals_2]) return (view, add, primals_1, primals_2) INFO: TRACED GRAPH ===== saved_tensors_pack_hook add 3 ===== /data/users/ivankobzarev/a/pytorch/torch/fx/_lazy_graph_module.py class pack_float8(torch.nn.Module): def forward(self, x_1: "f32[s0, s1][s1, 1]cuda:0"): # No stacktrace found for following nodes _to_copy: "f8e4m3fn[s0, s1][s1, 1]cuda:0" = torch.ops.aten._to_copy.default(x_1, dtype = torch.float8_e4m3fn); x_1 = None return (torch.float32, _to_copy) INFO: TRACED GRAPH ===== saved_tensors_unpack_hook add 3 ===== <eval_with_key>.22 from /data/users/ivankobzarev/a/pytorch/torch/fx/experimental/proxy_tensor.py:1225 in wrapped class pack_float8(torch.nn.Module): def forward(self, x_1: "f32[s0, s1][s1, 1]cuda:0"): # No stacktrace found for following nodes _to_copy: "f8e4m3fn[s0, s1][s1, 1]cuda:0" = torch.ops.aten._to_copy.default(x_1, dtype = torch.float8_e4m3fn); x_1 = None return (torch.float32, _to_copy) INFO: TRACED GRAPH ===== Forward graph 3 ===== /data/users/ivankobzarev/a/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module): def forward(self, primals_1: "Sym(s0)", primals_2: "Sym(s1)", primals_3: "f32[s0, s1][s1, 1]cuda:0"): # File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6660 in simple_fn, code: x = x + 1 add: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten.add.Tensor(primals_3, 1); primals_3 = None # No stacktrace found for following nodes _to_copy: "f8e4m3fn[s0, s1][s1, 1]cuda:0" = torch.ops.aten._to_copy.default(add, dtype = torch.float8_e4m3fn) # File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6661 in simple_fn, code: x = SAF.apply(x) view: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten.view.default(add, [primals_1, primals_2]); add = None return (view, _to_copy, primals_1, primals_2) INFO: TRACED GRAPH ===== Backward graph 3 ===== <eval_with_key>.21 class GraphModule(torch.nn.Module): def forward(self, primals_1: "Sym(s0)", primals_2: "Sym(s1)", add_packed_2: "f8e4m3fn[s0, s1][s1, 1]cuda:0", tangents_1: "f32[s0, s1][s1, 1]cuda:0"): # No stacktrace found for following nodes _to_copy: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten._to_copy.default(add_packed_2, dtype = torch.float32); add_packed_2 = None # File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6661 in simple_fn, code: x = SAF.apply(x) add_7: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten.add.Tensor(tangents_1, _to_copy); tangents_1 = _to_copy = None return (None, None, add_7) ``` Differential Revision: [D72187044](https://our.internmc.facebook.com/intern/diff/D72187044) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150032 Approved by: https://github.com/bdhirsh	2025-05-22 14:09:38 +00:00
zeshengzong	f12d8d60b1	Add hint message when parameters is empty in clip_grad_norm_ (#151529 ) Fixes #148259 ## Changes - Add print warning message when `parameters` generator exhausted ## Test Result ### print warning ```python import torch import torch.nn as nn import torch.optim as optim class SimpleModel(nn.Module): def __init__(self): super(SimpleModel, self).__init__() self.fc = nn.Linear(10, 1) def forward(self, x): return self.fc(x) model = SimpleModel() criterion = nn.MSELoss() optimizer = optim.SGD(model.parameters(), lr=0.01) inputs = torch.randn(16, 10) targets = torch.randn(16, 1) outputs = model(inputs) loss = criterion(outputs, targets) optimizer.zero_grad() loss.backward() params_to_clip = model.parameters() for p in params_to_clip: print(p.shape) max_norm = 1.0 norm_type = 2.0 total_norm = nn.utils.clip_grad_norm_(params_to_clip, max_norm, norm_type) print(f"total_norm: {total_norm}") ``` ```bash /home/zong/code/pytorch/torch/nn/utils/clip_grad.py:222: UserWarning: `parameters` is an empty generator, no gradient clipping will occur. warnings.warn( total_norm: 0.0 ``` ### UT ```bash pytest test/test_nn.py -k test_clip_grad_norm ``` ![image](https://github.com/user-attachments/assets/0aa0f06c-e0a5-43cf-9a97-d7c2747c9180) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151529 Approved by: https://github.com/jbschlosser	2025-05-22 11:23:39 +00:00
leslie-fang-intel	40e6ca24ef	Update CPU Inductor merge rules by adding more CPP Template (#152086 ) Summary Add more CPP Template into the CPU Inductor merge rules. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152086 Approved by: https://github.com/atalman	2025-05-22 09:46:26 +00:00
Aleksei Nikiforov	2f57ee579d	S390x update docker image (#153619 ) Add ninja-build for pytorch tests. Switch to gcc 14 due to fix for precompiled headers and s390x vectorization interaction. Disable -Werror when building onnxruntime. Pin onnx version. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153619 Approved by: https://github.com/huydhn	2025-05-22 09:34:46 +00:00
zeshengzong	d7a83ab67b	Fix `lr_scheduler` unexpectedly calls `step()` when init argument last_epoch is larger than -1 (#149312 ) Fixes #102261 ## Changes - Use flag `_is_initial` to replace `self.last_epoch == 0` condition to judge whether `lr` should be initial value - Add test for `ExponentialLR` checkpoint usecase ## Test Result ```python pytest -s test/optim/test_lrscheduler.py -vv ``` ![image](https://github.com/user-attachments/assets/6fd32bcc-b4fb-4421-b891-620bd4900dc1) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149312 Approved by: https://github.com/janeyx99 Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2025-05-22 08:42:37 +00:00
Michael Lazos	423fc671e9	[Cutlass] Support float8_e4m3fn GEMM (#153890 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153890 Approved by: https://github.com/drisspg, https://github.com/eellison	2025-05-22 08:37:33 +00:00
Sidharth	c1b7dbc52a	[dynamo] unimplemented -> unimplemented_v2 in variables/dict.py (#154040 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154040 Approved by: https://github.com/williamwen42, https://github.com/StrongerXi	2025-05-22 06:46:10 +00:00
Yu, Guangye	a664cfdf95	Add C10_NODEPRECATED check for xpu (#153935 ) # Motivation Add `C10_NODEPRECATED` check for XPU. This doesn't allow xpu codebase to use `c10::optional`. What's the change about torch-xpu-ops commit update? Deprecate `c10::optional`, `c10::nullopt`, `c10::make_option`, use the counterpart in std instead. # Additional Context This PR depends on https://github.com/intel/torch-xpu-ops/pull/1683 https://github.com/intel/torch-xpu-ops/pull/1690 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153935 Approved by: https://github.com/Skylion007, https://github.com/cyyever	2025-05-22 06:44:04 +00:00
Malaika	482e5b6660	[inductor] Added precompilation_timeout_seconds into a config instead of hardcoded (#153788 ) Fixes #153392 - Updated config.py to add the timeout as a config var to be tuned dynamically (default is 3600s). - Passed the var as a kwarg during call on instance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153788 Approved by: https://github.com/henrylhtsang	2025-05-22 06:44:02 +00:00
Wei Wang	7128b50a65	[CI][CUDA][Distributed] Move cuda 11.8 distributed pull jobs to cuda 12.6 (#151594 ) This PR moves distributed cuda CI job from cuda 11.8 to cuda 12.6. In doing so, a few unit test failures were exposed, some if not all of which would take a while to root-cause and fix, so temporarily skip them after creating the issues. https://github.com/pytorch/pytorch/issues/153479 test_nan_assert tricky behavior (e.g. skip_but_pass_in_sandcastle, ubuntu 20.04 does not work, ubuntu 22.04 works, Amazon Linux 2023 skip - what is Sandcastle OS?) https://github.com/pytorch/pytorch/issues/153122 CUDA context related https://github.com/pytorch/pytorch/issues/153517 NCCL regression, future NCCL may fix it https://github.com/pytorch/pytorch/issues/154073 skip test_symmetric_memory for cuda 12.6 before it is fixed See: https://github.com/pytorch/pytorch/issues/147383 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151594 Approved by: https://github.com/eqy, https://github.com/atalman, https://github.com/cyyever, https://github.com/huydhn, https://github.com/kwen2501	2025-05-22 06:33:29 +00:00
Laith Sakka	4bcff4af99	Move prologue_supported_inputs computations to def_kernal (#150869 ) This avoid replaying load_input on a cache hit on the generate_code_cache. the idea is that if a template have prologue_loads_all_inputs = True, it means that all all inputs are loaded and hence no need to replay Effect on the current benchmark on a local run on dev server. 18549985383 -> 15072230073 25697270062 -> 20738613297 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150869 Approved by: https://github.com/eellison	2025-05-22 06:24:44 +00:00
Colin L Reliability Rice	4421aee558	torch.compile: Supress stdout / stderr output from subprocesses when local (#153837 ) Summary: This output is extremely noisy - i.e. on a 96 core machine, with 8 ranks, you can get ~700 duplicate set of logs from each worker. Differential Revision: D74907920 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153837 Approved by: https://github.com/aorenste, https://github.com/masnesral	2025-05-22 05:49:43 +00:00
soulitzer	f2af30fee5	Add a HOP to bypass tracing of a wrapper function while tracing the wrapped function (#153487 ) Usage: ```python from torch._higher_order_ops.wrap import dynamo_bypassing_wrapper # Your ordinary function wrapper def my_hop_fn_impl(fn, args, k=1, kwargs): def wrapper(args, *kwargs): out = fn(args, *kwargs) if isinstance(out, tuple): return (out[0] + k,) return out + k return wrapper # Calling `my_hop_fn` instead of the impl directly captures a HOP into the dynamo graph def my_hop_fn(fn, args, k=1, *kwargs): return dynamo_bypassing_wrapper( functools.partial(my_hop_fn_impl, k=k), fn, args, **kwargs ) ``` Notes: - The dynamo captured graph now stashes arbitrary callable objects (the wrapper_fn) - this is equivalent to what SAC does today with policy_fn. - The `wrapper_fn` passed to `dynamo_bypassing_wrapper ` should have signature `Callable -> Callable` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153487 Approved by: https://github.com/ydwu4	2025-05-22 04:24:38 +00:00
Boyuan Feng	669b176d4c	[Graph Partition] support removed arguments, NoneLayout, and mutation (#153899 ) Graph partition relies on `read_writes` to collect partition inputs and outputs. There are three edge cases: 1. `NoneLayout` is not allocated so it cannot become a partition input or output. 2. Codegen may decide a buffer to be internal to a kernel (e.g., triton kernel). One example is some buffers internal to a FusedSchedulerNode. These buffers are never actually allocated as `buf_id`. 3. We should use mutation_real_name for graph partition inputs and outputs to match the behavior of other codegen. This PR supports these 3 cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153899 Approved by: https://github.com/eellison	2025-05-22 04:24:31 +00:00
Yidi Wu	d1fe198df6	[cond] support output the same unbacked symbol from two branches (#148206 ) Previously, we didn't track the unbacked symbols leaked out of true_branch and false_branch if they have the same shape expr. This cause the the fake output of cond operator itself doesn't set up its unbacked_bindings meta properly (because they're ignored). In this PR, we also check whether there're leaked out unbacked symbols and create new unbacked symbols for it and track it as output of cond. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148206 Approved by: https://github.com/zou3519	2025-05-22 03:39:43 +00:00
Colin Peppler	fe285b9560	[aoti] fix corner case in unbacked replacements for atomically_apply_size_hint (#153768 ) ## PR There are a few cases that my previous PR (#153220) didn't cover. 1. The LHS/RHS matters. Today, if you do `torch._check(lhs == rhs)` then it will show up as a deferred runtime assert with `Eq(lhs, rhs)`. 2. There can be transitive replacements. For example, expr1 -> expr2 -> u0. `test_size_with_unbacked_add_expr_transitive` tests for this. 3. An unbacked symint expr may not have a replacement that's purely a symbol, for instance, it could be another expression. `test_size_with_unbacked_add_and_mul_expr` tests for this. ## Device assertion msg ``` /tmp/tmp07mu50tx/6y/c6ym2jzadwfigu3yexredb7qofviusz3p7ozcdjywvayhxgcqxkp.py:40: unknown: block: [8681,0,0], thread: [4,0,0] Assertion `index out of bounds: 0 <= tl.broadcast_to(tmp13, [XBLOCK]) < ks0` failed. ... /tmp/tmp07mu50tx/6y/c6ym2jzadwfigu3yexredb7qofviusz3p7ozcdjywvayhxgcqxkp.py:40: unknown: block: [8681,0,0], thread: [6,0,0] Assertion `index out of bounds: 0 <= tl.broadcast_to(tmp13, [XBLOCK]) < ks0` failed. ``` ## Autotuning code setup This is the autotuning code for a concat kernel which takes input tensors (`in_buf`) and writes them to the (`out_buf`). It's important to note the size of `in_buf0` is the same as `in_buf1` don't match along dim=0. This is bad because all concat inputs must share the same size for each dim except for the concat dim (here that's dim=1). ``` in_buf0 = generate_example_value(size=(u1 + s0, 256)) # concrete size is (17900, 256) in_buf1 = generate_example_value(size=(u0, 10)) # concrete size is (8192, 10) ... out_buf = generate_example_value(size=(u1 + s0, 266)) # concrete size is (17900, 256+10) triton_poi_fused_cat_1.run(in_buf0, in_buf1, ..., out_buf, xnumel=(u1 + s0) * 266 ...) ``` If we look into the kernel code, you'll see that `tmp9` loads `in_buf1` (our incorrectly shaped input tensor). There is also a mask to prevent OOB loads. - `tmp6` makes sure we're only loading with the `xindex` from 256 to 264. - `xmask` makes sure we're only loading with the `xindex` within `xnumel`. - `tmp6 & xmask` together is essentially checking `0 ≤ x0 < u1 + s0` and `256 ≤ x1 < 264`. The mask logic is correct, however, `in_buf1` has the shape `[8192, 10]` this means any load where `8192 ≤ x0 < u1 + s0` will be an OOB load. ``` def triton_poi_fused_cat_1(in_buf0, in_buf1, ... out_buf, xnumel, XBLOCK): xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK) xmask = xindex < xnumel x0 = (xindex % 264) x1 = xindex // 264 ... tmp6 = x0 >= tl.full([1], value=256) tmp9 = tl.load(in_buf1 + (x1), tmp6 & xmask) # device assertion is thrown here tl.device_assert(((0 <= tl.broadcast_to(tmp13, [XBLOCK])) & (tl.broadcast_to(tmp13, [XBLOCK]) < ks0)) \| ~(xmask & tmp6), "index out of bounds: 0 <= tl.broadcast_to(tmp13, [XBLOCK]) < ks0") ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153768 Approved by: https://github.com/jingsh	2025-05-22 02:05:37 +00:00
Jiang, Yanbing	a264af8c71	Support fp8 output of _scaled_mm for CPU (#153600 ) This PR is to support fp8 output of torch._scaled_mm for CPU, and create related UTs with fp8 and bf16/fp16/fp32 output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153600 Approved by: https://github.com/leslie-fang-intel, https://github.com/mingfeima, https://github.com/jansel	2025-05-22 01:15:39 +00:00
Gabriel Ferns	254293b777	Add `flag _metrics_log_runtime` to disable runtime metric logging by default (#153506 ) https://github.com/pytorch/pytorch/pull/152708 expanded support of `get_estimated_runtime` to many more types of `SchedulerNodes`. This caused an increase in compile time because we're always calling `get_estimated_runtime` to populate the metrics table. This PR adds a flag for this logging, which reduces the instruction count by 8%. Long term, we should probably merge metrics.py with TORCH_LOGS/tlparse (suggestion from @xmfan). Update: added support for TORCH_LOGS for the metrics logging. Test Plan: mm_loop.py and many existing tests cover. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153506 Approved by: https://github.com/eellison	2025-05-22 01:02:11 +00:00
PyTorch MergeBot	261897734a	Revert "cpp_wrapper: build non-performance-sensitive code at O1 (#148773 )" This reverts commit 3c89cfd46075e62c1725b43557612901a9cbb6fa. Reverted https://github.com/pytorch/pytorch/pull/148773 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems that pr_time_benchmark is regressed after this land ([comment](https://github.com/pytorch/pytorch/pull/148773#issuecomment-2899545140))	2025-05-22 00:11:14 +00:00
Max Podkorytov	7ef2c62fd3	[ROCm][Inductor][CK] Add ck-tile based universal gemm kernels to torch.mm autotune choices (#152341 ) This PR adds code generation for CK-tile based universal gemm kernels to the CK backend for Inductor, and adds these kernels to autotune choices. Unlike legacy-CK based kernels (which are generated by parsing the CK instances from CK library), we generate the set of instances by manually specifying the tuning parameters. This PR introduces a new template for code generation, and compilation/autotuning is handled by the existing infrastructure. Points of discussion: * For simplicity and reduced coupling with CK, the instance filter checks only data type and layout, and doesn't check the alignment requirement - meaning that more instances will be compiled than necessary - while keeping the code generation independent from internal CK logic which checks the alignment validity at runtime * CK-tile instances are enabled whenever legacy-CK instances are enabled. A config knob could be introduced to differentiate between the instance types if that's needed * Whether gemm problem size K is ever dynamic, since whenever it's not a compile-time constant, we need to perform a runtime dispatch between several kernels Testing Use the existing tests in `test/inductor/test_ck_backend.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152341 Approved by: https://github.com/chenyang78	2025-05-21 23:59:16 +00:00
Ke Wen	87fc5af1f6	[c10d] Turn off default non-blocking API mode to work around hang in NCCL 2.26 (#154055 ) Work around issues like #153960, #152623 NCCL 2.26 seems to introduce random hang in non-blocking API mode. This PR opts out of non-blocking mode to work around it. Previously torch turned it on by default in eager init (i.e. `device_id` passed) to avoid init overhead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154055 Approved by: https://github.com/atalman	2025-05-21 23:46:52 +00:00
Simon Fan	fae6f6c9ca	[aot] fix deepcopying of aot bwd containing real tensors (#153999 ) Previously when we lower backward AOT due to symints, the post grad passes would leave the bw_module in a non-runnable state. This caused issues when compiled autograd tried to trace at runtime. So we had inductor operate on a deepcopy of bw_module. But with https://github.com/pytorch/pytorch/issues/153993, we see that deepcopying real tensors will fail under fake mode due to the device type mismatch between the fake tensors ("meta" device) and the real tensor. So by disabling fake mode, we avoid these errors. This change is a strict improvement over current, but it does reveal that this deepcopy can theoretically cause OOMs. FIXES https://github.com/pytorch/pytorch/issues/153993 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153999 Approved by: https://github.com/jamesjwu, https://github.com/bdhirsh	2025-05-21 23:30:02 +00:00
William Wen	67f9feeee7	remove TestCustomOp.test_impl_device_cpu from dynamo expected failures (#154049 ) Fixes https://github.com/pytorch/pytorch/issues/153763 maybe? Pull Request resolved: https://github.com/pytorch/pytorch/pull/154049 Approved by: https://github.com/StrongerXi	2025-05-21 23:20:30 +00:00
Catherine Lee	5ee1242310	Follow up to #152209 , remove compat patch after docker image rename (#152958 ) Remove compat patch that lets PRs that haven't rebased base #152209 still have docker images. Merge ~2 weeks after the above PR was merged. ~80% of PRs have a merge base that is <2 weeks old Pull Request resolved: https://github.com/pytorch/pytorch/pull/152958 Approved by: https://github.com/huydhn	2025-05-21 23:11:29 +00:00
ishanjmukherjee	d82610c2af	docs: fix "should not to be" typo in `register_buffer` docstring (#153817 ) Corrects a small grammatical error in `register_buffer` docstring, from "... should not to be ..." to "... should not be ...". Docs-only change, so no runtime behavior, tests, or APIs are affected. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153817 Approved by: https://github.com/mikaylagawarecki	2025-05-21 22:46:50 +00:00
yang	b967b7b11e	Update rnn.py, fix `torch.nn.RNN` document error (#153620 ) I found the same issue as #147490 (@jibril-b-coulibaly). There's an equivalent in the [doc-string](https://docs.pytorch.org/docs/stable/generated/torch.nn.RNN.html#rnn) of `torch.nn.RNN`: ```python # Efficient implementation equivalent to the following with bidirectional=False def forward(x, hx=None): if batch_first: x = x.transpose(0, 1) seq_len, batch_size, _ = x.size() if hx is None: hx = torch.zeros(num_layers, batch_size, hidden_size) h_t_minus_1 = hx h_t = hx output = [] for t in range(seq_len): for layer in range(num_layers): h_t[layer] = torch.tanh( x[t] @ weight_ih[layer].T + bias_ih[layer] + h_t_minus_1[layer] @ weight_hh[layer].T + bias_hh[layer] ) output.append(h_t[-1]) h_t_minus_1 = h_t output = torch.stack(output) if batch_first: output = output.transpose(0, 1) return output, h_t ``` However there's something wrong. 1. Like mentioned in #147490, line 499 is wrong `fb55bac3de/torch/nn/modules/rnn.py (L499)` The input for RNNCell should be different for different layers. 2. The code contains several hidden reference-related issues that may result in unintended modifications to tensors. For example in line 504, this causes all elements in the final output list to point to the same tensor. `fb55bac3de/torch/nn/modules/rnn.py (L504)` 3. Some variable is not defined. Despite being a relatively minor issue in annotation, it can lead to significant confusion for those who are new to the concept. For example `weight_ih` in line 499 `fb55bac3de/torch/nn/modules/rnn.py (L499)` So, i write a runnable version to make it more clear: ```python # Efficient implementation equivalent to the following with bidirectional=False rnn = nn.RNN(input_size, hidden_size, num_layers) params = dict(rnn.named_parameters()) def forward(x, hx=None, batch_first=False): if batch_first: x = x.transpose(0, 1) seq_len, batch_size, _ = x.size() if hx is None: hx = torch.zeros(rnn.num_layers, batch_size, rnn.hidden_size) h_t_minus_1 = hx.clone() h_t = hx.clone() output = [] for t in range(seq_len): for layer in range(rnn.num_layers): input_t = x[t] if layer == 0 else h_t[layer - 1] h_t[layer] = torch.tanh( input_t @ params[f"weight_ih_l{layer}"].T + h_t_minus_1[layer] @ params[f"weight_hh_l{layer}"].T + params[f"bias_hh_l{layer}"] + params[f"bias_ih_l{layer}"] ) output.append(h_t[-1].clone()) h_t_minus_1 = h_t.clone() output = torch.stack(output) if batch_first: output = output.transpose(0, 1) return output, h_t ``` This code can reproduce the computation of torch.nn.RNN. For example: ```python import torch import torch.nn as nn torch.manual_seed(0) input_size, hidden_size, num_layers = 3, 5, 2 rnn = nn.RNN(input_size, hidden_size, num_layers) params = dict(rnn.named_parameters()) x = torch.randn(10, 4, 3) official_imp = rnn(x) my_imp = forward(x) assert torch.allclose(official_imp[0], my_imp[0]) assert torch.allclose(official_imp[1], my_imp[1]) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153620 Approved by: https://github.com/mikaylagawarecki	2025-05-21 22:45:28 +00:00
Bin Bao	5b6e551c0f	[AOTI][refactor] Fix an anonymous namespace issue (#154033 ) Summary: Remove anonymous namespace in model_container.h to fix the following compiler warning, ``` warning: ‘torch::aot_inductor::AOTInductorModelContainer’ has a field ‘torch::aot_inductor::AOTInductorModelContainer::constant_folded_’ whose type uses the anonymous namespace [-Wsubobject-linkage] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/154033 Approved by: https://github.com/chenyang78	2025-05-21 22:29:09 +00:00
Yidi Wu	d356ca2466	[map] add inductor support by lowering to while_loop (#150971 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150971 Approved by: https://github.com/zou3519 ghstack dependencies: #151034	2025-05-21 22:19:47 +00:00
Yidi Wu	cf1b38a017	[map] make proxy mode re-dispatch to fake key (#151034 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151034 Approved by: https://github.com/zou3519	2025-05-21 22:19:47 +00:00
Huy Do	c13eeaa718	Move inductor benchmark jobs to CUDA 12.8 (#154004 ) For benchmark jobs, we usually want to run with the latest support CUDA version to get the best performance. This is a request coming from NVIDIA where they are running inductor benchmarks on Blackwell with CUDA 12.8 (min support version) and looking for an apples-to-apples comparison. This also clean up references to CUDA 12.4 which have been sunset in PyTorch CI. ### Testing - H100 benchmark https://github.com/pytorch/pytorch/actions/runs/15151424588 - Micro benchmark https://github.com/pytorch/pytorch/actions/runs/15151445957 (I just realize that this is still running on A100, @yanboliang Do you want to run on H100 now that we have capacity there? It would also solve the problem of GPU memory) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154004 Approved by: https://github.com/atalman, https://github.com/ZainRizvi, https://github.com/seemethere	2025-05-21 22:17:10 +00:00
henrylhtsang	053ca7439a	[cutlass backend] Add serializer for cutlass ops (#153894 ) Differential Revision: [D74524786](https://our.internmc.facebook.com/intern/diff/D74524786/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153894 Approved by: https://github.com/ColinPeppler, https://github.com/mlazos	2025-05-21 22:01:40 +00:00
Natalia Gimelshein	401fa87ace	make only current thread allocate to pool in NcclPG (#153990 ) follow up to #153356 that fixes nccl allocation to pool Pull Request resolved: https://github.com/pytorch/pytorch/pull/153990 Approved by: https://github.com/kwen2501	2025-05-21 21:57:37 +00:00
angelayi	28bcd9eb30	[aoti] Initial Metal support (#153959 ) An example generated file: P1816629015 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153959 Approved by: https://github.com/malfet, https://github.com/desertfire ghstack dependencies: #153964	2025-05-21 21:55:59 +00:00
angelayi	918ae5d361	[aoti] Add MPS runner and shim (#153964 ) Added AOTIModelContainerRunnerMps and a shim for mps fallback ops. I also added a mps-specific shim which contains one operator, which will be used to set arguments being passed to the Metal kernel: ``` AOTI_TORCH_EXPORT AOTITorchError aoti_torch_mps_set_arg( AOTIMetalKernelFunctionHandle func, unsigned idx, AtenTensorHandle tensor); ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153964 Approved by: https://github.com/malfet, https://github.com/desertfire	2025-05-21 21:55:59 +00:00
Sidharth	0b79a8c1a9	[dynamo] renamed _fn for more clarity and put a comment of user compiler user (#154026 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154026 Approved by: https://github.com/williamwen42, https://github.com/StrongerXi	2025-05-21 21:12:51 +00:00
Scott Todd	0e5f2339d0	[ROCm][Windows] Run hipcc with compatibility flags. (#153986 ) See also https://github.com/ROCm/TheRock/issues/590. Including the `-Wno-ignored-attributes` flag here avoids 700MB of log warning spam while compiling and the `-fms-extensions` seems beneficial to include: https://clang.llvm.org/docs/MSVCCompatibility.html. Co-authored-by: Aaryaman Vasishta <jem456.vasishta@gmail.com> Co-authored-by: Scott Todd <scott.todd0@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/153986 Approved by: https://github.com/Skylion007, https://github.com/jeffdaily Co-authored-by: Aaryaman Vasishta <jem456.vasishta@gmail.com>	2025-05-21 20:26:52 +00:00
Benjamin Glass	3c89cfd460	cpp_wrapper: build non-performance-sensitive code at O1 (#148773 ) Builds on #148212, applying the same improvements to `cpp_wrapper` mode. Benchmark results: * [A100 Benchmarks](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2014%20May%202025%2015%3A10%3A05%20GMT&stopTime=Wed%2C%2021%20May%202025%2015%3A10%3A05%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/benjaminglass1/77/orig&lCommit=ca7d0a3f16e3c511534d2cd03d695be8524570d3&rBranch=main&rCommit=1075bb37d34e483763a09c7810790d5491441e13) * [x86 Benchmarks](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2014%20May%202025%2015%3A10%3A05%20GMT&stopTime=Wed%2C%2021%20May%202025%2015%3A10%3A05%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cpu%20(x86)&lBranch=gh/benjaminglass1/77/orig&lCommit=ca7d0a3f16e3c511534d2cd03d695be8524570d3&rBranch=main&rCommit=1075bb37d34e483763a09c7810790d5491441e13) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148773 Approved by: https://github.com/desertfire	2025-05-21 20:23:04 +00:00
Ryan Guo	4c6f0fe22f	[dynamo] Properly handle `torch.script.jit` under `@staticmethod` (#153984 ) Fixes #153607. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153984 Approved by: https://github.com/williamwen42	2025-05-21 19:45:06 +00:00
James Wu	b184e3da9c	[easy] Fix internal only test (#154035 ) Internally static cuda launcher isn't enabled, so we need to always enable it Differential Revision: [D75146584](https://our.internmc.facebook.com/intern/diff/D75146584/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154035 Approved by: https://github.com/Skylion007 ghstack dependencies: #153565	2025-05-21 19:00:55 +00:00
Yidi Wu	8e6e79fc1b	[hop_schema] support gen_schema for invoke_subgraph (#152984 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152984 Approved by: https://github.com/zou3519 ghstack dependencies: #151067, #152974	2025-05-21 18:55:46 +00:00
Yidi Wu	9c33899196	[hop_schema] add HopSchemaGenerator to make it easier to create hop schema (#152974 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152974 Approved by: https://github.com/zou3519 ghstack dependencies: #151067	2025-05-21 18:55:46 +00:00
Yidi Wu	1e0f19e173	auto functionalize base_hop (#151067 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151067 Approved by: https://github.com/zou3519	2025-05-21 18:55:46 +00:00
Laith Sakka	11c0ffefcd	Cache code generation during triton template expansion and enable it for mm_template. (#151773 ) In a model, we see ~~ 40% of the time in mm/addmm tuning. The model have 2000 mm, many of which receives the same input shapes. with autotune enabled, this become expensive, while we already cache auto tuning results, we did not used to cache the generation of the python code and the loading for each config that we autotune on. This diff handles the code generation part (template expansions) a previous diff handled the loading part. This is expected to save 20% of the model I am working on. How do we do the caching? For a given configurations and input layout, the generated code is always the same. One caveat is that some other information collected during code generation are input dependent (namely depends on inputs names and symbol names in inputs). and not just layout. ! To handle those we use a record and replay approach, where we record the functions that are called during code generation that effect those outputs and replay them at a cache hit. Effect on the current benchmark on a local run on dev server. mm_loop. 24115830838 -> 18362098019 mm_loop_dynamic 30506097176-> 25697270062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151773 Approved by: https://github.com/eellison	2025-05-21 18:55:41 +00:00
Deysha Rivera	aec7cc60d7	add graph_code_verbose_log artifact for fx passes (#153775 ) Fixes #153646 This PR refactors the logging behavior in the FX pass insert_deferred_runtime_asserts and runtime_assert.py to separate verbose/intermediate graph logs from the final output graph log. All verbose logs generated during the FX pass are now routed to a new artifact logger, graph_code_verbose, while only the final output graph remains logged to the original graph_code artifact. Changes - Added a new artifact logger: [graph_code_log = torch._logging.getArtifactLogger(__name__, "graph_code_verbose")] - Updated all verbose/intermediate FX pass logs in [insert_deferred_runtime_asserts] to use the new graph_code_verbose artifact. - Ensured that only the final output graph is logged to the original graph_code artifact. - No changes to the FX pass logic or output—only logging behavior is affected. Notes This change is backward-compatible and does not affect the functional behavior of FX passes. No changes to user-facing APIs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153775 Approved by: https://github.com/williamwen42	2025-05-21 18:31:59 +00:00
Aleksei Nikiforov	d23f4ae7b5	s390x: use qemu issue workaround for runner registration too (#154030 ) s390x: use qemu issue workaround for runner registration too. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154030 Approved by: https://github.com/seemethere	2025-05-21 18:30:25 +00:00
Tomasz Bohutyn	bb7e30c165	[MegaCache] Make MegaCache generic to allow external plugins registration (#152977 ) Implements #152976 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152977 Approved by: https://github.com/oulgen	2025-05-21 18:18:47 +00:00
James Wu	c31e239910	[precompile] Add BundledAOTAutogradCacheEntry (#152840 ) Finally, this PR adds BundledAOTAutogradCacheEntry. A BundledAOTAutogradCacheEntry is an AOTAutogradCacheEntry that saves the entire CompiledFxGraph directly in the entry. This has some advantages: - No more dependency on FxGraphCache at all - Clearing FxGraphCache does not result in AOTAutogradCache miss - Simpler logic, as BundledAOTAutogradCacheEntry has everything you need to load a full compiled python wrapper from a dynamo output We plan to use BundledAOTAutogradCacheEntry for precompile. There's also a question of whether we want to use it for regular caching — the main disadvantage of this is having to save the same CompiledFxGraph twice, once in Inductor cache and once for AOTAutogradCache. With MegaCaching, this could be a regression in total cache size (as well as a minor cold start regression, as you have to save the same graph twice). I will import this and measure the mega cache space complexity, and if it looks good I'll enable it by default for caching as well. On warm start, if AOTAutogradCache hits, you won't have to load inductor at all, so warm start overhead should be unaffected. Differential Revision: [D74593304](https://our.internmc.facebook.com/intern/diff/D74593304) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152840 Approved by: https://github.com/zhxchen17	2025-05-21 18:08:42 +00:00
PyTorch MergeBot	3eb8fa081a	Revert "[3/n][Optimus][Auto-AC] Support float8_e4m3fn quantization type and set scaling as the default (#153802 )" This reverts commit 32b1baa981fe53d13d77acbee509c51087abf107. Reverted https://github.com/pytorch/pytorch/pull/153802 on behalf of https://github.com/malfet due to It breaks ROCM testing, see `d23762974e/1` ([comment](https://github.com/pytorch/pytorch/pull/153802#issuecomment-2898695702))	2025-05-21 17:20:31 +00:00
henrylhtsang	d23762974e	[inductor][cutlass backend] Add 2 stage autotuning aka prescreening (#153335 ) Motivation: By default, we are tuning the cutlass backend kernels on 3 swizzles. There are runtime params, so they share the same underlying kernel, which saves a lot of compilation time. However, autotuning all combinations of {configs} x {swizzles} is still expensive. Observations: Winner of the {configs} x {swizzles} autotuning is the same as if we do a greedy search: first find the top X winners of {configs} with swizzle 2 (hardcoded), then autotune on the {top X winner configs} x {swizzles}. In other words, we can use a Greedy algorithm to reduce autotuning time. I attach the logs below. This somewhat depends on what X is, but a number like 5-10 works pretty well from empirical observations. Logs: Baseline: https://gist.github.com/henrylhtsang/9a604f150a270dc19524f72a5d4dfac2 ``` AUTOTUNE mm(2048x2048, 2048x2048) strides: [2048, 1], [1, 2048] dtypes: torch.bfloat16, torch.bfloat16 cuda_cutlass_gemm_1776 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_1777 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_1778 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_1800 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_1801 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_1802 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_9012 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_9013 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_9014 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_8940 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_8941 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_8942 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_8934 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_8935 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_8936 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_2001 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_2002 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_2003 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_1848 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_1849 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_1850 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_8964 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_8965 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_8966 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_8958 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_8959 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_8960 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_1929 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_1930 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_1931 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_1770 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_1771 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_1772 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_1953 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_1954 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_1955 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_1995 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_1996 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_1997 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_1794 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_1795 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_1796 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_1842 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_1843 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_1844 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_9006 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_9007 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_9008 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_1923 0.0306 ms 95.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1 ``` with prescreening: ``` AUTOTUNE mm(147456x6144, 6144x2048) strides: [6144, 1], [2048, 1] dtypes: torch.bfloat16, torch.bfloat16 cutlass_1a5e81af 4.5469 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1 cutlass_aa6f899c 4.6328 ms 98.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cutlass_aa6f899c 4.6836 ms 97.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cutlass_161b8b81 4.7224 ms 96.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cutlass_161b8b81 4.7234 ms 96.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cutlass_161b8b81 4.7274 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cutlass_853b6347 4.7369 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cutlass_aa6f899c 4.7404 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cutlass_161b8b81 4.7711 ms 95.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8 cutlass_8bc6fbda 4.8148 ms 94.4% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8 cutlass_8bc6fbda 4.8159 ms 94.4% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cutlass_8bc6fbda 4.8214 ms 94.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cutlass_8bc6fbda 4.8302 ms 94.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cutlass_0a1c55af 4.8487 ms 93.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8 cutlass_0a1c55af 4.8527 ms 93.7% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2 cutlass_02780d72 4.8617 ms 93.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cutlass_0a1c55af 4.8737 ms 93.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1 cutlass_0a1c55af 4.8738 ms 93.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4 cutlass_02780d72 4.9348 ms 92.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cutlass_02780d72 4.9763 ms 91.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cutlass_853b6347 4.9805 ms 91.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cutlass_1a5e81af 5.0225 ms 90.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8 cutlass_853b6347 5.0271 ms 90.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8 cutlass_02780d72 5.0595 ms 89.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8 cutlass_853b6347 5.1434 ms 88.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cutlass_c1ffa14b 5.1574 ms 88.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8 cutlass_1a5e81af 5.1916 ms 87.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4 cutlass_c1ffa14b 5.2018 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4 cutlass_c1ffa14b 5.2019 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1 cutlass_c1ffa14b 5.2037 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2 cutlass_1a5e81af 5.5329 ms 82.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2 cutlass_aa6f899c 11.5046 ms 39.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8 SingleProcess AUTOTUNE benchmarking takes 1.9526 seconds and 0.0352 seconds precompiling for 32 choices ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153335 Approved by: https://github.com/eellison	2025-05-21 17:12:05 +00:00
Bin Bao	72a3c8dfa8	[AOTI][reland] Add an option to specify custom op C shim (#153968 ) Summary: Reland https://github.com/pytorch/pytorch/pull/153851 after fixing a fuzzer test issue. Add an option to tell AOTInductor codegen to generate C shim functions for certain custom ops instead of relying on ProxyExecutor. The lib that defines custom ops need to implement corresponding C shim functions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153968 Approved by: https://github.com/hl475	2025-05-21 15:57:57 +00:00
Aaron Gokaslan	b7d08defe9	[BE]: Type previously untyped decorators (#153726 ) This fixes decorator typing which unmasks a lot of typing issues in the codebase Pull Request resolved: https://github.com/pytorch/pytorch/pull/153726 Approved by: https://github.com/albanD	2025-05-21 15:56:19 +00:00
Nikita Shulga	6c2c527cd6	[BE] Remove extra semicolons from SymmetricMemory.hpp (#154034 ) Fixes ``` In file included from /Users/malfet/git/pytorch/pytorch/torch/csrc/distributed/c10d/SymmetricMemory.cpp:1: /Users/malfet/git/pytorch/pytorch/torch/csrc/distributed/c10d/SymmetricMemory.hpp:77:4: warning: extra ';' after member function definition [-Wextra-semi] 77 \| }; \| ^ /Users/malfet/git/pytorch/pytorch/torch/csrc/distributed/c10d/SymmetricMemory.hpp:81:4: warning: extra ';' after member function definition [-Wextra-semi] 81 \| }; \| ^ 2 warnings generated. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/154034 Approved by: https://github.com/Skylion007	2025-05-21 14:33:30 +00:00
James Wu	33767eb391	Add option to statically launch user defined triton kernels (#153725 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153725 Approved by: https://github.com/oulgen, https://github.com/Mingming-Ding, https://github.com/jansel ghstack dependencies: #153565	2025-05-21 14:33:15 +00:00
Nikita Shulga	b73d77900e	[CI] Run limited h100 tests every 6 hours (#153900 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153900 Approved by: https://github.com/Skylion007	2025-05-21 13:40:03 +00:00
Yutao Xu	11f8511455	Update torch-xpu-ops commit pin (#153902 ) Update the torch-xpu-ops commit to `defce46ae7`, includes: - Resolve the aten::gamma accuracy gap compared to scipy - Optimize layernom_vectorized_impl by using adaptive wg selection for small shapes - [Intro async flag and use current stream avoid stream sync](https://github.com/intel/torch-xpu-ops/pull/1546) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153902 Approved by: https://github.com/Skylion007, https://github.com/EikanWang	2025-05-21 13:29:41 +00:00
ZhiweiYan-96	616e20b4f1	[Intel GPU] scalar tensor case handling in addmm, baddmm (#153051 ) # Motivation This PR adds scalar tensor (`t.elem()==1 and t.sizes().empty() == true and t.dim()=0` )handling in addmm, baddmm. The issue is found during the process of oneDNN upgradation. Found that the new version of oneDNN requires the post-binary (self in addmm) has same dimension as the one of output tensor. Now we need explicitly expand the shape of `self` tensor. Former version dnnl may help us do the broadcasting inside. This PR could fix issues in https://github.com/intel/torch-xpu-ops/issues/1612 and CI error in https://github.com/pytorch/pytorch/pull/151767. # Implementation We treat the scalar tensor as normal tensor by `unsqueeze` it as 1 dimension tensor. Accompanied with the existing shape handle logic, it would be further `unsqueeze` to 2D or 3D shape. UT testing ``` python test/inductor/test_torchinductor_opinfo.py TestInductorOpInfoXPU.test_comprehensive_addmm_xpu_float32 python test/inductor/test_torchinductor_opinfo.py TestInductorOpInfoXPU.test_comprehensive_addmv_xpu_float32 python test/inductor/test_torchinductor_opinfo.py TestInductorOpInfoXPU.test_comprehensive_baddbmm_xpu_float16 python test/inductor/test_torchinductor_opinfo.py TestInductorOpInfoXPU.test_comprehensive_baddbmm_xpu_float32 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153051 Approved by: https://github.com/EikanWang, https://github.com/guangyey	2025-05-21 12:24:37 +00:00
Irem Yuksel	afd7a13bca	Migrate to new Windows Arm64 runners (#152099 ) This PR moves the Windows Arm64 nightly jobs to the new runner image, see [arm-windows-11-image](https://github.com/actions/partner-runner-images/blob/main/images/arm-windows-11-image.md ) Fixes #151671 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152099 Approved by: https://github.com/seemethere	2025-05-21 09:13:15 +00:00
Aaron Gokaslan	ffd49d538e	[BE][Ez]: Improve typing in torch/modules/container.py (#153728 ) Adds some missing type annotations Pull Request resolved: https://github.com/pytorch/pytorch/pull/153728 Approved by: https://github.com/albanD	2025-05-21 07:15:00 +00:00
Andy (An) Wang	a636a92ee9	[MTIA ATen Backend] Migrate "_unsafe_view" and "view" ops from out-of-tree to pytorch in-tree (#153670 ) Summary: # Context The MTIA New Aten Backend work is essentially to move MTIA operators from pytorch out-of-tree to in-tree, with following benefits: 1. Avoid duplicate code copied from pytorch, e.g. view ops implementation, util functions. 2. Utilize TensorIterator and structured kernel codegen, avoid manual implementation of broadcasting, dtype casting, asserting, etc. 3. Eliminate MTIA's own codegen flow, which is unnecessary complexity. 4. Overall make MTIA's aten backend more pytorch native. Differential Revision: D74672464 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153670 Approved by: https://github.com/albanD, https://github.com/nautsimon	2025-05-21 05:20:45 +00:00
xinan.lin	dcb3edd30d	[AOTI][XPU] Refactor AOTInductor runtime API for Intel GPU. (#153929 ) Simplify and improve code format for sycl_runtime_wrappers.h Pull Request resolved: https://github.com/pytorch/pytorch/pull/153929 Approved by: https://github.com/desertfire ghstack dependencies: #153924	2025-05-21 03:52:54 +00:00
PyTorch MergeBot	531d8f5fb6	Revert "[cuBLAS][cuBLASLt] Use cuBLAS default workspace size in Lt (#153556 )" This reverts commit 2b43d635d31f1338743885efd1a259f43bd2ee65. Reverted https://github.com/pytorch/pytorch/pull/153556 on behalf of https://github.com/eqy due to reverting, will add disable for reduced precision reduction ([comment](https://github.com/pytorch/pytorch/pull/153556#issuecomment-2896257521))	2025-05-21 02:09:11 +00:00
PyTorch MergeBot	1478d0185c	Revert "[CI][CUDA] Move cu118 distributed pull jobs to cu126, move cu124-sm75 to cu126-sm75 (#151594 )" This reverts commit 8cabd23b3d357ec38a400978bb5423efcb433f2a. Reverted https://github.com/pytorch/pytorch/pull/151594 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to fail a distributed test in trunk ([comment](https://github.com/pytorch/pytorch/pull/151594#issuecomment-2896230131))	2025-05-21 01:45:20 +00:00
Yu, Guangye	daa68e7a93	Update USE_XCCL option if USE_XPU is OFF (#153936 ) # Motivation Disable `USE_XCCL` when `USE_XPU` is turned `OFF` to ensure configuration consistency. This is required because XCCL depends on XPU functionality. Especially, ensure that `USE_XCCL` is correctly set to `OFF` when [caffe2_update_option(USE_XPU OFF)](`1075bb37d3/cmake/Dependencies.cmake (L97)`) is invoked. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153936 Approved by: https://github.com/Skylion007	2025-05-21 01:32:41 +00:00
PyTorch MergeBot	cf6e5d1881	Revert "[cuBLASLt] relax `addmm` cuBLASLt constraint (#153675 )" This reverts commit f9bb7cf72a43bec17b5fc2ccbe865aa130e760be. Reverted https://github.com/pytorch/pytorch/pull/153675 on behalf of https://github.com/eqy due to incorrect, cuBLASLt doesnt handle beta != 1.0 but this appears untested ([comment](https://github.com/pytorch/pytorch/pull/153675#issuecomment-2896188784))	2025-05-21 01:20:10 +00:00
Frost Mitchell	fe49b11e09	Add memory reporting for XPU to Memory Profiler (#152842 ) Adds support for XPU profile_memory in Pytorch Profiler. Currently, when `profile_memory=True` is passed to `torch.profiler.profile`, there is no XPU memory reported. For example, the profiling table printed by the code below is missing any `XPU Mem` columns: <details><summary>profiling.py</summary> <p> ```python import torch import torch.nn as nn import torch.optim as optim from torch.profiler import profile, ProfilerActivity class ToyModel(nn.Module): def __init__(self): super(ToyModel, self).__init__() self.conv1 = nn.Conv1d(20,20,15,padding="same") self.flatten = nn.Flatten() self.net1 = nn.Linear(2048, 4096) self.relu = nn.ReLU() self.net2 = nn.Linear(4096, 5) def forward(self, x): res = self.conv1(x) res = self.flatten(res) res = self.net1(res) return self.net2(self.relu(res)) def demo_basic(): model = ToyModel().to("xpu") loss_fn = nn.MSELoss().to("xpu") optimizer = optim.SGD(model.parameters(), lr=0.001) with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.XPU], profile_memory=True) as prof: for epoch in range(10): optimizer.zero_grad() outputs = model(torch.randn(20, 2048).to("xpu")) labels = torch.randn(20, 5).to("xpu") loss_fn(outputs, labels).backward() optimizer.step() print(prof.key_averages().table(max_name_column_width=100, sort_by="xpu_time_total", row_limit=100)) if __name__ == "__main__": demo_basic() ``` </p> </details> ``` ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self XPU Self XPU % XPU total XPU time avg CPU Mem Self CPU Mem # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ gemm_kernel 0.00% 0.000us 0.00% 0.000us 0.000us 1.501ms 44.73% 1.501ms 25.024us 0 b 0 b 60 autograd::engine::evaluate_function: AddmmBackward0 0.12% 1.067ms 30.47% 260.929ms 13.046ms 0.000us 0.00% 1.009ms 50.448us 0 b 0 b 20 AddmmBackward0 0.09% 744.983us 15.99% 136.944ms 6.847ms 0.000us 0.00% 784.640us 39.232us 0 b 0 b 20 aten::mm 15.41% 131.956ms 15.79% 135.167ms 3.379ms 784.640us 23.37% 784.640us 19.616us 0 b 0 b 40 aten::linear 0.02% 156.361us 20.58% 176.187ms 8.809ms 0.000us 0.00% 741.760us 37.088us 0 b 0 b 20 aten::addmm 20.25% 173.371ms 20.52% 175.723ms 8.786ms 741.760us 22.10% 741.760us 37.088us 0 b 0 b 20 Optimizer.step#SGD.step 0.40% 3.429ms 5.55% 47.509ms 4.751ms 0.000us 0.00% 488.960us 48.896us 0 b 0 b 10 aten::_foreach_add_ 4.81% 41.162ms 5.15% 44.080ms 4.408ms 488.960us 14.57% 488.960us 48.896us 0 b 0 b 10 at::native::xpu::MultiTensorApplyKernelFunctor<at::n... 0.00% 0.000us 0.00% 0.000us 0.000us 422.880us 12.60% 422.880us 42.288us 0 b 0 b 10 autograd::engine::evaluate_function: ConvolutionBack... 0.03% 280.041us 4.36% 37.328ms 3.733ms 0.000us 0.00% 356.320us 35.632us 0 b 0 b 10 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 856.227ms Self XPU time total: 3.357ms ``` This PR updates the XPUCachingAllocator.cpp to report allocation events to the Profiler, and causes these to be printed in the table: ``` ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self XPU Self XPU % XPU total XPU time avg CPU Mem Self CPU Mem XPU Mem Self XPU Mem # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ gemm_kernel 0.00% 0.000us 0.00% 0.000us 0.000us 1.436ms 43.64% 1.436ms 23.939us 0 b 0 b 0 b 0 b 60 autograd::engine::evaluate_function: AddmmBackward0 0.13% 1.186ms 29.92% 262.875ms 13.144ms 0.000us 0.00% 1.005ms 50.272us 0 b 0 b 320.94 Mb -4.69 Mb 20 AddmmBackward0 0.09% 815.288us 16.48% 144.802ms 7.240ms 0.000us 0.00% 790.720us 39.536us 0 b 0 b 325.47 Mb 0 b 20 aten::mm 15.86% 139.342ms 16.26% 142.875ms 3.572ms 790.720us 24.03% 790.720us 19.768us 0 b 0 b 325.47 Mb 325.47 Mb 40 aten::linear 0.02% 182.856us 20.46% 179.775ms 8.989ms 0.000us 0.00% 669.440us 33.472us 0 b 0 b 3.13 Mb 0 b 20 aten::addmm 20.10% 176.607ms 20.40% 179.210ms 8.961ms 669.440us 20.34% 669.440us 33.472us 0 b 0 b 3.13 Mb 3.13 Mb 20 Optimizer.step#SGD.step 0.42% 3.692ms 5.61% 49.267ms 4.927ms 0.000us 0.00% 486.640us 48.664us 0 b 0 b 0 b 0 b 10 aten::_foreach_add_ 4.83% 42.439ms 5.19% 45.574ms 4.557ms 486.640us 14.79% 486.640us 48.664us 0 b 0 b 0 b -20.00 Kb 10 at::native::xpu::MultiTensorApplyKernelFunctor<at::n... 0.00% 0.000us 0.00% 0.000us 0.000us 420.960us 12.79% 420.960us 42.096us 0 b 0 b 0 b 0 b 10 autograd::engine::evaluate_function: ConvolutionBack... 0.04% 310.719us 4.47% 39.279ms 3.928ms 0.000us 0.00% 339.520us 33.952us 0 b 0 b -2.89 Mb -3.12 Mb 10 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 878.627ms Self XPU time total: 3.291ms ``` These XPU memory numbers match the same profiling results on CUDA. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152842 Approved by: https://github.com/guangyey, https://github.com/sraikund16	2025-05-21 01:19:19 +00:00
Jane Xu	8817e5ac80	Render Example: and not Example:: in docs (#153978 ) Everything here is a grep except the changes in tools/autograd/load_derivatives.py which I manually corrected. The correct notation is: ``` Example:: >>> ... ``` It is common and wrong to have: ``` Example:: >>> ... ``` In the wrong example, we get these pesky double colons: ![image](https://github.com/user-attachments/assets/20ffd349-68bb-4552-966c-e23923350476) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153978 Approved by: https://github.com/soulitzer, https://github.com/malfet	2025-05-21 01:03:26 +00:00
atalman	0959869683	Bump triton pin for the release 3.3.1 of triton (#153951 ) Triton is pointing to latest triton pin : https://github.com/triton-lang/triton/tree/release/3.3.x XPU pointing to latest XPU pin: https://github.com/intel/intel-xpu-backend-for-triton/commits/release/3.3.x/ This version contains the fix for: Compilation Issue for RTX 5090 GPUs with Compute Capability = 120. https://github.com/triton-lang/triton/pull/6771 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153951 Approved by: https://github.com/davidberard98	2025-05-21 00:27:39 +00:00
Menglu Yu	32b1baa981	[3/n][Optimus][Auto-AC] Support float8_e4m3fn quantization type and set scaling as the default (#153802 ) Summary: 1. Customers now can test with float8_e4m3fn. 2. To play safe, we set the scaling version as the default. Test Plan: ### unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization ``` Buck UI: https://www.internalfb.com/buck2/f679f362-8bf4-454c-87df-a85cbc2ab2a8 Test UI: https://www.internalfb.com/intern/testinfra/testrun/5066549861047443 Network: Up: 16KiB Down: 3.9MiB (reSessionID-98badbfd-76f7-487f-ab1c-1ec4f850614d) Analyzing targets. Remaining 0/281 Executing actions. Remaining 0/5957 7.3s exec time total Command: test. Finished 3 local, 1 remote Time elapsed: 1:29.7s Tests finished: Pass 3. Fail 0. Fatal 0. Skip 0. Build failure 0 Differential Revision: D74910193 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153802 Approved by: https://github.com/nareshrajkumar866, https://github.com/Hahu803, https://github.com/Mingming-Ding	2025-05-21 00:21:54 +00:00
Nikita Shulga	58dc80dff6	[MPSInductor] Fix indexing calculation (#153997 ) By using `c10:🤘:floor_divie` primitive Which fixes `test_flip_cat_mps` test, and makes `doctr_reco_predictor` and `doctr_det_predictor` pass accuracy checks (at least locally, scheduled a workflow dispatch to validate it in CI) Before this change following script generated different compile and eager results ```python import torch def foo(unsqueeze, unsqueeze_1): cat_1 = torch.ops.aten.cat.default([unsqueeze, unsqueeze_1], 1) view = torch.ops.aten.view.default(cat_1, [4]) slice_5 = torch.ops.aten.slice.Tensor(view, 0, 0, 3) rev_1 = torch.ops.aten.flip.default(slice_5, [0]) return rev_1 if __name__ == "__main__": x = torch.arange(1.0, 3.0, device='mps').reshape(2, 1) y = torch.arange(5.0, 7.0, device='mps').reshape(2, 1) rc, (kernel,) = torch._inductor.utils.run_and_get_kernels(torch.compile(foo), x, y) print(kernel) print("Compile: ", rc) print("Eager: ", foo(x, y)) ``` After this change ``` ''' #include <c10/metal/utils.h> kernel void generated_kernel( device float* out_ptr0, constant float* in_ptr0, constant float* in_ptr1, uint xindex [[thread_position_in_grid]] ) { int x0 = xindex; auto tmp6 = in_ptr0[1 + (c10:🤘:floor_divide((-1)x0, 2))]; auto tmp11 = in_ptr1[1 + (c10:🤘:floor_divide((-1)x0, 2))]; auto tmp0 = (2 + ((-1)*x0)) % (2); auto tmp1 = static_cast<long>(tmp0); auto tmp2 = 0; auto tmp3 = tmp1 >= tmp2; auto tmp4 = 1; auto tmp5 = tmp1 < tmp4; auto tmp7 = tmp5 ? tmp6 : 0.0; auto tmp8 = tmp1 >= tmp4; auto tmp9 = 2; auto tmp10 = tmp1 < tmp9; auto tmp12 = tmp8 ? tmp11 : 0.0; auto tmp13 = tmp5 ? tmp7 : tmp12; out_ptr0[x0] = static_cast<float>(tmp13); } ''' Compile: tensor([2., 5., 1.], device='mps:0') Eager: tensor([2., 5., 1.], device='mps:0') ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153997 Approved by: https://github.com/dcci ghstack dependencies: #153970, #153971	2025-05-21 00:03:46 +00:00
Jane Xu	fc33da410f	Add torch/header_only_apis.txt and enforce they're tested (#153635 ) This PR adds enforcement of testing header only APIs. The benefit of torch/header_only_apis.txt is twofold: 1) this gives us a clear view of what we expect to be header only 2) this allows us to enforce testing The enforcement added in this PR is very basic--we literally string match that a symbol in `torch/header_only_apis.txt` is in a cpp test. This is meant to be a first step in verifying our APIs are properly tested and can get fancier over time. For now, I've added myself as a codeowner to learn what to look out for in terms of proper tests. Over time, I anticipate we can automate more steps, but right now let's just get something out the door. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153635 Approved by: https://github.com/albanD ghstack dependencies: #153965	2025-05-20 23:42:24 +00:00
Jane Xu	41a9aa6564	Remove janky (though at times useful) dlclose test (#153975 ) This test was never the shining star in class but it helped check that we can properly delete a stable library. But now that we are running it in CI this is not a good test to annoy people with as dlclose + parallelism is likely not the move. I will miss it locally though. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153975 Approved by: https://github.com/jbschlosser	2025-05-20 23:26:42 +00:00
PyTorch MergeBot	7b7604fdb4	Revert "[inductor][cutlass backend] Add 2 stage autotuning aka prescreening (#153335 )" This reverts commit 0c04492e3b142854fad8356a2a4d74f12e2c6c5d. Reverted https://github.com/pytorch/pytorch/pull/153335 on behalf of https://github.com/malfet due to Breaks lint, see `3742b7fb3a/1` ([comment](https://github.com/pytorch/pytorch/pull/153335#issuecomment-2896031661))	2025-05-20 23:12:11 +00:00
Slawomir Siwek	3742b7fb3a	Treat dim=[] same as dim=None (#153570 ) Fixes https://github.com/pytorch/pytorch/issues/153568 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153570 Approved by: https://github.com/ngimel	2025-05-20 22:44:29 +00:00
albanD	f7b8eadd9d	Add codeowner for merge rules (#152354 ) To ensure changes to merge rights are properly reviewed Also make the codeowner file valid by removing invalid users Pull Request resolved: https://github.com/pytorch/pytorch/pull/152354 Approved by: https://github.com/malfet	2025-05-20 22:24:23 +00:00
henrylhtsang	0c04492e3b	[inductor][cutlass backend] Add 2 stage autotuning aka prescreening (#153335 ) Motivation: By default, we are tuning the cutlass backend kernels on 3 swizzles. There are runtime params, so they share the same underlying kernel, which saves a lot of compilation time. However, autotuning all combinations of {configs} x {swizzles} is still expensive. Observations: Winner of the {configs} x {swizzles} autotuning is the same as if we do a greedy search: first find the top X winners of {configs} with swizzle 2 (hardcoded), then autotune on the {top X winner configs} x {swizzles}. In other words, we can use a Greedy algorithm to reduce autotuning time. I attach the logs below. This somewhat depends on what X is, but a number like 5-10 works pretty well from empirical observations. Logs: Baseline: https://gist.github.com/henrylhtsang/9a604f150a270dc19524f72a5d4dfac2 ``` AUTOTUNE mm(2048x2048, 2048x2048) strides: [2048, 1], [1, 2048] dtypes: torch.bfloat16, torch.bfloat16 cuda_cutlass_gemm_1776 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_1777 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_1778 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_1800 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_1801 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_1802 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_9012 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_9013 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_9014 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_8940 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_8941 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_8942 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_8934 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_8935 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_8936 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_2001 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_2002 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_2003 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_1848 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_1849 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_1850 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_8964 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_8965 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_8966 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_8958 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_8959 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_8960 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_1929 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_1930 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_1931 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_1770 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_1771 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_1772 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_1953 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_1954 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_1955 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_1995 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_1996 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_1997 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_1794 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_1795 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_1796 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_1842 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_1843 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_1844 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_9006 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_9007 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_9008 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_1923 0.0306 ms 95.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1 ``` with prescreening: ``` AUTOTUNE mm(147456x6144, 6144x2048) strides: [6144, 1], [2048, 1] dtypes: torch.bfloat16, torch.bfloat16 cutlass_1a5e81af 4.5469 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1 cutlass_aa6f899c 4.6328 ms 98.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cutlass_aa6f899c 4.6836 ms 97.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cutlass_161b8b81 4.7224 ms 96.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cutlass_161b8b81 4.7234 ms 96.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cutlass_161b8b81 4.7274 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cutlass_853b6347 4.7369 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cutlass_aa6f899c 4.7404 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cutlass_161b8b81 4.7711 ms 95.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8 cutlass_8bc6fbda 4.8148 ms 94.4% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8 cutlass_8bc6fbda 4.8159 ms 94.4% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cutlass_8bc6fbda 4.8214 ms 94.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cutlass_8bc6fbda 4.8302 ms 94.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cutlass_0a1c55af 4.8487 ms 93.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8 cutlass_0a1c55af 4.8527 ms 93.7% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2 cutlass_02780d72 4.8617 ms 93.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cutlass_0a1c55af 4.8737 ms 93.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1 cutlass_0a1c55af 4.8738 ms 93.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4 cutlass_02780d72 4.9348 ms 92.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cutlass_02780d72 4.9763 ms 91.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cutlass_853b6347 4.9805 ms 91.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cutlass_1a5e81af 5.0225 ms 90.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8 cutlass_853b6347 5.0271 ms 90.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8 cutlass_02780d72 5.0595 ms 89.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8 cutlass_853b6347 5.1434 ms 88.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cutlass_c1ffa14b 5.1574 ms 88.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8 cutlass_1a5e81af 5.1916 ms 87.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4 cutlass_c1ffa14b 5.2018 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4 cutlass_c1ffa14b 5.2019 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1 cutlass_c1ffa14b 5.2037 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2 cutlass_1a5e81af 5.5329 ms 82.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2 cutlass_aa6f899c 11.5046 ms 39.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8 SingleProcess AUTOTUNE benchmarking takes 1.9526 seconds and 0.0352 seconds precompiling for 32 choices ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153335 Approved by: https://github.com/eellison	2025-05-20 22:19:02 +00:00
Bin Bao	2c2524f74b	[AOTI] Generate unique cubin file names when package_cpp_only (#153948 ) Summary: * When package_cpp_only is specified, generate kernel file names with unique kernel names to make the final packaged package files more readable. Assert on unique_kernel_names in case somehow it was explicitly set to False. * Fix a rocm test skip, see https://github.com/pytorch/pytorch/pull/153828 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153948 Approved by: https://github.com/angelayi, https://github.com/yushangdi	2025-05-20 22:07:53 +00:00
Wei Wang	8cabd23b3d	[CI][CUDA] Move cu118 distributed pull jobs to cu126, move cu124-sm75 to cu126-sm75 (#151594 ) This PR moves distributed cuda CI job from cuda 11.8 to cuda 12.6. In doing so, a few unit test failures were exposed, some if not all of which would take a while to root-cause and fix, so temporarily skip them after creating the issues. https://github.com/pytorch/pytorch/issues/153479 test_nan_assert tricky behavior (e.g. skip_but_pass_in_sandcastle, ubuntu 20.04 does not work, ubuntu 22.04 works, Amazon Linux 2023 skip - what is Sandcastle OS?) https://github.com/pytorch/pytorch/issues/153122 CUDA context related https://github.com/pytorch/pytorch/issues/153517 NCCL regression, future NCCL may fix it See: https://github.com/pytorch/pytorch/issues/147383 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151594 Approved by: https://github.com/eqy, https://github.com/atalman, https://github.com/cyyever	2025-05-20 21:56:47 +00:00
Eddie Yan	2b43d635d3	[cuBLAS][cuBLASLt] Use cuBLAS default workspace size in Lt (#153556 ) Also enables unified workspaces by default for non-FBCODE use cases. Default Lt workspace size is also updated to match cuBLAS logic for default, including for Blackwell (SM 10.0) and GeForce Blackwell (SM 12.0). Recommended defaults are documented here: https://docs.nvidia.com/cuda/cublas/#cublassetworkspace Pull Request resolved: https://github.com/pytorch/pytorch/pull/153556 Approved by: https://github.com/Skylion007, https://github.com/ngimel	2025-05-20 21:51:49 +00:00
Yiming Zhou	aeb734f519	[nativert] Move GraphSignature to pytorch core (#152969 ) Summary: Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72 Added an in-memory representation for input and output specs of a graph. The GraphSignature class models the input and output specs of an exported graph produced by torch.export, which holds the graph information deserialized from the pt2 archive package. Runtime relies on the GraphSignature for weight name lookup and weight loading. The serialization schema is defined in torch/_export/serde/schema.py See more at: https://docs.pytorch.org/docs/stable/export.html#torch.export.ExportGraphSignature Test Plan: Added tests under `test/cpp/nativert/test_graph_signature.cpp` Differential Revision: D73895378 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152969 Approved by: https://github.com/swolchok	2025-05-20 21:49:56 +00:00
Jane Xu	8f943046f8	[BE] light cleanups to linter logic (#153965 ) some BE cleanup on other lint things I saw while doing the top of the this stack Pull Request resolved: https://github.com/pytorch/pytorch/pull/153965 Approved by: https://github.com/soulitzer	2025-05-20 21:28:48 +00:00
Grace Cheng	deaf6c2f2f	Address the ignored warnings for `-Wmissing-field-initializers` in the file fbcode/caffe2/aten/src/ATen/native/cuda/RowwiseScaledMM.cu (#153958 ) Summary: the error message https://www.internalfb.com/sandcastle/workflow/698057942249983018/artifact/actionlog.698057942382778255.stderr.1?selectedLines=66-66-70-148 from D74892646 When switching the host compiler to Clang, maybe we should only silence these warnings in this file. Test Plan: sandcastle_green Differential Revision: D75029051 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153958 Approved by: https://github.com/Skylion007, https://github.com/eqy	2025-05-20 21:25:56 +00:00
Nikita Shulga	6cb7e4b5a5	[EZ] Update mps xfail reason (#153971 ) cummin is not implemented Pull Request resolved: https://github.com/pytorch/pytorch/pull/153971 Approved by: https://github.com/dcci, https://github.com/jansel ghstack dependencies: #153970	2025-05-20 21:15:14 +00:00
Nikita Shulga	03859242ce	[Testing] Fix `test_deterministic_`... on MPS (#153970 ) By decorated emitted kernels with `'''` rather than `"""` To match regex in `torch._inductor.utils.run_and_get_kernels` This fixes `test_deterministic_codegen_mps`, `test_deterministic_codegen_on_graph_break_mps` and `test_deterministic_codegen_with_suffix_mps` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153970 Approved by: https://github.com/dcci, https://github.com/jansel	2025-05-20 21:15:14 +00:00
soulitzer	3aa95b252a	Fix test_side_stream_backward_overlap flakiness (#153963 ) Fixes https://github.com/pytorch/pytorch/issues/153927 Although the autograd backward should always execute SideBw before MainBw, there is still a small chance the recorded events won't be in that order. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153963 Approved by: https://github.com/janeyx99, https://github.com/Skylion007 ghstack dependencies: #151079, #153412	2025-05-20 21:02:56 +00:00
PyTorch MergeBot	500a710422	Revert "Fixed an issue with XPU skip so the test_decompose_mem_bound_mm.py suite can be ran correctly (#153245 )" This reverts commit 2e56ce097a201ff3c69610cea953a9efce17d1b1. Reverted https://github.com/pytorch/pytorch/pull/153245 on behalf of https://github.com/yangw-dev due to tests failed internally [D75078034](https://www.internalfb.com/diff/D75078034) ([comment](https://github.com/pytorch/pytorch/pull/153245#issuecomment-2895785642))	2025-05-20 20:45:55 +00:00
Xu Han	179e7d8624	Fix vs2022 caused AVX512 illegal instruction issue. (#153480 ) Fixes #145702 Add `/d2implyavx512upperregs-` to disable compiler over-aggressive optimization, which caused involeved AVX512 register on AVX2 machine. Reference to: https://github.com/pytorch/pytorch/issues/145702#issuecomment-2874029459 Local test passed: <img width="1208" alt="image" src="https://github.com/user-attachments/assets/26f4cb91-6bb5-416f-aa35-c899eb1489b2" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/153480 Approved by: https://github.com/Blackhex, https://github.com/cyyever, https://github.com/atalman	2025-05-20 20:37:00 +00:00
Anita Katahoire	996c4d803d	Removing conda references from PyTorch Docs (#152702 ) Addresses #148339 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152702 Approved by: https://github.com/svekars, https://github.com/albanD, https://github.com/atalman	2025-05-20 20:33:28 +00:00
Gantaphon Chalumporn	05bc78e64f	[submodule] Update fbgemm pinned version (#153950 ) Summary: Update fbgemm pinned version in PyTroch. Related update in fbgemm: D74434751 Included changes: Update fbgemm external dependencies directory in setup.py Add DISABLE_FBGEMM_AUTOVEC flag to disable fbgemm's autovec Test Plan: PyTorch OSS CI Differential Revision: D75073516 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153950 Approved by: https://github.com/Skylion007, https://github.com/ngimel	2025-05-20 20:24:27 +00:00
eqy	823a35807c	[CUDA][CUDNN] Dispatch to cuDNN for non-batch-splittable 64-bit NCHW convolutions (#153101 ) For #152816 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153101 Approved by: https://github.com/Skylion007	2025-05-20 20:19:03 +00:00
pbialecki	e8f8baf71f	set CUDA_MODULE_LOADING for older drivers only (#152695 ) `CUDA_MODULE_LOADING=LAZY` is the default for all drivers shipped with CUDA >=12.2 and we should check the driver version before setting the env variable. (the `LOG(WARNING)` has to be removed before merging) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152695 Approved by: https://github.com/malfet, https://github.com/atalman, https://github.com/nWEIdia	2025-05-20 19:34:40 +00:00
Joel Schlosser	7587350458	Make python_agnostic cpp extension tests standalone (#153274 ) Related: #148920 This PR: * Introduces a new file `test/cpp_extensions/python_agnostic_extension/test/test_python_agnostic.py` with testing that follows the usual python testing patterns * This replaces the testing for python_agnostic in `test/test_cpp_extensions_aot.py` After this PR, it is now possible to run: ``` python test/cpp_extensions/python_agnostic_extension/test/test_python_agnostic.py ``` and the test will build the prerequisite wheel before running the tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153274 Approved by: https://github.com/janeyx99, https://github.com/cyyever ghstack dependencies: #153264	2025-05-20 19:18:09 +00:00
Joel Schlosser	3ecd444004	Support independent builds for cpp extension tests + apply to libtorch_agnostic tests (#153264 ) Related: #148920 This PR: * Provides a helper `install_cpp_extension(extension_root)` for building C++ extensions. This is intended to be used in `TestMyCppExtension.setUpClass()` * Updates libtorch_agnostic tests to use this * Deletes preexisting libtorch_agnostic tests from `test/test_cpp_extensions_aot.py` * Fixes `run_test.py` to actually run tests in `test/cpp_extensions/libtorch_agnostic_extension/test/test_libtorch_agnostic.py` to avoid losing coverage. This wasn't being run due to logic excluding tests that start with "cpp"; this is fixed now After this PR, it is now possible to run: ``` python test/cpp_extensions/libtorch_agnostic_extension/test/test_libtorch_agnostic.py ``` and the test will build the `libtorch_agnostic` extension before running the tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153264 Approved by: https://github.com/janeyx99	2025-05-20 19:18:09 +00:00
Tsung-Hsien Lee	f1f54c197d	[c10d] Simplify `new_subgroups()` by using `new_subgroups_by_enumeration()` (#153843 ) Summary: The code changes in each file of the diff include removing the `subgroups` and `cur_subgroup` variables, and replacing the while loop with a call to `new_subgroups_by_enumeration()`. Test Plan: contbuild & OSS CI Differential Revision: D75007368 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153843 Approved by: https://github.com/Skylion007, https://github.com/wz337	2025-05-20 19:15:20 +00:00
Bert Maher	2d20106922	[inductor] Support cutlass backend with remote execution (#153844 ) Meta-internal builds need to use RE to build with nvcc, since the trainers do not have nvcc (and its attendant build toolchain) installed. This diff enables building using an RE service (via the same code path used for Triton) Differential Revision: [D74907192](https://our.internmc.facebook.com/intern/diff/D74907192/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153844 Approved by: https://github.com/henrylhtsang	2025-05-20 19:05:23 +00:00
Dan Zimmerman	e0f8174001	[triton][fb] Move build_paths into triton_utils (#153652 ) Summary: TSA, this is just a small cleanup Test Plan: CI Differential Revision: D74835506 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153652 Approved by: https://github.com/Skylion007	2025-05-20 18:59:50 +00:00
Eddie Yan	f9bb7cf72a	[cuBLASLt] relax `addmm` cuBLASLt constraint (#153675 ) `beta == 1.0` doesn't seem to be required anymore https://github.com/pytorch/pytorch/issues/153590 `self.dim() == 1` restriction seems to still hold but not sure if that's due to a lack of handling on the PyTorch side or the cuBLASLt side, will investigate Pull Request resolved: https://github.com/pytorch/pytorch/pull/153675 Approved by: https://github.com/Skylion007	2025-05-20 18:43:38 +00:00
Svetlana Karslioglu	7c9d94e9bb	Redirect mobile_optimizer.rst to executorch (#153664 ) Redirect mobile_optimizer.rst to executorch Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/153664 Approved by: https://github.com/byjlw, https://github.com/malfet	2025-05-20 18:13:45 +00:00
xinan.lin	0087f5f0af	[AOTI][XPU] Embed SPRI-V files into .so (#153924 ) Following the design of #150739, this PR supports embed kernel SPIR-V files so AOTI is one step closer to generate a single binary. Fixes #153829 Fixes #153830 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153924 Approved by: https://github.com/desertfire	2025-05-20 17:38:53 +00:00
Yang Wang	335c89c6f1	[Monitoring] enable local logs and add mac test monitoring (#153454 ) Enable to run the upload utilzation logics using local pointer instead of reading from s3, this could be useful for rocm too, Pull Request resolved: https://github.com/pytorch/pytorch/pull/153454 Approved by: https://github.com/huydhn	2025-05-20 17:14:40 +00:00
henrylhtsang	b910d37ec6	[cutlass backend] Reduce log level for cutlass runtime error (#153457 ) Want to make sure we always call self.cleanup_run_fn() even if we crash. I think this is the reason why sometimes we get ``` in _dlclose TypeError: 'NoneType' object is not callable ``` Differential Revision: [D74629230](https://our.internmc.facebook.com/intern/diff/D74629230/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153457 Approved by: https://github.com/ColinPeppler	2025-05-20 17:03:17 +00:00
Shivam Raikundalia	6b5b69a468	[Memory Snapshot] Fix RecordFunction Callback Handling (#153839 ) Fixes #153571 Summary: 1. Set annotation callback to global to include all threads 2. Only init callbacks when enable == true and callbacks are empty under mutex 3. When enable == false, check if callbacks are present and if so remove them and set handle to 0 under mutex We don't expect memory snapshots to be called from several different threads (almost always called just from main) but we make sure to add thread safety in the off case that users do want to call it from different points of entry Test Plan: Ran basic snapshot and saw that the callbacks were registered properly Reviewed By: ngimel Differential Revision: D74771491 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153839 Approved by: https://github.com/ngimel, https://github.com/Skylion007	2025-05-20 17:01:00 +00:00
angelayi	ddfaab3b56	[aoti] Reset expr when generating cpp code (#153898 ) Maybe fixes https://github.com/pytorch/pytorch/issues/153896 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153898 Approved by: https://github.com/desertfire	2025-05-20 16:31:25 +00:00
Eddie Yan	5163bf0069	[CUDA][cuBLAS][cuBLASLt] avoid polluting prefer cuBLAS/Lt setting across tests (#153655 ) Some tests may not set the preferred backend, which leads to unexpected behavior when multiple tests are run vs. standalone Tests that should exercise both backends should explicitly parametrize this setting Pull Request resolved: https://github.com/pytorch/pytorch/pull/153655 Approved by: https://github.com/ngimel	2025-05-20 16:18:35 +00:00
PaulZhang12	a7c01d7f13	[Inductor] Subgraph check output strides (#153755 ) Make sure outputs strides of subgraph consistent with original gm. Without checking strides, it was possible for subgraph to produce nans with a reinterpret tensor on the output of the subgraph output, in which itself was not contiguous. Differential Revision: [D74691119](https://our.internmc.facebook.com/intern/diff/D74691119/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153755 Approved by: https://github.com/eellison ghstack dependencies: #153754	2025-05-20 16:07:18 +00:00
PaulZhang12	63e5d46478	[Inductor] Subgraph support dynamic input expressions (#153754 ) Support subgraph choice taking in inputs that have dynamic dimensions. Testing with decomposeK subgraph decomp Differential Revision: [D74484741](https://our.internmc.facebook.com/intern/diff/D74484741/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153754 Approved by: https://github.com/eellison	2025-05-20 16:07:18 +00:00
iupaikov-amd	2e56ce097a	Fixed an issue with XPU skip so the test_decompose_mem_bound_mm.py suite can be ran correctly (#153245 ) Fixes #153239 Replaced custom decorator with the common one. Although the better way to skip the whole suite would be to add it to skip list in run_test.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/153245 Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/jeffdaily	2025-05-20 15:46:21 +00:00
Eddie Yan	ef958fa152	[cuDNN][cuDNN frontend] upgrade cuDNN frontend submodule to 1.12 (#153888 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153888 Approved by: https://github.com/Skylion007	2025-05-20 15:08:37 +00:00
PyTorch MergeBot	3102ae6798	Revert "[AOTI] Add an option to specify custom op C shim (#153851 )" This reverts commit 365ac49840105918c604a6b1c7e81c1ca59e37fb. Reverted https://github.com/pytorch/pytorch/pull/153851 on behalf of https://github.com/malfet due to Looks like it broke fuzzer test, but I could be wrong, see `c4d1ff02f8/1` ([comment](https://github.com/pytorch/pytorch/pull/153851#issuecomment-2894619773))	2025-05-20 14:23:50 +00:00
Nikita Shulga	c4d1ff02f8	[Lint] Update clang-format to 19.1.4 (#153889 ) All changes other than the one to `tools/linter/adapters/s3_init_config.json` are generated by newer clang-format Pull Request resolved: https://github.com/pytorch/pytorch/pull/153889 Approved by: https://github.com/cyyever, https://github.com/atalman	2025-05-20 14:12:46 +00:00
Aaron Gokaslan	d869ea11e0	[BE]: Update fmtlib submodule to 11.2.0 (#153853 ) Update fmtlib to 11.2.0 with a lot of miscellaneous fixes for various compilers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153853 Approved by: https://github.com/malfet	2025-05-20 14:11:18 +00:00
James Wu	4b759d98f8	Recheck autotune cache on static cuda launcher load (#153565 ) When loading statically launchable triton kernels from FxGraphCache, since we don't instantiate a CachingAutotuner like we do normally, we need to recheck the autotune cache based on the existing compile results. If we get a hit, we take the compile result whose config matches the best config. Sometimes, the best config will have been from coordinate descent tuning. In this case, FxGraphCache today does not cache the resulting triton kernel, neither with static or without static cuda launcher. This is because coordinate descent tuning happens at runtime, and if the best config happens to not be one of the precompiled configs. Test Plan: New unit test that failed before Pull Request resolved: https://github.com/pytorch/pytorch/pull/153565 Approved by: https://github.com/aorenste	2025-05-20 14:00:43 +00:00
Michael Lazos	d68d4d31f4	[Cutlass] EVT tests update (#153926 ) Fixes internal EVT tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/153926 Approved by: https://github.com/williamwen42	2025-05-20 10:03:10 +00:00
Michael Lazos	d44074f01a	[Dynamo] Fix einops regression (#153925 ) Fixes https://github.com/pytorch/pytorch/issues/153476 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153925 Approved by: https://github.com/williamwen42	2025-05-20 09:52:42 +00:00
Panagiotis Kourdis	44f19c7179	Record the XPU and XCCL build settings in the compiled binary (#147161 ) Fixes #ISSUE_NUMBER Currently the XPU and XCCL build settings are not recorded in the compiled binary and are not shown using the `torch.__config__.show()` which is a quick way to check if the binary has been built with such support. Below is the output adding them (see end of last line): ``` Python 3.12.8 \| packaged by conda-forge \| (main, Dec 5 2024, 14:24:40) [GCC 13.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import torch >>> print(torch.__config__.show()) PyTorch built with: - GCC 13.3 - C++ Version: 201703 - Intel(R) oneAPI Math Kernel Library Version 2025.1-Product Build 20250203 for Intel(R) 64 architecture applications - Intel(R) MKL-DNN v3.5.3 (Git Hash 66f0cb9eb66affd2da3bf5f8d897376f04aae6af) - OpenMP 201511 (a.k.a. OpenMP 4.5) - LAPACK is enabled (usually provided by MKL) - CPU capability usage: AVX512 XPU backend - Build settings: BLAS_INFO=mkl, BUILD_TYPE=RelWithDebInfo, COMMIT_SHA=43eb39d7c832b5560f7bfa8d29cc7919ac21c0ca, CXX_COMPILER=/home/pkourdis/compilers/gcc-13.3.0/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DUSE_KINETO -DLIBKINETO_NOCUPTI -DLIBKINETO_NOROCTRACER -DLIBKINETO_NOXPUPTI=OFF -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-unknown-pragmas -Wno-unused-parameter -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-dangling-reference -Wno-error=dangling-reference -Wno-error=redundant-move -DUSE_XPU -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, TORCH_VERSION=2.7.0, USE_CUDA=0, USE_CUDNN=OFF, USE_CUSPARSELT=OFF, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=1, USE_MPI=0, USE_NCCL=OFF, USE_NNPACK=0, USE_OPENMP=ON, USE_ROCM=0, USE_ROCM_KERNEL_ASSERT=OFF, USE_XCCL=1, USE_XPU=1, ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147161 Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/albanD Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>	2025-05-20 09:21:39 +00:00
PyTorch MergeBot	1075bb37d3	Revert "Fix fake tensor caching when output has unbacked (#153034 )" This reverts commit cb5f31a4a164a4fa1eaa627f9b15cdc18aa95ef1. Reverted https://github.com/pytorch/pytorch/pull/153034 on behalf of https://github.com/malfet due to Seems to have introduced flakiness in MacOS inductor tests, see https://github.com/pytorch/pytorch/issues/153891 ([comment](https://github.com/pytorch/pytorch/pull/153034#issuecomment-2893059329))	2025-05-20 06:02:38 +00:00
PyTorch MergeBot	9849c79fa2	Revert "FakeTensorMode dispatch shouldn't include bypass in exception context (#153780 )" This reverts commit aa84c037f0f473c91a79f48a5f278b7243f64b0e. Reverted https://github.com/pytorch/pytorch/pull/153780 on behalf of https://github.com/malfet due to Reverting to clearly revert https://github.com/pytorch/pytorch/pull/153034, that seems to have introduced flakiness in MacOS inductor tests, see https://github.com/pytorch/pytorch/issues/153891 ([comment](https://github.com/pytorch/pytorch/pull/153780#issuecomment-2893053304))	2025-05-20 05:59:42 +00:00
Bin Bao	365ac49840	[AOTI] Add an option to specify custom op C shim (#153851 ) Summary: Add an option to tell AOTInductor codegen to generate C shim functions for certain custom ops instead of relying on ProxyExecutor. The lib that defines custom ops need to implement corresponding C shim functions. Differential Revision: [D75014177](https://our.internmc.facebook.com/intern/diff/D75014177) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153851 Approved by: https://github.com/hl475	2025-05-20 05:12:09 +00:00
Sidharth	89ebd29fdc	[Dynamo] added warning message for tracing lru_cache wrapped functions (#153744 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153744 Approved by: https://github.com/williamwen42	2025-05-20 04:08:29 +00:00
angelayi	5ef90e14a3	[export] Remove unused constants (#153800 ) An internal test case ran into a weird issue when exporting, where the model imported a file which creates tensor constants upon importing [(code ptr)](https://fburl.com/code/xwmhxm7n). This causes the tracer to create some tensor constants even though it's not used in the model code. This PR updates the lift_constant_tensors pass to remove constant nodes that are not being used instead of lifting them as tensor constants. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153800 Approved by: https://github.com/dolpm, https://github.com/pianpwk	2025-05-20 03:15:27 +00:00
Yanli Zhao	a79e621c1c	[DDP] rebuilt bucket order when find_unused_parameters=true (#153404 ) Differential Revision: D72437251 Enable to rebuild bucket order when find_unused_parameters=true. It should be always better than not rebuilding bucket order when find_unused_parameters=True: 1. for cases where bucket order in the first iteration is the same as the parameter order, rebuilding bucket order will not change anything 2. for cases where bucket order in the first iteration is not the same as the parameter order, there could be two cases: a. bucket order will not change after 1st iteration even the graph is dynamic and there is unused parameter, in this case, rebuilding bucket order will have performance gain b. bucket order change after 1st iteration due to dynamic graph, in this case, both parameter order and bucket order in 1st iteration are not ideal, so rebuilding bucket order or not does not matter it can help case 2.a if enabling to rebuild bucket order when find_unused_parameters=true. meanwhile it will not hurt other cases in 1 and 2.b. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153404 Approved by: https://github.com/rohan-varma, https://github.com/fegin	2025-05-20 02:45:01 +00:00
Nikita Shulga	8b94d30b26	[Testing] Benchmark more tests for MPSInductor (#153897 ) And report HF tests as HF tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/153897 Approved by: https://github.com/dcci	2025-05-20 02:41:38 +00:00
Nikita Shulga	1627951f24	[3.13] Remove all profiler related skips (#153857 ) As underlying issue were fixed by https://github.com/pytorch/pytorch/pull/153848 Fixes https://github.com/pytorch/pytorch/issues/142166 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153857 Approved by: https://github.com/williamwen42 ghstack dependencies: #153848	2025-05-20 01:19:52 +00:00
PyTorch MergeBot	b15720118a	Revert "Cache code generation during triton template expansion and enable it for mm_template. (#151773 )" This reverts commit 9180bb187c0e4c3ab3654e765fe33ad4c75a2b1a. Reverted https://github.com/pytorch/pytorch/pull/151773 on behalf of https://github.com/malfet due to It broke ROCm, see `f9aa3bae8c/1` ([comment](https://github.com/pytorch/pytorch/pull/151773#issuecomment-2892587039))	2025-05-20 00:42:53 +00:00
xinan.lin	f9aa3bae8c	[Inductor][XPU] Fallback bmm to mm when batch == 1, align with cuda. (#153770 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153770 Approved by: https://github.com/NikhilAPatel, https://github.com/EikanWang, https://github.com/jansel	2025-05-19 23:56:20 +00:00
PyTorch MergeBot	d81217be2e	Revert "Improve torch.ops typing (#153558 )" This reverts commit c5cba39d469151895cd0ecf7673b98e5072b69c2. Reverted https://github.com/pytorch/pytorch/pull/153558 on behalf of https://github.com/yangw-dev due to Your diff will not be landed to fbcode since we suspect it caused the following breakage in an internal test:[D75007157](https://www.internalfb.com/diff/D75007157) for instance: tests_gpu/lookup_gpu_index_test.py:232:8 Undefined attribute [16]: torch._ops._OpNamespace has no attribute simple_index_mm_batch ([comment](https://github.com/pytorch/pytorch/pull/153558#issuecomment-2892506789))	2025-05-19 23:32:36 +00:00
Menglu Yu	701e22112d	[PT2][Optimus][Observability] Refactor the logging to avoid excessive tlparse log (#153584 ) Summary: context: https://fb.workplace.com/groups/943185660584207/permalink/1215335930035844/ Test Plan: before: aps-aps-ig_v4_2t_2_make_baseline_30batch-735703723-f735706162 tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-aps-ig_v4_2t_2_make_baseline_30batch-735703723-f735706162/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000&fbclid=IwZXh0bgNhZW0CMTEAAR575JfJZUtE7kQCqzIZVCYomv1q03JzuMFVok8qDA_FuGC8oZ6rhhb2EziSQA_aem_abITQJZQP45t51_r-J-cFw Differential Revision: D74776025 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153584 Approved by: https://github.com/jamesjwu	2025-05-19 22:57:29 +00:00
Natalia Gimelshein	c3e14ecdcd	[CachingHostAllocator] guard accesses to use_host_register by mutex (#153845 ) Per title Pull Request resolved: https://github.com/pytorch/pytorch/pull/153845 Approved by: https://github.com/mradmila, https://github.com/jeffdaily	2025-05-19 22:39:13 +00:00
Nikita Shulga	41564803c2	[Docs] Mention `version.txt` change for patch releases (#153860 ) Part of https://github.com/pytorch/pytorch/issues/151425 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153860 Approved by: https://github.com/Skylion007, https://github.com/seemethere	2025-05-19 22:35:33 +00:00
Nikita Shulga	08e716fc70	[BE] Fix `-Wextra-semi` warning (#153887 ) Introduced by https://github.com/pytorch/pytorch/pull/153645 Semicolon is not needed after closing curly bracket defining a class method. Not sure why CI did not catch it, but my local builds are now erroring out with ``` [19/97] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/torch/csrc/jit/passes/dead_code_elimination.cpp.o In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/jit/passes/dead_code_elimination.cpp:4: /Users/nshulga/git/pytorch/pytorch/torch/csrc/jit/ir/alias_analysis.h:356:64: warning: extra ';' after member function definition [-Wextra-semi] 356 \| ValueAndMemoryLocationSet(const AliasDb* db) : aliasDb_(db){}; \| ^ ``` Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/153887 Approved by: https://github.com/wdvr, https://github.com/davidberard98	2025-05-19 22:25:03 +00:00
Dmitry Nikolaev	f419067e50	[ROCm] improve sparse addmm, enable complex (#153262 ) PR to: - enable complex data types for sparse matmul on ROCm - fix sparse addmm/baddbmm on ROCm - fix sparse hipification for ROCm - fix/enable sparse tests on ROCm (~40 tests total): ``` test_sparse_csr.py::TestSparseCSRCUDA::test_bmm_cuda_* test_sparse.py::TestSparseCUDA::test_sparse_matmul_cuda_* test_sparse_csr.py::TestSparseCSRCUDA::test_mm_cuda_float64 test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_all_sparse_csr_SparseCS* test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_* ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153262 Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony	2025-05-19 22:23:18 +00:00
Catherine Lee	91cc93deae	[CI] Reuse old whl (#153838 ) ~50% of commits on main only touch python files unrelated to the object files in the whl, meaning that we could reuse old whls and put the current commit's python files into the whl. This PR does that in CI by identifying a previous job whose artifact and whls binaries can be reused. See https://docs.google.com/document/d/1nQ1FNJqnJuSFRiM2HvQ27zg6Vm-77n7LECp30zYfTDk/edit?tab=t.icom2lesr6es for more details? To reuse: * the changed files between the whl's commit and the current commit can only be python files in test/ or torch/ and not in torch/csrc * not on main branch or release branch * ci-force-rebuild not on PR * special abort issue is closed * artifact should exist Pros: * build time -> 6 min whenever this can be done Cons: * not sure if I have the right files * version + whl name still remains the same Testing: Unfortunately this PR's changed files are not on the list of acceptable changed files for reusing the whl, so I've been mangling it on other PRs to get things like https://github.com/pytorch/pytorch/actions/runs/15119214901/job/42497650394?pr=147470 (It is enabled on linux-focal-cuda12.6-py3.10-gcc11 / build and there are changes in common_utils.py to make sure the copying of python takes effect) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153838 Approved by: https://github.com/malfet	2025-05-19 21:47:33 +00:00
Nikita Shulga	c0343b1539	Fix profiler on cpython-3.13 (#153848 ) Per [PEP 667](https://peps.python.org/pep-0667/) `PyFrame_GetLocals` no longer returns dict, but rather instance of `PyFrameLocalsProxy_Type`, so calling `PyDict_GetItemString` is no longer valid(it will always return None) and must be replaced with `PyMapping_GetItemString` Tested by partially reverting https://github.com/pytorch/pytorch/pull/141674 full revert will be done in the followup PR Fixes https://github.com/pytorch/pytorch/issues/148273 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153848 Approved by: https://github.com/Skylion007	2025-05-19 21:20:53 +00:00
PyTorch MergeBot	8c40c9ffcb	Revert "[CI] Reuse old whl (#153838 )" This reverts commit cc48550e6f6fa8888b7d90d030b36c4e6d6581ab. Reverted https://github.com/pytorch/pytorch/pull/153838 on behalf of https://github.com/clee2000 due to testing on main is hard ([comment](https://github.com/pytorch/pytorch/pull/153838#issuecomment-2892272494))	2025-05-19 21:13:27 +00:00
David Berard	a237831bc2	[JIT] Optimize DCE by storing a MemoryLocations for an entire set<Value> (#153645 ) Summary: TL;DR: make DCE faster by replacing a Set<Value> with a MemoryLocations sparse bitset (representing all the memory locations stored by the collection of all values in the set). Details The goal of this PR is to optimize this function from AliasDb: ``` bool AliasDb::writesToAlias(Node* n, const ValueSet& vs) const { const auto writtenTo = getWrites(n); if (writtenTo.empty()) { return false; } MemoryLocations locs; for (const auto v : vs) { auto it = elementMap_.find(v); if (it != elementMap_.end()) { const auto& vlocs = memoryDAG_->getMemoryLocations(it->second); if (writtenTo.intersects(vlocs)) { return true; } } } return false; } ``` In the DCE use case, we have a ValueSet of live values, into which we insert `Value`s; and sometimes need to check whether a node mutates any of the live values using `writesToAlias`. Looping through all the values in the ValueSet and indexing into the elementMap_ is slow; so if we can pre-compute the MemoryLocations set, this speeds up the function. In some large model examples, I see ~15-25x speedups from this change. Implementation: To avoid exposing too many details of AliasDb, I introduce a friend class `ValueAndMemoryLocationSet`, which is an insert-only set of Values, which also maintains the corresponding MemoryLocations. Then in AliasDb, I use `ValueAndMemoryLocationSet` if we're using AliasDb for analysis, and otherwise use a `Set<Value>` if we don't have AliasDb. Test Plan: Rely on unit tests. Differential Revision: D74827086 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153645 Approved by: https://github.com/eellison	2025-05-19 21:04:59 +00:00
Catherine Lee	cc48550e6f	[CI] Reuse old whl (#153838 ) ~50% of commits on main only touch python files unrelated to the object files in the whl, meaning that we could reuse old whls and put the current commit's python files into the whl. This PR does that in CI by identifying a previous job whose artifact and whls binaries can be reused. See https://docs.google.com/document/d/1nQ1FNJqnJuSFRiM2HvQ27zg6Vm-77n7LECp30zYfTDk/edit?tab=t.icom2lesr6es for more details? To reuse: * the changed files between the whl's commit and the current commit can only be python files in test/ or torch/ and not in torch/csrc * not on main branch or release branch * ci-force-rebuild not on PR * special abort issue is closed * artifact should exist Pros: * build time -> 6 min whenever this can be done Cons: * not sure if I have the right files * version + whl name still remains the same Testing: Unfortunately this PR's changed files are not on the list of acceptable changed files for reusing the whl, so I've been mangling it on other PRs to get things like https://github.com/pytorch/pytorch/actions/runs/15119214901/job/42497650394?pr=147470 (It is enabled on linux-focal-cuda12.6-py3.10-gcc11 / build and there are changes in common_utils.py to make sure the copying of python takes effect) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153838 Approved by: https://github.com/malfet	2025-05-19 20:56:44 +00:00
Laith Sakka	9180bb187c	Cache code generation during triton template expansion and enable it for mm_template. (#151773 ) In a model, we see ~~ 40% of the time in mm/addmm tuning. The model have 2000 mm, many of which receives the same input shapes. with autotune enabled, this become expensive, while we already cache auto tuning results, we did not used to cache the generation of the python code and the loading for each config that we autotune on. This diff handles the code generation part (template expansions) a previous diff handled the loading part. This is expected to save 20% of the model I am working on. How do we do the caching? For a given configurations and input layout, the generated code is always the same. One caveat is that some other information collected during code generation are input dependent (namely depends on inputs names and symbol names in inputs). and not just layout. ! To handle those we use a record and replay approach, where we record the functions that are called during code generation that effect those outputs and replay them at a cache hit. Effect on the current benchmark on a local run on dev server. mm_loop. 24115830838 -> 18362098019 mm_loop_dynamic 30506097176-> 25697270062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151773 Approved by: https://github.com/eellison	2025-05-19 20:38:04 +00:00
PyTorch MergeBot	1ccacc028d	Revert "[CI] Reuse old whl (#153838 )" This reverts commit 0716acff3a3692daaf31d97f91ae5aee70f10f24. Reverted https://github.com/pytorch/pytorch/pull/153838 on behalf of https://github.com/clee2000 due to forgot to comment some stuff out ([comment](https://github.com/pytorch/pytorch/pull/153838#issuecomment-2892195387))	2025-05-19 20:33:14 +00:00
Mikayla Gawarecki	6383ddcfa4	Update serialization docs (#153631 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153631 Approved by: https://github.com/albanD	2025-05-19 20:22:07 +00:00
Nikita Shulga	2fcbb903cb	[BE][EZ] Delete unsued conda-env-IOS.txt (#153849 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153849 Approved by: https://github.com/janeyx99, https://github.com/seemethere, https://github.com/Skylion007, https://github.com/ZainRizvi	2025-05-19 20:06:38 +00:00
eqy	6ae0c42278	[CUDA][cuBLASLt] Respect `allow[FP16/BF16]ReductionCuBLAS` in `cuBLASLt` (#153095 ) cuBLASLt matmuls have been silently allowing all reduction types, which meant that e.g., `allow_fp16_reduced_precision_reduction = False` had no effect. In practice split-K with reduced precision reductions were unlikely to happen as the default `CUBLASLT_WORKSPACE_SIZE` of 1MiB tends to prevent this. However this isn't guaranteed and we are on the path to increasing the default workspace size following #151163 This setting is effectively already tested in e.g., `test_cublas_addmm_size_100_cuda_float16` and `test_cublas_addmm_size_100_cuda_bfloat16` but the backend selection is not deterministic. Running the full `test_matmul_cuda.py` seems to exercise the Lt interface, but running a standalone test does not (apparently due to spurious alignment differences). Pull Request resolved: https://github.com/pytorch/pytorch/pull/153095 Approved by: https://github.com/cyyever, https://github.com/Skylion007	2025-05-19 20:05:37 +00:00
Aaron Gokaslan	e581e1c0f4	[BE][Ez]: Propogate some nodiscard in RNN (#153836 ) Follow up @cyyever #153805 to propagate [nodiscard] from the empty() method call. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153836 Approved by: https://github.com/eqy	2025-05-19 19:55:45 +00:00
Catherine Lee	0716acff3a	[CI] Reuse old whl (#153838 ) ~50% of commits on main only touch python files unrelated to the object files in the whl, meaning that we could reuse old whls and put the current commit's python files into the whl. This PR does that in CI by identifying a previous job whose artifact and whls binaries can be reused. See https://docs.google.com/document/d/1nQ1FNJqnJuSFRiM2HvQ27zg6Vm-77n7LECp30zYfTDk/edit?tab=t.icom2lesr6es for more details? To reuse: * the changed files between the whl's commit and the current commit can only be python files in test/ or torch/ and not in torch/csrc * not on main branch or release branch * ci-force-rebuild not on PR * special abort issue is closed * artifact should exist Pros: * build time -> 6 min whenever this can be done Cons: * not sure if I have the right files * version + whl name still remains the same Testing: Unfortunately this PR's changed files are not on the list of acceptable changed files for reusing the whl, so I've been mangling it on other PRs to get things like https://github.com/pytorch/pytorch/actions/runs/15119214901/job/42497650394?pr=147470 (It is enabled on linux-focal-cuda12.6-py3.10-gcc11 / build and there are changes in common_utils.py to make sure the copying of python takes effect) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153838 Approved by: https://github.com/malfet	2025-05-19 19:26:08 +00:00
PyTorch MergeBot	674a85cf26	Revert "[Distributed][CI] Rework continuous TestCase (#153653 )" This reverts commit 0d5c628a6e96e0a960af39d1d0de4bf04df69c39. Reverted https://github.com/pytorch/pytorch/pull/153653 on behalf of https://github.com/kwen2501 due to More fixes needed ([comment](https://github.com/pytorch/pytorch/pull/153653#issuecomment-2891931028))	2025-05-19 18:29:27 +00:00
Ke Wen	0d5c628a6e	[Distributed][CI] Rework continuous TestCase (#153653 ) 1. Reworked `MultiProcContinousTest` to spawn processes during `setUpClass` instead of `main` (so that we can support multiple TestClass'es in one file). 2. The child processes are now an infinite loop, monitoring test IDs passed from main process via a task queue. Reciprocally, the child processes inform the main process completion of a test via a completion queue. 3. Added a test template. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153653 Approved by: https://github.com/d4l3k, https://github.com/fegin, https://github.com/fduwjj	2025-05-19 18:20:42 +00:00
Yang Wang	c54b9f2969	[Monitoring] Add util for linux build (#153456 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/153456 Approved by: https://github.com/huydhn	2025-05-19 17:28:17 +00:00
Mengwei Liu	be36bacdaa	[pytorch] Delete TorchScript based Android demo app and point user to ExecuTorch (#153767 ) Summary: A retry of #153656. This time start from co-dev to make sure we capture internal signals. Test Plan: Rely on CI jobs. Differential Revision: D74911818 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153767 Approved by: https://github.com/kirklandsign, https://github.com/cyyever, https://github.com/Skylion007	2025-05-19 17:20:36 +00:00
Tsung-Hsien Lee	6487ea30b3	[c10d] Fix `new_subgroups(group=)` bug (#153798 ) Summary: The bug, introduced in https://github.com/pytorch/pytorch/pull/152765, was caused by passing the `group` parameter to the `get_rank()` function, which caused the function to return the rank of the entire group instead of the rank of the current process. The fix involves removing the `group` parameter from the `get_rank()` function call. Test Plan: contbuild & OSS CI Differential Revision: D74964213 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153798 Approved by: https://github.com/Skylion007	2025-05-19 17:01:10 +00:00
PyTorch MergeBot	b0e5402377	Revert "Recheck autotune cache on static cuda launcher load (#153565 )" This reverts commit 02af4e88e4e76309672dbc9b5970ae630df525c7. Reverted https://github.com/pytorch/pytorch/pull/153565 on behalf of https://github.com/malfet due to Looks like it broke ROCM, see `ee72c53c88/1` ([comment](https://github.com/pytorch/pytorch/pull/153565#issuecomment-2891673913))	2025-05-19 16:52:48 +00:00
zeshengzong	ee72c53c88	Enable ruff check for all ipynb files (#153820 ) Fixes #146411, following #148654 After test, seems this could be enabled for all ipynb file. ```bash lintrunner --take RUFF --all-files Warning: Could not find a lintrunner config at: '.lintrunner.private.toml'. Continuing without using configuration file. ok No lint issues. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153820 Approved by: https://github.com/Skylion007	2025-05-19 16:45:26 +00:00
Yuanyuan Chen	ed5f4a4fa8	Replace size() checks with empty() (#153805 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/153805 Approved by: https://github.com/nareshrajkumar866, https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2025-05-19 16:20:57 +00:00
Laith Sakka	0ec8fe46d7	cleanup, refactor and add missing self._dde_suppressed checks (#152657 ) so two things other than cleanups and refactoring 1) do not use propagate_real_tensors to resolve eval under guard_or_true/guard_or_false . 2) do not guard for dimensions of type DimDynamic.OBLIVIOUS_SIZE under guard_or_true/guard_or_false . Pull Request resolved: https://github.com/pytorch/pytorch/pull/152657 Approved by: https://github.com/pianpwk	2025-05-19 16:15:14 +00:00
PaulZhang12	dccd19c2ef	[Inductor] Construct subgraph with benchmarking args not example_inputs (#153753 ) If the inputs to a subgraph has FlexibleLayout, the subgraph does not currently freeze the layouts here. Therefore, the `example_inputs` generated might not be consistent in layout with the `args` based in for benchmarking Differential Revision: [D74900879](https://our.internmc.facebook.com/intern/diff/D74900879/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153753 Approved by: https://github.com/eellison	2025-05-19 15:58:40 +00:00
soulitzer	7a46f4bde0	Enable accelerator to perform streaming backward (#153412 ) Also see https://github.com/pytorch/pytorch/pull/142097 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153412 Approved by: https://github.com/albanD ghstack dependencies: #151079	2025-05-19 15:52:42 +00:00
Benjamin Glass	c5cba39d46	Improve torch.ops typing (#153558 ) Fixes longstanding issue where direct references to aten operations are seen as untyped by type checkers. This is accomplished by setting attributes on several classes more consistently, so that `__getattr__` can return a single type in all other cases. Decisions made along the way: 1. `torch.ops.higher_order` is now implemented by a single-purpose class. This was effectively true before, but the class implementing it attempted to be generalized unnecessarily. Fixing this simplified typing for the `_Ops` class. 2. `__getattr__` is only called when all other lookup methods have failed, so several constant special-cases in the function could be implemented as class variables. The remainder of this PR is fixing up all the bugs exposed by the updated typing, as well as all the nitpicky typing issues. Test plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/153558 Approved by: https://github.com/rec, https://github.com/Skylion007, https://github.com/cyyever	2025-05-19 14:52:32 +00:00
Bin Bao	3cd5b3b1e7	[AOTI] Skip a rocm test (#153828 ) Summary: Skip test_aot_inductor_package.test_compile_after_package. https://github.com/pytorch/pytorch/pull/150739 added an opt-in feature which doesn't work for rocm yet. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153828 Approved by: https://github.com/malfet	2025-05-19 14:13:19 +00:00
James Wu	02af4e88e4	Recheck autotune cache on static cuda launcher load (#153565 ) When loading statically launchable triton kernels from FxGraphCache, since we don't instantiate a CachingAutotuner like we do normally, we need to recheck the autotune cache based on the existing compile results. If we get a hit, we take the compile result whose config matches the best config. Sometimes, the best config will have been from coordinate descent tuning. In this case, FxGraphCache today does not cache the resulting triton kernel, neither with static or without static cuda launcher. This is because coordinate descent tuning happens at runtime, and if the best config happens to not be one of the precompiled configs. Test Plan: New unit test that failed before Pull Request resolved: https://github.com/pytorch/pytorch/pull/153565 Approved by: https://github.com/aorenste	2025-05-19 12:50:22 +00:00
PyTorch UpdateBot	c45515c2ed	Update slow tests (#153815 ) This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153815 Approved by: https://github.com/pytorchbot	2025-05-19 11:15:25 +00:00
PyTorch UpdateBot	4f1a52fba4	[xla hash update] update the pinned xla hash (#153816 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153816 Approved by: https://github.com/pytorchbot	2025-05-19 11:05:51 +00:00
Aaron Gokaslan	f3daedb263	[BE]: Remove redundant copy (#153629 ) Add typing and remove redundant copy Pull Request resolved: https://github.com/pytorch/pytorch/pull/153629 Approved by: https://github.com/cyyever, https://github.com/albanD	2025-05-19 08:25:20 +00:00
Anant Gulati	5506baa4ed	Refactoring FSDP2 (_composable/fsdp) test cases to be device agnostic (#149848 ) The motivation for this PR is refactor existing test cases in the folder test/distributed/_composable/fsdp/ or fsdp2(as referred to in torch titan) to be device agnostic such that any accelerator type is supported (for eg. CUDA, HPU, XPU etc) The changes are in line with previously merged changes for fsdp (present in the folder test/distributed/fsdp/ ) test cases: https://github.com/pytorch/pytorch/pull/139184/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/149848 Approved by: https://github.com/kwen2501, https://github.com/guangyey	2025-05-19 05:46:51 +00:00
Xiaozhu Meng	6f835a4769	[amd] fix tunableop gemm (#153764 ) Summary: Tunableop on AMD has perf regression for a while. It turns out that the tunableop code path will first run tuned GEMM and then run heuristics GEMM (so run two GEMMs...).... Test Plan: ``` CUDA_VISIBLE_DEVICES=0 buck test @//mode/opt-amd-gpu -c fbcode.rocm_arch=mi300 -c fbcode.rocm_ck_rtz=true fbcode//accelerators/workloads/microbench/RE:test_emu_v1p4 -- --exact 'accelerators/workloads/microbench/RE:test_emu_v1p4 - test_gemm (accelerators.workloads.microbench.RE.test_emu_v1p4.EMUv1p4PerfTest)' --run-disabled ``` Before the diff ``` File "/data/users/mxz/fbsource/buck-out/v2/gen/fbcode/ecc11ed52295855f/accelerators/workloads/microbench/RE/__test_emu_v1p4__/test_emu_v1p4#link-tree/accelerators/workloads/microbench/RE/test_emu_v1p4.py", line 47, in test_gemm self.assertTrue(result < AMD_GEMM_BASELINE * AMD_GEMM_THRESHOLD) Buck UI: https://www.internalfb.com/buck2/b4b8dfca-0301-4c5d-83d6-d866d840c42d Test UI: https://www.internalfb.com/intern/testinfra/testrun/14355223896396807 Network: Up: 10MiB Down: 1.9GiB (reSessionID-23b213fe-a460-4788-86c6-a52343ff10f4) Loading targets. Remaining 0/5144 93161 dirs read, 753263 targets declared Analyzing targets. Remaining 0/70523 2837379 actions, 3262810 artifacts declared Executing actions. Remaining 0/472286 217:26:58.1s exec time total Command: test. Finished 122 local, 522 remote, 199785 cache (99% hit) 211:26:30.5s exec time cached (97%) Time elapsed: 12:50.2s Test execution completed but the tests failed Tests finished: Pass 0. Fail 1. Fatal 0. Skip 0. Build failure 0 1 TESTS FAILED ✗ accelerators/workloads/microbench/RE:test_emu_v1p4 - test_gemm (accelerators.workloads.microbench.RE.test_emu_v1p4.EMUv1p4PerfTest) Run $ fdb buck test <args> to debug accelerators/workloads/microbench/RE:test_emu_v1p4 - test_gemm (accelerators.workloads.microbench.RE.test_emu_v1p4.EMUv1p4PerfTest) ^^^ just prefix your previous command! ($ fdb !!) Learn more at https://fburl.com/fdb ``` After the diff ``` Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` Reviewed By: henryoier, henryhu6 Differential Revision: D74910115 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153764 Approved by: https://github.com/yangsiyu007, https://github.com/xw285cornell	2025-05-19 04:07:48 +00:00
Xu Han	2ade886412	[XPU] [Windows] Auto turn on kineto XPU build when compiler version support. (#153681 ) Since SYCL compiler 20250101, it will remove dependency of level zero header. We can turn on kineto XPU by default. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153681 Approved by: https://github.com/chuanqi129, https://github.com/cyyever, https://github.com/EikanWang	2025-05-19 03:07:14 +00:00
Zhang, Jianyi	1bc5762495	[Intel GPU][Inductor] Fallback embedding_dense_backward on XPU (#151637 ) Reopen #146888, now the modification only affects xpu device. We do not want to decompose embedding_dense_backward for torch.compile. Current XPU devices have hardware limitations on atomic ops. Fallback to eager and we can use sort to implement this op. hf_T5 amp bf16 training in torchbench can get 2x improvement on Max 1550. ~~I also align with cuda on gelu decomposition in _addmm_activation~~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/151637 Approved by: https://github.com/guangyey, https://github.com/etaf, https://github.com/jansel, https://github.com/EikanWang	2025-05-19 02:19:37 +00:00
James Wu	74d0300804	Change unsafe_marked_cacheable_functions to a dictionary, so that you can specify a static cache key (#152486 ) Fixes https://github.com/pytorch/pytorch/issues/152434 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152486 Approved by: https://github.com/oulgen	2025-05-19 02:16:33 +00:00
Nikita Shulga	694748dd9d	[MPSInductor] Fix conv_transpose channels last (#153787 ) Regardless of the input layout, transposed convolution always returns contiguous tensor on MPS Add test to validate that This fixes torch.compile for SegmentAnything network Pull Request resolved: https://github.com/pytorch/pytorch/pull/153787 Approved by: https://github.com/cyyever, https://github.com/Skylion007, https://github.com/jansel, https://github.com/dcci ghstack dependencies: #153786	2025-05-19 02:01:48 +00:00
Nikita Shulga	6fe5d9215f	[EZ][MPS] Enable rsub op (#153786 ) Nothing really to enable, just add it to native functions, TensorIterator abstraction takes care of the rest Pull Request resolved: https://github.com/pytorch/pytorch/pull/153786 Approved by: https://github.com/cyyever, https://github.com/Skylion007, https://github.com/dcci	2025-05-19 02:01:48 +00:00
Bin Bao	a2d0ef242d	[AOTI] Embed cubin files into .so (#150739 ) Summary: Embed cubin files so AOTI is one step closer to generate a single binary. Controlled by a flag and off as default. Differential Revision: [D72535357](https://our.internmc.facebook.com/intern/diff/D72535357) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150739 Approved by: https://github.com/angelayi	2025-05-19 01:11:46 +00:00
cyy	a8986963da	Fix some CMake issues (#153686 ) These issues were discovered when trying CMake 3.27: 1. set C++ language on HIP sources. 2. add missing link to gtest_main. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153686 Approved by: https://github.com/Skylion007	2025-05-19 00:31:34 +00:00
PyTorch MergeBot	75eb2f3ff6	Revert "[Dynamo] added warning message for tracing lru_cache wrapped functions (#153744 )" This reverts commit aac30ef50366b03f0ef2d1e770f45a3465f6ea66. Reverted https://github.com/pytorch/pytorch/pull/153744 on behalf of https://github.com/jeanschmidt due to Need to revert as it is breaking internal signals: [D74935585](https://www.internalfb.com/diff/D74935585) ([comment](https://github.com/pytorch/pytorch/pull/153744#issuecomment-2889187038))	2025-05-18 20:13:00 +00:00
Stephen Jia	cb57b19c3a	[ATen-CPU] Use `math.h` for GeLU as well as `cmath` (#153742 ) Summary: ## Context See https://github.com/pytorch/pytorch/pull/149164 for more context. Originally, this fix worked but more recently including `cmath` by itself no longer provides access to math constants on Windows platforms. I found that including `math.h` resolves this. I'm not sure exactly what changed, but this PR updates the header to just use both includes fix the symbols not being found. It might be a bug with a recent Windows update perhaps? Test Plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/153742 Approved by: https://github.com/swolchok, https://github.com/Skylion007	2025-05-18 19:06:45 +00:00
Aaron Orenstein	aa84c037f0	FakeTensorMode dispatch shouldn't include bypass in exception context (#153780 ) In the FakeTensor cache when we get a bypass exception while computing the cache key (call this exc_1) we need to dispatch to the original operation. It's possible for the dispatch to the original operation to get its own exception which we want to bubble up to the caller (call this exc_2). If we directly dispatch from within the handler for exc_1 then exc_2 will have a `__context__` of exc_1 - which can cause deviations between cached and non-cached behavior - so we need to be a bit careful when we call the dispatch. Testing: test_aotdispatch.py::TestAOTExport::test_aot_export_predispatch_outdtype fails before this change and passes after. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153780 Approved by: https://github.com/oulgen	2025-05-18 17:21:46 +00:00
Thomas Bohnstingl	68034198e5	[HOP] Mutation and alias rework (#146658 ) This PR reworks the way the input mutations and various aliases are checked Pull Request resolved: https://github.com/pytorch/pytorch/pull/146658 Approved by: https://github.com/ydwu4	2025-05-18 08:05:22 +00:00
Justin Chu	0e805aad7f	[ONNX] Support float4 (#151069 ) - Support exporting float4 models (note: currently we use IR version 10 universally in the exporter, which does not include float 4 support. Eventually when onnx runtime and the ecosystem moves to support the new IR version 11 we should bump our version to 11 in the exporter as well) - The shape of the type is set according to https://github.com/pytorch/pytorch/pull/148791#discussion_r2038704986 (added last dim with size 2) - Use ml_dtypes types when converting to numpy for consistency with ONNX IR Fix https://github.com/pytorch/pytorch/issues/150202 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151069 Approved by: https://github.com/titaiwangms	2025-05-18 03:19:35 +00:00
Tom Ritchford	8568dbce1d	[inductor] Clean typing in codegen/common.py and codecache.py (#150767 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150767 Approved by: https://github.com/aorenste	2025-05-17 13:56:50 +00:00
Xuehai Pan	27f7b65a69	[BE] Ensure generated stub files by `gen_pyi` are properly formatted (#150730 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150730 Approved by: https://github.com/aorenste	2025-05-17 12:30:40 +00:00
Michael Lazos	7ebea09986	[Cutlass] Enable fusion with FusedSchedulerNodes (#153588 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153588 Approved by: https://github.com/eellison ghstack dependencies: #152815	2025-05-17 12:29:10 +00:00
Michael Lazos	f604732e2e	[Cutlass] E2E Tests for EVT (#152815 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152815 Approved by: https://github.com/henrylhtsang, https://github.com/eellison	2025-05-17 12:29:10 +00:00
Angela Yi	b4fb801b2d	[export] Move PT2 constants to torch::_export (#153206 ) Test Plan: `buck2 test //sigmoid/...` https://www.internalfb.com/intern/testinfra/testrun/1970325119807758 Differential Revision: D74417085 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153206 Approved by: https://github.com/zhxchen17, https://github.com/dolpm	2025-05-17 08:21:59 +00:00
PyTorch MergeBot	40339c1e99	Revert "[CUDA][cuBLAS][cuBLASLt] avoid polluting prefer cuBLAS/Lt setting across tests (#153655 )" This reverts commit 3bde364996d53571a9fb799f5951a203a352ed18. Reverted https://github.com/pytorch/pytorch/pull/153655 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to fail a test in trunk ([comment](https://github.com/pytorch/pytorch/pull/153655#issuecomment-2888212597))	2025-05-17 08:11:54 +00:00
Xuehai Pan	9b2a45ac7d	Refactor `torch/utils/data/datapipes/gen_pyi.py` with `torchgen` (#150626 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150626 Approved by: https://github.com/aorenste	2025-05-17 06:21:41 +00:00
eqy	e802b29ed4	[SDPA][EZ] Abate narrowing conversion warning spam in `flash_api.cpp` (#153643 ) for messages like ```/workspace/pytorch/aten/src/ATen/native/transformers/cuda/flash_attn/flash_api.cpp:1396:38: warning: narrowing conversion of ‘(char)(& q)->at::Tensor::<anonymous>.at::TensorBase::get_device()’ from ‘char’ to ‘c10::DeviceIndex’ {aka ‘signed ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153643 Approved by: https://github.com/Skylion007	2025-05-17 02:07:35 +00:00
Sidharth	aac30ef503	[Dynamo] added warning message for tracing lru_cache wrapped functions (#153744 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153744 Approved by: https://github.com/williamwen42	2025-05-17 00:43:18 +00:00
Aaron Gokaslan	e88c4db302	[BE]: Update ruff linter to 0.11.10 (#153625 ) Fixes a bug with #153543 where I forgot to add pyproject.toml to the list of files RUF can scan and also updates it to the latest version (which is just minor bugfixes). Pull Request resolved: https://github.com/pytorch/pytorch/pull/153625 Approved by: https://github.com/cyyever, https://github.com/atalman	2025-05-17 00:39:47 +00:00
clr	a952f42bdb	dynamo: Log if we're using dynamic shapes via set_feature_usage (#153490 ) This makes it extremely clear if a specific model didn't use dynamic shapes and should have (except it had a bad config option). Pull Request resolved: https://github.com/pytorch/pytorch/pull/153490 Approved by: https://github.com/jansel	2025-05-16 23:59:00 +00:00
Zhe Qu	1e9666b32d	Add cudaLaunchKernel to cuda_to_hip_mappings (#153690 ) Summary: as $title Test Plan: Used in D74789639 Rollback Plan: Reviewed By: cenzhaometa Differential Revision: D74789639 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153690 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-05-16 23:37:11 +00:00
cyy	7ae7324ac4	[submodule] Update google benchmark to v1.9.3 (#153676 ) And remove `include_directories` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153676 Approved by: https://github.com/Skylion007	2025-05-16 23:31:53 +00:00
NikhilAPatel	59c3463653	[Inductor] Fallback bmm to mm when batch == 1 (#153572 ) Summary: This change introduces a fallback path from `bmm` to `mm` when the batch dimension is `1`. The motivation is to unlock specialized `mm` kernel paths (e.g., `decomposeK`, `persistent+TMA`, etc.) which often don't have `bmm` equivalents. ### Rationale - No regression: On shapes where the fallback triggers, we see no performance loss. - Performance wins: On select shapes (especially with large `K`), we observe measurable speedups by triggering `mm`-specific optimizations. For example, on `bmm` shapes of the form `(1, H, K, H)` where `H ∈ {16, 32, 48, 64}` and `K ∈ {4096 ... 32768}`, we see an average speedup of 10%. - Prevalence in prod: Internal workloads frequently emit `bmm` ops with `batch=1`, making this fallback broadly useful in practice. Test Plan: contbuild & OSS CI Tests in test/inductor/test_torchinductor.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/153572 Approved by: https://github.com/PaulZhang12, https://github.com/eellison	2025-05-16 22:35:03 +00:00
henrylhtsang	76f182f8e0	[cutlass backend] Reduce log level for cutlass compilation error (#153397 ) Differential Revision: [D74596410](https://our.internmc.facebook.com/intern/diff/D74596410/) This change should only affect cutlass backend. We realize that we are going to have Cuda compilation errors, and we do a really good job handling them and caching them. So reduce the logging levels there. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153397 Approved by: https://github.com/ColinPeppler, https://github.com/Skylion007	2025-05-16 21:46:14 +00:00
Eddie Yan	3bde364996	[CUDA][cuBLAS][cuBLASLt] avoid polluting prefer cuBLAS/Lt setting across tests (#153655 ) Some tests may not set the preferred backend, which leads to unexpected behavior when multiple tests are run vs. standalone Tests that should exercise both backends should explicitly parametrize this setting Pull Request resolved: https://github.com/pytorch/pytorch/pull/153655 Approved by: https://github.com/ngimel	2025-05-16 21:31:13 +00:00
PyTorch MergeBot	084c4aa614	Revert "Reapply "Delete TorchScript based Android demo app and point to ExecuTorch (#153633 )" (#153656 )" This reverts commit 7ed377f5776578aec4a6a9bc4eeef221a6b80a77. Reverted https://github.com/pytorch/pytorch/pull/153656 on behalf of https://github.com/larryliu0820 due to Still being used internally so can't remove ([comment](https://github.com/pytorch/pytorch/pull/153656#issuecomment-2887665403))	2025-05-16 21:00:11 +00:00
Ryan Guo	e4a636df80	[dynamo] Make `OptimizedModule` more robust in attribute reads and writes (#153637 ) Fixes #138157. Differential Revision: [D74834872](https://our.internmc.facebook.com/intern/diff/D74834872) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153637 Approved by: https://github.com/williamwen42	2025-05-16 20:29:19 +00:00
PyTorch MergeBot	1748fa529a	Revert "cleanup, refactor and add missing self._dde_suppressed checks (#152657 )" This reverts commit f7fb2f66e3b60b6e3d8b3ac78aa435b76f49bc11. Reverted https://github.com/pytorch/pytorch/pull/152657 on behalf of https://github.com/malfet due to Broke lint ([comment](https://github.com/pytorch/pytorch/pull/152657#issuecomment-2887539146))	2025-05-16 19:42:20 +00:00
Nikita Shulga	62d8e3cb40	[BE][MPS] Cleanup log ops migration (#153727 ) Introduced by https://github.com/pytorch/pytorch/pull/153398 Workaround internal compiler error on MacOS-13 by providing boolean specialization Pull Request resolved: https://github.com/pytorch/pytorch/pull/153727 Approved by: https://github.com/Skylion007	2025-05-16 19:32:17 +00:00
Aaron Gokaslan	cf226cb4d4	[BE]: Enable misc RUF rules and fix pyproject.toml indent (#153624 ) Enables a variety of misc ruff rules and fixes some incorrect indentation in the file. Now that we updated ruff recently we can enable this rule lints. Most of these lints I've already applied, but now they are out of preview can apply them as stable lints. Including: * Do not bother why typing union with Never as this gets cancelled otu * Simplify nested Literal into a single Literal * Properly use packaging to parse version instead of `map(int(` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153624 Approved by: https://github.com/atalman, https://github.com/malfet	2025-05-16 19:29:16 +00:00
Laith Sakka	f7fb2f66e3	cleanup, refactor and add missing self._dde_suppressed checks (#152657 ) so two things other than cleanups and refactoring 1) do not use propagate_real_tensors to resolve eval under guard_or_true/guard_or_false . 2) do not guard for dimensions of type DimDynamic.OBLIVIOUS_SIZE under guard_or_true/guard_or_false . Pull Request resolved: https://github.com/pytorch/pytorch/pull/152657 Approved by: https://github.com/pianpwk	2025-05-16 19:10:04 +00:00
PyTorch MergeBot	c2dda47bc5	Revert "[dynamo] Make `OptimizedModule` more robust in attribute reads and writes (#153637 )" This reverts commit 2ce0b66db8b6a22e90b430a73b8914c2d73512e9. Reverted https://github.com/pytorch/pytorch/pull/153637 on behalf of https://github.com/malfet due to Looks like it broke slow tests, see `cda572b053/1` ([comment](https://github.com/pytorch/pytorch/pull/153637#issuecomment-2887449037))	2025-05-16 18:49:57 +00:00
Benjamin Glass	cda572b053	codecache: Remove cpp_prefix.h duplication per build, then precompile it (#144293 ) Prior to this PR, `_inductor/codegen/cpp_prefix.h` was copied into a new temporary directory on every inductor run utilizing the CPP backend (i.e. CPU-only), then included in the output source code. Instead, this PR puts it in an appropriate place in the torch includes, and includes it from there. This allows us to precompile it in cpp_wrapper and AOT inductor mode, saving significant compilation time. Due to difficulties getting this to work in FBCode, the precompilation itself is only enabled in OSS PyTorch. Differential Revision: [D69420620](https://our.internmc.facebook.com/intern/diff/D69420620) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144293 Approved by: https://github.com/desertfire	2025-05-16 17:41:36 +00:00
Pian Pawakapan	befb5bd52a	[dynamic shapes] simplify int(x / y) pattern (#153477 ) Fixes #138853 Summary: Converts `TruncToInt(IntTrueDiv(x / y))` to `x // y` if divisible, helps detect symint specializations where we didn't previously Differential Revision: D74664734 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153477 Approved by: https://github.com/bobrenjc93	2025-05-16 17:32:15 +00:00
Yulun Wang	3aa84775e7	[hipify] Replace cuda error cudaErrorContextIsDestroyed (#153576 ) Summary: The cuda symbol the cuda symbol cudaErrorContextIsDestroyed is not converted to hipErrorContextIsDestroyed. Add this convertion Test Plan: CI Differential Revision: D74542735 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153576 Approved by: https://github.com/xw285cornell, https://github.com/cyyever	2025-05-16 16:19:42 +00:00
soulitzer	a060f3d272	Rewrite autograd producer consumer stream sync logic (#151079 ) Also see previous work https://github.com/pytorch/pytorch/pull/142097 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151079 Approved by: https://github.com/albanD	2025-05-16 15:42:22 +00:00
Ryan Guo	2ce0b66db8	[dynamo] Make `OptimizedModule` more robust in attribute reads and writes (#153637 ) Fixes #138157. Differential Revision: [D74834872](https://our.internmc.facebook.com/intern/diff/D74834872) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153637 Approved by: https://github.com/williamwen42	2025-05-16 15:17:07 +00:00
Guilherme Leobas	f66a159db5	[Set] Raise TypeError if set is called with the wrong number of arguments (#152990 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152990 Approved by: https://github.com/anijain2305 ghstack dependencies: #150792, #152987, #152988, #152904, #152901, #152902, #152903, #152905, #152906, #152989, #152907, #152908	2025-05-16 14:28:32 +00:00
Guilherme Leobas	5a0ca65555	[Set] Add correct set/frozenset __init__ behavior (#152908 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152908 Approved by: https://github.com/anijain2305 ghstack dependencies: #150792, #152987, #152988, #152904, #152901, #152902, #152903, #152905, #152906, #152989, #152907	2025-05-16 14:28:32 +00:00
Guilherme Leobas	053025494f	[Set] Raise KeyError on empty `set.pop()` (#152907 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152907 Approved by: https://github.com/anijain2305 ghstack dependencies: #150792, #152987, #152988, #152904, #152901, #152902, #152903, #152905, #152906, #152989	2025-05-16 14:28:32 +00:00
Guilherme Leobas	5964cb5eb1	[Set] Update `set.union` and `set.update` to support *args (#152989 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152989 Approved by: https://github.com/anijain2305 ghstack dependencies: #150792, #152987, #152988, #152904, #152901, #152902, #152903, #152905, #152906	2025-05-16 14:28:32 +00:00
Guilherme Leobas	4759922c5e	[Set] Add `set.intersection(_update)` (#152906 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152906 Approved by: https://github.com/anijain2305 ghstack dependencies: #150792, #152987, #152988, #152904, #152901, #152902, #152903, #152905	2025-05-16 14:28:32 +00:00
Guilherme Leobas	ca96d55322	[Set] Add `set.difference(_update)` (#152905 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152905 Approved by: https://github.com/anijain2305 ghstack dependencies: #150792, #152987, #152988, #152904, #152901, #152902, #152903	2025-05-16 14:28:32 +00:00
Guilherme Leobas	5c6830ced0	[Set] Raise `KeyError` if elem not contained in the set (#152903 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152903 Approved by: https://github.com/anijain2305 ghstack dependencies: #150792, #152987, #152988, #152904, #152901, #152902	2025-05-16 14:28:32 +00:00
Guilherme Leobas	574f4c507a	[Set] Add `set.issubset` and `set.issuperset` (#152902 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152902 Approved by: https://github.com/anijain2305 ghstack dependencies: #150792, #152987, #152988, #152904, #152901	2025-05-16 14:28:32 +00:00
Guilherme Leobas	5926b7a38f	[Set] Add set.symmetric_difference(_update) (#152901 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152901 Approved by: https://github.com/anijain2305 ghstack dependencies: #150792, #152987, #152988, #152904	2025-05-16 14:28:32 +00:00
Guilherme Leobas	fe51ce62ca	[Set] Raise TypeError if number of arguments mismatch (#152904 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152904 Approved by: https://github.com/anijain2305 ghstack dependencies: #150792, #152987, #152988	2025-05-16 14:28:32 +00:00
Guilherme Leobas	481c345f49	[Set] Raise `TypeError` if argument is unhashable (#152988 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152988 Approved by: https://github.com/anijain2305 ghstack dependencies: #150792, #152987	2025-05-16 14:28:32 +00:00
Guilherme Leobas	cf7021a0ee	[Set] Handle exception in ConstantVariable operation (#152987 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152987 Approved by: https://github.com/williamwen42, https://github.com/anijain2305 ghstack dependencies: #150792	2025-05-16 14:28:32 +00:00
Guilherme Leobas	477f13c3fb	[Set] Add CPython set tests (#150792 ) Tests: * test_set.py This PR adds test_set.py from the CPython 3.13 branch and ~400 files to test/dynamo_expected_failures. Most of these are expected to be fixed in upcoming PRs. Only minimal changes were made to test_set.py to enable compilation with Dynamo using the PYTORCH_TEST_WITH_DYNAMO=1 environment variable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150792 Approved by: https://github.com/anijain2305	2025-05-16 14:28:32 +00:00
Siddharth Kotapati	6592086ac3	Add metal kernel for log ops (#153398 ) Move unary log ops to metal kernels Pull Request resolved: https://github.com/pytorch/pytorch/pull/153398 Approved by: https://github.com/kulinseth, https://github.com/malfet	2025-05-16 14:25:28 +00:00
xinan.lin	8ca985b365	[Break XPU] Skip newly added test case on XPU that failed because torch._C._scatter not implemented. (#153685 ) Fixes #153608 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153685 Approved by: https://github.com/malfet	2025-05-16 14:15:50 +00:00
Scott Wolchok	9ccd601a14	[easy] Fix endif comments in functional_base.h (#153696 ) The first one of these confused me on #152388. Happened to notice the second. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153696 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-05-16 14:08:41 +00:00
PyTorch MergeBot	3443627e07	Revert "[BE]: Enable RUFF TRY400 rule - log.exception (#153473 )" This reverts commit 4f4ecc583e0f48ad2d062a53bf91c61ab40b4948. Reverted https://github.com/pytorch/pytorch/pull/153473 on behalf of https://github.com/jeanschmidt due to seems to have broken internal signals, @albanD may I count on you to help the author merge his PR? D74837988 ([comment](https://github.com/pytorch/pytorch/pull/153473#issuecomment-2886017075))	2025-05-16 08:29:26 +00:00
PyTorch MergeBot	86c6f71ddb	Revert "[Ez][BE]: Remove accidental classvar (#153540 )" This reverts commit e0dece510b703376d50a5d6536be6c601ca67d9e. Reverted https://github.com/pytorch/pytorch/pull/153540 on behalf of https://github.com/jeanschmidt due to Broken internal tests, @albanD may you help the author get his PR merged? D74804063 ([comment](https://github.com/pytorch/pytorch/pull/153540#issuecomment-2886011101))	2025-05-16 08:26:37 +00:00
PyTorch MergeBot	4d073af58c	Revert "[inductor][dynamo] Include operator name in size/stride/alignment assertion (#152353 )" This reverts commit 725bbb6b5fffa2f2d219a0692ed27e376c9dd48a. Reverted https://github.com/pytorch/pytorch/pull/152353 on behalf of https://github.com/jeanschmidt due to seems to have broken a few internal tests, @jansel may you help the author get his PR merged? ([comment](https://github.com/pytorch/pytorch/pull/152353#issuecomment-2885997862))	2025-05-16 08:20:39 +00:00
Robert Burke	741539a790	Split out second pass of LayerNorm for profiler attribution reasons (#153578 ) Summary: Split out second pass of LayerNorm so it's more likely to show up in profiler output. In my testing with perf, the samples from the lambda in the current implementation are attributed somewhat haphazardly. Differential Revision: D74181627 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153578 Approved by: https://github.com/hl475	2025-05-16 08:07:13 +00:00
xinan.lin	a9adc9a9b6	[Linter] Add linter to detect device-bias hard code in test cases. (#152948 ) Since XPU does not gate community pull requests, we’ve observed that contributors often hardcode "cuda" in functions decorated with @requires_gpu() when adding new test cases. This causes the tests to fail on XPU and breaks XPU CI. This PR adds a linter to detect such issues automatically. An example is shown below. ``` Error (TEST_DEVICE_BIAS) [device-bias] `@requires_gpu` function should not hardcode device='cuda' 11670 \| .contiguous() 11671 \| ) 11672 \| >>> 11673 \| inp = torch.rand((64, 64), device="cuda") * 2 - 1 11674 \| boundaries = torch.tensor([-0.9, -0.8, 0.1, 0.2, 0.5, 0.9]) 11675 \| 11676 \| self.common(fn, (inp, boundaries), check_lowp=False) Error (TEST_DEVICE_BIAS) [device-bias] `@requires_gpu` function should not hardcode .cuda() call 11700 \| self.assertEqual(ref, res) 11701 \| 11702 \| for offset2 in (0, 1, 2, 3, 4): >>> 11703 \| base2 = torch.randn(64 * 64 + 64, dtype=torch.float32).cuda() 11704 \| inp2 = torch.as_strided(base2, (64, 64), (64, 1), offset2) 11705 \| ref2 = fn(inp2) 11706 \| res2 = fn_c(inp2) Error (TEST_DEVICE_BIAS) [device-bias] `@requires_gpu` function should not hardcode torch.device('cuda:0') 11723 \| return x.sin() + x.cos() 11724 \| 11725 \| base = torch.randn( >>> 11726 \| 64 * 64 + 64, dtype=torch.float32, device=torch.device("cuda:0") 11727 \| ) 11728 \| 11729 \| inp1 = torch.as_strided(base, (32, 32), (32, 1), 4) Error (TEST_DEVICE_BIAS) [device-bias] `@requires_gpu` function should not hardcode .to('cuda') call 11771 \| torch.manual_seed(42) 11772 \| base = torch.randn(64 * 64 + 64, dtype=torch.float32, device=self.device) 11773 \| torch.manual_seed(42) >>> 11774 \| base_ref = torch.randn(64 * 64 + 64, dtype=torch.float32).to("cuda") 11775 \| 11776 \| inp = torch.as_strided(base, size, stride, offset) 11777 \| inp_ref = torch.as_strided(base_ref, size, stride, offset) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152948 Approved by: https://github.com/EikanWang, https://github.com/cyyever, https://github.com/malfet, https://github.com/jansel	2025-05-16 08:03:54 +00:00
Ti-Tai Wang	658d17dfb5	[ONNX] Add test for decomp_table update (#153671 ) Added a test to strengthen the case for cherry-picking #153168. The original PR didn’t include this test since the fix for decomp_table and the registry was already covered by existing tests. However, it's reasonable to include a dedicated test for the specific issue (https://github.com/pytorch/pytorch/issues/150367 ) when considering the cherry-pick. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153671 Approved by: https://github.com/justinchuby	2025-05-16 08:00:16 +00:00
angelayi	3fe42d4d5d	[export] Dynamo symint support (#152677 ) Basically adds native _IntWrapper support to dynamo. Here's my process of trying to make symint input support work on dynamo, and how I ended up with this approach [(doc)](https://docs.google.com/document/d/1GvNRQd8BnxlMay_hrEVgEta6VUeUW_hcFeRuB7q1nDY/edit?tab=t.0). What I did was, before passing inputs to dynamo.export, I first wrap them with a class, `_IntWrapper`. When processing dynamic shapes, I will then add the corresponding dynamic shape specification to the `dynamism` field stored on the `_IntWrapper`. If there is no dynamism specified, then this will get unwrapped back to an integer. When dynamo tracing, when we encounter an `_IntWrapper`, we will convert this to a symint if the dynamism was specified as `Dim.DYNAMIC/AUTO`. Dynamo will then trace a graph that contains symint inputs, which will get passed to AOTAutograd and so on. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152677 Approved by: https://github.com/pianpwk	2025-05-16 07:51:50 +00:00
Eddie Yan	d965fa2c4b	[CUDA][cuBLAS] Remove `IS_ARM64` skip in `test_matmul_cuda.py` (#153660 ) Original skip seems stale and the test appears to run fine on Grace + Hopper and Grace + Blackwell Pull Request resolved: https://github.com/pytorch/pytorch/pull/153660 Approved by: https://github.com/Skylion007	2025-05-16 07:31:16 +00:00
Chien-Chin Huang	1503b3f897	[DSD] Don't pop tensors if they are on Meta device (#153185 ) DSD currently will pop tensors if these tensors are on Meta device. This forbid the use cases that users would like to let DCP to directly initialize the tensors when loading. This PR also removes test/distributed/checkpoint/e2e/test_pipeline.py which is based on the above feature that is not realistic and is not used anywhere. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153185 Approved by: https://github.com/mori360	2025-05-16 07:18:39 +00:00
Xia, Weiwen	1a722f62c2	[Quant][X86] add an op to compute uint8 batch norm 2d (#152811 ) Summary This PR adds a new op, `onednn.qbatch_norm2d`, which accepts uint8 inputs on CPU device (instead of QuantizedCPU). The new ops are implemented with AVX512 instructions and it provides similar performance as its counterpart for QuantizedCPU device `quantized.batch_norm2d`. The new op supports output dtypes other than uint8 (fp32, fp16 and bf16 are supported). Test plan ``` pytest test/quantization/core/test_quantized_op.py -k test_int8_batch_norm_onednn ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152811 Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168, https://github.com/jgong5 ghstack dependencies: #152411	2025-05-16 06:13:40 +00:00
Daniel Vega-Myhre	7e16cb99b6	[FlexAttention] Enforce Q,K,V memory layouts for fp8 flex attention to avoid perf degradation (#153357 ) Fixes #147336 ## Context NCU analysis of the fp8 flex attention perf issue in #147336 showed an unexpected increase in shared memory access bank conflicts when loading the V tensor from HBM to SRAM. Bringing this to the attention of triton developer @davidberard98 he identified the memory layout of the tensor in HBM to be causing non-pipelined loads into SRAM, causing the slowdown. To summarize: In flex attention when performing the FP8 GEMM `softmax_scores @ V` the right operand V must be in column-major memory layout. However, the `tl.load` of V blocks from HBM to SRAM cannot be pipelined if the V tensor isn't column-major in HBM already, leading to substantial performance degradation. This is because triton does not perform async copies with the `cp.async` PTX instruction if the number of contiguous bytes is less than 4 (see [here](`81f93f2c8e/lib/Dialect/TritonGPU/Transforms/Pipeliner/PipeliningUtility.cpp (L403)`)). i.e., when loading 4 bytes of contiguous data from a tensor stored in row-major in HBM, we have to perform 4 separate non-contiguous writes to SRAM to place those bytes in their new location in the col-major layout in SRAM. Thus the load is not a candidate for pipelining w/ cp.async and just moves data to registers then performs a series of single byte stores. ## Fix summary - To fix this, we should enforce memory layouts for Q, K, V in FlexAttention when fp8 is being used, to ensure they each exist in HBM in the necessary memory layout to facilitate pipelined loads into SRAM ahead of the FP8 GEMMs ## Benchmarks Rerunning the repro we see fp8 runtime is reduced from 120% of bf16 to 76% of bf16 runtime. Before fix: ``` (flex) [danvm@devgpu007.eag6 ~/ml-perf-tools/flex_attention (main)]$ rm -rf /tmp/torchinductor_${USER}; python profile_flex.py --bf16 --fp8 2025-05-11 19:07:33,402 - flex_bench - INFO - Running benchmark: bf16 2025-05-11 19:07:35,885 - flex_bench - INFO - bf16: 424.87228804347734 us 2025-05-11 19:07:35,893 - flex_bench - INFO - Running benchmark: fp8e4m3 2025-05-11 19:07:37,319 - flex_bench - INFO - fp8e4m3: 515.714000000001 us ``` After fix: ``` (flex) [danvm@devgpu007.eag6 ~/ml-perf-tools/flex_attention (main)]$ rm -rf /tmp/torchinductor_${USER}; python profile_flex.py --bf16 --fp8 2025-05-11 17:34:38,223 - flex_bench - INFO - Running benchmark: bf16 2025-05-11 17:34:41,157 - flex_bench - INFO - bf16: 423.4662032967036 us 2025-05-11 17:34:41,167 - flex_bench - INFO - Running benchmark: fp8e4m3 2025-05-11 17:34:42,917 - flex_bench - INFO - fp8e4m3: 326.3694803493453 us ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153357 Approved by: https://github.com/ngimel, https://github.com/davidberard98	2025-05-16 04:56:50 +00:00
Angela Yi	459ce6c12a	[export] Flatten frame local logs (#153627 ) Summary: Some new errors have been showing up on the PT2 dashboard with ``` Invalid type for lengths: Expected BlobReference or torch.Tensor, got: Tensor(shape: torch.Size([10]), stride: (1,), storage_offset: 0) ``` This is caused by [this piece of code](https://fburl.com/code/5nbi9on7) which maps over a set of nodes (in this case type `IDListFeatureListField`) and turns the results into strings to be displayed later. However during pytree.tree_map we call pytree.tree_unflatten which will call the class's init function, which calls `assert_blob` (https://fburl.com/code/h3ainrn9). Because we've mapped over the values and converted them to strings, the assert_blob fails. I initially thought to disable the assert_blob while tracing (D74684309) but then I think we should actually flatten the list first. Because tlparse will expect just a string out outputs instead of the actual structure. Test Plan: `buck2 run mode/opt sigmoid/inference/ts_migration:pt2i_readiness_main -- --test_suite ads_all --mode test_full_model --model_id 542947220` fails with something else 😅 Differential Revision: D74744326 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153627 Approved by: https://github.com/yiming0416	2025-05-16 04:45:09 +00:00
Mengwei Liu	7ed377f577	Reapply "Delete TorchScript based Android demo app and point to ExecuTorch (#153633 )" (#153656 ) This reverts commit ae0e8f0c7316addab3f415dc767a9d34f58b0dae. Keep android/libs/fbjni because it's being used by other components of PyTorch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153656 Approved by: https://github.com/malfet	2025-05-16 04:35:42 +00:00
Raymond Li	56e1c236bf	[Dynamo] Catch unserialisable NN modules (#153503 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153503 Approved by: https://github.com/c00w, https://github.com/jansel	2025-05-16 02:55:28 +00:00
Simon Fan	d1f1ff8610	[ddp] propagate use_python_reducer to C++ reducer (#152735 ) C++ Reducer is silently incorrect under CA, its implementation is no-oping the collective. I'm guessing that it was no-op'd because in DDP + python reducer, the C++ reducer is still being initialized. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152735 Approved by: https://github.com/fegin ghstack dependencies: #153300, #152689	2025-05-16 01:38:03 +00:00
Simon Fan	1b4749f748	[ca][dtensor] run real PG dtensor tests under CA (#152689 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152689 Approved by: https://github.com/bdhirsh ghstack dependencies: #153300	2025-05-16 01:38:03 +00:00
Simon Fan	5aea57d653	[ca][dynamo] always run eager checkpoint region's recomputation in eager (#153300 ) I slap disable on the recomputation hook, otherwise the partitioner may save less/more activations and mismatch with the expected eager count in checkpoint. See code comment `Note: [compiled autograd and checkpoint unpack hook]`. This fixes all non-nested checkpointing tests. I also wrap nested checkpointing tests, and a few of them still fail. This also seems to fix all PYTORCH_TEST_WITH_DYNAMO checkpointing tests except for `TestAutograd.test_checkpointing_without_reentrant_custom_function_works`. For those tests, it looks like we fail to HOPify the checkpointed region and when the backward executes the unpack hooks, dynamo tried to trace them. This messed up the internal state tracking of checkpointing, some raising the _StopRecomputationError and others raising the same count mismatch error as CA. FIXES https://github.com/pytorch/pytorch/issues/127115 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153300 Approved by: https://github.com/jansel	2025-05-16 01:37:48 +00:00
cyy	9d3b6ee4c1	[submodule] Update gtest to v1.17.0 (#153618 ) And remove some outdated CMake code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153618 Approved by: https://github.com/malfet	2025-05-16 01:24:19 +00:00
Tristan Rice	d1dd2c1fc8	gloo: cuda (#153406 ) This enables Gloo CUDA when used with a backend that supports GPUDirect which currently is only the IBVERBS backend. This requires some changes to Gloo which are in https://github.com/pytorch/gloo/pull/441 Since we're now depending on gloo_cuda we need to split ProcessGroupGloo into two pieces, one with the CPU bits (libtorch_cpu) and one with CUDA kernels in libtorch_cuda. This unfortunately requires some major refactoring as some CPU code is shared across both. The gloo submodule is updated to depend on the new Gloo changes Test plan: ```py import os import time transport = "TCP" #transport = "IBVERBS" os.environ["GLOO_DEVICE_TRANSPORT"] = transport rank = int(os.environ["RANK"]) os.environ["CUDA_VISIBLE_DEVICES"] = str(rank) ibv = "mlx5_0:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_9:1,mlx5_10:1,mlx5_11:1".split(",")[rank] ibv_name, ibv_port = ibv.split(":") os.environ["TORCH_GLOO_IBV_NAME"] = ibv_name os.environ["TORCH_GLOO_IBV_PORT"] = ibv_port os.environ["TORCH_GLOO_IBV_INDEX"] = "3" import torch import torch.distributed as dist dist.init_process_group("gloo") rank = dist.get_rank() # initial sanity check #device = "cpu" #t = torch.zeros(10, device=device) #dist.all_reduce(t) #print("sanity complete") device = "cpu" iters = 10 warmup_iters = 2 for nelem in [10, 100, 1000, 10000, 100000, 1000000, 10000000, 100000000]: t = torch.zeros(nelem, device=device) torch.cuda.current_stream().synchronize() for i in range(warmup_iters): dist.all_reduce(t) torch.cuda.current_stream().synchronize() start = time.perf_counter() for i in range(iters): dist.all_reduce(t) torch.cuda.current_stream().synchronize() dur = (time.perf_counter() - start) qps = iters/dur bandwidth_gb = t.nbytes * iters / dur / 1e9 gb = t.nbytes / 1e9 if rank == 0: print(f"{transport=} {device=} {iters=} {nelem=} {qps=} {gb=} {bandwidth_gb=}\n", end="") ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153406 Approved by: https://github.com/fduwjj	2025-05-16 01:13:13 +00:00
Nikita Shulga	ab757dcddc	[MPS][Testing] Add GoogleFnet, YituTechConvBert and Super_SloMo to benchmarks (#153658 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153658 Approved by: https://github.com/atalman, https://github.com/ZainRizvi, https://github.com/cyyever ghstack dependencies: #153657	2025-05-16 01:09:31 +00:00
Nikita Shulga	754b758ea1	[BE] Extend empty_gpu_cache to mps (#153657 ) And replace `if: elif:` with `getattr()` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153657 Approved by: https://github.com/atalman, https://github.com/wdvr, https://github.com/ZainRizvi	2025-05-16 01:08:54 +00:00
Deep Shah	2489b6470b	[c10d] Allow split_group to work with non nccl backends (#152175 ) Summary: Currently things are hardcoded to only work with nccl backend. Extend it to allow NCCL + custom plugin backend. The split-specific methods/attributes have not been added to the base Backend and Options as some of them are specific to backend implementations. Instead, explicit checks have been added to the split_group method for the expected methods and attributes. I am open to making them part of base Backend based if folks prefer. Test Plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/152175 Approved by: https://github.com/shuqiangzhang, https://github.com/kwen2501	2025-05-16 00:15:29 +00:00
Aaron Orenstein	cb5f31a4a1	Fix fake tensor caching when output has unbacked (#153034 ) We handle fake tensor caching in two ways: 1. If the inputs have no symbols (SymInt, etc) then we cache on the FakeTensorMode. 2. If the inputs have symbols then we cache on the ShapeEnv. This way the symbols in the inputs and outputs are associated with the guards in place at the time of the call. However - it's possible to have an op where there are no symbols in the inputs but there is an unbacked symbol in the output. In this case we shouldn't cache at all because what would that really mean? So this PR changes the caching behavior so that if there's a symbol in the output which doesn't come in some way from the input then we refuse to cache that op. Added a test which checks for this case. While in there I also did a couple other related changes: 1. Added negative caching - if we see that an (op, args) failed to cache previously we don't even bother trying to cache it again. 2. Reworked the inner behavior of _cached_dispatch_impl a little to make it more clear which bits we expect to be able to throw _BypassDispatchCache and add some comments. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153034 Approved by: https://github.com/masnesral, https://github.com/tugsbayasgalan	2025-05-15 23:18:52 +00:00
Daniel Vega-Myhre	e7a40fb301	[Async TP] Fix dim swapping before reduction in fused_scaled_matmul_reduce_scatter (#153595 ) ## Summary - The unit test `pytest test/distributed/test_symmetric_memory.py -k test_fused_scaled_matmul_reduce_scatter_scatter` was not running for some reason when #149247 was merged, giving false green CI signals. When it was ran manually recently, the test failed, highlighting a bug causing incorrect numerics when `scatter_dim=1`. - This PR fixes the bug, which was related to how we swap dims 0<=>scatter_dim at the beginning of the custom op (for more efficient cross-device data movement I believe), then swap it back prior to reduction. ## Test plan - I confirmed the unit test `pytest test/distributed/test_symmetric_memory.py -k test_fused_scaled_matmul_reduce_scatter_scatter` is now passing. - I confirmed e2e training w/ torchtitan looks good ([logs](https://www.internalfb.com/phabricator/paste/view/P1812054188)) - I analyzed the tlparse to verify the fused_all_gather_matmul and fused_scaled_matmul_reduce_scatter both appear at least once in the post grad graphs ([tlparse](https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpVbUsdG/dedicated_log_torch_trace_65oh3qj_.log/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000)) ## Next steps 1. I think for async TP `fused_scaled_matmul_reduce_scatter` we may only need `scatter_dim_after_maybe_reshape` and not `orig_scatter_dim` after all. I can confirm this and refactor if it is the case. 2. This op is specifically designed for async TP, and many of the arguments don't make sense for a user trying to use this as a standalone op. IMO we should have separate standalone custom op without all the extra function args and internal logic that doesn't apply to non-async TP cases. 3. In a follow up PR I want to add shape annotations to each line (e.g. `# (B, T, H)` etc) to make this easier to debug in the future. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153595 Approved by: https://github.com/fegin	2025-05-15 21:44:57 +00:00
Scott Wolchok	ea17cd067d	Add vec_reduce_all specialization for std::plus on AArch64 (#152388 ) AArch64 has an instruction for this. Differential Revision: [D73817183](https://our.internmc.facebook.com/intern/diff/D73817183/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152388 Approved by: https://github.com/Skylion007 ghstack dependencies: #152365, #152366	2025-05-15 21:26:18 +00:00
Scott Wolchok	b972435158	vec::map: directly process reduced-precision floats when reasonable (#152366 ) The immediate motivation is to make map support match ExecuTorch so we can delete ExecuTorch-specific mapping functions, but this should also straightforwardly improve performance. Testing: there is existing coverage for this in vec_test_all_types.cpp. Verified that it really does cover the newly enabled "don't convert through float" paths by temporarily adding a TORCH_INTERNAL_ASSERT(false). Differential Revision: [D73802126](https://our.internmc.facebook.com/intern/diff/D73802126/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152366 Approved by: https://github.com/malfet ghstack dependencies: #152365	2025-05-15 21:26:18 +00:00
Jithun Nair	e4adf5df39	[ROCm] cpp_extension allow user to override default flags (#152432 ) We need -fgpu-rdc for projects such as DeepEP + rocSHMEM. The default of -no-gpu-rdc doesn't work for such cases. As per https://github.com/pytorch/pytorch/pull/152432#issuecomment-2840899088: "rocshmem shares the same global variable in different files, as deepEP uses CUDAExtention to build the project `65e2a700f0/setup.py (L51)` and depends on rocshmem, this -fgpu-rdc is needed. The current logic in Pytorch prevents users from overriding this flag." Pull Request resolved: https://github.com/pytorch/pytorch/pull/152432 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-05-15 21:06:18 +00:00
Catherine Lee	b8fad785d5	Change trigger for autoformat, use --all-files (#153289 ) Change trigger for auto format to be pull_request b/c the reusable action used gets the pr number from the pull_request event context, but only run it if ciflow/autoformat is attached to the PR. Tested this on a different PR, and it seems to be working Changed tag name because ciflow prefixed labels have special handling Also change to run on all files so it will mimic the normal CI lintrunner call, and because lintrunner, either by itself or using -m mergebase can miss some things. Idk if it would miss for format, but it does for checking lint. Format seems to take shorter than normal lint. I don't know if the comment about making suggestions on non edited file changes is a concern. I didn't really test this part Pull Request resolved: https://github.com/pytorch/pytorch/pull/153289 Approved by: https://github.com/atalman, https://github.com/malfet	2025-05-15 20:38:33 +00:00
Sam Larsen	90deff6d59	Refactor tests in test_max_autotune into a few separate test cases. (#153486 ) Summary: To support running a subset of these tests with the remote autotuning utilities, I've split out some of the tests into separate classes so that I can derive from the "main" TestMaxAutotune class when creating new tests for remote. I'm not 100% sure what some of these tests do, so please suggest if another grouping / naming might make more sense. The remaining tests in TestMaxAutotune all smelled relevant to me. Test Plan: existing unit tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/153486 Approved by: https://github.com/eellison	2025-05-15 20:35:22 +00:00
Scott Wolchok	a2e2f908fd	add is_vec_specialized_for (#152365 ) Let people detect at compile time whether Vectorized is specialized for a given type. See vec_base.h. Differential Revision: [D73802129](https://our.internmc.facebook.com/intern/diff/D73802129/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152365 Approved by: https://github.com/jgong5, https://github.com/malfet	2025-05-15 20:21:48 +00:00
PyTorch MergeBot	ae0e8f0c73	Revert "Delete TorchScript based Android demo app and point to ExecuTorch (#153633 )" This reverts commit b22f01fcb9d69bb7d77e08d69004c7265ef7fa4a. Reverted https://github.com/pytorch/pytorch/pull/153633 on behalf of https://github.com/malfet due to But libtorch build regressions are real, fbjni is still used for C++ builds ([comment](https://github.com/pytorch/pytorch/pull/153633#issuecomment-2884951805))	2025-05-15 20:16:05 +00:00
Yang Wang	b03e4f53d2	[Monitoring] enable windows monitoring test (#153453 ) enable the utilization for win tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/153453 Approved by: https://github.com/huydhn	2025-05-15 20:03:07 +00:00
Tristan Rice	f7ecc091a0	c10d/TCPStore: better logs on remote shutdown (#153586 ) This makes it more obvious what's going on when TCPStore shuts down while waiting on a remote key and also shows the remote address. Test plan: ``` [W514 18:33:36.536327028 TCPStore.cpp:138] [c10d] recvValueWithTimeout failed on SocketImpl(fd=3, addr=[localhost]:34658, remote=[localhost]:1234): Failed to recv, got 0 bytes. Connection was likely closed. Did the remote server shutdown or crash? ``` ```py import os rank = int(os.environ["RANK"]) import time from torch import distributed as dist store = dist.TCPStore( host_name="localhost", port=1234, is_master=(rank == 0), wait_for_workers=False, ) time.sleep(1) print("starting") if rank != 0: store.get("foo") else: time.sleep(1) print("done") ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153586 Approved by: https://github.com/XilunWu	2025-05-15 20:02:51 +00:00
Yang Wang	064f4c18f9	[Monitoring] Enable perf tests (#153452 ) Enable monitoring for more perf tests, currently for perf, we collect usage data every 4 seconds and aggregate every 15 seconds. Can reduce the number down if the monitoring does not affect the perf testx Pull Request resolved: https://github.com/pytorch/pytorch/pull/153452 Approved by: https://github.com/Skylion007, https://github.com/huydhn	2025-05-15 19:19:19 +00:00
Xuehai Pan	a4c828199e	[BE] Add `__all__` to `torch/nn/functional.pyi` and `torch/return_types.pyi` (#150729 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150729 Approved by: https://github.com/aorenste	2025-05-15 19:01:57 +00:00
Mengwei Liu	b22f01fcb9	Delete TorchScript based Android demo app and point to ExecuTorch (#153633 ) Delete TorchScript demo app and point people to ExecuTorch demo app. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153633 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/atalman, https://github.com/janeyx99, https://github.com/seemethere	2025-05-15 18:43:59 +00:00
Catherine Lee	00e5cb3db3	[ez][trymerge] Edit revert message for reverted ghstack PRs (#153573 ) Change comment about successful revert so it also contains info about the original PR that got the comment (if it is a ghstacked PR) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153573 Approved by: https://github.com/atalman, https://github.com/malfet	2025-05-15 18:23:20 +00:00
Shuai Yang	480ae2dab8	Add needs_contiguous_strides to more collective ops (#153523 ) Differential Revision: D74705770 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153523 Approved by: https://github.com/fmassa	2025-05-15 17:27:37 +00:00
Aditya Tewari	cfee9046b6	cpu: enable gemm-bf16f32 for SDPA BF16 (#140159 ) This PR enables SDPA BF16: gemm:bf16f32 for aarch64. This will enable faster inference for models with attention layers for autocast mode (bf16). Benchmark results from [PyTorch CI HUD - branch](https://hud.pytorch.org/benchmark/huggingface/inductor_no_cudagraphs?dashboard=torchinductor&startTime=Fri%2C%2028%20Mar%202025%2021%3A26%3A20%20GMT&stopTime=Fri%2C%2004%20Apr%202025%2020%3A26%3A20%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cpu%20(aarch64)&lBranch=adi/gemm_bf16f32&lCommit=d5aeab452e4b1f0580a4636b15a604c77a02c57b&rBranch=main&rCommit=bc72420bcb37390af3fced885e019903e6e425bd) Overall Geometric mean speedup in HUD dashboard : for Huggingface: `[0.48x → 0.58x]` and for Blueberries: `[0.88x → 1.13x]` Benchmark numbers for `torch.nn.functional.scaled_dot_product_attention`on Neoverse™ V1. `batch_size = 1, num_attention_heads = 64, sequence_length = 512, attention_head_size = 128` `threads=16` <img width="319" alt="Screenshot 2024-12-20 at 16 23 22" src="https://github.com/user-attachments/assets/c863f97d-0761-4fb8-aa6c-fc67b22ac3f9" /> Script to benchmark & profile SDPA: import torch import torch.nn as nn import time import numpy as np from torch.profiler import profile, record_function, ProfilerActivity class SimpleAttentionModel(nn.Module): def __init__(self, query, key, value): super(SimpleAttentionModel, self).__init__() self.query = query self.key = key self.value = value def forward(self, attn_mask=None): torch.nn.functional.scaled_dot_product_attention( self.query, self.key, self.value, attn_mask=attn_mask) #batch_size = 1, num_attention_heads = 64, sequence_length = 512, hidden_size = 128 def bench_sdpa(batch_size = 1, num_attention_heads = 64, sequence_length = 512, query_sequence_length = 128 , hidden_size=128, precision=torch.float32): with torch.no_grad(): attention_head_size = int(hidden_size / num_attention_heads) query = torch.rand(size=(batch_size, num_attention_heads, query_sequence_length, attention_head_size), dtype=precision) key = torch.rand(size=(batch_size, num_attention_heads, sequence_length, attention_head_size), dtype=precision) value = torch.rand(size=(batch_size, num_attention_heads, sequence_length, attention_head_size), dtype=precision) model = SimpleAttentionModel(query, key, value) model.eval() for _ in range(10): model() times = [] n_iters = 100 for _ in range(n_iters): s = time.time_ns() model() times.append((time.time_ns() - s) / 1e3) min_times = np.min(times) mean_times = np.mean(times) print(f"Min Times = {min_times} us") print(f"Mean Times = {mean_times} us") print("Times = ", times) print("BF16 mode:") with profile(activities=[ProfilerActivity.CPU], record_shapes=True) as prof: with record_function("model_inference"): bench_sdpa(precision=torch.bfloat16) profile_data = prof.key_averages(group_by_input_shape=True).table(sort_by="cpu_time_total") print(profile_data) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140159 Approved by: https://github.com/jgong5, https://github.com/malfet, https://github.com/nikhil-arm, https://github.com/leslie-fang-intel, https://github.com/CaoE, https://github.com/cfRod, https://github.com/fadara01	2025-05-15 17:21:18 +00:00
PyTorch MergeBot	236b08cbf8	Revert "[ca][dynamo] always run eager checkpoint region's recomputation in eager (#153300 )" This reverts commit 4863e5c843722eb2a34fb0ca1d518a33431a38c0. Reverted https://github.com/pytorch/pytorch/pull/153300 on behalf of https://github.com/malfet due to Looks like it breaks rocm, see `fa8543454a/1` ([comment](https://github.com/pytorch/pytorch/pull/153300#issuecomment-2884489459))	2025-05-15 16:58:52 +00:00
PyTorch MergeBot	2327c9eedc	Revert "[ca][dtensor] run real PG dtensor tests under CA (#152689 )" This reverts commit b297e01f4b1f43ffd1769313f077a2a68928f012. Reverted https://github.com/pytorch/pytorch/pull/152689 on behalf of https://github.com/malfet due to Looks like it breaks rocm, see `fa8543454a/1` ([comment](https://github.com/pytorch/pytorch/pull/153300#issuecomment-2884489459))	2025-05-15 16:58:51 +00:00
Nikita Shulga	db26aeaec2	[MPSInductor] Support numpy scalars handling (#153598 ) By default, numpy computes results in float64 format, but when passed as an argument to MPS function, must be implicitly converted to float32, which naturally occurs in some networks, for example in speech_transformer Pull Request resolved: https://github.com/pytorch/pytorch/pull/153598 Approved by: https://github.com/cyyever, https://github.com/dcci ghstack dependencies: #153582	2025-05-15 16:48:25 +00:00
Catherine Lee	0cb48633d9	[ez][CI] Add linux aarch64 to upload test stats, change format of trigger for upload test stats (#153505 ) Change from inline list to yml list Add linux aarch64 for list of triggering workflows Pull Request resolved: https://github.com/pytorch/pytorch/pull/153505 Approved by: https://github.com/Skylion007	2025-05-15 15:33:59 +00:00
Animesh Jain	fa8543454a	[dynamo][torch-function] Prevent unnecessary __torch_function__ tracing (#153551 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153551 Approved by: https://github.com/mlazos	2025-05-15 14:06:17 +00:00
Aaron Gokaslan	4f4ecc583e	[BE]: Enable RUFF TRY400 rule - log.exception (#153473 ) Change logging.error to logging.exception to log additional information when relevant. A few places have slipped in logging.errors in try except since I last did a clean up here and the rule is stabilized so I am enabling it codebase wide. I have NOQA'd much of our custom exception stack trace handling for RPC calls and distributed and tried to a fix a few errors based on whether we immediately reraised it or if we didn't print any exception handling where it could be useful. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153473 Approved by: https://github.com/albanD, https://github.com/cyyever	2025-05-15 13:36:59 +00:00
sanchitintel	7482eb217c	[Inductor-CPU] Faster int8 WoQ GEMM for small M with explicit prefetching and different outer loops (#149373 ) ### Summary Fixes #148494 Explicitly prefetch the cache lines of the next `B` block to accelerate int8 WoQ (BF16 activation, int8 statically quantized weights) GEMM for small `M` dimension. Some of this code (outer loops of the GEMM) is being ported over from Intel Extension for PyTorch. The macro-kernel* and the micro-kernel* are essentially the same, but optionally prefetch a block of B. Templatization is being used to prevent branching causing a slowdown due to unnecessary prefetching. \* - in [BLIS](https://dl.acm.org/doi/10.1145/2764454) parlance ### Performance data with BS 1 Machine: 32 cores of one socket of a Intel Xeon SP Gen 5 machine \| Model \| input tokens \| output tokens \| next-token latency before this PR \| Next-token latency after this change \| Speedup \| \|-----------\|-------------\|-----------------\|--------------------------------------\|------------------------------------------\|-----------\| \|GPT-J \| 128 \| 128 \| 42 ms \| 38 ms \| 9.52 % \| \| GPT-J \| 1024 \| 1024 \| 48 ms \| 45 ms \| 6.25 % \| \|LLaMA 3.1 8B Instruct \| 128 \| 128 \| 52 ms \| 47 ms\| 9.61% \| \|LLaMA 3.1 8B Instruct \| 1024 \| 1024 \| 57 ms \| 53 ms\| 7.01% \| While the input shapes of GEMMs corresponding to linear for next-token computation remain the same in case of different number of input & output tokens, the difference in next-token latency is due to attention for those cases Pull Request resolved: https://github.com/pytorch/pytorch/pull/149373 Approved by: https://github.com/leslie-fang-intel, https://github.com/Xia-Weiwen Co-authored-by: Xia Weiwen <xia.weiwen@hotmail.com>	2025-05-15 11:55:58 +00:00
cyy	e5e06d9cab	[submodule] Update kleidiai to v1.8.0 (#153592 ) And cleans up some CMake instructions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153592 Approved by: https://github.com/malfet	2025-05-15 10:14:05 +00:00
Xuehai Pan	22b124335e	[BE] Update `.pyi` stub template to use Generic TypeAlias (PEP 585) and Union Type (PEP 604) (#150728 ) https://github.com/pytorch/pytorch/pull/129001#discussion_r1645126801 is the motivation for the whole stack of PRs. In `torch/__init__.py`, `torch._C.Type` shadows `from typing import Type`, and there is no type stub for `torch._C.Type` in `torch/_C/__init__.pyi`. So we need to use `from typing import Type as _Type`. After enabling [Generic TypeAlias (PEP 585)](https://peps.python.org/pep-0585) in the `.pyi` type stub files, we can use `type` instead of `typing.Type` or `from typing import Type as _Type`. ------ - [Generic TypeAlias (PEP 585)](https://peps.python.org/pep-0585): e.g. `typing.List[T] -> list[T]`, `typing.Dict[KT, VT] -> dict[KT, VT]`, `typing.Type[T] -> type[T]`. - [Union Type (PEP 604)](https://peps.python.org/pep-0604): e.g. `Union[X, Y] -> X \| Y`, `Optional[X] -> X \| None`, `Optional[Union[X, Y]] -> X \| Y \| None`. Note that in `.pyi` stub files, we do not need `from __future__ import annotations`. So this PR does not violate issue #117449: - #117449 ------ Pull Request resolved: https://github.com/pytorch/pytorch/pull/150728 Approved by: https://github.com/cyyever, https://github.com/aorenste ghstack dependencies: #150726, #150727	2025-05-15 09:36:42 +00:00
Xuehai Pan	f7a5aa1d8d	[torchgen] Refactor and simplify `gen_pyi.py` to use Generic TypeAlias (PEP 585) and Union Type (PEP 604) (#150727 ) https://github.com/pytorch/pytorch/pull/129001#discussion_r1645126801 is the motivation for the whole stack of PRs. In `torch/__init__.py`, `torch._C.Type` shadows `from typing import Type`, and there is no type stub for `torch._C.Type` in `torch/_C/__init__.pyi`. So we need to use `from typing import Type as _Type`. After enabling [Generic TypeAlias (PEP 585)](https://peps.python.org/pep-0585) in the `.pyi` type stub files, we can use `type` instead of `typing.Type` or `from typing import Type as _Type`. ------ - [Generic TypeAlias (PEP 585)](https://peps.python.org/pep-0585): e.g. `typing.List[T] -> list[T]`, `typing.Dict[KT, VT] -> dict[KT, VT]`, `typing.Type[T] -> type[T]`. - [Union Type (PEP 604)](https://peps.python.org/pep-0604): e.g. `Union[X, Y] -> X \| Y`, `Optional[X] -> X \| None`, `Optional[Union[X, Y]] -> X \| Y \| None`. Note that in `.pyi` stub files, we do not need `from __future__ import annotations`. So this PR does not violate issue #117449: - #117449 ------ Pull Request resolved: https://github.com/pytorch/pytorch/pull/150727 Approved by: https://github.com/aorenste ghstack dependencies: #150726	2025-05-15 09:36:42 +00:00
Jerry Mannil	129a2976a8	[ROCm] Improvements to non-vectorized elementwise kernels (#153184 ) * Unroll loops manually to hide memory access latency Co-authors: @akadutta @amd-hhashemi Pull Request resolved: https://github.com/pytorch/pytorch/pull/153184 Approved by: https://github.com/jeffdaily	2025-05-15 09:14:43 +00:00
Pat Vignola	6e107899da	[Torch] Fix crash when comparing fp8 tensors that have more than 1 dimension (#153508 ) Summary: `torch.nonzero` returns as many items as the number of dimensions, so we shouldn't expect a single element for the indices. Test Plan: CI Differential Revision: D74539233 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153508 Approved by: https://github.com/exclamaforte	2025-05-15 08:41:46 +00:00
Simon Fan	b297e01f4b	[ca][dtensor] run real PG dtensor tests under CA (#152689 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152689 Approved by: https://github.com/bdhirsh ghstack dependencies: #153300	2025-05-15 08:10:35 +00:00
Simon Fan	4863e5c843	[ca][dynamo] always run eager checkpoint region's recomputation in eager (#153300 ) I slap disable on the recomputation hook, otherwise the partitioner may save less/more activations and mismatch with the expected eager count in checkpoint. See code comment `Note: [compiled autograd and checkpoint unpack hook]`. This fixes all non-nested checkpointing tests. I also wrap nested checkpointing tests, and a few of them still fail. This also seems to fix all PYTORCH_TEST_WITH_DYNAMO checkpointing tests except for `TestAutograd.test_checkpointing_without_reentrant_custom_function_works`. For those tests, it looks like we fail to HOPify the checkpointed region and when the backward executes the unpack hooks, dynamo tried to trace them. This messed up the internal state tracking of checkpointing, some raising the _StopRecomputationError and others raising the same count mismatch error as CA. FIXES https://github.com/pytorch/pytorch/issues/127115 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153300 Approved by: https://github.com/jansel	2025-05-15 08:10:35 +00:00
PyTorch MergeBot	71027b13b2	Revert "[FlexAttention] Enforce Q,K,V memory layouts for fp8 flex attention to avoid perf degradation (#153357 )" This reverts commit 881a598a1e38ef06d4f51d1e3fd8e359fed0c3a0. Reverted https://github.com/pytorch/pytorch/pull/153357 on behalf of https://github.com/jeanschmidt due to Might have introduced regressions in rocm testing for main: https://github.com/pytorch/pytorch/actions/runs/15035410497/job/42257000513 feel free to re-merge if this was a mistake ([comment](https://github.com/pytorch/pytorch/pull/153357#issuecomment-2882915691))	2025-05-15 07:58:27 +00:00
Giulio D'Ippolito	004dad48f7	Allow to set custom PYTHONPATH for torch.inductor (#152832 ) When using Bazel, it’s common to encounter issues like [this](https://github.com/bazelbuild/bazel/issues/14640) and [this](https://github.com/bazel-contrib/rules_python/issues/792) where the `PYTHONPATH` environment variable becomes too long and results in an error such as: `OSError: [Errno 7] Argument list too long` . To work around this, users often resort to custom logic to manipulate PYTHONPATH. Currently, PyTorch Inductor constructs the PYTHONPATH for a subprocess using sys.path, which can lead to this issue in certain environments. This PR introduces support for a new environment variable, `TORCH_CUSTOM_PYTHONPATH`, allowing users to override the default `PYTHONPATH` passed to the subprocess. This provides a clean way to avoid an exception when using PyTorch in Bazel. Please let me know if I need to add some documentation to support this PR. I haven't found an open issue specific to this change but I'm confident that this change (or a similar one) would be appreciated by few. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152832 Approved by: https://github.com/masnesral	2025-05-15 06:35:41 +00:00
Xia, Weiwen	55784be01b	[Quant][X86] add ops to compute uint8 pointwise add/add_relu (#152411 ) Summary This PR adds two new ops, `onednn.qadd.tensor` and `onednn.qadd_relu.tensor`, for int8 elementwise add, which accepts inputs on CPU device (instead of QuantizedCPU). The new ops are implemented with AVX512 instructions and it provides similar or better performance, depending on shape, than its counterpart for QuantizedCPU device `quantized.add` and `quantized.add_relu`. The new op supports output dtypes other than uint8 (fp32, fp16 and bf16 are supported). Test plan ``` pytest test/quantization/core/test_quantized_op.py -k test_int8_add_onednn ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152411 Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168	2025-05-15 06:23:01 +00:00
Zizeng Meng	a762dd1f67	[Memento] On-demand mode using without torch api (#153171 ) Summary: CUDA Post: https://fb.workplace.com/groups/ai.efficiency.tools.users/permalink/2020094788475989/ # Context In this diff, we want to enable the on-demand mode of memory snapshot to allow user to trace any remote process via dyno command line. # Design decision How do we send on-demand signal to remote process We leverage the dyno-Kineto approach. Since dyno is running on all machine in Meta, it can send a request to the remote machine to start the Kineto. Kineto will start another thread for memoryProfiler (https://fburl.com/code/dxsmmrok) why we use different approach as CUDA On CUDA side, we are using pybind to load torch Module and invoke the python api to start/stop the profiling. However, this requires us to compile the whole torch binary in the predictor which is not recommended by runtime(andruwang) Thus, we decide to use the CPP api directly to avoid un-necessary dependency why the snapshot is saved as json string directly instead of pickle Pickle is primarily designed for use with Python and doesn't have well support in cpp. Also, it is hard for user to download the snapshot file and open locally. Due to the dependency issue, it is hard to import the gzip/pickle library to decode the data. Thus, let's use JSON for now. I will work on the visualizer to fasten the render and support other format later. Plan: * Now, we will encoded file into gz for MTIA ondemand only and update the visualizer to support both type. * Update auto-trace and CUDA side to encode in gzip as well * Fully remove pickle dependency. Test Plan: # Remote cogwheel test Servicelab: https://fburl.com/servicelab/pckux7a3 snapshot file manifold: https://fburl.com/manifold/fnotk18c snapshot file in pastry: P1805522232 Visualization on D74399684 {F1977786422} # Local Predictor Test url: https://fburl.com/pytorch_memory_visualizer/y06kskkm {F1977787329} Differential Revision: D74179606 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153171 Approved by: https://github.com/sraikund16	2025-05-15 06:07:04 +00:00
bobrenjc93	181bfabb9e	fix set_logs for a single child log file (#153580 ) Tested via ``` + import logging + torch._logging.set_logs(modules={"torch._functorch._aot_autograd.autograd_cache": logging.DEBUG}) ``` ``` python test/dynamo/test_aot_autograd_cache.py -k test_multi_graph_specialization ``` and verifying logs are printed Pull Request resolved: https://github.com/pytorch/pytorch/pull/153580 Approved by: https://github.com/ColinPeppler	2025-05-15 05:58:45 +00:00
Animesh Jain	9839ec1383	[dynamo][compile-time] Cache method on load builtin (#153524 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153524 Approved by: https://github.com/StrongerXi, https://github.com/jansel ghstack dependencies: #153522	2025-05-15 05:54:15 +00:00
Animesh Jain	b47be23461	[dynamo][compile-time] Faster inspect getattr_static for torch.Tensor (#153522 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153522 Approved by: https://github.com/StrongerXi, https://github.com/jansel	2025-05-15 05:54:15 +00:00
henrylhtsang	910d2f96af	[cutlass backend] forward fix cutlass backend A100 test (#153428 ) Forward fix of https://github.com/pytorch/pytorch/pull/153006, which broke a test. In the long run, we should get rid of CUDATemplateCaller.category. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153428 Approved by: https://github.com/ColinPeppler	2025-05-15 05:45:38 +00:00
hanchao	0ca91af6b8	Define USE_C10D_XCCL and USE_XCCL in pytorch (#147593 ) ### Motivation: Add `USE_XCCL` and `USE_C10D_XCCL` to enable support of XCCL backend building in stock PyTorch, similar to `USE_NCCL` and `USE_C10D_NCCL`. By default, `USE_XCCL` is OFF and allowed set to ON explicitly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147593 Approved by: https://github.com/guangyey, https://github.com/malfet, https://github.com/albanD, https://github.com/cyyever	2025-05-15 05:39:00 +00:00
Reed Evans	ebd3268538	Removed duplicate patterns from gitignore (#153515 ) Removed duplicate patterns from gitignore. These patterns are duplicated verbatim on lines 148-169. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153515 Approved by: https://github.com/soulitzer	2025-05-15 05:38:42 +00:00
Chien-Chin Huang	b992a665d1	Fix AsyncMM not compiled with SM90a issue (#153519 ) The CMakeLists.txt is wrong and doesn't enable SM90a for AsyncMM.cu Pull Request resolved: https://github.com/pytorch/pytorch/pull/153519 Approved by: https://github.com/drisspg, https://github.com/ngimel, https://github.com/cyyever	2025-05-15 05:23:29 +00:00
Nikita Shulga	d5ddc5ab20	[MPS] Fix float64 scalar tensor handling (#153582 ) Current implementation causes silent correction problem with torch.compile when someone tries to `torch.compile` function where one of the arguments is say `np.exp(.3)`, which will be represented as torch.float64 scalar tensor Add regssion test for this behavior Pull Request resolved: https://github.com/pytorch/pytorch/pull/153582 Approved by: https://github.com/dcci	2025-05-15 05:15:14 +00:00
Mandar Deshpande	3e8bda4ad5	[pytorch][triton] flex attention fwd kernel with TMA loads (#151923 ) (#152460 ) Summary: Device side TMA for flex_attention fwd kernel, Q K V tensors Test Plan: Unit test: ``` buck test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:flex_attention -- test_tma_with_customer_kernel_options ``` https://www.internalfb.com/intern/testinfra/testrun/14355223891618726 Differential Revision: D71082691 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152460 Approved by: https://github.com/drisspg	2025-05-15 04:49:32 +00:00
Tsung-Hsien Lee	756fd80734	[BE] Improve the typing related to `model` input argument of `torch.compile()` (#153559 ) Summary: Match the `overload` typing with the original typing in function definition and adjust the corresponding comments. Test Plan: contbuild & OSS CI Differential Revision: D74746243 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153559 Approved by: https://github.com/Skylion007	2025-05-15 04:49:26 +00:00
Robert Burke	d2f6c6df1d	unbreak fb:operator_benchmark_test (#152049 ) Summary: unbreak fb:operator_benchmark_test Test Plan: works on my machine Differential Revision: D73540912 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152049 Approved by: https://github.com/hl475	2025-05-15 03:38:48 +00:00
Xuehai Pan	014726d9d3	[torchgen] Refactor `torchgen.utils.FileManager` to accept `pathlib.Path` (#150726 ) This PR allows `FileManager` to accept `pathlib.Path` as arguments while keeping the original `str` path support. This allows us to simplify the code such as: 1. `os.path.join(..., ...)` with `Path.__floordiv__(..., ...)`. `95a5958db4/torchgen/utils.py (L155)` `95a5958db4/torchgen/utils.py (L176)` 2. `os.path.basename(...)` with `Path(...).name`. `95a5958db4/torchgen/utils.py (L161)` 3. Manual file extension split with `Path(...).with_stem(new_stem)` `95a5958db4/torchgen/utils.py (L241-L256)` ------ Pull Request resolved: https://github.com/pytorch/pytorch/pull/150726 Approved by: https://github.com/aorenste	2025-05-15 02:52:24 +00:00
Daniel Vega-Myhre	881a598a1e	[FlexAttention] Enforce Q,K,V memory layouts for fp8 flex attention to avoid perf degradation (#153357 ) Fixes #147336 ## Context NCU analysis of the fp8 flex attention perf issue in #147336 showed an unexpected increase in shared memory access bank conflicts when loading the V tensor from HBM to SRAM. Bringing this to the attention of triton developer @davidberard98 he identified the memory layout of the tensor in HBM to be causing non-pipelined loads into SRAM, causing the slowdown. To summarize: In flex attention when performing the FP8 GEMM `softmax_scores @ V` the right operand V must be in column-major memory layout. However, the `tl.load` of V blocks from HBM to SRAM cannot be pipelined if the V tensor isn't column-major in HBM already, leading to substantial performance degradation. This is because triton does not perform async copies with the `cp.async` PTX instruction if the number of contiguous bytes is less than 4 (see [here](`81f93f2c8e/lib/Dialect/TritonGPU/Transforms/Pipeliner/PipeliningUtility.cpp (L403)`)). i.e., when loading 4 bytes of contiguous data from a tensor stored in row-major in HBM, we have to perform 4 separate non-contiguous writes to SRAM to place those bytes in their new location in the col-major layout in SRAM. Thus the load is not a candidate for pipelining w/ cp.async and just moves data to registers then performs a series of single byte stores. ## Fix summary - To fix this, we should enforce memory layouts for Q, K, V in FlexAttention when fp8 is being used, to ensure they each exist in HBM in the necessary memory layout to facilitate pipelined loads into SRAM ahead of the FP8 GEMMs ## Benchmarks Rerunning the repro we see fp8 runtime is reduced from 120% of bf16 to 76% of bf16 runtime. Before fix: ``` (flex) [danvm@devgpu007.eag6 ~/ml-perf-tools/flex_attention (main)]$ rm -rf /tmp/torchinductor_${USER}; python profile_flex.py --bf16 --fp8 2025-05-11 19:07:33,402 - flex_bench - INFO - Running benchmark: bf16 2025-05-11 19:07:35,885 - flex_bench - INFO - bf16: 424.87228804347734 us 2025-05-11 19:07:35,893 - flex_bench - INFO - Running benchmark: fp8e4m3 2025-05-11 19:07:37,319 - flex_bench - INFO - fp8e4m3: 515.714000000001 us ``` After fix: ``` (flex) [danvm@devgpu007.eag6 ~/ml-perf-tools/flex_attention (main)]$ rm -rf /tmp/torchinductor_${USER}; python profile_flex.py --bf16 --fp8 2025-05-11 17:34:38,223 - flex_bench - INFO - Running benchmark: bf16 2025-05-11 17:34:41,157 - flex_bench - INFO - bf16: 423.4662032967036 us 2025-05-11 17:34:41,167 - flex_bench - INFO - Running benchmark: fp8e4m3 2025-05-11 17:34:42,917 - flex_bench - INFO - fp8e4m3: 326.3694803493453 us ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153357 Approved by: https://github.com/ngimel, https://github.com/davidberard98	2025-05-15 02:41:38 +00:00
eellison	eaf2dee10e	don't run triton mm for k<32 (#153550 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/153550 Approved by: https://github.com/suo Co-authored-by: Natalia Gimelshein <ngimel@meta.com>	2025-05-15 02:36:44 +00:00
karthickai	725bbb6b5f	[inductor][dynamo] Include operator name in size/stride/alignment assertion (#152353 ) Fixes #151930 This PR updates the `assert_size_stride` and `assert_alignment` functions in [guards.cpp](https://github.com/pytorch/pytorch/blob/main/torch/csrc/dynamo/guards.cpp) to accept an optional `op_name` argument and includes it in the error messages. The corresponding type stubs in [guards.pyi](https://github.com/pytorch/pytorch/blob/main/torch/_C/_dynamo/guards.pyi) are updated to match the new function arg. In [inductor/ir.py](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/ir.py) extracts the operator name from the FX graph and passes it into the `codegen_size_asserts` and `codegen_alignment_asserts` functions, so that generated assertions in Triton code include the op name for better debugging. Added unit tests inside [test_torchinductor.py](https://github.com/pytorch/pytorch/blob/main/test/inductor/test_torchinductor.py). - Verified both successful and failing assertion cases include the operator name. - Verified that generated Triton code contains the op name inside the asserts. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152353 Approved by: https://github.com/jansel	2025-05-15 02:33:57 +00:00
henrylhtsang	f5e0806f34	[cutlass backend] Add back descriptive names for epilogue fusion (#153405 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153405 Approved by: https://github.com/mlazos	2025-05-15 01:47:52 +00:00
zeshengzong	82dc3457e0	Add `load_state_dict` hint doc about invoke order work with lr_scheduler (#149942 ) Fixes #119168 ## Test Result ![image](https://github.com/user-attachments/assets/edb8124c-f103-475a-b903-20fbc71fdea6) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149942 Approved by: https://github.com/janeyx99 Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2025-05-15 01:07:36 +00:00
cyy	781ba0ac9d	Update CMake to 3.27 in Windows CI (#153380 ) Before it's possible to use enable newer CMake. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153380 Approved by: https://github.com/albanD	2025-05-15 00:19:32 +00:00
Ting Lu	c2bc7e2827	API change for new enum in cusparseltsplitkmode-t for cusparseLT 0.7.0+ (#150536 ) Changing the bool to int to express split_k_mode. Before 0.7.0 we only have 2 cusparseLtSplitKMode_t enum values ONE_KERNEL and TWO_KERNELS so a boolean is enough but since 0.7.0 there are more. For Blackwell, there has to be minor change to parameter split_k_one_kernel (https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/sparse/cuda/cuSPARSELtOps.cpp#L103), since there are new values introduced to enum [cusparseLtSplitKMode_t](https://docs.nvidia.com/cuda/cusparselt/types.html#cusparseltsplitkmode-t) and a bool type is not enough for it (would have to be replaced with integer) https://docs.nvidia.com/cuda/cusparselt/types.html#cusparseltsplitkmode-t Error we see without the change ``` RuntimeError: CUDA error: invalid value when calling `cusparseLtMatmulAlgSetAttribute( &handle, &alg_sel, CUSPARSELT_MATMUL_SPLIT_K_MODE, &splitKMode, sizeof(splitKMode))` To execute this test, run the following from the base repo dir: python test/test_sparse_semi_structured.py TestSparseSemiStructuredCUSPARSELTCUDA.test_csrc_cslt_sparse_mm_search_cuda_int8 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150536 Approved by: https://github.com/jcaip, https://github.com/atalman	2025-05-14 23:36:53 +00:00
Hashem Hashemi	72fee137dd	[ROCm] Maxpool forward NHWC Perf Improvement targeting Resnet scenarios (#151727 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/151727 Approved by: https://github.com/seemethere Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com>	2025-05-14 22:34:55 +00:00
Aaron Gokaslan	e0dece510b	[Ez][BE]: Remove accidental classvar (#153540 ) Untyped variables become ClassVar in dataclasses, this type alias should just be a type alias; no need for it to eb a classvar. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153540 Approved by: https://github.com/albanD, https://github.com/aorenste	2025-05-14 21:55:56 +00:00
henrylhtsang	7412b33e91	[inductor] Use get to avoid possible keyerror at the end of precompilation (#153417 ) Shameful admission: I have encountered this error 1-2 times, but don't have a repro. torch/_inductor/select_algorithm.py", line 2022, in wait_on_futures elapsed_times[future], ~~~~~~~~~~~~~^^^^^^^^ torch._inductor.exc.InductorError: KeyError: <Future at 0x7fc4e394fb90 state=finished returned tuple> Pull Request resolved: https://github.com/pytorch/pytorch/pull/153417 Approved by: https://github.com/Skylion007, https://github.com/ColinPeppler	2025-05-14 21:49:43 +00:00
Aidyn-A	f2e8e41855	[Easy][Inductor] Adds safety checks in get_estimated_runtime (#152821 ) This PR adds checks on `gpu_memory_bandwidth` and `gpu_flops` in `get_estimated_runtime`. This will prevent division by zero and other potential incorrect values: `9210a98b92/torch/_inductor/scheduler.py (L864-L865)` `9210a98b92/torch/_inductor/scheduler.py (L874)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152821 Approved by: https://github.com/eellison, https://github.com/jansel	2025-05-14 21:46:59 +00:00
Aaron Gokaslan	f887bfffda	Fix typo (#153561 ) Fix typo from #153386 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153561 Approved by: https://github.com/albanD	2025-05-14 21:38:51 +00:00
Animesh Jain	03d01860fd	[dynamo][compile-time] Compute logging related flags once (#153426 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153426 Approved by: https://github.com/jansel	2025-05-14 21:19:06 +00:00
Aaron Gokaslan	1bd6bc7190	[BE]: Enable ruff YTT linter for Python version checks (#153547 ) Adds ruff YTT checks to help future proof version checks and follow best practices here. Also makes it easier for static linters like mypy to detect python version branching. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153547 Approved by: https://github.com/albanD	2025-05-14 21:09:16 +00:00
PyTorch MergeBot	f363a3f51a	Revert "[cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for `sm90`, `sm100` (#149282 )" This reverts commit 9386701b51aadce951bf38daf497b0257a3f2211. Reverted https://github.com/pytorch/pytorch/pull/149282 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, see [D74729259](https://www.internalfb.com/diff/D74729259). @drisspg may you help out the author have their PR merged? ([comment](https://github.com/pytorch/pytorch/pull/149282#issuecomment-2881546951))	2025-05-14 20:53:49 +00:00
Wang, Chuanqi	c92ea3bc98	[BE] Upgrade XPU support package to 2025.1 in CICD (#151899 ) Address #151097. Including below changes, - Add XPU support package 2025.1 build and test in CI for both Linux and Windows - Keep XPU support package 2025.0 build in CI to ensure no break issue until PyTorch 2.8 release - Upgrade XPU support package from 2025.0 to 2025.1 in CD for both Linux and Windows - Enable XCCL in Linux CD wheel and oneMKL integration in both both Linux and Windows - Update XPU runtime pypi packages of CD wheels - Remove deprecated support package version docker image build Pull Request resolved: https://github.com/pytorch/pytorch/pull/151899 Approved by: https://github.com/EikanWang, https://github.com/atalman	2025-05-14 20:21:09 +00:00
David Berard	5e6e52e7c9	[JIT] add GRAPH_DEBUG for setGraphExecutorOptimize (#153549 ) Summary: Optionally log when setGraphExecutorOptimize is called, so we can get insight into the GraphExecutor behavior. Differential Revision: D74692508 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153549 Approved by: https://github.com/PaulZhang12, https://github.com/SamGinzburg	2025-05-14 20:07:25 +00:00
James Wu	dda2c7c8fc	Pass inductor config for static cuda launcher to workers (#153382 ) Async compile workers don't respect inductor configs generally that get changed in the middle of execution because they warm up early. StaticCudaLauncher is especially susceptible to this because it affects triton compilation without being part of the inductor meta. So we'll pass it in via extra configs on each worker run. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153382 Approved by: https://github.com/masnesral, https://github.com/jansel	2025-05-14 20:01:32 +00:00
Aby Mathew C	6a28cc826f	Add TEST_HPU flag to set device type (#153461 ) MOTIVATION This PR includes a minor change to check for TEST_HPU flag as well before falling back to CPU. Without this flag, some tests were falling back to CPU causing them to fail. Please refer to this RFC as well: https://github.com/pytorch/rfcs/pull/66 CHANGES add TEST_HPU flag to some of the conditions checking the environment use DEVICE_COUNT variable instead of torch.accelerator.device_count() API since the later is not supported on out-of-tree devices like Intel Gaudi. @ankurneog , @EikanWang , @cyyever , @guangyey Pull Request resolved: https://github.com/pytorch/pytorch/pull/153461 Approved by: https://github.com/EikanWang, https://github.com/cyyever, https://github.com/albanD	2025-05-14 19:31:40 +00:00
Ben Zickel	a54bf43baa	Fix support of MixtureSameFamily [bugfix]. (#151317 ) Fixes https://github.com/pyro-ppl/pyro/issues/3419 which is actually a `torch` bug that can be replicated by the below code: ``` from torch import rand from torch.distributions import MixtureSameFamily, Categorical, Binomial max_count = 20 probs = rand(10, 5) binom_probs = rand(10, 5) d = MixtureSameFamily(Categorical(probs=probs), Binomial(max_count, binom_probs)) d.log_prob(d.sample()) ``` which results in: ``` Traceback (most recent call last): File "test.py", line 11, in <module> d.log_prob(d.sample()) File "pytorch\torch\distributions\mixture_same_family.py", line 168, in log_prob self._validate_sample(x) File "pytorch\torch\distributions\distribution.py", line 315, in _validate_sample valid = support.check(value) ^^^^^^^^^^^^^^^^^^^^ File "pytorch\torch\distributions\constraints.py", line 307, in check (value % 1 == 0) & (self.lower_bound <= value) & (value <= self.upper_bound) ^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: The size of tensor a (10) must match the size of tensor b (5) at non-singleton dimension 1 ``` ### Fix explanation (only for cases when the component distribution contains parameters with batch dimenisons) - The failure is due to sample validation taking place before padding in `MixtureSameFamily.log_prob`, and hence the fix is to pad before doing sample validation. - The fix itself does not alter the calculations at all. It only affects the sample validation process. - The failure does not occur with the component distribution set to the `Normal` distribution, as its validation is not defined elementwise (the validation itself is elementwise). - I've split the `test_mixture_same_family_log_prob` test into two tests based on the `Normal` and `Binomial` distributions. - Initially, the `Binomial` version of the test did not fail, but this was due to the component distribution having equal batch dimensions of (5, 5) so I changed it to (10, 5). ### Updated fix explanation (for all cases) - The previous fix caused a bug in sample shape validation (which is done correctly) due to the padding taking place before the sample validation. - The updated fix corrects the support to reflect the fact that the support of `MixtureSameFamily` is equal to the support of its components distribution with the first event dimension removed. - This issue was already anticipated in the [code](`331423e5c2/torch/distributions/mixture_same_family.py (L127)`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/151317 Approved by: https://github.com/albanD, https://github.com/fritzo	2025-05-14 19:24:36 +00:00
clr	534b66fe30	torch.compile: Remove reference to the unused dynamo_config.dynamic_shapes from (#153297 ) tests This config option is not set anywhere, and does nothing, so this should cause no changes to tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153297 Approved by: https://github.com/Skylion007	2025-05-14 19:02:51 +00:00
PyTorch MergeBot	bf0fe4f828	Revert "[CUDA][CUDNN] Dispatch to cuDNN for non-batch-splittable 64-bit NCHW convolutions (#153101 )" This reverts commit ced90d23d3dfff42379fa032fe6a83b764d12e9f. Reverted https://github.com/pytorch/pytorch/pull/153101 on behalf of https://github.com/jeanschmidt due to Seems to have introduced breakages on main, tentative revert: https://github.com/pytorch/pytorch/actions/runs/15024667248/job/42224521705 ([comment](https://github.com/pytorch/pytorch/pull/153101#issuecomment-2881208171))	2025-05-14 18:52:07 +00:00
Nikita Shulga	8749fe8439	[CI][MPS] Speedup test_large_bmm (#153562 ) By computing matmuls of only one random non-zero batch on CPU This reduces test runtime from 11 minutes to 14 sec ``` % python3 test/test_mps.py -v -k test_large_bmm_ test_large_bmm_bfloat16 (__main__.TestMPS.test_large_bmm_bfloat16) ... ok test_large_bmm_float16 (__main__.TestMPS.test_large_bmm_float16) ... ok ---------------------------------------------------------------------- Ran 2 tests in 27.495s ``` TODO: Compute it over two slices when https://github.com/pytorch/pytorch/issues/153560 is fixed Pull Request resolved: https://github.com/pytorch/pytorch/pull/153562 Approved by: https://github.com/Skylion007, https://github.com/clee2000	2025-05-14 18:49:42 +00:00
angelayi	47d6feff7c	[export] Support no inputs in unflattened module (#153474 ) Encountered in this diff D74589491 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153474 Approved by: https://github.com/avikchaudhuri	2025-05-14 18:45:47 +00:00
PyTorch MergeBot	6ef1cbc191	Revert "[ROCm] Maxpool forward NHWC Perf Improvement targeting Resnet scenarios (#151727 )" This reverts commit e6a90672601ad3d636145dd8a68952281a6d1199. Reverted https://github.com/pytorch/pytorch/pull/151727 on behalf of https://github.com/jeanschmidt due to Seems to be breaking internal builds, @seemethere may you help the author? [D74729252](https://www.internalfb.com/diff/D74729252) ([comment](https://github.com/pytorch/pytorch/pull/151727#issuecomment-2881122917))	2025-05-14 18:18:17 +00:00
Aaron Gokaslan	533fc58453	[BE]: Fix typing None override other optimizers (#153386 ) Follow up to #153367 to fix other instances of it throughout the codebase Also fully type NamedOptimizer since we were so close Pull Request resolved: https://github.com/pytorch/pytorch/pull/153386 Approved by: https://github.com/tsunghsienlee, https://github.com/janeyx99, https://github.com/jansel, https://github.com/cyyever	2025-05-14 17:48:47 +00:00
Xu Zhang	2362bd4a4c	[Torch][NT] Fix NestedTensor contiguous check condition. (#153237 ) (#153529 ) Fixes #153237 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153529 Approved by: https://github.com/jbschlosser	2025-05-14 17:15:48 +00:00
Ryan Guo	8bb67700a3	[dynamo] Support `delattr` on result of `torch.compile(module)` (#152741 ) This is essentially a follow-up on #122098, where we added support of `getattr` and `setattr` on result of `torch.compile(module)`, but didn't add support for `delattr`. Fixes #150711. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152741 Approved by: https://github.com/anijain2305 ghstack dependencies: #152740	2025-05-14 17:03:59 +00:00
Ryan Guo	6765df052c	[dynamo] Emit warning on global module hooks when calling using output of `torch.compile(module)` (#152740 ) When we do `torch.compile(module)`, we eventually end up returning a new `OptimizedModule` instance, whose `forward` method is the result of `torch.compile(mod.__call__)`, meaning it already captures all the extra logic (e.g., hook firing) for the compiled module. `OptimizedModule` also inherits `nn.module.__call__`, and thus has its own hook logic. This is useful for torchao, which injects module forward hooks to run in eager for quantization purposes. However, this might create unexpected behavior for global module hooks, because `torch.compile(module)` causes the hook to fire one extra time for `OptimizedModule`, when compared to eager. To preserve BC, we simply emit a warning for this behavior, and let users decide what to do. This is reasonable because the global module hooks are documented to be used for debugging/profiling purposes only. Fixes #149502 Differential Revision: [D74611716](https://our.internmc.facebook.com/intern/diff/D74611716) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152740 Approved by: https://github.com/anijain2305, https://github.com/zou3519	2025-05-14 17:03:59 +00:00
Shangdi Yu	b3dea0c0dd	Change aoti cpp tests to run serially within file (#152960 ) Fixes #152674 https://github.com/pytorch/pytorch/issues/152889 https://github.com/pytorch/pytorch/issues/152888 https://github.com/pytorch/pytorch/issues/152891 `--dist=loadfile` ensures all tests in the same source file run in the same worker. Tests like `FreeInactiveConstantBufferRuntimeConstantFoldingCuda` expect exclusive access to memory during test time to compute diffs (e.g., initMemory - updateMemory2 == DATASIZE). With `-n 3`, tests run in separate processes, but CUDA device memory is shared — and cudaMemGetInfo() reads device-wide global state. ``` python test/run_test.py --cpp --verbose -i cpp/test_aoti_inference -dist=loadfile ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152960 Approved by: https://github.com/desertfire, https://github.com/cyyever	2025-05-14 17:02:39 +00:00
Anthony Shoumikhin	ba70876407	Update lint_urls.sh (#153246 ) Treat 403, 429 and 503 http errors as success. Ignore non-verbal hostnames. Kill child jobs immediately. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153246 Approved by: https://github.com/malfet	2025-05-14 16:54:49 +00:00
Meet Vadakkanchery	b6b0080419	[DCP] Use multiprocess Pipes instead of Queues to improve communication contract with checkpointer process (#153488 ) Summary: ### Diff Context - PR introduces Pipes for multiprocess comms with checkpointer process. - Pipes allow easier comms contract management due to close() API and catch-all feature when background process is dead (e.g. seg faults). Test Plan: CI Differential Revision: D74668559 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153488 Approved by: https://github.com/saumishr	2025-05-14 16:47:43 +00:00
Aaron Gokaslan	8799bffc34	[BE][Ez]: RUF200 - validate pyproject.toml metadata (#153543 ) Since we have pyproject.toml metadata for [project] and [build-requires], let's turn on the linter rules which validates this optional metadata to make sure it's properly formatted and follows the correct schema for standard Python build tools. Right now, incorrect metadata could silently error with how our CI is invoked or only provide warnings for invalid metadata. This check will help surface those errors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153543 Approved by: https://github.com/albanD	2025-05-14 16:42:22 +00:00
Anthony Shoumikhin	7d39e73c57	Fix more URLs (#153277 ) Or ignore them. Found by running the lint_urls.sh script locally with https://github.com/pytorch/pytorch/pull/153246 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153277 Approved by: https://github.com/malfet	2025-05-14 16:23:50 +00:00
fengqing.lu	de92296bbb	[Intel GPU] undo broadcast on zero stride tensor for SDPA (#151976 ) Fix https://github.com/pytorch/pytorch/issues/152290. The model hubert uses aten::expand to build attention mask by broadcasting. Pytorch uses strides[d]=0 to represent broadcast, which is not supported by oneDNN. This PR handles this scenario. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151976 Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/drisspg	2025-05-14 16:09:03 +00:00
chunhuanMeng	1f48bab377	Update torch-xpu-ops commit pin (#153445 ) Update the torch-xpu-ops commit to [207105038963e5f9f012f1a0cfd3b9f57b2ab5b0](`2071050389`), includes: - Improve the accuracy of `upsample_bilinear2d_backward` - Enhance the performance of `avg_pool2d` - Update the implementation of scatter-gather and indexing Pull Request resolved: https://github.com/pytorch/pytorch/pull/153445 Approved by: https://github.com/guangyey, https://github.com/EikanWang	2025-05-14 15:34:47 +00:00
Shangdi Yu	2e440e39a6	[nativert] Move Placement to pytorch core (#152953 ) Summary: Move Placement to pytorch core. Using `torch::nativert::isSameDevice` explicitly in code to avoid confusion with the `isSameDevice` in torch namespace. Test Plan: ``` buck run fbcode//mode/dev-nosan //caffe2/test/cpp/nativert:placement_test ./bin/test_nativert ``` OSS and internal CI Differential Revision: D74190745 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152953 Approved by: https://github.com/Skylion007, https://github.com/swolchok, https://github.com/zhxchen17, https://github.com/cyyever	2025-05-14 15:26:54 +00:00
eqy	ced90d23d3	[CUDA][CUDNN] Dispatch to cuDNN for non-batch-splittable 64-bit NCHW convolutions (#153101 ) For #152816 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153101 Approved by: https://github.com/Skylion007	2025-05-14 15:22:47 +00:00
PyTorch UpdateBot	0ce941f994	[audio hash update] update the pinned audio hash (#153507 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153507 Approved by: https://github.com/pytorchbot	2025-05-14 15:16:35 +00:00
Horace He	cd119ddd7c	Add matching against hypothetical (new) ghstack pull-request trailer (#153528 ) I would like to change ghstack to use a new trailer Pull Request resolved: https://github.com/pytorch/pytorch/pull/153528 Approved by: https://github.com/malfet	2025-05-14 14:07:01 +00:00
Animesh Jain	8f3d7972ad	[dynamo][compile-time] Cache the function signature to speedup inlining (#153396 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153396 Approved by: https://github.com/jansel, https://github.com/StrongerXi ghstack dependencies: #153333	2025-05-14 14:01:46 +00:00
PyTorch MergeBot	2344eca5eb	Revert "Fix skipIfXpu and skipIfHpu disables tests when used on class (#151315 )" This reverts commit ee096b89f63394b2c18826288783eef241f3959c. Reverted https://github.com/pytorch/pytorch/pull/151315 on behalf of https://github.com/jeanschmidt due to Seems to have introduced internal regressions, see [D74668899](https://www.internalfb.com/diff/D74668899). @malfet may you help the author get this PR merged? ([comment](https://github.com/pytorch/pytorch/pull/151315#issuecomment-2880203323))	2025-05-14 13:15:03 +00:00
PyTorch MergeBot	2c1912452d	Revert "Rewrite autograd producer consumer stream sync logic (#151079 )" This reverts commit f78e4529a9d446deb77c6ac38184582f6ab9167a. Reverted https://github.com/pytorch/pytorch/pull/151079 on behalf of https://github.com/jeanschmidt due to Seems to have introduced regressions in internal signals, see [D74648937](https://www.internalfb.com/diff/D74648937) ([comment](https://github.com/pytorch/pytorch/pull/151079#issuecomment-2880176879))	2025-05-14 13:07:12 +00:00
PyTorch MergeBot	a628efd1e8	Revert "Enable accelerator to perform streaming backward (#153412 )" This reverts commit d5d26ce43641a19c3e36a751b59b7fa3825cea83. Reverted https://github.com/pytorch/pytorch/pull/153412 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert https://github.com/pytorch/pytorch/pull/151079 ([comment](https://github.com/pytorch/pytorch/pull/153412#issuecomment-2880169739))	2025-05-14 13:04:27 +00:00
Bin Bao	e8f7a97e2e	[Refactor] Explicilty spell out the namespace for device() function (#153248 ) Summary: To prepare for the coming up header-only file change. The same files have been using a mixed style of using at::device() and device(). Given these .cpp files are not in the at namespace, it makes sense to spell them out explicitly. Differential Revision: [D74577412](https://our.internmc.facebook.com/intern/diff/D74577412) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153248 Approved by: https://github.com/cyyever, https://github.com/albanD, https://github.com/janeyx99	2025-05-14 12:00:47 +00:00
abmajumder	0ef5ba43a6	Fix negative dim issue in for parallel loss context manager (#152785 ) Facing similar issue as on #152016 , and added as per @tianyu-l 's solution. Fixes #152016 Tagging @tianyu-l @atalman for review. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152785 Approved by: https://github.com/tianyu-l	2025-05-14 10:43:27 +00:00
Animesh Jain	864a5f4434	[dynamo][compile-time] Cache the cleaned insturctions while inlining (#153333 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153333 Approved by: https://github.com/StrongerXi, https://github.com/jansel, https://github.com/williamwen42	2025-05-14 09:26:26 +00:00
Will Feng	0139ce9303	Add skip_dtype_check_in_meta_registrations config to torch/fx/experimental/_config (#153513 ) Helion relies on torch/fx/experimental 's fake_tensor tracing but does its own dtype checking, which conflicts with some meta kernel's existing dtype checking. This PR adds a config so that we skip those dtype checking in meta kernels and rely on the calling system to do the dtype checking. Currently it only applies to `baddbmm`, but I expect that similar changes will need to be done to other meta kernels in the future. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153513 Approved by: https://github.com/jansel	2025-05-14 09:14:11 +00:00
Hashem Hashemi	4015166e5d	[ROCm] Maxpool backward NHWC Perf Improvement targeting Resnet scenarios (#152267 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/152267 Approved by: https://github.com/jeffdaily	2025-05-14 06:59:29 +00:00
Wanchao Liang	4c5cf18ee0	[device_mesh] improve device selection logic (#150897 ) as titled, this PR improves the device selection logic when user did not set the device before calling the DeviceMesh constructor, as a device manager, DeviceMesh should try to set the device for users in a good way. The behavior of set_device before: * If user call init_process_group to init a world process group, we assume user already called set_device and we don't set the device for the user * If user does not init a world process group by themselves, we init a world process group for the user and follow a heuristic to set the device. This is ok but sometimes the set_device heuristic wouldn't work well (i.e. if user use TORCH_CUDA_VISBILE_DEVICES So this PR improves the device selection logic to: * If the default cuda context is initialized by the time we init DeviceMesh, then we assume user must called some cuda operation before therefore must have selected the device by themselves * If not the above, then we check if envvars have "LOCAL_RANK" and "WORLD_SIZE" from the launcher (i.e. torchrun), if so, we use "LOCAL_RANK" to set the device for the current process, which is a very standard practice. (This solves the TORCH_CUDA_VISBILE_DEVICES issue) * If not above, then we throw warning to users about situation, and fallback to the old heuristic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150897 Approved by: https://github.com/tianyu-l ghstack dependencies: #150898	2025-05-14 06:29:16 +00:00
zeshengzong	0f891cad5a	Enable ruff check for `torch/utils/data/.ipynb` (#148654 ) Fixes part of #146411 Enable ruff check for `torch/utils/data/.ipynb` files ## Test Result ```bash lintrunner -a --take RUFF torch/utils/data/*.ipynb ``` ![image](https://github.com/user-attachments/assets/88fddc91-3f19-4704-9aef-2cabd2cdc96e) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148654 Approved by: https://github.com/Skylion007	2025-05-14 06:21:47 +00:00
redwrasse	f7798d8645	Checks kv pair indexing in OrderedPreservingDictTest.test_range_insert (#148136 ) `OrderedPreservingDictTest.test_range_insert` has an [unused loop variable `j`](https://github.com/pytorch/pytorch/blob/main/c10/test/util/ordered_preserving_dict_test.cpp#L186), I think taken from the [inspired project](https://github.com/pytorch/pytorch/blob/main/c10/test/util/ordered_preserving_dict_test.cpp#L165) testcase for range inserts, where it [checks kv pair indexing/order](https://github.com/Tessil/ordered-map/blob/master/tests/ordered_map_tests.cpp#L136) for the ordered dict. This just adds in that functionality to the test case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148136 Approved by: https://github.com/eellison	2025-05-14 06:05:23 +00:00
Animesh Jain	11c64b7cf8	[dynamo][compile-time] Cache whether a function is inlineable (#153192 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153192 Approved by: https://github.com/StrongerXi, https://github.com/jansel, https://github.com/williamwen42 ghstack dependencies: #153458	2025-05-14 05:40:25 +00:00
Ke Wen	e2ce17c6ef	[SymmMem][a2av] Use more CTAs for intra-node case (#153509 ) Previously, we launch the a2av kernel with at most 8 blocks for intra-node cases, which turns out to saturate only 57 GB/s bandwidth. This PR adds more blocks for intra-node, up to 8 per peer, pumping up data parallelism. The kernel now achieves 350 GB/s SOL for Hopper. See figure. It also uses a simple tuning based on input size to avoid jumping to 8 CTAs directly (i.e. 1, 2, 4, then 8) For inter-node, we cap at 8 blocks, since 57 GB/s seems bigger than regular NIC bandwidths (400 Gb/s). ![all_to_all_vdev Performance on 8xH100](https://github.com/user-attachments/assets/d4b841e6-4c42-4a2e-aa9f-2bc116ba9d25) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153509 Approved by: https://github.com/ngimel ghstack dependencies: #153483	2025-05-14 04:24:32 +00:00
Wang, Chuanqi	20dbe644c7	[CD] Fix the libgomp twice load issue (#150084 ) Fixes #149422 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150084 Approved by: https://github.com/malfet, https://github.com/leslie-fang-intel, https://github.com/atalman Co-authored-by: LifengWang <lifeng.a.wang@intel.com>	2025-05-14 04:06:18 +00:00
Zizeng Meng	316c15297c	[MemoryZ] Show the current and max entries rendered (#153446 ) Summary: as title Test Plan: {F1977904091} Differential Revision: D74626081 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153446 Approved by: https://github.com/sraikund16	2025-05-14 03:16:12 +00:00
Animesh Jain	c797f1285c	[dynamo][copmile-time] Handle builtins first in LOAD_GLOBAL (#153458 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153458 Approved by: https://github.com/jansel	2025-05-14 03:04:38 +00:00
Bin Bao	33a5179269	[AOTI][reland2] Remove typedef for half and bfloat16 (#153467 ) Summary: Reland https://github.com/pytorch/pytorch/pull/151109 after fixing cutlass AOTI build issues. typedef is prone to name collision. Explicitly spell out the actual aten types, needed for the standalone AOTI codegen. Differential Revision: D74398762 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153467 Approved by: https://github.com/jingsh, https://github.com/henrylhtsang, https://github.com/cyyever	2025-05-14 02:37:18 +00:00
Meet Patel	9ad9a04ca7	Add TensorLR variant for fused Adagrad on CPU (#153078 ) This PR adds a tensor LR variant for the CPU Adagrad(fused=True). I copied the behavior from the tensor LR variant of CPU Adam(fused=True), where the `lr.item()` is cast to a double and passed in the default function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153078 Approved by: https://github.com/janeyx99	2025-05-14 02:23:33 +00:00
angelayi	d51bc27378	[export] Make draft_export public (#153219 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/153219 Approved by: https://github.com/pianpwk	2025-05-14 02:18:36 +00:00
Jane Xu	b15b870903	[BE] remove outdated torch/README.md (#153500 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153500 Approved by: https://github.com/albanD, https://github.com/cyyever	2025-05-14 02:10:30 +00:00
Marek Michalowski	d759a517af	Update the heuristic for AArch64 bmm/baddbmm (#149122 ) Updates heuristic for bmm/baddbmm and consolidates all heuristic logic in a single location - The goal of the consolidation is to improve maintainability and readability of the heuristic logic. Instead of different parts scattered across two files, this patch centralizes everything inside `Matmul.cpp`, where there already exists heuristic-based selection for mkldnn. - The logic of the check itself doesn't change (existing code is reused where possible) but a separate heuristic threshold for bmm/baddbmm is introduced based on newer, benchmarking data. Use the script below to see the performance improvement for bmm from the new heuristic: ``` import torch import time # Set below to True to use cases selected by only one of the hueristics. USE_ONLY_DIVERGENT_TEST_CASES = True BATCH_SIZES = [ 1, 8, 32, 64, 128, 256 ] M_DIMS = [ 4, 8, 16, 32, 64, 256, 512 ] N_DIMS = [ 4, 8, 16, 32, 64, 256, 512 ] K_DIMS = [ 4, 8, 16, 32, 64, 256, 512 ] ITERS = 50 def old_heuristic(m, n, k): is_above_min_dims = m > 8 and n > 8 and k > 8 is_above_min_size = mnk > 8_192 return is_above_min_dims and is_above_min_size def new_heuristic(b, m, n, k): return bbmnk >= 4_194_304 def generate_test_cases(): test_cases = [] for b in BATCH_SIZES: for m in M_DIMS: for n in N_DIMS: for k in K_DIMS: if USE_ONLY_DIVERGENT_TEST_CASES: if old_heuristic(m, n, k) != new_heuristic(b, m, n, k): test_cases.append([b, m, n, k]) else: test_cases.append([b, m, n, k]) return test_cases def test(x, y): for _ in range(5): torch.bmm(x, y) perf = 0.0 for _ in range(ITERS): start = time.time() torch.bmm(x, y) end = time.time() perf += (end - start) / ITERS return perf def main(): print(f"{'b':<10}{'m':<10}{'n':<10}{'k':<10}{'time (s)':10}") cumulative_mean_time = 0.0 for b, m, n, k in generate_test_cases(): mean_time = test(torch.rand(b, m, n), torch.rand(b, n, k)) cumulative_mean_time += mean_time print(f"{b:<10}{m:<10}{n:<10}{k:<10}{mean_time:10.3e}") print(f"Cumulative mean time = {cumulative_mean_time:.4f} s") if __name__ == "__main__": main() ``` From the script we see that cumulative mean time from all test cases (at 16 threads) is: - 1.6195 s for the old heuristic - 0.7012 s for the new heuristic Pull Request resolved: https://github.com/pytorch/pytorch/pull/149122 Approved by: https://github.com/fadara01, https://github.com/aditew01, https://github.com/malfet	2025-05-14 02:03:50 +00:00
Scott Wolchok	e8662e836a	Remove std::is_arithmetic specialization from c10/util/strong_type.h (#153424 ) Specializing std::is_arithmetic has undefined behavior (and breaks builds with -Winvalid-specialization). Should fix #150901 Differential Revision: [D74614724](https://our.internmc.facebook.com/intern/diff/D74614724/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153424 Approved by: https://github.com/cyyever, https://github.com/Skylion007	2025-05-14 02:01:32 +00:00
clr	85f97b5a8c	compile_fx: make a compile event that corresponds to the fx_compile waitcounter (#152983 ) This is a pretty minor change, but by having exact correspondence, we can easily confirm data differences between perfetto and wait counters Pull Request resolved: https://github.com/pytorch/pytorch/pull/152983 Approved by: https://github.com/jansel, https://github.com/masnesral	2025-05-14 01:54:42 +00:00
Ke Wen	90001554bf	[SymmMem][a2av] Fix TODO: change stride unit (#153483 ) Previous kernel impl assumes float type. This PR makes it general by passing stride in unit of bytes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153483 Approved by: https://github.com/fegin, https://github.com/ngimel	2025-05-14 01:47:54 +00:00
eqy	9386701b51	[cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for `sm90`, `sm100` (#149282 ) cleanup tuple/tensor boilerplate in cuDNN SDPA, preparation for nested/ragged tensor backward Pull Request resolved: https://github.com/pytorch/pytorch/pull/149282 Approved by: https://github.com/drisspg	2025-05-14 01:39:24 +00:00
William Wen	8521a690f7	[dynamo] fix potential circular import error in decorators.py (#153217 ) Differential Revision: [D74442043](https://our.internmc.facebook.com/intern/diff/D74442043) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153217 Approved by: https://github.com/jansel	2025-05-14 01:01:57 +00:00
Hashem Hashemi	e6a9067260	[ROCm] Maxpool forward NHWC Perf Improvement targeting Resnet scenarios (#151727 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/151727 Approved by: https://github.com/jeffdaily Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com>	2025-05-14 00:58:00 +00:00
Ting Lu	7f79222992	Upgrade to NCCL 2.26.5 for CUDA 12 (#152810 ) Upgrade NCCL to latest 2.26.5 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152810 Approved by: https://github.com/eqy, https://github.com/albanD, https://github.com/nWEIdia, https://github.com/atalman, https://github.com/cyyever	2025-05-14 00:52:50 +00:00
Georg Narodoslawsky	8739a8c288	elastic: do not shutdown rendezvous on leaving workers (#152525 ) In #117066, shutdown of the rendezvous was added if a worker shuts down. This is incorrect, because the rendezvous is actually shutdown in [this file](`fa6f9eb2be/torch/distributed/launcher/api.py (L290)`) but should not be shutdown if a signal is received. See also [this pull request](https://github.com/pytorch/pytorch/pull/67749). #124819 then tried to remediate the situation by fixing the faulty shutdown for the restart case. But this is only triggered if the agent restarts the training, but not if the shutdown of the rendezvous happened before. Removing both these changes restores the original behavior. The rendezvous should only be shutdown if a run completes or fails, not for a single worker leaving. Fixes #150916 Fixes #147064 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152525 Approved by: https://github.com/kiukchung	2025-05-14 00:44:10 +00:00
Pian Pawakapan	8ac82c3e72	[export] support functools.partial forward (non-strict) (#153408 ) Fixes #153086 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153408 Approved by: https://github.com/tugsbayasgalan	2025-05-13 23:30:13 +00:00
dolpm	40b719c97d	[nativert] move executor config to torch (#153087 ) Summary: nativert RFC: https://github.com/zhxchen17/rfcs/blob/master/RFC-0043-torch-native-runtime.md To land the runtime into PyTorch core, we will gradually land logical parts of the code into the Github issue and get each piece properly reviewed. This diff moves the executor config to torch. since it's header-only this requires some changes to the libtorch build configs Test Plan: CI Differential Revision: D74278789 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153087 Approved by: https://github.com/zhxchen17	2025-05-13 23:26:00 +00:00
Yiming Zhou	3498201e57	GPU lowering uses aoti_call_delegate (#153282 ) Summary: Skip custom objects when serializing the weight nodes of `aoti_call_delegate` hop as they are not consumed by the runtime. Test Plan: CI Reviewed By: SherlockNoMad Differential Revision: D73704385 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153282 Approved by: https://github.com/dolpm, https://github.com/SherlockNoMad	2025-05-13 23:23:27 +00:00
TJ Yin	81719ebde3	[caffe2] Make c10::str works with scoped enum (#152705 ) (#152714 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/152705 Test Plan: ``` buck2 test fbcode//caffe2/c10/test:util_base_tests --fail-fast ``` Differential Revision: D74087796 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152714 Approved by: https://github.com/Skylion007	2025-05-13 21:05:36 +00:00
Benjamin Glass	e8596c291b	Fix misleadingly high AOT Inductor dashboard performance (#153060 ) Fixes misleadingly high AOTInductor performance benchmark numbers in scenarios where a model updates internal parameters during `torch.export.export`. Since `FakeTensorMode` is enabled during export, all such parameters become `FakeTensor`s, slowing down future eager-mode runs using that model substantively. This, in turn, causes misleading performance stats, where the slowness of eager-mode makes `AOTInductor` look _very_ good. An [example benchmark](https://hud.pytorch.org/benchmark/timm_models/inductor_aot_inductor?dashboard=torchinductor&startTime=Wed%2C%2030%20Apr%202025%2015%3A54%3A04%20GMT&stopTime=Wed%2C%2007%20May%202025%2015%3A54%3A04%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=main&lCommit=1dd36ad2d440a4f3faf724b3a8e13925e3180c24&rBranch=main&rCommit=cc7346bf19c019255dcb4484694a75850ed74d5a&model=convit_base) with this issue. The equivalent `cpp_wrapper` benchmark run shows a 2x performance gain, not 20x. Only two benchmarks we regularly run are affected by this, both in the TIMM set. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153060 Approved by: https://github.com/desertfire	2025-05-13 20:59:59 +00:00
Shivam Raikundalia	a13c8f2ecb	[EZ/Profiler] Replace manual GIL calls with pybind GIL calls (#153415 ) Summary: Use pybind11::gil_scoped_acquire instead of old impl as it will automatically take care of error handling. In the original implementation we missed releasing the GIL on each possible error which could put the program in a deadlock Test Plan: Induced error manually and saw that GIL was released Differential Revision: D74593564 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153415 Approved by: https://github.com/Skylion007, https://github.com/cyyever	2025-05-13 20:47:52 +00:00
James Wu	5ff2cb8587	Add justknobs for static cuda launcher (#153400 ) Summary: This diff adds a justknobs check for static cuda launcher. In particular, it supports a fractional rollout where each mast job/version can be consistently enrolled in the config on or off. It also adds a set_feature_use so we can track whether static cuda launcher is enabled on a given dynamo compile. Test Plan: Existing unit tests. The justknobs in question are set to be disabled right now, so this diff does not launch the feature yet. Differential Revision: D74599203 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153400 Approved by: https://github.com/oulgen	2025-05-13 20:10:13 +00:00
clr	20ba8fe7e6	induct: Log a pt2 compile event + waitcounter for node fusing. (#153270 ) This appears to be slow in production (potentially a quadratic explosion), and logging this explicitly in pt2_compile_events and wait_counters makes it a lot easier to see how bad of an issue this is. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153270 Approved by: https://github.com/masnesral	2025-05-13 19:02:36 +00:00
Ryan Guo	8ac82a1d20	[dynamo] Add test to ensure we don't print fx graph upon data dependent graph break (#153416 ) This adds a regression test for #149831, also as part of getting it cherry-picked into 2.7.1. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153416 Approved by: https://github.com/atalman	2025-05-13 18:28:02 +00:00
Wanchao Liang	9df9d9ded0	[device_mesh] replace dim_group_info with group_name (#150898 ) as titled, there's no need to maintain a dim_group_info anymore, we can simply maintain a list of group_name instead. This will simplify the logic Pull Request resolved: https://github.com/pytorch/pytorch/pull/150898 Approved by: https://github.com/tianyu-l, https://github.com/fegin	2025-05-13 17:16:45 +00:00
Tristan Rice	9c3cef437c	gloo: support ibverbs in cmake (#153425 ) This updates the gloo submodule in PyTorch to a version that supports the new ibverbs backend that can be used with PyTorch. Test plan: ``` sudo dnf install rdma-core-devel USE_GLOO_IBVERBS=ON python setup.py develop torchrun --nproc_per_node 2 ~/scripts/gloo_ibverbs_test.py ``` ```py """ run with: torchrun --nproc_per_node 2 ~/scripts/gloo_ibverbs_test.py """ import os os.environ["GLOO_DEVICE_TRANSPORT"] = "IBVERBS" import torch import torch.distributed as dist dist.init_process_group("gloo") rank = dist.get_rank() if rank == 0: device = "cpu" else: device = "cuda" print(device) t = torch.full((10, 100), fill_value=(rank+1), device=device) target = torch.full((10, 100), fill_value=3, device=device) dist.all_reduce(t) torch.testing.assert_close(t, target) t = torch.full((10, 100), fill_value=(rank+1), device=device) if rank == 0: dist.send(t, dst=1) else: dist.recv(t, src=0) torch.testing.assert_close(t, torch.full_like(t, 1)) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153425 Approved by: https://github.com/fduwjj	2025-05-13 17:09:00 +00:00
Sam Larsen	dde705864a	Fix test broken by D73809989 (#153413 ) Summary: I forgot to remove this unused field in D73809989. Test Plan: `buck test 'fbcode//mode/opt' fbcode//caffe2/test:fbonly -- --exact 'caffe2/test:fbonly - test_compilation_metrics_logger_in_sync (caffe2.test.fb.test_fb.TestFBOnly)'` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153413 Approved by: https://github.com/c00w	2025-05-13 16:44:30 +00:00
Simon Fan	216e28f7e9	[ca] run xfails up until their last passing backend (#153279 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153279 Approved by: https://github.com/jansel ghstack dependencies: #153193, #153222	2025-05-13 16:42:10 +00:00
Simon Fan	a80eb84a5f	[ca] support higher order gradients (create_graph=True) (#153222 ) Adds create_graph support if you don't compile or compile only with torch.compile(backend="eager"). Using a backend that uses AOTDispatch produces a post-dispatch AOT backward, where its double backward will be silently incorrect if the forward trace involved any ops that are not composite implicit. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153222 Approved by: https://github.com/jansel ghstack dependencies: #153193	2025-05-13 16:42:09 +00:00
Simon Fan	37efaf4af9	[ca][api] config api shouldn't error with optimize_assert (#153193 ) Toggling on `torch._dynamo.config.compiled_autograd = True` was erroring export (optimize_assert didn't have `rebuild_ctx` defined). Separately add a way to `rebuild_ctx` for `optimize_assert` since it is a public API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153193 Approved by: https://github.com/jansel	2025-05-13 16:42:02 +00:00
Guilherme Leobas	a4459cd4e3	Remove `property` from python_type function (#152900 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152900 Approved by: https://github.com/amjames, https://github.com/anijain2305 ghstack dependencies: #153070	2025-05-13 16:26:25 +00:00
Guilherme Leobas	f67eb6f8c5	Fix path matching in `CPythonTestCase/setUpClass` (#153070 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153070 Approved by: https://github.com/amjames, https://github.com/anijain2305, https://github.com/Skylion007	2025-05-13 16:26:25 +00:00
Prachi Gupta	c5ebc12f7f	[ROCm] unkip test_non_standard_bool except for failings ops (#152956 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/152956 Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily	2025-05-13 15:55:42 +00:00
Zizeng Meng	445d8fd77d	[MemoryZ] Sync changes to internal page (#153166 ) Summary: For MTIA on-demand mode, since we are not using torch Module. The data upload happens in cpp and doesn't support pickle. Thus, we store as JSON at the end and need the update visualizer to support it Test Plan: Check Test plan in D74179606 Differential Revision: D74406209 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153166 Approved by: https://github.com/sraikund16	2025-05-13 15:35:10 +00:00
Nikita Shulga	ea3eaf68bf	Fix AOTI cpp tests (#153423 ) `Error in dlopen: /lib/x86_64-linux-gnu/libstdc++.so.6: version GLIBCXX_3.4.30 not found` error was caused by cmake migration (as conda one probably have some extra link rules), while `C++ exception with description "CUDA error: no kernel image is available for execution on the device` were caused by the fact that test were build for Maxwell, but run on SM_86 Remaining test was failing before, but was probably disabled TODOs: - Move build to the build step Pull Request resolved: https://github.com/pytorch/pytorch/pull/153423 Approved by: https://github.com/huydhn, https://github.com/cyyever	2025-05-13 15:25:03 +00:00
ZhiweiYan-96	6b02e60838	[Intel GPU] Use user-friendly err msg in mm (#151655 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151655 Approved by: https://github.com/EikanWang	2025-05-13 15:13:21 +00:00
Animesh Jain	7fdd754136	[compile-time traces] Profile large missing gaps in compile time (#151256 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151256 Approved by: https://github.com/bdhirsh, https://github.com/masnesral, https://github.com/zou3519, https://github.com/jansel	2025-05-13 14:44:51 +00:00
Wang, Eikan	ee096b89f6	Fix skipIfXpu and skipIfHpu disables tests when used on class (#151315 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151315 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-05-13 14:44:17 +00:00
Howard Huang	d9ef1012db	[PP] Optimize memory usage by releasing output memory earlier (#153383 ) Considering `output_chunks` is only used for last stage, we should not keep the outputs of each stage in memory; this will allow memory to be freed earlier. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153383 Approved by: https://github.com/Skylion007, https://github.com/kwen2501	2025-05-13 14:42:38 +00:00
Sam Larsen	f1de3f9f07	Rename "output_tensor" -> "out" in autotune_process.py (#153169 ) Summary: This change is to support remote autotuning. I want to use all the same benchmarking utilities in select_algorithm.py. For remote autotuning, I'll reuse the TritonBenchmarkRequest class used for subprocess autotuning because it's already serializable. That class is also used in standard, in-process autotuning, but via TritonTemplateCaller.benchmark() which sets the output_tensor param when calling the underlying TritonBenchmarkRequest. For remote, I'll be using the TritonBenchmarkRequest request directly so I want the parameter to be named 'out' to avoid "got an unexpected keyword argument 'out'". Test Plan: Existing unit tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/153169 Approved by: https://github.com/aorenste, https://github.com/eellison	2025-05-13 14:18:29 +00:00
Zhang, Jianyi	9f98e37eb4	[Intel GPU] add tf32 support for matmul on XPU (#144240 ) Support xpu tf32 matmul using torch.bachend.mkldnn.allow_tf32, we will discuss in future if we need a new api to control matmul only ~~Support xpu tf32 matmul using torch.set_float32_matmul_precision. For conv, check https://github.com/pytorch/pytorch/pull/137570 We decide not following torch.backends.cuda.matmul.allow_tf32 because this API actually calls setAllowTF32CuBLAS to set matmul_precison to high. We also avoid other related tf32 changes (i.e. in inductor) by not introducing new API.~~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/144240 Approved by: https://github.com/EikanWang	2025-05-13 14:03:01 +00:00
Michael Lazos	ff039d39ec	[Dynamo] Optimize dedupe region ancestor tracking (#152589 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152589 Approved by: https://github.com/anijain2305 ghstack dependencies: #152389, #152505, #152410, #152506, #152570, #152572	2025-05-13 12:17:59 +00:00
Michael Lazos	d0faa9985d	[Dynamo] Fix typing in graph_deduplication.py (#152572 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152572 Approved by: https://github.com/Skylion007, https://github.com/anijain2305 ghstack dependencies: #152389, #152505, #152410, #152506, #152570	2025-05-13 12:17:59 +00:00
Michael Lazos	a415c9831f	[Hierarchical Compile] Replace tracing alias and mutation check with dynamo impl (#152570 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152570 Approved by: https://github.com/anijain2305 ghstack dependencies: #152389, #152505, #152410, #152506	2025-05-13 12:17:59 +00:00
Michael Lazos	57dafb90ef	[Hierarchical Compile] Take into account mutation deps in cycle detection (#152506 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152506 Approved by: https://github.com/anijain2305 ghstack dependencies: #152389, #152505, #152410	2025-05-13 12:17:59 +00:00
Michael Lazos	118192011e	[Hierarchical Compile] Add mutation dependencies to topological sorting (#152410 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152410 Approved by: https://github.com/anijain2305 ghstack dependencies: #152389, #152505	2025-05-13 12:17:59 +00:00
Michael Lazos	3592cb52d9	[Hierarchical Compilation] Use universal flatten APIs (#152505 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152505 Approved by: https://github.com/anijain2305 ghstack dependencies: #152389	2025-05-13 12:17:59 +00:00
Michael Lazos	023a3dc69f	[Hierarchical Compilation] Track node mutations (#152389 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152389 Approved by: https://github.com/anijain2305	2025-05-13 12:17:59 +00:00
nikitaved	edc2d539d1	`torch.tensordot`: performance improvements when contracting to a scalar. (#145936 ) As per title. Fixes https://github.com/pytorch/pytorch/issues/145731 Touches only compute. The CPU overhead can potentially be further reduced. Before: ```python In [3]: n = 512 In [4]: A = torch.rand(n, n) In [5]: B = torch.rand(n, n) In [6]: %timeit torch.tensordot(A, B, [[0, 1], [0, 1]]) 2.04 ms ± 70 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [7]: %timeit torch.tensordot(A, B, [[0, 1], [1, 0]]) 2.85 ms ± 191 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [8]: %timeit torch.tensordot(A, B, [[1, 0], [0, 1]]) 2.9 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [9]: %timeit torch.tensordot(A, B, [[1, 0], [1, 0]]) 4.07 ms ± 262 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` After ```python In [2]: n = 512 In [3]: A = torch.rand(n, n) In [4]: B = torch.rand(n, n) In [5]: %timeit torch.tensordot(A, B, [[0, 1], [0, 1]]) 30.7 µs ± 2.51 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) In [6]: %timeit torch.tensordot(A, B, [[0, 1], [1, 0]]) 141 µs ± 6.52 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) In [7]: %timeit torch.tensordot(A, B, [[1, 0], [0, 1]]) 142 µs ± 4.03 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) In [8]: %timeit torch.tensordot(A, B, [[1, 0], [1, 0]]) 62.8 µs ± 4.31 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145936 Approved by: https://github.com/albanD, https://github.com/ngimel	2025-05-13 10:57:30 +00:00
PyTorch MergeBot	8d7dec6e92	Revert "[DSD] Don't pop tensors if they are on Meta device (#153185 )" This reverts commit 7243c69421cd0b868f3fa3b552c17e9c8b3023a1. Reverted https://github.com/pytorch/pytorch/pull/153185 on behalf of https://github.com/jeanschmidt due to Seems to break internal signals, see [D74577069](https://www.internalfb.com/diff/D74577069) ([comment](https://github.com/pytorch/pytorch/pull/153185#issuecomment-2875662357))	2025-05-13 09:13:27 +00:00
cyy	9785b32189	Remove unused typing-extensions BUCK target (#153229 ) This target is unused. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153229 Approved by: https://github.com/colesbury	2025-05-13 04:29:59 +00:00
cyy	15e08f9571	[submodule] Update ONNX to 1.18 (#152200 ) Update ONNX to 1.18. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152200 Approved by: https://github.com/justinchuby, https://github.com/malfet	2025-05-13 04:18:45 +00:00
Laith Sakka	c4fb0b6f33	refresh expected results (#150166 ) @huydhn when do you think we will have the APIs to access results on oss storage available so we do not have to worry about this racing again? Also is there a way to accelerate unstability in this after we land it? Pull Request resolved: https://github.com/pytorch/pytorch/pull/150166 Approved by: https://github.com/bobrenjc93, https://github.com/eellison, https://github.com/anijain2305	2025-05-13 04:04:42 +00:00
Nikita Shulga	483bbb639a	[CI] Collect accuracy for MPS inductor benchmarks (#153443 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153443 Approved by: https://github.com/atalman	2025-05-13 03:49:28 +00:00
Henry Tsang	36722c287f	[cutlass backend] make compile name independent of command (#153388 ) Differential Revision: D74291603 The goal is to reuse the kernels as much as possible. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153388 Approved by: https://github.com/ColinPeppler	2025-05-13 03:49:24 +00:00
FFFrog	29c8ae825f	[OpenReg] Move SDPA to OpenReg from open_registration_extension.cpp (#153309 ) As the title stated. Next Chages: - Migrate remaining functionality to OpenReg Pull Request resolved: https://github.com/pytorch/pytorch/pull/153309 Approved by: https://github.com/albanD	2025-05-13 03:49:19 +00:00
Nikita Shulga	a6c5b59067	[MPSInductor] Fix multistage reduction suffixes (#153362 ) By invalidating all variable created during the loop except for the context of iterator_cache, as storage can be done inside reduction loop and clear `IteratorRangeEntry` codegen cache. Which results in the following kernel for `x / x.sum()` if x size is 2048 and max thread group size is 1024 ```metal [[max_total_threads_per_threadgroup(1024)]] kernel void generated_kernel( device half* out_ptr1, constant half* in_ptr0, uint2 thread_pos [[thread_position_in_grid]], uint2 group_pos [[thread_position_in_threadgroup]] ) { auto xindex = thread_pos.x; auto r0_index = thread_pos.y; threadgroup float tmp_acc_0[32]; float tmp_acc_1 = 0; for(auto r0_0_cnt = 0; r0_0_cnt < 2; ++r0_0_cnt) { int r0_0 = 2 * r0_index + r0_0_cnt; auto tmp0 = static_cast<float>(in_ptr0[r0_0]); tmp_acc_1 += tmp0; } auto tmp1 = c10:🤘:threadgroup_sum(tmp_acc_0, tmp_acc_1, r0_index * 1, 1024); for(auto r0_0_cnt = 0; r0_0_cnt < 2; ++r0_0_cnt) { int r0_0 = 2 * r0_index + r0_0_cnt; auto tmp2 = static_cast<float>(in_ptr0[r0_0]); auto tmp3 = tmp2 / tmp1; out_ptr1[r0_0] = static_cast<half>(tmp3); } } ``` Fixes compilation report reported while running `GPUTests.test_pattern_matcher_multi_user_mps` and `GPUTests.test_weight_norm_bwd_mps` Fixes https://github.com/pytorch/pytorch/issues/152155 Though inductor tests are still failing, need to keep refining the variable invalidation Pull Request resolved: https://github.com/pytorch/pytorch/pull/153362 Approved by: https://github.com/manuelcandales, https://github.com/dcci, https://github.com/jansel	2025-05-13 03:07:53 +00:00
fduwjj	27e9d9b103	[c10d][fr] Add try catch to update entry due to cuda error (#153414 ) During the dump of FR, due to some unknown reasons, we see cuda errors when querying events and this leads to the failures of whole FR dumps (when trying to get entries). So we do a try-catch instead of let it fails the whole process. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153414 Approved by: https://github.com/d4l3k	2025-05-13 01:10:00 +00:00
Laith Sakka	8b507a9809	convert guard_size_oblivious to runtime check in infer_size_impl (#148872 ) its ok to check the requirement numel == newsize at runtime in case of unbacked instead of at compile time and assume that its true. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148872 Approved by: https://github.com/bobrenjc93	2025-05-13 00:32:28 +00:00
Natalia Gimelshein	0cf61ca7e4	make use_mem_pool threadlocal (#153356 ) Partial fix for #152861, makes allocation to pool thread-local, but doesn't touch the second bug where multiple threads allocating to multiple pools error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153356 Approved by: https://github.com/Skylion007, https://github.com/eellison	2025-05-13 00:16:07 +00:00
soulitzer	d5d26ce436	Enable accelerator to perform streaming backward (#153412 ) Also see https://github.com/pytorch/pytorch/pull/142097 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153412 Approved by: https://github.com/albanD ghstack dependencies: #151079	2025-05-13 00:02:24 +00:00
Gabriel Ferns	71c8231742	fix bug with TORCHINDUCTOR_DUMP_LAUNCH_PARAMS (#153066 ) Summary: https://fb.workplace.com/groups/1028545332188949/posts/9503194033132340/?comment_id=9504669536318123&reply_comment_id=9506405459477864&notif_id=1746154132646897&notif_t=work_group_comment_mention Aligns the arguments for the triton inputs Differential Revision: D74085173 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153066 Approved by: https://github.com/jansel	2025-05-12 23:56:49 +00:00
PyTorch MergeBot	641e4bee67	Revert "[export][cond] support merging constant ints as unbacked symint (#152742 )" This reverts commit a805911d15f0da0b3b07203d5cb727e84ef40cf0. Reverted https://github.com/pytorch/pytorch/pull/152742 on behalf of https://github.com/ydwu4 due to breaking trunk ([comment](https://github.com/pytorch/pytorch/pull/152742#issuecomment-2874410372))	2025-05-12 23:06:33 +00:00
Shuai Yang	a87e810980	add needs_contiguous_strides tag (#153399 ) Summary: The padding operations could lead to non-contiguous tensors, which will fail the test in `reduce_scatter_tensor`: https://fburl.com/code/5wt5xkig The `needs_contiguous_strides` tag is to tell inductor that `reduce_scatter_tensor` needs contiguous inputs, so it will not to execute padding operations. Test Plan: W/o the tag, job failed on the check: https://www.internalfb.com/mlhub/pipelines/runs/mast/aps-rebase_sanity_check_256bs_8t-fc398c39d3?job_attempt=0&version=0&tab=summary&env=PRODUCTION With this tag, previously failed job succeeded: https://www.internalfb.com/mlhub/pipelines/runs/mast/aps-rebase_sanity_128bs_8t_i10_tag-2ed5b05276?job_attempt=11&version=0&tab=summary&env=PRODUCTION Differential Revision: D74598810 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153399 Approved by: https://github.com/fmassa	2025-05-12 23:03:56 +00:00
Aaron Gokaslan	f05b38aa26	[BE]: Improve decorator typing for Optimizer subclasses (#153374 ) Improves typing so that all the optimizer subclasses (which all of them that subtype step) do not erase their type signature when this decorator is used. Now *kwarg values and returns will propogate This complements @tsunghsienlee PR #153367 as the type signature of step() was being erased on all the optimizer subclasses by this untyped decorator Pull Request resolved: https://github.com/pytorch/pytorch/pull/153374 Approved by: https://github.com/janeyx99, https://github.com/tsunghsienlee	2025-05-12 22:55:25 +00:00
Benjamin Glass	b0f2891e43	[AOTInductor] Fix clang-tidy warnings in wrapper (#153197 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153197 Approved by: https://github.com/desertfire	2025-05-12 22:35:59 +00:00
Aaron Gokaslan	3ff22fe2df	[BE]: Use shutil.which in inductor codegen (#153377 ) Use shutil.which instead of subprocess. Is more secure, has better error handling and is more cross platform Pull Request resolved: https://github.com/pytorch/pytorch/pull/153377 Approved by: https://github.com/albanD	2025-05-12 22:11:26 +00:00
Shivam Raikundalia	dbb4444ce3	[Memento] Add PT2 to Memory Snapshot (#152707 ) Summary: To add PT2 information to memory snapshot we piggyback off of the Kineto implementation using record_function similar to adding the user annotations. To do this we add the following: 1. Stack implementation that we instantiate to keep track of which compile context stack we are currently in (top element of the stack). The stack will be per device and thread-local since different threads of a process can be in different compile contexts at a given time. For this reason, we do not need to add mutexes to our stack impl since no two threads will touch a given stack 2. RecordFunction hooks to properly pipe the correct events to the compile context stack. These hooks are similar to the annotation ones in the fact that we just register them lazily and DO NOT unregister them. This is done out of convenience. In the future, we should save the handles and unregister them to minimize overhead after profiling is finished. As of now, we are registering this at the FUNCTION scope which is wide; however, we treat any function that does not start with "Torch-Compiled Region" as a no-op so we anticipate the difference in performance to be negligible during and after profiling. We also hide this feature behind a flag set to off on default so existing jobs will be unaffected 3. Piping for compile context to pickle output Test Plan: In D74039793, we add CompileContext to the visualizer and we see the following {F1977654658} Differential Revision: D74028214 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152707 Approved by: https://github.com/eqy	2025-05-12 21:12:51 +00:00
soulitzer	f78e4529a9	Rewrite autograd producer consumer stream sync logic (#151079 ) Also see previous work https://github.com/pytorch/pytorch/pull/142097 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151079 Approved by: https://github.com/albanD	2025-05-12 21:07:16 +00:00
Svetlana Karslioglu	f136046919	Clean up right nav (#153090 ) - Move community and language binding links to the horizontal bar - Add an intro to the community page. - Fix the link in the ogp_image - Fix the link in the version switcher - Clean up unneeded links Pull Request resolved: https://github.com/pytorch/pytorch/pull/153090 Approved by: https://github.com/albanD	2025-05-12 21:00:45 +00:00
Yidi Wu	a805911d15	[export][cond] support merging constant ints as unbacked symint (#152742 ) @pianpwk points out that this will be helpful to address several data dependent issues in huggingface [models](`e23705e557/src/diffusers/schedulers/scheduling_euler_ancestral_discrete.py (L332)`) with the following pattern: ```python idx = if u0 return 0 else return 1 return x[idx] ``` We could preserve the conditional with a cond. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152742 Approved by: https://github.com/zou3519	2025-05-12 20:26:31 +00:00
Menglu Yu	88a068f33b	[2/n][Optimus][Auto-AC] Support activation quantization with scaling (#151770 ) Summary: Previously, we only support non-scaling quantization, which may lead to overflow, here we support scaling quantization, and set it as the default version. Here, we quantize activation nodes based on the size_in_mb, the default value is 100, i.e., as long as the node has at least 100MB size, we will quantize it. Test Plan: ### how to enable ``` torch._inductor.config.post_grad_fusion_options = { "activation_quantization_aten_pass": { "quant_type": "torch.float8_e5m2", -> default is this type to quantize, you can change the type "use_scaling": False, -> default is False, if you want to use scaling verison, set it to True "size_in_mb": 0.0, -> default is 100, you can tune the value. "exclude_primals": False, -> whether want to exclude quantize parameters, default is False "allowed_dtypes": "torch.float16;torch.bfloat16;torch.float32", -> dtype you consider to quant, use ";" to separate, default is torch.bfloat16 }, } ``` ### toy model ``` buck2 run mode/opt //scripts/qyz/autoac:quantization ``` ``` Epoch [80/200], Loss: 19227.2109 Epoch [100/200], Loss: 1353.5272 Epoch [120/200], Loss: 38630.6758 Epoch [140/200], Loss: 6239.9155 Epoch [160/200], Loss: 6039.1567 Epoch [180/200], Loss: 3994.3569 Epoch [200/200], Loss: 146.3966 ``` Differential Revision: D73015996 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151770 Approved by: https://github.com/Mingming-Ding	2025-05-12 19:43:18 +00:00
Aaron Gokaslan	45df18dcd0	[BE]: Enable ruff rule TC007 (#153394 ) Enables [TC007] https://docs.astral.sh/ruff/rules/unquoted-type-alias/#unquoted-type-alias-tc007 this finds type aliases that should be quoted if they have to interact with IF TYPE_CHECKING blocks: https://docs.astral.sh/ruff/rules/unquoted-type-alias/#unquoted-type-alias-tc007 Disabled it when we updated RUFF, but really should only have disabled TC006 as that is the one that is going to cause some changes codebase wide. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153394 Approved by: https://github.com/albanD	2025-05-12 19:18:29 +00:00
Aaron Gokaslan	fb85ebd710	[BE]: Use undocumented temp shim to restore setuptools compat (#153052 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153052 Approved by: https://github.com/albanD	2025-05-12 18:33:41 +00:00
Aaron Gokaslan	3555ebb63d	[BE]: Update ruff to 0.11.8 (#153249 ) Fixes a ton of false negatives throughout the codebase. RUFF also properly validates NOQA comments now and most of the changes are fixing typos there or removing filewide flake8 suppressions that were also silencing ruff issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153249 Approved by: https://github.com/cyyever, https://github.com/albanD, https://github.com/seemethere	2025-05-12 18:30:52 +00:00
PyTorch MergeBot	5c3fddb9cc	Revert "[Hierarchical Compilation] Track node mutations (#152389 )" This reverts commit c2936ebfd58be7a6519f51d165dfac8407020140. Reverted https://github.com/pytorch/pytorch/pull/152389 on behalf of https://github.com/jeanschmidt due to Humm, interesting, there seems to be a bug in stack PRs, as it should be part of the stack and be reverted with the other ones ([comment](https://github.com/pytorch/pytorch/pull/152389#issuecomment-2873540451))	2025-05-12 18:18:44 +00:00
Blaine Burton Rister	e1d03fa251	[Inductor] Optimize grid calculation by using // instead of FloorDiv (#153230 ) https://github.com/pytorch/pytorch/pull/146942 introduced an 8.3% regression on the `benchmark_torchbench_run_bert_pytorch_training:defaults-speedup-x1000` perf metric. This was flagged by internal CI testing (task T223596372). The root cause seems to be that `FloorDiv` is now used to calculate the launch grid in certain scenarios, which is slower than the previously-used `//`. Since launch grid calculations happen at runtime, they can have a significant performance impact on some models. The reason for switching to `FloorDiv` in https://github.com/pytorch/pytorch/pull/146942 was to allow the FX backend to generate runnable Python code. `FloorDiv(x, y)` maps to `x // y` in Python, whereas `sympy.floor(sympy.Rational(x,y))` maps to `floor(x/y)`, which crashes as FX doesn't know what `floor` is. To get the best of both worlds, this PR reverts to using `//` to calculate launch grids, but then post-processes the resulting sympy expressions in the FX converter, converting `floor(x / y)` to `FloorDiv(x, y)`. Since this sympy manipulation happens at compile time, the perf impact should minimal, and should only affect the FX backend. This is similar to the approach previously explored in https://github.com/pytorch/pytorch/pull/151144, but the implementation is more minimal and self-contained. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153230 Approved by: https://github.com/jansel	2025-05-12 18:08:52 +00:00
Chien-Chin Huang	498f364518	Fix test_fused_scaled_matmul_reduce_scatter when scatter_dim is 0 (#153286 ) The function signature of fused_scaled_matmul_reduce_scatter was changed. This PR fixes the function signature. However when scatter_dim is 1, the two outputs are not close. We need a followup on this. Another followup is to change fused_scaled_matmul_reduce_scatter to make those newly added arguments optional. Users shouldn't need to these arguments if they don't flatten the inputs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153286 Approved by: https://github.com/kwen2501	2025-05-12 17:38:49 +00:00
PyTorch UpdateBot	7e1790d86b	[xla hash update] update the pinned xla hash (#153368 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153368 Approved by: https://github.com/pytorchbot	2025-05-12 17:11:23 +00:00
xinan.lin	dc47295dc5	[Inductor UT][Break XPU] Generalize newly added device-bias code in Inductor UT. (#153355 ) Fixes #153123 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153355 Approved by: https://github.com/desertfire, https://github.com/Skylion007	2025-05-12 15:53:05 +00:00
Tsung-Hsien Lee	ea4b65ab60	Fix the type hint of `step()` with default value (#153367 ) Summary: Because the default value of `closure` is `None`, this fixes the situation when `step()`. The previous typing (https://github.com/pytorch/pytorch/pull/102593) could only be used as `step(closure=None)` and `step(None)`. Test Plan: contbuild & OSS CI Differential Revision: D74560785 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153367 Approved by: https://github.com/cyyever, https://github.com/Skylion007, https://github.com/janeyx99	2025-05-12 15:52:59 +00:00
Thanh Ha	de5c5f4fb7	Opt-out LF runners from of inductor jobs (#153151 ) Opt-out of inductor jobs for the lf experiment configuration. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153151 Approved by: https://github.com/seemethere	2025-05-12 15:52:53 +00:00
rzou	89aa6eb19b	Stop codegen-ing post_grad_custom_pass in repros (#153243 ) When codegen'ed, it looks like: ```py post_grad_custom_pass = <object at 0x12345678> ``` Which is not runnable at all. Some logic is also trying to deepcopy the object, and not all of these objects are deepcopy-able. This PR skips codegenning of these passes. Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/153243 Approved by: https://github.com/houseroad	2025-05-12 15:21:11 +00:00
Colin Peppler	7657d80a58	[aoti] when generating example input shapes, use unbacked replacements (#153220 ) ## Context Suppose we have this graph like this : ``` a: "[s1 + u2, 200]" b: "[u0, 32]" cat: "[s1 + u2, 232]" = torch.cat([a, b], dim=1) ``` NOTE: torch.cat assumes "all tensors must either have the same shape (except in the concatenating dimension) or be a 1-D empty tensor with size (0,)." So, we would expect u0 = s1 + u2 which is guarded on today except it's a deferred runtime assertion since unbacked symints aren't replaced today as Pian. Notice how a has a different symbolic shape than both b and cat. Today, this will create an unexpected shape mismatch when AOTI autotunes. Here's a rough illustration where 8192 is the unbacked symint fallback value. ``` # s1 is an arbitrary integer a = generate_example_value(size=(s1 + 8192, 200)) b = generate_example_value(size=(8192, 32)) out = generate_example_value(size=(s1 + 8192, 232)) triton_cat.run(a, b, out ...) ``` ## Error ``` wrapper.py:1484: <module>: block: [443,0,0], thread: [53,0,0] Assertion `index out of bounds: 0 <= tl.broadcast_to(tmp13, [XBLOCK]) < ks0` failed. ... wrapper.py:1484: <module>: block: [443,0,0], thread: [55,0,0] Assertion `index out of bounds: 0 <= tl.broadcast_to(tmp13, [XBLOCK]) < ks0` failed. RuntimeError: CUDA error: device-side assert triggered ``` Differential Revision: [D74485962](https://our.internmc.facebook.com/intern/diff/D74485962) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153220 Approved by: https://github.com/desertfire	2025-05-12 15:20:57 +00:00
Aaron Gokaslan	1c659b5bc0	[BE]: Use more portable shutil.which call for cpp_builder (#153325 ) We should be using shutil.which instead of calling some binary subprocess here for portability and security. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153325 Approved by: https://github.com/xuhancn, https://github.com/cyyever, https://github.com/albanD	2025-05-12 15:15:21 +00:00
PyTorch MergeBot	78d752e96a	Revert "[Hierarchical Compilation] Use universal flatten APIs (#152505 )" This reverts commit f9e3a9058e80fde310e5f0919d3a21e28cd024a8. Reverted https://github.com/pytorch/pytorch/pull/152505 on behalf of https://github.com/jeanschmidt due to [TENTATIVE] reverting to check if reverting this stack partially caused the introduction of https://github.com/pytorch/pytorch/actions/runs/14966121510/job/42049638969#step:22:875 ([comment](https://github.com/pytorch/pytorch/pull/152505#issuecomment-2872869990))	2025-05-12 14:48:08 +00:00
soulitzer	cb35a2b15d	Add missing in-place on view check to custom autograd.Function (#153094 ) Fixes https://github.com/pytorch/pytorch/issues/152773 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153094 Approved by: https://github.com/albanD ghstack dependencies: #153005	2025-05-12 14:42:46 +00:00
zhxchen17	a67dd2083c	[dynamo] Guard serialization for SHAPE_ENV (#153258 ) Differential Revision: [D74483150](https://our.internmc.facebook.com/intern/diff/D74483150/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153258 Approved by: https://github.com/jansel ghstack dependencies: #153255, #153256, #153257	2025-05-12 14:42:01 +00:00
zhxchen17	e2f6870c98	[dynamo] Guard serialization for DEFAULT_DEVICE (#153257 ) Differential Revision: [D74483147](https://our.internmc.facebook.com/intern/diff/D74483147/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153257 Approved by: https://github.com/jansel ghstack dependencies: #153255, #153256	2025-05-12 14:42:00 +00:00
zhxchen17	ef1dcc21ee	[dynamo] Guard serialization for global state guards (GRAD_MODE, DETERMINISTIC_ALGORITHMS, TORCH_FUNCTION_STATE, FSDP_TRAINING_STATE) (#153256 ) serialization for global state guards. Differential Revision: [D74483149](https://our.internmc.facebook.com/intern/diff/D74483149/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153256 Approved by: https://github.com/jansel ghstack dependencies: #153255	2025-05-12 14:41:53 +00:00
zhxchen17	0210986cc4	[dynamo] Guard serialization for EMPTY_NN_MODULE_HOOKS_DICT (#153255 ) EMPTY_NN_MODULE_HOOKS_DICT Differential Revision: [D74483148](https://our.internmc.facebook.com/intern/diff/D74483148/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153255 Approved by: https://github.com/jansel	2025-05-12 14:41:44 +00:00
PyTorch MergeBot	daca611465	Revert "[ROCm] Maxpool forward NHWC Perf Improvement targeting Resnet scenarios (#151727 )" This reverts commit 5683965f02c4091a864484917f74e3a42c9c56ae. Reverted https://github.com/pytorch/pytorch/pull/151727 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/151727#issuecomment-2872361816))	2025-05-12 12:29:28 +00:00
PyTorch MergeBot	8511d21081	Revert "Forward fix #151727 (#153306 )" This reverts commit 64518ca7420271562c4920c13c44221c54e534df. Reverted https://github.com/pytorch/pytorch/pull/153306 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/153306#issuecomment-2872339570))	2025-05-12 12:22:13 +00:00
PyTorch UpdateBot	23ecd35a96	Update slow tests (#151207 ) This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151207 Approved by: https://github.com/pytorchbot	2025-05-12 12:05:58 +00:00
PyTorch MergeBot	47df195065	Revert "[Hierarchical Compile] Add mutation dependencies to topological sorting (#152410 )" This reverts commit bc8b305eb816106de31602f8b7fd80d4113e6ee8. Reverted https://github.com/pytorch/pytorch/pull/152410 on behalf of https://github.com/jeanschmidt due to Breaking internal signal citadel-fbcode-test-mode-opt-for-pt2_stack_for_internal-linux-0 please see diff [D74531503](https://www.internalfb.com/diff/D74531503) for more details ([comment](https://github.com/pytorch/pytorch/pull/152410#issuecomment-2871168679))	2025-05-12 07:15:09 +00:00
PyTorch MergeBot	0e36887209	Revert "[Hierarchical Compile] Take into account mutation deps in cycle detection (#152506 )" This reverts commit 779e647999645d19eebf01fa686fb792176f8940. Reverted https://github.com/pytorch/pytorch/pull/152506 on behalf of https://github.com/jeanschmidt due to Breaking internal signal citadel-fbcode-test-mode-opt-for-pt2_stack_for_internal-linux-0 please see diff [D74531503](https://www.internalfb.com/diff/D74531503) for more details ([comment](https://github.com/pytorch/pytorch/pull/152410#issuecomment-2871168679))	2025-05-12 07:15:09 +00:00
PyTorch MergeBot	53ebcabb52	Revert "[Hierarchical Compile] Replace tracing alias and mutation check with dynamo impl (#152570 )" This reverts commit 50df08eb5e4d9276b72929fd859ad892880bab0f. Reverted https://github.com/pytorch/pytorch/pull/152570 on behalf of https://github.com/jeanschmidt due to Breaking internal signal citadel-fbcode-test-mode-opt-for-pt2_stack_for_internal-linux-0 please see diff [D74531503](https://www.internalfb.com/diff/D74531503) for more details ([comment](https://github.com/pytorch/pytorch/pull/152410#issuecomment-2871168679))	2025-05-12 07:15:09 +00:00
PyTorch MergeBot	0071fdab9e	Revert "[Dynamo] Fix typing in graph_deduplication.py (#152572 )" This reverts commit 15166be691454f8a0e626b54b6be0bea51938f86. Reverted https://github.com/pytorch/pytorch/pull/152572 on behalf of https://github.com/jeanschmidt due to Breaking internal signal citadel-fbcode-test-mode-opt-for-pt2_stack_for_internal-linux-0 please see diff [D74531503](https://www.internalfb.com/diff/D74531503) for more details ([comment](https://github.com/pytorch/pytorch/pull/152410#issuecomment-2871168679))	2025-05-12 07:15:09 +00:00
PyTorch MergeBot	aa7fe6af41	Revert "[Dynamo] Optimize dedupe region ancestor tracking (#152589 )" This reverts commit b5f1345f72ec6d1b004b05284e9553e65ee03abc. Reverted https://github.com/pytorch/pytorch/pull/152589 on behalf of https://github.com/jeanschmidt due to Breaking internal signal citadel-fbcode-test-mode-opt-for-pt2_stack_for_internal-linux-0 please see diff [D74531503](https://www.internalfb.com/diff/D74531503) for more details ([comment](https://github.com/pytorch/pytorch/pull/152410#issuecomment-2871168679))	2025-05-12 07:15:09 +00:00
Chien-Chin Huang	7243c69421	[DSD] Don't pop tensors if they are on Meta device (#153185 ) DSD currently will pop tensors if these tensors are on Meta device. This forbid the use cases that users would like to let DCP to directly initialize the tensors when loading. This PR also removes test/distributed/checkpoint/e2e/test_pipeline.py which is based on the above feature that is not realistic and is not used anywhere. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153185 Approved by: https://github.com/mori360	2025-05-12 07:04:59 +00:00
Aaron Gokaslan	032ef48725	[BE]: Add PEP621 project section to pyproject.toml (#153055 ) Follow up to @ezyang's PR #153020 , but better uses PEP621 to reduce redundant fields and pass through metadata better to uv, setuptools, poetry and other tooling. * Enables modern tooling like uv sync and better support for tools like poetry. * Also allows us to set project wide settings that are respected by linters and IDE (in this example we are able centralize the minimum supported python version). * Currently most of the values are dynamically fetched from setuptools, eventually we can migrate all the statically defined values to pyproject.toml and they will be autopopulated in the setuptool arguments. * This controls what additional metadata shows up on PyPi . Special URL Names are listed here for rendering on pypi: https://packaging.python.org/en/latest/specifications/well-known-project-urls/#well-known-labels These also clearly shows us what fields will need to be migrated to pyproject.toml over time from setup.py per #152276. Static fields be fairly easy to migrate, the dynamically built ones like requirements are a bit more challenging. Without this, `uv sync` complains: ``` error: No `project` table found in: `pytorch/pyproject.toml` ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153055 Approved by: https://github.com/ezyang	2025-05-12 02:16:07 +00:00
Yidi Wu	ceb009baee	[map] always turn on dynamo for map (#152041 ) Summary: X-link: https://github.com/pytorch/executorch/pull/10409 Reland D72896450 Make map consistent with other control flow ops. After the change, map is able to support accessing closures in the map fn. Test Plan: See existing tests. Reviewed By: zou3519 Differential Revision: D73138427 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152041 Approved by: https://github.com/zou3519	2025-05-12 02:10:08 +00:00
PyTorch UpdateBot	c5b4dc9898	[executorch hash update] update the pinned executorch hash (#152238 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152238 Approved by: https://github.com/pytorchbot, https://github.com/huydhn Co-authored-by: Huy Do <huydhn@gmail.com>	2025-05-12 01:50:12 +00:00
Yuanhao Ji	930de01861	[Typing] Apply `torch.types.Device` in `torch/cuda/memory.py` (#153027 ) Part of: #152952 Here is the definition of `torch.types.Device`: `ab997d9ff5/torch/types.py (L74)` It contains `int`, so the `int` in `Union[Device, int]` is redundant. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153027 Approved by: https://github.com/Skylion007	2025-05-11 23:32:59 +00:00
Aaron Gokaslan	0104ac0f6f	[Ez][BE]: Fix click ImportError in torch/csrc/jit (#153323 ) Fixes unnecessary import for torch script. Unblocks #153020 as it appears to fix circular importer linter into importing every Python file under torch Pull Request resolved: https://github.com/pytorch/pytorch/pull/153323 Approved by: https://github.com/ngimel, https://github.com/cyyever	2025-05-11 19:16:01 +00:00
Zhengxu Chen	c51bdf5acf	[export] Exporter API prototype. (#153205 ) Summary: see inline code comments for documentation Test Plan: CI buck2 test --flagfile fbcode//mode/opt fbcode//caffe2/test:test_export -- -r TestPackage Differential Revision: D74426900 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153205 Approved by: https://github.com/tugsbayasgalan	2025-05-11 14:20:09 +00:00
PyTorch UpdateBot	909ec495b8	[audio hash update] update the pinned audio hash (#153301 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153301 Approved by: https://github.com/pytorchbot	2025-05-11 03:47:56 +00:00
henrylhtsang	1f5cf19f56	[cutlass backend] Use src code to generate cutlass gemm name (#153006 ) This shaves off 40s for at least small cases, since we don't have to recompile the kernel again. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153006 Approved by: https://github.com/mlazos	2025-05-11 00:57:03 +00:00
Huy Do	64518ca742	Forward fix #151727 (#153306 ) #151727 is failing internally with the following error `error: suggest braces around initialization of subobject [-Werror,-Wmissing-braces]` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153306 Approved by: https://github.com/eqy, https://github.com/cyyever, https://github.com/wdvr	2025-05-11 00:39:59 +00:00
PyTorch MergeBot	fdc387ec7c	Revert "refine fp32 precision api (#125888 )" This reverts commit 4c11b26158691cfd9ad48338ddebd1ca9bded788. Reverted https://github.com/pytorch/pytorch/pull/125888 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to cause some failures on ROCm ([comment](https://github.com/pytorch/pytorch/pull/125888#issuecomment-2869274791))	2025-05-11 00:35:46 +00:00
Xu Han	e4f22822cb	Revert "Cleanup VS 2019 refs in pytorch (#145863 )" (#152613 ) This reverts commit b45e6fa707ced2adb68eaf1a2c1ccb389a6283d7. revert PRs: https://github.com/pytorch/pytorch/pull/145863 https://github.com/pytorch/pytorch/pull/145319 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152613 Approved by: https://github.com/atalman, https://github.com/malfet	2025-05-10 19:33:26 +00:00
Nikita Shulga	4f068598c4	[BE] Delete now unused `mac-mps.yml` (#153263 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153263 Approved by: https://github.com/Skylion007, https://github.com/cyyever ghstack dependencies: #153013, #153057, #152719	2025-05-10 19:10:41 +00:00
Aaron Gokaslan	d22c40373f	[Ez][BE]: Fix KeyError LOGNAME (#153324 ) Unblocks #153020 which accidentally improves the CircularImportLinter to check all Python files. It doesn't set a logname so it errors, there is another FSDP script which already defaults LOGNAME to '' if not specified, this does the same. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153324 Approved by: https://github.com/awgu	2025-05-10 18:23:38 +00:00
Vlad K	6a84fe65ec	Fix code portability when looking for Dot (#153259 ) When trying to plot a trace graph, Inductor checks if "dot" is installed. Currently, the code runs a "which dot" command. By default, Windows doesn't have the "which" command. This patch replaces it with the more portable alternative. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153259 Approved by: https://github.com/Skylion007	2025-05-10 16:12:44 +00:00
Benjamin Glass	01cbf5a30a	[AOTInductor] Add wrapper and kernel code to debug code logging (#153181 ) This is a simple PR to make the AOTInductor wrapper and kernel code get output by `TORCH_COMPILE_DEBUG=1`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153181 Approved by: https://github.com/desertfire	2025-05-10 15:31:18 +00:00
PyTorch MergeBot	01bb249978	Revert "`has_triton`: Use the device interface for detecting Triton availability (#139171 )" This reverts commit 48bfe9afc70a98addd5aa738bf501c029e4a9285. Reverted https://github.com/pytorch/pytorch/pull/139171 on behalf of https://github.com/masnesral due to Performance regression for huggingface ([comment](https://github.com/pytorch/pytorch/pull/139171#issuecomment-2868939790))	2025-05-10 14:46:23 +00:00
bobrenjc93	70c8047c2d	include user stacks with constraint violation error message (#152924 ) Fixes #152918 Before: ``` File "/data/users/bobren/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 5588, in produce_guards_verbose raise ConstraintViolationError( torch.fx.experimental.symbolic_shapes.ConstraintViolationError: Constraints violated (L['x'].size()[0])! For more information, run with TORCH_LOGS="+dynamic". - You marked L['x'].size()[0] as dynamic but your code specialized it to be a constant (5). Either remove the mark_dynamic or use a less strict API such as maybe_mark_dynamic or Dim.AUTO. ``` After: ``` File "/data/users/bobren/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 5588, in produce_guards_verbose raise ConstraintViolationError( torch.fx.experimental.symbolic_shapes.ConstraintViolationError: Constraints violated (L['x'].size()[0])! For more information, run with TORCH_LOGS="+dynamic". - You marked L['x'].size()[0] as dynamic but your code specialized it to be a constant (5). Either remove the mark_dynamic or use a less strict API such as maybe_mark_dynamic or Dim.AUTO. User stack: File "/home/bobren/local/a/pytorch/error.py", line 5, in foo return torch.randn(5) * x ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152924 Approved by: https://github.com/pianpwk	2025-05-10 13:36:47 +00:00
haozhe.zhu	4c11b26158	refine fp32 precision api (#125888 ) Based on the [conversation](https://github.com/pytorch/pytorch/issues/121791), we plan to drop the "highest, high, medium" to represent fp32 internal computation data types . Instead, we will directly use the algorithm to represent it. ### Design Choice: Directly use algorithms name like "TF32", "BF16". #### Pros - The names are more informative. 'tf32' is more informative than a simple "high". - Easier to extend new algorithm like `tf32x3` #### Cons - "HIGHEST, HIGH, MEDIUM" indicated the relative precision between different algorithms. However, we can have more documents to discuss them. ### We provide a layered structure for backends/operators. ('f32' is short for 'fp32_precision') ![image](https://github.com/user-attachments/assets/f89143e5-d6a1-4865-9351-9a50439f5067) ### We provide 3 fp32 compute precision can be set: - "ieee": Not allowed to use any other internal computation data types . - "tf32": Allowed to use tf32 as internal computation data types. - "bf16": Allowed to use bf16 as internal computation data types. - "none": Precision's are not set. Can be override by its father node. ### Overriding Precision Settings Child node can be override by its father node if it is set to default. For current default settings: ``` backend = generic, op = all, precision setting = none backend = cuda, op = all, precision setting = none backend = cuda, op = conv, precision setting = tf32 backend = cuda, op = rnn, precision setting = tf32 backend = cuda, op = matmul, precision setting = none backend = matmul, op = all, precision setting = none backend = matmul, op = conv, precision setting = none backend = matmul, op = rnn, precision setting = none backend = matmul, op = matmul, precision setting = none ``` - If the user set `torch.backends.mkldnn.fp32_precision="bf16"`, his child nodes `torch.backends.mkldnn.matmul.fp32_precision` / `torch.backends.mkldnn.conv.fp32_precision` / `torch.backends.mkldnn.rnn.fp32_precision` will also be override to "bf16". - If the user set `torch.backends.fp32_precision="bf16"`, `torch.backends.mkldnn.fp32_precision` and his child nodes will also we override to "bf16". ### Backward Compatible Since new API allow user to have more fine-grained control. There will be some conflict. For example, previous `torch.backends.cudnn.allow_tf32` are not enough to represent the status for `torch.backends.cudnn.rnn.fp32_precision="ieee"` and `torch.backends.cudnn.conv.fp32_precision="tf32"`. Therefore, our goal for backward compatible is - If the user only uses previous APIs, it will work as previous expectations. - If the user use new API to change the status to an un-representable status for old API, and try to access the status by old API. We will raise Runtime Error and point the document for user. ### Test Plan ``` python test/test_cuda.py -k test_fp32_precision_with_tf32 python test/test_cuda.py -k test_fp32_precision_with_float32_matmul_precision python test/test_cuda.py -k test_invalid_status_for_legacy_api python test/test_mkldnn.py -k test_mlkdnn_get_set python test/test_mkldnn.py -k test_generic_precision python test/test_mkldnn.py -k test_invalid python test/test_mkldnn.py -k test_default_use_parent ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125888 Approved by: https://github.com/jgong5, https://github.com/albanD Co-authored-by: Jiang, Yanbing <yanbing.jiang@intel.com>	2025-05-10 11:13:04 +00:00
Michael Lazos	b5f1345f72	[Dynamo] Optimize dedupe region ancestor tracking (#152589 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152589 Approved by: https://github.com/anijain2305 ghstack dependencies: #152389, #152505, #152410, #152506, #152570, #152572	2025-05-10 08:27:56 +00:00
Michael Lazos	15166be691	[Dynamo] Fix typing in graph_deduplication.py (#152572 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152572 Approved by: https://github.com/Skylion007, https://github.com/anijain2305 ghstack dependencies: #152389, #152505, #152410, #152506, #152570	2025-05-10 08:27:56 +00:00
Michael Lazos	50df08eb5e	[Hierarchical Compile] Replace tracing alias and mutation check with dynamo impl (#152570 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152570 Approved by: https://github.com/anijain2305 ghstack dependencies: #152389, #152505, #152410, #152506	2025-05-10 08:27:45 +00:00
Michael Lazos	779e647999	[Hierarchical Compile] Take into account mutation deps in cycle detection (#152506 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152506 Approved by: https://github.com/anijain2305 ghstack dependencies: #152389, #152505, #152410	2025-05-10 08:27:31 +00:00
Michael Lazos	bc8b305eb8	[Hierarchical Compile] Add mutation dependencies to topological sorting (#152410 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152410 Approved by: https://github.com/anijain2305 ghstack dependencies: #152389, #152505	2025-05-10 08:27:19 +00:00
Michael Lazos	f9e3a9058e	[Hierarchical Compilation] Use universal flatten APIs (#152505 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152505 Approved by: https://github.com/anijain2305 ghstack dependencies: #152389	2025-05-10 08:27:07 +00:00
Michael Lazos	c2936ebfd5	[Hierarchical Compilation] Track node mutations (#152389 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152389 Approved by: https://github.com/anijain2305	2025-05-10 08:27:01 +00:00
Xilun Wu	bc4cf1c13a	[BE] fix failing test_dp_state_dict_save_load on ROCm CI where world_size=7 (#153283 ) Summary I saw an unrelated CI failure `distributed/_composable/fsdp/test_fully_shard_state_dict.py::TestFullyShardStateDictMultiProcess::test_dp_state_dict_save_load` in one of my PR: https://hud.pytorch.org/pr/pytorch/pytorch/153225#41930032096 This is caused by triggering uneven sharding in FSDP2 at `cbb03e6971/torch/distributed/fsdp/_fully_shard/_fsdp_param.py (L353-L361)` This didn't show up because the cuda CI has even number of GPUs (e.g. 2/4/8) but it's not true on ROCm CI. For the failing CI case, the device number is 7. Solution Skip the test if `self.world_size` can not divide `mlp_dim` (i.e. 16). Test CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/153283 Approved by: https://github.com/fegin, https://github.com/weifengpy	2025-05-10 04:46:32 +00:00
Tom Pollak	fc7d8c6808	[Pipelining] Fix _batch_p2p bug for non-NCCL backends (#132644 ) (#152938 ) Fixes #132644 `_batch_p2p` incorrectly assumes that `dist.batch_isend_irecv` returns a single-element list of `dist.Work`, likely due to NCCL's coalescing behaviour. For none NCCL backends like Gloo, multiple `dist.Work` objects are returned, causing the code to discard some operations via `.pop()`. This leads to deadlocks during pipeline parallelism. ## Changes: * Modified `_batch_p2p` to return `list[dist.Work]` instead of popping a single element. * Added `_wait_batch_p2p` to call `wait()` on multiple `dist.Work` objects, consuming the result of `_batch_p2p`. * Updated references from `dist.Work` to `list[dist.Work]`. ## Testing: * `pippy_bert.py` from #132644 now works with gloo. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152938 Approved by: https://github.com/kwen2501, https://github.com/H-Huang	2025-05-10 04:19:38 +00:00
Jake Stevens	b86d46ff21	[torch][ao] Properly strip tracking stats in _fold_conv_bn_qat for 1D (#152982 ) Summary: _fold_conv_bn_qat has logic to remove the tracking stats. Currently, this includes a check that includes only torch.nn.modules.batchnorm.BatchNorm2d. As a result, the tracking stats are not properly removed when 1D is used. This diff updates to fix this. Test Plan: Run N7113483 without this fix. {F1977726982} ``` bento kernel build sensorml ``` Re-run with local version of kernel, containing this diff: {F1977727151} Notice that now, num_batches is removed. Differential Revision: D74269649 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152982 Approved by: https://github.com/andrewor14, https://github.com/yushangdi	2025-05-10 01:20:18 +00:00
Natalia Gimelshein	9c99ea2991	error out on negative offs or on K=0 in group gemm (#153226 ) Error out if K=0 in one of the grouped gemms to avoid hangs in #152668 Also, adds meta function for _scaled_grouped_mm (TODO: do the same for _grouped_mm, unless it's done already) One weird thing I'm seeing, when running all grouped_gemm tests, I'm erroring out with ``` File "/data/users/ngimel/pytorch/torch/_inductor/graph.py", line 1246, in call_function out = lowerings[target](args, kwargs) # type: ignore[index] File "/data/users/ngimel/pytorch/torch/_inductor/lowering.py", line 445, in wrapped out = decomp_fn(args, **kwargs) File "/data/users/ngimel/pytorch/torch/_inductor/kernel/mm_scaled_grouped.py", line 444, in tuned_scaled_grouped_mm if is_nonzero and can_use_triton_kernel(mat_a, mat_b, offs, bias): File "/data/users/ngimel/pytorch/torch/_inductor/kernel/mm_scaled_grouped.py", line 375, in can_use_triton_kernel offs is not None File "/home/ngimel/.conda/envs/pytorch_monarch/lib/python3.10/site-packages/sympy/core/relational.py", line 516, in __bool__ raise TypeError("cannot determine truth value of Relational") torch._inductor.exc.InductorError: LoweringException: TypeError: cannot determine truth value of Relational ``` which is weird, there's no relational that sympy has to evaluate in `offs is not None`, and when running this test separately (`test_scaled_grouped_gemm_2d_3d_fast_accum_True_strided_False_use_torch_compile_True_cuda`) it passes. I suspect some autotuning cache has to be reset between runs, but don't know what to look for. Edit: that error is "fixed" by setting `dynamic=False`, now with correct meat function something's wrong with dynamic shapes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153226 Approved by: https://github.com/kwen2501	2025-05-10 01:13:18 +00:00
Benson Ma	639793c17e	[pytorch] Expose `c10_retrieve_device_side_assertion_info()` for use by external code (#153211 ) Summary: - Expose `c10_retrieve_device_side_assertion_info()` for use by external code. The motivating use case is FBGEMM kernel launcher utilities, which add FBGEMM-specific context to the errors coming out of Torch DSA Test Plan: OSS CI Differential Revision: D74432771 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153211 Approved by: https://github.com/Skylion007	2025-05-10 01:08:45 +00:00
Dan Zimmerman	658aea980c	[inductor] Rename knobs > triton_knobs in static_cuda_launcher (#153189 ) Summary: A follow up from https://github.com/pytorch/pytorch/pull/152457 since I didn't address the comment then Test Plan: CI Differential Revision: D74421432 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153189 Approved by: https://github.com/jamesjwu	2025-05-10 00:26:21 +00:00
Huy Do	fbb6412fdb	Stop uploading sccache stats to benchmark database (#153285 ) This is not used for anything atm and potentially bloat up the size of the database Pull Request resolved: https://github.com/pytorch/pytorch/pull/153285 Approved by: https://github.com/clee2000, https://github.com/malfet	2025-05-10 00:17:38 +00:00
PyTorch MergeBot	e6dccb036e	Revert "Fix fake tensor caching when output has unbacked (#153034 )" This reverts commit 4f425a0397eb0c63b8864bb9b168a519dcfbebbe. Reverted https://github.com/pytorch/pytorch/pull/153034 on behalf of https://github.com/malfet due to Broke pr_time_benchmarks, see `d07fbd41e3/1` ([comment](https://github.com/pytorch/pytorch/pull/153034#issuecomment-2868100487))	2025-05-09 23:43:56 +00:00
Joona Havukainen	4e24ee7283	Move mps_linear forward to use MPS kernels directly instead of MPSGraph (#152210 ) This PR moves `mps_linear` to use MPSNDArrays and call into the MPS kernel directly instead of going through MPSGraph. It also adds a caching mechanism for reusing MPS kernels as there is also a small overhead attached to creating the kernel object. The impact of the improvement is relatively more significant for small input kernels where the MPSGraph overhead represents a larger portion of the overall execution time of the operation but the speedup shows for both small and large input sizes as expected. `mps_linear` before the changes: ``` input shapes: f32:[1,1,20], f32:[1,20] torch.linear time: <torch.utils.benchmark.utils.common.Measurement object at 0x109d67110> func(args, kwargs) Median: 199.29 us IQR: 9.56 us (196.71 to 206.27) 979 measurements, 1 runs per measurement, 1 thread input shapes: f32:[1,1,5120], f32:[13284,5120] torch.linear time: <torch.utils.benchmark.utils.common.Measurement object at 0x1063b4510> func(args, *kwargs) Median: 979.29 us IQR: 25.29 us (964.83 to 990.13) 205 measurements, 1 runs per measurement, 1 thread ``` `mps_linear` after the changes: ``` input shapes: f32:[1,1,20], f32:[1,20] torch.linear time: <torch.utils.benchmark.utils.common.Measurement object at 0x10693a190> func(args, *kwargs) Median: 176.08 us IQR: 15.02 us (172.42 to 187.44) 1103 measurements, 1 runs per measurement, 1 thread input shapes: f32:[1,1,5120], f32:[13284,5120] torch.linear time: <torch.utils.benchmark.utils.common.Measurement object at 0x10d524dd0> func(args, **kwargs) Median: 952.56 us IQR: 15.63 us (945.47 to 961.10) 210 measurements, 1 runs per measurement, 1 thread ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152210 Approved by: https://github.com/kulinseth, https://github.com/malfet Co-authored-by: Nikita Shulga <nshulga@meta.com>	2025-05-09 23:41:23 +00:00
Nikita Shulga	d07fbd41e3	[BE][MPS] Use `squeeze`/`unsqueeze` in Linear (#153288 ) Instead of views, to reshape weight to 2D tensor if necessary Already tested by `test_linear_1d_weight` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153288 Approved by: https://github.com/wdvr	2025-05-09 23:34:54 +00:00
Shangdi Yu	ee326137a9	[reland] Add graph module runtime asserts to AOTI (#153182 ) Summary: Solves https://github.com/pytorch/pytorch/issues/151925 A reland of https://github.com/pytorch/pytorch/pull/152125. added a try-except around the justknob internally. Also added more documentation Currently, AOTI only generate runtime asserts for unbacked symints. We should generate asserts for all `_assert_scalar` calls in the input graph. Also factored out the run time assertion logic to a separate function. We need to generate runtime asserts directly in Inductor instead of just re-using the asserts from input graphs becase we reuse the same ShapeEnv as before. In particular, on subsequent graph passes, we would immediately turn all of these assertions into noops, because when we evaluated their expressions, we would see that because we had a deferred runtime assert in the ShapeEnv, we know "oh, of course this expression is True" already. One example is below: ``` class Model(torch.nn.Module): def forward(self, a, b, c): nz = torch.nonzero(a) ones = a.new_ones([nz.size(0), b.size(0)]) torch._check(ones.size(0) >= 1) equals = torch.add(ones, c) return equals torch._dynamo.mark_dynamic(c, 0) ``` When we re-use the ShapeEnv in Inductor lowering, the check that checks a and nonzero have the same shape would be evaluted to True after we resolve unbacked bindings using the ShapeEnv. See `test_unbacked_equals_input_size_runtime_assertion` in test_aot_inductor. In addition to the Inductor generated runtime asserts, we also need the runtime asserts from the input graph, because some derived runtime asserts are not generated in Inductor. One example is below: ``` class Model(torch.nn.Module): def forward(self, x): y = x.reshape(100, -1).clone() y = y + 1 return y dynamic_shapes = { "x": {0: torch.export.Dim.DYNAMIC}, } x.shape[0] needs to be a multiple of 100. ``` See `test_aoti_runtime_asserts_backed_symint` in test_aot_inductor. Example: ``` def forward(self): arg0_1: "f32[s35]"; arg0_1, = fx_pytree.tree_flatten_spec([], self._in_spec) # File: /data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/73a672eb896e7996/scripts/shangdiy/__pt__/pt#link-tree/scripts/shangdiy/pt.py:11 in forward, code: y = x.reshape(100, -1).clone() sym_size_int: "Sym(s35)" = torch.ops.aten.sym_size.int(arg0_1, 0) # mod: "Sym(Mod(s35, 100))" = sym_size_int % 100; sym_size_int = None eq_2: "Sym(Eq(Mod(s35, 100), 0))" = mod == 0; mod = None _assert_scalar = torch.ops.aten._assert_scalar.default(eq_2, "Runtime assertion failed for expression Eq(Mod(s35, 100), 0) on node 'eq'"); eq_2 = _assert_scalar = None # File: /data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/73a672eb896e7996/scripts/shangdiy/__pt__/pt#link-tree/scripts/shangdiy/pt.py:11 in forward, code: y = x.reshape(100, -1).clone() view: "f32[100, (s35//100)]" = torch.ops.aten.reshape.default(arg0_1, [100, -1]); arg0_1 = None clone: "f32[100, (s35//100)]" = torch.ops.aten.clone.default(view); view = None # File: /data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/73a672eb896e7996/scripts/shangdiy/__pt__/pt#link-tree/scripts/shangdiy/pt.py:12 in forward, code: y = y + 1 add_6: "f32[100, 1]" = torch.ops.aten.add.Tensor(clone, 1); clone = None return (add_6,) ``` Generated cpp code: ``` auto inputs = steal_from_raw_handles_to_raii_handles(input_handles, 1); auto arg0_1 = std::move(inputs[0]); auto arg0_1_size = arg0_1.sizes(); int64_t s35 = arg0_1_size[0]; inputs.clear(); auto& kernels = static_cast<AOTInductorModelKernels&>(*this->kernels_.get()); if (!((s35 % 100L) == 0L)) { throw std::runtime_error("Expected Eq(Mod(s35, 100), 0) to be True but received " + std::to_string(s35)); } ``` Test Plan: ``` buck run fbcode//mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r aoti_runtime_asserts_backed_symint buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchinductor_dynamic_shapes -- -r test_unbacked_floordiv_simplify TORCHINDUCTOR_SCALAR_ASSERTS_FULL=1 buck run fbcode//mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r test_sym_i64_input_codegen_cuda TORCHINDUCTOR_SCALAR_ASSERTS_FULL=1 buck run fbcode//mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r test_unbacked_equals_input_size ``` Differential Revision: D74361799 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153182 Approved by: https://github.com/henrylhtsang	2025-05-09 22:56:19 +00:00
henrylhtsang	298b43792b	[RFC][inductor] Refactor AlgorithmSelectorCache to spit out make_precompile_fn (#153212 ) Motivation is that `AlgorithmSelectorCache.__call__` is getting very long and hard to work with. There are nested layers of local functions in it. For example, we pass `precompile_fn`, a local variable, to `do_autotuning`, a local function, which already has a pointer to choices, a local variable, and then have `do_autotuning` calls `choices` in `self.lookup`. When I was trying to make changes to do_autotuning, I would get `UnboundLocalError: cannot access local variable 'choices' where it is not associated with a value`. But no idea why it was even working in the first place. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153212 Approved by: https://github.com/eellison	2025-05-09 22:35:10 +00:00
Jeff Daily	37f92bbe0a	[ROCm][CI] fix nightly build after rocm 6.4 upgrade (#153253 ) rocm-smi adds inclusion of drm.h and libdrm-devel package was missing Pull Request resolved: https://github.com/pytorch/pytorch/pull/153253 Approved by: https://github.com/jeffdaily, https://github.com/atalman Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-05-09 22:08:15 +00:00
Natalia Gimelshein	9ae722cdb4	allocate cuMem memory with rdma flag (#153261 ) to be able to register memory with ibverbs Pull Request resolved: https://github.com/pytorch/pytorch/pull/153261 Approved by: https://github.com/kwen2501, https://github.com/eqy, https://github.com/Skylion007	2025-05-09 21:48:48 +00:00
Jithun Nair	f11d7a5978	[ROCm] Update spack includes (#152569 ) * Cleans up code in `caffe2/CMakeLists.txt` to remove individual ROCm library include paths and use `ROCM_INCLUDE_DIRS` CMake var instead * `ROCM_INCLUDE_DIRS` CMake var is set in `cmake/public/LoadHIP.cmake` by adding all the ROCm packages that PyTorch depends on * `rocm_version.h` is provided by the `rocm-core` package, so use the include directory for that component to be compliant with Spack * Move `find_package_and_print_version(hip REQUIRED CONFIG)` earlier so that `hip_version.h` can be located in the hip package include dir for Spack * `list(REMOVE_DUPLICATES ROCM_INCLUDE_DIRS)` to remove duplicate `/opt/rocm/include` entries in the non-Spack case * Remove user-provided env var `ROCM_INCLUDE_DIRS` since `ROCM_PATH` already exists as a user-provided env var, which should be sufficient to locate the include directories for ROCm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152569 Approved by: https://github.com/renjithravindrankannath, https://github.com/jeffdaily Co-authored-by: Renjith Ravindran <Renjith.RavindranKannath@amd.com>	2025-05-09 21:36:38 +00:00
Aaron Orenstein	4f425a0397	Fix fake tensor caching when output has unbacked (#153034 ) We handle fake tensor caching in two ways: 1. If the inputs have no symbols (SymInt, etc) then we cache on the FakeTensorMode. 2. If the inputs have symbols then we cache on the ShapeEnv. This way the symbols in the inputs and outputs are associated with the guards in place at the time of the call. However - it's possible to have an op where there are no symbols in the inputs but there is an unbacked symbol in the output. In this case we shouldn't cache at all because what would that really mean? So this PR changes the caching behavior so that if there's a symbol in the output which doesn't come in some way from the input then we refuse to cache that op. Added a test which checks for this case. While in there I also did a couple other related changes: 1. Added negative caching - if we see that an (op, args) failed to cache previously we don't even bother trying to cache it again. 2. Reworked the inner behavior of _cached_dispatch_impl a little to make it more clear which bits we expect to be able to throw _BypassDispatchCache and add some comments. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153034 Approved by: https://github.com/masnesral, https://github.com/tugsbayasgalan	2025-05-09 21:17:54 +00:00
Xilun Wu	cbb03e6971	[BE][DTensor] move torch.distributed._tensor import to torch.distributed.tensor in test files (#153225 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153225 Approved by: https://github.com/kwen2501, https://github.com/fegin	2025-05-09 20:40:54 +00:00
Ryan Guo	3976e52264	Fix `torch.isin` decomposition for scalar inputs (#153216 ) This patch fixes a corner case of `torch.isin` decompisition when both inputs are scalars. This pattern showed up from #141196. Fixes #141196. Error stack befor this patch: ``` File "/home/ryanguo99/repos/pytorch/test/dynamo/test_misc.py", line 12503, in test_scalar_isin_decomposition res = opt_f() ^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/_dynamo/eval_frame.py", line 691, in _fn raise e.remove_dynamo_frames() from None # see TORCHDYNAMO_VERBOSE=1 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/_dynamo/output_graph.py", line 1618, in _call_user_compiler raise BackendCompilerFailed( File "/home/ryanguo99/repos/pytorch/torch/_dynamo/output_graph.py", line 1593, in _call_user_compiler compiled_fn = compiler_fn(gm, self.example_inputs()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/_dynamo/repro/after_dynamo.py", line 150, in __call__ compiled_gm = compiler_fn(gm, example_inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/__init__.py", line 2365, in __call__ return compile_fx(model_, inputs_, config_patches=self.config) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/_inductor/compile_fx.py", line 2317, in compile_fx return aot_autograd( ^^^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/_dynamo/backends/common.py", line 106, in __call__ cg = aot_module_simplified(gm, example_inputs, *self.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/_functorch/aot_autograd.py", line 1179, in aot_module_simplified compiled_fn = AOTAutogradCache.load( ^^^^^^^^^^^^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/_functorch/_aot_autograd/autograd_cache.py", line 923, in load compiled_fn = dispatch_and_compile() ^^^^^^^^^^^^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/_functorch/aot_autograd.py", line 1164, in dispatch_and_compile compiled_fn, _ = create_aot_dispatcher_function( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/_functorch/aot_autograd.py", line 576, in create_aot_dispatcher_function return _create_aot_dispatcher_function( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/_functorch/aot_autograd.py", line 826, in _create_aot_dispatcher_function compiled_fn, fw_metadata = compiler_fn( ^^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 180, in aot_dispatch_base fw_module, updated_flat_args, maybe_subclass_meta = aot_dispatch_base_graph( # type: ignore[misc] ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/fx/experimental/proxy_tensor.py", line 2199, in _trace_inner t = dispatch_trace( ^^^^^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/_compile.py", line 51, in inner return disable_fn(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/_dynamo/eval_frame.py", line 872, in _fn return fn(args, *kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/fx/experimental/proxy_tensor.py", line 1223, in dispatch_trace graph = tracer.trace(root, concrete_args) # type: ignore[arg-type] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/_dynamo/eval_frame.py", line 872, in _fn return fn(args, *kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/fx/_symbolic_trace.py", line 850, in trace (self.create_arg(fn(args)),), ^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/fx/experimental/proxy_tensor.py", line 1278, in wrapped out = f(tensors) # type:ignore[call-arg] ^^^^^^^^^^^ File "<string>", line 1, in <lambda> File "/home/ryanguo99/repos/pytorch/torch/_functorch/_aot_autograd/traced_function_transforms.py", line 720, in inner_fn outs = fn(args) ^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/_functorch/_aot_autograd/traced_function_transforms.py", line 419, in _functionalized_f_helper f_outs = fn(f_args) ^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/_functorch/_aot_autograd/traced_function_transforms.py", line 81, in inner_fn outs = fn(args) ^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/_functorch/_aot_autograd/traced_function_transforms.py", line 902, in functional_call out = PropagateUnbackedSymInts(mod).run( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/fx/interpreter.py", line 171, in run self.env[node] = self.run_node(node) ^^^^^^^^^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/fx/experimental/symbolic_shapes.py", line 7387, in run_node result = super().run_node(n) ^^^^^^^^^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/fx/interpreter.py", line 240, in run_node return getattr(self, n.op)(n.target, args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/fx/interpreter.py", line 320, in call_function return target(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/fx/experimental/proxy_tensor.py", line 1326, in __torch_function__ return func(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/_subclasses/functional_tensor.py", line 511, in __torch_dispatch__ outs_unwrapped = func._op_dk( ^^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/utils/_stats.py", line 27, in wrapper return fn(args, *kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/fx/experimental/proxy_tensor.py", line 1428, in __torch_dispatch__ return proxy_call(self, func, self.pre_dispatch, args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/fx/experimental/proxy_tensor.py", line 797, in proxy_call r = maybe_handle_decomp(proxy_mode, func, args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/fx/experimental/proxy_tensor.py", line 2358, in maybe_handle_decomp out = CURRENT_DECOMPOSITION_TABLE[op](args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/_prims_common/wrappers.py", line 309, in _fn result = fn(args, *kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/_decomp/decompositions.py", line 5108, in isin return isin_default(elements, test_elements, invert=invert) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/_decomp/decompositions.py", line 5137, in isin_default x = elements.view(elements.shape, ((1,) test_elements.ndim)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: TypeError: view() received an invalid combination of arguments - got (), but expected one of: * (torch.dtype dtype) * (tuple of ints size) While executing %isin : [num_users=1] = call_function[target=torch.isin](args = (%x, %x), kwargs = {}) GraphModule: class GraphModule(torch.nn.Module): def forward(self): # File: /home/ryanguo99/repos/pytorch/test/dynamo/test_misc.py:12498 in f, code: x = torch.tensor(0) x: "i64[][]" = torch.tensor(0) # File: /home/ryanguo99/repos/pytorch/test/dynamo/test_misc.py:12499 in f, code: return torch.isin(x, x) isin: "b8[][]" = torch.isin(x, x); x = None return (isin,) Original traceback: File "/home/ryanguo99/repos/pytorch/test/dynamo/test_misc.py", line 12499, in f return torch.isin(x, x) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153216 Approved by: https://github.com/williamwen42, https://github.com/peterbell10	2025-05-09 20:26:25 +00:00
Jason Ansel	180cbf46f2	Fix `'TensorBox' object has no attribute 'is_input_buffer'` (#152980 ) Summary: Fix for https://fb.workplace.com/groups/1075192433118967/permalink/1664491270855744/ Test Plan: Used reproducer from D74262030 Differential Revision: D74270090 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152980 Approved by: https://github.com/Skylion007, https://github.com/eellison	2025-05-09 19:58:32 +00:00
Pian Pawakapan	d808a3e203	[dynamic shapes] guard_or_false for computeStorageNbytes (#150483 ) removes fast path for computing storage, fixes some adjacent tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/150483 Approved by: https://github.com/laithsakka	2025-05-09 19:31:19 +00:00
Zhengxu Chen	fe11d300ac	[nativert] Improve MPMCQueue tests. (#153154 ) Summary: - Use std::this_thread::yield and stop busy wating. - Sort test file orders. Following up @swolchok's comment from https://github.com/pytorch/pytorch/pull/152837 Test Plan: CI Differential Revision: D74402536 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153154 Approved by: https://github.com/Skylion007, https://github.com/cyyever	2025-05-09 19:25:42 +00:00
Aaron Gokaslan	287b1ca30c	[Ez][BE]: Ensure matplotlib remains optional dependency via fake_quantize (#153244 ) Unblocks #153055 and ensure that matplotlib should always be optional in PyTorch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153244 Approved by: https://github.com/albanD	2025-05-09 19:19:30 +00:00
Ti-Tai Wang	90fde0dc09	[ONNX] Support sym_float (#153200 ) Fixes #153115 Note: torch.sym_int is not supported in this PR because it's not appeared in exported program, instead, it's `torch.ops.aten.sym_size.int()`. ``` ExportedProgram: class GraphModule(torch.nn.Module): def forward(self, x: "f32[s35, s16]"): # sym_size_int_1: "Sym(s35)" = torch.ops.aten.sym_size.int(x, 0); x = None return (sym_size_int_1,) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153200 Approved by: https://github.com/justinchuby Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>	2025-05-09 19:10:17 +00:00
Gabriel Ferns	da0b89bcbf	Scheduler Flops refactor (#152708 ) This refactors `estimate_flops` and `get_estimated_runtime` on scheduler nodes: 1. New function on BaseSchedulerNode: `estimate_flops`. Works with all types of ir nodes now, not just `ExternalKernels`. 1. Extends `get_estimated_runtime` to work with non-`ExternalKernels`. Prelude to: https://github.com/pytorch/pytorch/pull/149697 Testing: New unit tests cover functionality. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152708 Approved by: https://github.com/xmfan, https://github.com/eellison	2025-05-09 19:01:43 +00:00
Boyuan Feng	073b0257ba	[Graph Partition] Maintain relative order within partition during reordering (#153111 ) PR #151968 adds `reorder_for_minimizing_partition` for the minimal number of partitions. If reordering two nodes cannot reduce the number of partitions, `reorder_for_minimizing_partition` should maintain the relative order of these two nodes and rely on other reorder passes for some nice features, such as shorter liveness duration or less peak memory. In an extreme case, when all nodes are on gpu and can be cudagraphed, `reorder_for_minimizing_partition` should not reorder any nodes. This PR improves `reorder_for_minimizing_partition` for the invariant: relative order of nodes within the same graph partition are maintained. To do so, we record the index of each node in the input `nodes: list[BaseSchedulerNode]` and use a heap to pop the node with the smallest index. So we always scheduler a node with smaller index in the same graph partition and respects the invariant. Previous implementation tried to use a queue to achieve that but failed. Because node_N at the end may rely on node_1 at the start, such that node_N is added to queue once node_1 is scheduled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153111 Approved by: https://github.com/eellison	2025-05-09 18:49:53 +00:00
Scott Wolchok	ec24f8f58a	Format all headers under ATen/cpu/vec, not just top-level (#152364 ) not formatting these seems like an oversight. Had to add a few clang-format suppressions to keep includes in the same order to avoid breaking builds. This PR was generated using `lintrunner --paths-cmd "rg --files -g '*.h' aten/src/ATen/cpu/vec/" format` Differential Revision: [D73802128](https://our.internmc.facebook.com/intern/diff/D73802128/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152364 Approved by: https://github.com/malfet, https://github.com/Skylion007, https://github.com/CaoE	2025-05-09 18:46:07 +00:00
Chen, Zejun	76e34e3850	[Kineto] Upgrade the kineto commit to fb36cce (#152007 ) XPU intends to upgrade oneAPI version(https://github.com/pytorch/pytorch/issues/151097) to support torch Distributed. However, the PTI within the oneAPI to be upgraded introduces breaking changes. It changed the signature of the APIs as follows. - ptiViewEnableRuntimeApi - ptiViewGetApiIdName To avoid the breaks due to the PTI upcoming non-backward-compatible changes, we refined the XPU PTI integration with the kineto. We check the PTI version and then invoke the PTI API accordingly. It means that the kineto of this PR can overcome the non-backward-compatible issue for the sake of the upcoming oneAPI 2025.1. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152007 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/sraikund16, https://github.com/malfet	2025-05-09 18:38:41 +00:00
Benson Ma	192f7140d1	[fbgemm_gpu] Replace `C10_CUDA_KERNEL_LAUNCH_CHECK()` in the `KernelLauncher` (#153178 ) Summary: - Replace `C10_CUDA_KERNEL_LAUNCH_CHECK()` in the `KernelLauncher`, as the latter does not print __FILE__ and __LINE__ The existing `C10_CUDA_KERNEL_LAUNCH_CHECK()` implementation does not print the source file and line number when a CUDA kernel launch throws an error, leaving users confused with a context-less message like `CUDA error: invalid arguments`. This new check is a slimmed re-implementation of the macro with extra context information added to the error (beyond just file and line number) so that we can at least locate the FBGEMM source file or template where the error first surfaces. Test Plan: ``` buck2 run 'fbcode//mode/opt' fbcode//deeplearning/fbgemm/fbgemm_gpu/test/utils:kernel_launcher buck2 run 'fbcode//mode/opt-amd-gpu' fbcode//deeplearning/fbgemm/fbgemm_gpu/test/utils:kernel_launcher ``` Reviewed By: sryap Differential Revision: D74364031 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153178 Approved by: https://github.com/atalman, https://github.com/huydhn	2025-05-09 17:43:16 +00:00
henrylhtsang	595e21a9dd	[cutlass-3] Add cutlass key for fbcode and OSS (#153081 ) Differential Revision: [D74337959](https://our.internmc.facebook.com/intern/diff/D74337959/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153081 Approved by: https://github.com/drisspg	2025-05-09 17:38:31 +00:00
Boyuan Feng	ffda46e3be	[Graph Partition] remove weak dep from `partition_input_names` (#152863 ) Graph partition analyzes read_writes to get partition input names. However, weak dep is fake dependency and is not actually read or written. So we should not include weak dep in graph partition input names. The following test failure is fixed by removing weak dependency from partition_input_names: `PYTORCH_TEST_WITH_INDUCTOR=1 python test/test_torch.py TestTorchDeviceTypeCUDA.test_params_invalidated_with_grads_invalidated_between_unscale_and_step_Adam_cuda_float32` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152863 Approved by: https://github.com/eellison	2025-05-09 17:20:04 +00:00
Wang, Chuanqi	286de0d601	[CI] Enable XCCL in XPU CI build (#150927 ) As XCCL has been enabled for torch xpu, enable it in CI build. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150927 Approved by: https://github.com/EikanWang, https://github.com/cyyever, https://github.com/atalman	2025-05-09 17:12:34 +00:00
Nikita Shulga	e73a4c3643	[BE][CI] Merge regular and MPS test config shards (#152719 ) Unsure why there were separate to beging with Pull Request resolved: https://github.com/pytorch/pytorch/pull/152719 Approved by: https://github.com/seemethere, https://github.com/atalman ghstack dependencies: #153013, #153057	2025-05-09 17:01:35 +00:00
Nikita Shulga	309ecb2277	[CI] Add opt-in h100 tests (#153170 ) So far only run: - inductor/test_fp8.py - test_matmul_cuda.py - inductor/test_max_autotune.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/153170 Approved by: https://github.com/drisspg, https://github.com/eellison	2025-05-09 17:01:05 +00:00
Pian Pawakapan	8ea95d2e73	[inductor] dtype promotion error in cat decomp (#152995 ) cloning single tensor wasn't following dtype promotion rules for SAM model: https://github.com/pytorch/pytorch/issues/152606 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152995 Approved by: https://github.com/yushangdi, https://github.com/eellison	2025-05-09 16:58:58 +00:00
James Wu	e21ff9c3be	Add logging for guard miss failure (#153125 ) Differential Revision: [D74371381](https://our.internmc.facebook.com/intern/diff/D74371381/) This PR adds some logging for guard misses to tlparse, so that we know when AOTAutogradCache and FxGraphCache miss due to guards. Example tlparse result: https://gist.github.com/jamesjwu/afa19335c0aee85b24546b13c1cf6427 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153125 Approved by: https://github.com/oulgen, https://github.com/jingsh	2025-05-09 16:51:04 +00:00
soulitzer	9d00f2b375	[autograd][docs] Add more details on why save_for_backward is important in extending autograd note (#153005 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153005 Approved by: https://github.com/albanD	2025-05-09 16:36:57 +00:00
Thanh Ha	50657120a0	Allow workflows to opt-out of experiments (#153085 ) This change adds support to allow workflows to opt-out of experiments. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153085 Approved by: https://github.com/ZainRizvi Co-authored-by: Zain Rizvi <ZainRizvi@users.noreply.github.com>	2025-05-09 16:34:46 +00:00
Ryan Guo	18e13a67ce	[dynamo] Harden torch function dispatchability check for attributes and methods access (#153082 ) See more details in https://github.com/pytorch/pytorch/issues/151771#issuecomment-2836372110. Fixes #151771. Differential Revision: [D74342291](https://our.internmc.facebook.com/intern/diff/D74342291) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153082 Approved by: https://github.com/mlazos	2025-05-09 16:14:23 +00:00
Mu-Chu Lee	c227865720	[AOTInductor] Fix state of ConstantFolding (#153152 ) Summary: Bug fix for constant folding states. We are not setting the correct state for each updates. One race condition would be: (1) All threads obtain the model_exec_lock from main run. (2) In second round of updated constant buffer, we should have set secondary as INITIALIZED but primary is mistakenly set instead. (3) run_const_fold get called and an model_exec_lock is obtained, waiting for available at this time. (4) main run enters INITIALIZED, waiting for unique_lock (which a shared_lock is being held by (3) at this moment) Test Plan: TBD Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/153152 Approved by: https://github.com/jingsh, https://github.com/chenyang78	2025-05-09 16:03:05 +00:00
Sam Larsen	f2ea63658f	Refactor nested benchmarking functions in select_algorithm.py (#153084 ) Summary: I'll need some of the benchmark-related functions surfaced so I can use them for remote autotuning. This PR just lifts the main in-process benchmarking helpers to classmethods. It wasn't strictly necessary to also move the sub-process benchmarking helper, but I think it improves readability. Also added some missing types. Test Plan: Existing unit tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/153084 Approved by: https://github.com/aorenste, https://github.com/eellison	2025-05-09 15:09:51 +00:00
Ankita George	916f6bafe7	Fix HF loading when there's no metadata file to work with fsspec (#152856 ) Summary: HF loading when there is no metadata is an edge case for some users. We were previously calling safe_open(filename) to get the keys in the safetensors file, but this doesn't work with fsspec, when models have a different backend than local fs (ie. hf, s3 etc). This diff updates to open the file with fsspec.open() and then safetensors.deserialize() to get the keys Test Plan: unit test and e2e test reading from hf Differential Revision: D74181513 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152856 Approved by: https://github.com/joecummings	2025-05-09 13:32:01 +00:00
Yu, Guangye	e06a08059a	Add device guard for xpu conv on multi device (#153067 ) # Motivation fixes https://github.com/pytorch/pytorch/issues/153022 The root cause is that the XPU backend registers the convolution op using `m.impl`, which bypasses the device guard logic typically added by the code generation system. This can lead to unexpected behavior if the current device isn't explicitly set. # Additional Context run the following script ```python import torch import torchvision.models as models torch.manual_seed(0) model = models.resnet50(weights="ResNet50_Weights.DEFAULT") model.eval() data = torch.rand(1, 3, 224, 224) device = torch.device('xpu:1') # 'xpu:0' model = model.to(device=device, dtype=torch.float16) data = data.to(device, dtype=torch.float16) with torch.no_grad(): ret = model(data) print(ret) print("Execution finished") ``` The output is ```bash -9.2102e-02, -7.7588e-01, -1.4111e+00, -9.2383e-01, 6.4551e-01, -6.0730e-03, -7.8271e-01, -1.1904e+00, -4.1602e-01, 3.2715e-02, -4.9854e-01, -6.3623e-01, -8.5107e-01, -6.8555e-01, -9.4434e-01, -8.8672e-01, -6.7969e-01, -6.9824e-01, -2.8882e-01, 2.0312e+00]], device='xpu:1', dtype=torch.float16) Execution finished ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153067 Approved by: https://github.com/albanD, https://github.com/EikanWang	2025-05-09 09:41:51 +00:00
Dmitry Rogozhkin	aca2c99a65	xpu: get xpu arch flags at runtime in cpp_extensions (#152192 ) This commit moves query for xpu arch flags to runtime when building SYCL extensions which allows to adjust `TORCH_XPU_ARCH_LIST` at python script level. That's handy for example in ci test which gives a try few variants of the list. CC: @malfet, @jingxu10, @EikanWang, @guangyey Pull Request resolved: https://github.com/pytorch/pytorch/pull/152192 Approved by: https://github.com/guangyey, https://github.com/gujinghui, https://github.com/albanD	2025-05-09 05:43:50 +00:00
Michael Lazos	9fa07340fd	[Cutlass] Implement memory planning for EVT (#153177 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153177 Approved by: https://github.com/henrylhtsang ghstack dependencies: #153196, #150907	2025-05-09 05:39:05 +00:00
Michael Lazos	a3154ca34a	[Cutlass] Changes to gemm template for EVT (#150907 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150907 Approved by: https://github.com/henrylhtsang, https://github.com/eellison ghstack dependencies: #153196	2025-05-09 05:39:05 +00:00
Michael Lazos	c54aa0da01	[Cutlass] Fix tests (#153196 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153196 Approved by: https://github.com/BoyuanFeng	2025-05-09 05:39:05 +00:00
PyTorch MergeBot	34196301d5	Revert "[CI] Add opt-in h100 tests (#153170 )" This reverts commit f87a0fe2cae5be82ffd845fa7e6053396c8222d1. Reverted https://github.com/pytorch/pytorch/pull/153170 on behalf of https://github.com/clee2000 due to workflow doesnt have right concurrency group? ([comment](https://github.com/pytorch/pytorch/pull/153170#issuecomment-2864951319))	2025-05-09 03:04:50 +00:00
eqy	b30d276abc	[CUDA][cuBLASLt] Fix scale setting for `allowFP16AccumulationCuBLAS` `true` case (#153083 ) Also add some missing `@onlyCUDA` / support check decorators in `test_matmul_cuda.py` Should help resolve #151890 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153083 Approved by: https://github.com/janeyx99	2025-05-09 02:27:17 +00:00
Dmitry Rogozhkin	10234ccefe	xpu: rely on sycl/sycl.hpp to include bfloat16.hpp (#152562 ) Fixes: https://github.com/intel/torch-xpu-ops/issues/1503 `sycl/ext/oneapi/bfloat16.hpp` header file is a DPC++ compiler internal header. It's not documented for usage (see extension specification linked below) and is not guaranteed to exist. Instead, documented usage of extension suggests to rely on including `sycl/sycl.hpp` which in its turn includes `bfloat16.hpp` header (which is implementation detail). We stepped into issues by explicitly including `bloat16.hpp` sycl header whithin user facing production environment when `intel-sycl-rt` wheel is installed (which is the dependency of `torch` wheel package built and publicly available for xpu). Compiler includes this file from `intel-sycl-rt` and due to `#pragma once` usage its content is included as well giving redefinitions of symbols in this file (previous inclusion is coming from `sycl/sycl.hpp`): ``` In file included from /workspace/lib/python3.12/site-packages/torch/include/c10/util/BFloat16.h:23: /opt/intel/oneapi/compiler/2025.0/bin/compiler/../../include/sycl/ext/oneapi/bfloat16.hpp:60:23: error: redefinition of 'BF16VecToFloatVec' 60 \| template <int N> void BF16VecToFloatVec(const bfloat16 src[N], float dst[N]) { \| ^ /workspace/include/sycl/ext/oneapi/bfloat16.hpp:60:23: note: previous definition is here 60 \| template <int N> void BF16VecToFloatVec(const bfloat16 src[N], float dst[N]) { \| ``` While SYCL header files themselves can be improved (`#pragma once` dropped), we still must correct usage of sycl `bfloat16.hpp` header in pytorch, i.e. drop it. This fortunately helps to address the reported issue of redefinitions though follow up on compiler side is still required. Also, `SYCL_EXT_ONEAPI_BFLOAT16_MATH_FUNCTIONS` used to cover inclusion of `sycl/sycl.hpp` does not make sense since it's defined in this very header. Thus, we should use `SYCL_LANGUAGE_VERSION` instead which is defined on compiler level. See: `f958dce280/sycl/doc/extensions/experimental/sycl_ext_oneapi_bfloat16_math_functions.asciidoc` CC: @EikanWang, @guangyey, @gujinghui Pull Request resolved: https://github.com/pytorch/pytorch/pull/152562 Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/albanD	2025-05-09 02:25:44 +00:00
Shangdi Yu	faff387bfd	Mini tutorial for provenance tracking (#152211 ) as title Pull Request resolved: https://github.com/pytorch/pytorch/pull/152211 Approved by: https://github.com/svekars, https://github.com/eellison, https://github.com/desertfire	2025-05-09 01:41:04 +00:00
Nikita Shulga	f87a0fe2ca	[CI] Add opt-in h100 tests (#153170 ) So far only run: - inductor/test_fp8.py - test_matmul_cuda.py - inductor/test_max_autotune.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/153170 Approved by: https://github.com/drisspg	2025-05-09 01:03:12 +00:00
Animesh Jain	ab829ec629	[dynamo][pr_time_benchmark] Add dynamo benchmark to stress test inlining (#153159 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153159 Approved by: https://github.com/laithsakka ghstack dependencies: #152883, #153105	2025-05-09 00:09:19 +00:00
Catherine Lee	cbcb57d09d	[CI] Use sccache installed in docker image in xla build (#153002 ) The edited comment should have the info. The code change looks large, but its copied from the install_cache script that our docker images use `6a8006472e/.ci/docker/common/install_cache.sh (L42)` Sccache stopped working on xla at some point near dec 17 2023. I am not sure what commit caused it. I think it was having trouble writing to the cache. Either way, there is an sccache already installed on the docker image, so we should use that instead of a binary from s3 which we're probably no longer sure where it came from/what commit it was built from The one in the docker image is installed here `69d438ee65/.github/upstream/Dockerfile (L61)` and is also very old, so I have https://github.com/pytorch/xla/pull/9102 to update it sccache still not writing properly, i will investigate, but xla build currently broken after the above xla pr, and this should fix it Pull Request resolved: https://github.com/pytorch/pytorch/pull/153002 Approved by: https://github.com/malfet	2025-05-08 23:22:20 +00:00
PyTorch MergeBot	0203f89cc1	Revert "[BE]: Add PEP621 project section to pyproject.toml (#153055 )" This reverts commit 5976419c6939207834492a1f5fba4a62f2c91b0d. Reverted https://github.com/pytorch/pytorch/pull/153055 on behalf of https://github.com/malfet due to And failures seems related to this change, but I don't know how, see for example `7cb5c751c3/1` ([comment](https://github.com/pytorch/pytorch/pull/153055#issuecomment-2864664725))	2025-05-08 23:17:58 +00:00
ILCSFNO	7cb5c751c3	Fix the basic description of torch.min(), torch.max(), torch.all(), torch.any() (#152658 ) Fixes #152176 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152658 Approved by: https://github.com/malfet	2025-05-08 22:59:14 +00:00
Hashem Hashemi	5683965f02	[ROCm] Maxpool forward NHWC Perf Improvement targeting Resnet scenarios (#151727 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/151727 Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony, https://github.com/eqy	2025-05-08 22:38:23 +00:00
Ke Wen	5dd746b4b5	[c10d] Reduce test verbosity (#153116 ) Has been seeing a lot of `Starting event listener thread for rank` recently in test print-out. Moving them to `logger.debug`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153116 Approved by: https://github.com/fduwjj	2025-05-08 22:22:22 +00:00
Wei Feng	5a8c9c3ab0	[FSDP2][Doc] add pointer to torchtitan (#153079 ) <img width="838" alt="Screenshot 2025-05-08 at 10 51 05 AM" src="https://github.com/user-attachments/assets/4cf43a16-3801-424b-a74f-ede1d41ff052" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/153079 Approved by: https://github.com/mori360	2025-05-08 22:22:07 +00:00
Jithun Nair	88b56774bd	At least one of ROCM_HOME or CUDA_HOME must be None (#152236 ) Copied description by @hj-wei from https://github.com/ROCm/pytorch/pull/1809 > Hi all, I manually generating nvcc to bypass NVIDIA component checks(Megatron-LM), see `2da43ef4c1/megatron/legacy/fused_kernels/__init__.py (L57)` > but it can lead to incorrect CUDA_HOME configurations. This can cause initialization anomalies in downstream libraries like DeepSpeed Pull Request resolved: https://github.com/pytorch/pytorch/pull/152236 Approved by: https://github.com/jeffdaily	2025-05-08 22:20:25 +00:00
Ke Wen	4064062e18	[c10d] Test multiple CUDA Graph captures (#150040 ) 1. Do multiple captures 2. Perform multiple collectives in one capture 3. Multiple replays (existing) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150040 Approved by: https://github.com/fduwjj	2025-05-08 22:14:03 +00:00
Will Feng	d9dc6b56ec	Support using SymInt shapes for torch.baddbmm no-broadcast case (#153112 ) A typical `bmm` kernel in Helion needs to pass in symint shapes to `torch.baddbmm`. Currently `self.expand((dim1, dim2, dim3))` in baddbmm runs unconditionally and it doesn't work with symint shapes (it raises the following error): ``` Traceback (most recent call last): File "/home/willfeng/local/helion_yf225/helion/_compiler/type_propagation.py", line 699, in propagate_call CheckForIndexCalls.retry_call(self.value, proxy_args, proxy_kwargs), ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/willfeng/local/helion_yf225/helion/_compiler/tile_index_proxy.py", line 104, in retry_call return fn(proxy_args, proxy_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/willfeng/local/pytorch/torch/utils/_stats.py", line 27, in wrapper return fn(args, *kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/willfeng/local/pytorch/torch/_subclasses/fake_tensor.py", line 1338, in __torch_dispatch__ return self.dispatch(func, types, args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/willfeng/local/pytorch/torch/_subclasses/fake_tensor.py", line 1986, in dispatch return self._cached_dispatch_impl(func, types, args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/willfeng/local/pytorch/torch/_subclasses/fake_tensor.py", line 1450, in _cached_dispatch_impl output = self._dispatch_impl(func, types, args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/willfeng/local/pytorch/torch/_subclasses/fake_tensor.py", line 2645, in _dispatch_impl r = func(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/willfeng/local/pytorch/torch/_ops.py", line 806, in __call__ return self._op(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/willfeng/local/pytorch/torch/_prims_common/wrappers.py", line 309, in _fn result = fn(args, **kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/willfeng/local/pytorch/torch/_meta_registrations.py", line 2172, in meta_baddbmm self = self.expand((dim1, dim2, dim3)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: /home/willfeng/local/pytorch/build/aten/src/ATen/RegisterCompositeExplicitAutograd_0.cpp:5025: SymIntArrayRef expected to contain only concrete integers ``` This PR changes it so that we don't run `expand()` when not necessary, which makes the Helion use case (i.e. no broadcasting) work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153112 Approved by: https://github.com/jansel	2025-05-08 21:34:24 +00:00
Pian Pawakapan	4166373908	[dynamic shapes] guard_or_false for infer_size (#152146 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/152146 Approved by: https://github.com/laithsakka	2025-05-08 21:27:22 +00:00
Aaron Gokaslan	5976419c69	[BE]: Add PEP621 project section to pyproject.toml (#153055 ) Follow up to @ezyang's PR #153020 , but better uses PEP621 to reduce redundant fields and pass through metadata better to uv, setuptools, poetry and other tooling. * Enables modern tooling like uv sync and better support for tools like poetry. * Also allows us to set project wide settings that are respected by linters and IDE (in this example we are able centralize the minimum supported python version). * Currently most of the values are dynamically fetched from setuptools, eventually we can migrate all the statically defined values to pyproject.toml and they will be autopopulated in the setuptool arguments. * This controls what additional metadata shows up on PyPi . Special URL Names are listed here for rendering on pypi: https://packaging.python.org/en/latest/specifications/well-known-project-urls/#well-known-labels These also clearly shows us what fields will need to be migrated to pyproject.toml over time from setup.py per #152276. Static fields be fairly easy to migrate, the dynamically built ones like requirements are a bit more challenging. Without this, `uv sync` complains: ``` error: No `project` table found in: `pytorch/pyproject.toml` ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153055 Approved by: https://github.com/ezyang	2025-05-08 21:27:19 +00:00
Zhengxu Chen	9608e7fee9	[nativert] Address tooling setup for torch/nativert/ (#153164 ) Summary: As discussed with @malfet , we're porting nativert code to torch/nativert/. Following up some concerns over the new directory, I'm trying to setup the tooling on OSS so various things (like linters) can run on torch/nativert/ properly. Test Plan: CI Differential Revision: D74407808 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153164 Approved by: https://github.com/dolpm, https://github.com/Skylion007	2025-05-08 21:11:33 +00:00
Bert Maher	e820b05cab	[inductor] Generate synthetic offsets appropriately for autotuning _scaled_grouped_mm (#152968 ) Summary: The autotuner is using zero-filled tensors to autotune _scaled_grouped_mm and that's not appropriate for the offsets tensor, since it essentially corresponds to "no input" and thus yields invalid perf results. We can't really use the actual input tensors, since we might be compiling this op in the context of an entire graph. So instead, I decided to create a synthetic offsets tensor assuming that each group is (roughly) the same size. I don't have data but I'd guess this approach is OK for MoE since we're generally hoping to load-balance the experts; I'm not sure how well it applies to other scenarios that might be more heavy-tailed. Test Plan: ``` pytest test_matmul_cuda.py -k test_scaled_grouped_gemm_ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152968 Approved by: https://github.com/ngimel	2025-05-08 21:07:04 +00:00
Boyuan Feng	590965f92f	[Graph Partition][Flex Attention] analyze symints from subgraph inputs and outputs (#152878 ) Flex Attention may have symints in subgraph inputs and outputs. Existing code implicitly captures these symints but does not explicitly store it in TritonTemplateBuffer. This leads to error when analyzing symints used in Flex Attention as a TritonTemplateBuffer. This PR fixes the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152878 Approved by: https://github.com/drisspg	2025-05-08 20:25:35 +00:00
atalman	6ae7730eeb	Use gcc13 in Manylinux 2.28 images (#152825 ) Related to: https://github.com/pytorch/pytorch/issues/152426 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152825 Approved by: https://github.com/malfet	2025-05-08 20:04:48 +00:00
Jim Wan	8b8051f6ed	[Minimizer] Fix the path naming (#153130 ) Summary: Added some logging and captured the indexing. See below image. {F1977773416} This is why the saved module path is called `/tmp/jimwan/minimizer_a_acc.pt` Now the updated module paths are `/tmp/jimwan/minimizer_addmm_default_103_acc.pt`. Test Plan: ``` MTIAC_USE_DIST_REF_KERNELS=all buck2 run @//mode/opt mtia/accuracy/minimizer:mtia_minimizer_runner -- --mode sequential --compare_fn allclose --pt_save_dir /tmp/debug3 --atol 1e-4 --rtol 1e-4 --all_outputs --start_idx native_layer_norm_default_80 --end_idx getitem_272 2>&1 \| tee ~/test.log ``` {F1977773610} Reviewed By: qcyuan Differential Revision: D74369107 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153130 Approved by: https://github.com/Skylion007	2025-05-08 19:59:52 +00:00
Aidyn-A	086e2c2399	[TEST][ATen][CUDA] Skip row-wise scaled matrix mmultiplication tests on sm_120+ (#152814 ) The float8 row-wise scaled matmuls are not supported on Blackwell yet. This PR adds skips to those tests to decrease the noise on `sm_120+` machines. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152814 Approved by: https://github.com/eqy, https://github.com/Skylion007	2025-05-08 19:34:20 +00:00
Catherine Lee	4b8b7c7fb9	[CI] Use cmake from pip instead of conda in CI docker images (#152537 ) As in title idk how the install_cmake script is used because I see it being called with 3.18 but when I look at the build jobs some say 3.18 and others 3.31 Just make everything install cmake via the requirements-ci.txt. I don't know if the comment at `5d36485b4a/.ci/docker/common/install_conda.sh (L78)` still holds, but pretty much every build has CONDA_CMAKE set to true, so I'm just defaulting to installing through pip Also defaulting to 4.0.0 everywhere except the executorch docker build because executorch reinstalls 3.31.something Pull Request resolved: https://github.com/pytorch/pytorch/pull/152537 Approved by: https://github.com/cyyever, https://github.com/atalman, https://github.com/malfet	2025-05-08 18:58:10 +00:00
Mu-Chu Lee	b3524080dc	[AOTInductor] Generate kernels separately for const graph and main graph (#153040 ) Summary: We should generate the kernel for const graph and main graph separately. The reason is that when we run autotuning, we would create separate kernel calls and we should make sure that main graph also contains the runner. Test Plan: python test/inductor/test_aot_inductor.py -k test_autotune_with_constant_folding Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D74347765](https://our.internmc.facebook.com/intern/diff/D74347765) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153040 Approved by: https://github.com/angelayi	2025-05-08 18:45:45 +00:00
Isuru Fernando	e5f869999c	[inductor] Fix ModularIndexing assumptions (#152993 ) Fixes https://github.com/pytorch/pytorch/issues/151198. Since the result of ModularIndexing can be zero due to the modulo operation, we should not make any assumption about ModularIndexing being positive Pull Request resolved: https://github.com/pytorch/pytorch/pull/152993 Approved by: https://github.com/yf225	2025-05-08 18:26:45 +00:00
Tristan Rice	d900c68ea6	c10d/gloo: add ibverbs backend (#153015 ) Summary: X-link: https://github.com/pytorch/gloo/pull/437 This provides a new "UnboundBuffer" implementation for Gloo ibverbs backend so it can be used with PyTorch. This currently is passing basic tests such as `reduce_test` and `send_recv_test` but there are a number of failures. Putting this up for review so the follow up fixes are less of a mega PR and also so we can start doing some initial testing with this E2E with PyTorch. Known issues: * support recv from any is not supported * AllreduceBcubeBase2 is failing Test Plan: ``` buck2 run mode/dbgo //gloo/test:send_recv_test_ibverbs buck2 test //gloo/test: GLOO_DEVICE_TRANSPORT=IBVERBS buck2 run @//mode/opt //caffe2/test/distributed:c10d -- -r '.gloo.' -f ``` We can't run any of the gloo tests in CI since none of our CI machines have ibverbs so they're disabled by default and need to be manually run. Differential Revision: D73291471 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153015 Approved by: https://github.com/fduwjj	2025-05-08 18:26:29 +00:00
Aaron Orenstein	7cdf5048ea	Fix evaluate_expr to include suppress_guards_tls in cache key (#152661 ) ShapeEnv.evaluate_expr() behaves differently based on the (tls) global "suppress_guards" - so its cache key needs to include that value. This came up because #152662 triggered it in the test `test/dynamo/test_exc.py::ExcTests::test_trigger_bisect_on_error` - fixing this caused that test to work again. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152661 Approved by: https://github.com/laithsakka	2025-05-08 18:25:34 +00:00
Nikita Shulga	30a3c5d970	Skip lintchecks for now (#153156 ) As devs has been complaining it's failing. Completely remove them from lint.yml as https://github.com/pytorch/pytorch/pull/153157 moved it to nightly See https://github.com/pytorch/pytorch/issues/152439 as well as https://github.com/pytorch/pytorch/issues/152884 and https://github.com/pytorch/pytorch/issues/152489 for more details Was introduced in https://github.com/pytorch/pytorch/pull/152377 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153156 Approved by: https://github.com/albanD, https://github.com/ZainRizvi	2025-05-08 17:58:05 +00:00
Alvaro-Kothe	e86b6b2a19	Add tests to check pretty print when padding is a string in C++ API (#153126 ) Currently there are no tests to verify the behaviour of pretty print when padding is `torch::kSame` or `torch::kValid`. This PR just adds this tests to check for future regressions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153126 Approved by: https://github.com/Skylion007	2025-05-08 17:55:25 +00:00
PyTorch MergeBot	d36261d2e6	Revert "[dynamo] Avoid running `torch.nn.Module.__call__` twice under `torch.compile(mod)` (#152740 )" This reverts commit 0886d402f155e0b34760a2906f4bd71c878fd98f. Reverted https://github.com/pytorch/pytorch/pull/152740 on behalf of https://github.com/huydhn due to Discuss with the author to revert and reland this ([comment](https://github.com/pytorch/pytorch/pull/152740#issuecomment-2863779028))	2025-05-08 17:31:21 +00:00
PyTorch MergeBot	34d424d813	Revert "[dynamo] Support `delattr` on result of `torch.compile(module)` (#152741 )" This reverts commit 6c025b5a8270e456405eccc26db1344ddd016d7b. Reverted https://github.com/pytorch/pytorch/pull/152741 on behalf of https://github.com/huydhn due to Discuss with the author to revert and reland this ([comment](https://github.com/pytorch/pytorch/pull/152740#issuecomment-2863779028))	2025-05-08 17:31:21 +00:00
Jacobgoss30	6a8006472e	Fix doc cosineannealinglr 152081 (#152936 ) ## Summary This PR updates the docstring for `CosineAnnealingLR` to accurately reflect its recursive learning rate schedule. The previous docstring displayed only the SGDR closed-form expression, which doesn't match the actual recursive implementation in code. Changes: - Added the recursive update formula used in `get_lr()` - Retained the original closed-form SGDR expression for reference - Clarified that warm restarts are not implemented in this scheduler This addresses confusion raised in issue #152081. ## Related issue [#152081](https://github.com/pytorch/pytorch/issues/152081) ## Testing Doc-only change. Ran pre-commit to verify formatting. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152936 Approved by: https://github.com/janeyx99	2025-05-08 17:25:30 +00:00
angelayi	3cd69350ed	[export] Unflatten None (#153000 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/153000 Approved by: https://github.com/pianpwk	2025-05-08 16:40:13 +00:00
PyTorch MergeBot	7b806a8cb1	Revert "[inductor][dynamo] Include operator name in size/stride/alignment assertion (#152353 )" This reverts commit 93576351270383ca37deaec6b2417a33dc045a93. Reverted https://github.com/pytorch/pytorch/pull/152353 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to fail an inductor test in trunk ([comment](https://github.com/pytorch/pytorch/pull/152353#issuecomment-2863657185))	2025-05-08 16:39:28 +00:00
cyy	d291fa8ecc	Avoid std::chrono::system_clock (#153135 ) This PR replaces most `std::chrono::system_clock` with `std::chrono::steady_clock` if the duration is used in condition variables. Ideally system clocks should be used only to log wall-clock times. Some `high_resolution_clock` are also changed to `steady_clock` because its resolution is not required in the context. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153135 Approved by: https://github.com/albanD, https://github.com/Skylion007, https://github.com/malfet	2025-05-08 16:30:29 +00:00
Jithun Nair	fe8ebacee4	[ROCm] Upgrade ROCm CI to ROCm6.4 (#151368 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151368 Approved by: https://github.com/jeffdaily, https://github.com/malfet Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-05-08 16:12:16 +00:00
PyTorch MergeBot	05326b7e49	Revert "Add runtime asserts to AOTI (#152125 )" This reverts commit 834bc5e4148538b7544aafdf5b090d007600fbd6. Reverted https://github.com/pytorch/pytorch/pull/152125 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/152125#issuecomment-2863554139))	2025-05-08 15:58:18 +00:00
Wang, Chuanqi	1d3e8f326a	[CI] Increase shards number for XPU ci UT tests (#149113 ) The XPU CI test met timeout issue, refer https://github.com/pytorch/pytorch/actions/runs/14897047392/job/41842336828 and this PR will reduce the ci time cost Pull Request resolved: https://github.com/pytorch/pytorch/pull/149113 Approved by: https://github.com/etaf, https://github.com/EikanWang	2025-05-08 15:42:33 +00:00
Anthony Shoumikhin	8141b146ca	Run URL linter on nightly only (#153157 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153157 Approved by: https://github.com/malfet	2025-05-08 15:32:42 +00:00
Ke Wen	efa07df257	[c10d] Remove unordered PG destroy test (#153110 ) torch.distributed does not support unordered ProcessGroup destroy. Removing the test. Resolves #137507 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153110 Approved by: https://github.com/fduwjj, https://github.com/fegin	2025-05-08 15:29:44 +00:00
Simon Fan	500cbeee4e	[dynamo][ca] support dynamic annotations on tensors in ListVariables/TupleVariables (#152119 ) Together with https://github.com/pytorch/pytorch/pull/151962, FIXES https://github.com/pytorch/pytorch/issues/133575 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152119 Approved by: https://github.com/jansel ghstack dependencies: #151731, #151962	2025-05-08 15:12:16 +00:00
Simon Fan	6dea8ef555	[ca] hide unused scalar int sizes from dynamo (#151962 ) together with https://github.com/pytorch/pytorch/pull/151731, FIXES https://github.com/pytorch/pytorch/issues/113129 https://github.com/pytorch/pytorch/issues/146168 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151962 Approved by: https://github.com/jansel ghstack dependencies: #151731	2025-05-08 15:12:16 +00:00
Simon Fan	8f380b239f	[ca] mark scalar int sizes as dynamic via tensor wrapping (#151731 ) This is the only way to support dynamic shapes on scalars right now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151731 Approved by: https://github.com/jansel	2025-05-08 15:12:08 +00:00
PyTorch MergeBot	a7ea115494	Revert "[CI] Use cmake from pip instead of conda in CI docker images (#152537 )" This reverts commit 941062894a1accfd472d0acd2716493e1f173bd7. Reverted https://github.com/pytorch/pytorch/pull/152537 on behalf of https://github.com/malfet due to Sorry to revert this PR, but it broke doc builds, see `4976b1a3a8/1` ([comment](https://github.com/pytorch/pytorch/pull/152537#issuecomment-2863337268))	2025-05-08 14:53:34 +00:00
James Wu	4976b1a3a8	Keep raw cubin file around in case it gets deleted underneath us (#153064 ) This diff hardens StaticCudaLauncher in the event a cubin file gets deleted under us. We store the raw cubin on the static cuda launcher, and reload it as needed. On cold start, this can happen if the cubin file is created by triton, and gets deleted before we can load the kernel on the parent process. We don't want to store the entire cubin both in file format and in memory for caching purposes, so we delete it before caching the data. In the unfortunate/unlikely event where we can't load/find the necessary file on warm start, skip the stored triton launcher, falling back to regular triton. This comes at a cost to worker memory, but it's not more memory than regular triton workers already take, so it should be okay. Tests: - Make test_static_cuda_launcher always delete the cubin path and reload it Fixes #153030 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153064 Approved by: https://github.com/oulgen, https://github.com/jansel	2025-05-08 14:29:19 +00:00
yuchengliu1	13bdfe6577	get right function declaration on windows inductor (#152939 ) Fixes #152251 `get_export_declaration` introduced one more ')' in Windows platform, which cause this pattern of function declaration different with Linux. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152939 Approved by: https://github.com/xuhancn, https://github.com/jansel	2025-05-08 14:28:33 +00:00
Xilun Wu	0f9821d0e3	[BE][lint] fix PYFMT for PT-D code under torch.testing._internal, add them to the lint list (#153114 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153114 Approved by: https://github.com/cyyever, https://github.com/fegin, https://github.com/H-Huang, https://github.com/Skylion007	2025-05-08 14:01:49 +00:00
rzou	2926dd4d8e	Stop proxy-ing autograd.Function.ctx into the graph (#152621 ) The reason why we did this before is because that's how our older autograd.Function x Dynamo interaction work, but we've since adopted newer designs that don't actually need the autograd.Function.ctx proxied into the graph. We still need a fx.Proxy for the autograd.Function.ctx object, so whenever we do I create one via discard_graph_changes. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/152621 Approved by: https://github.com/oulgen	2025-05-08 13:32:54 +00:00
ignasa007	22c31046d4	Fixed rerr computation in lobpcg (#152789 ) Fixes #101075 This PR fixes an issue with the computation of residuals in the LOBPCG algorithm. Bug: [Line 788](`8f54e56e62/torch/_lobpcg.py (L788)`) is supposed to compute the denominator in Equation 9 of [Duersch et al., 2018](https://arxiv.org/abs/1704.07458), as also suggested in [line 776](`8f54e56e62/torch/_lobpcg.py (L776)`), but it uses the raw eigenvalue-estimates instead of their absolute values. Consequence: This made the algorithm's success sensitive to initialization of eigenvectors. Tests: - I have tested @jtorde's [script](https://github.com/pytorch/pytorch/issues/101075#issuecomment-1545349559), and I did NOT run into any assertion errors for a few minutes (as opposed to the original implementation, which fails after a few seconds). - I have also tried @pearu's specific [test case](https://github.com/pytorch/pytorch/issues/101075#issuecomment-1548483685), which also executes successfully - the residuals remain positive, and the final output is the same as one returned by SciPy (with and without enforcing the use of LOBPCG). - I extracted the relevant test cases from [test/test_autograd.py](https://github.com/pytorch/pytorch/blob/main/test/test_autograd.py) and [test/test_linalg.py](https://github.com/pytorch/pytorch/blob/main/test/test_linalg.py), and they ran successfully. Let me know if further test cases or benchmarks are needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152789 Approved by: https://github.com/pearu, https://github.com/lezcano	2025-05-08 12:22:31 +00:00
Animesh Jain	34d4363e6d	[dynamo] Fix super and classmethod binding of cls object (#153105 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153105 Approved by: https://github.com/jansel ghstack dependencies: #152883	2025-05-08 12:07:08 +00:00
Catherine Lee	941062894a	[CI] Use cmake from pip instead of conda in CI docker images (#152537 ) As in title idk how the install_cmake script is used because I see it being called with 3.18 but when I look at the build jobs some say 3.18 and others 3.31 Just make everything install cmake via the requirements-ci.txt. I don't know if the comment at `5d36485b4a/.ci/docker/common/install_conda.sh (L78)` still holds, but pretty much every build has CONDA_CMAKE set to true, so I'm just defaulting to installing through pip Also defaulting to 4.0.0 everywhere except the executorch docker build because executorch reinstalls 3.31.something Pull Request resolved: https://github.com/pytorch/pytorch/pull/152537 Approved by: https://github.com/cyyever, https://github.com/atalman, https://github.com/malfet	2025-05-08 10:10:27 +00:00
Xinfeng Xie	bfc0920d95	[C10D] Move getNcclDataType into NCCLUtils (#153113 ) Differential Revision: D74365214 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153113 Approved by: https://github.com/ngimel	2025-05-08 08:54:05 +00:00
Narek Malkhasyan	dfb91a627f	Clean up of CUTLASS_VERSION (#152947 ) Fixes #152847 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152947 Approved by: https://github.com/eqy, https://github.com/cyyever	2025-05-08 08:32:34 +00:00
karthickai	9357635127	[inductor][dynamo] Include operator name in size/stride/alignment assertion (#152353 ) Fixes #151930 This PR updates the `assert_size_stride` and `assert_alignment` functions in [guards.cpp](https://github.com/pytorch/pytorch/blob/main/torch/csrc/dynamo/guards.cpp) to accept an optional `op_name` argument and includes it in the error messages. The corresponding type stubs in [guards.pyi](https://github.com/pytorch/pytorch/blob/main/torch/_C/_dynamo/guards.pyi) are updated to match the new function arg. In [inductor/ir.py](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/ir.py) extracts the operator name from the FX graph and passes it into the `codegen_size_asserts` and `codegen_alignment_asserts` functions, so that generated assertions in Triton code include the op name for better debugging. Added unit tests inside [test_torchinductor.py](https://github.com/pytorch/pytorch/blob/main/test/inductor/test_torchinductor.py). - Verified both successful and failing assertion cases include the operator name. - Verified that generated Triton code contains the op name inside the asserts. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152353 Approved by: https://github.com/jansel, https://github.com/shunting314	2025-05-08 08:28:05 +00:00
henrylhtsang	4f9dd3c3e5	[cutlass backend] Fix EVT test for fbcode post cutlass 3.9.2 upgrade (#153106 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153106 Approved by: https://github.com/mlazos	2025-05-08 08:20:40 +00:00
Ruben Rodriguez Buchillon	f9df09da08	[mm sampling] extract more triton information (#153099 ) Summary: # Why capture more triton config information that was not being captured # What capture and extract - group_m - allow_tf32 - acc_type - matrix_instr_nonkdim - waves_per_eu - kpack to achieve this, add - matrix_instr_nonkdim - waves_per_eu - kpack to the info_dict of the TritonTemplateCaller Test Plan: with D74342290 ``` buck2 run -c fbcode.rocm_arch=mi300 -m rocm621 mode/opt-amd-gpu fbcode//deeplearning/aot_inductor/benchmark/sampling:test_gemm_autotune_benchmark_AMD_block_0 2>&1 \| tee /tmp/tmp.52Igj8lthj/15.txt ``` (edited for clarity and brevity) ``` AutotuneMetrics03LogEntry( backend='Triton', exectime_ms=0.007449999917298555, perf_model_name='scripts.vandrei.pytorch_experiments.matmul_estimator_lib.estimate_matmul_time_new', perf_model_exectime_ms=0.009558684365573179, config_triton_block_m=16, config_triton_block_n=256, config_triton_block_k=128, config_triton_num_stages=2, config_triton_num_warps=8, config_triton_group_m=16, config_triton_allow_tf32='False', config_triton_acc_type='tl.float32', config_triton_matrix_instr_nonkdim=16, config_triton_waves_per_eu=1, config_triton_kpack=2, x_batch_dim=0, x_row_dim=8, x_col_dim=96, x_batch_stride=0, x_row_stride=96, x_col_stride=1, x_dtype='torch.float16', x_dtype_size=16, w_batch_dim=0, w_row_dim=96, w_col_dim=512, w_batch_stride=0, w_row_stride=512, w_col_stride=1, w_dtype='torch.float16', w_dtype_size=16, vendor='AMD', model='gfx942:sramecc+:xnack-', major=9, minor=4, sms=304, l2_cache=4194304, warp_size=64, regs_per_sm=65536, max_threads_per_sm=2048, total_mem=206141652992, hip_version='6.2.41134', triton_upstream_hash='3889f3f3b97b817741e308c173409927b7c4536f', environment='experiment-xzy-default', session_id='8a7001bd-652c-440c-bc56-4cb1e25146ea', [...] ) ``` Reviewed By: exclamaforte Differential Revision: D74342286 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153099 Approved by: https://github.com/exclamaforte, https://github.com/eellison	2025-05-08 07:24:28 +00:00
zeshengzong	3c87529d23	Make device check error message more descriptive (#150750 ) Fixes #122757 ## Test Result ```python import torch model_output = torch.randn(10, 5).cuda() labels = torch.randint(0, 5, (10,)).cuda() weights = torch.randn(5) loss_fn = torch.nn.CrossEntropyLoss(weight=weights) loss = loss_fn(input=model_output, target=labels) print(loss) Traceback (most recent call last): File "/home/zong/code/pytorch/../loss2.py", line 17, in <module> loss = loss_fn(input=model_output, target=labels) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zong/code/pytorch/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl return self._call_impl(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zong/code/pytorch/torch/nn/modules/module.py", line 1762, in _call_impl return forward_call(args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zong/code/pytorch/torch/nn/modules/loss.py", line 1297, in forward return F.cross_entropy( ^^^^^^^^^^^^^^^^ File "/home/zong/code/pytorch/torch/nn/functional.py", line 3494, in cross_entropy return torch._C._nn.cross_entropy_loss( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: Expected all tensors to be on the same device, but got weight is on cpu, different from other tensors on cuda:0 (when checking argument in method wrapper_CUDA_nll_loss_forward) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150750 Approved by: https://github.com/malfet	2025-05-08 06:19:44 +00:00
天邑	c73bd990cf	fix shard tensor gather when a local tensor on certain ranks has zero elements (#150914 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150914 Approved by: https://github.com/fduwjj	2025-05-08 05:06:22 +00:00
rzou	94ca3a4666	Add torch._C.Tag.needs_contiguous_strides (#152859 ) this forces inductor to force the inputs to be contiguous. Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/152859 Approved by: https://github.com/eellison	2025-05-08 04:49:59 +00:00
Menglu Yu	2d25e4d478	[1/n][Optimus][Auto-AC] Support activation quantization without scaling (#148380 ) Summary: We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Test Plan: # unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten ``` Buck UI: https://www.internalfb.com/buck2/776d3911-bb86-4ac8-a527-540cf1510b9d Test UI: https://www.internalfb.com/intern/testinfra/testrun/4785074873051017 Network: Up: 4.3MiB Down: 42MiB (reSessionID-fef7e727-68b1-4645-a519-5652854df38d) Executing actions. Remaining 0/4 6.7s exec time total Command: test. Finished 2 local Time elapsed: 3:11.5s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 # E2E ### how to enable (you can overrite the dtype, if nothing given, the default is fp8) ``` post_grad_fusion_options={ "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"} }, ``` Differential Revision: D70522237 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148380 Approved by: https://github.com/Mingming-Ding, https://github.com/Hahu803	2025-05-08 04:44:15 +00:00
Animesh Jain	6f6fac6a41	[dynamo] Fix bug in hasattr(tensor, "size") (#152883 ) Fixes https://github.com/pytorch/pytorch/issues/135696 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152883 Approved by: https://github.com/StrongerXi	2025-05-08 01:16:01 +00:00
Shangdi Yu	834bc5e414	Add runtime asserts to AOTI (#152125 ) Summary: Solves https://github.com/pytorch/pytorch/issues/151925 Currently, AOTI only generate runtime asserts for unbacked symints. We should generate asserts for all `_assert_scalar` calls in the input graph. Also factored out the run time assertion logic to a separate function. We need to generate runtime asserts directly in Inductor instead of just re-using the asserts from input graphs becase we reuse the same ShapeEnv as before. In particular, on subsequent graph passes, we would immediately turn all of these assertions into noops, because when we evaluated their expressions, we would see that because we had a deferred runtime assert in the ShapeEnv, we know "oh, of course this expression is True" already. One example is below: ``` class Model(torch.nn.Module): def forward(self, a, b, c): nz = torch.nonzero(a) ones = a.new_ones([nz.size(0), b.size(0)]) torch._check(ones.size(0) >= 1) equals = torch.add(ones, c) return equals torch._dynamo.mark_dynamic(c, 0) ``` When we re-use the ShapeEnv in Inductor lowering, the check that checks a and nonzero have the same shape would be evaluted to True after we resolve unbacked bindings using the ShapeEnv. See test_unbacked_equals_input_size_runtime_assertion in test_aot_inductor. In addition to the Inductor generated runtime asserts, we also need the runtime asserts from the input graph, because some derived runtime asserts are not generated in Inductor. One example is below: ``` class Model(torch.nn.Module): def forward(self, x): y = x.reshape(100, -1).clone() y = y + 1 return y dynamic_shapes = { "x": {0: torch.export.Dim.DYNAMIC}, } x.shape[0] needs to be a multiple of 100. ``` See test_aoti_runtime_asserts_backed_symint in test_aot_inductor. Example: ``` def forward(self): arg0_1: "f32[s35]"; arg0_1, = fx_pytree.tree_flatten_spec([], self._in_spec) # File: /data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/73a672eb896e7996/scripts/shangdiy/__pt__/pt#link-tree/scripts/shangdiy/pt.py:11 in forward, code: y = x.reshape(100, -1).clone() sym_size_int: "Sym(s35)" = torch.ops.aten.sym_size.int(arg0_1, 0) # mod: "Sym(Mod(s35, 100))" = sym_size_int % 100; sym_size_int = None eq_2: "Sym(Eq(Mod(s35, 100), 0))" = mod == 0; mod = None _assert_scalar = torch.ops.aten._assert_scalar.default(eq_2, "Runtime assertion failed for expression Eq(Mod(s35, 100), 0) on node 'eq'"); eq_2 = _assert_scalar = None # File: /data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/73a672eb896e7996/scripts/shangdiy/__pt__/pt#link-tree/scripts/shangdiy/pt.py:11 in forward, code: y = x.reshape(100, -1).clone() view: "f32[100, (s35//100)]" = torch.ops.aten.reshape.default(arg0_1, [100, -1]); arg0_1 = None clone: "f32[100, (s35//100)]" = torch.ops.aten.clone.default(view); view = None # File: /data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/73a672eb896e7996/scripts/shangdiy/__pt__/pt#link-tree/scripts/shangdiy/pt.py:12 in forward, code: y = y + 1 add_6: "f32[100, 1]" = torch.ops.aten.add.Tensor(clone, 1); clone = None return (add_6,) ``` Generated cpp code: ``` auto inputs = steal_from_raw_handles_to_raii_handles(input_handles, 1); auto arg0_1 = std::move(inputs[0]); auto arg0_1_size = arg0_1.sizes(); int64_t s35 = arg0_1_size[0]; inputs.clear(); auto& kernels = static_cast<AOTInductorModelKernels&>(*this->kernels_.get()); if (!((s35 % 100L) == 0L)) { throw std::runtime_error("Expected Eq(Mod(s35, 100), 0) to be True but received " + std::to_string(s35)); } ``` Test Plan: ``` buck run fbcode//mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r aoti_runtime_asserts_backed_symint ``` Differential Revision: D73596786 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152125 Approved by: https://github.com/henrylhtsang, https://github.com/jingsh	2025-05-08 00:27:24 +00:00
Michael Lazos	20e2ca3e29	[Dynamo] Allow inlining into AO quantization modules (#152934 ) This adds dynamo inlining into `torch.ao.quantization.fake_quantize`. This is needed for QAT compatbility w/ an RL training model. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152934 Approved by: https://github.com/williamwen42	2025-05-07 23:58:11 +00:00
Ke Wen	5bf0c3518c	Detect NVSHMEM location (#153010 ) ### Changes - Detect NVSHMEM install location via `sysconfig.get_path("purelib")`, which typically resolves to `<conda_env>/lib/python/site-packages`, and NVSHMEM include and lib live under `nvidia/nvshmem` - Added link dir via `target_link_directories` - Removed direct dependency on mlx5 - Added preload rule (following other other NVIDIA libs) ### Plan of Record 1. End user experience: link against NVSHMEM dynamically (NVSHMEM lib size is 100M, similar to NCCL, thus we'd like users to `pip install nvshmem` than torch carrying the bits) 2. Developer experience: at compile time, prefers wheel dependency than using Git submodule General rule: submodule for small lib that torch can statically link with If user pip install a lib, our CI build process should do the same, rather than building from Git submodule (just for its header, for example) 3. Keep `USE_NVSHMEM` to gate non-Linux platforms, like Windows, Mac 4. At configuration time, we should be able to detect whether nvshmem is available, if not, we don't build `NVSHMEMSymmetricMemory` at all. For now, we have symbol dependency on two particular libs from NVSHMEM: - libnvshmem_host.so: contains host side APIs; - libnvshmem_device.a: contains device-side global variables AND device function impls. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153010 Approved by: https://github.com/ngimel, https://github.com/fduwjj, https://github.com/Skylion007	2025-05-07 23:35:04 +00:00
Michael Lazos	df1ec045b5	[Cutlass] Add epilogue inputs/outputs to def_kernel (#151406 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151406 Approved by: https://github.com/eellison ghstack dependencies: #152733, #150906	2025-05-07 23:09:02 +00:00
Michael Lazos	d483aefafa	[Cutlass] Integrate EVT into CUDACPPScheduling (#150906 ) Previously merged: * #151713 * #151405 * #150905 * #152306 * #152305 Allow epilogue nodes in cuda combined scheduling Pull Request resolved: https://github.com/pytorch/pytorch/pull/150906 Approved by: https://github.com/eellison ghstack dependencies: #152733	2025-05-07 23:09:02 +00:00
Michael Lazos	6b9d741e1c	[Cutlass] Handle broadcasting in EVT python codegen (#152733 ) Previously merged: * #151713 * #151405 * #150905 * #152306 * #152305 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152733 Approved by: https://github.com/eellison	2025-05-07 23:09:02 +00:00
Meet Patel	4270517cbf	Fix test/test_optim.py error message. (#153076 ) Fixes an error message in test/test_optim.py Current behavior: If running the test with Adagrad, the error message reads: "SGD does not currently support capturable". Fix: The error message now says correctly: "Adagrad does not currently support capturable". Pull Request resolved: https://github.com/pytorch/pytorch/pull/153076 Approved by: https://github.com/janeyx99	2025-05-07 22:46:05 +00:00
Meet Patel	7706074ece	Fix TORCH_CHECK error message in FusedSgdKernel (#153074 ) This fixes an issue in the TORCH_CHECK error message in the FusedSgdKernel. Current behavior: If the LR tensor is not on the same device as the parameters, the error message reads: "found_inf must be on the same GPU device as the params". Fix: The error message now correctly points out "lr must be on the same GPU device as the params". Pull Request resolved: https://github.com/pytorch/pytorch/pull/153074 Approved by: https://github.com/Skylion007, https://github.com/janeyx99	2025-05-07 22:10:09 +00:00
Eddie Yan	cecfc7dc53	[CUDA][cuDNN] Fix handling of `CPU` side input and target length tensors in `CTCLoss` (#152745 ) https://github.com/pytorch/pytorch/pull/128271 migrated to cuDNN V8 CTCLoss which expects input and target length tensors to be on `CUDA` rather than `CPU` without adding the logic to account for the edge case of them being on `CPU` see also #152421 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152745 Approved by: https://github.com/Skylion007	2025-05-07 22:01:18 +00:00
Ti-Tai Wang	773a91c775	[ONNX] dynamic_shapes uses DYNAMIC (#153065 ) Although Dim.AUTO covers the cases that a user sets more axes to be dynamic than the model actually needs, it silently falls back to STATIC when DYNAMIC fails. This increases the difficulty of debugging. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153065 Approved by: https://github.com/justinchuby	2025-05-07 21:48:41 +00:00
henrylhtsang	a2891cba2f	[cutlass backend] Skip cuda lib path if it is torch/lib (#153003 ) Differential Revision: [D74284808](https://our.internmc.facebook.com/intern/diff/D74284808/) This is a bit risky for cutlass backend, so decided to separate it out. Tested offline. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153003 Approved by: https://github.com/chenyang78	2025-05-07 21:28:15 +00:00
Zhengxu Chen	5bb154e6fd	[nativert] Move MPMCQueue to torch/nativert. (#152837 ) Summary: Torch Native Runtime RFC: https://github.com/zhxchen17/rfcs/blob/master/RFC-0043-torch-native-runtime.md To land the runtime into PyTorch core, we will gradually land logical parts of the code into the Github issue and get each piece properly reviewed. This diff adds a small library implementing a multi producer multi consumer queue which will be used to synchronize taks for Torch Native Runtime. Differential Revision: D74184245 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152837 Approved by: https://github.com/albanD, https://github.com/dolpm, https://github.com/swolchok	2025-05-07 21:17:42 +00:00
Paul Zhang	d2ee606e9b	[Inductor] Set correct baseline for decomposek test (#152897 ) Differential Revision: D74218923 Running on A100 seems to result in precision loss from decompose_k. This was root caused to the fp16/bf16 reduction setting, which establishes a less precise baseline than decompose_k, as decompose_k uses the bmm.dtype overload for fp32 output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152897 Approved by: https://github.com/eellison	2025-05-07 21:02:47 +00:00
fduwjj	1ff3c223d2	[c10d][fr] Make FR vendor neutral so that other backends can use it (#152563 ) Current FR code is built with `USE_C10D_NCCL` we should remove it to make it generic. And we keep existing API used by NCCL so that we can have some bc compatibility because lots of use cases are around FR with NCCL. The generic version with C10::Event can then be used for other backend like Gloo, etc. The current Unit test should cover the change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152563 Approved by: https://github.com/kwen2501, https://github.com/d4l3k ghstack dependencies: #152585	2025-05-07 20:37:40 +00:00
Milos Puzovic	642e9305eb	Fixes detection of ArmPL on Linux platform (#150031 ) On Linux it failed to detect that there is bin directory as it wasn't looking for armpl-info which is the only file that is in that directory on Linux and also adding link to math library as it is required to link against when checking for LAPACK functions. Fixes #149610 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150031 Approved by: https://github.com/fadara01, https://github.com/malfet	2025-05-07 19:47:21 +00:00
Yuanhao Ji	f5f8f637a5	[Typing] Improve device typing for `torch.set_default_device()` (#153028 ) Part of: #152952 Here is the definition of `torch.types.Device`: `ab997d9ff5/torch/types.py (L74)` So `_Optional[_Union["torch.device", str, builtins.int]]` is equivalent to it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153028 Approved by: https://github.com/Skylion007	2025-05-07 19:31:43 +00:00
henrylhtsang	dd7d231ed3	[cutlass backend][test] re-enable test_cuda_compile_command for fbcode (#153001 ) Differential Revision: [D74284047](https://our.internmc.facebook.com/intern/diff/D74284047/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153001 Approved by: https://github.com/ColinPeppler	2025-05-07 19:06:24 +00:00
Joel Schlosser	62b7ef06cc	[Dynamo] Remove unused guard PYMODULE_MATCH (#152961 ) Not used anywhere: https://www.internalfb.com/code/search?q=repo%3Afbcode%20PYMODULE_MATCH Pull Request resolved: https://github.com/pytorch/pytorch/pull/152961 Approved by: https://github.com/jansel ghstack dependencies: #152725, #152727, #152728, #152730, #152865, #152872	2025-05-07 18:58:18 +00:00
Joel Schlosser	d9b8473b59	[Dynamo] Guard serialization for RANGE_ITERATOR_MATCH (#152872 ) Tests serialization for RANGE_ITERATOR_MATCH; includes no non-test changes. This PR handles iterator exhaustion issues by utilizing the janky solution from #152865; it passes a function to generate kwargs and `frame_state.f_locals` is updated with fresh iterators through a second kwarg generation pass after initial tracing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152872 Approved by: https://github.com/jansel ghstack dependencies: #152725, #152727, #152728, #152730, #152865	2025-05-07 18:58:18 +00:00
Joel Schlosser	52f7106c00	[Dynamo] Guard serialization for TUPLE_ITERATOR_LEN (#152865 ) Tests serialization for TUPLE_ITERATOR_LEN; includes no non-test changes. Passing a tuple iterator as input results in the iterator being exhausted during testing. I threw together a super janky workaround via accepting a func for kwarg generation and replacing `frame_state.f_locals` with newly-generated kwargs to get fresh iterators, but insights into a better approach are welcome! Pull Request resolved: https://github.com/pytorch/pytorch/pull/152865 Approved by: https://github.com/jansel ghstack dependencies: #152725, #152727, #152728, #152730	2025-05-07 18:58:18 +00:00
Joel Schlosser	fb500d0b1c	[Dynamo] Guard serialization for SEQUENCE_LENGTH (#152730 ) Tests only; no other changes needed. Test logic uses a tuple function input to trigger installation of a SEQUENCE_LENGTH guard. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152730 Approved by: https://github.com/jansel ghstack dependencies: #152725, #152727, #152728	2025-05-07 18:58:18 +00:00
Joel Schlosser	42954ab28e	[Dynamo] Guard serialization for CLOSURE_MATCH (#152728 ) Unsupported because it uses unsupported FUNCTION_MATCH. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152728 Approved by: https://github.com/jansel ghstack dependencies: #152725, #152727	2025-05-07 18:58:18 +00:00
Joel Schlosser	a9186ec723	[Dynamo] Guard serialization for FUNCTION_MATCH (#152727 ) Unsupported because it uses unsupported ID_MATCH. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152727 Approved by: https://github.com/jansel ghstack dependencies: #152725	2025-05-07 18:58:18 +00:00
Joel Schlosser	a6f51be2fd	[Dynamo] Guard serialization for NN_MODULE (#152725 ) Throws an error when attempting to serialize an NN_MODULE guard. It is not supported because it uses the unsupported ID_MATCH guard (#152330): `a6dd1c2208/torch/_dynamo/guards.py (L1738-L1739)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152725 Approved by: https://github.com/jansel	2025-05-07 18:58:17 +00:00
Yuxin Wu	2cf7fd0d2b	Update docs of saved_tensors_hooks to avoid ref cycle (#153049 ) Fixes #115255 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153049 Approved by: https://github.com/Skylion007, https://github.com/soulitzer	2025-05-07 18:54:56 +00:00
Nikita Shulga	7cf8049d63	[BE] Update ruamel to 0.18.10 (#153057 ) To address the feedback from https://github.com/pytorch/pytorch/pull/153013 Previously it was pinned to 0.17.4, that was released in 2021 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153057 Approved by: https://github.com/Skylion007 ghstack dependencies: #153013	2025-05-07 18:11:14 +00:00
Isuru Fernando	d042ec856b	Use gather in index_select (#151715 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151715 Approved by: https://github.com/ngimel	2025-05-07 17:55:34 +00:00
eqy	172e641529	[CUDA] Rest peak memory stats before running `test_set_per_process_memory_fraction` (#152540 ) Otherwise previous tests can cause `application = int(total_memory * 0.499) - torch.cuda.max_memory_reserved()` to go negative Hopefully abates current flakiness (see also https://github.com/pytorch/pytorch/issues/135115#:~:text=TestCuda.test_set_per_process_memory_fraction) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152540 Approved by: https://github.com/Skylion007	2025-05-07 17:02:39 +00:00
henrylhtsang	8b9c9a327f	[cutlass backend] cache filtered ops based on layouts (#152580 ) Differential Revision: [D73972687](https://our.internmc.facebook.com/intern/diff/D73972687/) Add cache to store the list of filtered ops for a specific shape + layout + dtype (aka hash on input_nodes). Pull Request resolved: https://github.com/pytorch/pytorch/pull/152580 Approved by: https://github.com/eellison	2025-05-07 16:38:22 +00:00
PyTorch MergeBot	61dd2a0cc3	Revert "[BE] Update numba versions (#152557 )" This reverts commit 80d2116405367e1dd11648ab4225d4207d5e6132. Reverted https://github.com/pytorch/pytorch/pull/152557 on behalf of https://github.com/malfet due to This time it breaks torchbench tests, see `9c114934f7/1`(inductor_torc&mergeEphemeralLF=true ([comment](https://github.com/pytorch/pytorch/pull/152557#issuecomment-2858945427))	2025-05-07 15:03:41 +00:00
Nikita Shulga	9c114934f7	[Lint] Add install command for GHA step (#153013 ) Otherwise, it fails to run the script Pull Request resolved: https://github.com/pytorch/pytorch/pull/153013 Approved by: https://github.com/wdvr, https://github.com/cyyever	2025-05-07 14:55:00 +00:00
Aaron Orenstein	42b3e560ee	Thread through options so GraphPickler can allow all ops (#152801 ) Fixes #151904 In #151904 we discussed the feasibility of including all ops in the GraphPickler. This PR changes it so we can filter which ops are allowed and which are blocked. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152801 Approved by: https://github.com/masnesral	2025-05-07 14:36:50 +00:00
George White	f393ee5ab5	Use `torch.types.Device` in `device_interface.py` (#152935 ) This is just a clean-up change that I noticed was possible; it removes the duplicate `_device_t` type which had the same semantics. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152935 Approved by: https://github.com/Skylion007	2025-05-07 13:20:10 +00:00
cyy	2f09e79142	Fix Codegen.cmake warning (#153023 ) Fix ``` CMake Warning (dev) in cmake/Codegen.cmake: A logical block opening on the line /var/lib/jenkins/workspace/cmake/Codegen.cmake:393 (if) closes on the line /var/lib/jenkins/workspace/cmake/Codegen.cmake:401 (endif) with mis-matching arguments. ``` by removing the condition in `endif`. We could instead fix it, however, that is not best practice. For example, cmake_lint warns that, and CMake says ``` The optional <condition> argument is supported for backward compatibility only. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153023 Approved by: https://github.com/aditew01, https://github.com/Skylion007	2025-05-07 12:45:20 +00:00
George White	48bfe9afc7	`has_triton`: Use the device interface for detecting Triton availability (#139171 ) This PR replaces the `has_triton()` global method which was previously used for this task. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139171 Approved by: https://github.com/jansel, https://github.com/shink	2025-05-07 12:23:10 +00:00
xinan.lin	56879f64a8	[Break XPU] Fix XPU UT failures introduced by community. (#152945 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152945 Approved by: https://github.com/Skylion007, https://github.com/EikanWang	2025-05-07 08:01:31 +00:00
fduwjj	5c878d4b04	[c10d][fr] Decouple the core logic of FR with the entry and event type (#152585 ) We want to make FR generic enough so the first step is to make the FR a template struct so that most of common code logic can be reused. The reason for this is that CudaEvent does not inherit c10::Event and we just want to swap the event part so that for NCCL we use CudaEvent and for the rest of backends, we use c10::event. Differential Revision: [D74262695](https://our.internmc.facebook.com/intern/diff/D74262695) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152585 Approved by: https://github.com/kwen2501, https://github.com/d4l3k	2025-05-07 06:21:33 +00:00
Yi Wang	93a0a7a0bf	Fix bug visualizing 1D Tensor using rich (#152871 ) Fixes https://github.com/pytorch/pytorch/issues/152848 I didn't fix the bug earlier because the example script didn't exhaustively present all combinations of 1D/2D tensor, 1D/2D mesh, and all possible sharding specs. Therefore, in this PR, I enriched the example script to cover all possible combinations. <img width="1008" alt="f" src="https://github.com/user-attachments/assets/1745a804-a004-4f98-8332-d7498453f397" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/152871 Approved by: https://github.com/wanchaol	2025-05-07 06:04:22 +00:00
Nikita Shulga	bb9fbb294a	[Testing] Add logic for running MPS tests (#153012 ) Prep change for getting rid of `_mac-test-mps.yml` A complete no-op for now, but will be used by PR above the stack, but they should be landed few days apart to avoid forcing lots of people to rebase their PRs Pull Request resolved: https://github.com/pytorch/pytorch/pull/153012 Approved by: https://github.com/wdvr	2025-05-07 04:27:31 +00:00
Guilherme Leobas	ae1e51b6ad	Add infra to run CPython tests under Dynamo (#150787 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150787 Approved by: https://github.com/zou3519	2025-05-07 04:03:14 +00:00
Yiming Zhou	13fbf21a76	[nativert] Port string join and split to c10/util (#152873 ) Summary: Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72 Port string utils functions join and split to c10/util Test Plan: Added tests in `string_util_test.cpp` buck2 run mode/opt caffe2/c10/test:util_base_tests Differential Revision: D74202473 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152873 Approved by: https://github.com/cyyever, https://github.com/Skylion007	2025-05-07 03:58:11 +00:00
Yuanhao Ji	5796212d48	[Dynamo] Replace `unimplemented` with `unimplemented_v2` in `torch/_dynamo/variables/misc.py` [1/2] (#152274 ) Part of #147913 Replace `unimplemented` with`unimplemented_v2` in `torch/_dynamo/variables/misc.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152274 Approved by: https://github.com/williamwen42 Co-authored-by: William Wen <william.wen42@gmail.com>	2025-05-07 03:37:24 +00:00
cyy	ab997d9ff5	Pass UNINSTALL_DILL to docker build (#152792 ) `UNINSTALL_DILL` was not really passed to docker before. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152792 Approved by: https://github.com/wdvr	2025-05-07 03:17:45 +00:00
Tsung-Hsien Lee	dfcfad2112	[c10d] Fix unused `group` input argument in `new_subgroups()` (#152765 ) Summary: This diff fixes an unused input argument [`group`](`8faa225695/torch/distributed/distributed_c10d.py (L5341)`) in the `new_subgroups()` function. Test Plan: contbuild & OSS CI, see Differential Revision: D74132537 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152765 Approved by: https://github.com/wz337	2025-05-07 02:37:51 +00:00
Animesh Jain	ecd74c953f	[dynamo] Recursively realize the stack_values (#152853 ) Might also fix - https://github.com/pytorch/pytorch/issues/135696 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152853 Approved by: https://github.com/Lucaskabela, https://github.com/mlazos, https://github.com/jansel	2025-05-07 02:36:44 +00:00
Zhengxu Chen	1965a2ca1e	[dynamo][ez] Remove unused guard OBJECT_MUTATION. (#152855 ) Summary: seems not used anywhere https://www.internalfb.com/code/search?q=case%3Ayes%20filepath%3Acaffe2%20OBJECT_MUTATION Test Plan: CI Differential Revision: D74196559 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152855 Approved by: https://github.com/jansel, https://github.com/jbschlosser	2025-05-07 02:32:32 +00:00
Colin Peppler	81b6920c68	[aoti] skip input symbol codegen for sympy expr w/ many symbols (#152579 ) Issue was that - symbol-ids appeared out-of-order w.r.t to the order of the forward inputs ``` def forward(arg0 # [(s3 - 1) + s4, 32], arg1 #[(s3 - 1)] ..) ``` - this causes codegen to fail because it expects all the base symbols `s4,s3` to have been codegen-ed already. - well, we can skip codegen-ing sympy expr with many symbols e.g. `(s3 - 1) + s4` because `s3` and `s4` will be codegen-ed by other inputs. ``` # for example s3 = arg1.size(0) + 1 s4 = argN.size(0) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152579 Approved by: https://github.com/jingsh, https://github.com/desertfire	2025-05-07 01:18:09 +00:00
angelayi	60ecc560af	[export] Add draft-export docs (#152637 ) Sample page: https://docs-preview.pytorch.org/pytorch/pytorch/152637/draft_export.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/152637 Approved by: https://github.com/zou3519, https://github.com/svekars	2025-05-07 01:12:45 +00:00
PyTorch MergeBot	a28dcdba2c	Revert "[aot][ca] save bw_module in AOTAutogradCache (#151860 )" This reverts commit 613bd462721f3246888030de0a3f6932d52f515a. Reverted https://github.com/pytorch/pytorch/pull/151860 on behalf of https://github.com/huydhn due to Chatting with @xmfan and decide to revert and reland this instead ([comment](https://github.com/pytorch/pytorch/pull/151860#issuecomment-2856709646))	2025-05-07 00:56:54 +00:00
PyTorch MergeBot	f6db749e60	Revert "[ca] mark scalar int sizes as dynamic via tensor wrapping (#151731 )" This reverts commit 18229a5300a61b2d76ca95bee8ae8d4f4d5fa938. Reverted https://github.com/pytorch/pytorch/pull/151731 on behalf of https://github.com/huydhn due to Chatting with @xmfan and decide to revert and reland this instead ([comment](https://github.com/pytorch/pytorch/pull/151860#issuecomment-2856709646))	2025-05-07 00:56:54 +00:00
PyTorch MergeBot	8f208dc75a	Revert "[ca] hide unused scalar int sizes from dynamo (#151962 )" This reverts commit 4555ed8c83b47c450e31f1192e1f0fc4147d435f. Reverted https://github.com/pytorch/pytorch/pull/151962 on behalf of https://github.com/huydhn due to Chatting with @xmfan and decide to revert and reland this instead ([comment](https://github.com/pytorch/pytorch/pull/151860#issuecomment-2856709646))	2025-05-07 00:56:53 +00:00
PyTorch MergeBot	64bbf58fb4	Revert "[dynamo][ca] support dynamic annotations on tensors in ListVariables/TupleVariables (#152119 )" This reverts commit 7aebb127bf309658770be93b264d4009c20a7f40. Reverted https://github.com/pytorch/pytorch/pull/152119 on behalf of https://github.com/huydhn due to Chatting with @xmfan and decide to revert and reland this instead ([comment](https://github.com/pytorch/pytorch/pull/151860#issuecomment-2856709646))	2025-05-07 00:56:53 +00:00
Isalia20	56492bfcb9	[MPS] SDPA specialized kernels (#152781 ) Paritally fixes #139668 and #152550 Still work in progress. Following needs to be addressed: - [x] Some tests are failing and need to check why and bugfix - [x] Benchmark the new kernels and add to this PR for varying sequence lengths head dimensions(the ones that get dispatched to kernels) - [x] Add tests to cover the specialized paths(if applicable) - [x] Code cleanup Tested on Macbook M1 Pro ### Vector Fast Path (q_len=1, k_len=256) - Old: 0.378 ms - New: 0.260 ms - 31.2% speed improvement ### Vector 2-pass (q_len=1, k_len=4096) - Old: 0.627 ms - New: 0.370 ms - 41.0% speed improvement ### Vector Fast Path (q_len=8, k_len=256) - Old: 0.545 ms - New: 0.322 ms - 40.9% speed improvement ### Vector 2-pass (q_len=8, k_len=4096) - Old: 1.318 ms - New: 1.057 ms - 19.8% speed improvement Script to get perf: ``` import torch import time def benchmark_sdpa(config, iterations=100): device = config.get("device", "cpu") batch = config["batch"] heads = config["heads"] q_len = config["q_len"] k_len = config["k_len"] head_dim = config["head_dim"] q = torch.randn(batch, heads, q_len, head_dim, device=device, dtype=torch.float32) k = torch.randn(batch, heads, k_len, head_dim, device=device, dtype=torch.float32) v = torch.randn(batch, heads, k_len, head_dim, device=device, dtype=torch.float32) for _ in range(5): _ = torch.nn.functional.scaled_dot_product_attention(q, k, v) if device == "mps": torch.mps.synchronize() total_time = 0.0 for i in range(iterations): start = time.perf_counter() _ = torch.nn.functional.scaled_dot_product_attention(q, k, v) if device == "mps": torch.mps.synchronize() end = time.perf_counter() total_time += end - start avg_time = total_time / iterations print(f"[{config['name']}] Avg time per run: {avg_time * 1000:.3f} ms over {iterations} iterations") return avg_time def main(): device = "mps" if torch.backends.mps.is_available() else "cpu" print(f"Running benchmarks on device: {device}") benchmarks = [ { "name": "Vector Fast - Small q_len & moderate k_len", "batch": 1, "heads": 8, "q_len": 1, # small query sequence length triggers vector fast path "k_len": 256, # moderate key length "head_dim": 64, "device": device, }, { "name": "Vector 2-pass - Small q_len & long k_len", "batch": 1, "heads": 8, "q_len": 1, # small query sequence length "k_len": 4096, # long key length triggers the 2-pass variant "head_dim": 64, "device": device, }, # { # "name": "Full Attention - Moderate q_len/k_len", # "batch": 1, # "heads": 8, # "q_len": 128, # longer query sequence length # "k_len": 8192, # matching key length for full attention paths # "head_dim": 64, # "device": device, # }, # { # "name": "Full Attention - Longer q_len/k_len", # "batch": 1, # "heads": 8, # "q_len": 128, # very long sequence length # "k_len": 8192, # "head_dim": 64, # "device": device, # }, ] iterations = 100 for config in benchmarks: benchmark_sdpa(config, iterations=iterations) if __name__ == "__main__": main() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152781 Approved by: https://github.com/malfet	2025-05-07 00:40:11 +00:00
Joel Schlosser	2b2b790908	[Dynamo] Guard serialization for CONSTANT_MATCH (#152724 ) This PR adds testing only; no non-test changes were needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152724 Approved by: https://github.com/jansel ghstack dependencies: #152704	2025-05-07 00:36:39 +00:00
Catherine Lee	d2935a9f85	[CI] Upgrade sccache to 0.10.0 (#152957 ) Newest release handles cuda better, and I think this fixes the cases I saw where some cuda related builds weren't being cached correctly Pull Request resolved: https://github.com/pytorch/pytorch/pull/152957 Approved by: https://github.com/malfet	2025-05-07 00:33:43 +00:00
Joel Schlosser	6d1e8994d3	[Dynamo] Guard serialization for EQUALS_MATCH (#152704 ) This PR: * Makes no changes to non-test code to support serialization for EQUALS_MATCH * Adds test logic involving a custom-defined constant type to trigger the guard installation here: `72337bdcf2/torch/_dynamo/variables/user_defined.py (L792)` Q: Is there a better way to trigger installation of this guard or is this sufficient? Pull Request resolved: https://github.com/pytorch/pytorch/pull/152704 Approved by: https://github.com/jansel	2025-05-07 00:28:31 +00:00
Nikita Shulga	9919d6b872	[Testing] Add copysign from scalar regression test (#152997 ) But instead of adding it just for MPS backend, add it to OpInfo Fixes https://github.com/pytorch/pytorch/issues/152582 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152997 Approved by: https://github.com/wdvr	2025-05-07 00:19:42 +00:00
Siddharth Kotapati	327d1b6ef0	Move additional MPS Unary ops to Iterator (#152876 ) Noticed some of these ops were contributing to a big chunk of the runtime for OpenLLama as well as a few other benchmarks At the op level, moving to a TensorIterator-based Metal kernel gives a 20x speedup. Will migrate the inverse trigonometric functions & log ops in a follow-up PR, as this one is already a bit large Pull Request resolved: https://github.com/pytorch/pytorch/pull/152876 Approved by: https://github.com/malfet	2025-05-07 00:06:54 +00:00
henrylhtsang	61aa77e216	[cutlass backend][BE][clean-up] refactor to remove use of autotune_fallback_to_aten=True in cutlass backend tests (#152850 ) Differential Revision: [D74192001](https://our.internmc.facebook.com/intern/diff/D74192001/) Motivation: clean up post https://github.com/pytorch/pytorch/issues/147479. I plan to leave the rest of the clean-up as an first time issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152850 Approved by: https://github.com/chenyang78	2025-05-06 23:48:57 +00:00
Ti-Tai Wang	5fa5017479	[ONNX] Suggest users setting dynamo=True when exporting (#152478 ) Fixes #152025 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152478 Approved by: https://github.com/justinchuby	2025-05-06 23:18:11 +00:00
Nikita Shulga	80d2116405	[BE] Update numba versions (#152557 ) Let's see if PyTorch is compatible with latest `test_unary_funcs` are no longer failing due to https://github.com/pytorch/pytorch/pull/148024 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152557 Approved by: https://github.com/Skylion007	2025-05-06 23:15:21 +00:00
Shivam Raikundalia	911b838aae	[Memory Viz] Add Compile Context to Visualizer (#152862 ) Summary: Adds PT2 info to visualizer. Also makes sure we have a case when compile context is not in pickle file. Test Plan: {F1977637362} Differential Revision: D74202811 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152862 Approved by: https://github.com/aaronenyeshi	2025-05-06 23:09:59 +00:00
Ryan Guo	6c025b5a82	[dynamo] Support `delattr` on result of `torch.compile(module)` (#152741 ) This is essentially a follow-up on #122098, where we added support of `getattr` and `setattr` on result of `torch.compile(module)`, but didn't add support for `delattr`. Fixes #150711. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152741 Approved by: https://github.com/anijain2305 ghstack dependencies: #152740	2025-05-06 22:30:37 +00:00
Ryan Guo	0886d402f1	[dynamo] Avoid running `torch.nn.Module.__call__` twice under `torch.compile(mod)` (#152740 ) When we do `torch.compile(mod)`, we eventually end up returning a new module instance, whose `forward` method is the result of `torch.compile(mod.__call__)`, meaning it already captures all the extra logic (e.g., hook firing) from the default `torch.nn.Module.__call__`. As a result we can't reuse the inherited default `__call__` as is, because we'd end up running the logic twice. This patch makes the returned `OptimizedModule` override the default `__call__`, and directly calls into its compiled `forward` method. Fixes #149502 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152740 Approved by: https://github.com/anijain2305	2025-05-06 22:30:37 +00:00
Nikita Shulga	1c30862d8f	Partilally revert https://github.com/pytorch/pytorch/pull/152288 (#152909 ) Summary: As it results in build failures for some internal targets that stuck on older compiler. Platform update is tracked in [T223408150](https://www.internalfb.com/tasks?t=223408150) Test Plan: CI Differential Revision: D74220384 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152909 Approved by: https://github.com/cyyever, https://github.com/wdvr	2025-05-06 22:02:42 +00:00
Wouter Devriendt	5fe58ab5bd	Devcontainer: Optimize apt-get commands to reduce Docker image size (#152882 ) ## Summary - Added --no-install-recommends flag to all apt-get install commands to reduce unnecessary dependencies - Added apt-get clean after package installations to remove package cache and reduce image size - Combined multiple apt commands into single instructions to reduce Docker image layers ## Test plan Test by building the devcontainer and verifying functionality while ensuring reduced image size Pull Request resolved: https://github.com/pytorch/pytorch/pull/152882 Approved by: https://github.com/cyyever, https://github.com/atalman, https://github.com/Skylion007	2025-05-06 20:33:02 +00:00
Prachi Gupta	ed63cb20ec	[ROCm] Fix SymmetricMemory build error on NAVI arch (#152838 ) NAVI arch doesn't support `__builtin_amdgcn_s_memtime()`, using `clock64()` instead which works for both NAVI and MI archs. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/152838 Approved by: https://github.com/jeffdaily	2025-05-06 19:37:58 +00:00
Prachi Gupta	8faa0b18c3	[ROCm] opportunistic fastatomics - fix build error with newer compilers (#152841 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/152841 Approved by: https://github.com/jeffdaily	2025-05-06 19:37:48 +00:00
Wouter Devriendt	1f4f4a61c2	Devcontainer: Replace conda with apt-based setup (#152881 ) ## Summary - Replaced miniconda base image with base Ubuntu 22.04 image - Installed Python and required dependencies using apt - Replaced conda-based CUDA installation with apt-based version - Updated paths in install-dev-tools.sh to reflect the new non-conda environment - Removed conda-specific files and added requirements.txt for Python dependencies ## Test plan Test by building and running the devcontainer in VS Code with both CPU and CUDA configurations Pull Request resolved: https://github.com/pytorch/pytorch/pull/152881 Approved by: https://github.com/atalman	2025-05-06 19:23:58 +00:00
Wouter Devriendt	200df50c05	Devcontainer: Fix context path and workspace mount (#152880 ) ## Summary - Changed the devcontainer context path from '../..' to './' for both CPU and CUDA configurations - Added workspace mount configuration to properly mount the repository in the container - Added containerEnv to disable implicit --user pip flag ## Test plan Test by building and running the devcontainer in VS Code Pull Request resolved: https://github.com/pytorch/pytorch/pull/152880 Approved by: https://github.com/atalman	2025-05-06 19:22:29 +00:00
Krishna Bindumadhavan	08f5371571	[float16]: Fix the accumulation type for dot and gemv (#152676 ) Fixes #147860 Also, partially address: https://github.com/pytorch/pytorch/issues/125438 Use float32 for accumulation with float16 and and bfloat16 types Pull Request resolved: https://github.com/pytorch/pytorch/pull/152676 Approved by: https://github.com/malfet	2025-05-06 18:10:08 +00:00
Aaron Orenstein	7a0781eaad	Improve cache key graph printing performance (#151928 ) Teach the graph printer how to allow overriding printing SymTypes (`SymInt`, `SymFloat`, `SymBool`) and then use that to reuse the fast SymNode printing from `torch._inductor.utils.sympy_str()` to make computing the cache key faster. On my computer the repro from #151823 goes from 480s -> 80s (still terrible... but better). Fixes #151823 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151928 Approved by: https://github.com/laithsakka	2025-05-06 17:39:53 +00:00
Boyuan Feng	7dd9d514d2	[Graph Partition] remove PRECOMPUTED_SIZE from partition symbol inputs (#152864 ) PRECOMPUTED_SIZE is computed during runtime and should not be included in graph_partition_inputs. See the following example for a PRECOMPUTED_SIZE `ps0`. ![image](https://github.com/user-attachments/assets/5aa949a9-b8e0-4b77-8702-95b96b58694e) full output code: [P1803820480](https://www.internalfb.com/phabricator/paste/view/P1803820480) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152864 Approved by: https://github.com/eellison	2025-05-06 17:35:29 +00:00
Jovian Anthony Jaison	5d36485b4a	Log aot and idx waitcounters. (#152444 ) Summary: Added for create_aot_dispatcher_function and compile_fx_inner. Note: Log wait counters flag is already set for: 1. async_compile.precompile 2. remote_fx_graph_cache_get 3. remote_fx_graph_cache_put Test Plan: contbuild Differential Revision: D73866124 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152444 Approved by: https://github.com/ppanchalia, https://github.com/masnesral	2025-05-06 16:21:58 +00:00
Aaron Gokaslan	07a29dbe81	[BE]: Update cutlass submodule to 3.9.2 (#152779 ) A lot of last minute bugfixes for CUTLASS blackwell that we should upstream. It's a header only library and a minor release so this should strictly improve compiler support and fix some bugs. Needed to update some instruction numbers in torch compile baselines for the new kernels Pull Request resolved: https://github.com/pytorch/pytorch/pull/152779 Approved by: https://github.com/henrylhtsang	2025-05-06 16:08:24 +00:00
James Wu	f56bcd2408	[precompile] [easy] Refactor FxGraphCache to add cache_hit_post_compile function (#152839 ) This PR refactors CompiledFxGraph by adding a new post_compile step that only runs on cache hit. This refactors a bunch of code in _lookup_graph to its own function so that we can use it in BundledAOTAutogradCacheEntry. No difference in behavior here. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152839 Approved by: https://github.com/oulgen ghstack dependencies: #152836	2025-05-06 15:33:24 +00:00
Ke Wen	a8f727c439	[c10d] Fix extra CUDA context created by barrier (#149144 ) Fixes #149119. In ProcessGroup.hpp, we create a dummy tensor for dispatching. This requires a correct device index. This PR uses `device_id` given by user when calling `init_process_group`. This PR also uses `torch._C._get_accelerator()` to determine the device type. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149144 Approved by: https://github.com/XilunWu, https://github.com/fduwjj, https://github.com/cyyever	2025-05-06 15:27:30 +00:00
James Wu	12a8b70247	[precompile] Refactor AOTAutogradCacheEntry to be generic (#152836 ) The purpose of this stack is to create a new BundledAOTAutogradCacheEntry, which is an AOTAutogradCacheEntry that is self contained, i.e. it contains all of the CompiledFxGraph directly in the entry, instead of relying on FxGraphCache._lookup_graph. Because this woudl balloon the size of the actual cache entry to do this, our goal is not to use BundledAOTAutogradCacheEntry in cache scenarios: only for precompile use cases. Thus, it's important we make this whole setup generic, to be able to support these two workflows clearly. This PR genericizes AOTAutogradCacheEntry considerably, so that it can take in different types of Forwards and Backwards. Each GenericAOTAutogradCacheEntry is composed of two parts, a TForward and a TBackward. The forward and backward can be loaded in multiple ways, either via FxGraphCache._lookup_graph, or by saving the entire CompiledFxGraph. For simplicify, this PR only implements the generic code refactors needed, but does not fully implement BundledAOTAutogradCacheEntry, which is an AOTAutogradCacheEntry that takes a full CompiledForward. We'll handle and implement BundledAOTAutogradCacheEntry in the PR above this, for easier review. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152836 Approved by: https://github.com/oulgen	2025-05-06 15:19:17 +00:00
PyTorch MergeBot	fcd5e49138	Revert "[dynamo] Recursively realize the stack_values (#152853 )" This reverts commit 460888f908ea4b634ecc863a6da6b2132108bc79. Reverted https://github.com/pytorch/pytorch/pull/152853 on behalf of https://github.com/malfet due to Looks like it broke inductor tests ([comment](https://github.com/pytorch/pytorch/pull/152853#issuecomment-2854897485))	2025-05-06 15:02:57 +00:00
Krishna Bindumadhavan	f47bf38e30	[float16]: Fast path for torch.dot with float16/bfloat16 (#152799 ) Fixes #152798 Add the fast path for dot with contiguous tensors for float16/bfloat16 types. Performance with patch (see issue for benchmark and current performance): ![Improved dot performance](https://github.com/user-attachments/assets/57f64e90-8191-4710-adb0-f430644827de) We see up to 10x+ improvement in performance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152799 Approved by: https://github.com/malfet	2025-05-06 14:59:27 +00:00
Joel Schlosser	b06cbd49f1	[Dynamo] Guard serialization for TENSOR_SUBCLASS_METADATA_MATCH (#152626 ) This PR updates `GuardsStatePickler.reducer_override()` in `torch/_dynamo/guards.py` to handle reconstruction of traceable wrapper subclasses. It's intended to work recursively and handle any level of subclass instance nesting (e.g. subclass instances that contain subclass instances, etc.) This PR tests the guard on several traceable wrapper tensor subclasses: * `LocalSubclass`: used to ensure the correct error message is thrown when the subclass is not defined globally * `torch.testing._internal.two_tensor.TwoTensor`: defines None for its extra metadata * `SubclassWithMeta`: stores non-trivial extra metadata * `SubclassWithCustomMetadataGuard`: stores non-trivial extra metadata and defines a custom `__metadata_guard__` classmethod * `SubclassWithSubclassInnerTensors`: used to test recursiveness; this subclass contains subclass inner tensor components Pull Request resolved: https://github.com/pytorch/pytorch/pull/152626 Approved by: https://github.com/jansel	2025-05-06 14:06:36 +00:00
Francisco Massa	199d5a408a	[partitioner] Fix argument to _broadcast_on_rank0 (#152846 ) Summary: There was a bug when I refactored my original implementation. This should fix it Test Plan: Run on some internal workloads Differential Revision: D74190485 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152846 Approved by: https://github.com/danthe3rd	2025-05-06 13:45:59 +00:00
Blaine Burton Rister	bc11afd41f	[Inductor] FX backend via Wrapper IR (#146942 ) # Sub-PRs These PRs contain refactors from the main one. They should be reviewed and merged first. - https://github.com/pytorch/pytorch/pull/150458 - https://github.com/pytorch/pytorch/pull/152391 - https://github.com/pytorch/pytorch/pull/152587 # Feature The goals of this PR are twofold. ## Goal 1: Introduce Wrapper IR as an intermediate step in wrapper codegen. In addition to Triton/C++/Halide kernels, Inductor also generates "wrapper" code which allocates memory and calls the kernels. Originally, this wrapper code was fairly standard Python which resembled a user-written PyTorch program. Over time, various wrapper code generators have been added to accommodate things like AOTInductor, which prefers C++ code for static compilation. This complexity has bled into other parts of the codebase, as we now need if/else statements to choose between Python and C++ macros. (See an example [here](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/ir.py#L5515-L5522).) Since most of these code generation steps are conceptually identical across target languages, it seems reasonable to refactor them into some kind of intermediate representation which can be shared between the various backends. This might also make it easier to develop out-of-tree backends which cannot put their own macros in core Inductor components. This PR takes some initial steps to formalize Inductor's wrapper codegen by generalizing the existing Memory Planning IR into a fully fledged Wrapper IR. This is pretty much identical to the existing Memory Planning IR, but it supports a richer set of ops for things like kernel definitions and calls. This refactor could help encapsulate wrapper codegen. Ideally, we don't need to worry about direct Python/C++ codegen in the main compiler files such as `ir.py`, and can instead defer these to classes like `PythonWrapperCodegen` and `CppWrapperCpu`, which operate on the Wrapper IR. ## Goal 2: Convert Wrapper IR into FX IR. One of the main benefits of Wrapper IR is to enable more diverse Inductor backends. This PR introduces a converter from Wrapper IR into [FX IR](https://pytorch.org/docs/stable/fx.html), which is the intermediate representation most commonly used in PyTorch graph compilers. The purpose of this is to enable out-of-tree backends to consume Inductor's output in FX IR, which would hopefully make Inductor easier to leverage in novel compilers, hardware accelerators, etc. It's not trivial to generate Python or C++ code which Inductor can compile and run, and doing so may require changes to other core Inductor files, for the reasons outlined in the previous section. The goal of supporting FX output is to enable something like `torch.compile`'s [custom backend](https://pytorch.org/docs/stable/torch.compiler_custom_backends.html) system, in which an out-of-tree backend can receive an optimized FX graph from Inductor, and compile and run it however it likes. The typical users of this feature would likely not be part of PyTorch, and may or may not support running a kernel in eager mode. However, they can understand what `torch.empty_strided` means, compile and run Triton kernels, etc. So we just need to present them with an FX graph saying what code Inductor wants to run, which should be easier to analyze and transform in a third party system than Python or C++ source. Since FX IR is fairly stable, this mechanism should hopefully isolate third-party backends, hardware accelerators, etc. from the implementation details of Inductor, and vice versa. # Current status Things that seem to work: - Converted a lot of the most common Python codegen lines to Wrapper IR lines. - Handled the following cases, in addition to what was already in the Memory Planning IR: - Comments - Triton kernels - Extern/fallback kernels - Freeing tensors (`del buf0`) - MultiOutput - Graph outputs - ReinterpretView / StorageBox, for both call args and outputs. - FX conversion asserts that the program only contains Wrapper IR lines, and not strings of Python/C++ code. - Prototype FX converter which can handle some of the most common use cases. - Defining Triton kernels, and putting them in a side table using TorchDynamo's existing [utilities](https://dev-discuss.pytorch.org/t/higher-order-operators-2023-10/1565). - Calling wrapped Triton kernels. - Calling extern kernels and certain types of fallback kernels. - Support both `extern_kernels.` and `aten.`. - Support multi-output kernels like `torch.topk`. - Graphs with multiple inputs/outputs. - Training i.e. calling `Tensor.backward()` in a compiled function. - Graph breaks (training). - Run the `torch.fx.GraphModule` on GPU using the standard `__call__` method. This makes it easy to test the correctness of FX codegen. Things that don't work: - Both Wrapper IR and Wrapper -> FX coverage are currently best effort. There are still features which aren't captured as Wrapper IR lines, and fall back to plain strings. This representation is functionally correct but probably not rich enough to achieve the goals outlined in the previous sections. - Fallback kernels seem like the most difficult thing to fully cover, since they each define their own Python/C++ macros that would need to be converted to FX. - Size/alignment asserts are currently disabled via the config file. It's possible to generate FX IR for these, but it seems reasonable to defer these sanity checks to a later PR. - CommBuffer's and distributed communication are not yet supported. An earlier version of this PR attempted to implement this by calling `empty_strided_p2p`. However, building and testing distributed support seems non-trivial, so it's probably better to defer this. # Out-of-tree compilers With this PR, out of tree backends will be able to do further compilation on the FX graphs by subclassing `WrapperFxCodegen` and overriding the `compile_graph` function. This follows the same API as torch.compile's [custom backends](https://pytorch.org/docs/stable/torch.compiler_custom_backends.html), where the user simply returns a callable running the graph. The callable need not be a method of `GraphModule` or any other PyTorch class. See an example below. ``` from torch._inductor.codegen.wrapper_fxir import WrapperFxCodegen class MyCustomBackend(WrapperFxCodegen): def compile_graph(self, gm): # Add 1 to the graph's outputs def compiled_fn(args): return [x + 1 for x in gm.graph.forward(args)] return compiled_fn ``` # Example FX graphs This section contains some example FX graphs generated by Inductor. The correctness of these graphs was verified against eager mode by calling the corresponding `GraphModule`. Here's an FX graph calling a basic Triton kernel. Notice how outputs are allocated with `torch.empty_strided`, and the Triton kernel is called by reference to Dynamo's triton side table. ``` graph(): %arg0_1 : [num_users=1] = placeholder[target=arg0_1] %arg1_1 : [num_users=1] = placeholder[target=arg1_1] %buf0 : [num_users=2] = call_function[target=torch.empty_strided](args = ((8,), (1,)), kwargs = {dtype: torch.float32, device: cuda:0}) %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [(8,)], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg1_1, in_ptr1: %arg0_1, out_ptr0: %buf0, xnumel: 8, XBLOCK: 8}}) return (buf0,) ``` Here's a more complicated graph that calls a `torch.addmm` extern kernel. ``` graph(): %arg0_1 : [num_users=1] = placeholder[target=arg0_1] %arg1_1 : [num_users=2] = placeholder[target=arg1_1] %buf0 : [num_users=3] = call_function[target=torch.empty_strided](args = ((), ()), kwargs = {dtype: torch.float32, device: cuda:0}) %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [(1,)], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg1_1, out_ptr0: %buf0, xnumel: 1, r0_numel: 129, XBLOCK: 1}}) %buf2 : [num_users=2] = call_function[target=torch.empty_strided](args = ((129, 1), (1, 1)), kwargs = {dtype: torch.float32, device: cuda:0}) %addmm : [num_users=0] = call_function[target=torch.addmm](args = (%buf0, %arg0_1, %arg1_1), kwargs = {alpha: 1, beta: 1, out: %buf2}) %delete : [num_users=0] = call_function[target=torch._inductor.codegen.wrapper_fxir.delete](args = (%buf0,), kwargs = {}) return (buf2,) ``` Here's a graph which indexes into a tuple using `operator.getitem`. This is necessary to use the output of the `torch.topk` operation. ``` graph(): %arg0_1 : [num_users=1] = placeholder[target=arg0_1] %buf0 : [num_users=3] = call_function[target=torch.ops.aten.topk.default](args = (%arg0_1, 2), kwargs = {}) %buf1 : [num_users=2] = call_function[target=operator.getitem](args = (%buf0, 0), kwargs = {}) %buf2 : [num_users=2] = call_function[target=operator.getitem](args = (%buf0, 1), kwargs = {}) %delete : [num_users=0] = call_function[target=torch._inductor.codegen.wrapper_fxir.delete](args = (%buf0,), kwargs = {}) %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [(2,)], tma_descriptor_metadata: {}, kwargs: {in_out_ptr0: %buf1, xnumel: 2, XBLOCK: 2}}) %triton_kernel_wrapper_mutation_1 : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 1, constant_args_idx: 1, grid: [(2,)], tma_descriptor_metadata: {}, kwargs: {in_out_ptr0: %buf2, xnumel: 2, XBLOCK: 2}}) return (buf1, buf2) ``` Here's a graph that reinterprets an output tensor using `torch.as_strided`. This is one way to handle Inductor's `ReinterpretView` op. ``` graph(): %arg0_1 : [num_users=1] = placeholder[target=arg0_1] %arg1_1 : [num_users=1] = placeholder[target=arg1_1] %buf0 : [num_users=2] = call_function[target=torch.empty_strided](args = ((2, 4), (4, 1)), kwargs = {dtype: torch.float32, device: cuda:0}) %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [(8,)], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg0_1, in_ptr1: %arg1_1, out_ptr0: %buf0, xnumel: 8, XBLOCK: 8}}) %buf0_view_buf0_0 : [num_users=1] = call_function[target=torch.as_strided](args = (%buf0, (8,), (1,), 0), kwargs = {}) return (buf0_view_buf0_0,) ``` Here's a graph with dynamic shapes. This one is a little bit funky. Inductor provides a graph input for each shape symbol, which we map to a placeholder, in this example `s6`. Then, shape expressions in the generated code can refer to the symbol `s6`. The size hint for `s6` is stored in `node.meta["val"]` where `node` is the placeholder defining it. This works out in the generated python code because the placeholder defines a Python variable with the name `s6`. ``` graph(): %s6 : [num_users=0] = placeholder[target=s6] %arg1_1 : [num_users=1] = placeholder[target=arg1_1] %arg2_1 : [num_users=1] = placeholder[target=arg2_1] %buf0 : [num_users=2] = call_function[target=torch.empty_strided](args = ((s6,), (1,)), kwargs = {dtype: torch.float32, device: cuda:0}) %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [[-(((-s6)//8)), 1, 1]], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg2_1, in_ptr1: %arg1_1, out_ptr0: %buf0, xnumel: s6, XBLOCK: 8}}) return buf0 ``` Here's another graph, this time with dynamic shapes and strides. The grid expression is more complex since the numel is a product of dimensions. ``` graph(): %s10 : [num_users=0] = placeholder[target=s10] %arg1_1 : [num_users=1] = placeholder[target=arg1_1] %arg2_1 : [num_users=1] = placeholder[target=arg2_1] %buf0 : [num_users=2] = call_function[target=torch.empty_strided](args = ([s10, s10], [s10, 1]), kwargs = {dtype: torch.float32, device: cuda:0}) %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [[-(((s102)//(-64))), 1, 1]], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg2_1, in_ptr1: %arg1_1, out_ptr0: %buf0, xnumel: s102, XBLOCK: 64}}) return buf0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146942 Approved by: https://github.com/jansel	2025-05-06 10:06:39 +00:00
Yu, Guangye	e32a16a9da	Correct torch.xpu.is_bf16_supported return False if no XPU detected (#152317 ) # Motivation Fix https://github.com/pytorch/pytorch/issues/152301 When XPU is not available, calling `torch.xpu.is_bf16_supported()` still returns `True`, which is inconsistent with the expected behavior (should be False). # Solution Align to other backend, adding `including_emulation` to `torch.xpu.is_bf16_supported` and, - return `False` if XPU is not available - return `True` if `including_emulation` is True - return `torch.xpu.get_device_properties().has_bfloat16_conversions` if `including_emulation` is False, it means if the device could generate SPIRV code for bf16. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152317 Approved by: https://github.com/EikanWang	2025-05-06 10:03:17 +00:00
Huy Do	8904ba6387	Forward fix D74196435 (#152926 ) Summary: Forward fix a misplace declaration from D74196435 Test Plan: Random check with a failed build `buck2 build --config fbcode.enable_gpu_sections=true --flagfile fbcode//mode/opt fbcode//accelerators/workloads/models/emu_flash/tests:test_compile_eager` Reviewed By: wdvr Differential Revision: D74225582 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152926 Approved by: https://github.com/cyyever, https://github.com/wdvr	2025-05-06 07:33:38 +00:00
Chuanqi Xu	689e14ae00	[NFC] [inductor] [compile async] Warn exception if pickler failed (#152401 ) A NFC to help us to find issues See https://github.com/pytorch/pytorch/issues/151904 CC @aorenste Pull Request resolved: https://github.com/pytorch/pytorch/pull/152401 Approved by: https://github.com/Skylion007	2025-05-06 07:12:35 +00:00
Anthony Shoumikhin	1dd36ad2d4	Fix conditional git diff in _link_check.yml (#152919 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152919 Approved by: https://github.com/huydhn	2025-05-06 07:01:45 +00:00
PyTorch MergeBot	0e2b948256	Revert "cleanup, refactor and add missing self._dde_suppressed checks (#152657 )" This reverts commit 784c666cae00f85ecf675298ddb056bebaf32f55. Reverted https://github.com/pytorch/pytorch/pull/152657 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to cause a test to fail in trunk ([comment](https://github.com/pytorch/pytorch/pull/152657#issuecomment-2853442594))	2025-05-06 06:45:07 +00:00
PyTorch MergeBot	451d652873	Revert "Make device check error message more descriptive (#150750 )" This reverts commit 8253970a1f90a5b0b1fe0d4febd949470f6fa265. Reverted https://github.com/pytorch/pytorch/pull/150750 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to cause a test to fail in trunk ([comment](https://github.com/pytorch/pytorch/pull/150750#issuecomment-2853438985))	2025-05-06 06:42:08 +00:00
Animesh Jain	460888f908	[dynamo] Recursively realize the stack_values (#152853 ) Might also fix - https://github.com/pytorch/pytorch/issues/135696 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152853 Approved by: https://github.com/Lucaskabela, https://github.com/mlazos, https://github.com/jansel	2025-05-06 06:30:31 +00:00
PyTorch UpdateBot	dd766e1dc5	[audio hash update] update the pinned audio hash (#152885 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152885 Approved by: https://github.com/pytorchbot	2025-05-06 05:29:25 +00:00
Laith Sakka	784c666cae	cleanup, refactor and add missing self._dde_suppressed checks (#152657 ) so two things other than cleanups and refactoring 1) do not use propagate_real_tensors to resolve eval under guard_or_true/guard_or_false . 2) do not guard for dimensions of type DimDynamic.OBLIVIOUS_SIZE under guard_or_true/guard_or_false . Pull Request resolved: https://github.com/pytorch/pytorch/pull/152657 Approved by: https://github.com/pianpwk	2025-05-06 05:24:09 +00:00
bobrenjc93	e2eb845313	[ez] fix a bunch of typos in dynamo (#152886 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152886 Approved by: https://github.com/williamwen42	2025-05-06 05:13:56 +00:00
zeshengzong	37c71820f3	Fix nn.LazyModuleMixin examples (#150596 ) Fixes #150404 ## Test Result ![image](https://github.com/user-attachments/assets/e546339f-c1cb-47db-ab0e-276a42c167b8) ![image](https://github.com/user-attachments/assets/298db7ad-6512-4b17-9453-170ff843c4fd) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150596 Approved by: https://github.com/mikaylagawarecki	2025-05-06 05:11:22 +00:00
Anthony Shoumikhin	337895eaaf	Run url and xref linters independently (#152899 ) Also introduce `skip-xref-lint` label Pull Request resolved: https://github.com/pytorch/pytorch/pull/152899 Approved by: https://github.com/huydhn	2025-05-06 05:02:32 +00:00
You Jiacheng	ee0cd1d8b5	Only do shallow clone when checkout nccl (#152826 ) Note: `--depth` implies `--single-branch` since git 2.7.6 ```sh git clone https://github.com/NVIDIA/nccl.git Cloning into 'nccl'... remote: Enumerating objects: 4205, done. remote: Counting objects: 100% (238/238), done. remote: Compressing objects: 100% (122/122), done. remote: Total 4205 (delta 144), reused 126 (delta 116), pack-reused 3967 (from 3) Receiving objects: 100% (4205/4205), 4.22 MiB \| 7.01 MiB/s, done. Resolving deltas: 100% (2858/2858), done. ``` ```sh git clone --depth 1 --branch v2.25.1-1 https://github.com/NVIDIA/nccl.git Cloning into 'nccl'... remote: Enumerating objects: 249, done. remote: Counting objects: 100% (249/249), done. remote: Compressing objects: 100% (227/227), done. remote: Total 249 (delta 31), reused 111 (delta 15), pack-reused 0 (from 0) Receiving objects: 100% (249/249), 657.44 KiB \| 2.14 MiB/s, done. Resolving deltas: 100% (31/31), done. Note: switching to '80f6bda4378b99d99e82b4d76a633791cc45fef0'. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152826 Approved by: https://github.com/albanD	2025-05-06 04:56:19 +00:00
Animesh Jain	97dfd8dd53	[invoke_subgraph] Run missing graph passes recursively (#152675 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152675 Approved by: https://github.com/bdhirsh, https://github.com/zou3519 ghstack dependencies: #152772, #152770	2025-05-06 02:55:34 +00:00
Animesh Jain	cc254eaa7c	[inductor][refactor] Refactor the fetching of subgraph names (#152770 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152770 Approved by: https://github.com/jansel, https://github.com/zou3519 ghstack dependencies: #152772	2025-05-06 02:55:34 +00:00
Animesh Jain	b1d34acac5	[fx] Recursive DCE on subgraphs (#152772 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152772 Approved by: https://github.com/bdhirsh, https://github.com/zou3519	2025-05-06 02:55:34 +00:00
Yu, Guangye	35c727e7ff	Fix typo on `test_multi_device_context_manager` for XPU (#152812 ) # Motivation Align https://github.com/pytorch/pytorch/pull/152474, fix the typo on UT for XPU introduced by https://github.com/pytorch/pytorch/issues/148864 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152812 Approved by: https://github.com/EikanWang, https://github.com/Skylion007	2025-05-06 02:51:19 +00:00
angelayi	470cd3a995	[aotinductor] Don't alloc weights if they don't exist (#152692 ) Fixes https://github.com/pytorch/pytorch/issues/152356 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152692 Approved by: https://github.com/henrylhtsang	2025-05-06 02:50:21 +00:00
zeshengzong	8253970a1f	Make device check error message more descriptive (#150750 ) Fixes #122757 ## Test Result ```python import torch model_output = torch.randn(10, 5).cuda() labels = torch.randint(0, 5, (10,)).cuda() weights = torch.randn(5) loss_fn = torch.nn.CrossEntropyLoss(weight=weights) loss = loss_fn(input=model_output, target=labels) print(loss) Traceback (most recent call last): File "/home/zong/code/pytorch/../loss2.py", line 17, in <module> loss = loss_fn(input=model_output, target=labels) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zong/code/pytorch/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl return self._call_impl(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zong/code/pytorch/torch/nn/modules/module.py", line 1762, in _call_impl return forward_call(args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zong/code/pytorch/torch/nn/modules/loss.py", line 1297, in forward return F.cross_entropy( ^^^^^^^^^^^^^^^^ File "/home/zong/code/pytorch/torch/nn/functional.py", line 3494, in cross_entropy return torch._C._nn.cross_entropy_loss( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: Expected all tensors to be on the same device, but got weight is on cpu, different from other tensors on cuda:0 (when checking argument in method wrapper_CUDA_nll_loss_forward) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150750 Approved by: https://github.com/mikaylagawarecki	2025-05-06 02:33:20 +00:00
Yiming Zhou	1d7728056b	[nativert] Move TensorMeta to pytorch core (#152475 ) Summary: Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72 This diff moves `TensorMeta.cpp` and `TensorMeta.h` to PyTorch core under `torch/nativert/graph/` Existing `torch::_export::TensorMeta` in `torch/csrc/utils/generated_serialization_types.h` is auto-generated from the export serde schema and therefore only containing the most basic serializable types. We need the newly added `TensorMeta.cpp` to deserialize the metadata into a in-memory class with c10 types so that it can be consumed by the runtime later. Test Plan: Added test under `test/cpp/nativert/test_tensor_meta.cpp` Differential Revision: D73820548 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152475 Approved by: https://github.com/albanD	2025-05-06 01:50:46 +00:00
Anthony Shoumikhin	1798b0db25	Use three-dot diffs in URL and xref lint workflows (#152895 ) Only run on the files actually modified in a PR, not every file touched on main since the branch point Fixes #152884 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152895 Approved by: https://github.com/huydhn	2025-05-06 01:33:52 +00:00
David Berard	f097e83369	[inductor][retry] Realize bucketize/searchsorted output (#152858 ) Context: bucketize is relatively expensive, computationally. So it's not always profitable to fuse it if it means doing extra computation. For example, this repro: https://gist.github.com/davidberard98/7fd6af7e6291787c246c705945a25554 shows a slowdown from 56us (eager) to ~100us (torch.compile-d): instead of computing 2\\15 binary searches, the fused version does 2\\15 * 384 - one for each of the broadcasted outputs. Solution: Realize the output of bucketize (and searchsorted, which also uses inductor's ops.bucketize). If there's an opportunity to do non-broadcasted fusions, the scheduler can still apply such fusions later on. After this PR, instead of a slowdown, we see an improvement from 56us (eager) to 33us (compiled). Retry Original PR (https://github.com/pytorch/pytorch/pull/152644) was reverted due to internal bisect blaming this change, but the bisect was a false positive (and is marked as such) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152858 Approved by: https://github.com/aakhundov	2025-05-06 01:32:26 +00:00
drisspg	14f8066910	Ensure mxfp8 scaled_mm works w/ max-autotune (#152744 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152744 Approved by: https://github.com/Skylion007	2025-05-06 01:16:57 +00:00
cyy	ac792a0dca	[submodule] Bump ITTAPI to 3.25.5 (#150263 ) It hasn't been updated for 3 years. And also to remove CMake 4 workaround. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150263 Approved by: https://github.com/sraikund16	2025-05-06 01:02:18 +00:00
Ankita George	721fdfa32d	[ez] Fsspec Filesystem ls details should be false (#152693 ) Summary: The default action for ls for the local filesystem is with details=False, but this isn't the case for all filesystems (eg. huggingface), so setting details=False explicitly so that the return type of ls is a list of strings, and not a list of dictionaries, which is what it would be with details=True. Test Plan: tested in notebook Differential Revision: D74080572 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152693 Approved by: https://github.com/joecummings	2025-05-06 01:02:13 +00:00
Jane Xu	4979ca5ffa	Synchronize in foreach tests after profiling (#152857 ) After the CI change from 12.4 -> 12.6 around mid-March, the foreach tests have been flaky and hard to repro due to nondeterminism. Per @davidberard98's suggestion, let's try to add a synchronize before checking profiler results to see whether this fixes the flake! The hope is that the 48 currently open foreach flaky issues will close from this change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152857 Approved by: https://github.com/davidberard98	2025-05-06 00:56:48 +00:00
Pian Pawakapan	13dcf80a53	[dynamic shapes] use try-catch instead of guard_or_true for reshape_view_helper (#152638 ) Test Plan: test_export Differential Revision: D74033649 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152638 Approved by: https://github.com/laithsakka	2025-05-06 00:54:24 +00:00
PyTorch MergeBot	d197228d43	Revert "[CI] Use cmake from pip instead of conda in CI docker images (#152537 )" This reverts commit 3196a3aca0f16792820158cfd451cb977f99ac7e. Reverted https://github.com/pytorch/pytorch/pull/152537 on behalf of https://github.com/huydhn due to We need signals from inductor, cmake version from pip is too old? ([comment](https://github.com/pytorch/pytorch/pull/152537#issuecomment-2852820175))	2025-05-06 00:22:23 +00:00
PyTorch MergeBot	103fe856e1	Revert "Add infra to run CPython tests under Dynamo (#150787 )" This reverts commit 7c96dd8f0c9a7e17f598612405f002441c7f07ae. Reverted https://github.com/pytorch/pytorch/pull/150787 on behalf of https://github.com/huydhn due to Sorry for reverting your change but a failed test is showing up in trunk ([comment](https://github.com/pytorch/pytorch/pull/150787#issuecomment-2852818113))	2025-05-06 00:20:02 +00:00
Aaron Gokaslan	0e9874849f	[BE]: Update torch core lazy helpers with micropts (#152778 ) Some minor nits I noticed. Use reserve when possible Pull Request resolved: https://github.com/pytorch/pytorch/pull/152778 Approved by: https://github.com/cyyever, https://github.com/albanD	2025-05-06 00:03:51 +00:00
albanD	fd57c16285	Avoid triggering ignored requires_grad warning in our code (#152686 ) This one is ok to silence as we're just doing formatting Pull Request resolved: https://github.com/pytorch/pytorch/pull/152686 Approved by: https://github.com/Skylion007	2025-05-05 23:56:40 +00:00
Huy Do	125a3eee5c	[ez] Use pip instead of conda in run_tests.sh (#152860 ) Part 1 of https://github.com/pytorch/pytorch/issues/148336. The rest depends on https://github.com/pytorch/pytorch/issues/148335 to remove conda from Docker build process. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152860 Approved by: https://github.com/atalman	2025-05-05 23:06:55 +00:00
Mandar Deshpande	e3064bf0e3	[inductor] Allow num_program specification for TMA workspace (#152844 ) Summary: Allow TMA workspace creation allow specification for `num_programs`, which defaults to `num_sms` when not specified. We need a total `num_programs * num_tma_descriptors` no. of descriptors for a kernel. Test Plan: CI. Differential Revision: D74189599 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152844 Approved by: https://github.com/drisspg	2025-05-05 23:02:55 +00:00
PyTorch MergeBot	cc954848d4	Revert "[c10d] Fix extra CUDA context created by barrier (#149144 )" This reverts commit 457fa820ad538c7aeadb68f0ec418d63972ba1ee. Reverted https://github.com/pytorch/pytorch/pull/149144 on behalf of https://github.com/huydhn due to Internal failure looks legit ([comment](https://github.com/pytorch/pytorch/pull/149144#issuecomment-2852564660))	2025-05-05 22:56:50 +00:00
Felix Su	2ce6d169fc	[IR] Input Adapter refactor prototype (#152459 ) (#152575 ) Summary: 1. Adding `input` field to `_adapt_flat_args` function 2. In `process_forward_inputs`, `reorder_kwargs` will now do nothing if no kwargs are provided (previously would error) 3. Pass `args` as input to `_adapt_flat_args` These changes are made to update the InputAdapter see more context in D73811508 Test Plan: see D73811508 Differential Revision: D73945419 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152575 Approved by: https://github.com/angelayi	2025-05-05 22:51:58 +00:00
Dev (Devashish) Shankar	a2ccda3c60	[pytorch][PR][inductor] Fix one instance of launch_enter_hook (#152831 ) Summary: One usage seems missed in https://github.com/pytorch/pytorch/pull/152457 Test Plan: EMS local benchmark Differential Revision: D74159749 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152831 Approved by: https://github.com/danzimm	2025-05-05 22:15:47 +00:00
Dongji Gao	2b4fe9fa14	[Autotune Cache] Fix the bug of using the wrong key for recording artifacts in CacheArtifactManager (#152678 ) Summary: Replace the key (path) from `<hash>.best_config` to `<parent_dir>/<hash>.best_config` to ensure that Autotune artifacts in MegaCache are loaded to the correct location locally. Test Plan: NA Differential Revision: D74052400 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152678 Approved by: https://github.com/oulgen	2025-05-05 21:03:10 +00:00
Huamin Li	d547c7e10d	[fbgemm] Implement __obj_flatten__ for LinearPackedParamsBase (#152619 ) Differential Revision: D73991241 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152619 Approved by: https://github.com/jerryzh168, https://github.com/houseroad	2025-05-05 20:58:25 +00:00
albanD	22d1359bc6	Move warning from item to specific number conversions (#152709 ) Follow up to https://github.com/pytorch/pytorch/pull/143261 to not warn when a plain .item() is done. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152709 Approved by: https://github.com/malfet, https://github.com/ngimel	2025-05-05 20:46:05 +00:00
Jane Xu	3bc69cc08d	Document that dampening is skipped in SGD momentum first step (#152833 ) Pointed out by https://x.com/hi_tysam/status/1917318692276174977/photo/2. It would be BC breaking to change this behavior 7 years after it has been decided, so we are documenting it first at the very least. <img width="642" alt="image" src="https://github.com/user-attachments/assets/3febcb07-e0ed-44a1-bd3b-a8e685711cb4" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/152833 Approved by: https://github.com/albanD	2025-05-05 20:07:23 +00:00
PyTorch MergeBot	99dac7005f	Revert "[Inductor] FX backend via Wrapper IR (#146942 )" This reverts commit a7691140a0fed33a838dda11e28ff7da393d9180. Reverted https://github.com/pytorch/pytorch/pull/146942 on behalf of https://github.com/malfet due to Looks like it indeed breaks lint, see `a7691140a0/1` ([comment](https://github.com/pytorch/pytorch/pull/146942#issuecomment-2852192778))	2025-05-05 20:01:29 +00:00
Blaine Burton Rister	a7691140a0	[Inductor] FX backend via Wrapper IR (#146942 ) # Sub-PRs These PRs contain refactors from the main one. They should be reviewed and merged first. - https://github.com/pytorch/pytorch/pull/150458 - https://github.com/pytorch/pytorch/pull/152391 - https://github.com/pytorch/pytorch/pull/152587 # Feature The goals of this PR are twofold. ## Goal 1: Introduce Wrapper IR as an intermediate step in wrapper codegen. In addition to Triton/C++/Halide kernels, Inductor also generates "wrapper" code which allocates memory and calls the kernels. Originally, this wrapper code was fairly standard Python which resembled a user-written PyTorch program. Over time, various wrapper code generators have been added to accommodate things like AOTInductor, which prefers C++ code for static compilation. This complexity has bled into other parts of the codebase, as we now need if/else statements to choose between Python and C++ macros. (See an example [here](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/ir.py#L5515-L5522).) Since most of these code generation steps are conceptually identical across target languages, it seems reasonable to refactor them into some kind of intermediate representation which can be shared between the various backends. This might also make it easier to develop out-of-tree backends which cannot put their own macros in core Inductor components. This PR takes some initial steps to formalize Inductor's wrapper codegen by generalizing the existing Memory Planning IR into a fully fledged Wrapper IR. This is pretty much identical to the existing Memory Planning IR, but it supports a richer set of ops for things like kernel definitions and calls. This refactor could help encapsulate wrapper codegen. Ideally, we don't need to worry about direct Python/C++ codegen in the main compiler files such as `ir.py`, and can instead defer these to classes like `PythonWrapperCodegen` and `CppWrapperCpu`, which operate on the Wrapper IR. ## Goal 2: Convert Wrapper IR into FX IR. One of the main benefits of Wrapper IR is to enable more diverse Inductor backends. This PR introduces a converter from Wrapper IR into [FX IR](https://pytorch.org/docs/stable/fx.html), which is the intermediate representation most commonly used in PyTorch graph compilers. The purpose of this is to enable out-of-tree backends to consume Inductor's output in FX IR, which would hopefully make Inductor easier to leverage in novel compilers, hardware accelerators, etc. It's not trivial to generate Python or C++ code which Inductor can compile and run, and doing so may require changes to other core Inductor files, for the reasons outlined in the previous section. The goal of supporting FX output is to enable something like `torch.compile`'s [custom backend](https://pytorch.org/docs/stable/torch.compiler_custom_backends.html) system, in which an out-of-tree backend can receive an optimized FX graph from Inductor, and compile and run it however it likes. The typical users of this feature would likely not be part of PyTorch, and may or may not support running a kernel in eager mode. However, they can understand what `torch.empty_strided` means, compile and run Triton kernels, etc. So we just need to present them with an FX graph saying what code Inductor wants to run, which should be easier to analyze and transform in a third party system than Python or C++ source. Since FX IR is fairly stable, this mechanism should hopefully isolate third-party backends, hardware accelerators, etc. from the implementation details of Inductor, and vice versa. # Current status Things that seem to work: - Converted a lot of the most common Python codegen lines to Wrapper IR lines. - Handled the following cases, in addition to what was already in the Memory Planning IR: - Comments - Triton kernels - Extern/fallback kernels - Freeing tensors (`del buf0`) - MultiOutput - Graph outputs - ReinterpretView / StorageBox, for both call args and outputs. - FX conversion asserts that the program only contains Wrapper IR lines, and not strings of Python/C++ code. - Prototype FX converter which can handle some of the most common use cases. - Defining Triton kernels, and putting them in a side table using TorchDynamo's existing [utilities](https://dev-discuss.pytorch.org/t/higher-order-operators-2023-10/1565). - Calling wrapped Triton kernels. - Calling extern kernels and certain types of fallback kernels. - Support both `extern_kernels.` and `aten.`. - Support multi-output kernels like `torch.topk`. - Graphs with multiple inputs/outputs. - Training i.e. calling `Tensor.backward()` in a compiled function. - Graph breaks (training). - Run the `torch.fx.GraphModule` on GPU using the standard `__call__` method. This makes it easy to test the correctness of FX codegen. Things that don't work: - Both Wrapper IR and Wrapper -> FX coverage are currently best effort. There are still features which aren't captured as Wrapper IR lines, and fall back to plain strings. This representation is functionally correct but probably not rich enough to achieve the goals outlined in the previous sections. - Fallback kernels seem like the most difficult thing to fully cover, since they each define their own Python/C++ macros that would need to be converted to FX. - Size/alignment asserts are currently disabled via the config file. It's possible to generate FX IR for these, but it seems reasonable to defer these sanity checks to a later PR. - CommBuffer's and distributed communication are not yet supported. An earlier version of this PR attempted to implement this by calling `empty_strided_p2p`. However, building and testing distributed support seems non-trivial, so it's probably better to defer this. # Out-of-tree compilers With this PR, out of tree backends will be able to do further compilation on the FX graphs by subclassing `WrapperFxCodegen` and overriding the `compile_graph` function. This follows the same API as torch.compile's [custom backends](https://pytorch.org/docs/stable/torch.compiler_custom_backends.html), where the user simply returns a callable running the graph. The callable need not be a method of `GraphModule` or any other PyTorch class. See an example below. ``` from torch._inductor.codegen.wrapper_fxir import WrapperFxCodegen class MyCustomBackend(WrapperFxCodegen): def compile_graph(self, gm): # Add 1 to the graph's outputs def compiled_fn(args): return [x + 1 for x in gm.graph.forward(args)] return compiled_fn ``` # Example FX graphs This section contains some example FX graphs generated by Inductor. The correctness of these graphs was verified against eager mode by calling the corresponding `GraphModule`. Here's an FX graph calling a basic Triton kernel. Notice how outputs are allocated with `torch.empty_strided`, and the Triton kernel is called by reference to Dynamo's triton side table. ``` graph(): %arg0_1 : [num_users=1] = placeholder[target=arg0_1] %arg1_1 : [num_users=1] = placeholder[target=arg1_1] %buf0 : [num_users=2] = call_function[target=torch.empty_strided](args = ((8,), (1,)), kwargs = {dtype: torch.float32, device: cuda:0}) %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [(8,)], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg1_1, in_ptr1: %arg0_1, out_ptr0: %buf0, xnumel: 8, XBLOCK: 8}}) return (buf0,) ``` Here's a more complicated graph that calls a `torch.addmm` extern kernel. ``` graph(): %arg0_1 : [num_users=1] = placeholder[target=arg0_1] %arg1_1 : [num_users=2] = placeholder[target=arg1_1] %buf0 : [num_users=3] = call_function[target=torch.empty_strided](args = ((), ()), kwargs = {dtype: torch.float32, device: cuda:0}) %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [(1,)], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg1_1, out_ptr0: %buf0, xnumel: 1, r0_numel: 129, XBLOCK: 1}}) %buf2 : [num_users=2] = call_function[target=torch.empty_strided](args = ((129, 1), (1, 1)), kwargs = {dtype: torch.float32, device: cuda:0}) %addmm : [num_users=0] = call_function[target=torch.addmm](args = (%buf0, %arg0_1, %arg1_1), kwargs = {alpha: 1, beta: 1, out: %buf2}) %delete : [num_users=0] = call_function[target=torch._inductor.codegen.wrapper_fxir.delete](args = (%buf0,), kwargs = {}) return (buf2,) ``` Here's a graph which indexes into a tuple using `operator.getitem`. This is necessary to use the output of the `torch.topk` operation. ``` graph(): %arg0_1 : [num_users=1] = placeholder[target=arg0_1] %buf0 : [num_users=3] = call_function[target=torch.ops.aten.topk.default](args = (%arg0_1, 2), kwargs = {}) %buf1 : [num_users=2] = call_function[target=operator.getitem](args = (%buf0, 0), kwargs = {}) %buf2 : [num_users=2] = call_function[target=operator.getitem](args = (%buf0, 1), kwargs = {}) %delete : [num_users=0] = call_function[target=torch._inductor.codegen.wrapper_fxir.delete](args = (%buf0,), kwargs = {}) %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [(2,)], tma_descriptor_metadata: {}, kwargs: {in_out_ptr0: %buf1, xnumel: 2, XBLOCK: 2}}) %triton_kernel_wrapper_mutation_1 : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 1, constant_args_idx: 1, grid: [(2,)], tma_descriptor_metadata: {}, kwargs: {in_out_ptr0: %buf2, xnumel: 2, XBLOCK: 2}}) return (buf1, buf2) ``` Here's a graph that reinterprets an output tensor using `torch.as_strided`. This is one way to handle Inductor's `ReinterpretView` op. ``` graph(): %arg0_1 : [num_users=1] = placeholder[target=arg0_1] %arg1_1 : [num_users=1] = placeholder[target=arg1_1] %buf0 : [num_users=2] = call_function[target=torch.empty_strided](args = ((2, 4), (4, 1)), kwargs = {dtype: torch.float32, device: cuda:0}) %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [(8,)], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg0_1, in_ptr1: %arg1_1, out_ptr0: %buf0, xnumel: 8, XBLOCK: 8}}) %buf0_view_buf0_0 : [num_users=1] = call_function[target=torch.as_strided](args = (%buf0, (8,), (1,), 0), kwargs = {}) return (buf0_view_buf0_0,) ``` Here's a graph with dynamic shapes. This one is a little bit funky. Inductor provides a graph input for each shape symbol, which we map to a placeholder, in this example `s6`. Then, shape expressions in the generated code can refer to the symbol `s6`. The size hint for `s6` is stored in `node.meta["val"]` where `node` is the placeholder defining it. This works out in the generated python code because the placeholder defines a Python variable with the name `s6`. ``` graph(): %s6 : [num_users=0] = placeholder[target=s6] %arg1_1 : [num_users=1] = placeholder[target=arg1_1] %arg2_1 : [num_users=1] = placeholder[target=arg2_1] %buf0 : [num_users=2] = call_function[target=torch.empty_strided](args = ((s6,), (1,)), kwargs = {dtype: torch.float32, device: cuda:0}) %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [[-(((-s6)//8)), 1, 1]], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg2_1, in_ptr1: %arg1_1, out_ptr0: %buf0, xnumel: s6, XBLOCK: 8}}) return buf0 ``` Here's another graph, this time with dynamic shapes and strides. The grid expression is more complex since the numel is a product of dimensions. ``` graph(): %s10 : [num_users=0] = placeholder[target=s10] %arg1_1 : [num_users=1] = placeholder[target=arg1_1] %arg2_1 : [num_users=1] = placeholder[target=arg2_1] %buf0 : [num_users=2] = call_function[target=torch.empty_strided](args = ([s10, s10], [s10, 1]), kwargs = {dtype: torch.float32, device: cuda:0}) %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [[-(((s102)//(-64))), 1, 1]], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg2_1, in_ptr1: %arg1_1, out_ptr0: %buf0, xnumel: s102, XBLOCK: 64}}) return buf0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146942 Approved by: https://github.com/jansel	2025-05-05 19:34:49 +00:00
PyTorch MergeBot	fdadda21b6	Revert "[float16]: Fast path for torch.dot with float16/bfloat16 (#152799 )" This reverts commit d57bf53225004a684952222722a4f7322a21a596. Reverted https://github.com/pytorch/pytorch/pull/152799 on behalf of https://github.com/malfet due to This broke C10_MOBILE builds, not sure why it was not surfaced on pull, see `a766c1d117/1` ([comment](https://github.com/pytorch/pytorch/pull/152799#issuecomment-2852084433))	2025-05-05 19:17:59 +00:00
dolpm	a766c1d117	[nativert] move intrusive list to c10/util (#152754 ) Summary: nativert RFC: https://github.com/zhxchen17/rfcs/blob/master/RFC-0043-torch-native-runtime.md To land the runtime into PyTorch core, we will gradually land logical parts of the code into the Github issue and get each piece properly reviewed. This diff moves intrusive list to c10/util Test Plan: CI Differential Revision: D74104595 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152754 Approved by: https://github.com/Skylion007, https://github.com/cyyever	2025-05-05 18:49:56 +00:00
Ryan Guo	51e77f3b30	[dynamo] replace `unimplemented` with `unimplemented_v2` in `variables/torch_functions.py` (#151278 ) This addresses part of #147913. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151278 Approved by: https://github.com/Skylion007, https://github.com/williamwen42 ghstack dependencies: #151277	2025-05-05 18:45:40 +00:00
Ryan Guo	9e24f9b523	[dynamo] replace `unimplemented` with `unimplemented_v2` in `variables/functions.py` (#151277 ) This addresses part of #147913. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151277 Approved by: https://github.com/Skylion007, https://github.com/williamwen42	2025-05-05 18:45:40 +00:00
Krishna Bindumadhavan	d57bf53225	[float16]: Fast path for torch.dot with float16/bfloat16 (#152799 ) Fixes #152798 Add the fast path for dot with contiguous tensors for float16/bfloat16 types. Performance with patch (see issue for benchmark and current performance): ![Improved dot performance](https://github.com/user-attachments/assets/57f64e90-8191-4710-adb0-f430644827de) We see up to 10x+ improvement in performance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152799 Approved by: https://github.com/malfet	2025-05-05 18:29:39 +00:00
PyTorch MergeBot	172a7c942e	Revert "Log aot and idx waitcounters. (#152444 )" This reverts commit ea9ea029595a5f628fdd368a6e1dd76e95707161. Reverted https://github.com/pytorch/pytorch/pull/152444 on behalf of https://github.com/jovianjaison due to needs a fix ([comment](https://github.com/pytorch/pytorch/pull/152444#issuecomment-2851905261))	2025-05-05 18:11:37 +00:00
Will Constable	136ee4c81b	Make assertion about pass callable print the bad pass (#152654 ) If you passed an invalid string now you can easily see what it is Pull Request resolved: https://github.com/pytorch/pytorch/pull/152654 Approved by: https://github.com/eellison	2025-05-05 18:07:43 +00:00
zhxchen17	fd6d4a6a24	[dynamo] Guard serialization for DICT_KEYS_MATCH (#152723 ) DICT_KEYS_MATCH Differential Revision: [D74091886](https://our.internmc.facebook.com/intern/diff/D74091886/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152723 Approved by: https://github.com/jansel ghstack dependencies: #152615, #152616, #152687, #152716, #152721	2025-05-05 18:05:56 +00:00
zhxchen17	2da9ab4b1c	[dynamo] Guard serialization for MAPPING_KEYS_CHECK (#152721 ) MappingProxyType Differential Revision: [D74091363](https://our.internmc.facebook.com/intern/diff/D74091363/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152721 Approved by: https://github.com/jansel ghstack dependencies: #152615, #152616, #152687, #152716	2025-05-05 18:05:56 +00:00
zhxchen17	24e1666b3a	[dynamo] Guard serialization for WEAKREF_ALIVE (#152716 ) Punt on WEAREF_ALIVE as weakref won't live across the process and users might need to drop them upfront. Differential Revision: [D74088735](https://our.internmc.facebook.com/intern/diff/D74088735/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152716 Approved by: https://github.com/jansel ghstack dependencies: #152615, #152616, #152687	2025-05-05 18:05:56 +00:00
zhxchen17	2cb16df6e2	[dynamo] Guard serialization for DUPLICATE_INPUT. (#152687 ) Seems this guard is not very active. Adding a test to detect error handling at least. Differential Revision: [D74074837](https://our.internmc.facebook.com/intern/diff/D74074837/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152687 Approved by: https://github.com/jansel ghstack dependencies: #152615, #152616	2025-05-05 18:05:56 +00:00
zhxchen17	ffd58293f7	[dynamo] Guard serialization for FUNCTORCH_STACK_MATCH (#152616 ) Make Functorch interpreters serializable most of the time, so that we can save the guards on functorch states. ## Test Cases: 0. torch.compile() without functorch layers present. Guard should fail with any layer being pushed. 1. torch.compile() nested in vmap. 2. torch.compile() nested in grad. 3. torch.compile() nested in jvp + vmap 4. torch.compile() nested functionalize 5. torch.compile() nested in vmap + grad Differential Revision: [D74008787](https://our.internmc.facebook.com/intern/diff/D74008787/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152616 Approved by: https://github.com/zou3519 ghstack dependencies: #152615	2025-05-05 18:05:56 +00:00
zhxchen17	1d1cbcd8a3	[dynamo] Guard serialization for DUAL LEVEL. (#152615 ) Seem dual level counter should be stored in OutputGraph so that the value can be preserved through roundtripping. Differential Revision: [D74008786](https://our.internmc.facebook.com/intern/diff/D74008786/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152615 Approved by: https://github.com/jansel, https://github.com/zou3519	2025-05-05 18:05:56 +00:00
Catherine Lee	0145f9e29e	[CI] docker images use tags instead of image name (#152209 ) Change CI docker images to be `ci-image:<image name>-<folder sha>` instead of `<image name>:<folder sha>` so we never have to make a new ecr repo ever again Pros: never have to make a new ecr repo ever again Cons: if it aint broken, dont fix it? Don't need to change linux-test images since they use the "full name" of the image with the docker registry and the tag In order to prevent others needing to rebase past this PR, also push the image to the "old name". This can be removed after this PR has been in main for a while Pull Request resolved: https://github.com/pytorch/pytorch/pull/152209 Approved by: https://github.com/seemethere, https://github.com/atalman	2025-05-05 18:02:29 +00:00
cyy	45efa1aaa8	[3/N] Use internal linkage in C++ files (#151297 ) Follows #151070. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151297 Approved by: https://github.com/Skylion007	2025-05-05 17:48:39 +00:00
Alexander Grund	99287b170b	Generate test reports for pytest when option is given (#152170 ) The argument needs to be appended when test reports should be generated. IS_CI is not necessarily set, so rather check TEST_SAVE_XML instead as in other places where test reports are conditionally enabled. See also https://github.com/pytorch/pytorch/issues/126523 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152170 Approved by: https://github.com/Skylion007	2025-05-05 17:46:40 +00:00
kyo	a21090a38c	Fix incorrect citation of authors in documentation (#145209 ) This PR corrects the citation of Adafactor authors "Noam Shazeer" and "Mitchell Stern" in the documentation. The current text incorrectly lists them as "Shazeer, Noam, and Mitchell Stern," which seems to be a result of a data parsing issue of some reference manager(s) [as you can find many papers with the same issue](https://www.google.com/search?q=%22Shazeer%2C+Noam%2C+and+Mitchell+Stern%22). The updated citation follows standard conventions for author names. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145209 Approved by: https://github.com/janeyx99	2025-05-05 17:45:05 +00:00
Jovian Anthony Jaison	ea9ea02959	Log aot and idx waitcounters. (#152444 ) Summary: Added for create_aot_dispatcher_function and compile_fx_inner. Note: Log wait counters flag is already set for: 1. async_compile.precompile 2. remote_fx_graph_cache_get 3. remote_fx_graph_cache_put Test Plan: contbuild Differential Revision: D73866124 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152444 Approved by: https://github.com/ppanchalia, https://github.com/masnesral	2025-05-05 17:35:29 +00:00
Rohan	35475a3e07	Disable SLEEF implementation of vec::maximum in vec128_float_neon.h \| Accelerate aten::hardtanh_ by 21x (#152538 ) The `has_inf_nan` implementation in `vec::maximum` is scalar, and it slows down certain activations like `tanh` by almost 20 times. Additionally, the `vec::minimum` function simply uses NEON intrinsics and not SLEEF. This PR makes the two fns similar in implementation. Besides, the SLEEF function `Sleef_fmaxf4` ultimately invokes the `vmaxq_f32` NEON intrinsic through [vmax_vf_vf_vf](`d28232a309/src/arch/helperadvsimd.h (L253)`). From a single threaded profile of mobilenet on an Arm Neoverse-V2 machine (code below), the `aten::hardtanh_` takes 5.653ms per function call while using the current PyTorch 2.7 wheel, whereas it takes 266.096us per function call while simply using `vmaxq_f32` - a 21x speedup, and overall inference is 1.8x faster. ___ Run the below script: `OMP_NUM_THREADS=1 python profile_mobilenet.py --iterations 10` <details > <summary>profile_mobilenet.py</summary> ``` import torch import torchvision.models as models from torch.profiler import profile, record_function, ProfilerActivity import argparse torch.manual_seed(42) def load_mobilenet(): model = models.mobilenet_v2(pretrained=True) model.eval() return model def generate_sample_input(batch_size=8): return torch.randn(batch_size, 3, 224, 224) def warmup(model, sample_input, num_warmup=10): with torch.inference_mode(): for _ in range(num_warmup): _ = model(sample_input) def parse_args(): parser = argparse.ArgumentParser() parser.add_argument('--batch_size', type=int, default=8) parser.add_argument('--iterations', type=int, default=100) return parser.parse_args() def main(): args = parse_args() model = load_mobilenet() sample_input = generate_sample_input(args.batch_size) print("Warming up...") warmup(model, sample_input) print("Warmup complete.") with profile(activities=[ProfilerActivity.CPU], record_shapes=True) as prof: with torch.inference_mode(): for i in range(args.iterations): with record_function("model_inference"): outputs = model(sample_input) print(prof.key_averages().table(sort_by="cpu_time_total")) print(f"Throughput: {(args.iterations * args.batch_size / (prof.profiler.self_cpu_time_total / 1e6)):.3f} images/s") if __name__ == "__main__": main() ``` </details> <details> <summary>Profiler output using the current Pytorch 2.7 wheel </summary> ``` -------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls -------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ model_inference 2.39% 101.839ms 100.00% 4.254s 425.437ms 10 aten::hardtanh_ 0.02% 905.454us 46.50% 1.978s 5.653ms 350 aten::hardtanh 0.03% 1.239ms 46.48% 1.977s 5.650ms 350 aten::clamp 46.45% 1.976s 46.45% 1.976s 5.646ms 350 aten::conv2d 0.06% 2.468ms 43.89% 1.867s 3.591ms 520 aten::convolution 0.06% 2.491ms 43.83% 1.865s 3.586ms 520 aten::_convolution 0.13% 5.546ms 43.77% 1.862s 3.581ms 520 aten::thnn_conv2d 0.04% 1.658ms 24.13% 1.027s 3.019ms 340 aten::_slow_conv2d_forward 23.99% 1.021s 24.09% 1.025s 3.014ms 340 aten::mkldnn_convolution 14.42% 613.285ms 19.51% 829.885ms 4.610ms 180 aten::batch_norm 0.06% 2.368ms 6.89% 292.928ms 563.323us 520 aten::_batch_norm_impl_index 0.11% 4.600ms 6.83% 290.560ms 558.769us 520 aten::native_batch_norm 6.60% 280.762ms 6.69% 284.567ms 547.244us 520 aten::contiguous 0.01% 623.099us 5.01% 213.152ms 1.184ms 180 aten::clone 0.02% 988.729us 5.00% 212.529ms 1.181ms 180 aten::copy_ 4.94% 210.315ms 4.94% 210.315ms 1.052ms 200 aten::linear 0.00% 58.347us 0.18% 7.659ms 765.905us 10 aten::addmm 0.17% 7.373ms 0.18% 7.483ms 748.309us 10 aten::empty 0.17% 7.161ms 0.17% 7.161ms 1.790us 4000 aten::add 0.11% 4.742ms 0.11% 4.742ms 47.419us 100 aten::empty_like 0.03% 1.315ms 0.09% 3.890ms 5.557us 700 aten::view 0.05% 1.933ms 0.05% 1.933ms 2.801us 690 aten::as_strided_ 0.04% 1.599ms 0.04% 1.599ms 8.885us 180 aten::resize_ 0.04% 1.493ms 0.04% 1.493ms 2.871us 520 aten::adaptive_avg_pool2d 0.00% 55.360us 0.04% 1.491ms 149.051us 10 aten::mean 0.00% 116.997us 0.03% 1.435ms 143.515us 10 aten::sum 0.02% 935.980us 0.02% 992.121us 99.212us 10 aten::detach 0.02% 707.217us 0.02% 707.217us 2.080us 340 aten::div_ 0.00% 161.473us 0.01% 326.035us 32.604us 10 aten::to 0.00% 178.193us 0.01% 321.253us 0.892us 360 aten::_nnpack_available 0.01% 302.835us 0.01% 302.835us 0.891us 340 aten::_to_copy 0.00% 63.170us 0.00% 143.060us 14.306us 10 aten::t 0.00% 49.759us 0.00% 117.621us 11.762us 10 aten::transpose 0.00% 40.637us 0.00% 67.862us 6.786us 10 aten::flatten 0.00% 42.634us 0.00% 58.867us 5.887us 10 aten::fill_ 0.00% 56.141us 0.00% 56.141us 5.614us 10 aten::expand 0.00% 42.687us 0.00% 48.930us 4.893us 10 aten::empty_strided 0.00% 40.589us 0.00% 40.589us 4.059us 10 aten::as_strided 0.00% 33.468us 0.00% 33.468us 1.673us 20 aten::resolve_conj 0.00% 9.066us 0.00% 9.066us 0.453us 20 aten::dropout 0.00% 5.782us 0.00% 5.782us 0.578us 10 -------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 4.254s Throughput: 18.804 images/s ``` </details> <details> <summary>Profiler output after this PR's changes </summary> ``` -------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls -------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ model_inference 4.43% 104.484ms 100.00% 2.359s 235.883ms 10 aten::conv2d 0.10% 2.313ms 79.19% 1.868s 3.592ms 520 aten::convolution 0.10% 2.293ms 79.09% 1.866s 3.588ms 520 aten::_convolution 0.23% 5.436ms 78.99% 1.863s 3.583ms 520 aten::thnn_conv2d 0.08% 1.799ms 44.29% 1.045s 3.072ms 340 aten::_slow_conv2d_forward 44.03% 1.039s 44.21% 1.043s 3.067ms 340 aten::mkldnn_convolution 24.91% 587.584ms 34.47% 812.992ms 4.517ms 180 aten::batch_norm 0.10% 2.350ms 11.83% 279.113ms 536.757us 520 aten::_batch_norm_impl_index 0.20% 4.788ms 11.73% 276.764ms 532.238us 520 aten::native_batch_norm 11.30% 266.660ms 11.46% 270.420ms 520.038us 520 aten::contiguous 0.02% 575.723us 9.41% 222.080ms 1.234ms 180 aten::clone 0.04% 1.061ms 9.39% 221.504ms 1.231ms 180 aten::copy_ 9.29% 219.131ms 9.29% 219.131ms 1.096ms 200 aten::hardtanh_ 0.04% 917.669us 3.95% 93.133ms 266.096us 350 aten::hardtanh 0.05% 1.130ms 3.91% 92.216ms 263.474us 350 aten::clamp 3.85% 90.894ms 3.86% 91.086ms 260.246us 350 aten::linear 0.00% 68.681us 0.33% 7.899ms 789.945us 10 aten::addmm 0.32% 7.598ms 0.33% 7.707ms 770.673us 10 aten::empty 0.30% 7.176ms 0.30% 7.176ms 1.794us 4000 aten::add 0.20% 4.627ms 0.20% 4.627ms 46.268us 100 aten::empty_like 0.06% 1.316ms 0.17% 3.973ms 5.676us 700 aten::view 0.08% 2.001ms 0.08% 2.001ms 2.899us 690 aten::adaptive_avg_pool2d 0.00% 53.745us 0.07% 1.548ms 154.791us 10 aten::resize_ 0.06% 1.533ms 0.06% 1.533ms 2.948us 520 aten::as_strided_ 0.06% 1.521ms 0.06% 1.521ms 8.450us 180 aten::mean 0.00% 117.637us 0.06% 1.494ms 149.417us 10 aten::sum 0.04% 973.291us 0.04% 1.013ms 101.342us 10 aten::detach 0.03% 652.224us 0.03% 652.224us 1.918us 340 aten::div_ 0.01% 195.077us 0.02% 363.103us 36.310us 10 aten::to 0.01% 212.758us 0.02% 359.655us 0.999us 360 aten::_nnpack_available 0.01% 295.235us 0.01% 295.235us 0.868us 340 aten::_to_copy 0.00% 68.726us 0.01% 146.897us 14.690us 10 aten::t 0.00% 53.873us 0.01% 124.033us 12.403us 10 aten::transpose 0.00% 42.512us 0.00% 70.160us 7.016us 10 aten::flatten 0.00% 44.040us 0.00% 66.631us 6.663us 10 aten::expand 0.00% 44.632us 0.00% 51.177us 5.118us 10 aten::fill_ 0.00% 40.134us 0.00% 40.134us 4.013us 10 aten::empty_strided 0.00% 35.291us 0.00% 35.291us 3.529us 10 aten::as_strided 0.00% 34.193us 0.00% 34.193us 1.710us 20 aten::resolve_conj 0.00% 8.594us 0.00% 8.594us 0.430us 20 aten::dropout 0.00% 6.758us 0.00% 6.758us 0.676us 10 -------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 2.359s Throughput: 33.915 images/s ``` </details> ___ Using torchbench, the models `mobilenet_v2` and `mobilenet_v3_large` showed improvements as expected too. Before -> After (latency in ms) ``` "mobilenet_v3_large-eval_latency": 1207.212 -> 844.902 "mobilenet_v2-eval_latency": 1029.834 -> 662.476 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152538 Approved by: https://github.com/Skylion007	2025-05-05 17:21:11 +00:00
Brian Hirsh	131da0a982	Add a test for AsyncCollectiveTensor handling for maybe-view ops (#152688 ) We never added a proper test for the fix from https://github.com/pytorch/pytorch/pull/134661 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152688 Approved by: https://github.com/kwen2501 ghstack dependencies: #152195	2025-05-05 17:21:00 +00:00
Brian Hirsh	5abe74857a	SAC: fix recompute tag propagation for ops with list[tensor] inputs (#152195 ) There's an "are we compiling" check in SAC, which we rely on to know when to propagate recompute tags during tracing. This check was a bit brittle, and missed cases where input ops accept list of tensors - I updated it to check if a `FunctionalTensorMode` is active, which should be a 100% reliable way to know if AOTDispatcher is in the middle of running. There is a long-standing followup here around unifying `torch.compiler.is_compiling()` to work in all cases. We should probably just update it to always check if FakeMode/FunctionalMode are active and use it there. This has a bit of BC risk though so I opted for the more local fix to SAC. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152195 Approved by: https://github.com/soulitzer	2025-05-05 17:21:00 +00:00
Guilherme Leobas	7c96dd8f0c	Add infra to run CPython tests under Dynamo (#150787 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150787 Approved by: https://github.com/zou3519	2025-05-05 17:20:14 +00:00
Oguz Ulgen	50fe1b2349	Implement async manifold cache write (#152452 ) Summary: This diff implements an AsyncManifoldCache class that performs cache write and update ttl operations in an async manner. Essentially we are ok with the fire and forget approach where we dont guarantee that we can observe our writes, this gives us better runtime latency. Test Plan: added new unit test Reviewed By: jamesjwu Differential Revision: D73867797 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152452 Approved by: https://github.com/jamesjwu	2025-05-05 16:45:48 +00:00
Catherine Lee	3196a3aca0	[CI] Use cmake from pip instead of conda in CI docker images (#152537 ) As in title Pull Request resolved: https://github.com/pytorch/pytorch/pull/152537 Approved by: https://github.com/cyyever, https://github.com/atalman	2025-05-05 16:32:40 +00:00
henrylhtsang	d119481717	[cutlass backend] Minor lru_cache to slightly speed up filtering ops (#152577 ) For default level, it went from 0.11332 seconds to Filtering took 0.10064 seconds. You can't really apply lru_cache too aggressively. For example, hashing a cutlass op takes a long time. Removing a log further bring it down to 0.07202 seconds Differential Revision: [D73971021](https://our.internmc.facebook.com/intern/diff/D73971021/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152577 Approved by: https://github.com/chenyang78	2025-05-05 16:27:16 +00:00
dscamiss	9a9cc48c65	Update SGD documentation to match implementation (#149884 ) Fixes #149476 This PR updates the pseudocode description of the SGD optimizer to better match the implementation. Updated pseudocode: ![image](https://github.com/user-attachments/assets/2d7bc618-0408-4909-b835-af6465736918) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149884 Approved by: https://github.com/janeyx99	2025-05-05 16:06:17 +00:00
Ke Wen	7a2df6a00b	[PGNCCL] Add FP8 support (#152706 ) NCCL added support for `Float8e4m3` and `Float8e5m2` in 2.24. NVIDIA GPUs does not seem to support the following "no negative zero" versions: `Float8_e4m3fnuz` and `Float8_e5m2fnuz`, see https://onnx.ai/onnx/technical/float8.html. So we continue to error out for these two upon a reduction op. Test plan: - test_allreduce_float8 - test_reduce_scatter_float8 Resolves #148344 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152706 Approved by: https://github.com/d4l3k, https://github.com/eqy, https://github.com/fduwjj, https://github.com/cyyever	2025-05-05 16:02:27 +00:00
Pradip Jha	a1516d9e6e	Add "#pragma once" to CachingHostAllocator.h (#152800 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152800 Approved by: https://github.com/cyyever, https://github.com/Skylion007	2025-05-05 15:21:14 +00:00
Nikita Shulga	fe36d7dc44	[MPSInductor] Fix `truncdiv` implementation (#152788 ) For integral dtypes it should be just an alias for division Fixes `GPUTests.test_div7_mps` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152788 Approved by: https://github.com/dcci, https://github.com/jansel ghstack dependencies: #152663, #152515, #152737, #152743, #152758	2025-05-05 13:31:51 +00:00
atalman	87f2bd2439	Remove conda usage in windows binary builds (#151035 ) This is related to : https://github.com/pytorch/pytorch/issues/146048 Removing conda from windows binary builds. At this point we are only removing conda and replacing it with python builds. Not rewriting all batch files as python or bash. Additionally cleanup unused files: ``` .ci/pytorch/windows/internal/static_lib_test.bat .ci/pytorch/windows/internal/env_fix.bat .ci/pytorch/windows/internal/vs_install.bat ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151035 Approved by: https://github.com/cyyever, https://github.com/clee2000, https://github.com/malfet	2025-05-05 13:09:05 +00:00
Isuru Fernando	0a470dc7c1	[inductor] fix lowering for cummin, cummax for one element tensors (#151931 ) Fixes https://github.com/pytorch/pytorch/issues/151738 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151931 Approved by: https://github.com/eellison	2025-05-05 13:05:59 +00:00
Tom Ritchford	2825a28bf1	Exempt overriding methods from docstring_linter (fix #151692 ) (#151906 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151906 Approved by: https://github.com/Skylion007	2025-05-05 12:39:42 +00:00
PyTorch UpdateBot	9210a98b92	[xla hash update] update the pinned xla hash (#152809 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152809 Approved by: https://github.com/pytorchbot	2025-05-05 11:21:11 +00:00
Xia, Weiwen	ac9fcd6346	[Inductor][CPU] bug fix for int8 GEMM compensation epilogue (#152408 ) Fixes #152398 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152408 Approved by: https://github.com/leslie-fang-intel	2025-05-05 08:26:47 +00:00
Phillip Liu	7e637de9cb	[Flight Recorder] Added logging after FR dump completed (#152648 ) Summary: TSIA Test Plan: eyes Differential Revision: D74041147 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152648 Approved by: https://github.com/fduwjj, https://github.com/wdvr	2025-05-05 06:17:47 +00:00
Nikita Shulga	0ffd31dc8a	[MPS] Migrate div roudning modes (#152758 ) By implementing `div_floor` and `div_trunc` . Do not mark `div_trunc` as OPMATH, to align following output with CPU(if division is performed in fp32, than result will be truncated to 25 ``` import torch print(torch.tensor([[-7.4688, -3.1289]], dtype=torch.float16,device="cpu").div(torch.tensor([-0.2988, -0.8789], dtype=torch.bfloat16,device="cpu"), rounding_mode="trunc")) tensor([[24., 3.]]) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152758 Approved by: https://github.com/dcci ghstack dependencies: #152663, #152515, #152737, #152743	2025-05-05 03:02:29 +00:00
James Wu	93d8f6ee32	[reland] Detailed triton kernel logging (#152694 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152694 Approved by: https://github.com/Skylion007	2025-05-05 02:46:57 +00:00
Dharak Kharod	a78eec88b8	Implement util function compute_global_tensor_shape for 1D device mesh (#152751 ) ### Summary Recreating #151990 to mitigate easyCLA failure compute_global_tensor_shape util function takes in local tensor shape, device mesh and placements. We all gather the shapes from the shards and according to the placement type we construct the global shape. Note: currenty only implemented for placement type Shard and Replicate, TODO for StridedShared ### Test `pytest test/distributed/tensor/test_utils.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152751 Approved by: https://github.com/XilunWu	2025-05-05 02:44:31 +00:00
George White	30453d60dd	Add methods for checking Triton availability to the device interface (#152529 ) Adds the `is_triton_capable` and `raise_if_triton_unavailable` class methods to the device interface, to allow device types to run their own checks for Triton _capability_ (which means a device can actually support Triton in the first place) and _availability_ (if the correct backend of Triton is installed and is functional for the device). Using the device interface allows us to do these checks in a device-agnostic way, allow external backends to attest their Triton support by simply implementing those methods. The intention is for this to back things like the `has_triton` utility method. This has been split from #139171. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152529 Approved by: https://github.com/jansel	2025-05-05 00:55:53 +00:00
PyTorch MergeBot	8dbe1ff34b	Revert "Avoid triggering ignored requires_grad warning in our code (#152686 )" This reverts commit f51bee137518cde82e88ec655988e7eb1b94a3f3. Reverted https://github.com/pytorch/pytorch/pull/152686 on behalf of https://github.com/wdvr due to failinginternal test, discussed with author ([comment](https://github.com/pytorch/pytorch/pull/152686#issuecomment-2849497208))	2025-05-04 23:34:34 +00:00
Aaron Gokaslan	49b9efdf1f	[BE]: Cleanup traceutils with fmtlib (#152265 ) Simplify code and make it faster. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152265 Approved by: https://github.com/albanD, https://github.com/cyyever	2025-05-04 22:27:19 +00:00
Will Feng	82cb202de7	[Inductor][NCU] Add kernel name filtering, and allow custom metrics (#150872 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150872 Approved by: https://github.com/FindHao Co-authored-by: Yueming Hao <yhao@meta.com>	2025-05-04 20:49:19 +00:00
Matthijs Hogervorst	b117a6c47b	Fix two error messages involving Tensor.dense() (#152631 ) Two error messages in the codebase instruct the user to use `Tendor.dense()`. This method doesn't exist, but `Tensor.to_dense()` does, and this is what the user should be using instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152631 Approved by: https://github.com/jansel	2025-05-04 20:44:08 +00:00
Andrew Gallagher	220870ce9e	[caffe2] Support building for armv8.1 (#152766 ) Summary: - Remove explicit `-march=` compiler flags, as they're already implied by the toolchain: https://www.internalfb.com/code/fbsource/[7f85b0565073]/fbcode/tools/build/buck/wrappers/defs.bzl?lines=819 - Gate non-8.1 compliant opcodes with `__ARM_FEATURE_*`. Test Plan: CI Reviewed By: rahulg Differential Revision: D74023601 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152766 Approved by: https://github.com/Skylion007	2025-05-04 19:09:21 +00:00
ILCSFNO	a69da90a9f	Add pad limit of avg_poolnd and AvgPoolnd (#152680 ) Fixes #152156 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152680 Approved by: https://github.com/mikaylagawarecki	2025-05-04 17:25:22 +00:00
cyy	370e23388d	Set CMake 3.5 as minimum version in pytorch_android (#152769 ) I saw pytorch_android failure in docker image builds. This fix attempts to bypass CMake 4 limitations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152769 Approved by: https://github.com/Skylion007	2025-05-04 16:57:22 +00:00
Julius Herb	8f54e56e62	Add optional device index to AOTIModelPackageLoader (#152093 ) This is my suggestion for resolving #152087 This PR extends the constructor of `AOTIModelPackageLoader` with an (optional) device index. The device type is still determined by `metadata_["AOTI_DEVICE_KEY"]`, but the `device_index` argument can be used to move an AOTI model package to different devices like `cuda:0`, `cuda:1`, ... in a convenient way. AFAIK, this is not possible so far using `AOTIModelPackageLoader` alone. The default case (no device index specified) with `metadata_["AOTI_DEVICE_KEY"] == "cuda"` would lead to the current behavior, i.e., the model is loaded to device `cuda`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152093 Approved by: https://github.com/desertfire	2025-05-04 11:40:12 +00:00
FFFrog	fd8fd01d25	[OpenReg] Add _lazy_init and rng_state support for OpenReg (#151914 ) As the title stated. Changes: - Add get_rng_state & set_rng_state support for OpenReg - Add _lazy_init support for OpenReg - Remove redundant code for cuda/Module.cpp Pull Request resolved: https://github.com/pytorch/pytorch/pull/151914 Approved by: https://github.com/albanD	2025-05-04 09:42:08 +00:00
FFFrog	c8bac51ec1	Remove the unnecessary cuda/Tensor.cpp (#152522 ) As the title stated. Question: I have carefully looked through all the .h files in Tensor.cpp and from my perspective this file does not make sense. Does anyone know what the background is for doing this? Pull Request resolved: https://github.com/pytorch/pytorch/pull/152522 Approved by: https://github.com/Skylion007, https://github.com/albanD, https://github.com/eqy ghstack dependencies: #152512, #152513, #152521	2025-05-04 07:15:11 +00:00
FFFrog	8562457cba	Make torch/csrc/utils.h to be device-agnostic (#152521 ) `torch/csrc/utils.h` should be device-independent. Currently, it contains CUDA-related implementations, which indirectly causes the [failure of ROCm testing](https://github.com/pytorch/pytorch/pull/151914#issuecomment-2839691038) (The reason is that the ROCm test environment shouldn`t expose HIP-related header files, which causes the JIT compilation to fail during testing) Therefore, move CUDA-related implementations to `torch/csrc/cuda/utils.h`. Question: This change may introduce BC-breack. I searched for this function globally on github and I think the impact is very small. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152521 Approved by: https://github.com/Skylion007, https://github.com/albanD ghstack dependencies: #152512, #152513	2025-05-04 07:15:11 +00:00
Nikita Shulga	e889937850	[MPS] Migrate `div` to Metal (#152743 ) TODOs: - Verify accuracy of `metal::dot` vs `x.xx.x + y.yy.y` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152743 Approved by: https://github.com/dcci, https://github.com/Skylion007 ghstack dependencies: #152663, #152515, #152737	2025-05-04 00:56:19 +00:00
PyTorch MergeBot	8faa225695	Revert "[inductor] Realize bucketize/searchsorted output (#152644 )" This reverts commit 9ae4906b21cbd186a493a9564e22a42da2184e3a. Reverted https://github.com/pytorch/pytorch/pull/152644 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/152644#issuecomment-2848743442))	2025-05-03 18:16:39 +00:00
Joe Wang	6ae690f8f0	add support for 0 size shardedTensor and recalculate metadata from all_gather (#152583 ) Summary: change set 1. a ShardedTensor could have 0 size initially, the current check won't pass if the size is 0, added here 2. when we call ShardedTensor._init_from_local_shards, it will assume all the metadata is correct, all_gather to double check. In the new case, the metadata could be all 0 size, and the tensor has actual size, we need to provide such capability to recalculate the local/global metadata from the local tensor by all_gathering the information Test Plan: i don't see a UT is associated, I have tested this with diff stack, D73274786. Differential Revision: D73903933 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152583 Approved by: https://github.com/q10, https://github.com/fduwjj	2025-05-03 17:26:29 +00:00
rzou	762844355e	Make DispatchKeySet serializable; add `__eq__` (#152732 ) These seem like reasonable things to add. Also fixes a bug in vLLM for me. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/152732 Approved by: https://github.com/bdhirsh	2025-05-03 14:40:06 +00:00
Nikita Shulga	792736f9ac	[BE][MPS] Pass `alpha` by reference (#152737 ) As it's always a scalar Pull Request resolved: https://github.com/pytorch/pytorch/pull/152737 Approved by: https://github.com/dcci ghstack dependencies: #152663, #152515	2025-05-03 08:31:45 +00:00
PyTorch MergeBot	cc28b43950	Revert "[ROCm] Upgrade ROCm CI to ROCm6.4 (#151368 )" This reverts commit 844842dfbf937c43b41c528e461d3f3931bca6e9. Reverted https://github.com/pytorch/pytorch/pull/151368 on behalf of https://github.com/malfet due to This broke inductor cpp wrapper ([comment](https://github.com/pytorch/pytorch/pull/151368#issuecomment-2848519706))	2025-05-03 08:31:31 +00:00
Ke Wen	457fa820ad	[c10d] Fix extra CUDA context created by barrier (#149144 ) Fixes #149119. In ProcessGroup.hpp, we create a dummy tensor for dispatching. This requires a correct device index. This PR uses `device_id` given by user when calling `init_process_group`. This PR also uses `torch._C._get_accelerator()` to determine the device type. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149144 Approved by: https://github.com/XilunWu, https://github.com/fduwjj, https://github.com/cyyever	2025-05-03 03:13:34 +00:00
Nikita Shulga	34e9f0b5c6	[MPS] Migrate mul to TensorIterator (#152515 ) What initially supposed to be a very straightforward change resulted in small refactor of binary op tensor generators when invoked for mixed dtype, which surfaced via `test_output_grad_match_sinc_mps_float16` test failure. If operands are of different dtype (in particular float16 tensor and float32 scalar), one must perform an operation with `opmath_t` (or `TensorIterator::common_dtype()`) precision, rather than casting both operands to output dtype and performing it then, which can be demonstrated via the following example: ``` >>> torch.tensor([-1.8633, 6.2031, -2.2500, -3.3926, 8.5938, 5.9766], dtype=torch.half).mul(torch.pi) tensor([ -5.8555, 19.4844, -7.0703, -10.6562, 27.0000, 18.7812], dtype=torch.float16) >>> torch.tensor([-1.8633, 6.2031, -2.2500, -3.3926, 8.5938, 5.9766], dtype=torch.half).mul(torch.tensor(torch.pi, dtype=torch.float16)) tensor([ -5.8516, 19.4844, -7.0664, -10.6562, 26.9844, 18.7656], dtype=torch.float16) ``` Solve this problem for now, but introducing `REGISTER_OPMATH_BINARY_OP` that indicates that operands must be cast to opmath_t, before performing the computation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152515 Approved by: https://github.com/Skylion007, https://github.com/kulinseth, https://github.com/dcci ghstack dependencies: #152663	2025-05-03 02:35:03 +00:00
Will Constable	1cd68c59dd	Remove incorrect assertion (#152653 ) It's only aspirational that the 'improvement' value is positive. In fact the pass could make a collective more exposed and we shouldn't assert here in that case Pull Request resolved: https://github.com/pytorch/pytorch/pull/152653 Approved by: https://github.com/eellison ghstack dependencies: #152565	2025-05-03 02:33:58 +00:00
PaulZhang12	84aa0985fb	[Inductor] Add decomposeK as an autotuning choice for mm (#150654 ) As a result of adding subgraph as a choice to inductor https://github.com/pytorch/pytorch/pull/149761 and enabling FP32 output from PyTorch GEMMs from FP16/BF16 inputs: https://github.com/pytorch/pytorch/pull/150812, this PR enables decompose_k as an autotuning choice for Inductor in generating the fastest matmuls with Triton. DecomposeK is currently only enabled for `torch.compile`. Followups: * decompose_k does not currently support epilogue fusion, which will take some work to enable * Enable autotuning the bmm with Triton Templates as well without requiring tons of more compile time, async compilation. Anecdotal evidence shows that Triton BMM performs better usually than aten BMM * Add for addmm * Enable for Inference and AOTI Below are the results of running TritonBench for Split-K shapes, comparing the aten performance versus pt2_triton, which now autotunes on decompose_k, seeing >10% speedup compared to aten on average, and for some shapes over 3x the performance of the best Triton mm previously: <img width="929" alt="Screenshot 2025-04-28 at 9 15 39 PM" src="https://github.com/user-attachments/assets/27d85bbc-4f3a-43a6-a8fa-d4a5bbb8c999" /> TorchInductor Benchmark Dashboard: <img width="1727" alt="Screenshot 2025-04-30 at 2 02 53 PM" src="https://github.com/user-attachments/assets/4acd7ffc-407f-4cfd-98bb-2e3d8b1f00b3" /> We see speedups across all runs for training. Compile time increased as expected, with more `mm` options to tune over. Differential Revision: [D73820115](https://our.internmc.facebook.com/intern/diff/D73820115) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150654 Approved by: https://github.com/eellison	2025-05-03 02:23:54 +00:00
Anatoly Myachev	5e9682719f	[Inductor UT] Generalize device-bias code in `test_flex_attention.py` (#151937 ) @EikanWang @etaf @guangyey please take a look Pull Request resolved: https://github.com/pytorch/pytorch/pull/151937 Approved by: https://github.com/drisspg	2025-05-03 01:12:49 +00:00
Animesh Jain	73b6b1ded4	[inductor][invoke_subgraph] Free the buffers before the subgraph call (#152494 ) Before ![image](https://github.com/user-attachments/assets/62b24c14-69e6-40fb-94e3-223930132ef6) After ![image](https://github.com/user-attachments/assets/9f340d4e-80a9-45aa-9400-626fff5b5ecd) tlparse - https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmph5dwWt/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152494 Approved by: https://github.com/Skylion007, https://github.com/eellison	2025-05-03 00:38:08 +00:00
Sam Larsen	36140e01fd	Rename "startup-tracing-compile" to "compile-time" in label_to_label.yml (#152711 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152711 Approved by: https://github.com/oulgen	2025-05-03 00:35:05 +00:00
rzou	3d777bae10	Inductor respects exact strides on custom ops by default (#150511 ) If a tag is not specified on a custom operator, then inductor will assume that it needs exact strides. Test Plan: - tests + CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/150511 Approved by: https://github.com/eellison, https://github.com/shunting314 ghstack dependencies: #148104	2025-05-03 00:02:24 +00:00
rzou	2b37a726e0	Refactor layout constraint selection logic (#148104 ) This PR: - cleans up some existing comments that don't make sense anymore - hooks up the "custom_op_default_layout_constraint" back (that seems to have broken) - cleans up the "lazy registration path" which seems to never get hit anymore - adds dislike_padding to nodes that require exact strides Test Plan: - tests + CI disable padding Pull Request resolved: https://github.com/pytorch/pytorch/pull/148104 Approved by: https://github.com/shunting314, https://github.com/eellison	2025-05-03 00:02:24 +00:00
Ke Wen	0e59b594ee	[SymmMem] Use cub's BlockScan instead of in-house impl for offset calculation (#151993 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151993 Approved by: https://github.com/ngimel ghstack dependencies: #151261, #151498, #151819	2025-05-02 23:40:47 +00:00
Jane Xu	2107d87dc9	[BE] remove outdated warning about TORCH_CUDA_ARCH_LIST (#152715 ) I saw this warning when compiling a 3rd party lib and did not agree with it. I'm not sure the original reason why we would want to force people to pass in TORCH_CUDA_ARCH_LIST to cmake vs set it as an env var. As a developer, it's much easier to set it as an env var or have it be autodetected. I also realized this warning was from before 2018!!! 7 years ago! And there are no plans to actually enforce this (nor should there be), so let's remove this misleading warning. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152715 Approved by: https://github.com/malfet, https://github.com/zou3519	2025-05-02 23:00:51 +00:00
drisspg	a6ea63a841	[FlexAttention] explicilty create grad_q w/ strides (#152641 ) Fixes: #147463 There is a mismatch between inductor's lowering for empty_like and it does not match the behavior of eager. The strides do not match preserve format https://github.com/pytorch/pytorch/issues/144699 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152641 Approved by: https://github.com/xmfan	2025-05-02 22:57:26 +00:00
Anthony Shoumikhin	54f29b04d6	Improve error wording in _link_check.yml (#152726 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152726 Approved by: https://github.com/huydhn	2025-05-02 22:43:05 +00:00
iupaikov-amd	730a077d48	[ROCm] Unskipped test_rnn_dropout_state for ROCm (#152339 ) Unskipping the test, should work fine now. Related PR: https://github.com/pytorch/pytorch/pull/144572 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152339 Approved by: https://github.com/jeffdaily	2025-05-02 22:02:30 +00:00
Thomas Bohnstingl	ea12a38668	[associative_scan] Refactoring of input checking and dynamo invocation (#148657 ) This PR is the counterpart of https://github.com/pytorch/pytorch/pull/142125 for the associative_scan operation. The way the input checks are performed and the combine_fn is not invoked in the frontend to check the output trees, but rather dynamo is used for that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148657 Approved by: https://github.com/ydwu4	2025-05-02 21:39:28 +00:00
NikhilAPatel	8afe40bc5e	[Inductor] Fix kernel argument ordering when using dynamic shapes with workspace (#152660 ) Summary: This PR fixes a bug in the Triton kernel invocation path where the `workspace_tensor` was inserted before the unpacked `extra_args` list in the final kernel argument list. This broke the expected ordering of arguments when dynamic shape size hints are emitted. When dynamic shapes are used, `extra_args` contains both size hint arguments and grid arguments. The kernel expects the argument list to follow the order: size hints → workspace tensor → grid args. But previously, the `workspace_tensor` was inserted before unpacking `extra_args`, resulting in: workspace tensor → size hints → grid args, which is incorrect. This fix constructs the workspace tensor earlier, allowing it to be slotted in after the size hints and before the grid arguments, restoring the expected argument layout. Test Plan: contbuild and OSS CI Reviewers: paulzhan Pull Request resolved: https://github.com/pytorch/pytorch/pull/152660 Approved by: https://github.com/PaulZhang12, https://github.com/drisspg	2025-05-02 21:32:07 +00:00
Blaine Burton Rister	add4702ebc	[Inductor] Introduce Wrapper IR line for symbolic call args (#152587 ) Preparatory refactor for https://github.com/pytorch/pytorch/pull/146942. This PR introduces a new wrapper IR line to represent symbolic call args. This deletes a little bit of duplicated code between the Python and C++ backends. In the main PR, having a Wrapper IR line for this also tells the FX backend what this part of the wrapper code is doing. Before this PR, symbolic call args generated raw Python lines, which confuse the FX converter. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152587 Approved by: https://github.com/jansel	2025-05-02 20:37:00 +00:00
David Berard	9ae4906b21	[inductor] Realize bucketize/searchsorted output (#152644 ) Context: bucketize is relatively expensive, computationally. So it's not always profitable to fuse it if it means doing extra computation. For example, this repro: https://gist.github.com/davidberard98/7fd6af7e6291787c246c705945a25554 shows a slowdown from 56us (eager) to ~100us (torch.compile-d): instead of computing 2\\15 binary searches, the fused version does 2\\15 * 384 - one for each of the broadcasted outputs. Solution: Realize the output of bucketize (and searchsorted, which also uses inductor's ops.bucketize). If there's an opportunity to do non-broadcasted fusions, the scheduler can still apply such fusions later on. After this PR, instead of a slowdown, we see an improvement from 56us (eager) to 33us (compiled). Differential Revision: [D74036850](https://our.internmc.facebook.com/intern/diff/D74036850) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152644 Approved by: https://github.com/benjaminglass1, https://github.com/eellison	2025-05-02 20:31:17 +00:00
Alexander Grund	74b496e54c	Cleanup DeviceInterface in triton test (#152409 ) - Remove inherited functions - Return valid device_count (1 device: idx=0) - Remove unused function `triton_supported` Followup to #144399 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152409 Approved by: https://github.com/jansel	2025-05-02 20:25:32 +00:00
Yang Wang	44f29a3669	Add parameters for monitor (#152541 ) Add log interval and log-data-collect interval to all test yml Add upload step for all test yml files next step: enable perf test with utilization Pull Request resolved: https://github.com/pytorch/pytorch/pull/152541 Approved by: https://github.com/huydhn	2025-05-02 20:24:11 +00:00
Eddie Yan	ec68d082a1	[CUDA][TF32] Account for TF32 in `test_conv2d_same_padding` (#152618 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152618 Approved by: https://github.com/msaroufim, https://github.com/Skylion007	2025-05-02 20:19:00 +00:00
Catherine Lee	39c0b01970	[ez] Disable failing test in periodic no gpu no avx (#152698 ) Failing on periodic after it was added in #152542 Ex inductor/test_cpu_repro.py::CPUReproTests::test_tanh_atan2_use_decompose_tanh [GH job link](https://github.com/pytorch/pytorch/actions/runs/14775755628/job/41485185829) [HUD commit link](`6f6acb4128`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152698 Approved by: https://github.com/huydhn, https://github.com/hl475	2025-05-02 20:02:48 +00:00
Meet Vadakkanchery	a6dd1c2208	[DCP] Add 30min timeout for IPC communications in async checkpointing (#152629 ) Summary: ### Diff Context - Sometime background process can be stuck processing async checkpoint request, and trainer shutdown can occur before the background process completes. - Fix, timeout the thread while reading the IPC queue for a response from background process. Differential Revision: D74017700 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152629 Approved by: https://github.com/saumishr	2025-05-02 19:36:22 +00:00
Akash Verma	5d860c1e54	[ROCm][CI] Enabled fp8 distributed tests in test_micro_pipeline_tp.py for MI300 (#151977 ) This PR enabled fp8 distributed tests on MI300. For testing the added feature, ran distributed.tensor.parallel.test_micro_pipeline_tp test and all the tests passed successfully, and no tests were skipped. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151977 Approved by: https://github.com/jeffdaily	2025-05-02 19:22:18 +00:00
zeshengzong	d457b4492d	Optimize `Sequential` methods description (#147304 ) Fixes #146892 Add methods description and examples for [`Sequential` document](https://pytorch.org/docs/stable/generated/torch.nn.Sequential.html) ## Test Result ### Before ![image](https://github.com/user-attachments/assets/3121a06f-02ed-4362-ad0a-f055bb43d469) ### After ![image](https://github.com/user-attachments/assets/66f6bb55-5298-4062-8f7f-7a7f4c1e16d9) ![image](https://github.com/user-attachments/assets/a5275a4c-4214-4518-b7a2-dff21954f368) ![image](https://github.com/user-attachments/assets/9c40d1fb-114a-4d14-a3c4-1143a131660e) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147304 Approved by: https://github.com/mikaylagawarecki	2025-05-02 19:18:58 +00:00
eqy	216d81da81	[CUDA][complex] skip `test_reference_numerics_large_jiterator_unary_cuda_complex64` on CUDA (#148024 ) already skipped on ROCM for a similar reason, recent numpy versions changed convention from `nan+infj` to `-inf+infj` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148024 Approved by: https://github.com/nWEIdia, https://github.com/atalman, https://github.com/malfet	2025-05-02 19:11:11 +00:00
Ryan Guo	16153a0f27	[AOTAutogradCache][Easy] Move `"einops.einops.rearrange"` to `SAFE_NON_TORCH_FUNCTIONS` (#152640 ) As title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152640 Approved by: https://github.com/oulgen, https://github.com/zou3519, https://github.com/bdhirsh	2025-05-02 19:09:30 +00:00
Eddie Yan	0488883d6e	[cuDNN][SDPA] Fix head-dim 256 condition for SM 10.0 (#152076 ) turns out the backward is not supported yet, whoops Pull Request resolved: https://github.com/pytorch/pytorch/pull/152076 Approved by: https://github.com/drisspg	2025-05-02 18:43:33 +00:00
Nikita Shulga	07290bdcdc	Skip search for MKL on ARM cpus (#145850 ) It will not find it anyway and makes a bit easier parsing thru CMake log on non-x86 systems Pull Request resolved: https://github.com/pytorch/pytorch/pull/145850 Approved by: https://github.com/atalman	2025-05-02 18:39:49 +00:00
Prachi Gupta	1ea2731e26	[ROCm] Add support for SymmetricMemory (#150580 ) This is an attempt to re-land the initial PR https://github.com/pytorch/pytorch/pull/134817 with recent design changes from upstream. NOTE: ROCm currently does NOT have multicast/multimem hardware support at the moment, so those features are disabled in symmetric memory for ROCm. This also means that we currently do not have a way of lowering add + all_reduce + wait_tensor into one_shot_all_reduce op in inductor as it depends on a multicast buffer support. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150580 Approved by: https://github.com/jeffdaily, https://github.com/kwen2501, https://github.com/yoyoyocmu Co-authored-by: Xiaodong Wang <xdwang@fb.com>	2025-05-02 18:35:14 +00:00
Laith Sakka	376529c78b	consolidate guard_or_x and definitely_x (#152463 ) definitely_true is almost same as guard_or_false, the potential differences are not meaningful to a degree that justify the existence of both. same for definitely_false, it can be expressed with guard_or_true and guard_or_false. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152463 Approved by: https://github.com/bobrenjc93	2025-05-02 18:08:11 +00:00
Aidyn-A	72337bdcf2	[ATen][CUDA] Optimize 128 bit vectorization (#148320 ) Fixes #147376. As per request: https://github.com/pytorch/pytorch/pull/145746#pullrequestreview-2642118301 This PR omits sm80 or older of using vec8 kernels due to long compilation and large binary size. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148320 Approved by: https://github.com/eqy, https://github.com/malfet, https://github.com/atalman	2025-05-02 17:35:44 +00:00
James Wu	3baa85cfad	[StaticCudaLauncher] Ensure cuda context exists before launching kernels (#152667 ) Triton does this already due to https://github.com/triton-lang/triton/pull/3731/files, in order to fix https://github.com/pytorch/pytorch/issues/124565. We need to do the same thing as triton here, so that in cases with no compilation we still have a cuda context in the backward autograd thread. Fixes https://github.com/pytorch/pytorch/issues/152639 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152667 Approved by: https://github.com/oulgen	2025-05-02 17:29:57 +00:00
albanD	f51bee1375	Avoid triggering ignored requires_grad warning in our code (#152686 ) This one is ok to silence as we're just doing formatting Pull Request resolved: https://github.com/pytorch/pytorch/pull/152686 Approved by: https://github.com/Skylion007	2025-05-02 17:27:47 +00:00
Jithun Nair	844842dfbf	[ROCm] Upgrade ROCm CI to ROCm6.4 (#151368 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151368 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-05-02 17:21:18 +00:00
Laith Sakka	f65fb0a23d	Make PGO code state not sensitive to file path by hashing file content when the file is available. (#152628 ) In some internal frameworks, on second attempts the actual code is copied to a different path than previous attempts. but its still the same. PGO will not work on those cased due to the following, sate entries before this PR used to be identified by (filepath, function name, line number). after this PR they are identified by (hash(filepath) , function name, line number). This way PGO will work for those jobs on future attempts and re-compilations of static versions will be avoided. Sometimes we do not have access to the source code, (file does not exists) This seems to happen mostly when we re-trace a compiled function but generally it can happen . Pull Request resolved: https://github.com/pytorch/pytorch/pull/152628 Approved by: https://github.com/oulgen	2025-05-02 17:11:21 +00:00
Animesh Jain	ea4b7e0e1d	[invoke_subgraph] Simplify output code for subgraph output node (#152490 ) Before - [manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmppQg3F8/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000](https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmppQg3F8/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000) ![image](https://github.com/user-attachments/assets/8fecdc23-eb78-4e15-9d03-c4bae4b49434) After fix - https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmp9a5EM0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 ![image](https://github.com/user-attachments/assets/8e98120c-d82e-42dc-bc50-a6bfd4f9923c) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152490 Approved by: https://github.com/eellison ghstack dependencies: #152383	2025-05-02 16:31:25 +00:00
Michał Górny	5c0f474dac	Do not check out nccl when not building it (#152533 ) Add additional conditions to `build_pytorch_libs.py` to avoid fetching NCCL when `USE_CUDA` or `USE_NCCL` are disabled. While at it, adjust the existing condition for `USE_SYSTEM_NCCL` to use the utility function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152533 Approved by: https://github.com/albanD	2025-05-02 16:31:03 +00:00
Animesh Jain	f6761f2968	[inductor][subgraph] Simplify the resulting output code for subgraph (#152383 ) Check out output code Before this PR - - https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmp3iXDVs/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 ![image](https://github.com/user-attachments/assets/ef86eb8f-e8b9-47dd-8609-f90481f018b8) After this PR - https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpRgUJvq/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 ![image](https://github.com/user-attachments/assets/10e22c60-7fb9-4519-9d54-019beff5333b) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152383 Approved by: https://github.com/eellison	2025-05-02 15:52:34 +00:00
Nikita Shulga	cb0cf7e5c7	[MPS][BE] Do not dispatch empty kernels (#152663 ) If `iter.numel()` is zero no need to dispatch kernel Pull Request resolved: https://github.com/pytorch/pytorch/pull/152663 Approved by: https://github.com/kulinseth	2025-05-02 14:34:53 +00:00
PyTorch MergeBot	50d4698ac8	Revert "[cutlass backend] Minor lru_cache to slightly speed up filtering ops (#152577 )" This reverts commit 1fef3cdabc3f79fd0cbf9273052057ef6122710f. Reverted https://github.com/pytorch/pytorch/pull/152577 on behalf of https://github.com/wdvr due to failing test_unary_ufuncs.py::TestUnaryUfuncsCUDA::test_reference_numerics_large_jiterator_unary_cuda_complex64 [GH job link](https://github.com/pytorch/pytorch/actions/runs/14787347116/job/41519095088) [HUD commit link](`1fef3cdabc`) ([comment](https://github.com/pytorch/pytorch/pull/152577#issuecomment-2846544603))	2025-05-02 07:25:25 +00:00
cyy	e9e1aacef8	Enable -Wunused on torch targets (#150077 ) For GCC, ``-Wunused`` contains: ``` -Wunused-function Warn whenever a static function is declared but not defined or a non\-inline static function is unused. -Wunused-label Warn whenever a label is declared but not used. To suppress this warning use the unused attribute. -Wunused-parameter Warn whenever a function parameter is unused aside from its declaration. To suppress this warning use the unused attribute. -Wunused-variable Warn whenever a local variable or non-constant static variable is unused aside from its declaration To suppress this warning use the unused attribute. ``` For Clang, some of the diagnostics controlled by ``-Wunused`` are enabled by default: ``` Controls [-Wunused-argument](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-argument), [-Wunused-but-set-variable](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-but-set-variable), [-Wunused-function](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-function), [-Wunused-label](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-label), [-Wunused-lambda-capture](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-lambda-capture), [-Wunused-local-typedef](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-local-typedef), [-Wunused-private-field](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-private-field), [-Wunused-property-ivar](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-property-ivar), [-Wunused-value](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-value), [-Wunused-variable](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-variable). ``` These checks are all usefull. This PR aims to enable ``-Wunused`` without breaking code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150077 Approved by: https://github.com/zou3519, https://github.com/wdvr	2025-05-02 07:14:19 +00:00
Laith Sakka	38a9a8b7f7	Fix: Consider input defined unbacked during inductor codegen for runtime asserts (#152231 ) So when we use mark_unbacked the graph will have an unbacked inputs symInt. Right now, deferred runtime assertions that uses those is never generated. This PR changes that, such that in the forward graph we consider those and generate the corresponding runtime assertions of them. We still ignore them for backward which is not ideal The way we generate runtime assertion is by emitting them when all the defined unbacked symbols used in them are seen. We previously skipped placeholder, because for backward we have a wacky approach were we ignore input defined unbacked symbols and assumes assertions that uses them are already emitted in forward and we try to emit all other runtime assertions again. see [Note [Backwards runtime asserts] Doing that we ends up only emitting the runtime assertions that depends on things defined solely in backward, but we could miss checks that spans inputs defined in both backward and forward, i.e one symbol defined in forward passed as input to backward., and another that is defined in backward.) .This is not ideal an ideal approach could be something like this https://github.com/pytorch/pytorch/pull/151919 but it require more work . Pull Request resolved: https://github.com/pytorch/pytorch/pull/152231 Approved by: https://github.com/aorenste	2025-05-02 07:01:48 +00:00
Ke Wen	829752ba37	[SymmMem] Add all_to_all_vdev (#151819 ) Merge in/out splits into one tensor Multi-block Use sync instead of barrier Use nvshmemx_collective_launch Rotate blocks among peer write back input splits Parallel scan works Use scan for output offsets Use at most 16 blocks Pull Request resolved: https://github.com/pytorch/pytorch/pull/151819 Approved by: https://github.com/ngimel, https://github.com/fduwjj ghstack dependencies: #151261, #151498	2025-05-02 06:59:21 +00:00
PyTorch MergeBot	6dadfc4457	Revert "Enable -Wunused on torch targets (#150077 )" This reverts commit 688adc9941f855e78dd4d595682eea16317b7f54. Reverted https://github.com/pytorch/pytorch/pull/150077 on behalf of https://github.com/wdvr due to failing internally with use of undeclared identifier ([comment](https://github.com/pytorch/pytorch/pull/150077#issuecomment-2846499828))	2025-05-02 06:53:20 +00:00
Animesh Jain	3731b70b40	[inductor][invoke_subgraph] Remove assertion checks for outputs of invoke_subgraph (#152384 ) For invoke_subgraph, input assertions are good. We don't need output assertions. This is the tlparse Before ![image](https://github.com/user-attachments/assets/4ae14530-3314-4dfa-9297-58f9e3ee4b9c) After ![image](https://github.com/user-attachments/assets/c1457687-2396-49a7-986b-ef6145fcbf46) https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmppQg3F8/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152384 Approved by: https://github.com/eellison, https://github.com/zou3519 ghstack dependencies: #152547, #152581	2025-05-02 06:46:05 +00:00
Animesh Jain	9e3fc41060	[invoke_subgraph] rename identifiers to prevent python mangling (#152581 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152581 Approved by: https://github.com/BoyuanFeng, https://github.com/zou3519 ghstack dependencies: #152547	2025-05-02 06:46:05 +00:00
PyTorch MergeBot	4f9f1abd6d	Revert "Use swap_tensors path in nn.Module.to for all subclasses that override __torch_dispatch__ (#152539 )" This reverts commit 037343657edceb345001e4c0ff226a34ca4c6063. Reverted https://github.com/pytorch/pytorch/pull/152539 on behalf of https://github.com/wdvr due to failing internal tests - discussed with author ([comment](https://github.com/pytorch/pytorch/pull/152539#issuecomment-2846484924))	2025-05-02 06:43:35 +00:00
Ke Wen	d7961a1086	[SymmMem] Add all-to-all (#151498 ) Add an all-to-all impl based on NVSHMEM's on-stream API `nvshmemx_alltoallmem_on_stream`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151498 Approved by: https://github.com/fegin, https://github.com/fduwjj ghstack dependencies: #151261	2025-05-02 06:40:43 +00:00
PyTorch MergeBot	7c3e679ddd	Revert "[Inductor] Add decomposeK as an autotuning choice for mm (#150654 )" This reverts commit fdcfc6a61a2146c7c961073e029ead633113eb9a. Reverted https://github.com/pytorch/pytorch/pull/150654 on behalf of https://github.com/wdvr due to Failing ROCM tests: inductor/test_subgraph_choice.py::TestSubgraphChoice::test_subgraph_decompose_k [GH job link](https://github.com/pytorch/pytorch/actions/runs/14786111108/job/41515742446) [HUD commit link](`3c54e0c216`) ([comment](https://github.com/pytorch/pytorch/pull/150654#issuecomment-2846470409))	2025-05-02 06:31:38 +00:00
Animesh Jain	4649fd17b0	[invoke_subgraph] Unpacked operands (#152547 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152547 Approved by: https://github.com/ydwu4, https://github.com/zou3519	2025-05-02 05:44:46 +00:00
PyTorch MergeBot	e6989ceea9	Revert "[BE] Update numba versions (#152557 )" This reverts commit b5995cb67f8543f148b9216e140980e6844aadff. Reverted https://github.com/pytorch/pytorch/pull/152557 on behalf of https://github.com/clee2000 due to test_unary_funcs failure seems real? [GH job link](https://github.com/pytorch/pytorch/actions/runs/14787082066/job/41518415014) [HUD commit link](`b5995cb67f`) ([comment](https://github.com/pytorch/pytorch/pull/152557#issuecomment-2846336004))	2025-05-02 05:22:17 +00:00
FFFrog	ac5de6d55a	Remove unnecessary __STDC_FORMAT_MACROS macro (#152513 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152513 Approved by: https://github.com/cyyever, https://github.com/albanD ghstack dependencies: #152512	2025-05-02 05:06:44 +00:00
Boyuan Feng	d969e2ec33	[CUDAGraph Trees] support memory allocation on side stream (#152472 ) I tried `beginAllocateToPool` instead of `_cuda_beginAllocateCurrentStreamToPool` and the error in #151199 does not happen any more. However, this approach is unsafe for multithreading. When multiple run_eager happens concurrently, we expect memory allocation to different mem_pool. Since beginAllocateToPool does not check stream, these memory allocation may happen on the same mem_pool. So, I use `_cuda_beginAllocateCurrentThreadToPool` to direct all memory allocation on the same thread to a given mem_pool. In particular, `_cuda_beginAllocateCurrentThreadToPool` records the launching thread id, and during runtime checks if the current thread id matches the launching thread id. Fixes #151199 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152472 Approved by: https://github.com/eellison, https://github.com/ngimel	2025-05-02 04:26:35 +00:00
bobrenjc93	1f898657e6	[ez] fix grammar mistakes in StatefulSymbolicContext comment (#152598 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152598 Approved by: https://github.com/malfet ghstack dependencies: #151407	2025-05-02 04:21:16 +00:00
Chien-Chin Huang	36e5ff6bc4	[CP] Fix the offsets to KV in backward (#152625 ) This is more semantically correct even though we currently assumed KV have the same lengths. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152625 Approved by: https://github.com/XilunWu	2025-05-02 03:30:11 +00:00
henrylhtsang	1fef3cdabc	[cutlass backend] Minor lru_cache to slightly speed up filtering ops (#152577 ) For default level, it went from 0.11332 seconds to Filtering took 0.10064 seconds. You can't really apply lru_cache too aggressively. For example, hashing a cutlass op takes a long time. Removing a log further bring it down to 0.07202 seconds Differential Revision: [D73971021](https://our.internmc.facebook.com/intern/diff/D73971021/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152577 Approved by: https://github.com/chenyang78	2025-05-02 02:17:50 +00:00
Yidi Wu	5b5938929f	[refactor] refactor dense implementation of auto_functionalized_v2 for better clarity (#152248 ) Abstracts away two helper functions (get_mutable_args_from_schema and _generate_new_op_kwargs_from_bases) to make the code better organized and more re-usable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152248 Approved by: https://github.com/zou3519 ghstack dependencies: #152072, #152073, #152244, #152245, #152246, #152247	2025-05-02 02:08:06 +00:00
Yidi Wu	380327c663	[hop] make materialize_as_graph's include and exclude dispatch key set optional (#152247 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152247 Approved by: https://github.com/zou3519 ghstack dependencies: #152072, #152073, #152244, #152245, #152246	2025-05-02 02:08:06 +00:00
Yidi Wu	a776a566db	[hop][schema] allow adding kw_only info to schema argument (#152246 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152246 Approved by: https://github.com/zou3519 ghstack dependencies: #152072, #152073, #152244, #152245	2025-05-02 02:08:06 +00:00
Yidi Wu	7e7b9ca18f	[hop][be] make check_input_alias_and_mutation_return_ouputs create new fake mode (#152245 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152245 Approved by: https://github.com/zou3519 ghstack dependencies: #152072, #152073, #152244	2025-05-02 02:08:06 +00:00
Nikita Shulga	b5995cb67f	[BE] Update numba versions (#152557 ) Let's see if PyTorch is compatible with latest Pull Request resolved: https://github.com/pytorch/pytorch/pull/152557 Approved by: https://github.com/Skylion007	2025-05-02 01:51:30 +00:00
cyy	ce94b212c7	[Environment Variable][Rebase] Use thread-safe getenv functions (#140200 ) Use our thread-safe getenv wrappers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140200 Approved by: https://github.com/kwen2501, https://github.com/eqy	2025-05-02 00:41:49 +00:00
Ti-Tai Wang	a5dd7011a0	[ONNX] Delete JitTraceConvertStrategy (#152556 ) Fixes #151703 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152556 Approved by: https://github.com/justinchuby	2025-05-02 00:26:43 +00:00
Colin Peppler	3c54e0c216	[inductor] if unbacked symint in old-size or new-size skip mark_reuse check (#152379 ) Probably can run the `mark_reuse` check work with unbacked sizes under certain conditions. For e.g. `x.repeat(u0, 2).repeat(2, u0)`. But I think cases like those are rare so skipping the check for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152379 Approved by: https://github.com/jansel, https://github.com/eellison, https://github.com/jingsh	2025-05-02 00:24:58 +00:00
PaulZhang12	fdcfc6a61a	[Inductor] Add decomposeK as an autotuning choice for mm (#150654 ) As a result of adding subgraph as a choice to inductor https://github.com/pytorch/pytorch/pull/149761 and enabling FP32 output from PyTorch GEMMs from FP16/BF16 inputs: https://github.com/pytorch/pytorch/pull/150812, this PR enables decompose_k as an autotuning choice for Inductor in generating the fastest matmuls with Triton. DecomposeK is currently only enabled for `torch.compile`. Followups: * decompose_k does not currently support epilogue fusion, which will take some work to enable * Enable autotuning the bmm with Triton Templates as well without requiring tons of more compile time, async compilation. Anecdotal evidence shows that Triton BMM performs better usually than aten BMM * Add for addmm * Enable for Inference and AOTI Below are the results of running TritonBench for Split-K shapes, comparing the aten performance versus pt2_triton, which now autotunes on decompose_k, seeing >10% speedup compared to aten on average, and for some shapes over 3x the performance of the best Triton mm previously: <img width="929" alt="Screenshot 2025-04-28 at 9 15 39 PM" src="https://github.com/user-attachments/assets/27d85bbc-4f3a-43a6-a8fa-d4a5bbb8c999" /> TorchInductor Benchmark Dashboard: <img width="1727" alt="Screenshot 2025-04-30 at 2 02 53 PM" src="https://github.com/user-attachments/assets/4acd7ffc-407f-4cfd-98bb-2e3d8b1f00b3" /> We see speedups across all runs for training. Compile time increased as expected, with more `mm` options to tune over. Differential Revision: [D73820115](https://our.internmc.facebook.com/intern/diff/D73820115) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150654 Approved by: https://github.com/eellison	2025-05-01 23:01:30 +00:00
rzou	64957db6c9	Fix some inductor periodic benchmarks (#152605 ) Some were reporting "pass" consistently on https://hud.pytorch.org/ Those are fine to flip. I filed a separate issue for the now-regressions for AOTI: https://github.com/pytorch/pytorch/issues/152606. These should be looked at. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152605 Approved by: https://github.com/eellison, https://github.com/huydhn	2025-05-01 22:18:30 +00:00
Simon Fan	7aebb127bf	[dynamo][ca] support dynamic annotations on tensors in ListVariables/TupleVariables (#152119 ) Together with https://github.com/pytorch/pytorch/pull/151962, FIXES https://github.com/pytorch/pytorch/issues/133575 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152119 Approved by: https://github.com/jansel ghstack dependencies: #149707, #151860, #151731, #151962	2025-05-01 21:59:55 +00:00
Simon Fan	4555ed8c83	[ca] hide unused scalar int sizes from dynamo (#151962 ) together with https://github.com/pytorch/pytorch/pull/151731, FIXES https://github.com/pytorch/pytorch/issues/113129 https://github.com/pytorch/pytorch/issues/146168 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151962 Approved by: https://github.com/jansel ghstack dependencies: #149707, #151860, #151731	2025-05-01 21:59:55 +00:00
Simon Fan	18229a5300	[ca] mark scalar int sizes as dynamic via tensor wrapping (#151731 ) This is the only way to support dynamic shapes on scalars right now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151731 Approved by: https://github.com/jansel ghstack dependencies: #149707, #151860	2025-05-01 21:59:49 +00:00
Simon Fan	613bd46272	[aot][ca] save bw_module in AOTAutogradCache (#151860 ) Compiled Autograd retraces AOT's bw_module at backward runtime into a larger graph, and today this runs into an issue on warm cache runs because the bw_module is not restored. This PR adds it to the cache, by first stripping it bare from unserializable metadata. I also intentionally differentiate the cached and non-cached versions to avoid accidental attempts of AOT compilation with a restored bw_module (would probably crash). Note that since the cache entry may be used by runs that use compiled autograd and runs that do not, we need to cache both the lowered backward and the bw_module. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151860 Approved by: https://github.com/jamesjwu ghstack dependencies: #149707	2025-05-01 21:59:43 +00:00
Simon Fan	c461ba6522	[aot] mark dynamic activations as maybe dynamic (#149707 ) Today, we mark graph outputs as maybe dynamic, this lets a compilation to communicate to future compilations whether certain graph inputs are dynamic. Similarly, we can do this to saved activations, which may be used in future compilations as well. This is especially prevalent in compiled autograd, where tensor activations will always become graph inputs. Changes to the tests were mainly cosmetic, with the exception of tests that relied on duck shaping. By annotating tensor dims, we prevent them from reusing pre-existing symbols, so this change will make graphs use duck shapes less than before, which affects some of the caching tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149707 Approved by: https://github.com/bdhirsh	2025-05-01 21:59:36 +00:00
Eli Uriegas	b6c5886d09	BE: Swap functorch --> torch._higher_order_ops (#152620 ) Summary: Discovered when attempting to resolve arvr builds, should resolve issues around utilizing functorch through export. Test Plan: ``` buck2 test arvr/mode/linux/opt //arvr/libraries/xrrp/ml/python/test:convert_to_etvk_test ``` Differential Revision: D74013898 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152620 Approved by: https://github.com/zou3519	2025-05-01 21:53:23 +00:00
PyTorch MergeBot	1c04ea4e59	Revert "[torchgen] Refactor `torchgen.utils.FileManager` to accept `pathlib.Path` (#150726 )" This reverts commit 4b5b1adb21f5d7d66945d78a1f89d2f9d86f15bb. Reverted https://github.com/pytorch/pytorch/pull/150726 on behalf of https://github.com/malfet due to This breaks Windows builds, see `a765e2ddda/1` ([comment](https://github.com/pytorch/pytorch/pull/150726#issuecomment-2845858846))	2025-05-01 21:52:35 +00:00
dolpm	a765e2ddda	[nativert] port enumerate from folly to c10::utill (#152481 ) Summary: nativert RFC: https://github.com/zhxchen17/rfcs/blob/master/RFC-0043-torch-native-runtime.md To land the runtime into PyTorch core, we will gradually land logical parts of the code into the Github issue and get each piece properly reviewed. This diff ports an enumeration util from folly into c10. Test Plan: CI Differential Revision: D73881042 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152481 Approved by: https://github.com/Skylion007, https://github.com/zhxchen17, https://github.com/cyyever	2025-05-01 21:41:05 +00:00
Nikita Shulga	24b315676d	[MPS][BE] Migrate `lerp.Scalar.out` to tensor iterator (#152514 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152514 Approved by: https://github.com/kulinseth, https://github.com/Skylion007, https://github.com/dcci	2025-05-01 20:11:55 +00:00
Xuehai Pan	f1d636f85b	[BE] detect CXX pytree requirement with `TorchVersion` (#151102 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151102 Approved by: https://github.com/zou3519	2025-05-01 18:55:57 +00:00
angelayi	8cb6957e01	[export] Ignore None buffers (#152571 ) Fixes https://github.com/pytorch/pytorch/issues/152467 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152571 Approved by: https://github.com/yiming0416, https://github.com/yushangdi	2025-05-01 18:18:16 +00:00
Mikayla Gawarecki	037343657e	Use swap_tensors path in nn.Module.to for all subclasses that override __torch_dispatch__ (#152539 ) Fixes https://github.com/pytorch/pytorch/issues/148977 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152539 Approved by: https://github.com/albanD	2025-05-01 18:04:33 +00:00
Xuehai Pan	4b5b1adb21	[torchgen] Refactor `torchgen.utils.FileManager` to accept `pathlib.Path` (#150726 ) This PR allows `FileManager` to accept `pathlib.Path` as arguments while keeping the original `str` path support. This allows us to simplify the code such as: 1. `os.path.join(..., ...)` with `Path.__floordiv__(..., ...)`. `95a5958db4/torchgen/utils.py (L155)` `95a5958db4/torchgen/utils.py (L176)` 2. `os.path.basename(...)` with `Path(...).name`. `95a5958db4/torchgen/utils.py (L161)` 3. Manual file extension split with `Path(...).with_stem(new_stem)` `95a5958db4/torchgen/utils.py (L241-L256)` ------ Pull Request resolved: https://github.com/pytorch/pytorch/pull/150726 Approved by: https://github.com/zou3519	2025-05-01 17:43:16 +00:00
Mu-Chu Lee	83acb688bb	Fix constant folding cloning constants (#152273 ) Summary: Bug fix for #135060 Simple review: https://github.com/pytorch/pytorch/pull/135060/files#diff-f23386709ff7e1235b15e18f835a48e5124e0ddd596aeb33c201daad1abbedd7R357 We mistakenly typed get_attr into getattr. This causes constants never get untagged, and forces all constants get cloned twice which greatly increases the memory consumption. Test Plan: python test/inductor/test_aot_inductor.py -k test_empty_constant_folding Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/152273 Approved by: https://github.com/trieuat, https://github.com/zhxchen17	2025-05-01 17:34:39 +00:00
Henry Tsang	563a91b144	[cutlass backend] Move cutlass compiled cache to cache_dir (#151825 ) Moved "compiled_cache.db" to cache folder. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151825 Approved by: https://github.com/mlazos	2025-05-01 17:26:01 +00:00
henrylhtsang	1845df05c6	[inductor][BE] Add more debug logs for why fx graph cache doesn't happen (#152487 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152487 Approved by: https://github.com/Skylion007, https://github.com/eellison	2025-05-01 17:25:28 +00:00
Isuru Fernando	f0c9b3385d	Support more dtypes for input, indices in gather (#151822 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151822 Approved by: https://github.com/ngimel	2025-05-01 16:35:23 +00:00
PyTorch MergeBot	4c8dee7986	Revert "[inductor][invoke_subgraph] Remove assertion checks for outputs of invoke_subgraph (#152384 )" This reverts commit c87c823de43b7815c523160778b682973e151794. Reverted https://github.com/pytorch/pytorch/pull/152384 on behalf of https://github.com/malfet due to Broke CI, see `52cbcac640/1` ([comment](https://github.com/pytorch/pytorch/pull/152384#issuecomment-2845099985))	2025-05-01 15:46:08 +00:00
PyTorch MergeBot	f7b60456cc	Revert "[inductor][subgraph] Simplify the resulting output code for subgraph (#152383 )" This reverts commit 98eb7c8cb1abafaff4e28b07ed91cababc2ce54a. Reverted https://github.com/pytorch/pytorch/pull/152383 on behalf of https://github.com/malfet due to Broke CI, see `52cbcac640/1` ([comment](https://github.com/pytorch/pytorch/pull/152384#issuecomment-2845099985))	2025-05-01 15:46:08 +00:00
PyTorch MergeBot	2f1800bc3d	Revert "[invoke_subgraph] Simplify output code for subgraph output node (#152490 )" This reverts commit 5fe335810af0df48f473387b6f9efcd5dbff4d4a. Reverted https://github.com/pytorch/pytorch/pull/152490 on behalf of https://github.com/malfet due to Broke CI, see `52cbcac640/1` ([comment](https://github.com/pytorch/pytorch/pull/152384#issuecomment-2845099985))	2025-05-01 15:46:07 +00:00
PyTorch MergeBot	2fa39e60ed	Revert "[inductor][invoke_subgraph] Free the buffers before the subgraph call (#152494 )" This reverts commit 5236a8506c4f2fcce6d8a7f945808d84e6c46784. Reverted https://github.com/pytorch/pytorch/pull/152494 on behalf of https://github.com/malfet due to Broke CI, see `52cbcac640/1` ([comment](https://github.com/pytorch/pytorch/pull/152384#issuecomment-2845099985))	2025-05-01 15:46:07 +00:00
Nikita Shulga	52cbcac640	[BE] Migrate all add/sub ops to Metal kernels (#152510 ) As typecasting harness shoudl take care of all permutations Fix bug in `exec_binary_kernel` where it was not properly downcasting CPU double/complexDouble scalars to floats Fixes https://github.com/pytorch/pytorch/issues/152582 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152510 Approved by: https://github.com/Skylion007, https://github.com/dcci, https://github.com/cyyever ghstack dependencies: #152443, #152466, #152479, #152504, #152485	2025-05-01 15:35:57 +00:00
Francisco Massa	e82dc0769c	Respect checkpointed boundaries when using knapsack formulation in the partitioner (#141684 ) When multiple checkpoint regions are back-to-back with no operations in-between, we enforce the operation at the boundary to be force-saved, see `7ea0da2d57/torch/_functorch/partitioners.py (L772-L807)` When using the `memory_budget` formulation on a graph which already has AC inside, we should respect the boundaries of the AC decision (which is set to `MUST_SAVE`), and thus ban those nodes from possible recomputation. Adding tests would be nice, but not sure what's the best way to test this right now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141684 Approved by: https://github.com/bdhirsh	2025-05-01 15:28:41 +00:00
Jean Schmidt	41de0f2eaf	removing short-perf-test-cpu.sh and short-perf-test-gpu.sh (#152551 ) When working on #148342 I realised that there is no reference from those files. So seems they are stale and can be safely removed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152551 Approved by: https://github.com/atalman, https://github.com/xuzhao9	2025-05-01 15:09:55 +00:00
Huamin Li	6f6acb4128	[AOTI][CPU] Introduce config.cpp.use_decompose_tanh (#152542 ) Summary: Previously D70489427 changed tanh impl to `.tanh()`, and this is causing some meta internal workload perf regression. This diff will introduce a config so we can set it based on need. Differential Revision: D73909371 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152542 Approved by: https://github.com/desertfire	2025-05-01 10:25:31 +00:00
Blaine Burton Rister	7c63ddd817	[Inductor] Wrapper code refactors to prepare for FX codegen (#152391 ) This PR contains some refactors from https://github.com/pytorch/pytorch/pull/146942, which help to enable Wrapper FX codegen: 1. Remove `OutputLine`, which is unused. 2. Add an attribute to the backend classes specifying whether they support caching. 3. Before compiling a graph, query the registered backends and check whether caching is supported. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152391 Approved by: https://github.com/jansel	2025-05-01 09:14:55 +00:00
Pian Pawakapan	701c0848b8	[dynamic shapes] aten.constant_pad_nd meta impl (#152129 ) We know the output shape, and we know this always produces a clone. Avoids data-dependent errors from the decomposition. along with https://github.com/pytorch/pytorch/pull/150483, should fix https://github.com/pytorch/pytorch/issues/123855 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152129 Approved by: https://github.com/laithsakka	2025-05-01 08:32:10 +00:00
Will Constable	53bf174626	Fix assertion in reorder_communication_preserving_peak_memory (#152565 ) >=0 is practically correct becuase we do model the runtime of some ops as 0. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152565 Approved by: https://github.com/eellison	2025-05-01 06:40:04 +00:00
Pian Pawakapan	47972f9092	[export] warn when Dim.AUTO 0/1 specializes (#151827 ) Fixes #151582 example warning for Dim.AUTO: ``` torch/_export/non_strict_utils.py:499] dimension inputs['x'].shape[1] 0/1 specialized; Dim.AUTO was specified along with a sample input with hint = 1. ``` example error when Dim.DYNAMIC specializes: ``` - Received user-specified dim hint Dim.DYNAMIC(min=None, max=None), but export 0/1 specialized due to hint of 0 for dimension inputs['x'].shape[0]. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151827 Approved by: https://github.com/angelayi	2025-05-01 06:00:51 +00:00
Ke Wen	a7f1ddc184	[SymmMem] Experimental NVSHMEM integration (#151261 ) Adding NVSHMEM as a backend for `SymmetricMemory`, implementation of which is in `NVSHMEMSymmetricMemory.cu`. Moving some helper functions in `CUDASymmetricMemory.cu` to `CUDASymmetricMemoryUtils.cpp`, so that they can be shared by `NVSHMEMSymmetricMemory`. These functions are mostly side-band exchange helpers (`store_all_gather`, `IpcChannel`, etc). Adding `TORCH_SYMMEM` to control which implementation to use for CUDA tensors, currently support: `CUDA` (in-house impl), `NVSHMEM`. The NVSHMEM feature is gated by build-time flag: `USE_NVSHMEM=1`. And `NVSHMEM_HOME` setting is required (TODO). Ported most code from #146593. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151261 Approved by: https://github.com/fegin, https://github.com/fduwjj	2025-05-01 05:24:50 +00:00
Yidi Wu	13add553b2	[HOP][be] make supports_input_mutation and aliasisng a class field (#152244 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152244 Approved by: https://github.com/zou3519 ghstack dependencies: #152072, #152073	2025-05-01 05:22:02 +00:00
Yidi Wu	447f8241f5	[export][function schema] support exporting hop with function schema argument (#152073 ) We need to make function schema proxyable to trace a the auto_functionalized hop that takes function schema as inputs. The implementation basically follows how we support torchbind object: 1. upon seeing an untracked function schema arg, we creates a constant get_attr node 2. we track the function schema argument in export to support lift/unlift. 3. we need to support serde for functional schema. We'll add support for this in follow-up PRs. However, compared with torchbind object: 1. we don't need a dynamo implementation, because the function schema is added when we auto_functionalize a hop to the argument of auto_functionalized. One potential use case is users re-traces an exported program with strict mode. Since non-strict is the default now, we don't see a use case yet. 2. we don't need an inductor implementation, because the function schema will go away after auto_functionalized re-inplacing pass. edit: we greatly simplifies (and generalizes) the implementation following @zou3519 's suggestion of using pytree.register_constant Pull Request resolved: https://github.com/pytorch/pytorch/pull/152073 Approved by: https://github.com/zou3519 ghstack dependencies: #152072	2025-05-01 05:22:02 +00:00
Yidi Wu	500bf50129	[export][be] better type annotation for lift_constants_pass (#152072 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152072 Approved by: https://github.com/zou3519	2025-05-01 05:22:02 +00:00
Michael Lazos	d96193f622	[Inductor] Fix int check again (#152576 ) Made an oss change to a diff train diff @diff-train-skip-merge Pull Request resolved: https://github.com/pytorch/pytorch/pull/152576 Approved by: https://github.com/wdvr	2025-05-01 05:19:40 +00:00
Sheng Qin	18588fe2fc	Fix GuardOnDataDependentSymNode in the normalize operator (#152039 ) Test Plan: Dumped the local net torch.package to local Ran ``` buck2 run scripts/shengqin:test_model_export -- /tmp/mtia_local_torch_package {\"local\":null} ``` succeeded Reviewed By: hongyang-zhao Differential Revision: D73405271 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152039 Approved by: https://github.com/houseroad	2025-05-01 04:34:49 +00:00
cyy	688adc9941	Enable -Wunused on torch targets (#150077 ) For GCC, ``-Wunused`` contains: ``` -Wunused-function Warn whenever a static function is declared but not defined or a non\-inline static function is unused. -Wunused-label Warn whenever a label is declared but not used. To suppress this warning use the unused attribute. -Wunused-parameter Warn whenever a function parameter is unused aside from its declaration. To suppress this warning use the unused attribute. -Wunused-variable Warn whenever a local variable or non-constant static variable is unused aside from its declaration To suppress this warning use the unused attribute. ``` For Clang, some of the diagnostics controlled by ``-Wunused`` are enabled by default: ``` Controls [-Wunused-argument](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-argument), [-Wunused-but-set-variable](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-but-set-variable), [-Wunused-function](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-function), [-Wunused-label](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-label), [-Wunused-lambda-capture](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-lambda-capture), [-Wunused-local-typedef](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-local-typedef), [-Wunused-private-field](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-private-field), [-Wunused-property-ivar](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-property-ivar), [-Wunused-value](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-value), [-Wunused-variable](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-variable). ``` These checks are all usefull. This PR aims to enable ``-Wunused`` without breaking code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150077 Approved by: https://github.com/zou3519	2025-05-01 04:09:06 +00:00
Jason Ansel	15a3f58f91	Return ConstantVariable(None) from WithExitFunctionVariable.exit to prevent NoneType crash inside autocast exception path (#152503 ) Copy of #152013 with PR time benchmarks updated (regressions seem unrelated) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152503 Approved by: https://github.com/anijain2305, https://github.com/Skylion007 Co-authored-by: Witold Dziurdz <wdziurdz@habana.ai>	2025-05-01 04:01:24 +00:00
Pian Pawakapan	632b89af43	[dynamic shapes] support SymInt inputs for kthvalue (#152151 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152151 Approved by: https://github.com/tugsbayasgalan, https://github.com/malfet	2025-05-01 03:47:23 +00:00
Kevin Fu	56d6d4dafe	[PT2] Port replace_lce_with_matmul / replace_first_lce_with_fused_matmul_lce to PT2 pre_grad passes (#152450 ) (#152536 ) Summary: Same with D71358949, but removing newly added log to avoid test failures. Port over replace_lce_with_matmul and replace_first_lce_with_fused_matmul_lce to PT2 pre_grad pass. Original dper pass diffs: D67884534, D68123479, D68384238 Test Plan: Test 1. Covers replace_lce_with_matmul and case 1 of replace_first_lce_with_fused_matmul_lce ``` CUDA_VISIBLE_DEVICES=6 TORCH_LOGS=+inductor,aot TORCH_COMPILE_DEBUG=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 buck2 run mode/opt-split-dwarf mode/inplace -c fbcode.platform010_cuda_version=12 -c fbcode.nvcc_arch=h100 caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --model-path=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/669809193/0/gpu_lowering/input.predictor.disagg.gpu.merge --lower-backend="AOT_INDUCTOR" --add_passes="use_matmul_fuse_lce_replace_first_LCE,use_contiguous_linear_reduction_replace_linear_reduction" --batch-size=3072 --gpu-trace --disable_acc_tracer=true 2>&1 \| tee ~/logs/disable_acc_tracer/aoti_cmf_ctr_triton_669809193_0_diable_acc.log ``` Log: P1798246938 Test 2. Covers replace_lce_with_matmul and case 2 of replace_first_lce_with_fused_matmul_lce ``` CUDA_VISIBLE_DEVICES=7 TORCH_LOGS=+inductor,aot TORCH_COMPILE_DEBUG=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 buck2 run mode/opt-split-dwarf mode/inplace -c fbcode.platform010_cuda_version=12 -c fbcode.nvcc_arch=h100 caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --model-path=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/677734158/9/gpu_lowering/input.predictor.disagg.gpu.merge --lower-backend="AOT_INDUCTOR" --add_passes="use_matmul_fuse_lce_replace_first_LCE,use_matmul_lce_replace_normal_LCE" --batch-size=3072 --gpu-trace --disable_acc_tracer=true 2>&1 \| tee ~/logs/disable_acc_tracer/aoti_cmf_ctr_triton_677734158_9_diable_acc.log ``` Log: P1798246675 Seeing logs like `[Pre grad(predispatch IR)] Apply use_matmul_fuse_lce_replace_first_LCE pass, save before/after graph to /tmp/tmp8lyzoh79, graph before/after are the same = False` Differential Revision: D73934142 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152536 Approved by: https://github.com/wdvr	2025-05-01 03:14:04 +00:00
Animesh Jain	5236a8506c	[inductor][invoke_subgraph] Free the buffers before the subgraph call (#152494 ) Before ![image](https://github.com/user-attachments/assets/62b24c14-69e6-40fb-94e3-223930132ef6) After ![image](https://github.com/user-attachments/assets/9f340d4e-80a9-45aa-9400-626fff5b5ecd) tlparse - https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmph5dwWt/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152494 Approved by: https://github.com/Skylion007, https://github.com/eellison ghstack dependencies: #152357, #152384, #152383, #152490	2025-05-01 02:04:10 +00:00
Animesh Jain	5fe335810a	[invoke_subgraph] Simplify output code for subgraph output node (#152490 ) Before - [manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmppQg3F8/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000](https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmppQg3F8/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000) ![image](https://github.com/user-attachments/assets/8fecdc23-eb78-4e15-9d03-c4bae4b49434) After fix - https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmp9a5EM0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 ![image](https://github.com/user-attachments/assets/8e98120c-d82e-42dc-bc50-a6bfd4f9923c) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152490 Approved by: https://github.com/eellison ghstack dependencies: #152357, #152384, #152383	2025-05-01 02:04:10 +00:00
Animesh Jain	98eb7c8cb1	[inductor][subgraph] Simplify the resulting output code for subgraph (#152383 ) Check out output code Before this PR - - https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmp3iXDVs/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 ![image](https://github.com/user-attachments/assets/ef86eb8f-e8b9-47dd-8609-f90481f018b8) After this PR - https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpRgUJvq/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 ![image](https://github.com/user-attachments/assets/10e22c60-7fb9-4519-9d54-019beff5333b) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152383 Approved by: https://github.com/eellison ghstack dependencies: #152357, #152384	2025-05-01 02:04:10 +00:00
Animesh Jain	c87c823de4	[inductor][invoke_subgraph] Remove assertion checks for outputs of invoke_subgraph (#152384 ) For invoke_subgraph, input assertions are good. We don't need output assertions. This is the tlparse Before ![image](https://github.com/user-attachments/assets/4ae14530-3314-4dfa-9297-58f9e3ee4b9c) After ![image](https://github.com/user-attachments/assets/c1457687-2396-49a7-986b-ef6145fcbf46) https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmppQg3F8/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152384 Approved by: https://github.com/eellison, https://github.com/zou3519 ghstack dependencies: #152357	2025-05-01 02:04:10 +00:00
Nikita Shulga	3849fd13de	🐛 Add `ciflow/pull`🦋 (#152567 ) To make it easier to workaround GitHub relibability issues, when it sometime fails to scheduled `on: pull_request` workflows See https://github.com/pytorch/pytorch/issues/151322 But alas, it does not fixes problem at hand... Pull Request resolved: https://github.com/pytorch/pytorch/pull/152567 Approved by: https://github.com/clee2000, https://github.com/huydhn, https://github.com/ZainRizvi, https://github.com/Camyll, https://github.com/atalman	2025-05-01 02:00:51 +00:00
Pian Pawakapan	0b8822e70b	[export] set is_exporting() for strict (#151833 ) Helpful for upcoming work in figuring when to use stack trace in prettifying dynamic shapes errors Pull Request resolved: https://github.com/pytorch/pytorch/pull/151833 Approved by: https://github.com/angelayi	2025-05-01 02:00:19 +00:00
henrylhtsang	f2cc07d202	[cutlass backend] Add addmm dynamic support (#152498 ) Differential Revision: [D73893133](https://our.internmc.facebook.com/intern/diff/D73893133/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152498 Approved by: https://github.com/ColinPeppler	2025-05-01 01:40:08 +00:00
Nikita Shulga	fe1deeb701	[BE] Replace func_name with __func__ (#152553 ) Summary: Not sure why one needs to preserve the name by hand Test Plan: CI Differential Revision: D73941209 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152553 Approved by: https://github.com/wdvr	2025-05-01 01:26:49 +00:00
Pian Pawakapan	0d2746092b	[ez][export] suggest torch._checks only for booleans (#152499 ) We were doing this when the error was coming from int/float casts, suggesting fixes like `torch._check(zuf0), torch._check(~zuf0)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152499 Approved by: https://github.com/angelayi	2025-05-01 01:24:46 +00:00
Sanshan Gao	be1adcae32	add split sizes info dump for uneven all2all bw calculation (#151438 ) Add split sizes info to dumped execution trace and kineto trace for bw calcuation of uneven all2all. Take input data as an example from case below, although we know input size of Rank-0 is 50 elements, actual data size that Rank-0 sends out is (12+13+14)=39 elements. Rank-0 doesn't send the 1st chunk of 11 elements to peers. But we don't know this infomation now, because "in split size" filed is empty. ![image](https://github.com/user-attachments/assets/7240f334-2081-409b-bbe0-a8396ffa2d30) ![image](https://github.com/user-attachments/assets/679fc49f-e34f-4a74-bad0-fb6fa9d18239) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151438 Approved by: https://github.com/shengfukevin, https://github.com/kwen2501	2025-05-01 01:19:20 +00:00
eqy	7abca8ceba	Decorate `test_host_memory_stats` with `@serialTest` (#152454 ) Seems to need it as it is expecting only its allocation behavior to be visible, to address #152422 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152454 Approved by: https://github.com/Skylion007	2025-05-01 00:53:20 +00:00
Pian Pawakapan	5521e6b671	[export] support SymInt minlength for torch.bincount() (#152497 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152497 Approved by: https://github.com/angelayi	2025-05-01 00:45:58 +00:00
rzou	ad9e209ea3	Change test/inductor/test_standalone_compile to test/inductor/test_compile (#152103 ) These are the tests for torch._inductor.compile, so I renamed the file test_compile. This is to avoid confusion with torch._inductor.standalone_compile, which is now a lot more standalone than torch._inductor.compile. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152103 Approved by: https://github.com/oulgen	2025-05-01 00:44:02 +00:00
Xinfeng Xie	8136e0d3b7	Expose NCCL communicator from ProcessGroupNCCL via an unsafe API (#152496 ) Differential Revision: D73892691 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152496 Approved by: https://github.com/ngimel	2025-04-30 23:51:34 +00:00
Animesh Jain	f2a89b802d	[invoke_subgraph] Cache on tangent metadata and retrace if needed (#152357 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152357 Approved by: https://github.com/zou3519, https://github.com/bdhirsh	2025-04-30 23:49:17 +00:00
Francisco Massa	b6f8209f54	Remove redundant line in partitioner (#152517 ) Summary: This is a cleanup from https://github.com/pytorch/pytorch/pull/152264, which contained a line which was a vestige from a previous implementation. Test Plan: Let CI run Differential Revision: D73904636 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152517 Approved by: https://github.com/Skylion007, https://github.com/bdhirsh	2025-04-30 23:17:30 +00:00
PyTorch MergeBot	56039b5778	Revert "[CUDAGraph Trees] support memory allocation on side stream (#152472 )" This reverts commit c620763ec2be83e37f9b31ad6663c6e82d6c0ab0. Reverted https://github.com/pytorch/pytorch/pull/152472 on behalf of https://github.com/BoyuanFeng due to should use tid instead pid ([comment](https://github.com/pytorch/pytorch/pull/152472#issuecomment-2843491656))	2025-04-30 22:18:10 +00:00
Zhengxu Chen	361bf056a7	[nativert] Add moodycamel/concurrentqueue as third-party dependency (#152033 ) nativert RFC: https://github.com/zhxchen17/rfcs/blob/master/RFC-0043-torch-native-runtime.md moodycamel/concurrentqueue is a high performence mpmc queue implementation and single header only. We want to add this to third_party to be used with upcoming Torch Native Runtime. The source code is imported from commit hash 2f09da73d22a47dc8a89cdd4fc4c3bfae07f4284 from https://github.com/cameron314/concurrentqueue Pull Request resolved: https://github.com/pytorch/pytorch/pull/152033 Approved by: https://github.com/seemethere, https://github.com/malfet	2025-04-30 21:37:20 +00:00
PyTorch MergeBot	49a72011cc	Revert "[inductor][BE] Add more debug logs for why fx graph cache doesn't happen (#152487 )" This reverts commit 76331657d21e4bebd8f3c00ceed5369ae8b64112. Reverted https://github.com/pytorch/pytorch/pull/152487 on behalf of https://github.com/malfet due to And it broke those tests, not sure why signal was ignored ([comment](https://github.com/pytorch/pytorch/pull/152487#issuecomment-2843333471))	2025-04-30 21:35:17 +00:00
Huy Do	3f10091d3c	Clean up conda usage in benchmark scripts (#152552 ) Fixes https://github.com/pytorch/pytorch/issues/152123. * Switch `benchmarks/dynamo/Makefile` to use uv. Note that these scripts are only used locally, so it's kind of ok to keep conda here IMO. But switching to uv is probably nicer to most folks. * Delete some files that are outdated and not used anymore Pull Request resolved: https://github.com/pytorch/pytorch/pull/152552 Approved by: https://github.com/atalman, https://github.com/albanD	2025-04-30 21:27:29 +00:00
Zhengxu Chen	5a66c1d921	[nativert] Add utility function to convert strings into numbers. (#151467 ) Summary: nativert RFC: https://github.com/zhxchen17/rfcs/blob/master/RFC-0043-torch-native-runtime.md To land the runtime into PyTorch core, we will gradually land logical parts of the code into the Github issue and get each piece properly reviewed. This diff adds a small library to convert strings into numbers which will later be used for parsing graph IR. Differential Revision: D73133034 ## Test Plan c10 unittests Pull Request resolved: https://github.com/pytorch/pytorch/pull/151467 Approved by: https://github.com/cyyever, https://github.com/albanD	2025-04-30 21:20:52 +00:00
rzou	22ecaeb145	[standalone_compile] fix dynamic shapes with config_patches (#152462 ) compile_fx with config_patches goes down another path where we need to propagate the kwarg... Test Plan: - updated test Pull Request resolved: https://github.com/pytorch/pytorch/pull/152462 Approved by: https://github.com/oulgen	2025-04-30 21:02:14 +00:00
eqy	ce317cd5a8	[CUDA][SDPA] bump fudge factor in `test_sdpa` in `test_nestedtensor` (#152235 ) Small mismatches on e.g., 4090, A6000/A40 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152235 Approved by: https://github.com/drisspg, https://github.com/Skylion007, https://github.com/jbschlosser	2025-04-30 20:24:49 +00:00
henrylhtsang	55c539428f	[inductor][BE] cleanup and improve precompilation loggings (#152483 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152483 Approved by: https://github.com/chenyang78, https://github.com/jingsh	2025-04-30 20:21:55 +00:00
henrylhtsang	76331657d2	[inductor][BE] Add more debug logs for why fx graph cache doesn't happen (#152487 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152487 Approved by: https://github.com/Skylion007, https://github.com/eellison	2025-04-30 20:05:21 +00:00
Natalia Gimelshein	adebb8b112	set thread_work_size to 4 for unrolled kernel (#152396 ) Previous PRs enabling 8-vectorization inadvertently regressed unrolled kernel perf. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152396 Approved by: https://github.com/BoyuanFeng, https://github.com/msaroufim, https://github.com/malfet, https://github.com/Aidyn-A, https://github.com/atalman	2025-04-30 19:53:58 +00:00
Divyansh Khanna	c4a0b31c1d	Update CODEOWNERS (torch/utils/data/) (#152482 ) Updating codeowners for dataloading Pull Request resolved: https://github.com/pytorch/pytorch/pull/152482 Approved by: https://github.com/ramanishsingh, https://github.com/janeyx99	2025-04-30 19:24:56 +00:00
eqy	1bb13a16bb	[CUDA][SDPA] Bump python `fused_attention_vs_math_ref_grads` `fudge_factor` for `sm120` (#152491 ) 🍦 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152491 Approved by: https://github.com/Skylion007	2025-04-30 19:22:21 +00:00
Mark Saroufim	7a3cae4b20	Configurable logging for cpp_extensions.py (#152260 ) Today `cpp_extensions` makes heavy use of printing to stderr, this makes our life harder in KernelBot where we typically rely on stderr to only surface real errors but instead today cpp_extensions leverages stderr for updates that could be qualified as INFO, WARNING, ERROR Now instead we'll recommend users of our cpp extension system to do something like ```python import logging cpp_ext_logger = logging.getLogger("torch.utils.cpp_extension") cpp_ext_logger.setLevel(logging.WARNING) ``` While this dramatically reduces log spew, it can be viewed as a BC breaking change if people were relying on certain strings being present in stdout or stderr Considering different teams might want to silence errors differently, this PR proposes replacing all `print()` statements with `logging` statements with the same heuristics that the python logging module recommends 1. DEBUG: For things like detailed compilation steps or reading filepaths - by default gets logged on stdout 2. INFO: Build progress - by default gets logged on stdout 3. WARNING: Surfacing issues that might cause bad performance or slow compilation times - by default gets logged on stdout 4. ERROR: Problems that prevent proper functioning - by default gets logged on stdout Note that warnings.warn is a different library and is not hooked up to the python logging module by default So the goal of this PR is to make it possible for teams to set the logging that is most appropriate to them. One annoying thing is logger throws ruff errors if you try to use it in conjunction with f strings or .format so have to use old school %s An unrelated improvement I'd be happy to push to a seperate PR is adding support for "native" in `TORCH_CUDA_ARCH_LIST` which would just pick the ARCH for the current device An example of what's in stderr today ``` Using /root/.cache/torch_extensions/py311_cu124 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /root/.cache/torch_extensions/py311_cu124/grayscale/build.ninja... /usr/local/lib/python3.11/site-packages/torch/utils/cpp_extension.py:2059: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST']. warnings.warn( Building extension module grayscale... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) Loading extension module grayscale... /usr/local/lib/python3.11/site-packages/torch/_dynamo/variables/functions.py:679: UserWarning: Graph break due to unsupported builtin grayscale.PyCapsule.grayscale. This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind). If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround. If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use torch.compiler.allow_in_graph. torch._dynamo.utils.warn_once(msg) ``` Whereas after this PR users can do `python benchmark_load_inline.py > >(tee stdout.txt) 2> >(tee stderr.txt >&2)` ```python import os import sys from pathlib import Path import shutil import tempfile import torch from torch.utils.cpp_extension import load_inline import logging cpp_ext_logger = logging.getLogger("torch.utils.cpp_extension") cpp_ext_logger.setLevel(logging.WARNING) os.environ["TORCH_CUDA_ARCH_LIST"] = "native" cpp_code = """ torch::Tensor to_gray(torch::Tensor input); """ cuda_kernel_code = """ torch::Tensor to_gray(torch::Tensor input) { auto output = torch::epty({input.size(0), input.size(1)}, input.options()); return output ; } """ # Avoid caching results with tempfile.TemporaryDirectory() as build_dir: cuda_module = load_inline( name="to_gray_cuda", cpp_sources=cpp_code, cuda_sources=cuda_kernel_code, functions=["to_gray"], with_cuda=True, verbose=True, extra_cflags=["-std=c++17"], # "-ftime-report", "-H"], extra_cuda_cflags=["-arch=sm_89"], build_directory=build_dir, ) ``` ## New logs ### On failure Which gives a much more reasonable stdout ``` [1/3] /usr/local/cuda-12.8/bin/nvcc --generate-dependencies-with-compile --dependency-output cuda.cuda.o.d -DTORCH_EXTENSION_NAME=to_gray_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -isystem /home/marksaroufim/pytorch/torch/include -isystem /home/marksaroufim/pytorch/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -isystem /usr/local/cuda/targets/x86_64-linux/include -isystem /home/marksaroufim/.conda/envs/nv/include/python3.10 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -arch=sm_89 -std=c++17 -c /tmp/tmpbg_xzv0r/cuda.cu -o cuda.cuda.o FAILED: cuda.cuda.o /usr/local/cuda-12.8/bin/nvcc --generate-dependencies-with-compile --dependency-output cuda.cuda.o.d -DTORCH_EXTENSION_NAME=to_gray_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -isystem /home/marksaroufim/pytorch/torch/include -isystem /home/marksaroufim/pytorch/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -isystem /usr/local/cuda/targets/x86_64-linux/include -isystem /home/marksaroufim/.conda/envs/nv/include/python3.10 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -arch=sm_89 -std=c++17 -c /tmp/tmpbg_xzv0r/cuda.cu -o cuda.cuda.o /tmp/tmpbg_xzv0r/cuda.cu(6): error: namespace "torch" has no member "epty" auto output = torch::epty({input.size(0), input.size(1)}, input.options()); ^ 1 error detected in the compilation of "/tmp/tmpbg_xzv0r/cuda.cu". [2/3] c++ -MMD -MF main.o.d -DTORCH_EXTENSION_NAME=to_gray_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -isystem /home/marksaroufim/pytorch/torch/include -isystem /home/marksaroufim/pytorch/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -isystem /usr/local/cuda/targets/x86_64-linux/include -isystem /home/marksaroufim/.conda/envs/nv/include/python3.10 -fPIC -std=c++17 -std=c++17 -c /tmp/tmpbg_xzv0r/main.cpp -o main.o ninja: build stopped: subcommand failed. ``` And stderr ``` Traceback (most recent call last): File "/home/marksaroufim/pytorch/torch/utils/cpp_extension.py", line 2874, in _run_ninja_build subprocess.run( File "/home/marksaroufim/.conda/envs/nv/lib/python3.10/subprocess.py", line 526, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1. The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/home/marksaroufim/load_inline_slow/benchmark_load_inline.py", line 30, in <module> cuda_module = load_inline( File "/home/marksaroufim/pytorch/torch/utils/cpp_extension.py", line 2261, in load_inline return _jit_compile( File "/home/marksaroufim/pytorch/torch/utils/cpp_extension.py", line 2367, in _jit_compile _write_ninja_file_and_build_library( File "/home/marksaroufim/pytorch/torch/utils/cpp_extension.py", line 2528, in _write_ninja_file_and_build_library _run_ninja_build( File "/home/marksaroufim/pytorch/torch/utils/cpp_extension.py", line 2892, in _run_ninja_build raise RuntimeError(message) from e RuntimeError: Error building extension 'to_gray_cuda' ``` ### On success stdout ``` [1/3] c++ -MMD -MF main.o.d -DTORCH_EXTENSION_NAME=to_gray_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -isystem /home/marksaroufim/pytorch/torch/include -isystem /home/marksaroufim/pytorch/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -isystem /usr/local/cuda/targets/x86_64-linux/include -isystem /home/marksaroufim/.conda/envs/nv/include/python3.10 -fPIC -std=c++17 -std=c++17 -c /tmp/tmpxv_ovlrf/main.cpp -o main.o [2/3] /usr/local/cuda-12.8/bin/nvcc --generate-dependencies-with-compile --dependency-output cuda.cuda.o.d -DTORCH_EXTENSION_NAME=to_gray_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -isystem /home/marksaroufim/pytorch/torch/include -isystem /home/marksaroufim/pytorch/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -isystem /usr/local/cuda/targets/x86_64-linux/include -isystem /home/marksaroufim/.conda/envs/nv/include/python3.10 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -arch=sm_89 -std=c++17 -c /tmp/tmpxv_ovlrf/cuda.cu -o cuda.cuda.o [3/3] c++ main.o cuda.cuda.o -shared -L/home/marksaroufim/pytorch/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda-12.8/lib64 -lcudart -o to_gray_cuda.so ``` And an empty stderr as expected Pull Request resolved: https://github.com/pytorch/pytorch/pull/152260 Approved by: https://github.com/albanD	2025-04-30 18:30:28 +00:00
Aidyn-A	05933e08ca	[ATen][CUDA][SDPA] Enable SDPA on sm_121 (#152314 ) This PR adds support for `sm_121` of the DGX Spark. The `sm_121` is binary compatible with `sm_120` (just like `sm_89` and `sm_86`), therefore a compilation targeting `sm_121` is not required. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152314 Approved by: https://github.com/eqy	2025-04-30 18:04:50 +00:00
Yuanhao Ji	b027cb8f9e	[Docs] Add Description of `validate_args` for torch.distributions (#152173 ) Fixes #152165 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152173 Approved by: https://github.com/soulitzer	2025-04-30 18:01:20 +00:00
cyy	256c96332c	[1/N] Use std::filesystem (#152288 ) Maybe it is time to use std::filesystem because CXX11 ABI is now the default. The changes are for jit and distributed code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152288 Approved by: https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2025-04-30 17:54:16 +00:00
Jeff Daily	62ab6a5bb1	[ROCm] Use almalinux docker files for building Magma (#152488 ) Fixes #151707 for ROCm Magma builds. See also #152358. Depends on #152492. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152488 Approved by: https://github.com/atalman	2025-04-30 17:53:30 +00:00
Boyuan Feng	c620763ec2	[CUDAGraph Trees] support memory allocation on side stream (#152472 ) I tried `beginAllocateToPool` instead of `_cuda_beginAllocateCurrentStreamToPool` and the error in #151199 does not happen any more. However, this approach is unsafe for multithreading. When multiple run_eager happens concurrently, we expect memory allocation to different mem_pool. Since beginAllocateToPool does not check stream, these memory allocation may happen on the same mem_pool. So, I use `_cuda_beginAllocateCurrentThreadToPool` to direct all memory allocation on the same thread to a given mem_pool. In particular, `_cuda_beginAllocateCurrentThreadToPool` records the launching thread id, and during runtime checks if the current thread id matches the launching thread id. Fixes #151199 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152472 Approved by: https://github.com/eellison	2025-04-30 17:45:07 +00:00
Ryan Guo	0904a182c2	[dynamo] Relax guard introduced when tracing `__call__` on user defined object (#152395 ) This relaxes the guard introduced in #100444 (which aggressively guard on the object id, despite Dynamo is just tracing its `__call__` method. This allows users to bypass the high compilation time issue in #150706 by compiling transformer blocks only. Without this patch, we'd get lots of unnecessary recompilation, as the block has difference attention processor instances. Compiling blocks only _significantly_ speeds up compilation process (from ~310s to ~32s), and even speeds up e2e performance for some reason (7.83s to 7.67s). Pull Request resolved: https://github.com/pytorch/pytorch/pull/152395 Approved by: https://github.com/anijain2305 ghstack dependencies: #152369	2025-04-30 17:34:21 +00:00
Ryan Guo	e4994e2f73	[AOTAutogradCache] Allow `torch.Tensor` and a non-torch op from einops (#152369 ) This addresses part of #150706. Specifically, it reduces the warm start `torch.compile` overhead by 40~50% for GGUF models on 1. HuggingFace diffusers: [tlparse before, 224s](https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpqgbdva/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000) v.s. [tlparse after, 126s](https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmp950PFy/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000) 2. ComfyUI: [tlparse before, 93s](https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmp7SeJb4/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000) v.s. [tlparse after, 51s](https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpRwGNqA/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000) The improvements should generalize to all other GGUF models on these platforms, because the cache miss was induced by framework code, which will be hit by every GGUF model. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152369 Approved by: https://github.com/jamesjwu	2025-04-30 17:34:21 +00:00
Catherine Lee	ce2cf31623	Remove dead binary_ios_build, test, upload scripts (#152461 ) Can't find any mentions of them in the codebase, presumably no longer used? Pull Request resolved: https://github.com/pytorch/pytorch/pull/152461 Approved by: https://github.com/seemethere, https://github.com/janeyx99, https://github.com/malfet	2025-04-30 17:10:27 +00:00
PyTorch MergeBot	702264dad4	Revert "Change test/inductor/test_standalone_compile to test/inductor/test_compile (#152103 )" This reverts commit ff1099562d261315ac7bbf43f3795872099a1c31. Reverted https://github.com/pytorch/pytorch/pull/152103 on behalf of https://github.com/clee2000 due to failure is real but log classifier is pointing at an unrelated line, actual failure is just that the old name is mentioned somewhere and needs to be changed, see the bottom of the test step of the job https://github.com/pytorch/pytorch/actions/runs/14740884246/job/41379127184#step:22:705 [GH job link](https://github.com/pytorch/pytorch/actions/runs/14758321324/job/41434697413) [HUD commit link](`ff1099562d`) ([comment](https://github.com/pytorch/pytorch/pull/152103#issuecomment-2842638551))	2025-04-30 16:57:58 +00:00
Eddie Yan	8aa65780f4	[CUDA] Fix `test_multi_device_context_manager` on CUDA (#152474 ) Seems there was a typo where `set_device` was called when the intent was to use `current_device` As-is the test will fail on multigpu systems with `TypeError: set_device() missing 1 required positional argument: 'device'` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152474 Approved by: https://github.com/Skylion007	2025-04-30 16:53:10 +00:00
FFFrog	1e4bcd3ba3	Remove unnecessary condition compilation macro (#152512 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152512 Approved by: https://github.com/albanD, https://github.com/Skylion007	2025-04-30 16:48:25 +00:00
Bin Bao	3b105ccc04	[AOTI] Fix a memory leak in model_package_loader (#152334 ) Summary: There was a char array allocated but never freed. It was found by valgrind and verified fixed with this PR, although it's not easy to write a unit test for it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152334 Approved by: https://github.com/angelayi, https://github.com/Skylion007	2025-04-30 16:21:50 +00:00
Scott Wolchok	c7484805ca	Add two missing JIT tests to CMake (#152440 ) Looks like I forgot to add these. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152440 Approved by: https://github.com/Skylion007	2025-04-30 16:18:55 +00:00
rzou	ff1099562d	Change test/inductor/test_standalone_compile to test/inductor/test_compile (#152103 ) These are the tests for torch._inductor.compile, so I renamed the file test_compile. This is to avoid confusion with torch._inductor.standalone_compile, which is now a lot more standalone than torch._inductor.compile. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152103 Approved by: https://github.com/oulgen	2025-04-30 15:27:44 +00:00
Jeff Daily	3c2bf24786	[ROCm] add almalinux images (#152492 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/152492 Approved by: https://github.com/atalman	2025-04-30 15:14:01 +00:00
io-no	d88e0ceb64	Cast to unsigned char to avoid UB (#152360 ) The standard requires that the argument to functions like `isdigit`, `isalpha`, and similar must be either `EOF` or an `unsigned char`; otherwise, the behavior is undefined (UB). To avoid out-of-bounds reads, modern implementations of some libraries (such as glibc) deliberately pad their internal tables to guarantee valid memory access even for negative values. However, this is implementation-specific, and other libraries may not do this. Properly casting the argument to `unsigned char` is good practice to avoid potential issues on some platforms. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152360 Approved by: https://github.com/cyyever, https://github.com/Skylion007	2025-04-30 15:09:13 +00:00
Catherine Lee	4408701fed	[CI][CD] Unify install_cuda and install_cuda_aarch64 scripts (#152140 ) Generalize install_cuda so it can also handle aarch64 Remove install_cuda_aarch64 since install_cuda can now handle it Make install_cuda and install_cudnn functions in the install_cuda script because most of the code is the same Pull Request resolved: https://github.com/pytorch/pytorch/pull/152140 Approved by: https://github.com/huydhn, https://github.com/atalman	2025-04-30 15:09:06 +00:00
PyTorch MergeBot	371999782a	Revert "Fix flaky test in test_custom_ops (#152484 )" This reverts commit 5a52e050248c71dd6e84f51d25cbd17a88555800. Reverted https://github.com/pytorch/pytorch/pull/152484 on behalf of https://github.com/malfet due to It broke test_save to file with TypeError: get_sample_op_profile() missing 1 required argument ([comment](https://github.com/pytorch/pytorch/pull/152484#issuecomment-2842254907))	2025-04-30 14:53:15 +00:00
Animesh Jain	d620fefb2c	[invoke_subgraph] Use backward identifier for min-cut parititioning (#152207 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152207 Approved by: https://github.com/zou3519, https://github.com/bdhirsh	2025-04-30 14:34:56 +00:00
Nikita Shulga	cf894b3f1f	[MPS][BE] Remove `exec_binary_alpha_kernel` (#152485 ) Which was almost a complete copy-n-paste from exec_binary_kernel anyway Just add `Scalar` as an optional argument and figure out kernel name during the invocation rather than in executor Pull Request resolved: https://github.com/pytorch/pytorch/pull/152485 Approved by: https://github.com/Skylion007 ghstack dependencies: #152443, #152466, #152479, #152504	2025-04-30 14:09:14 +00:00
Dan Zimmerman	c90e23eb73	[inductor] Fix usage of launch_enter_hook/launch_exit_hook (#152457 ) In https://github.com/triton-lang/triton/pull/6467 I moved where `launch_enter_hook`/`launch_exit_hook` are specified (from the kernel class to a config). This PR updates the usages to use the config module if it exists to support tip of main triton. In https://github.com/triton-lang/triton/pull/6641 I renamed `triton.config` to `triton.knobs`, hence the second commit in this PR. Test Plan: Setup OSS PT with tip of main triton (namely including https://github.com/triton-lang/triton/pull/6641) and run `python test/inductor/test_pad_mm.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152457 Approved by: https://github.com/jamesjwu	2025-04-30 13:22:16 +00:00
Aidyn-A	36acaaae3f	[CUDA] Add new architectures (#152414 ) CUDA 12.9 will introduce a couple of new architectures `sm_103` and `sm_121`. We do not need to build for them, because they are going to be compatible with`sm_100` and `sm_120` respectively (similar to `sm_86` and `sm_89`), but PyTorch must be "aware" of them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152414 Approved by: https://github.com/Skylion007, https://github.com/eqy, https://github.com/malfet	2025-04-30 09:55:27 +00:00
Nichols A. Romero	ece1658418	[ROCm][TunableOp] Fix ScaledGEMM rowwise (#152403 ) Fixes TunableOp ScaledGEMM regression for rowwise scaling caused by this https://github.com/pytorch/pytorch/pull/147548 Credit goes to @mawong-amd for fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152403 Approved by: https://github.com/jeffdaily	2025-04-30 08:33:03 +00:00
PyTorch MergeBot	7a9d0d2451	Revert "[PT2] Port replace_lce_with_matmul / replace_first_lce_with_fused_matmul_lce to PT2 pre_grad passes (#152450 )" This reverts commit c8f48eb18531e4e348fcfa718b2e52d3c2497197. Reverted https://github.com/pytorch/pytorch/pull/152450 on behalf of https://github.com/wdvr due to still failing after https://github.com/pytorch/pytorch/pull/152493 - needs further investigation ([comment](https://github.com/pytorch/pytorch/pull/152450#issuecomment-2841212970))	2025-04-30 08:30:57 +00:00
PyTorch MergeBot	424e21ae82	Revert "fix tests broken after #152450 (#152493 )" This reverts commit d8fe6fa280c3e5bd21b3e84b3e25d9204ccdedf7. Reverted https://github.com/pytorch/pytorch/pull/152493 on behalf of https://github.com/wdvr due to still failing ([comment](https://github.com/pytorch/pytorch/pull/152493#issuecomment-2841207942))	2025-04-30 08:27:58 +00:00
Eddie Yan	fa6f9eb2be	[CUDA][TF32] Account for TF32 in `compile_kernel_advanced` (#152468 ) Also cleanup some uses of `assert_close` in favor of `self.assertEqual` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152468 Approved by: https://github.com/msaroufim	2025-04-30 07:54:38 +00:00
Wouter Devriendt	d8fe6fa280	fix tests broken after #152450 (#152493 ) Updating test expected value after #152450 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152493 Approved by: https://github.com/huydhn, https://github.com/malfet Co-authored-by: Huy Do <huydhn@gmail.com>	2025-04-30 07:16:10 +00:00
angelayi	5a52e05024	Fix flaky test in test_custom_ops (#152484 ) Hopefully fixes https://github.com/pytorch/pytorch/issues/151301, https://github.com/pytorch/pytorch/issues/151281 by making the ops have different names Pull Request resolved: https://github.com/pytorch/pytorch/pull/152484 Approved by: https://github.com/zou3519	2025-04-30 07:07:27 +00:00
PyTorch MergeBot	cc7346bf19	Revert "fix tests broken after #152450 (#152493 )" This reverts commit 4df97a883949564aa4ed20b6912c3eb664d2624c. Reverted https://github.com/pytorch/pytorch/pull/152493 on behalf of https://github.com/huydhn due to Another tweak is needed https://github.com/pytorch/pytorch/actions/runs/14748144909/job/41399954902, seem easier to revert and reland this ([comment](https://github.com/pytorch/pytorch/pull/152493#issuecomment-2841010528))	2025-04-30 07:05:58 +00:00
wizzniu	59a8aa1489	Fix instantiate_device_type_tests() for 3rd-party devices (#152177 ) For 3rd-party devices now, `` instantiate_device_type_tests()`` with explicitly passing ``str`` obj (rather than `List[str]/Tuple[str]`) to argument ``only_for`` or ``except_for`` would causes unexpected results. For example, if calling ``instantiate_device_type_tests(TestXXX, globals(), only_for="cpu")``, then it goes into [filter_desired_device_types()](`f38dae76ee/torch/testing/_internal/common_device_type.py (L729)`) and results in ``only_for=['c', 'p', 'u']`` because ``only_for`` we passed is a "cpu" string. This PR fixes the above unexpected behavior for ``str`` case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152177 Approved by: https://github.com/albanD	2025-04-30 06:25:59 +00:00
Nikita Shulga	a2c553cac6	[Metal] Extend typecasted op support to complex dtypes (#152504 ) First of all, by extending `c10:🤘:cast_to` to work correctly with complex dtypes, by introducing two more specializations: one that casts complex to scalar, and another that casts scalar to complex (as default metal typecast will turn `float x` into `float2(x, x)`) Add ComplexHalf and ComplexFloat enum values to `c10:🤘:ScalarTypes` and handle them in `val_at_offs(ptr, offs, type)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152504 Approved by: https://github.com/dcci ghstack dependencies: #152443, #152466, #152479	2025-04-30 05:32:07 +00:00
Wouter Devriendt	4df97a8839	fix tests broken after #152450 (#152493 ) Updating test expected value after #152450 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152493 Approved by: https://github.com/huydhn, https://github.com/malfet	2025-04-30 04:55:55 +00:00
Nikita Shulga	fcfa6e36c9	[MPS] Fix lerp for complex numbers (#152479 ) As well as `.add`/`.sub` with complex alpha Before this change `python3 -c "import torch;print(torch.rand(10, device='mps', dtype=torch.complex64).add(torch.rand(10, device='mps', dtype=torch.complex64), alpha=.5j))"` used to fail with ``` RuntimeError: value cannot be converted to type double without overflow ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152479 Approved by: https://github.com/dcci ghstack dependencies: #152443, #152466	2025-04-30 04:46:19 +00:00
Nikita Shulga	9bfdf57572	[MPS][BE] Introduce `c10:🤘:mul` (#152466 ) Which multiplies two arguments for either scalar or complex data types This allows one to get rid of bunch of complex specialization in BinaryOps Pull Request resolved: https://github.com/pytorch/pytorch/pull/152466 Approved by: https://github.com/dcci ghstack dependencies: #152443	2025-04-30 04:45:47 +00:00
Henry Tsang	ee2d104c05	[cutlass backend] Add (limited) bmm dynamic shape support (#152393 ) Differential Revision: D73626732 In this PR, we add support for bmm dynamic shape, provided that the batch stride is the biggest in the stride for A, B, and D. For example, for A of size `(B, M, K)`, we support stride `(MK, K, 1)` and `(MK, 1, M)`. With this assumption, we can infer the batch stride from existing arguments. The reason is we don't want to add 2-3 more runtime params. The concerns are complexity and possible perf regression, though we didn't verify the latter. We can revisit this if there is a need for that. We also remove `B = 1` for normal mm and addmm. We tested it and didn't see perf regression. But open to revisiting this as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152393 Approved by: https://github.com/ColinPeppler	2025-04-30 04:36:24 +00:00
bobrenjc93	e5ea7911ea	[ez] Make relaxed constraint error message more user friendly (#151407 ) Fixes #151356 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151407 Approved by: https://github.com/Skylion007	2025-04-30 03:51:50 +00:00
Nikita Shulga	c01bcc5efb	[MPS][BE] Delete unused lerp functors (#152443 ) For `lerp.Scalar_out` weight (aka alpha) is not an optional argument, so no point in having those specializations. But move `alpha=1.0` ahead of dispatching to Metal shaders, as plain copy of tensor should still be faster `a1a4fee3b8/aten/src/ATen/native/mps/operations/BinaryOps.mm (L285-L290)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152443 Approved by: https://github.com/Skylion007	2025-04-30 03:32:52 +00:00
Brian Hirsh	4a63cab624	[cudagraphs] Fix issue in collecting static_input_idxs (#152287 ) related to https://github.com/pytorch/pytorch/issues/152275 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152287 Approved by: https://github.com/bdhirsh, https://github.com/eellison Co-authored-by: Brian Hirsh <hirsheybar@fb.com>	2025-04-30 03:24:05 +00:00
angelayi	bce7f0a216	Fix additional inputs to error on inconsistent constants (#151970 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/151970 Approved by: https://github.com/pianpwk	2025-04-30 01:38:17 +00:00
Natalia Gimelshein	4bead7b85e	use cutlass native BroadcastPtrArray in scaled group gemm (#152404 ) After cutlass update to 3.9 we can use BroadcastPtrArray instead of a local copy with small changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152404 Approved by: https://github.com/Skylion007, https://github.com/eqy	2025-04-30 01:17:28 +00:00
eqy	cc072af74a	[CUDA][MXFP8] bump tolerances for `test_blockwise_mxfp8_nvfp4_numerics` (#151811 ) got a slightly lower sqnr on a smaller GPU Pull Request resolved: https://github.com/pytorch/pytorch/pull/151811 Approved by: https://github.com/albanD	2025-04-30 01:12:51 +00:00
Yiming Zhou	bea7d428bc	[export] Preserve custom metadata for tensor constants (#152241 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/151476 The `custom_meta` collected from `mod` has keys that follow name of nodes in `mod`, which are inconsistent with the node names after the naming pass. For example a constant `b` will become `c_b`. Test Plan: buck2 run caffe2/test:test_export -- -r test_run_decompositions_keep_tensor_constant_metadata Differential Revision: D73703068 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152241 Approved by: https://github.com/angelayi	2025-04-30 00:30:35 +00:00
jeetkanjani7	d36b09ca58	[aten] Enable vectorized 8byte copy for fp16/bf16 for index select kernel (#152380 ) ## Summary Enable aligned vector loading for 2 bytes data types for index select. Specifically: - 4 element fp16/bf16 packing: added 8-byte vector load/store to move 4 half values at once. - warp-wide predicate (__all_sync): decide fast vs fallback path per warp, eliminating lane level divergence - alignment guard: fast or vectorized path only executes when src and dst are 8 byte aligned, preventing mis aligned address faults. - Safe for loop fallback: for misaligned, strid > 1, or tail elements we recompute offsets per element to avoid memory corruption. - Bound checks: fast or vectorized path is skipped when less than 4 elements are remaining, guaranteeing bounded access. - Stride remapping: Redirect calls to inner contiguous dim which has stride = 1 so copies occur along memory coalesced axes. - AMD support: Ensured portability and correctness across CUDA and HIP platforms. ## Perf testing We note a 2.5x improvement in memory bandwidth after this change when the tensor dim is a multiple of 4 for 2 byte data types (fp16/bf16). <img width="625" alt="image" src="https://github.com/user-attachments/assets/909b04a3-98f2-4c30-8c29-c36e1beeea0f" /> With input tensor dimension not being a multiple of 4, we see a smaller improvement (~1.2x) due to warp divergence. <img width="624" alt="image" src="https://github.com/user-attachments/assets/f3ed16f4-b091-48bd-9889-093f6a90688d" /> ## Perf testing code ``` # pyre-strict from typing import List, Optional, Tuple import click import pandas as pd import torch # @manual=//triton:triton import triton @click.command() @click.option("--data-type", type=str, default="bf16") @click.option("--return-result", type=bool, default=False) def main( data_type: str, return_result: bool, ) -> Optional[Tuple[List[triton.testing.Benchmark], List[pd.DataFrame]]]: torch.backends.cudnn.allow_tf32 = True torch.backends.cuda.matmul.allow_tf32 = True data_types = {"fp32", "fp16", "bf16"} if data_type not in data_types: raise ValueError(f"Unsupported data type: {data_type}.") dtype = { "fp32": torch.float32, "fp16": torch.float16, "bf16": torch.bfloat16 }[data_type] D1 = 192 D2 = 156 configs: List[triton.testing.Benchmark] = [ triton.testing.Benchmark( x_names=["B"], x_vals=[24], line_arg="provider", line_vals=[ "repeat_interleave", "repeat_interleave_int32", ], line_names=["repeat_interleave", "repeat_interleave_int32"], styles=[("red", "-"), ("purple", "-")], ylabel="ms", plot_name=f"torch-repeat_interleave-D1-{D1}-D2-{D2}-dtype-{dtype}", args={ "D1": D1, "D2": D2, "dtype": dtype, }, ) ] @triton.testing.perf_report(configs) def bench_repeat_interleave( B: int, D1: int, D2: int, dtype: torch.dtype, provider: str, ) -> float: warmup = 20 rep = 100 torch.manual_seed(42) torch.cuda.manual_seed(42) a = torch.randn(24, D1, D2) a = a.to(dtype).to("cuda") input_bytes = a.numel() * a.element_size() repeats = torch.randint(low=100, high=1600, size=(24,), device="cuda") output_bytes = ( repeats.sum() * a.shape[1] * a.shape[2] * repeats.element_size() ) total_bytes = input_bytes + output_bytes def torch_repeat_interleave( input_tensor: torch.Tensor, repeats: torch.Tensor ) -> torch.Tensor: res = input_tensor.repeat_interleave(repeats, dim=0) return res def torch_repeat_interleave_int32( input_tensor: torch.Tensor, repeats: torch.Tensor ) -> torch.Tensor: dim = 0 if torch.is_tensor(repeats): idx64 = torch.repeat_interleave( torch.arange( 0, input_tensor.shape[dim or 0], device=input_tensor.device, ), repeats, dim=0, ) else: idx64 = ( torch.arange( input_tensor.shape[dim or 0] * repeats, device=input_tensor.device, ) .reshape(-1, repeats) .flatten() ) idx32 = idx64.to(torch.int32) res = torch.index_select(input_tensor, 0, idx32) return res def expand_flatten(input_tensor: torch.Tensor) -> torch.Tensor: return input_tensor[:, None].expand(-1, 4, -1).flatten(0, 1) if provider == "repeat_interleave": fn = lambda: torch_repeat_interleave(a, repeats) # noqa E731 ms = triton.testing.do_bench(fn, warmup=warmup, rep=rep) bw = total_bytes / (ms * 1e6) # print("Bandwidth[GB/s]: ", total_bytes / (ms * 1e6)) return bw.item() if provider == "repeat_interleave_int32": fn = lambda: torch_repeat_interleave_int32(a, repeats) ms = triton.testing.do_bench(fn, warmup=warmup, rep=rep) bw = total_bytes / (ms * 1e6) # print("Bandwidth[GB/s]: ", total_bytes / (ms * 1e6)) return bw.item() elif provider == "expand_flatten": fn = lambda: expand_flatten(a) ms = triton.testing.do_bench(fn, warmup=warmup, rep=rep) bw = total_bytes / (ms * 1e6) # print("Bandwidth[GB/s]: ", total_bytes / (ms * 1e6)) return bw.item() else: raise ValueError(f"unsupported provider: {provider}") df = bench_repeat_interleave.run(print_data=True, return_df=True) if return_result: return configs, df if __name__ == "__main__": main() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152380 Approved by: https://github.com/Skylion007, https://github.com/ngimel	2025-04-29 23:54:52 +00:00
Fuzzkatt	c6d3b8f861	add xfail for distributed tests on Jetson (#152224 ) We are hitting distributed import failures on Jetson in test/export/test_export.py tests in NVIDIA internal testing with the recent additions of https://github.com/pytorch/pytorch/pull/146050 and https://github.com/pytorch/pytorch/pull/147417. Instead of simply skipping these tests for Jetson, we are introducing an xfailIfDistributedNotSupported to get better signaling for this kind of failure in the long run. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152224 Approved by: https://github.com/nWEIdia, https://github.com/eqy	2025-04-29 23:48:40 +00:00
Tiwari-Avanish	6f8023a35f	[PowerPC] Fix vec256 for complex float and double in Power system (#152402 ) Power System build is failing with below error. After this commit it is failing: `912102b4ec` Fix the build error along with test cases that are failing for complex double and complex float data type. Build Failure Logs: ``` vec_base.h:790:6: error: use of deleted function ‘at::vec::DEFAULT::ComplexDbl& at::vec::DEFAULT::Vectorized<c10::complex >::operator’ 790 \| c[i] = a[i] * b[i]; \| ~^ error: use of deleted function ‘at::vec::DEFAULT::ComplexDbl& at::vec::DEFAULT::Vectorized<c10::complex >::oper ator’ 802 \| c[i] = a[i] / b[i]; \| ~^ error: use of deleted function ‘at::vec::DEFAULT::ComplexFlt& at::vec::DEFAULT::Vectorized<c10::complex >::opera tor’ 790 \| c[i] = a[i] * b[i]; \| ~^ error: use of deleted function ‘at::vec::DEFAULT::ComplexFlt& at::vec::DEFAULT::Vectorized<c10::complex >::opera tor’ 802 \| c[i] = a[i] / b[i]; \| ~^ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152402 Approved by: https://github.com/malfet	2025-04-29 23:45:49 +00:00
Kevin Fu	c8f48eb185	[PT2] Port replace_lce_with_matmul / replace_first_lce_with_fused_matmul_lce to PT2 pre_grad passes (#152450 ) Summary: Port over replace_lce_with_matmul and replace_first_lce_with_fused_matmul_lce to PT2 pre_grad pass. Original dper pass diffs: D67884534, D68123479, D68384238 Test Plan: Test 1. Covers replace_lce_with_matmul and case 1 of replace_first_lce_with_fused_matmul_lce ``` CUDA_VISIBLE_DEVICES=6 TORCH_LOGS=+inductor,aot TORCH_COMPILE_DEBUG=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 buck2 run mode/opt-split-dwarf mode/inplace -c fbcode.platform010_cuda_version=12 -c fbcode.nvcc_arch=h100 caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --model-path=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/669809193/0/gpu_lowering/input.predictor.disagg.gpu.merge --lower-backend="AOT_INDUCTOR" --add_passes="use_matmul_fuse_lce_replace_first_LCE,use_contiguous_linear_reduction_replace_linear_reduction" --batch-size=3072 --gpu-trace --disable_acc_tracer=true 2>&1 \| tee ~/logs/disable_acc_tracer/aoti_cmf_ctr_triton_669809193_0_diable_acc.log ``` Log: P1798246938 Test 2. Covers replace_lce_with_matmul and case 2 of replace_first_lce_with_fused_matmul_lce ``` CUDA_VISIBLE_DEVICES=7 TORCH_LOGS=+inductor,aot TORCH_COMPILE_DEBUG=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 buck2 run mode/opt-split-dwarf mode/inplace -c fbcode.platform010_cuda_version=12 -c fbcode.nvcc_arch=h100 caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --model-path=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/677734158/9/gpu_lowering/input.predictor.disagg.gpu.merge --lower-backend="AOT_INDUCTOR" --add_passes="use_matmul_fuse_lce_replace_first_LCE,use_matmul_lce_replace_normal_LCE" --batch-size=3072 --gpu-trace --disable_acc_tracer=true 2>&1 \| tee ~/logs/disable_acc_tracer/aoti_cmf_ctr_triton_677734158_9_diable_acc.log ``` Log: P1798246675 Seeing logs like `[Pre grad(predispatch IR)] Apply use_matmul_fuse_lce_replace_first_LCE pass, save before/after graph to /tmp/tmp8lyzoh79, graph before/after are the same = False` Reviewed By: huxintong Differential Revision: D71358949 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152450 Approved by: https://github.com/huxintong	2025-04-29 23:45:20 +00:00
Vinitha Vijayan	e872bf8f88	Avoid linking multiple OMP runtimes in libtorch_cpu.so if BLAS used is OpenBLAS. (#147725 ) When PyTorch is built with OpenBLAS support and libopenblas is ldrectly linked with libgomp.so the libtorch_cpu.so ends up getting multiple omp runtimes linked against it. This may result in unexpected runtime behaviour /regression. This patch fixes this by avoiding linking against libomp.so if OpenBLAS is linked against libgomp.so Fixes #146603 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147725 Approved by: https://github.com/albanD	2025-04-29 23:39:48 +00:00
abcarlisle	a1a4fee3b8	Native channel shuffle floating point exception (#144010 ) Fixes #142453 Added TORCH_CHECKS to prevent the user from using the native_channel_shuffle function incorrectly and getting a "Floating point exception (core dumped)" Pull Request resolved: https://github.com/pytorch/pytorch/pull/144010 Approved by: https://github.com/albanD	2025-04-29 23:38:54 +00:00
angelayi	8f420a500a	Save/load op profiles (#151817 ) Add ability to save/load op profiles into a yaml file: ```python op_profile = self.get_sample_op_profile() # Save save_op_profiles(op_profile, "op_profile.yaml") # Load loaded = load_op_profiles("op_profile.yaml") assert op_profile == loaded ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151817 Approved by: https://github.com/zou3519	2025-04-29 23:11:32 +00:00
Michael Lazos	8358eca2ce	[Cutlass] Only run EVT tests on sm90 (#151713 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151713 Approved by: https://github.com/masnesral ghstack dependencies: #152305, #152306, #150905, #151405	2025-04-29 23:06:01 +00:00
Michael Lazos	a1f6d85b36	[Cutlass] Fixes for e2e compilation in arg rendering (#151405 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151405 Approved by: https://github.com/eellison ghstack dependencies: #152305, #152306, #150905	2025-04-29 23:06:01 +00:00
Michael Lazos	a0ce5ce6e4	[Cutlass] Implement cutlass epilogue visitor python codegen (#150905 ) This PR implements the second codegen task of CUTLASS EVT: translating inductor epilogue nodes into python code that will be traced by the EVT infra. Details: The implementation uses a simple ops wrapper which only supports add and mul pointwise ops today (to be extended in the future). This ops wrapper generates python code from inner_fn of the epilogue nodes in the format EVT expects. The main caveat is that one of the outputs needs to be named "D" and the accumulator input needs to be named "acc". Reads/writes are named according to the inductor buffer names otherwise. Previously merged: * #150904 * #150903 * #150346 * #150345 * #150344 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150905 Approved by: https://github.com/eellison ghstack dependencies: #152305, #152306	2025-04-29 23:05:55 +00:00
Michael Lazos	72273bef9e	[Cutlass] Fix int check in example tensor creation (#152306 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152306 Approved by: https://github.com/Skylion007, https://github.com/eellison ghstack dependencies: #152305	2025-04-29 23:05:47 +00:00
Michael Lazos	4293a6095d	[Cutlass] Remove unused dtype conversion map (#152305 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152305 Approved by: https://github.com/Skylion007	2025-04-29 23:05:41 +00:00
Will Constable	a4a771648a	[pt2d] Add reorder_comms_preserving_peak_memory pass (#146562 ) This is a new pass to replace the pre-existing passes. It has the same basic goal, to achieve communication overlap (latency hiding), but also constrains the solution to not increase peak memory. The principles of operation are detailed in code comments, but summarized here: - never reorder collectives relative to each other (TBD if we should relax this later) - before performing reordering, push all comm and wait nodes as late as possible, respecting data dependencies - estimate peak memory and current memory at each scheduler node - move collective nodes forward one position at a time, if the move does not increaes curr memory beyond peak memory The pass logs a summary table for each graph to TORCH_LOGS=overlap. e.g. (exact format may have been tweaked but this shows the idea). ``` rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] Collective node initial exposed final exposed improvement limiting factor moves [rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] ----------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------- --------------- ------------- ------------------- ------- [rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] ExternKernelSchedulerNode(name='op2') (torch.ops._c10d_functional.all_gather_into_tensor.default) (size=[2256, 256], stride=[256, 1]) (buf2) (12142 ns) 12141.6 6514.53 5627.08 prefetch limit 75 [rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] ExternKernelSchedulerNode(name='op6') (torch.ops._c10d_functional.reduce_scatter_tensor.default) (size=[282, 256], stride=[256, 1]) (buf7) (32266 ns) 32265.8 28429.2 3836.61 data dependency 78 [rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] ExternKernelSchedulerNode(name='op9') (torch.ops._c10d_functional.all_gather_into_tensor.default) (size=[256], stride=[1]) (buf11) (10801 ns) 10800.6 10732.3 68.254 peak memory 1 [rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] ExternKernelSchedulerNode(name='op14') (torch.ops._c10d_functional.reduce_scatter_tensor.default) (size=[32], stride=[1]) (buf17) (10810 ns) 10809.5 10809.5 0 data dependency 4 [rank ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146562 Approved by: https://github.com/eellison ghstack dependencies: #152060, #146561	2025-04-29 22:51:31 +00:00
PyTorch MergeBot	e35e31697e	Revert "[MPS][BE] Delete unused lerp functors (#152443 )" This reverts commit 0a2d3206a82c4a5c923938cf0a0ebc0f47aa17dd. Reverted https://github.com/pytorch/pytorch/pull/152443 on behalf of https://github.com/wdvr due to failing MPS test: test/test_optim.py::TestOptimRenewedMPS::test_can_load_from_to_named_state_dict_is_named_optim0_False_is_named_optim1_False_Adafactor_mps_float32 ([comment](https://github.com/pytorch/pytorch/pull/152443#issuecomment-2840405966))	2025-04-29 22:50:23 +00:00
PyTorch MergeBot	fecaa60c3c	Revert "Add detailed triton kernel logging to tlparse (#152197 )" This reverts commit 8303860de779da840316dd95ce3051e0a4119174. Reverted https://github.com/pytorch/pytorch/pull/152197 on behalf of https://github.com/wdvr due to failing python test/dynamo/test_structured_trace.py StructuredTraceTest.test_cudagraphs on trunk ([comment](https://github.com/pytorch/pytorch/pull/152197#issuecomment-2840400839))	2025-04-29 22:47:48 +00:00
PyTorch MergeBot	471025c489	Revert "[AOTI][reland] Remove typedef for half and bfloat16 (#151109 )" This reverts commit a0d440a26a555c34e87b90bef3bff960b34bb180. Reverted https://github.com/pytorch/pytorch/pull/151109 on behalf of https://github.com/wdvr due to causing AOTI test failures - discussed with author ([comment](https://github.com/pytorch/pytorch/pull/151109#issuecomment-2840386483))	2025-04-29 22:37:16 +00:00
Anthony Shoumikhin	accffef504	Run link checks on modified files on push too (#152464 ) https://github.com/pytorch/pytorch/issues/152439 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152464 Approved by: https://github.com/huydhn	2025-04-29 22:08:40 +00:00
Francisco Massa	89c0c3ca80	Add private config to broadcast rank0 decision from the partitioner to all ranks (#152264 ) Summary: This PR adds a private configuration to the partitioner that ensures that the decision taken is the same across all ranks. This is a temporary workaround, as when size_hints are also taken into account in compiler collectives this workaround will not be needed anymore. Test Plan: This has been tested on some internal models, but I haven't added any tests in PyTorch (yet?) T Differential Revision: D73666017 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152264 Approved by: https://github.com/bdhirsh	2025-04-29 21:27:57 +00:00
atalman	28efeb1522	Remove unused Manylinux2014 Docker files and builds (#152428 ) Related to Manylinux 2.28 migration: https://github.com/pytorch/pytorch/issues/123649 Cleanup old Docker files and `manylinuxaarch64-builder:cpu-aarch64` image which has been replaced by `manylinux2_28_aarch64-builder:cpu-aarch64` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152428 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-04-29 20:57:29 +00:00
Tristan Rice	c039cb1a06	submodules: point gloo to new home in pytorch/ (#152438 ) Gloo moved to the PyTorch GitHub org. This updates PyTorch to point to the new location. https://github.com/pytorch/gloo Test plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/152438 Approved by: https://github.com/fduwjj	2025-04-29 20:42:24 +00:00
Nikita Shulga	0a2d3206a8	[MPS][BE] Delete unused lerp functors (#152443 ) For `lerp.Scalar_out` weight (aka alpha) is not an optional argument, so no point in having those specializations Pull Request resolved: https://github.com/pytorch/pytorch/pull/152443 Approved by: https://github.com/Skylion007	2025-04-29 20:42:21 +00:00
zhxchen17	1d8cdf373b	[dynamo] Guard serialization for NAME_MATCH (#152332 ) Differential Revision: [D73780430](https://our.internmc.facebook.com/intern/diff/D73780430/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152332 Approved by: https://github.com/jansel ghstack dependencies: #152325, #152326, #152327, #152328, #152329, #152330, #152331	2025-04-29 20:16:00 +00:00
zhxchen17	5c297b2846	[dynamo] Guard serialization for DISPATCH_KEY_SET_MATCH (#152331 ) Differential Revision: [D73780433](https://our.internmc.facebook.com/intern/diff/D73780433/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152331 Approved by: https://github.com/jansel ghstack dependencies: #152325, #152326, #152327, #152328, #152329, #152330	2025-04-29 20:16:00 +00:00
zhxchen17	4cb75d7afc	[dynamo] Guard serialization for ID_MATCH (#152330 ) Differential Revision: [D73780431](https://our.internmc.facebook.com/intern/diff/D73780431/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152330 Approved by: https://github.com/jansel ghstack dependencies: #152325, #152326, #152327, #152328, #152329	2025-04-29 20:16:00 +00:00
zhxchen17	0b39124ea3	[dynamo] Guard serialization for NONE_MATCH. (#152329 ) Differential Revision: [D73780435](https://our.internmc.facebook.com/intern/diff/D73780435/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152329 Approved by: https://github.com/jansel ghstack dependencies: #152325, #152326, #152327, #152328	2025-04-29 20:16:00 +00:00
zhxchen17	ab4091a9fa	[dynamo] Guard serialization for BOOL_MATCH. (#152328 ) Differential Revision: [D73780434](https://our.internmc.facebook.com/intern/diff/D73780434/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152328 Approved by: https://github.com/jansel ghstack dependencies: #152325, #152326, #152327	2025-04-29 20:16:00 +00:00
zhxchen17	c521c45a8a	[dynamo] Guard serialization for DICT_CONTAINS (#152327 ) Adding serialization for DICT_CONTAINS Differential Revision: [D73780432](https://our.internmc.facebook.com/intern/diff/D73780432/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152327 Approved by: https://github.com/jansel ghstack dependencies: #152325, #152326	2025-04-29 20:16:00 +00:00
zhxchen17	52202525b9	[dynamo] Guard serialization for DICT_VERSION (#152326 ) I think we shouldn't support DICT_VERSION for 2 reasons: 1. dict version is not well defined across processes 2. they are pretty rare (only with pytree calls) Differential Revision: [D73780437](https://our.internmc.facebook.com/intern/diff/D73780437/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152326 Approved by: https://github.com/jansel ghstack dependencies: #152325	2025-04-29 20:16:00 +00:00
zhxchen17	df663b9e72	[dynamo] Guard serialization for TYPE_MATCH (#152325 ) Adding guard serialization for TYPE_MATCH Differential Revision: [D73780438](https://our.internmc.facebook.com/intern/diff/D73780438/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152325 Approved by: https://github.com/jansel	2025-04-29 20:16:00 +00:00
Catherine Lee	a04f4622e1	[conda] Remove conda from lint-autoformat.yml (#152433 ) Installs setuptools since I get https://github.com/pytorch/pytorch/actions/runs/14736804186/job/41364832984#step:5:60 ``` + python3 -m tools.generate_torch_version --is_debug=false Traceback (most recent call last): File "<frozen runpy>", line 198, in _run_module_as_main File "<frozen runpy>", line 88, in _run_code File "/home/ec2-user/actions-runner/_work/pytorch/pytorch/tools/generate_torch_version.py", line 9, in <module> from setuptools import distutils # type: ignore[import] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ModuleNotFoundError: No module named 'setuptools' ``` It should be a no op in the normal lint workflow since setuptools is in the docker image Switched from using python3.10 to system python, which should be python3.9 Use venv to put deps not in the base? Pull Request resolved: https://github.com/pytorch/pytorch/pull/152433 Approved by: https://github.com/huydhn	2025-04-29 20:14:21 +00:00
Kevin Fu	2cfc1faa27	[PT2]: fix add_passes and remove_passes naming issue (#152386 ) Summary: When defining pre_grad passes, they are initially defined as empty functions, then overriden in [customized_triton_kernel_passes.py](https://www.internalfb.com/code/fbsource/[b4eea3dcd7f22421e68a3c1533fd09a4281bc291]/fbcode/caffe2/torch/_inductor/fx_passes/fb/customized_triton_kernel_passes.py?lines=71-73). This causes issues for add_passes and remove_passes because `p.__name__` now may be prefixed by _. This diff removes the leading _ to match the pass name. Test Plan: Tested together with the next diff in the stack. Reviewed By: oniononion36 Differential Revision: D73809937 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152386 Approved by: https://github.com/huxintong	2025-04-29 20:07:15 +00:00
Svetlana Karslioglu	e58c73be44	Add latex settings (#152350 ) - Fixes #147027 - Only lualatex can build our 3K pages PDF with reasonable quality, xelatex runs out of memory and pdflatex just fails. - Move notes under the same toctree as python-api which is needed for the PDF but doesn't change how the HTML is generated. This is the produced PDF: [pytorch.pdf](https://github.com/user-attachments/files/19945450/pytorch.pdf) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152350 Approved by: https://github.com/albanD	2025-04-29 19:28:43 +00:00
Sam Larsen	e6e1ca1996	[easy] Fix test_dynamo_timed (#152387 ) Summary: I'm just trying to fix the test again. It's out of date because it's disabled and some dynamo_timed-related fields are gone now. Test Plan: `python test/dynamo/test_utils.py -k dynamo_timed` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152387 Approved by: https://github.com/anijain2305	2025-04-29 19:22:56 +00:00
Dan Johnson	8e2e06b7ea	Fix shadow local variables (#152429 ) Summary: Fixing shadow local variables error: P1798875650 Test Plan: CI Differential Revision: D73853605 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152429 Approved by: https://github.com/Skylion007, https://github.com/eqy	2025-04-29 18:50:18 +00:00
Anthony Shoumikhin	a3123dd3ab	Run link linters on modified files only or on everything when scheduled (#152377 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152377 Approved by: https://github.com/huydhn	2025-04-29 18:30:40 +00:00
James Wu	8303860de7	Add detailed triton kernel logging to tlparse (#152197 ) This PR adds detailed logging of each triton kernel we compile, and its autotune result, to every kernel we compile with triton. We add these results to a global variable that we then clear after each triton kernel compile. We can't keep these objects around after compile time, so we can't record the autotune cache save or coordinate descent tuning, unfortunately, but we can log at least: - The duration of compilation - Whether or not autotune cache hit - The best autotuning config, if there's only one. Example triton kernel info: https://gist.github.com/jamesjwu/493bdd0f36b0b7e3ca327f87bd6c2c75 See internal diff for an example log for internal model. Differential Revision: [D73674443](https://our.internmc.facebook.com/intern/diff/D73674443) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152197 Approved by: https://github.com/oulgen, https://github.com/eellison	2025-04-29 18:16:56 +00:00
Nikita Shulga	d35e900c74	[MPSInductor] Make sure sizevars are computed (#152436 ) Before calling the kernel This fixes `GPUTests.test_float_repr_dynamic_shapes_mps` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152436 Approved by: https://github.com/dcci ghstack dependencies: #152363, #152430	2025-04-29 17:53:29 +00:00
Nikita Shulga	835f95490f	[MPSInductor] Fix type promotion in `_print_Max` (#152430 ) Run into this problem while re-enabling `test_float_repr_dynamic_shapes`, where `_print_Max` were called for integer and long argument which resulted in the following compilation error ``` error: call to 'max' is ambiguous out_ptr0[x0 + x1*metal::max(1, ks0)] = static_cast<float>(tmp26); ^~~~~~~~~~ /System/Library/PrivateFrameworks/GPUCompiler.framework/Versions/32023/Libraries/lib/clang/32023.619/include/metal/metal_integer:2477:16: note: candidate function METAL_FUNC int max(int x, int y) ^ /System/Library/PrivateFrameworks/GPUCompiler.framework/Versions/32023/Libraries/lib/clang/32023.619/include/metal/metal_integer:3686:17: note: candidate function METAL_FUNC long max(long x, long y) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152430 Approved by: https://github.com/dcci ghstack dependencies: #152363	2025-04-29 17:53:29 +00:00
Laith Sakka	cce8b5d8d7	Refactor TritonTemplate.generate and move codgen part to generate_and_load (#151764 ) Splitting https://github.com/pytorch/pytorch/pull/149267/ . This first PR just refactor the code without adding any caching functionality. The logic of generating the code and loading it is moved to generate_and_load() + some typing Pull Request resolved: https://github.com/pytorch/pytorch/pull/151764 Approved by: https://github.com/drisspg, https://github.com/eellison	2025-04-29 17:44:46 +00:00
PyTorch MergeBot	3962b8f1e0	Revert "[OpenReg] Add _lazy_init and rng_state support for OpenReg (#151914 )" This reverts commit 64a55b531f4f4ae2b35175ab5d9a30a856b0d6ef. Reverted https://github.com/pytorch/pytorch/pull/151914 on behalf of https://github.com/malfet due to Looks like breaks number of ROCM jobs, see `797768cd90/1` ([comment](https://github.com/pytorch/pytorch/pull/151914#issuecomment-2839691038))	2025-04-29 17:36:12 +00:00
Boyuan Feng	797768cd90	[Graph Partition] reorder for minimal number of partitions (#151968 ) This pr adds an optimal reordering for minimizing #partitions. ## Optimal reordering for minimizing #partitions A bfs could minimize #partitions (ignore peak memory for now): 1. For each node, compute node_to_indegree: dict[node, int]. 2. Maintain 2 queues: cudagraphable_nodes, and non_cudagraphable_nodes. Iterate through all nodes and add nodes to one of these 2 queues if node_to_indegree[node] == 0. 3. While non_cudagraphable_nodes is not empty: Pop 1 node, schedule it, update the indegree of all its successors, and add its successor nodes to one of the queues if node_to_indegree[successor] == 0. 4. While cudagraphable_nodes is not empty: Pop 1 node, schedule it, update the indegree of all its successors, and add its successor nodes to one of the queues if node_to_indegree[successor] == 0. 5. Repeat step 3 & 4 until all nodes have been scheduled. We call this strategy `reorder_for_minimizing_partition`. Q: Why is this optimal? Suppose this is not optimal, we have a counter example with 2 non_cudagraphable regions: ``` [non_cudagrable1, cudagraphable2, non_cudagraphable3] ``` where we can reorder to only 1 non_cudagraphable region: ``` [non_cudagrable1, non_cudagraphable3, cudagraphable2] ``` This reorder means non_cudagraphable3 does not depend on cudagraphable2. So after we scheduled non_cudagraphable1, both non_cudagraphable3 and cudagraphable2 have in_degree as 0. If this is true, Step 3 should have already scheduled non_cudagraphable3 before cudagraphable2 such that the counter example cannot exist. This shows we cannot find such a counter example and the bfs is optimal on minimizing #partitions. ## Minimize peak memory `reorder_for_peak_memory` currently uses topological_sort_dfs, topological_sort_lpmf, and topological_sort_bfs, where the later 2 are bfs. ILP brings small benefits and it can hardly scale to more than 100 nodes, according to @xuanzhang816. So ILP is not used for peak memory reorder in the inductor. Heuristics strategy: - Conduct reorder_for_peak_memory as the default order - Conduct reorder_for_minimal_partitions and get results as list[tuple[partition, bool]], where partition: list[BaseSchedulerNode] and bool for cudagraphable. - If the reorder increases peak memory too much, we use the default order. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151968 Approved by: https://github.com/eellison	2025-04-29 17:17:16 +00:00
Mark Saroufim	a77a44761b	[BE] Remove dangling # in contributing.md (#152259 ) I frequently come to CONTRIBUTING.md to copy paste the below snippet to rebuild pytorch which in zsh gives this error because zsh interprets # as a command. These comments add nothing so just removing ``` error: pathspec 'sync' did not match any file(s) known to git error: pathspec 'the' did not match any file(s) known to git error: pathspec 'submodules' did not match any file(s) known to git Building wheel torch-2.8.0a0+git9c01c87 invalid command name '#' ``` ``` git submodule update --init --recursive # very important to sync the submodules python setup.py develop # then try running the command again git submodule update --init --recursive python setup.py develop ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152259 Approved by: https://github.com/janeyx99	2025-04-29 17:07:19 +00:00
Catherine Lee	de20d76622	[conda] Remove conda usage from upload test stats while running workflow (#152431 ) The original uses python 3.10 and the base is 3.9 but I think that's ok Pull Request resolved: https://github.com/pytorch/pytorch/pull/152431 Approved by: https://github.com/atalman	2025-04-29 16:16:54 +00:00
Catherine Lee	f84062f78d	[conda] Remove conda usage from TD llm retriever job (#152338 ) Remove conda usage from TD llm retriever job python3 in the base is python3.9 right now. I'm not sure what the best way to deal with a potentially different python version would be, dnf install? Pull Request resolved: https://github.com/pytorch/pytorch/pull/152338 Approved by: https://github.com/huydhn	2025-04-29 15:17:50 +00:00
Siddharth Kotapati	663bcb68ba	Implement metal kernel for basic MPS arithmetic ops using TensorIterator (#147644 ) Add metal kernels for add, subtract, & lerp ops using TensorIterator. Should help resolve: https://github.com/pytorch/pytorch/issues/143874 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147644 Approved by: https://github.com/malfet	2025-04-29 14:24:49 +00:00
Yuanhao Ji	2fb62f8288	[Dynamo][Typing] Enable typing hints for `tx` in `misc.py` (#152412 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/152412 Approved by: https://github.com/cyyever, https://github.com/Skylion007	2025-04-29 13:54:35 +00:00
Mu-Chu Lee	49cbe0ffe9	[AOTInductor] Propagate ConstantType for main graph. (#152272 ) Summary: We need to make sure all named_parameters and named_buffers be propagated if we use runtime constant folding. Test Plan: python test/inductor/test_aot_inductor.py -k test_constant_type_propagation Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/152272 Approved by: https://github.com/22quinn Co-authored-by: 22quinn <33176974+22quinn@users.noreply.github.com> Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2025-04-29 12:42:17 +00:00
FFFrog	64a55b531f	[OpenReg] Add _lazy_init and rng_state support for OpenReg (#151914 ) As the title stated. Changes: - Add get_rng_state & set_rng_state support for OpenReg - Add _lazy_init support for OpenReg - Remove redundant code for cuda/Module.cpp Pull Request resolved: https://github.com/pytorch/pytorch/pull/151914 Approved by: https://github.com/albanD	2025-04-29 11:18:12 +00:00
Huy Do	5c01302cc8	Remove 3.13 hack when installing TIMM (#152399 ) A Docker build failure showing up at this step triggered by the landing of https://github.com/pytorch/pytorch/pull/152362. Here is the example logs https://github.com/pytorch/pytorch/actions/runs/14718029881/job/41305891896: ``` #37 29.72 + as_jenkins conda run -n py_3.13 pip install --progress-bar off --pre torch torchvision --index-url https://download.pytorch.org/whl/nightly/cu124 #37 29.72 + sudo -E -H -u jenkins env -u SUDO_UID -u SUDO_GID -u SUDO_COMMAND -u SUDO_USER env PATH=/opt/conda/envs/py_3.13/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64 conda run -n py_3.13 pip install --progress-bar off --pre torch torchvision --index-url https://download.pytorch.org/whl/nightly/cu124 #37 49.50 ERROR: Cannot install torch and torchvision==0.22.0.dev20250226+cu124 because these package versions have conflicting dependencies. ``` This happens because we have stopped building 12.4 nightly for sometime. This hack doesn't apply anymore, so let's just remove it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152399 Approved by: https://github.com/cyyever, https://github.com/wdvr, https://github.com/malfet	2025-04-29 08:22:37 +00:00
zeshengzong	eb69f4e609	Add lr_lambda type check in MultiplicativeLR (#151973 ) Fixes #81554 ## TestResult ### Before ```python In [3]: import torch ...: class SimpleLinearModel(torch.nn.Module): ...: def __init__(self): ...: super(SimpleLinearModel, self).__init__() ...: self.linear = torch.nn.Linear(10, 1) ...: ...: def forward(self, x): ...: return self.linear(x) ...: ...: net = SimpleLinearModel() ...: optimizer = torch.optim.Adam(net.parameters(), lr=0.01) ...: scheduler = torch.optim.lr_scheduler.MultiplicativeLR(optimizer, 0.95) ...: for i in range(10): ...: print(i, scheduler.get_last_lr()) ...: scheduler.step() TypeError: 'float' object is not callable ### After ```python ...: scheduler = torch.optim.lr_scheduler.MultiplicativeLR(optimizer, 0.95) TypeError: lr_lambda should be a function, but got float ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151973 Approved by: https://github.com/janeyx99	2025-04-29 08:21:41 +00:00
CaoE	dcd9a444b3	Add pack support and use micro gemm for Half flex attention on CPU (#151530 ) Add pack support and use micro gemm for the second gemm to improve the performance for Half flex attention on CPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151530 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel	2025-04-29 07:24:00 +00:00
cyy	41bd0c900a	[1/N] Deprecate c10::string_view and at::string (#151972 ) The calls of `c10::string_view` in the code base are replaced by `std::string_view`. The calls of `at::string` are replaced by `std::string` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151972 Approved by: https://github.com/malfet	2025-04-29 07:23:52 +00:00
PyTorch MergeBot	a6d19fcfac	Revert "[cudagraphs] Fix issue in collecting static_input_idxs (#152287 )" This reverts commit 75a564608ab289edd5ba0e30a3acf544b90b5769. Reverted https://github.com/pytorch/pytorch/pull/152287 on behalf of https://github.com/wdvr due to causing ao failures - discussed with author ([comment](https://github.com/pytorch/pytorch/pull/152287#issuecomment-2837686127))	2025-04-29 06:57:06 +00:00
Laith Sakka	62f1d0ea78	Log information about suppressed data dependent errors (#151041 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151041 Approved by: https://github.com/bobrenjc93	2025-04-29 06:08:07 +00:00
Scott Wolchok	520366e102	Fix StringCoordView::substr after D73379178 / #151810 (#152304 ) Received complaint that we broke something. After a bunch of debugging, landed on this test + fix. Differential Revision: [D73754877](https://our.internmc.facebook.com/intern/diff/D73754877/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D73754877/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/152304 Approved by: https://github.com/Skylion007	2025-04-29 06:00:38 +00:00
Alexander Grund	ad11d6378c	Don't run NCCL/gloo distributed test without GPUs (#150764 ) If there aren't any GPUs the WORLD_SIZE would be zero which does not work. So skip those backends completely in that case. Fix after https://github.com/pytorch/pytorch/pull/137161 It might make sense to still run the (CPU-) part of the tests by using something like `world_size = max(3, gpu_count)` or `num_gpus if num_gpus else 3` instead of skipping them all Pull Request resolved: https://github.com/pytorch/pytorch/pull/150764 Approved by: https://github.com/kwen2501	2025-04-29 05:27:23 +00:00
Isalia20	99c42722f6	[MPS] fix memory leak in sdpa float32 (#152371 ) Fixes #152344 Leak seems to be on the MPS Graph side, even though there is an identity tensor it seems like it's no longer enough to bypass the SDPA sequence which seems to leak memory. Even adding 0.0f seems to be optimized to be ignored and still take the sdpa sequence(that's the reason for adding 1e-20) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152371 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-04-29 04:51:10 +00:00
PyTorch MergeBot	46419c7899	Revert "[Relandx2] Rewrite the guts of torch::jit::Lexer to speed it up (#152372 )" This reverts commit 7ce6f632142b65849fa33f325c90a24bace2c130. Reverted https://github.com/pytorch/pytorch/pull/152372 on behalf of https://github.com/malfet due to Looks like it broke distributed this time around, see `f05d3e5019/1` ([comment](https://github.com/pytorch/pytorch/pull/152372#issuecomment-2837426497))	2025-04-29 04:37:40 +00:00
xinan.lin	f05d3e5019	[torch-xpu-ops] Update torch-xpu-ops commit pin. (#152321 ) Update the torch-xpu-ops commit to [655fa9bc7f88ab5bd3766b5f2fd5b43989c2caca](`655fa9bc7f`), including: - Fixes batch_norm numeric error by adding additional boundary check - Enable two operators: fft & jagged_to_padded_dense - XCCL relevant changes: - Cache cclStream to improve performance. - Add support for complex datatypes in allgather and broadcast. - Support coalescing operations and batch_isend_irecv. - Introduce additional logging; use export TORCH_CPP_LOG_LEVEL=INFO. - Fix #152296 - Fix #152020 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152321 Approved by: https://github.com/EikanWang, https://github.com/Skylion007	2025-04-29 04:00:09 +00:00
Yi Wang	119cdcc926	Add rich support to torch.distributed.tensor.debug.visualize_sharding (#152027 ) Fixes https://github.com/pytorch/pytorch/issues/151857 Please verify this PR by running the following command on a computer with at least 4 GPUs. ```shell torchrun --nproc_per_node=4 /w/pytorch/torch/distributed/tensor/examples/visualize_sharding_example.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152027 Approved by: https://github.com/wanchaol, https://github.com/wconstab	2025-04-29 03:51:32 +00:00
Nikita Shulga	9c7b902cb2	[MPSInductor][BE] Make all reductions cacheable (#152363 ) By moving actual implementaiton to `_reduction_nocache` and make reduction a caching wrapper Pull Request resolved: https://github.com/pytorch/pytorch/pull/152363 Approved by: https://github.com/dcci	2025-04-29 02:49:22 +00:00
Laith Sakka	5a9868b78c	Do not log exception when recording is disabled or already recording (#151038 ) I am not sure why do we log all exceptions here and re-raise them , but at least when recording is disabled this should be transparent. namely logging dde could be spamming. before: <img width="995" alt="Screenshot 2025-04-10 at 12 47 31 PM" src="https://github.com/user-attachments/assets/f90d4557-d958-4558-a917-0d687366cad1" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/151038 Approved by: https://github.com/bobrenjc93	2025-04-29 02:48:20 +00:00
Camyll Harajli	b22fda9e1c	Remove conda refs in tools (#152368 ) Fixes #152126 Did not find references in the two .ipynb files Pull Request resolved: https://github.com/pytorch/pytorch/pull/152368 Approved by: https://github.com/atalman	2025-04-29 02:45:47 +00:00
Aaron Orenstein	c8b4a39d73	Add precedence to the infix printing done by sympy_str. (#151920 ) Add precedence to the infix printing done by sympy_str. Without this change sympy_str will print the same string for both `a+b(c+d)` and `(a+b)(c+d)`. While there I also cleaned up the printing for `-a` and `a - b`. Added some tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151920 Approved by: https://github.com/jansel	2025-04-29 00:58:58 +00:00
Will Constable	4b61564252	Include CollectiveKernel in inductor debug visualization (#146561 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146561 Approved by: https://github.com/eellison ghstack dependencies: #152060	2025-04-29 00:53:38 +00:00
atalman	22f179d77d	Use almalinux docker files for building Magma (#152358 ) Resolves https://github.com/pytorch/pytorch/issues/151707 for CUDA Nvidia Magma builds. Removes deprecated cuda 12.4 build. Using `pytorch/manylinux2_28-builder` image for magma build creates circular dependency. For a while for magma builds we used `conda-builder` image since it does not have circular dependency: https://github.com/pytorch/builder/blob/release/2.4/magma/Makefile#L13 However during migration to pytorch/pytorch: https://github.com/pytorch/pytorch/pull/139888 we introduced circular dependency using Manylinux 2.28 docker image. Hence using almalinux image which suppose to be general usage image Please note: Magma builds using Docker build : https://github.com/pytorch/pytorch/blob/main/.ci/magma/README.md we can look into migrating them to Docker images if required as a followup BE change if needed TODO: Make same change for rocm builds. I believe some more work for rocm is required, since maga-rocm is requires rocm dev, utils and lib to be installed : https://github.com/pytorch/pytorch/blob/main/.ci/docker/common/install_rocm.sh Pull Request resolved: https://github.com/pytorch/pytorch/pull/152358 Approved by: https://github.com/nWEIdia, https://github.com/huydhn	2025-04-29 00:45:01 +00:00
Scott Wolchok	7ce6f63214	[Relandx2] Rewrite the guts of torch::jit::Lexer to speed it up (#152372 ) Reapplying with fix for linux-manylinux-2_28-py3-cpu-s390x / build failure (https://github.com/pytorch/pytorch/actions/runs/14716285820/job/41300304223#logs), which is to just update a pair of static_assert constants I got wrong. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152372 Approved by: https://github.com/wdvr, https://github.com/malfet	2025-04-28 23:55:48 +00:00
everettknag	e5f4356a25	[inductor][fix] enable dtype promotion for bucketize (#150634 ) Summary: bucketization involves comparing an input with border values. Without careful consideration of dtypes, this can cause dangerous implicit casting. aten.bucketize resolves this via dtype promotion. We enable dtype promotion for the inductor bucketization pass so as to maintain alignment with the aten op. Test Plan: ``` python3 test/inductor/test_torchinductor.py -k "bucketize" ``` Fixes #145929 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150634 Approved by: https://github.com/davidberard98, https://github.com/eellison	2025-04-28 23:44:26 +00:00
Will Constable	119f64d0eb	Add 'step' counter to visualize_overlap log (#152060 ) Example of log after the change: ``` [rank0]:V0227 15:07:20.704000 1594243 torch/_inductor/comms.py:621] [0/0] [__overlap] ==== Visualize overlap after reordering pass <function group_copy_collective at 0x7f41c1922050> (ran in 0.026380538940429688 sec)==== [rank0]:V0227 15:07:20.705000 1594243 torch/_inductor/comms.py:569] [0/0] [__overlap] 0: GroupedSchedulerNode(name='op6_op7') (size=[512], stride=[1]), (size=[4096], stride=[1]) () (0 ns) [rank0]:V0227 15:07:20.705000 1594243 torch/_inductor/comms.py:569] [0/0] [__overlap] 1: GroupedSchedulerNode(name='op55_op56') (size=[512], stride=[1]), (size=[4096], stride=[1]) () (0 ns) [rank0]:V0227 15:07:20.705000 1594243 torch/_inductor/comms.py:569] [0/0] [__overlap] 2: GroupedSchedulerNode(name='op75_op76') (size=[512], stride=[1]), (size=[4096], stride=[1]) () (0 ns) [rank0]:V0227 15:07:20.706000 1594243 torch/_inductor/comms.py:569] [0/0] [__overlap] 3: GroupedSchedulerNode(name='op121_op122') (size=[512], stride=[1]), (size=[4096], stride=[1]) () (0 ns) [rank0]:V0227 15:07:20.706000 1594243 torch/_inductor/comms.py:569] [0/0] [__overlap] 4: GroupedSchedulerNode(name='op141_op142') (size=[512], stride=[1]), (size=[4096], stride=[1]) () (0 ns) [rank0]:V0227 15:07:20.706000 1594243 torch/_inductor/comms.py:569] [0/0] [__overlap] 5: GroupedSchedulerNode(name='op187_op188') (size=[512], stride=[1]), (size=[4096], stride=[1]) () (0 ns) [rank0]:V0227 15:07:20.706000 1594243 torch/_inductor/comms.py:569] [0/0] [__overlap] 6: GroupedSchedulerNode(name='op207_op208') (size=[512], stride=[1]), (size=[4096], stride=[1]) () (0 ns) [rank0]:V0227 15:07:20.707000 1594243 torch/_inductor/comms.py:569] [0/0] [__overlap] 7: GroupedSchedulerNode(name='op253_op254') (size=[512], stride=[1]), (size=[4096], stride=[1]) () (0 ns) [rank0]:V0227 15:07:20.707000 1594243 torch/_inductor/comms.py:569] [0/0] [__overlap] 8: GroupedSchedulerNode(name='op273_op274') (size=[512], stride=[1]), (size=[4096], stride=[1]) () (0 ns) [rank0]:V0227 15:07:20.707000 1594243 torch/_inductor/comms.py:569] [0/0] [__overlap] 9: GroupedSchedulerNode(name='op319_op320') (size=[512], stride=[1]), (size=[4096], stride=[1]) () (0 ns) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152060 Approved by: https://github.com/eellison	2025-04-28 23:23:21 +00:00
Eddie Yan	a6d38051ee	[CUDA][CUTLASS] CUTLASS 3.9 submodule upgrade (#151253 ) Originally authored by Jack Kosaian, likely needs #ifdefs if we want to preserve compat with 3.8 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151253 Approved by: https://github.com/Skylion007, https://github.com/henrylhtsang Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2025-04-28 23:10:14 +00:00
Animesh Jain	75a564608a	[cudagraphs] Fix issue in collecting static_input_idxs (#152287 ) related to https://github.com/pytorch/pytorch/issues/152275 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152287 Approved by: https://github.com/bdhirsh, https://github.com/eellison	2025-04-28 23:07:52 +00:00
Joona Havukainen	63790a0c43	Speed-up time spent in generating shaped str keys (#152202 ) Replaces the janky way of using the IntArrayRef to create an NSArray to ask for it to provide its contents in a string format with use of stringstream. This speeds up the call for getting the key string for caching (or reading from cache) for shaped inputs by ~5x. While the actual wall time, depending on the number of input tensors, is only some microseconds this time represents non-negligible chunk of the overall time spent in preparing to dispatch work to the GPU. And since this function gets called on every time a (cacheable) operation in MPS is used it should be a small but broadly impacting time saver. Using mps_linear as an example. Note this is before PR https://github.com/pytorch/pytorch/pull/152199 so it only captures the CPU time spent in the op call: Before the change: ``` torch.linear time: <torch.utils.benchmark.utils.common.Measurement object at 0x1108f07d0> func(args, kwargs) Median: 22.75 us IQR: 0.87 us (22.50 to 23.38) 8361 measurements, 1 runs per measurement, 1 thread ``` After the change: ``` torch.linear time: <torch.utils.benchmark.utils.common.Measurement object at 0x108875350> func(args, **kwargs) Median: 18.67 us IQR: 0.46 us (18.50 to 18.96) 10342 measurements, 1 runs per measurement, 1 thread ``` Which aligns with the observed change for getTensorStringKeys() taking ~1us instead of ~5us in mps_linear op I got from a point measurement sandwiching the function call with `std::chrono::high_resolution_clock`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152202 Approved by: https://github.com/Skylion007	2025-04-28 23:06:10 +00:00
zeshengzong	c81d8c231c	Fix CosineAnnealingWarmRestarts reset T_cur (#151289 ) Fixes #88791 ## Test Result ```python pytest test/optim/test_lrscheduler.py -k test_CosineAnnealingWarmRestarts ``` ![image](https://github.com/user-attachments/assets/75ad238c-f319-47dc-bf2d-da05b0879b84) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151289 Approved by: https://github.com/janeyx99	2025-04-28 23:02:55 +00:00
Jagadish Krishnamoorthy	0d99b4e9e2	ROCm: Enable tf32 testing on test_nn (#148945 ) Add tf32 support for ROCm tests. test command: python test/test_nn.py -v Pull Request resolved: https://github.com/pytorch/pytorch/pull/148945 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-04-28 23:01:04 +00:00
Yuanhao Ji	f3ef46e5fa	[Dynamo] Replace `unimplemented` with `unimplemented_v2` in `torch/_dynamo/variables/iter.py` (#151789 ) Part of #147913 Replace `unimplemented` with`unimplemented_v2` in `torch/_dynamo/variables/iter.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151789 Approved by: https://github.com/Skylion007, https://github.com/williamwen42	2025-04-28 22:56:39 +00:00
Anthony Shoumikhin	d79e06723d	Provide list of files to link linters if desired (#152352 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152352 Approved by: https://github.com/huydhn	2025-04-28 22:48:34 +00:00
Yong Hoon Shin	c8540984a2	[inductor] set correct precompile start time (#152284 ) Fixes #148777 With num_worker set to 1, ran script in #148777 before: ``` Precompiling benchmark choice TritonTemplateCaller took 0.19s Precompiling benchmark choice TritonTemplateCaller took 0.38s Precompiling benchmark choice TritonTemplateCaller took 0.53s Precompiling benchmark choice TritonTemplateCaller took 0.90s Precompiling benchmark choice TritonTemplateCaller took 1.29s Precompiling benchmark choice TritonTemplateCaller took 20.78s Precompiling benchmark choice TritonTemplateCaller took 25.42s Precompiling benchmark choice TritonTemplateCaller took 25.92s Precompiling benchmark choice TritonTemplateCaller took 27.21s Precompiling benchmark choice TritonTemplateCaller took 48.76s Precompiling benchmark choice TritonTemplateCaller took 53.66s Precompiling benchmark choice TritonTemplateCaller took 63.12s Precompiling benchmark choice TritonTemplateCaller took 69.53s Precompiling benchmark choice TritonTemplateCaller took 71.24s Precompiling benchmark choice TritonTemplateCaller took 75.57s Precompiling benchmark choice TritonTemplateCaller took 97.58s Precompiling benchmark choice TritonTemplateCaller took 107.71s Precompiling benchmark choice TritonTemplateCaller took 117.27s Precompiling benchmark choice TritonTemplateCaller took 126.30s FX codegen and compilation took 133.733s ``` after: ``` Precompiling benchmark choice TritonTemplateCaller took 0.18s Precompiling benchmark choice TritonTemplateCaller took 0.18s Precompiling benchmark choice TritonTemplateCaller took 0.14s Precompiling benchmark choice TritonTemplateCaller took 0.35s Precompiling benchmark choice TritonTemplateCaller took 0.39s Precompiling benchmark choice TritonTemplateCaller took 19.54s Precompiling benchmark choice TritonTemplateCaller took 4.69s Precompiling benchmark choice TritonTemplateCaller took 0.52s Precompiling benchmark choice TritonTemplateCaller took 1.28s Precompiling benchmark choice TritonTemplateCaller took 20.96s Precompiling benchmark choice TritonTemplateCaller took 4.81s Precompiling benchmark choice TritonTemplateCaller took 9.40s Precompiling benchmark choice TritonTemplateCaller took 6.34s Precompiling benchmark choice TritonTemplateCaller took 1.93s Precompiling benchmark choice TritonTemplateCaller took 4.39s Precompiling benchmark choice TritonTemplateCaller took 21.91s Precompiling benchmark choice TritonTemplateCaller took 10.10s Precompiling benchmark choice TritonTemplateCaller took 9.55s Precompiling benchmark choice TritonTemplateCaller took 9.15s FX codegen and compilation took 133.246s ``` Also tested async triton compile path by setting num_workers > 1 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152284 Approved by: https://github.com/Skylion007, https://github.com/henrylhtsang	2025-04-28 22:30:35 +00:00
PyTorch MergeBot	e7c19f4f69	Revert "Reapply "Rewrite the guts of torch::jit::Lexer to speed it up (#151850 )" (#152250 )" This reverts commit e407ea1e5e22a41d14ce141295bf391cd46f2677. Reverted https://github.com/pytorch/pytorch/pull/152250 on behalf of https://github.com/malfet due to Breaks s390, may be time to move build back to opt-in `2667cb69d9/1` ([comment](https://github.com/pytorch/pytorch/pull/152250#issuecomment-2836833030))	2025-04-28 22:05:12 +00:00
Shaoyu Yang	2667cb69d9	[inductor] align `replicationpad` on processing `bool` dtype with eager (#147666 ) Fixes #143779 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147666 Approved by: https://github.com/jansel	2025-04-28 21:54:31 +00:00
Andrey Talman	86b0271b00	Add CUDA 12.8 almalinux image, remove CUDA 12.4 almalinux (#152362 ) This is general purpose image located in: https://hub.docker.com/r/pytorch/almalinux-builder Updating it to match our supported CUDA matrix Adding this build to use as general purpose image and use for Magma build Pull Request resolved: https://github.com/pytorch/pytorch/pull/152362 Approved by: https://github.com/malfet	2025-04-28 21:15:05 +00:00
eqy	34b0de50a3	[TF32][CUDA] account for TF32 in `test_linear_autograd` (#152216 ) Abate some more noise seen on blackwell Pull Request resolved: https://github.com/pytorch/pytorch/pull/152216 Approved by: https://github.com/Skylion007	2025-04-28 21:00:17 +00:00
Animesh Jain	ddff3d4f6b	[inductor][invoke_subgraph] Run joint graph passes for inference (#152062 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152062 Approved by: https://github.com/eellison ghstack dependencies: #151409, #151633, #151477, #151957, #151961	2025-04-28 20:42:55 +00:00
Boyuan Feng	99b6c426a9	[Graph Partition] fix extra reference in runner.partitions to cudagraphify functions (#152066 ) When CompiledFxGraph is deallocated, its cudagraphifed fn (i.e., `current_callable`) is expected to also be deallocated. Without graph partition, this is true since the cudagraphified fn is only refered by compiled_fx_graph.current_callable. However, with graph partition, runner.partitions hold cudagraphified fns while compiled_fx_graph.current_callable holds the runner.call. Thus the cudagraphied fn may not be deallocated when CompiledFxGraph is deallocated. This leads to errors in several unit tests (e.g., test_unaligned_static_input_no_cudagraphs and test_unaligned_static_input_non_trees). In this PR, we also clean up runner.partitions when CompiledFxGraph is deallocated. This fixes the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152066 Approved by: https://github.com/eellison	2025-04-28 20:38:26 +00:00
Boyuan Feng	728a6dd51c	[Graph Partition] support ForeachKernelSchedulerNode (#152148 ) ForeachKernelSchedulerNode misses outputs_by_name when created with previous nodes. This PR fixes the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152148 Approved by: https://github.com/eellison	2025-04-28 20:38:22 +00:00
Grace Cheng	8e65310d49	[caffe2/c10/util/TypeIndex] Add '__CUDA_ARCH_LIST__' check (#152030 ) Summary: We suspect that switching the NVCC host compiler from GCC to Clang, while targeting multiple architectures, is causing issues because only _CUDA_ARCH_LIST_ is being passed, without _CUDA_ARCH_. To resolve this c10 compilation error, we should first fix the problem and then switch the NVCC host compiler from GCC to Clang. Once this is done, the errors no longer occur. Test Plan: CI Reviewed By: zhuhan0 Differential Revision: D73383236 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152030 Approved by: https://github.com/cyyever, https://github.com/ZainRizvi	2025-04-28 20:31:23 +00:00
Anthony Shoumikhin	fcebaedebc	Add a label to skip URL lint if needed (#152340 ) Some URLs may be down due to server side issues we can't control Pull Request resolved: https://github.com/pytorch/pytorch/pull/152340 Approved by: https://github.com/huydhn, https://github.com/malfet	2025-04-28 20:29:40 +00:00
Nikita Shulga	33766de2d3	[Security] Advise against loading untrusted TorchScripts (#152336 ) As torchscripted model is a Turing complete program Pull Request resolved: https://github.com/pytorch/pytorch/pull/152336 Approved by: https://github.com/albanD Co-authored-by: albanD <desmaison.alban@gmail.com>	2025-04-28 20:18:56 +00:00
henrylhtsang	00ebbbb701	[cutlass backend] add addmm and bmm for cutlass backend benchmark (#152163 ) Copying what @kadeng did. ``` FINAL results... Experiment group: bmm (BS: 8, 1024x1024, 1024x1024) torch.float16 +-----------------------+--------------------+----------------------+---------------------+ \| name \| forward_time (us) \| compilation_time (s) \| perf_over_aten (%) \| +-----------------------+--------------------+----------------------+---------------------+ \| aten \| 44.454172253608704 \| 3.0991086587309837 \| NA \| \| triton \| 44.06978189945221 \| 0.07496077567338943 \| -0.8646890374284049 \| \| triton_persistent_tma \| 43.598245829343796 \| 0.06154991965740919 \| -1.9254130284597197 \| \| cutlass_lvl_default \| 39.91834074258804 \| 0.056073310784995556 \| -10.20338762612423 \| +-----------------------+--------------------+----------------------+---------------------+ Experiment group: bmm (BS: 8, 1024x1024, 1024x1024) torch.bfloat16 +-----------------------+-------------------+----------------------+---------------------+ \| name \| forward_time (us) \| compilation_time (s) \| perf_over_aten (%) \| +-----------------------+-------------------+----------------------+---------------------+ \| aten \| 49.05610531568527 \| 0.160279156640172 \| NA \| \| triton \| 43.97720843553543 \| 0.0660805031657219 \| -10.353241145961718 \| \| triton_persistent_tma \| 43.94153505563736 \| 0.061738294549286366 \| -10.425960697724962 \| \| cutlass_lvl_default \| 40.2066633105278 \| 0.034127906896173954 \| -18.039430460713596 \| +-----------------------+-------------------+----------------------+---------------------+ Average edge over aten (max(-edge, 0), higher is better): triton: 5.608965091695062 (from 2 valid values) triton_persistent_tma: 6.175686863092341 (from 2 valid values) cutlass_lvl_default: 14.121409043418913 (from 2 valid values) ``` Differential Revision: [D73625766](https://our.internmc.facebook.com/intern/diff/D73625766/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152163 Approved by: https://github.com/jingsh	2025-04-28 20:16:17 +00:00
henrylhtsang	5f4c8e4c89	[inductor][tests] don't test for cpu if you want to use triton backend (#152227 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152227 Approved by: https://github.com/clee2000	2025-04-28 19:43:56 +00:00
Scott Wolchok	e407ea1e5e	Reapply "Rewrite the guts of torch::jit::Lexer to speed it up (#151850 )" (#152250 ) Almost-exact reapply of #151850 (adding minor reviewer nits) . AFAICT it was reverted unnecessarily. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152250 Approved by: https://github.com/Skylion007, https://github.com/cyyever	2025-04-28 19:33:40 +00:00
Wanchao Liang	6b1acfa41b	Fix redistribute new_local_tensor be None case (#152303 ) as titled, we can just set new_local_tensor to be the local tensor and remove the None check, as there would be cases where there's no transformation needed (i.e. src_placements and dst_placements are the same, and we still want to return the original local_tensor) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152303 Approved by: https://github.com/awgu	2025-04-28 19:00:17 +00:00
Ankita George	d3f8aa4378	[ez] Don't always pass HF token to fsspec (#151464 ) Summary: The HF storage reader/writer component can work for any back-end in theory, so we shouldn't enforce the token to be passed into fsspecreader/writer, because the specific fsspec implementation may not handle tokens. Specifically, manifold doesn't accept a token arg, but we're passing one in always, which is throwing Test Plan: signals Differential Revision: D73130679 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151464 Approved by: https://github.com/Skylion007	2025-04-28 18:52:20 +00:00
Alexander Grund	41a0c23c7c	Skip test requiring MKL (#152322 ) `test_reproduce_121253_issue_addmm_fusion_check` checks for "mkl._mkl_linear" being found in the generated source which cannot be there when MKL isn't available. Add skip marker similar to other tests in this file. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152322 Approved by: https://github.com/Skylion007	2025-04-28 18:29:24 +00:00
Yuki Kobayashi	686dff0098	Fix an incorrect link markup (#152239 ) Remove extra whitespace so the link works correctly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152239 Approved by: https://github.com/soulitzer	2025-04-28 18:28:08 +00:00
Ryo Suzuki	fcbbb03d48	Extend vec backend with BF16 SVE intrinsics (#143666 ) - Following the work in https://github.com/pytorch/pytorch/pull/119571, BF16 SVE intrinsics are added to the Vectorized class, providing ~1.7x speedup on `silu` and `softmax`. - Added bf16 detection in CMake - Added a guard for native NEON code to prevent compilation errors @aditew01 @maajidkhann please have a look Pull Request resolved: https://github.com/pytorch/pytorch/pull/143666 Approved by: https://github.com/malfet, https://github.com/aditew01, https://github.com/nikhil-arm Co-authored-by: Aditya Tewari <aditya.tewari@arm.com>	2025-04-28 18:25:44 +00:00
Will Constable	0c52ee1b35	[DTensor] Error on illegal view op during sharding prop (#149764 ) Adds explicit error checking during sharding propagation for view ops rather than relying on runtime errors during local op execution. Before: An error is thrown by aten.view op called by DTensor dispatch, because the local shard size is incompatible with the (incorrectly calculated) args to the view op. `RuntimeError: shape '[384]' is invalid for input of size 512` After: We raise more specific errors for cases of incompatible view operations during sharding propagation, before getting to runtime dispatch. `RuntimeError: Attempted to flatten an unevenly sharded dimension, which would require resharding the input. Please explicitly redistribute the tensor instead.` Change Summary: add 'strict_view' kwarg to the helper methods that implement view/reshape op shard prop rules, so it can be decided op-by-op whether to raise these new errors enabled errors just for the 'view' op in this PR added two specific checks/errors that can occur during view ops. Details: - View ops are never allowed to flatten a dimension that is unevenly sharded, since that would likely change the size/content of the local_tensor and require redistribute - View ops are also never allowed to flatten two dims if the rightmost dim is a Shard() placment, becuase it would cause contiguity errors without redistribution Notes: - Disables support for several ops in test_dtensor_ops.py test, which decompose to an illegal view that only works by performing a redistribution: cartesian_prod, flatten, ravel, reshape, reshape_as, view, view_as, take_along_dim, kron Follow Ups: - triage other view-like ops (besides aten::view) for using strict_view - look for other gaps where view-like ops could still perform redistribution (ban them all, and document this) Fixes #143372 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149764 Approved by: https://github.com/wanchaol, https://github.com/XilunWu ghstack dependencies: #152045	2025-04-28 18:21:49 +00:00
Will Constable	efeed720a6	[DTensor] make test_dtensor_ops report dtensor_args (#152045 ) Before: Does not report DTensor args, and you can't tell which combination of sharding/replication is used for that particular iteration ``` RuntimeError: failed to run: torch.flatten, with ([tensor([[[-6.1074e-01, 1.1260e+00, 1.7686e+00, -7.8216e+ [ 8.8558e-01, -3.0949e+00, -5.4584e+00, -8.5322e+00], [-2.9770e-01, -3.2814e+00, -7.5875e+00, -8.1269e+00], [-6.0136e+00, -5.1712e+00, -4.2667e+00, -4.2142e+00]], [[-7.5171e+00, 5.3900e+00, -7.9208e+00, 6.1000e+00], [-1.7350e+00, -3.6188e-03, -7.1592e+00, 9.2951e-02], [ 5.7143e+00, -3.0805e+00, 7.6227e+00, -7.4862e+00], [ 4.3167e-01, -4.9678e+00, -1.2441e+00, -2.3042e+00]], [[-7.4280e+00, -2.7754e+00, -5.2989e+00, -6.1920e+00], [-2.5225e+00, -5.2520e+00, 6.5686e+00, -6.0350e+00], [-5.1740e+00, -1.6405e+00, -4.4463e+00, -5.1884e+00], [ 3.9581e+00, -6.3151e-01, -3.3223e+00, 4.0546e+00]], [[-2.8112e+00, 3.8742e+00, -4.4612e+00, -5.0016e+00], [ 7.0568e+00, -2.0951e-01, -8.0049e+00, -4.1438e+00], [ 3.1207e+00, -7.6518e+00, 7.1084e+00, -1.0500e+00], [ 8.8823e+00, -1.1178e+00, 4.8485e+00, -8.8593e+00]]], requires_grad=True)], {}) ``` After: You can see the particular DTensor spec that failed ``` RuntimeError: failed to run: torch.flatten, with ([DTensor(local_tensor=tensor([[[-6.0136, -5.1712, -4.2667, [[ 0.4317, -4.9678, -1.2441, -2.3042]], [[ 3.9581, -0.6315, -3.3223, 4.0546]], [[ 8.8823, -1.1178, 4.8485, -8.8593]]], requires_grad=True), device_mesh=DeviceMesh('cpu', [0, 1, 2,3]), placements=(Shard(dim=1),))], **{}) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152045 Approved by: https://github.com/XilunWu	2025-04-28 18:21:48 +00:00
Eddie Yan	bb90f66e70	[CUDA][conv3d] bump tolerances for `test_variant_consistency_eager` `conv3d` `complex64` (#152203 ) ~1/1000 1.5e-5 mismatch on A100 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152203 Approved by: https://github.com/Skylion007, https://github.com/soulitzer	2025-04-28 17:59:37 +00:00
Thanh Ha	79e8dc7d53	Pin to SHA for actions outside of PyTorch (#152110 ) Pin actions from repos external to the PyTorch project to their shasums for security. This is a best practice as Git tags are not immutable. https://openssf.org/blog/2024/08/12/mitigating-attack-vectors-in-github-workflows/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/152110 Approved by: https://github.com/seemethere, https://github.com/ZainRizvi	2025-04-28 17:57:32 +00:00
wizzniu	2246cb6e14	Fix common_distributed.py to NOT set root logger (#152319 ) Using `logging.basicConfig` to set root logger's level is not a good behavior. Fix common_distributed.py to set level for current logger only, because it affects downstream's 3rd-party testing plugins. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152319 Approved by: https://github.com/Skylion007	2025-04-28 17:51:32 +00:00
Alvaro-Kothe	8ce3d4a541	test(Conv3d): use correct class for `test_Conv3d_module_same_padding` (#152187 ) The test for the class `Conv3d` is calling `Conv2d`. This PR just ensure that we are testing the correct module. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152187 Approved by: https://github.com/Skylion007	2025-04-28 16:59:12 +00:00
atalman	c869862875	Remove cuda dependencies from non cuda buids (#152333 ) These dependancies added to fix poetry issue on pypi. However inclusion of these dependencies creates issue with poetry on download.pytorch.org due to poetry reading first available wheel on index for METADATA requirements. Hence all metadata requirements for CPU wheels can't list any cuda dependencies. Injecting these dependencies via prep for pypi will need to be done via: https://github.com/pytorch/test-infra/blob/main/release/pypi/prep_binary_for_pypi.sh Ref: https://github.com/pytorch/pytorch/issues/152121 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152333 Approved by: https://github.com/jeanschmidt, https://github.com/malfet	2025-04-28 16:46:44 +00:00
Laith Sakka	cbf8e0fb1a	use statically known true instead of guard size oblivious in bmm and mm inductor decompositions . (#148893 ) this was discussed with @eellison and he recommended using statically_known_true here, the intuition is. We already have 0/1 specializations in place, if we reach those checks with dynamic shapes that are not already specialized then we do not want them to specialize them, "a recompilation here is not justified". Those are all non-semantic changing optimizations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148893 Approved by: https://github.com/eellison	2025-04-28 16:44:25 +00:00
Nikita Shulga	6e5e9dc321	[benchmarking] Inc aarch64 bench shards to 15 (#152324 ) As it frequently timing out with 12, but also it feels like shards are somewhat unbalanced I.e. if one to look at https://github.com/pytorch/pytorch/actions/runs/14696840776/job/41239776679 Shard 12 takes 3.6 hours, while shard 11 is only 40 min Pull Request resolved: https://github.com/pytorch/pytorch/pull/152324 Approved by: https://github.com/janeyx99, https://github.com/atalman	2025-04-28 16:08:39 +00:00
Jordan Zhu	4bdecd94ea	[modefile free][long tail] selectify fbcode/caffe2/defs.bzl (#148925 ) Summary: replace read_config with select For more info, please refer to the [doc](https://docs.google.com/document/d/1e0Hvht8WEHhcRvlCAodq_R9xnAtKBrAhdyvxcAqQjCw/edit?tab=t.hl8j18gza0cv) Test Plan: CI Reviewed By: malfet Differential Revision: D70267850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148925 Approved by: https://github.com/malfet	2025-04-28 16:04:28 +00:00
PyTorch MergeBot	9c864f9b0f	Revert "[Inductor UT] Generalize device-bias code in `test_flex_attention.py` (#151937 )" This reverts commit 443840080265ce6133121c91d258b619eae151bb. Reverted https://github.com/pytorch/pytorch/pull/151937 on behalf of https://github.com/malfet due to Broke ASAN tests, probably by enabling too many tests https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=asan&mergeEphemeralLF=true ([comment](https://github.com/pytorch/pytorch/pull/151937#issuecomment-2835151532))	2025-04-28 12:56:49 +00:00
PyTorch UpdateBot	0b6ea0b959	[xla hash update] update the pinned xla hash (#151210 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151210 Approved by: https://github.com/pytorchbot	2025-04-28 11:45:09 +00:00
Anthony Shoumikhin	7cae7902a2	Add scripts to check xrefs and urls (#151844 ) Traverses the docs and code to find any broken links Pull Request resolved: https://github.com/pytorch/pytorch/pull/151844 Approved by: https://github.com/huydhn	2025-04-28 09:30:07 +00:00
Scott Wolchok	7e8b9b3f51	ReducedPrecisionFloatGemvFastPathKernel: Correctly type parallel_for lambda arguments as int64_t (#152233 ) This plus the previous irangeification PR seem like a better fix for #150637 than #150949 to me -- should make sure we are using 64-bit math for indexing everywhere. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152233 Approved by: https://github.com/Skylion007, https://github.com/cyyever ghstack dependencies: #152232	2025-04-28 07:19:26 +00:00
Scott Wolchok	3b7d6bbe8b	irangeify ReducedPrecisionFloatGemvKernel.cpp (#152232 ) We should be using irange, especially because we had 32-bit overflow issues in this file recently. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152232 Approved by: https://github.com/Skylion007	2025-04-28 07:19:26 +00:00
Gabriel Ferns	ce00ec7ecf	Enable max autotune for AOTInductor benchmark (#149309 ) With this PR, AOTinductor can choose to run into max-autotune mode when benchmarking. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149309 Approved by: https://github.com/desertfire Co-authored-by: Gabriel Ferns <gabeferns@meta.com>	2025-04-28 06:54:26 +00:00
Nikita Shulga	13966d0bf5	[BE] Migrate dtype_abbrs into one location (#152229 ) Namely `torch.utils._dtype_abbrs.dtype_abbrs` Before that it was defined in various forms of completeness in `c02edba863/torch/fx/graph.py (L215)`, `c02edba863/torch/testing/_internal/common_utils.py (L5226)` and `c02edba863/torch/testing/_internal/logging_tensor.py (L17)` TODO: - Add linter that `torch.testing._internal` module is not referenced from any of the public facing APIs, as it can have extra dependencies such as `expect_test` Fixes https://github.com/pytorch/pytorch/issues/152225 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152229 Approved by: https://github.com/clee2000, https://github.com/Skylion007	2025-04-28 03:52:47 +00:00
Isalia20	899eec665c	[MPS] col2im kernel implementation (#152282 ) Fixes #151820 Also requested in #141287 Mainly based on the cuda kernel implementations Pull Request resolved: https://github.com/pytorch/pytorch/pull/152282 Approved by: https://github.com/malfet	2025-04-28 03:48:41 +00:00
Aart J.C. Bik	2503843673	Add check for 2-dim mask to COO mask computation (#151940 ) Follow up on discussion on https://github.com/pytorch/pytorch/pull/151794 Related to all fixes for https://github.com/pytorch/pytorch/issues/151351 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151940 Approved by: https://github.com/Skylion007	2025-04-28 03:40:46 +00:00
Anatoly Myachev	4438400802	[Inductor UT] Generalize device-bias code in `test_flex_attention.py` (#151937 ) @EikanWang @etaf @guangyey please take a look Pull Request resolved: https://github.com/pytorch/pytorch/pull/151937 Approved by: https://github.com/liangan1, https://github.com/drisspg	2025-04-28 03:07:23 +00:00
Laith Sakka	98bd2bd1ab	Do not generate long log messages for suppressed data dependent errors. (#151023 ) TORCH_LOGS="all" python test/test_dynamic_shapes.py -k test_guard_or_true before: <img width="1065" alt="Screenshot 2025-04-10 at 9 55 27 AM" src="https://github.com/user-attachments/assets/3ee20de0-2902-4eb1-8ab0-80f1b974fb78" /> after: <img width="1124" alt="Screenshot 2025-04-10 at 9 54 35 AM" src="https://github.com/user-attachments/assets/4e7e1f0c-856c-417f-8763-bfe183e2450d" /> Note: we actually do not expect to see a log at all, this is an orthogonal issue in recording where it logs each error seen even when recording is not enabled? I will follow up with PR for that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151023 Approved by: https://github.com/bobrenjc93	2025-04-28 00:39:52 +00:00
cyy	70d7638b0d	Fix clang-tidy suppression in torch/csrc/jit (#152271 ) Remove some clang-tidy suppression in torch/csrc/jit by applying fixes or refactoring. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152271 Approved by: https://github.com/Skylion007, https://github.com/malfet Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2025-04-27 21:18:39 +00:00
PyTorch MergeBot	c02edba863	Revert "Update OpenBLAS commit (#151547 )" This reverts commit c4b085475062270946eeec854aa54d0739c7a0c9. Reverted https://github.com/pytorch/pytorch/pull/151547 on behalf of https://github.com/malfet due to It breaks all aarch64 tests ([comment](https://github.com/pytorch/pytorch/pull/151547#issuecomment-2833593427))	2025-04-27 18:58:35 +00:00
cyy	b34146a093	Fix initGdsBindings declaration (#152277 ) Move initGdsBindings into the correct namespace. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152277 Approved by: https://github.com/Skylion007	2025-04-27 17:04:56 +00:00
Zizeng Meng	861945100e	[Kineto] Enable OOM observer (#152160 ) Summary: # Context: When memory leak happens, it usually trigger the OOM in the later iterations. The snapshot of full iteration will be huge and hard to interpret. On CUDA side, they provide OOM observer which generates snapshot when OOM happens with latest 1,500,000 entries for debugging. In this diff, we want to implement the feature on MTIA side Test Plan: Run this test with last diff in the stack. ``` buck run @//mode/opt kineto/libkineto/fb/mtia/integration_tests:mtia_memory_auto_trace_test ``` As shown, the memory_snapshot is generated when oom happens Log: P1794792326 Snapshot: https://fburl.com/pytorch_memory_visualizer/lx73y6s3 {F1977402355} Differential Revision: D71993315 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152160 Approved by: https://github.com/sraikund16	2025-04-27 15:56:44 +00:00
Aditya Tewari	c4b0854750	Update OpenBLAS commit (#151547 ) Motivation: Update OpenBLAS and change build script to enable SBGEMM kernels . Update pytorch `jammy` builds for aarch64 to use `install_openblas.sh` instead of `conda_install` Link to full [TorchInductor Performance Dashboard AArch64](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2016%20Apr%202025%2009%3A35%3A26%20GMT&stopTime=Thu%2C%2017%20Apr%202025%2009%3A35%3A26%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cpu%20(aarch64)&lBranch=adi/update_openblas&lCommit=90701ab81bf61fd864d31e0aa7e88d97a1a8676c&rBranch=main&rCommit=40ce4fb24a536d175348df876f61956d4945778e) 1. This shows a promising speedup across most of the HF models in benchmark, specifically giving a significant boost to SDPA layers. 2. Overall torch-bench pass-rate increased `[87%, 65/75 → 96%, 72/75]` <img width="676" alt="Screenshot 2025-04-17 at 10 32 10" src="https://github.com/user-attachments/assets/a92dce0c-ecee-4466-8175-065df664dd71" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/151547 Approved by: https://github.com/malfet	2025-04-27 15:55:42 +00:00
Nikita Shulga	bb680b5a87	[MPSInductor] Fix masked_fill decomp (#152268 ) By adding `mps` to the list of accelerators that can work with CPU scalars Fixes `GPUTests.test_masked_fill_promotion_mps` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152268 Approved by: https://github.com/kulinseth, https://github.com/dcci, https://github.com/Skylion007 ghstack dependencies: #152266	2025-04-27 15:50:46 +00:00
Yuanhao Ji	cbcf677223	[Dynamo] Replace `unimplemented` with `unimplemented_v2` in `torch/_dynamo/variables/lists.py` (#151873 ) Part of #147913 Replace `unimplemented` with`unimplemented_v2` in `torch/_dynamo/variables/lists.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151873 Approved by: https://github.com/williamwen42 Co-authored-by: William Wen <william.wen42@gmail.com>	2025-04-27 11:59:45 +00:00
Yuanhao Ji	0423a7b322	[Dynamo] Replace `unimplemented` with `unimplemented_v2` in `torch/_dynamo/variables/nn_module.py` (#151895 ) Part of #147913 Replace `unimplemented` with`unimplemented_v2` in `torch/_dynamo/variables/nn_module.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151895 Approved by: https://github.com/williamwen42 Co-authored-by: William Wen <william.wen42@gmail.com>	2025-04-27 11:54:42 +00:00
Anthony Shoumikhin	e2f9759bd0	Fix broken URLs (#152237 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152237 Approved by: https://github.com/huydhn, https://github.com/malfet	2025-04-27 09:56:42 +00:00
Nikita Shulga	cbcc03c2ad	[MPSInductor][BE] Only include headers when needed (#152266 ) Store headers used by shader in `MetalKernel.headers` Add headers when function depending on it gets invoked Generate majority of a special ops from template Delete two unused functors: `entr` and `xlog1py` as they are decomposed by inductor anyway Pull Request resolved: https://github.com/pytorch/pytorch/pull/152266 Approved by: https://github.com/Skylion007, https://github.com/jansel, https://github.com/dcci, https://github.com/cyyever	2025-04-27 05:09:50 +00:00
Bin Bao	a0d440a26a	[AOTI][reland] Remove typedef for half and bfloat16 (#151109 ) Summary: Reland https://github.com/pytorch/pytorch/pull/150657 typedef is prone to name collision. Explicitly spell out the actual aten types, needed for the libtorch-free codegen. Differential Revision: [D72878456](https://our.internmc.facebook.com/intern/diff/D72878456) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151109 Approved by: https://github.com/angelayi	2025-04-26 23:17:35 +00:00
Zhiyi Zhang	225742838b	Add an additional check to trigger graph break for sparse tensor (#151897 ) Fixes #151522 This PR fixes the issue that Dynamo fails to trigger a graph break for sparse tensors in certain code paths. I added an additional check to handle this case, and it resolves the original problem. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151897 Approved by: https://github.com/jansel	2025-04-26 21:02:32 +00:00
Oguz Ulgen	e4a1a16bef	Check integrity of bytes in AppendingByteSerializer (#152139 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152139 Approved by: https://github.com/zou3519	2025-04-26 18:10:58 +00:00
co63oc	9480ed4cd3	Fix typos in multiple files (#152254 ) Fix typos in multiple files Pull Request resolved: https://github.com/pytorch/pytorch/pull/152254 Approved by: https://github.com/Skylion007	2025-04-26 17:18:39 +00:00
Aaron Gokaslan	6a62356857	[BE][Easy]: Change typing to DimsType in dim_reduction (#151677 ) Use prims_common DimsType to reduce duplication of DType Pull Request resolved: https://github.com/pytorch/pytorch/pull/151677 Approved by: https://github.com/albanD	2025-04-26 16:59:32 +00:00
Zhengxu Chen	203201255f	[dynamo] remove dead code for DATA_PTR_MATCH (#152206 ) Summary: Seems this guard is not created anywhere Test Plan: CI Differential Revision: D73682084 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152206 Approved by: https://github.com/anijain2305, https://github.com/jansel	2025-04-26 15:25:01 +00:00
Yukio Siraichi	ee8166e94f	Correctly handle duplicated arguments when merging input views. (#146275 ) Fix: #135099 This PR changes how we map the original inputs into the new set of inputs that take in the tensor input's base instead of their aliases. Problem: in order to create this mapping, we had a dictionary that mapped the hashed arguments into their respective indices. However, if there's a group of equal arguments, we will have only one mapping for such an argument. This breaks the assumption that there will be one mapping for each argument. Solution: map the hashed arguments into a list of indices. Then, we will be able to correctly reconstruct the parameters for the new calling convention. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146275 Approved by: https://github.com/bdhirsh	2025-04-26 14:50:16 +00:00
FFFrog	580913290c	[Easy] The event_id of torch.cuda.Event and torch.xpu.Event always is 0 (#151226 ) Although torch.cuda.Event and torch.xpu.Event have cuda_event and sycl_event fields respectively, the event_id exposed from the base class torch.Event is always 0, which can confuse users. The memory of torch.Event is not useful to torch.cuda.Event and torch.xpu.Event, but we still need to inherit from torch.Event because CPython will check it. Repro with cuda: ``` >>> import torch >>> event = torch.cuda.Event() >>> event.cuda_event 0 >>> event.event_id 0 >>> event.record() >>> event.cuda_event 127982096 >>> event.event_id 0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151226 Approved by: https://github.com/albanD, https://github.com/guangyey ghstack dependencies: #151404, #151221, #151411	2025-04-26 14:18:22 +00:00
Davide Italiano	2ce9d2e9aa	[MPS/inductor] Adjust test_to_dtype_mps so that it works on the backend. (#152230 ) float64 isnt' supported for MPS, but we can still test the functionality with another type. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152230 Approved by: https://github.com/malfet, https://github.com/jansel	2025-04-26 13:54:53 +00:00
FFFrog	0f9b02c839	[Easy][torch.Event] Fix and improve the docs of torch.Event (#151411 ) Changes: - add detailed function or class signature - fix the wrong display of torch.Event.wait and torch.Event.record Pull Request resolved: https://github.com/pytorch/pytorch/pull/151411 Approved by: https://github.com/albanD ghstack dependencies: #151404, #151221	2025-04-26 13:52:38 +00:00
FFFrog	bd7dc1b17d	[Easy] Fix the function signature of torch.Event (#151221 ) As the title stated. The difference between declaration and implemention. declaration: `d5a19e4525/torch/_C/__init__.pyi.in (L157-L162)` Implementation: `d5a19e4525/torch/csrc/Event.cpp (L30-L32)` Question: Which one should we choose? - Change enable_timing to False to be consistent with torch.cuda.Event - Change enable_timing to True to avoid BC-break Pull Request resolved: https://github.com/pytorch/pytorch/pull/151221 Approved by: https://github.com/albanD ghstack dependencies: #151404	2025-04-26 13:51:56 +00:00
Chuanqi Xu	4a46ee96d2	[Indcutor Remote Cache] Raise an exception if redis module is required but not available (#151779 ) If we need redis but redis is not available, it is better to tell the user to install redis instead of continue silently. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151779 Approved by: https://github.com/aorenste	2025-04-26 11:21:54 +00:00
Mu-Chu Lee	8d427e9e76	[AOTInductor] Inherit Buffer if not being updated (#152092 ) Summary: Inherit buffer from original constants buffer if it's not being updated. Test Plan: TBD Differential Revision: D73571260 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152092 Approved by: https://github.com/kflu, https://github.com/jingsh	2025-04-26 04:28:23 +00:00
Dan Johnson	d22c4cc353	Add option to use mempool on OOM (#151487 ) MemPool is a separate pool of memory handled by the caching allocator. This PR adds the option let the caching allocator try to use this pool as a last resort instead of OOMing by associating a use_on_oom bool with each MemPool. Usage: Users can optionally specify a ``use_on_oom`` bool (which is False by default) during MemPool creation. If true, then the CUDACachingAllocator will be able to use memory in this pool as a last resort instead of OOMing. ``` pool = torch.cuda.MemPool(allocator, use_on_oom=True) with torch.cuda.use_mem_pool(pool): a = torch.randn(40 * 1024 * 1024, dtype=torch.uint8, device="cuda") del a # at the memory limit, this will succeed by using pool's memory in order to avoid the oom b = torch.randn(40 * 1024 * 1024, dtype=torch.uint8, device="cuda") ``` Testing: ``` python test/test_cuda.py -k test_mempool_limited_memory_with_allocator ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151487 Approved by: https://github.com/eqy, https://github.com/syed-ahmed, https://github.com/ngimel	2025-04-26 04:04:57 +00:00
cyy	65b845f82b	Remove useless options for third-party ONNX build (#147616 ) Treat ONNX CMake targets properly and remove unneeded options. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147616 Approved by: https://github.com/malfet	2025-04-26 02:34:08 +00:00
Alexander Grund	d9d306e8e9	Fix inductor test_linear_with_in_out_buffer (#151548 ) Without MKL there is only 1 epilogue, not 2 because `addmm` is used instead of `packed_linear/_mkl_linear`. This fails first at `TestSelectAlgorithmCPU.test_linear_with_in_out_buffer_batch_size_8_in_features_3_in_features2_192_image_size_224_out_features_64_bias_True_cpu_float32` Instead of skipping the whole test just adjust the count for the single check. Final numbers of `test/inductor/test_cpu_select_algorithm.py` without MKL: ``` Ran 1337 tests OK (skipped=1211) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151548 Approved by: https://github.com/jansel	2025-04-26 01:53:34 +00:00
Michal Gallus	0e015ef116	[ROCm][Windows] Fix HIP Caffe2 Tests (#152014 ) Solves the following problems of caffe2 HIP tests building on Windows: 1. HIP tests now use `hip_add_executable` to be built with custom_command invoking hip compiler, due to lack of cmake support for HIP in 3.18 (currently used). 2. failing with "Command line too long" which resulted from `hip_add_executable` adding the same flags over and over on top of `HIP_HIPCC_FLAGS` with every test added. 3. Disables `HasSameArgTypes` test on Windows, as `at::native::modern::detail` is nowhere to be found in the codebase (I think it must be a legacy thing). Perhaps the whole test should be removed/rewritten? Pull Request resolved: https://github.com/pytorch/pytorch/pull/152014 Approved by: https://github.com/jeffdaily	2025-04-26 01:35:46 +00:00
Nikita Shulga	3ef6d6924a	[BE] Switch `TestConsistency` to MPS device (#147893 ) Which will eventually allow move decorators away more `common_mps.py` Adjust tolerances accordingly. XFAIL a bunch of tests on MacOS-13, which is going to be deprecated anyway Pull Request resolved: https://github.com/pytorch/pytorch/pull/147893 Approved by: https://github.com/atalman ghstack dependencies: #152204	2025-04-26 01:19:21 +00:00
Nikita Shulga	73f11e3365	[BE] Do not allow PyTorch codebase to use `c10::optional` (#150464 ) Extensions can still rely on it, and we should decorate it with deprecated, but it is a C++20 feature. XPU still uses it, so exclude XPU builds until https://github.com/intel/torch-xpu-ops/pull/1615 is merged Test plan: - `0def9b4acc` should fail MPS builds ``` /Users/ec2-user/runner/_work/pytorch/pytorch/aten/src/ATen/native/mps/OperationUtils.mm:975:44: error: no template named 'optional' in namespace 'c10'; did you mean 'std::optional'? c10::optional<int64_t> extra) { ^~~~~~~~~~~~~ std::optional ``` - `a769759dd4` should fail CUDA builds ``` /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/CUDASymmetricMemoryOps.cu(530): error: namespace "c10" has no member "nullopt" input, c10::nullopt, reduce_op, group_name, out); ^ 1 error detected in the compilation of ``` Fixes https://github.com/pytorch/pytorch/issues/150313 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150464 Approved by: https://github.com/atalman	2025-04-26 01:15:53 +00:00
Flavio Sales Truzzi	4647658247	[PT2] - Allowlist should have precedence (#151942 ) Summary: When working on List[List[int]], the ints were being considered Constants regardless of their inclusion on the allowlist. Test Plan: CI + new test https://www.internalfb.com/intern/testinfra/testrun/5066549856504774 Differential Revision: D73137631 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151942 Approved by: https://github.com/laithsakka	2025-04-26 00:58:43 +00:00
PyTorch MergeBot	fa1b4ef649	Revert "Rewrite the guts of torch::jit::Lexer to speed it up (#151850 )" This reverts commit 47d34261e06e2416e7a1e7d51a3d428e4ea51f9d. Reverted https://github.com/pytorch/pytorch/pull/151850 on behalf of https://github.com/ZainRizvi due to This codev PR is breaking on it's internal counterpart diff D73129443. For codev PRs like this one, please always make sure the internal diff is green and then land the diff internally. The Github PR will be automatically merged ([comment](https://github.com/pytorch/pytorch/pull/151850#issuecomment-2831686141))	2025-04-26 00:44:11 +00:00
Scott Wolchok	47d34261e0	Rewrite the guts of torch::jit::Lexer to speed it up (#151850 ) The trie-based approach was, apparently, not efficient. This incidentally fixes a bug where "not inp" and "is note" were lexed incorrectly; see test_lexer.cpp update. Differential Revision: [D73129443](https://our.internmc.facebook.com/intern/diff/D73129443/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151850 Approved by: https://github.com/Skylion007 ghstack dependencies: #151801, #151802, #151803, #151804, #151805, #151806, #151807, #151810, #151849	2025-04-25 23:49:35 +00:00
PyTorch MergeBot	0f765773e3	Revert "[BE] Do not allow PyTorch codebase to use `c10::optional` (#150464 )" This reverts commit 490ef768cff448080083a46f362053e025f6b95b. Reverted https://github.com/pytorch/pytorch/pull/150464 on behalf of https://github.com/clee2000 due to broke xpu [GH job link](https://github.com/pytorch/pytorch/actions/runs/14674243034/job/41187443432) [HUD commit link](`490ef768cf`)? ([comment](https://github.com/pytorch/pytorch/pull/150464#issuecomment-2831608162))	2025-04-25 23:34:56 +00:00
Chien-Chin Huang	6aa92806db	[CP] Use TorchFunctionMode to dispatch SDPA for CP (#147902 ) While we prefer not use monkey patching to dispatch SDPA, TorchFunctionMode is currently not compatible with selective activation checkpointing (https://github.com/pytorch/pytorch/issues/147995). This PR adds `TorchFunctionMode` to CP code and make it configurable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147902 Approved by: https://github.com/XilunWu	2025-04-25 23:33:48 +00:00
Davide Italiano	e28864fc0f	[MPS/inductor] Fix the approximation of polygamma for n == 0. (#152214 ) Fixes #152205 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152214 Approved by: https://github.com/malfet	2025-04-25 22:42:45 +00:00
Scott Wolchok	cf101d66ee	Add simple direct C++ tests for torch::jit::Lexer (#151849 ) We have test_jit.py, but given that I'm working on significant changes to the lexer, it seems nice to have direct C++ tests. (Also, writing the tests caught a pair of related bugs; see the two tests with "Bug" in their name. The rewrite will fix them.) Differential Revision: [D73402367](https://our.internmc.facebook.com/intern/diff/D73402367/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151849 Approved by: https://github.com/malfet ghstack dependencies: #151801, #151802, #151803, #151804, #151805, #151806, #151807, #151810	2025-04-25 22:39:49 +00:00
Nikita Shulga	490ef768cf	[BE] Do not allow PyTorch codebase to use `c10::optional` (#150464 ) Extensions can still rely on it, and we should decorate it with deprecated, but it is a C++20 feature Test plan: - `0def9b4acc` should fail MPS builds ``` /Users/ec2-user/runner/_work/pytorch/pytorch/aten/src/ATen/native/mps/OperationUtils.mm:975:44: error: no template named 'optional' in namespace 'c10'; did you mean 'std::optional'? c10::optional<int64_t> extra) { ^~~~~~~~~~~~~ std::optional ``` - `a769759dd4` should fail CUDA builds ``` /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/CUDASymmetricMemoryOps.cu(530): error: namespace "c10" has no member "nullopt" input, c10::nullopt, reduce_op, group_name, out); ^ 1 error detected in the compilation of ``` Fixes https://github.com/pytorch/pytorch/issues/150313 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150464 Approved by: https://github.com/atalman	2025-04-25 22:03:48 +00:00
Anthony Shoumikhin	9e50c21e27	Fix xrefs (#151888 ) Fix existing cross references and removed old ones Pull Request resolved: https://github.com/pytorch/pytorch/pull/151888 Approved by: https://github.com/eqy, https://github.com/huydhn, https://github.com/svekars	2025-04-25 21:27:27 +00:00
Iurii Paikov	1aa971a3bb	[ROCm] Implemented dropout usage for RNN with MIOpen backend (#144572 ) This PR fixes https://github.com/pytorch/pytorch/issues/107183 for ROCm. Implemented the usage of new RNN descriptor for MIOpen backend that takes into account dropout rate value using dropout descriptor. This fixes associated test_RNN_dropout_state test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144572 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-04-25 21:06:45 +00:00
FFFrog	2c5c793085	[Easy] Add more check for elapsedTime of torch.xxx.Event and torch.Event (#151404 ) As the title stated Changes: - Add record, query and enable_timing check - Add related tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/151404 Approved by: https://github.com/albanD	2025-04-25 20:15:04 +00:00
xadupre	91c590f048	[ONNX] add converters for sym_min, sym_max (#152196 ) Conversion of Phi4-multimodel-instruct fails because of missing converters for torch.sym_max, and torch.sym_min. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152196 Approved by: https://github.com/justinchuby	2025-04-25 20:01:05 +00:00
Chenye Zhao	9336608307	BM FM FlashAttention Test (#151974 ) Reviewed By: joebos Differential Revision: D72880307 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151974 Approved by: https://github.com/yoyoyocmu, https://github.com/Skylion007, https://github.com/malfet	2025-04-25 19:24:25 +00:00
Sam Larsen	8542d55f0c	[logging] Clean up dynamo_timed usages in cudagraph_trees (#152136 ) Summary: I'm investigating differences in total torch.compile overhead in our two main internal sources: dynamo_compile and pt2_compile_events. One source of discrepancy is due to cudagraphs overheads. Currently, we have a context manager that optionally attributes a dynamo_timed region to a cudagraph-related column logged to dynamo_compile, but _all_ dynamo_timed regions show up in pt2_compile_events (hence the discrepancy; pt2_compile_events is overcounting). We could filter out these specific events from pt2_compile_events when measuring overall overhead. But I'm going to argue that those timed regions that we DO NOT consider as a compiler-related overhead don't have much value in logging in the first place. So I'm suggesting we just remove those instances. Here's the production job with the discrepancy: * dynamo_compile: https://fburl.com/scuba/dynamo_compile/3604eypl * pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/c2dv8sty Test Plan: torchbench nanogpt: * tlparse: https://fburl.com/h1n2ascc * dynamo_compile: https://fburl.com/scuba/dynamo_compile/sandbox/u37yrynp * pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/s7avd0di Pull Request resolved: https://github.com/pytorch/pytorch/pull/152136 Approved by: https://github.com/BoyuanFeng	2025-04-25 19:18:12 +00:00
Andrew Gallagher	1bc0e2579d	[aarch64] Fixes to build with ArmPL's cblas.h (#151126 ) Summary: Various fixes to make fbcode work w/ ArmPL's cblas header: 1) Avoid re-declaring prototypes for internal blas methods which ArmPL already declares. 2) Fix `std::complex` conversion when using these methods. 3) Drop `extern "C"` around include fo `cblas.h`. Test Plan: CI Differential Revision: D72808561 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151126 Approved by: https://github.com/Skylion007	2025-04-25 19:02:28 +00:00
Nikita Shulga	56190d2577	[MPS] Fix ICE for entr bool instantiation on M1/M2 (#152204 ) By instantiating it implicitly, otherwise attempts to run something like ``` % python3 -c "import torch; print(torch.special.entr(torch.testing.make_tensor(10, dtype=torch.bool, device='mps')))" ``` will fail with ``` Failed to created pipeline state object, error: Error Domain=AGXMetalG14X Code=3 "Compiler encountered an internal error" ``` Similar in spirit to https://github.com/pytorch/pytorch/pull/149123 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152204 Approved by: https://github.com/dcci	2025-04-25 19:00:49 +00:00
Yuanhao Ji	d7eb3a492c	[Typing] Enable torch.types.IntLikeType / FloatLikeType / BoolLikeType (#152157 ) ### Changes Replace `Union[SymInt, int]` and `Union[int, SymInt]` with `IntLikeType`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152157 Approved by: https://github.com/Skylion007	2025-04-25 19:00:10 +00:00
Shangdi Yu	85bfaf8cc5	Package const folded graph's cubin file (#152145 ) Summary: We need to pacakge const folded graph's cubin file into the final .pt2 package. Fix https://github.com/pytorch/pytorch/issues/152067 Test Plan: ``` buck run fbcode//mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r test_constant_folding_cuda ``` Differential Revision: D73626480 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152145 Approved by: https://github.com/henrylhtsang, https://github.com/desertfire	2025-04-25 18:38:32 +00:00
eellison	a5f2fd1017	Unskip index_put in cudagraphs (#152186 ) The repro from the original skip in https://github.com/pytorch/pytorch/pull/105439 does not fail. unskip. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152186 Approved by: https://github.com/Skylion007	2025-04-25 18:15:49 +00:00
Jithun Nair	bcf1031cb8	[ROCm] Fixes to enable VM-based MI300 CI runners (#152133 ) New VM-based MI300 CI runners tested in https://github.com/pytorch/pytorch/pull/151708 exposed some issues in CI that this PR fixes: * HSAKMT_DEBUG_LEVEL is a debug env var that was introduced to debug driver issues. However, in the new MI300 runners being tested, since they run inside a VM, the driver emits a debug message `Failed to map remapped mmio page on gpu_mem 0` when calling `rocminfo` or doing other GPU-related work. This results in multiple PyTorch unit tests failing when doing a string match on the stdout vs expected output. * HSA_FORCE_FINE_GRAIN_PCIE was relevant for rccl performance improvement, but is not required now. * amdsmi doesn't return metrics like [power_info](https://rocm.docs.amd.com/projects/amdsmi/en/latest/reference/amdsmi-py-api.html#amdsmi-get-power-cap-info) and [clock_info](https://rocm.docs.amd.com/projects/amdsmi/en/latest/reference/amdsmi-py-api.html#amdsmi-get-clock-info) in a VM ("Guest") environment. Return 0 as the default in cases where amdsmi returns "N/A" * amdsmi throws an exception when calling `amdsmi.amdsmi_get_clock_info` on the VM-based runners. Temporarily skipping the unit test for MI300 until we find a resolution. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152133 Approved by: https://github.com/jeffdaily	2025-04-25 18:06:48 +00:00
James Wu	0dae27d75b	Turn on static cuda launcher in OSS (#151691 ) After a few small bugfixes on tests (to make it so we throw/catch similar exceptions to triton), I think we're ready to flip the switch and use StaticCudaLauncher on by default in OSS. Initial round of benchmarks look good, with average compilation time going down by a few percent: <img width="828" alt="image" src="https://github.com/user-attachments/assets/cad03e09-b4d6-49a7-a9e5-6068d1c0bd5c" /> With no changes to runtime perf: <img width="823" alt="image" src="https://github.com/user-attachments/assets/3fcd435e-1057-43f4-878b-8d66a3812a10" /> There are a few noisy models I want to double check, though, so will run some more tests before accepting review. Full benchmark results, showing a ~5% compile time improvement across the board: https://hud.pytorch.org/benchmark/huggingface/inductor_with_cudagraphs?dashboard=torchinductor&startTime=Wed%2C%2016%20Apr%202025%2002%3A31%3A12%20GMT&stopTime=Wed%2C%2023%20Apr%202025%2002%3A31%3A12%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(a100)&lBranch=gh/jamesjwu/139/orig&lCommit=cc45c8667fa23dec16ca50002d9504a34688ca5c&rBranch=main&rCommit=2a9afdae81d0dde98e96d7e3c9ca840e241e5405 <img width="1482" alt="image" src="https://github.com/user-attachments/assets/6e6a7f39-7f44-459f-9845-9a37f084ea82" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/151691 Approved by: https://github.com/oulgen, https://github.com/jansel, https://github.com/EikanWang	2025-04-25 17:48:53 +00:00
PyTorch MergeBot	c03359de2d	Revert "[Inductor] Record Triton’s Base32 Cache Key in .best_config for Debugging (#148981 )" This reverts commit fc6e37ceb23f99808265c11a37368078d5f982b8. Reverted https://github.com/pytorch/pytorch/pull/148981 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. @davidberard98 can you please help get these changes validated? Details in D73628297. To validate the fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/148981#issuecomment-2831044810))	2025-04-25 17:45:13 +00:00
henrylhtsang	4ea2e093ca	[inductor][BE] Clean up use_mixed_mm and mixed_mm_choice usage inside pytorch (#152071 ) Differential Revision: [D73551912](https://our.internmc.facebook.com/intern/diff/D73551912/) Decided to leave the mixed_mm tests alive. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152071 Approved by: https://github.com/eellison	2025-04-25 17:25:55 +00:00
PyTorch MergeBot	67f75244ea	Revert "[Easy] Add more check for elapsedTime of torch.xxx.Event and torch.Event (#151404 )" This reverts commit c91acad73a11825c366c51fb1e91d7e1a47d3f9e. Reverted https://github.com/pytorch/pytorch/pull/151404 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. @albanD can you please help it get relanded? To validate the fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/151404#issuecomment-2830829368))	2025-04-25 16:08:27 +00:00
zhxchen17	d4a8e4e30c	[dynamo] Guard serialization for HASATTR (#151349 ) Adding guard serialization for type HASATTR Differential Revision: [D73059073](https://our.internmc.facebook.com/intern/diff/D73059073/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151349 Approved by: https://github.com/jansel, https://github.com/anijain2305 ghstack dependencies: #151318, #151343	2025-04-25 14:16:30 +00:00
zhxchen17	558f45190e	[dynamo] Guard serialization for NOT_PRESENT_IN_GENERIC_DICT (#151343 ) Adding guard serialization for type NOT_PRESENT_IN_GENERIC_DICT Differential Revision: [D73057304](https://our.internmc.facebook.com/intern/diff/D73057304/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151343 Approved by: https://github.com/jansel, https://github.com/anijain2305 ghstack dependencies: #151318	2025-04-25 14:16:30 +00:00
zhxchen17	a34c28e0d2	[dynamo] Add guard serialization for tensor matches. (#151318 ) This is a proof-of-concept of how we could serialize a guard and deserialize it back from the bytes. The main behavioral change introduced in this diff is on CheckFunctionManager: ``` check_fn_manager = CheckFunctionManager(code, output_graph, guards_serialization_mode="save") guards_state: bytes = check_fn_manager.guards_state ``` Once `guards_serialization_mode` is set to `save`, CheckFunctionManager will return an addtional `bytes` object called `guards_state` which should contain all the information needed for deserializing guards later. When we load back guards state, we will set `guards_serialization_mode` is set to `load`: ``` output_graph_state = pickle.loads(guards_state) check_fn_manager = CheckFunctionManager(code, output_graph_state, guards_serialization_mode="load") ``` # TENSOR_MATCH Since we have many types of guards to support, we will break the work into small diffs instead of a single diff to support every guards. We kick off the work from TENSOR_MATCH from this diff. # Testing For each type of guard we will test it like the following: 1. Use guard_filter_fn to select 1 type of guard each time. 2. Call InstructionTranslator directly on an example function to get OutputGraph and CheckFunctionManager (reference guard manager) 3. Serialize->deserialize the output graph state and re-build the guards with a new CheckFunctionManager (loaded guard manager) 4. Throw a set of example inputs to both reference and loaded guard manager to see if their behavior match. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151318 Approved by: https://github.com/jansel, https://github.com/anijain2305	2025-04-25 14:16:23 +00:00
Alexander Grund	6e8602b558	Relax tolerance on test_aot_autograd_exhaustive_matmul_cpu_float32 without MKL (#152106 ) When e.g. OpenBLAS is used instead of MKL the differences get to large: > Greatest absolute difference: 5.91278076171875e-05 at index (7,) (up to 1e-05 allowed) > Greatest relative difference: 3.468156592134619e-06 at index (7,) (up to 1.3e-06 allowed) I traced some of the matmul operations and there are differences of around 8e-6 between MKL and OpenBLAS but I haven't found where exactly the backward pass is calculated which is where the actual differences arise. So I couldn't check if there is some difference in the low-level BLAS function used by the autograd. However it seems odd that there is a difference at all: For the MKL case it seems to be zero up to the accuracy shown by Python. So it seems the AOT compilation has some differences when MKL is not available. Maybe this is also the reason why it fails for ARM and hence the test is skipped there. Maybe @zou3519 knows more as he introduced those skip markers in https://github.com/pytorch/pytorch/pull/85565 Is there any documentation how and where `matmul_backward(_out)` is generated and how AOT transforms it with and without MKL? Pull Request resolved: https://github.com/pytorch/pytorch/pull/152106 Approved by: https://github.com/zou3519	2025-04-25 14:03:37 +00:00
Xia, Weiwen	c1c8c1f8d6	[Quant][X86] add an op to compute uint8 pointwise mul (#151112 ) Summary Add a new op, `onednn.qmul.tensor`, for int8 elementwise mul, which accepts inputs on CPU device (instead of QuantizedCPU). The new op is implemented by AVX512 instructions and it provides similar or better performance, depending on shape, than its counterpart for QuantizedCPU device `quantized.mul`. The new op supports output dtypes other than uint8 (fp32, fp16 and bf16 are supported). Test plan ``` pytest test/quantization/core/test_quantized_op.py -k test_int8_mul_onednn ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151112 Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168	2025-04-25 12:52:54 +00:00
Yu, Guangye	ad81eeb7c7	Refactor to use torch.accelerator.device_index instead of torch.cuda.device for generic device context manager (#148880 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148880 Approved by: https://github.com/EikanWang, https://github.com/albanD ghstack dependencies: #148864	2025-04-25 09:45:25 +00:00
Yu, Guangye	33c75cae0a	Add torch.accelerator.device_index as accelerator's device switch context (#148864 ) # Motivation We propose adding support for the Python with statement on `torch.accelerator.device_index` to enable device switching functionality. This enhancement would simplify writing device-agnostic code and provide benefits across all accelerators. Its device-specific counterparts include [`torch.cuda.device`](`00199acdb8/torch/cuda/__init__.py (L482)`) and [`torch.cuda._DeviceGuard`](`00199acdb8/torch/cuda/__init__.py (L469)`). Design Philosophy It accepts either an `Int` or `None` as input. When `None` is passed, no device switch is performed. Supporting `None` is important for compatibility, as it's possible to encounter `None` values from `torch.device.index`. Therefore, with this PR, we can do like this ```python src = 0 dst = 1 # Set src to current device torch.accelerator.set_device_index(src) with torch.accelerator.device_index(dst): # Inside with statement, we set dst to current device assert torch.accelerator.get_device_index() == dst # Here the current device should be src assert torch.accelerator.get_device_index() == src ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148864 Approved by: https://github.com/albanD	2025-04-25 09:45:25 +00:00
NevermindNilas	f38dae76ee	[Proposal] Drop legacy CUDA support to slim down the wheels (#152069 ) Proposal of dropping legacy CUDA support to slim down the Windows wheels. With the latest release of 2.7.0 and the new Blackwell support we've seen yet another rise in size to the wheel, going from ~2.5GB with Pytorch 2.6.0 all the way to ~3.1GB with pytorch 2.7.0 CUDA 12.8 on Python 3.12 and ~3.3GB with Python 3.13. Python 3.12, Pytorch 2.7.0 Cuda 12.8 ![image](https://github.com/user-attachments/assets/78a5bbcb-027e-4139-84f0-57bfae9f594e) Python 3.13, Pytorch 2.7.0, Cuda 12.8 ![image](https://github.com/user-attachments/assets/7f256860-46e3-41f6-81b3-65bd3ee5aa77) These .CI changes should imply the removal of support for many GPUs which are now about 8 years old if not older, including GPUs like the GTX960M, 950M, 940M, 930M and some other Quadro GPUs all the way from april 2016 like Quadro M500M as per [Nvidia's Documentation](https://developer.nvidia.com/cuda-gpus). This change would also save on our bandwidth 😅 @seemethere Pull Request resolved: https://github.com/pytorch/pytorch/pull/152069 Approved by: https://github.com/seemethere, https://github.com/eqy, https://github.com/atalman	2025-04-25 08:20:00 +00:00
Justin Chu	a811d3351b	[ONNX] Implement sym_not (#152111 ) Implement onnx support for sym_not. Replaces https://github.com/pytorch/pytorch/pull/147472 Fix https://github.com/pytorch/pytorch/issues/136572 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152111 Approved by: https://github.com/titaiwangms	2025-04-25 07:50:37 +00:00
PyTorch UpdateBot	6120cc8ccd	[executorch hash update] update the pinned executorch hash (#151728 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151728 Approved by: https://github.com/pytorchbot	2025-04-25 05:33:09 +00:00
Michael Lazos	a936d596f6	[Cutlass] Implement EVT example tensor creation (#150904 ) This PR implements a translation layer from inductor IR to "example tensors" the expected arguments of the EVT tracer. These tensors basically store the name, shape, stride, and dtype of the tensor and allow an ast-based python parse to generate the EVT C++. udpates to example tensor creation Previously merged: * https://github.com/pytorch/pytorch/pull/150903 * https://github.com/pytorch/pytorch/pull/150346 * https://github.com/pytorch/pytorch/pull/150345 * https://github.com/pytorch/pytorch/pull/150344 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150904 Approved by: https://github.com/eellison	2025-04-25 04:43:37 +00:00
PyTorch UpdateBot	dda0c952e7	[audio hash update] update the pinned audio hash (#152149 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152149 Approved by: https://github.com/pytorchbot	2025-04-25 04:20:06 +00:00
Justin Chu	e2c7ae52d5	[ONNX] Add group_norm support from opset 21 (#152138 ) I didn't run the model in test because ORT doesn't have the op yet. Nevertheless it should be leveraged for newer opset versions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152138 Approved by: https://github.com/titaiwangms, https://github.com/shubhambhokare1, https://github.com/cyyever	2025-04-25 03:30:07 +00:00
Tristan Rice	1a6d50d407	Reducer: add check on received data to avoid segfault (#152143 ) When ncclCommAbort is called it may return invalid/corrupted data to the reducer. This adds a check so we don't read past the end of the tensors leading to a segfault. While this looks like it could be a security issue it actually isn't since we only read past the end of the buffer, not write. Fixes #149418 Test plan: https://gist.github.com/d4l3k/b47c2c95cf9c37e78069e19f1b6ed2c6 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152143 Approved by: https://github.com/fduwjj, https://github.com/fegin	2025-04-25 02:16:44 +00:00
Jim Wan	7f28c03fac	Adding fbgemm to whitelist (#152079 ) Adding `torch.ops.fbgemm` to GraphPickler's allowlist. Otherwise, the fx graph module containing `fbgemm` node will return "Unable to pickle non-standard op" error. The validation is done on the model and the difference appears only on the graph name not the node. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152079 Approved by: https://github.com/aorenste	2025-04-25 01:13:51 +00:00
PyTorch MergeBot	8313bc27f2	Revert "Add OIDC permissions to bazel workflow (#151456 )" This reverts commit 5fc1eb85fc1b9d605939830d3be3506762b3df27. Reverted https://github.com/pytorch/pytorch/pull/151456 on behalf of https://github.com/seemethere due to This is causing downstream failures on PRs, see examples in PR comment ([comment](https://github.com/pytorch/pytorch/pull/151456#issuecomment-2829130319))	2025-04-25 00:37:15 +00:00
xinan.lin	75c71ab371	[Break XPU] generalize newly introduced device bias code in Inductor UT. (#151926 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151926 Approved by: https://github.com/EikanWang, https://github.com/jansel	2025-04-25 00:03:23 +00:00
leslie-fang-intel	d70490ecfe	[Inductor][CPP] Optimize the epilogue for int8 GEMM Template (#152000 ) Summary For int8 GEMM Template, the micro GEMM will calculate in u8s8s32 and we will do the scale/zp compensation in the epilogue. In general, it will be calculated as: ``` temp = micro_gemm_output * x_scale * w_scale temp = temp - (x_scale * w_scale * x_zp) * sum(w, 0) ``` For case when `x_scale, w_scale, x_zp` are constant, we can pre-calculate the compensation to save runtime calculation. Performance Test with 4 cores of XEON-5 and shapes from VIT model Before ``` GEMM(M=197,N=768,K=768) compile: 0.0939 ms (2.48 TOPS, 18.13 GB/s) GEMM(M=197,N=3072,K=768) compile: 0.4275 ms (2.17 TOPS, 13.90 GB/s) GEMM(M=197,N=768,K=3072) compile: 0.2677 ms (3.47 TOPS, 22.20 GB/s) GEMM(M=1,N=1000,K=768) compile: 0.0148 ms (0.10 TOPS, 99.10 GB/s) ``` After ``` GEMM(M=197,N=768,K=768) compile: 0.0597 ms (3.90 TOPS, 28.53 GB/s) GEMM(M=197,N=3072,K=768) compile: 0.2126 ms (4.37 TOPS, 27.95 GB/s) GEMM(M=197,N=768,K=3072) compile: 0.2282 ms (4.07 TOPS, 26.04 GB/s) GEMM(M=1,N=1000,K=768) compile: 0.0149 ms (0.10 TOPS, 98.71 GB/s) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152000 Approved by: https://github.com/Xia-Weiwen, https://github.com/CaoE, https://github.com/jansel	2025-04-24 23:36:00 +00:00
Jing Xu	2089b22c76	[xpu] set aot device flags in cpp_extension (#149459 ) If PyTorch is compiled with only AOT text strings starting with "dg2", the `_get_sycl_arch_list()` function will pass an empty string to `-device` argument of `ocloc` and then cause a compilation crash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149459 Approved by: https://github.com/guangyey, https://github.com/dvrogozh, https://github.com/malfet Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com> Co-authored-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>	2025-04-24 22:55:52 +00:00
fulvius31	fc6e37ceb2	[Inductor] Record Triton’s Base32 Cache Key in .best_config for Debugging (#148981 ) This is a follow-up PR of the reverted one https://github.com/pytorch/pytorch/pull/147019 : Modified TorchInductor’s autotuning flow so that each best_config JSON file also includes the Triton “base32” (or base64) cache key. Motivation Debugging & Analysis: With this change, we can quickly identify which compiled binary and IRs belongs to a given best config. The impact is minimal since it is only an extra field in .best_config. It can help advanced performance tuning or kernel-level debugging. Also, since Triton already stores cubin/hsaco in its cache, developers/researchers can avoid to set store_cubin = True since they can get the cubin/hsaco in the Triton cache and with the code provided in this PR, they can easily match the best_config with the right Triton cache directory for the "best" kernel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148981 Approved by: https://github.com/davidberard98	2025-04-24 21:28:53 +00:00
Andrew M. James	0413358a77	Non-deterministic alert in histc_cuda for floating types only (#151701 ) The note about atomic add only applies for floating point. The implementation is deterministic for integer data types. fixes: #151610 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151701 Approved by: https://github.com/ngimel, https://github.com/Skylion007	2025-04-24 21:16:46 +00:00
iremyux	6ced5e6840	Python 3.11 and 3.13 support for Windows Arm64 (#152109 ) This PR adds Python 3.11 and 3.13 support Windows Arm64 wheels and creates the necessary jobs Pull Request resolved: https://github.com/pytorch/pytorch/pull/152109 Approved by: https://github.com/malfet	2025-04-24 21:09:14 +00:00
eqy	d78d2af4e3	[CUDA][TF32] Account for TF32 in `test_corrcoef` (#151830 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151830 Approved by: https://github.com/Skylion007	2025-04-24 21:06:07 +00:00
Jane Xu	8a9c66bb70	Improve stable library apis per Scott's feedback (#152040 ) Following 3 suggestions: 1. inline at::Tensor arg 2. use uniq ptr of array vs std::vector 3. document the `std::optional<S>()` case Pull Request resolved: https://github.com/pytorch/pytorch/pull/152040 Approved by: https://github.com/swolchok, https://github.com/albanD	2025-04-24 20:51:03 +00:00
Jane Xu	dccc41581a	Include other accelerators in capturable docstr for optimizers (#149770 ) Fixes #149722 @ILCSFNO is this better? Pull Request resolved: https://github.com/pytorch/pytorch/pull/149770 Approved by: https://github.com/albanD	2025-04-24 20:38:42 +00:00
ILCSFNO	bd09d87fdb	add Out Notes (#151306 ) Fixes #150181 @albanD Could you please have a check? Build locally without pytorch build: ![Developer-FAQ](https://github.com/user-attachments/assets/351a7e0b-588e-48ae-ad0a-03f427c86e89) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151306 Approved by: https://github.com/albanD	2025-04-24 20:25:09 +00:00
Yidi Wu	92f125e622	[export] improve error message for deserializing custom triton op (#152029 ) In https://github.com/pytorch/pytorch/issues/151746, users ran into an error where a custom triton op cannot be resolved into an operator from string target. We improve the error message by reminding users to register the same custom operator at de-serialization time. Now the error looks like this: ```python torch._export.serde.serialize.SerializeError: We failed to resolve torch.ops.triton_kernel.add.default to an operator. If it's a custom op/custom triton op, this is usally because the custom op is not registered when deserializing. Please import the custom op to register it before deserializing. Otherwise, please file an issue on github. Unsupported target type for node Node(target='torch.ops.triton_kernel.add.default', inputs=[NamedArgument(name='x', arg=Argument(as_tensor=TensorArgument(name='linear')), kind=1), NamedArgument(name='y', arg=Argument(as_tensor=TensorArgument(name='mul')), kind=1)], outputs=[Argument(as_tensor=TensorArgument(name='add'))], metadata={'stack_trace': 'File "/data/users/yidi/pytorch/test.py", line 50, in forward\n output = triton_add(dense_output, bias)', 'nn_module_stack': 'L__self__,,__main__.SimpleModel', 'torch_fn': 'add.default_1;OpOverload.add.default'}, is_hop_single_tensor_return=None): <class 'str'>.``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152029 Approved by: https://github.com/jingsh	2025-04-24 20:22:05 +00:00
Svetlana Karslioglu	24bda01a93	Pin theme to a branch (#152046 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152046 Approved by: https://github.com/albanD	2025-04-24 20:20:21 +00:00
eqy	6efc572221	[CUDA][CPU] Bump system memory requirement for `test_cross_entropy_large_tensor` (#151812 ) `/usr/bin/time` seems to show max resident pages at 119GiB Pull Request resolved: https://github.com/pytorch/pytorch/pull/151812 Approved by: https://github.com/colesbury	2025-04-24 19:25:29 +00:00
PyTorch MergeBot	b1d055fd6a	Revert "[dynamo] Add guard serialization for tensor matches. (#151318 )" This reverts commit 81c4369d813facf39313dfd481adc71704cbc2c1. Reverted https://github.com/pytorch/pytorch/pull/151318 on behalf of https://github.com/zhxchen17 due to macos test failing ([comment](https://github.com/pytorch/pytorch/pull/151318#issuecomment-2828638168))	2025-04-24 19:22:45 +00:00
Catherine Lee	b11c9e1808	[CI][docker] Use install_cusparselt when possible in docker image (#150600 ) spot checked builds for line like `Found CUSPARSELT: /usr/local/cuda/lib64/libcusparseLt.so`. I don't know if there's another way to do it I am slowly trying to reduce the duplicated code in docker image installs Pros: * less dup code Cons: * more docker copies Pull Request resolved: https://github.com/pytorch/pytorch/pull/150600 Approved by: https://github.com/atalman	2025-04-24 18:52:10 +00:00
Svetlana Karslioglu	ff075d0815	Update docs dependencies for local build (#151796 ) Fixes #151786 - Changed requirements.txt to a symlink to .ci/docker/requirements-docs.txt - Updated README.md with better doc build instructions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151796 Approved by: https://github.com/malfet	2025-04-24 18:40:42 +00:00
zhxchen17	81c4369d81	[dynamo] Add guard serialization for tensor matches. (#151318 ) This is a proof-of-concept of how we could serialize a guard and deserialize it back from the bytes. The main behavioral change introduced in this diff is on CheckFunctionManager: ``` check_fn_manager = CheckFunctionManager(code, output_graph, guards_serialization_mode="save") guards_state: bytes = check_fn_manager.guards_state ``` Once `guards_serialization_mode` is set to `save`, CheckFunctionManager will return an addtional `bytes` object called `guards_state` which should contain all the information needed for deserializing guards later. When we load back guards state, we will set `guards_serialization_mode` is set to `load`: ``` output_graph_state = pickle.loads(guards_state) check_fn_manager = CheckFunctionManager(code, output_graph_state, guards_serialization_mode="load") ``` # TENSOR_MATCH Since we have many types of guards to support, we will break the work into small diffs instead of a single diff to support every guards. We kick off the work from TENSOR_MATCH from this diff. # Testing For each type of guard we will test it like the following: 1. Use guard_filter_fn to select 1 type of guard each time. 2. Call InstructionTranslator directly on an example function to get OutputGraph and CheckFunctionManager (reference guard manager) 3. Serialize->deserialize the output graph state and re-build the guards with a new CheckFunctionManager (loaded guard manager) 4. Throw a set of example inputs to both reference and loaded guard manager to see if their behavior match. Differential Revision: [D72987485](https://our.internmc.facebook.com/intern/diff/D72987485/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151318 Approved by: https://github.com/jansel, https://github.com/anijain2305	2025-04-24 18:07:01 +00:00
Lucas Kabela	03970dfd4c	Add functionality for installing free variables (#151134 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151134 Approved by: https://github.com/anijain2305 ghstack dependencies: #152036	2025-04-24 17:57:54 +00:00
Lucas Kabela	402d19c0bd	add basic unit tests and noop config (#152036 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152036 Approved by: https://github.com/anijain2305	2025-04-24 17:57:54 +00:00
Animesh Jain	9c1bc9ce46	[fake tensor] Cache None, integer and SymInts in the output (#151961 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151961 Approved by: https://github.com/bdhirsh, https://github.com/zou3519 ghstack dependencies: #151409, #151633, #151477, #151957	2025-04-24 16:44:45 +00:00
Shangdi Yu	0eb554e96a	Better error msg for too big to optimize (#151855 ) Summary: In the "too big to optimize" error message, tell the user that they should use the torch._inductor.config.aot_inductor.compile_wrapper_opt_level = 'O0' flag Test Plan: This is not added to unit test cases because it runs for a little longer time before the expected failure ``` def test_runtime_checks_error_msg(self): with torch.library._scoped_library("mylib", "FRAGMENT") as lib: torch.library.define( "mylib::foo", "(Tensor a, Tensor b) -> Tensor", tags=torch.Tag.pt2_compliant_tag, lib=lib, ) torch.library.impl("mylib::foo", "cpu", lib=lib) def foo(a: torch.Tensor, b: torch.Tensor) -> torch.Tensor: return a + b torch.library.impl_abstract("mylib::foo", lib=lib) def foo_fake_impl(a, b): return a + b class Model(torch.nn.Module): def __init__(self) -> None: super().__init__() def forward(self, x): for i in range(10000): x = torch.ops.mylib.foo(x, x) return x inputs = (torch.ones(8, 8, 8), ) model = Model() with self.assertRaisesRegex(Exception, "torch._inductor.config.aot_inductor.compile_wrapper_opt_level"): with torch.no_grad(): AOTIRunnerUtil.compile( model, inputs, ) ``` Differential Revision: D72323380 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151855 Approved by: https://github.com/desertfire	2025-04-24 16:35:19 +00:00
Will Constable	56e67badc3	Move verbose warning to warning_once (#152044 ) It was printing 1000s of lines for me.. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152044 Approved by: https://github.com/XilunWu	2025-04-24 16:18:34 +00:00
PyTorch MergeBot	3a170a8ce6	Revert "[Cutlass] Implement EVT example tensor creation (#150904 )" This reverts commit 253059356fc93b51c7c53246a5922db3fb14e184. Reverted https://github.com/pytorch/pytorch/pull/150904 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking the test_example_tensor_creation test internally. See D73519195 for more details. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/150904#issuecomment-2828132914))	2025-04-24 16:00:25 +00:00
Animesh Jain	d743a7bd85	[invoke_subgraph] Cache fake tensor if no unbacked symint in the output (#151957 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151957 Approved by: https://github.com/zou3519, https://github.com/bdhirsh ghstack dependencies: #151409, #151633, #151477	2025-04-24 14:17:22 +00:00
Animesh Jain	1d73b644a8	[fake tensor cache] Support index with non bool/int8 indices (#151477 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151477 Approved by: https://github.com/zou3519, https://github.com/bdhirsh ghstack dependencies: #151409, #151633	2025-04-24 13:48:18 +00:00
Animesh Jain	41285f26e4	[invoke_subgraph][fake tensor] Add finalizer on subgraph instead of the functionalize ctx wrapper (#151633 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151633 Approved by: https://github.com/zou3519 ghstack dependencies: #151409	2025-04-24 13:32:08 +00:00
Animesh Jain	3278ddd50c	[invoke_subgraph] Compile time traces (#151409 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151409 Approved by: https://github.com/zou3519	2025-04-24 13:20:50 +00:00
Xilun Wu	5e320eea66	[BE] follow autoformating and linter (#151507 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151507 Approved by: https://github.com/Skylion007	2025-04-24 07:37:04 +00:00
Mark Saroufim	5b368fa0b7	Add torch.cuda._compile_kernel() (#151484 ) Followup work on top https://github.com/pytorch/pytorch/pull/149480 Wrapper on top of nvrtc inspired by https://gist.github.com/malfet/2c9a25976dd7396430c38af603f791da from @malfet Compiling toy kernels with this setup takes 0.01s vs 90s using `load_inline()` on my local H100. This was primarily motivated by the timeouts I was seeing in the popcorn leaderboard but would also be useful to integrate into KernelBench This PR is in the same spirit as https://github.com/pytorch/pytorch/pull/148972 which was a similar UX for Metal For now we are planning on landing this as a private function because we expect to iterate both on the user facing API and the internals implementation, will open up a seperate issue to discuss the path towards making this work public and give a broader overview of the state of custom cuda kernel authoring in PyTorch Future work, as a prereq to making the work public * divup primitive * support multiple kernels * Expose _get_nvrtc_version from native code * interop with torch.compile * AMD support Pull Request resolved: https://github.com/pytorch/pytorch/pull/151484 Approved by: https://github.com/malfet	2025-04-24 07:14:31 +00:00
Henry Tsang	78953ee122	[pytorch] reland of [cutlass backend] delay construction of cutlass presets to when called (#151875 ) (#152031 ) Differential Revision: D73524978 reland of https://github.com/pytorch/pytorch/pull/151875 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152031 Approved by: https://github.com/yangw-dev	2025-04-24 05:36:36 +00:00
Nikita Shulga	2ea8653391	[vec128] Fix fmsub NEON defintion (#152075 ) As reported in https://github.com/pytorch/pytorch/issues/149292, according to manual, `vfmsq_f32` implements `c - a * b` rather than `a * b - c`, so it's call must be prefixed with `vnegq_f32` Also, adjust the tests to use OpMath for FMA computation to avoid accuracy error accumulation due to non-fused multiply-and-add over lower precision dtypes Note that `Vectorized::fmsub` is not currently instantiated anywhere, so it could safely remain broken TODO: - Enable C++ testing on MacOS and/or aarch64 platforms (right now Mac tests are build without C++ tests) Fixes https://github.com/pytorch/pytorch/issues/149292 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152075 Approved by: https://github.com/swolchok ghstack dependencies: #151955	2025-04-24 05:10:45 +00:00
Isalia20	5e9bdc9b86	[MPS] layernorm forward kernel (#152010 ) Implements layernorm forward pass as a metal kernel instead of MPSGraph ops. Speed ups are indicated on the chart below: ![Figure_1](https://github.com/user-attachments/assets/27a4d2ef-b3e4-4650-9ce3-b939c080321e) Script for generating times, need to build torch with old/new codebase and then run this with different file name indicated at the end of the script ```python import csv import time import numpy as np import torch import torch.nn.functional as F matrix_sizes = [32, 64, 128, 256, 512, 1024, 2048, 4096, 8192] batch_sizes = [1] elementwise_affine = [False, True] num_runs = 50 warmup_runs = 3 def create_input_tensor(n, batch_size): torch.manual_seed(42) return torch.randn(batch_size, n, dtype=torch.float32) def run_layer_norm(A, normalized_shape, elementwise_affine): torch.mps.synchronize() start = time.perf_counter() out = F.layer_norm(A, normalized_shape) torch.mps.synchronize() end = time.perf_counter() return out, end - start results = {"N": [], "elementwise_affine": [], "batch_size": [], "mean_time": [], "std_time": []} for el_aff in elementwise_affine: for n in matrix_sizes: for batch_size in batch_sizes: print(f"\nBenchmarking LayerNorm for input size N={n}, batch_size={batch_size}, elementwise_affine={el_aff}") try: A_cpu = create_input_tensor(n, batch_size) A_mps = A_cpu.to("mps") normalized_shape = (n,) for _ in range(warmup_runs): _, _ = run_layer_norm(A_mps, normalized_shape, el_aff) times = [] for _ in range(num_runs): _, t = run_layer_norm(A_mps, normalized_shape, el_aff) times.append(t) mean_time = np.mean(times) std_time = np.std(times) results["N"].append(n) results["elementwise_affine"].append(el_aff) results["batch_size"].append(batch_size) results["mean_time"].append(mean_time) results["std_time"].append(std_time) print(f"Mean time: {mean_time:.4f}s ± {std_time:.4f}s") except RuntimeError as e: print(f"Error for N={n}, batch_size={batch_size}: {e}") continue with open("layernorm_benchmark_times_new.csv", "w", newline="") as f: writer = csv.writer(f) writer.writerow(["N", "elementwise_affine", "batch_size", "mean_time", "std_time"]) for i in range(len(results["N"])): writer.writerow( [ results["N"][i], results["elementwise_affine"][i], results["batch_size"][i], results["mean_time"][i], results["std_time"][i], ] ) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152010 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-04-24 05:07:46 +00:00
Davide Italiano	a389835313	[MPS] Adjust test_sum_dtypes so it can run on MPS. (#152064 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152064 Approved by: https://github.com/malfet, https://github.com/jansel Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-04-24 05:04:49 +00:00
Wei Feng	2102b3b4c5	[FSDP1] print fqns when debug FlatParamHandle (#151336 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151336 Approved by: https://github.com/awgu, https://github.com/Skylion007	2025-04-24 04:49:24 +00:00
Scott Wolchok	2a58d2a155	StringCordView: make iterator fast when there is only one piece (#151810 ) This makes the StringCordView iterator a variant holding either the existing implementation (when there is more than one piece) or a simple `std::string_view::iterator` (when there is only one piece). The latter seems to be significantly cheaper. Differential Revision: [D73379178](https://our.internmc.facebook.com/intern/diff/D73379178/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151810 Approved by: https://github.com/Skylion007 ghstack dependencies: #151801, #151802, #151803, #151804, #151805, #151806, #151807	2025-04-24 04:43:34 +00:00
Scott Wolchok	76cc379bec	Fix missing moves in SchemaTypeParser::parseFakeAndRealType (#151807 ) Was seeing a small amount of shared_ptr traffic from these. The std::move(text) at the top is just a piggyback. Differential Revision: [D73376720](https://our.internmc.facebook.com/intern/diff/D73376720/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151807 Approved by: https://github.com/zou3519, https://github.com/cyyever, https://github.com/Skylion007 ghstack dependencies: #151801, #151802, #151803, #151804, #151805, #151806	2025-04-24 04:43:34 +00:00
Scott Wolchok	68454b9d17	Fix a missed c10::TypeFactory::create spot in function_schema_parser (#151806 ) Looks like we are supposed to be using TypeFactory instead of direct creation everywhere that might run on mobile. Differential Revision: [D73376716](https://our.internmc.facebook.com/intern/diff/D73376716/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151806 Approved by: https://github.com/Skylion007, https://github.com/iseeyuan ghstack dependencies: #151801, #151802, #151803, #151804, #151805	2025-04-24 04:43:34 +00:00
Scott Wolchok	b237211b42	Fix easy missing moves in function_schema_parser (#151805 ) Just some straightforward not-moving-upon-return. Differential Revision: [D73376718](https://our.internmc.facebook.com/intern/diff/D73376718/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151805 Approved by: https://github.com/malfet, https://github.com/cyyever ghstack dependencies: #151801, #151802, #151803, #151804	2025-04-24 04:43:34 +00:00
Scott Wolchok	89a85d0954	Add & use Token::text_view() (which returns a string_view unlike text()) (#151804 ) Sadly, I can't just fix text() because that might cause lifetime issues in somebody's code. Differential Revision: [D73376715](https://our.internmc.facebook.com/intern/diff/D73376715/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151804 Approved by: https://github.com/zou3519, https://github.com/cyyever, https://github.com/Skylion007, https://github.com/malfet ghstack dependencies: #151801, #151802, #151803	2025-04-24 04:43:34 +00:00
Scott Wolchok	0559741d7f	Fix return type of TypeFactoryBase<c10::DynamicType>::get (#151803 ) getBaseType() actually returns a reference. This was causing shared_ptr copies. Differential Revision: [D73376717](https://our.internmc.facebook.com/intern/diff/D73376717/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151803 Approved by: https://github.com/malfet, https://github.com/Skylion007 ghstack dependencies: #151801, #151802	2025-04-24 04:43:34 +00:00
Scott Wolchok	fabbcddab1	Create and use DynamicTypes for check in DispatchKeyExtractor::makeBitsetForDispatchArgs (#151802 ) On mobile, many but not all things in the JIT type subsystem start using DynamicType. Not using DynamicType was imposing a startup time cost here, as explained in the comment. Differential Revision: [D73129442](https://our.internmc.facebook.com/intern/diff/D73129442/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151802 Approved by: https://github.com/malfet ghstack dependencies: #151801	2025-04-24 04:43:34 +00:00
Scott Wolchok	5de92e676a	Don't copy DynamicType argument to DynamicType::create (#151801 ) This improves performance of DynamicType::isSubtypeOfExt. Differential Revision: [D73129449](https://our.internmc.facebook.com/intern/diff/D73129449/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151801 Approved by: https://github.com/malfet	2025-04-24 04:43:34 +00:00
PyTorch MergeBot	43f1b60ded	Revert "[MPS] Adjust test_sum_dtypes so it can run on MPS. (#152064 )" This reverts commit d703f062fe7e4ead362ec0473ef33579e84532ac. Reverted https://github.com/pytorch/pytorch/pull/152064 on behalf of https://github.com/malfet due to Lint is not green ([comment](https://github.com/pytorch/pytorch/pull/152064#issuecomment-2826305781))	2025-04-24 04:04:49 +00:00
Davide Italiano	e2cf60ff18	[MPS] Fix test_neg_index_mps (#151966 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151966 Approved by: https://github.com/malfet, https://github.com/jansel Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-04-24 04:02:09 +00:00
Pian Pawakapan	2ee8de54b1	[dynamic shapes] user-code friendly statically_known_true, has_static_value (#151601 ) Fixes #151480 Allows `statically_known_true` in user code, as well as introducing `has_static_value`, returning True if the input has a static bool/float/int value Pull Request resolved: https://github.com/pytorch/pytorch/pull/151601 Approved by: https://github.com/laithsakka, https://github.com/zou3519, https://github.com/jingsh	2025-04-24 02:53:59 +00:00
Davide Italiano	d703f062fe	[MPS] Adjust test_sum_dtypes so it can run on MPS. (#152064 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152064 Approved by: https://github.com/malfet, https://github.com/jansel	2025-04-24 02:32:36 +00:00
dolpm	4ac2ee573d	[sigmoid] memory planner C10 deps (#151275 ) Summary: perf-sensitive util functions for use in our memory planner Test Plan: CI Differential Revision: D73002726 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151275 Approved by: https://github.com/georgiaphillips	2025-04-24 01:46:32 +00:00
FFFrog	c91acad73a	[Easy] Add more check for elapsedTime of torch.xxx.Event and torch.Event (#151404 ) As the title stated Changes: - Add record, query and enable_timing check - Add related tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/151404 Approved by: https://github.com/albanD	2025-04-24 01:28:09 +00:00
Kaiyu Shi	f39a1a43ee	Fix typos in meta.rst (#151979 ) ### Fixes made: - "allow you to the module" → corrected to "allows you to move the module" - "allow" → changed to "allows" to agree with the singular subject "method" Pull Request resolved: https://github.com/pytorch/pytorch/pull/151979 Approved by: https://github.com/colesbury	2025-04-24 01:25:09 +00:00
drisspg	4e1d4333f7	[FlexAttention] Remove Old Constraint on lastdim strides (#151959 ) Fixes: #148827 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151959 Approved by: https://github.com/Chillee ghstack dependencies: #151846	2025-04-24 01:09:52 +00:00
drisspg	2455ded502	[FlexAttention] Fix device test instantation (#151846 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151846 Approved by: https://github.com/Chillee, https://github.com/BoyuanFeng, https://github.com/mlazos	2025-04-24 01:09:52 +00:00
cyyever	f2cfeb23e5	[Environment Variable][7/N] Use thread-safe getenv functions (#140211 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/140211 Approved by: https://github.com/ezyang, https://github.com/eqy	2025-04-24 01:06:29 +00:00
PyTorch MergeBot	8172397025	Revert "Update torch-xpu-ops commit pin (#150827 )" This reverts commit 776aa682218bad4df7b6cd46ef2a0f1d8ca1194c. Reverted https://github.com/pytorch/pytorch/pull/150827 on behalf of https://github.com/etaf due to Inductor UT regression ([comment](https://github.com/pytorch/pytorch/pull/150827#issuecomment-2825857903))	2025-04-24 00:41:06 +00:00
Nikita Shulga	4d2d833976	[CI] Update sleef submodule to v3.8 (#151955 ) Should help with RISC-V cross-compilation. 3.9.0 migration is blocked by sleef project switching to C++20 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151955 Approved by: https://github.com/atalman, https://github.com/wdvr, https://github.com/Skylion007	2025-04-23 23:56:05 +00:00
Pian Pawakapan	fd3d339e17	[dynamic shapes] be less aggressive with runtime assert CSE for bounds (#151590 ) Fixes #150540 Fixes #147772 Stops trying to CSE bound expressions, only does exact deduplication for runtime asserts. Adds the test cases to check that AOTAutograd doesn't data-dependent error out when retracing due to not seeing the asserts. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151590 Approved by: https://github.com/laithsakka	2025-04-23 23:07:00 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	47ad351ff3	[DRAFT] INitial version of sticky export (#151047 ) Summary: This is to make torchnative demos and benchmarking real models more simple by not requiring ppl to find example inputs first. Test Plan: CI Differential Revision: D72815584 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151047 Approved by: https://github.com/zhxchen17	2025-04-23 22:58:43 +00:00
henrylhtsang	bd191730ce	[cutlass backend] Stop using GenerateSM80 for SM90 and SM100 (#150781 ) Not urgent. We don't use the GenerateSM80 ops I believe. For SM100, we could skip SM90 as well. But I don't have data for that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150781 Approved by: https://github.com/kadeng	2025-04-23 22:16:57 +00:00
Richard Howell	dccb7a9cb2	[pytorch] use a mutex in initialize_torch_libraries (#151938 ) Summary: The TORCH_LIBRARY_THREAD_UNSAFE_LAZY_INIT feature is thread unsafe for calling the initializers, but we want to allow the deferred initializer call to be safe from multiple threads. Add a mutex to ensure we have thread safe construction of the libraries post launch. Differential Revision: D73457714 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151938 Approved by: https://github.com/swolchok, https://github.com/zou3519	2025-04-23 21:41:01 +00:00
PyTorch MergeBot	562328501e	Revert "Turn on static cuda launcher in OSS (#151691 )" This reverts commit e31e2d27c6739cad5327cc54e6ac9fd28a157cbf. Reverted https://github.com/pytorch/pytorch/pull/151691 on behalf of https://github.com/malfet due to This breaks tests, see `c1f51cf2c4/1` ([comment](https://github.com/pytorch/pytorch/pull/151691#issuecomment-2825427252))	2025-04-23 20:28:31 +00:00
PyTorch MergeBot	98c53d8b39	Revert "[MPS] Fix test_neg_index_mps (#151966 )" This reverts commit 9422e24c472ccbaffc4cf3935e12d0a83f269560. Reverted https://github.com/pytorch/pytorch/pull/151966 on behalf of https://github.com/malfet due to Looks like it broke halide testing, see https://github.com/pytorch/pytorch/actions/runs/14623941238/job/41034065229 ([comment](https://github.com/pytorch/pytorch/pull/151966#issuecomment-2825425305))	2025-04-23 20:25:49 +00:00
Yidi Wu	c1f51cf2c4	[map] defer importing AOTConfig and create_joint dependency (#151479 ) Summary: We reverted D72896450 due to a weird error happens at a seemingly unrelated test "buck2 run apf/data/tests:preproc_state_serializer_test -- --filter-text "test_load_artifact" " I did some investigation and found that moving import AOTConfig and create_joint inside the create_fw_bw_grap causes a delay of importing the recursively imported modules in AOTConfig create_joint from test construction time to the test running time. The path.exists mock gets called multiple times due to the inspect.getsource calls in multiple places of torch. Specifically, we set a breakpoint at the sideeffect of mocked os.path.exists. P1787425831 shows the importing stack trace before the change. P1787431638 shows the importing stacktrace after the change. The notable difference is that in the second pastry, we trigger an os.path.exists when somewhere in triton we called inspect.getsourcelines when we construct OnDiskPreprocStateSerializer, which gets recorded by the mock. Looking at the test, it seems what the test actualy wants to test is the deserialize step. So we reset_mock before the step to avoid mocking things happened at import time. Test Plan: buck2 run apf/data/tests:preproc_state_serializer_test -- --filter-text "test_load_artifact" and existing tests for map. Differential Revision: D73138415 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151479 Approved by: https://github.com/angelayi, https://github.com/zou3519	2025-04-23 19:16:40 +00:00
Natalia Gimelshein	99ae7d4069	Reland fast gather and index implementation (#151917 ) This PR reapplies #151490 and #151753 together, and adds some missing checks when applying the fast path. Previously missed checks: 1) indexing path has the stride in the indexed dimension in bytes, gather path has the stride in the indexed dimension in elements. When checking if fast path is applicable, I didn't take this difference into account, and still multiplied the indexing stride by element size. Fixed and test added 2) We want to take fast path only when we are copying contiguous equally spaced slices of inputs + all the necessary alignment requirements. The effective tensor size should be 2d (after all possible flattening is applied), the index stride in the last dimension should be 0, and, since in the kernel we are not applying non-indexing-related offsets to src tensor, the src tensor stride in the second dimension should be 0. This automatically happens for gather with dim=0, so I didn't put in an explicit condition for this. Sometimes all conditions except first dim "effective" stride equal to 0 are satisfied for scatter on non-zero dim, when index size in the indexing dimension is 1 and thus it is collapsed (dimensions of size 1 are always collapsed), e.g. ``` # test gather along 1st dim that can accidentally trigger fast path # because due to index dimension in the gather dim being 1 # an unexpected squashing in tensorIterator happens src = make_tensor((16, 2, 16), device=device, dtype=dtype) ind = torch.randint(2, (16, 1), device=device).view(16, 1, 1).expand(16, 1, 16) res = torch.gather(src, dim=1, index=ind) if res.device.type == "cuda": ref_cpu = torch.gather(src.cpu(), dim=1, index=ind.cpu()) self.assertEqual(res.cpu(), ref_cpu, atol=0, rtol=0) ``` Note that if index size here was (16, 2, 16) instead of (16, 1, 16) then the middle dimension could not be collapsed and we wouldn't end up incorrectly taking fast path. We could update the kernel to take this stride into account when computing offsets into src tensor, or we could specifically disallow non-zero stride on the first dimension. I took the second path for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151917 Approved by: https://github.com/eqy, https://github.com/malfet, https://github.com/Skylion007	2025-04-23 19:13:13 +00:00
Yidi Wu	69e41cee04	move find_hop_schema into _higher_order_ops/schema.py (#151147 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151147 Approved by: https://github.com/zou3519	2025-04-23 18:26:37 +00:00
Nikhil Anil Patel	5acc3e286a	[Inductor] Add Additional Configs for persistent+TMA version of Triton mm and addmm (#150587 ) Summary: This PR introduces additional autotuning configurations for the persistent+TMA version of Triton `mm` and `addmm` operations. The new configurations are as follows: * `(128, 128, 64, 5, 8)` * `(256, 128, 64, 4, 8)` * `(128, 128, 64, 5, 4)` These configurations were selected based on exhaustive autotuning performed on commonly used shapes from an internal foundational model. While these new configs are generally more performant across the board, we see notable gains a few specific cases: * In scenarios where `n >> m, k`, the configurations `(128, 128, 64, 5, 8)` and `(256, 128, 64, 4, 8)` tend to produce an additional 5-10% speedup over the aten baseline compared to the original configurations. * Similarly, the configuration `(128, 128, 64, 5, 4)` yields approximately an 8% improvement in scenarios where k >> m, n. These enhancements are expected to provide performance benefits across diverse use cases, particularly when compared to the original set of configurations. Test Plan: contbuild & OSS CI Reviewers: paulzhan Pull Request resolved: https://github.com/pytorch/pytorch/pull/150587 Approved by: https://github.com/PaulZhang12, https://github.com/drisspg, https://github.com/eellison	2025-04-23 18:21:35 +00:00
Animesh Jain	3c1a17a08b	[Dynamo] Use LazyVariableTracker in base VT (#151847 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151847 Approved by: https://github.com/StrongerXi	2025-04-23 18:18:01 +00:00
PyTorch MergeBot	aa285e6512	Revert "[cutlass backend] delay construction of cutlass presets to when called (#151875 )" This reverts commit 8ca7953d510deb21cd99b92523f73beafa4588bf. Reverted https://github.com/pytorch/pytorch/pull/151875 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/151875#issuecomment-2825030726))	2025-04-23 17:33:31 +00:00
Yidi Wu	5f63789dd2	[torchbind] fix error message when attr is a real tensor. (#151944 ) Summary: Previously, when attr is defined, "if attr" will try to evaluate the data of attr, which is not intendended and we get a ugly error stack if the attr is not evaluable (like a fake tensor) before the callable(attr) check. Test Plan: Existing tests. Reviewed By: yushangdi, henryoier Differential Revision: D73460905 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151944 Approved by: https://github.com/yushangdi	2025-04-23 17:32:11 +00:00
PyTorch MergeBot	9344da8bd1	Revert "[fake tensor cache] Support index with non bool/int8 indices (#151477 )" This reverts commit bdb34f55a0c44f82d914dc9b41e785b2eed97675. Reverted https://github.com/pytorch/pytorch/pull/151477 on behalf of https://github.com/wdvr due to reverting confusing ghstack state ([comment](https://github.com/pytorch/pytorch/pull/151477#issuecomment-2825023953))	2025-04-23 17:30:27 +00:00
PyTorch MergeBot	348272e67e	Revert "[invoke_subgraph][fake tensor] Add finalizer on subgraph instead of the functionalize ctx wrapper (#151633 )" This reverts commit 02dd096e5154867f6eb463d434b9eba0bdc85a64. Reverted https://github.com/pytorch/pytorch/pull/151633 on behalf of https://github.com/wdvr due to reverting confusing ghstack state ([comment](https://github.com/pytorch/pytorch/pull/151633#issuecomment-2825007363))	2025-04-23 17:23:23 +00:00
Alan Du	2ab752d720	Make `torch.jit.Error` inherit from Exception (#151947 ) Summary: I can confirm that `torch.jit.Error.mro()` contains `Exception` in the inheritance hierarchy. This avoids a bunch of `pyre-ignore`s in D73352417. Test Plan: Sandcastle Differential Revision: D73464544 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151947 Approved by: https://github.com/Skylion007	2025-04-23 17:19:25 +00:00
Davide Italiano	9422e24c47	[MPS] Fix test_neg_index_mps (#151966 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151966 Approved by: https://github.com/malfet	2025-04-23 17:06:28 +00:00
FFFrog	a560216abb	Update description for torch.random.fork_rng (#151881 ) As the title stated. Related ISSUE: https://github.com/pytorch/pytorch/issues/151784 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151881 Approved by: https://github.com/albanD	2025-04-23 16:59:29 +00:00
Nichols A. Romero	05114679b7	[ROCm] AtomicAdd specialization on AMD for fp64. (#151724 ) Fixes https://github.com/pytorch/pytorch/issues/151039 Improve scatter add performance on MI250X. Some numbers from the reporter's benchmark: ``` Before: dtype torch.float64 time = 3.577979326248169 After: dtype torch.float64 time = 0.0031385421752929688 ``` No perf. improvement to MI300 or MI100. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151724 Approved by: https://github.com/jeffdaily	2025-04-23 16:33:32 +00:00
James Wu	e31e2d27c6	Turn on static cuda launcher in OSS (#151691 ) After a few small bugfixes on tests (to make it so we throw/catch similar exceptions to triton), I think we're ready to flip the switch and use StaticCudaLauncher on by default in OSS. Initial round of benchmarks look good, with average compilation time going down by a few percent: <img width="828" alt="image" src="https://github.com/user-attachments/assets/cad03e09-b4d6-49a7-a9e5-6068d1c0bd5c" /> With no changes to runtime perf: <img width="823" alt="image" src="https://github.com/user-attachments/assets/3fcd435e-1057-43f4-878b-8d66a3812a10" /> There are a few noisy models I want to double check, though, so will run some more tests before accepting review. Full benchmark results, showing a ~5% compile time improvement across the board: https://hud.pytorch.org/benchmark/huggingface/inductor_with_cudagraphs?dashboard=torchinductor&startTime=Wed%2C%2016%20Apr%202025%2002%3A31%3A12%20GMT&stopTime=Wed%2C%2023%20Apr%202025%2002%3A31%3A12%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(a100)&lBranch=gh/jamesjwu/139/orig&lCommit=cc45c8667fa23dec16ca50002d9504a34688ca5c&rBranch=main&rCommit=2a9afdae81d0dde98e96d7e3c9ca840e241e5405 <img width="1482" alt="image" src="https://github.com/user-attachments/assets/6e6a7f39-7f44-459f-9845-9a37f084ea82" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/151691 Approved by: https://github.com/oulgen	2025-04-23 15:43:24 +00:00
Eddie Yan	dcc32ff5bf	[CUDA][cuBLAS][cuBLASLt] Opt-in unified cuBLAS + cuBLASLt workspaces (#151163 ) opt-in version of https://github.com/pytorch/pytorch/pull/145130 as there was a lack of repro for the 70% forward issue `TORCH_CUBLASLT_UNIFIED_WORKSPACE=1` @izaitsevfb could you comment if it was repeatable per every forward pass, on startup, or something else? Pull Request resolved: https://github.com/pytorch/pytorch/pull/151163 Approved by: https://github.com/ngimel	2025-04-23 15:24:22 +00:00
PyTorch MergeBot	7310049c42	Revert "[FlexAttention] Fix device test instantation (#151846 )" This reverts commit b37fa20771a7aa1ddcfaf59df7e56683d3d0be3b. Reverted https://github.com/pytorch/pytorch/pull/151846 on behalf of https://github.com/jithunnair-amd due to PR broke rocm workflow ([comment](https://github.com/pytorch/pytorch/pull/151846#issuecomment-2824607429))	2025-04-23 15:01:36 +00:00
FFFrog	21b0ef520d	[Easy] Remove redundant code (#151883 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151883 Approved by: https://github.com/albanD, https://github.com/Skylion007	2025-04-23 14:25:19 +00:00
Yuanhao Ji	b32b002a6e	[BE] Replace `std::runtime_error` with `TORCH_CHECK` [1/N] (#151880 ) Part of: #148114 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151880 Approved by: https://github.com/albanD, https://github.com/Skylion007, https://github.com/cyyever	2025-04-23 11:14:35 +00:00
Catherine Lee	6d28d61323	[CI] Remove protobuf from docker image (#151933 ) Pretty sure the source should be the one in third-party Pull Request resolved: https://github.com/pytorch/pytorch/pull/151933 Approved by: https://github.com/huydhn	2025-04-23 10:29:09 +00:00
William Wen	5b9df57b50	[dynamo] context manager/decorator for dynamo config patching during tracing (#150586 ) Implement traceable config patching for Dynamo: enables restricted patching of Dynamo config where user can use a context manager/decorator to change tracing behavior for parts of the code. The new `dont_skip_tracing` decorator/context manager for ignoring most trace rules is easily implemented with this more generic traceable config patching feature. Implementation: - Create a new specialized context manager class representing a wrapper around torch._dynamo.config.patch - Dynamo doesn't trace into the context manager but updates config at compile time - Correctness is based on our correctness for handling supported context managers - Implementation is inspired by how `GradModeVariable` is implemented. Previous attempts: https://github.com/pytorch/pytorch/pull/148736 (decorator-only global approach) and https://github.com/pytorch/pytorch/pull/149439 (decorator-only traceback approach) See https://docs.google.com/document/d/1vWNwKL_jpg-PLopifcaSa338wks3GqSVF4GHRguybGg/edit?tab=t.0 for more details on implementation - including previous approaches. NOTE: this PR fixes a bug where skipped code objects were not tracked by convert_frame.py, leading to cases where code objects would be automatically skipped even after `torch._dynamo.reset()`. This exposed some latent dynamo-wrapped test failures in CI that previously passed in CI but not locally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150586 Approved by: https://github.com/jansel, https://github.com/zou3519, https://github.com/anijain2305	2025-04-23 09:12:13 +00:00
Blaine Burton Rister	62b5649b76	[Inductor] Test ND block pointers with dynamic shapes (#151646 ) With ND tiling, we can get multi-dimensional block pointers with dynamic shapes. This is an important capability, but I couldn't find any CI tests for it. This PR adds a couple of tests checking that we get the expected block pointers with dynamic shapes, both for pointwise and reduction kernels. Example kernels: ``` @triton.jit def triton_poi_fused_div_0(in_ptr0, out_ptr0, ks0, ks1, ynumel, xnumel, YBLOCK : tl.constexpr, XBLOCK : tl.constexpr): yoffset = (tl.program_id(1) + tl.program_id(2) * tl.num_programs(1)) * YBLOCK yindex = yoffset + tl.arange(0, YBLOCK)[:, None] ymask = yindex < ynumel xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[None, :] xmask = xindex < xnumel x1 = xindex y0 = yindex tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[ks0, ks0], strides=[ks1, 1], block_shape=[YBLOCK, XBLOCK], order=[1, 0], offsets=[yoffset, xoffset]), boundary_check=[0, 1]) tmp1 = (tmp0 / tmp0) tl.store(tl.make_block_ptr(out_ptr0, shape=[ks0, ks0], strides=[ks0, 1], block_shape=[YBLOCK, XBLOCK], order=[1, 0], offsets=[yoffset, xoffset]), tl.broadcast_to(tmp1, [YBLOCK, XBLOCK]).to(tl.float32), boundary_check=[0, 1]) @triton.jit def triton_red_fused_prod_0(in_ptr0, out_ptr0, ks0, ks1, xnumel, r0_numel, r1_numel, XBLOCK : tl.constexpr, R0_BLOCK : tl.constexpr, R1_BLOCK : tl.constexpr): xnumel = 1 rnumel = r0_numel * r1_numel RBLOCK: tl.constexpr = R0_BLOCKR1_BLOCK xoffset = tl.program_id(0) XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None, None] xmask = tl.full([XBLOCK, R0_BLOCK, R1_BLOCK], True, tl.int1) r0_base = tl.arange(0, R0_BLOCK)[None, :, None] r1_base = tl.arange(0, R1_BLOCK)[None, None, :] rbase = r1_base + r0_baser1_numel block_ptr0 = tl.make_block_ptr(in_ptr0, shape=[ks0, ks0], strides=[ks1, 1], block_shape=[R0_BLOCK, R1_BLOCK], order=[1, 0], offsets=[0, 0]) _tmp2 = tl.full([XBLOCK, R0_BLOCK, R1_BLOCK], 1, tl.float32) for r0_offset in range(0, r0_numel, R0_BLOCK): r0_index = r0_offset + r0_base r0_mask = r0_index < r0_numel for r1_offset in range(0, r1_numel, R1_BLOCK): r1_index = r1_offset + r1_base r1_mask = r1_index < r1_numel roffset = r1_offset + r0_offsetr1_numel rindex = r1_index + r0_indexr1_numel r0_0 = r0_index r1_1 = r1_index tmp0 = tl.load(block_ptr0, boundary_check=[0, 1], padding_option='zero', eviction_policy='evict_first')[None, :, :] tmp1 = tl.broadcast_to(tmp0, [XBLOCK, R0_BLOCK, R1_BLOCK]) tmp3 = _tmp2 tmp1 _tmp2 = tl.where(r0_mask & r1_mask, tmp3, _tmp2) block_ptr0 = tl.advance(block_ptr0, [0, R1_BLOCK]) block_ptr0 = tl.advance(block_ptr0, [R0_BLOCK, (-1)R1_BLOCK(triton_helpers.div_floor_integer((-1) + ks0 + R1_BLOCK, R1_BLOCK))]) tmp4 = tl.reshape(_tmp2, [XBLOCK, RBLOCK]) tmp2 = triton_helpers.prod(tmp4, 1)[:, None, None] tl.store(out_ptr0 + (tl.full([XBLOCK, 1, 1], 0, tl.int32)), tmp2, None) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151646 Approved by: https://github.com/jansel, https://github.com/eellison, https://github.com/shunting314	2025-04-23 06:20:04 +00:00
bobrenjc93	ee81fe40c1	Support regexes in dynamic sources allowlist (#151766 ) As requested by Shuai. I also included an additional refactor to capture changes in the whitelist over time since previously the first time it was set, it was impossible override when a new config was set. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151766 Approved by: https://github.com/pianpwk	2025-04-23 06:17:16 +00:00
Pian Pawakapan	7c97720d16	[dynamic shapes] rewrite expand with guard_or_false (#150236 ) Rewrites the expand decomposition to avoid unbacked errors, assuming the general path where `input shape == output shape or input shape == 1`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150236 Approved by: https://github.com/laithsakka	2025-04-23 06:11:11 +00:00
PyTorch UpdateBot	097faa9217	[audio hash update] update the pinned audio hash (#151729 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151729 Approved by: https://github.com/pytorchbot, https://github.com/Skylion007	2025-04-23 06:04:32 +00:00
Xia, Weiwen	b247e5db33	[Inductor][CPU] Add GEMM templates for _weight_int4pack_mm_for_cpu with AMX (#150603 ) Summary It's part of the task to enable max-autotune with GEMM template for WoQ INT4 GEMM on CPU. This PR adds AMX-based GEMM templates for `torch.ops.aten_weight_int4pack_mm_for_cpu`. It brings performance benefits on platforms where AMX is available. Validation results We have run GPT-J-6B and Llama-3-8B-Instruct on a 6th gen Xeon with 96 cores. Results show that the AMX-based microkernel outperforms AVX512-based one by >5x for prefill stage with 1024 input length. Test plan ``` python test/inductor/test_cpu_select_algorithm.py -k test_int4_woq_mm_amx ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150603 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel	2025-04-23 05:58:55 +00:00
Pian Pawakapan	54f736155b	[dynamic shapes] guard_or_false for _reshape_view_helper, utils._infer_size for wildcard dims (#150127 ) For reshape/view: removes fast paths for 0 elements, checking dimensions to skip. Modifies the loop accumulating input elements, to raise a UserError if we run out of dimensions, graph breaking for compile and erroring out for export. For infer_size: assumes if user passes us an unbacked, it's probably not -1 Will think about changes in https://docs.google.com/document/d/1WYx6EZwVDXtBnWyrzoecgGWdiK0V3XZKftfpWwQ5i3E/edit?tab=t.0#heading=h.22k54zym11qp in a later PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/150127 Approved by: https://github.com/laithsakka	2025-04-23 05:42:30 +00:00
drisspg	b37fa20771	[FlexAttention] Fix device test instantation (#151846 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151846 Approved by: https://github.com/Chillee, https://github.com/BoyuanFeng, https://github.com/mlazos	2025-04-23 05:37:25 +00:00
Oguz Ulgen	cc793e895e	[StandaloneCompile] Autotune at compile time (#151922 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151922 Approved by: https://github.com/jamesjwu ghstack dependencies: #151921	2025-04-23 04:32:06 +00:00
Oguz Ulgen	f9bdfe90ae	[MegaCache] Return None on no compilation (#151921 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151921 Approved by: https://github.com/jamesjwu	2025-04-23 04:32:06 +00:00
Oguz Ulgen	78bbb468c6	Use /var/tmp instead of /tmp for torch cache directory on fbcode (#151466 ) Summary: We've been noticing that cache directory has been getting cleaned underneath us, lets use /var/tmp which is supposed to be cleaned less frequently. https://fb.workplace.com/groups/257735836456307/posts/883428143887070 Test Plan: unit tests Reviewed By: masnesral Differential Revision: D73008663 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151466 Approved by: https://github.com/masnesral	2025-04-23 03:30:51 +00:00
Michael Lazos	253059356f	[Cutlass] Implement EVT example tensor creation (#150904 ) This PR implements a translation layer from inductor IR to "example tensors" the expected arguments of the EVT tracer. These tensors basically store the name, shape, stride, and dtype of the tensor and allow an ast-based python parse to generate the EVT C++. udpates to example tensor creation Previously merged: * https://github.com/pytorch/pytorch/pull/150903 * https://github.com/pytorch/pytorch/pull/150346 * https://github.com/pytorch/pytorch/pull/150345 * https://github.com/pytorch/pytorch/pull/150344 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150904 Approved by: https://github.com/eellison	2025-04-23 03:26:56 +00:00
Oguz Ulgen	cd021d048e	Fix circular imports (#151939 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151939 Approved by: https://github.com/jamesjwu	2025-04-23 02:53:32 +00:00
Pian Pawakapan	13339ce086	[dynamic shapes] bound_sympy for size-oblivious min/max reasoning (#151242 ) Differential Revision: D72978020 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151242 Approved by: https://github.com/bobrenjc93	2025-04-23 02:14:05 +00:00
Shunting Zhang	74074fe8d8	[inductor] handle offset in ReinterpretView for alignment (#151859 ) Fix https://github.com/pytorch/pytorch/issues/151589 It's interesting that the Q4_K dequantization example in the referred GH issue does not crash even if Inductor pass triton the wrong alignment information. I dig this a bit. The main reason is, there are 2 things in triton that decides the vectorization size 1. alignement 2. max number of contiguous elements a thread need to process Here is the triton code that decides vectorization size [link](`c5fed8e1ca/third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/LoadStoreOpToLLVM.cpp (L147-L157)`), and here is the triton code that considers contiguity for vectorization [link](`c5fed8e1ca/lib/Analysis/AxisInfo.cpp (L1250-L1269)`) When Inductor wrongly tell triton that a unaligned tensor is aligned, Triton may not do vectorization (or not do full vectorization) because of the second restriction. Check this test: ``` @parametrize( "size", ( 128, 1024, 1024 * 1024, ), ) def test_slice_view_dtype(self, size): offset = 1 def f(x): return x[2:].view(dtype=torch.float32) + 1 x = torch.randn((size + offset) * 2, dtype=torch.bfloat16, device=self.device) self.common(f, (x,), reference_in_float=False) ``` Before the fix, Inductor would tell Triton that the output of aten.view.dtype tensor is aligned even though it's not. That tensor will be passed to the triton kernel for the aten.add. Triton may do different vectorization decision depending on the tensor size 1. when size = 128, triton pick ld.global.b32 to load data from global memory 2. when size = 1024, triton uses ld.global.v2.b32 4. when size = 1024 * 1024, triton uses ld.global.v4.b32 So whether wrong alignment metadata causes issue depends on if triton picks the vectorized instructions. The latter depends on the triton config (block size) decided by inductor and triton internal logic (how they assign elements to each thread). We'd better to make sure Inductor always generate correct metadata to make sure such hidden issues does not turn into crash later. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151859 Approved by: https://github.com/jansel, https://github.com/eellison ghstack dependencies: #151841	2025-04-23 01:50:49 +00:00
leslie-fang-intel	68a7501dab	[Inductor][CPP] Fix Codegen Issue when Parallel Reduction under the vectorization (#151887 ) Summary Fixes [#151290](https://github.com/pytorch/pytorch/issues/151290) and [#151523](https://github.com/pytorch/pytorch/issues/151523), which are regressions introduced by [#144020](https://github.com/pytorch/pytorch/pull/144020). That PR enabled parallelization at the inner loop level. However, a currently unsupported case arises when parallel reduction occurs under the vectorization loop level, specifically in patterns like: ``` for vec_loop_level: do_parallel_reduction ``` In such cases, a temporary buffer `tmp_acc_array` is allocated for tail scalar kernels, and another temporary buffer `tmp_acc_array` is also defined for parallel reduction. This results in a conflict due to overlapping temporary buffers. This PR disables the problematic case to avoid the conflict until proper support is implemented. Test Plan ``` python test/inductor/test_flex_attention.py -k test_make_block_mask_cpu python test/inductor/test_cpu_repro.py -k test_parallel_reduction_vectorization ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151887 Approved by: https://github.com/jansel	2025-04-23 00:41:14 +00:00
Nikita Shulga	015b526a2a	[MPSInductor] Warn-cast double as floats (#151963 ) To support sqrt over dynamic shapes, i.e. make something like: ```python torch.compile(dynamic=True)(lambda x: x * math.sqrt(x.size(0)) ``` compilable into ```metal // Source node to ATen node mapping: // Graph fragment: // %scalar_tensor_default : [num_users=1] = call_function[target=torch.ops.aten.scalar_tensor.default](args = (%arg0_1,), kwargs = {}) // %convert_element_type_default : [num_users=1] = call_function[target=torch.ops.prims.convert_element_type.default](args = (%scalar_tensor_default, torch.float64), kwargs = {}) // %sqrt_default : [num_users=1] = call_function[target=torch.ops.aten.sqrt.default](args = (%convert_element_type_default,), kwargs = {}) // %convert_element_type_default_1 : [num_users=1] = call_function[target=torch.ops.prims.convert_element_type.default](args = (%sqrt_default, torch.float32), kwargs = {}) // %mul_tensor : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%arg1_1, %convert_element_type_default_1), kwargs = {}) kernel void generated_kernel( device float* out_ptr0, constant float* in_ptr0, constant long& ks0, uint xindex [[thread_position_in_grid]] ) { int x0 = xindex; auto tmp0 = in_ptr0[x0]; auto tmp1 = ks0; auto tmp2 = static_cast<float>(tmp1); auto tmp3 = metal::sqrt(tmp2); auto tmp4 = static_cast<float>(tmp3); auto tmp5 = tmp0 * tmp4; out_ptr0[x0] = static_cast<float>(tmp5); } ``` TODO: - Figure out if this could be tweaked in fx-passes, but overhead is probably too high Pull Request resolved: https://github.com/pytorch/pytorch/pull/151963 Approved by: https://github.com/dcci ghstack dependencies: #151869, #151871, #151872	2025-04-23 00:30:45 +00:00
Davide Italiano	49b7ffbb15	[MPS] Implement _print_Trunc_to_Int (#151964 ) Fixes `test_device_assert_mps` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151964 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-04-23 00:30:00 +00:00
PyTorch MergeBot	72f711e200	Revert "[inductor] Change minimum number of SMs to 60 to let Ada use Triton GEMM backend (#150888 )" This reverts commit 8d81806211bc3c0ee6c2ef235017bacf1d775a85. Reverted https://github.com/pytorch/pytorch/pull/150888 on behalf of https://github.com/henrylhtsang due to Revert because this change isn't needed ([comment](https://github.com/pytorch/pytorch/pull/150888#issuecomment-2822768377))	2025-04-23 00:26:49 +00:00
Syed Tousif Ahmed	334aab0dea	Updates NCCLConfig with QOS variable (#151821 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151821 Approved by: https://github.com/kwen2501	2025-04-23 00:03:49 +00:00
Scott Wolchok	aa61707a56	Fix extra heap allocation in Source constructor (#151800 ) This was a sneaky one: the StringCordView default constructor allocates. Differential Revision: [D73129448](https://our.internmc.facebook.com/intern/diff/D73129448/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151800 Approved by: https://github.com/malfet, https://github.com/cyyever, https://github.com/Skylion007 ghstack dependencies: #151682	2025-04-22 23:36:06 +00:00
Riley Dulin	cd576fdce5	[torch][fx] Add support for EXIR dialect overload ops in normalize_function (#143689 ) Summary: I had a minor annoyance when debugging graphs using EXIR dialect ops, that all the function normalization went away. For functions with > 5 arguments, some of which are just simple bools and ints, it's very helpful to have the kwarg names attached. Enhance `normalize_target` to handle EdgeOpOverload targets. To avoid a circular dependency on Executorch from pytorch core, I just use a `hasattr` check for "_op". This only happens if the target is not already a recognized torch function. Also, I noticed that the new `fx.Node.normalized_arguments` function didn't forward an important kwarg to `normalize_target`, so I fixed that too. Test Plan: Tested with FxGraphDrawer and an fx Graph containing EXIR nodes. Differential Revision: D67545909 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143689 Approved by: https://github.com/angelayi	2025-04-22 23:36:02 +00:00
Scott Wolchok	4f8adde5ce	Speed up OperatorEntry construction by avoiding updateDispatchTableFull_ (#151682 ) The purpose of the updateDispatchTableFull_ call is, according to the comment, just to pick up fallback kernels if there are any. We can implement that directly more efficiently. Differential Revision: [D73129447](https://our.internmc.facebook.com/intern/diff/D73129447/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151682 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/bdhirsh	2025-04-22 23:35:53 +00:00
Keshav Kolur	c98340e268	[autodeps2] Replace third-party/pyyaml with third-party/pypi/pyyaml (#151668 ) Summary: We should use the pypi version. Test Plan: CI Differential Revision: D73211869 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151668 Approved by: https://github.com/Skylion007	2025-04-22 23:27:13 +00:00
angelayi	f4ac9a160d	[fx] Filter stacktrace (#151029 ) Filtering out the stacktrace so that the stacktrace on nodes when using fx.Tracer looks nicer. I just copied the filtering we have in [proxy_tensor.py](`6720d23969/torch/fx/experimental/proxy_tensor.py (L1903-L1931)`). Previously the stacktrace looked like: ``` File "/data/users/angelayi/pytorch/moo.py", line 3964, in <module> run_tests() File "/data/users/angelayi/pytorch/torch/testing/_internal/common_utils.py", line 1342, in run_tests unittest.main(argv=argv) File "/home/angelayi/.conda/envs/pytorch-3.10/lib/python3.10/unittest/main.py", line 101, in __init__ self.runTests() File "/home/angelayi/.conda/envs/pytorch-3.10/lib/python3.10/unittest/main.py", line 271, in runTests self.result = testRunner.run(self.test) File "/home/angelayi/.conda/envs/pytorch-3.10/lib/python3.10/unittest/runner.py", line 184, in run test(result) File "/home/angelayi/.conda/envs/pytorch-3.10/lib/python3.10/unittest/suite.py", line 84, in __call__ return self.run(args, kwds) File "/home/angelayi/.conda/envs/pytorch-3.10/lib/python3.10/unittest/suite.py", line 122, in run test(result) File "/home/angelayi/.conda/envs/pytorch-3.10/lib/python3.10/unittest/suite.py", line 84, in __call__ return self.run(args, *kwds) File "/home/angelayi/.conda/envs/pytorch-3.10/lib/python3.10/unittest/suite.py", line 122, in run test(result) File "/home/angelayi/.conda/envs/pytorch-3.10/lib/python3.10/unittest/case.py", line 650, in __call__ return self.run(args, *kwds) File "/data/users/angelayi/pytorch/torch/testing/_internal/common_utils.py", line 3324, in run self._run_custom( File "/data/users/angelayi/pytorch/torch/testing/_internal/common_utils.py", line 3296, in _run_custom super_run(result=result) File "/home/angelayi/.conda/envs/pytorch-3.10/lib/python3.10/unittest/case.py", line 591, in run self._callTestMethod(testMethod) File "/home/angelayi/.conda/envs/pytorch-3.10/lib/python3.10/unittest/case.py", line 549, in _callTestMethod method() File "/data/users/angelayi/pytorch/torch/testing/_internal/common_utils.py", line 3156, in wrapper method(args, *kwargs) File "/data/users/angelayi/pytorch/moo.py", line 1495, in test_stack_trace gm = torch.fx.GraphModule(m, tracer.trace(m)) File "/data/users/angelayi/pytorch/torch/fx/_symbolic_trace.py", line 837, in trace (self.create_arg(fn(args)),), File "/data/users/angelayi/pytorch/moo.py", line 1485, in forward x = x * 2 File "/data/users/angelayi/pytorch/torch/fx/proxy.py", line 716, in impl return tracer.create_proxy("call_function", target, args, kwargs) File "/data/users/angelayi/pytorch/torch/fx/proxy.py", line 248, in create_proxy proxy.node.stack_trace = "".join(CapturedTraceback.extract().format()) ``` Now it looks like: ``` File "/data/users/angelayi/pytorch/moo.py", line 1485, in forward x = x * 2 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151029 Approved by: https://github.com/jfix71, https://github.com/zou3519, https://github.com/jingsh	2025-04-22 22:50:36 +00:00
Amandeep Chhabra	a7ccd96bbf	logging start of torch elastic workers. (#150849 ) Summary: We would like to log start of the workers. It will help with complete logging. Test Plan: unit tests https://www.internalfb.com/intern/testinfra/testrun/6473924724652056 e2e tests https://www.internalfb.com/mlhub/pipelines/runs/mast/f712311762-27449483648-TrainingApplication_V403K?job_attempt=0&version=0&tab=execution_details&env=PRODUCTION Reviewed By: tnykiel Differential Revision: D72297314 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150849 Approved by: https://github.com/d4l3k, https://github.com/kiukchung	2025-04-22 22:35:06 +00:00
angelayi	6a1b820255	[export] Enable symint inputs for AdditionalInputs and ShapesCollection (#151842 ) With `AdditionalInputs`, the behavior is the same as with tensors: ```python class M(torch.nn.Module): def forward(self, x, y): return x + y additional_inputs = torch.export.AdditionalInputs() additional_inputs.add((5, 5)) additional_inputs.add((3, 5)) additional_inputs.add((5, 4)) ep = torch.export.export( M(), (6, 7), dynamic_shapes=additional_inputs, strict=False ) ``` With `ShapesCollection`, we now need to wrap integer inputs as `_IntWrapper` so that we can have a unique identifier for each integer input. ```python class M(torch.nn.Module): def forward(self, x, y): return x + y from torch.export.dynamic_shapes import _IntWrapper args = (_IntWrapper(5), _IntWrapper(5)) # Or we can do `args = pytree.tree_map_only(int, lambda a: _IntWrapper(a), orig_args)` shapes_collection = torch.export.ShapesCollection() shapes_collection[args[0]] = Dim.DYNAMIC shapes_collection[args[1]] = Dim.DYNAMIC ep = torch.export.export( M(), args, dynamic_shapes=shapes_collection, strict=False ) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151842 Approved by: https://github.com/pianpwk	2025-04-22 22:29:18 +00:00
atalman	43de9b75c3	Remove mention of magma-cuda in readme.md, refactor magma_conda install (#147476 ) Related to: https://github.com/pytorch/pytorch/issues/138506 we migrated magma-cuda build from anaconda to aws Last version of magma-cuda published was 12.6 https://anaconda.org/pytorch/magma-cuda126 Here is the PR that moved from anaconda to tarball: https://github.com/pytorch/pytorch/pull/140417 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147476 Approved by: https://github.com/albanD	2025-04-22 22:08:49 +00:00
Nikita Shulga	c0b70f94e2	[Testing] Enable `test_mutations_loop_fusion_mps` (#151872 ) By testing it against float32 rather than double dtype Pull Request resolved: https://github.com/pytorch/pytorch/pull/151872 Approved by: https://github.com/Skylion007, https://github.com/dcci, https://github.com/jansel ghstack dependencies: #151869, #151871	2025-04-22 22:00:16 +00:00
Nikita Shulga	2f851ac8f8	[MPSInductor] Implement `atomic_add` store mode (#151871 ) Which fixes `GPUTests.test_index_put2_mps`, `GPUTests. test__unsafe_masked_index_put_accumulate_mps` and dozen of scatter/gather tests that relied on atomic_add store mode Pull Request resolved: https://github.com/pytorch/pytorch/pull/151871 Approved by: https://github.com/jansel, https://github.com/dcci ghstack dependencies: #151869	2025-04-22 22:00:16 +00:00
Nikita Shulga	3aecf2dc52	[MPS] Extend index_put to half precision floats (#151869 ) By reusing `c10/metal/atomic.h` This also fixes `GPUTests.test_index_put_fallback[12]_mps` that is unrolled by inductor, so no need for dedicated atomic_add support TODOs: - Get rid of indexing kernel and compute it directly when kernel is run - Simulate atomic_add for int64 types as series of int32 atomic-add-and-fetch - Setup tolerances correctly to pass float16/bfloat16 tests (as CPU always takes sequential strategy) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151869 Approved by: https://github.com/Skylion007, https://github.com/dcci	2025-04-22 22:00:08 +00:00
Prachi Gupta	b8f4dc5a9f	[ROCm] opportunistic fastatomics for ReduceAdd operations for MI300 GPUs (#146264 ) In this approach, we are catching any lane within a wave that is doing fastatomics to the same destination address and computing the sum on the CU. This is leading to 3x improvement in scatter_add performance and 2x improvement in index_select. scatter_add performance on MI300x: dtype\|Baseline (before optimizations)\|opportunistic fastatomics -------\|----------------------------------\|---------------------------------- f32\|1.389425039\|0.430447996 fp16\|2.195472956\|0.779729486 bf16\|2.194051027\|0.784599513 Using the following reproducer ``` import torch import triton def main(): dtype = torch.float32 dim = 1305301 a = torch.rand(100, device="cuda", dtype=dtype) index = torch.randint(0, 100, (dim,), device="cuda") src = torch.rand(dim, device="cuda", dtype=dtype) print("=" * 20) print( triton.testing.do_bench( lambda: a.scatter_add(0, index, src), return_mode="median", ) ) print("=" * 20) if __name__ == "__main__": main() ``` co-authored by: @amd-hhashemi Pull Request resolved: https://github.com/pytorch/pytorch/pull/146264 Approved by: https://github.com/jeffdaily, https://github.com/mxz297 Co-authored-by: Hashem Hashemi <hashem.hashemi@amd.com>	2025-04-22 21:55:40 +00:00
Catherine Lee	e05ac9b794	Use folder tagged docker images for binary builds (#151706 ) Should be the last part of https://github.com/pytorch/pytorch/pull/150558, except for maybe s390x stuff, which I'm still not sure what's going on there For binary builds, do the thing like we do in CI where we tag each image with a hash of the .ci/docker folder to ensure a docker image built from that commit gets used. Previously it would use imagename:arch-main, which could be a version of the image based on an older commit After this, changing a docker image and then tagging with ciflow/binaries on the same PR should use the new docker images Release and main builds should still pull from docker io Cons: * if someone rebuilds the image from main or a PR where the hash is the same (ex folder is unchanged, but retrigger docker build for some reason), the release would use that image instead of one built on the release branch * spin wait for docker build to finish Pull Request resolved: https://github.com/pytorch/pytorch/pull/151706 Approved by: https://github.com/atalman	2025-04-22 21:50:10 +00:00
sumantro93	017a6bd593	add min/max_seqlen to non_differentiable (#151750 ) Fixes #148988 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151750 Approved by: https://github.com/soulitzer	2025-04-22 21:46:02 +00:00
PyTorch MergeBot	835413baed	Revert "[Optimus][Observability] Improve tlparse logging (#151635 )" This reverts commit 06a3c3c8cdb2424d42d7926a49a18ee6852a40cb. Reverted https://github.com/pytorch/pytorch/pull/151635 on behalf of https://github.com/clee2000 due to broke dynamo/test_structured_trace.py::StructuredTraceTest::test_ddp_graphs [GH job link](https://github.com/pytorch/pytorch/actions/runs/14600342064/job/40970324075) [HUD commit link](`06a3c3c8cd`), test did fail on PR but dr ci says it matches an existing failure, which it does, but also this PR breaks the test too ([comment](https://github.com/pytorch/pytorch/pull/151635#issuecomment-2822538113))	2025-04-22 21:39:23 +00:00
PyTorch MergeBot	bc6c0bc344	Revert "Do not generate long log messaged for suppressed data dependent errors. (#151023 )" This reverts commit dfdf731579d7472a009f8edf35994b8701e79065. Reverted https://github.com/pytorch/pytorch/pull/151023 on behalf of https://github.com/laithsakka due to breaking other PRs ([comment](https://github.com/pytorch/pytorch/pull/151023#issuecomment-2822483635))	2025-04-22 21:08:30 +00:00
PyTorch MergeBot	459c62ee1d	Revert "Do not log exception when recording is disabled or already recording (#151038 )" This reverts commit 73d95893a2b844ba8ee523e0e3915adf54017411. Reverted https://github.com/pytorch/pytorch/pull/151038 on behalf of https://github.com/laithsakka due to breaking other PRs ([comment](https://github.com/pytorch/pytorch/pull/151023#issuecomment-2822483635))	2025-04-22 21:08:30 +00:00
PyTorch MergeBot	aaf71a481b	Revert "Log information about suppressed data dependent errors (#151041 )" This reverts commit ccd00359da3423ff7bae8ee682df10590fc844ce. Reverted https://github.com/pytorch/pytorch/pull/151041 on behalf of https://github.com/laithsakka due to breaking other PRs ([comment](https://github.com/pytorch/pytorch/pull/151023#issuecomment-2822483635))	2025-04-22 21:08:30 +00:00
Scott Wolchok	2f74cffab2	Remove `reinterpret_cast`s with undefined behavior from stable/library.h (#151595 ) There is a list of valid uses of `reinterpret_cast` (see https://en.cppreference.com/w/cpp/language/reinterpret_cast), and the use here was not on the list, hence undefined behavior. Implement what we meant using memcpy, which is well-defined. Differential Revision: [D73200791](https://our.internmc.facebook.com/intern/diff/D73200791/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151595 Approved by: https://github.com/janeyx99	2025-04-22 20:24:47 +00:00
Wanchao Liang	3380a46b44	Fix DTensorTestBase to barrier with device ids (#150896 ) try to get rid of the below annoying warnings when running the unit tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/150896 Approved by: https://github.com/fegin	2025-04-22 20:22:55 +00:00
Shunting Zhang	a48ccf02f9	[Inductor] move alignment tests to a separate file (#151841 ) This is a pure code movement. test_torchinductor.py is already 15K lines of code. Move alignment related tests I added recently to a separate file. I need add more such kind of tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151841 Approved by: https://github.com/jansel, https://github.com/eellison	2025-04-22 20:18:58 +00:00
rzou	596296fb0b	[standalone_compile] Dynamic shape handling (#151788 ) standalone_compile needs to get dynamic shape information from somewhere. We add a new `dynamic_shapes` argument with three options: 1. from the passed-in graph (dynamic="from_graph"). This is the default. 2. from the example inputs, thereby specializing on them. (dynamic="from_example_inputs") 3. from the current tracing context (dynamic="from_tracing_context") 1 and 3 are not exactly the same. 2 can also be used for more advanced things... (specialize on one input but not the other). Most of this PR is tests. Test Plan: - a lot of new tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151788 Approved by: https://github.com/oulgen	2025-04-22 20:17:24 +00:00
Brian Hirsh	7e4b89ac6c	fix spammy library deinit errors when user passes an invalid TORCH_LOGS argument (#151678 ) fixes https://github.com/pytorch/pytorch/issues/151055. Thanks @desertfire for the patch that fixed this. I was a bit careful about the test - I wanted to make sure the test accurately ensures that we don't regress and our error message is not spammy when users enter an invalid `TORCH_LOGS=....` argument. But I tried to avoid using expecttests, since people occasionally add new logging artifacts and I didn't want to add to much churn by forcing this to fail CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151678 Approved by: https://github.com/desertfire, https://github.com/zou3519	2025-04-22 20:13:52 +00:00
PyTorch MergeBot	0bb9b89fb7	Revert "[compile][compile time traces] Add more dynamo traces (#151357 )" This reverts commit 607443b16be705788ab06e9a31e4569e0f1516c3. Reverted https://github.com/pytorch/pytorch/pull/151357 on behalf of https://github.com/wdvr due to stack in a weird state - reverting for now ([comment](https://github.com/pytorch/pytorch/pull/151357#issuecomment-2822369232))	2025-04-22 20:12:44 +00:00
Thomas Bohnstingl	d0d4e992f1	[associative_scan] Fixes for assoc_scan testcases (#149988 ) This PR fixes some issues with the testcases of `associative_scan`, in particular the problem where the compile_mode is inadvertently always set to `none`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149988 Approved by: https://github.com/ydwu4	2025-04-22 20:09:12 +00:00
henrylhtsang	8ca7953d51	[cutlass backend] delay construction of cutlass presets to when called (#151875 ) In hindsight, always constructing the dict is a bit silly. We should only construct it when we need it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151875 Approved by: https://github.com/yangw-dev	2025-04-22 20:03:10 +00:00
titaiwangms	6cd1741985	[ONNX] Update decomposition logic to loop over onnx registry (#151826 ) Fixes #150367 This PR makes decomposition table from onnx registry, which includes registered ops not only ATen and prim. This will help to keep the custom ops that are specified in the custom_translation table from decomposition during ONNX export. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151826 Approved by: https://github.com/justinchuby	2025-04-22 19:40:52 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	69ee6a9280	[Sana][HybridCache] Fix bug in detect_attr_assignment (#151824 ) Summary: tree_flatten_with_map will internally call unflatten function with user supplied function. But this function was not returning anything causing the leaves to be None. This is wrong when the constructor is sensitive to this behaviour Test Plan: CI Differential Revision: D73388529 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151824 Approved by: https://github.com/bdhirsh	2025-04-22 19:39:50 +00:00
Aart J.C. Bik	337caacd4c	Use more efficient mask to index computation (#151372 ) This change addresses the third time/mem "spike" observed in https://github.com/pytorch/pytorch/issues/151351 The change sees to perform better (time/mem) for both very sparse and very dense cases. It runs faster, and claims less memory both observed on CPU/GPU. It even avoids OOM for larger cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151372 Approved by: https://github.com/eqy	2025-04-22 19:31:12 +00:00
Li-Huai (Allan) Lin	fbd29527d8	[MPS] Move ops modifiers to testing utils so other tests can reuse (#151781 ) Test collection check: ``` python -m pytest test/test_mps.py --collect-only ``` Before: ``` 6390 tests collected in 8.34s ``` After: ``` 6390 tests collected in 7.71s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151781 Approved by: https://github.com/malfet	2025-04-22 19:19:52 +00:00
Oguz Ulgen	982062dfc4	Cache the value of torch_key in subproc (#151057 ) No need to recalculate torch_key in subprocs, lets pass it from main process. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151057 Approved by: https://github.com/jamesjwu, https://github.com/masnesral	2025-04-22 18:54:06 +00:00
zeshengzong	fa0f13b90b	Fix doc requirements install error (#151787 ) Fixes #151786 Change version in requirements of docs consistent with version in [CI version file](https://github.com/pytorch/pytorch/blob/main/.ci/docker/requirements-docs.txt), which changed in #149331 ### Test Result ![image](https://github.com/user-attachments/assets/f8646c03-116f-4f1c-b017-11b70995626b) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151787 Approved by: https://github.com/malfet	2025-04-22 18:33:44 +00:00
Shivam Raikundalia	4bf09562e4	[EZ/Profiler] Update Submodule (#151843 ) Summary: Update to `d82680bbd4` Test Plan: CI Differential Revision: D73397323 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151843 Approved by: https://github.com/Skylion007, https://github.com/aaronenyeshi	2025-04-22 18:19:43 +00:00
zeshengzong	834a017fe3	Optimize register_full_backward_hook description when all input no grad (#151785 ) Fixes #100528 ## Test Result ### Before ![image](https://github.com/user-attachments/assets/5dd2e1d3-3bb1-49d0-84bf-8a7a6b18fa4b) ### After ![image](https://github.com/user-attachments/assets/2e16d17b-1586-40d8-b0ef-35559fc064f4) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151785 Approved by: https://github.com/soulitzer	2025-04-22 17:57:31 +00:00
Tugsbayasgalan Manlaibaatar	2c27597d6a	Infra for handling builtin ops (min, max, math.pow) (#151348 ) Reapply of https://github.com/pytorch/pytorch/pull/150003 Differential Revision: [D73050801](https://our.internmc.facebook.com/intern/diff/D73050801/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151348 Approved by: https://github.com/zhxchen17 ghstack dependencies: #151347	2025-04-22 17:20:09 +00:00
Shangdi Yu	264e8fb151	More fix for aot_export_module name collision during unlifting (#151684 ) Summary: Also check the module's named buffers and parameters when resolving name collision Test Plan: ``` buck2 run mode/dev-nosan caffe2/test/inductor:test_aot_inductor -- -r aoti_constant_tensor_name_collision ``` Differential Revision: D73264885 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151684 Approved by: https://github.com/angelayi	2025-04-22 16:59:33 +00:00
Menglu Yu	06a3c3c8cd	[Optimus][Observability] Improve tlparse logging (#151635 ) Summary: We improve tlparse logging for Optimus graph transformaton to enable easier debug Test Plan: ``` TORCH_TRACE=~/my_trace_log_dir CUDA_VISIBLE_DEVICES=5 buck2 run mode/opt //aps_models/ads/ecosystem/tooling/tools/efficient_module_suite/pyper_models:pyper_model_perf_benchmark -- --flow_id 720055919 --shrink_model --mfu_profile_module "impl.shared_arch.dense_sparse_interaction" --use_synthetic_data ``` Differential Revision: D73229681 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151635 Approved by: https://github.com/Yuzhen11	2025-04-22 16:56:08 +00:00
Thanh Ha	5fc1eb85fc	Add OIDC permissions to bazel workflow (#151456 ) Update workflow to use OIDC authentication to access AWS resources rather than assuming the runner's default role. This is part of the multicloud effort to prepare jobs to support being run in non-AWS clouds. The JWT ID token requires `id-token: write` in order to create the token for the job. See: https://docs.github.com/en/actions/security-for-github-actions/security-hardening-your-deployments/configuring-openid-connect-in-cloud-providers#adding-permissions-settings Ref: pytorch-fdn/multicloud-ci-infra#3 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151456 Approved by: https://github.com/malfet	2025-04-22 16:54:14 +00:00
Shangdi Yu	5d316ce0d0	Add device check for inputs (#151828 ) Summary: Generate device checks for inputs in AOTI. Enable with AOTI_RUNTIME_CHECK_INPUTS=1 Test Plan: ``` buck run fbcode//mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r test_runtime_checks_device_type_failed ``` Differential Revision: D73382824 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151828 Approved by: https://github.com/angelayi	2025-04-22 16:36:27 +00:00
PyTorch MergeBot	3804aed32e	Revert "[Inductor] Add Additional Configs for persistent+TMA version of Triton mm and addmm (#150587 )" This reverts commit 99aeee2c5f07f7fe6ec3f34aacb7db71569a60c5. Reverted https://github.com/pytorch/pytorch/pull/150587 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally (see D73410693). To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/150587#issuecomment-2821828926))	2025-04-22 16:15:55 +00:00
PyTorch MergeBot	4504910843	Revert "[ez] Make relaxed constraint error message more user friendly (#151407 )" This reverts commit e0f05229e9ff84aa6138df2bd51f5044bc743afb. Reverted https://github.com/pytorch/pytorch/pull/151407 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally (see D73198095). To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts. ([comment](https://github.com/pytorch/pytorch/pull/151407#issuecomment-2821819654))	2025-04-22 16:12:42 +00:00
PyTorch MergeBot	f072bf27a7	Revert "faster gather implementation (#151490 )" This reverts commit 541f8cd34cbccfcaf04a377f747390f83658d6ec. Reverted https://github.com/pytorch/pytorch/pull/151490 on behalf of https://github.com/malfet due to Looks like it breaks demucs accuracy, though may be bogus, but let's try to revert, see `c729f7dbee/3` ([comment](https://github.com/pytorch/pytorch/pull/151490#issuecomment-2821803788))	2025-04-22 16:09:14 +00:00
PyTorch MergeBot	ed0d2ebaa0	Revert "Non-deterministic alert in histc_cuda for floating types only (#151701 )" This reverts commit b7a7741411585817daa81780b078fd15816f2d2d. Reverted https://github.com/pytorch/pytorch/pull/151701 on behalf of https://github.com/ZainRizvi due to Sorry but this is causing inductor tests to fail. See here for more info: test_torch.py::TestTorchDeviceTypeCUDA::test_nondeterministic_alert_histc_cuda_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/14586002763/job/40913547718) [HUD commit link](`b7a7741411`) ([comment](https://github.com/pytorch/pytorch/pull/151701#issuecomment-2821800837))	2025-04-22 16:07:25 +00:00
Rachel Guo	c729f7dbee	[provenance_tracking][reland] Fix UT error and re-land `ExternKernel` support (#151709 ) Summary: ATT. reverted previous diff : D72572050 Test Plan: ``` TORCH_LOGS="+inductor, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:provenance_tracing -- -r test_triton_kernel_to_post_grad_tracing_extern_kernel ``` Differential Revision: D73281217 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151709 Approved by: https://github.com/jingsh	2025-04-22 15:44:56 +00:00
Nikita Shulga	d778c92e16	[Metal][BE] Move atomic ops to c10/metal/atomic.h (#151868 ) To be reused from indexing and MPSInductor implementaiton of atomic_add stores Added wrapper for `metal::atomic<int>`(to be used by followup PR) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151868 Approved by: https://github.com/Skylion007	2025-04-22 14:11:29 +00:00
Animesh Jain	159e2f96e3	[dynamo][ci] Fix recently broken test (#151877 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151877 Approved by: https://github.com/masnesral, https://github.com/jansel	2025-04-22 06:42:03 +00:00
Yuanhao Ji	3aeeb77a3a	[Dynamo][Easy] Remove unreachable code (#151739 ) This line is unreachable: `f6c1cf04b5/torch/_dynamo/output_graph.py (L275)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151739 Approved by: https://github.com/Skylion007	2025-04-22 06:27:00 +00:00
Laith Sakka	ccd00359da	Log information about suppressed data dependent errors (#151041 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151041 Approved by: https://github.com/bobrenjc93 ghstack dependencies: #151023, #151038	2025-04-22 06:07:57 +00:00
Laith Sakka	73d95893a2	Do not log exception when recording is disabled or already recording (#151038 ) I am not sure why do we log all exceptions here and re-raise them , but at least when recording is disabled this should be transparent. namely logging dde could be spamming. before: <img width="995" alt="Screenshot 2025-04-10 at 12 47 31 PM" src="https://github.com/user-attachments/assets/f90d4557-d958-4558-a917-0d687366cad1" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/151038 Approved by: https://github.com/bobrenjc93 ghstack dependencies: #151023	2025-04-22 06:07:57 +00:00
Laith Sakka	dfdf731579	Do not generate long log messaged for suppressed data dependent errors. (#151023 ) TORCH_LOGS="all" python test/test_dynamic_shapes.py -k test_guard_or_true before: <img width="1065" alt="Screenshot 2025-04-10 at 9 55 27 AM" src="https://github.com/user-attachments/assets/3ee20de0-2902-4eb1-8ab0-80f1b974fb78" /> after: <img width="1124" alt="Screenshot 2025-04-10 at 9 54 35 AM" src="https://github.com/user-attachments/assets/4e7e1f0c-856c-417f-8763-bfe183e2450d" /> Note: we actually do not expect to see a log at all, this is an orthogonal issue in recording where it logs each error seen even when recording is not enabled? I will follow up with PR for that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151023 Approved by: https://github.com/bobrenjc93	2025-04-22 06:07:57 +00:00
Michael Lazos	a09a3f4c30	[Hierarchical compile] Ensure output nodes are sorted last (#151295 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151295 Approved by: https://github.com/anijain2305 ghstack dependencies: #151293, #151294	2025-04-22 05:13:07 +00:00
Michael Lazos	283884b224	[Hierarchical Compile] Handle autocast ctx manager (#151294 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151294 Approved by: https://github.com/anijain2305 ghstack dependencies: #151293	2025-04-22 05:13:07 +00:00
Michael Lazos	4a643af992	[Hierarchical Compile] Fix small bug (#151293 ) This technically would never be exposed because we never check that a node is an ancestor of itself, but it is good for it to be correct. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151293 Approved by: https://github.com/anijain2305	2025-04-22 05:13:07 +00:00
PyTorch MergeBot	e76c0b159a	Revert "[dynamic shapes] guard_or_false for _reshape_view_helper, utils._infer_size for wildcard dims (#150127 )" This reverts commit a02eae8142ddd8fbf068a3e17fc0dd276d92fc78. Reverted https://github.com/pytorch/pytorch/pull/150127 on behalf of https://github.com/malfet due to Caused TestDynamoTimed.test_dynamo_timed to fail on macOS, see https://github.com/pytorch/pytorch/actions/runs/14584536979/job/40908019050 ([comment](https://github.com/pytorch/pytorch/pull/150127#issuecomment-2820081721))	2025-04-22 05:05:50 +00:00
PyTorch MergeBot	0ff302e8e0	Revert "reroute index to fast implementation for indexing on 0th dimension (#151753 )" This reverts commit 4d78e19365c4e2189693c7a81b665d4ec2d2cf53. Reverted https://github.com/pytorch/pytorch/pull/151753 on behalf of https://github.com/malfet due to Looks like it breaks bunch of distributed tests with DSA, see `4d78e19365` ([comment](https://github.com/pytorch/pytorch/pull/151753#issuecomment-2820078298))	2025-04-22 05:03:03 +00:00
Junjie Wang (PyTorch)	95abc0f515	[c10d][fr] Fix another bug when we should continue when the op list is empty (#151798 ) Differential Revision: D73375318 We shouldn't check the op list when it is empty. And later, when it is empty we pops it out from the queue we will check for collective matching. Added a unit test for this case and also covered the case fixed https://github.com/pytorch/pytorch/pull/151683 in the unit test as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151798 Approved by: https://github.com/d4l3k, https://github.com/wconstab, https://github.com/fegin	2025-04-22 04:43:31 +00:00
Nikita Shulga	6f327128a9	[MKLDNN] Check that strides are positive (#151848 ) For pooling ops. Prevents division-by-zero when argument is wrong Fixes https://github.com/pytorch/pytorch/issues/149274 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151848 Approved by: https://github.com/atalman	2025-04-22 04:25:47 +00:00
Boyuan Feng	29811f68d2	[Inductor][FlexAttention] fix `vars_and_sizes` divisor error (#151634 ) Triton codegen currently [sorts vars by divisor](`ae6f6b8efb/torch/_inductor/codegen/simd.py (L233-L237)`). When there are two vars with the same divisor, the order is undecided. ```python nodes.sort( key=lambda x: V.graph.sizevars.size_hint( x.divisor, fallback=config.unbacked_symint_fallback ) ) ``` The test case leads to the following nodes: ``` (Pdb) nodes[0] IterationRangesEntry(x1, ((s37 + 127)//128), 2, (xindex//ps0), {x0: ((s37 + 127)//128), x1: 2, x2: ((s12 + 127)//128), x4: 2(((s12 + 127)//128))(((s37 + 127)//128)), x5: 0, x6: 2, x7: (((s12 + 127)//128))(((s37 + 127)//128))}) (Pdb) nodes[1] IterationRangesEntry(x0, 1, ((s37 + 127)//128), ModularIndexing(xindex, 1, ps0), {x0: ((s37 + 127)//128), x1: 2, x2: ((s12 + 127)//128), x4: 2(((s12 + 127)//128))(((s37 + 127)//128)), x5: 0, x6: 2, x7: (((s12 + 127)//128))(((s37 + 127)//128))}) (Pdb) nodes[2] IterationRangesEntry(x2, 2(((s37 + 127)//128)), ((s12 + 127)//128), (xindex//(2(((s37 + 127)//128)))), {x0: ((s37 + 127)//128), x1: 2, x2: ((s12 + 127)//128), x4: 2(((s12 + 127)//128))(((s37 + 127)//128)), x5: 0, x6: 2, x7: (((s12 + 127)//128))(((s37 + 127)//128))}) (Pdb) V.graph.sizevars.statically_known_equals(nodes[0].length, 2) True (Pdb) V.graph.sizevars.statically_known_equals(nodes[1].length, 1) True (Pdb) V.graph.sizevars.statically_known_equals(nodes[2].length, 1) True (Pdb) V.graph.sizevars.statically_known_equals(nodes[0].divisor, 1) True (Pdb) V.graph.sizevars.statically_known_equals(nodes[1].divisor, 1) True (Pdb) V.graph.sizevars.statically_known_equals(nodes[2].divisor, 2) True ``` Since x1 and x0 both have divisor 1, the relative order is random across runs. In some runs, we have order [x1, x0, x2] with divisors as [1,1,2] and lengths as [2,1,1]. After x1, we have [divisor = divisor node.length](`ae6f6b8efb/torch/_inductor/codegen/simd.py (L246)`) = 1 * 2 = 2. Then, when processing x0, we have node.divisor=1, divisor=2, and [FloorDiv(node.divisor, divisor)](`ae6f6b8efb/torch/_inductor/codegen/simd.py (L251)`) = 0, which indicates an iteration length of 0 and leads errors later. The fix is to sort by both divisor and length_is_one. So for two nodes with the same divisor, we process the node with length=1 first. Fixes #149789 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151634 Approved by: https://github.com/Skylion007, https://github.com/drisspg	2025-04-22 04:24:56 +00:00
Sam Larsen	529f698ad4	[logging] Put "everything" WaitCounters in dynamo_timed (#151757 ) Summary: The main motivation is to capture the cudagraphs overhead in a WaitCounter. We'll combine that with Triton autotuning, and therefore rename to "compile_runtime_overheads". Since we have a couple WaitCounters where we want to capture all runtime and compile overheads, let's put the accounting in dynamo_timed so we'll automatically capture any toplevel timed regions that get added in the future. Also, dynamo_timed already has to figure out if we're timing a runtime vs. compile-time event, so we can reuse some of that logic. Test Plan: Ran an internal model with `TORCHINDUCTOR_BENCHMARK_FUSION=1` (to get benchmarking at compile time in addition to runtime). Overall compile time from various sources matches up: * tlparse: https://fburl.com/9fgsstkr. Eyeballing, total time should be 32 ranks x 2175 = ~69.6k s * ods: https://fburl.com/canvas/r4clhnb7. Right on. * dynamo_compile: https://fburl.com/scuba/dynamo_compile/ax71aqox. Right on. * pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/shcjd9ql. Right on. And the runtime overhead: * ods: https://fburl.com/canvas/nvgjb282 * dynamo_compile: https://fburl.com/scuba/dynamo_compile/f2dtv0qh If we compare that to a run of the same model without the changes in this stack, results can mismatch by a lot: * tlparse: https://fburl.com/cchxwd1s. Eyeballing, total time should be 32 ranks x 2300s = ~73.5k s * ods: https://fburl.com/canvas/x1i3wvf4. It's kinda close * dynamo_compile: https://fburl.com/scuba/dynamo_compile/l7sgxdxd. Waaay too high. * pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/jb4s9z1u. This is the only one that's actually correct. The discrepancy is even worse if we focus on the runtime events: * ods: https://fburl.com/canvas/a4o9f7ou * dynamo_compile: https://fburl.com/scuba/dynamo_compile/95izaes1 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151757 Approved by: https://github.com/ppanchalia ghstack dependencies: #151749	2025-04-22 03:29:13 +00:00
Sam Larsen	edba20b853	[logging] Fix duration logging for dynamo_compile (#151749 ) Summary: There are a few issues I'm solving:. 1. It's too hard to measure total pt2 overhead using the dynamo_compile table because users need to know the columns representing all the top-level events (dynamo_cumulative_compile_time_us, etc.). Instead, let's populate the existing duration_us field for all top-level events. The complication is that runtime events in particular (Triton autotuning, cudagraphify) can be collapsed into a single row, with gaps in between, so we can't simply use `end_time - start_time` in all cases. Instead, we'll sum durations for all outer events when updating the compile-time or runtime metrics context. Introduce a 'depth' counter in TLS to track the nesting of CompilationMetrics events. 2. The existing implementation relies on callers of dynamo_timed to specify whether the event is a runtime or compile-time event. That doesn't work because some methods can be called in both situations, e.g., `CachingAutotuner.benchmark_all_configs`. For example `TORCHINDUCTOR_BENCHMARK_FUSION=1` enables benchmarking during compile-time. Instead, we can figure out automatically whether we're measuring a compile-time or runtime event and log accordingling. 3. If `log_compilation_events` were to throw an exception, we'd fail to clear the aggregated counters for runtime logs and they could be attributed to the wrong compile ID. I didn't actually find evidence of this in practice, but I added exception handling for extra safety. Test Plan: Ran internal models and compared dynamo_compile to pt2_compile_events: `TORCHINDUCTOR_BENCHMARK_FUSION=0` * tlparse: https://fburl.com/itciwnxc * dynamo_compile: https://fburl.com/scuba/dynamo_compile/yvkif5vb * pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/segijet7 `TORCHINDUCTOR_BENCHMARK_FUSION=1` * tlparse: https://fburl.com/jgurcvkw * dynamo_compile: https://fburl.com/scuba/dynamo_compile/uum91ceb * pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/x4xnisez Pull Request resolved: https://github.com/pytorch/pytorch/pull/151749 Approved by: https://github.com/Skylion007	2025-04-22 03:29:13 +00:00
Andrew M. James	b7a7741411	Non-deterministic alert in histc_cuda for floating types only (#151701 ) The note about atomic add only applies for floating point. The implementation is deterministic for integer data types. fixes: #151610 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151701 Approved by: https://github.com/ngimel, https://github.com/Skylion007	2025-04-22 03:24:36 +00:00
Yu, Guangye	14e3ffb1ff	Deprecate host allocator legacy APIs (#151437 ) # Motivation This PR aims to deprecate the host allocator legacy API and recommend users to use the unified API `getHostAllocator(device_type)` APIs, such as: ```cpp at::getHostAllocator(device_type)->allocate(...); at::getHostAllocator(device_type)->empty_cache(); at::getHostAllocator(device_type)->record_event(...); at::getHostAllocator(device_type)->get_stats(); at::getHostAllocator(device_type)->reset_accumulated_stats(); at::getHostAllocator(device_type)->reset_peak_stats(); ``` # Additional Context TODO: - [ ] Move is_pinned from `AcceleratorHookInterface` to `HostAllocator` - [ ] Deprecate `getPinnedMemoryAllocator` inside `AcceleratorHookInterface` and recommend using `getHostAllocator` instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151437 Approved by: https://github.com/EikanWang, https://github.com/albanD ghstack dependencies: #151403, #151431	2025-04-22 03:13:24 +00:00
James Wu	a4fdae5c84	Lift guard checking logic to AOTAutogradCache (#151563 ) This somewhat complicated PR does a few things: - It separates out a lot of the guard checking logic into its own class, GuardedCache[T] - It adds a new `check_guard_hit` lambda to FXGraphCache._lookup_graph, which allows callers to define their own guard checking logic - It then uses these two combined parts to lift guard checking to AOTAutogradCache. This means that AOTAutogradCache stores its own guard expressions and evaluates them. - FXGraphCache's guard checking logic is completely unchanged, just refactored. As part of the work, I'm able to extend a bit of the logging functionality of AOTAutogradCache into FXGraphCache, so that you can know if FXGraphCache missed due to a guard failure or a full cache miss. # Why do this? Lifting guards to AOTAutogradCache has a few benefits: - First, it fixes a long standing bug in guard checking logic. Backward passes can have different symint inputs than forward passes depending on forward output, if AOTAutograd chooses to store symints for the backward. These symint inputs have the same underlying symbols as the forward, but on AOTAutogradCache hit, we don't have access to the hints backing these exact symints (we only have hints for the symints on the forward function). By lifting guard checking logic to AOTAutogradCache, we no longer need to check the backward guards, as they'll be included in the AOTAutogradCache guard expression. I've added a unit test that failed before my diff, and now passes, as an example of this - Secondly, this is the first step necessary to bundle CompiledFxGraph into AOTAutogradCache. Doing so will simplify our cache logic significantly, and also make precompile logic simpler, as precompiles will only need to store AOTAutogradCacheEntrys, without needing to match them up with inductor FXGraphCache entries. - Finally, adding guard checking logic to AOTAutogradCache my allow us in the future to handle more complicated cases like a single forward with multiple backwards, as guard checks are now storable on the cache entry itself. # Guard checking logic of AOTAutogradCache When AOTAutogradCache evaluates guard expressions, it no longer needs to evaluate the forward/backward guards in the FXGraphCacheEntry (since the AOTAutogradCache guard expressions will encompass them). Because of this, we still need a way for AOTAutogradCache to distinguish between multiple FXGraphCache local entries. To do so, AOTAutogradCache stores the guard string from FXGraphCache, which it uses as a second "cache key". It doesn't need to evaluate these guards, it just needs to find the cache entry from FXGraphCache that had the same guards as when it was stored. After this, I will work on putting the FXGraphCache entries directly into AOTAutogradCache. If I can put CompiledFxGraphs in the cache directly, I no longer need this complicated `check_guard_hit` overriding logic. ## Test Plan Added a new unit test. There are comprehensive guard checking unit tests in `test_aot_autograd_cache` already, and those pass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151563 Approved by: https://github.com/oulgen	2025-04-22 03:01:08 +00:00
PyTorch MergeBot	40cf49d460	Revert "[Intel GPU] Allow XPU backend in Depthwise_conv2d&3d operators (#149114 )" This reverts commit 08831f30bbe745cd9f0c07d1868583a68f613514. Reverted https://github.com/pytorch/pytorch/pull/149114 on behalf of https://github.com/guangyey due to CI is broken ([comment](https://github.com/pytorch/pytorch/pull/149114#issuecomment-2819890341))	2025-04-22 02:22:42 +00:00
Pian Pawakapan	a02eae8142	[dynamic shapes] guard_or_false for _reshape_view_helper, utils._infer_size for wildcard dims (#150127 ) For reshape/view: removes fast paths for 0 elements, checking dimensions to skip. Modifies the loop accumulating input elements, to raise a UserError if we run out of dimensions, graph breaking for compile and erroring out for export. For infer_size: assumes if user passes us an unbacked, it's probably not -1 Will think about changes in https://docs.google.com/document/d/1WYx6EZwVDXtBnWyrzoecgGWdiK0V3XZKftfpWwQ5i3E/edit?tab=t.0#heading=h.22k54zym11qp in a later PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/150127 Approved by: https://github.com/laithsakka	2025-04-22 01:14:15 +00:00
Sam Larsen	80a3877b3d	[easy] Fix test_dynamo_timed (#151816 ) Summary: The structured logging counter is a global that might have been affected by earlier tests. Clear it explicitly. Fixes #148093 Test Plan: `pytest test/dynamo/test_utils.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151816 Approved by: https://github.com/ppanchalia	2025-04-22 00:12:31 +00:00
Lu Fang	b3b1616560	Add explict type info in the try-catch for dynamo logging (#151733 ) Differential Revision: D73295871 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151733 Approved by: https://github.com/hl475	2025-04-21 23:29:10 +00:00
Dylan Maloy	a35e73b91f	[c10] add #pragma once to leftright (#151710 ) Summary: i am getting duplicate defn's when including in my binary that already includes the dispatcher. Test Plan: CI Differential Revision: D73237748 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151710 Approved by: https://github.com/georgiaphillips	2025-04-21 23:18:49 +00:00
Nikhil Anil Patel	99aeee2c5f	[Inductor] Add Additional Configs for persistent+TMA version of Triton mm and addmm (#150587 ) Summary: This PR introduces additional autotuning configurations for the persistent+TMA version of Triton `mm` and `addmm` operations. The new configurations are as follows: * `(128, 128, 64, 5, 8)` * `(256, 128, 64, 4, 8)` * `(128, 128, 64, 5, 4)` These configurations were selected based on exhaustive autotuning performed on commonly used shapes from an internal foundational model. While these new configs are generally more performant across the board, we see notable gains a few specific cases: * In scenarios where `n >> m, k`, the configurations `(128, 128, 64, 5, 8)` and `(256, 128, 64, 4, 8)` tend to produce an additional 5-10% speedup over the aten baseline compared to the original configurations. * Similarly, the configuration `(128, 128, 64, 5, 4)` yields approximately an 8% improvement in scenarios where k >> m, n. These enhancements are expected to provide performance benefits across diverse use cases, particularly when compared to the original set of configurations. Test Plan: contbuild & OSS CI Reviewers: paulzhan Pull Request resolved: https://github.com/pytorch/pytorch/pull/150587 Approved by: https://github.com/PaulZhang12, https://github.com/drisspg, https://github.com/eellison	2025-04-21 23:18:33 +00:00
Natalia Gimelshein	4d78e19365	reroute index to fast implementation for indexing on 0th dimension (#151753 ) Per title, improve x[index] cuda perf for the common case of indexing along the first dim, using vectorized gather kernel Pull Request resolved: https://github.com/pytorch/pytorch/pull/151753 Approved by: https://github.com/eqy	2025-04-21 23:15:30 +00:00
angelayi	01f1cc44cb	Rename register_fake_profile to unsafe_generate_fake_kernels (#151797 ) Fixes https://docs.google.com/document/d/1BZsuUR1zJ-52Y7wP4yWX8beB4dwYbgdu5o1qKam_iWg/edit?disco=AAABiJdX1XU Pull Request resolved: https://github.com/pytorch/pytorch/pull/151797 Approved by: https://github.com/zou3519	2025-04-21 23:08:15 +00:00
Shangdi Yu	efdcc981d0	Back out "Do not propagate real tensor in extern kernel" (#151813 ) Summary: D73002775 breaks aot_compile for many draft exported models on PT2I dashboard. Revert. Example error msg: ``` OrderedSet([]) >= OrderedSet([u1185, u1186, u1187]) (inductor >= fx) fx node is: %embedding_bag_byte_prepack : [num_users=4] = call_function[target=torch.ops.quantized.embedding_bag_byte_prepack.default](args = (%view_10,), kwargs = {}) new operations are: ``` Differential Revision: D73381032 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151813 Approved by: https://github.com/angelayi, https://github.com/zou3519	2025-04-21 22:54:03 +00:00
drisspg	79a9447f0e	FlexAttention add decorator for large test cases (#151459 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151459 Approved by: https://github.com/Skylion007	2025-04-21 22:53:13 +00:00
Laith Sakka	6ea2e6a2d2	Do not do proper const fold during tensorify_python_scalars (#151494 ) Chatting with Bob the goal of this is to const fold the floats that where tensorified by calling guard_scalar(val) on them and then replacing their usages by their values. Hence we do not need to do this for nodes with no float symbols. We do not want todo proper const folding because we need to preserve statements that deferred runtime asserts depend on. (see the added test) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151494 Approved by: https://github.com/bobrenjc93	2025-04-21 22:39:50 +00:00
Pian Pawakapan	cd1317f92f	[export] suggest dynamic re-export in input constraints hook (#151624 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/151624 Approved by: https://github.com/angelayi	2025-04-21 22:29:46 +00:00
Michael Lazos	c312d8c501	[Dynamo] Clean up old torch function flag (#149711 ) This is tracked via `SymbolicTorchFunctionState` now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149711 Approved by: https://github.com/StrongerXi, https://github.com/anijain2305	2025-04-21 21:33:58 +00:00
fduwjj	25a11850e9	[symmem] Add some code comments to rendezvous code (#151716 ) While reading and learning the rendezvous code, I just want to add some comments to explain the code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151716 Approved by: https://github.com/kwen2501	2025-04-21 20:45:39 +00:00
Aaron Gokaslan	352019bf9e	[BE]: Better cleanup optimized code from #151474 (#151794 ) This change addresses the first/second time/mem "spike" observed Improves on #151474 by removing unnecessary stride calculations and unused arguments to the helper function https://github.com/pytorch/pytorch/issues/151351 Fixes https://github.com/pytorch/pytorch/issues/151351 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151794 Approved by: https://github.com/albanD, https://github.com/eqy	2025-04-21 20:32:11 +00:00
henrylhtsang	1f0d764b65	stage 2 of depreate silent fallback of tuning gemm (#148622 ) context: https://github.com/pytorch/pytorch/issues/147479 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148622 Approved by: https://github.com/eellison ghstack dependencies: #151506	2025-04-21 20:14:34 +00:00
henrylhtsang	02cecd1018	[inductor][test] Skip triton tests for MPS as well, also change reason for skipping SM89 to not IS_BIG_GPU (#151506 ) Differential Revision: [D73162091](https://our.internmc.facebook.com/intern/diff/D73162091/) Combining / improving https://github.com/pytorch/pytorch/pull/150485 and https://github.com/pytorch/pytorch/pull/150343 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151506 Approved by: https://github.com/ColinPeppler	2025-04-21 20:14:34 +00:00
PaulZhang12	191b0237a6	Added to docs for out_dtype arg in torch gemms (#151704 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151704 Approved by: https://github.com/bdhirsh	2025-04-21 20:09:17 +00:00
Evgeny Fiksman	1a6effc5d8	[torch] Expose PCI info from CUDA device (#151672 ) Summary: PR#125083 add cuda device UUID info, but due to meta internal [version of ROCM the code was excluded](https://github.com/pytorch/pytorch/pull/125083?fbclid=IwY2xjawJvLnNleHRuA2FlbQIxMQABHlY55crrkTqWBWTsr2HVfuqnZ3R1GHR3o9Kf1o3h3uvyawEmCEdhdT48iY1P_aem_8tfrGrWE9SxFYasGfH8kCQ#issuecomment-2103315320). This change will ensure meta internal code is built and PCI info is available Test Plan: pass CI Differential Revision: D73253426 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151672 Approved by: https://github.com/Skylion007	2025-04-21 19:55:19 +00:00
Svetlana Karslioglu	2fb1326483	Add dates to pages (#151602 ) re: #150873 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151602 Approved by: https://github.com/albanD	2025-04-21 19:53:55 +00:00
Zain Rizvi	b7c7000728	Ensure runners have the required prefix (#151815 ) Clone changes from https://github.com/pytorch/pytorch/pull/151696/ since that PR wouldn't merge Pull Request resolved: https://github.com/pytorch/pytorch/pull/151815 Approved by: https://github.com/seemethere	2025-04-21 19:09:17 +00:00
Nikita Shulga	9680016bcf	[MergeBot] Update PullRequestResolved Regex (#151814 ) By copying an updated one from `cff091f3f3` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151814 Approved by: https://github.com/izaitsevfb, https://github.com/albanD	2025-04-21 19:02:05 +00:00
Nikita Shulga	d79144da52	[BE] Move aarch64 docker build to larger node (#151808 ) They happen once a week or so, not sure why it needs to be on the slowest machine possible Pull Request resolved: https://github.com/pytorch/pytorch/pull/151808 Approved by: https://github.com/huydhn, https://github.com/ZainRizvi	2025-04-21 18:54:31 +00:00
PyTorch MergeBot	fd04c79878	Revert "[aot autograd][logging] Profile large missing gaps in compile time tracing (#151256 )" This reverts commit 8e373592c8be3e28a5f5a774fc1d517aa3dbe8b4. Reverted https://github.com/pytorch/pytorch/pull/151256 on behalf of https://github.com/Camyll due to breaking internal tests, cannot import ([comment](https://github.com/pytorch/pytorch/pull/151256#issuecomment-2819244186))	2025-04-21 18:49:23 +00:00
Nikita Shulga	f37e138bc4	[MPS] Enable log1p and sigmoid for int64 (#151791 ) It works on MacOS-15, but likely will need a skip for MacOS-13 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151791 Approved by: https://github.com/Skylion007 ghstack dependencies: #151790	2025-04-21 18:30:04 +00:00
Henry Tsang	e2b1c06319	[cutlass] Define GELU_taylor<float> only if CUTLASS version is <= 380 (#151702 ) Summary: #buildmore `df8a550d39/include/cutlass/epilogue/thread/activation.h (L610)` was added in v3.9 (not tagged yet) Test Plan: mostly ci. Logic seems same. Reviewed By: drisspg Differential Revision: D72615240 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151702 Approved by: https://github.com/Skylion007, https://github.com/eqy	2025-04-21 18:23:46 +00:00
Oguz Ulgen	0f8613bf5c	Introduce unsafe way to mark functions as cacheable (#151603 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151603 Approved by: https://github.com/jamesjwu ghstack dependencies: #151768, #151609	2025-04-21 17:37:38 +00:00
Oguz Ulgen	67c2869a38	Unpack the output code in the standalone_compile (#151609 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151609 Approved by: https://github.com/zou3519 ghstack dependencies: #151768	2025-04-21 17:37:38 +00:00
Oguz Ulgen	287998b87f	Run standalone compile tests on cpu/gpu (#151768 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151768 Approved by: https://github.com/zou3519	2025-04-21 17:37:29 +00:00
Nikita Shulga	cea43f721a	[Testing] Unskip expm1 log1p for MPS (#151790 ) But don't test them for unsupported dtypes (which is float64 for MPS) - Skip int64 for log1p for now (next PR will fix that) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151790 Approved by: https://github.com/Skylion007	2025-04-21 17:18:47 +00:00
PyTorch MergeBot	9374064483	Revert "[Easy] Add more check for elapsedTime of torch.xxx.Event and torch.Event (#151404 )" This reverts commit 783be8f93248ca3af24b968bdf84188f5a3257d1. Reverted https://github.com/pytorch/pytorch/pull/151404 on behalf of https://github.com/malfet due to suspected of breaking linux builds and breaks internal tests as well ([comment](https://github.com/pytorch/pytorch/pull/151404#issuecomment-2819041756))	2025-04-21 17:11:53 +00:00
PyTorch MergeBot	33808f0ebd	Revert "[Easy] The event_id of torch.cuda.Event and torch.xpu.Event always is 0 (#151226 )" This reverts commit 8e5fefedf4af3f31ccd05290c1b21eedf6a4ad1b. Reverted https://github.com/pytorch/pytorch/pull/151226 on behalf of https://github.com/malfet due to Reverting to unblock revert of https://github.com/pytorch/pytorch/pull/151404 ([comment](https://github.com/pytorch/pytorch/pull/151226#issuecomment-2819030735))	2025-04-21 17:07:49 +00:00
bobrenjc93	515a0f606b	[ez] fix typo in comment (#151755 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151755 Approved by: https://github.com/Skylion007	2025-04-21 14:52:39 +00:00
Thanh Ha	2eacdb91c3	Add OIDC permissions to xpu workflow (#151455 ) The reusable workflow requires OIDC authentication to work and is configured via it's only caller xpu.yml however setting it here too to clarify that it is required. This setting also flags jobs that call this workflow without the required permissions set to remind them it need to be set. JWT ID token requires `id-token: write` permissions as documented here https://docs.github.com/en/actions/security-for-github-actions/security-hardening-your-deployments/configuring-openid-connect-in-cloud-providers#adding-permissions-settings Ref: pytorch-fdn/multicloud-ci-infra#3 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151455 Approved by: https://github.com/chuanqi129, https://github.com/atalman	2025-04-21 14:39:40 +00:00
inventshah	bf28d1cafc	Expose bicubic mode for torch::nn::functional::grid_sample in LibTorch (#150817 ) When bicubic interpolation was added to grid_sampler in #44780, `GridSampleFuncOptions` was not updated to allow a user to use bicubic mode in LibTorch, even though the function could handle it. This PR fixes the parity such that LibTorch's `torch::nn::functional::grid_sample` behaves the same as PyTorch's `torch.nn.functional.grid_sample`. Existing users can directly use `torch::grid_sampler` but must know what int to pass for the interpolation (2 for bicubic) and padding mode parameters, which is not ideal. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150817 Approved by: https://github.com/Skylion007	2025-04-21 08:55:27 +00:00
Nikita Shulga	2a9afdae81	[Benchmarking] Add sam and stable_diffusion to MPS benchmarked models (#151748 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151748 Approved by: https://github.com/Skylion007, https://github.com/dcci ghstack dependencies: #151747	2025-04-21 05:51:46 +00:00
FFFrog	f7ddc5125e	[Easy] Fix the compilation warning of BlasKernel. (#151736 ) As the title stated. Change Before: ```C++ [2/21] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/BlasKernel.cpp.o /root/Git.d/pytorch/pytorch/aten/src/ATen/native/BlasKernel.cpp:346:6: warning: ‘void at::native::blas_impl::gemv_fast_path(const char, const int, const int, const scalar_t, const scalar_t, const int, const scalar_t, const int, const scalar_t, scalar_t, const int) [with scalar_t = c10::Half]’ defined but not used [-Wunused-function] 346 \| void gemv_fast_path<at::Half>( \| ^~~~~~~~~~~~~~~~~~~~~~~~ /root/Git.d/pytorch/pytorch/aten/src/ATen/native/BlasKernel.cpp:329:6: warning: ‘bool at::native::blas_impl::gemv_use_fast_path(char, int64_t, int64_t, scalar_t, int64_t, int64_t, scalar_t, int64_t) [with scalar_t = c10::Half]’ defined but not used [-Wunused-function] 329 \| bool gemv_use_fast_path<at::Half>( \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~ /root/Git.d/pytorch/pytorch/aten/src/ATen/native/BlasKernel.cpp:301:6: warning: ‘void at::native::blas_impl::gemv_fast_path(const char, const int, const int, const scalar_t, const scalar_t, const int, const scalar_t, const int, const scalar_t, scalar_t, const int) [with scalar_t = c10::BFloat16]’ defined but not used [-Wunused-function] 301 \| void gemv_fast_path<at::BFloat16>( \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~ /root/Git.d/pytorch/pytorch/aten/src/ATen/native/BlasKernel.cpp:273:6: warning: ‘bool at::native::blas_impl::gemv_use_fast_path(char, int64_t, int64_t, scalar_t, int64_t, int64_t, scalar_t, int64_t) [with scalar_t = c10::BFloat16]’ defined but not used [-Wunused-function] 273 \| bool gemv_use_fast_path<at::BFloat16>( \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151736 Approved by: https://github.com/shink, https://github.com/Skylion007	2025-04-21 03:31:46 +00:00
Scott Wolchok	8eb21dffa9	consolidate ATen/test/dispatch_key_set_test.cpp with rest of DispatchKeySet tests (#151697 ) Doesn't seem to be a reason to have two test files for this. Differential Revision: [D73274020](https://our.internmc.facebook.com/intern/diff/D73274020/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151697 Approved by: https://github.com/Skylion007 ghstack dependencies: #151626, #151627, #151628, #151629, #151630	2025-04-21 02:58:12 +00:00
Mandar Deshpande	9c2ac2b876	[pytorch][triton] Enable warp spec for FlexAttention kernel (#150470 ) Summary: Given inductor support for warp-specialization for `TritonTemplateKernel`, this change adds: - num_consumer_groups - num_buffers_warp_spec to the flexattention template generated by inductor in `torch.compile`. NOTE: Currently default config doesn't enable warp-spec and needs explicit args for num_consumer_groups, num_buffers_warp_spec in the kernel options to enable. Test Plan: ### Functional Testing ```Py import torch from torch.nn.attention.flex_attention import flex_attention from triton.testing import do_bench make_tensor = lambda: torch.rand(8, 16, 8192, 128, device="cuda", dtype=torch.bfloat16) q, k, v = make_tensor(), make_tensor(), make_tensor() flex_compiled = torch.compile(flex_attention, fullgraph=True) print(do_bench(lambda: flex_compiled(q, k, v, kernel_options={"num_warps": 4, "num_consumer_groups": 2, "num_buffers_warp_spec": 3,}))) ``` - (best config) without WS: 11.06 - with WS: 9.35 Differential Revision: D70501880 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150470 Approved by: https://github.com/drisspg	2025-04-21 02:00:55 +00:00
Huamin Li	fc2dd6d408	[Inductor] Update should_decompose_mm condition for CPU (#151730 ) Summary: Similar to what we did previously in D70033166 Previously, for cpu we decompose addmm if ``` check_device(mat1, mat2, device="cpu") and statically_known_true(mat1.shape[0] == 1) and statically_known_true(mat2.shape[0] <= 64) and statically_known_true(mat2.shape[1] <= 512) ``` We have a new case where `mat1.shape[0] = 80`, and benchmark shows that it will beneficial if we decompose, so update the condition to ``` check_device(mat1, mat2, device="cpu") and statically_known_true(mat1.shape[0] == 1) and statically_known_true(mat2.shape[0] <= 128) and statically_known_true(mat2.shape[1] <= 512) ``` Differential Revision: D73292985 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151730 Approved by: https://github.com/kflu, https://github.com/houseroad	2025-04-21 01:56:47 +00:00
Davide Italiano	470132c6a1	[MPS] Add support for hermite_polynomial_he (inductor/eager). (#151754 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151754 Approved by: https://github.com/malfet, https://github.com/jansel	2025-04-20 17:44:40 +00:00
Aart J.C. Bik	c3a7278278	Use more efficient row/col computation (#151474 ) This change addresses the first/second time/mem "spike" observed in https://github.com/pytorch/pytorch/issues/151351 Fixes #151351 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151474 Approved by: https://github.com/eqy, https://github.com/amjames, https://github.com/Skylion007	2025-04-20 16:02:19 +00:00
Camyll Harajli	6b45b6e6c9	run lintrunner for Export d68846308 (#151725 ) fixes broken lint tests in https://github.com/pytorch/pytorch/pull/151481 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151725 Approved by: https://github.com/exclamaforte, https://github.com/Skylion007 Co-authored-by: Gabriel Ferns <gabeferns@meta.com> Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2025-04-20 14:58:17 +00:00
Gabriel Ferns	a40e876b08	Support fp8 dtypes in assert_close (#150002 ) Fixes #135998 Adds support for fp8. These are compared bitwise, without atol and rtol. The implementation uses the same comparison functions, just with atol and rtol forced to zero. The error message is different from the default case; it only tells the user the first mismatch. This is to avoid triggering the error from #135998. Test Plan: New unit test covers new code paths. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150002 Approved by: https://github.com/cyyever, https://github.com/zou3519	2025-04-20 01:24:21 +00:00
PyTorch MergeBot	48761e9737	Revert "[Easy] Fix the function signature of torch.Event (#151221 )" This reverts commit 92baeecbdd3fb717880485e529df4efb02627c9d. Reverted https://github.com/pytorch/pytorch/pull/151221 on behalf of https://github.com/malfet due to This broke rocm tests, see `92baeecbdd (40818271233-box)` ([comment](https://github.com/pytorch/pytorch/pull/151221#issuecomment-2816883409))	2025-04-19 22:06:24 +00:00
PyTorch MergeBot	c4482565cc	Revert "[Easy][torch.Event] Fix and improve the docs of torch.Event (#151411 )" This reverts commit 1e1d0a4be63b354f762ee21bdccec03c1e5b371c. Reverted https://github.com/pytorch/pytorch/pull/151411 on behalf of https://github.com/malfet due to This broke rocm tests, see `92baeecbdd (40818271233-box)` ([comment](https://github.com/pytorch/pytorch/pull/151221#issuecomment-2816883409))	2025-04-19 22:06:24 +00:00
Nikita Shulga	9b74ea2490	[Benchmarking] Run MPS benchmarks for [b]float16 (#151747 ) And implicitly pass `--float32` when collecting results for "notset" option. Speedups for some models are much higher for float16 dtype, but it's important to track accuracy Pull Request resolved: https://github.com/pytorch/pytorch/pull/151747 Approved by: https://github.com/Skylion007	2025-04-19 16:40:08 +00:00
Nikita Shulga	ed511cd537	[Testing] Make test_add_complex3 run on different devices (#151732 ) By constructing tensor on that device, because it does not call `self.common` but rather executes test directly. Otherwise `test_add_complex3_mps` will test CPU inductor, rather than MPS one Pull Request resolved: https://github.com/pytorch/pytorch/pull/151732 Approved by: https://github.com/dcci	2025-04-19 14:29:13 +00:00
Aaron Gokaslan	483e61bfec	[BE][Easy]: Simplify reversed call in graph matcher (#151674 ) Another list call on reversed that is no longer necessary since ItemViews reversed Pull Request resolved: https://github.com/pytorch/pytorch/pull/151674 Approved by: https://github.com/albanD	2025-04-19 14:14:31 +00:00
PyTorch MergeBot	68f748a992	Revert "[Testing] Make test_add_complex3 run on different devices (#151732 )" This reverts commit 414ce713fb329b20f93002fa4ffd6bb23bc3b93b. Reverted https://github.com/pytorch/pytorch/pull/151732 on behalf of https://github.com/malfet due to It breaks MacOS-13 ([comment](https://github.com/pytorch/pytorch/pull/151732#issuecomment-2816690571))	2025-04-19 12:35:41 +00:00
FFFrog	1e1d0a4be6	[Easy][torch.Event] Fix and improve the docs of torch.Event (#151411 ) Changes: - add detailed function or class signature - fix the wrong display of torch.Event.wait and torch.Event.record Pull Request resolved: https://github.com/pytorch/pytorch/pull/151411 Approved by: https://github.com/albanD ghstack dependencies: #151226, #151221	2025-04-19 12:21:02 +00:00
FFFrog	92baeecbdd	[Easy] Fix the function signature of torch.Event (#151221 ) As the title stated. The difference between declaration and implemention. declaration: `d5a19e4525/torch/_C/__init__.pyi.in (L157-L162)` Implementation: `d5a19e4525/torch/csrc/Event.cpp (L30-L32)` Question: Which one should we choose? - Change enable_timing to False to be consistent with torch.cuda.Event - Change enable_timing to True to avoid BC-break Pull Request resolved: https://github.com/pytorch/pytorch/pull/151221 Approved by: https://github.com/albanD ghstack dependencies: #151226	2025-04-19 11:56:37 +00:00
FFFrog	8e5fefedf4	[Easy] The event_id of torch.cuda.Event and torch.xpu.Event always is 0 (#151226 ) Although torch.cuda.Event and torch.xpu.Event have cuda_event and sycl_event fields respectively, the event_id exposed from the base class torch.Event is always 0, which can confuse users. The memory of torch.Event is not useful to torch.cuda.Event and torch.xpu.Event, but we still need to inherit from torch.Event because CPython will check it. Repro with cuda: ``` >>> import torch >>> event = torch.cuda.Event() >>> event.cuda_event 0 >>> event.event_id 0 >>> event.record() >>> event.cuda_event 127982096 >>> event.event_id 0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151226 Approved by: https://github.com/albanD	2025-04-19 10:42:00 +00:00
PyTorch MergeBot	92d0c40c49	Revert "Cache the value of torch_key in subproc (#151057 )" This reverts commit 5f5805a6ac44179520291b2aa6e18d286dc93669. Reverted https://github.com/pytorch/pytorch/pull/151057 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/151057#issuecomment-2816614510))	2025-04-19 08:48:12 +00:00
Nichols A. Romero	f6c1cf04b5	[ROCm][TunableOp] Support submatrices in offline tuning (#151138 ) This PR adds support for submatrices in offline tuning for: - GEMM - GEMM and bias - ScaledGEMM - Batch Strided GEMM New UTs to cover submatrices. Submatrices for strided batch API is not part of this PR and will be done seperately. There is also a bug fix for offline tuning for full matrix for GEMM and bias in the `NT` case. Offline and online UTs were updated to cover this corner case. To improve code readability, swapped definition of transA and transB. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151138 Approved by: https://github.com/jeffdaily	2025-04-19 04:14:27 +00:00
Will Constable	2673ea4131	Add api to enable/disable NaN detector per-PG (#151723 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151723 Approved by: https://github.com/kwen2501, https://github.com/fduwjj	2025-04-19 03:55:25 +00:00
Nikita Shulga	414ce713fb	[Testing] Make test_add_complex3 run on different devices (#151732 ) By constructing tensor on that device, because it does not call `self.common` but rather executes test directly. Otherwise `test_add_complex3_mps` will test CPU inductor, rather than MPS one Pull Request resolved: https://github.com/pytorch/pytorch/pull/151732 Approved by: https://github.com/dcci	2025-04-19 03:14:46 +00:00
PyTorch MergeBot	6261db7719	Revert "inductor.config.descriptive_names = False is not actually supported (#145523 ) (#146051 ) (#151481 )" This reverts commit cfc4d74b0c9a0d21debbebb41e1dfa4dd2acf2a0. Reverted https://github.com/pytorch/pytorch/pull/151481 on behalf of https://github.com/malfet due to It indeed breaks lint, it followup PR contains it's own issues ([comment](https://github.com/pytorch/pytorch/pull/151481#issuecomment-2816490764))	2025-04-19 03:12:56 +00:00
Nikita Shulga	843e4d11ba	[Benchmarking] Enable HF_GPT2 benchmarking on Metal (#151721 ) By building wheel with USE_DISTRIBUTED=1 Otherwise attempt to run ``` python3 benchmarks/dynamo/torchbench.py --performance --only hf_T5 --backend inductor --inference --devices mps ``` wil fail with ``` File "/Users/nshulga/Library/Python/3.10/lib/python/site-packages/transformers/modeling_utils.py", line 40, in <module> import torch.distributed.tensor File "/Users/nshulga/git/pytorch/pytorch/torch/distributed/tensor/__init__.py", line 4, in <module> import torch.distributed.tensor._ops # force import all built-in dtensor ops File "/Users/nshulga/git/pytorch/pytorch/torch/distributed/tensor/_ops/__init__.py", line 2, in <module> from ._conv_ops import * # noqa: F403 File "/Users/nshulga/git/pytorch/pytorch/torch/distributed/tensor/_ops/_conv_ops.py", line 5, in <module> from torch.distributed.tensor._dtensor_spec import DTensorSpec, TensorMeta File "/Users/nshulga/git/pytorch/pytorch/torch/distributed/tensor/_dtensor_spec.py", line 6, in <module> from torch.distributed.tensor.placement_types import ( File "/Users/nshulga/git/pytorch/pytorch/torch/distributed/tensor/placement_types.py", line 8, in <module> import torch.distributed._functional_collectives as funcol File "/Users/nshulga/git/pytorch/pytorch/torch/distributed/_functional_collectives.py", line 9, in <module> import torch.distributed.distributed_c10d as c10d File "/Users/nshulga/git/pytorch/pytorch/torch/distributed/distributed_c10d.py", line 23, in <module> from torch._C._distributed_c10d import ( ModuleNotFoundError: No module named 'torch._C._distributed_c10d'; 'torch._C' is not a package ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151721 Approved by: https://github.com/wdvr, https://github.com/dcci, https://github.com/huydhn	2025-04-19 02:57:03 +00:00
Gabriel Ferns	cfc4d74b0c	inductor.config.descriptive_names = False is not actually supported (#145523 ) (#146051 ) (#151481 ) Summary: This config is not supported (it throws an error when set), and doesn't really make sense imo. Approved by: https://github.com/eellison Test Plan: contbuild & OSS CI, see `edf266e9bb` Reviewed By: masnesral Differential Revision: D68846308 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151481 Approved by: https://github.com/masnesral	2025-04-19 01:13:35 +00:00
Tugsbayasgalan Manlaibaatar	adf5f38eae	Don't specialize min/max (#151347 ) address https://github.com/pytorch/pytorch/issues/149635 Differential Revision: [D73041489](https://our.internmc.facebook.com/intern/diff/D73041489/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151347 Approved by: https://github.com/bobrenjc93	2025-04-19 00:11:15 +00:00
Shivam Raikundalia	359e1d517c	[Profiler] Remove Decref From Python Context (#151625 ) Summary: When doing on-demand profiler with stack, the decref causes a segfault. I tried checking the refcount and the object itself and they both look fine but still segfaults every time. Lets remove it for now and revisit. This will induce a small memory leak but it should be small enough that it does not create any significant impact on jobs ran. Test Plan: Removed decref and got clean traces https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/0/1744933624/localhost/libkineto_activities_2936811.json.gz&bucket=gpu_traces Differential Revision: D73225468 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151625 Approved by: https://github.com/davidberard98	2025-04-18 23:55:19 +00:00
Scott Wolchok	e48189cf03	Don't eagerly create AliasInfo in parseAliasDeclaration (#151630 ) No need to create an AliasInfo...unless we need it. Differential Revision: [D73129452](https://our.internmc.facebook.com/intern/diff/D73129452/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151630 Approved by: https://github.com/Skylion007, https://github.com/malfet ghstack dependencies: #151626, #151627, #151628, #151629	2025-04-18 22:51:37 +00:00
Scott Wolchok	cac8d35503	Use fmt::format for debug strings in Library init (#151629 ) Observed several ms taken during `import torch` by c10::str here. Differential Revision: [D73129453](https://our.internmc.facebook.com/intern/diff/D73129453/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151629 Approved by: https://github.com/cyyever, https://github.com/Skylion007, https://github.com/albanD, https://github.com/malfet ghstack dependencies: #151626, #151627, #151628	2025-04-18 22:51:37 +00:00
Scott Wolchok	313ceb4da3	Reserve vector in StringCordView ctor (#151628 ) Clear missing reserve (we should expect that pieces are not empty). Differential Revision: [D73129445](https://our.internmc.facebook.com/intern/diff/D73129445/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151628 Approved by: https://github.com/Skylion007, https://github.com/malfet ghstack dependencies: #151626, #151627	2025-04-18 22:51:29 +00:00
Scott Wolchok	704a504e8a	Reserve vectors in FunctionSchema::cloneWithRealTypes (#151627 ) 1) reserving is much better than not reserving 2) std::transform for a 1-line-body loop is generally not considered to be an improvement (and doesn't get seem to get boiled away by clang under -Oz) Differential Revision: [D73013363](https://our.internmc.facebook.com/intern/diff/D73013363/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151627 Approved by: https://github.com/Skylion007, https://github.com/malfet ghstack dependencies: #151626	2025-04-18 22:51:23 +00:00
Scott Wolchok	fc7d493908	Overload Library::def rather than templating it (#151626 ) It ends up being templated over a bunch of reference-to-array-of-characters types with different lengths, such as `char const (&) [88]`, which is an annoyance when profiling and possibly a source of code bloat. Differential Revision: [D73129450](https://our.internmc.facebook.com/intern/diff/D73129450/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151626 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-04-18 22:51:16 +00:00
PyTorch MergeBot	97d97aef24	Revert "[dynamic shapes] guard_or_false for _reshape_view_helper, utils._infer_size for wildcard dims (#150127 )" This reverts commit 1dd2033c0a1de460ee2bad8d64c36a0344886071. Reverted https://github.com/pytorch/pytorch/pull/150127 on behalf of https://github.com/clee2000 due to maybe caused export test to fail? export/test_draft_export.py::TestDraftExport::test_masked_linear [GH job link](https://github.com/pytorch/pytorch/actions/runs/14538768138/job/40794985504) [HUD commit link](`1dd2033c0a`), bad TD ([comment](https://github.com/pytorch/pytorch/pull/150127#issuecomment-2816232086))	2025-04-18 21:38:47 +00:00
Sam Larsen	bd77c3e054	[easy] Update test/dynamo/test_structured_trace.py (#151606 ) Summary: test/dynamo/test_structured_trace.py is out of date because of some new fields. (I guess the test is disabled?). Bring it up to date. Test Plan: `python test/dynamo/test_structured_trace.py` Fixes #149671 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151606 Approved by: https://github.com/Skylion007 ghstack dependencies: #151599	2025-04-18 21:33:13 +00:00
Justin Chu	56d318bfac	[ONNX][Eazy] Update onnx program doc formatting and improve robustness (#151623 ) - Update docstring list formatting - Use a try finally block to keep the model unmodified if save() fails. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151623 Approved by: https://github.com/titaiwangms	2025-04-18 21:31:31 +00:00
Animesh Jain	02dd096e51	[invoke_subgraph][fake tensor] Add finalizer on subgraph instead of the functionalize ctx wrapper (#151633 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151633 Approved by: https://github.com/zou3519 ghstack dependencies: #151330, #151256, #151357, #151477	2025-04-18 21:23:21 +00:00
Wei Wang	b74be52454	[CUDA][NVTX] Move nvtx3 code from cmake/public/cuda.cmake to cmake/Dependencies.cmake (#151583 ) Fixes [#147220] Context: In the CUDA NVTX world, there are NVTX v2 and NVTX v3. As announced in CUDA release notes, e.g. [CUDA 12.8 Update 1]( https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#deprecated-or-dropped-operating-systems) "`NVTX v2 is deprecated. To migrate to NVTX v3. Change your code from: #include <nvtoolsext.h> to #include "nvtx3/nvtoolsext.h`". This header is included in the toolkit." On the PyTorch side, TORCH_CUDA_USE_NVTX3 compile time macro is used and it is set to true when (most of the time) nvtx3 is found. nvtx3 is found in two cases: 1) USE_SYSTEM_NVTX=0 (default), torch build process would automatically look for the nvtx3 in pytorch/third_party/nvtx. This is the most common and default case. 2) when USE_SYSTEM_NVTX=1 is used, nvtx3 is found from the installed CUDA toolkit (e.g. CUDA 12.8 and even some earlier cuda versions). As described in #147220, the reason it can find pytorch/third_party/nvtx is because it used `6f035d8462/cmake/public/cuda.cmake (L176)` note the "PROJECT_SOURCE_DIR" usage in [pytorch/cmake/public/cuda.cmake](`6f035d8462/cmake/public/cuda.cmake (L176)`) Before this PR: PyTorch build would succeed in finding nvtx3 due to the above described process, everything is good. But downstream projects like torchvision can fail, and would by default fail because the following are happening: 1) USE_SYSTEM_NVTX=0 is used (and most likely it is this case because it is the default) 2) NVTX v2 can no longer be found (e.g. future CUDA versions because deprecation would eventually become removal) 3) TorchVision cannot find NVTX3 either because torchvision was invoking [pytorch/cmake/public/cuda.cmake] but the PROJECT_SOURCE_DIR is no longer the pytorch source but torchvision source! 4) One workaround is to "USE_SYSTEM_NVTX=1" but users have to explicitly set this and do the plumbing work After this PR: PyTorch can still find nvtx3 because the part of the code that finds nvtx3 is just moved to a new place. The CI logs are showing it being able to find nvtx3. e.g. [this job](https://productionresultssa14.blob.core.windows.net/actions-results/47f8efaa-0afe-4e1f-bc94-0a82629941cb/workflow-job-run-dc8201b1-845b-5da1-a6ea-d3360ce1b508/logs/job/job-logs.txt?rsct=text%2Fplain&se=2025-04-18T20%3A38%3A05Z&sig=yMd6egC%2Banl3lR%2BudXFX18bfUH189z0DTGLtscHQJwY%3D&ske=2025-04-19T06%3A21%3A45Z&skoid=ca7593d4-ee42-46cd-af88-8b886a2f84eb&sks=b&skt=2025-04-18T18%3A21%3A45Z&sktid=398a6654-997b-47e9-b12b-9515b896b4de&skv=2025-01-05&sp=r&spr=https&sr=b&st=2025-04-18T20%3A28%3A00Z&sv=2025-01-05), which reads "`Found nvtx3: C:/actions-runner/_work/pytorch/pytorch/pytorch/third_party/NVTX/c/include`" For torchvision, it still invoke [pytorch/cmake/public/cuda.cmake] but it no longer tries to find nvtx3 as torchvision is not using nvtx3 (if in future it uses, it can set USE_SYSTEM_NVTX=1 by default). So it would avoid the error reported in [#147220] Pull Request resolved: https://github.com/pytorch/pytorch/pull/151583 Approved by: https://github.com/eqy, https://github.com/atalman, https://github.com/malfet	2025-04-18 21:18:09 +00:00
Junjie Wang (PyTorch)	6e7b6e8d57	[c10d][fr] Fix a bug when first rank is not zero in the script (#151683 ) Summary: Further testing the script, we found that we shouldn't always assume rank 0 is the first rank, so we need to check all entries and see if it P2P op for this coalesced group. Test Plan: Directly test with corner case. Differential Revision: D73266257 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151683 Approved by: https://github.com/fegin	2025-04-18 20:55:06 +00:00
Catherine Lee	a6e46faff4	Use reusable binary docker build action for manywheel (#151489 ) This is part of splitting up https://github.com/pytorch/pytorch/pull/150558 into smaller chunks, please see that for more context Similar to https://github.com/pytorch/pytorch/pull/151483 but for manywheel Changed the job name s390x doesn't have access to aws ecr so it doesn't use the action. manylinuxs390x-builder ecr repo doesn't exist in docker hub so idk why the image name is that Testing: Can't really test since PRs don't have the credentials to push to docker io, which is the image used for everything, including PRs right now Pull Request resolved: https://github.com/pytorch/pytorch/pull/151489 Approved by: https://github.com/seemethere	2025-04-18 20:38:33 +00:00
Catherine Lee	b0f26e81a5	Use reusable binary docker build action for libtorch (#151488 ) This is part of splitting up https://github.com/pytorch/pytorch/pull/150558 into smaller chunks, please see that for more context Similar to https://github.com/pytorch/pytorch/pull/151483 but for libtorch Changed the job name Testing: Can't really test since PRs don't have the credentials to push to docker io, which is the image used for everything, including PRs right now Pull Request resolved: https://github.com/pytorch/pytorch/pull/151488 Approved by: https://github.com/atalman	2025-04-18 20:37:38 +00:00
Xiaodong Wang	88b0553c58	[AMD] Remove fbcode limit for uuid (#151652 ) Summary: We're now w/ later rocm version so ok to add uuid back. Test Plan: sandcastle Differential Revision: D73240086 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151652 Approved by: https://github.com/Skylion007, https://github.com/ngimel, https://github.com/houseroad	2025-04-18 20:37:09 +00:00
Huy Do	7ffa9000ed	Replace perf-nightly-macos with inductor-perf-nightly-macos (#151698 ) The name was updated by https://github.com/pytorch/pytorch/pull/151155. The benchmark results weren't updated on the dashboard otherwise. For PT2 compiler perf benchmark, we are still relying on this old workflow. To get rid of this, we need to update PT2 benchmark dashboard to use the new benchmark database (cc @yangw-dev) The results are there on the new database: ``` SELECT * FROM oss_ci_benchmark_v3 WHERE workflow_id = 14510035576 ``` but not on the old database: ``` SELECT * FROM inductor_torch_dynamo_perf_stats WHERE workflow_id = 14510035576 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151698 Approved by: https://github.com/seemethere, https://github.com/atalman	2025-04-18 20:31:36 +00:00
PyTorch MergeBot	1b267a58a1	Revert "[export] allow partially specifying keys for dynamic shapes dict spec (#151597 )" This reverts commit c8240e3492e4813e822d7265eb3afb7f1168db39. Reverted https://github.com/pytorch/pytorch/pull/151597 on behalf of https://github.com/clee2000 due to broke some export test export/test_converter.py::TestConverter::test_aten_len [GH job link](https://github.com/pytorch/pytorch/actions/runs/14538615968/job/40792673415) [HUD commit link](`c8240e3492`), bad TD ([comment](https://github.com/pytorch/pytorch/pull/151597#issuecomment-2816127271))	2025-04-18 20:17:44 +00:00
Sam Larsen	f20a266512	[easy] Update test/dynamo/test_utils.py (#151599 ) Summary: test/dynamo/test_utils.py is out of date because of some new dynamo_timed fields. (I guess the test is disabled?). Bring it up to date Test Plan: `python test/dynamo/test_utils.py` Fixes #148093 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151599 Approved by: https://github.com/Skylion007	2025-04-18 18:49:24 +00:00
PyTorch MergeBot	e434a9152e	Revert "[inductor][test] Skip triton tests for MPS as well, also change reason for skipping SM89 to not IS_BIG_GPU (#151506 )" This reverts commit 6246c7d62ca2f091838d5c707e3d932994c5e35a. Reverted https://github.com/pytorch/pytorch/pull/151506 on behalf of https://github.com/henrylhtsang due to seems to be breaking some rocm mi300 run ([comment](https://github.com/pytorch/pytorch/pull/151506#issuecomment-2815999009))	2025-04-18 18:40:17 +00:00
Aaron Gokaslan	cccfc146fe	[BE][Easy]: Simplify ModuleList reversed method (#151673 ) Removes unnecessary list calls now that we are in Python 3.9 and KeyViews implement reversed directly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151673 Approved by: https://github.com/albanD	2025-04-18 18:39:32 +00:00
PyTorch MergeBot	b7807759de	Revert "stage 2 of depreate silent fallback of tuning gemm (#148622 )" This reverts commit 181b3883e71b9771e8a3cdaf43d627f68e9f0fa6. Reverted https://github.com/pytorch/pytorch/pull/148622 on behalf of https://github.com/henrylhtsang due to seems to be breaking some rocm mi300 run ([comment](https://github.com/pytorch/pytorch/pull/148622#issuecomment-2815995105))	2025-04-18 18:37:09 +00:00
Oguz Ulgen	b73606dcc5	Add jk for force_disable_caches (#151621 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151621 Approved by: https://github.com/jamesjwu	2025-04-18 18:19:40 +00:00
eellison	9ccdeae7db	Fix uint view copy (#151598 ) Fix for https://github.com/pytorch/pytorch/issues/151156. We have some logic to undo our upcast prior to dtype bitcast. This pr cleans up that logic using dtypes in codegen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151598 Approved by: https://github.com/zou3519 ghstack dependencies: #151562	2025-04-18 18:13:39 +00:00
PyTorch MergeBot	28974a1ec3	Revert "[Easy] Fix the compilation warning of BlasKernel. (#151302 )" This reverts commit 32c79da789af84312a0db2de19211a7c57196ba7. Reverted https://github.com/pytorch/pytorch/pull/151302 on behalf of https://github.com/malfet due to Breaks builds without OpenMP, see https://github.com/pytorch/pytorch/issues/151680 ([comment](https://github.com/pytorch/pytorch/pull/151302#issuecomment-2815954855))	2025-04-18 18:10:45 +00:00
garfield1997	115a0c6413	add privateuse1 device type to pre forward hook of fsdp (#149487 ) add privateuse1 device type to pre forward hook of fsdp Pull Request resolved: https://github.com/pytorch/pytorch/pull/149487 Approved by: https://github.com/FFFrog, https://github.com/cyyever, https://github.com/shink, https://github.com/albanD	2025-04-18 17:50:23 +00:00
zeshengzong	1a48382a4c	[Easy] Optimize container.py typing (#151653 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/151653 Approved by: https://github.com/albanD	2025-04-18 17:33:43 +00:00
Shangdi Yu	931bd05560	Do not propagate real tensor in extern kernel (#151377 ) Summary: See internal Diff for more details. In ExternKernel, the FakeTensors do not have associated real tensors, because they are just created from ir.Node's shape and stride. Test Plan: ``` buck run fbcode//mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r aoti_data_dependent_ex buck2 run mode/dev-nosan fbcode//caffe2/test/inductor:aot_inductor_arrayref_cpu -- -r data_dependent_extern_kernel_op ``` Differential Revision: D73002775 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151377 Approved by: https://github.com/angelayi	2025-04-18 17:28:13 +00:00
henrylhtsang	181b3883e7	stage 2 of depreate silent fallback of tuning gemm (#148622 ) context: https://github.com/pytorch/pytorch/issues/147479 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148622 Approved by: https://github.com/eellison ghstack dependencies: #151506	2025-04-18 17:26:16 +00:00
henrylhtsang	6246c7d62c	[inductor][test] Skip triton tests for MPS as well, also change reason for skipping SM89 to not IS_BIG_GPU (#151506 ) Differential Revision: [D73162091](https://our.internmc.facebook.com/intern/diff/D73162091/) Combining / improving https://github.com/pytorch/pytorch/pull/150485 and https://github.com/pytorch/pytorch/pull/150343 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151506 Approved by: https://github.com/ColinPeppler	2025-04-18 17:26:16 +00:00
Pian Pawakapan	1dd2033c0a	[dynamic shapes] guard_or_false for _reshape_view_helper, utils._infer_size for wildcard dims (#150127 ) For reshape/view: removes fast paths for 0 elements, checking dimensions to skip. Modifies the loop accumulating input elements, to raise a UserError if we run out of dimensions, graph breaking for compile and erroring out for export. For infer_size: assumes if user passes us an unbacked, it's probably not -1 Will think about changes in https://docs.google.com/document/d/1WYx6EZwVDXtBnWyrzoecgGWdiK0V3XZKftfpWwQ5i3E/edit?tab=t.0#heading=h.22k54zym11qp in a later PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/150127 Approved by: https://github.com/laithsakka	2025-04-18 17:05:11 +00:00
Pian Pawakapan	c8240e3492	[export] allow partially specifying keys for dynamic shapes dict spec (#151597 ) Fixes #148564 Should help with exporting HF-style models, so users don't have to specify 100 Nones Pull Request resolved: https://github.com/pytorch/pytorch/pull/151597 Approved by: https://github.com/angelayi	2025-04-18 16:53:01 +00:00
Xiaodong Wang	9eaaca2ece	Turn off symm_mem when cuda version is <12.3 (#151203 ) Summary: It looks symmetric memory only supports cuda12.3+. We do have the definition w/ 12.3- but we don't have implementation. So maybe a good idea to even disable the definition. Test Plan: CI Reviewed By: jianyuh, houseroad, ngimel, jiawenliu64 Differential Revision: D72936993 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151203 Approved by: https://github.com/ngimel, https://github.com/houseroad	2025-04-18 16:37:12 +00:00
FFFrog	783be8f932	[Easy] Add more check for elapsedTime of torch.xxx.Event and torch.Event (#151404 ) As the title stated Changes: - Add record, query and enable_timing check - Add related tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/151404 Approved by: https://github.com/albanD	2025-04-18 15:26:13 +00:00
rzou	29317f8585	[standalone_compile] Some misc fixes (#151502 ) This PR fixes two things. The first problem is that in the vLLM style standalone_compile is called from within a custom torch.compile backend. If there already is a FakeTensorMode (which there is), we shouldn't create a new FakeTensorMode with the same shape_env, instead we should just reuse the same FakeTensorMode. The second thing is that compile_fx can mutate the passed in gm, so we deepcopy (since standalone_compile should be standalone) Test Plan: - new test - updated old tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/151502 Approved by: https://github.com/oulgen ghstack dependencies: #151501, #151551	2025-04-18 12:34:13 +00:00
rzou	58310a0043	[standalone_compile] support multiple returns (#151551 ) We were only returning the first one. There's an edge case on what to do if the original function returns a single Tensor. capture(f) returns a function that returns a tuple of one Tensor in this case and we were originally converting this back to one single Tensor. I think it's fine to return a tuple of one Tensor (that is what the graph passed to standalone_compile asked for!) but we can revisit. fine Test Plan: - modified one test to used multiple outputs Pull Request resolved: https://github.com/pytorch/pytorch/pull/151551 Approved by: https://github.com/Skylion007, https://github.com/oulgen ghstack dependencies: #151501	2025-04-18 12:34:13 +00:00
rzou	ac715e96b4	[standalone_compile] Don't check if path is directory if it doesn't exist (#151501 ) os.path.isdir(path) will return False if the path doesn't exist. Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/151501 Approved by: https://github.com/Skylion007, https://github.com/oulgen	2025-04-18 12:34:13 +00:00
Nikita Shulga	14293c2377	[MPS] Allow isin for mixed types (#151600 ) To follow pattern set by CPU and CUDA impls: define common_dtype and optionally casts `elements` and `test_elements` to common dtype if needed - Add regression test, though skip it on MacOS-13, as `isin` seems to produce garbage there even for same dtypes ``` >>> import torch >>> x=torch.arange(4.0, device='mps') >>> y=torch.arange(1.0, 3.0, device='mps') >>> x, y, torch.isin(x, y), torch.isin(y, x) (tensor([0., 1., 2., 3.], device='mps:0'), tensor([1., 2.], device='mps:0'), tensor([False, True, False, False], device='mps:0'), tensor([False, False], device='mps:0')) >>> torch.__version__ '2.6.0' ``` - Cleanup code a bit Fixes https://github.com/pytorch/pytorch/issues/151443 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151600 Approved by: https://github.com/Skylion007, https://github.com/dcci, https://github.com/kulinseth	2025-04-18 12:30:32 +00:00
Adam J. Stewart	675f69f40f	collect_env: gracefully handle no pip (#151607 ) If pip is not installed: ### Before ```console > python3 torch/utils/collect_env.py Collecting environment information... Traceback (most recent call last): File "/Users/Adam/pytorch/torch/utils/collect_env.py", line 694, in <module> main() ~~~~^^ File "/Users/Adam/pytorch/torch/utils/collect_env.py", line 677, in main output = get_pretty_env_info() File "/Users/Adam/pytorch/torch/utils/collect_env.py", line 672, in get_pretty_env_info return pretty_str(get_env_info()) ~~~~~~~~~~~~^^ File "/Users/Adam/pytorch/torch/utils/collect_env.py", line 497, in get_env_info pip_version, pip_list_output = get_pip_packages(run_lambda) ~~~~~~~~~~~~~~~~^^^^^^^^^^^^ File "/Users/Adam/pytorch/torch/utils/collect_env.py", line 450, in get_pip_packages for line in out.splitlines() ^^^^^^^^^^^^^^ AttributeError: 'NoneType' object has no attribute 'splitlines' ``` ### After ```console > python3 torch/utils/collect_env.py Collecting environment information... PyTorch version: N/A Is debug build: N/A CUDA used to build PyTorch: N/A ROCM used to build PyTorch: N/A OS: macOS 15.4 (arm64) GCC version: Could not collect Clang version: 20.1.0 CMake version: version 3.31.6 Libc version: N/A Python version: 3.13.2 (main, Apr 8 2025, 15:27:33) [Clang 17.0.0 (clang-1700.0.13.3)] (64-bit runtime) Python platform: macOS-15.4-arm64-arm-64bit-Mach-O Is CUDA available: N/A CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: N/A GPU models and configuration: Could not collect Nvidia driver version: Could not collect cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: N/A CPU: Apple M2 Pro Versions of relevant libraries: [pip3] Could not collect [conda] Could not collect ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151607 Approved by: https://github.com/malfet	2025-04-18 12:28:58 +00:00
Yutao Xu	776aa68221	Update torch-xpu-ops commit pin (#150827 ) Update the torch-xpu-ops commit to [b51dd3ef4f4d0f6b44c59e61431c5d29354dcaf6](`b51dd3ef4f`), including: - Update commit pin to xpu-ops main branch - Fixes batch_norm numeric error by adding additional boundary check - Enable two operators: fft & jagged_to_padded_dense - XCCL relevant changes: 1. Cache `cclStream` to improve performance. 2. Add support for complex datatypes in `allgather` and `broadcast`. 3. Support `coalescing` operations and `batch_isend_irecv`. 4. Introduce additional logging; use `export TORCH_CPP_LOG_LEVEL=INFO`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150827 Approved by: https://github.com/EikanWang Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>	2025-04-18 10:12:59 +00:00
LuFengqing	0376bbf5b3	[XPU] skip a subprocess UT for Windows (#150999 ) This case creates subprocess in a subprocess. In Windows it can't load function at this scenario hence I have to skip it ``` File "C:\ProgramData\miniforge3\envs\lfq\lib\multiprocessing\spawn.py", line 116, in spawn_main exitcode = _main(fd, parent_sentinel) File "C:\ProgramData\miniforge3\envs\lfq\lib\multiprocessing\spawn.py", line 126, in _main self = reduction.pickle.load(from_parent) AttributeError: Can't get attribute 'run_model' on <module '__main__' (built-in)> Traceback (most recent call last): File "<string>", line 25, in <module> File "<string>", line 16, in test_multi_process AssertionError ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150999 Approved by: https://github.com/guangyey, https://github.com/EikanWang Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>	2025-04-18 08:55:47 +00:00
Natalia Gimelshein	541f8cd34c	faster gather implementation (#151490 ) So far it's only for `gather`, but we'll move index_select and index to this implementation too. Torchtitan and fbgemm have noticed that gather/index_select perf is bad, this PR brings core implementation to be on par with those customized implementations. Added benefits: all dtypes are supported, a bit less strict on the tensor dimensions/contiguity because we pick the fast path after TensorIterator collapsed the dimensions. Biggest part of this PR is not even the kernel (it's dumb, just vectorized loads are enough), but moving utilities for vectorized loads and stores from SymmetricMemory to be generally accessible in MemoryAccess.cuh. Additional tests are coming to make sure this implementation doesn't break anything `gather` is equivalent to x[indices] for 1d indices via ``` def fn_gather(x, indices): return torch.gather(x, dim=0, index=indices.unsqueeze(1).expand(-1, x.shape[1])) def fn_index(x, indices): return x[indices] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151490 Approved by: https://github.com/Skylion007, https://github.com/eqy	2025-04-18 07:48:31 +00:00
Tugsbayasgalan Manlaibaatar	eb1f85a2a0	Support C++ statically_known_true (#151346 ) Differential Revision: [D73040543](https://our.internmc.facebook.com/intern/diff/D73040543/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151346 Approved by: https://github.com/laithsakka	2025-04-18 06:42:12 +00:00
FFFrog	8895c290f4	[Easy] enable PYFMT for torch/quantization/eager (#150761 ) All modifications are done through tools, the detailed commands are as follows: ```bash lintrunner -a --take "PYFMT" --all-files ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150761 Approved by: https://github.com/jerryzh168	2025-04-18 05:53:33 +00:00
PyTorch UpdateBot	91b090c912	[executorch hash update] update the pinned executorch hash (#151632 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151632 Approved by: https://github.com/pytorchbot	2025-04-18 05:07:28 +00:00
bobrenjc93	6649ed9deb	[ez] fix code owners typo (#151499 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151499 Approved by: https://github.com/laithsakka	2025-04-18 04:24:16 +00:00
Will Constable	bedefa46a9	Document non-pytorch CUDA memory allocation and how to query it (#150880 ) This PR documents the fact that PyTorch does not have visibility into how every CUDA memory allocation happend - it only knows about allocations that went through the pytorch CUDA allocator. It also adds a code snippet showing how to use pynvml to query current GPU memory usage. ## Preview Added a note at the top of "Understanding CUDA Memory Usage" doc: <img width="732" alt="image" src="https://github.com/user-attachments/assets/69e28d2a-841a-4b1b-b886-e96fb5d76582" /> which links to a section below: <img width="733" alt="image" src="https://github.com/user-attachments/assets/cab4f252-9ac2-4fc6-a45d-fdb958fc7dbc" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/150880 Approved by: https://github.com/kwen2501, https://github.com/ngimel	2025-04-18 03:48:54 +00:00
Jane Xu	7d282da449	Add automatic categorization for release notes: inductor (aoti) (#151569 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151569 Approved by: https://github.com/desertfire ghstack dependencies: #151453	2025-04-18 03:39:06 +00:00
Chen Zhu	2426258789	[doc fix] fix torch export docs for preserve_module_call_signature (#151140 ) The preserve_module_call_signature explanation is missing in the __init__.py. Copying that from _trace.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/151140 Approved by: https://github.com/angelayi	2025-04-18 02:55:35 +00:00
Yu, Guangye	33cfe30ee1	Add HostAllocator as the unified parent class (#151431 ) # Motivation This PR introduces a unified parent class `HostAllocator` with the following goals: 1. Enable backend-specific host allocator registration, including support for out-of-tree backends. 2. Provide a unified and extensible API surface for host memory management across all backends, especially accelerators. The new interface includes: - `at::getHostAllocator()->allocate` - `at::getHostAllocator()->empty_cache` - `at::getHostAllocator()->record_event` - `at::getHostAllocator()->get_stats` - `at::getHostAllocator()->reset_accumulated_stats` - `at::getHostAllocator()->reset_peak_stats` # Additional Context We plan to deprecate legacy APIs such as `at::cuda::CachingHostAllocator_emptyCache` and recommend users migrate to the new backend-specific API, for example: ```cpp at::getHostAllocator(at::kCUDA)->empty_cache(); ``` This refactor will help standardize host memory management across devices and simplify backend integration in the future. Another key improvement I am going to do is move the `is_pinned` functionality into the `HostAllocator` class, which enables centralized pinned memory verification through calls like `at::getHostAllocator(at::kCUDA)->is_pinned(ptr)`. Benefits include: - Consistent host memory handling across all device backends - Decouple pinned memory functionality with `AcceleratorHooksInterface` in a more modular way - Clearer separation between device memory allocation and pinned host memory management This architecture makes the system more maintainable and extensible for future device support. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151431 Approved by: https://github.com/albanD ghstack dependencies: #151403	2025-04-18 02:44:17 +00:00
FFFrog	1cc5a8452b	[Openreg][PrivateUse1] Fix releasing tensor issue when using pin_memory (#151091 ) As the title stated. Related PR: https://github.com/pytorch/pytorch/pull/147066 Co-authored-by: Zhenbin Lin <lin-zhenbin@qq.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/151091 Approved by: https://github.com/albanD ghstack dependencies: #151007	2025-04-18 02:40:07 +00:00
FFFrog	3528488061	[Openreg][PrivateUse1] Enable CI for openreg (#151007 ) Changes: - move test_openreg.py from test/cpp_extensions/open_registration_extension/ to test/ - update README.md for openreg - enable CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/151007 Approved by: https://github.com/albanD	2025-04-18 02:40:07 +00:00
Laith Sakka	09e8ff92cc	refresh benchmark results (#151622 ) updating due to <1.5% increases in https://github.com/pytorch/pytorch/pull/151469 not all benchmarks were updated Pull Request resolved: https://github.com/pytorch/pytorch/pull/151622 Approved by: https://github.com/oulgen	2025-04-18 02:39:13 +00:00
Tristan Rice	98c892749b	c10d/Store: add nonblocking mode to queue_pop (#151485 ) This adds a non-blocking mode to queue_pop. This allows for workers to poll if work is ready without blocking the main loop. This is useful for the case where you want to have a GPU have maximum utilization when something only periodically is sent on the queue. We also expose a `torch.distributed.QueueEmptyError` so users can catch the error and handle it accordingly. Test plan: ``` pytest test/distributed/test_store.py -k queue -v -s -x ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151485 Approved by: https://github.com/fduwjj, https://github.com/tianfengfrank	2025-04-18 02:14:50 +00:00
PaulZhang12	3ed5f1fb77	[CUDA][cuBLAS] Aten GEMM overload for FP32 output from FP16/BF16 inputs (#150812 ) Enable FP32 output from FP16/BF16 GEMMs in aten with cuBLAS. Accumulation for these GEMMs are generally already done in FP32. Adds the functionality to the following aten operators: * mm * bmm * addmm * baddmm Follow up of customer issue: https://github.com/pytorch/pytorch/issues/146241#issuecomment-2781889390 Differential Revision: [D73126191](https://our.internmc.facebook.com/intern/diff/D73126191) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150812 Approved by: https://github.com/ngimel, https://github.com/eqy	2025-04-18 01:53:26 +00:00
Xintong Hu	a6182903cd	Update PyTorchStreamReader API to take cpu allocator override (#150439 ) Summary: Add allocator param in getRecord Test Plan: newly added UT ``` buck test caffe2/caffe2/serialize:inline_container_test ``` Differential Revision: D72252585 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150439 Approved by: https://github.com/albanD	2025-04-18 01:53:14 +00:00
Laith Sakka	b434322075	Fix has_free_symbols (#151492 ) used to fail for self.assertFalse(has_free_symbols(sympy.S.true)) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151492 Approved by: https://github.com/bobrenjc93 ghstack dependencies: #151170, #151171	2025-04-18 01:19:01 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	c2a202169d	Fix implicit state dict modification (#151436 ) Summary: Previously we were modyfing ep.state_dict while runnning decomp which it shouldn't Test Plan: CI Fixes: https://github.com/pytorch/pytorch/issues/151366 Differential Revision: D73102315 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151436 Approved by: https://github.com/angelayi	2025-04-18 00:58:55 +00:00
leslie-fang-intel	34266836d5	[Inductor] Suppress cuda init error for CPU only Inductor (#151528 ) Summary After https://github.com/pytorch/pytorch/pull/151255, invoking `torch.compile` on a non-CUDA device prints the following error: `E0416 23:39:55.953000 418833 torch/_inductor/codegen/cuda/cuda_env.py:22] Error getting cuda arch: Torch not compiled with CUDA enabled.` This PR updates the code to initialize `PRESETS` only when CUDA is available, preventing this error message from being printed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151528 Approved by: https://github.com/jansel, https://github.com/henrylhtsang	2025-04-18 00:55:01 +00:00
Will Constable	9e235c549c	[C10D] avoid computing global_rank when group_rank is used (#151373 ) collective APIs accept either group or global rank for src/dst rank. We provide a helper `_canonicalize_group_rank` which converts from maybe group or maybe global to one particular format (defined by the kwarg return_global: bool=False). In this PR we stop performing the mapping lookup that converts group to global or global to group in the case that the caller wants us to return the same value that was passed in. The PR should be functionally equivalent, except in cases where the mapping itself would raise an exception but the mapping was not necessary in the first place. This has come up in cases where people create new process groups outside of 'init_process_group' APIs and group-specific ranks may not have a valid mapping to the 'global' rank. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151373 Approved by: https://github.com/xunnanxu, https://github.com/d4l3k	2025-04-17 23:53:50 +00:00
Yanli Zhao	d8bafd23ab	[DDP] add one option to allow skipping all reduce unused parameters (#151503 ) Summary: add one option to allow skipping all reduce unused parameters, this could help improve training throughput significantly when the number of unused parameters is large in the model. Test Plan: unit tests, CI Differential Revision: D72282069 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151503 Approved by: https://github.com/mrshenli	2025-04-17 23:30:19 +00:00
eellison	6d46b530fc	Remove libdevice ops in inductor (#151562 ) Now that we track dtypes during codegen, we can delete all these extra ops that worked around the problem by doing dispatch at lowering time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151562 Approved by: https://github.com/isuruf, https://github.com/jansel	2025-04-17 22:18:00 +00:00
Animesh Jain	bdb34f55a0	[fake tensor cache] Support index with non bool/int8 indices (#151477 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151477 Approved by: https://github.com/zou3519, https://github.com/bdhirsh ghstack dependencies: #151330, #151256, #151357	2025-04-17 21:51:08 +00:00
Catherine Lee	0129c3a4e1	Use reusable binary docker build action for almalinux, clean up script (#151483 ) This is part of splitting up https://github.com/pytorch/pytorch/pull/150558 into smaller chunks, please see that for more context Use the binary docker build action from https://github.com/pytorch/pytorch/pull/151471 Change the workflow trigger to be all of .ci/docker so it will make a new image + tag whenever it changes. build script: * change to be independent of the CUDA_VERSION env var, since all the info should be in the imagename:tag * remove docker push parts since that will happen during the workflow * clean up a bit * make the build script more like the CI build script (use a temp image name) I don't think this image is actually used anywhere Also push docker image to imagename:tag, I got rid of it in the PR making the reusable workflow since I thought it was not in the original scripts but it actually is there Pull Request resolved: https://github.com/pytorch/pytorch/pull/151483 Approved by: https://github.com/ZainRizvi	2025-04-17 21:32:56 +00:00
William Wen	652fa451a4	[dynamo] support fb internal bytecode EAGER_IMPORT_NAME (#151362 ) Differential Revision: [D73127097](https://our.internmc.facebook.com/intern/diff/D73127097) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151362 Approved by: https://github.com/oulgen	2025-04-17 21:19:45 +00:00
angelayi	d5dda82586	[export] Integrate meta kernel generation with draft-export (#150809 ) If a custom operator does not contain a fake impl, currently draft-export will use the real-tensor propagation to get an output for the operator and continue tracing. However if we retrace the exported model using `ep.run_decompositions`, or `export`, or run the exported program with fake tensors, we'll still fail because there's no fake impl. With this PR, after draft-export we will generate an operator profile for each operator call that we encounter, and store this on the report attached to the exported program `ep._report.op_profiles`. Users can then use `torch._library.fake_profile.register_fake_profile` to temporarily generate and register a fake impl based on these operator profiles. This way future fake tensor retracing will work. The workflow would look something like: ```python class M(torch.nn.Module): def forward(self, a, b): res = torch.ops.mylib.foo8(a, b) # no fake impl return res ep = export(M(), (torch.ones(3, 4), torch.ones(3, 4)) # this fails bc no fake impl ep = draft_export(M(), (torch.ones(3, 4), torch.ones(3, 4)) ep.run_decompositions() # this fails bc no fake impl # this registers fake impls based on the profiles with torch._library.fake_profile.register_fake_profile(ep._report.op_profiles): decomp = ep.run_decompositions() # this works new_inp = ( torch.ones(2, 3, 4), torch.ones(2, 3, 4), ) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150809 Approved by: https://github.com/zou3519	2025-04-17 20:52:31 +00:00
Michael Lazos	4f62dccbda	[Cutlass] Implement Epilogue Argument emitter (#150903 ) This implements epilogue visitor tree argument generation (example type [here](`3fe62887d8/include/cutlass/epilogue/fusion/sm90_callbacks_tma_warpspecialized.hpp (L332)`)). Details: The codegen task here is to implement a function which can generate a tree of C++ structs and properly extract the correct properties from Inductor buffers and write them to the correct locations in the generated struct. To implement this with the minimum amount of code, I generate the cutlass DAGIR (the EVT internal represenation) which specifically has a pass, [pass_argument_type.py ](`5e497243f7/python/cutlass/backend/evt/passes/pass_argument_type.py (L4)`) which generates a nested tree of custom argument types for each node in the DAGIR. This nested tree of constructors is then passed kwargs to fill in the proper values, where the node's name is used to differentiate between different values in the kwarg dictionary. This however is non-customizable; the nested tree of EVT args is a nested tree of ctypes which looks for actual values so that this object can be passed directly to the cutlass-python C++ runner. Inductor on the other hand needs to fill this struct with string C++ expressions representing the values (or extracting the values from kernel launcher args). So `_render_argument_type` implements this: it iterates over the tree of types created by pass_argument_type.py and generates a string representing the nested structs, filling in C++ expressions representing the different fields. Long term plan: Long term, I will ask the nvidia to provide an overridable [visitor_factory](`5e497243f7/python/cutlass/backend/evt/passes/pass_argument_type.py (L82)`) which could allow us to override the behavior of pass_argument_type.py to generate the string we would like during DAGIR generation. Previously merged: * #150346 * #150345 * #150344 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150903 Approved by: https://github.com/henrylhtsang, https://github.com/eellison	2025-04-17 20:30:21 +00:00
Dylan Maloy	8e0f9fbccf	[c10] helpers for runtime c10::alias re-use (#151361 ) Summary: we need these to check whether the input tensor was re-sized/strided between executions when choosing to alias Test Plan: CI Reviewed By: henryoier Differential Revision: D73061676 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151361 Approved by: https://github.com/SherlockNoMad	2025-04-17 20:27:17 +00:00
Aaron Gokaslan	da580123a0	[BE][Easy]: Dedupe a TypeAlias in PrimsCommon (#151565 ) Replaces a duplicate TypeAlias with a reference to the global constant for them Pull Request resolved: https://github.com/pytorch/pytorch/pull/151565 Approved by: https://github.com/albanD	2025-04-17 19:59:41 +00:00
Nikita Shulga	c4688af254	Fix lint Introduced by fb6ac2f16132f7953711ce6924bc2ee4a033228c	2025-04-17 12:48:52 -07:00
Meet Vadakkanchery	473a38b562	[DCP] Add logging for _stateful_to_state_dict(), stage_state_dict(), and synchronize_staging() (#151320 ) Summary: As titled. Test Plan: CI Differential Revision: D73040700 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151320 Approved by: https://github.com/saumishr	2025-04-17 12:48:39 -07:00
Aaron Gokaslan	c5b10ff119	[BE][Easy]: Normalize Dim typing in torch distributed (#151566 ) Improve typing using prims_common dtypes Pull Request resolved: https://github.com/pytorch/pytorch/pull/151566 Approved by: https://github.com/albanD	2025-04-17 19:30:09 +00:00
Kashif Rasul	2ed2cb5805	add generalized pareto distribution (GPD) (#135968 ) Add the GPD as a distribution class Pull Request resolved: https://github.com/pytorch/pytorch/pull/135968 Approved by: https://github.com/albanD Co-authored-by: Alexander März <statmixedmlgit@gmail.com>	2025-04-17 18:51:02 +00:00
zeshengzong	7e2081fa93	Optimize `interpolate` saturate description (#151304 ) Fixes #108225 ## Test Result ### Before ![image](https://github.com/user-attachments/assets/bdbf8a5c-d5a4-44a5-b81e-2cbb5b8bfd02) ### After ![image](https://github.com/user-attachments/assets/1c21a27d-1700-4661-9988-dbb1cdc81fa2) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151304 Approved by: https://github.com/albanD Co-authored-by: albanD <desmaison.alban@gmail.com>	2025-04-17 18:34:29 +00:00
Jared Hance	055e59e709	[bazel] Build flatbuffers within bazel (#151364 ) This is similar to how we handle protobufs and it makes it more convenient for bazel users to handle their version of flatbuffers. (Flatbuffers is very picky about the generated code matching the runtime). Instead of using the checked in generated code, we generate it on the fly. This is relevant to https://github.com/pytorch/pytorch/issues/112903, because having the version of flatbuffers tied to pytorch will make pytorch difficult to use as an external workspace. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151364 Approved by: https://github.com/malfet	2025-04-17 18:33:51 +00:00
iremyux	3a6b3c8e0e	Combine windows x64 and arm64 yaml template files (#149850 ) While introducing Windows-Arm64 nightly workflows, we created a separate template file for win-arm64. This PR combines x64&arm64 and deletes the win-arm64 one. Fixes #148776 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149850 Approved by: https://github.com/ozanMSFT, https://github.com/malfet	2025-04-17 17:58:55 +00:00
PyTorch MergeBot	1ce7969e81	Revert "[Easy] Add more check for elapsedTime of torch.xxx.Event and torch.Event (#151404 )" This reverts commit 90c5b86cd8fcbbe6ee7c46ad17a05767f884794b. Reverted https://github.com/pytorch/pytorch/pull/151404 on behalf of https://github.com/clee2000 due to broke a cpp extension test? test_cpp_extensions_stream_and_event.py::TestCppExtensionStreamAndEvent::test_stream_event [GH job link](https://github.com/pytorch/pytorch/actions/runs/14519277500/job/40736981315) [HUD commit link](`90c5b86cd8`), bad TD ([comment](https://github.com/pytorch/pytorch/pull/151404#issuecomment-2813649667))	2025-04-17 17:45:41 +00:00
Blaine Burton Rister	ae6f6b8efb	[Inductor] Remove singleton tiling splits when prefer_nd_tiling=True (#151508 ) # Issue Users who want block pointers are like to use the config settings `{"trition.use_block_ptr": True, "triton.prefer_nd_tiling": True, "triton.max_tiles": 3}` . Among other things, these settings allow us to generate 3D block pointers for broadcasts. However, broadcasts which don't truly require 3D often end up introducing a superfluous tiling dimension of size 1. For example, given this function with elementwise multiplication: ``` def foo(x, y, z): a = x * y b = 128.0 c = a * b d = a * z e = x * z return a, c, d, e inps = [ torch.randn((8, 11, 128), device=self.device), torch.randn((128,), device=self.device), torch.randn((8, 11, 128), device=self.device), ] torch.compile(foo)(inps) ``` We get the following Triton kernels: ``` @triton.jit def triton_poi_fused_mul_0(in_ptr0, in_ptr1, out_ptr0, znumel, ynumel, xnumel, ZBLOCK : tl.constexpr, YBLOCK : tl.constexpr, XBLOCK : tl.constexpr): znumel = 88 ynumel = 1 xnumel = 128 zoffset = tl.program_id(2) ZBLOCK zindex = zoffset + tl.arange(0, ZBLOCK)[:, None, None] zmask = zindex < znumel yoffset = tl.program_id(1) * YBLOCK yindex = yoffset + tl.arange(0, YBLOCK)[None, :, None] ymask = tl.full([ZBLOCK, YBLOCK, XBLOCK], True, tl.int1) xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[None, None, :] xmask = xindex < xnumel x1 = xindex z0 = zindex tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[88, 128], strides=[128, 1], block_shape=[ZBLOCK, XBLOCK], order=[1, 0], offsets=[zoffset, xoffset]), boundary_check=[0, 1], eviction_policy='evict_last')[:, None, :] tmp1 = tl.load(tl.make_block_ptr(in_ptr1, shape=[128], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), boundary_check=[0], eviction_policy='evict_last')[None, None, :] tmp2 = tmp0 * tmp1 tl.store(tl.make_block_ptr(out_ptr0, shape=[88, 128], strides=[128, 1], block_shape=[ZBLOCK, XBLOCK], order=[1, 0], offsets=[zoffset, xoffset]), tl.reshape(tl.broadcast_to(tmp2, [ZBLOCK, YBLOCK, XBLOCK]), [ZBLOCK, XBLOCK]).to(tl.float32), boundary_check=[0, 1]) ''', device_str='cuda') @triton.jit def triton_poi_fused_mul_1(in_ptr0, in_ptr1, in_ptr2, out_ptr0, out_ptr1, out_ptr2, xnumel, XBLOCK : tl.constexpr): xnumel = 11264 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[11264], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), boundary_check=[0]) tmp3 = tl.load(tl.make_block_ptr(in_ptr1, shape=[11264], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), boundary_check=[0]) tmp5 = tl.load(tl.make_block_ptr(in_ptr2, shape=[11264], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), boundary_check=[0]) tmp1 = 128.0 tmp2 = tmp0 * tmp1 tmp4 = tmp0 * tmp3 tmp6 = tmp5 * tmp3 tl.store(tl.make_block_ptr(out_ptr0, shape=[11264], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), tl.broadcast_to(tmp2, [XBLOCK]).to(tl.float32), boundary_check=[0]) tl.store(tl.make_block_ptr(out_ptr1, shape=[11264], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), tl.broadcast_to(tmp4, [XBLOCK]).to(tl.float32), boundary_check=[0]) tl.store(tl.make_block_ptr(out_ptr2, shape=[11264], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), tl.broadcast_to(tmp6, [XBLOCK]).to(tl.float32), boundary_check=[0]) ''', device_str='cuda') ``` Note that one kernel has `ynumel=1`. The extra dimension results in more expensive address calculations, and also seems to prevent fusion. # Fix To fix this, this PR filters out any splits of size 1 from the `prefer_nd_tiling` algorithm. This results in the following fused kernel, with 2D tiling: ``` @triton.jit def triton_poi_fused_mul_0(in_ptr0, in_ptr1, in_ptr2, out_ptr0, out_ptr1, out_ptr2, out_ptr3, ynumel, xnumel, YBLOCK : tl.constexpr, XBLOCK : tl.constexpr): ynumel = 88 xnumel = 128 yoffset = tl.program_id(1) * YBLOCK yindex = yoffset + tl.arange(0, YBLOCK)[:, None] ymask = yindex < ynumel xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[None, :] xmask = xindex < xnumel x1 = xindex y0 = yindex tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[88, 128], strides=[128, 1], block_shape=[YBLOCK, XBLOCK], order=[1, 0], offsets=[yoffset, xoffset]), boundary_check=[0, 1], eviction_policy='evict_last') tmp1 = tl.load(tl.make_block_ptr(in_ptr1, shape=[128], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), boundary_check=[0], eviction_policy='evict_last')[None, :] tmp5 = tl.load(tl.make_block_ptr(in_ptr2, shape=[88, 128], strides=[128, 1], block_shape=[YBLOCK, XBLOCK], order=[1, 0], offsets=[yoffset, xoffset]), boundary_check=[0, 1], eviction_policy='evict_last') tmp2 = tmp0 * tmp1 tmp3 = 128.0 tmp4 = tmp2 * tmp3 tmp6 = tmp2 * tmp5 tmp7 = tmp0 * tmp5 tl.store(tl.make_block_ptr(out_ptr0, shape=[88, 128], strides=[128, 1], block_shape=[YBLOCK, XBLOCK], order=[1, 0], offsets=[yoffset, xoffset]), tl.broadcast_to(tmp2, [YBLOCK, XBLOCK]).to(tl.float32), boundary_check=[0, 1]) tl.store(tl.make_block_ptr(out_ptr1, shape=[88, 128], strides=[128, 1], block_shape=[YBLOCK, XBLOCK], order=[1, 0], offsets=[yoffset, xoffset]), tl.broadcast_to(tmp4, [YBLOCK, XBLOCK]).to(tl.float32), boundary_check=[0, 1]) tl.store(tl.make_block_ptr(out_ptr2, shape=[88, 128], strides=[128, 1], block_shape=[YBLOCK, XBLOCK], order=[1, 0], offsets=[yoffset, xoffset]), tl.broadcast_to(tmp6, [YBLOCK, XBLOCK]).to(tl.float32), boundary_check=[0, 1]) tl.store(tl.make_block_ptr(out_ptr3, shape=[88, 128], strides=[128, 1], block_shape=[YBLOCK, XBLOCK], order=[1, 0], offsets=[yoffset, xoffset]), tl.broadcast_to(tmp7, [YBLOCK, XBLOCK]).to(tl.float32), boundary_check=[0, 1]) ''', device_str='cuda') ``` # Test plan Added the test case above to CI. Checked that a single kernel is generated with 2D tiling. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151508 Approved by: https://github.com/jansel	2025-04-17 17:37:45 +00:00
Jithun Nair	b4550541ea	[ROCm] upgrade nightly wheels to rocm6.4 (#151355 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151355 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-04-17 17:29:07 +00:00
Oguz Ulgen	ef64beb232	Include post grad gm and fx runnable in cache artifacts for tlparse (#151469 ) Fixed #151462 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151469 Approved by: https://github.com/bdhirsh	2025-04-17 17:14:13 +00:00
Oguz Ulgen	ee3366dbb2	[MegaCache] Encode key in base64 (#151472 ) I have noticed that there are some errors like ``` UnicodeDecodeError: 'utf-8' codec can't decode byte 0x95 in position 169302: invalid start byte ``` I havent been able to repro this locally yet, this change should fix the encoding issues Pull Request resolved: https://github.com/pytorch/pytorch/pull/151472 Approved by: https://github.com/masnesral	2025-04-17 17:12:22 +00:00
Oguz Ulgen	8404c09b15	[MegaCache] Rename the PGO artifact when used between different jobs (#151482 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151482 Approved by: https://github.com/bobrenjc93, https://github.com/jamesjwu	2025-04-17 17:09:29 +00:00
zeshengzong	fe90a5c140	[Easy] Optimize `clip_grad` param description (#151532 ) Fix missing optional description in `clip_grad_norm_` and `clip_grad_value_` ## Test Result ### Before ![image](https://github.com/user-attachments/assets/3393dd4b-a730-4dd4-8304-9b895ac669d4) ![image](https://github.com/user-attachments/assets/220c4738-a728-474b-b06d-b5be7660d150) ### After ![image](https://github.com/user-attachments/assets/5637bb68-3b6d-49a3-8ee1-3af636950aa0) ![image](https://github.com/user-attachments/assets/c0f1d966-a9ba-4fac-a874-9d4955f6e0d6) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151532 Approved by: https://github.com/Skylion007, https://github.com/albanD	2025-04-17 16:47:38 +00:00
Mu-Chu Lee	c3a18f6126	[AOTInductor] Add states for constant folding process (#151273 ) Summary: We add states in the constant folding process for AOTInductor. Basically, there's 3 states, which is (1) None: The state when no constants are loaded and uninitialized. (2) Initialized: The state when constants are loaded, but not yet folded. (3) Folded: The state where the model is fully ready with folded constants. Note that even if constant folding is not enabled, we still only run when state is FOLDED, this is okay because without constant folding, the transition from INITIALIZED to FOLDED is just a pass-throught. Test Plan: python test/inductor/test_aot_inductor.py -k test_constant_folding_with_update Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D73002538](https://our.internmc.facebook.com/intern/diff/D73002538) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151273 Approved by: https://github.com/jingsh, https://github.com/desertfire	2025-04-17 16:41:38 +00:00
Jane Xu	4843ce7611	[BE] Remove outdated script to check namespace BC (#151453 ) Now that we have bc_lint in CI, this script is no longer needed (nor has it ever been conclusive). I've already updated the Runbook to not need this script. Suppressing bc_lint as this script is not shipped as a part of torch--it is not user facing! For context, this script is (rarely) used by the release notes manager to ensure BC across releases. It had been broken for at least since 2.6. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151453 Approved by: https://github.com/albanD, https://github.com/jbschlosser	2025-04-17 15:43:53 +00:00
FFFrog	90c5b86cd8	[Easy] Add more check for elapsedTime of torch.xxx.Event and torch.Event (#151404 ) As the title stated Changes: - Add record, query and enable_timing check - Add related tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/151404 Approved by: https://github.com/albanD	2025-04-17 15:30:12 +00:00
Zhang, Jianyi	7f528751cc	[Inductor] fix torch._inductor.exc.InductorError: KeyError (#151424 ) Fixes #151423, which is a regression after #150845 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151424 Approved by: https://github.com/eellison	2025-04-17 15:07:43 +00:00
Aleksei Nikiforov	bb11122e12	Update docker image names for s390x (#151426 ) Disable switching tag for s390x docker images Keep it that way unless they are published. There's no way to determine in advance which docker image names are needed for building s390x binaries otherwise. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151426 Approved by: https://github.com/malfet, https://github.com/seemethere	2025-04-17 12:47:23 +00:00
Nikita Shulga	fa6e842527	[MPS] Make fused rms_norm traceable (#150661 ) Which is a regression, introduced by https://github.com/pytorch/pytorch/issues/150629#issue-2970312779 which I should have reviewed more thoroughly. - Defined `_fused_rms_norm`, added MPS-only implementation for it and dispatch from `rms_norm_symint`, which is registered as `CompositeImplicitAutograd`, i.e. it is not supposed to do any computations over Tensor, only dispatch to other ops - - Register `_fused_rms_norm` as a fallback in `torch/_inductor/lowering.py` - Added unit test to avoid those regressions in the future TODO: - Get rid of this op, change `rms_norm_symint` definition to `CompositeExplicitAutograd` and implement backward function in `tools/autograd/derivatives.yaml` - Benchmark compiler and re-enable decomp as follows when compiled code is faster ```python @register_decomposition(aten._rms_norm_fused) def rms_norm_fused( self: torch.Tensor, ndim: int, weight: torch.Tensor, eps: float ) -> torch.Tensor: dtr = [self.dim() - i - 1 for i in range(ndim)] return self * weight * (self.pow(2).mean(dtr, keepdim=True).add(eps).rsqrt()) ``` Fixes https://github.com/pytorch/pytorch/issues/150629 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150661 Approved by: https://github.com/manuelcandales, https://github.com/jansel	2025-04-17 11:32:00 +00:00
PyTorch MergeBot	41b82611ee	Revert "[Reopen] [Intel GPU] Set higher tolerance for some models only on XPU Device (#144756 )" This reverts commit 300e0ee13c08ef77e88f32204a2e0925c17ce216. Reverted https://github.com/pytorch/pytorch/pull/144756 on behalf of https://github.com/malfet due to Broke rocm torch bench runs with TypeError: unsupported operand type(s) for \|: 'set' and 'list' ([comment](https://github.com/pytorch/pytorch/pull/144756#issuecomment-2812525970))	2025-04-17 11:09:01 +00:00
PyTorch MergeBot	e4fe67f623	Revert "[MPS] Make fused rms_norm traceable (#150661 )" This reverts commit 682f09ec51526aefe6b504ac8081944baa866556. Reverted https://github.com/pytorch/pytorch/pull/150661 on behalf of https://github.com/malfet due to Has decomp started to fail again ([comment](https://github.com/pytorch/pytorch/pull/150661#issuecomment-2812520408))	2025-04-17 11:06:05 +00:00
FFFrog	32c79da789	[Easy] Fix the compilation warning of BlasKernel. (#151302 ) As the title stated. Change Before: ```C++ [2/21] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/BlasKernel.cpp.o /root/Git.d/pytorch/pytorch/aten/src/ATen/native/BlasKernel.cpp:346:6: warning: ‘void at::native::blas_impl::gemv_fast_path(const char, const int, const int, const scalar_t, const scalar_t, const int, const scalar_t, const int, const scalar_t, scalar_t, const int) [with scalar_t = c10::Half]’ defined but not used [-Wunused-function] 346 \| void gemv_fast_path<at::Half>( \| ^~~~~~~~~~~~~~~~~~~~~~~~ /root/Git.d/pytorch/pytorch/aten/src/ATen/native/BlasKernel.cpp:329:6: warning: ‘bool at::native::blas_impl::gemv_use_fast_path(char, int64_t, int64_t, scalar_t, int64_t, int64_t, scalar_t, int64_t) [with scalar_t = c10::Half]’ defined but not used [-Wunused-function] 329 \| bool gemv_use_fast_path<at::Half>( \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~ /root/Git.d/pytorch/pytorch/aten/src/ATen/native/BlasKernel.cpp:301:6: warning: ‘void at::native::blas_impl::gemv_fast_path(const char, const int, const int, const scalar_t, const scalar_t, const int, const scalar_t, const int, const scalar_t, scalar_t, const int) [with scalar_t = c10::BFloat16]’ defined but not used [-Wunused-function] 301 \| void gemv_fast_path<at::BFloat16>( \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~ /root/Git.d/pytorch/pytorch/aten/src/ATen/native/BlasKernel.cpp:273:6: warning: ‘bool at::native::blas_impl::gemv_use_fast_path(char, int64_t, int64_t, scalar_t, int64_t, int64_t, scalar_t, int64_t) [with scalar_t = c10::BFloat16]’ defined but not used [-Wunused-function] 273 \| bool gemv_use_fast_path<at::BFloat16>( \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151302 Approved by: https://github.com/malfet, https://github.com/aditew01 ghstack dependencies: #151427	2025-04-17 10:50:22 +00:00
Michael Lazos	f29fe78cf2	[Dynamo] Implement sourceless named tuple support (#151266 ) Fixes https://github.com/pytorch/pytorch/issues/140903 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151266 Approved by: https://github.com/williamwen42, https://github.com/StrongerXi, https://github.com/anijain2305	2025-04-17 08:43:03 +00:00
FFFrog	49c91b4be9	[Easy][Building] Fix the warning of int4mm.cu when building (#151427 ) As the title stated. Changes Before: ```C++ [999/1526] Building CUDA object caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/int4mm.cu.o /root/Git.d/pytorch/pytorch/aten/src/ATen/native/cuda/int4mm.cu(142): warning #177-D: variable "at::native::kWarpSize" was declared but never referenced constexpr int32_t kWarpSize = 32; ^ Remark: The warnings can be suppressed with "-diag-suppress <warning-number>" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151427 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-04-17 08:21:32 +00:00
Chong Gu	a05cc9f494	Remove Clear Cache Time from do_bench_using_profiling (#150696 ) Summary: In most instances, this action would take ~33% of the total run time, which means that our benchmark would previously differ from the end results by a lot. Test Plan: We can compare the benchmark results for ``` CUDA_VISIBLE_DEVICES=4,5 buck run mode/opt -c python.package_style=inplace -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100a //caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --model-snapshot-id=672308665_0 --lower-backend=AOT_INDUCTOR --node-replacement-dict="{'torch.nn.Linear':{'(autotune)': 'fp8_float_model_dynamic_quantization_rowwise'}}" --trace-aot-inductor-module=True --disable-acc-tracer=False --batch-size=1024 ``` before and after the diff, and notice that on average, the benchmark results decrease by ~0.1ms per iteration, which is more closely aligned with the lowered modules. Differential Revision: D72469845 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150696 Approved by: https://github.com/frank-wei	2025-04-17 07:25:41 +00:00
bobrenjc93	e0f05229e9	[ez] Make relaxed constraint error message more user friendly (#151407 ) Fixes #151356 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151407 Approved by: https://github.com/Skylion007	2025-04-17 06:43:10 +00:00
Benjamin Glass	10a54ffe5a	[inductor] Reduce runtime of CPU OpInfo tests (#151435 ) `has_triton()` returns True if Triton is present on the system and supports _any_ backend we care about. In this case, that means we _always_ check gradients, even though the intended behavior is to skip gradients when testing on CPU. Fixes a bug from #146911. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151435 Approved by: https://github.com/masnesral	2025-04-17 05:25:14 +00:00
PyTorch UpdateBot	b7d9f44602	[executorch hash update] update the pinned executorch hash (#151493 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151493 Approved by: https://github.com/pytorchbot	2025-04-17 05:14:12 +00:00
Nikita Shulga	682f09ec51	[MPS] Make fused rms_norm traceable (#150661 ) Which is a regression, introduced by https://github.com/pytorch/pytorch/issues/150629#issue-2970312779 which I should have reviewed more thoroughly. - Defined `_fused_rms_norm`, added MPS-only implementation for it and dispatch from `rms_norm_symint`, which is registered as `CompositeImplicitAutograd`, i.e. it is not supposed to do any computations over Tensor, only dispatch to other ops - - Register `_fused_rms_norm` as a fallback in `torch/_inductor/lowering.py` - Added unit test to avoid those regressions in the future TODO: - Get rid of this op, change `rms_norm_symint` definition to `CompositeExplicitAutograd` and implement backward function in `tools/autograd/derivatives.yaml` - Benchmark compiler and re-enable decomp as follows when compiled code is faster ```python @register_decomposition(aten._rms_norm_fused) def rms_norm_fused( self: torch.Tensor, ndim: int, weight: torch.Tensor, eps: float ) -> torch.Tensor: dtr = [self.dim() - i - 1 for i in range(ndim)] return self * weight * (self.pow(2).mean(dtr, keepdim=True).add(eps).rsqrt()) ``` Fixes https://github.com/pytorch/pytorch/issues/150629 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150661 Approved by: https://github.com/manuelcandales, https://github.com/jansel	2025-04-17 04:15:24 +00:00
PyTorch MergeBot	17ea9d1478	Revert "[DCP] Add logging for _stateful_to_state_dict(), stage_state_dict(), and synchronize_staging() (#151320 )" This reverts commit fb6ac2f16132f7953711ce6924bc2ee4a033228c. Reverted https://github.com/pytorch/pytorch/pull/151320 on behalf of https://github.com/malfet due to Broke lint ([comment](https://github.com/pytorch/pytorch/pull/151320#issuecomment-2811669325))	2025-04-17 03:57:03 +00:00
Nikita Shulga	a94483329c	[MPS] Start benchmarking compile results (#151155 ) To know passrate and speedup Pull Request resolved: https://github.com/pytorch/pytorch/pull/151155 Approved by: https://github.com/dcci	2025-04-17 02:45:39 +00:00
Valérian Rey	f5851efed9	Fix `torch.autograd.backward` `inputs` validation (#150975 ) - Fixes #150883 - Fixes #70504 This is my first PR to pytorch, so please tell me if I'm forgetting anything. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150975 Approved by: https://github.com/soulitzer	2025-04-17 02:11:13 +00:00
fduwjj	6f9ffaa991	[c10d][fr] Fix script for uneven reduce scatter and update test cases (#151475 ) Somehow the type string for reduce scatter is "REDUCE_SCATTER" not "REDUCESCATTER". This PR fixed it and added more test cases. Differential Revision: [D73141245](https://our.internmc.facebook.com/intern/diff/D73141245) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151475 Approved by: https://github.com/fegin	2025-04-17 02:11:08 +00:00
Shangdi Yu	cd1db55817	Fix tensor_constant name collision in aot_export_module (#151123 ) Summary: When we have an exported program that looks like this: ``` ExportedProgram: class GraphModule(torch.nn.Module): def forward(self, b__tensor_constant0: "f32[1]", ... c_lifted_tensor_0: "i64[925]", …. , tupleized_input_0_0: "f32[10, 2139]", clone: "i64[925]" = torch.ops.aten.clone.default(c_lifted_tensor_0); c_lifted_tensor_0 = None index_select: "f32[10, 925]" = torch.ops.aten.index_select.default(tupleized_input_0_0, 1, clone); clone = None ``` The graph after `aot_export_module` could have a name collision, notice that `_tensor_constant0` arg of `clone` is different from the `_tensor_constant0` in the input module . ``` def forward(self): arg9_1: "f32[10, 2139]" _tensor_constant0: "f32[1]" = self._tensor_constant0 # this should be int64, conflicted with the original _tensor_constant0, had a clone on this constant before lifting index: "f32[10, 925]" = torch.ops.aten.index.Tensor(arg9_1, [None, _tensor_constant0]); _tensor_constant0 = None ``` This caused the `tensors used as indices must binary, int...` aoti error on PT2I dashboard because later we used `clone` as index. We had this error because we created a new `_tensor_constant0` at [here](https://github.com/pytorch/pytorch/blob/main/torch/fx/_symbolic_trace.py#L403-L412), and the new `_tensor_constant0` overrides the original `_tensor_constant0` on the input Module in `_unlift_graph`. The `arg` for `clone` is created at `create_proxy` in `proxy.py`. To fix this, we do a graph pass before we unlift the graph inputs to avoid name collision Test Plan: ``` buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r aot_compile_constant_folding buck2 run mode/dev-nosan caffe2/test/inductor:test_aot_inductor -- -r aoti_constant_tensor_name_collision ``` Differential Revision: D72761937 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151123 Approved by: https://github.com/tugsbayasgalan, https://github.com/jingsh	2025-04-17 01:52:21 +00:00
Yu, Guangye	bf92c9883b	Refine host caching allocator (#151403 ) # Motivation This stack of PRs aims to generalize and improve PyTorch host allocator code. This PR introduces a `DeleterFnPtr` template parameter to `CachingHostAllocatorInterface` to resolve circular dependency issues. This change allows for better code reuse and simplifies the implementation of host allocators. # Additional Context TODO: - [ ] Unify host allocator related API - [ ] Deprecate those device-specific legacy API - [ ] Move `is_pinned` to host allocator Pull Request resolved: https://github.com/pytorch/pytorch/pull/151403 Approved by: https://github.com/gujinghui, https://github.com/albanD	2025-04-17 01:50:47 +00:00
Meet Vadakkanchery	fb6ac2f161	[DCP] Add logging for _stateful_to_state_dict(), stage_state_dict(), and synchronize_staging() (#151320 ) Summary: As titled. Test Plan: CI Differential Revision: D73040700 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151320 Approved by: https://github.com/saumishr	2025-04-17 01:08:32 +00:00
Mao Yunfei	300e0ee13c	[Reopen] [Intel GPU] Set higher tolerance for some models only on XPU Device (#144756 ) Reopen the previous stale closed PR https://github.com/pytorch/pytorch/pull/134192 We need to increase the tolerance slightly to ensure that certain models pass accuracy check on the XPU device. This pull request preserves the original tolerance threshold for the CUDA device and introduces a new key higher_fp16_bf16_xpu, which only impacts the XPU device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144756 Approved by: https://github.com/chuanqi129, https://github.com/EikanWang, https://github.com/desertfire	2025-04-17 00:26:55 +00:00
Boyuan Feng	2fd26925c4	improve noop elimination for view (#151095 ) This PR improves noop elimination. ### View Noop ```python >>> torch.Size([1,2,3]) == [1,2,3] False >>> torch.Size([1,2,3]) == (1,2,3) True ``` So we add `tuple(size)` in `view_noop`. Example: ```python import torch @torch.compile() def f(x): batch_size = x.shape[0] x = x.transpose(1, 2) # (batch_size, 2, 3) x = x.reshape(batch_size, 2, 3) # noop return x x = torch.randn((2,3,2)) f(x) x = torch.randn((4,3,2)) f(x) ``` Before: ![image](https://github.com/user-attachments/assets/be488881-6c99-43a9-b088-fa481f675775) After: ![image](https://github.com/user-attachments/assets/6d93be3d-128b-44d4-ad6a-d3d18e272329) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151095 Approved by: https://github.com/eellison	2025-04-16 23:55:32 +00:00
zeshengzong	9a2624c712	Fix `keepdim` param optional description (#151197 ) Fixes #151104 Fix optional description of `dim` and `keepdim`, except `torch.quantile` which already fixed in #146485 ## Test Result ### Before ![image](https://github.com/user-attachments/assets/69f1824d-3d15-407e-8c92-f25a22e16914) ### After ![image](https://github.com/user-attachments/assets/e5aac674-ab8f-4988-a5f1-7400c36bdc99) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151197 Approved by: https://github.com/mikaylagawarecki	2025-04-16 23:15:30 +00:00
Catherine Lee	9e6ad274dc	Action for building docker binary builds (#151471 ) This is part of splitting up https://github.com/pytorch/pytorch/pull/150558 into smaller chunks, please see that for more context Uses calculate docker image with the new custom tag prefix, so the naming convention of the docker images is slightly different for images built on PR based off of `a582f04608/.github/workflows/build-manywheel-images.yml (L101)` Also moves the push of the docker images from inside the build scripts to inside the workflow Currently not used anywhere, but the binary docker builds are very similar so I'm going to change them to use this instead Pull Request resolved: https://github.com/pytorch/pytorch/pull/151471 Approved by: https://github.com/malfet, https://github.com/seemethere, https://github.com/ZainRizvi	2025-04-16 23:01:35 +00:00
Svetlana Karslioglu	cd7bc60e11	Migrate to new theme (#149331 ) - Migrate pytorch docs, cpp docs and functorch docs to the pytorch_sphinx_theme2 - Migrate index.rst to markdown and restructure to use high-level horizontal bar sections Python API, Developer Notes - Added python-api.md which becomes the main container for the API docs. This file will be used to add all api references in the toctree. It would be great to have lint for this file: https://github.com/pytorch/pytorch/issues/150718 - Enabled mermaid sphinx extension and opengraph sphinx extension Pull Request resolved: https://github.com/pytorch/pytorch/pull/149331 Approved by: https://github.com/malfet, https://github.com/atalman, https://github.com/albanD	2025-04-16 21:35:19 +00:00
Nikita Shulga	1ffaa00ad7	[MPS] Migrate `bitwise_not` to unary operator (#151460 ) That kills to birds with one stone: - Makes implementations more standartized (and faster for strided inputs/outputs) - Fixes bug strided inplace bitwise_not I.e. before this change ```python import torch x=torch.arange(32, device="mps") x[::2].bitwise_not_() print(x) ``` produced ``` tensor([ -1, -2, -3, -4, -5, -6, -7, -8, -9, -10, -11, -12, -13, -14, -15, -16, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31], device='mps:0') ``` after, it generates reasonable output ``` tensor([ -1, 1, -3, 3, -5, 5, -7, 7, -9, 9, -11, 11, -13, 13, -15, 15, -17, 17, -19, 19, -21, 21, -23, 23, -25, 25, -27, 27, -29, 29, -31, 31], device='mps:0') ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151460 Approved by: https://github.com/dcci, https://github.com/qqaatw, https://github.com/Skylion007	2025-04-16 21:34:45 +00:00
PyTorch MergeBot	f252f9df5e	Revert "[Openreg][PrivateUse1] Enable CI for openreg (#151007 )" This reverts commit abbca37fe882541e0259b43dd314a324180550ed. Reverted https://github.com/pytorch/pytorch/pull/151007 on behalf of https://github.com/clee2000 due to At least test_record_event needs to also be skipped on dynamo too, its failing and then somehow causing a hang? https://github.com/pytorch/pytorch/actions/runs/14487625709/job/40637535027#step:25:73 ([comment](https://github.com/pytorch/pytorch/pull/151007#issuecomment-2810789483))	2025-04-16 21:05:17 +00:00
PyTorch MergeBot	e0535e823f	Revert "[Openreg][PrivateUse1] Fix releasing tensor issue when using pin_memory (#151091 )" This reverts commit e229ce34c4ab8cd4e2800227615be32fb362b1e6. Reverted https://github.com/pytorch/pytorch/pull/151091 on behalf of https://github.com/clee2000 due to At least test_record_event needs to also be skipped on dynamo too, its failing and then somehow causing a hang? https://github.com/pytorch/pytorch/actions/runs/14487625709/job/40637535027#step:25:73 ([comment](https://github.com/pytorch/pytorch/pull/151007#issuecomment-2810789483))	2025-04-16 21:05:17 +00:00
Boyuan Feng	5b5399bfcd	[graph partition] reorder to reduce #partitions for simple dependencies (#150814 ) This PR reduces #graph partitions by reordering nodes when the `should_partition` nodes have simple dependencies. Specifically, for `should_partition` nodes: a. If a node has no dependency or only depends on graph inputs: move to the front. Use case is when we move symints to cuda tensor for PaddedTensorSubclass b. If the only user of a node is OutputNode: move it to the end. #### Example The following example shows a padded tensor subclass use case where we copy symint to a cuda tensor (aka mask) in the middle of function. Reordering still generates 1 cudagraph by moving the mask to the front. ```python import torch torch._inductor.config.graph_partition = True # Two reasons for this: # 1. We want to reuse the same mask for many masked_fill calls # 2. Prevent inductor from fusing this op into other ops (e.g. masked_fill) # so we can still reorder in scheduler @torch.library.custom_op("mylib::create_mask", mutates_args=(), tags=(torch._C.Tag.cudagraph_unsafe,)) def create_mask(padded_size: int, original_size: int, device: torch.device) -> torch.Tensor: mask = torch.zeros((padded_size,), dtype=torch.bool, device=device) mask[original_size:] = True return mask @create_mask.register_fake def _(padded_size, original_size, device): return torch.empty((padded_size,), dtype=torch.bool, device=device) def f(padded_tensor, original_tensor, weight): original_size = original_tensor.size()[0] padded_size = padded_tensor.size()[0] # element wise op so we don't care padding value padded_tensor = padded_tensor + 1 padded_tensor = torch.nn.functional.relu(padded_tensor) # dot product requires padding with 0 dot_res = padded_tensor.dot(weight) padded_tensor += dot_res # min requires padding with inf, so we create mask now mask = create_mask(padded_size, original_size, padded_tensor.device) min_res = torch.min( torch.ops.aten.masked_fill(padded_tensor, mask, float("inf")) ) # max requires padding with inf. we can reuse previous mask max_res = torch.max( torch.ops.aten.masked_fill(padded_tensor, mask, -float("inf")) ) return min_res+max_res+padded_tensor compiled_f = torch.compile(f, mode="reduce-overhead") def run(padded_size, original_size): padded_tensor = torch.randn(padded_size, device="cuda") padded_tensor[original_size:] = 0 original_tensor = torch.randn(original_size, device="meta") weight = torch.randn(padded_size, device="cuda") eager_out = f(padded_tensor, original_tensor, weight) compiled_out = compiled_f(padded_tensor, original_tensor, weight) assert torch.allclose(eager_out[0], compiled_out[0]) assert torch.allclose(eager_out[1], compiled_out[1]) # new cudagraph run(8, 4) # new cudagraph due to recompile run(8, 6) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150814 Approved by: https://github.com/eellison	2025-04-16 20:49:20 +00:00
PyTorch MergeBot	a582f04608	Revert "[ez] Make relaxed constraint error message more user friendly (#151407 )" This reverts commit bc934f57d7c14b07e7497eb72a90d893270bc662. Reverted https://github.com/pytorch/pytorch/pull/151407 on behalf of https://github.com/izaitsevfb due to breaks export tests ([comment](https://github.com/pytorch/pytorch/pull/151407#issuecomment-2810716135))	2025-04-16 20:40:22 +00:00
Animesh Jain	607443b16b	[compile][compile time traces] Add more dynamo traces (#151357 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151357 Approved by: https://github.com/williamwen42 ghstack dependencies: #151330, #151256	2025-04-16 20:37:08 +00:00
Animesh Jain	8e373592c8	[aot autograd][logging] Profile large missing gaps in compile time tracing (#151256 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151256 Approved by: https://github.com/bdhirsh, https://github.com/masnesral ghstack dependencies: #151330	2025-04-16 20:37:08 +00:00
Animesh Jain	c58b3f6be3	[invoke_subgraph][inductor] Run pre and post grad passes on invoke_subgraph (#151330 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151330 Approved by: https://github.com/eellison, https://github.com/zou3519	2025-04-16 20:37:01 +00:00
Mateusz Nowak	4c4a5df73b	Allow to run flex_attention on HPU (#148656 ) HPU specific implementation details are to be located in out-of-tree HPU library. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148656 Approved by: https://github.com/drisspg	2025-04-16 19:49:15 +00:00
Blaine Burton Rister	9400f53903	[Inductor] Broadcast to range tree shape before block pointer store (#151399 ) # Feature This fixes a bug related to block pointer stores. Since Triton's block pointer stores don't support implicit broadcasting, in certain cases we need to generate a `reshape->broadcast->reshape` pattern to ensure that the tensor being stored has the same shape as the block pointer. This happens when the block indexing expression involves strides of 0 or dimensions of 1, both of which we eliminate from the block pointer. The existing logic missed an important edge case. We may need a broadcast prior to the first `reshape` of this pattern, in case the tensor comes from a load with implicit broadcasting. For example, if the range trees have shape `[YBLOCK, XBLOCK]`, but the load has a shape `[1, XBLOCK]`, we need to broadcast this to `[YBLOCK, XBLOCK]` prior to storing. See the example kernel below, which comes from `expand` -> `clone` with 3D tiling. The load has an implicit broadcast, and the store has a reshape. Thus, we need to insert an explicit broadcast between them. ``` @triton.jit def triton_poi_fused_clone_0(in_ptr0, out_ptr0, znumel, ynumel, xnumel, ZBLOCK : tl.constexpr, YBLOCK : tl.constexpr, XBLOCK : tl.constexpr): znumel = 32 ynumel = 1 xnumel = 32 zoffset = tl.program_id(2) * ZBLOCK zindex = zoffset + tl.arange(0, ZBLOCK)[:, None, None] zmask = zindex < znumel yoffset = tl.program_id(1) * YBLOCK yindex = yoffset + tl.arange(0, YBLOCK)[None, :, None] ymask = tl.full([ZBLOCK, YBLOCK, XBLOCK], True, tl.int1) xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[None, None, :] xmask = xindex < xnumel x1 = xindex z0 = zindex tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[32], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), boundary_check=[0], eviction_policy='evict_last')[None, None, :] tl.store(tl.make_block_ptr(out_ptr0, shape=[32, 32], strides=[32, 1], block_shape=[ZBLOCK, XBLOCK], order=[1, 0], offsets=[zoffset, xoffset]), tl.reshape(tl.broadcast_to(tmp0, [ZBLOCK, YBLOCK, XBLOCK]), [ZBLOCK, XBLOCK]).to(tl.float32), boundary_check=[0, 1]) ''', device_str='cuda') ``` The tricky part is that we don't want to emit redundant broadcasts in the store. This PR reworks the logic a bit to make sure we don't emit a second broadcast unless it actually changes the shape. # Test plan Added a CI test for this case, which would fail on trunk. Checked that only one broadcast was emitted. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151399 Approved by: https://github.com/jansel, https://github.com/eellison	2025-04-16 19:03:40 +00:00
eqy	17bf59340c	[cuSPARSE][B200] Bump tolerances for test_sparse_csr matvec (#148721 ) Small tolerance bump for blackwell (appears to use same kernel as prev. arches) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148721 Approved by: https://github.com/nWEIdia, https://github.com/ngimel	2025-04-16 18:44:18 +00:00
William Wen	1f29190b59	[dynamo] unimplemented -> unimplemented_v2 in variables/builtin.py (#151145 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151145 Approved by: https://github.com/Skylion007, https://github.com/StrongerXi, https://github.com/jansel, https://github.com/zou3519	2025-04-16 17:16:05 +00:00
bobrenjc93	bc934f57d7	[ez] Make relaxed constraint error message more user friendly (#151407 ) Fixes #151356 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151407 Approved by: https://github.com/Skylion007	2025-04-16 17:00:06 +00:00
Sidney Tsang	cedcdda0ed	Add ccode for CeilToInt and IntTrueDiv (#151375 ) Summary: As titled Test Plan: Test in D73052653 -- shape calculator generates successfully Differential Revision: D73073845 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151375 Approved by: https://github.com/kalpit-meta-1, https://github.com/Skylion007	2025-04-16 16:47:55 +00:00
PyTorch MergeBot	6a3a6d22dc	Revert "[dynamo] context manager/decorator for dynamo config patching during tracing (#150586 )" This reverts commit 40ce4fb24a536d175348df876f61956d4945778e. Reverted https://github.com/pytorch/pytorch/pull/150586 on behalf of https://github.com/clee2000 due to broke some inductor tests? inductor/test_fuzzer.py::TestConfigFuzzer::test_config_fuzzer_dynamo_bisect [GH job link](https://github.com/pytorch/pytorch/actions/runs/14486513628/job/40635178179) [HUD commit link](`40ce4fb24a`), bad TD ([comment](https://github.com/pytorch/pytorch/pull/150586#issuecomment-2810064322))	2025-04-16 16:13:47 +00:00
Nikita Shulga	0c77af3576	[MPSInductor] Add pow, log2 and FloorToInt ops (#151449 ) That enables `test_pow_by_natural_log2_dynamic_shapes_mps` Not sure why log2 printer function suffix is `OpaqueUnaryFn_log2`, rather than just `log2` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151449 Approved by: https://github.com/jansel	2025-04-16 15:56:21 +00:00
FFFrog	e229ce34c4	[Openreg][PrivateUse1] Fix releasing tensor issue when using pin_memory (#151091 ) As the title stated. Related PR: https://github.com/pytorch/pytorch/pull/147066 Co-authored-by: Zhenbin Lin <lin-zhenbin@qq.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/151091 Approved by: https://github.com/albanD ghstack dependencies: #151005, #151007	2025-04-16 13:12:17 +00:00
Simon Fan	c7400d0026	[inductor][comms] skip reorder_for_locality for wait nodes (#150074 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150074 Approved by: https://github.com/eellison, https://github.com/bdhirsh ghstack dependencies: #150258	2025-04-16 10:18:33 +00:00
Simon Fan	159d8a14a6	[inductor][comms] fix node_summary for composite scheduler nodes (#150258 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150258 Approved by: https://github.com/yf225	2025-04-16 10:18:33 +00:00
angelayi	41c97a72a1	[export] Add draft-export to error msg (#151065 ) Given an exception in torch.export, I want to try/catch it to add the message "hey try out draft-export!". Currently I only add this message for errors that draft-export is known to fix, like DataDependentErrors, ConstraintViolationErrors, and no fake impl. Originally the error message looks like: ``` File "/data/users/angelayi/pytorch/torch/_library/custom_ops.py", line 626, in fake_impl raise RuntimeError( RuntimeError: There was no fake impl registered for <CustomOpDef(mylib::foo2)>. This is necessary for torch.compile/export/fx tracing to work. Please use `foo2_impl.register_fake` to add an fake impl. ``` Now, the error msg now looks something like: ``` File "/data/users/angelayi/pytorch/torch/_library/custom_ops.py", line 626, in fake_impl raise RuntimeError( RuntimeError: There was no fake impl registered for <CustomOpDef(mylib::foo2)>. This is necessary for torch.compile/export/fx tracing to work. Please use `foo2_impl.register_fake` to add an fake impl. The error above occurred when calling torch.export.export. If you would like to view some more information about this error, and get a list of all other errors that may occur in your export call, you can rerun your program with the `DRAFT_EXPORT=1` envvar, or replace your `export()` call with `draft_export()`. ``` In python versions >= 3.11, we can use `exception.add_note` to add to the error message. However with previous versions I did a hack to modify `e.args`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151065 Approved by: https://github.com/pianpwk ghstack dependencies: #151051	2025-04-16 08:56:02 +00:00
angelayi	84e633e09d	[export] Make draft-export predispatch=True by default (#151051 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151051 Approved by: https://github.com/pianpwk	2025-04-16 08:56:02 +00:00
Cookiee235	a5c61668d7	fix ambiguous error message (#150086 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/150086 Approved by: https://github.com/anijain2305	2025-04-16 08:48:05 +00:00
Laith Sakka	0a489f924d	Fix: missing () in generated runtime assert c++ code (#151171 ) Address one of the issues in https://github.com/pytorch/pytorch/issues/151127 generated code used to be not a==5 or b==5 should be not (a==5 or b==5) address one of the issues in the comments of Address one of the issues in https://github.com/pytorch/pytorch/issues/151127 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151171 Approved by: https://github.com/aorenste, https://github.com/eellison ghstack dependencies: #151170	2025-04-16 08:10:17 +00:00
Laith Sakka	55595e0c85	Fix Issues in deferring runtime assertions. (#151170 ) This PR fix two bugs: 1) Update self.bound_unbacked_symbols before emitting runtime asserts : set self.bound_unbacked_symbols before emitting runtime asserts to include runtime asserts depending on the current node 2) In the pass that remove unused graph inputs, we should not remove symbols that are used by runtime assertions. Address some of the issues in https://github.com/pytorch/pytorch/issues/151127 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151170 Approved by: https://github.com/bobrenjc93, https://github.com/eellison	2025-04-16 08:10:17 +00:00
FFFrog	abbca37fe8	[Openreg][PrivateUse1] Enable CI for openreg (#151007 ) Changes: - move test_openreg.py from test/cpp_extensions/open_registration_extension/ to test/ - update README.md for openreg - enable CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/151007 Approved by: https://github.com/albanD ghstack dependencies: #151005	2025-04-16 07:55:51 +00:00
FFFrog	a9dbbe1aee	[OpenReg][PrivateUse1] Refactoring the csrc files of pytorch_openreg (#151005 ) As the title stated. Changes: - Remove unnecessary header file - Remove unnecessary registry logic about PrivateUse1HooksRegistry，such as TORCH_DECLARE_REGISTRY, C10_DEFINE_REGISTRY, etc,. - using static + global variable to do initialization instead of call_one Next Step: Enable test_openreg.py in CI/CD to guard the quality of PrivateUse1 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151005 Approved by: https://github.com/albanD	2025-04-16 07:55:50 +00:00
William Wen	40ce4fb24a	[dynamo] context manager/decorator for dynamo config patching during tracing (#150586 ) Implement traceable config patching for Dynamo: enables restricted patching of Dynamo config where user can use a context manager/decorator to change tracing behavior for parts of the code. The new `dont_skip_tracing` decorator/context manager for ignoring most trace rules is easily implemented with this more generic traceable config patching feature. Implementation: - Create a new specialized context manager class representing a wrapper around torch._dynamo.config.patch - Dynamo doesn't trace into the context manager but updates config at compile time - Correctness is based on our correctness for handling supported context managers - Implementation is inspired by how `GradModeVariable` is implemented. Previous attempts: https://github.com/pytorch/pytorch/pull/148736 (decorator-only global approach) and https://github.com/pytorch/pytorch/pull/149439 (decorator-only traceback approach) See https://docs.google.com/document/d/1vWNwKL_jpg-PLopifcaSa338wks3GqSVF4GHRguybGg/edit?tab=t.0 for more details on implementation - including previous approaches. NOTE: this PR fixes a bug where skipped code objects were not tracked by convert_frame.py, leading to cases where code objects would be automatically skipped even after `torch._dynamo.reset()`. This exposed some latent dynamo-wrapped test failures in CI that previously passed in CI but not locally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150586 Approved by: https://github.com/jansel, https://github.com/zou3519, https://github.com/anijain2305	2025-04-16 06:49:58 +00:00
Angela Yi	daf2ccf023	[custom ops] Fix destroy function (#151299 ) Summary: D72906445 seemed to cause a SIGABRT when running the test in the test plan. The change I narrowed it down to was where in fake_impls the [`deregister_fake_kernel` no longer calls `lib.destroy`](https://github.com/pytorch/pytorch/pull/150806/files#diff-7fd3f4222276c63b91f3a895530bb5efe137fd23165b48f25afcf3c06a5d2a8fL65-L69). Calling `lib.destroy` in that handle results in a maximum recursion error where someone calls library.destroy which calls the handle which calls back to library.destroy. So I compared the implementation of this _del_library and lib.destroy and it seemed like the main thing different was deleting `self.m`. So adding that fixed my issue! Side note, I feel like we can combine `_del_library` and `library._destroy`? But I won't do it in this diff to make sure we don't break too many things 😅 Test Plan: `buck test 'fbcode//mode/opt' fbcode//aiplatform/gmpp/bulk_eval/reader/service/tests:reader_service_handler_tests -- --exact 'aiplatform/gmpp/bulk_eval/reader/service/tests:reader_service_handler_tests - aiplatform.gmpp.bulk_eval.reader.service.tests.reader_service_handler_tests.ReaderServiceHandlerTests: test_add_preproc_output_into_queue'` https://www.internalfb.com/intern/testinfra/testrun/10977524170296078 Differential Revision: D73017613 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151299 Approved by: https://github.com/zou3519	2025-04-16 06:18:09 +00:00
Sam Larsen	585d03fa39	Record how many parameters we're parsing within dynamo (#148508 ) This allows us to track how many paramaters we have in compilations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148508 Approved by: https://github.com/jansel, https://github.com/anijain2305 Co-authored-by: Sam Larsen <slarsen@meta.com>	2025-04-16 06:15:11 +00:00
PyTorch UpdateBot	b4cee2bf57	[executorch hash update] update the pinned executorch hash (#151280 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151280 Approved by: https://github.com/pytorchbot	2025-04-16 05:39:06 +00:00
Mu-Chu Lee	107121dfad	[AOTInductor] Add interface for user managed buffer in package api. (#151325 ) Summary: https://github.com/pytorch/pytorch/pull/151141 We add interface for user managed buffer in the package api. Test Plan: Included in commit.] Reviewed By: henrylhtsang Differential Revision: D72985440 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151325 Approved by: https://github.com/angelayi	2025-04-16 04:25:40 +00:00
Will Feng	82200e33b5	Make torch._chunk_cat support non-contiguous inputs (#151263 ) Currently, `torch._chunk_cat` only supports contiguous inputs (due to `.view()` usage in `_pad_chunk()` supporting only contiguous tensor). This doesn't work for internal models where there can be non-contiguous input tensors: - size=[8192, 16416], stride=[16448, 1] # stride[0] is larger than size[1] - size=[1152, 384], stride=[1, 1152] # column-major tensor In this PR, we relax the assumption on contiguous input tensor, by switching from `.view()` to `.reshape()`. Note that since `.reshape()` will try to use `.view()` under the hood whenever possible, this should not cause regression to existing use cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151263 Approved by: https://github.com/BoyuanFeng	2025-04-16 04:18:46 +00:00
fduwjj	30101aa450	[c10d][fr] Add counters for FR dump and reduce its timeout to finish dump before watchdog timeout (#151329 ) After https://github.com/pytorch/pytorch/pull/150652, we still see some ranks missing dumps. Upon looking further, the case is that FR dump timed out for its first attempt: watchdog thread: notify FR dump -> wait for 1 mins -> throw watchdog timeout -> notify elastic to kill process FR dump thread: received FR dump signal -> timeout after 1 mins with first attempt -> started 2nd attempt -> got killed. So we want to make the FR dump timeout shorter, in reality, the log shows that the dump finished within one sec. Even if we consider a very slow speed like 200K/s the usual size FR (1MB at most) takes around 5 secs, so 15 secs is like 3 times buffer. Also we still let watchdog sleep for 1 min so that we can wait enough time for two dump to timeout and the following check like GIL checker to execute. Also, if we get stuck in getting GIL or cuda hang, 15 seconds should be enough to detect the hang. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151329 Approved by: https://github.com/fegin	2025-04-16 03:48:03 +00:00
Camyll Harajli	3a90fd481e	fix test_einsum: use initialized values (#151363 ) Summary: `empty` uses uninitialized values so that could be NaNs, thus, the assert_close kept failing in FBCode. Test Plan: ``` buck test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:unbacked_symints_cpu -- --exact 'caffe2/test/inductor:unbacked_symints_cpu - test_einsum_cpu (caffe2.test.inductor.test_unbacked_symints.TestUnbackedSymintsCPU)' --env TORCH_LOGS="+output_code" --print-passing-details --env TORCH_LOGS_FORMAT="%(filename)s:%(lineno)s: %(message)s" ``` Differential Revision: D73067722 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151363 Approved by: https://github.com/Camyll Co-authored-by: Camyll Harajli <camyllh@meta.com>	2025-04-16 03:10:29 +00:00
Nikita Shulga	6124dabd30	[CI][NoOp] Update skip reason for argmin_with_nan (#151374 ) Which is https://github.com/pytorch/pytorch/issues/130295 (i.e. torch.compile produces correct results, but eager is not) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151374 Approved by: https://github.com/dcci	2025-04-16 02:33:20 +00:00
Joel Schlosser	ae53510b9e	Fix setUpClass() / tearDownClass() for device-specific tests (#151129 ) Finishes up the work started in #121686 + adds test Update: this was not as straightforward as I originally imagined. Context below. TL;DR: `TestFoo{CPU, CUDA}` now actually derive from `TestFoo`! Also, `{CPU, CUDA}TestBase` setup / teardown logic is now always called (it is required to set the primary device), regardless of whether `super().setUpClass()` / `super().tearDownClass()` are called or not. Background: The typical way to get device-specific tests is to write a generic `TestFoo` and call `instantiate_device_type_tests(TestFoo, locals())` to get `TestFooCPU`, `TestFooCUDA`, etc. After this, generic tests (e.g. `TestFoo.test_bar()`) become `TestFooCPU.test_bar_cpu()` / `TestFooCUDA.test_bar_cuda()`. Behind the scenes, this was historically accomplished by creating a `TestFooCUDA` that derives from both a `CUDATestBase` and an empty class called `TestFoo_base`. This `TestFoo_base` has the same bases as `TestFoo`, but none of the test functions (e.g. `test_bar()`). The documented reason for this is to avoid things like a derived `TestFooCUDA.test_bar()` being discovered in addition to the real device-specific test `TestFooCUDA.test_bar_cuda()`. (1) A reason this matters is because it should be possible to call e.g. `super().setUpClass()` from a custom setup / teardown classmethod. If the generated TestFooCUDA does not derive from TestFoo, but instead derives from the empty class described above, this syntax does not work; in fact there is no way to form a proper `super()` call that works across the device-specific test variants. Here's an example that breaks in the OpInfo tests: `070f389745/test/test_ops.py (L218-L221)` (2) Further, there is some precedent within a custom `setUpClass()` impl for storing things on the `cls` object to be accessed at test time. This must be the device-specific test class (`TestFooCUDA`) and not `TestFoo` for this to work. As an example, the open device registration tests load a module during setup and use it in the test logic: `070f389745/test/test_cpp_extensions_open_device_registration.py (L63-L77)` `070f389745/test/test_cpp_extensions_open_device_registration.py (L79-L80)` To accomplish both (1) and (2) at the same time, I decided to revisit the idea of utilizing a proper inheritance hierarchy for `TestFoo` -> `{TestFooCPU, TestFooCUDA}`. That is: have TestFooCPU / TestFooCUDA actually derive from `TestFoo`. This achieves both (1) and (2). The only thing left is to make sure the generic tests (e.g. `TestFoo.test_bar()`) are not discoverable, as was the stated reason for diverging from this in the first place. It turns out we can simply `delattr()` these generic tests from `TestFoo` once `TestFooCPU` / `TestFooCUDA` have been setup with the device-specific variants, and all works well. The `instantiate_device_type_tests(...)` logic already deletes `TestFoo` from scope, so I don't see a problem with deleting generic tests from this base class as well (CI will prove me right or wrong ofc). Side note: I was encountering a weird race condition where sometimes the custom `setUpClass()` / `tearDownClass()` defined & swapped in [here](`4a47dd9b3f/torch/testing/_internal/common_device_type.py (L940-L955)`) would be used, and sometimes it wouldn't. This non-deterministic behavior was called out previously by @ngimel here: `4a47dd9b3f/test/inductor/test_torchinductor_dynamic_shapes.py (L128-L130)` To address this, I moved this block of logic to before the first call to `instantiate_test()`, as that method queries for the primary device, and the primary device identification logic may manually invoke `setUpClass()` (see [here](`4a47dd9b3f/torch/testing/_internal/common_device_type.py (L381-L384)`)). Goal: define the `setUpClass()` / `tearDownClass()` we want for correctness before they're ever called. This seems to work and the behavior is deterministic now AFAICT. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151129 Approved by: https://github.com/janeyx99, https://github.com/masnesral, https://github.com/malfet	2025-04-16 02:18:42 +00:00
Aleksei Nikiforov	067a7b1d4a	Disable -Werror for s390x test module compilation (#150413 ) This change should make nightly testsuite green again for s390x. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150413 Approved by: https://github.com/seemethere	2025-04-16 02:15:17 +00:00
Carlo Bertolli	aacac88bee	[ROCM] Fix in-place aten sum with specialized templated kernels. (#151230 ) We noticed a regression when doing aten.sum in-place (a+=b) and the type of the output is not the same as the functor. Co-authored by: Jerry Mannil <jerry.mannil@amd.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/151230 Approved by: https://github.com/jeffdaily	2025-04-16 02:07:46 +00:00
cyy	cadd832c19	[1/N] Use std::string_view in torchgen (#146403 ) Moves remaining c10::sv to std::sv Pull Request resolved: https://github.com/pytorch/pytorch/pull/146403 Approved by: https://github.com/albanD	2025-04-16 01:50:22 +00:00
henrylhtsang	dd11613f94	[cutlass backend][experimental] Try out presets for cutlass instead of searching all configs (#151255 ) Differential Revision: [D72668861](https://our.internmc.facebook.com/intern/diff/D72668861/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151255 Approved by: https://github.com/mlazos	2025-04-16 01:48:06 +00:00
henrylhtsang	532025fbd0	[cutlass backend][ez] Ban FP32 output dtype from using CUTLASS GEMM backend (#151279 ) FP32 not supported: https://github.com/pytorch/pytorch/issues/145952 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151279 Approved by: https://github.com/ColinPeppler	2025-04-16 01:12:18 +00:00
Justin Chu	8780d18f64	[ONNX] Add a comment for handling bf16/fp8 tensor to numpy conversion (#151371 ) Follow up of https://github.com/pytorch/pytorch/pull/151259 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151371 Approved by: https://github.com/titaiwangms	2025-04-16 00:49:38 +00:00
Benji Beck	4bbb61812c	[BE][1/2] Move original_weights_lookup attribute to constant (#151241 ) Summary: As title. Cleaning usages by using global constant. Test Plan: `buck test 'fbcode//mode/opt' fbcode//caffe2/test:quantization_fx -- --exact 'caffe2/test:quantization_fx - test_keep_original_weights (quantization.fx.test_quantize_fx.TestQuantizeFx)'` Differential Revision: D72892815 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151241 Approved by: https://github.com/Skylion007, https://github.com/hl475	2025-04-16 00:41:25 +00:00
Nikita Shulga	44a522dd78	[BE] Fix extra-semi warning in attention.cpp (#151367 ) Introduced by https://github.com/pytorch/pytorch/pull/149512 Before this change, following warning was generated ``` /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/transformers/attention.cpp:452:71: warning: extra ';' outside of a function is incompatible with C++98 [-Wc++98-compat-extra-semi] 452 \| REGISTER_HPU_DISPATCH(_fused_sdp_choice_stub, &_fused_sdp_choice_meta); \| ^ 1 warning generated. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151367 Approved by: https://github.com/drisspg	2025-04-16 00:31:45 +00:00
henrylhtsang	8e6415fd32	[cutlass backend] "Fix" FlexibleLayout (#151284 ) So Horace was right, Triton does fix the layout when rendering the template (i.e. roughly at the same time). You can double check that running the unit test with gemm backend as "TRITON,CUTLASS". You will notice that the layout is fixed if we have triton in gemm backend, but flexible if triton is not there. code pointer: https://github.com/pytorch/pytorch/blob/main/torch/_inductor/select_algorithm.py#L927 In the future, we should remove `fix_op_layout` from class CUTLASSGemmTemplate. But maybe we can monitor it for a bit first. Differential Revision: [D72996143](https://our.internmc.facebook.com/intern/diff/D72996143/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151284 Approved by: https://github.com/ColinPeppler	2025-04-16 00:10:52 +00:00
Michael Lazos	e55eb5c870	[Cutlass] Integrate EVT codegen into 3x gemm template (#150346 ) Previously merged: * #150345 * #150344 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150346 Approved by: https://github.com/henrylhtsang ghstack dependencies: #150344, #150345	2025-04-16 00:08:22 +00:00
Oguz Ulgen	3cf0e2d8ec	Add inductor standalone_compile API (#150670 ) This PR adds standalone_compile API that does precompilation via caching to support vLLM use case in the short term while we work on the longer term precompilation solution. ``` standalone_compile(gm, example_inputs, options) -> CompiledArtifact CompiledArtifact.save(path, format: binary\|unpacked = binary) CompiledArtifact.load(path, format: binary\|unpacked = binary) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150670 Approved by: https://github.com/jamesjwu, https://github.com/zou3519	2025-04-15 23:38:15 +00:00
Justin Chu	9917feff50	[ONNX] Produce correct dtypes for bf16/f8 in IR TorchTensor (#151259 ) Split the changes from https://github.com/pytorch/pytorch/pull/151069 to address https://github.com/microsoft/onnxscript/issues/2187, where the output np arrays do not have the correct ml_dtypes types as expected. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151259 Approved by: https://github.com/titaiwangms	2025-04-15 23:21:04 +00:00
Nikita Shulga	331423e5c2	Fix tensorpipe compilation with clang-17 (#151344 ) By suppressing `missing-template-arg-list-after-template-kw` warning, which seems to be required to compile Google's libnop, which is in a semi-abandoned state now ``` In file included from /Users/malfet/git/pytorch/pytorch/third_party/tensorpipe/third_party/libnop/include/nop/base/variant.h:21: /Users/malfet/git/pytorch/pytorch/third_party/tensorpipe/third_party/libnop/include/nop/types/variant.h:241:30: error: a template argument list is expected after a name prefixed by the template keyword [-Wmissing-template-arg-list-after-template-kw] 241 \| index_ = value_.template Construct(std::forward<Args>(args)...); \| ^ /Users/malfet/git/pytorch/pytorch/third_party/tensorpipe/third_party/libnop/include/nop/types/variant.h:258:26: error: a template argument list is expected after a name prefixed by the template keyword [-Wmissing-template-arg-list-after-template-kw] 258 \| if (!value_.template Assign(TypeTag<T>{}, index_, std::forward<U>(value))) { \| ^ /Users/malfet/git/pytorch/pytorch/third_party/tensorpipe/third_party/libnop/include/nop/types/variant.h:265:26: error: a template argument list is expected after a name prefixed by the template keyword [-Wmissing-template-arg-list-after-template-kw] 265 \| if (!value_.template Assign(index_, std::forward<T>(value))) { \| ^ 3 errors generated. ``` Fixes https://github.com/pytorch/pytorch/issues/151316 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151344 Approved by: https://github.com/ZainRizvi, https://github.com/seemethere	2025-04-15 22:18:06 +00:00
PyTorch MergeBot	98b1e82ba8	Revert "Fix setUpClass() / tearDownClass() for device-specific tests (#151129 )" This reverts commit bd4cf30e31a2a0b0a57f54c7eedd3a39d5778cbe. Reverted https://github.com/pytorch/pytorch/pull/151129 on behalf of https://github.com/jbschlosser due to flex attention tests failing ([comment](https://github.com/pytorch/pytorch/pull/151129#issuecomment-2807632119))	2025-04-15 22:07:25 +00:00
Angela Yi	e1d8b3f838	[inductor] Check NoneLayout in update_zero_dim_cpu_tensor (#151321 ) Summary: This fixes the error in https://fb.workplace.com/groups/1075192433118967/permalink/1640802133224658/ I tried really hard but I couldn't come up with a test case to repro the issue, but I confirmed with the OP that this issue has been fixed. ``` Traceback (most recent call last): File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/compile_fx.py", line 746, in _compile_fx_inner mb_compiled_graph = fx_codegen_and_compile( File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/compile_fx.py", line 1343, in fx_codegen_and_compile return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs) File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/compile_fx.py", line 1232, in codegen_and_compile compiled_module = graph.compile_to_module() File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/graph.py", line 2087, in compile_to_module return self._compile_to_module() File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/graph.py", line 2095, in _compile_to_module self.codegen_with_cpp_wrapper() if self.cpp_wrapper else self.codegen() File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/graph.py", line 2002, in codegen self._update_scheduler() File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/graph.py", line 1996, in _update_scheduler self.scheduler = Scheduler(self.operations) File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/scheduler.py", line 1954, in __init__ self._init(nodes) File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/scheduler.py", line 1974, in _init self.update_zero_dim_cpu_tensor() File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/scheduler.py", line 4433, in update_zero_dim_cpu_tensor and buffer.get_size() == [] File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/ir.py", line 3903, in get_size return [*self.get_layout().size] File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/ir.py", line 3914, in get_layout raise NotImplementedError(type(self.layout).__name__) torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: NotImplementedError: NoneLayout ``` Test Plan: OP said the issue is fixed Differential Revision: D72575808 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151321 Approved by: https://github.com/BoyuanFeng	2025-04-15 21:58:09 +00:00
aishwaryar12309	4518b30680	Clarify that x and dx are mutually exclusive in torch.trapezoid doc (#151190 ) This PR addresses [#151105](https://github.com/pytorch/pytorch/issues/151105) by stating that x and dx are mutually exclusive parameters in torch.trapezoid() Pull Request resolved: https://github.com/pytorch/pytorch/pull/151190 Approved by: https://github.com/soulitzer	2025-04-15 21:42:05 +00:00
Michael Lazos	630cf46039	[Cutlass] Codegen for EVT Epilogue (#150345 ) Previously merged: * #150344 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150345 Approved by: https://github.com/henrylhtsang, https://github.com/eellison ghstack dependencies: #150344	2025-04-15 21:31:21 +00:00
Jeff Daily	27ef3f6cdc	[ROCm][CI/CD] Create ROCm6.4 magma tarball (#151345 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151345 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-04-15 21:12:48 +00:00
fduwjj	71e7dcda87	[c10d][fr] Record each individual collective being coalesced (#151238 ) During the record of FR for coalesced collectives we are not consistent. For P2P ops, we log individual collectives into FR but for non-p2p ops, we don't do that. This PR is trying to make non-P2P also log individual collective into FR so that we can use script to check correctness of ops for each one of collectives coalesced. Also the added unit test also address the unit test ask in the comment in https://github.com/pytorch/pytorch/pull/150863?fbclid=IwZXh0bgNhZW0CMTEAAR4a5Rd_JyJlrbKZcacbIv5WX5b4MqBRNn0hpgl-VTSD0eeXRlPZ9Ty_CPOYhQ_aem_ALEG1ibRajwie-rn1B4n5w#pullrequestreview-2751254224. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151238 Approved by: https://github.com/d4l3k, https://github.com/wconstab ghstack dependencies: #151247	2025-04-15 20:56:37 +00:00
fduwjj	ae648f047c	[c10d][fr] Enable FR analysis script for rest of all coalesce op (#151247 ) We revisited how coalesced collective is working in https://github.com/pytorch/pytorch/pull/151243 and we now want to enable the script to work for slow path. The change is indeed bc-breaking but this is needed to make it work and the API is an internal use API. It is not user facing. For slow path the individual has input-sizes and output sizes recorded but no state. The final one has the state ready. We check the correctness of each individual collective one by one but we don't check the state match for these collectives, we can only check the state match for the last one which is the work item with coalesced label. Added more unit test for slow path. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151247 Approved by: https://github.com/d4l3k, https://github.com/XilunWu	2025-04-15 20:53:03 +00:00
chaihahaha	f98150fc8e	Warn user of existing lock file to avoid infinite waiting (#149382 ) Sometimes the python script didn't exit normally and the lock file remains in the path. In this case, the `file_baton.py` may sleep forever waiting for the lock file to release. This PR will add a warning to show the existing lock file path, let the user better understand which file to delete when the waiting time is too long. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149382 Approved by: https://github.com/soulitzer	2025-04-15 20:25:29 +00:00
Joel Schlosser	bd4cf30e31	Fix setUpClass() / tearDownClass() for device-specific tests (#151129 ) Finishes up the work started in #121686 + adds test Update: this was not as straightforward as I originally imagined. Context below. TL;DR: `TestFoo{CPU, CUDA}` now actually derive from `TestFoo`! Also, `{CPU, CUDA}TestBase` setup / teardown logic is now always called (it is required to set the primary device), regardless of whether `super().setUpClass()` / `super().tearDownClass()` are called or not. Background: The typical way to get device-specific tests is to write a generic `TestFoo` and call `instantiate_device_type_tests(TestFoo, locals())` to get `TestFooCPU`, `TestFooCUDA`, etc. After this, generic tests (e.g. `TestFoo.test_bar()`) become `TestFooCPU.test_bar_cpu()` / `TestFooCUDA.test_bar_cuda()`. Behind the scenes, this was historically accomplished by creating a `TestFooCUDA` that derives from both a `CUDATestBase` and an empty class called `TestFoo_base`. This `TestFoo_base` has the same bases as `TestFoo`, but none of the test functions (e.g. `test_bar()`). The documented reason for this is to avoid things like a derived `TestFooCUDA.test_bar()` being discovered in addition to the real device-specific test `TestFooCUDA.test_bar_cuda()`. (1) A reason this matters is because it should be possible to call e.g. `super().setUpClass()` from a custom setup / teardown classmethod. If the generated TestFooCUDA does not derive from TestFoo, but instead derives from the empty class described above, this syntax does not work; in fact there is no way to form a proper `super()` call that works across the device-specific test variants. Here's an example that breaks in the OpInfo tests: `070f389745/test/test_ops.py (L218-L221)` (2) Further, there is some precedent within a custom `setUpClass()` impl for storing things on the `cls` object to be accessed at test time. This must be the device-specific test class (`TestFooCUDA`) and not `TestFoo` for this to work. As an example, the open device registration tests load a module during setup and use it in the test logic: `070f389745/test/test_cpp_extensions_open_device_registration.py (L63-L77)` `070f389745/test/test_cpp_extensions_open_device_registration.py (L79-L80)` To accomplish both (1) and (2) at the same time, I decided to revisit the idea of utilizing a proper inheritance hierarchy for `TestFoo` -> `{TestFooCPU, TestFooCUDA}`. That is: have TestFooCPU / TestFooCUDA actually derive from `TestFoo`. This achieves both (1) and (2). The only thing left is to make sure the generic tests (e.g. `TestFoo.test_bar()`) are not discoverable, as was the stated reason for diverging from this in the first place. It turns out we can simply `delattr()` these generic tests from `TestFoo` once `TestFooCPU` / `TestFooCUDA` have been setup with the device-specific variants, and all works well. The `instantiate_device_type_tests(...)` logic already deletes `TestFoo` from scope, so I don't see a problem with deleting generic tests from this base class as well (CI will prove me right or wrong ofc). Side note: I was encountering a weird race condition where sometimes the custom `setUpClass()` / `tearDownClass()` defined & swapped in [here](`4a47dd9b3f/torch/testing/_internal/common_device_type.py (L940-L955)`) would be used, and sometimes it wouldn't. This non-deterministic behavior was called out previously by @ngimel here: `4a47dd9b3f/test/inductor/test_torchinductor_dynamic_shapes.py (L128-L130)` To address this, I moved this block of logic to before the first call to `instantiate_test()`, as that method queries for the primary device, and the primary device identification logic may manually invoke `setUpClass()` (see [here](`4a47dd9b3f/torch/testing/_internal/common_device_type.py (L381-L384)`)). Goal: define the `setUpClass()` / `tearDownClass()` we want for correctness before they're ever called. This seems to work and the behavior is deterministic now AFAICT. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151129 Approved by: https://github.com/janeyx99, https://github.com/masnesral, https://github.com/malfet	2025-04-15 20:13:26 +00:00
Michael Lazos	d77e0cddfe	[Cutlass] Import cutlass python API for EVT (#150344 ) This imports the pieces of the cutlass python API that are needed for python EVT tracing. It builds on existing importing for cutlass_library. Once EVT tracing has been added to cutlass_library (should be later this year) this can be removed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150344 Approved by: https://github.com/henrylhtsang, https://github.com/eellison	2025-04-15 20:11:40 +00:00
Shunting Zhang	91923f0ee1	[inductor] disable alignment asserts in fbcode (#151274 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151274 Approved by: https://github.com/Mingming-Ding, https://github.com/Microve, https://github.com/eellison	2025-04-15 19:59:54 +00:00
Thomas Bohnstingl	a2632d5241	[HOP] Reworked DispatchKey.Autograd (#151107 ) This PR intends to rework the dispatching of the autograd key. I.e., currently the DispatchKey.Autograd of the HOPs was triggered, even if non of the operands of the HOP have `requires_grad=True`. With this rework, the autograd is bypassed if non of the operands require gradients and only invoked if any of the operands require gradients. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151107 Approved by: https://github.com/ydwu4	2025-04-15 19:55:46 +00:00
Jeff Daily	19a33b20c2	[ROCm][CI/CD] create ROCm 6.4 images, part 1, skip magma tarball (#151236 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151236 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-04-15 19:45:15 +00:00
Oguz Ulgen	8d5f7ab06c	Replace all random is_fbcode imports to environment (#151283 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151283 Approved by: https://github.com/masnesral, https://github.com/Skylion007	2025-04-15 19:42:58 +00:00
Brian Hirsh	eea4a7b424	update expected results for comptime benchmark (#151319 ) This PR https://github.com/pytorch/pytorch/pull/150594 bumped the benchmark up by ~1%, a bit under our 1.5% "regression" mark. Modeled this PR after https://github.com/pytorch/pytorch/pull/144274 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151319 Approved by: https://github.com/jamesjwu, https://github.com/laithsakka	2025-04-15 19:40:13 +00:00
henrylhtsang	e45a6a9300	[inductor][test] Disable Triton GEMM backend tests for SM89 (#150485 ) Motivation: To deprecate a silent fallback behavior https://github.com/pytorch/pytorch/issues/150390 Problem: On SM89, Trition GEMM backend isn't working. This seems to be a pre-existing issue. I don't have access to SM89 to debug further. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150485 Approved by: https://github.com/xmfan, https://github.com/eellison	2025-04-15 19:03:52 +00:00
Boyuan Feng	f1adf22b5f	improve noop elimination for slice and slice_scatter (#151175 ) Improves noop elimination for `slice` and `slice_scatter`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151175 Approved by: https://github.com/zou3519	2025-04-15 18:56:50 +00:00
Nikita Shulga	d7050ef48b	[CI] Run test_torchinductor for MPS device (#150821 ) There are only 118 failures atm, mark them all with xfail to avoid new regressions Add `xfail_if_mps_unimplemented` decorator to distinguish between tests that call unimplemented eager op vs ones that fail for some other reason. Added `aten._scaled_dot_product_attention_math_for_mps` fallback to make test behavior consistent between MacOS-15 (where falback is in place) and MacOS-14 Weird MacOS-14 specific skips: - test_torchinductor.py::GPUTests::test_cat_extern_kernel_mps - test_torchinductor.py::GPUTests::test_sort_transpose_mps (likely an eager bug) - test_torchinductor.py::GPUTests::test_unaligned_input_mps Numerous MacOS-13 skips, including few eager hard crashes, for example running `test_torchinductor.py::GPUTests::test_scatter5_mps` causes ``` /AppleInternal/Library/BuildRoots/c651a45f-806e-11ed-a221-7ef33c48bc85/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSNDArray/Kernels/MPSNDArrayScatter.mm:309: failed assertion `Rank of destination array (1) must be greater than or equal to inner-most dimension of indices array (3)' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150821 Approved by: https://github.com/ZainRizvi, https://github.com/dcci ghstack dependencies: #151224, #151246, #151272, #151282, #151288	2025-04-15 18:42:39 +00:00
Prachi Gupta	7e5f6dcf7f	Add @requires_multicast_support to test_multimem_all_gather (#151227 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/151227 Approved by: https://github.com/jeffdaily	2025-04-15 18:41:12 +00:00
Shangdi Yu	83d88d128d	[reland] Make export._trace._WrapperModule work in strict mode (#146919 ) (#151264 ) Summary: as title `export._trace._WrapperModule` is used to wrap functions into a Module so we can export the function. We add `export._wrapper_utils` to `dynamo`'s `MOD_INLINELIST` so dynamo traces into `_WrapperModule` Fixes https://github.com/pytorch/pytorch/issues/146867 Test Plan: ``` buck run fbcode//mode/dev-nosan //caffe2/test:test_export -- -r wrapper_module ``` Differential Revision: D72986826 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151264 Approved by: https://github.com/angelayi	2025-04-15 18:35:34 +00:00
sandishkumarhn	61f127aac5	[Export] fix automatically convert instances of _check(u>=0) to check_is_size() (#148844 ) Fixes #148826 Understanding: 1. PyTorch should automatically convert instances of _check(u>=0) to check_is_size() 2. The export mechanism should suggest using check_is_size() instead of _check(u>=0) when applicable Changes made: 1. Added a helper function to detect non-negative checks: is_non_negative_check 2. Modified the suggestion logic in _suggest_torch_checks to detect and handle non-negative checks 3. unit tests test_is_non_negative_check_function, test_suggest_torch_checks_with_non_negative_check, and test_suggest_torch_checks_with_regular_check unit tests: base) sany@sandishs-Laptop pytorch % pytest test/export/test_export.py::TestExport::test_suggest_torch_checks_with_non_negative_check =================================== test session starts ================== platform darwin -- Python 3.9.19, pytest-7.3.2, pluggy-1.5.0 rootdir: /Users/sany/git/pytorch configfile: pytest.ini plugins: xdoctest-1.1.0, cpp-2.3.0, flakefinder-1.1.0, anyio-4.6.0, rerunfailures-14.0, hypothesis-5.35.1, xdist-3.3.1, subtests-0.13.1, typeguard-4.3.0 collected 1 item Running 1 items in this shard test/export/test_export.py . [100%] ======================== 1 passed in 1.67s ======================= (base) sany@sandishs-Laptop pytorch % pytest test/export/test_export.py::TestExport::test_suggest_torch_checks_with_regular_check ======================= test session starts ================= platform darwin -- Python 3.9.19, pytest-7.3.2, pluggy-1.5.0 rootdir: /Users/sany/git/pytorch configfile: pytest.ini plugins: xdoctest-1.1.0, cpp-2.3.0, flakefinder-1.1.0, anyio-4.6.0, rerunfailures-14.0, hypothesis-5.35.1, xdist-3.3.1, subtests-0.13.1, typeguard-4.3.0 collected 1 item Running 1 items in this shard test/export/test_export.py . [100%] ================================= 1 passed in 1.61s ================ (base) sany@sandishs-Laptop pytorch % pytest test/export/test_export.py::TestExport::test_is_non_negative_check_function ================================ test session starts ============= platform darwin -- Python 3.9.19, pytest-7.3.2, pluggy-1.5.0 rootdir: /Users/sany/git/pytorch configfile: pytest.ini plugins: xdoctest-1.1.0, cpp-2.3.0, flakefinder-1.1.0, anyio-4.6.0, rerunfailures-14.0, hypothesis-5.35.1, xdist-3.3.1, subtests-0.13.1, typeguard-4.3.0 collected 1 item Running 1 items in this shard test/export/test_export.py . [100%] ======================= 1 passed in 1.62s ========================= (base) sany@sandishs-Laptop pytorch % Pull Request resolved: https://github.com/pytorch/pytorch/pull/148844 Approved by: https://github.com/laithsakka	2025-04-15 17:41:11 +00:00
PyTorch MergeBot	74f6bc28a7	Revert "Add inductor standalone_compile API (#150670 )" This reverts commit c9aef508984a31f03821eaad381468673ef29c0a. Reverted https://github.com/pytorch/pytorch/pull/150670 on behalf of https://github.com/Camyll due to breaking internal builds with torch module not found error ([comment](https://github.com/pytorch/pytorch/pull/150670#issuecomment-2806975267))	2025-04-15 17:35:59 +00:00
Blaine Burton Rister	c0a0761871	[Inductor] Refactor wrapper codegen to use Wrapper IR. (#150458 ) Preparatory refactor for https://github.com/pytorch/pytorch/pull/146942. # Feature This PR refactors the existing wrapper codegen into `WrapperLine` subclasses, extending the existing Memory Planning IR into a fully-fledged Wrapper IR. See the diagram below. ![wrapper_ir](https://github.com/user-attachments/assets/a61db21b-caf3-45d2-bfdb-91066ae4ba6b) The IR currently supports the following ops: - All existing memory planning IR ops (`AllocateLine`, `FreeIfNotReusedLine`, etc.) - Reinterpret views (`ReinterpretLine`) - Kernel definitions (`KernelDefinitionLine`) - Calls to defined kernels (`KernelCallLine`) - Calls to extern kernels (`ExternKernelLine`, `ExternKernelAllocLine`) - Ops with multiple outputs (`MultiOutputLine`) - Tensor cleanup at the end of a graph (`FreeLine`) - Leaving comments in code (`CommentLine`) There are two main motivations for this refactor: 1. Unlike free-form C++ and and Python code, Wrapper IR lines provide structured information about what the wrapper code does. This serves as a natural extension point for other types of wrapper codegen. For example, the parent PR generates FX IR from Wrapper IR. Wrapper IR aims to give new backends enough information to generate wrapper code without needing to modify core Inductor files such as `ir.py`. 2. This design will hopefully promote stronger modularity and encapsulation. a. Inductor's core compilation passes don't need to worry about whether they're targeting Python, C++, FX or anything else. They can simply focus on generating Wrapper IR, and target-specific code can be refactored into the various backends. b. Backends do not need to know about all the details and internal state of `V.graph` IR. For example, they don't need to consider whether a buffer has been removed from the graph when generating code. Wrapper IR will hopefully provide a simpler interface for generating wrapper code, which abstracts away the details of device code. # Implementation details The implementation mainly consists of separating direct C++/Python codegen into two phases: 1. Emit Wrapper IR lines describing what the wrapper code is supposed to do. 2. Inside the `codegen()` method of each `WrapperLine`, call backend methods which generate pure Python/C++ code using the information stored in the Wrapper IR line. For example, `KernelCallLine` calls `wrapper._generate_kernel_call_helper`, which is overriden by the various Python and C++ backends to generate the final wrapper code. The main difficulty in implementing this is that we need to be careful that code is generated in the correct order. Wrapper codegen happens in two passes: first we write code into `self.lines` which mainly contains wrapper IR, but can also contain raw Python or C++ lines in some situations. Then, we convert the wrapper IR into the final Python/C++ code in `self.wrapper_call`. Since the same macros may be used in both passes, it's difficult to ensure that code is written to the correct buffer. The easiest solution for this was to implement a context manager overriding the `writeline` method to write to `self.wrapper_call` after memory planning is finished. This way, `writeline` writes to `self.lines` in the first pass, and `self.wrapper_call` in the second. This obviated the need to pass `code` or `writeline` variables all the way through the call stack, which would have touched most of the existing macros. # Test plan Since this refactor touches all the existing wrapper codegen classes, the existing CI provides good coverage. The parent PR introduces new tests for the FX IR backend. Among other things, these tests assert that `self.lines` only contains Wrapper IR lines, and no free-form code. While this would not be true of all programs today, the tests suggests that the IR implemented in this PR is sufficient to cover basic PyTorch usage. # Future directions These two goals are only partially realized by this PR. These are several important steps which still undergo direct Python/C++ codegen in core files: - User-defined Triton kernels. - Reinterpret views on outputs, from `gen_output_refs()`. (In the parent PR, the FX converter has a custom way of handling this. This can eventually be ported into Wrapper IR.) - Fallback ops with custom `codegen()` methods, e.g. `ScatterFallback`. - Misc. C++ lines emitted by the various cpp backends, e.g. declaring constants. These cases will gradually be handled in subsequent PRs, as the Inductor->FX converter expands its coverage. Given that these refactors are pretty tricky to do, it seems wiser to execute them in stages, as opposed to porting everything to Wrapper IR at once.Some Python and codegen still lives in core files such as `ir.py`, as described in previous sections. Hopefully, this PR will serve as a starting point which moves the codebase towards a more modular design. Over time, we can gradually refactor the remaining codegen (mainly in `ir.py`) into backend classes. One limitation of this PR is that codegen still happens in two phases during `PythonWrapperCodegen`. First, we generate Wrapper IR into `self.lines`, and from there we generate Python or C++ code into `self.wrapper_call`, `self.header`, etc. In the long term, it would be cleaner to split wrapper IR into its own class which doesn't deal with Python/C++ codegen at all. (See the diagram at the top.) That would strictly enforce the boundary between Wrapper IR and Python/C++ wrapper code. However, this would probably be a much larger refactor. Another limitation of the current code is that the helper functions have a lot of call args. It's also possible to clean this up by passing Wrapper IR ops e.g. `KernelCallLine` into helper functions like `_generate_kernel_call_helper`, since they store all the arguments. However, that change would likely be prone to merge conflicts, so I would like to save it for follow-up PRs if possible. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150458 Approved by: https://github.com/eellison	2025-04-15 17:28:36 +00:00
Shunting Zhang	8f440a8e70	don't return logits for benchmark script (#151075 ) PT2 benchmark scripts has a pattern like: ``` def forward_and_backward_pass(self, mod, inputs, collect_outputs=True): cloned_inputs = clone_inputs(inputs) self.optimizer_zero_grad(mod) with self.autocast(self.autocast_arg): pred = mod(cloned_inputs) loss = self.compute_loss(pred) self.grad_scaler.scale(loss).backward() self.optimizer_step() if collect_outputs: return collect_results(mod, pred, loss, cloned_inputs) return None ``` for training. The collect_outputs argument is True only for accuracy testing and it's false for performance testing. For HF benchmark suite, a model usually returns tuple (loss, logits). For performance testing, even though the logits is never used anywhere, dynamo has to keep it due to the control flow. A few bad things if we keep logits here 1. the peak memory will be higher since the logits is large and we can not release its memory earlier. 2. we can not do optimization like chunking for the logits because the tensor needs to be returned from the pre-grad graph Actually I think it's fine to not return logits at all. - For training cases, checking loss and gradients for accuracy is good enough. It's hard to see two runs have mismatch logits but matching loss/gradients. - Also, discarding logits as soon as possible for perf benchmarking makes it more fair for us. On the other hand, it may be interesting to let dynamo support something like dynamo.constexpr (similar to tl.constexpr). A variable annotated as dynamo.constexpr will be specialized at compile time and we can do more optimization (DCE e.g.) at compile time. (A small [repro](https://gist.github.com/shunting314/0912a8947028a904c34f361021b8024d)) Benchmark results here [link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Fri%2C%2004%20Apr%202025%2018%3A03%3A26%20GMT&stopTime=Fri%2C%2011%20Apr%202025%2018%3A03%3A26%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=gh/shunting314/204/head&lCommit=fe25dab3f65e1b0e9db0af03f7664af70fcc9c66&rBranch=main&rCommit=55e62ff74ad5614faf80b060c7bfc551e3b7af5a) - HF 15% (1.51 -> 1.66 compression ratio) peak memory improvement - I also see 5% (2.74 -> 2.79x) perf win for HF. It could be true. We may generate more efficient kernels since we don't need keep logits and return it from the pre-grad graph. But I'll double check Pull Request resolved: https://github.com/pytorch/pytorch/pull/151075 Approved by: https://github.com/eellison, https://github.com/jansel	2025-04-15 17:13:00 +00:00
David Berard	7d205b22b5	[profiler][retry] don't disable CUPTI_LAZY_REINIT for cuda >= 12.6 (#151124 ) Retry of https://github.com/pytorch/pytorch/pull/150957, which was reverted due to internal meta failures Credit to @mgmtea who wrote the initial version of this PR: https://github.com/pytorch/pytorch/pull/146604 Context: CUPTI is the NVIDIA library that Kineto uses for collecting GPU-side info during profiling. The intended usage is to register a callback while you want profiling to occur, and then unregister the callback when you want profiling to stop. But a bug would cause crashes if CUPTI callbacks were de-registered when used with cudagraphs. The workaround was to disable "CUPTI_LAZY_REINIT" and "CUPTI_TEARDOWN" in Kineto - which prevents crashes, but can result in slower execution after profiling has occurred and completed. This bug is believed to be fixed in CUDA >= 12.6, so this PR qualifies that DISABLE_CUPTI_LAZY_REINIT=1 and CUPTI_TEARDOWN=0 should only be applied if CUDA >= 12.6. Additionally, `profiler_allow_cudagraph_cupti_lazy_reinit_cuda12()` is added as an escape hatch so that we can add a killswitch in case we see more crashes related to this. Differential Revision: [D72842114](https://our.internmc.facebook.com/intern/diff/D72842114/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D72842114/)! Differential Revision: [D72842114](https://our.internmc.facebook.com/intern/diff/D72842114) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151124 Approved by: https://github.com/sraikund16	2025-04-15 16:11:49 +00:00
Ankita George	c5de6ff079	Remove ls from filesystem base (#151117 ) Summary: User reported issue where they are inheriting from filesystembase but don't have the ls method which was added in the PR https://github.com/pytorch/pytorch/pull/150701#discussion_r2039840129. Removing the method from the base class but keeping it in derived class Test Plan: buck test 'fbcode//mode/opt' fbcode//caffe2/test/distributed/checkpoint:test_hf_storage Differential Revision: D72867722 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151117 Approved by: https://github.com/Skylion007, https://github.com/lw	2025-04-15 14:45:20 +00:00
rzou	f1f18c75c9	Gracefully handle optree less than minimum version, part 2 (#151257 ) If optree is less than the minimum version, we should pretend it doesn't exist. The problem right now is: - Install optree==0.12.1 - `import torch._dynamo` - This raise an error "min optree version is 0.13.0" The fix is to pretend optree doesn't exist if it is less than the min version. There are ways to clean up this PR more (e.g. have a single source of truth for the version, some of the variables are redundant), but I am trying to reduce the risk as much as possible for this to go into 2.7. Test Plan: I verified the above problem was fixed. Also tried some other things, like the following, which now gives the expected behavior. ```py >>> import torch >>> import optree >>> optree.__version__ '0.12.1' >>> import torch._dynamo >>> import torch._dynamo.polyfills.pytree >>> import torch.utils._pytree >>> import torch.utils._cxx_pytree ImportError: torch.utils._cxx_pytree depends on optree, which is an optional dependency of PyTorch. To u se it, please upgrade your optree package to >= 0.13.0 ``` I also audited all non-test callsites of optree and torch.utils._cxx_pytree. Follow along with me: optree imports - torch.utils._cxx_pytree. This is fine. - [guarded by check] `f76b7ef33c/torch/_dynamo/polyfills/pytree.py (L29-L31)` _cxx_pytree imports - [guarded by check] torch.utils._pytree (changed in this PR) - [guarded by check] torch/_dynamo/polyfills/pytree.py (changed in this PR) - [guarded by try-catch] `f76b7ef33c/torch/distributed/_functional_collectives.py (L17)` - [guarded by try-catch] `f76b7ef33c/torch/distributed/tensor/_op_schema.py (L15)` - [guarded by try-catch] `f76b7ef33c/torch/distributed/tensor/_dispatch.py (L35)` - [guarded by try-catch] `f76b7ef33c/torch/_dynamo/variables/user_defined.py (L94)` - [guarded by try-catch] `f76b7ef33c/torch/distributed/tensor/experimental/_func_map.py (L14)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151257 Approved by: https://github.com/malfet, https://github.com/XuehaiPan	2025-04-15 13:08:26 +00:00
jianan-gu	12cb11a268	[Inductor UT] Refactor FlexAttention UT and add CPU tests (#144953 ) This PR extends and refines all rest UTs for CPU and more devices in `test/inductor/test_flex_attention.py` and `test/inductor/test_flex_decoding.py`, as a follow-up to https://github.com/pytorch/pytorch/pull/141453 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144953 Approved by: https://github.com/drisspg	2025-04-15 12:44:49 +00:00
Benson Ma	2180e87d7c	[fbgemm_gpu] Incorporate Torch DSA (#151148 ) Summary: X-link: https://github.com/facebookresearch/FBGEMM/pull/1035 X-link: https://github.com/pytorch/FBGEMM/pull/3950 - Incorporte the PyTorch DSA infrastructure into the FBGEMM kernel launcher utility Test Plan: ``` # Nvidia buck2 test 'fbcode//mode/opt' fbcode//deeplearning/fbgemm/fbgemm_gpu/test/utils:tensor_accessor_builder buck2 test 'fbcode//mode/opt' fbcode//deeplearning/fbgemm/fbgemm_gpu/test/utils:tensor_accessor_builder_with_memcheck buck2 run 'fbcode//mode/opt' -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=a100 -c fbcode.platform=platform010 fbcode//deeplearning/fbgemm/fbgemm_gpu/test/utils:kernel_launcher # AMD buck2 run mode/opt-amd-gpu -c fbcode.platform=platform010 fbcode//deeplearning/fbgemm/fbgemm_gpu/test/utils:tensor_accessor_builder_with_memcheck buck2 run mode/opt-amd-gpu -c fbcode.platform=platform010 fbcode//deeplearning/fbgemm/fbgemm_gpu/test/utils:kernel_launcher buck2 run mode/opt-amd-gpu -c fbcode.platform=platform010 fbcode//deeplearning/fbgemm/fbgemm_gpu/test/tbe:split_embeddings_utils ``` Differential Revision: D72759030 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151148 Approved by: https://github.com/huydhn	2025-04-15 11:34:04 +00:00
Mu-Chu Lee	70e7b76707	[AOTInductor] Add Python interface for user managed buffer. (#151141 ) Summary: Add pybind for user managed buffer in update_constants_buffer. Test Plan: Included in commit. ``` python test/inductor/test_aot_inductor.py -k user_managed ``` Differential Revision: D72892310 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151141 Approved by: https://github.com/henrylhtsang, https://github.com/desertfire	2025-04-15 09:36:30 +00:00
ZhiweiYan-96	bd9c436c99	[Intel GPU][PT2E] Register qconv impls to general qconv_pointwise schema (#151092 ) # Motivation Refer to https://github.com/pytorch/pytorch/pull/150751, general scheme for `qconv_pointwise` is added and `qconv2d_pointwise` is removed in callers. This PR registers the XPU backend implementations to this operator. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151092 Approved by: https://github.com/EikanWang, https://github.com/guangyey	2025-04-15 08:42:14 +00:00
Zhang, Jianyi	a756c50315	[Intel GPU] Avoid using fp32 in sdp math path when benchmark performance. (#150996 ) sdp on xpu will fallback to math path in some cases (i.e. training). In dynamo benchmark, we prefer to use fp16 for better performance. Although `allow_fp16_bf16_reduction_math_sdp` is under backends.cuda, its implementation is for all device. I didn't add if device == xpu here, I suppose cuda devices will not run into math path anyway Pull Request resolved: https://github.com/pytorch/pytorch/pull/150996 Approved by: https://github.com/drisspg, https://github.com/EikanWang	2025-04-15 08:08:01 +00:00
Chien-Chin Huang	ccfce9ae86	Fix score_mod.py dynamic max autotune for backward (#151270 ) Same as https://github.com/pytorch/pytorch/pull/148991 but this PR fixes the backward path. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151270 Approved by: https://github.com/drisspg, https://github.com/bobrenjc93	2025-04-15 06:33:37 +00:00
Nikita Shulga	afaadce083	[MPSInductor] Adjust memory format detection (#151288 ) MPS conv implementation will only yield channels last if input is in channels_last format Fixes `TestGPUTests.test_conv2d_backward_channels_last` on MacOS-15 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151288 Approved by: https://github.com/jansel, https://github.com/dcci ghstack dependencies: #151224, #151246, #151272, #151282	2025-04-15 06:25:00 +00:00
Nikita Shulga	b8a2824755	[MPS] Fix logit output for half/bfloat (#151282 ) Which also fixes MPSInductor pointwise test TODO: (as followup PRs): get rid of special native_function.yaml dispatches and use stub Pull Request resolved: https://github.com/pytorch/pytorch/pull/151282 Approved by: https://github.com/dcci ghstack dependencies: #151224, #151246, #151272	2025-04-15 06:25:00 +00:00
FFFrog	a2f7764507	[Dynamo] Fix the unimplemented_v2 of EventVariable.call_method in ctx_manager.py (#151208 ) Changes: - Field of `explanations` shoule be `str` instead of `tuple` - Not only `torch.cuda.Event`, but alse `torch.xpu.Event` can trigger this message. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151208 Approved by: https://github.com/Skylion007	2025-04-15 05:26:39 +00:00
Colin Peppler	9e20a8411b	make einsum unbacked friendly (#151032 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151032 Approved by: https://github.com/pianpwk	2025-04-15 04:35:17 +00:00
henrylhtsang	5a51de5ab1	[cutlass backend] Add more logs for cutlass backend benchmark (#150639 ) Goal is to have a way to compare if a change make it better or worse. ``` Average edge over aten (max(-edge, 0), higher is better): triton: 8.596507086950552 (from 6 valid values) triton_persistent_tma: 9.517193693923307 (from 6 valid values) cutlass_lvl_default: 3.3234737908691785 (from 6 valid values) cutlass_lvl_1111: 7.088173348313991 (from 6 valid values) cutlass_lvl_2222: 7.291869722320318 (from 6 valid values) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150639 Approved by: https://github.com/ColinPeppler	2025-04-15 04:19:51 +00:00
fduwjj	48b4bc1640	[c10d][fr] Enable FR analysis script for all fast-path coalesce op (#151243 ) This PR is to enable FR for all coalesce ops for fast path. (batch p2p is enabled in the current script, so we will mainly focus on non-P2P ops). To explain what is fast path, let's revisit how coalesced collective is working today: For non-P2P coalesced ops, there are are several ways to call it (due to legendary reasons): - Way one: Directly call python api like all_reduce_coalesced in python, this will be deprecated soon. - Way two: Directly call api inside PGNCCL like allreduce_coalesced. The way case 1 will eventually call into this. This is not deprecated and will not be deprecated, IIUC. - Way three: Using _coalescing_manager in python, like: ``` with _coalescing_manager(): for i in range(num_colls): dist.all_reduce(tensors[i]) ``` This way has two path: - Fast path: when users call all-reduce, all-gather-into-tensor or reduce-scatter, we will only launch one big collective by calling the api from case 1. - Slow path: we call startCoalescing() in the beginning and then a bunch of collectives (each one will generate a FR entry) and then endCoalescing(). Inside startCoalescing(), groupStart() is called and inside endCoalescing(), groupEnd() is then called. So although this is going to be one collective, we call into PGNCCL for each collective coalesced in the slow path case. - For uneven all-gather (allgather_v) and reduce-scatter, it follows the pattern mention in slow path. It directly call cpp api inside PGNCCL. This PR addressed the fast path because this is just an easy case, we store the collectives info on the python side, and we will only call into PGNCCL once so there will only be one work and one FR entry. We can just treat them as regular coalesced collective. We add some e2e unit test for build_db function so that the change to FR is more thoroughly tested. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151243 Approved by: https://github.com/d4l3k, https://github.com/wz337	2025-04-15 04:08:28 +00:00
Ryan Guo	f66229de2b	[dynamo] Remove `traceable_tensor_subclasses`-related code (#151062 ) Since #149792 deprecates `traceable_tensor_subclasses` and it's been landed for over a week, we can safely remove all the old code that uses `traceable_tensor_subclasses` (they were primarily for testing purposes and are equivalent to no-ops now). Pull Request resolved: https://github.com/pytorch/pytorch/pull/151062 Approved by: https://github.com/mlazos, https://github.com/anijain2305 ghstack dependencies: #151060, #151061	2025-04-15 03:55:35 +00:00
Ryan Guo	6a1499d209	[dynamo] handle tensor subclass with non-classmethod `__torch_function__` (#151061 ) As title, this patch fixes bugs in 1. emulating `has_torch_function` 2. emulating calling `__torch_function__` 3. building a callable VT for non-classmethod `__torch_function__` Fixes #120799, #150265, #150848. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151061 Approved by: https://github.com/anijain2305, https://github.com/mlazos ghstack dependencies: #151060	2025-04-15 03:55:34 +00:00
Ryan Guo	73129b8974	[dynamo] Properly handle `super().some_classmethod(...)` (#151060 ) Previously we were passing in the instance as first argument to a `super().some_classmethod(...)` call, but we should've passed in the type object instead, per semantics of `@classmethod`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151060 Approved by: https://github.com/Skylion007, https://github.com/mlazos, https://github.com/anijain2305	2025-04-15 03:55:34 +00:00
Yifu Wang	e178a3aa94	clang-format CUDASymmetricMemory.cu (#151260 ) Ported from #146592 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151260 Approved by: https://github.com/Skylion007	2025-04-15 02:00:34 +00:00
zeshengzong	25803d3a22	Optimize typing in `lr_scheduler.py` (#151219 ) ## Changes - Add typing annotation in `lr_scheduler.py` ## Test Result ```bash pytest test/optim/test_lrscheduler.py -vv ``` ![image](https://github.com/user-attachments/assets/34a91965-ff3a-462a-9ab0-b46ad4b290e9) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151219 Approved by: https://github.com/janeyx99	2025-04-15 01:00:13 +00:00
Tristan Rice	4ede6705b5	test_store: fix timeout for test_queues (#151252 ) Fixes #151216, #151215 Previously I forgot to revert the timeout after setting it for the timeout test. To prevent this in the future I split the test into 3 different tests so timeout testing is isolated. Test plan: Stress tested ``` pytest test/distributed/test_store.py -k queue -v -s --minutes 10 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151252 Approved by: https://github.com/XilunWu	2025-04-15 00:44:19 +00:00
Howard Huang	263f08e119	[PP] Add schedule visualizer (#150347 ) Added a new private file (`_schedule_visualizer.py`) with some helper methods that can be used to visualize the operations of a schedule and plot with matplotlib. InterleavedZeroBubble(pp_group=4, microbatches=8): ![image](https://github.com/user-attachments/assets/610ba9a8-7d18-4a99-bcad-6f43e5b23c8c) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150347 Approved by: https://github.com/kwen2501	2025-04-15 00:38:18 +00:00
Nikita Shulga	070357b61a	[MPSInductor] Fix silent correctness in bitcast (#151272 ) By using Metal `as_type` which according to documentation does exactly that: > Metal adds an as_type<type-id> operator to allow any scalar or vector data type (that is not a pointer) to be reinterpreted as another scalar or vector data type of the same size. The bits in the operand are returned directly without modification as the new type. The usual type promotion for function arguments is not performed. Using `reinterpret_cast` created a potential silent correctness error when dtypes of different sizes were bitcast to each other Add expicit cast to src_type to avoid errors due to type promotion (i.e. soemthing like (x+1).view(dtype=torch.float16) would work correctly in eager mode for int16 dtype, but would fail in compile, as arithmetic operations will promote int16 to int32 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151272 Approved by: https://github.com/dcci ghstack dependencies: #151224, #151246	2025-04-14 23:39:42 +00:00
Animesh Jain	508b882513	[dynamo][invoke_subgraph] Use FxGraphModule comparison instead of hashing (#150911 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150911 Approved by: https://github.com/zou3519	2025-04-14 23:34:26 +00:00
Nichols A. Romero	a24a9c42fb	[ROCm] Improve behavior of get_torch_rocm_version helper function on non-ROCm systems. (#151040 ) Fixes #150041 Return a zero tuple when ROCm is _not_ supported, similar to what is done for the CUDA version of this function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151040 Approved by: https://github.com/jeffdaily	2025-04-14 22:50:07 +00:00
Oguz Ulgen	c9aef50898	Add inductor standalone_compile API (#150670 ) This PR adds standalone_compile API that does precompilation via caching to support vLLM use case in the short term while we work on the longer term precompilation solution. ``` standalone_compile(gm, example_inputs, options) -> CompiledArtifact CompiledArtifact.save(path, format: binary\|unpacked = binary) CompiledArtifact.load(path, format: binary\|unpacked = binary) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150670 Approved by: https://github.com/jamesjwu, https://github.com/zou3519	2025-04-14 22:00:09 +00:00
PyTorch MergeBot	4a47dd9b3f	Revert "[map] always turn on dynamo for map (#150962 )" This reverts commit a72d56cb6be8c6ded5678b0b98003c90fd1b5a71. Reverted https://github.com/pytorch/pytorch/pull/150962 on behalf of https://github.com/Camyll due to breaking internal builds {SHORT_REASON} ([comment](https://github.com/pytorch/pytorch/pull/150962#issuecomment-2803006282))	2025-04-14 21:09:22 +00:00
PyTorch MergeBot	6a77a0a50c	Revert "[map] make proxy mode re-dispatch to fake key (#151034 )" This reverts commit ca2e8cd3528635526a3fe09444139ffa748e97be. Reverted https://github.com/pytorch/pytorch/pull/151034 on behalf of https://github.com/Camyll due to breaking internal builds {SHORT_REASON} ([comment](https://github.com/pytorch/pytorch/pull/150962#issuecomment-2803006282))	2025-04-14 21:09:21 +00:00
rzou	070f389745	Mark auto_functionalized HOPs as cacheable (#151194 ) Fixes #151188 Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/151194 Approved by: https://github.com/oulgen, https://github.com/anijain2305 ghstack dependencies: #151193	2025-04-14 20:05:32 +00:00
rzou	dea50b0778	Improve sort with non-constant keys error message (#151193 ) Fixes https://github.com/pytorch/pytorch/issues/143505 Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/151193 Approved by: https://github.com/jansel, https://github.com/anijain2305, https://github.com/williamwen42	2025-04-14 20:05:32 +00:00
Nikita Shulga	46ce8f7df6	[MPSInductor] Cast halfs to floats (#151246 ) To avoid accuracy issues when small reductions are unrolled, cast half to float during the `load` op As `op_math_t<half>` is indeed float This fixes `test_unroll_small_reduction` for reduced precision types Pull Request resolved: https://github.com/pytorch/pytorch/pull/151246 Approved by: https://github.com/dcci ghstack dependencies: #151224	2025-04-14 19:47:04 +00:00
Olaf Lipinski	0a6e1d6b9b	Expand docs for `nn.functional`, and make the wording consistent (#148436 ) Expands the docs for the loss functions, and makes the wording consistent. Fixes #148353 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148436 Approved by: https://github.com/albanD	2025-04-14 19:37:12 +00:00
Nariaki Tateiwa	23a3cef5d9	[c10d] Add `_allgather_base` , `reduce_scatter` , and `_reduce_scatter_base` into ProcessGroupMPI to enable FSDP with MPI backend (#150162 ) This PR implements _allgather_base, reduce_scatter, and _reduce_scatter_base in the MPI backend (ProcessGroupMPI), enabling support for Fully Sharded Data Parallel (FSDP) in environments that use MPI for distributed communication. ### Context As noted in https://github.com/pytorch/pytorch/issues/85628, FSDP currently supports only the NCCL backend. Due to this limitation, FSDP cannot run on legacy HPC environments or clusters that rely on MPI. By implementing just these three collective operations, we can enable FSDP to work with the MPI backend. These collectives are implemented in a similar manner to existing operations such as allgather. ### Testing We validated this PR using pytorch/build/bin/ProcessGroupMPITest with OpenMPI, and all tests passed successfully. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150162 Approved by: https://github.com/H-Huang	2025-04-14 19:31:38 +00:00
angelayi	7deed1946f	Fix assert_tensor_meta (#150808 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150808 Approved by: https://github.com/pianpwk ghstack dependencies: #150806, #150807	2025-04-14 19:28:54 +00:00
angelayi	53528440e1	Generate meta kernel with operator profiles (#150807 ) Added a context manager, `torch._library.fake_profile.register_fake_profile(op_profiles)`, where given an operator profile, it will generate and register a fake impl for the operator based on the operator profile. The input to `register_fake_profile` is a dictionary mapping operator name to a set of profiles which describe the input and outputs of the operator. Here's an example of a profile for `mylib.foo.default`: ``` "mylib.foo.default": { OpProfile( args_profile=( TensorMetadata(rank=2, dtype=torch.float32, device=torch.device("cpu"), layout=torch.strided,), TensorMetadata(rank=2, dtype=torch.float32, device=torch.device("cpu"), layout=torch.strided,), ), out_profile=TensorMetadata(rank=2, dtype=torch.float32, device=torch.device("cpu"), layout=torch.strided,), ) } ``` `foo`'s profile contains only one profile, which says that for 2 input tensors of rank 2, dtype float32, device cpu, we will return one tensor of rank 2, dtype float32, and device cpu. This will then generate a fake kernel where given 2 input tensors of rank 2 (and the other tensor metadata), we will output one tensor of rank 2 (and the other tensor metadata). If the operator also supports other input ranks, then we can add to the profile for the fake impl to support more input types. This profile can either be manually written or created by draft-export, and then checked into the codebase. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150807 Approved by: https://github.com/zou3519 ghstack dependencies: #150806	2025-04-14 19:28:54 +00:00
Justin Chu	901e37515f	[ONNX] Fix bfloat16 support in onnx_program callable (#151121 ) - Added a test to guard bfloat16. The optimizer incorrectly turns bfloat16 initializers into uint16, but this is not relevant to export logic. - Fix bfloat16 support in onnx_program callable Tested with the following with cuda ```py import torch class BfloatModel(torch.nn.Module): def __init__(self): super().__init__() self.param = torch.nn.Parameter(torch.tensor(2.0, dtype=torch.bfloat16)) def forward(self, x): return x * torch.tensor(1.0, dtype=torch.bfloat16) * self.param input = torch.randn(1, 10, dtype=torch.bfloat16) model = BfloatModel() onnx_program = torch.onnx.export(model, (input,), dynamo=True, optimize=False, verify=True) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151121 Approved by: https://github.com/titaiwangms Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-04-14 19:27:29 +00:00
cz2h	f76b7ef33c	Add error check for out variant of tensordot function with requries_grad tensor (#150270 ) Fixes #147846. Previously there is no error out under out variant of`tensordot` while `requires_grad=True`. This can cause potential issue when out tensor is part of a computation graph. Enforces the out variant of tensordot to run without setting `requries_grad=True`. Change same to #117067 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150270 Approved by: https://github.com/soulitzer	2025-04-14 18:43:14 +00:00
Aaron Orenstein	1f5af12cd9	Using hasattr for `_boxed_call` is asking for trouble (#151130 ) Summary: There are a number of places in the code checking for the existence of `_boxed_call` instead of checking for a `True` value. This is somewhat dangerous because one would assume that setting it to `None` or `False` would be the same as not setting it (output_code.py does this, for example). Change `hasattr()` to `getattr(..., False)` for these cases. Test Plan: unit tests pass Differential Revision: D72806693 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151130 Approved by: https://github.com/Skylion007	2025-04-14 18:36:30 +00:00
Pian Pawakapan	6dddd6520d	[dynamic shapes] add sym_and, sym_or (#150456 ) This has been pretty helpful for the size-oblivious rewrite. Wanted the variadic args version to avoid `sym_or(a, sym_or(b, sym_or(c, d)))` in favor of `sym_or(a, b, c, d)`. Happy to change this to ban the 1-arg version. This is better than plain and/or because the whole symbolic expression gets preserved, and if we guard on it or defer as a runtime assert, we preserve all branches. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150456 Approved by: https://github.com/laithsakka	2025-04-14 18:18:06 +00:00
Animesh Jain	785495ee29	[dynamo][error message] Hint for dict_items as inputs to the compiled region (#151169 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151169 Approved by: https://github.com/zou3519 ghstack dependencies: #151164, #151168	2025-04-14 17:38:20 +00:00
Animesh Jain	3c46808a14	[dynamo] Graph break fixes while tracing inspect module (#151168 ) Fixes https://github.com/pytorch/pytorch/issues/139374 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151168 Approved by: https://github.com/jansel ghstack dependencies: #151164	2025-04-14 17:38:20 +00:00
Thomas Bohnstingl	b0bdd76f2e	[scan] Autograd with partial gradient support (#146285 ) This PR introduces the Autograd feature for scan with partial gradient support. It is a combination of the already opened PRs: https://github.com/pytorch/pytorch/pull/135631 and https://github.com/bohnstingl/pytorch/pull/4 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146285 Approved by: https://github.com/ydwu4 Co-authored-by: Yidi Wu <yidi@meta.com>	2025-04-14 17:01:31 +00:00
fzyzcjy	50abc1ecc4	Super tiny fix typo (#151212 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/151212 Approved by: https://github.com/Skylion007	2025-04-14 16:47:40 +00:00
Nikita Shulga	184ac8c7f7	[MPSInductor] Fix noop codegen (#151224 ) By adding `pass` in front of the comment for fake set_device call Which fixes `TestGPU.test_zero_element_mutation_mps`, which previously failed with ``` torch._inductor.exc.InductorError: RuntimeError: Failed to import /var/folders/sc/2thx6_x95h7_h9qs8s48yh140000gn/T/tmp2emka_sx/7k/c7kmnwhb363ysalhewglr3cwtej6tiz3t4ppqa4bvhubaokmlprw.py IndentationError: expected an indented block after 'with' statement on line 38 (c7kmnwhb363ysalhewglr3cwtej6tiz3t4ppqa4bvhubaokmlprw.py, line 40) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151224 Approved by: https://github.com/Skylion007, https://github.com/jansel, https://github.com/dcci	2025-04-14 16:38:47 +00:00
Jithun Nair	001695c397	[ROCm][CI] Enable distributed CI on MI300 (#150667 ) * Enable distributed CI on MI300 runners, same schedule-based and release-branch triggers as `periodic.yml`; also uses label `ciflow/periodic-rocm-mi300` for triggering on PRs. * Disabled failing distributed tests on MI300 via Github issues: [151077](https://github.com/pytorch/pytorch/issues/151077), [151078](https://github.com/pytorch/pytorch/issues/151078), [151081](https://github.com/pytorch/pytorch/issues/151081), [151082](https://github.com/pytorch/pytorch/issues/151082), [151083](https://github.com/pytorch/pytorch/issues/151083), [151084](https://github.com/pytorch/pytorch/issues/151084), [151085](https://github.com/pytorch/pytorch/issues/151085), [151086](https://github.com/pytorch/pytorch/issues/151086), [151087](https://github.com/pytorch/pytorch/issues/151087), [151088](https://github.com/pytorch/pytorch/issues/151088), [151089](https://github.com/pytorch/pytorch/issues/151089), [151090](https://github.com/pytorch/pytorch/issues/151090), [151153](https://github.com/pytorch/pytorch/issues/151153) * Disable failing distributed tests via `skipIfRocm`: `ea9315ff95` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150667 Approved by: https://github.com/jeffdaily	2025-04-14 16:19:04 +00:00
cyy	eb19f5abab	[2/N] Use internal linkage in aten C++ files (#151070 ) Turn functions and variables into static if they are not used outside the ten cpp files. In some cases, missing header inclusion is added. In other cases, unused functions are removed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151070 Approved by: https://github.com/Skylion007	2025-04-14 16:07:17 +00:00
PyTorch MergeBot	24b3ab9255	Revert "Add inductor standalone_compile API (#150670 )" This reverts commit bbc5fe850454df6860814ab77a1f3a4ca3698157. Reverted https://github.com/pytorch/pytorch/pull/150670 on behalf of https://github.com/albanD due to Broke profiler test ([comment](https://github.com/pytorch/pytorch/pull/150670#issuecomment-2802067144))	2025-04-14 15:22:33 +00:00
hippocookie	d99236b68c	Optimize `cdist` param description (#151178 ) Fixes #151101 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151178 Approved by: https://github.com/soulitzer	2025-04-14 13:53:10 +00:00
bobrenjc93	8497491f38	[ez] remove unused arg in _create_wrapped_callback (#151179 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151179 Approved by: https://github.com/anijain2305, https://github.com/Skylion007 ghstack dependencies: #150753, #150754, #150755, #150828	2025-04-14 12:54:23 +00:00
bobrenjc93	d5a19e4525	[ez] dynamo fix typo in comment (#150828 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150828 Approved by: https://github.com/anijain2305, https://github.com/Skylion007 ghstack dependencies: #150753, #150754, #150755	2025-04-14 10:09:28 +00:00
zeshengzong	5eebcb991a	Add scripts to generate plots of LRSchedulers (#149189 ) Fixes #92007 ## Changes - Add script to generate plots for `lr_scheduler` - Add plots to `lr_scheduler` docs - Add example section if it missing in `lr_scheduler` docs ## Test Result ### LambdaLR ![image](https://github.com/user-attachments/assets/37fc0894-e2ec-48f2-a2d6-3514e51e1ea2) ### MultiplicativeLR ![image](https://github.com/user-attachments/assets/2122b3a0-a4ce-42c7-bb45-559c1fc73e0f) ### StepLR ![image](https://github.com/user-attachments/assets/47bc9d96-4b60-4586-a000-f213583bbe8f) ### MultiStepLR ![image](https://github.com/user-attachments/assets/c822b849-d5be-4b94-aa7a-0017a2c9ff15) ### ConstantLR ![image](https://github.com/user-attachments/assets/83107cdd-7b00-44a6-b09d-e8ee849b4a12) ### LinearLR ![image](https://github.com/user-attachments/assets/60190105-691a-4101-8966-5b0c396093a4) ### ExponentialLR ![image](https://github.com/user-attachments/assets/dfcbcbca-89e5-4a2f-b1bd-33e25d2405ec) ### PolynomialLR ![image](https://github.com/user-attachments/assets/7c3d4fce-c846-40a0-b62e-f3e81c7e08bd) ### CosineAnnealingLR ![image](https://github.com/user-attachments/assets/26712769-dde9-4faa-b61b-e23c51daef50) ### ChainedScheduler ![image](https://github.com/user-attachments/assets/20734a8b-e939-424f-b45a-773f86f020b1) ### SequentialLR ![image](https://github.com/user-attachments/assets/2cd3ed67-2a0a-4c42-9ad2-e0be090d3751) ### ReduceLROnPlateau ![image](https://github.com/user-attachments/assets/b77f641e-4810-450d-b2cd-8b3f134ea188) ### CyclicLR ![image](https://github.com/user-attachments/assets/29b8666f-41b3-45e4-9159-6929074e6108) ### OneCycleLR ![image](https://github.com/user-attachments/assets/d5b683ef-41e8-4ca8-9fe8-0f1e6b433866) ### CosineAnnealingWarmRestarts ![image](https://github.com/user-attachments/assets/1d45ea80-dea8-494d-a8ab-e9cfc94c55d6) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149189 Approved by: https://github.com/janeyx99	2025-04-14 09:53:38 +00:00
Zesheng Zong	5a64476ed6	[Easy] Add `output_size` in forward method of ConvTranspose2d (#150609 ) Fixes #74593 Add description for `forward` in [ConvTranspose2d](https://pytorch.org/docs/stable/generated/torch.nn.ConvTranspose2d.html) doc ## Test Result ![image](https://github.com/user-attachments/assets/eebad7a2-f782-4219-9756-344e0f34fada) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150609 Approved by: https://github.com/mikaylagawarecki Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com>	2025-04-14 09:53:22 +00:00
zeshengzong	01f226bfb8	Add check for ctc_loss targets param (#150981 ) Fixes #150835 ## Test Result ```python # cuda >>> import torch >>> import torch.nn.functional as F >>> device = "cuda" # "cpu" is fine >>> num_classes = 4 >>> log_probs = torch.rand(0, 0, num_classes, device=device) >>> targets = torch.tensor([], device=device, dtype=torch.long) >>> input_lengths = torch.tensor([], device=device, dtype=torch.long) >>> target_lengths = torch.tensor([], device=device, dtype=torch.long) >>> result = F.ctc_loss(log_probs, targets, input_lengths, target_lengths, reduction='none') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/zong/code/pytorch/torch/nn/functional.py", line 3079, in ctc_loss return torch.ctc_loss( ^^^^^^^^^^^^^^^ RuntimeError: log_probs tensor must not be empty # cpu >>> device = "cpu" >>> num_classes = 4 >>> log_probs = torch.rand(0, 0, num_classes, device=device) >>> targets = torch.tensor([], device=device, dtype=torch.long) >>> input_lengths = torch.tensor([], device=device, dtype=torch.long) >>> target_lengths = torch.tensor([], device=device, dtype=torch.long) >>> result = F.ctc_loss(log_probs, targets, input_lengths, target_lengths, reduction='none') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/zong/code/pytorch/torch/nn/functional.py", line 3079, in ctc_loss return torch.ctc_loss( ^^^^^^^^^^^^^^^ RuntimeError: log_probs tensor must not be empty ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150981 Approved by: https://github.com/eqy	2025-04-14 07:24:30 +00:00
Oguz Ulgen	bbc5fe8504	Add inductor standalone_compile API (#150670 ) This PR adds standalone_compile API that does precompilation via caching to support vLLM use case in the short term while we work on the longer term precompilation solution. ``` standalone_compile(gm, example_inputs, options) -> CompiledArtifact CompiledArtifact.save(path, format: binary\|unpacked = binary) CompiledArtifact.load(path, format: binary\|unpacked = binary) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150670 Approved by: https://github.com/jamesjwu, https://github.com/zou3519	2025-04-14 07:07:10 +00:00
bobrenjc93	189bc9283e	[ez] move GuardsContext code comment to the right place (#150755 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150755 Approved by: https://github.com/anijain2305, https://github.com/Skylion007 ghstack dependencies: #150753, #150754	2025-04-14 07:03:23 +00:00
PyTorch UpdateBot	9757092aed	[executorch hash update] update the pinned executorch hash (#151195 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151195 Approved by: https://github.com/pytorchbot	2025-04-14 05:46:54 +00:00
Chuanqi Xu	0d09a33819	[Attention] Always pad in preprocess_mask to avoid recompilations (#150403 ) Motivation: for the following script: ``` // demo.py import torch import json from transformers import BertModel, BertConfig CONFIG = """ { "architectures": [ "BertForMaskedLM" ], "attention_probs_dropout_prob": 0.1, "gradient_checkpointing": false, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 768, "initializer_range": 0.02, "intermediate_size": 3072, "layer_norm_eps": 1e-12, "max_position_embeddings": 512, "model_type": "bert", "num_attention_heads": 12, "num_hidden_layers": 12, "pad_token_id": 0, "position_embedding_type": "absolute", "transformers_version": "4.6.0.dev0", "type_vocab_size": 2, "use_cache": true, "vocab_size": 30522 } """ config = json.loads(CONFIG) bloom_config = BertConfig(**config) model = BertModel(bloom_config).half().cuda() torch.compiler.reset() torch.cuda.empty_cache() compiled_fn = torch.compile(model) vocab_size = 30522 for b in range(1, 3): for s in range(1, 10): print(f"🚀 {b} {s}") input_ids = torch.randint(0, vocab_size, (b, s)).cuda() attention_mask = torch.ones(b, s).cuda() with torch.no_grad(): out = compiled_fn(input_ids, attention_mask).last_hidden_state ``` when we run it with: ``` time TORCH_LOGS=recompiles python demo.py ``` We can see there are 7 recompilations and it takes 2 mins (fresh build) or 1 min (cached build) in my machine. One root cause of the recompilations is, there are guards to check the alignments of the inputs (see the patch). So there are unexpected recompilations for `(1, 4)`, `(1, 8)`, `(2, 4)` and `(2, 8)` inputs. In this patch, we always try to always pad the inputs if we don't know its shape at compilation to avoid the guards on alignment. It is fine to always pad the tensor. It won't change the semantics. Now there are only 3 recompilations and it takes 1 min (fresh build) and 17s (cached build) in my machine. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150403 Approved by: https://github.com/drisspg	2025-04-14 04:18:22 +00:00
Nitin Singh	9458b83729	[HPU] Add HPU as a supported device for NestedTensor (#148659 ) This change enables basic NestedTensor operations on HPU, fixing the runtime error when creating a NestedTensor on HPU. - Extended `NestedTensorImpl` to recognize `hpu` as a valid storage device. - Added `NestedTensorHPU` to `DispatchKey` parsing in `DispatchKey.cpp`. - Updated `torchgen/model.py` to include `NestedTensorHPU` in `dispatch_keys`. - Modified `native_functions.yaml` to enable `NestedTensorHPU` support for various ops. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/148659 Approved by: https://github.com/jeromean, https://github.com/albanD, https://github.com/sujoysaraswati	2025-04-14 03:42:34 +00:00
bobrenjc93	9aca00102f	[ez]][dynamo] remove useless super().__init__() (#150754 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150754 Approved by: https://github.com/anijain2305, https://github.com/jansel, https://github.com/Skylion007 ghstack dependencies: #150753	2025-04-14 03:37:42 +00:00
Yuki Kobayashi	101c4f482a	Docs: Fix typos in the Symbolic Numbers docstrings (#151181 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151181 Approved by: https://github.com/soulitzer	2025-04-14 01:46:02 +00:00
Li-Huai (Allan) Lin	ddfc14b3ae	[MPS] Fix where (#151176 ) Fixes #150967 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151176 Approved by: https://github.com/kulinseth, https://github.com/malfet	2025-04-13 20:44:50 +00:00
Thomas Adams	8494d5582a	Propagate callable parameter types using ParamSpec (#142306 ) (#151014 ) Partially addresses #142306 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151014 Approved by: https://github.com/Skylion007	2025-04-13 20:38:11 +00:00
bobrenjc93	3f0931b1de	[ez][dynamo] some code movement (#150753 ) `optimize_assert` already does the lookup for `backend` and `backend_ctx_ctor`. This simply moves the lookups within `optimize` lower so we don't end up calling these functions twice unnecessarily in the `optimize_assert` path. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150753 Approved by: https://github.com/anijain2305, https://github.com/jansel	2025-04-13 15:44:42 +00:00
Yu, Guangye	b0810168a3	Generalize poison fork logic for each device backend (#144664 ) # Motivation Generalize the posion_fork code to make it reusable across different devices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144664 Approved by: https://github.com/EikanWang, https://github.com/albanD	2025-04-13 09:54:30 +00:00
zeshengzong	304633152c	Clean up duplicated code in lr_scheduler (#150984 ) ## Changes - Remove duplicated code in `ReduceLROnPlateau` - Remove redundant `noqa` comment ## Test Result ```bash pytest test/optim/test_lrscheduler.py ``` ![image](https://github.com/user-attachments/assets/37f91f31-0e77-4abf-9dd1-75538c0f0792) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150984 Approved by: https://github.com/janeyx99	2025-04-13 09:18:50 +00:00
Zhang, Jianyi	b59f3d3ae0	[Intel GPU] skip a cuda api call in amp to save some host overhead on xpu (#151111 ) This can save ~0.2ms on non cuda devices by skip calling `amp_definitely_not_available()`. It can improve small models in torchbench like lennard_jones on xpu 10% on both eager and inductor in dynamo benchmarks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151111 Approved by: https://github.com/soulitzer	2025-04-13 06:37:07 +00:00
Ruisi Zhang	1c5619ef9c	[DTensor] Add DTensor redistribute fwd/bwd datatype conversion to enable SimpleFSDP mixed precision training (#150740 ) As titled, this pr adds additional `forward_dtype` and `backward_dtype` conversion in DTensor `redistribute` API to enable SimpleFSDP's mixed precision training. In this forward pass, the DTensor can be configured to be cast to `forward_dtype`; in the backward pass, the DTensor can be configured to be cast to `backward_dtype`. 1. Correctness: The end-to-end SimpleFSDP mixed precision training integration has been proved to work properly in the PR from this fork: https://github.com/tianyu-l/pytorch_intern24/pull/20. We are now migrating the code to official PyTorch DTensor. 2. Example Usage: There is an example in TorchTian's SimpleFSDP implementation: https://github.com/pytorch/torchtitan/pull/1060. In the example below, a DTensor `x` is all-gather'ed along the `self.compute_placements`, with datatype cast to `self.param_dtype`. In the backward pass, additionally, the computed gradients are reduce-scatter'ed along the `self.grad_placements`, with datatype cast to `self.reduce_dtype`. ```python output = x.redistribute( placements=self.compute_placements, forward_dtype=self.param_dtype, backward_dtype=self.reduce_dtype, ).to_local(grad_placements=self.grad_placements) ``` Under the hood, in `class Redistribute(torch.autograd.Function):`, the `forward` function first takes `x`'s local tensor, convert it to `forward_dtype`, before all-gather `x`. The `backward` function take `grad_output` and convert it to `backward_dtype`, before reduce-scatter `grad_output`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150740 Approved by: https://github.com/tianyu-l	2025-04-13 05:49:03 +00:00
PyTorch UpdateBot	00c6caaf3d	[executorch hash update] update the pinned executorch hash (#150722 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150722 Approved by: https://github.com/pytorchbot	2025-04-13 05:37:33 +00:00
Animesh Jain	587aec2b4f	[dynamo][nn_module] Use method.__self__ to find source for patched methods (#151164 ) Fixes https://github.com/pytorch/pytorch/issues/137476 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151164 Approved by: https://github.com/jansel	2025-04-13 04:50:19 +00:00
Animesh Jain	7b1a2373e8	[dynamo][super variable] Fix bug to use correct source (#151154 ) Fixes https://github.com/pytorch/pytorch/issues/150994 We should cherry-pick to 2.7 branch if possible, because this breaks torch.compile on some HF models. Look at the issue referenced here. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151154 Approved by: https://github.com/jansel	2025-04-13 04:48:52 +00:00
PyTorch MergeBot	8157e76b79	Revert "[Inductor] Refactor wrapper codegen to use Wrapper IR. (#150458 )" This reverts commit fe7f425de7b76ef33d308d0a03779b97a914d186. Reverted https://github.com/pytorch/pytorch/pull/150458 on behalf of https://github.com/clee2000 due to broke a lot of tests internally? D72906459 ([comment](https://github.com/pytorch/pytorch/pull/150458#issuecomment-2799578597))	2025-04-13 03:52:42 +00:00
Nikita Shulga	67188cd38d	[Testing] Skip `test_unspec_inputs_float64_mps` (#151167 ) As backend does nto support float64 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151167 Approved by: https://github.com/dcci ghstack dependencies: #151166	2025-04-13 00:41:51 +00:00
Nikita Shulga	d289d1177c	[CI] Fix `GPUTests.test_scheduler_vertical_fusion1` (#151166 ) By enabling the test_operators on MPS device Pull Request resolved: https://github.com/pytorch/pytorch/pull/151166 Approved by: https://github.com/dcci	2025-04-13 00:41:51 +00:00
Nikita Shulga	9699cc3eb9	[MPSInductor] Fix larger-than-threadgroup Welford reductions (#151152 ) By using `welford_combine` primitive in the loop This fixes `GPUTests.test_multilayer_var_lowp_mps` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151152 Approved by: https://github.com/jansel ghstack dependencies: #151042, #150824, #151151	2025-04-12 21:44:51 +00:00
PyTorch MergeBot	7762bddd87	Revert "[MPSInductor] Fix larger-than-threadgroup Welford reductions (#151152 )" This reverts commit 71073caa00836c23e3fc7fcfe1d69b77ffb9d9c9. Reverted https://github.com/pytorch/pytorch/pull/151152 on behalf of https://github.com/malfet due to Another lint failure ([comment](https://github.com/pytorch/pytorch/pull/151152#issuecomment-2799027274))	2025-04-12 20:27:48 +00:00
James Wu	3dcb46c30e	[easy] Add cache bypass traceback information to cache_info on autograd_cache_bypass (#151025 ) This will help us better debug pickling errors, etc, in internal models Pull Request resolved: https://github.com/pytorch/pytorch/pull/151025 Approved by: https://github.com/masnesral	2025-04-12 19:56:32 +00:00
Xiaodong Wang	9d4de265db	[AMD] Block mem efficient attention for FP32 in CK backend (#151132 ) Summary: CK doesn't support FP32 attention, but aotriton does. If we prefer CK, and the input dtype is FP32, we'll select mem efficient attention but CK doesn't support it. So we'll exclude mem eff attention and pick math. Differential Revision: D72880985 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151132 Approved by: https://github.com/yoyoyocmu	2025-04-12 19:36:20 +00:00
Nikita Shulga	71073caa00	[MPSInductor] Fix larger-than-threadgroup Welford reductions (#151152 ) By using `welford_combine` primitive in the loop This fixes `GPUTests.test_multilayer_var_lowp_mps` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151152 Approved by: https://github.com/jansel ghstack dependencies: #151042, #150824, #151151	2025-04-12 19:16:33 +00:00
Nikita Shulga	3b86cb8dff	[MPSInductor][BE] Implement reduction caching (#151151 ) That avoids double/triple invocation of welford reductions when both mean and deviation must be returned Code has been copy-n-pasted for Halide implementation `575f348965/torch/_inductor/codegen/halide.py (L1189-L1191)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151151 Approved by: https://github.com/jansel ghstack dependencies: #151042, #150824	2025-04-12 19:16:33 +00:00
FFFrog	2653498ff3	[Openreg][PrivateUse1] Refactor csrc files of Pytorch_openreg (#151004 ) I want to format and refactor the csrc file of pytorch_openreg. To make the code review clearer and easier to understand, I divide the code refactoring into two parts: - Part 1: Code formatting - Part 2: Code refactoring and optimization (Next PR) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151004 Approved by: https://github.com/albanD ghstack dependencies: #151000	2025-04-12 17:22:28 +00:00
FFFrog	c181403063	[Openreg][PrivateUse1] Improve openreg module capabilities (#151000 ) ---- - Add more functionalities for openreg in openreg module - Remove related functionalities from test_cpp_extensions_open_device_registration.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/151000 Approved by: https://github.com/albanD	2025-04-12 17:21:35 +00:00
Zhengxu Chen	be24e7b4b4	[dynamo] Use sentinel value for guard filter. (#151131 ) Summary: `None` can collide with the real values in the scope, so we should use a separate value. Also added "has_value" to the struct so that it's more clear whether the value is absent or not. Test Plan: CI Differential Revision: D72881300 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151131 Approved by: https://github.com/jansel, https://github.com/anijain2305	2025-04-12 15:29:57 +00:00
Yichen Yan	5b16a0704e	Fix license check for setuptools>=77 (#151158 ) Fixes #151157 See issue for more information Pull Request resolved: https://github.com/pytorch/pytorch/pull/151158 Approved by: https://github.com/malfet	2025-04-12 13:41:12 +00:00
Tianyu Liu	7dd2ed1197	[dtensor] add op support for torch._grouped_mm (#151072 ) This PR would make TP work with Grouped MM in MoE implementations like https://github.com/pytorch/torchtitan/pull/1084 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151072 Approved by: https://github.com/wanchaol, https://github.com/wwwjn	2025-04-12 07:07:44 +00:00
FFFrog	0c59a031c8	[OpenReg][PrivateUse1] add device context for OpenReg Module (#150997 ) Add device context support for OpenReg Module, which is depended by some tests such as ``torch.serialization.default_restore_location`` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150997 Approved by: https://github.com/albanD	2025-04-12 06:32:30 +00:00
jPorterDosch	3e9f4f3f78	docs: allow empty targets tensor in ctc_loss (#151080 ) docs: allow empty targets tensor in ctc_losswhen target_lengths are zero, as described in issue Fixes #150995 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151080 Approved by: https://github.com/albanD	2025-04-12 05:26:54 +00:00
PyTorch MergeBot	2f899f07aa	Revert "Make export._trace._WrapperModule work in strict mode (#146919 )" This reverts commit dad5e5e2622c82ca272290225abe16ee461d9ac9. Reverted https://github.com/pytorch/pytorch/pull/146919 on behalf of https://github.com/malfet due to Broke lint, see https://github.com/pytorch/pytorch/actions/runs/14415686353/job/40431799827 ([comment](https://github.com/pytorch/pytorch/pull/146919#issuecomment-2798446930))	2025-04-12 04:12:36 +00:00
Shangdi Yu	dad5e5e262	Make export._trace._WrapperModule work in strict mode (#146919 ) Summary: as title `export._trace._WrapperModule` is used to wrap functions into a Module so we can export the function. We add `export._wrapper_utils` to `dynamo`'s `MOD_INLINELIST` so dynamo traces into `_WrapperModule` Fixes https://github.com/pytorch/pytorch/issues/146867 Test Plan: ``` buck run fbcode//mode/dev-nosan //caffe2/test:test_export -- -r wrapper_module ``` Differential Revision: D69434316 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146919 Approved by: https://github.com/angelayi	2025-04-12 03:22:08 +00:00
Bert Maher	19b76bd873	hack to try to fix not empty triton dir (#151119 ) Differential Revision: D72741938 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151119 Approved by: https://github.com/hl475, https://github.com/muchulee8, https://github.com/Skylion007	2025-04-12 03:21:41 +00:00
Boyuan Feng	c1470d4dc4	[graph partition] support graphsafe_run_with_rng_state (#150958 ) Prior to this PR, `rng_state` is in `V.graph.graph_inputs` but not in read_writes of any IRNode. As a result, it is not identified as a partition inputs: ```python def partition_0(args): primals_2, primals_1 = args ... buf0 = torch.ops.higher_order.graphsafe_run_with_rng_state(torch.ops.aten.rand.default, [4, 4], dtype=torch.float32, device=device(type='cuda', index=1), pin_memory=False, rng_state=fwd_rng_state_0) # <----- access fwd_rng_state_0 but it's not an input ... def call(self, args): primals_1, primals_2, fwd_rng_state_0 = args ... partition0_args = [primals_2, primals_1] (buf2, primals_2, primals_1) = self.partitions[0](partition0_args) # <---- fwd_rng_state_0 is graph_inputs but is not passed to partitions[0] ... ``` This PR fixes this issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150958 Approved by: https://github.com/eellison	2025-04-12 03:17:08 +00:00
Nikita Shulga	397d37acc5	[MPSInductor] Naive welford_reduce implementation (#150824 ) Literal Python-to-Metal translation of `85549fe6de/torch/_inductor/runtime/triton_helpers.py (L217-L225)` Fixed missing barrier in `welford_combine` And this is sufficient to make `GPUTests.test_batch_norm_2d_2_mps` to pass Pull Request resolved: https://github.com/pytorch/pytorch/pull/150824 Approved by: https://github.com/dcci, https://github.com/jansel ghstack dependencies: #151042	2025-04-12 03:11:38 +00:00
soulitzer	32f0f414ab	Add some autograd producer consumer stream sync tests (#150952 ) Thanks @ngimel and @albanD for some ideas on test cases Pull Request resolved: https://github.com/pytorch/pytorch/pull/150952 Approved by: https://github.com/albanD	2025-04-12 02:44:09 +00:00
angelayi	397b7f9b82	[custom ops] Override fake registration (#150806 ) Added a flag, `allow_override`, to allow overriding existing kernel implementations in `torch.library.register_fake` `library.impl`. The default is false, where if a user tries to register a kernel to a dispatch key that already contains a kernel, it will error. This flag doesn't apply to CustomOpDefs, where overriding a fake kernel is already allowed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150806 Approved by: https://github.com/zou3519	2025-04-12 02:43:47 +00:00
PyTorch MergeBot	77407b38a9	Revert "[MPSInductor] Naive welford_reduce implementation (#150824 )" This reverts commit 575f348965abe8ea428eba7098f67ec9764a7f9a. Reverted https://github.com/pytorch/pytorch/pull/150824 on behalf of https://github.com/malfet due to Linter fails again, landrace this time? ([comment](https://github.com/pytorch/pytorch/pull/150824#issuecomment-2798392241))	2025-04-12 02:22:22 +00:00
Wei Wang	f6e9e064a7	[CI][CUDA] xfail grouped gemm unit tests on blackwell (#150982 ) On SM100OrLater, Expect failures like: RuntimeError: torch._grouped_mm is only supported on CUDA devices with compute capability = 9.0 To execute this test, run the following from the base repo dir: python test/test_matmul_cuda.py TestMatmulCudaCUDA.test_grouped_gemm_3d_2d_strided_False_a_row_major_True_b_row_major_False_cuda This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ` test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_2d_strided_False_a_row_major_False_b_row_major_False_cuda SKIPPED [0.0005s] (Issue with numpy versi...) [ 2%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_2d_strided_False_a_row_major_False_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy versio...) [ 4%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_2d_strided_False_a_row_major_True_b_row_major_False_cuda SKIPPED [0.0001s] (Issue with numpy versio...) [ 6%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_2d_strided_False_a_row_major_True_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy version...) [ 8%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_2d_strided_True_a_row_major_False_b_row_major_False_cuda SKIPPED [0.0001s] (Issue with numpy versio...) [ 10%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_2d_strided_True_a_row_major_False_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy version...) [ 12%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_2d_strided_True_a_row_major_True_b_row_major_False_cuda SKIPPED [0.0001s] (Issue with numpy version...) [ 14%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_2d_strided_True_a_row_major_True_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy version ...) [ 16%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_3d_strided_False_a_row_major_False_b_row_major_False_cuda SKIPPED [0.0001s] (Issue with numpy versi...) [ 18%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_3d_strided_False_a_row_major_False_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy versio...) [ 20%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_3d_strided_False_a_row_major_True_b_row_major_False_cuda SKIPPED [0.0001s] (Issue with numpy versio...) [ 22%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_3d_strided_False_a_row_major_True_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy version...) [ 25%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_3d_strided_True_a_row_major_False_b_row_major_False_cuda SKIPPED [0.0001s] (Issue with numpy versio...) [ 27%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_3d_strided_True_a_row_major_False_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy version...) [ 29%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_3d_strided_True_a_row_major_True_b_row_major_False_cuda SKIPPED [0.0001s] (Issue with numpy version...) [ 31%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_3d_strided_True_a_row_major_True_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy version ...) [ 33%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_2d_strided_False_a_row_major_False_b_row_major_False_cuda SKIPPED [0.0002s] (Issue with numpy versi...) [ 35%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_2d_strided_False_a_row_major_False_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy versio...) [ 37%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_2d_strided_False_a_row_major_True_b_row_major_False_cuda SKIPPED [0.0001s] (Issue with numpy versio...) [ 39%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_2d_strided_False_a_row_major_True_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy version...) [ 41%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_2d_strided_True_a_row_major_False_b_row_major_False_cuda SKIPPED [0.0001s] (Issue with numpy versio...) [ 43%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_2d_strided_True_a_row_major_False_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy version...) [ 45%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_2d_strided_True_a_row_major_True_b_row_major_False_cuda SKIPPED [0.0001s] (Issue with numpy version...) [ 47%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_2d_strided_True_a_row_major_True_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy version ...) [ 50%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_3d_strided_False_a_row_major_False_b_row_major_False_cuda SKIPPED [0.0001s] (Issue with numpy versi...) [ 52%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_3d_strided_False_a_row_major_False_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy versio...) [ 54%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_3d_strided_False_a_row_major_True_b_row_major_False_cuda SKIPPED [0.0001s] (Issue with numpy versio...) [ 56%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_3d_strided_False_a_row_major_True_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy version...) [ 58%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_3d_strided_True_a_row_major_False_b_row_major_False_cuda SKIPPED [0.0001s] (Issue with numpy versio...) [ 60%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_3d_strided_True_a_row_major_False_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy version...) [ 62%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_3d_strided_True_a_row_major_True_b_row_major_False_cuda SKIPPED [0.0001s] (Issue with numpy version...) [ 64%] test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_3d_strided_True_a_row_major_True_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy version ...) [ 66%] test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_2d_2d_fast_accum_False_strided_False_cuda XFAIL [0.8166s] [ 68%] test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_2d_2d_fast_accum_False_strided_True_cuda XFAIL [0.0017s] [ 70%] test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_2d_2d_fast_accum_True_strided_False_cuda XFAIL [0.0012s] [ 72%] test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_2d_2d_fast_accum_True_strided_True_cuda XFAIL [0.0012s] [ 75%] test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_2d_3d_fast_accum_False_strided_False_cuda XFAIL [0.0033s] [ 77%] test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_2d_3d_fast_accum_False_strided_True_cuda XFAIL [0.0012s] [ 79%] test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_2d_3d_fast_accum_True_strided_False_cuda XFAIL [0.0015s] [ 81%] test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_2d_3d_fast_accum_True_strided_True_cuda XFAIL [0.0012s] [ 83%] test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_3d_2d_fast_accum_False_strided_False_cuda XFAIL [0.0012s] [ 85%] test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_3d_2d_fast_accum_False_strided_True_cuda XFAIL [0.0012s] [ 87%] test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_3d_2d_fast_accum_True_strided_False_cuda XFAIL [0.0011s] [ 89%] test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_3d_2d_fast_accum_True_strided_True_cuda XFAIL [0.0012s] [ 91%] test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_3d_3d_fast_accum_False_strided_False_cuda XFAIL [0.0014s] [ 93%] test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_3d_3d_fast_accum_False_strided_True_cuda XFAIL [0.0012s] [ 95%] test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_3d_3d_fast_accum_True_strided_False_cuda XFAIL [0.0011s] [ 97%] test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_3d_3d_fast_accum_True_strided_True_cuda XFAIL [0.0011s] [100%] ` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150982 Approved by: https://github.com/ngimel, https://github.com/eqy	2025-04-12 01:53:12 +00:00
Blaine Burton Rister	fe7f425de7	[Inductor] Refactor wrapper codegen to use Wrapper IR. (#150458 ) Preparatory refactor for https://github.com/pytorch/pytorch/pull/146942. # Feature This PR refactors the existing wrapper codegen into `WrapperLine` subclasses, extending the existing Memory Planning IR into a fully-fledged Wrapper IR. See the diagram below. ![wrapper_ir](https://github.com/user-attachments/assets/a61db21b-caf3-45d2-bfdb-91066ae4ba6b) The IR currently supports the following ops: - All existing memory planning IR ops (`AllocateLine`, `FreeIfNotReusedLine`, etc.) - Reinterpret views (`ReinterpretLine`) - Kernel definitions (`KernelDefinitionLine`) - Calls to defined kernels (`KernelCallLine`) - Calls to extern kernels (`ExternKernelLine`, `ExternKernelAllocLine`) - Ops with multiple outputs (`MultiOutputLine`) - Tensor cleanup at the end of a graph (`FreeLine`) - Leaving comments in code (`CommentLine`) There are two main motivations for this refactor: 1. Unlike free-form C++ and and Python code, Wrapper IR lines provide structured information about what the wrapper code does. This serves as a natural extension point for other types of wrapper codegen. For example, the parent PR generates FX IR from Wrapper IR. Wrapper IR aims to give new backends enough information to generate wrapper code without needing to modify core Inductor files such as `ir.py`. 2. This design will hopefully promote stronger modularity and encapsulation. a. Inductor's core compilation passes don't need to worry about whether they're targeting Python, C++, FX or anything else. They can simply focus on generating Wrapper IR, and target-specific code can be refactored into the various backends. b. Backends do not need to know about all the details and internal state of `V.graph` IR. For example, they don't need to consider whether a buffer has been removed from the graph when generating code. Wrapper IR will hopefully provide a simpler interface for generating wrapper code, which abstracts away the details of device code. # Implementation details The implementation mainly consists of separating direct C++/Python codegen into two phases: 1. Emit Wrapper IR lines describing what the wrapper code is supposed to do. 2. Inside the `codegen()` method of each `WrapperLine`, call backend methods which generate pure Python/C++ code using the information stored in the Wrapper IR line. For example, `KernelCallLine` calls `wrapper._generate_kernel_call_helper`, which is overriden by the various Python and C++ backends to generate the final wrapper code. The main difficulty in implementing this is that we need to be careful that code is generated in the correct order. Wrapper codegen happens in two passes: first we write code into `self.lines` which mainly contains wrapper IR, but can also contain raw Python or C++ lines in some situations. Then, we convert the wrapper IR into the final Python/C++ code in `self.wrapper_call`. Since the same macros may be used in both passes, it's difficult to ensure that code is written to the correct buffer. The easiest solution for this was to implement a context manager overriding the `writeline` method to write to `self.wrapper_call` after memory planning is finished. This way, `writeline` writes to `self.lines` in the first pass, and `self.wrapper_call` in the second. This obviated the need to pass `code` or `writeline` variables all the way through the call stack, which would have touched most of the existing macros. # Test plan Since this refactor touches all the existing wrapper codegen classes, the existing CI provides good coverage. The parent PR introduces new tests for the FX IR backend. Among other things, these tests assert that `self.lines` only contains Wrapper IR lines, and no free-form code. While this would not be true of all programs today, the tests suggests that the IR implemented in this PR is sufficient to cover basic PyTorch usage. # Future directions These two goals are only partially realized by this PR. These are several important steps which still undergo direct Python/C++ codegen in core files: - User-defined Triton kernels. - Reinterpret views on outputs, from `gen_output_refs()`. (In the parent PR, the FX converter has a custom way of handling this. This can eventually be ported into Wrapper IR.) - Fallback ops with custom `codegen()` methods, e.g. `ScatterFallback`. - Misc. C++ lines emitted by the various cpp backends, e.g. declaring constants. These cases will gradually be handled in subsequent PRs, as the Inductor->FX converter expands its coverage. Given that these refactors are pretty tricky to do, it seems wiser to execute them in stages, as opposed to porting everything to Wrapper IR at once.Some Python and codegen still lives in core files such as `ir.py`, as described in previous sections. Hopefully, this PR will serve as a starting point which moves the codebase towards a more modular design. Over time, we can gradually refactor the remaining codegen (mainly in `ir.py`) into backend classes. One limitation of this PR is that codegen still happens in two phases during `PythonWrapperCodegen`. First, we generate Wrapper IR into `self.lines`, and from there we generate Python or C++ code into `self.wrapper_call`, `self.header`, etc. In the long term, it would be cleaner to split wrapper IR into its own class which doesn't deal with Python/C++ codegen at all. (See the diagram at the top.) That would strictly enforce the boundary between Wrapper IR and Python/C++ wrapper code. However, this would probably be a much larger refactor. Another limitation of the current code is that the helper functions have a lot of call args. It's also possible to clean this up by passing Wrapper IR ops e.g. `KernelCallLine` into helper functions like `_generate_kernel_call_helper`, since they store all the arguments. However, that change would likely be prone to merge conflicts, so I would like to save it for follow-up PRs if possible. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150458 Approved by: https://github.com/eellison	2025-04-12 01:15:19 +00:00
Nikita Shulga	575f348965	[MPSInductor] Naive welford_reduce implementation (#150824 ) Literal Python-to-Metal translation of `85549fe6de/torch/_inductor/runtime/triton_helpers.py (L217-L225)` Fixed missing barrier in `welford_combine` And this is sufficient to make `GPUTests.test_batch_norm_2d_2_mps` to pass Pull Request resolved: https://github.com/pytorch/pytorch/pull/150824 Approved by: https://github.com/dcci, https://github.com/jansel ghstack dependencies: #151042	2025-04-12 00:46:01 +00:00
PyTorch MergeBot	83f14c0b06	Revert "[MPSInductor] Naive welford_reduce implementation (#150824 )" This reverts commit 5edfb4c4fad1bb9504482d930a2540d22427d383. Reverted https://github.com/pytorch/pytorch/pull/150824 on behalf of https://github.com/malfet due to I should have waited for lint ([comment](https://github.com/pytorch/pytorch/pull/150824#issuecomment-2798249264))	2025-04-12 00:21:14 +00:00
Yidi Wu	ca2e8cd352	[map] make proxy mode re-dispatch to fake key (#151034 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151034 Approved by: https://github.com/zou3519 ghstack dependencies: #150962	2025-04-11 23:28:06 +00:00
Yidi Wu	a72d56cb6b	[map] always turn on dynamo for map (#150962 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150962 Approved by: https://github.com/zou3519	2025-04-11 23:28:06 +00:00
Nikita Shulga	5edfb4c4fa	[MPSInductor] Naive welford_reduce implementation (#150824 ) Literal Python-to-Metal translation of `85549fe6de/torch/_inductor/runtime/triton_helpers.py (L217-L225)` Fixed missing barrier in `welford_combine` And this is sufficient to make `GPUTests.test_batch_norm_2d_2_mps` to pass Pull Request resolved: https://github.com/pytorch/pytorch/pull/150824 Approved by: https://github.com/dcci, https://github.com/jansel ghstack dependencies: #151042	2025-04-11 23:21:35 +00:00
eqy	c4f826d5e8	[CUDA][TF32] Account for TF32 in `test_alexnet_prefix` (#150970 ) Mainly seems to be an issue on Blackwell with e.g., ``` Mismatched elements: 1 / 746496 (0.0%) Greatest absolute difference: 0.005461275577545166 at index (2, 32, 11, 9) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150970 Approved by: https://github.com/soulitzer	2025-04-11 23:13:54 +00:00
Bert Maher	2d187bf7e6	Support tuning of _scaled_grouped_mm (#150421 ) This includes the default aten implementation, as well as a Triton implementation imported from FBGEMM (https://github.com/pytorch/FBGEMM/blob/main/fbgemm_gpu/experimental/gemm/triton_gemm/grouped_gemm.py) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150421 Approved by: https://github.com/ngimel	2025-04-11 23:03:49 +00:00
Will Constable	c3bc6b3542	[DTensor] Fix empty shard global-offset calculation (#150862 ) `compute_local_shape_and_global_offset` util computes the local shape of a particular shard of a DTensor, and the global offset (which describes how the shard fits into the global tensor). When the tensor dim does not evenly divide into the mesh dim, uneven sharding occurs. In some cases, uneven sharding results in an empty shard. e.g. tensor dim size: 4096 mesh dim size: 30 ranks 0..27 have local size 18 rank 28 has local size 8 rank 29 has local size 0 <--- empty shard The global offset for an empty shard was previously undefined and returned values that were computed based on logic that assumes no empty shards. This caused DCP to fail to save a checkpoint, becuase deduplication logic could 'throw away' real (non-empty) shards thinking they were duplicates of zero-sized shards with the same offset. Now, we define the global offset of an empty shard to be the dim-size, which is out of bounds of the tensor and can't overlap with any non-empty shards. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150862 Approved by: https://github.com/teja-rao, https://github.com/XilunWu	2025-04-11 22:25:57 +00:00
Matthew Hoffman	85549fe6de	Add `__all__` for `torch.utils.dlpack` (#149026 ) Fixes the issue: ```python torch.utils.dlpack.to_dlpack(tensor) # "to_dlpack" is not exported from module "torch.utils.dlpack" Pylance[reportPrivateImportUsage](https://github.com/microsoft/pyright/blob/main/docs/configuration.md#reportPrivateImportUsage) ``` the docs for `torch.utils.dlpack`: https://pytorch.org/docs/stable/dlpack.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/149026 Approved by: https://github.com/mikaylagawarecki	2025-04-11 22:03:24 +00:00
albanD	2a909cab16	Update ninja missing error message (#147698 ) In cpp_extensions Pull Request resolved: https://github.com/pytorch/pytorch/pull/147698 Approved by: https://github.com/Skylion007	2025-04-11 21:56:53 +00:00
Bin Bao	a78ac409b5	[AOTI] Add _weight_int4pack_mm to the C shim fallback list (#151059 ) Summary: As title Pull Request resolved: https://github.com/pytorch/pytorch/pull/151059 Approved by: https://github.com/yushangdi	2025-04-11 21:22:35 +00:00
Bartlomiej Stemborowski	12281f9c18	[dynamo] Deprecate enable_cpp_framelocals_guard_eval config variable - default: True (#151008 ) [dynamo] Deprecate enable_cpp_framelocals_guard_eval config variable - default: True Reading the feature enabling param `enable_cpp_framelocals_guard_eval `at the CPP level is time consuming and slows down the operation of the dynamo as it is done every time the function using this param is called. Reading the value only once at init isn’t an option as it would disable the modification of this param at the runtime. Since this feature is enabled by default for some time and it doesn’t cause known issues, the `enable_cpp_framelocals_guard_eval `configuration param will be deprecated by this commit and its value is hardcoded to true. Local microbenchmark dynamo_guard_eval.py: - 931.9 us -> 538.9 us (3.10) @williamwen42 @jansel @anijain2305 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151008 Approved by: https://github.com/williamwen42	2025-04-11 21:07:59 +00:00
Nikita Shulga	8910e4f2bb	Fix 32-bit indexing overflows in ReducedPrecisionGemV (#150949 ) By chaining `lda` type from `int` to ~~`long`~~ `int64_t` Add regression test (but probably restrict it to CPUs (or may be skip float32 testing on GPUs) Fixes https://github.com/pytorch/pytorch/issues/150637 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150949 Approved by: https://github.com/Skylion007	2025-04-11 20:55:20 +00:00
Pat Vignola	05236b5045	Allow OpaqueTensorImpl to be used for views (#151028 ) Summary: When creating an `OpaqueTensorImpl`, currently there's only an option to create it for a non-view tensor, but it can be useful to create one for view tensors as well. View tensors should contain the same autograd parameters as the original tensor, whereas non-view tensors get created with whatever `inference_mode` option is currently enabled. For this reason, `TensorImpl` has a special view constructor that takes `TensorImpl::ImplType` as its first parameter, so adding a new constructor to `OpaqueTensorImpl` that does the same thing allows us to create views with it. Test Plan: CI Reviewed By: scottxu0730 Differential Revision: D71748460 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151028 Approved by: https://github.com/scottxu0730, https://github.com/chaos5958	2025-04-11 20:07:47 +00:00
Tristan Rice	bb60e82672	c10d/Store: add queues (#150969 ) This adds queue operations as described in https://github.com/pytorch/pytorch/issues/150943. This works by adding two new operations `queue_push` and `queue_pop`. The semantics are designed to be blocking with a timeout. Pushing will always succeed as the queue is infinite size. Popping will first call `wait` until the key is ready and then pop the value from the queue. This implements queues for only: HashStore, TCPStore w/ libuv. FileStore and the legacy backends are not supported. `wait` and `check` work for queue operations though queue_push will only wake up the first waiter rather than all of them. This also has a few cleanups to error types/documentation in related code. Example trace: ``` [I409 16:51:43.963833529 TCPStoreLibUvBackend.cpp:829] [c10d - trace] validate magic:1015412686 address:[localhost]:55816 [I409 16:51:43.963845838 TCPStoreLibUvBackend.cpp:842] [c10d - trace] ping nonce:2840795 address:[localhost]:55816 [I409 16:51:43.963902914 TCPStoreLibUvBackend.cpp:911] [c10d - trace] add key:init/ val:1 address:[localhost]:55816 [I409 16:51:43.963939389 TCPStoreLibUvBackend.cpp:977] [c10d - trace] wait key_count:1 keys[0]:init/ address:[localhost]:55816 [I409 16:51:43.963974842 TCPStoreLibUvBackend.cpp:893] [c10d - trace] get key:init/ address:[localhost]:55816 [I409 16:51:43.964071909 TCPStoreLibUvBackend.cpp:1121] [c10d - trace] queue_push key:/test_prefix/test_queue_support address:[localhost]:55816 [I409 16:51:43.964080221 TCPStoreLibUvBackend.cpp:940] [c10d - trace] check key_count:1 keys[0]:/test_prefix/foo address:[localhost]:55816 [I409 16:51:43.964108584 TCPStoreLibUvBackend.cpp:1121] [c10d - trace] queue_push key:/test_prefix/foo address:[localhost]:55816 [I409 16:51:43.964123207 TCPStoreLibUvBackend.cpp:1121] [c10d - trace] queue_push key:/test_prefix/foo address:[localhost]:55816 [I409 16:51:43.964128194 TCPStoreLibUvBackend.cpp:940] [c10d - trace] check key_count:1 keys[0]:/test_prefix/foo address:[localhost]:55816 [I409 16:51:43.964156347 TCPStoreLibUvBackend.cpp:977] [c10d - trace] wait key_count:1 keys[0]:/test_prefix/foo address:[localhost]:55816 [I409 16:51:43.964187493 TCPStoreLibUvBackend.cpp:977] [c10d - trace] wait key_count:1 keys[0]:/test_prefix/foo address:[localhost]:55816 [I409 16:51:43.964217709 TCPStoreLibUvBackend.cpp:1133] [c10d - trace] queue_pop key:/test_prefix/foo address:[localhost]:55816 [I409 16:51:43.964324300 TCPStoreLibUvBackend.cpp:977] [c10d - trace] wait key_count:1 keys[0]:/test_prefix/foo address:[localhost]:55816 [I409 16:51:43.964354495 TCPStoreLibUvBackend.cpp:1133] [c10d - trace] queue_pop key:/test_prefix/foo address:[localhost]:55816 [I409 16:51:43.964416299 TCPStoreLibUvBackend.cpp:940] [c10d - trace] check key_count:1 keys[0]:/test_prefix/foo address:[localhost]:55816 [I409 16:51:43.964458733 TCPStoreLibUvBackend.cpp:977] [c10d - trace] wait key_count:1 keys[0]:/test_prefix/non_existant address:[localhost]:55816 [W409 16:51:43.974516585 socket.cpp:460] [c10d] waitForInput: poll for socket SocketImpl(fd=75, addr=[localhost]:55816, remote=[localhost]:46641) returned 0, likely a timeout [W409 16:51:43.974559169 socket.cpp:485] [c10d] waitForInput: socket SocketImpl(fd=75, addr=[localhost]:55816, remote=[localhost]:46641) timed out after 10ms [I409 16:51:43.974600451 TCPStoreLibUvBackend.cpp:1101] [c10d - trace] cancel_wait address:[localhost]:55816 ``` Test plan: ``` $ pytest test/distributed/test_store.py -k queue -v -s test/distributed/test_store.py::FileStoreTest::test_queues SKIPPED [0.4351s] (Store does not support queues) test/distributed/test_store.py::HashStoreTest::test_queues PASSED [0.0009s] test/distributed/test_store.py::PrefixFileStoreTest::test_queues SKIPPED [0.0006s] (Store does not support queues) test/distributed/test_store.py::TCPStoreTest::test_queues SKIPPED [0.0012s] (Store does not support queues) test/distributed/test_store.py::LibUvTCPStoreTest::test_queues PASSED [0.0014s] test/distributed/test_store.py::PrefixTCPStoreTest::test_queues PASSED [0.0014s] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150969 Approved by: https://github.com/XilunWu, https://github.com/fduwjj	2025-04-11 19:24:17 +00:00
PaulZhang12	83ae61fd8e	[Inductor] Add Subgraph as a Autotuning Choice (#150653 ) Add the option for providing a Subgraph as an autotuning choice in Inductor. This is crucial for implementing the split-k optimization for GEMMs by decomposing a mm -> bmm. https://github.com/pytorch/pytorch/pull/150654 uses these changes to add decomposeK as a default autotuning choice for aten.mm in Inductor. Using https://github.com/pytorch/pytorch/pull/150654 and a simple script: ``` import torch def f(a, b): return torch.matmul(a, b) def decompose_func(a_in, b_in): M, K = a_in.shape K, N = b_in.shape # TODO: Ideally we want to autotune over this parameter kPartitions = 256 assert K % kPartitions == 0, "K must be divisible by Kmini" B = K // kPartitions a_reshaped = a_in.reshape(M, B, kPartitions).transpose( 0, 1 ) # Shape: (B, M, kPartitions) b_reshaped = b_in.reshape(B, kPartitions, N) # Shape: (B, kPartitions, N) result = torch.bmm(a_reshaped, b_reshaped) # Shape: (B, M, N) return result.sum(dim=0).to(torch.float16) # Sum over B dimension, Shape: (M, N) for k in [4096, 8192, 12288, 16384, 20480, 24576, 28672, 32768]: a = torch.randn(32, k, dtype=torch.float16, device="cuda", requires_grad=True) b = torch.randn(k, 32, dtype=torch.float16, device="cuda", requires_grad=True) compiled_res = torch.compile(f, dynamic=False)(a, b) decompose_res = decompose_func(a, b) print(f"Compiled mm result close to aten: {torch.allclose(f(a, b), compiled_res, atol=1e-5, rtol=0.5)}") print(f"Compiled mm result close to decompose: {torch.allclose(decompose_res, compiled_res, atol=1e-5, rtol=0.5)}") ``` we are able to autotune the decomposeK optimization to aten and the traditional Triton templates in Inductor. DecomposeK is faster than aten by about ~10% on average and > 4x speedup over the best Triton templates on an H100 machine, e.g.: ``` AUTOTUNE mm(32x28672, 28672x32) decompose_k_mm 0.0126 ms 100.0% mm 0.0144 ms 87.5% triton_mm_69 0.0579 ms 21.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=4 triton_mm_75 0.0677 ms 18.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, num_stages=4, num_warps=4 triton_mm_76 0.0850 ms 14.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=4 triton_mm_68 0.1444 ms 8.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=4 triton_mm_72 0.1546 ms 8.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, num_stages=3, num_warps=4 triton_mm_74 0.1819 ms 6.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, num_stages=4, num_warps=4 triton_mm_67 0.1917 ms 6.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, num_stages=2, num_warps=4 triton_mm_73 0.2766 ms 4.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, num_stages=3, num_warps=4 ``` https://pastebin.com/g3FMaauT is the generated code from Inductor containing the subgraph decomposition for aten.mm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150653 Approved by: https://github.com/eellison	2025-04-11 19:08:43 +00:00
Shivam Raikundalia	ad5e9065ac	[Profiler/Easy] Remove temp flag for on-demand Memory Snapshot (#151068 ) Summary: Now that we have profiler impl in we don't need the temporary flag. submodule update too. Test Plan: CI Reviewed By: sanrise Differential Revision: D72672186 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151068 Approved by: https://github.com/davidberard98	2025-04-11 18:50:25 +00:00
Michael Lazos	fe961679d5	[Inductor] add support for disabling atomic adds (#151033 ) As title Pull Request resolved: https://github.com/pytorch/pytorch/pull/151033 Approved by: https://github.com/eellison, https://github.com/shunting314	2025-04-11 18:41:56 +00:00
PyTorch MergeBot	67d3053d4b	Revert "update benchamark result due to <1% regression (#150937 )" This reverts commit 860765d621e14730f8b6e7344da0053c4f00d540. Reverted https://github.com/pytorch/pytorch/pull/150937 on behalf of https://github.com/laithsakka due to regression diff reverted ([comment](https://github.com/pytorch/pytorch/pull/150937#issuecomment-2797611127))	2025-04-11 17:36:47 +00:00
fduwjj	6b32255e37	[c10d][fr] Add logging of nccl_version into fr and its dump (#151048 ) Users also want to see the nccl version in the FR dump so let's add it to FR. We only add it per rank per PG nccl comm, so this is really add a couple bytes to FR memory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151048 Approved by: https://github.com/kwen2501	2025-04-11 17:36:09 +00:00
Oguz Ulgen	5f5805a6ac	Cache the value of torch_key in subproc (#151057 ) No need to recalculate torch_key in subprocs, lets pass it from main process. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151057 Approved by: https://github.com/jamesjwu, https://github.com/masnesral	2025-04-11 17:30:23 +00:00
Luca Wehrstedt	fc1cccd012	Register also future allocations in mempool with NCCL (#150684 ) This is the final PR, where everything comes together. The problem I'm trying to solve is the following: when we register a MemPool with the NCCL ProcessGroup, it calls `ncclCommRegister` on all the allocations that are _currently_ in the pool. However, any later allocation will _not_ be registered with the NCCL communicator! This is terribly inconvenient, because it means that every piece of code that allocates a tensor must be changed to become aware of whether it's doing so within a private pool, and it must become aware of NCCL and of all the PGs in existence, in order to re-register that pool with them. Moreover, I believe there can be performance implications because allocating tensors is usually done in the critical path (i.e., during the forward and backward of every step of a training), whereas registering memory is a slow operation that should be done once at init time. With this PR, once the user registers a Mempool with the NCCL PG, we install some hooks into the CachingAllocator in order to listen for all future memory allocations and, if they belong to the pool, we automatically call `ncclCommRegister` on them! (In fact, we reuse the hooks that already exist for `TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/150684 Approved by: https://github.com/kwen2501 ghstack dependencies: #150683	2025-04-11 17:26:37 +00:00
Luca Wehrstedt	99642182f2	Add mempool to allocator's trace events (#150683 ) In the NCCL ProcessGroup we want to support being able to "register" with NCCL all the allocations that belong to a certain private MemPool. In order to do so on-the-fly for every new allocation, we register a hook for the CachingAllocator's TraceEvents. However, we were lacking a way to know whether a given TraceEvent belonged to the MemPool that we cared about or not. With this PR, we add a MempoolId_t field to the TraceEvents. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150683 Approved by: https://github.com/syed-ahmed, https://github.com/kwen2501	2025-04-11 17:26:37 +00:00
Tianyu Liu	d385179886	[dtensor] add op support for torch.cumsum (#151071 ) For `torch.cumsum`, any sharding placement shoud propogate through if the cumsum `dim` is not sharded; otherwise it needs to be replicated first. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151071 Approved by: https://github.com/wanchaol	2025-04-11 16:42:19 +00:00
henrylhtsang	1fe260f7c4	[cutlass backend] Add and fix logs, fix types, and make cutlass generator only generate GEMM (#150973 ) Differential Revision: [D72760205](https://our.internmc.facebook.com/intern/diff/D72760205/) We hardcoded to only use GEMM anyway. This also raises the problem with high instantiation level. As the instantiation level goes higher (here it is 3333), the time it takes to list the configs might be long already (here it is >3 minutes). If we know exactly what configs we care, we should have a way to generate them without calling generators. But let's see if we need that. using this script ``` import os os.environ["TORCH_LOGS"] = "inductor" import torch import torch._inductor.config torch._inductor.config.max_autotune = True torch._inductor.config.force_disable_caches = True torch._inductor.config.max_autotune_gemm_backends = "Aten,CUTLASS" # intentionally use no cutlass ops torch._inductor.config.cuda.cutlass_max_profiling_configs = 0 torch._inductor.config.cuda.cutlass_instantiation_level = "3333" def main(): M = 128 dtype = torch.float16 A = torch.randn(M, M, device="cuda", dtype=dtype) B = torch.randn(M, M, device="cuda", dtype=dtype) compiled_model = torch.compile(torch.mm) _ = compiled_model(A, B) print("done") if __name__ == "__main__": main() ``` before, with logs: ``` CUTLASS library generated 7 operations in 235.03 seconds Got cutlass configs: total number of ops: 4753. Filtering took 10.51 seconds ``` after: ``` CUTLASS library generated 1 operations in 207.39 seconds Got cutlass configs: total number of ops: 4753. Filtering took 9.53 seconds ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150973 Approved by: https://github.com/ColinPeppler	2025-04-11 16:24:26 +00:00
James Wu	f1364431f0	Add debug_lines of FXGraphCacheKey to AOTAutogradCacheEntry (#150594 ) Previously we didn't save debug_lines because it's pretty large, but compared to the size of FXGraphCache entries it's still pretty small. So let's add it to AOTAutogradCache for easier debugability. Differential Revision: [D72361611](https://our.internmc.facebook.com/intern/diff/D72361611/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150594 Approved by: https://github.com/oulgen	2025-04-11 15:24:13 +00:00
Burak Turk	38bec787fa	cleanup JK for duplicate pt2 compile callbacks prevention (#148704 ) Summary: This diff cleans up the JK we used for enabling `add pt2 callbacks for backward pass and prevent duplicate callbacks` feature. Differential Revision: D70643543 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148704 Approved by: https://github.com/mlazos	2025-04-11 15:17:06 +00:00
James Wu	91920661b4	Don't log benchmarking event to Scuba (#151053 ) These two events are really common, and also make up a huge portion of logs (~70%) we get internally in PT2 Compile Events. I don't think it's actually that useful to aggregate them, so instead of logging them to PT2 Compile Events, lets just only log them to chromium. These two events will still be visible from tlparse: they just won't be in our internal tables. Please let me know if folks disagree. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151053 Approved by: https://github.com/oulgen, https://github.com/masnesral	2025-04-11 14:56:36 +00:00
zeshengzong	d94cc0e994	Optimize `ConvTranspose2d` stride description (#150819 ) Fixes #150775 ## Test Result ### Before ![image](https://github.com/user-attachments/assets/81cd932f-9447-4924-9553-a5cb88fc5d0e) ### After ![image](https://github.com/user-attachments/assets/6365c71c-7268-4226-b722-ee7446cb2467) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150819 Approved by: https://github.com/jbschlosser	2025-04-11 09:37:56 +00:00
William Wen	183bca41de	[dynamo] unimplemented -> unimplemented_v2 in variables/builder.py (#151044 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151044 Approved by: https://github.com/anijain2305, https://github.com/zou3519	2025-04-11 09:07:01 +00:00
Yuanhao Ji	d6f1c72354	[PrivateUse1] Allow out-of-tree devices to pass check when validating csr tensor args (#149374 ) Fixes #149303 Fllow-up: #147306 Because we have a dispatch key named `DispatchKey::SparseCsrPrivateUse1` for this case, we allow users to create a csr tensor on out-of-tree devices, so we should also let that pass the check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149374 Approved by: https://github.com/FFFrog, https://github.com/albanD	2025-04-11 09:05:20 +00:00
Colin Peppler	5590a0692c	[aotinductor] fix std::{min.max} compilation error for sympy expr with multiple args (#150894 ) ### Compilation error The issue is that u0 (an unbacked symint) can come from a smaller int dtype e.g. int16, int32. ``` error: no matching function for call to ‘min(int64_t&, short int&)’ 759 \| call_add_kernel_with_scaling_0(... std::min(100L, s97, u0) ...); ``` ### Diff The fix is to explicitly specify `int64_t` in the std::min template. ``` int64_t s97 = arg0_1_size[0]; int16_t u0_raw; # not a long auto u0 = u0_raw; # Before std::min({100L, s97, u0}) # After std::min<int64_t>({100L, s97, u0}) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150894 Approved by: https://github.com/desertfire	2025-04-11 07:32:47 +00:00
PyTorch MergeBot	44ed0c9fbb	Revert "[profiler] don't disable CUPTI_LAZY_REINIT for cuda >= 12.6 (#150957 )" This reverts commit 37812009fd123d5c4a038ce798eedd4a89eeffad. Reverted https://github.com/pytorch/pytorch/pull/150957 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/150957#issuecomment-2795878848))	2025-04-11 05:38:58 +00:00
wdziurdz	6c7336cb31	[Profiler][HPU] Enable profiler.key_averages().table() for HPU devices (#150770 ) Fixes #150769 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150770 Approved by: https://github.com/sraikund16, https://github.com/jeromean	2025-04-11 05:17:12 +00:00
Yuanhao Ji	85ada5d6dd	[Dynamo] Allow dynamo to handle 'or' operator between two dicts (#147305 ) Fixes #146538 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147305 Approved by: https://github.com/anijain2305	2025-04-11 04:47:31 +00:00
xinan.lin	6f6ff8837a	[Inductor UT][Break XPU] Fix UTs for XPU broken by community. (#150830 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150830 Approved by: https://github.com/anmyachev, https://github.com/desertfire, https://github.com/jansel ghstack dependencies: #149862	2025-04-11 04:30:46 +00:00
xinan.lin	d186c933f8	[Inductor UT][Break XPU] Apply CUDA tolerances changes on XPU that introduced by #144579 . (#149862 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149862 Approved by: https://github.com/desertfire, https://github.com/jansel	2025-04-11 04:30:46 +00:00
Isuru Fernando	a22d3e778e	[dynamo][guards] Print relational guards only once (#150810 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150810 Approved by: https://github.com/anijain2305	2025-04-11 04:10:37 +00:00
Tristan Rice	8b5e717601	c10d/Store: add clone feature (#150966 ) (#150966 ) (#151045 ) Summary: This adds a new `clone()` method to Store which will return a new Store instance that can be used from a different thread. This is intended to better support multiple threads with stores such as when ProcessGroupNCCL needs a store to do error propagation. Related issue: https://github.com/pytorch/pytorch/issues/150943 Approved by: https://github.com/fduwjj Test Plan: contbuild & OSS CI, see `205881ea4a` Test plan from GitHub: ``` pytest test/distributed/test_store.py -k PythonStore pytest test/distributed/test_store.py -k clone ``` Differential Revision: D72789690 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151045 Approved by: https://github.com/XilunWu, https://github.com/fduwjj	2025-04-11 04:00:23 +00:00
Justin Chu	75162aa7de	[ONNX] Support running bfloat16 models with ONNX Runtime (#149646 ) Use ORTValue objects to support bfloat16 and other dtypes as inputs. This only supports cuda as ort only implements bfloat16 on cuda. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149646 Approved by: https://github.com/titaiwangms	2025-04-11 03:38:26 +00:00
Zhengxu Chen	86370fd658	[dynamo] Allow guards to be dropped with custom filter functions. (#150936 ) Summary: A follow up of https://github.com/pytorch/pytorch/pull/150689. Test Plan: test_dynamo -k test_guard_filter_fn Differential Revision: D72722322 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150936 Approved by: https://github.com/jansel	2025-04-11 03:06:34 +00:00
zeshengzong	4b0cf9fc00	Optimize transformer encoder/decoder init suggestion (#146882 ) Fixes #72253 Add hint message for users to manually initialize after created. ## Test Result Before ![image](https://github.com/user-attachments/assets/1914223f-008e-4ff7-aea1-c54c55679f65) ![image](https://github.com/user-attachments/assets/fd4110c1-26f7-48fe-9582-80581ab72328) After ![image](https://github.com/user-attachments/assets/12270ba2-b384-4fe6-b351-4287b272d102) ![image](https://github.com/user-attachments/assets/0194e3a0-700a-40da-a9de-e9854c2d5d2e) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146882 Approved by: https://github.com/jbschlosser	2025-04-11 02:31:56 +00:00
Jiang, Yanbing	1e92579126	Add torch._scaled_mm for CPU (#150410 ) This PR is the duplicated one for https://github.com/pytorch/pytorch/pull/139975. This PR is to add torch._scaled_mm for CPU backend. _scaled_mm_out_cpu and _scaled_mm_cpu are new added and included in torch._scaled_mm CPU dispatch. We also add _scaled_mm_out_cpu_emulated as a fallback function if the current platform cannot run FP8 matmul using oneDNN. And this PR also updates the various UTs related to FP8 to support CPU tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150410 Approved by: https://github.com/atalman	2025-04-11 02:23:03 +00:00
cyyever	24ca7e91e6	[1/N] Use internal linkage in torch/csrc C++ files. (#150930 ) Turn more functions and variables into static if they are not used outside the cpp files. Unused functions are removed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150930 Approved by: https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2025-04-11 02:19:31 +00:00
fduwjj	48132de4af	[c10d][fr] Fix the false positive in the dtype check in fr analysis script (#151063 ) When checking dtype in fr analysis script, we should only check it when the input of output numbel is larger than zero. For the case when it is gather or scatter, the output/input size will be an empty list for non-src or non-dst ranks which we should just skip the check. Differential Revision: [D72826823](https://our.internmc.facebook.com/intern/diff/D72826823) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151063 Approved by: https://github.com/d4l3k, https://github.com/kwen2501	2025-04-11 02:11:58 +00:00
Tristan Rice	df4e5294a6	Reapply "ProcessGroupGloo: support lazy_init (#150801 )" (#151031 ) This reverts commit 73f3d6d9aaa128d9917e8b3790933ba2855066cc. Reapplies #150801 Test plan: See #150801 submodule Pull Request resolved: https://github.com/pytorch/pytorch/pull/151031 Approved by: https://github.com/fduwjj	2025-04-11 01:58:35 +00:00
Nikita Shulga	b7c0fda163	[MPS] Fix `determine_backend_memory_format` logic (#151042 ) If input is channels last than MPS will return a channels last output This fixed `GPUTests.test_convolution_4_mps` from test_torchinductor.py That previous failed with ``` AssertionError: expected size 3==3, stride 1==192 at dim=1; expected size 12==12, stride 48==16 at dim=2; expected size 16==16, stride 3==1 at dim=3 ``` As FakeTensor implementation of conv returned `Contiguous`, rather than `ChannelLast` layout on MacOS-15 or later. This doesn't seem to be very well documented, so will try to document the call path for `ExternKernel` invocation for `aten::convolution`: - First inductor decomp defined here is called `c93e4b8290/torch/_inductor/kernel/conv.py (L424-L425)` - Then it goes thru FakeTensor decomposition implemented here `320914f1b6/torch/_subclasses/fake_impls.py (L739-L740)` - Finally it goes down to convolution meta registrations implemented here `320914f1b6/torch/_meta_registrations.py (L2416-L2417)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151042 Approved by: https://github.com/dcci	2025-04-11 01:51:34 +00:00
fduwjj	320914f1b6	[c10d][libuv] Add back correct EOF case check (#151052 ) We removed the wrong EOF case in https://github.com/pytorch/pytorch/pull/150987, and we added the correct one back in this PR. Since https://github.com/pytorch/pytorch/pull/150987 is a fix, so we merge that PR first and use this PR as a follow-up to further makes the logic more complete. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151052 Approved by: https://github.com/XilunWu	2025-04-11 01:37:30 +00:00
Yanan Cao (PyTorch)	c93e4b8290	[BC-breaking] Set NonStrict as default for export_for_training (#150941 ) Summary: - Flip default value of `strict` argument from True to False on torch.export.export_for_training API - All callsites have been updated to provide this argument explicitly to avoid behavior change. - If you see any breakages, that means you may have a new callsite that is missed, please set `strict=True` explicitly to the callsite to mitigage. Test Plan: CI Differential Revision: D72724975 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150941 Approved by: https://github.com/ydwu4	2025-04-11 00:50:05 +00:00
eellison	e945247f05	Revert two recent prologue prs (#151013 ) These were landed in a bit of a rush to try to make the release.. Reverting, then will re-land with https://github.com/pytorch/pytorch/pull/151009 applied, and do full benchmark run with max-autotune. Differential Revision: [D72791103](https://our.internmc.facebook.com/intern/diff/D72791103) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151013 Approved by: https://github.com/zou3519	2025-04-10 23:48:41 +00:00
Will Constable	c9a35c2a6e	[C10D] Document object collectives limitations (#150815 ) Adds louder warning labels in the doc page and docstring for object collectives in hopes of raising awareness of several footgun issues including accidental creation of cuda contexts by serializing and sending 'device-local' gpu tensors over the object-* apis. Preview: <img width="902" alt="image" src="https://github.com/user-attachments/assets/e0c08c70-d8e5-4e15-b3e2-5cd563714f71" /> addresses #150798 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150815 Approved by: https://github.com/kwen2501	2025-04-10 22:48:39 +00:00
Yiming Zhou	dbcd0b571d	Back out "[AOTI] Always use oss schema for ExternKernelNodes serialization" (#151026 ) Summary: Revert for FC breaking Test Plan: CI Differential Revision: D72802075 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151026 Approved by: https://github.com/hl475	2025-04-10 22:36:35 +00:00
Justin Chu	f304483e95	[ONNX] Add asdict method to VerificationInfo class (#151024 ) This pull request introduces a new method to convert `VerificationInfo` objects to dictionaries and includes a corresponding test to ensure the method works correctly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151024 Approved by: https://github.com/titaiwangms	2025-04-10 22:23:33 +00:00
henrylhtsang	8d81806211	[inductor] Change minimum number of SMs to 60 to let Ada use Triton GEMM backend (#150888 ) context: https://github.com/pytorch/pytorch/issues/150390#issuecomment-2790272814 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150888 Approved by: https://github.com/jansel	2025-04-10 22:10:55 +00:00
PyTorch MergeBot	e786b3bf54	Revert "[inductor] Change minimum number of SMs to 60 to let Ada use Triton GEMM backend (#150888 )" This reverts commit 115a165f9b24e3aaaeb2d0994678116758bd636f. Reverted https://github.com/pytorch/pytorch/pull/150888 on behalf of https://github.com/malfet due to This indeed broke all those inductor tests ([comment](https://github.com/pytorch/pytorch/pull/150888#issuecomment-2795231901))	2025-04-10 21:46:23 +00:00
PyTorch MergeBot	6a65f2c4fe	Revert "Support tuning of _scaled_grouped_mm (#150421 )" This reverts commit 8efcf21fff327d155350bf26ccba769bab58c077. Reverted https://github.com/pytorch/pytorch/pull/150421 on behalf of https://github.com/malfet due to Looks like it broke lint, see `a0ab243c3a/1` ([comment](https://github.com/pytorch/pytorch/pull/150421#issuecomment-2795218547))	2025-04-10 21:36:41 +00:00
PyTorch MergeBot	a0ab243c3a	Revert "Generalize poison fork logic for each device backend (#144664 )" This reverts commit 83bd0b63b55f224fada6d5f6dd7eb5b4cb3072fb. Reverted https://github.com/pytorch/pytorch/pull/144664 on behalf of https://github.com/atalman due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/144664#issuecomment-2795157082))	2025-04-10 21:02:14 +00:00
Bert Maher	8efcf21fff	Support tuning of _scaled_grouped_mm (#150421 ) This includes the default aten implementation, as well as a Triton implementation imported from FBGEMM (https://github.com/pytorch/FBGEMM/blob/main/fbgemm_gpu/experimental/gemm/triton_gemm/grouped_gemm.py) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150421 Approved by: https://github.com/ngimel	2025-04-10 20:34:16 +00:00
PyTorch MergeBot	abe41c5c9c	Revert "c10d/Store: add clone feature (#150966 )" This reverts commit 205881ea4a451574c3a3de87c42484043a955d6e. Reverted https://github.com/pytorch/pytorch/pull/150966 on behalf of https://github.com/atalman due to failing internally ([comment](https://github.com/pytorch/pytorch/pull/150966#issuecomment-2795063574))	2025-04-10 20:17:53 +00:00
Ivan Dimitrov	8fdd61bc45	Fix torchscript issues with reference quantized modules (#150870 ) Summary: The reference quantized modules for linear / conv / etc fail to torchscript due to two issues (1) The type of torch.qscheme doesn't script (2) The "_DTYPE_TO_QVALUE_BOUNDS" values were resolving to union[float, int] instead of just int. We fix that with a hard cast. See: <internal post> + comments for more context Test Plan: unit tests + fixing this NB N6923590 Differential Revision: D72652616 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150870 Approved by: https://github.com/jerryzh168	2025-04-10 20:14:45 +00:00
PyTorch MergeBot	31162214d8	Revert "[AOTI] Remove typedef for half and bfloat16 (#150657 )" This reverts commit 357814c85c00a2b5b3fb9add97735e4789caa7e0. Reverted https://github.com/pytorch/pytorch/pull/150657 on behalf of https://github.com/atalman due to failing internally ([comment](https://github.com/pytorch/pytorch/pull/150657#issuecomment-2795042772))	2025-04-10 20:08:03 +00:00
Shunting Zhang	252029b294	[Inductor] assert fallback output alignment (#150804 ) Previous PR (https://github.com/pytorch/pytorch/pull/150777) fixes the alignment problem for fallback kernel assuming meta kernel is correct. This PR handles the case that meta kernel is incorrect. Assertion is added if the compiler assumes a fallback kernel output is aligned. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150804 Approved by: https://github.com/jansel, https://github.com/eellison ghstack dependencies: #150777	2025-04-10 20:01:06 +00:00
henrylhtsang	115a165f9b	[inductor] Change minimum number of SMs to 60 to let Ada use Triton GEMM backend (#150888 ) context: https://github.com/pytorch/pytorch/issues/150390#issuecomment-2790272814 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150888 Approved by: https://github.com/jansel	2025-04-10 19:46:35 +00:00
William Wen	4161c752bb	[dynamo] unpack sequence lazily for list extend/deque extendleft (#150965 ) Fixes https://github.com/pytorch/pytorch/issues/133063. We were unpacking generators/iterators eagerly when we should be unpacking them one-by-one. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150965 Approved by: https://github.com/jansel	2025-04-10 19:31:31 +00:00
Pian Pawakapan	389cd15265	[export] check tuple length mismatch for dynamic_shapes spec (#150976 ) Summary: weren't checking this Test Plan: test_export Differential Revision: D72761995 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150976 Approved by: https://github.com/angelayi	2025-04-10 19:08:43 +00:00
fduwjj	f663aa4e81	[c10d][tcp_store] Fix connection reset caused by wrong socket close (#150987 ) While fixing the memory leak in https://github.com/pytorch/pytorch/pull/145757, we accidentally close the socket for the case when nread == 0 and thought it is the case when connection is closed. This is not true. According to libuv doc: https://docs.libuv.org/en/v1.x/stream.html#c.uv_read_cb. > nread might be 0, which does not indicate an error or EOF. This is equivalent to EAGAIN or EWOULDBLOCK under read(2). We found this bug when debugging a broken pipe issue when users first call a set and then wait for all keys right afterwards on 128 ranks. This might also cause other broken pipe issues we have seen in the prod jobs recently. Added a unit test to test this case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150987 Approved by: https://github.com/d4l3k, https://github.com/XilunWu	2025-04-10 18:48:57 +00:00
Daniel Vega-Myhre	e7ed50f27b	[async TP] Fix handling of case where scatter dim = 0 for 2D output tensor (#150935 ) ## Summary of changes 1. Change assertion to a warning, when no all gather or reduce scatter patterns are found, and remove the corresponding unit test. It seems some valid TP graphs may not have any pattern matches, from what I can see. 2. Fix wrong variable name being referenced (`A_with_scatter_dim_0` instead of just `A`) 3. Simplify reshaping to target output shape (don't need to recalculate output shape) 4. When "A" tensor is 2D, so we are doing doing a 2D x 2D scaled mm, we need to fix our handling of the case where the scatter dim is 0. When scatter dim is 0 for the 2D scaled mm output shape, this is actually dim 1 in the unreduced stacked partial scaled mm outputs, which has a (logical) shape of `(group_size, M//group_size, N)`. To summarize: - Unreduced stacked partials are of shape `(M, N)` - We view as `(group size, M//group_size, N)` and reduce along the scatter dim (`group_size` / dim 0). - Reduced output (`reduced_out`) has shape (M//group_size, N) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150935 Approved by: https://github.com/lw	2025-04-10 18:25:48 +00:00
yucai-intel	08831f30bb	[Intel GPU] Allow XPU backend in Depthwise_conv2d&3d operators (#149114 ) This modification is to support XPU kernels for depthwise_conv2d and depthwise_conv3d. Currently, when running depthwise_conv on XPU devices, it is calculated with Mkldnn via the ConvBackend::Overrideable path. After this modification, depthwise_conv will be calculated directly using XpuDepthwise3d when the Mkldnn backend is disabled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149114 Approved by: https://github.com/guangyey, https://github.com/albanD	2025-04-10 17:49:27 +00:00
David Berard	37812009fd	[profiler] don't disable CUPTI_LAZY_REINIT for cuda >= 12.6 (#150957 ) Credit to @mgmtea who wrote the initial version of this PR: https://github.com/pytorch/pytorch/pull/146604 Context: CUPTI is the NVIDIA library that Kineto uses for collecting GPU-side info during profiling. The intended usage is to register a callback while you want profiling to occur, and then unregister the callback when you want profiling to stop. But a bug would cause crashes if CUPTI callbacks were de-registered when used with cudagraphs. The workaround was to disable "CUPTI_LAZY_REINIT" and "CUPTI_TEARDOWN" in Kineto - which prevents crashes, but can result in slower execution after profiling has occurred and completed. This bug is believed to be fixed in CUDA >= 12.6, so this PR qualifies that DISABLE_CUPTI_LAZY_REINIT=1 and CUPTI_TEARDOWN=0 should only be applied if CUDA >= 12.6. Additionally, `profiler_allow_cudagraph_cupti_lazy_reinit_cuda12()` is added as an escape hatch so that we can add a killswitch in case we see more crashes related to this. Differential Revision: [D72745929](https://our.internmc.facebook.com/intern/diff/D72745929) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150957 Approved by: https://github.com/aaronenyeshi, https://github.com/Skylion007	2025-04-10 17:45:01 +00:00
Hexin Wang	6720d23969	Fixing NCCL abort hang issue when a ProcessGroupNCCL manages multiple ncclComms (#150690 ) Detail of the issue: If PyTorch issues send/recv to each 2 rank comm, and these comms are managed by a single ProcessGroupNCCL instance, then comms need to abort either in sequence or in group. I.e. the following sequential abort will cause hang in NCCL. recv(..., comm0, stream); send(..., comm1, stream); abort(comm1); abort(comm0); Fixes #119797 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150690 Approved by: https://github.com/kwen2501, https://github.com/eqy, https://github.com/atalman	2025-04-10 17:33:26 +00:00
Basil Wong	1250106630	[pytorch] Remove numpy dependency from Knapsack Evaluator (#150825 ) Summary: The two implementations are functionally equivalent. They both calculate the memory budget at the knee point in the Pareto frontier using the same algorithm. 1. np.linspace -> basic list comprehension 2. runtime and memory values -> lists instead of numpy arrays 3. np.ptp -> max - min 4. np.norm -> diff with min value / range 5. np.sqrt -> *0.5 5. np.argmin -> .index(min(_)) Test Plan: # Unit Testing ``` buck test mode/opt //caffe2/test/functorch:test_ac_knapsack; pingme "tests done" Buck UI: https://www.internalfb.com/buck2/f4e41eb8-e775-4f04-b4e7-8e567599deb8 Test UI: https://www.internalfb.com/intern/testinfra/testrun/10133099236155875 Network: Up: 24KiB Down: 1.9GiB (reSessionID-7cd11487-f3e7-43ab-982a-805510771c8d) Executing actions. Remaining 0/259826 98:15:40.5s exec time total Command: test. Finished 3 local, 5 remote, 103467 cache (99% hit) 98:15:14.8s exec time cached (99%) Time elapsed: 1:09.9s Tests finished: Pass 15. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` # End to End Testing ### Baseline Run with DP Let's confirm everything we are running on works. - Optimization Algo: DP - Memory Budget: 0.05 - AIX Link: apf_local-basilwong-2025-03-22_20:39:10 - TLParse rank 0: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpDJaWp5/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 - TLParse rank 1: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpDJaWp5/rank_1/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 ### Dynamic Memory Budget (Before Change) - Revision: 2c95489b7f79 - Optimization Algo: Dynamic Memory Budget - Memory Budget: 0.05 - AIX Link: https://www.internalfb.com/mlhub/pipeline/4088035428184866 - TLParse: - https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpykEy8U/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 - https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpykEy8U/rank_1/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 ### Dynamic Memory Budget (After Change) - Revision: 14353eef3c9e - Optimization Algo: Dynamic Memory Budget - Memory Budget: 0.05 - AIX Link: https://www.internalfb.com/mlhub/pipeline/1613558749306737 - TLParse Links: - https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpZKNWFw/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 - https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpZKNWFw/rank_1/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 As a sanity check lets take the AC information for the following compile id: 7_0_0 from the rank 0 of each TLParse. {F1976883124} Baseline: P1779400819 * Saved node values show we are storing much more compared to dynamic memory: ``` "Knapsack Saved Nodes": [ 16, 17, 19, 20, 21, 22, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60 ] ``` * Before Change: P1779401775 * Saved nodes are similar to after change but not exactly. ``` "Knapsack Saved Nodes": [ 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 49, 50 ] ``` * After Change: P1779402106 * Here we se the largest nodes that are saved are around the same, but there is a small discrepancy for the smallest nodes. ``` "Knapsack Saved Nodes": [ 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 50, 51, 57, 58, 59, 60, 61, 62 ], ``` The discrepancy can be explained by looking at the estimated memory values. This is the non-deterministic part(below are the top 5 memory values for considered candidates): ``` 0.05774741703905514, 0.007333005338292718, 0.007333005338292718, 0.007333005338292718, 0.007333005338292718, ``` vs ``` 0.049254204820440746, 0.006254502199421049, 0.006254502199421049, 0.006254502199421049, 0.006254502199421049, ``` Based on that the dynamic memory implementations performed similarly in an E2E test and that memory is non-deterministic we should be good to go to land. Differential Revision: D71692245 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150825 Approved by: https://github.com/seemethere, https://github.com/jansel	2025-04-10 17:07:03 +00:00
Laith Sakka	5471e80fb4	Remove guard_size_oblivious from vector_norm decomposition. (#148809 ) This PR remove the usage of guard_size_oblivious in vector_norm by inlining it in the runtime check, this prevent any data dependent error from ever appearing here at the locations where guard_size_oblivious used to exist. Before this PR it used to break potentially. This is NOT BC breaking or changing of semantics from eager. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148809 Approved by: https://github.com/bobrenjc93	2025-04-10 16:19:00 +00:00
angelayi	e6969c1bd8	[export] Symint support (nonstrict, Dim.DYNAMIC) (#150198 ) Fixes https://github.com/pytorch/pytorch/issues/113682 only in the non-strict export case. Also we only support Dim.DYNAMIC/AUTO, not named-Dims Pull Request resolved: https://github.com/pytorch/pytorch/pull/150198 Approved by: https://github.com/pianpwk	2025-04-10 15:06:23 +00:00
Tom Ritchford	596e44d26a	[inductor] Enable docstring_linter on _inductor (#144622 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144622 Approved by: https://github.com/eellison ghstack dependencies: #144621	2025-04-10 14:32:26 +00:00
Tom Ritchford	ba35793226	[inductor] Add tests for new docstring_linter features (fix #142496 ) (#144621 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144621 Approved by: https://github.com/eellison	2025-04-10 14:32:26 +00:00
PyTorch MergeBot	73f3d6d9aa	Revert "ProcessGroupGloo: support lazy_init (#150801 )" This reverts commit f237ee54bfb35d16cd10e358d4b78578c88a5781. Reverted https://github.com/pytorch/pytorch/pull/150801 on behalf of https://github.com/atalman due to failing internally ([comment](https://github.com/pytorch/pytorch/pull/150801#issuecomment-2793161239))	2025-04-10 13:44:31 +00:00
Wang, Chuanqi	7b7b9d707e	[CI] Add XPU compiled check in CICD (#150771 ) Address the suggestion from https://github.com/pytorch/pytorch/issues/150001#issuecomment-2753407421 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150771 Approved by: https://github.com/malfet, https://github.com/atalman	2025-04-10 13:33:27 +00:00
FFFrog	4273e5d15c	Expose is_available API for torch.backends.mkldnn (#147432 ) As the title stated. Like torch.backends.mkl, torch.backends.openmp and so on, they all expose is_available API for users. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147432 Approved by: https://github.com/albanD	2025-04-10 05:05:37 +00:00
cdzhan	1a1a32ce5a	[elastic][test] fix race condition in test_barrier_timeout_rank_tracing (#150768 ) # Root cause The barrier timeout set to 0.1 is too short, some threads may not have enough time to reach the barrier. # How to reproduce Adding some sleep will be easy to reproduce. ```python def test_barrier_timeout_rank_tracing(self): N = 3 store = dist.HashStore() def run_barrier_for_rank(i: int): if i != 0: import time;time.sleep(1) # Let some thread sleep for a while try: store_util.barrier( store, N, key_prefix="test/store", barrier_timeout=0.1, rank=i, rank_tracing_decoder=lambda x: f"Rank {x} host", trace_timeout=0.01, ) except Exception as e: return str(e) return "" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150768 Approved by: https://github.com/d4l3k	2025-04-10 04:40:16 +00:00
Yuanhao Ji	a6933a1c42	[Inductor] Remove triton dtype patch which has landed (#149611 ) As this [pr][0] has already landed, we should remove its patch. Having [mentioned][1] this before, I am making this change now to avoid omissions. [0]: https://github.com/triton-lang/triton/pull/3342 [1]: https://github.com/pytorch/pytorch/pull/147583/files#r1970440062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149611 Approved by: https://github.com/eellison	2025-04-10 03:42:55 +00:00
Benjamin Glass	b80bb87689	cpp_wrapper: Miscellaneous fixups (#150143 ) 1. Revisit preprocessing code in cpp_bulider.py, removing a hack that channels it through stdout. 2. Fix ops that return None. Differential Revision: [D72053414](https://our.internmc.facebook.com/intern/diff/D72053414) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150143 Approved by: https://github.com/desertfire	2025-04-10 03:31:12 +00:00
Laith Sakka	cd80778ac8	Fix issue in optimized_add issue: make_optimized should be called on non args only (#150955 ) PR https://github.com/pytorch/pytorch/pull/149665 did a change to the optimized_add that is causing an issue internally. In general make_optimized should be only be called with valid new_args, new_args can become None when elements already exists also, we should break out of the loop in that case. Note that I also only maintained the optimized summation when both lhs and rhs lengths are <=2. This is ok because the optimization is based on the inductive property of adding one symbol at a time. the [2]+[2] here is serving as base case ( i feel we can also remove it ) . Note that keeping it for all sizes while correct, I am not sure if tis as efficient (we will do N log(n) insertions). there is no current justification for that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150955 Approved by: https://github.com/Mingming-Ding, https://github.com/atalman, https://github.com/bobrenjc93	2025-04-10 03:00:21 +00:00
Yuanhao Ji	bf7d8ef10d	[Docs] Clarify behavior when integer dtype is used with requires_grad=True in `tensor.to()` (#150913 ) Fixes #150618 Related comment: https://github.com/pytorch/pytorch/issues/3226#issuecomment-489362234 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150913 Approved by: https://github.com/janeyx99, https://github.com/soulitzer, https://github.com/cyyever Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2025-04-10 02:52:58 +00:00
Yuki Kobayashi	78b3d71ece	Docs: Add missing whitespace in the cmake warning message (#150929 ) A trailing whitespace is needed to be concatenated to the following string correctly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150929 Approved by: https://github.com/Skylion007	2025-04-10 02:50:56 +00:00
Yu, Guangye	3d3fcaaf7b	Delegate torch.accelerator.device_count to torch.xxx.device_count for multi-process usage (#149924 ) # Motivation Adapt `torch.accelerator.device_count` for multi-process usage. For example, `torch.cuda.device_count` avoids poisoning fork, then `torch.accelerator.device_count` should meet the same requirement. Now that `torch.get_device_module(device).device_count` supports this, `torch.accelerator.device_count` should align with this behavior as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149924 Approved by: https://github.com/albanD ghstack dependencies: #147507	2025-04-10 02:37:37 +00:00
Yu, Guangye	6972255dad	Document poison fork note for accelerator APIs (#147507 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147507 Approved by: https://github.com/sraikund16, https://github.com/kwen2501, https://github.com/albanD	2025-04-10 02:37:37 +00:00
Yu, Guangye	83bd0b63b5	Generalize poison fork logic for each device backend (#144664 ) # Motivation Generalize the posion_fork code to make it reusable across different devices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144664 Approved by: https://github.com/EikanWang, https://github.com/albanD	2025-04-10 02:34:53 +00:00
cyy	322f883c0c	Remove unneeded CUDA logic from _create_build_env (#145822 ) Because FindCUDAToolkit.cmake has that logic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145822 Approved by: https://github.com/albanD	2025-04-10 02:17:28 +00:00
cyy	54827752a4	[5/N] Remove unnecessary once flag usage (#147445 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/147445 Approved by: https://github.com/albanD	2025-04-10 01:48:10 +00:00
Tristan Rice	205881ea4a	c10d/Store: add clone feature (#150966 ) This adds a new `clone()` method to Store which will return a new Store instance that can be used from a different thread. This is intended to better support multiple threads with stores such as when ProcessGroupNCCL needs a store to do error propagation. Related issue: https://github.com/pytorch/pytorch/issues/150943 Test plan: ``` pytest test/distributed/test_store.py -k PythonStore pytest test/distributed/test_store.py -k clone ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150966 Approved by: https://github.com/fduwjj	2025-04-10 01:41:50 +00:00
rzou	061832bc7a	Gracefully handle optree less than minimum version (#150956 ) Summary: - We are saying the minimum version of pytree that PyTorch can use is 0.13.0 - If a user imports torch.utils._cxx_pytree, it will raise an ImportError if optree doesn't exist or exists and is less than the minimum version. Fixes https://github.com/pytorch/pytorch/issues/150889. There are actually two parts to that issue: 1. dtensor imports torch.utils._cxx_pytree, but the optree installed in the environment might be too old. Instead, raising ImportError in torch.utils._cxx_pytree solves the issue. 2. We emit an "optree too low version" warning. I've deleted the warning in favor of the more explicit ImportError. Test Plan: - code reading Pull Request resolved: https://github.com/pytorch/pytorch/pull/150956 Approved by: https://github.com/albanD, https://github.com/atalman, https://github.com/XuehaiPan	2025-04-10 01:22:50 +00:00
mantaionut	9d1528186f	Fix static functions when using module in MSVC (#148675 ) If you try to use torch in c++ using modules then it will not compile due to static function not being supported in MSVC when using modules https://developercommunity.visualstudio.com/t/10323558. It's also aligned with [C++20 standard](https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/n4849.pdf) (ISO/IEC 14882:2020) 10.2.7 Export declaration [module.interface]: "Exported names have either external linkage or no linkage". Fixes https://github.com/pytorch/pytorch/issues/71309 Tested using the following code. ```c++ export module testModule; import <torch/torch.h>; import <memory>; import <string>; import <tuple>; import <iostream>; export namespace testModule { export void test() { torch::Tensor tensor1 = torch::rand({ 2, 3 }); torch::Tensor tensor2 = torch::rand({ 3, 2 }); // Perform tensor multiplication torch::Tensor result = torch::matmul(tensor1, tensor2); // Print the tensors std::cout << "Tensor 1: " << tensor1 << std::endl; std::cout << "Tensor 2: " << tensor2 << std::endl; std::cout << "Result of multiplication: " << result << std::endl; } } ``` ```c++ import testModule; int main() { testModule::test(); return 0; } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148675 Approved by: https://github.com/albanD, https://github.com/malfet Co-authored-by: mantaionut <ionut@janeasystems.com>	2025-04-10 01:19:54 +00:00
FFFrog	69cee91a55	Code Clean: Using the new builtin function provides by python 3.8 later (#150839 ) Changes: - reversed - math.perm - inspect.getfile Pull Request resolved: https://github.com/pytorch/pytorch/pull/150839 Approved by: https://github.com/Skylion007	2025-04-10 01:17:39 +00:00
Mu-Chu Lee	f3cf3ec591	[AOTInductor] Add User Managed buffer for AOTI constant buffer. (#150276 ) Summary: We add the functionality to allow users to directly pass in a at::Tensor into AOTInductor, that would be used as the constant. This user managed buffer skips the copying step in AOTInductor, and let users to directly manage the memory usage themselve. Test Plan: LD_LIBRARY_PATH=/data/users/$USER/pytorch/build/lib /data/users/$USER/pytorch/build/bin/test_aoti_inference Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D72589514](https://our.internmc.facebook.com/intern/diff/D72589514) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150276 Approved by: https://github.com/chenyang78, https://github.com/desertfire	2025-04-10 00:15:44 +00:00
Shangdi Yu	92e81cf41a	Add real_tensor to the FakeTensor in node.meta["val"] (#150948 ) Summary: We need real_tensor on the FakeTensor in node.meta["val"] in order to aot_compile the draft exported programs. Otherwise, we cannot propagate real tensors even when fake_mode.propagate_real_tensors = True. This also fixes real tensor propagation in `run_decomposition()`. Test Plan: ``` buck2 run @mode/dev-nosan caffe2/test:test_export -- -r test_dedup_data_dependent_failure ``` Differential Revision: D72732714 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150948 Approved by: https://github.com/angelayi	2025-04-10 00:11:46 +00:00
Laith Sakka	91d1826539	Add dynamic version for mm_loop benchmark (#150865 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150865 Approved by: https://github.com/eellison	2025-04-09 23:37:43 +00:00
Will Constable	a8b48ff14c	[DTensor] clean up _local_shard_size_and_offset (#150650 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150650 Approved by: https://github.com/wanchaol, https://github.com/XilunWu ghstack dependencies: #150490	2025-04-09 22:07:48 +00:00
Will Constable	3532dd4f1e	[DTensor] StridedShard support uneven sharding (#150490 ) This enables using FSDP+TP on parameters with dimensions that aren't evenly divisible by the DP/TP mesh sizes. - this may not support all possible combinations of strided shardings and shardings, but the support before this PR is not complete anyway This contains several fixes for different aspects of DTensor behavior relating to uneven strided sharding: - original creation of the strided tensor requires fixes in StridedShard._split_tensor - full_tensor() reconstruction requries fixes in StridedShard._to_replicate_tensor to correctly reshuffle the data into the original pre-sharded order - Distributed Checkpointing support requires correct computation of the compute_local_shape_and_global_offset util so it knows how a local shard maps to the global tensor, for reconstruction during load/reshard. This PR also adds a util `_explicit_order_placements` which converts a list of placements with StridedSharding into a list of placements with only regular sharding, with the order shuffled such that it is equivalent. Builds on and completes the work started in https://github.com/pytorch/pytorch/pull/148894 Uneven Sharding Example ------- (copied from _StridedShard._to_replicate_tensor docstring) mesh = (DP=2, TP=2) original = torch.arange(5) Applying Sharding Step 1 - Apply TP sharding `tp = distribute_tensor(x, world_mesh['tp'], [Shard(0)])` local_tensors: rank0: [0,1,2] rank1: [3,4] rank1: [0,1,2] rank3: [3,4] Step 2 - Apply FSDP sharding `dp_tp = ...` (the process of creating a strided-shard tensor is skipped over as it is hacky and complicated) dp_tp has placement (_StridedShard(0, split_factor=2), Shard(0)) local_tensors: rank0: [0,1] rank1: [3] rank1: [2] rank3: [4] Reconstructing the Full Tensor Now, say someone wants to reconstruct dp_tp's full tensor. This will invoke 'redistribute' to replicate. redistribute will first replicate the "Shard(0)" placement on the rightmost mesh dim, then replicate the StridedShard placement second, which is implemented by this function. So our starting point (`local_tensor` arg) is the result of replicating the Shard(0) placement across the TP dim, which looks like this. Note the discrepancy with the 'tp sharded tensor' line above! We'll fix it by locally shuffling data. local_tensors: rank0: [0,1,3] rank1: [0,1,3] rank1: [2,4] rank3: [2,4] Step 1: replicate over the DP dimension. Afterwards, each rank can locally sort the values. note: we need padding to do this allgather, and we'll need to keep track of the padding amount for later local_tensors: rank0: [0,1,3,2,4] rank1: [0,1,3,2,4] rank1: [0,1,3,2,4] rank3: [0,1,3,2,4] Step 2: chunk and shuffle values around to account for the wrong order of operations above and get the original tensor content back 01324# <- our allgather includes padding, if padding was applied in step 1 01324 <- Remove the padding 013, 24 <- chunk once, 'undoing' the DP allgather 01, 3, 2, 4 <- chunk each chunk, 'undoing' the initial (wrong) TP allgather performed by Shard(0)->Replicate() 012, 34 <- interleave with stride=TP mesh dim size 01234 <- concatenate Co-authored-by: Luca Wehrstedt <lw@meta.com> Co-authored-by: Will Constable <whc@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/150490 Approved by: https://github.com/wanchaol, https://github.com/XilunWu	2025-04-09 22:07:48 +00:00
Wei Wang	cc2decdb25	[CI][CUDA][Distributed]Update test_composability.py (#148578 ) world_size = int(os.getenv("WORLD_SIZE", 4)) in subsequent lines indicate the tests in this file do not only require > 1 GPU, but at least 4 GPUs. skip_if_lt_x_gpu(4) does not properly skip this on a platform with 2 GPUs. skip_if_lt_x_gpu being broken, potentially related to a similar issue: https://github.com/pytorch/pytorch/issues/146094 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148578 Approved by: https://github.com/atalman	2025-04-09 21:57:05 +00:00
Yifei Teng	786422a4d7	Remove a workaround added in #149381 (#150693 ) Remove a workaround added in https://github.com/pytorch/pytorch/pull/149381. Fixes https://github.com/pytorch/xla/issues/8934 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150693 Approved by: https://github.com/albanD	2025-04-09 21:48:03 +00:00
Laith Sakka	087e8587cd	support backed_size_oblivious in guard_or_false/guard_or_true (#150231 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150231 Approved by: https://github.com/pianpwk	2025-04-09 21:47:20 +00:00
Tom Ritchford	31fe258efc	[inductor] Add features to docstring_linter (see #142496 ) (#145834 ) ## Improvements to `docstring_linter` * Add a "grandfather list" of existing undocumented classes and functions (`--grandfather`, `--grandfather-tolerance`, `--no-grandfather`, `--write-grandfather`) * In classes, now just one of the class itself or its `__init__()` method needs to be documented (`--lint-init` turns the old behavior back on) * Now classes and functions defined local to other functions do not need to be documented (`--lint-local` turns the old behavior back on) * New `--report` flag produces a compact report of long, undocumented classes or function definitions: see attached example run over all pytorch: [pytorch-docs.json](https://github.com/user-attachments/files/18455981/pytorch-docs.json) ## Help text ``` $ python tools/linter/adapters/docstring_linter.py --help usage: docstring_linter.py [-h] [-l] [-v] [--grandfather GRANDFATHER] [--grandfather-tolerance GRANDFATHER_TOLERANCE] [--lint-init] [--lint-local] [--lint-protected] [--max-class MAX_CLASS] [--max-def MAX_DEF] [--min-docstring MIN_DOCSTRING] [--no-grandfather] [--report] [--write-grandfather] [files ...] `docstring_linter` reports on long functions, methods or classes without docstrings positional arguments: files A list of files or directories to lint optional arguments: -h, --help show this help message and exit -l, --lintrunner Run for lintrunner and print LintMessages which aren't edits -v, --verbose Print more debug info --grandfather GRANDFATHER, -g GRANDFATHER Set the grandfather list --grandfather-tolerance GRANDFATHER_TOLERANCE, -t GRANDFATHER_TOLERANCE Tolerance for grandfather sizes, in percent --lint-init, -i Lint __init__ and class separately --lint-local, -o Lint definitions inside other functions --lint-protected, -p Lint functions, methods and classes that start with _ --max-class MAX_CLASS, -c MAX_CLASS Maximum number of lines for an undocumented class --max-def MAX_DEF, -d MAX_DEF Maximum number of lines for an undocumented function --min-docstring MIN_DOCSTRING, -s MIN_DOCSTRING Minimum number of characters for a docstring --no-grandfather, -n Disable the grandfather list --report, -r Print a report on all classes and defs --write-grandfather, -w Rewrite the grandfather list ``` --- Pull Request resolved: https://github.com/pytorch/pytorch/pull/145834 Approved by: https://github.com/amjames, https://github.com/eellison	2025-04-09 21:38:36 +00:00
Bin Bao	357814c85c	[AOTI] Remove typedef for half and bfloat16 (#150657 ) Summary: typedef is prone to name collision. Explicitly spell out the actual aten types, needed for the libtorch-free codegen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150657 Approved by: https://github.com/malfet	2025-04-09 21:21:17 +00:00
Isuru Fernando	d751698a36	Support negative values for fill with uint tensors (#144458 ) Fixes https://github.com/pytorch/pytorch/issues/144188 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144458 Approved by: https://github.com/amjames, https://github.com/eellison	2025-04-09 21:08:06 +00:00
Laith Sakka	860765d621	update benchamark result due to <1% regression (#150937 ) <img width="1503" alt="Screenshot 2025-04-09 at 9 07 13 AM" src="https://github.com/user-attachments/assets/e16f31b0-c5dc-4dd6-8adb-aac11ed988db" /> PR https://hud.pytorch.org/pr/148104 which is acceptable but we have to update this to avoid flakiness in the future . Pull Request resolved: https://github.com/pytorch/pytorch/pull/150937 Approved by: https://github.com/zou3519	2025-04-09 20:25:48 +00:00
Richard Barnes	2b9d8a5633	Fix `-Wmissing-braces` in a few files (#150802 ) Test Plan: Sandcastle Reviewed By: wenxin0319 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150802 Approved by: https://github.com/Skylion007	2025-04-09 20:15:34 +00:00
angelayi	ea0cbba1fc	[export] Refine draft-export CVE with Dim.AUTO (#150876 ) Instead of using refine_dynamic_shapes_from_suggested_fixes to fix ConstraintViolationErrors in draft-export, we can just convert the dims to Dim.AUTO, which is less error prone Pull Request resolved: https://github.com/pytorch/pytorch/pull/150876 Approved by: https://github.com/pianpwk	2025-04-09 19:44:30 +00:00
Tristan Rice	f237ee54bf	ProcessGroupGloo: support lazy_init (#150801 ) This adds lazy initialization support to ProcessGroupGloo via `TORCH_GLOO_LAZY_INIT` or via `create_device(..., lazy_init=True)` This is still a draft PR as there's one race condition when doing coalesced operations that needs to be fixed upstream in Gloo first. Depends on https://github.com/facebookincubator/gloo/pull/427 landing first This also updates the gloo submodule to include the required changes. Test plan: added lazy init test variants ``` pytest -v test/distributed/test_c10d_gloo.py -k Lazy ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150801 Approved by: https://github.com/fduwjj	2025-04-09 19:29:50 +00:00
Yanan Cao (PyTorch)	a4545f09da	[Codemod][AddExplicitStrictExportForTrainingInferenceArg] caffe2/test/export (#150884 ) Differential Revision: D72667175 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150884 Approved by: https://github.com/ydwu4	2025-04-09 19:18:33 +00:00
Shangdi Yu	cfab04d01b	Fix aten.div type promotion for FakeTensor (#150874 ) Summary: When we divide a FakeTensor by an integer using the fast op implementation, the type promotion should be `ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT` so we get a float when dividing an int FakeTensor by an integer. ``` FAST = get_fast_op_impls() fast_div = FAST[torch.ops.aten.div.Tensor] fast_div(fake_tensor, some_int) ``` Test Plan: ``` python test/test_fake_tensor.py -k test_fast_div ``` Differential Revision: D72667430 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150874 Approved by: https://github.com/angelayi	2025-04-09 18:52:01 +00:00
Zhuoran Zhao	d3a2872c67	Hipify global scrach defintion in AOTI codegen (#150893 ) Summary: as title, a refactor is very needed I think .... or at least unify internal/external AOTI wrapper hipification method Test Plan: P1780296121 Differential Revision: D72683568 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150893 Approved by: https://github.com/davidberard98	2025-04-09 18:35:36 +00:00
Natalia Gimelshein	d04a6ec021	add reduce_scatter to symm mem ops (#150813 ) + a few small fixes (don't error out on 0-element tensors, a few more checks for contiguous outputs, more threads for better perf). Pull Request resolved: https://github.com/pytorch/pytorch/pull/150813 Approved by: https://github.com/xw285cornell	2025-04-09 17:59:17 +00:00
Shangdi Yu	cc185c32e0	[aoti] Use generate_fake_kernels_from_real_mismatches config for draft exported programs (#150651 ) Summary: Sometimes we get `MetadataMismatchError` in aoti compilation because draft export uses the flag below to infer the fake kernel when there’s a mismatch, but aoti doesn’t have this flag turned on. https://fburl.com/code/9qzytl6q torch._functorch.config.generate_fake_kernels_from_real_mismatches If we set this flag to True, then aoti compilation would work. Test Plan: ``` buck run fbcode//mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r aoti_runtime_asserts ``` Differential Revision: D72345085 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150651 Approved by: https://github.com/angelayi	2025-04-09 17:28:29 +00:00
Max Ren	6fb089f2a2	[AO] fix per token block size calculation (#150890 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150890 Approved by: https://github.com/jerryzh168	2025-04-09 17:07:31 +00:00
Will Constable	c59aaa03ff	[DTensor] add _explicit_order_placements util (#150493 ) The util converts a list of placements in the traditional DTensor format (e.g. [_StridedShard(0), Shard(0)], where list position is mesh_dim and sharding is always applied left-to-right (from dim 0 to higher dims)) to a more explicitly ordered format, also replacing '_StridedShard' with simple 'Shard' placements in the process. (e.g. the above becomes [(1, Shard(0)), (0, Shard(0)] where the first item in the tuple is the mesh_dim and the ordering of the tuples is the sharding order. This is useful so far as a helper for fixing local shape computation for strided sharding in the uneven shape case, in the following PR- but may also be useful more broadly if we can use explicit orderings to simplify other parts of DTensor logic. This skips implementing some combinations of _StridedSharding that are not currently used in the wild today, but could be supported easily. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150493 Approved by: https://github.com/wanchaol, https://github.com/XilunWu	2025-04-09 16:55:24 +00:00
PyTorch MergeBot	01568cb17a	Revert "Refactor layout constraint selection logic (#148104 )" This reverts commit 2e7c9d33e7f933ac3b723cb3bb05b9c88432c25c. Reverted https://github.com/pytorch/pytorch/pull/148104 on behalf of https://github.com/atalman due to [GH job link](https://github.com/pytorch/pytorch/actions/runs/14357056427/job/40251630946) [HUD commit link](`2e7c9d33e7`) ([comment](https://github.com/pytorch/pytorch/pull/148104#issuecomment-2790369493))	2025-04-09 16:49:48 +00:00
PyTorch MergeBot	a0e796df03	Revert "Inductor respects exact strides on custom ops by default (#150511 )" This reverts commit a4bb2f106f8cc642539d4698b6d869a87adca92f. Reverted https://github.com/pytorch/pytorch/pull/150511 on behalf of https://github.com/atalman due to [GH job link](https://github.com/pytorch/pytorch/actions/runs/14357056427/job/40251630946) [HUD commit link](`2e7c9d33e7`) ([comment](https://github.com/pytorch/pytorch/pull/148104#issuecomment-2790369493))	2025-04-09 16:49:48 +00:00
rzou	a4bb2f106f	Inductor respects exact strides on custom ops by default (#150511 ) If a tag is not specified on a custom operator, then inductor will assume that it needs exact strides. Test Plan: - tests + CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/150511 Approved by: https://github.com/eellison, https://github.com/shunting314 ghstack dependencies: #150495, #148104	2025-04-09 16:46:48 +00:00
Yidi Wu	c714d2fc0e	[hop] support base_hop._gen_schema (#149688 ) This PR creates two utils for generating a schema for hops from example inputs and use base hop as an exmaple. 1. HopArgumentInfoGen creates an argument or an output schema with mutation information. 2. CFuncitonSchemaGen piece together the argument info of inputs and outputs and produces torch._C.FunctionSchema. is_write attribute of argument info can be computed. Note that the is_write annotation only works when the inputs are flattened (e.g. cannot support mutation inside tuple). We need special handling the case where we have tuple inputs like cond. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149688 Approved by: https://github.com/zou3519	2025-04-09 16:42:55 +00:00
Justin Chu	72755a4b7a	Avoid circular imports in tracing_state_functions (#150325 ) tracing_state_functions references some torch functions from submodules like `torch.onnx.is_in_onnx_export` that could trigger module initialization & circular imports. I turned the mapping into a function so that the dictionary is not initialized at torch import. (discovered in https://github.com/pytorch/pytorch/pull/149646) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150325 Approved by: https://github.com/zou3519	2025-04-09 16:32:11 +00:00
fduwjj	8aaf296efc	[c10d][fr] Refactor analysis script for modularization and reusing for coalesce collectives (#150881 ) Trying to make the code of FR analysis more reusable and modularized. So we split core error analysis logic into separate functions. This PR mostly is shuffle around the code a bit. Differential Revision: [D72690120](https://our.internmc.facebook.com/intern/diff/D72690120) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150881 Approved by: https://github.com/wz337	2025-04-09 16:10:19 +00:00
fduwjj	c8d37b9c85	[ez][c10d] Disable start event recording for coalesced col and improve profile title (#150863 ) While looking at enabling FR analysis for coalesced collectives, I found that for the slow-path coalescing (cols which are not all-gather, all-reduce or reduce-scatter), we still record start event for them. This is wrong and we should do the same thing as endEvent recodring. And I made the profiler title more visible when we pass in the opType for coalesced all-gather and reduce-scatter. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150863 Approved by: https://github.com/eqy, https://github.com/d4l3k, https://github.com/kwen2501	2025-04-09 16:09:56 +00:00
shubhambhokare1	1a56609e75	[ONNX] Supporting different opset versions for torchlib registry (#149901 ) - Allows opset_version to determine which onnx decomposition to choose - Adds a cleanup function to modify the registry after it is built Pull Request resolved: https://github.com/pytorch/pytorch/pull/149901 Approved by: https://github.com/justinchuby, https://github.com/titaiwangms	2025-04-09 16:03:46 +00:00
pralay	97a5e5c6b3	Added _fused_sdp_choice_stub dispatcher support for HPU device (#149512 ) Currently for HPU device we don't have any support for _fused_sdp_choice_stub dispatcher function, so for `scaled_dot_product_attention` function by default selecting the `MATH Backend` using `_fused_sdp_choice_stub` for HPU device. With this PR we have enabled support for `_fused_sdp_choice_stub` dispatcher function, so that we can invoke any backend (for example math, flash_attention, efficient_attention, cudnn_attention, overrideable) according to user choice for HPU device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149512 Approved by: https://github.com/drisspg	2025-04-09 15:48:09 +00:00
atalman	d0e3482266	Update triton wheel build, setuptools pin (#150931 ) Observing failure in release workflow: https://github.com/pytorch/pytorch/actions/runs/14346340202/job/40216804374 ``` Traceback (most recent call last): File "/opt/python/cp311-cp311/lib/python3.11/site-packages/wheel/bdist_wheel.py", line 11, in <module> from setuptools.command.bdist_wheel import bdist_wheel as bdist_wheel ModuleNotFoundError: No module named 'setuptools.command.bdist_wheel' The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/tmp/tmppwpqef_x/triton/python/setup.py", line 27, in <module> from wheel.bdist_wheel import bdist_wheel File "/opt/python/cp311-cp311/lib/python3.11/site-packages/wheel/bdist_wheel.py", line 13, in <module> raise ImportError(ERROR) from exc ImportError: The 'wheel.bdist_wheel' module has been removed. Please update your setuptools to v70.1 or later. If you're explicitly importing 'wheel.bdist_wheel', please update your import to point to 'setuptools.command.bdist_wheel' instead. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150931 Approved by: https://github.com/Skylion007	2025-04-09 15:26:07 +00:00
zeshengzong	5a422150c3	Add `torch.triu_indices`, `torch.tril_indices` dtype description (#150749 ) Fixes #150675 ## Test Result ![image](https://github.com/user-attachments/assets/f30a0de0-6475-4d07-b441-15fffd453ba1) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150749 Approved by: https://github.com/bdhirsh	2025-04-09 15:03:24 +00:00
Xia, Weiwen	246f3b6530	[Quant][PT2E][X86] enable qconv1d-relu fusion (#150751 ) Summary As the title. - The `conv1d - relu` pattern will be annotated by the `X86InductorQuantizer`. - The pattern will be fused as `qconv_pointwise` during lowering. Test plan ``` python test/inductor/test_mkldnn_pattern_matcher.py -k test_qconv1d_relu_cpu ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150751 Approved by: https://github.com/jerryzh168, https://github.com/leslie-fang-intel	2025-04-09 14:42:02 +00:00
Jack Taylor	2299087220	[ROCm] Introduce AMD specific inductor gemm tuning (#147315 ) Replaces https://github.com/pytorch/pytorch/pull/143286 Adds ROCm specific MM configs for max-autotune incorporating ROCm specific triton tuning kernargs such as waves_per_eu, kpack, matrix_instr_nonkdim. This PR also introduces behavior to allow tuning for GROUP_M in triton gemm case. Dynamo huggingface inference benchmarks: `TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS="TRITON" python huggingface.py --performance --inference --bfloat16 --backend=inductor` GEOMEAN speedup (before): \| 1.35x GEOMEAN speedup (after): \| 1.42x name \| Eager - abs latency \| old - abs_latency \| old - speedup \| new - abs_latency \| new - speedup -- \| -- \| -- \| -- \| -- \| -- AlbertForMaskedLM \| 26.22 \| 26.52 \| 98.86% \| 24.58 \| 106.67% AlbertForQuestionAnswering \| 25.96 \| 26.40 \| 98.33% \| 24.10 \| 107.73% AllenaiLongformerBase \| 21.03 \| 10.65 \| 197.50% \| 10.49 \| 200.58% BartForCausalLM \| 7.77 \| 9.76 \| 79.63% \| 8.79 \| 88.46% BartForConditionalGeneration \| 14.44 \| 12.86 \| 112.26% \| 11.96 \| 120.70% BertForMaskedLM \| 8.10 \| 8.82 \| 91.89% \| 8.57 \| 94.53% BertForQuestionAnswering \| 6.82 \| 7.32 \| 93.20% \| 7.10 \| 96.18% BlenderbotForCausalLM \| 10.97 \| 11.39 \| 96.34% \| 10.10 \| 108.65% BlenderbotSmallForCausalLM \| 5.91 \| 5.44 \| 108.72% \| 4.82 \| 122.67% BlenderbotSmallForConditionalGeneration \| 12.64 \| 9.65 \| 130.94% \| 9.11 \| 138.83% CamemBert \| 8.35 \| 9.15 \| 91.24% \| 8.86 \| 94.27% DebertaForMaskedLM \| 10.92 \| 6.09 \| 179.44% \| 5.90 \| 185.05% DebertaForQuestionAnswering \| 14.29 \| 7.70 \| 185.59% \| 7.26 \| 196.75% DebertaV2ForMaskedLM \| 15.47 \| 10.22 \| 151.32% \| 9.34 \| 165.55% DebertaV2ForQuestionAnswering \| 14.98 \| 6.11 \| 245.28% \| 6.28 \| 238.40% DistilBertForMaskedLM \| 8.37 \| 8.70 \| 96.30% \| 8.22 \| 101.92% DistilBertForQuestionAnswering \| 10.21 \| 10.54 \| 96.88% \| 10.39 \| 98.36% DistillGPT2 \| 8.77 \| 6.78 \| 129.40% \| 6.31 \| 138.88% ElectraForCausalLM \| 10.32 \| 4.70 \| 219.45% \| 4.60 \| 224.29% ElectraForQuestionAnswering \| 11.48 \| 5.62 \| 204.20% \| 5.44 \| 210.95% GPT2ForSequenceClassification \| 6.21 \| 5.72 \| 108.50% \| 5.58 \| 111.26% GoogleFnet \| 26.51 \| 20.81 \| 127.37% \| 19.91 \| 133.11% LayoutLMForMaskedLM \| 12.09 \| 7.99 \| 151.28% \| 7.66 \| 157.80% LayoutLMForSequenceClassification \| 10.62 \| 6.49 \| 163.67% \| 6.25 \| 169.95% M2M100ForConditionalGeneration \| 14.98 \| 10.20 \| 146.79% \| 9.89 \| 151.42% MBartForCausalLM \| 7.67 \| 9.78 \| 78.44% \| 8.87 \| 86.55% MBartForConditionalGeneration \| 13.45 \| 12.69 \| 105.99% \| 12.03 \| 111.82% MT5ForConditionalGeneration \| 19.96 \| 5.32 \| 375.37% \| 5.08 \| 393.01% MegatronBertForCausalLM \| 13.22 \| 7.86 \| 168.07% \| 7.18 \| 184.01% MegatronBertForQuestionAnswering \| 15.62 \| 11.81 \| 132.21% \| 11.02 \| 141.68% MobileBertForMaskedLM \| 26.63 \| 10.82 \| 245.99% \| 11.95 \| 222.73% MobileBertForQuestionAnswering \| 23.53 \| 7.55 \| 311.51% \| 9.53 \| 247.03% OPTForCausalLM \| 7.33 \| 7.64 \| 95.93% \| 7.56 \| 96.90% PLBartForCausalLM \| 8.73 \| 7.63 \| 114.40% \| 7.37 \| 118.58% PLBartForConditionalGeneration \| 10.46 \| 8.50 \| 122.98% \| 8.16 \| 128.13% PegasusForCausalLM \| 7.18 \| 7.37 \| 97.42% \| 6.64 \| 108.22% PegasusForConditionalGeneration \| 16.47 \| 16.66 \| 98.87% \| 14.18 \| 116.13% RobertaForCausalLM \| 10.30 \| 9.95 \| 103.52% \| 9.52 \| 108.25% RobertaForQuestionAnswering \| 6.37 \| 7.13 \| 89.28% \| 6.79 \| 93.87% T5ForConditionalGeneration \| 12.40 \| 6.72 \| 184.51% \| 6.48 \| 191.16% T5Small \| 12.02 \| 6.66 \| 180.55% \| 6.32 \| 190.33% TrOCRForCausalLM \| 14.12 \| 13.31 \| 106.11% \| 12.45 \| 113.41% XGLMForCausalLM \| 16.48 \| 6.23 \| 264.52% \| 6.35 \| 259.51% XLNetLMHeadModel \| 74.87 \| 62.23 \| 120.32% \| 57.95 \| 129.19% YituTechConvBert \| 20.21 \| 10.50 \| 192.48% \| 9.97 \| 202.72% We are also seeing improvement ~9% on internal addmm benchmark This PR will also slightly reduce the compilation time on AMD max-autotune as before this change we assess every config with matrix_instr_nonkdim [0, 16] but we remove this and use 16 for all configs with this update. No CI to test the max-autotune perf currently but this will be enabled via https://github.com/pytorch/pytorch/pull/148672 after which we can investigate more tuning updates and config pruning Pull Request resolved: https://github.com/pytorch/pytorch/pull/147315 Approved by: https://github.com/jansel, https://github.com/eellison	2025-04-09 14:34:30 +00:00
Antoine Broyelle	886d9acb0d	[docs] Add 32-bit complex to the list of dtypes (#144590 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144590 Approved by: https://github.com/janeyx99	2025-04-09 13:10:21 +00:00
Richard Howell	64ac41f68d	[pytorch] add header docs for TORCH_LIBRARY_THREAD_UNSAFE_LAZY_INIT (#150854 ) Summary: Add header docs for the experimental TORCH_LIBRARY_THREAD_UNSAFE_LAZY_INIT feature, and guard behind C10_MOBILE. Reviewed By: albanD Differential Revision: D72572345 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150854 Approved by: https://github.com/larryliu0820, https://github.com/zou3519	2025-04-09 12:59:24 +00:00
cyy	142f0f86ce	Enable modernize-use-default-member-init (#149046 ) ``modernize-use-default-member-init`` prefers initialisation in class members, that make more ``= default`` constructors possible. Some violations or modernize rules have been fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149046 Approved by: https://github.com/zou3519	2025-04-09 11:57:24 +00:00
Sherlock Huang	81f60f3880	Expand allowed_getattr_types_for_subgm to torch.Tensor (#150867 ) Summary: att regular weight has the type of torch.nn.parameter.Parameter buffer and tensor constant has the type of torch.Tensor both types are valid. Test Plan: CI Differential Revision: D72657275 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150867 Approved by: https://github.com/zhxchen17	2025-04-09 11:01:45 +00:00
FFFrog	604467de20	Code Clean: Remove specific bytecode support in dynamo for python3.8 (#150838 ) Related Bytecode: - CALL_FINALLy - END_FINALLy - POP_FINALLy The bytecodes above were removed before python3.9, refer to [this](`53908bd790/Misc/NEWS.d/3.9.0a2.rst`) for more infos. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150838 Approved by: https://github.com/Skylion007, https://github.com/jansel ghstack dependencies: #150834	2025-04-09 07:16:52 +00:00
FFFrog	b01877aa13	Fix addbmm & addmv & baddbmm out dtype check (#148176 ) ---- - torch.addbmm - torch.addmv - torch.baddbmm ISSUE related: https://github.com/pytorch/pytorch/issues/138399 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148176 Approved by: https://github.com/jansel ghstack dependencies: #148174	2025-04-09 07:02:56 +00:00
James Wu	4d6ff6ca5c	Fill config2launcher with correct launchers during cache hit coordinate descent (#150860 ) This bug was crazy hard to reproduce, so I can't seem to get a unit test written that isn't the internal one I used for debugging. Here's a short TLDR of the bug: - Due to D71983456(OSS: https://github.com/pytorch/pytorch/pull/149910), we cache CachingAutotuners in memory. - Importantly: Saving stuff in PyCodeCache in memory is not semantically equivalent to writing to disk. By saving it in memory, CachingAutotuners do not reset global state. - It's possible through recompiles for different dynamo frames to compile down to exactly the same inductor output code. This involves models that run multiple times, but differ very subtley, or in ways that cause a dynamo guard failure but not a different inductor output code. - Because of this, we reuse CachingAutotuners for a second compile (with different example inputs, just the same triton kernel code) - CachingAutotuners have a Coordinate Descent class on them, which has a cache: https://fburl.com/code/4igrsams (OSS: `aafc4b6188/torch/_inductor/runtime/coordinate_descent_tuner.py (L69)`) - Because we are caching these in memory and not on disk, this cache is not cleared between runs. - However, this variable is not saved on the class, and is reinitialized every time we do autotuning: https://fburl.com/code/n2o8tmje (OSS: `aafc4b6188/torch/_inductor/runtime/triton_heuristics.py (L933)`) - `config2launcher` is added when we call `benchmark_one_config`, but on a CoorDesc cache hit, we never call `benchmark_one_config`! So we end up returning None, and erroring with: ``` AttributeError: 'NoneType' object has no attribute 'store_cubin' ``` This fixes the problem for now by just recompiling the launcher. Technically, we might be able to save config2launcher on the class to avoid this, but I don't want to risk another weird cache safety bug here, so taking the simpler approach for now. Note that this error only reproduces if: - None of AOTAutogradCache, FXgraphCache hit on the second entry: otherwise, the CachingAutotuner will go through a pickling and then not be saved in memory - We haven't spawned parallel compile workers. If there are parallel compile workers, we pickle the autotuner on the way from the worker to the parent process, once again resetting the Autotuner. - The autotune cache doesn't already have the best config stored in it So it was extraordinarily hard to debug/reproduce. Because of this, I have a complicated internal unit test but no OSS test that can trigger the exact problem. I'll work on a separate test later, but this needs to go in to fix a sev, so we're landing it based on an internal test only. Differential Revision: [D72655382](https://our.internmc.facebook.com/intern/diff/D72655382/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D72655382/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/150860 Approved by: https://github.com/oulgen	2025-04-09 04:39:37 +00:00
Nikita Shulga	bc47d539fc	[MPS] Support ArgumentBuffer bindings from C++/Python (#150780 ) To workaround limitation of 32-arguments per kernel and being able to eventually compile something like ```python import torch def foo(args): rc = torch.empty_like(args[0]) for arg in args: rc += arg return rc tensors = torch.rand(100, 32, device='mps').unbind(0) print(torch.compile(foo)(tensors)) ``` For now, introduce `at::native:🤘:get_tensor_gpu_address` and use it from both C++ test and compile_shader to convert list of tensors to list of pointers valid on GPU. Initially this binding were done via `id< MTLArgumentEncoder>`, but according to [Improving CPU Performance by Using Argument Buffers](https://developer.apple.com/documentation/metal/improving-cpu-performance-by-using-argument-buffers?language=objc#Encode-Resources-into-Argument-Buffers) article, this is not necessary when targeting Tier2-only devices (which is true of all devices on MacOS-13 or newer): > To directly encode the argument buffer resources on these Tier 2 devices, write the [MTLBuffer](https://developer.apple.com/documentation/metal/mtlbuffer?language=objc).[gpuAddress](https://developer.apple.com/documentation/metal/mtlbuffer/gpuaddress?language=objc) property — and for other resource types (samplers, textures, and acceleration structures), the [gpuResourceID](https://developer.apple.com/documentation/metal/mtlcomputepipelinestate/gpuresourceid?language=objc) property — into the corresponding structure member. To encode offsets, treat these property values as uint64 types and add the offset to them. Add both C++ and PyThon unittests that validate that this works. Please note, that using either ArgumentEncoder or directly encoding the data does not guarantee buffer will not be freed until shader execution is complete. On the other hand, this should already be guaranteed by MPSCachingAllocator that would only free the memory after all streams completed its execution. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150780 Approved by: https://github.com/dcci	2025-04-09 04:24:37 +00:00
rzou	2e7c9d33e7	Refactor layout constraint selection logic (#148104 ) This PR: - cleans up some existing comments that don't make sense anymore - hooks up the "custom_op_default_layout_constraint" back (that seems to have broken) - cleans up the "lazy registration path" which seems to never get hit anymore - adds dislike_padding to nodes that require exact strides Test Plan: - tests + CI disable padding Pull Request resolved: https://github.com/pytorch/pytorch/pull/148104 Approved by: https://github.com/shunting314, https://github.com/eellison ghstack dependencies: #150495	2025-04-09 02:09:18 +00:00
rzou	44deb67830	Fix _del_library (#150495 ) On library deletion, we need to clear fx's schema cache. Test Plan: - top PR in the stack, I don't have a good test case for this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150495 Approved by: https://github.com/eellison	2025-04-09 02:09:18 +00:00
Daniel Vega-Myhre	5f18b7d877	[docs] remove --recursive flag from readme (#150785 ) Fixes #150745 See https://github.com/pytorch/pytorch/issues/150745#issuecomment-2784216663 Cloning with `--recursive` as shown in the docs prevents users from checking out commits from before NCCL was removed as a submodule. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150785 Approved by: https://github.com/atalman	2025-04-09 02:07:48 +00:00
PyTorch MergeBot	d9f47c75de	Revert "Fixing NCCL abort hang issue when a ProcessGroupNCCL manages multiple ncclComms (#150690 )" This reverts commit 91173ff89aab5f632d483c736d11d5dcf60decac. Reverted https://github.com/pytorch/pytorch/pull/150690 on behalf of https://github.com/atalman due to failing internal test ([comment](https://github.com/pytorch/pytorch/pull/150690#issuecomment-2787905966))	2025-04-09 00:06:32 +00:00
eellison	27ded359a5	Fix inplacing with multiple, fused uses (#150845 ) We had `can_inplace` defined on a single use. When that buffer has multiple uses inside a fused node, we need to check if the other accesses have the same index. Otherwise we may read memory that has already been written to from inplacing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150845 Approved by: https://github.com/zou3519, https://github.com/exclamaforte, https://github.com/atalman, https://github.com/jansel	2025-04-09 00:05:07 +00:00
Yiming Zhou	89505f4498	[AOTI] Always use oss schema for ExternKernelNodes serialization (#150197 ) Summary: Added a field `protocol` to `ExternKernelNodes` and all the lowering pass will always use the oss schema to serialize external kernel nodes from now on. Test Plan: CI Differential Revision: D72020444 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150197 Approved by: https://github.com/zhxchen17	2025-04-08 22:35:28 +00:00
FFFrog	17f9276e29	Code Clean: Remove python3.8 specific code because PyTorch now need Python3.9 and later (#150834 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150834 Approved by: https://github.com/Skylion007, https://github.com/albanD	2025-04-08 18:53:55 +00:00
Shunting Zhang	901b02cf16	[Inductor] fix alignement assumption for fallback (#150777 ) Inductor right now only works properly for fallback kernels producing aligned output. When Inductor create layout for fallback kernel output, Inductor does not add the tensor offset to the layout [link](`2a1e2b88ed/torch/_inductor/ir.py (L6935-L6941)`). Thus unaligned output will be treated as aligned. Adding the offset to the layout directly does not work since that change the index expression in the generated kernel and we may 'double' applying the offset. Triton already considers the offset when passing in the data_ptr. To solve this issue, we track the unaligned buffer names instead. This potentially can fix the internal issues we are debugging here: https://fb.workplace.com/groups/1075192433118967/permalink/1618308128807392/ Differential Revision: [D72600784](https://our.internmc.facebook.com/intern/diff/D72600784) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150777 Approved by: https://github.com/eellison, https://github.com/jansel	2025-04-08 18:49:44 +00:00
Yanan Cao (PyTorch)	c36d9b0d8d	[Codemod][AddExplicitStrictExportForTrainingInferenceArg] caffe2/torch/ao (#150826 ) Differential Revision: D72615631 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150826 Approved by: https://github.com/ydwu4	2025-04-08 18:49:22 +00:00
Basil Wong	aafc4b6188	Do not depend on numpy during the import (#150816 ) Summary: Related issue: https://github.com/pytorch/pytorch/issues/149681 We can follow up with a different implementation that does not use numpy(potentially with Torch primitives). Test Plan: pending: contbuild & OSS CI Differential Revision: D72609835 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150816 Approved by: https://github.com/jerryzh168, https://github.com/cyyever, https://github.com/albanD	2025-04-08 18:12:53 +00:00
Guilherme Leobas	e6bd133866	add batching rule for `torch.Tensor.scatter_add_` (#150543 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150543 Approved by: https://github.com/zou3519	2025-04-08 18:00:10 +00:00
William Wen	97759614c2	[dynamo] reconstruct functions decorated in the compiled region properly (#150645 ) We were previously unable to reconstruct functions that were decorated in the compiled region. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150645 Approved by: https://github.com/jansel	2025-04-08 17:32:46 +00:00
PyTorch MergeBot	4926bd6004	Revert "Fix the Problems About Defining Static Variable in Inline Function (#147095 )" This reverts commit 3da14d38bd396f5bbe8494872d1509efa1a6f048. Reverted https://github.com/pytorch/pytorch/pull/147095 on behalf of https://github.com/atalman due to breaks internally ([comment](https://github.com/pytorch/pytorch/pull/147095#issuecomment-2787129770))	2025-04-08 17:10:36 +00:00
FFFrog	3e0038ae85	Fix torch.matmul related out dtype check (#148174 ) ---- - torch.matmul -> CompositeImplicitAutograd -> dot_out (when left_dim == 1 & right_dim == 1) -> mv_out (when left_dim == 2 & right_dim == 1) -> mm_out (when left_dim == 1 & right_dim == 2) -> ... - torch.dot - torch.vdot - torch.mm - torch.mv ISSUE related: https://github.com/pytorch/pytorch/issues/138399 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148174 Approved by: https://github.com/jansel	2025-04-08 17:00:28 +00:00
Animesh Jain	173f126068	[invoke_subgraph] Preserve node meta (#150782 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150782 Approved by: https://github.com/bdhirsh ghstack dependencies: #150666	2025-04-08 16:57:39 +00:00
PyTorch MergeBot	4447352e64	Revert "[CUDA] Only use vec128 if CUDA version is newer than 12.8 (#150705 )" This reverts commit 5228986c395dc79f90d2a2b991deea1eef188260. Reverted https://github.com/pytorch/pytorch/pull/150705 on behalf of https://github.com/atalman due to break periodic tests ([comment](https://github.com/pytorch/pytorch/pull/150705#issuecomment-2787017751))	2025-04-08 16:29:05 +00:00
ikalinic	97f34f0125	[ROCm][Windows] Include AOTriton dependent sources in Windows build (#150521 ) Includes ATen native transformers hipified sources in ROCm+Windows build. This was removed due to Trinton not being available on Windows, but this causes further linker errors. Setting `USE_FLASH_ATTENTION=0` and `USE_MEM_EFF_ATTENTION=0` during the build will mitigate the missing headers, but also not cause any linker errors, so we will use this approach for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150521 Approved by: https://github.com/jeffdaily	2025-04-08 16:18:15 +00:00
Yuanhao Ji	1239260a0e	[Accelerator][Chore] Use existing `acc` when raising an error (#150829 ) As the title said, `acc` already exists so we just use it instead of calling `current_accelerator()` again. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150829 Approved by: https://github.com/guangyey, https://github.com/Skylion007	2025-04-08 16:05:06 +00:00
Nikita Shulga	ec5f2e3028	[Build] Fix fbgemm build with gcc-12+ (#150847 ) By suppressing more warnings TODO: fbgemm pin really needs to get updated Pull Request resolved: https://github.com/pytorch/pytorch/pull/150847 Approved by: https://github.com/atalman, https://github.com/Skylion007	2025-04-08 16:03:40 +00:00
ZhiweiYan-96	52d172eafd	Facilitate at::_weight_int4pack_mm_with_scale_and_zeros related registration (#147962 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147962 Approved by: https://github.com/jerryzh168, https://github.com/guangyey, https://github.com/EikanWang ghstack dependencies: #137566 Co-authored-by: xiaolil1 <xiaoli.liu@intel.com>	2025-04-08 15:36:07 +00:00
Yan Zhiwei	da7322548b	[Intel GPU] int4 WOQ gemm XPU Support (#137566 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137566 Approved by: https://github.com/liangan1, https://github.com/guangyey, https://github.com/EikanWang Co-authored-by: xiaolil1 <xiaoli.liu@intel.com>	2025-04-08 15:36:06 +00:00
FFFrog	05365e380d	Remove torch functions that do not support device arguments from _device_constructor (#150290 ) As the title stated In Addition: - I have checked all the functions in _device_constructor and found ``torch.vander`` also don`t support device arguments - Remove the duplicated function such as torch.ones and torch.asarray Related issue:https://github.com/pytorch/pytorch/issues/150284 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150290 Approved by: https://github.com/albanD	2025-04-08 15:13:55 +00:00
FFFrog	a402c2f203	Remove redundant code in cuda/__init__.py (#150529 ) As the title stated. Follow: https://github.com/pytorch/pytorch/pull/147078 Fix issue: https://github.com/pytorch/pytorch/issues/150519 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150529 Approved by: https://github.com/eqy	2025-04-08 15:03:21 +00:00
Guilherme Leobas	ad516180e0	Update CPython tests for ctx manager to use unittest (#146501 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146501 Approved by: https://github.com/zou3519 ghstack dependencies: #146500	2025-04-08 14:55:17 +00:00
Guilherme Leobas	f3b2fb6c66	Allow trace through unittest (#146500 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146500 Approved by: https://github.com/anijain2305	2025-04-08 14:55:17 +00:00
Luca Wehrstedt	1791b4150b	Clarify behavior of TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK (#150682 ) I still don't really understand the original purpose of that env var, but it appears that its usage is completely disconnected from MemPools and from `ncclMemAlloc`/`Free`. In fact, when that env var is set, we invoke `ncclCommRegister` for _all_ NCCL communicators for _all_ the memory segments managed by the allocator (both the global ones, allocated with `cudaMalloc`, and the ones in private MemPools), and we do that both for the segments that already exist when the PG is initialized and for all segments that will be allocated later. I'm reworking the code a bit, by using a few helper functions, whose name should make this behavior clearer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150682 Approved by: https://github.com/kwen2501 ghstack dependencies: #150681	2025-04-08 13:00:59 +00:00
Luca Wehrstedt	3649e2e7bd	Safer bookkeeping of NCCL communicators (#150681 ) This consists mainly in two changes: - ensure we can reliably obtain the device from a `NCCLComm` object (there was one constructor which didn't set the device) - use a RAII pattern for acquiring the lock to the global dictionary of `NCCLComms` (which ensures the lock is released in case of exceptions) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150681 Approved by: https://github.com/kwen2501	2025-04-08 11:12:37 +00:00
FFFrog	3da14d38bd	Fix the Problems About Defining Static Variable in Inline Function (#147095 ) Refer to https://github.com/pytorch/pytorch/issues/125465 for more informations - Remove unused header files - Move the inline function that defines the static variable to .cc Pull Request resolved: https://github.com/pytorch/pytorch/pull/147095 Approved by: https://github.com/cyyever, https://github.com/albanD	2025-04-08 10:23:02 +00:00
FFFrog	881d99495d	Add more check for torch.ormqr (#150759 ) As the title statd. Please refer to https://github.com/pytorch/pytorch/issues/150674 for more info. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150759 Approved by: https://github.com/lezcano	2025-04-08 08:26:05 +00:00
fengqing.lu	a106842ea8	[XPU] Fix XPU unit test on Windows (#150520 ) This PR is to resolve issue reported in https://github.com/intel/torch-xpu-ops/issues/1478 There are two cases failing in our Windows CI enabling. - test_xpu.py::TestXpuXPU::test_lazy_init_xpu Needs to add `if __name__ == '__main__':` for Windows when using multiprocess. Refer to https://stackoverflow.com/a/18205006 ``` RuntimeError: An attempt has been made to start a new process before the current process has finished its bootstrapping phase. This probably means that you are not using fork to start your child processes and you have forgotten to use the proper idiom in the main module: if __name__ == '__main__': freeze_support() ... The "freeze_support()" line can be omitted if the program is not going to be frozen to produce an executable. Traceback (most recent call last): File "C:\Users\sdp\lufengqing\torch-xpu-ops\test\xpu\xpu_test_utils.py", line 24, in <module> test_multi_process(model, input) File "C:\Users\sdp\lufengqing\torch-xpu-ops\test\xpu\xpu_test_utils.py", line 16, in test_multi_process assert p.exitcode == 0 AssertionError ``` - test_xpu.py::TestXpuXPU::test_wrong_xpu_fork_xpu is a linux only test case, we should skip it on Windows. Refer to `248487f455/test/test_multiprocessing.py (L609)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150520 Approved by: https://github.com/guangyey, https://github.com/EikanWang	2025-04-08 07:02:40 +00:00
xinan.lin	58ede0cca3	[Inductor XPU] Refine `test_mkldnn_pattern_matcher.py` to be reusable for XPU. (#150286 ) This PR extracts some test cases from TestPatternMatcher into a newly created TestPatternMatcherGeneric, and uses instantiate_device_type_tests to make them reusable across multiple devices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150286 Approved by: https://github.com/jansel	2025-04-08 05:42:44 +00:00
FFFrog	f8aa6404ac	Refactor: add initialization of math.lcm into torch_c_binding_in_graph_functions (#150766 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150766 Approved by: https://github.com/aorenste, https://github.com/jansel	2025-04-08 04:12:26 +00:00
zeshengzong	c9c0f8eae3	Add plot for `torch.nn.Threshold` and `torch.nn.GLU` (#150171 ) Fixes #150170 ## Changes - Add plot for `torch.nn.Threshold` and `torch.nn.GLU` - Add example output make them easier get result by users ## Test Result ![image](https://github.com/user-attachments/assets/f6c5bc46-f9b7-4db7-9797-e08d8423d1b3) ![image](https://github.com/user-attachments/assets/ad4e6c84-7b29-44f1-b7bd-9c81e4a92ef8) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150171 Approved by: https://github.com/albanD	2025-04-08 03:55:37 +00:00
zeshengzong	7e11089fe5	Optimize dataloader Self typing (#146816 ) Optimize `dataloader.py` method return type with Self typing Pull Request resolved: https://github.com/pytorch/pytorch/pull/146816 Approved by: https://github.com/albanD	2025-04-08 03:52:23 +00:00
atalman	836955bdbd	[Manylinux 2.28] Correct Linux aarch64 cuda binaries wheel name (#150786 ) Related to: https://github.com/pytorch/pytorch/issues/149044#issuecomment-2784044555 For CPU binaries we run auditwheel however for cuda binaries auditwheel produces invalid results . Hence we need to rename the file. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150786 Approved by: https://github.com/malfet	2025-04-08 02:58:28 +00:00
Ahmad Sharif	73b4938f7c	[cuda] Add new faster gammabeta backward kernel (#148605 ) (Reapply with launch bounds) (#150625 ) # Changes over the previous PR This reverts commit 61a1f09 and adds `__launch_bounds__` to the kernel. Previously I merged 114d404 that did not work on Blackwell because it consumed too many registers. It got reverted in 61a1f09. For more context see: https://github.com/pytorch/pytorch/issues/150266. This PR reverts the revert (i.e. reapplies the original diff), with one additional line with `__launch_bounds__` added: ``` git diff HEAD^ diff --git a/aten/src/ATen/native/cuda/layer_norm_kernel.cu b/aten/src/ATen/native/cuda/layer_norm_kernel.cu index 0d63a2f979c..3ce2c24c18e 100644 --- a/aten/src/ATen/native/cuda/layer_norm_kernel.cu +++ b/aten/src/ATen/native/cuda/layer_norm_kernel.cu @@ -657,6 +657,7 @@ bool aligned_grid > __global__ void +__launch_bounds__(block_dim_x * block_dim_y) GammaBetaBackwardCUDAKernelTemplate( int64_t M, int64_t N, ``` I managed to get a Blackwell machine and verified that the fix works. The fix was verified using this repro that I got from @drisspg <details> <summary> Repro script that fails on Blackwell </summary> ``` import torch from torch.nn import init # from transformer_nuggets import init_logging # from transformer_nuggets.utils.benchmark import profiler # from pathlib import Path # init_logging() class PermuteModule(torch.nn.Module): def __init__(self, permutation): super(PermuteModule, self).__init__() self.permutation = permutation def forward(self, x:torch.Tensor) -> torch.Tensor: assert len(x.shape) == len(self.permutation), f"Dimension mismatch! Unable to permute {len(x.shape)} dim input with a {len(self.permutation)} dim permutation!" return x.permute(self.permutation) def test(n_layers:int, conv_stride:int): _sequence = [] for _ in range(n_layers): # Conv1d inputs are (N x C x L), LayerNorm expects ( x C). Dims must be permuted between modules. _sequence += [ PermuteModule((0,2,1)), torch.nn.Conv1d(in_channels=512, out_channels=512, groups=1, kernel_size=9, dilation=1, stride=conv_stride, padding=0, bias=False), PermuteModule((0,2,1)), torch.nn.LayerNorm(512), torch.nn.ReLU() ] model = torch.nn.Sequential(_sequence).to(device="cuda") data = torch.randn((100,2048,512), device="cuda") out = model(data) loss = torch.nn.functional.mse_loss(out, torch.rand_like(out)) loss.backward() torch.autograd.set_detect_anomaly(True) print(f"Torch version: {torch.__version__}") # with profiler(Path("conv")): # # print(f"layers=1, stride=1") # # test(n_layers=1, conv_stride=1) # # print(f"layers=2, stride=1") # # test(n_layers=2, conv_stride=1) # # print(f"layers=1, stride=2") # # test(n_layers=1, conv_stride=2) # print(f"layers=2, stride=2") # test(n_layers=2, conv_stride=2) print(f"layers=2, stride=2") test(n_layers=2, conv_stride=2) # we will not reach this print statement. print("DONE.") ``` </details> I also re-ran my performance benchmark and found no regressions over the previous PR. # Full description of the old PR Original PR: https://github.com/pytorch/pytorch/pull/148605 This PR adds a new kernel for producing gamma and beta values for the backward pass in a performant way. To test the performance against the baseline, I measured the backward pass of layernorm while sweeping over the following variables: 1. dtype in {half, float} 2. M in `2k, 2k - 1, 2k + 1 for k in range(...)` 3. N in `2k, 2k - 1, 2k + 1 for k in range(...)` 4. Whether we flush the L2 cache before running the backward pass Summary: The new code performs better than the old code, especially for powers of 2. For M >> N case, it performs very well (kernel itself can be 30x faster and the overall backward pass can be 5-10x faster). In order to visualize results of the kernel when choosing different values of M, N and dtype, I wrote some code to generate a heatmap. The heatmap has N on the x-axis, M on the y-axis and color-coded points where green shows performance improvement and red shows regressions. For example, `m=32 n=2048 1.42x` in the heatmap would indicate the normalized shape had 32 elements. The leading dimensions' product was 2048 elements and the new kernel resulted in the backward pass* being 1.42x faster than the old backward pass. Important note: This heatmap shows the total backward pass time as seen by the user. The kernel time difference can be sometimes very large while the total backward pass time is not that high. For example, for dtype=torch.half, M=32 N=2048, flush_l2_cache=True case, the heatmap shows a speedup of 1.42x, while ncu tells me the new kernel is 2.5x faster than the old: M=32 N=2048 dtype=half flush_l2=True Old Kernel NCU summary: ``` ----------------------- ----------- ------------ Metric Name Metric Unit Metric Value ----------------------- ----------- ------------ DRAM Frequency Ghz 1.59 SM Frequency Ghz 1.35 Elapsed Cycles cycle 27,526 Memory Throughput % 2.21 DRAM Throughput % 0.54 Duration us 20.42 L1/TEX Cache Throughput % 4.31 L2 Cache Throughput % 2.62 SM Active Cycles cycle 1,475.02 Compute (SM) Throughput % 0.29 ----------------------- ----------- ------------ ``` M=32 N=2048 dtype=half flush_l2=True New Kernel NCU summary: ``` ----------------------- ----------- ------------ Metric Name Metric Unit Metric Value ----------------------- ----------- ------------ DRAM Frequency Ghz 1.59 SM Frequency Ghz 1.34 Elapsed Cycles cycle 10,920 Memory Throughput % 5.64 DRAM Throughput % 1.35 Duration us 8.13 L1/TEX Cache Throughput % 1.92 L2 Cache Throughput % 6.89 SM Active Cycles cycle 3,554.41 Compute (SM) Throughput % 0.67 ----------------------- ----------- ------------ ``` Let's look at some rows from the heatmap. For dtype=float16 flush_l2_cache=True and when input shapes are powers of 2, we get the following: <img width="1508" alt="image" src="https://github.com/user-attachments/assets/06179599-b2f0-4a45-8664-247a1067950b" /> There are 3 columns -- the first shows all data points, the second shows speedups only and the 3rd column shows regressions only. We can see that there are dramatic speedups for M >> N cases and the regressions are not that high (less than 1%, which could just be measurement noise). Here is a small guide I made: ![image](https://github.com/user-attachments/assets/90c26f7c-e3ad-46d2-a6ce-fe4b5fb3d738) For dtype=float32, we get a similar chart: <img width="1499" alt="image" src="https://github.com/user-attachments/assets/c4d31a76-03b0-426c-9114-e1bfad29b530" /> The new code performs especially well for m >> n cases, and also where m and n are small. The m >> n case is special because we run 2 reduction kernels back to back and parallelize in the "M" dimension (the older kernel only parallelized in the "N" dimension). The new code can sometimes have regressions for non-powers of 2. That is because the old code was using block sizes of {16, 32} while we have `threads.x = 32`. For example when N=33, the old code would have 3 blocks and we will have 2 blocks. I wrote some code to specialize for this case, but I think it will add complexity and @ngimel mentioned that non-powers of 2 are rare enough. I am including the regressions here for completeness' sake: <img width="1500" alt="image" src="https://github.com/user-attachments/assets/31c17cfb-ed9b-4106-b9c8-5c359751f530" /> To see this better: 1. Click the image 2. Right click the expanded image and open in a new tab 3. Go to that tab and left click once to zoom in If you want to see the full data, here it is: ![image](https://github.com/user-attachments/assets/54fb60c9-8c0c-4530-a1dd-79ecda1a69a1) I also measured binary size and compile time since those are important for developers: Binary size comparison ![image](https://github.com/user-attachments/assets/ceef5073-1036-47f6-b9dc-cea088beda51) ``` # Original -rwxr-xr-x 1 ahmads users 307193112 Mar 6 08:46 ./torch/lib/libtorch_cuda.so # This PR -rwxr-xr-x 1 ahmads users 307193112 Mar 6 08:46 ./torch/lib/libtorch_cuda.so ``` The diff in bytes is 302kB which is about a 0.1% increase. Compile time difference: ``` # Original real 0m10.931s user 0m9.676s sys 0m1.004s # this PR real 0m16.720s user 0m15.514s sys 0m1.066s # Command I ran time /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DAT_PER_OPERATOR_HEADERS -DFLASHATTENTION_DISABLE_ALIBI -DFLASHATTENTION_DISABLE_SOFTCAP -DFLASH_NAMESPACE=pytorch_flash -DFMT_HEADER_ONLY=1 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTORCH_CUDA_BUILD_MAIN_LIB -DTORCH_CUDA_USE_NVTX3 -DUNFUSE_FMA -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_CUDA -DUSE_CUFILE -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_FLASH_ATTENTION -DUSE_MEM_EFF_ATTENTION -DUSE_NCCL -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cuda_EXPORTS -I/home/ahmads/personal/pytorch/build/aten/src -I/home/ahmads/personal/pytorch/aten/src -I/home/ahmads/personal/pytorch/build -I/home/ahmads/personal/pytorch -I/home/ahmads/personal/pytorch/cmake/../third_party/benchmark/include -I/home/ahmads/personal/pytorch/third_party/onnx -I/home/ahmads/personal/pytorch/build/third_party/onnx -I/home/ahmads/personal/pytorch/nlohmann -I/home/ahmads/personal/pytorch/third_party/flash-attention/csrc/flash_attn/src -I/home/ahmads/personal/pytorch/aten/src/THC -I/home/ahmads/personal/pytorch/aten/src/ATen/cuda -I/home/ahmads/personal/pytorch/third_party/fmt/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/tools/util/include -I/home/ahmads/personal/pytorch/build/caffe2/aten/src -I/home/ahmads/personal/pytorch/aten/src/ATen/.. -I/home/ahmads/personal/pytorch/build/nccl/include -I/home/ahmads/personal/pytorch/c10/cuda/../.. -I/home/ahmads/personal/pytorch/c10/.. -I/home/ahmads/personal/pytorch/third_party/tensorpipe -I/home/ahmads/personal/pytorch/build/third_party/tensorpipe -I/home/ahmads/personal/pytorch/third_party/tensorpipe/third_party/libnop/include -I/home/ahmads/personal/pytorch/torch/csrc/api -I/home/ahmads/personal/pytorch/torch/csrc/api/include -isystem /home/ahmads/personal/pytorch/build/third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googlemock/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googletest/include -isystem /home/ahmads/personal/pytorch/third_party/protobuf/src -isystem /home/ahmads/personal/pytorch/third_party/XNNPACK/include -isystem /home/ahmads/personal/pytorch/third_party/ittapi/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/eigen -isystem /usr/local/cuda/include -isystem /home/ahmads/personal/pytorch/third_party/ideep/mkl-dnn/include/oneapi/dnnl -isystem /home/ahmads/personal/pytorch/third_party/ideep/include -isystem /home/ahmads/personal/pytorch/INTERFACE -isystem /home/ahmads/personal/pytorch/third_party/nlohmann/include -isystem /home/ahmads/personal/pytorch/third_party/NVTX/c/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/cudnn_frontend/include -DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS -D_GLIBCXX_USE_CXX11_ABI=1 -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch -gencode arch=compute_90,code=sm_90 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUB_WRAPPED_NAMESPACE=at_cuda_detail -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -Xcompiler -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-missing-field-initializers -Wno-array-bounds -Wno-unknown-pragmas -Wno-strict-overflow -Wno-strict-aliasing -Wunused-function -Wunused-variable -Wunused-but-set-variable -Wno-maybe-uninitialized -MD -MT caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o -MF caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o.d -x cu -c /home/ahmads/personal/pytorch/aten/src/ATen/native/cuda/layer_norm_kernel.cu -o caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o ``` So the new PR is 6 seconds longer compile time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150625 Approved by: https://github.com/ngimel, https://github.com/atalman	2025-04-08 02:39:41 +00:00
morotti	c0991b0316	README: anaconda license violation / no longer recommend anaconda since it's no longer free to use (#150619 ) hello, I was going over the documentation to build pytorch from source. Unfortunately, the first thing that come up is that you strongly recommend to use anaconda, which shouldn't be used because it's no longer free to use. Could you please remove that from the doc? I don't know if you are aware but anaconda is no longer free. They changed their terms of service in 2020 to restrict commercial usage. They changed their terms of service in 2024 to forbid downloading anaconda and forbid education and non-profit usage too. The download is open and doesn't require any registration, but if you download anaconda they will sue you ^^ They started raining lawsuits against users since last year. You may have heard about anaconda vs intel in the news. They started another 5 or so in the last few months. https://www.reuters.com/legal/litigation/intel-sued-copyright-infringement-over-ai-software-2024-08-09/ You may need to adjust more doc and adjust your build system. The free to use alternatives are miniforge with the conda-forge channel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150619 Approved by: https://github.com/seemethere	2025-04-08 02:10:31 +00:00
CaoE	d7f3cd0ac3	Add Half support for weight_norm on CPU (#148878 ) Fixes #148867. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148878 Approved by: https://github.com/leslie-fang-intel, https://github.com/cyyever, https://github.com/albanD	2025-04-08 01:12:29 +00:00
Nikita Shulga	5228986c39	[CUDA] Only use vec128 if CUDA version is newer than 12.8 (#150705 ) By addressing a feedback requested at https://github.com/pytorch/pytorch/pull/145746 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150705 Approved by: https://github.com/atalman	2025-04-08 00:46:13 +00:00
Akash Verma	e9e5682a4a	[ROCm] Build Pytorch extensions with amdclang++ (#150451 ) Here are the following modifications made to cpp_extension.py- 1) Changed compiler flag to use --version. 2) Added a feature to convert alpha-numeric string to numeric string for the version string returned by compiler. This was the source of error as the parser was failing on parsing alpha-numeric version string. Build with following pytorch extensions- Apex, TorchVision, TorchAudio & DeepSpeed. Unit tested with following pytorch extensions- Apex, TorchVision. (cherry picked from commit c873aeac35851a7d5000eb7f24561d3f56c2ffbd) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/150451 Approved by: https://github.com/jeffdaily	2025-04-07 23:31:29 +00:00
Hexin Wang	91173ff89a	Fixing NCCL abort hang issue when a ProcessGroupNCCL manages multiple ncclComms (#150690 ) Detail of the issue: If PyTorch issues send/recv to each 2 rank comm, and these comms are managed by a single ProcessGroupNCCL instance, then comms need to abort either in sequence or in group. I.e. the following sequential abort will cause hang in NCCL. recv(..., comm0, stream); send(..., comm1, stream); abort(comm1); abort(comm0); Fixes #119797 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150690 Approved by: https://github.com/kwen2501	2025-04-07 23:20:49 +00:00
Animesh Jain	6ea5514e04	[invoke_subgraph] Lazy backward (#150666 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150666 Approved by: https://github.com/zou3519, https://github.com/bdhirsh	2025-04-07 22:44:43 +00:00
Ankita George	78fe079c97	Support having no metadata file for HuggingFaceStorageReader (#150701 ) Summary: If there is only one safetensors file, we don't need users to have a metadata file and we can just construct it from the keys of that file. This is a use-case for some HuggingFace models, so adding support for it Test Plan: ensure existing tests pass tested e2e in a notebook Differential Revision: D72472490 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150701 Approved by: https://github.com/joecummings	2025-04-07 22:10:39 +00:00
Nikita Shulga	fbccbfedaf	[BE] Fix Amp.metal compilation warning (#150783 ) Deleting unused `uint tid` fixes ``` [114/1416] Compiling /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/Amp.metal to Amp_30.air /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/Amp.metal:70:10: warning: unused parameter 'tid' [-Wunused-parameter] uint tid [[thread_position_in_grid]]) { ^ 1 warning generated. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150783 Approved by: https://github.com/wdvr, https://github.com/atalman	2025-04-07 22:05:00 +00:00
Max Ren	eba05e2d3e	[AO] Refactor convert and add QuantAffinePlaceholderObserver (#150644 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150644 Approved by: https://github.com/jerryzh168 ghstack dependencies: #150642, #150643	2025-04-07 20:52:45 +00:00
Max Ren	5653fb3525	[AO] Add Moving Average Affine Observer (#150643 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150643 Approved by: https://github.com/jerryzh168 ghstack dependencies: #150642	2025-04-07 20:52:45 +00:00
Max Ren	ed0dea3e24	[AO] update port_metadata_pass to support quant_affine ops (#150642 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150642 Approved by: https://github.com/jerryzh168	2025-04-07 20:52:44 +00:00
PyTorch MergeBot	bf1132c196	Revert "Generalize poison fork logic for each device backend (#144664 )" This reverts commit d86c14156d875b782b82dda96842a1f77910f010. Reverted https://github.com/pytorch/pytorch/pull/144664 on behalf of https://github.com/atalman due to failing periodic test: python test/test_cpp_extensions_mtia_backend.py TestCppExtensionMTIABackend.test_device_context ([comment](https://github.com/pytorch/pytorch/pull/144664#issuecomment-2784506104))	2025-04-07 20:09:53 +00:00
Pian Pawakapan	f8b53f4a75	[export] raise when Dim.DYNAMIC 0/1 specializes (#150716 ) Previously we didn't catch this, mark_dynamic() just doesn't allocate a symbol for it Differential Revision: D72486930 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150716 Approved by: https://github.com/angelayi	2025-04-07 18:58:42 +00:00
Sam Larsen	2a1e2b88ed	[logging] Add pgo remote get/put timings to dynamo_compile (#150322 ) Test Plan: https://fburl.com/scuba/dynamo_compile/sandbox/xf950tw8 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150322 Approved by: https://github.com/ppanchalia	2025-04-07 18:08:26 +00:00
Annop Wongwathanarat	6fcffd8cd1	Optimize SVE embedding performance (#150176 ) Change loop unrolling strategy. Previously, the script only unrolls the inner loop over block_size when block size is multiple of vector length. This version instead unrolls the outer loop which reduces the number of load/store for accumulation into the output array and improves performance for cases when block size is not multiple of vector length. Benchmarking script: ```python # SPDX-FileCopyrightText: Copyright 2025 Arm Limited and/or its affiliate <open-source-office@arm.com> # SPDX-License-Identifier: BSD-3-Clause import torch import torch.nn as nn import numpy as np import time import sys np.random.seed(0) torch.manual_seed(0) num_embeddings = 400000 embedding_dim = int(sys.argv[1]) multi_hot = 100 batch_size = 400 nrun = 1000 class SimpleEmbeddingBagModel(nn.Module): def __init__(self, num_embeddings, embedding_dim): super(SimpleEmbeddingBagModel, self).__init__() weights = torch.from_numpy((np.random.random_sample((num_embeddings, embedding_dim)) + 1).astype(np.float32)).to(torch.float16) # Defining the EmbeddingBag layer self.embedding_bag = torch.nn.EmbeddingBag(num_embeddings, embedding_dim, _weight=weights, mode='sum', include_last_offset=True, dtype=torch.float32) def forward(self, input, offsets): # Forward pass through the EmbeddingBag layer result32 = self.embedding_bag(input, offsets, per_sample_weights=None) return result32 # Instantiate the model model = SimpleEmbeddingBagModel(num_embeddings=num_embeddings, embedding_dim=embedding_dim) model.eval() # Example input input_tensor = torch.randint(0, num_embeddings, (batch_size * multi_hot,), dtype=torch.long) offsets = torch.tensor(range(0, batch_size * multi_hot + 1, multi_hot)) with torch.no_grad(): # warm up output32 = model(input_tensor, offsets) ti = time.time_ns() for i in range(nrun): _ = model(input_tensor, offsets) tf = time.time_ns() print("{:3d} {:.3E}".format(embedding_dim, (tf-ti)/nrun/1.e6)) ``` Speedup on NEOVERSEV1 with 1 thread ![embedding](https://github.com/user-attachments/assets/16e567ed-b9a5-4db3-90b8-dec66d5414a7) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150176 Approved by: https://github.com/digantdesai, https://github.com/malfet	2025-04-07 18:01:54 +00:00
Saurabh Mishra	7d2411d30e	[DCP][OSS] Introduce barrier util in the DistWrapper for rank local checkpointing (#150748 ) Summary: Introduce barrier util in the DistWrapper for rank local checkpointing. This barrier will be used at the end of the rank local checkpointing to ensure all ranks synchronize. Test Plan: UTs Differential Revision: D72541431 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150748 Approved by: https://github.com/MeetVadakkanchery	2025-04-07 17:33:07 +00:00
Isuru Fernando	957faaadca	Avoid overflow in vector_norm for scalar input (#144073 ) Fixes https://github.com/pytorch/pytorch/issues/143960 where torch.dist gave different results from eager due to vector_norm overflowing and eager mode avoids the overflow for single element reductions by not computing the power and then the root. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144073 Approved by: https://github.com/eellison, https://github.com/laithsakka	2025-04-07 17:10:10 +00:00
fduwjj	06e9deabb6	[c10d][fr] Improve FR dump robustness with all watchdog broadcast wait and more frequent store check (#150652 ) When debugging FR missing dump and missing dump logs, I have couple initial findings: 1. On the same rank, if a second watchdog timeout triggers on a different PG(or subPG), that watchdog thread will immediately throw exception instead of sleeping. We want to fix that by still making the watchdog thread to wait for 1 min. 2. The FR dump takes about 900ms to 1200ms so, we are not checking the store frequently enough. But instead of changing the frequency from 1sec to 300ms, we finally decided to just let all ranks just sleep for 1 min universally rather than using a promise. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150652 Approved by: https://github.com/kwen2501	2025-04-07 16:33:27 +00:00
jpvillam	56ab71de98	[ROCm] Expand workspace size for gfx95 (#150632 ) Use same workspace size for gfx95* as gfx94* Pull Request resolved: https://github.com/pytorch/pytorch/pull/150632 Approved by: https://github.com/jeffdaily Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>	2025-04-07 16:05:56 +00:00
shiyang-weng	0ad2c5d7e2	Add RECORD_FUNCTION for AOTI (#150150 ) Only add RECORD_FUNCTION for shim_fn now. Next step need to add RECORD_FUNCTION for all the aoti_torch_* functions. Fixes https://github.com/pytorch/pytorch/issues/148650 Some code gen by aoti ```c++ AtenTensorHandle buf1_handle; AtenTensorHandle buf2_handle; AtenTensorHandle buf3_handle; AtenTensorHandle buf4_handle; {RECORD_FUNCTION("aoti_torch_cpu__embedding_bag", c10::ArrayRef<c10::IValue>());AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_cpu__embedding_bag(L__self___sparse_arch_embedding_bag_collection_embedding_bags_t_cat_0_weight, arg80_1, arg81_1, 0, 0L, 0, nullptr, 1, -1L, &buf1_handle, &buf2_handle, &buf3_handle, &buf4_handle));} RAIIAtenTensorHandle buf1(buf1_handle); RAIIAtenTensorHandle buf2(buf2_handle); RAIIAtenTensorHandle buf3(buf3_handle); RAIIAtenTensorHandle buf4(buf4_handle); arg80_1.reset(); arg81_1.reset(); ``` On trace ``` { "name": "aoti_torch_cpu__embedding_bag", "ph": "X", "ts": 68874.450000, "dur": 361.291000, "tid": 2, "pid": "CPU Functions", "args": {} }, ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150150 Approved by: https://github.com/desertfire, https://github.com/EikanWang	2025-04-07 15:12:29 +00:00
Benjamin Glass	f813d64f54	cpp_wrapper: Fix even more tests (#147225 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147225 Approved by: https://github.com/desertfire ghstack dependencies: #150671, #150672	2025-04-07 14:20:06 +00:00
Benjamin Glass	f0abbabac1	AOTI fallback ops: sort alphabetically (#150672 ) This is just a housekeeping task that makes the listed fallback op order match what's in the generated C shim files. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150672 Approved by: https://github.com/desertfire ghstack dependencies: #150671	2025-04-07 14:20:06 +00:00
Benjamin Glass	5e3c8214b5	cpp_wrapper: Re-enable code disabled for forward compatibility (#150671 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150671 Approved by: https://github.com/desertfire	2025-04-07 14:20:06 +00:00
Shivam Raikundalia	99c9a31386	[submodule] [Snapshot/Profiler] Memory Snapshot On Demand (#150559 ) Summary: Profiler side of memory snapshot. 1. Add API to actually do snapshot when client interface is called 2. Add ifdefs to builds so that kineto hooks snapshot correctly. Design Philosophy: There is one interesting part of this implementation and it is during export. For export we are callign the python impl of the export rather than CPP even though we are already in CPP. This is because it is better to simply have one path of export rather than 2. Personally, I want there to be parity between auto-trace and on-demand so it if we can limit the side paths then we will have an easier time maintaining this relationship Test Plan: {F1976563426} Reviewed By: sanrise Differential Revision: D70733247 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150559 Approved by: https://github.com/sanrise	2025-04-07 13:04:38 +00:00
Zain Huda	e209625334	[torchrec] update local_shards_wrapper to latest version (#150469 ) Summary: Adding new ops, support for empty shards, and fixed initializations for downstream checkpointing. Test Plan: buck2 run 'fbcode//mode/dev-nosan' fbcode//torchrec/distributed/tests:test_shards_wrapper Differential Revision: D72271275 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150469 Approved by: https://github.com/XilunWu	2025-04-07 13:00:52 +00:00
PyTorch UpdateBot	cdf3b63e32	Update slow tests (#150283 ) This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150283 Approved by: https://github.com/pytorchbot	2025-04-07 11:49:59 +00:00
PyTorch UpdateBot	25662d38d5	[xla hash update] update the pinned xla hash (#132021 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132021 Approved by: https://github.com/pytorchbot	2025-04-07 11:35:56 +00:00
Kurt Mohler	164d2c887b	Add check in `test_cow_input` to ensure COW data is never changed (#150723 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150723 Approved by: https://github.com/Skylion007	2025-04-07 04:35:00 +00:00
Zhengxu Chen	24aadb40fb	[precompile] Serialization for GlobalStateGuard (#150636 ) Summary: To preserve global state guards we need to make the C++ type serialzable. Using json because it's easier to do and we don't have a lot of data in global state. Test Plan: test_dynamo -k test_global_state_guard_serialization Differential Revision: D72410611 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150636 Approved by: https://github.com/williamwen42	2025-04-07 03:10:03 +00:00
eellison	b6929aef08	Fix conv2d strided prologue (#150697 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150697 Approved by: https://github.com/drisspg	2025-04-07 02:26:58 +00:00
Yu, Guangye	d86c14156d	Generalize poison fork logic for each device backend (#144664 ) # Motivation Generalize the posion_fork code to make it reusable across different devices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144664 Approved by: https://github.com/EikanWang, https://github.com/albanD	2025-04-07 02:06:21 +00:00
Han, Chao1	d98575806b	Generalize compile collective to avoid cuda-bias (#150405 ) Fixes https://github.com/intel/torch-xpu-ops/issues/1527 Let the combination of `compile` and `collective` to support more devices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150405 Approved by: https://github.com/guangyey, https://github.com/jansel Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>	2025-04-07 01:54:20 +00:00
Richard Barnes	d8d306cbc6	Suppress `-Wunused-function` for DSA (#150735 ) Test Plan: Sandcastle Reviewed By: dtolnay Differential Revision: D72458590 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150735 Approved by: https://github.com/eqy, https://github.com/cyyever	2025-04-07 01:47:35 +00:00
Richard Barnes	370ba6b96f	[codemod] Fix `-Wambiguous-reversed-operator` in aten/src/ATen/cuda/tunable/Tunable.h (#150744 ) Summary: `-Wambiguous-reversed-operator` warns about ambiguous reversed operators, e.g. `a < b` and `b > a` are both valid. Such operators are disallowed in C++20. This codemod fixes the warnings. #buildsonlynotests - If this diff compiles, it works. - If you approve of this diff, please use the "Accept & Ship" button :-) Test Plan: Sandcastle Differential Revision: D72535527 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150744 Approved by: https://github.com/drisspg	2025-04-07 01:45:03 +00:00
Paul Ganssle	47b494ef69	Add type hints to `_tensor_docs.add_docstr_all` (#150715 ) There is some sort of bug in `pytype` where if this function doesn't have type hints, `pytype` will spend 10 minutes inferring the types. Not that this matters much for a project not using `pytype`, but it led me to realize that this function could easily be type hinted and is not, so here is a PR adding some type hints. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150715 Approved by: https://github.com/Skylion007	2025-04-06 22:25:34 +00:00
Scott Wolchok	0aaf35310a	Overload unary - operator on at::vec::Vectorized to call neg() (#150568 ) Makes Vectorized look even more like a scalar type, getting me closer to being able to use the same generic code with scalars and Vectorized (e.g., for sigmoid, which needs `exp(-x)`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/150568 Approved by: https://github.com/Skylion007 ghstack dependencies: #150380	2025-04-06 21:12:27 +00:00
Scott Wolchok	912102b4ec	Make at::vec::Vectorized ops work with scalars (#150380 ) I noticed that I couldn't use `vec::Vectorized` operations with scalars, even though there is an implicit conversion from `T` to `vec::Vectorized<T>`, so I made it work. Test Plan: Added tests. Reverted vec_base.h, left the new tests in place, and confirmed that new tests don't compile in that state. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150380 Approved by: https://github.com/Skylion007	2025-04-06 21:12:27 +00:00
Eddie Yan	8adfcd35c3	[cuDNN][SDPA] Loosen constraints for GQA for cuDNN Attention (#150337 ) cuDNN attention doesn't require key and value tensors to have the same number of heads Pull Request resolved: https://github.com/pytorch/pytorch/pull/150337 Approved by: https://github.com/drisspg	2025-04-06 20:31:11 +00:00
Bin Bao	6a8ab902a2	[AOTI][dashboard] Fix mis-calculated memory compression ratio (#150695 ) Summary: https://github.com/pytorch/pytorch/pull/149817 introduced an extra warmup run to compute AOTI memory compression ratio, but since weights are only loaded once in the AOTI run, the peak memory seen in the extra warmup won't include the weight, which causes an aritifically high memory compression ratio. This PR removes that extra warmup run, and calls reset_peak_memory_stats in the proper place instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150695 Approved by: https://github.com/yushangdi	2025-04-06 19:51:22 +00:00
Randolf Scholz	6c38b9be73	[typing] Add type hints to `__init__` methods in `torch.distributions`. (#144197 ) Fixes #144196 Extends #144106 and #144110 ## Open Problems: - [ ] Annotating with `numbers.Number` is a bad idea, should consider using `float`, `SupportsFloat` or some `Procotol`. https://github.com/pytorch/pytorch/pull/144197#discussion_r1903324769 # Notes - `beta.py`: needed to add `type: ignore` since `broadcast_all` is untyped. - `categorical.py`: converted `else` branches of mutually exclusive arguments to `if` branch[^2]. - ~~`dirichlet.py`: replaced `axis` with `dim` arguments.~~ #144402 - `gemoetric.py`: converted `else` branches of mutually exclusive arguments to `if` branch[^2]. - ~~`independent.py`: fixed bug in `Independent.__init__` where `tuple[int, ...]` could be passed to `Distribution.__init__` instead of `torch.Size`.~~ EDIT: turns out the bug is related to typing of `torch.Size`. #144218 - `independent.py`: made `Independent` a generic class of its base distribution. - `multivariate_normal.py`: converted `else` branches of mutually exclusive arguments to `if` branch[^2]. - `relaxed_bernoulli.py`: added class-level type hint for `base_dist`. - `relaxed_categorical.py`: added class-level type hint for `base_dist`. - ~~`transforms.py`: Added missing argument to docstring of `ReshapeTransform`~~ #144401 - ~~`transforms.py`: Fixed bug in `AffineTransform.sign` (could return `Tensor` instead of `int`).~~ #144400 - `transforms.py`: Added `type: ignore` comments to `AffineTransform.log_abs_det_jacobian`[^1]; replaced `torch.abs(scale)` with `scale.abs()`. - `transforms.py`: Added `type: ignore` comments to `AffineTransform.__eq__`[^1]. - `transforms.py`: Fixed type hint on `CumulativeDistributionTransform.domain`. Note that this is still an LSP violation, because `Transform.domain` is defined as `Constraint`, but `Distribution.domain` is defined as `Optional[Constraint]`. - skipped: `constraints.py`, `constraints_registry.py`, `kl.py`, `utils.py`, `exp_family.py`, `__init__.py`. ## Remark `TransformedDistribution`: `__init__` uses the check `if reinterpreted_batch_ndims > 0:`, which can lead to the creation of `Independent` distributions with only 1 component. This results in awkward code like `base_dist.base_dist` in `LogisticNormal`. ```python import torch from torch.distributions import * b1 = Normal(torch.tensor([0.0]), torch.tensor([1.0])) b2 = MultivariateNormal(torch.tensor([0.0]), torch.eye(1)) t = StickBreakingTransform() d1 = TransformedDistribution(b1, t) d2 = TransformedDistribution(b2, t) print(d1.base_dist) # Independent with 1 dimension print(d2.base_dist) # MultivariateNormal ``` One could consider changing this to `if reinterpreted_batch_ndims > 1:`. [^1]: Usage of `isinstance(value, numbers.Real)` leads to problems with static typing, as the `numbers` module is not supported by `mypy` (see <https://github.com/python/mypy/issues/3186>). This results in us having to add type-ignore comments in several places [^2]: Otherwise, we would have to add a bunch of `type: ignore` comments to make `mypy` happy, as it isn't able to perform the type narrowing. Ideally, such code should be replaced with structural pattern matching once support for Python 3.9 is dropped. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144197 Approved by: https://github.com/malfet Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2025-04-06 17:50:35 +00:00
Isalia20	49f6cce736	[MPS] grad scaler (#150255 ) Fixes #142397 Basic implementation is done. What's left: - [x] Different dtype/device tensors in the TensorList - [x] fast path for grouping the foreach kernel - [x] Tests Regarding tests, I found some tests in `test/test_torch.py` for GradScaler but I couldn't figure out what is the best way to enable the test for MPS device. By removing `@onlyNativeDeviceTypes`, one enables the tests for MPS but also enables tests for all other devices which are not included in the native device types. If I put: `instantiate_device_type_tests(TestTorchDeviceType, globals(), allow_mps=True)` This enables lots of tests in that class for MPS which were not(?) being tested before? This part needs some clarification Pull Request resolved: https://github.com/pytorch/pytorch/pull/150255 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-04-06 17:06:55 +00:00
Natalia Gimelshein	55e62ff74a	bf16 grouped gemm (#150374 ) Enabled bf16 grouped gemm with an API similar to _scaled_group_gemm, except without scale and fast accum arguments. All transpose variants are enabled, unlike scaled gemm. Ideally we'd factor out a lot more code from scaled gemm, currently there's a lot of repetition between scaled and non-scaled versions. I factored out only a helper kernel that prepares arguments. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150374 Approved by: https://github.com/drisspg	2025-04-06 04:53:24 +00:00
PyTorch MergeBot	caf8d9bc17	Revert "Fix conv2d strided prologue (#150697 )" This reverts commit 2e4ae2ab41dbe1939bd1ffb427af8e5ea8eaff41. Reverted https://github.com/pytorch/pytorch/pull/150697 on behalf of https://github.com/ngimel due to breaks rocm build ([comment](https://github.com/pytorch/pytorch/pull/150697#issuecomment-2781218658))	2025-04-06 04:50:15 +00:00
Klint Qinami	2d98a1caf5	[MTIA] Map names to operand indices when folding submodules (#150692 ) When replacing placeholders with getattrs during constant folding, we can have an argument and parameter name mismatch. In fact, there is no guarantee that the parameter name is equivalent to the argument name used in the module call. Differential Revision: D72415970 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150692 Approved by: https://github.com/jfix71	2025-04-06 03:11:14 +00:00
Jeff Daily	15768cc34b	add unit test for preferred_blas_library settings (#150581 ) Follow up to #150212 that was committed without a unit test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150581 Approved by: https://github.com/atalman, https://github.com/malfet Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-04-06 01:44:07 +00:00
Richard Barnes	83b870a28a	Fix missing braces for clang CUDA (#150736 ) Test Plan: Sandcastle Differential Revision: D72469764 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150736 Approved by: https://github.com/Skylion007	2025-04-06 01:29:59 +00:00
Nikita Shulga	c830c12a87	[MPSInductor] Fix tiled reduction logic (#150737 ) In case of tiles, index must include both reduction dimentions Pull Request resolved: https://github.com/pytorch/pytorch/pull/150737 Approved by: https://github.com/dcci	2025-04-06 00:20:41 +00:00
Isalia20	cfea55dbec	[MPS] fix inverse bug for N>1024 (#146754 ) Fixes #138200 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146754 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-04-05 21:49:21 +00:00
Mu-Chu Lee	60a45eb862	[AOTInductor] Introduce MaybeOwningAtenTensorHandle for ConstantMap (#150275 ) Summary: We used RAIIAtenTensorHandle for ConstantMap, where RAIIAtenTensorHandle is a unique_ptr, indicating that all memory handling is by the AOTInductor internally. In this PR, we introduce ConstantAtenTensorHandle which replaces RAIIATenTensorHandle. This class holds a raw AtenTensorHandle, and also owns a RAIIAtenTensorHandle if user decides to delegate memory management to AOTInductor. This is a prerequisite for user managed buffer, this PR, however only introduces this class and make sure it works with existing AOTInductor and has the default behavior identical as using RAIIAtenTensorHandle. Test Plan: Existing tests. No change should be introduced within this PR. Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/150275 Approved by: https://github.com/chenyang78, https://github.com/desertfire	2025-04-05 06:00:35 +00:00
Nikita Shulga	7ac8186851	[MPSInductor] Speedup `sum`/`prod` reductions (#150566 ) By using cooperative `simd_sum`/`simd_product` instead of a C-style for loop for threadgroup reductions. This also allows significantly reduce amount of shared memory needed to perform those reductions Using such reduction increases the `torch.compile` performance for gpt-fast using `stories110M` from 29 tokens/sec to 630 tokens/sec on M4 and changes perf of torch.rand as follows: \|size\| before \| after \| \|------------------------\|------------\|-------------\| \| 512x512 \| 202.1 \| 131.8 \| \| 1024x1024 \| 780.6 \| 176.9 \| \| 2048x2048 \| 1423.4 \| 339.9 \| \| 4096x4097 \| 2982.2 \| 1047.2 \| Unfortunately, none of the SIMDgroup operations are available for 64-bit integers, but one can simulate the behavior using using `simd_shuffle_down` of 64-bit values represented as `int2` types, that yields reduction in $log_2(threadgroup\\_size)$ steps. [`mlx/kernels/reduction/ops.h](`86389bf970/mlx/backend/metal/kernels/reduction/ops.h (L15-L18)`) contains an implementation of such algorithm, but alas it yields wrong results on M1/M2(and may be M3 machines) if not all threads in the simdgroup are active which could be observed by running ```python import torch lib=torch.mps.compile_shader(""" kernel void do_sum(device int* out, constant int* in, uint idx [[thread_position_in_grid]]) { out[idx] = metal::simd_shuffle_down(in[idx], 8); } """) x=torch.arange(22, device='mps', dtype=torch.int32) y=torch.empty_like(x) lib.do_sum(y, x) print(y) ``` that returns following on M4 ``` tensor([ 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 0, 0, 0, 0, 0, 0, 0, 0], device='mps:0', dtype=torch.int32) ``` but same kernel running on M1 returns ``` tensor([ 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 14, 15, 16, 17, 18, 19, 20, 21], device='mps:0', dtype=torch.int32) ``` This discrepancy in behavior can be addressed by using `simd_shuffle_and_fill_down`, but any kernels using simd_shuffle_and_fill_down cause an internal compiler error on MacOS-13.2. Considering that OS is to be EOL soon, skip the offending tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150566 Approved by: https://github.com/manuelcandales ghstack dependencies: #150452, #150457	2025-04-05 02:47:27 +00:00
Jithun Nair	c14977e91c	Use 'rocm' naming for rocm-related workflows/jobs (#150555 ) Reduces number of places in the workflow files needing update for ROCm version update Pull Request resolved: https://github.com/pytorch/pytorch/pull/150555 Approved by: https://github.com/jeffdaily	2025-04-05 02:09:11 +00:00
Laith Sakka	3320efef6b	Refresh expected results. (#150264 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150264 Approved by: https://github.com/bobrenjc93	2025-04-05 01:11:19 +00:00
eellison	2e4ae2ab41	Fix conv2d strided prologue (#150697 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150697 Approved by: https://github.com/drisspg	2025-04-05 00:28:56 +00:00
leslie-fang-intel	d6887f444f	[Inductor] Fallback embedding when sparse is True (#150659 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/150656, fallback `embedding` when sparse is True. Test Plan ``` python -u -m pytest -s -v test/inductor/test_torchinductor.py -k test_embedding_sparse ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150659 Approved by: https://github.com/jansel	2025-04-04 23:59:38 +00:00
Stepan Hruda	2e23768d25	Expose symbols on macos in the xplat pytorch stack (#150487 ) Summary: X-link: https://github.com/pytorch/executorch/pull/9819 Had to revert D71321310 because it affected way too many targets and build sizes. These changes should expose just enough symbols to be buildable in arvr mode on macOS. Could potentially make narrow it down even more by avoiding eg `get_pt_compiler_flags` Differential Revision: D72255474 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150487 Approved by: https://github.com/drisspg	2025-04-04 23:03:16 +00:00
Paul Zhang	2a2ddff214	[Inductor] Fix consolidating _scaled_mm into mm template TMA error (#150686 ) Summary: The previous diff broke a few tests that didn't run on internal or GH CI: T220169086, this fixes that issue. The {% if } block is only supposed to support autotuned parameters (constexpr), and should not be used for locals based on other examples. Test Plan: buck test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:fp8 -- --exact 'caffe2/test/inductor:fp8 - test_tensorwise_scaling_bfloat16_shape_16,32,32_has_bias_False_use_fast_accum_True_persistent_matmul_True (caffe2.test.inductor.test_fp8.TestFP8Lowering)' Reviewed By: NikhilAPatel Differential Revision: D72460516 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150686 Approved by: https://github.com/eellison, https://github.com/NikhilAPatel	2025-04-04 22:49:22 +00:00
Ankita George	861d2cc02c	Add a param for save format in Storage Writer (#150025 ) Summary: add a param to specify to the storage writer how to save tensors. Write now the only options are safetensors and torch.save. Test Plan: (lintrunner) [ankitageorge@devgpu003.cco3 /data/users/ankitageorge/fbsource/fbcode/caffe2 (1d57cb27b)]$ buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/distributed/checkpoint:test_hf_storage File changed: fbcode//caffe2/torch/distributed/checkpoint/filesystem.py Buck UI: https://www.internalfb.com/buck2/e80cc963-e34a-4876-b6f4-7ce2794e48dd Test UI: https://www.internalfb.com/intern/testinfra/testrun/3659174965882569 Network: Up: 32KiB Down: 1.9KiB (reSessionID-ef9fa764-a40a-451b-ab58-08eabe7a9422) Executing actions. Remaining 0/4 3.4s exec time total Command: test. Finished 2 local Time elapsed: 19.6s Tests finished: Pass 4. Fail 0. Fatal 0. Skip 0. Build failure 0 Reviewed By: saumishr Differential Revision: D70271943 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150025 Approved by: https://github.com/saumishr	2025-04-04 17:52:53 +00:00
Eric Griffith	c53bc616d5	caffe2: Fix lint errors in native/xnnpack/Linear.cpp (#150508 ) Summary: See title Test Plan: Sandcastle Differential Revision: D72275403 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150508 Approved by: https://github.com/malfet, https://github.com/Skylion007, https://github.com/cyyever	2025-04-04 17:14:43 +00:00
PyTorch MergeBot	c93e34d7b5	Revert "bound sympy accuracy (#150383 )" This reverts commit 1bc2b2b12ae1ddd27b0401a1baac3b8099b6fc50. Reverted https://github.com/pytorch/pytorch/pull/150383 on behalf of https://github.com/laithsakka due to big regression ([comment](https://github.com/pytorch/pytorch/pull/150383#issuecomment-2779227548))	2025-04-04 16:26:00 +00:00
PyTorch MergeBot	f443035f10	Revert "[cuda] Add new faster gammabeta backward kernel (#148605 ) (Reapply with launch bounds) (#150625 )" This reverts commit c6defa9443d241dd7a0baac4e708b6e906bd012c. Reverted https://github.com/pytorch/pytorch/pull/150625 on behalf of https://github.com/atalman due to failing internal build ([comment](https://github.com/pytorch/pytorch/pull/150625#issuecomment-2779183414))	2025-04-04 16:05:18 +00:00
Zhengxu Chen	07d439e782	[aoti] Split ConstantType definition out of model.h (#150545 ) Summary: Splitting the type definition of ConstantType into a separate header because it's needed by Sigmoid OSS but the entire model.h header include cause the following compilation error: ``` 2025-04-01T18:12:42.0391272Z FAILED: caffe2/CMakeFiles/torch_cpu.dir/__/torch/csrc/nativert/kernels/AOTICallDelegateKernel.cpp.o 2025-04-01T18:12:42.0417705Z /opt/cache/bin/sccache /opt/cache/bin/clang++ -DAT_PER_OPERATOR_HEADERS -DBUILD_ONEDNN_GRAPH -DCAFFE2_BUILD_MAIN_LIB -DCPUINFO_SUPPORTED_PLATFORM=1 -DFMT_HEADER_ONLY=1 -DFXDIV_USE_INLINE_ASSEMBLY=0 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DIDEEP_USE_MKL -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DNNP_CONVOLUTION_ONLY=0 -DNNP_INFERENCE_ONLY=0 -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTORCH_ENABLE_LLVM -DUSE_C10D_GLOO -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_RPC -DUSE_TENSORPIPE -DXNN_LOG_LEVEL=0 -D_FILE_OFFSET_BITS=64 -Dtorch_cpu_EXPORTS -I/var/lib/jenkins/workspace/build/aten/src -I/var/lib/jenkins/workspace/aten/src -I/var/lib/jenkins/workspace/build -I/var/lib/jenkins/workspace -I/var/lib/jenkins/workspace/cmake/../third_party/benchmark/include -I/opt/llvm/include -I/var/lib/jenkins/workspace/third_party/onnx -I/var/lib/jenkins/workspace/build/third_party/onnx -I/var/lib/jenkins/workspace/nlohmann -I/var/lib/jenkins/workspace/torch/csrc/api -I/var/lib/jenkins/workspace/torch/csrc/api/include -I/var/lib/jenkins/workspace/caffe2/aten/src/TH -I/var/lib/jenkins/workspace/build/caffe2/aten/src/TH -I/var/lib/jenkins/workspace/build/caffe2/aten/src -I/var/lib/jenkins/workspace/build/caffe2/../aten/src -I/var/lib/jenkins/workspace/torch/csrc -I/var/lib/jenkins/workspace/third_party/miniz-3.0.2 -I/var/lib/jenkins/workspace/third_party/kineto/libkineto/include -I/var/lib/jenkins/workspace/third_party/kineto/libkineto/src -I/var/lib/jenkins/workspace/third_party/cpp-httplib -I/var/lib/jenkins/workspace/aten/src/ATen/.. -I/var/lib/jenkins/workspace/third_party/FXdiv/include -I/var/lib/jenkins/workspace/c10/.. -I/var/lib/jenkins/workspace/third_party/pthreadpool/include -I/var/lib/jenkins/workspace/third_party/cpuinfo/include -I/var/lib/jenkins/workspace/aten/src/ATen/native/quantized/cpu/qnnpack/include -I/var/lib/jenkins/workspace/aten/src/ATen/native/quantized/cpu/qnnpack/src -I/var/lib/jenkins/workspace/aten/src/ATen/native/quantized/cpu/qnnpack/deps/clog/include -I/var/lib/jenkins/workspace/third_party/NNPACK/include -I/var/lib/jenkins/workspace/third_party/fbgemm/include -I/ 2025-04-01T18:12:42.0444143Z In file included from /var/lib/jenkins/workspace/torch/csrc/nativert/kernels/AOTICallDelegateKernel.cpp:5: 2025-04-01T18:12:42.0445081Z In file included from /var/lib/jenkins/workspace/torch/csrc/nativert/executor/AOTIDelegateExecutor.h:6: 2025-04-01T18:12:42.0446002Z In file included from /var/lib/jenkins/workspace/torch/csrc/nativert/executor/AOTInductorModelImpl.h:5: 2025-04-01T18:12:42.0447549Z /var/lib/jenkins/workspace/torch/csrc/inductor/aoti_runtime/model.h:78:13: error: function 'RAII_cpuMalloc' is not needed and will not be emitted [-Werror,-Wunneeded-internal-declaration] 2025-04-01T18:12:42.0448656Z RAIIDataPtr RAII_cpuMalloc(size_t num_bytes) { ``` model.h defines RAII_malloc functions directly into anonymous namespace which seems pretty sad. we should do something about it but may not in the current diff. Test Plan: CI Differential Revision: D72320413 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150545 Approved by: https://github.com/desertfire	2025-04-04 15:48:45 +00:00
Yuanhao Ji	1b0a023dde	[Dynamo][Misc] Apply typing hints for `codegen` (#150289 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/150289 Approved by: https://github.com/Skylion007, https://github.com/cyyever	2025-04-04 14:26:22 +00:00
Davide Italiano	295b7e21eb	[MPS/inductor] Add support for hermite_polynomial_h. (#150664 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150664 Approved by: https://github.com/malfet	2025-04-04 13:14:52 +00:00
Eddie Yan	09c4da9325	[CUDA][avgpool2d] Fix backward launch bounds again for `sm100`, `sm120` (#150640 ) `__CUDA_ARCH__` is not visible in host code, which causes incorrect launch bounds and `too many resources requested for launch` on blackwell CC @atalman @malfet as we would want this in 2.7 @nWEIdia Pull Request resolved: https://github.com/pytorch/pytorch/pull/150640 Approved by: https://github.com/malfet, https://github.com/drisspg, https://github.com/atalman	2025-04-04 13:05:40 +00:00
Jakub Grzybek	73358d37da	Fix codegen, change str comparison opeator to == for proper equality … (#150611 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150611 Approved by: https://github.com/Skylion007, https://github.com/cyyever	2025-04-04 09:59:59 +00:00
PyTorch MergeBot	4854926aeb	Revert "Add torch._scaled_mm for CPU (#150410 )" This reverts commit 3b02f795c5ad2339794b15b370c0e4a235d36adf. Reverted https://github.com/pytorch/pytorch/pull/150410 on behalf of https://github.com/malfet due to It breaks ROCM tests ([comment](https://github.com/pytorch/pytorch/pull/150410#issuecomment-2777704212))	2025-04-04 06:52:54 +00:00
PyTorch UpdateBot	f3cb3557d6	[executorch hash update] update the pinned executorch hash (#149817 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149817 Approved by: https://github.com/pytorchbot	2025-04-04 05:21:44 +00:00
Yuanhao Ji	98d06b401b	[Dynamo] Fix `dict.items()` return type (#150112 ) Fixes #150110 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150112 Approved by: https://github.com/jansel, https://github.com/zou3519	2025-04-04 04:32:13 +00:00
PyTorch UpdateBot	e6e1f8c272	[audio hash update] update the pinned audio hash (#150589 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150589 Approved by: https://github.com/pytorchbot	2025-04-04 04:29:45 +00:00
Pian Pawakapan	c6d79c163c	[dynamic shapes] allow duck typing for 0/1 (#150222 ) Fixes #150184 e.g. for config.backed_size_oblivious=True and compile Pull Request resolved: https://github.com/pytorch/pytorch/pull/150222 Approved by: https://github.com/laithsakka	2025-04-04 03:24:46 +00:00
Aby Mathew C	7df6f930e8	Adapt test_misc.py for HPUs (#149499 ) This PR is related to https://github.com/pytorch/pytorch/pull/145476 . That PR had two files (test_functions.py and test_misc.py) . test_functions was causing CI/rebase/merge issues and hence removed for now. This PR contains only test_misc.py. This is a continuation of https://github.com/pytorch/pytorch/pull/144387 . ## MOTIVATION We recently integrated support for Intel Gaudi devices (identified as 'hpu') into the common_device_type framework via the pull request at https://github.com/pytorch/pytorch/pull/126970. This integration allows tests to be automatically instantiated for Gaudi devices upon loading the relevant library. Building on this development, the current pull request extends the utility of these hooks by adapting selected CUDA tests to operate on Gaudi devices. Additionally, we have confirmed that these modifications do not interfere with the existing tests on CUDA devices. Other accelerators can also extend the functionality by adding the device in the devices list. ( For eg: xpu ) ## CHANGES Create a separate class for test functions running on CUDA devices Extend the functionality of these tests to include HPUs Use instantiate_device_type_tests with targeted attributes to generate device-specific test instances within the new classes Apply skipIfHPU decorator to bypass tests that are not yet compatible with HPU devices PS: Most of these changes were initially part of https://github.com/pytorch/pytorch/pull/147609 , but closed that PR due to merge conflicts. The review comments were handled in this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149499 Approved by: https://github.com/EikanWang, https://github.com/desertfire, https://github.com/cyyever	2025-04-04 02:47:43 +00:00
Scott Wolchok	ed0fd2fa7a	clang-format aten/src/ATen/cpu/vec/*.h (#150426 ) I got a complaint about indentation on #150380. Make the machines fix it for us. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150426 Approved by: https://github.com/aditew01, https://github.com/cyyever, https://github.com/frost-intel, https://github.com/Skylion007	2025-04-04 02:41:11 +00:00
fduwjj	bd9c42ebfb	[c10d] Surface error type when we unlink and create named pipe for DumpPipe (#150648 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150648 Approved by: https://github.com/fegin, https://github.com/kwen2501	2025-04-04 02:12:32 +00:00
Lucas Kabela	a9e2f22405	[Bugfix] Fix compile error with `torch.Tensor.unsqueeze_` and inplace views called from Tensor Class (#150573 ) Fixes #129673 ### Summary: Modifying a tensor by reshaping in place (such as `unsqueeze_`) should cause a graph break; however, when accessed through `torch.Tensor` api as opposed to as self attribute caused the code to crash with an error (see attached issue) Paths differed when traced due to the stack variable popped, as: * `self.unsqueeze_` pops a `LazyVariableTracker` which gets resolved to `TensorVariable`, so when looking for the method, triggers the fn call `var_getattr` in `_dynamo/variables/tensor.py`; since this is an inplace view (metadata mutation) on graph input, it is not well supported so should fall back (see [L446](`1017927c83/torch/_dynamo/variables/tensor.py (L446)`) in that file) * `torch.Tensor.unsqueeze` pops a `UserDefinedClassVariable` so when looking for the method, triggers the fn call `var_getattr` in `_dynamo/variables/user_defined.py` on [L273](`a8f6b40e36/torch/_dynamo/variables/user_defined.py (L273)`). This path tries to build a variable tracker from the obj popped, which resolves to a trace_rule , and as a Tensor method, is resolved to `TorchInGraphFunctionVariable` on [L3767](`a8f6b40e36/torch/_dynamo/trace_rules.py (L3767)`) So, one straightforward option is to check if the fn is an inplace_view on a input tensor in `torch.py` when we resolve the `__call__function` for the `TorchInGraphFunctionVariable` instead, which resolves the bug by providing a graph break ### Test ``` pytest test/dynamo/test_functions.py::FunctionTests::test_unsqueeze_inplace ``` Results in ``` Running 1 items in this shard test/dynamo/test_functions.py . [100%] =========================================================================================== 1 passed in 9.16s ========================================================================================== ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150573 Approved by: https://github.com/anijain2305	2025-04-04 01:58:34 +00:00
James Wu	1979a409e9	Make CompileEventLogger more defensive w.r.t to AOTAutogradCache and FXGraphCache (#150423 ) This PR makes it so that we don't crash due to logging if we invoke AOTAutogradCache/FXGraphCache without using dynamo. This is preparation for supporting certain VLLM use cases where they store graph modules and have special handling in conjunection with the caches. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150423 Approved by: https://github.com/oulgen	2025-04-04 01:55:13 +00:00
Laith Sakka	f9f6c080d8	support guard or false/true in user code and add tests (#150178 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150178 Approved by: https://github.com/pianpwk	2025-04-04 01:19:14 +00:00
Nichols A. Romero	d0026fa138	[ROCm][TunableOp] Fix UT race condition and reduce UT duration. (#150463 ) This PR fixes two race conditions that occur when UT tests are run: - In a particular order within a single shard. - Concurrently in multiple shards. Each test now gets a unique filename that depends on the test name. There were two other minor improvements to the UTs: - matmul_offline_mgpu could occasionally fail if run on 8 GPUs. Criteria was relaxed. - bmm_tunableop_rocm checks that the rotating buffer is not zero. Otherwise, the test is not useful. Additionally, several UTs took over 1 minute to run. Their duration was reduced by a combination of setting max tuning iterations to one, setting the rotating buffer size to zero, and/or reducing the matrix dimensions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150463 Approved by: https://github.com/jeffdaily	2025-04-04 01:12:03 +00:00
Avik Chaudhuri	1bc2b2b12a	bound sympy accuracy (#150383 ) Differential Revision: D72215735 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150383 Approved by: https://github.com/pianpwk	2025-04-04 00:15:32 +00:00
PyTorch MergeBot	b0e28f60df	Revert "add unit test for preferred_blas_library settings (#150581 )" This reverts commit 781d28e2655f88ae2fef827ed110f22ed553a0ab. Reverted https://github.com/pytorch/pytorch/pull/150581 on behalf of https://github.com/clee2000 due to new test broken internally D72395624 ([comment](https://github.com/pytorch/pytorch/pull/150581#issuecomment-2777228731))	2025-04-03 23:51:49 +00:00
Yanan Cao (PyTorch)	1ab6c4ff04	[Codemod][AddExplicitStrictExportForTrainingInferenceArg] caffe2/ (#149595 ) internal diff: D71497480 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149595 Approved by: https://github.com/Skylion007	2025-04-03 23:50:13 +00:00
Zhao Zhu	8878289f89	[aten] 8 bytes aligned vector loads for bf16 and fp16 dtypes in torch.cat (#150233 ) Enable aligned vector loading for 2 bytes datatypes in torch.cat. Specifically: 1. reduce the vector length to 8 bytes for 2-byte types (fp16, bf16 etc) 2. enable through a conditional template The reason why 8-byte vector loading was chosen for fp16 and bf16: 16-byte load results in heavier register overheads (i.e. 4 register per load for fp32 -> 8 register per load for fp16). Therefore, to employ the benefits of vectorized loading, we reduced ALIGNED_VEC_LOAD_BYTES to 8 for fp16 and bf16 ### perf testing: before: ``` torch-cat-D1-30108-D2-624-D3-772-dtype-torch.float32: B pt_eager copy 0 100.0 0.022621 0.036162 1 1000.0 0.133616 0.207051 2 10000.0 1.326848 1.848768 3 20000.0 2.744544 3.692128 torch-cat-D1-30108-D2-624-D3-772-dtype-torch.bfloat16: B pt_eager copy 0 100.0 0.022434 0.035477 1 1000.0 0.140608 0.144518 2 10000.0 1.303792 1.229584 3 20000.0 2.668288 2.436160 ``` after: ``` torch-cat-D1-30108-D2-624-D3-772-dtype-torch.float32: B pt_eager copy 0 100.0 0.022608 0.036328 1 1000.0 0.133861 0.207399 2 10000.0 1.325120 1.847136 3 20000.0 2.726528 3.693184 torch-cat-D1-30108-D2-624-D3-772-dtype-torch.bfloat16: B pt_eager copy 0 100.0 0.019942 0.035482 1 1000.0 0.084858 0.144544 2 10000.0 0.924384 1.230672 3 20000.0 1.944448 2.436480 ``` ### bw analysis: bw on fp16/bf16 got increased by 40%-50% for large tensors before: ``` Bandwidth (GB/s) for ((16384, 16384), 1) int8;fp16;fp32;int32;fp64;long\|869.87\|1382.74\|1956.46\|1952.73\|1969.03\|1963.66 Bandwidth (GB/s) for ((4194304,), 0) int8;fp16;fp32;int32;fp64;long\|568.43\|926.53\|1589.20\|1567.52\|1771.54\|1783.68 Bandwidth (GB/s) for ((16777216,), 0) int8;fp16;fp32;int32;fp64;long\|752.07\|1269.50\|1894.86\|1900.85\|1954.10\|1955.08 Bandwidth (GB/s) for ((33554432,), 0) int8;fp16;fp32;int32;fp64;long\|807.08\|1354.69\|1960.48\|1962.45\|1972.73\|1973.85 Bandwidth (GB/s) for ((134217728,), 0) int8;fp16;fp32;int32;fp64;long\|864.02\|1398.02\|1963.43\|1955.32\|1963.37\|1969.96 ``` after: ``` Bandwidth (GB/s) for ((16384, 16384), 1) int8;fp16;fp32;int32;fp64;long\|873.08\|1892.16\|1954.35\|1962.51\|1962.03\|1965.98 Bandwidth (GB/s) for ((4194304,), 0) int8;fp16;fp32;int32;fp64;long\|575.13\|1242.45\|1576.37\|1571.30\|1769.94\|1790.22 Bandwidth (GB/s) for ((16777216,), 0) int8;fp16;fp32;int32;fp64;long\|742.92\|1734.57\|1887.99\|1897.62\|1940.99\|1959.25 Bandwidth (GB/s) for ((33554432,), 0) int8;fp16;fp32;int32;fp64;long\|802.60\|1865.45\|1952.64\|1947.53\|1974.47\|1973.48 Bandwidth (GB/s) for ((134217728,), 0) int8;fp16;fp32;int32;fp64;long\|865.32\|1939.07\|1965.72\|1963.25\|1969.06\|1968.72 ``` ### Perf testing code: ``` # pyre-strict from typing import List, Optional, Tuple import click import pandas as pd import torch # @manual=//triton:triton import triton # CUDA_VISIBLE_DEVICEs=7 buck2 run @mode/opt //scripts/zhaozhu:cat_bench @click.command() @click.option("--data-type", type=str, default="bf16") @click.option("--return-result", type=bool, default=False) def main( data_type: str, return_result: bool, ) -> Optional[Tuple[List[triton.testing.Benchmark], List[pd.DataFrame]]]: torch.backends.cudnn.allow_tf32 = True torch.backends.cuda.matmul.allow_tf32 = True if data_type == "fp32": dtype = torch.float32 elif data_type == "fp16": dtype = torch.float16 elif data_type == "bf16": dtype = torch.bfloat16 else: raise ValueError(f"Unsupported data type: {data_type}.") D1 = int(torch.randint(low=10000, high=50000, size=(1,)).item()) D2 = int(torch.randint(low=100, high=1000, size=(1,)).item()) D3 = int(torch.randint(low=500, high=1000, size=(1,)).item()) configs: List[triton.testing.Benchmark] = [ triton.testing.Benchmark( x_names=["B"], x_vals=[100, 1000, 10000, 20000], line_arg="provider", line_vals=["pt_eager", "copy"], line_names=["pt_eager", "copy"], styles=[("blue", "-"), ("green", "-"), ("red", "-")], ylabel="ms", plot_name=f"torch-cat-D1-{D1}-D2-{D2}-D3-{D3}-dtype-{dtype}", args={ "D1": D1, "D2": D2, "D3": D3, "dtype": dtype, }, ) ] @triton.testing.perf_report(configs) def bench_cat( B: int, D1: int, D2: int, D3: int, dtype: torch.dtype, provider: str, ) -> float: warmup = 10 rep = 3 tensors = [] a = torch.empty( # (B, 30108), (B, D1), dtype=dtype, device=torch.device("cuda"), ).uniform_(-1.0, 1.0) b = torch.empty( # (B, 624), (B, D2), dtype=dtype, device=torch.device("cuda"), ).uniform_(-1.0, 1.0) c = torch.empty( # (B, 772), (B, D3), dtype=dtype, device=torch.device("cuda"), ).uniform_(-1.0, 1.0) tensors = [a, b, c] total_cols: int = int(a.shape[1] + b.shape[1] + c.shape[1]) def torch_copy( tensors: List[torch.Tensor], is_inplace: bool = True ) -> torch.Tensor: f = torch.zeros([B, total_cols], dtype=dtype, device=torch.device("cuda")) col_idx = 0 for t in tensors: temp = f[:, col_idx : col_idx + t.shape[1]] if is_inplace: temp.copy_(t) else: f[:, col_idx : col_idx + t.shape[1]] = t col_idx += t.shape[1] return f def torch_cat(tensors: List[torch.Tensor]) -> torch.Tensor: return torch.cat(tensors, dim=1) ref = torch_cat(tensors) real = torch_copy(tensors, is_inplace=False) torch.testing.assert_allclose(ref, real) if provider == "pt_eager": fn = lambda: torch_cat(tensors) # noqa E731 ms = triton.testing.do_bench(fn, warmup=warmup, rep=rep) return ms elif provider == "stack": def torch_stack(tensors: List[torch.Tensor]) -> torch.Tensor: return torch.stack(tensors, dim=1).view(-1, total_cols) fn = lambda: torch_stack(tensors) ms = triton.testing.do_bench(fn, warmup=warmup, rep=rep) return ms elif provider == "copy": fn = lambda: torch_copy(tensors) ms = triton.testing.do_bench(fn, warmup=warmup, rep=rep) return ms else: raise ValueError(f"unsupported provider: {provider}") df = bench_cat.run(print_data=True, return_df=return_result) if return_result: return configs, df if __name__ == "__main__": main() ``` and bw analysis code is from: https://github.com/pytorch/pytorch/pull/102815?fbclid=IwZXh0bgNhZW0CMTEAAR1Rwclp_O1fknl1Litpm9GeY0ZZZovdCv8_kQfGf6Zy8LaoP9JhO0ZsutM_aem_BPCZEZda5OOMnzI9Mrlapg#issue-1737409146 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150233 Approved by: https://github.com/ngimel	2025-04-03 23:40:18 +00:00
Henry Hu	5cf3029503	Remove unused rand call if not fallback to eager for rand (#147790 ) Fixes #147171 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147790 Approved by: https://github.com/eellison	2025-04-03 23:27:03 +00:00
William Wen	118e3862bc	[dynamo] disable new test_assert_failure_in_generic_ctx_mgr internally (#150631 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150631 Approved by: https://github.com/clee2000 ghstack dependencies: #150471	2025-04-03 23:08:25 +00:00
Tovly Deutsch	a2dce42654	Split up cub-RadixSortPairs.cu to parallelize compilation (#148936 ) Summary: `cub-RadixSortPairs.cu` has slow compilation times, especially on Windows. These changes split up the file into smaller components to allow each component to compile in parallel. On Windows, I observed a compile time drop from about 20 minutes to 6 minutes. Differential Revision: D70539649 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148936 Approved by: https://github.com/suo, https://github.com/eqy, https://github.com/malfet	2025-04-03 23:04:21 +00:00
Jane Xu	c0618a3957	Update commitlist.py instructions for the GitHub repo regime (#149535 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149535 Approved by: https://github.com/albanD	2025-04-03 22:43:00 +00:00
Richard Howell	76994d48f4	[pytorch] add experimental TORCH_LIBRARY_THREAD_UNSAFE_LAZY_INIT (#150537 ) Summary: Add an experimental feature to defer pytorch library initialization cost to post startup. As noted this feature is not thread safe, it requires the client to maintain thread safety at library load time. Reviewed By: zou3519 Differential Revision: D71917841 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150537 Approved by: https://github.com/zou3519	2025-04-03 22:36:17 +00:00
Jeff Daily	9e55dae2a6	CUDA CachingHostAllocator tracks registrations to call correct free (#146520 ) Allocations using cudaHostRegister should use corresponding cudaHostUnregister and similarly for cudaHostAlloc / cudaFreeHost. In test_cuda.py, the allocator config will change from test to test but the cache is not emptied prior to changing the config. This results in the wrong free being called later. Unit test sharding is avoiding this issue, but running the test_cuda.py with a single shard will fail. The following reproducer demonstrates the problem. ```C++ int main(int argc, char *argv) { void ptr; assert(cudaSuccess == cudaHostAlloc(&ptr, 1024, cudaHostAllocDefault)); assert(cudaSuccess == cudaHostUnregister(ptr)); std::free(ptr); return 0; } ``` The above code results in the following failure because the ptr is an invalid argument to cudaHostUnregister. ``` a.out: test.cpp:53: int main(int, char**): Assertion `cudaSuccess == cudaHostUnregister(ptr)' failed. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146520 Approved by: https://github.com/ngimel	2025-04-03 22:33:48 +00:00
Ahmad Sharif	c6defa9443	[cuda] Add new faster gammabeta backward kernel (#148605 ) (Reapply with launch bounds) (#150625 ) # Changes over the previous PR This reverts commit 61a1f09 and adds `__launch_bounds__` to the kernel. Previously I merged 114d404 that did not work on Blackwell because it consumed too many registers. It got reverted in 61a1f09. For more context see: https://github.com/pytorch/pytorch/issues/150266. This PR reverts the revert (i.e. reapplies the original diff), with one additional line with `__launch_bounds__` added: ``` git diff HEAD^ diff --git a/aten/src/ATen/native/cuda/layer_norm_kernel.cu b/aten/src/ATen/native/cuda/layer_norm_kernel.cu index 0d63a2f979c..3ce2c24c18e 100644 --- a/aten/src/ATen/native/cuda/layer_norm_kernel.cu +++ b/aten/src/ATen/native/cuda/layer_norm_kernel.cu @@ -657,6 +657,7 @@ bool aligned_grid > __global__ void +__launch_bounds__(block_dim_x * block_dim_y) GammaBetaBackwardCUDAKernelTemplate( int64_t M, int64_t N, ``` I managed to get a Blackwell machine and verified that the fix works. The fix was verified using this repro that I got from @drisspg <details> <summary> Repro script that fails on Blackwell </summary> ``` import torch from torch.nn import init # from transformer_nuggets import init_logging # from transformer_nuggets.utils.benchmark import profiler # from pathlib import Path # init_logging() class PermuteModule(torch.nn.Module): def __init__(self, permutation): super(PermuteModule, self).__init__() self.permutation = permutation def forward(self, x:torch.Tensor) -> torch.Tensor: assert len(x.shape) == len(self.permutation), f"Dimension mismatch! Unable to permute {len(x.shape)} dim input with a {len(self.permutation)} dim permutation!" return x.permute(self.permutation) def test(n_layers:int, conv_stride:int): _sequence = [] for _ in range(n_layers): # Conv1d inputs are (N x C x L), LayerNorm expects ( x C). Dims must be permuted between modules. _sequence += [ PermuteModule((0,2,1)), torch.nn.Conv1d(in_channels=512, out_channels=512, groups=1, kernel_size=9, dilation=1, stride=conv_stride, padding=0, bias=False), PermuteModule((0,2,1)), torch.nn.LayerNorm(512), torch.nn.ReLU() ] model = torch.nn.Sequential(_sequence).to(device="cuda") data = torch.randn((100,2048,512), device="cuda") out = model(data) loss = torch.nn.functional.mse_loss(out, torch.rand_like(out)) loss.backward() torch.autograd.set_detect_anomaly(True) print(f"Torch version: {torch.__version__}") # with profiler(Path("conv")): # # print(f"layers=1, stride=1") # # test(n_layers=1, conv_stride=1) # # print(f"layers=2, stride=1") # # test(n_layers=2, conv_stride=1) # # print(f"layers=1, stride=2") # # test(n_layers=1, conv_stride=2) # print(f"layers=2, stride=2") # test(n_layers=2, conv_stride=2) print(f"layers=2, stride=2") test(n_layers=2, conv_stride=2) # we will not reach this print statement. print("DONE.") ``` </details> I also re-ran my performance benchmark and found no regressions over the previous PR. # Full description of the old PR Original PR: https://github.com/pytorch/pytorch/pull/148605 This PR adds a new kernel for producing gamma and beta values for the backward pass in a performant way. To test the performance against the baseline, I measured the backward pass of layernorm while sweeping over the following variables: 1. dtype in {half, float} 2. M in `2k, 2k - 1, 2k + 1 for k in range(...)` 3. N in `2k, 2k - 1, 2k + 1 for k in range(...)` 4. Whether we flush the L2 cache before running the backward pass Summary: The new code performs better than the old code, especially for powers of 2. For M >> N case, it performs very well (kernel itself can be 30x faster and the overall backward pass can be 5-10x faster). In order to visualize results of the kernel when choosing different values of M, N and dtype, I wrote some code to generate a heatmap. The heatmap has N on the x-axis, M on the y-axis and color-coded points where green shows performance improvement and red shows regressions. For example, `m=32 n=2048 1.42x` in the heatmap would indicate the normalized shape had 32 elements. The leading dimensions' product was 2048 elements and the new kernel resulted in the backward pass* being 1.42x faster than the old backward pass. Important note: This heatmap shows the total backward pass time as seen by the user. The kernel time difference can be sometimes very large while the total backward pass time is not that high. For example, for dtype=torch.half, M=32 N=2048, flush_l2_cache=True case, the heatmap shows a speedup of 1.42x, while ncu tells me the new kernel is 2.5x faster than the old: M=32 N=2048 dtype=half flush_l2=True Old Kernel NCU summary: ``` ----------------------- ----------- ------------ Metric Name Metric Unit Metric Value ----------------------- ----------- ------------ DRAM Frequency Ghz 1.59 SM Frequency Ghz 1.35 Elapsed Cycles cycle 27,526 Memory Throughput % 2.21 DRAM Throughput % 0.54 Duration us 20.42 L1/TEX Cache Throughput % 4.31 L2 Cache Throughput % 2.62 SM Active Cycles cycle 1,475.02 Compute (SM) Throughput % 0.29 ----------------------- ----------- ------------ ``` M=32 N=2048 dtype=half flush_l2=True New Kernel NCU summary: ``` ----------------------- ----------- ------------ Metric Name Metric Unit Metric Value ----------------------- ----------- ------------ DRAM Frequency Ghz 1.59 SM Frequency Ghz 1.34 Elapsed Cycles cycle 10,920 Memory Throughput % 5.64 DRAM Throughput % 1.35 Duration us 8.13 L1/TEX Cache Throughput % 1.92 L2 Cache Throughput % 6.89 SM Active Cycles cycle 3,554.41 Compute (SM) Throughput % 0.67 ----------------------- ----------- ------------ ``` Let's look at some rows from the heatmap. For dtype=float16 flush_l2_cache=True and when input shapes are powers of 2, we get the following: <img width="1508" alt="image" src="https://github.com/user-attachments/assets/06179599-b2f0-4a45-8664-247a1067950b" /> There are 3 columns -- the first shows all data points, the second shows speedups only and the 3rd column shows regressions only. We can see that there are dramatic speedups for M >> N cases and the regressions are not that high (less than 1%, which could just be measurement noise). Here is a small guide I made: ![image](https://github.com/user-attachments/assets/90c26f7c-e3ad-46d2-a6ce-fe4b5fb3d738) For dtype=float32, we get a similar chart: <img width="1499" alt="image" src="https://github.com/user-attachments/assets/c4d31a76-03b0-426c-9114-e1bfad29b530" /> The new code performs especially well for m >> n cases, and also where m and n are small. The m >> n case is special because we run 2 reduction kernels back to back and parallelize in the "M" dimension (the older kernel only parallelized in the "N" dimension). The new code can sometimes have regressions for non-powers of 2. That is because the old code was using block sizes of {16, 32} while we have `threads.x = 32`. For example when N=33, the old code would have 3 blocks and we will have 2 blocks. I wrote some code to specialize for this case, but I think it will add complexity and @ngimel mentioned that non-powers of 2 are rare enough. I am including the regressions here for completeness' sake: <img width="1500" alt="image" src="https://github.com/user-attachments/assets/31c17cfb-ed9b-4106-b9c8-5c359751f530" /> To see this better: 1. Click the image 2. Right click the expanded image and open in a new tab 3. Go to that tab and left click once to zoom in If you want to see the full data, here it is: ![image](https://github.com/user-attachments/assets/54fb60c9-8c0c-4530-a1dd-79ecda1a69a1) I also measured binary size and compile time since those are important for developers: Binary size comparison ![image](https://github.com/user-attachments/assets/ceef5073-1036-47f6-b9dc-cea088beda51) ``` # Original -rwxr-xr-x 1 ahmads users 307193112 Mar 6 08:46 ./torch/lib/libtorch_cuda.so # This PR -rwxr-xr-x 1 ahmads users 307193112 Mar 6 08:46 ./torch/lib/libtorch_cuda.so ``` The diff in bytes is 302kB which is about a 0.1% increase. Compile time difference: ``` # Original real 0m10.931s user 0m9.676s sys 0m1.004s # this PR real 0m16.720s user 0m15.514s sys 0m1.066s # Command I ran time /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DAT_PER_OPERATOR_HEADERS -DFLASHATTENTION_DISABLE_ALIBI -DFLASHATTENTION_DISABLE_SOFTCAP -DFLASH_NAMESPACE=pytorch_flash -DFMT_HEADER_ONLY=1 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTORCH_CUDA_BUILD_MAIN_LIB -DTORCH_CUDA_USE_NVTX3 -DUNFUSE_FMA -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_CUDA -DUSE_CUFILE -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_FLASH_ATTENTION -DUSE_MEM_EFF_ATTENTION -DUSE_NCCL -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cuda_EXPORTS -I/home/ahmads/personal/pytorch/build/aten/src -I/home/ahmads/personal/pytorch/aten/src -I/home/ahmads/personal/pytorch/build -I/home/ahmads/personal/pytorch -I/home/ahmads/personal/pytorch/cmake/../third_party/benchmark/include -I/home/ahmads/personal/pytorch/third_party/onnx -I/home/ahmads/personal/pytorch/build/third_party/onnx -I/home/ahmads/personal/pytorch/nlohmann -I/home/ahmads/personal/pytorch/third_party/flash-attention/csrc/flash_attn/src -I/home/ahmads/personal/pytorch/aten/src/THC -I/home/ahmads/personal/pytorch/aten/src/ATen/cuda -I/home/ahmads/personal/pytorch/third_party/fmt/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/tools/util/include -I/home/ahmads/personal/pytorch/build/caffe2/aten/src -I/home/ahmads/personal/pytorch/aten/src/ATen/.. -I/home/ahmads/personal/pytorch/build/nccl/include -I/home/ahmads/personal/pytorch/c10/cuda/../.. -I/home/ahmads/personal/pytorch/c10/.. -I/home/ahmads/personal/pytorch/third_party/tensorpipe -I/home/ahmads/personal/pytorch/build/third_party/tensorpipe -I/home/ahmads/personal/pytorch/third_party/tensorpipe/third_party/libnop/include -I/home/ahmads/personal/pytorch/torch/csrc/api -I/home/ahmads/personal/pytorch/torch/csrc/api/include -isystem /home/ahmads/personal/pytorch/build/third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googlemock/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googletest/include -isystem /home/ahmads/personal/pytorch/third_party/protobuf/src -isystem /home/ahmads/personal/pytorch/third_party/XNNPACK/include -isystem /home/ahmads/personal/pytorch/third_party/ittapi/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/eigen -isystem /usr/local/cuda/include -isystem /home/ahmads/personal/pytorch/third_party/ideep/mkl-dnn/include/oneapi/dnnl -isystem /home/ahmads/personal/pytorch/third_party/ideep/include -isystem /home/ahmads/personal/pytorch/INTERFACE -isystem /home/ahmads/personal/pytorch/third_party/nlohmann/include -isystem /home/ahmads/personal/pytorch/third_party/NVTX/c/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/cudnn_frontend/include -DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS -D_GLIBCXX_USE_CXX11_ABI=1 -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch -gencode arch=compute_90,code=sm_90 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUB_WRAPPED_NAMESPACE=at_cuda_detail -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -Xcompiler -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-missing-field-initializers -Wno-array-bounds -Wno-unknown-pragmas -Wno-strict-overflow -Wno-strict-aliasing -Wunused-function -Wunused-variable -Wunused-but-set-variable -Wno-maybe-uninitialized -MD -MT caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o -MF caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o.d -x cu -c /home/ahmads/personal/pytorch/aten/src/ATen/native/cuda/layer_norm_kernel.cu -o caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o ``` So the new PR is 6 seconds longer compile time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150625 Approved by: https://github.com/ngimel	2025-04-03 22:07:43 +00:00
Andrey Talman	2abd81402f	[validations] Run nccl version check on Linux only (#150635 ) Followup https://github.com/pytorch/pytorch/pull/150194 to disable nccl version print on OS's other then Linux Pull Request resolved: https://github.com/pytorch/pytorch/pull/150635 Approved by: https://github.com/clee2000	2025-04-03 22:06:58 +00:00
Shangdi Yu	941090a791	Make sure torch.compiler._is_compiling_flag=True in aoti (#150588 ) Summary: See internal Diff summary Differential Revision: D72355449 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150588 Approved by: https://github.com/angelayi	2025-04-03 22:02:29 +00:00
PyTorch MergeBot	5a654deb40	Revert "Enable C++ dynamic shape guards by default (#140756 )" This reverts commit c1d503529d23f33bc0819286df8d0ecbe31b559f. Reverted https://github.com/pytorch/pytorch/pull/140756 on behalf of https://github.com/isuruf due to new test test_runtime_checks_large hangs on CI ([comment](https://github.com/pytorch/pytorch/pull/140756#issuecomment-2776979814))	2025-04-03 21:44:41 +00:00
Jason Ansel	d41c22b578	Revert "[fx] Move Node._prepend/Node._remove_from_list to C++ (#148261 )" (#150542 ) Reverts #148261 due to possible memory leak This reverts commit 5d4e7d58b42623a9024a84f0050967ff0318dcdb. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150542 Approved by: https://github.com/clee2000	2025-04-03 21:15:38 +00:00
Svetlana Karslioglu	277369ac16	Move formulas on separate line in loss.py (#150565 ) Move formulas on separate line in loss.py for better readability. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150565 Approved by: https://github.com/mikaylagawarecki	2025-04-03 20:47:35 +00:00
Yiming Zhou	a3f9e04656	[export] Make aoti_call_delegate hop traceable (#148804 ) Summary: The `aoti_call_delegate` hop now uses a stateless `original_gm` for tracing with fake tensors and the OSS AOTI Runner for running with real tensors Differential Revision: D70738393 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148804 Approved by: https://github.com/SherlockNoMad	2025-04-03 20:44:31 +00:00
Shangdi Yu	51da241c0a	[aoti] Fix cannot determine truth value of Relation error when propagating unbacked symint in lowering (#150570 ) Summary: Fix cannot determine truth value of Relation error when propagating unbacked symint in lowering Test Plan: ``` buck run fbcode//mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r aoti_runtime_asserts ``` Differential Revision: D72331070 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150570 Approved by: https://github.com/angelayi, https://github.com/henryoier	2025-04-03 20:06:15 +00:00
Isuru Fernando	c1d503529d	Enable C++ dynamic shape guards by default (#140756 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140756 Approved by: https://github.com/anijain2305 ghstack dependencies: #149149, #149197, #149211	2025-04-03 20:03:52 +00:00
Kai Londenberg	1843ad458d	[Inductor] Cache CUDA compilation errors (#149716 ) Summary: Add support for caching of CUDA (nvcc) compilation errors to codecache.py Test Plan: CI ( for example Cutlass backend unit tests ) Reviewed By: ColinPeppler Differential Revision: D71562040 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149716 Approved by: https://github.com/ColinPeppler	2025-04-03 19:47:27 +00:00
Jiang, Yanbing	3b02f795c5	Add torch._scaled_mm for CPU (#150410 ) This PR is the duplicated one for https://github.com/pytorch/pytorch/pull/139975. This PR is to add torch._scaled_mm for CPU backend. _scaled_mm_out_cpu and _scaled_mm_cpu are new added and included in torch._scaled_mm CPU dispatch. We also add _scaled_mm_out_cpu_emulated as a fallback function if the current platform cannot run FP8 matmul using oneDNN. And this PR also updates the various UTs related to FP8 to support CPU tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150410 Approved by: https://github.com/atalman	2025-04-03 19:43:45 +00:00
ZhaoqiongZ	96f35f55e2	update get start xpu document for v2.7 (#150397 ) update get start xpu document for v2.7 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150397 Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/atalman Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-04-03 18:17:08 +00:00
Xilun Wu	78d1165d76	[DTensor][tp] fix errors in FSDP+TP checkpointing test (#150354 ) ## Summary remove the `tp_parallelize_plan` assignment that accidentally rewrites the previous assignments in `test_fsdp_dsd.py`. ## Test `pytest test/distributed/checkpoint/fsdp/test_fsdp_dsd.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150354 Approved by: https://github.com/wconstab	2025-04-03 17:41:46 +00:00
FFFrog	5d36253a7d	Refactoring: fix the python constant check (#150608 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150608 Approved by: https://github.com/Skylion007	2025-04-03 17:33:45 +00:00
Jeff Daily	fa0fdc0cca	if blaslt fails, fall back to blas (#150147 ) Fixes #150016. This is implemented for both cublaslt and hipblaslt. gemm_and_bias on failure will fall back to unfused path. lt gemm on failure falls back to gemm even if gemm preference is set to lt. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150147 Approved by: https://github.com/malfet	2025-04-03 16:18:59 +00:00
David Berard	5be5cfe4cb	[inductor][autotune cache] add torch_key() to configs hash (#150494 ) Summary: Context: https://github.com/pytorch/pytorch/pull/150122 (D71982587 - let's call this "the WS diff") introduces "bc/fc-breaking" cache changes. In particular, it introduces `num_consumer_groups` and adds it to the cached config. In versions of torch that include the WS diff, `num_consumer_groups` is treated as a class variable on a triton.Config object (i.e. `triton.Config({..kwargs..}, num_consumer_groups=num_consumer_groups, ...`). And in versions of torch that don't include the WS diff, you generally don't expect to see this kwarg. But if a program is run WS-torch (i.e. torch w/ the WS diff), and then later you run the same program with non-WS-torch, then non-WS-torch is going to find this autotune cache entry, and interpret `num_consumer_groups` as a kwarg, because there's no special handling for for num_consumer_groups in this version of torch. Then the program crashes with a triton failure message. The fix: add the torch version / torch key into the hash, so that any changes to inductor will invalidate the cache (ensuring that other changes to triton_heuristics won't cause these bc/fc issues). Test Plan: D72285868 (or https://gist.github.com/davidberard98/2ea697eb550c94d0d1948fedb5c5c7d8, but this doesn't repro in OSS because this version of warp specialization is not available in oss triton) can repro the failure, and the failure is fixed after this PR is patched. Also, added a test in test/inductor/test_codecache.py which verifies that there's no cache hit if the torch_key changes (and verified that without the functional changes in this PR, the test fails). Differential Revision: D72285303 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150494 Approved by: https://github.com/oulgen	2025-04-03 16:01:57 +00:00
Luca Wehrstedt	440c07e56a	Fix detection of GPU multicast (#150563 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150563 Approved by: https://github.com/kwen2501	2025-04-03 15:31:15 +00:00
angelayi	5314a6fe82	[export] Fix deserialization issue (#150515 ) An internal model was serialized in 2023, and is now breaking while loading with the following error: ``` File "<eval_with_key>.1675", line 4 def forward(self, arg1163_1, arg1164_1, , arg1166_1, , arg1168_1, arg1169_1, arg1170_1, , arg1172_1, arg1173_1, arg1174_1, arg1175_1, arg1176_1, arg1177_1, arg1178_1, arg1179_1, arg1180_1, arg1181_1, arg1182_1, arg1183_1, arg1184_1, arg1185_1, arg1186_1, arg1187_1, arg1188_1, arg1189_1, arg1190_1, arg1191_1, arg1192_1, arg1193_1, arg1194_1, arg1195_1, arg1196_1, arg1197_1, arg1198_1, arg1199_1, arg1200_1, arg1201_1, arg1202_1, arg1203_1, arg1204_1, arg1205_1, arg1206_1, arg1207_1, arg1208_1, arg1209_1, arg1210_1, arg1211_1, arg1212_1, arg1213_1, arg1214_1, arg1215_1, arg1216_1, , arg1218_1, arg1219_1, arg1220_1, arg1221_1, arg1222_1, arg1223_1, arg1224_1, , arg1226_1, arg1227_1, arg1228_1, , arg1230_1, , , , , , , , , , , , , , , ): ^ SyntaxError: invalid syntax ``` The syntax errors are due to inputs that are `None` when exporting. Prior to changes in https://github.com/pytorch/pytorch/pull/123590 (landed 4/2024), input specs for none inputs look like `InputSpec(userInput=UserInputSpec(arg=Argument(asNone=True)))`, and during deserialization when creating a node, we would just use a dummy name `arg`. After to those changes, the input specs for none inputs look like `InputSpec(constantInput=InputToConstantInputSpec(name='y', value=ConstantValue(asNone=True)))`, and when creating a node we would use the name `y` as the name. However the PR didn't handle the case if it's loading an old package which doesn't have this name, so ended up putting empty names in the placeholder nodes. This error was uncovered after https://github.com/pytorch/pytorch/pull/149717, where we now use the GraphModule's python codegen to run the UnflattenedModule instead of going through the interpreter path. The placeholder nodes having empty names caused the python codegen to fail. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150515 Approved by: https://github.com/yushangdi	2025-04-03 15:27:45 +00:00
Isuru Fernando	a72b4eb806	Support windows in C++ shape guards (#149211 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149211 Approved by: https://github.com/anijain2305 ghstack dependencies: #149149, #149197	2025-04-03 14:42:08 +00:00
Isuru Fernando	f9a7eac718	use python fallback if there are overflows (#149197 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149197 Approved by: https://github.com/anijain2305 ghstack dependencies: #149149	2025-04-03 14:39:03 +00:00
Isuru Fernando	ff783f062a	Fix shape guard failure to be valid python (#149149 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149149 Approved by: https://github.com/anijain2305	2025-04-03 14:36:17 +00:00
FFFrog	70b34a42c1	Add new dependences for gen_pyi.py (#150391 ) As the title stated. When we update some functions in _torch_docs.py or _tensor_docs.py, and execute some commands (like ``python setup.py evolve``) to install the latest version, the description about the function we just changed is not updated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150391 Approved by: https://github.com/Skylion007, https://github.com/peterbell10	2025-04-03 14:18:18 +00:00
Jeff Daily	781d28e265	add unit test for preferred_blas_library settings (#150581 ) Follow up to #150212 that was committed without a unit test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150581 Approved by: https://github.com/atalman	2025-04-03 13:27:50 +00:00
Guilherme Leobas	cbc901fac3	Implement `raise ... from ...` (#148766 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148766 Approved by: https://github.com/zou3519	2025-04-03 13:15:31 +00:00
LifengWang	e0d19cf6cc	Enable weekly test for operator benchmark (#150502 ) To regularly track the performance of the operator benchmark, enable the weekly test. Hi, @huydhn, as you mentioned in https://github.com/pytorch/pytorch/pull/143733#issuecomment-2578317520, we could integrate the performance data from the weekly test into the OSS benchmark database for the dashboard. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150502 Approved by: https://github.com/huydhn	2025-04-03 12:17:19 +00:00
Danfeng Wang	5d9c7f78e7	[fbcode]Removing `@NoIntBaseDeprecated` annotation in `evaluation.thrift` file (#150271 ) Summary: #buildall Test Plan: ``` buck test 'fbcode//mode/opt' fbcode//caffe2/torch/fb/training_toolkit/applications/bulk_eval/tests:evaluator_test -- --exact 'caffe2/torch/fb/training_toolkit/applications/bulk_eval/tests:evaluator_test - test_setup_evaluation_utils (caffe2.torch.fb.training_toolkit.applications.bulk_eval.tests.evaluator_test.EvaluatorTest)' ``` Differential Revision: D72028940 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150271 Approved by: https://github.com/huydhn	2025-04-03 12:01:59 +00:00
Bin Bao	d4c30b4599	[AOTI][dashboard] Update how peak memory is measured (#150534 ) Summary: In the dashboard measurement script, AOTI needs to run Eager first to register the output pytree, so the peak memory compression ratio on the dashboard is always close to 1. Update AOTI run to use an extra warmup run, so the peak memory compression ratio measures the result at the run time instead of the compile time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150534 Approved by: https://github.com/yushangdi	2025-04-03 12:01:43 +00:00
Jagadish Krishnamoorthy	6fa1b17195	ROCm: Add trailing comma for consistency in gfx architecture list (#150250 ) Adding trailing comma for consistency. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150250 Approved by: https://github.com/petrex, https://github.com/jeffdaily, https://github.com/cyyever	2025-04-03 10:58:48 +00:00
Arash Pakbin	e6e07ec1cf	[ROCm] code cleanup of architecture checks (#150473 ) This PR replaces several calls to `at::cuda::getCurrentDeviceProperties()->gcnArchName` and `at::cuda::getDeviceProperties(device_index)->gcnArchName` when checking to see if the GPU architecture is in a certain list. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150473 Approved by: https://github.com/jeffdaily, https://github.com/cyyever	2025-04-03 09:51:06 +00:00
Jiang, Zhiwei	9e106019f6	[XPU] Add an implict conversion from XPUStream to sycl::queue* (#148646 ) # Motivation Currently, in Pytorch XPU, `cudaStream_t` is mapped to `sycl::queue&`, so an implicit cast from `XPUStream` to `sycl::queue&` is provided just like `CUDAStream` has an implicit cast to `cudaStream_t`. But on the SYCLomatic side, we migrate `cudaStream_t` to `sycl::queue*` but not `sycl::queue&` (One reason is that `cudaStream_t` is actually a pointer so users can do anything with that integer. Another reason is that the early `sycl::queue` was not impl-ed by a pointer, so copy by value is not desirable.) Without this PR: ``` cudaStream_t a = getCurrentCUDAStream(); cudaStream_t b = getCurrentCUDAStream().stream(); ``` need be migrated to: ``` queue_ptr a = &(sycl::queue&)getCurrentXPUStream(); queue_ptr b = &(getCurrentXPUStream().queue()); ``` With this PR: ``` queue_ptr a = getCurrentXPUStream(); queue_ptr b = &(getCurrentXPUStream().queue()); ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148646 Approved by: https://github.com/guangyey, https://github.com/EikanWang	2025-04-03 08:12:38 +00:00
Saagar Jha	c067127d47	Ensure cuda_dlink_post_cflags are quoted as well (#150151 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150151 Approved by: https://github.com/janeyx99	2025-04-03 06:50:22 +00:00
Junjie Wang (PyTorch)	fc674b45d4	[c10d] Add logging for desync debug report (#150513 ) Summary: We want to add a logging to first understand what is the distribution of desync debug report. Test Plan: Test with logger staging Differential Revision: D72249281 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150513 Approved by: https://github.com/kwen2501	2025-04-03 06:42:06 +00:00
Pian Pawakapan	90ddb33141	[export] specialize for aten.to (#149235 ) Changes decomposition behavior of `aten.to` to respect the aliasing/non-aliasing behavior in eager, and to specialize to the input/conversion dtype & device. Before change: we always decompose `aten.to` into `_to_copy`, regardless of aliasing behavior. This leads us to ban mutations on the result of `_to_copy` when aliased, since we can't guarantee correct program semantics. This meant users had to explicitly call `.clone()` before mutating. In the special cases where we don’t ban mutations (e.g. dtype conversion), we add runtime assertions on the input & conversion dtype/devices in the decomposed program (see https://github.com/pytorch/pytorch/pull/142420). After change: we decompose to the aliasing/non-aliasing behavior that matches eager, allowing mutations in all cases. We also add dtype/device assertions for all `aten.to` ops, starting in the pre-dispatch graph, basically specializing the program to the dtype/devices. Differential Revision: D71229547 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149235 Approved by: https://github.com/tugsbayasgalan	2025-04-03 05:20:10 +00:00
drisspg	2e5d95a082	[FlexAttention] Remove dead code (#150575 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150575 Approved by: https://github.com/Chillee, https://github.com/BoyuanFeng	2025-04-03 01:46:19 +00:00
Shangdi Yu	77dca3947e	[aoti] make a check function for each input (#150553 ) Summary: make a check function for each input to avoid too large to optimize error on `__check_inputs_outputs` Test Plan: ``` buck run fbcode//mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r runtime_checks ``` Differential Revision: D72286280 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150553 Approved by: https://github.com/desertfire	2025-04-03 00:55:35 +00:00
rzou	13f48197d2	Add Chillee as core reviewer (#150579 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150579 Approved by: https://github.com/albanD, https://github.com/drisspg, https://github.com/malfet	2025-04-03 00:40:06 +00:00
Mu-Chu Lee	f363fe616d	[AOTInductor] Fix autotuning code's codegen (#150522 ) Summary: Codegen used to generate tmp_arg_{index} as temporary args, and index is the position of the caller. We changed the logic of codegen such that we can reuse previous generated samples, and only delete after arg is no longer used. In this case, we need to make {index} unique, since different functions could reuse the same "tmp_arg_{index}" name string, but corresponds to different args. Test Plan: `python test/inductor/test_aot_inductor.py -k test_autotuning_args_reuse` Differential Revision: D72297084 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150522 Approved by: https://github.com/desertfire, https://github.com/22quinn	2025-04-03 00:08:19 +00:00
Gabriel Ferns	24f50653c8	fix bug in logging code (#150518 ) Fixes https://github.com/pytorch/pytorch/issues/150379 ```python >>> key = "aten._int_mm_1_2_3" >>> m, n, k = key.split("_")[-3:] >>> m, n, k ('1', '2', '3') >>> name = "_".join(key.split("_")[:-3]) >>> name 'aten._int_mm' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150518 Approved by: https://github.com/xmfan	2025-04-02 23:39:06 +00:00
PyTorch MergeBot	61a1f09b5b	Revert "[cuda] Add new faster gammabeta backward kernel (#148605 )" This reverts commit 114d404b0720e8073748690faeb96449e5c0b229. Reverted https://github.com/pytorch/pytorch/pull/148605 on behalf of https://github.com/drisspg due to See https://github.com/pytorch/pytorch/issues/150266#issuecomment-2773907902 for more details ([comment](https://github.com/pytorch/pytorch/pull/148605#issuecomment-2773928838))	2025-04-02 23:14:11 +00:00
Animesh Jain	de15ef0ee8	[invoke_subgraph] Force grad_outs to be contiguous at tracing time (#150561 ) I am unable to come up with a testcase. It passes many end-to-end tests that fail with ReshapeError at https://ossci-raw-job-status.s3.amazonaws.com/log/39717218372 ![image](https://github.com/user-attachments/assets/8509b485-3897-4538-968b-bbe05af63a59) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150561 Approved by: https://github.com/zou3519, https://github.com/bdhirsh ghstack dependencies: #150082, #150450, #150486, #150556	2025-04-02 22:59:08 +00:00
Wang, Chuanqi	0198e44f37	Update torch-xpu-ops commit pin to 98c808d (#150554 ) Update the torch-xpu-ops commit to [98c808dea6de7330c415aa777d6921944cf79887](`98c808dea6`), include - Fixes #150001 by removing pre-CXX11 ABI logic from build script for XPU - Fixes #150430 - Fixes XCCL build issue caused by PR #150398 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150554 Approved by: https://github.com/EikanWang, https://github.com/malfet	2025-04-02 22:42:18 +00:00
PaulZhang12	8667a00979	Add stride + dtype to autotune results (#150419 ) Add stride/dtype info to autotune gemm results. New output header: `AUTOTUNE mm(1024x1024, 1024x7680)` `strides: [1, 1024], [7680, 1]` `dtypes: torch.bfloat16, torch.bfloat16` Differential Revision: [D72253313](https://our.internmc.facebook.com/intern/diff/D72253313) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150419 Approved by: https://github.com/eellison	2025-04-02 22:36:38 +00:00
Animesh Jain	0bacb90a9c	[invoke_subgraph][min-cut partitioner] Fix bug to use the correct root module (#150556 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150556 Approved by: https://github.com/bdhirsh, https://github.com/zou3519 ghstack dependencies: #150082, #150450, #150486	2025-04-02 22:35:00 +00:00
Shivam Raikundalia	a677b491c9	[Profiler] Fix Empty C Call Queue (#150370 ) Summary: My commandeer of https://github.com/pytorch/pytorch/pull/150102 Based on description of PR it seems that we need to add C calls for each starting python event with a callable such that when the tracing exits we will have a matching enter for any given exit. It adds some unnecessary events at worst but prevents segfaults/failures. My PR just cleans up some refcount impl and logging. Contributors: @arjun-choudhry Test Plan: Ran resnet test internally. Will check CI and ask reviewers to make sure it resolves their issues. Differential Revision: D72207570 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150370 Approved by: https://github.com/aaronenyeshi	2025-04-02 22:25:46 +00:00
Eli Uriegas	74aa9f571c	ci: Use cache / progress when local docker build (#150551 ) It's a bit annoying to try and work on these locally when the cache / progress isn't being used so let's just set it so that those flags are only valid when in CI directly. `${CI}` is a default environment variable that's defined by actions itself. See https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/store-information-in-variables#default-environment-variables Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/150551 Approved by: https://github.com/clee2000, https://github.com/ZainRizvi, https://github.com/atalman	2025-04-02 22:08:57 +00:00
Avik Chaudhuri	1017927c83	multidimensional slicing (#150104 ) Differential Revision: D71962884 Fixes #150057 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150104 Approved by: https://github.com/angelayi	2025-04-02 20:57:16 +00:00
Ryan Guo	bb98749230	[dynamo] Always trace into tensor subclass `__torch_function__` (#149792 ) This patch effectively ignores traceable_tensor_subclasses, allowing Dynamo to always try tracing into the `__torch_function__` of tensor subclass. This helps us with 2 things: 1. allowing users to directly benefit from better compilation of tensor subclass, by just upgrading pytorch, without having to change legacy library code (see earlier patches in the stack for examples). 2. potentially exposing more issues in compiling tensor subclass, so we can get signals and improve them. As a consequence, it exposed and fixes 2 subtle bugs: 1. In `build_torch_function_fn`, we could get `torch._C._disabled_torch_function_impl` because we have a `Parameter` subclass without `__torch_function__` override or if we have a tensor subclass with `__torch_dispatch__` override. We graph break on this for now, and plan to add support -- the logic for simulating `torch._C._disabled_torch_function_impl` is already in `SuperVariable`, we just need to reuse it. 2. Sometimes we create `SyntheticLocalSource` and need to remove all the guards installed on it, but we only removed the ones whose source _is_ the created synthetic source `s`, but forgot about chained source like `s.foo`, this showed up as `SYNTHETIC_LOCAL['tmp_0'].__torch_function__.__func__`. Differential Revision: [D71906141](https://our.internmc.facebook.com/intern/diff/D71906141) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149792 Approved by: https://github.com/jansel, https://github.com/mlazos ghstack dependencies: #149482, #149483, #149484	2025-04-02 20:57:00 +00:00
Ryan Guo	3463ea1059	[dynamo] Support tensor subclass with overriden tensor methods and properties (#149484 ) This fixes most of the "torch.compile X tensor-subclass" issues encountered in https://github.com/city96/ComfyUI-GGUF/issues/118. The relevant tensor subclass definition is here: `298192ed60/ops.py (L18-L65)`. A few things to note about the tensor subclass: 1. it overrides a lot of the `torch.Tensor` methods (e.g., `to`, `clone`), so this patch updates `TensorWithTFOverrideVariable.var_getattr` to support that. 2. it overrides the `shape` property, so this patch updates `TensorWithTFOverrideVariable.var_getattr` to support property as well. 3. it has calls to `torch.Tensor.size`, which returns `torch.Size`, which gets reconstructed in `torch.Tensor.__torch_function__`, so this patch adds support for calling `torch.Size(...)` on non-constant inputs. Differential Revision: [D71906137](https://our.internmc.facebook.com/intern/diff/D71906137) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149484 Approved by: https://github.com/jansel, https://github.com/mlazos ghstack dependencies: #149482, #149483	2025-04-02 20:57:00 +00:00
Ryan Guo	0d4dbfd9ed	[dynamo] Support `torch.Tensor._make_subclass` and tracing through tensor subclass `__new__` (#149483 ) This builds off the previous patch in the stack, and fully fixes https://github.com/huggingface/diffusers/issues/10795. Essentially, tensor subclass in the issue uses `torch.Tensor._make_subclass`, which has a pretty simple shallow-copy plus type change semantics, as far as Dynamo is concerned. So this patch adds a polyfill for it. As a result, this allows us to trace through many user-defined `__new__` in tensor subclass (it's similar to how we trace through user-defined `__new__` for `UserDefinedClassVariable`), so this patch also faithfully trace through these `__new__` methods. Differential Revision: [D71906139](https://our.internmc.facebook.com/intern/diff/D71906139) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149483 Approved by: https://github.com/zou3519, https://github.com/mlazos ghstack dependencies: #149482	2025-04-02 20:56:52 +00:00
Ryan Guo	33535b3eee	[dynamo] Support Tensor subclass that has dynamic attributes or calls `Parameter.__torch_function__` (#149482 ) This fixes most of https://github.com/huggingface/diffusers/issues/10795, except for `torch.Tensor._make_subclass`, which will be fixed in a subsequent patch. The relevant tensor subclass from the aforementioned issue is defined here: `fbf6b856cc/src/diffusers/quantizers/gguf/utils.py (L398-L435)`. There are two things to note about the tensor subclass: 1. it calls `super().__torch_function__`, which is `torch._C._disabled_torch_function_impl`, so this patch updates `SuperVariable.call_method` to handle it (we can't do a simpler polyfill due to some bug with `var_getattr` raising `NotImplementedError`, which forgot to restore symbolic context). 2. it sets and reads attributes (`quant_type`), and defines new methods (`as_data`), so this patch adds support for those. 3. it has a `__init__`, which Dynamo needs to trace through in `TensorSubclassVariable.call_function`. Differential Revision: [D71906140](https://our.internmc.facebook.com/intern/diff/D71906140) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149482 Approved by: https://github.com/jansel, https://github.com/mlazos	2025-04-02 20:56:43 +00:00
William Wen	85df0dc246	[dynamo] emit only 1 graph break message on unrecoverable data-dependent assert fail (#150471 ) Addresses https://fb.workplace.com/groups/1075192433118967/permalink/1625299684774903/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/150471 Approved by: https://github.com/jansel	2025-04-02 20:42:43 +00:00
Colin Peppler	a8f6b40e36	[inductor] skip non-trivial tiling if unbacked symints are present (#150225 ) Take two of https://github.com/pytorch/pytorch/pull/149994. This time we just skip `convert_tiling_to_3d` and `candidate_tilings` if there exists unbacked symints. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150225 Approved by: https://github.com/eellison	2025-04-02 20:36:02 +00:00
PyTorch MergeBot	03c879d59b	Revert "[dynamo] Support Tensor subclass that has dynamic attributes or calls `Parameter.__torch_function__` (#149482 )" This reverts commit 98453c135a7778d12ff881d8b0a717257be9fc38. Reverted https://github.com/pytorch/pytorch/pull/149482 on behalf of https://github.com/malfet due to Broke trunk, see `b03c42109c/1` ([comment](https://github.com/pytorch/pytorch/pull/149482#issuecomment-2773650522))	2025-04-02 20:30:33 +00:00
PyTorch MergeBot	18908c8ced	Revert "[dynamo] Support `torch.Tensor._make_subclass` and tracing through tensor subclass `__new__` (#149483 )" This reverts commit 203e1d681d1a4eb7794dfaeaebfa497242dde17d. Reverted https://github.com/pytorch/pytorch/pull/149483 on behalf of https://github.com/malfet due to Broke trunk, see `b03c42109c/1` ([comment](https://github.com/pytorch/pytorch/pull/149482#issuecomment-2773650522))	2025-04-02 20:30:33 +00:00
PyTorch MergeBot	01411c739f	Revert "[dynamo] Support tensor subclass with overriden tensor methods and properties (#149484 )" This reverts commit 7e53c58687482d58461e1dd8e09f59a9daf8f7b3. Reverted https://github.com/pytorch/pytorch/pull/149484 on behalf of https://github.com/malfet due to Broke trunk, see `b03c42109c/1` ([comment](https://github.com/pytorch/pytorch/pull/149482#issuecomment-2773650522))	2025-04-02 20:30:33 +00:00
PyTorch MergeBot	e545567340	Revert "[dynamo] Always trace into tensor subclass `__torch_function__` (#149792 )" This reverts commit 238109ad3245c5485f9e83b4b02d258b09329042. Reverted https://github.com/pytorch/pytorch/pull/149792 on behalf of https://github.com/malfet due to Broke trunk, see `b03c42109c/1` ([comment](https://github.com/pytorch/pytorch/pull/149482#issuecomment-2773650522))	2025-04-02 20:30:32 +00:00
Eli Uriegas	af5c1b96e2	ci: Set minimum cmake version for halide build (#150560 ) This was failing due to pybind being strict about their cmake version requirements. This resolves errors like: ``` 652.1 Compatibility with CMake < 3.5 has been removed from CMake. 652.1 652.1 Update the VERSION argument <min> value. Or, use the <min>...<max> syntax 652.1 to tell CMake that the project requires at least <min> but has been updated 652.1 to work with policies introduced by <max> or earlier. 652.1 652.1 Or, add -DCMAKE_POLICY_VERSION_MINIMUM=3.5 to try configuring anyway. 652.1 652.1 652.1 -- Configuring incomplete, errors occurred! ``` Tested this locally with the following command: ``` ./build.sh pytorch-linux-jammy-py3.12-halide -t 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-jammy-py3.12-halide:8a8989876ff1aa1d5b0e465177afebbc7a9da921 ``` Closes https://github.com/pytorch/pytorch/issues/150420 Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/150560 Approved by: https://github.com/clee2000, https://github.com/ZainRizvi, https://github.com/atalman, https://github.com/malfet	2025-04-02 20:27:24 +00:00
James Wu	b03c42109c	Proactively remove CompiledTritonKernels before loading from cache/starting inductor compile (#150453 ) We'll still running into this issue intermittently and it's hard to debug; so I thought a more aggressive cache clear strategy may fix it as a stopgap until we can Statically launch cuda kernels and avoid some of this stuff Differential Revision: [D72257973](https://our.internmc.facebook.com/intern/diff/D72257973/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150453 Approved by: https://github.com/oulgen	2025-04-02 20:08:32 +00:00
Yidi Wu	22030efb64	expect fail scan test in sigmoid (#150475 ) Summary: as titled. Test Plan: see modified test. Differential Revision: D72271976 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150475 Approved by: https://github.com/zhxchen17	2025-04-02 19:56:50 +00:00
Catherine Lee	d4298f2136	[CI] Use system nccl in build (#150226 ) Install nccl in the docker image (which is already being done in some docker images), and use USE_SYSTEM_NCCL=1 in CI builds It takes some time to build nccl and doesn't happen in parallel, so theres less benefit in switching to a bigger runner and using more processes The other changes in this PR are because there is an install_cuda script and an install_cuda_aarch64 script and they both build nccl from source and define their own pins for the nccl version. There is also a .ci/docker/nccl-cu11.txt and cu12.txt that define the pins, and this is an attempt to unify them. Unfortunately this leads to a lot of files needing to be copied to the docker build Generally seems to increase docker pull times by <1 min, P1768456379 but its hard to tell what the real increase is 15761 mib -> 16221 [linux-focal-cuda11.8-py3.10-gcc9 / test (distributed](https://github.com/pytorch/pytorch/actions/runs/14114171729/job/39545500161#logs) `jq '[.layers[].size, .config.size] \| add / 1024 / 1024'` Example `6eb3c2e282 (39520169577-box)` ![image](https://github.com/user-attachments/assets/d44ef415-6e48-41ef-ac83-f19bab47560c) TODO: * Figure out a way to verify that nccl was built + works properly when it is expected (this time i just checked torch.distributed.is_nccl_available) * Merge the cusparse installation scripts * Merge the cuda installation scripts * Either split the nccl, cuda, and cusparse installations always, or make the always together in one bash script distributed/test_distributed_spawn Pull Request resolved: https://github.com/pytorch/pytorch/pull/150226 Approved by: https://github.com/seemethere, https://github.com/atalman	2025-04-02 19:42:43 +00:00
atalman	cb4cd6166e	Address Cmake update issue in windows magma builds (#150549 ) 1. Fixes Cmake update error: https://github.com/pytorch/pytorch/actions/runs/14223930697/job/39858632864 ``` CMake Error at CMakeLists.txt:1 (cmake_minimum_required): Compatibility with CMake < 3.5 has been removed from CMake. Update the VERSION argument <min> value. Or, use the <min>...<max> syntax to tell CMake that the project requires at least <min> but has been updated to work with policies introduced by <max> or earlier. Or, add -DCMAKE_POLICY_VERSION_MINIMUM=3.5 to try configuring anyway. ``` 2. Removes deprecated CUDA 12.4 build Pull Request resolved: https://github.com/pytorch/pytorch/pull/150549 Approved by: https://github.com/clee2000	2025-04-02 19:13:44 +00:00
PaulZhang12	e62d958f02	[Inductor] Reland Merge Triton ScaledMM as epilogue to MM template #150045 (#150441 ) Merges https://github.com/pytorch/pytorch/pull/150438 and https://github.com/pytorch/pytorch/pull/150045. https://github.com/pytorch/pytorch/pull/150045 was already landed, but did not include a change that makes it unable to land internally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150441 Approved by: https://github.com/clee2000	2025-04-02 17:49:32 +00:00
Ryan Guo	238109ad32	[dynamo] Always trace into tensor subclass `__torch_function__` (#149792 ) This patch effectively ignores traceable_tensor_subclasses, allowing Dynamo to always try tracing into the `__torch_function__` of tensor subclass. This helps us with 2 things: 1. allowing users to directly benefit from better compilation of tensor subclass, by just upgrading pytorch, without having to change legacy library code (see earlier patches in the stack for examples). 2. potentially exposing more issues in compiling tensor subclass, so we can get signals and improve them. As a consequence, it exposed and fixes 2 subtle bugs: 1. In `build_torch_function_fn`, we could get `torch._C._disabled_torch_function_impl` because we have a `Parameter` subclass without `__torch_function__` override or if we have a tensor subclass with `__torch_dispatch__` override. We graph break on this for now, and plan to add support -- the logic for simulating `torch._C._disabled_torch_function_impl` is already in `SuperVariable`, we just need to reuse it. 2. Sometimes we create `SyntheticLocalSource` and need to remove all the guards installed on it, but we only removed the ones whose source _is_ the created synthetic source `s`, but forgot about chained source like `s.foo`, this showed up as `SYNTHETIC_LOCAL['tmp_0'].__torch_function__.__func__`. Differential Revision: [D71906141](https://our.internmc.facebook.com/intern/diff/D71906141) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149792 Approved by: https://github.com/jansel, https://github.com/mlazos ghstack dependencies: #149482, #149483, #149484	2025-04-02 17:05:25 +00:00
Ryan Guo	7e53c58687	[dynamo] Support tensor subclass with overriden tensor methods and properties (#149484 ) This fixes most of the "torch.compile X tensor-subclass" issues encountered in https://github.com/city96/ComfyUI-GGUF/issues/118. The relevant tensor subclass definition is here: `298192ed60/ops.py (L18-L65)`. A few things to note about the tensor subclass: 1. it overrides a lot of the `torch.Tensor` methods (e.g., `to`, `clone`), so this patch updates `TensorWithTFOverrideVariable.var_getattr` to support that. 2. it overrides the `shape` property, so this patch updates `TensorWithTFOverrideVariable.var_getattr` to support property as well. 3. it has calls to `torch.Tensor.size`, which returns `torch.Size`, which gets reconstructed in `torch.Tensor.__torch_function__`, so this patch adds support for calling `torch.Size(...)` on non-constant inputs. Differential Revision: [D71906137](https://our.internmc.facebook.com/intern/diff/D71906137) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149484 Approved by: https://github.com/jansel, https://github.com/mlazos ghstack dependencies: #149482, #149483	2025-04-02 17:05:25 +00:00
Ryan Guo	203e1d681d	[dynamo] Support `torch.Tensor._make_subclass` and tracing through tensor subclass `__new__` (#149483 ) This builds off the previous patch in the stack, and fully fixes https://github.com/huggingface/diffusers/issues/10795. Essentially, tensor subclass in the issue uses `torch.Tensor._make_subclass`, which has a pretty simple shallow-copy plus type change semantics, as far as Dynamo is concerned. So this patch adds a polyfill for it. As a result, this allows us to trace through many user-defined `__new__` in tensor subclass (it's similar to how we trace through user-defined `__new__` for `UserDefinedClassVariable`), so this patch also faithfully trace through these `__new__` methods. Differential Revision: [D71906139](https://our.internmc.facebook.com/intern/diff/D71906139) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149483 Approved by: https://github.com/zou3519, https://github.com/mlazos ghstack dependencies: #149482	2025-04-02 17:05:19 +00:00
Ryan Guo	98453c135a	[dynamo] Support Tensor subclass that has dynamic attributes or calls `Parameter.__torch_function__` (#149482 ) This fixes most of https://github.com/huggingface/diffusers/issues/10795, except for `torch.Tensor._make_subclass`, which will be fixed in a subsequent patch. The relevant tensor subclass from the aforementioned issue is defined here: `fbf6b856cc/src/diffusers/quantizers/gguf/utils.py (L398-L435)`. There are two things to note about the tensor subclass: 1. it calls `super().__torch_function__`, which is `torch._C._disabled_torch_function_impl`, so this patch updates `SuperVariable.call_method` to handle it (we can't do a simpler polyfill due to some bug with `var_getattr` raising `NotImplementedError`, which forgot to restore symbolic context). 2. it sets and reads attributes (`quant_type`), and defines new methods (`as_data`), so this patch adds support for those. 3. it has a `__init__`, which Dynamo needs to trace through in `TensorSubclassVariable.call_function`. Differential Revision: [D71906140](https://our.internmc.facebook.com/intern/diff/D71906140) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149482 Approved by: https://github.com/jansel, https://github.com/mlazos	2025-04-02 17:05:12 +00:00
PyTorch MergeBot	532530be34	Revert "[Profiler] Fix Empty C Call Queue (#150370 )" This reverts commit 5734909f343ab1de44ed5ab23311d43a9c6afaed. Reverted https://github.com/pytorch/pytorch/pull/150370 on behalf of https://github.com/clee2000 due to broke some profiler tests when building with debug asserts profiler/test_memory_profiler.py::TestMemoryProfiler::test_config_check [GH job link](https://github.com/pytorch/pytorch/actions/runs/14211763078/job/39822158330) [HUD commit link](`3ac5a499dd`) ([comment](https://github.com/pytorch/pytorch/pull/150370#issuecomment-2773146070))	2025-04-02 16:40:54 +00:00
Manuel Candales	f38566dfe4	[MPSInductor] Disable mm/bmm decompositions (#150541 ) Disables mm/bmm decompositions. torch.compile on MPS was speeding up stories15M (~4x) but it was making stories110M much slower. Self-contained reproducer to demonstrate the difference (before the change, after it should be identical) ```python import torch import timeit def bench_mm(f, x, y): from torch.utils.benchmark import Timer return Timer(stmt="f(x, y); torch.mps.synchronize()", globals={"x": x, "y": y, "f": f}, language="python", timer=timeit.default_timer).blocked_autorange() x = torch.rand(1024, 512, device='mps') y = torch.rand(512, 1, device='mps') mm_c = torch.compile(torch.mm, options={"coordinate_descent_tuning": False}) mm_c_cdt = torch.compile(torch.mm, options={"coordinate_descent_tuning": True}) print(f"Compiled torch.mm perf (with cdt disabled) for 1024x512 and 512x1 matrices are {bench_mm(mm_c, x, y).median}") print(f"Compiled torch.mm perf (with cdt enabled) for 1024x512 and 512x1 matrices are {bench_mm(mm_c_cdt, x, y).median}") ``` Disabling the inductor mm decomposition, speeds up stories15M further (~6x) and speeds up stories110M (~7x) The table below show average tokens/sec across 5 runs on M1 Pro for stories15M and stories110M: \| \| stories15M \| stories110M \| \|------------------------\|------------\|-------------\| \| without compile \| 99.40 \| 53.11 \| \| compile before change \| 367.68 \| 19.43 \| \| compile after change \| 582.96 \| 355.07 \| stories110M (without compile) ``` (gptfast) mcandales@mcandales-mbp gpt-fast % python generate.py --checkpoint_path checkpoints/stories110M/stories110M.pt --prompt "Once upon a time" --device mps [...] Average tokens/sec: 53.11 ``` stories110M (compile before change) ``` (gptfast) mcandales@mcandales-mbp gpt-fast % python generate.py --checkpoint_path checkpoints/stories110M/stories110M.pt --prompt "Once upon a time" --device mps --compile [...] Average tokens/sec: 19.43 ``` stories110M (compile after change) ``` (gptfast) mcandales@mcandales-mbp gpt-fast % python generate.py --checkpoint_path checkpoints/stories110M/stories110M.pt --prompt "Once upon a time" --device mps --compile [...] Average tokens/sec: 355.07 ``` stories15M (without compile) ``` (gptfast) mcandales@mcandales-mbp gpt-fast % python generate.py --checkpoint_path checkpoints/stories110M/stories110M.pt --prompt "Once upon a time" --device mps [...] Average tokens/sec: 99.40 ``` stories15M (compile before change) ``` (gptfast) mcandales@mcandales-mbp gpt-fast % python generate.py --checkpoint_path checkpoints/stories110M/stories110M.pt --prompt "Once upon a time" --device mps --compile [...] Average tokens/sec: 367.68 ``` stories15M (compile after change) ``` (gptfast) mcandales@mcandales-mbp gpt-fast % python generate.py --checkpoint_path checkpoints/stories110M/stories110M.pt --prompt "Once upon a time" --device mps --compile [...] Average tokens/sec: 582.96 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150541 Approved by: https://github.com/malfet	2025-04-02 16:07:18 +00:00
Wang, Chuanqi	8102272d8c	[BE] Fix triton windows build (#150512 ) Fixes #150480 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150512 Approved by: https://github.com/atalman Co-authored-by: Andrey Talman <atalman@fb.com>	2025-04-02 15:48:11 +00:00
Animesh Jain	42c7c7f15f	[invoke_subgraph] Filter out grad_out where fw_out requires_grad is False (#150486 ) I am not sure if this is the right way. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150486 Approved by: https://github.com/zou3519 ghstack dependencies: #150082, #150450	2025-04-02 14:40:08 +00:00
Isuru Fernando	82ceebce58	[inductor] Lowerings for max_pool3d (#148210 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148210 Approved by: https://github.com/eellison	2025-04-02 14:13:01 +00:00
Isuru Fernando	5f62d07ec6	Fix log2, PowByNatural printing (#147592 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147592 Approved by: https://github.com/eellison	2025-04-02 14:12:15 +00:00
rzou	aae36929ed	Rename node.meta["arg_kwarg_vals"] to node.meta["eager_input_vals"] (#148092 ) And added a comment about it. Otherwise it might be confusing Test Plan: - wait for CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/148092 Approved by: https://github.com/eellison ghstack dependencies: #148046, #148063, #148091	2025-04-02 13:18:04 +00:00
rzou	4d121d2b02	Implement needs_exact_strides for mutable custom operators (#148091 ) Mutable custom operators get wrapped into an auto_functionalized HOP, so we need to store the arg_kwarg_vals on the auto_functionalized HOP itself. When Inductor does the re-inplacing, it'll use the pattern matcher to decompose the auto_functionalized HOP back into the original op (and 0+ other view or clone operations). The pattern matcher uses the arg_kwarg_vals to trace the subgraph to do the decomposition, so it ultimately sets arg_kwarg_vals on the original op's node correctly. Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/148091 Approved by: https://github.com/eellison ghstack dependencies: #148046, #148063	2025-04-02 13:18:04 +00:00
rzou	c69c3c885e	Add needs_exact_strides operator tag for Inductor to force exact strides (#148063 ) Inductor will force exact strides on a custom operator tagged with needs_exact_strides. I'll make this the default in a follow-up PR. Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/148063 Approved by: https://github.com/eellison ghstack dependencies: #148046	2025-04-02 13:17:58 +00:00
rzou	c41fbb4f78	Change arg_kwarg_vals propagation strategy (#148046 ) Instead of always propagating arg_kwarg_vals in _COPY_META_FIELDS, we special-case the pattern matcher to propagate arg_kwarg_vals when it sees triton_kernel_wrapper_functional. The strategy is: 1) trace out the replacement graph with arg_kwarg_vals (which have accurate eager-mode metadata) 2) trace out the replacement graph with vals (which have the accurate Inductor metadata) 3) Propagate the arg_kwarg_vals from the first graph to the second. 4) Use the second graph as the replacement graph. The strategy is this because we want to extend this to handle auto_functionalized later up in the stack. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/148046 Approved by: https://github.com/eellison	2025-04-02 13:17:52 +00:00
Bin Bao	03138733ba	[AOTI] Emit Triton kernels as comment (#150188 ) Summary: Emit the corresponding Triton kernel code as comment in each call_triton_ wrapper function, for easier debugging. Differential Revision: [D72178907](https://our.internmc.facebook.com/intern/diff/D72178907) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150188 Approved by: https://github.com/yushangdi	2025-04-02 12:41:54 +00:00
Benjamin Glass	75f38dfd4e	cpp_wrapper: precompile a few more commonly used headers, and improve RAIIPyObject interface (#149350 ) Add includes for torch.device, torch.dtype, torch.layout, and torch.memory_format to the cpp_wrapper common header, so that they get precompiled. Additionally, add move constructors and operator bool to RAIIPyObject. Closes #142005. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149350 Approved by: https://github.com/desertfire	2025-04-02 09:54:27 +00:00
Boyuan Feng	3f54b14c75	[CUDAGraph] support meta tensor (#150478 ) Previously, cudagraph is skipped if the graph contains any meta tensor. However, we should not skip since meta tensor does not have actual computation. This PR fixes the issue. ### Example ```python import torch def foobar(x, y): return x * 2, y * 3 foo_c = torch.compile(mode="reduce-overhead")(foobar) t = torch.empty((1, 16, 128, 128), device="meta") y = torch.rand([64], device="cuda") eager_out = foobar(t, y) for _ in range(3): compiled_out = foo_c(t, y) ``` Prior to this PR, above code leads to ``` skipping cudagraphs due to multiple devices: device(type='cuda', index=0), device(type='meta') ``` With this PR, we don't skip. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150478 Approved by: https://github.com/eellison	2025-04-02 07:21:50 +00:00
Sukchul Cho	0da8127f77	Compare device name of profiler dynamically (#150396 ) Compare self.use_device of torch.autograd.profiler.profiler with _get_privateuse1_backend_name(), since privateuse1 backend can be renamed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150396 Approved by: https://github.com/sraikund16	2025-04-02 06:06:06 +00:00
Rebecca Chen	c65de03196	Add `Any` return annotation to `__getattr__` methods that return a union of types. (#150204 ) Adds an `Any` return type annotation to `__getattr__` methods in `torch/_ops.py` that return a union of types. Attribute access returning a union of types can cause issues downstream because consumers would need to handle all of the possible types to make the type checker happy. This doesn't seem to matter today for mypy, presumably because `Any` is always inferred when a return type annotation is missing, but it still makes explicit what mypy is already doing implicitly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150204 Approved by: https://github.com/malfet	2025-04-02 05:25:07 +00:00
Nikita Shulga	dee016ceb7	[MPSInductor] Add `store_reduce` method (#150457 ) That restrict the store operation to 0th thread, which should be much better, shouldn't it (Though I don't observe it in the benchmark) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150457 Approved by: https://github.com/jansel, https://github.com/dcci ghstack dependencies: #150452	2025-04-02 05:12:49 +00:00
William Wen	3ac5a499dd	[dynamo] add dynamo disable reasons to codebase (#150440 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150440 Approved by: https://github.com/jansel, https://github.com/zou3519 ghstack dependencies: #150341	2025-04-02 04:26:48 +00:00
William Wen	25eff6e991	[dynamo] add reason field to torch.compiler.disable (#150341 ) Implements https://github.com/pytorch/pytorch/issues/146445 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150341 Approved by: https://github.com/zou3519, https://github.com/jansel	2025-04-02 04:26:48 +00:00
Mu-Chu Lee	063ea5d669	[AOTInductor] Modify test for Memory tracking for memory-related (#150269 ) operations Summary: Fix the test for memory tracking. This PR does: (1) Add tracking before and after for all memory-related operations. Make sure the operation do indeed captures memory both in CUDA and torch's CUDACachAllocator Make sure the operation do indeed captures consumed memory both in CUDA and torch's CUDACachAllocator. (2) Keep track of memory being reserved by CUDACacheAllocator in torch and it's relationship with global CUDA memory consumption. Test Plan: This PR is adding tests. Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/150269 Approved by: https://github.com/jingsh, https://github.com/chenyang78, https://github.com/desertfire	2025-04-02 04:18:18 +00:00
Shivam Raikundalia	5734909f34	[Profiler] Fix Empty C Call Queue (#150370 ) Summary: My commandeer of https://github.com/pytorch/pytorch/pull/150102 Based on description of PR it seems that we need to add C calls for each starting python event with a callable such that when the tracing exits we will have a matching enter for any given exit. It adds some unnecessary events at worst but prevents segfaults/failures. My PR just cleans up some refcount impl and logging. Test Plan: Ran resnet test internally. Will check CI and ask reviewers to make sure it resolves their issues. Differential Revision: D72207570 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150370 Approved by: https://github.com/aaronenyeshi	2025-04-02 02:44:50 +00:00
eqy	f09513e515	[CUDA]][SymmetricMemory] Interpret empty string as `std::nullopt` in `rendezvous` (#149793 ) this is a "temporary" fix as current internal API requires strings at some interfaces instead of `std::optional` and empty strings are presumably used in-lieu of `nullopt`. e.g., `9d02b3993f/torch/csrc/distributed/c10d/intra_node_comm.cu (L49)` this currently breaks `test_intra_node_comm_all_reduce` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149793 Approved by: https://github.com/kwen2501, https://github.com/cyyever	2025-04-02 02:41:07 +00:00
Animesh Jain	61ebe999cc	[invoke_subgraph] Do not cache fake tensors for AOTDispatcher first pass (#150450 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150450 Approved by: https://github.com/zou3519 ghstack dependencies: #150082	2025-04-02 02:31:54 +00:00
Animesh Jain	b060fedfa8	[invoke_subgraph] Support None in the fwd output (#150082 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150082 Approved by: https://github.com/zou3519	2025-04-02 02:31:54 +00:00
Rithesh Baradi	0ae75ca2de	assert on all_reduce_event only if it's not CPU device. (#150316 ) Summary: For CPU based runs, `all_reduce_event` would be None since this is the result of the `all_reduce_stream.record_event()`, which does not do much other than returning None when device type is CPU. Test Plan: CI Differential Revision: D72176406 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150316 Approved by: https://github.com/kwen2501, https://github.com/weifengpy, https://github.com/mori360	2025-04-02 01:54:35 +00:00
cyy	e872c38eb3	Remove cppcoreguidelines-pro-type-member-init_fix suppression (#148638 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/148638 Approved by: https://github.com/zou3519	2025-04-02 01:33:20 +00:00
vasiliy	c974b5322a	enable torch.compile for torch._scaled_mm nvfp4 recipe (#150462 ) Summary: Updates the meta registration for `torch._scaled_mm` to work for the nvfp4 recipe. Test Plan: ```bash pytest test/test_matmul_cuda.py -s -k test_blockwise_nvfp4 ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/150462 Approved by: https://github.com/eellison	2025-04-02 01:08:40 +00:00
Nikita Shulga	ee97299961	[MPS][Testing] Benchmark reduction ops (#150452 ) That compares eager vs compile On my M4Pro mini I'm getting the following now ``` [--------------------------------------------------------------------------------------------- --------------------------------------------------------------------------------------------] \| eager-512x512 \| compile-512x512 \| eager-1024x1024 \| compile-1024x1024 \| eager-2048x2048 \| compile-2048x2048 \| eager-4096x4096 \| compile-4096x4096 1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- sum (torch.float32) \| 121.0 \| 201.5 \| 130.3 \| 772.3 \| 179.4 \| 1470.5 \| 476.1 \| 2980.0 max (torch.float32) \| 154.1 \| 165.9 \| 198.7 \| 211.6 \| 344.2 \| 386.9 \| 1326.6 \| 1345.6 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150452 Approved by: https://github.com/dcci, https://github.com/manuelcandales	2025-04-02 01:06:27 +00:00
tvukovic-amd	db32093192	[ROCm][Windows] Fix torchvision build with ROCm 6.4 on windows (#150180 ) Since with HIP SDK 6.4 hipcc files and calls and restructured, the case for calling hipcc.exe is added in case of building torchvision with HIP SDK 6.4 on Windows Pull Request resolved: https://github.com/pytorch/pytorch/pull/150180 Approved by: https://github.com/malfet, https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-04-02 00:35:47 +00:00
Junjie Wang (PyTorch)	d22e3d5efe	[fr] Add logger config for flight record in PGNCCL (#150356 ) Summary: We want to move from a scuba based direct logging to a logger config based logging. Mostly changes are internal but we need to change the exception to exception_msg. Test Plan: Following https://www.internalfb.com/wiki/Server_Logging/Getting_Started_with_Logging/Onboarding_Existing_Scribe-Based_Logging_(Alpha)/ to test it. Differential Revision: D72198171 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150356 Approved by: https://github.com/fegin	2025-04-01 23:54:07 +00:00
Tristan Rice	6aea4d90fb	gloo: use shared Stores (#150230 ) Summary: X-link: https://github.com/facebookincubator/gloo/pull/423 This modifies `connectFullMesh` to take in a shared_ptr<IStore> instead of a reference. This is an API breaking change but fairly easy to work around. To have backwards compatibility in PyTorch during the commit phase we add a new ifdef `GLOO_SHARED_STORE` which can provide backwards compatibility until we update the pinned Gloo version in pytorch OSS repo. This also adds a new `wait_get` method to `IStore` which will allow us to do a more efficient operation in PyTorch TCPStore. PyTorch's `Store::get` automatically waits so we want to make sure we can avoid waiting twice to reduce network traffic. This change will land simultaneously in PyTorch and Gloo repos. Test Plan: ``` buck2 test //gloo/... //caffe2/caffe2/contrib/gloo: ``` Differential Revision: D72084111 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150230 Approved by: https://github.com/fduwjj	2025-04-01 23:37:25 +00:00
Nick Riasanovsky	4934a83347	[AMD] [TRITON] [INDUCTOR] Add tl.assume to enable bufferops on AMD (#150373 ) Summary: Update the GEMM template to include the necessary `tl.assume` annotations to enable bufferops with AMD. Test Plan: Tested manually with a simple matmul run with torch.complie(f, mode="max-autotune") the environment variables TRITON_ALWAYS_COMPILE=1 AMDGCN_ENABLE_DUMP=1 AMDGCN_USE_BUFFER_OPS=1. Inspecting the generated AMDGCN all loads/stores use bufferops. Note: Since inductor is loading constants for many of the shape values assumes are generally not needed for the stride/shape information, but pid calculations are generally a gap in Triton's inference capability. Differential Revision: D71922698 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150373 Approved by: https://github.com/eellison	2025-04-01 23:29:39 +00:00
angelayi	60fe0922f6	[pytree] Register normal class to register_dataclass (#147752 ) Fixes https://github.com/pytorch/pytorch/pull/147532#discussion_r1964365330 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147752 Approved by: https://github.com/zou3519	2025-04-01 23:28:20 +00:00
PyTorch MergeBot	203a27e0ce	Revert "[cuBLAS][cuBLASLt] Unify `cuBLASLt` workspaces with `cuBLAS` workspaces (#145130 )" This reverts commit 8f7fbe3d7d2cd301df48fcbe8a14f8aa1a9c1e48. Reverted https://github.com/pytorch/pytorch/pull/145130 on behalf of https://github.com/clee2000 due to reverted internally by D72140190 ([comment](https://github.com/pytorch/pytorch/pull/145130#issuecomment-2770874244))	2025-04-01 23:07:28 +00:00
Will Feng	80ab233786	[Inductor] Hide reinplace_fsdp_all_gather pass behind skip_fsdp_hooks config (#150436 ) The `reinplace_fsdp_all_gather` pass is currently only for Traceable FSDP2 and doesn't work together with SimpleFSDP. We should hide the pass behind `skip_fsdp_hooks` config which makes it only apply to Traceable FSDP2. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150436 Approved by: https://github.com/BoyuanFeng	2025-04-01 22:56:06 +00:00
PyTorch MergeBot	9458460211	Revert "if blaslt fails, fall back to blas (#150147 )" This reverts commit 65139eb050817329ac8e541c377b2be3bb5ffe14. Reverted https://github.com/pytorch/pytorch/pull/150147 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/150147#issuecomment-2770847320))	2025-04-01 22:52:22 +00:00
PyTorch MergeBot	76e1b3ba4c	Revert "[ROCm] use correct workspace for hipblaslt, silence warning (#150227 )" This reverts commit c158eac0de2afe38d68952ca401888ed5777f6b0. Reverted https://github.com/pytorch/pytorch/pull/150227 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/150227#issuecomment-2770827563))	2025-04-01 22:31:13 +00:00
henrylhtsang	629c1bd2dd	[ez][inductor][tests] Skip triton backend only for CPU tests (#150343 ) Motivation: to unblock https://github.com/pytorch/pytorch/pull/148622 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150343 Approved by: https://github.com/chenyang78	2025-04-01 22:03:48 +00:00
Avik Chaudhuri	b70d105c77	infer dynamic shapes through additional inputs (#150144 ) Summary: Instead of explicitly specifying dynamic shapes, it is possible to infer them from additional example inputs. Together with the example inputs provided to export, we can basically make any varying dim dynamic and keep any fixed dim static. This should be useful for prod scenarios that have access to tests and/or profiling data, yet are somewhat removed from the model authoring process. However this alone is not satisfactory: the exported program by design has only one graph, representing one path through the model, and we cannot necessarily guarantee that this graph works for the additional example inputs because different guards might have been created if we had exported with them instead (corresponding to different traced paths). However, checking that the additional example inputs satisfy the guards created by the original export should be sufficient for generalization. Now, while we don't preserve all guards in the exported program, we do check a subset of them as part of input matching. So we add a verification step at the end of export when such additional example inputs are provided. This should be enough for now. Test Plan: added test (positive and negative cases) Differential Revision: D72001771 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150144 Approved by: https://github.com/bobrenjc93	2025-04-01 21:13:39 +00:00
Michael Lazos	0d44a8aea1	[Hierarchical Compile] Apply deduplication after output node creation (#150306 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150306 Approved by: https://github.com/anijain2305 ghstack dependencies: #150303, #150304, #150305	2025-04-01 20:54:18 +00:00
Michael Lazos	8740ffa760	[Hierarchical Compile] Add cycle detection to graph region expansion (#150305 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150305 Approved by: https://github.com/anijain2305 ghstack dependencies: #150303, #150304	2025-04-01 20:54:18 +00:00
Michael Lazos	a2300aff94	[Hierarchical Compile] Add cycle detection function for debug (#150304 ) Remove print Pull Request resolved: https://github.com/pytorch/pytorch/pull/150304 Approved by: https://github.com/anijain2305 ghstack dependencies: #150303	2025-04-01 20:54:10 +00:00
Michael Lazos	99fd96c10b	[Hierarchical Compile] Remove spammy debug log (#150303 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150303 Approved by: https://github.com/williamwen42	2025-04-01 20:54:03 +00:00
atalman	295162ec3a	Smoke Test - disable pypi package validation for binaries that package cuda libs (#150194 ) Smoke Test - disable pypi package validation for binaries that package cuda libs. These binaries do not install packages via pypi. Should Resolve this from `linux-binary-manywheel / manywheel-py3_11-cuda12_6-full-test / test`: ``` Traceback (most recent call last): File "/pytorch/.ci/pytorch/smoke_test/smoke_test.py", line 468, in <module> main() File "/pytorch/.ci/pytorch/smoke_test/smoke_test.py", line 462, in main smoke_test_cuda( File "/pytorch/.ci/pytorch/smoke_test/smoke_test.py", line 274, in smoke_test_cuda compare_pypi_to_torch_versions( File "/pytorch/.ci/pytorch/smoke_test/smoke_test.py", line 220, in compare_pypi_to_torch_versions raise RuntimeError(f"Can't find {package} in PyPI for Torch: {torch_version}") RuntimeError: Can't find cudnn in PyPI for Torch: 9.5.1 ``` Link: https://github.com/pytorch/pytorch/actions/runs/14101221665/job/39505479587#step:15:982 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150194 Approved by: https://github.com/ZainRizvi	2025-04-01 19:18:44 +00:00
Tianyu Liu	d2ad9aa2f2	[dtensor][tp] add a ParallelStyle PrepareModuleInputOutput (#150372 ) Needed this class for because `parallelize_module` takes a dict, which doesn't allow `PrepareModuleInput` and `PrepareModuleOutput` to be applied at the same time. The `PrepareModuleInputOutput` in this PR initializes two variables `prepare_module_input` and `prepare_module_output` and uses them to process module / inputs / outputs. I had another implementation which put all code in `PrepareModuleInputOutput` and let `PrepareModuleInput` and `PrepareModuleOutput` inherit the monolithic `PrepareModuleInputOutput`. But it is 1. less cleaner 2. conceptually abusing inheritance because `PrepareModuleInput` shouldn't be able to access class methods of `PrepareModuleOutput` and vice versa Pull Request resolved: https://github.com/pytorch/pytorch/pull/150372 Approved by: https://github.com/wanchaol	2025-04-01 19:15:43 +00:00
Tianyu Liu	5d6ac2dced	[dtensor] add op support for select_backward and slice_backward (#150357 ) Inheriting and rebasing @awgu 's PR https://github.com/pytorch/pytorch/pull/149071 - fixed an issue for `select_backward` and an issue for `slice_backward` - removed `_experimental_ops.py` as it becomes empty Pull Request resolved: https://github.com/pytorch/pytorch/pull/150357 Approved by: https://github.com/awgu, https://github.com/XilunWu	2025-04-01 19:15:25 +00:00
IvanKobzarev	a37afd23fa	[custom_ops][perf] Move expensive pytree traversals of tensors to C++ (#148555 ) (benchmark for 1 call) Before: ``` └─ $ python ~/task_custom_ops_perf/test_custom_ops_perf_repro.py DO_BENCH mutate: 77.72445678710938 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/mutate.json DO_BENCH no_mutate: 64.61143493652344 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/no_mutate.json DO_BENCH direct_mutate: 11.682510375976562 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/direct_mutate.json DO_BENCH direct_no_mutate: 18.596649169921875 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/direct_no_mutate.json ``` After: ``` └─ $ python ~/task_custom_ops_perf/test_custom_ops_perf_repro.py DO_BENCH mutate: 47.6837158203125 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/mutate.json DO_BENCH no_mutate: 31.709671020507812 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/no_mutate.json DO_BENCH direct_mutate: 10.967254638671875 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/direct_mutate.json DO_BENCH direct_no_mutate: 10.728836059570312 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/direct_no_mutate.json ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148555 Approved by: https://github.com/zou3519	2025-04-01 18:45:48 +00:00
Ethan Wee	78300c8205	[ROCm] update test buffer fudge factor for hipblaslt (#150348 ) The default workspace for hipblaslt is larger than for cublas/cublaslt which requires a slight increase to the buffer needed. Forward-fix for #150227 that broke ROCm distributed tests but wasn't part of initial CI signal. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150348 Approved by: https://github.com/jeffdaily	2025-04-01 18:31:25 +00:00
Jason Ansel	37ebb0b56a	[inductor] Fix inductor windows linker error (#150256 ) Fixes #149889 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150256 Approved by: https://github.com/anijain2305, https://github.com/eellison	2025-04-01 18:30:55 +00:00
eellison	15dbad2115	Update torch.compile issue template (#150192 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150192 Approved by: https://github.com/malfet ghstack dependencies: #149947	2025-04-01 18:16:16 +00:00
PyTorch MergeBot	f04cf13bdd	Revert "Merge Triton ScaledMM as epilogue to MM template (#150045 )" This reverts commit 981048854da154eae8ff0bd439e72e1256ae00da. Reverted https://github.com/pytorch/pytorch/pull/150045 on behalf of https://github.com/PaulZhang12 due to Need to add PR 150415 fixes for internal merge ([comment](https://github.com/pytorch/pytorch/pull/150045#issuecomment-2770252452))	2025-04-01 17:54:28 +00:00
Will Feng	b0c560ef2a	[dynamo][hooks] use wrap_top_frame config for functions (#150209 ) When torch.compile is applied to a module via `mod.compile(...)`, it's equivalent to `torch.compile(mod._call_impl)` which takes a different path than `OptimizedModule`. This PR ensures that the `wrap_top_frame` config can also take effect for the `torch.compile(mod._call_impl)` use case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150209 Approved by: https://github.com/anijain2305	2025-04-01 17:41:23 +00:00
Nikita Shulga	48af2cdd27	[BE] Move all lint runner to 24.04 (#150427 ) As Ubuntu-20 reached EOL on Apr 1st, see https://github.com/actions/runner-images/issues/11101 This forces older python version to be 3.8 Delete all linux-20.04 runners from the lintrunner.yml Pull Request resolved: https://github.com/pytorch/pytorch/pull/150427 Approved by: https://github.com/seemethere	2025-04-01 17:33:15 +00:00
Xia, Weiwen	3b0cd9b542	[Quant][PT2E] add a lowering pass for x86 backend (#149708 ) Summary This PR adds a lowering pass for x86 backend - Patterns of `dequantize -> conv/linear (-> quantize)` are fused to corresponding quantized onednn ops. - Weights are prepacked ahead of time. - Post ops of conv/linear are fused if supported. - The pass returns a `GraphModule` with the modifications mentioned above. Test plan ``` pytest test/quantization/pt2e/test_x86inductor_quantizer.py -k test_lowering_to_x86 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149708 Approved by: https://github.com/jerryzh168, https://github.com/leslie-fang-intel	2025-04-01 17:32:41 +00:00
Catherine Lee	783f045c4f	[ez] Remove dead lite interpreter CI code (#150424 ) There are no lite-interpreter build environments in CI I assume every mac build is arm64 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150424 Approved by: https://github.com/seemethere, https://github.com/malfet	2025-04-01 17:14:32 +00:00
Catherine Lee	a17ee8181a	[CI] Fix log artifact not containing test logs attempt 2 (#150234 ) Fixes #ISSUE_NUMBER Take two of https://github.com/pytorch/pytorch/pull/149577 since it didn't work Pull Request resolved: https://github.com/pytorch/pytorch/pull/150234 Approved by: https://github.com/malfet, https://github.com/seemethere	2025-04-01 17:13:58 +00:00
Nikita Shulga	f94ac263af	[MPSInductor] Fix neg for unsigned types (#150412 ) By more-or-less copy-n-pasting the fix from https://github.com/pytorch/pytorch/pull/94035 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150412 Approved by: https://github.com/jansel, https://github.com/dcci ghstack dependencies: #150382, #150386	2025-04-01 16:52:41 +00:00
Xuehai Pan	ae74ef9d53	Set proper `LD_LIBRARY_PATH` on Linux in nightly venv in nightly pull tool (#143262 ) Before this change: ```console $ make setup-env-cuda PYTHON="${HOMEBREW_PREFIX}/bin/python3.12" $ source venv/bin/activate $ python3 -c 'import torch' Traceback (most recent call last): File "<string>", line 1, in <module> File "/home/PanXuehai/Projects/pytorch/torch/__init__.py", line 379, in <module> from torch._C import * # noqa: F403 ^^^^^^^^^^^^^^^^^^^^^^ ImportError: libcudnn.so.9: cannot open shared object file: No such file or directory ``` This PR adds `site-packages/nvidia/**/lib` to `LD_LIBRARY_PATH` in `venv/bin/activate` script to let NVIDIA PyPI packages can be loaded correctly. See also: - #141837 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143262 Approved by: https://github.com/malfet	2025-04-01 16:51:02 +00:00
Sriram Kumar	a19b667bca	[ROCm] Update CUDAPluggableAllocator.h (#1984 ) (#150010 ) Altering the flag to use the correct streamType in CUDAPluggableAllocator class for ROCm gpu. The flag TORCH_HIP_VERSION does not work for ROCm as intended. This flag is replaced with USE_ROCM. This is impacting Distributed Fused Adam in Rocm/APEX when using nccl_ub feature. This has been tested with rocm/apex. See PR https://github.com/ROCm/apex/pull/184 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150010 Approved by: https://github.com/jeffdaily	2025-04-01 16:49:03 +00:00
Ke Wen	35c45a4a31	[Reland] Launch kernel on current stream & remove `record_stream` entirely (#150398 ) Relanding #148590 due to merge conflict. This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related): 1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back. - Resolves #147729 - Resolves #146881 - Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user. 2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling. - Resolves #147168 3. Remove tensor life management when async_op=False; only use it when async_op=True. 4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460). 5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels. Joint work with @cenzhaometa who wants to remove the event sync overhead. Squashed contents: * [ptd][nccl] use current-stream as nccl-stream under async=False mode (#147820) PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead - async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready - pass down async from c10d down to NCCL-PG this helps shave off 50% CPU overhead (70us -> 35us), which reduce total CPU/GPU from 230us to 195us by 15% * [PGNCCL] Make avoid-record-stream default * [c10d] Add asyncOp argument to Ops * Change python side wait * Pass asyncOp at ProcessGroup level * Watchdog unstashing tensors as a safety net * Stash tensors for reduce_scatter_v and all_gather_v Pull Request approved: https://github.com/pytorch/pytorch/pull/149753 * [c10d] Move unstashing from watchdog to main thread Pull Request approved: https://github.com/pytorch/pytorch/pull/150079 * [PGNCCL][BE] Merge mutex into TensorShelf for encapsulation Pull Request approved: https://github.com/pytorch/pytorch/pull/150130 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150398 Approved by: https://github.com/atalman	2025-04-01 16:46:07 +00:00
Mergen Nachin	7382654ebc	Update ExecuTorch pin to latest viable/strict 3/28/2025 (#150308 ) From latest viable/strict: https://hud.pytorch.org/hud/pytorch/executorch/viable%2Fstrict/1?per_page=50 Fixes https://github.com/pytorch/pytorch/issues/144480 This commit has important CI stability fixes, such as https://github.com/pytorch/executorch/pull/9561 and https://github.com/pytorch/executorch/pull/9634 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150308 Approved by: https://github.com/jathu, https://github.com/malfet	2025-04-01 16:30:09 +00:00
Nikita Shulga	428234bc28	[MPSInductor] torch.complex128 is unsupported on MPS (#150386 ) Same as torch.float64 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150386 Approved by: https://github.com/dcci ghstack dependencies: #150382	2025-04-01 15:19:10 +00:00
Nikita Shulga	1c6e88eb03	[MPS] Test bf16 perf of few unary and binary ops (#150382 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150382 Approved by: https://github.com/Skylion007	2025-04-01 13:58:20 +00:00
Bin Bao	0d96c38b76	[AOTI] Skip test_buffer_mutation_and_force_mmap_weights for fbcode (#150340 ) Summary: Skip due to an older ideep version Differential Revision: D72190746 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150340 Approved by: https://github.com/yushangdi	2025-04-01 13:24:21 +00:00
maajidkhann	84c21d2147	Enable SVE ACLE implementation for tanH Aten op for FP32 dType. (#143741 ) In deep learning models, the tanh (hyperbolic tangent) function is a widely used activation function, primarily in feedforward networks, recurrent neural networks (RNNs), and various other architectures. Also, the tanh (hyperbolic tangent) function is commonly used in Physics-Informed Neural Networks (PINNs). PINNs are a class of machine learning models designed to solve partial differential equations (PDEs) by incorporating the governing physics directly into the loss function, along with data-driven terms. In PINNs, activation functions like tanh are used in the neural network architecture to enable the model to learn complex mappings between inputs (such as spatial and temporal coordinates) and outputs (such as field variables). Operator: tanh() Current Implementation in OSS in ATen Backend: SVE Flow: Uses SVE sleef when available else std implementation. With this PR : SVE Flow: Uses SVE ACLE implementation. (Faster Implementation) Here are the performance improvements. Single core perf numbers: ![image](https://github.com/user-attachments/assets/c2f4bcb6-11bc-4af1-b5eb-278a4cc4a69d) Metric: CPU time avg time per iteration (In ms) As you can see with both gcc and clang compilers, we see a significant performance gain with SVE ACLE implementation over current OSS Implementation (Sleef) and also Neon. Hardware: m7g.8xlarge (Graviton 3 Instance) Script used in benchmarking: ```python import os #os.environ["ATEN_CPU_CAPABILITY"] = "default" os.environ["ATEN_CPU_CAPABILITY"] = "sve256" import torch import torch.nn as nn #Set the random seed for reproducibility torch.manual_seed(1) #Create a tensor of shape (8521, 50) x = torch.randn(8521, 50) for i in range(10): output = x.tanh() #Perform the tanh operation 1000 times and profile the performance print("### CPU tanh") with torch.autograd.profiler.profile(record_shapes=True) as prof: for i in range(1000): output = x.tanh() #Print the profiling results sorted by self CPU time print(prof.key_averages().table(sort_by="self_cpu_time_total")) #Optionally print the final output (if needed, uncomment the following line) print(output) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143741 Approved by: https://github.com/malfet	2025-04-01 11:54:58 +00:00
yucai-intel	bf4814eb6a	[Intel GPU] Allow XPU backend in Quantize operators (#150288 ) This modification is to support torch.quantize_per_channel() on XPU, otherwise it will cause a segmentation fault. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150288 Approved by: https://github.com/jerryzh168, https://github.com/guangyey	2025-04-01 11:27:26 +00:00
Xuehai Pan	a10b765bf1	[pytree] add APIs to determine a class is a namedtuple or PyStructSequence (#113257 ) Changes in this PR: 1. Add `is_structseq` and `is_structseq_class` functions to determine a object or a class is PyStructSequence. 2. Add a generic class `structseq` which can be used as the registration key for PyStructSequence types like `namedtuple` for Named Tuple types. 3. Change `is_namedtuple` to accept subclasses of namedtuple to be namedtuple. Before this PR, only namedtuple class directly created by `collections.namedtuple` or `typing.NamedTuple` were namedtuple classes while their subclasses were not. This PR makes `is_namedtuple` return true for subclasses of namedtuple class. Resolves #75982. New tests are included in this PR. - #75982 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113257 Approved by: https://github.com/zou3519	2025-04-01 10:40:43 +00:00
Prajesh Praveen Anchalia	48e9ffc873	Unify on dynamo_compile as the overall wait counter (#150293 ) Summary: dynamo_compile for the most part has been accounting for compile time except autotuning. all_compilation_types had earlier been injected on fx_codegen_and_compile, which was incorrect. Add autotuining to dynamo and deprcate all_compilation_types counter. Differential Revision: D72145447 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150293 Approved by: https://github.com/masnesral, https://github.com/jamesjwu	2025-04-01 08:55:51 +00:00
FFFrog	36f2d0aaba	Add "xpu" to __all__ for torch/version.py (#149695 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149695 Approved by: https://github.com/desertfire, https://github.com/guangyey	2025-04-01 08:44:51 +00:00
Natalia Gimelshein	1700599266	Add one_shot_all_reduce_copy to allow non-symm-mem allocated tensors to be reduced (#150129 ) Per title, we want to be able to use it even if inputs are not registered. Separate copy would add latency, and one-shot is all about the lowest possible latency. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150129 Approved by: https://github.com/xw285cornell	2025-04-01 05:36:43 +00:00
Natalia Gimelshein	414b9ae016	enable out variant of 2-shot reduction (#150153 ) Per title, this version uses symm mem input both as input source and as a work buffer, so input is modified after the end (similar to what fbgemm car reduction does). It is intended to be wrapped in an op that would first copy the real inputs to symm mem buffers that wouldn't be exposed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150153 Approved by: https://github.com/xw285cornell	2025-04-01 05:36:04 +00:00
Tugsbayasgalan Manlaibaatar	7e7e5698cc	Suppress more warnings (#149833 ) Differential Revision: [D71702307](https://our.internmc.facebook.com/intern/diff/D71702307) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149833 Approved by: https://github.com/malfet, https://github.com/Skylion007	2025-04-01 05:33:04 +00:00
William Wen	790d459f85	[dynamo] add error message for unsupported LOAD_BUILD_CLASS (#150323 ) Improved error message for https://github.com/pytorch/pytorch/issues/128942 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150323 Approved by: https://github.com/jansel, https://github.com/zou3519	2025-04-01 05:03:50 +00:00
Stonepia	ce52674b76	[Doc] Update CMAKE_PREFIX_PATH for XPU windows README (#148863 ) We found that the `pip install cmake` and `conda install cmake` has different behavior. The reason is that the pip installed one doesn't find the corresponding libs under conda env. So we need to set the `CMAKE_PREFIX_PATH` for alignment. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148863 Approved by: https://github.com/CuiYifeng, https://github.com/malfet Co-authored-by: Cui, Yifeng <yifeng.cui@intel.com>	2025-04-01 04:43:11 +00:00
Phillip Liu	31634b8c6a	[fr] Added protection against missing stack frames in fr cont. (#150133 ) Summary: Previously we had D70358287, which didn't fully resolved the issue. Test Plan: # FR `buck2 run @//mode/opt //caffe2/fb/flight_recorder:fr_trace -- --mast_job_id f710320638-TrainingApplication --mast_job_version 0 --mast_job_attempt 0 --bucket tlcm_log_blob --world_size 128 --dump_file_name_offset 0 --allow-incomplete-ranks` Confirm no error # FR analyzer `buck2 run @//mode/opt //investigations/dr_patternson/analyzers/ai_observability:ai_observability-all-analyzers-cli -- flight_recorder_analyzer --mast_job_name f710320638-TrainingApplication --mast_job_version 0 --mast_job_attempt 0` Confirm no error Differential Revision: D71998980 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150133 Approved by: https://github.com/fduwjj	2025-04-01 03:07:59 +00:00
Nikita Shulga	827b730f4e	[CI] Skip test_copy_large_tensor on M2-15 runners (#150377 ) They have more than 12Gb memory, but may be running this test causes OOM in CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/150377 Approved by: https://github.com/atalman	2025-04-01 02:33:43 +00:00
Nikita Shulga	6470b373c1	`torch.backends.mkldnn.flags()` CM should not warn (#150358 ) By returning `None` rather than `False` from `THPModule_allowTF32OneDNN` when USE_XPU is not defined Added regression test Fixes https://github.com/pytorch/pytorch/issues/149829 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/150358 Approved by: https://github.com/atalman	2025-04-01 01:33:40 +00:00
Sun, Jiayi	5cb5675f13	[Inductor] optimize the heuristics of parallel reduction (#149614 ) Fix https://github.com/pytorch/pytorch/issues/148639. Summary: Optimize the heuristics of parallel reduction: When the number of steps of the first inner loop beyond the maximum parallel depth is much larger than the number of steps of all outer loops within the maximum parallel depth, change the starting depth of parallelism to the first inner loop and recalculate the maximum parallel depth. I ran the Inductor benchmark with this PR on CPU. A timm model poolformer_m36 BF16 has about 25% performance improvement, and no performance regression is seen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149614 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2025-04-01 01:31:00 +00:00
Zhang, Jianyi	0f12951fc2	[Intel gpu] always set deterministic for xpu accuracy test (#149028 ) On Intel Max 1550, models like Super_SloMo can actually pass accuracy test after set deterministic, because we do not use atomic in upsampling bilinear backward in some cases when running on XPU. Furthermore, I guess the only reason not to set deterministic on these models is just avoiding errors. We should use warn_only = True. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149028 Approved by: https://github.com/guangyey, https://github.com/desertfire Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>	2025-04-01 01:00:11 +00:00
Nikita Shulga	7ab8532cf1	[BE] Get rid of cross-compile and x86 build options for Mac (#150362 ) As both cross-compilation and x86 builds has been removed a while back Remove stale TODO about building with OpenMP support Pull Request resolved: https://github.com/pytorch/pytorch/pull/150362 Approved by: https://github.com/atalman, https://github.com/clee2000	2025-04-01 00:45:24 +00:00
Joshua Hamilton	4ce0b959ff	Add a warning when a tensor with requires_grad=True is converted to a scalar (#143261 ) Fixes #143071 Operations performed on tensors with `requires_grad=True` such as ```python import torch x = torch.tensor(2.0, requires_grad=True) y = x ** 3 ``` and ```python x = torch.tensor(2.0, requires_grad=True) y = torch.pow(x,3) ``` are valid operations. While an operation using `numpy` like ```python import numpy as np x = torch.tensor(2.0, requires_grad=True) y = np.pow(x,3) # > RuntimeError: Can't call numpy() on Tensor that requires grad. Use tensor.detach().numpy() instead. ``` leads to an error. However, an operation that uses `math` like ```python import math x = torch.tensor(2.0, requires_grad=True) y = math.pow(x,3) ``` does not cause an error, and `y` is no longer a tensor with a gradient! This represents a [footgun](https://en.wiktionary.org/wiki/footgun#Noun) for some users, like myself when training small, custom, non-neural network models. To prevent future undesired behavior, I added a warning when converting tensors with `requires_grad=True` to scalars. Now, when using `math.pow` on a `tensor`, we get a single warning with: ```python x = torch.tensor(2.0, requires_grad=True) y = math.pow(x,3) # > UserWarning: Converting a tensor with requires_grad=True to a scalar may lead to unexpected behavior. # Consider using tensor.detach() first. ``` Please let me know if you have any questions 👍 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143261 Approved by: https://github.com/malfet Co-authored-by: albanD <desmaison.alban@gmail.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-04-01 00:42:46 +00:00
Jack Taylor	49b7d0d84d	[ROCm] Enable more inductor UTs (#149513 ) Primarily enable inductor fp8 tests, also enable other inductor tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/149513 Approved by: https://github.com/jeffdaily	2025-04-01 00:30:36 +00:00
Nikita Shulga	c75dac5f5c	Fix typo (#150363 ) Fixes https://github.com/pytorch/pytorch/issues/150339 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150363 Approved by: https://github.com/atalman, https://github.com/kwen2501	2025-03-31 23:58:37 +00:00
Davide Italiano	b48505a8a1	[MPS] Add support for hermite_polynomial_h. (#150279 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150279 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2025-03-31 23:30:19 +00:00
Mu-Chu Lee	a2070e2fd5	[AOTInductor] Free tensors in test (#150274 ) Summary: This PR frees tensor that were new-ed within the test itself to prevent memory leak. Test Plan: Fixing tests itself. Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/150274 Approved by: https://github.com/chenyang78	2025-03-31 23:28:13 +00:00
Shiyan Deng	982a7f7db0	[cachinghostallocator] remove the check on cudaHostRegister path (#150070 ) Summary: In the cudaHostAlloc path, the flag we used is `cudaHostAllocDefault` [0] which don't really have this strict enforcement (devicePtr retrieved from ` cudaHostGetDevicePointer(()` point to the same addr as the hostPtr) according to the guide [1]. This diff removes the check so that the host register path works for ROCm. [0]`6aca002d82/aten/src/ATen/cuda/CachingHostAllocator.cpp (L97)` [1] https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1gb65da58f444e7230d3322b6126bb4902 Test Plan: test_pinned_memory_with_cudaregister tests Differential Revision: D71932562 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150070 Approved by: https://github.com/jeffdaily	2025-03-31 23:23:05 +00:00
PaulZhang12	981048854d	Merge Triton ScaledMM as epilogue to MM template (#150045 ) Previously, scaled_mm's (FP8 matmul) Triton lowering for inductor was in a separate template. This PR consolidates that lowering into the mm template, with an added epilogue to deal with multiplying the scales. This paves the way for future scaled variants of BMM, Grouped GEMM in inductor. Currently, there is still a separate template for TMA+persistent version of scaled_mm. The current mm lowering has a separate template for TMA + Persistent version. Will hopefully consolidate the extra scaled_mm TMA+persistent template when the consolidation for the mm template is done. TODO: Consolidate TMA+Persistent logic into 1 template and remove separate scaled_mm TMA template Pull Request resolved: https://github.com/pytorch/pytorch/pull/150045 Approved by: https://github.com/drisspg	2025-03-31 23:20:14 +00:00
Nikita Shulga	91666eef60	Update gloo submodule (#150320 ) That updates its CMake minimum version(via https://github.com/facebookincubator/gloo/pull/424 ) and removes cmake-4.0.0 workarounds for gloo Pull Request resolved: https://github.com/pytorch/pytorch/pull/150320 Approved by: https://github.com/atalman	2025-03-31 22:40:27 +00:00
PyTorch MergeBot	1526ff955e	Revert "Add a warning when a tensor with requires_grad=True is converted to a scalar (#143261 )" This reverts commit 515b45e5693dbf9dd58d8472806cbe5f49e43074. Reverted https://github.com/pytorch/pytorch/pull/143261 on behalf of https://github.com/clee2000 due to failing internal tests D72135661 ([comment](https://github.com/pytorch/pytorch/pull/143261#issuecomment-2767531682))	2025-03-31 22:19:08 +00:00
Faa Diallo	423e4a4568	[ROCm] cmake 4 workaround for hiprtc (#150324 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150324 Approved by: https://github.com/jeffdaily, https://github.com/atalman, https://github.com/malfet	2025-03-31 21:55:53 +00:00
Ethan Wee	4e2997db73	[ROCm][CI] Increase wheel build timeout from 210 to 240 (#150221 ) Fixes #150046. Increasing the timeout from 210 to 240. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150221 Approved by: https://github.com/jeffdaily	2025-03-31 21:46:09 +00:00
Pian Pawakapan	925fd4aa2e	[export] min/max ranges for dim hints (#149590 ) Differential Revision: D71522032 Adds min/max ranges to Dim.AUTO/DYNAMIC/STATIC, so users can do `Dim.AUTO(min=2, max=2048)`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149590 Approved by: https://github.com/tugsbayasgalan	2025-03-31 21:32:20 +00:00
Eli Uriegas	dfcd98e684	cd: Fix naming for windows arm64 libtorch builds (#150310 ) Apparently the magical incantation to name these correctly lies in the build_variant variable otherwise it silently does nothing. Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/150310 Approved by: https://github.com/atalman	2025-03-31 20:12:03 +00:00
Matthew Haddock	80b7f6b704	Adjust TestInductorOpInfo to depend on backend, not device (#146911 ) As is the case with many inductor tests, this test adapts test criteria based on device type, where it should be adjusting for the backend registered for that device. In this particular case, using the upstream triton CPU backend would lead to failures, as reference_in_float would be true as this is required for the C++/OpenMP backend which does not have float16 support. However most triton backends do, and as such should be tested in float16. Similarly a triton backend with a device not described as a GPU would get skipped from testing entirely. A more generic solution would be ideal, but this would require a lot of work across many tests. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146911 Approved by: https://github.com/masnesral	2025-03-31 18:24:16 +00:00
Aleksei Nikiforov	ab342d3793	Make PyTorch buildable by CMake-4.x on s390x (#150294 ) This is a continuation of https://github.com/pytorch/pytorch/pull/150203 that fixes nightly build on s390x. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150294 Approved by: https://github.com/malfet	2025-03-31 18:10:02 +00:00
angelayi	5e34758cef	[invoke_subgraph] Support unbacked (#149298 ) Differential Revision: [D71420641](https://our.internmc.facebook.com/intern/diff/D71420641) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149298 Approved by: https://github.com/zou3519	2025-03-31 17:25:09 +00:00
Pian Pawakapan	284b766898	[dynamic shapes] C++ bindings for guard_or_false/true (#150148 ) C++ version. Would like to add it in one place to prove it works, but couldn't find one that doesn't expose a chain of data-dependent changes... so just gonna put up the base implementation Pull Request resolved: https://github.com/pytorch/pytorch/pull/150148 Approved by: https://github.com/laithsakka, https://github.com/jingsh	2025-03-31 17:04:25 +00:00
Prachi Gupta	47cdad2995	[ROCm] Enable several fsdp related UTs (#149369 ) Enabling 26 UTs for ROCm in the following files: - distributed._shard.sharded_optim.test_sharded_optim - 2 UTs - distributed._shard.sharded_tensor.ops.test_binary_cmp - 4 UTs - distributed._shard.sharded_tensor.ops.test_init - 3 UTs - distributed._shard.sharded_tensor.ops.test_embedding - 2 UTs - distributed._shard.sharded_tensor.ops.test_embedding_bag - 2 UTs - distributed._composable.test_replicate_with_compiler - 4 UTs - distributed._composable.fsdp.test_fully_shard_grad_scaler - 1 UTs - distributed.tensor.test_attention - 4 UTs - distributed.tensor.test_matrix_ops - 1 UTs - distributed.tensor.test_tensor_ops - 1 UTs - distributed.fsdp.test_fsdp_grad_acc - 2 UTs Pull Request resolved: https://github.com/pytorch/pytorch/pull/149369 Approved by: https://github.com/jeffdaily	2025-03-31 16:15:57 +00:00
PyTorch MergeBot	7c858066ae	Revert "Enable TMA persistent GEMM Template by default (#149427 )" This reverts commit b8ef642f04874e13a9f2771902ddb7514f294015. Reverted https://github.com/pytorch/pytorch/pull/149427 on behalf of https://github.com/clee2000 due to failing tests internally D72116141 ([comment](https://github.com/pytorch/pytorch/pull/149427#issuecomment-2766672200))	2025-03-31 15:58:34 +00:00
PyTorch MergeBot	57fa99c5c3	Revert "enable out variant of 2-shot reduction (#150153 )" This reverts commit cdeb32d2d1c31b60c65133e83510977c5c180005. Reverted https://github.com/pytorch/pytorch/pull/150153 on behalf of https://github.com/clee2000 due to failing internal builds D72083877 ([comment](https://github.com/pytorch/pytorch/pull/150153#issuecomment-2766633712))	2025-03-31 15:43:24 +00:00
PyTorch MergeBot	e57fa18b40	Revert "Add one_shot_all_reduce_copy to allow non-symm-mem allocated tensors to be reduced (#150129 )" This reverts commit 8a872261dcb3797557d1965af6832677a77efec1. Reverted https://github.com/pytorch/pytorch/pull/150129 on behalf of https://github.com/clee2000 due to breaking internal builds D72080428 ([comment](https://github.com/pytorch/pytorch/pull/150129#issuecomment-2766619006))	2025-03-31 15:37:54 +00:00
Wang, Chuanqi	f74d5d576a	Update torch-xpu-ops commit pin to 3ee2bd2 (#150300 ) Update the torch-xpu-ops commit to [3ee2bd2f13e1ed17a685986ff667a58bed5f2aa5](`3ee2bd2f13`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150300 Approved by: https://github.com/EikanWang	2025-03-31 13:36:11 +00:00
Yichen Yan	bbb9b2476b	Unify use of `enableCollectiveHashDebug_` and trivial updates (#142865 ) Use `enableCollectiveHashDebug_` instead of checking env ad-hoc when `TORCH_DISTRIBUTED_DEBUG = DETAIL` Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/142865 Approved by: https://github.com/fegin, https://github.com/kwen2501	2025-03-31 12:23:30 +00:00
Ethan Wee	c158eac0de	[ROCm] use correct workspace for hipblaslt, silence warning (#150227 ) Follow up to #145130. That PR caused a warning on ROCm the first time hipblaslt was called for any workload, always. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/150227 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-03-31 09:49:43 +00:00
LifengWang	51f0403f46	Update the baseline for max_autotune ci workflow (#149107 ) Since the issue https://github.com/pytorch/pytorch/issues/148535 is fixed in PR https://github.com/pytorch/pytorch/pull/148923, update the baseline for max_autotune ci workflow. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149107 Approved by: https://github.com/chuanqi129, https://github.com/leslie-fang-intel, https://github.com/desertfire	2025-03-31 09:45:44 +00:00
Kavya Govindarajan	4aded85e79	Fix space typo in warning message (#143473 ) Warning shows up like this (no space between willbe): ``` /home/xxx/.local/lib/python3.11/site-packages/torch/distributed/fsdp/_state_dict_utils.py:827: UserWarning: When using ``NO_SHARD`` for ``ShardingStrategy``, full_state_dict willbe returned. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143473 Approved by: https://github.com/mikaylagawarecki, https://github.com/kwen2501	2025-03-31 07:38:02 +00:00
Matthew Hoffman	c976321541	Use variadic length tuple for `torch.masked.DimOrDims` (#149870 ) `tuple[int]` means only a tuple of length 1, which is not what was intended. ```python loss = torch.masked.mean(loss, mask=mask, dim=(-1, -2)) # Argument of type "tuple[Literal[-1], Literal[-2]]" cannot be assigned to parameter "dim" of type "DimOrDims" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149870 Approved by: https://github.com/Skylion007	2025-03-31 07:06:58 +00:00
Vlad K	f1b74037b1	Fix bug when Inductor include path contains spaces (#148271 ) This PR fixes a bug with how include directories with spaces are handled on Windows. I ran into an edge case with torch.compile() - it will error out with an exception on Windows. In particular, it will try to execute the following: `cl /I C:/Program Files/Python311/Include ...`, where `C:/Program` will be treated as separate from `Files/Python311/Include`. I looked into using something like `shlex.quote` or `pathlib.Path`, but I didn't find those options to be suitable (shlex is POSIX shell only, pathlib.Path does not escape spaces). There is another place in the function that also deals with escaping spaces. My fix follows the same style. `0ff2e6a85a/torch/_inductor/cpp_builder.py (L1464)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148271 Approved by: https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2025-03-31 06:46:05 +00:00
Youseok Yang	b99e0c5412	Fix mtia_extension.cpp setDevice() to correctly set current_device (#149398 ) We referred to this code and found that there was a minor bug. Fix for future reference for others. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149398 Approved by: https://github.com/janeyx99	2025-03-31 06:07:22 +00:00
Yuanhao Ji	4f14224dc8	[Inductor] Fix `torch.polygamma()` when n == 1 (#147453 ) Fixes #147450 Be consistent with cpu kernel: `77dbd28535/aten/src/ATen/native/cpu/UnaryOpsKernel.cpp (L433-L444)` Got this in the case: ``` Eager: tensor([1.2914e+15]), dtype: torch.float32 Compile: tensor([1.2914e+15]), dtype: torch.float32 Expected: tensor([6.5808e+32], dtype=torch.float64), dtype: torch.float64 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147453 Approved by: https://github.com/eellison	2025-03-31 05:27:46 +00:00
fduwjj	9456738edf	[c10d][fr] Allow multiple writer registration with warnings (#150232 ) The life span of writer is actually the whole program which is sub-optimal but it is a practical compromise so that the registration of writer can happen outside PG creation. So we decide to allow multiple writer registrations with warnings. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150232 Approved by: https://github.com/d4l3k, https://github.com/kwen2501	2025-03-31 04:43:43 +00:00
redwrasse	ad54b3aae2	test 0-dim squeeze in basic.TestSqueeze (#147928 ) Replace TODO with 0-dim squeeze, checks scalar is unchanged in `basic.TestSqueeze` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147928 Approved by: https://github.com/janeyx99	2025-03-31 04:35:16 +00:00
Luca Arnaboldi	c3bb174bb2	SubsetRandomSampler - changed iteration over tensor to iteration over list (#149126 ) Digging further the problem at https://github.com/UKPLab/sentence-transformers/pull/3261, it boils down to this expensive loop over a torch tensor. Looping over a list, like in RandomSampler, solves the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149126 Approved by: https://github.com/divyanshk, https://github.com/cyyever	2025-03-31 04:33:35 +00:00
dscamiss	59abb8c7a2	Fix documentation build errors caused by unsupported section titles (#150205 ) Fixes #150134 Build with `make html` looks OK now: ```shell reading sources... [100%] torch.compiler_get_started .. xpu looking for now-outdated files... none found pickling environment... done checking consistency... done preparing documents... done writing output... [ 80%] generated/torch.nn.Softsign .. generated/torch.nn.modules.module.register_module_full_backward_writing output... [ 86%] generated/torch.nn.modules.module.register_module_module_registration_hook .. generated/torch.rwriting output... [100%] generated/torch.xpu.get_rng_state .. xpu generating indices... genindex done highlighting module code... [100%] typing writing additional pages... search done copying images... [100%] _static/img/torch_cuda_memory/allocator_state_history.png copying static files... done copying extra files... done dumping search index in English (code: en)... done dumping object inventory... done build succeeded. The HTML pages are in build/html. ``` New rendering looks like this: ![image](https://github.com/user-attachments/assets/af7e23a5-9dfd-4cb6-9333-a9e8cfe47ea0) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150205 Approved by: https://github.com/albanD	2025-03-31 04:27:44 +00:00
Yuanhao Ji	32afecff8b	[PrivateUse1] Impl `isBuilt()` and `isAvailable()` (#149594 ) Follow-up: #146098 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149594 Approved by: https://github.com/albanD	2025-03-31 04:18:38 +00:00
jj hunt	46c8f2e965	Update docstring to match code. (#148455 ) Very tiny fix to doc string. Pass grid_size=None results in an Exception. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148455 Approved by: https://github.com/mikaylagawarecki	2025-03-31 04:16:11 +00:00
Nichols A. Romero	ca2ffc23ab	[ROCm][TunableOp] Stricter unit tests for online and offline tuning (#150142 ) Improvements to unit tests and warnings for unsupported cases in offline tuning. Here are more details: - Previously we only compared the OpSig for the untuned vs. tuned entries. This was not strict enough so we now compare OpSig+ParamSig. - The main offline and online UTs are now stricter to make sure we exercise the code paths for the four combinations of transA and transB. - Offline tuning does not support some tensor shapes. Emit warning and skip tuning. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150142 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-03-31 04:12:08 +00:00
Daniel Vega-Myhre	157bff22f7	[Async TP] Fuse matmul-reduce-scatters when reduce scatters have multiple users, and save fused node for backward instead of reduce_scatter node (#149946 ) Fixes #149876 ## Stack - [previous PR in stack] https://github.com/pytorch/pytorch/pull/149247 ## TL;DR This PR implements support in async TP for saving the reduce-scatter result for backward, which previously would break the torchtitan AC policies: no AC, per op SAC, and per layer SAC. ## Context In torchtitan's LLama3 per op SAC policy, we want to save the output of `reduce_scatter` ops for backward, which is useful for TP. The reduce_scatter op is also saved for No AC (since all activations are saved) and per layer SAC (since we save the activations for N full layers, which do contain reduce-scatters for TP. However, doing this causes incompatibility with Async TP for the AC policies above, for 2 reasons: 1) The graph pattern matching specifically only matches on reduce scatter nodes with 1 user, but reduce_scatter nodes saved for backwards will have 2 users (the 2nd one being the return/output node, which saves it for backward). 2) The subgraph replacement logic which replaces the users of the `wait_tensor` after the reduce-scatter with the new fused node has no mechanism to save the fused_node for backward instead of the reduce-scatter node. This means we cannot directly replace the subgraph, since we can't delete nodes which still have users (in this case, the output node is still using the reduce-scatter node). To fix this, we do 2 things: 1) Add additional pattern matching logic to also match reduce-scatter nodes with 2 users, so we also perform fusion when reduce-scatter is saved for backward. 2) When replacing the subgraph with the fused node, detect if the reduce-scatter was saved for backward, and if so, save the result of the fused node for backward instead. This enables us to properly erase the subgraph and prevent the memory leak which occurred in #149876 ## Other changes - Continue to throw an error if we don't find any candidate all-gathers or reduce-scatters for fusion (since TP should have both) but DON'T throw an error if we don't fuse any matmul-reduce-scatters. This is because I've found there are actually valid graphs where we do fuse reduce scatters in the forward graph but not the backward graph (in the backward pass there are reduce-scatters but the producer op is an "add" not a mm/scaled_mm). ## Test plan 1. All unit tests are passing 2. Visualized the graphs and verified the fusion is occurring properly. 3. Verified via manual torchtitan runs there is no memory leak / OOM occurring anymore. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149946 Approved by: https://github.com/fegin	2025-03-30 19:05:47 +00:00
James Wu	cbc0964636	Store statically launchable CachingAutotuners inside CompiledFXGraph.triton_bundle (#149054 ) This PR adds CachingAutotuners that are statically launchable to FXGraphCache's cache entry. Regular CachingAutotuners, with triton kernels attached to them, are not very good to cache: they are very large, and take huge amounts of space since they track all of the various binary files, along with various metadata. We could probably figure out what information we could delete from the kernel and have it still work, but with StaticCudaLauncher, we no longer have to. Instead, we can cache every compiled triton kernel that is statically launchable. Because StaticTritonCompileResult is serializable, and designed to have a very small memory footprint, we can save it into FXGraphCache without increasing the cache size significantly. We store it as a part of CompiledFxGraph.triton_bundle. Then, on load, we repopulate the CachingAutotuner into our CompiledTritonKernel cache. The upsides of this are many: - We no longer need to call into a separate process on cache hit - We can guarantee that the triton kernel we got from our cache entry is the one we use to launch again, so no worries about triton's own caching logic - Once we achieve feature parity and all torch.compiled triton kernels are statically launchable, we can clean up a bunch of TritonBundler code and simplify the cache hit logic. Fixes #149449 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149054 Approved by: https://github.com/oulgen	2025-03-30 17:51:11 +00:00
Aaron Gokaslan	e91f84c87d	[BE]: Update cudnn frontend submodule to 1.11.0 (#149759 ) Update CUDNN frontend submodule to 11.1.0. Adds some new features like score_mod from flex_attention and adds a lot of bugfixes and new feature knobs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149759 Approved by: https://github.com/jansel	2025-03-30 17:14:26 +00:00
Joshua Hamilton	515b45e569	Add a warning when a tensor with requires_grad=True is converted to a scalar (#143261 ) Fixes #143071 Operations performed on tensors with `requires_grad=True` such as ```python import torch x = torch.tensor(2.0, requires_grad=True) y = x ** 3 ``` and ```python x = torch.tensor(2.0, requires_grad=True) y = torch.pow(x,3) ``` are valid operations. While an operation using `numpy` like ```python import numpy as np x = torch.tensor(2.0, requires_grad=True) y = np.pow(x,3) # > RuntimeError: Can't call numpy() on Tensor that requires grad. Use tensor.detach().numpy() instead. ``` leads to an error. However, an operation that uses `math` like ```python import math x = torch.tensor(2.0, requires_grad=True) y = math.pow(x,3) ``` does not cause an error, and `y` is no longer a tensor with a gradient! This represents a [footgun](https://en.wiktionary.org/wiki/footgun#Noun) for some users, like myself when training small, custom, non-neural network models. To prevent future undesired behavior, I added a warning when converting tensors with `requires_grad=True` to scalars. Now, when using `math.pow` on a `tensor`, we get a single warning with: ```python x = torch.tensor(2.0, requires_grad=True) y = math.pow(x,3) # > UserWarning: Converting a tensor with requires_grad=True to a scalar may lead to unexpected behavior. # Consider using tensor.detach() first. ``` Please let me know if you have any questions 👍 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143261 Approved by: https://github.com/albanD Co-authored-by: albanD <desmaison.alban@gmail.com>	2025-03-30 11:19:07 +00:00
Nikita Shulga	e8a11f175e	[BE] Use `auto` in MPS codebase more (#150000 ) Non-trivial (but still a no-op changes): - Replace `[mpsGraph broadcastTensor:[mpsGraph constantWithScalar:1 dataType:MPSDataTypeInt32] toShape:inputTensor.shape name:nil]` with `[mpsGraph constantWithScalar:1 dataType:MPSDataTypeInt32 shape:inputTensor.shape]` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150000 Approved by: https://github.com/dcci, https://github.com/cyyever	2025-03-30 05:35:58 +00:00
Prajesh Praveen Anchalia	005c9b2f4f	Fix _Waitcounter decorator and dd backward pass wait counter (#150235 ) Summary: This will log a wait counter with for backward compile and fixes weirdness with nested context managers. Since the old wait counters added through dynamo_timed were never created with the nesting issue. I am also changing the key nomenclature from `pytorch.dynamo_timed` to `pytorch.wait_counter`. We want to use the same nomenclature, to make it easy to find keys. Reviewed By: jamesjwu Differential Revision: D72032055 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150235 Approved by: https://github.com/jamesjwu, https://github.com/masnesral	2025-03-30 05:20:12 +00:00
Shangdi Yu	cc58ecceea	Move dump location to avoid dumping twice (#150219 ) Summary: If we put the dumping code in codegen, we might get a separate node_mapping dump for the constant folded graph (https://github.com/pytorch/pytorch/blob/main/torch/_inductor/compile_fx.py#L1119). We move it into compile_fx.py so there's only one node_mapping dump. Test Plan: CI Reviewed By: YUNQIUGUO Differential Revision: D72068715 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150219 Approved by: https://github.com/YUNQIUGUO	2025-03-30 03:35:38 +00:00
Horace He	3140565db6	Update type of `create_block_mask` to more accurately reflect things (#150244 ) Fixes some mypy issues Pull Request resolved: https://github.com/pytorch/pytorch/pull/150244 Approved by: https://github.com/drisspg	2025-03-29 21:55:57 +00:00
sanshang	879a293db8	fix et trace collection of all_to_all (#149485 ) ![image](https://github.com/user-attachments/assets/1e602dec-24a4-4f47-88c0-9311737e217b) ![image](https://github.com/user-attachments/assets/c48a3273-43fb-4a7f-9341-b90cb6b10785) fix ET trace collection to all_to_all. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149485 Approved by: https://github.com/shengfukevin, https://github.com/kwen2501	2025-03-29 20:17:24 +00:00
Nikita Shulga	965784eb9b	[MPSInductor] Specify `max_total_threads_per_threadgroup` (#150247 ) When generating reduction kernel, otherwise compiler can unroll loops too much that kernel could not be launched for the intended threadgroup size Extend `c10:🤘:max` to accept different dtypes Together this fixes `test_large_broadcast_reduction` TODO: - Explore different threadgroup_sizes for best perf Pull Request resolved: https://github.com/pytorch/pytorch/pull/150247 Approved by: https://github.com/jansel, https://github.com/dcci ghstack dependencies: #150246	2025-03-29 19:37:15 +00:00
Nikita Shulga	52135db69a	[BE] Fix signed/unsigned comparison warning (#150246 ) One will see them only if compilation fails, but still Pull Request resolved: https://github.com/pytorch/pytorch/pull/150246 Approved by: https://github.com/cyyever, https://github.com/jansel	2025-03-29 15:12:42 +00:00
PyTorch MergeBot	3b00ff8850	Revert "[Profiler] Give non-zero default values to start events (#149757 )" This reverts commit bc72420bcb37390af3fced885e019903e6e425bd. Reverted https://github.com/pytorch/pytorch/pull/149757 on behalf of https://github.com/malfet due to Broke windows builds, which were also the signal on the HUD ([comment](https://github.com/pytorch/pytorch/pull/149757#issuecomment-2763461365))	2025-03-29 15:08:55 +00:00
Irshad CC	f3c77b2458	Set requires grad in TensorMaker::make_tensor() (#148255 ) Fixes #146419 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148255 Approved by: https://github.com/soulitzer	2025-03-29 08:06:42 +00:00
PaulZhang12	b8ef642f04	Enable TMA persistent GEMM Template by default (#149427 ) Previously, this was unable to be landed given there was limited H100 for CI testing. Benchmarking on H100 CI looks good now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149427 Approved by: https://github.com/drisspg	2025-03-29 07:32:42 +00:00
Max Calman	bc72420bcb	[Profiler] Give non-zero default values to start events (#149757 ) The intent of the existing code is to > // Assign system TIDs to start events based on the system TID of the next // observed event with the same Python TID. However, if there are start events that don't share the same Python TID as later observed events, then they are left with the default initialization of DeviceAndResource and assigned values of `0`. This is problematic because Kineto uses `device=0, resource=0` for the first GPU (or other backend) device. This PR maintains the previous logic of using TIDs from later events if any are present, but defaults to the current process and system thread IDs if there aren't later events to reference. This issue was discovered while working to implement a custom backend and some CPU start events were appearing on the same process and thread as the device in the trace. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149757 Approved by: https://github.com/sraikund16	2025-03-29 06:29:25 +00:00
Michał Górny	ec6fa547a1	Remove unnecessary "special linking" for `BLAS_LIBRARIES` (#145487 ) Remove the "special linking" that involves listing `BLAS_LIBRARIES` thrice if `TH_BINARY_BUILD` is set, as it should not be any different from listing it just once. The code seems to date back to commit cfcf2af95f91a88ec61cbcac8b30a718e7332aa5. The original code already listed `BLAS_LIBRARIES` thrice, but it provided no explanation for doing that — and without `TH_BINARY_BUILD`, BLAS was not linked at all. The current version seems to originate in d6a8d28d6529a4f0b80a8c046ca9c36ca6c8b347 — and it already provided an `ELSE` clause listing `BLAS_LIBRARIES` only once. From this, I suspect that it is probably an unnecessary leftover. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145487 Approved by: https://github.com/malfet	2025-03-29 05:13:22 +00:00
Jane Xu	2c9e07ecd2	[BE] Remove outdated RPC benchmark (#146716 ) We have lots of outdated unused + uncalled code in our codebase, namely in our benchmarks and examples folders among others. The last change to this directory was 4 years ago and this code looks dead. cc @albanD @H-Huang for feedback Pull Request resolved: https://github.com/pytorch/pytorch/pull/146716 Approved by: https://github.com/Skylion007, https://github.com/H-Huang	2025-03-29 04:44:36 +00:00
Bryce Ferenczi	beea76020b	Removed ROCM ifdef that governs thread count + smem parallel reduction. (#149779 ) #149548 Fixed the arbitrarily missing parallelism for NLL, but they also added an arbritrary #ifdef ROCM guard around this fix to prevent its use on CUDA gpus. There is also a problem with the way the kernel does the reduction from the intermediate shared memory, using only thread 0 walking linearly. This has been changed to a simple parallel reduction algorithm. Tested changes with `python3 test/test_nn.py` ``` Ran 3551 tests in 200.554s OK (skipped=998, expected failures=4) ``` Performance before and after with the script below with an RTX 3090, batch size x axis, time (sec) y axis. This GPU is also used for display graphics and such, so the measurements are pretty noisy, even with 100 samples. ## Before ![before_nll](https://github.com/user-attachments/assets/c19044aa-7bc2-4223-b560-9be7acedef35) ## After ifdef removal ![after_nll](https://github.com/user-attachments/assets/4672f5ca-93b0-4c34-a257-81b2ab364995) ## After Parallel SMEM reduction ![after_reduction](https://github.com/user-attachments/assets/9607b68c-7d9d-4ee0-9f99-8989d134e4fd) ```python import torch from matplotlib import pyplot as plt from torch.nn import functional as F timing = [] batches= list(range(32, 4096, 32)) for batch in [32] + batches: samples = [] for _ in range(100): probs = torch.rand(batch, 10).cuda() labels = torch.randint(0, 10, (batch,)).cuda() start = torch.cuda.Event(enable_timing=True) end = torch.cuda.Event(enable_timing=True) start.record() F.nll_loss(probs, labels) end.record() torch.cuda.synchronize() elapsed = start.elapsed_time(end) samples.append(elapsed) timing.append(sum(samples) / len(samples)) timing = timing[1:] plt.plot(batches, timing) plt.show() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149779 Approved by: https://github.com/jeffdaily	2025-03-29 04:27:54 +00:00
Eddie Yan	a8dd9b6c27	[cuDNN][SDPA] abide by `enable_gqa` convention in cuDNN (#149976 ) long overdue Pull Request resolved: https://github.com/pytorch/pytorch/pull/149976 Approved by: https://github.com/drisspg, https://github.com/Skylion007	2025-03-29 04:24:51 +00:00
Thanh Ha	340beb7f7c	Add .editorconfig (#149193 ) This adds an .editorconfig file to automatically configure devs local Editors / IDEs with the basic formatting rules of the project. List of supported editors: https://editorconfig.org/#pre-installed Pull Request resolved: https://github.com/pytorch/pytorch/pull/149193 Approved by: https://github.com/malfet	2025-03-29 04:07:21 +00:00
fzyzcjy	66a7a49d64	Super tiny fix typo (#149190 ) ... when checking the doc to build from source Pull Request resolved: https://github.com/pytorch/pytorch/pull/149190 Approved by: https://github.com/jingsh	2025-03-29 04:06:05 +00:00
Shangdi Yu	5e787bf3e5	[reland] Support torchbind in OSS proxy executor (#150196 ) Summary: The original Diff D69500038 is reverted due to a false alarm on trunk health. Implement torchbind support in OSSProxyExecutor. Exactly the same as the implementation in FbProxyExecutor. D69693697 - fbProxyExecutor D69887230 - fbProxyExecutor but for torchbind method D70746626 - Support None output type Other changes: - When generating the schema of the CallTrochBind HOP, the arg name of the torchbind object arg should be the same as the torchbind method's torchbind object arg (instead of `obj`). - In `AOTIModelPackageLoader`, we extract everything in `data/constants` to `tmp_dir/data/aot_inductor/<model>/` folder, so the torchbind objs exist in the same folder as the rest of the files (e.g. cpp, so). This is to be consistent of how files are packaged internally (more details in internal Diff summary). Note on using `filesystem`: Seems like there'll be [issues](https://github.com/pytorch/pytorch/pull/137209) with using`filesystem` header in linux, so here I use string manipulation instead of `filesystem::path`. Test Plan: ``` test/inductor:torchbind -- -r torchbind_aoti test/inductor:torchbind -- -r aot_compile ``` Differential Revision: D72063691 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150196 Approved by: https://github.com/hl475, https://github.com/desertfire	2025-03-29 03:36:55 +00:00
Mandar Deshpande	0861af2596	[pytorch][triton] Warp specialization support in TritonTemplate for torchinductor (#148503 ) (#150122 ) Summary: Currently only `num_warps` and `num_stages` are supported as one of the kernel options for inductor auto-tuning using `TritonTemplate`. In order to allow warp-specialization kernel options should allow specifying `num_consumer_groups` and `num_buffers_warp_spec` as well. NOTE: Currently gating changes to FBCODE using HAS_WARP_SPEC which is only available on triton/release-3.3.x Test Plan: ## Unit test Added tests for `test_triton_template_warp_specialization` to verify generated kenrnel contains configs for `num_consumer_groups` and `num_buffers_warp_spec`. ## Functional Testing Specific to flexattention. ``` import torch from torch.nn.attention.flex_attention import flex_attention from triton.testing import do_bench make_tensor = lambda: torch.rand(8, 16, 8192, 128, device="cuda", dtype=torch.bfloat16) q, k, v = make_tensor(), make_tensor(), make_tensor() flex_compiled = torch.compile(flex_attention, fullgraph=True) print(do_bench(lambda: flex_compiled(q, k, v, kernel_options={"num_warps": 4}))) ``` triton do_bench results: - default compile: 15.176783561706543 - with warp-spec: 9.452800750732422 ## Extra notes - generated triton kernel using `TORCH_LOGS=output_code`: P1740612877 - TTGIR for fused kernel: P1740614685 Differential Revision: D71982587 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150122 Approved by: https://github.com/eellison, https://github.com/zou3519, https://github.com/jansel	2025-03-29 03:36:50 +00:00
Mu-Chu Lee	03313c6619	[AOTInductor] Add function for users to extract constants in container (#150163 ) Summary: Add extract_constant_map that allows users to inspect the constants being used by AOTInductor Test Plan: `python test/inductor/test_aot_inductor.py -k extract_constants_map` `LD_LIBRARY_PATH=/data/users/$USER/pytorch/build/lib /data/users/$USER/pytorch/build/bin/test_aoti_inference` Differential Revision: D72020400 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150163 Approved by: https://github.com/chenyang78	2025-03-29 03:36:12 +00:00
Nichols A. Romero	7a470c9320	[ROCm] change preferred blas lib defaults (#150212 ) Fixes #148883 Fixes #150155 Also adds at::BlasBackend:Default. Instinct cards prefer hipBLASLt, everything else prefers rocBLAS. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150212 Approved by: https://github.com/jeffdaily	2025-03-29 03:33:07 +00:00
Tristan Rice	29b3fdab01	TCPStoreLibUvBackend: support masterListenFd (#150215 ) This supports `masterListenFd` which is required for full compatibility with the non-libuv TCPStore. The code was just missing a `uv_listen` call and now it works just fine. This is required to migrate the last remaining uses of TCPStore off of the non-libuv backend. Test plan: ``` pytest -v test/distributed/test_store.py -k test_take_over_listen_socket ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150215 Approved by: https://github.com/fduwjj	2025-03-29 01:58:07 +00:00
Nikita Shulga	493c7fa66f	[Cmake] Make PyTorch buildable by CMake-4.x (#150203 ) By turning on compatibility mode for protobuf, nnpack, PSimd and FP16, ittapi, TensorPipe and Gloo Update CMake requirements Revert 0ece461ccafe5649d2d0f058ff5477765fd56499 and b0901d62ae2c2e909f91401eacebf3731df20cbe to test that it actually works TODO: - Update/get rid of those libraries Fixes https://github.com/pytorch/pytorch/issues/150149 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150203 Approved by: https://github.com/clee2000	2025-03-29 01:39:13 +00:00
Nikita Shulga	edb6f1b7a8	Move MacOS inductor tests to M2-15 runner (#150228 ) To get more representative results (and be able to run more tests eventually) Also get pull_request for workflow dispatch if yml file is modified Pull Request resolved: https://github.com/pytorch/pytorch/pull/150228 Approved by: https://github.com/clee2000	2025-03-29 01:36:07 +00:00
Jeff Daily	65139eb050	if blaslt fails, fall back to blas (#150147 ) Fixes #150016. This is implemented for both cublaslt and hipblaslt. gemm_and_bias on failure will fall back to unfused path. lt gemm on failure falls back to gemm even if gemm preference is set to lt. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150147 Approved by: https://github.com/malfet	2025-03-28 23:39:53 +00:00
PyTorch MergeBot	ccfde4dadf	Revert "Move MacOS inductor tests to M2-15 runner (#150228 )" This reverts commit b1b58708b26a840f6bf0ccdd14a9916ff7291fb4. Reverted https://github.com/pytorch/pytorch/pull/150228 on behalf of https://github.com/malfet due to Should not have ignored lint signal ([comment](https://github.com/pytorch/pytorch/pull/150228#issuecomment-2762794366))	2025-03-28 23:05:27 +00:00
Nikita Shulga	b1b58708b2	Move MacOS inductor tests to M2-15 runner (#150228 ) To get more representative results (and be able to run more tests eventually) Also get pull_request for workflow dispatch if yml file is modified Pull Request resolved: https://github.com/pytorch/pytorch/pull/150228 Approved by: https://github.com/clee2000	2025-03-28 22:15:40 +00:00
PyTorch MergeBot	7ac0658757	Revert "[CI] Fix docker builds failing due to cmake update by setting CMAKE_POLICY_VERSION_MINIMUM (#150220 )" This reverts commit 87549a65c96cd7e48f024c02e7daa3f227b2bf18. Reverted https://github.com/pytorch/pytorch/pull/150220 on behalf of https://github.com/clee2000 due to doesn't solve the problem since the installed cmake 4 stays on the system, resulting in failed pytorch builds later ([comment](https://github.com/pytorch/pytorch/pull/150220#issuecomment-2762623078))	2025-03-28 21:44:03 +00:00
Zain Rizvi	4271ebdbdc	Explicitly state that a test-infra branch cut is required (#150214 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150214 Approved by: https://github.com/atalman ghstack dependencies: #150210, #150211, #150213	2025-03-28 21:13:29 +00:00
Zain Rizvi	2b2286c4ec	Update reference for binary_build workflows (#150213 ) There hasn't been a circleci for a looooong time Pull Request resolved: https://github.com/pytorch/pytorch/pull/150213 Approved by: https://github.com/atalman ghstack dependencies: #150210, #150211	2025-03-28 21:13:29 +00:00
Zain Rizvi	4118d7307f	Update referenced PRs for ecosystem library branch cut (#150211 ) The old PRs had a lot of extra changes in them which are no longer needed Pull Request resolved: https://github.com/pytorch/pytorch/pull/150211 Approved by: https://github.com/atalman ghstack dependencies: #150210	2025-03-28 21:13:22 +00:00
Zain Rizvi	f231500c50	Mention the cherry-picker bot in the release docs (#150210 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150210 Approved by: https://github.com/atalman	2025-03-28 21:13:15 +00:00
Catherine Lee	87549a65c9	[CI] Fix docker builds failing due to cmake update by setting CMAKE_POLICY_VERSION_MINIMUM (#150220 ) Set the CMAKE_POLICY_VERSION_MINIMUM env var to make executorch and halide docker builds pass (they install from those repos which don't have cmake pinned) This can be removed if executorch and halide update their builds and we update the hash? Pull Request resolved: https://github.com/pytorch/pytorch/pull/150220 Approved by: https://github.com/atalman, https://github.com/malfet	2025-03-28 20:55:04 +00:00
zeshengzong	cb83850a24	Fix docs format error in `torch.nn` (#150156 ) Fixes #150152 Fix format error in [torch.nn.CosineSimilarity](https://pytorch.org/docs/stable/generated/torch.nn.CosineSimilarity.html#torch.nn.CosineSimilarity), [torch.nn.KLDivLoss](https://pytorch.org/docs/stable/generated/torch.nn.KLDivLoss.html#torch.nn.KLDivLoss) and other pages. ## Test Result ### Before #### torch.nn.CosineSimilarity ![Image](https://github.com/user-attachments/assets/1ad633d9-dfaf-43f0-a536-9035a24bf858) #### torch.nn.KLDivLoss ![Image](https://github.com/user-attachments/assets/20a001b0-1f66-414e-b554-11934d65a4bf) ### After #### torch.nn.CosineSimilarity ![image](https://github.com/user-attachments/assets/a2d9ea8d-5637-4604-a0e4-9231a4deee44) #### torch.nn.KLDivLoss ![image](https://github.com/user-attachments/assets/d0e319f9-a3b3-47a7-b2f8-060d46d53bc7) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150156 Approved by: https://github.com/cyyever, https://github.com/malfet	2025-03-28 20:54:09 +00:00
Nikita Shulga	7c65911b11	[MPS] Fix dot/mm for conj_tensors (#150157 ) - Distinguish between conjugated/non_conjugated inputs by appending conjugation to the operator key - For matmul or dot, add `conjugateWithTensor:name:` calls before running the op - Enable testing for conjugated ops by passing `include_conjugated_inputs` to opinfo - Filter `include_conjugated_inputs` argument from `sample_inputs_window` (probably should have landed as separate PR) - Preserve conj property when gathering the views, that fixes `cov` operator Fixes https://github.com/pytorch/pytorch/issues/148156 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150157 Approved by: https://github.com/dcci	2025-03-28 20:36:44 +00:00
Catherine Lee	9092dd2e82	[CI] Disable some tests that are failing in periodic (#150059 ) Disabling some tests to restore periodic nogpu avx512 timeout: `59f14d19ae (38492953496-box)` profiler failure: `7ae0ce6360 (38461255009-box)` test_accelerator failure: `87bfd66c3c (39476723746-box)` origin: 146098 test_overrides failure: `bf752c36da (39484562957-box)` origin: 146098 inductor cpu repro: `bb9c426024 (38447525659-box)` functorch eager transforms: `8f858e226b (39488068620-box)` `f2cea01f71 (39555064878)` `b5281a4a18 (39599355600)` either 148288 or 148261? `2ec9aceaeb/1` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150059 Approved by: https://github.com/ZainRizvi, https://github.com/atalman, https://github.com/malfet	2025-03-28 20:31:32 +00:00
Jeff Daily	2bd5bfa3ce	[ROCm] use magma-rocm tarball for CI/CD (#149986 ) Follow-up to #149902. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149986 Approved by: https://github.com/malfet	2025-03-28 19:28:50 +00:00
Natalia Gimelshein	cdeb32d2d1	enable out variant of 2-shot reduction (#150153 ) Per title, this version uses symm mem input both as input source and as a work buffer, so input is modified after the end (similar to what fbgemm car reduction does). It is intended to be wrapped in an op that would first copy the real inputs to symm mem buffers that wouldn't be exposed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150153 Approved by: https://github.com/xw285cornell	2025-03-28 19:06:03 +00:00
Wang, Chuanqi	35ff5084e6	[CI] Remove the xpu env source for linux binary validate (#150138 ) Due to we have enabled the xpu runtime pypi packages as dependencies directly Pull Request resolved: https://github.com/pytorch/pytorch/pull/150138 Approved by: https://github.com/atalman	2025-03-28 17:25:37 +00:00
Catherine Lee	85079e4380	[TD] Enable TD on distributed cpu (#150028 ) Enable TD on distributed cpu, I think the only reason it's not is because I forgot to enable it Get rid of some of the statements that are no ops: * asan uses default shard * nogpu got moved to periodic * no windows cuda testing anymore Only thing on pull and trunk that doesn't use TD is dynamo_wrapped but I think it's fast enough to be ok for now, we can take another look after this Pull Request resolved: https://github.com/pytorch/pytorch/pull/150028 Approved by: https://github.com/ZainRizvi	2025-03-28 17:19:11 +00:00
PyTorch MergeBot	cf7447ae99	Revert "cpp_wrapper: Fix even more tests (#147225 )" This reverts commit d25acac357ff8663a7787e57e6bc5e69987a8f9a. Reverted https://github.com/pytorch/pytorch/pull/147225 on behalf of https://github.com/yangw-dev due to broke test internally test/inductor/test_benchmark_fusion ([comment](https://github.com/pytorch/pytorch/pull/147225#issuecomment-2761944564))	2025-03-28 17:07:52 +00:00
PyTorch MergeBot	e691fcae0e	Revert "cpp_wrapper: precompile a few more commonly used headers, and improve RAIIPyObject interface (#149350 )" This reverts commit 2b20d1433f4e5c7556fe4679d89b8f795990d494. Reverted https://github.com/pytorch/pytorch/pull/149350 on behalf of https://github.com/yangw-dev due to broke test internally test/inductor/test_benchmark_fusion ([comment](https://github.com/pytorch/pytorch/pull/147225#issuecomment-2761944564))	2025-03-28 17:07:52 +00:00
Andrey Talman	b0901d62ae	Pin cmake to 3.31.2 for windows conda install (#150185 ) Trying to fix nightly failures Cmake 4.0 update https://pypi.org/project/cmake/4.0.0/ broke nightly builds You can see it here: https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=cuda11_8-build and here: https://hud.pytorch.org/hud/pytorch/pytorch/nightly/1?per_page=50&name_filter= This fix for Windows Builds. Linux and MacOS where already fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150185 Approved by: https://github.com/jeanschmidt, https://github.com/ZainRizvi	2025-03-28 17:03:02 +00:00
Animesh Jain	a469ddc663	[inductor] No type promotion for slice_scatter (#150090 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150090 Approved by: https://github.com/eellison, https://github.com/zou3519 ghstack dependencies: #149087, #149667, #150036, #148953	2025-03-28 17:02:01 +00:00
Catherine Lee	1bdf996e7a	[CI] Fix log artifact not containing test logs? (#149577 ) Sometimes I would find a log artifact that only has usage_logs.txt in it, even though there are other logs created by tests. I think this is somehow caused by output buffering with find. I don't understand how, but at the very least, I can see that all the jobs on this PR have the logs from the test runs Pull Request resolved: https://github.com/pytorch/pytorch/pull/149577 Approved by: https://github.com/ZainRizvi	2025-03-28 17:00:00 +00:00
Catherine Lee	d5a8bd0688	[CI][docker] Use multistage build for triton (#149413 ) Sees to reduce docker pull times by ~3 min if triton is requested, some compressed docker sizes seems to have decreased by 1/3 ish Also add check that triton is installed/not installed Pull Request resolved: https://github.com/pytorch/pytorch/pull/149413 Approved by: https://github.com/malfet	2025-03-28 16:07:19 +00:00
Catherine Lee	0ece461cca	Pin cmake==3.31.6 (#150158 ) I'm not sure if this is the right think to do, but cmake 4.0.0 got released on pypi and our builds are failing with it Example: `aa70d62041 (39555975425-box)` I guess we have to go change all the cmake_minimum_required to >=3.5? backwards compat still failing because its building with the base commit which this pr can't really change until it gets merged, but at least manywheel binary builds got past where they were originally failing Also pin the conda installation, but the most recent version on conda is 3.31.2 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150158 Approved by: https://github.com/cyyever, https://github.com/malfet	2025-03-28 15:49:17 +00:00
Alexander Grund	350a479146	Fix test failures on non-x86 Linux (#148445 ) The cpp contexts are only supported on x86 Linux. The tests requiring them are skipped on non-Linux but not if the architecture is not x86. In most places it is checked for ARM64 which is not enough as a check for x86 is required instead. Fix the test decorators and factor out a common one in test_cuda. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148445 Approved by: https://github.com/eellison	2025-03-28 15:27:44 +00:00
Michael Lazos	d2c0c65ea1	[Dynamo] Add debug linting option for graph dedupe (#150053 ) As title Pull Request resolved: https://github.com/pytorch/pytorch/pull/150053 Approved by: https://github.com/StrongerXi, https://github.com/anijain2305	2025-03-28 14:27:09 +00:00
IvanKobzarev	25309a17f0	[aotd] Config to guess_tangents_stride (#150035 ) Differential Revision: [D71907684](https://our.internmc.facebook.com/intern/diff/D71907684) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150035 Approved by: https://github.com/ilyas409, https://github.com/seemethere	2025-03-28 13:54:19 +00:00
PyTorch MergeBot	7c4e49750e	Revert "Store statically launchable CachingAutotuners inside CompiledFXGraph.triton_bundle (#149054 )" This reverts commit c16af5d7984872b6ae81476d6cae64bddb7ce664. Reverted https://github.com/pytorch/pytorch/pull/149054 on behalf of https://github.com/jamesjwu due to Sorry I forgot to fix one last test ([comment](https://github.com/pytorch/pytorch/pull/149054#issuecomment-2761381443))	2025-03-28 13:35:07 +00:00
James Wu	c16af5d798	Store statically launchable CachingAutotuners inside CompiledFXGraph.triton_bundle (#149054 ) This PR adds CachingAutotuners that are statically launchable to FXGraphCache's cache entry. Regular CachingAutotuners, with triton kernels attached to them, are not very good to cache: they are very large, and take huge amounts of space since they track all of the various binary files, along with various metadata. We could probably figure out what information we could delete from the kernel and have it still work, but with StaticCudaLauncher, we no longer have to. Instead, we can cache every compiled triton kernel that is statically launchable. Because StaticTritonCompileResult is serializable, and designed to have a very small memory footprint, we can save it into FXGraphCache without increasing the cache size significantly. We store it as a part of CompiledFxGraph.triton_bundle. Then, on load, we repopulate the CachingAutotuner into our CompiledTritonKernel cache. The upsides of this are many: - We no longer need to call into a separate process on cache hit - We can guarantee that the triton kernel we got from our cache entry is the one we use to launch again, so no worries about triton's own caching logic - Once we achieve feature parity and all torch.compiled triton kernels are statically launchable, we can clean up a bunch of TritonBundler code and simplify the cache hit logic. Fixes #149449 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149054 Approved by: https://github.com/oulgen	2025-03-28 13:28:05 +00:00
Yuanhao Ji	d4da0e955e	[Dynamo] Fix `is_compile_supported()` when `device_type` contains device index (#147837 ) Fixes #147826 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147837 Approved by: https://github.com/anijain2305	2025-03-28 07:16:29 +00:00
Pian Pawakapan	103bf64a3c	[export] refactor _Dim into Dim (#149891 ) Summary: forward fix T218515233 Test Plan: test_export Differential Revision: D71769231 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149891 Approved by: https://github.com/jingsh, https://github.com/angelayi	2025-03-28 06:19:03 +00:00
bobrenjc93	f649ee73ce	Use source hashing to generate consistent symbolic ids (#149665 ) This PR was inspired by internal models that were cache missing due to PGO. At a high level the problem looks as follows Run 1, Invocation 1: We do static compile, save some example values in PGO/automatic dynamic Run 1, Invocation 2: We detect varying inputs, do dynamic compile, get a dynamic graph and save to PGO. Crucially what we save to PGO is actually a superset of what is actually dynamic. If we notice an input was varying, we mark it as dynamic in PGO even if later on that value gets specialized. When a value gets specialized, we actually remove the symbol from the graph. This results in an interesting conundrum where although we are producing the same isomorphic graph, PGO makes the second run cache miss. Let's see how.... Run 2, Invocation 1: We fetch the PGO, over-mark things as dynamic, get a fx graph, look it up in the cache and... whoops! cache miss! This is because of the aforementioned behavior where the PGO profile will cause us to over-allocate symbols. In practice this means we end up saving a graph in cache with symbols x:s1, y:s3 and on second attempt we cache miss with x:s1, y:s6 where symbols s3,s4,s5 were all optimistically marked dynamic by PGO and subsequently specialized. We solve this problem by hashing the source names. This ensures somewhat stable assignment. To prevent catastrophic symbol collisions, we use linear probing to ensure no collisions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149665 Approved by: https://github.com/Mingming-Ding, https://github.com/laithsakka	2025-03-28 05:36:32 +00:00
Tugsbayasgalan Manlaibaatar	c49315e645	Improve attr mismatch msg (#149576 ) Differential Revision: [D71513041](https://our.internmc.facebook.com/intern/diff/D71513041) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149576 Approved by: https://github.com/avikchaudhuri	2025-03-28 05:10:56 +00:00
Daniël de Kok	fdc4394b16	Do not fetch NCCL when system NCCL is used (#149607 ) We are compiling PyTorch in a sandbox without networking. Unconditionally fetching breaks the build and is not needed when a system NCCL is used. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149607 Approved by: https://github.com/malfet	2025-03-28 05:06:49 +00:00
Animesh Jain	c9ebf517c2	[dynamo][invoke_subgraph] Input aliasing and mutation check in Dynamo (#148953 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148953 Approved by: https://github.com/zou3519 ghstack dependencies: #149087, #149667, #150036	2025-03-28 03:50:07 +00:00
eellison	c18e2ce53b	Ignore meta ops in inductor (#150137 ) Fix for https://github.com/pytorch/pytorch/issues/144607 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150137 Approved by: https://github.com/BoyuanFeng	2025-03-28 03:01:57 +00:00
PyTorch MergeBot	ddb1e97839	Revert "Support torchbind in OSS proxy executor (#149747 )" This reverts commit aa70d62041c28fe35c416aa932b32ef0e4d5bc33. Reverted https://github.com/pytorch/pytorch/pull/149747 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/149747#issuecomment-2760040741))	2025-03-28 02:48:02 +00:00
Colin L. Rice	2f785ab208	dynamo_compile: Log all compilation time under all_compilation_types (#149664 ) This counter is designed to include all compilation pytorch does (triton + dynamo_compile). However this wasn't including all of dynamo compilation, since it was put in at the fx_codegen_and_compile spot. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149664 Approved by: https://github.com/masnesral	2025-03-28 02:27:48 +00:00
Natalia Gimelshein	8a872261dc	Add one_shot_all_reduce_copy to allow non-symm-mem allocated tensors to be reduced (#150129 ) Per title, we want to be able to use it even if inputs are not registered. Separate copy would add latency, and one-shot is all about the lowest possible latency. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150129 Approved by: https://github.com/xw285cornell	2025-03-28 02:14:27 +00:00
Sam Larsen	1e55b9c0b5	Fix autotune pool shutdown (#149890 ) Summary: A couple follow-ups noted in review from https://github.com/pytorch/pytorch/pull/149700: 1. Make sure we correctly signal _all_ subproces to shutdown, even in the case where some processes are currently benchmarking. 2. Change how the pool singleton is created. That also allows us to fully initialize the object in the ctor and remove a bunch of asserts. Test Plan: existing unit tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/149890 Approved by: https://github.com/aorenste ghstack dependencies: #149700	2025-03-28 02:09:51 +00:00
Sam Larsen	266bd22b44	Improve subproc autotuning implementation (#149700 ) Summary: The primary change is to update the autotune-in-a-subproc implementation to avoid using multiprocessing spawn. Spawn (re)executes the toplevel script in the subproc, which can be problematic. The approach here is similar to Triton parallel compile: we Popen a subproc on a controlled entry point and communicate over pipes. That change drove a lot of refactoring in the TuningProcess class, so I took the opportunity to simplify some things, rename some methods, etc. One other notable change is around the timeout / kill approach. After a timeout, we were previously attempting to stop the subproc in three steps (graceful shutdown, sigkill if graceful fails, sigterm if sigkill fails). I'm gonna argue think that's not useful: 1) The graceful shutdown is never going to work unless the subproc happens to have just completed its task and is ready to receive the next command. 2) If we're going to kill the subproc, let's just take the most aggressive approach and move on as quickly as possible to restarting it rather than waiting to see if previous shutdown attempts succeeded. The only downside that I can find find is maybe a little log spew?, e.g., ` ResourceWarning: subprocess 2987680 is still running` List of changes: * Use Popen instead of spawn for the autotuning subprocess. * Introduced a new entry point `__autotune_main__.py` * Renamed some TuningProcess methods. For example `shutdown` makes more sense than `terminate` because the latter implies a forced kill. * Simplified the implementation around benchmarking timeout and how we kill the subproc after a timeout. * Deprecated the unused timeout configs in `_inductor/config.py` * Moved `get_ld_library_path` helper to a common utils file. * Added more unit tests for subproc crashes / timeouts / exceptions, etc. Test plan: * New unit tests * Also ran internally with all combinations of: build mode `opt` and `dev-nosan`, and `buck run` vs. executing the `.par` file directly. * Made sure the functionality to parallelize autotuning across different GPUs is working (it wasn't clear to me this was behaving the way we wanted it to). Differential Revision: [D71976971](https://our.internmc.facebook.com/intern/diff/D71976971) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149700 Approved by: https://github.com/aorenste, https://github.com/jansel, https://github.com/eellison	2025-03-28 01:06:39 +00:00
Shivam Raikundalia	8b04364914	[Easy/Profiler] Set Duration to -1 for unfinished CPU events (#150131 ) Summary: Some OSS Kineto users were requesting that we allow for 0 duration events in Kineto even though they won't be seen on the trace. To allow this we changed the handling of said events in D71510383. However this causes unfinished events in collection to never be post processed; this diff fixes said issue. Test Plan: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/0/1743102222/localhost/libkineto_activities_631490.json.gz&bucket=gpu_traces Differential Revision: D71993609 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150131 Approved by: https://github.com/chuanhaozhuge, https://github.com/xw285cornell	2025-03-28 00:29:22 +00:00
Shangdi Yu	aa70d62041	Support torchbind in OSS proxy executor (#149747 ) Summary: Implement torchbind support in OSSProxyExecutor. Exactly the same as the implementation in FbProxyExecutor. D69693697 - fbProxyExecutor D69887230 - fbProxyExecutor but for torchbind method Other changes: - When generating the schema of the CallTrochBind HOP, the arg name of the torchbind object arg should be the same as the torchbind method's torchbind object arg (instead of `obj`). - In `AOTIModelPackageLoader`, we extract everything in `data/constants` to `tmp_dir/data/aot_inductor/<model>/` folder, so the torchbind objs exist in the same folder as the rest of the files (e.g. cpp, so). This is to be consistent of how files are packaged internally Test Plan: ``` buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r torchbind_aoti buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r aot_compile ``` Differential Revision: D69500038 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149747 Approved by: https://github.com/desertfire	2025-03-28 00:04:19 +00:00
Taras	d670df356c	Improve error handling when checking CUDA version in case nvcc is not found (#148671 ) Fixes: - https://github.com/pytorch/pytorch/issues/101138 Description The PR enhances error handling in `_check_cuda_version` by verifying the existence of the `nvcc` executable before invoking `subprocess.check_output`. If `nvcc` is missing, a `FileNotFoundError` is raised with a clear message, guiding users to check their CUDA installation and path configuration. Testing Manually tested with and without `nvcc` present in the expected path. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148671 Approved by: https://github.com/malfet	2025-03-27 23:04:59 +00:00
Benjamin Glass	2b20d1433f	cpp_wrapper: precompile a few more commonly used headers, and improve RAIIPyObject interface (#149350 ) Add includes for torch.device, torch.dtype, torch.layout, and torch.memory_format to the cpp_wrapper common header, so that they get precompiled. Additionally, add move constructors and operator bool to RAIIPyObject. Closes #142005. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149350 Approved by: https://github.com/desertfire ghstack dependencies: #147225	2025-03-27 23:00:01 +00:00
Nikita Shulga	ef1cb6b646	[BE] Suppress user_warnings while running opinfo tests (#150115 ) Some of the samples are constructed in a way that are expected to trigger those, but what's the point displaying them Pull Request resolved: https://github.com/pytorch/pytorch/pull/150115 Approved by: https://github.com/dcci ghstack dependencies: #150060	2025-03-27 22:36:27 +00:00
PyTorch MergeBot	1a3bd894ff	Revert "[fbcode]Removing `@NoIntBaseDeprecated` annotation in `caffe2.thrift` file (#149742 ) (#149744 )" This reverts commit 6eac3a0068f028d03897ce38e0cfec11812591fe. Reverted https://github.com/pytorch/pytorch/pull/149744 on behalf of https://github.com/malfet due to Broke tests, see `80aa88f907/1` ([comment](https://github.com/pytorch/pytorch/pull/149744#issuecomment-2759676260))	2025-03-27 22:31:54 +00:00
eellison	4c57aec5b9	Dont exclude constant_pad_nd in prologue fusion (#149947 ) Originally, I excluded constant_pad_nd from fusing to be conservative on compilation time. But, on benchmarking, you do occasionally get speedups by fusing it. Also includes a fix for making single, contiguous dep for prologues. For instance, the following benchmark gets a 7% speedup by fusing in the constant_pad_nd. ``` import torch import torch.nn.functional as F torch._inductor.config.force_disable_caches = True padded_N = 2048 n_pad_rows = 100 K, N = 2048, 4096 tensor1 = torch.randn(padded_N - n_pad_rows, 4096, device="cuda").to(torch.bfloat16) tensor2 = torch.randn(4096, 4096, device="cuda").to(torch.bfloat16) @torch.compile(mode='max-autotune-no-cudagraphs') def masked_linear(input, weight, n_pad_input_rows): """ Linear layer with input padded by `n_pad_input_rows` rows """ # Use constant_pad_nd to pad with zeros for the invalid rows padded_input = F.pad(tensor1, (0, 0, 0, n_pad_input_rows), "constant", 0) return F.linear(padded_input, weight) # Invoke the function masked_linear(tensor1, tensor2, n_pad_rows) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149947 Approved by: https://github.com/drisspg	2025-03-27 22:26:30 +00:00
PyTorch MergeBot	80aa88f907	Revert "Store statically launchable CachingAutotuners inside CompiledFXGraph.triton_bundle (#149054 )" This reverts commit ac91f8765ba7817a0853f0520e7f9c94768babc2. Reverted https://github.com/pytorch/pytorch/pull/149054 on behalf of https://github.com/yangw-dev due to This is breaking ROCM tests on trunk. hud.pytorch.org/ ([comment](https://github.com/pytorch/pytorch/pull/149054#issuecomment-2759604301))	2025-03-27 22:15:40 +00:00
Avik Chaudhuri	21bcbbfb5e	fix range constraints for expr (#150103 ) During tracing it is possible for a `s1: VR[2, inf]` to be replaced by a `s0: VR[3, inf]` (note smaller range) by the shape env. But after export, unfortunately we'd previously record `range_constraints[s0] = VR[2, inf]` (note larger range), which is incorrect. This is because we'd map `s1.node.expr` (`s0`) to the `var_to_range` of `s1.node._expr` (`s1`) when creating `range_constraints`. The comment surrounding this code suggests this predated `bound_sympy`, but now we can do better. For users, this means that when using `Dim.DYNAMIC` previously they wouldn't get input constraints checked sufficiently, now they do (shifting errors early). Differential Revision: D71962694 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150103 Approved by: https://github.com/zhxchen17	2025-03-27 22:11:39 +00:00
Keke Zhai	68414512e6	Implement aten.select.int sharding strategy (#149842 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149842 Approved by: https://github.com/XilunWu	2025-03-27 20:49:00 +00:00
Benjamin Glass	d25acac357	cpp_wrapper: Fix even more tests (#147225 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147225 Approved by: https://github.com/desertfire	2025-03-27 19:21:03 +00:00
Shangdi Yu	0ed0b7fa96	[aoti] Better error message when torchbind object is used as a graph input in AOTI (#149965 ) Summary: Given an explicit error when torchbind object is used as input to AoTI Test Plan: ``` buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r test_torchbind_input ``` Differential Revision: D69490915 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149965 Approved by: https://github.com/desertfire	2025-03-27 18:48:55 +00:00
PyTorch MergeBot	a9d08ed0ce	Revert "Parallelize sort (#149505 )" This reverts commit 842d51500be144d53f4d046d31169e8f46c063f6. Reverted https://github.com/pytorch/pytorch/pull/149505 on behalf of https://github.com/ZainRizvi due to Reverting since this is breaking inductor builds on trunk. More details [GH job link](https://github.com/pytorch/pytorch/actions/runs/14000726218/job/39207447863) [HUD commit link](`842d51500b`) ([comment](https://github.com/pytorch/pytorch/pull/149505#issuecomment-2759082390))	2025-03-27 18:43:11 +00:00
vasiliy	01cb3519b3	wire torch._scaled_mm with fp4 operands to the cublas nvfp4 kernel (#148792 ) Summary: When `a` and `b` have dtype `torch.float4_e2m1fn_x2` and `a_scale` and `b_scale` have dtype `torch.float8_e4m3fn`, makes ```python c = torch._scaled_mm(a, b, a_scale, b_scale, out_dtype=torch.bfloat16) ``` call the cuBLAS fp4 gemm kernel, as specified in https://docs.nvidia.com/cuda/cublas/index.html?highlight=fp4#d-block-scaling-for-fp8-and-fp4-data-types note: output scale (`scale_in_D` from the cuBLAS docs) is not tested in this PR - we can enable in a follow-up. Test Plan: ```bash pytest test/test_matmul_cuda.py -s -k mxfp8_nvfp4 ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/148792 Approved by: https://github.com/eqy ghstack dependencies: #148791	2025-03-27 17:32:20 +00:00
vasiliy	e33bc41958	add `torch.float4_e2m1fn_x2` to PyTorch (#148791 ) Summary: Redo of https://github.com/pytorch/pytorch/pull/146578 to get around rebase conflicts. Test Plan: ``` pytest test/quantization/core/experimental/test_floatx.py -s ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/148791 Approved by: https://github.com/drisspg, https://github.com/eqy, https://github.com/jeffdaily	2025-03-27 17:32:20 +00:00
James Wu	ac91f8765b	Store statically launchable CachingAutotuners inside CompiledFXGraph.triton_bundle (#149054 ) This PR adds CachingAutotuners that are statically launchable to FXGraphCache's cache entry. Regular CachingAutotuners, with triton kernels attached to them, are not very good to cache: they are very large, and take huge amounts of space since they track all of the various binary files, along with various metadata. We could probably figure out what information we could delete from the kernel and have it still work, but with StaticCudaLauncher, we no longer have to. Instead, we can cache every compiled triton kernel that is statically launchable. Because StaticTritonCompileResult is serializable, and designed to have a very small memory footprint, we can save it into FXGraphCache without increasing the cache size significantly. We store it as a part of CompiledFxGraph.triton_bundle. Then, on load, we repopulate the CachingAutotuner into our CompiledTritonKernel cache. The upsides of this are many: - We no longer need to call into a separate process on cache hit - We can guarantee that the triton kernel we got from our cache entry is the one we use to launch again, so no worries about triton's own caching logic - Once we achieve feature parity and all torch.compiled triton kernels are statically launchable, we can clean up a bunch of TritonBundler code and simplify the cache hit logic. Fixes #149449 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149054 Approved by: https://github.com/oulgen ghstack dependencies: #149657	2025-03-27 17:14:44 +00:00
Danfeng Wang	6eac3a0068	[fbcode]Removing `@NoIntBaseDeprecated` annotation in `caffe2.thrift` file (#149742 ) (#149744 ) Summary: To align with thrift-python, we are adding the int base class for `non-Flag` enums. In order to not break production code, the annotation `python.NoIntBaseClassDeprecated` is added to opt-out some enums After the related customer code logic changes, we can now safely remove the annotations that were added earlier. Our ultimate goal is to unconditionally add the `int` base to `thrift-py3` enums. Test Plan: ``` buck test 'fbcode//mode/opt' fbcode//caffe2/torch/fb/training_toolkit/applications/bulk_eval/tests:evaluator_test -- --exact 'caffe2/torch/fb/training_toolkit/applications/bulk_eval/tests:evaluator_test - test_setup_evaluation_utils (caffe2.torch.fb.training_toolkit.applications.bulk_eval.tests.evaluator_test.EvaluatorTest)' ``` Reviewed By: ahilger Differential Revision: D71446522 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149744 Approved by: https://github.com/izaitsevfb, https://github.com/huydhn	2025-03-27 17:11:26 +00:00
James Wu	14f0cd7630	[StaticCudaLauncher] Support sharedMemBytes > 48KB (#149657 ) Triton does some special handling when requesting more than 48 KB of shared memory: specifically it queries the device for maximum device memory, then sets the maximum amount of dynamic memory to be the difference between static and dynamic memory. See corresponding implementation in triton land here: https://github.com/triton-lang/triton/blob/main/third_party/nvidia/backend/driver.c#L128-L143 Test plan: - New unit test requesting more than 48 KB of memory Pull Request resolved: https://github.com/pytorch/pytorch/pull/149657 Approved by: https://github.com/jansel	2025-03-27 17:00:18 +00:00
Ankita George	85e4e51a7d	Fix bug in _load_state_dict_from_keys method (#150058 ) Summary: The _load_state_dict_from_keys method specifies that `Loads any key specified in this set. If no keys are specified, the entire checkpoint is loaded.` But this isn't happening right now, because an empty keys arg is passed in as a set() to `_load_state_dict` and keys is expected to be None for it to actually be included in the state_dict https://fburl.com/code/l8yzojyx. So with the set() argument, the state_dict is always going to be empty Test Plan: ensure existing tests pass Differential Revision: D71930712 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150058 Approved by: https://github.com/saumishr	2025-03-27 16:36:00 +00:00
Aleksandar Samardžić	d75921d3a6	Fix sparse CUTLASS-based kernels (#150023 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150023 Approved by: https://github.com/jcaip ghstack dependencies: #149978	2025-03-27 16:23:55 +00:00
Boyuan Feng	c830d750e6	[graph partition] support splitting on custom ops (#149782 ) This PR adds support for graph partition on custom ops. Land after #149458. ### API This PR provides a new API to register/unregister custom ops for graph partition. ```python def register_custom_op_support_cudagraph( operator: torch._library.custom_ops.CustomOpDef, is_cudagraphable: bool, ) -> None ``` Example usage: ```python from torch._inductor.utils import register_custom_op_partition @torch.library.custom_op("mylib::movement", mutates_args=()) def movement(pic: torch.Tensor) -> torch.Tensor: img = pic.cpu() cropped_img = (img + 1) * 2 return cropped_img.cuda() / 255.0 @movement.register_fake def _(pic): return torch.empty_like(pic) register_custom_op_support_cudagraph(movement, is_cudagraphable=False) ``` ### Example In this example, 1 torch-compiled region has 3 cudagraphs after splitting on 2 custom ops. ![image](https://github.com/user-attachments/assets/6d07355b-6690-4cde-89ef-e4aff6b0079c) Code to repro: ```python import torch from torch._inductor.utils import register_custom_op_support_cudagraph torch._inductor.config.graph_partition = True @torch.library.custom_op("mylib::movement", mutates_args=()) def movement(pic: torch.Tensor) -> torch.Tensor: img = pic.cpu() cropped_img = (img + 1)2 return cropped_img.cuda() / 255. @movement.register_fake def _(pic): return torch.empty_like(pic) @torch.library.custom_op("mylib::modify", mutates_args=()) def modify(pic: torch.Tensor) -> torch.Tensor: pic1 = pic + 1 pic1_cpu = (pic1.cpu() + 1) 2 return pic1_cpu.cuda() + pic @modify.register_fake def _(pic): return torch.empty_like(pic) @torch.library.custom_op("mylib::transform", mutates_args=()) def transform(pic: torch.Tensor) -> torch.Tensor: return (pic + 1) * 2 @transform.register_fake def _(pic): return torch.empty_like(pic) register_custom_op_support_cudagraph(movement, is_cudagraphable=False) register_custom_op_support_cudagraph(modify, is_cudagraphable=False) img = torch.randn(3, 64, 64, device="cuda") def f(img): x = (img + 10) * 2 y = movement(x) z = y + 1 u = transform(z) v = 2*u + 1 out = modify(v) return out + 1 compiled_f = torch.compile(f, mode="reduce-overhead", fullgraph=True) eager_out = f(img) for _ in range(3): compiled_out = compiled_f(img) assert torch.allclose(eager_out, compiled_out) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149782 Approved by: https://github.com/zou3519	2025-03-27 16:23:07 +00:00
PyTorch MergeBot	efc975feb2	Revert "[triton] Warp specialization support in torchinductor (#148503 )" This reverts commit 36183215e8845b54cdb69097e2b688fa9e4d3daf. Reverted https://github.com/pytorch/pytorch/pull/148503 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/148503#issuecomment-2758590645))	2025-03-27 16:06:42 +00:00
PyTorch MergeBot	af7719a2fa	Revert "Use source hashing to generate consistent symbolic ids (#149665 )" This reverts commit 1f92348dc6c60e3020a723b37ecb8226cf2480c0. Reverted https://github.com/pytorch/pytorch/pull/149665 on behalf of https://github.com/malfet due to Broke trunk, see `6eb3c2e282/1` ([comment](https://github.com/pytorch/pytorch/pull/149665#issuecomment-2758578187))	2025-03-27 16:02:27 +00:00
zpcore	6eb3c2e282	Update xla pin (#149381 ) Update xla pin to fix the github test failure issue. [failure link](https://hud.pytorch.org/failure?name=pull+%2F+linux-focal-py3_9-clang9-xla+%2F+test+%28xla%2C+1%2C+1%2C+lf.linux.12xlarge%29&jobName=linux-focal-py3_9-clang9-xla+%2F+test+%28xla%2C+1%2C+1%2C+lf.linux.12xlarge%29&failureCaptures=%5B%22test_call_jax_pytree%22%2C%22TestJaxInterop%22%5D). The test is run the torch_xla jax test but install the jax/jaxlib dependencies as we did in https://github.com/pytorch/xla/pull/8781/files. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149381 Approved by: https://github.com/atalman	2025-03-27 13:53:25 +00:00
Mandar Deshpande	36183215e8	[triton] Warp specialization support in torchinductor (#148503 ) Summary: Currently only `num_warps` and `num_stages` are supported as one of the kernel options for inductor auto-tuning using `TritonTemplate`. In order to allow warp-specialization kernel options should allow specifying `num_consumer_groups` and `num_buffers_warp_spec` as well. Test Plan: ## Unit test Added tests for `test_triton_template_warp_specialization` to verify generated kenrnel contains configs for `num_consumer_groups` and `num_buffers_warp_spec`. ## Functional Testing Specific to flexattention. ``` import torch from torch.nn.attention.flex_attention import flex_attention from triton.testing import do_bench make_tensor = lambda: torch.rand(8, 16, 8192, 128, device="cuda", dtype=torch.bfloat16) q, k, v = make_tensor(), make_tensor(), make_tensor() flex_compiled = torch.compile(flex_attention, fullgraph=True) print(do_bench(lambda: flex_compiled(q, k, v, kernel_options={"num_warps": 4}))) ``` triton do_bench results: - default compile: 15.176783561706543 - with warp-spec: 9.452800750732422 ## Extra notes - generated triton kernel using `TORCH_LOGS=output_code`: P1740612877 - TTGIR for fused kernel: P1740614685 Differential Revision: D70212243 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148503 Approved by: https://github.com/eellison	2025-03-27 13:07:50 +00:00
_githubsgi	f0e1a0838c	Enabling xpu in OffsetBasedRNGTracker . (#148360 ) Else torch.distributed breaks on xpu devices. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/148360 Approved by: https://github.com/zhangxiaoli73, https://github.com/guangyey, https://github.com/gujinghui, https://github.com/XilunWu, https://github.com/kwen2501 Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>	2025-03-27 10:55:05 +00:00
Matthew Haddock	e175929b8c	Make codegen dynamic shapes more device agnostic (#146830 ) Currently, as is the case with many inductor devices are assumed to be one of: - CPU with Cpp coden, or - GPU with triton codegen This is not always the case, a CPU backend may be using the triton CPU backend, or some other codegen entirely. This goes some way to fixing it in the case where a CPU backend can use triton scheduling. A more general solution could be implemented, but this would need to be quite robust, and is probably best done more centrally and by someone who can do more testing with CUDA devices. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146830 Approved by: https://github.com/eellison, https://github.com/albanD, https://github.com/guangyey Co-authored-by: Xuehai Pan <XuehaiPan@outlook.com>	2025-03-27 10:40:49 +00:00
Laith Sakka	6cbcdee944	Introduce guard_or_true, guard_or_false (#148430 ) some context in this document: https://docs.google.com/document/d/18nJsj-F2C_QXO7ClwzPcAUENQ-B440B43W7DdDnlDt4/edit?tab=t.0#heading=h.pgebnyi7pocj But TLDR; `guard_or_true`, `guard_or_false` are better than `guard_size_oblivious` due to : - Easier to reason about what assumptions we are making while reading the code. - Avoid size_oblivious complexity that is not needed. - Avoid unsoundness that could make `guard_size_oblivious(a==1)` be true when its not true for some vaue `a` during runtime. - Less data dependent errors for some cases: ex, when doing `guard_size_oblivious(a==1)` and we know `a` is a tensor size, if it's traced with `a=u1-u2` `guard_size_oblivious(a==1)` will throw a data dependent error but `guard_else_false` will just return `False`. ### How is it different from statically_known_true?? `if(cond)`: (normal guarding) will try to evaluate statically and guard on the condition, willing to restrict input space to evaluate cond. if it fails to evaluate due to data dependent error will throw an exception (that could be converted to graph break in some situations). `statically_known_true(cond)`: would be used when you never want to add a guard (restrict your input space), but just want to do a best effort check to see if you can infer that something is true/false ONLY based on existing constraints. `guard_or_true(cond)`/`guard_or_false(cond)`: Those would be used in situations you prefer to guard and know the result of the expression over not guarding, but in case you hit a data dependent error you are ok with just returning true or false. Some reasons you might be ok with returning true/false instead could be: 1. It's an optimization I do not want to fail for not performing optimization. 2. I am willing to deviate from the normal semantics when I have unbacked for the benefit of not failing (See the doc above for more details). `definitely_true(cond)`: same as `guard_or_false(cond)` except does not try to do static eval for unbacked (planning to deprecate it and replace uses with `guard_or_false` or make it alias to `guard_or_false`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148430 Approved by: https://github.com/bobrenjc93	2025-03-27 09:34:05 +00:00
pralay	a9ee797e41	added fake tensor support for foreach_copy (#149127 ) Fixes #149111 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149127 Approved by: https://github.com/jansel, https://github.com/jeromean	2025-03-27 09:26:23 +00:00
Louie Tsai	7aacbab0b3	Update Doc for Intel XPU Profiling (#134515 ) Updated below two pages for Intel XPU https://pytorch.org/docs/stable/torch.compiler_profiling_torch_compile.html https://pytorch.org/docs/stable/profiler.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/134515 Approved by: https://github.com/dvrogozh, https://github.com/malfet	2025-03-27 09:15:35 +00:00
Mu-Chu Lee	e6afb51805	[AOTInductor] Free folded constants that's managed by AOTInductor (#149825 ) internally. Summary: This diff allows freeing the usage of folded constants that's created by AOTInductor through CUDACachingAllocator instead of the constant blob from cudaMalloc directly. Test Plan: LD_LIBRARY_PATH=/data/users/$USER/pytorch/build/lib /home/$USER/local/pytorch/build/bin/test_aoti_inference Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/149825 Approved by: https://github.com/chenyang78, https://github.com/desertfire, https://github.com/jingsh	2025-03-27 06:05:50 +00:00
PyTorch MergeBot	e080bac533	Revert "Introduce guard_or_true, guard_or_false (#148430 )" This reverts commit d5593ea31ceb2590336cc9815ee2c13a18db6cd7. Reverted https://github.com/pytorch/pytorch/pull/148430 on behalf of https://github.com/laithsakka due to need to fix stuff ([comment](https://github.com/pytorch/pytorch/pull/148430#issuecomment-2756701436))	2025-03-27 05:10:20 +00:00
Simon Fan	748252378d	[ca] introduce RuntimeState to support c++ hooks via graph breaks (#149987 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149987 Approved by: https://github.com/jansel ghstack dependencies: #149647, #149709, #149651, #149897	2025-03-27 05:05:34 +00:00
Simon Fan	dcb378cff2	[ca] support anomly mode nan checks with different semantics than eager (#149897 ) see note in code Pull Request resolved: https://github.com/pytorch/pytorch/pull/149897 Approved by: https://github.com/jansel ghstack dependencies: #149647, #149709, #149651	2025-03-27 05:05:34 +00:00
Nikita Shulga	488b87cb68	[BE] do not retain/release tensor (#150075 ) `Tensor::as_strided__symint` is inplace op that returns self, no need to retain it Pull Request resolved: https://github.com/pytorch/pytorch/pull/150075 Approved by: https://github.com/angelayi, https://github.com/atalman, https://github.com/cyyever	2025-03-27 03:43:14 +00:00
bobrenjc93	1f92348dc6	Use source hashing to generate consistent symbolic ids (#149665 ) This PR was inspired by internal models that were cache missing due to PGO. At a high level the problem looks as follows Run 1, Invocation 1: We do static compile, save some example values in PGO/automatic dynamic Run 1, Invocation 2: We detect varying inputs, do dynamic compile, get a dynamic graph and save to PGO. Crucially what we save to PGO is actually a superset of what is actually dynamic. If we notice an input was varying, we mark it as dynamic in PGO even if later on that value gets specialized. When a value gets specialized, we actually remove the symbol from the graph. This results in an interesting conundrum where although we are producing the same isomorphic graph, PGO makes the second run cache miss. Let's see how.... Run 2, Invocation 1: We fetch the PGO, over-mark things as dynamic, get a fx graph, look it up in the cache and... whoops! cache miss! This is because of the aforementioned behavior where the PGO profile will cause us to over-allocate symbols. In practice this means we end up saving a graph in cache with symbols x:s1, y:s3 and on second attempt we cache miss with x:s1, y:s6 where symbols s3,s4,s5 were all optimistically marked dynamic by PGO and subsequently specialized. We solve this problem by hashing the source names. This ensures somewhat stable assignment. To prevent catastrophic symbol collisions, we use linear probing to ensure no collisions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149665 Approved by: https://github.com/Mingming-Ding, https://github.com/laithsakka	2025-03-27 03:39:27 +00:00
Daniel Vega-Myhre	ae29f054f5	[Async TP] More robust support for rowwise scales when fusing matmul reduce-scatter (#149247 ) Part of https://github.com/pytorch/torchtitan/issues/866 ## Context - Async TP needs to support the "reshape -> scaled_mm -> reshape" pattern because scaled mm only supports 2D input tensors and 2D scales. - (a,b,c) => (ab,c) - (a\b,c) @ (c,d) = (a\b,d) - (a\b,d) => (a,b,d) - Currently the implementation does not support scaled mm with rowwise scales for all cases of the reshape -> scaled_mm -> reshape pattern. The minimal example of this pattern is confirmed to work via this [unit test](`00a2c68f67/test/distributed/tensor/parallel/test_micro_pipeline_tp.py (L406)`), but more involved e2e examples in torchtitan fail silently (more context in final bullet point). - Previously, the "A tensor" node referenced in the async TP graph manipulation code is the 3D+ node before the reshape, but the "A_scale" node is the 2d node from after the reshape, so they are incompatible. - I previously implemented a simpler solution to this problem in https://github.com/pytorch/pytorch/pull/148001, with a [unit test](https://github.com/pytorch/pytorch/pull/148001/files#diff-115f1d0852382c9b58f22640d80999d879b33618e5f6c633fc9e4d0ca9781cecR406) confirming the fused node is indeed in the graph for the minimal example of the reshape->mm->reshape pattern. I also confirmed via manual e2e testing w/ torchtitan that the crash I was fixing no longer occurred. However, it turns out due to this [bug in torchtitan](https://github.com/pytorch/torchtitan/issues/866) it was causing async TP to fail silently and fall back to vanilla TP, hiding the fact that this original solution fixed the crash but the fusion would not occur for rowwise scales. Thus, more robust solution is needed to support all cases. ## Solution TL;DR - Use the 2D 'A' tensor and corresponding 2D scales as input to the fused_matmul_reduce_scatter implementation, instead of the 3D+ tensor/scales. - Track the "pre mm reshape" and "post mm reshape" separately, to be referenced in the `fused_scaled_matmul_reduce_scatter` implementation, to update the scatter dim through the pre-mm reshape, and apply the post-mm reshape before applying the reduce scatter and returning the output tensor. - Separate the `fused_matmul_reduce_scatter` and the `fused_scaled_matmul_reduce_scatter` code paths, to simplify them both. - By fixing the bug in torchtitan (PR https://github.com/pytorch/torchtitan/pull/965) and implementing support for rowwise scales in pytorch in this PR, together these changes will solve the problem of how to support rowwise scales with all types of AC. ## Additional details for reviewers To use the 2D A tensor while also supporting the "reshape -> mm -> reshape" pattern, the following other changes were needed: - Track the pre-mm reshape, as it will affect the scatter dim used in the fused_matmul_reduce_scatter impementation. - Track the post-mm reshape, as it will affect the output shape used in the fused_matmul_reduce_scatter impementation - Based on the pre-mm reshape and the original scatter dim, calculate the new scatter dim for the 2D tensor. This is needed because during the pipelined producer mm implementation, the scatter dim is moved to dim 0 (so it can be sharded along the first dim and then get chunks to do mm ops on by indexing into the first dim), then moved back to it's original place before the reduce-scatter. - Use the tracked post-mm reshape to reshape the stacked partial 2D outputs of the mm ops into 3D outputs needed for 1) the reduce-scatter w/ the original scatter dim, and 2) the expected output shape to prevent shape errors with subsequent ops. ## Test plan - All existing unit tests passing. - Expand unit tests for rowwise scales to test more scatter dims - Added unit tests enforcing that async TP fails fast / throws an error if it fails to perform any fusions. Previously it just "failed silently" (fell back to vanilla TP without the user knowing) which has led to confusion, so this will improve the UX. - Compared loss curves of bf16 vs float8 w/ rowwise scales to confirm integrity of numerics - Confirmed via manual testing with torchtitan and inspecting the compile graph that the fusion is working as intended for: - bfloat16 - float8 with tensorwise scales - float8 with rowwise scales ## Loss curves Loss curves are virtually identical for bf16 + vanilla TP versus float8 with rowwise scales + async TP: <img width="1017" alt="loss_async_tp" src="https://github.com/user-attachments/assets/4995db78-7012-490f-a370-f4fecc289a22" /> ## Performance #### Per op SAC Performance benchmarks for torchtitan Llama3 8b training runs on 4 H100s with per op SAC, using FSDP degree=2, TP degree=2: - bf16 (vanilla TP): TPS 5161.5, peak memory 50.53 GB - bf16 (async TP): TPS 5229.5, peak memory 50.68 GB - float8 tensorwise (vanilla TP): TPS: 5959.5, peak memory: 50.47 GB - float8 tensorwise (async TP): TPS 5964.5, peak memory 50.47 GB - float8 rowwise (vanilla TP): TPS: 4962.0, peak memory: 50.55 GB - float8 rowwise (async TP): TPS 4966.5, peak memory 50.65 GB #### Full AC Llama3 70b training runs on 128 H100s with full AC, using FSDP=16, TP=8 - bf16 (vanilla TP): 598 TPS, peak memory 71.51 GB - bf16 (async TP): TPS 673, peak memory 71.08 (+12.54% TPS vs vanilla TP) - float8 tensorwise (vanilla TP): 820 TPS, peak memory 55.26 GB - float8 tensorwise (async TP): 950 TPS, peak memory 55.91 GB (+15.85% TPS vs vanilla TP) - float8 rowwise (vanilla TP): TPS: 540 TPS, peak memory 71.46 GB - float8 rowwise (async TP): 560 TPS, peak memory 70.65 GB (+3.7% TPS vs vanilla TP but still unexpectedly lower than bf16) As you can see, float8 rowwise is working but performance needs to be improved further. ## Other changes - Added logging so the user will know why fusion failed if it does. - Remove logic which inserted a reshape node targeting "A scale" to get it to be in 3D like the "A tensor" since it's no longer needed. ## Long term plan - Add a `scaled_matmul` op in pytorch, which will natively support a 3D+ "A tensor" and allow us to simplify the async TP implementation by avoiding the reshape -> scaled_mm -> reshape pattern and the special handling for it. ## Visualizing fused nodes in graphs for torchtitan training runs Below are examples of the visualized graph generated by torch compile for torchtitan llama3 8b training runs with per op SAC. These graphs provide additional evidence (beyond the new unit tests added) that the implementation is working correctly. ### bf16 <img width="900" alt="bf16-fusion" src="https://github.com/user-attachments/assets/a3bed917-28eb-4a56-8d6e-2d2bf498385c" /> ### float8 with tensorwise scales <img width="900" alt="tensorwise-node" src="https://github.com/user-attachments/assets/b212ec4a-1899-44de-a4de-18c74e1de68a" /> ### float8 with rowwise scales <img width="900" alt="rowwise" src="https://github.com/user-attachments/assets/ed3354a3-894b-4ec9-86d0-f80364bf3d83" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/149247 Approved by: https://github.com/kwen2501	2025-03-27 03:15:30 +00:00
Ahmad Sharif	114d404b07	[cuda] Add new faster gammabeta backward kernel (#148605 ) This PR adds a new kernel for producing gamma and beta values for the backward pass in a performant way. To test the performance against the baseline, I measured the backward pass of layernorm while sweeping over the following variables: 1. dtype in {half, float} 2. M in `2k, 2k - 1, 2k + 1 for k in range(...)` 3. N in `2k, 2k - 1, 2k + 1 for k in range(...)` 4. Whether we flush the L2 cache before running the backward pass Summary: The new code performs better than the old code, especially for powers of 2. For M >> N case, it performs very well (kernel itself can be 30x faster and the overall backward pass can be 5-10x faster). In order to visualize results of the kernel when choosing different values of M, N and dtype, I wrote some code to generate a heatmap. The heatmap has N on the x-axis, M on the y-axis and color-coded points where green shows performance improvement and red shows regressions. For example, `m=32 n=2048 1.42x` in the heatmap would indicate the normalized shape had 32 elements. The leading dimensions' product was 2048 elements and the new kernel resulted in the backward pass being 1.42x faster than the old backward pass. Important note: This heatmap shows the total backward pass time as seen by the user. The kernel time difference can be sometimes very large while the total backward pass time is not that high. For example, for dtype=torch.half, M=32 N=2048, flush_l2_cache=True case, the heatmap shows a speedup of 1.42x, while ncu tells me the new kernel is 2.5x faster than the old: M=32 N=2048 dtype=half flush_l2=True Old Kernel NCU summary: ``` ----------------------- ----------- ------------ Metric Name Metric Unit Metric Value ----------------------- ----------- ------------ DRAM Frequency Ghz 1.59 SM Frequency Ghz 1.35 Elapsed Cycles cycle 27,526 Memory Throughput % 2.21 DRAM Throughput % 0.54 Duration us 20.42 L1/TEX Cache Throughput % 4.31 L2 Cache Throughput % 2.62 SM Active Cycles cycle 1,475.02 Compute (SM) Throughput % 0.29 ----------------------- ----------- ------------ ``` M=32 N=2048 dtype=half flush_l2=True New Kernel NCU summary: ``` ----------------------- ----------- ------------ Metric Name Metric Unit Metric Value ----------------------- ----------- ------------ DRAM Frequency Ghz 1.59 SM Frequency Ghz 1.34 Elapsed Cycles cycle 10,920 Memory Throughput % 5.64 DRAM Throughput % 1.35 Duration us 8.13 L1/TEX Cache Throughput % 1.92 L2 Cache Throughput % 6.89 SM Active Cycles cycle 3,554.41 Compute (SM) Throughput % 0.67 ----------------------- ----------- ------------ ``` Let's look at some rows from the heatmap. For dtype=float16 flush_l2_cache=True and when input shapes are powers of 2, we get the following: <img width="1508" alt="image" src="https://github.com/user-attachments/assets/06179599-b2f0-4a45-8664-247a1067950b" /> There are 3 columns -- the first shows all data points, the second shows speedups only and the 3rd column shows regressions only. We can see that there are dramatic speedups for M >> N cases and the regressions are not that high (less than 1%, which could just be measurement noise). Here is a small guide I made: ![image](https://github.com/user-attachments/assets/90c26f7c-e3ad-46d2-a6ce-fe4b5fb3d738) For dtype=float32, we get a similar chart: <img width="1499" alt="image" src="https://github.com/user-attachments/assets/c4d31a76-03b0-426c-9114-e1bfad29b530" /> The new code performs especially well for m >> n cases, and also where m and n are small. The m >> n case is special because we run 2 reduction kernels back to back and parallelize in the "M" dimension (the older kernel only parallelized in the "N" dimension). The new code can sometimes have regressions for non-powers of 2. That is because the old code was using block sizes of {16, 32} while we have `threads.x = 32`. For example when N=33, the old code would have 3 blocks and we will have 2 blocks. I wrote some code to specialize for this case, but I think it will add complexity and @ngimel mentioned that non-powers of 2 are rare enough. I am including the regressions here for completeness' sake: <img width="1500" alt="image" src="https://github.com/user-attachments/assets/31c17cfb-ed9b-4106-b9c8-5c359751f530" /> To see this better: 1. Click the image 2. Right click the expanded image and open in a new tab 3. Go to that tab and left click once to zoom in If you want to see the full data, here it is: ![image](https://github.com/user-attachments/assets/54fb60c9-8c0c-4530-a1dd-79ecda1a69a1) I also measured binary size and compile time since those are important for developers: Binary size comparison ![image](https://github.com/user-attachments/assets/ceef5073-1036-47f6-b9dc-cea088beda51) ``` # Original -rwxr-xr-x 1 ahmads users 307193112 Mar 6 08:46 ./torch/lib/libtorch_cuda.so # This PR -rwxr-xr-x 1 ahmads users 307193112 Mar 6 08:46 ./torch/lib/libtorch_cuda.so ``` The diff in bytes is 302kB which is about a 0.1% increase. Compile time difference: ``` # Original real 0m10.931s user 0m9.676s sys 0m1.004s # this PR real 0m16.720s user 0m15.514s sys 0m1.066s # Command I ran time /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DAT_PER_OPERATOR_HEADERS -DFLASHATTENTION_DISABLE_ALIBI -DFLASHATTENTION_DISABLE_SOFTCAP -DFLASH_NAMESPACE=pytorch_flash -DFMT_HEADER_ONLY=1 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTORCH_CUDA_BUILD_MAIN_LIB -DTORCH_CUDA_USE_NVTX3 -DUNFUSE_FMA -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_CUDA -DUSE_CUFILE -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_FLASH_ATTENTION -DUSE_MEM_EFF_ATTENTION -DUSE_NCCL -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cuda_EXPORTS -I/home/ahmads/personal/pytorch/build/aten/src -I/home/ahmads/personal/pytorch/aten/src -I/home/ahmads/personal/pytorch/build -I/home/ahmads/personal/pytorch -I/home/ahmads/personal/pytorch/cmake/../third_party/benchmark/include -I/home/ahmads/personal/pytorch/third_party/onnx -I/home/ahmads/personal/pytorch/build/third_party/onnx -I/home/ahmads/personal/pytorch/nlohmann -I/home/ahmads/personal/pytorch/third_party/flash-attention/csrc/flash_attn/src -I/home/ahmads/personal/pytorch/aten/src/THC -I/home/ahmads/personal/pytorch/aten/src/ATen/cuda -I/home/ahmads/personal/pytorch/third_party/fmt/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/tools/util/include -I/home/ahmads/personal/pytorch/build/caffe2/aten/src -I/home/ahmads/personal/pytorch/aten/src/ATen/.. -I/home/ahmads/personal/pytorch/build/nccl/include -I/home/ahmads/personal/pytorch/c10/cuda/../.. -I/home/ahmads/personal/pytorch/c10/.. -I/home/ahmads/personal/pytorch/third_party/tensorpipe -I/home/ahmads/personal/pytorch/build/third_party/tensorpipe -I/home/ahmads/personal/pytorch/third_party/tensorpipe/third_party/libnop/include -I/home/ahmads/personal/pytorch/torch/csrc/api -I/home/ahmads/personal/pytorch/torch/csrc/api/include -isystem /home/ahmads/personal/pytorch/build/third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googlemock/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googletest/include -isystem /home/ahmads/personal/pytorch/third_party/protobuf/src -isystem /home/ahmads/personal/pytorch/third_party/XNNPACK/include -isystem /home/ahmads/personal/pytorch/third_party/ittapi/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/eigen -isystem /usr/local/cuda/include -isystem /home/ahmads/personal/pytorch/third_party/ideep/mkl-dnn/include/oneapi/dnnl -isystem /home/ahmads/personal/pytorch/third_party/ideep/include -isystem /home/ahmads/personal/pytorch/INTERFACE -isystem /home/ahmads/personal/pytorch/third_party/nlohmann/include -isystem /home/ahmads/personal/pytorch/third_party/NVTX/c/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/cudnn_frontend/include -DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS -D_GLIBCXX_USE_CXX11_ABI=1 -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch -gencode arch=compute_90,code=sm_90 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUB_WRAPPED_NAMESPACE=at_cuda_detail -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -Xcompiler -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-missing-field-initializers -Wno-array-bounds -Wno-unknown-pragmas -Wno-strict-overflow -Wno-strict-aliasing -Wunused-function -Wunused-variable -Wunused-but-set-variable -Wno-maybe-uninitialized -MD -MT caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o -MF caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o.d -x cu -c /home/ahmads/personal/pytorch/aten/src/ATen/native/cuda/layer_norm_kernel.cu -o caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o ``` So the new PR is 6 seconds longer compile time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148605 Approved by: https://github.com/ngimel	2025-03-27 03:01:53 +00:00
Yidi Wu	b2b9aaf0ad	Fix non-strict export doesn't turn on dynamo for hop (#149903 ) Somehow the torch._dynamo.is_compiling is changed to torch.compiler.is_compiling(), which also checks whether we're exporting. This is not caught by cI because we don't have an export test for scan. Changing to torch.compiler.is_dynamo_compiling and added a test. edit: piggyback the re-tracing support in this PR. Related code in combine_fn_is_normalized. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149903 Approved by: https://github.com/zou3519	2025-03-27 02:38:05 +00:00
vasiliy	dad0854d48	meta registration for torch._scaled_mm with mxfp8 (#148461 ) Summary: Adds the meta registration logic for torch.compile to work with `torch._scaled_mm` with mxfp8. Thanks to @eellison for the pointer to make inductor work with this. Test Plan: ``` pytest test/test_matmul_cuda.py -k test_blockwise_mxfp8_compile -s ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/148461 Approved by: https://github.com/drisspg, https://github.com/eellison	2025-03-27 02:32:40 +00:00
Laith Sakka	d5593ea31c	Introduce guard_or_true, guard_or_false (#148430 ) some context in this document: https://docs.google.com/document/d/18nJsj-F2C_QXO7ClwzPcAUENQ-B440B43W7DdDnlDt4/edit?tab=t.0#heading=h.pgebnyi7pocj But TLDR; `guard_or_true`, `guard_or_false` are better than `guard_size_oblivious` due to : - Easier to reason about what assumptions we are making while reading the code. - Avoid size_oblivious complexity that is not needed. - Avoid unsoundness that could make `guard_size_oblivious(a==1)` be true when its not true for some vaue `a` during runtime. - Less data dependent errors for some cases: ex, when doing `guard_size_oblivious(a==1)` and we know `a` is a tensor size, if it's traced with `a=u1-u2` `guard_size_oblivious(a==1)` will throw a data dependent error but `guard_else_false` will just return `False`. ### How is it different from statically_known_true?? `if(cond)`: (normal guarding) will try to evaluate statically and guard on the condition, willing to restrict input space to evaluate cond. if it fails to evaluate due to data dependent error will throw an exception (that could be converted to graph break in some situations). `statically_known_true(cond)`: would be used when you never want to add a guard (restrict your input space), but just want to do a best effort check to see if you can infer that something is true/false ONLY based on existing constraints. `guard_or_true(cond)`/`guard_or_false(cond)`: Those would be used in situations you prefer to guard and know the result of the expression over not guarding, but in case you hit a data dependent error you are ok with just returning true or false. Some reasons you might be ok with returning true/false instead could be: 1. It's an optimization I do not want to fail for not performing optimization. 2. I am willing to deviate from the normal semantics when I have unbacked for the benefit of not failing (See the doc above for more details). `definitely_true(cond)`: same as `guard_or_false(cond)` except does not try to do static eval for unbacked (planning to deprecate it and replace uses with `guard_or_false` or make it alias to `guard_or_false`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148430 Approved by: https://github.com/bobrenjc93	2025-03-27 02:22:20 +00:00
Ahmad Sarvmeily	c2b8fead43	Allow TritonTemplate subclasses to override kernel type (#150018 ) Allows subclasses of `TritonTemplate` to override the kernel type, e.g. ``` class MyTritonTemplate(TritonTemplate): kernel_type = MyTritonTemplateKernel ``` This means that all of the logic in `TritonTemplate` class doesn't need to be duplicated in subclasses if the only required change is the kernel type. Note that there is precedent for doing this - see `SIMDScheduling` in `torch/_inductor/codegen/simd.py`: ``` class SIMDScheduling(BaseScheduling): kernel_type: type[Any] = SIMDKernel # override in subclass ... ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150018 Approved by: https://github.com/jansel	2025-03-27 02:16:40 +00:00
Angela Yi	8d1cfb63b5	[export] Save unflattened gm (#150030 ) Summary: Reland of D71082652 Test Plan: https://www.internalfb.com/intern/testinfra/testrun/8444249558423545 https://www.internalfb.com/intern/testinfra/testrun/7318349652864293 https://www.internalfb.com/intern/testinfra/testrun/13229323980143778 https://www.internalfb.com/intern/testinfra/testrun/11540474119884081 Differential Revision: D71902033 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150030 Approved by: https://github.com/pianpwk	2025-03-27 02:01:51 +00:00
Laith Sakka	128b32f363	cache loaded python modules (#149910 ) I am splitting caching the loading of modules from the caching the codegen since its trivial and much easier. Module loading is 50% of the cost, and codegen is 50% of maybe_append choice on full graph model. which is 40% of total compile time. <img width="434" alt="Screenshot 2025-03-24 at 4 35 12 PM" src="https://github.com/user-attachments/assets/aa851c6a-bde9-43f8-b12d-e439504ef62c" /> running mm_loop benchmark, before this change: 67947323682 after this change: 25845073249 2.6X faster. it seems that the cache was there then got dropped. I added benchmark so it wont be dropped again by mistake. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149910 Approved by: https://github.com/eellison, https://github.com/aorenste ghstack dependencies: #149932	2025-03-27 00:45:09 +00:00
Rachel Guo	48cff64a54	[pt2_provenance_tracing] add combo kernel nodes post_grad nodes origin info (#149598 ) Summary: found it helpful when running prod model with combo_kernel feature enabled Test Plan: CI Differential Revision: D71513304 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149598 Approved by: https://github.com/yushangdi	2025-03-27 00:26:24 +00:00
Animesh Jain	731b559f54	[easy] Use config patch to toggle capture_scalar_output (#150036 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150036 Approved by: https://github.com/angelayi ghstack dependencies: #149087, #149667	2025-03-27 00:01:39 +00:00
Animesh Jain	999fa15ba8	[invoke_subgraph][fake tensor cache] Add a finalizer for id hashed objects (#149667 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149667 Approved by: https://github.com/zou3519 ghstack dependencies: #149087	2025-03-27 00:01:39 +00:00
Animesh Jain	a7596b4b34	[invoke_subgraph] Fake tensor prop caching (#149087 ) Redoing https://github.com/pytorch/pytorch/pull/137808 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149087 Approved by: https://github.com/zou3519	2025-03-27 00:01:39 +00:00
Justin Chu	3efa211e48	[ONNX] Annotate None inputs in symbolic ops (#150038 ) Add `None` to type annotations of `torch.onnx.ops.symbolic*` ops and improve tests to test support for optional inputs. Previously it was omitted mistakenly even though the implementation supports it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150038 Approved by: https://github.com/titaiwangms	2025-03-27 00:01:09 +00:00
Nikita Shulga	6db95ccf4c	Delete linux-focal-cuda12_6-py3_10-gcc11-bazel-test (#150066 ) It's been broken for a while even when this jobs were still called ` linux-focal-cuda12.4-py3.10-gcc9-bazel-test` Last time it run successfully on Feb 21st Pull Request resolved: https://github.com/pytorch/pytorch/pull/150066 Approved by: https://github.com/yangw-dev, https://github.com/seemethere, https://github.com/atalman	2025-03-26 23:55:58 +00:00
Aleksandar Samardžić	43cc954f88	Refactor row-wise scaled MM (#149978 ) 1. Add config selection for SM89. 2. Only build kernels if compiling for given arch. 3. Factor out CMake code to enforce compiling for needed archs for individual files into a function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149978 Approved by: https://github.com/drisspg	2025-03-26 23:49:41 +00:00
Nikita Shulga	6aca002d82	[MPS] Add `chebyshev_polynomial_[uvw]` (#150060 ) For both eager and inductor Pull Request resolved: https://github.com/pytorch/pytorch/pull/150060 Approved by: https://github.com/dcci, https://github.com/jansel	2025-03-26 23:35:05 +00:00
PyTorch MergeBot	185aaaaf8e	Revert "Improve subproc autotuning implementation (#149700 )" This reverts commit 8cd6a133f21821f0713116f0f9a55e5368de8c1c. Reverted https://github.com/pytorch/pytorch/pull/149700 on behalf of https://github.com/yangw-dev due to This is breaking servicelab_benchmark_pyper_local_runner internally ([comment](https://github.com/pytorch/pytorch/pull/149700#issuecomment-2755975959))	2025-03-26 23:17:01 +00:00
Nikita Shulga	db8f4c1b1b	[MPSInductor] Run chebyshev_polynomial_t tests (#150042 ) Test name should start with `test_` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150042 Approved by: https://github.com/dcci	2025-03-26 22:50:08 +00:00
Jon Janzen	9aa0612dd3	[targets2buck] Remove tombstone messages proactively (#147897 ) Summary: X-link: https://github.com/pytorch/executorch/pull/8703 Originally we created a bunch of empty `TARGETS` files to allow us to enable `BUCK` files in fbcode by hiding the existing BUCK file. These files were subsequently merged together using `non_fbcode_target` so these tombstones are no longer necessary. This diff fixes all files that WOULD have had the useless tombstone merged into them. To create this diff, I just ran the merger script that Codemod Service is using and then deleted the "merged from" and tombstone lines with `sed`, `arc f` and reverted any lines that didn't make sense Test Plan: CI Differential Revision: D69994481 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147897 Approved by: https://github.com/izaitsevfb	2025-03-26 22:15:17 +00:00
Nichols A. Romero	c0af782f30	[ROCm] Change LoadHIP to use find_file for rocm_version.h (#149983 ) Fixes #149805 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149983 Approved by: https://github.com/jeffdaily	2025-03-26 21:26:41 +00:00
Pat Vignola	625913eefc	[MTIA] [Triton] Set codename of MTIA device in triton heuristics (#149860 ) Summary: Triton-MTIA expects the codename of the device as the arch when querying the module map, not the compute capability. This diff gets rid of the following error: `No libdevice is provided for arch (0, 0)` Test Plan: CI Reviewed By: Myrthan Differential Revision: D70072095 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149860 Approved by: https://github.com/jansel	2025-03-26 20:58:12 +00:00
Tristan Rice	87bfd66c3c	gloo: update to latest version (#149985 ) This updates submodule Gloo to the latest version and brings a number of benefits: * connection retries `d2609ab5e8` * better error messages `5ca057d6cc` * multi_get support for larger scale jobs `4ff6edf45f` * metadata exchange optimizations `20dc202dd8` * miscellaneous other fixes Old commit: `5354032ea0` Test plan: This is already being used in production environments at scale. PyTorch CI ``` pytest -v test/distributed/test_c10d_gloo.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149985 Approved by: https://github.com/fduwjj, https://github.com/malfet	2025-03-26 19:19:31 +00:00
Boyuan Feng	039ebdc192	[Graph Partition] Support symbol inputs (#149458 ) This PR supports symbol inputs to graph partition functions. Before this PR, we rely on `node.read_writes` to get partition inputs. However, this does not cover symbol inputs. In this PR, for each graph partition, we collect all symbol inputs which are required to be in scope to successfully perform codegen, including: - free symbols used in partition nodes. - free symbols in partition input/node shapes, strides, and offsets. This is needed for recording cudagraphs for tensors with dynamic shapes. ### Note1: MutationLayout In this example, node.layout is MutationLayoutSHOULDREMOVE. The symint from index `n` does not appear in the size, offset, stridese of node.layout. This symint appear in node.layout.target. So we need extra handle for it. ```python x = torch.zeros(7, device="cuda") def fn(n, a): a[n] = -1 return a opt_fn = torch.compile(fn, fullgraph=True) for n in range(2, x.shape[0]): opt_fn(n, x) ``` ### Note2: Composability with Padded Tensor Subclass W/o graph partition, Padded Tensor subclass lifts outer shapes to input arguments (i.e., arg0_1 for s0, arg1_1 for s1) but does not lift inner shapes (i.e., s2 and s3). Since cudagraph cache relies on integer inputs, it will cache on outer shapes and ignore inner shapes, which is bad. ``` def call(args): arg0_1, arg1_1, arg2_1, arg3_1, arg4_1, arg5_1 = args args.clear() s0 = arg0_1 s1 = arg1_1 arg2_1_size = arg2_1.size() s2 = arg2_1_size[0] s3 = arg2_1_size[1] assert_size_stride(arg2_1, (s2, s3), (s3, 1)) with torch.cuda._DeviceGuard(0): torch.cuda.set_device(0) buf0 = empty_strided_cuda((s2, s3), (s3, 1), torch.float32) # Topologically Sorted Source Nodes: [x1, mul], Original ATen: [aten.add, aten.mul] triton_poi_fused_add_mul_0_xnumel = s2s3 stream0 = get_raw_stream(0) triton_poi_fused_add_mul_0.run(arg2_1, buf0, triton_poi_fused_add_mul_0_xnumel, stream=stream0) del arg2_1 return (buf0, s0, s1, s1, ) ``` w/ graph partition, the partition function only includes tensor and inner shapes as inputs, to make sure the cudagraph caching is correct. Full Comparison: [code](https://www.internalfb.com/intern/diffing/?paste_number=1761674743) ```python def call(self, args): arg0_1, arg1_1, arg2_1, arg3_1, arg4_1, arg5_1 = args args.clear() s0 = arg0_1 s1 = arg1_1 arg2_1_size = arg2_1.size() s2 = arg2_1_size[0] s3 = arg2_1_size[1] assert_size_stride(arg2_1, (s2, s3), (s3, 1)) partition0_args = [arg2_1, s2, s3] del arg2_1 (buf0,) = self.partitions[0](partition0_args) del partition0_args return (buf0, s0, s1, s1, ) ``` The number of cudagraphs is validated below: (also added to test) ```python import torch from padded_tensor import PaddedTensor # Turning off graph_partition leads to # torch._inductor.cudagraph_trees.get_container(0).tree_manager.new_graph_id().id=6 # at the end, which is wrong. # torch._inductor.config.graph_partition = False # Turning on graph_partition leads to # torch._inductor.cudagraph_trees.get_container(0).tree_manager.new_graph_id().id=4 # at the end, which is correct. torch._inductor.config.graph_partition = True def f(x): x1 = x + 1 return x1 2 compiled_f = torch.compile(f, mode="reduce-overhead") def run(shape): x = torch.randn(*shape, device="cuda") pad_x = PaddedTensor.from_tensor(x, multipliers={0:4, 1:4}) assert hasattr(pad_x, "multipliers"), breakpoint() eager_out = f(pad_x) for _ in range(3): compiled_out = compiled_f(pad_x) compiled_out = compiled_f(pad_x) assert eager_out.shape == compiled_out.shape assert eager_out.tensor.shape == compiled_out.tensor.shape assert torch.allclose(eager_out.tensor, compiled_out.tensor) # static shape. record a NEW cudagraph. 1 cudagraph in total now. run((2,3)) # outer shape is dynamic, leading to a new dynamo graph # this new dynamo graph forces a NEW cudagraph. 2 cudagraphs in total now run((3,4)) # outer shape changed but inner shape does not change # so NO new cudagraph is recorded run((2,2)) # inner shape is dynamic now, leading to a new dynamo graph # this new dynamo graph forces a NEW cudagraph. 3 cudagraphs in total now run((5,6)) # does NOT record a new cudagraph run((7,8)) # record a NEW cudagraph. 4 cudagraphs in total now run((10,11)) assert torch._inductor.cudagraph_trees.get_container(0).tree_manager.new_graph_id().id == 4 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149458 Approved by: https://github.com/eellison	2025-03-26 17:21:30 +00:00
Jithun Nair	4a9466c96a	Newer conda versions require --update-deps to update dependencies such as libgcc-ng (#149599 ) * When we try to install [libstdcxx-ng 12.3.0 from conda-forge](`595293316d/.ci/docker/common/install_conda.sh (L65)`), conda 24.7.1 updates the dependencies of that package, including libgcc-ng package to the following: `libgcc-ng-14.2.0 \| h69a702a_2 52 KB conda-forge` * However, conda updated their installer script on Feb 6 2025 to version 25.1.1, which behaves differently from previous versions when installing conda packages. * conda 25.1.1 does not update any dependencies in the above step, and hence the same installation of libgcc-ng from "defaults" channel is present: `libgcc-ng pkgs/main/linux-64::libgcc-ng-11.2.0-h1234567_1` * Adding the "--update-deps" flags to the conda install command installs a newer libgcc-ng package from the "conda-forge" conda channel: `libgcc-ng-12.3.0 \| h77fa898_13 762 KB conda-forge`, which is compatible with the libstdcxx-ng 12.3.0 package * Compare this [Feb 4 docker build](https://github.com/pytorch/pytorch/actions/runs/13148456164/job/36691412387#step:6:5179) to this [Feb 10 docker build](https://github.com/pytorch/pytorch/actions/runs/13247023578/job/36975931849#step:6:5451), which shows that the latter does not update libgcc-ng. * This creates linking issues when trying to use a library, that was built with a newer libgcc_s.so.1 (from libcc-ng package), in the PyTorch conda environment. Eg. ONNX-RT: ``` [0;93m2025-02-13 10:18:38.492434704 [W:onnxruntime:Default, migraphx_execution_provider.cc:167 get_flags_from_env] [MIGraphX EP] MIGraphX ENV Override Variables Set:[m [1;31m2025-02-13 10:18:38.628064251 [E:onnxruntime:Default, provider_bridge_ort.cc:2028 TryGetProviderInfo_ROCM] /onnxruntime/onnxruntime/core/session/provider_bridge_ort.cc:1636 onnxruntime::Provider& onnxruntime::ProviderLibrary::Get() [ONNXRuntimeError] : 1 : FAIL : Failed to load library libonnxruntime_providers_rocm.so with error: /opt/conda/envs/py_3.10/bin/../lib/libgcc_s.so.1: version `GCC_12.0.0' not found (required by /opt/conda/envs/py_3.10/lib/python3.10/site-packages/onnxruntime/capi/libonnxruntime_providers_rocm.so) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149599 Approved by: https://github.com/malfet	2025-03-26 17:04:21 +00:00
Shangdi Yu	b2088f1afe	Add inductor test for torchbind symint (#149980 ) Summary: add test Test Plan: ``` buck run //caffe2/test:test_export -- -r test_compile_custom_obj_unbacked_symint ``` Differential Revision: D71843179 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149980 Approved by: https://github.com/BoyuanFeng	2025-03-26 17:02:55 +00:00
Mu-Chu Lee	a0253d2840	[Inductor] Use real input to autotune user defined triton kernels (#149553 ) Summary: User defined Triton kernel sometimes rely on real inputs to determine the path of execution. We need real inputs to invoke the correct behavior of the user defined triton kernels (see example in test case, where we have an early return for random inputs) Test Plan: Included in the commit. python test/inductor/test_aot_inductor.py -k triton_autotuning python test/inductor/test_aot_inductor.py -k triton_mutated_autotuning Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/149553 Approved by: https://github.com/davidberard98, https://github.com/eellison	2025-03-26 16:42:48 +00:00
Nikita Shulga	3a8171efad	[MPS] Preserve in/out dtypes in binary_op name (#150024 ) To be consistient with unary op and avoid silent correctness problems if someone will try to invoke the op with unexpected out dtype Pull Request resolved: https://github.com/pytorch/pytorch/pull/150024 Approved by: https://github.com/dcci	2025-03-26 16:00:43 +00:00
Jack Taylor	32299e5f9a	Reland "Introduce new template heuristic for triton autotune configs" (#147452 ) This change was reverted in https://github.com/pytorch/pytorch/pull/147388 for regressing an internal workload. I have removed the additional ir.device_type calls in mm_scaled and unpack_mixed_mm.py which could be contributing to the additional compile time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147452 Approved by: https://github.com/jansel	2025-03-26 15:47:06 +00:00
atalman	7336b76bcc	Refactor cudnn version check in smoke test for Windows (#150015 ) After https://github.com/pytorch/pytorch/pull/149885 I see failures on Window smoke test: https://github.com/pytorch/test-infra/actions/runs/14069923716/job/39401550854 Due to fact that pypi packages such as cudnn and nccl are installed only on Linux. Hence this should resolve issue on Windows platform. On windows cudnn is shipped with PyTorch as opposed to installed dynamically. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150015 Approved by: https://github.com/ZainRizvi	2025-03-26 15:15:46 +00:00
Ankita George	8a40fca9a1	Support huggingface reading and writing for multi rank case (#148189 ) Summary: This diff adds the ability for HF reader/writer to read/write in a distributed way. We do this by sending all the tensors meant for the same file to the same rank. Test Plan: ensure existing tests pass I also ran a full end to end test on my devserver to read/write from my HF repo Differential Revision: D70096439 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148189 Approved by: https://github.com/joecummings, https://github.com/saumishr	2025-03-26 14:47:31 +00:00
Aleksei Nikiforov	0c139fa58e	Switch s390x tests to blocklist (#149507 ) Switch s390x tests to blocklist Pull Request resolved: https://github.com/pytorch/pytorch/pull/149507 Approved by: https://github.com/seemethere	2025-03-26 12:11:41 +00:00
Laith Sakka	7379c66344	add loop mm benchmark (#149932 ) results: compile time instruction count for iteration 4 is 67947323682 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149932 Approved by: https://github.com/bobrenjc93, https://github.com/eellison	2025-03-26 11:21:30 +00:00
cyy	79e8a69257	Enable move warnings for torch targets (#149923 ) This PR enables more move warnings for torch targets and fixes some code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149923 Approved by: https://github.com/malfet	2025-03-26 08:38:13 +00:00
Nikita Shulga	de68ddc68e	[MPS] Fix metal ops with different dtypes (#149974 ) By implementing `_cast_` flavors of both dense and strided ops. Add regression tests that tests `fmax`/`fmin` for mixed dtypes. Been dreaded to write this PR for a while, as it end up to be pretty bulky: - Adds 1C10_METAL_ALL_TYPES_FUNCTOR` and `c10:🤘:ScalarType` to `c10/metal/common.h` and test that its values always match `c10::ScalarType` - Add `c10:🤘:cast_to` to `c10/metal/utils.h` which could be used to cast any scalar metal dtype to any other one, including complex values - Implement `val_at_offs<T>(constant void *, long offs, ScalarType dtype)` that is used to dynamically cast types - Add `binary_strided_cast` and `binary_dense_cast` that are invoked for output dtype and cast both inputs to that output before performing the op Benchmark collected on M2Pro that runs fmax for 1 mln element tensors (Times are in microseconds.) \| \| dense-dense \| transp-transp \| dense-transp \| transp-dense \| dense-scalar \| dense-bcast \| \|-------------------------\|---------------\|----------------\|----------------\|----------------\|---------------\|--------------- \| \| fmax (torch.float16, torch.float16) \| 160.9 \| 159.9 \| 270.5 \| 270.9 \| 236.6 \| 293.0 \| fmax (torch.float32, torch.float32) \| 176.9 \| 171.0 \| 273.7 \| 293.5 \| 242.6 \| 294.2 \| fmax (torch.float32, torch.float16) \| 171.4 \| 170.9 \| 283.6 \| 303.0 \| 253.7 \| 302.3 \| add (torch.float16, torch.float16) \| 218.0 \| 223.6 \| 221.0 \| 222.0 \| 214.9 \| 218.3 \| add (torch.float32, torch.float32) \| 227.4 \| 233.9 \| 228.8 \| 231.9 \| 218.9 \| 221.4 \| add (torch.float32, torch.float16) \| 226.1 \| 227.5 \| 227.5 \| 226.9 \| 177.0 \| 190.8 TODOS: - Include input and output dtype in non-cast kernel name - Make TensorFactory.h use `C10_METAL_ALL_TYPES_FUNCTOR` - Extend mixed_dytpes testing via OpInfo Fixes https://github.com/pytorch/pytorch/issues/149951 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149974 Approved by: https://github.com/manuelcandales	2025-03-26 07:03:21 +00:00
Aleksei Nikiforov	aa575cab71	Skip cxxabi check for s390x (#149954 ) On s390x gcc 14 is used because it contains fix for interaction between precompiled headers and vectorization builtins. This fix is not available in earlier gcc versions. gcc-14 uses ABI19, but check still fails, so skip it for now.. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149954 Approved by: https://github.com/cyyever, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-03-26 06:50:27 +00:00
Justin Chu	6ae8eb881c	[ONNX] Clean up the diagnostics module (#149864 ) Remove the diagnostics/SARIF module from ONNX exporter because it is obsolete unused. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149864 Approved by: https://github.com/titaiwangms	2025-03-26 05:58:32 +00:00
PyTorch MergeBot	d256b2dcb2	Revert "[custom_ops][perf] Move expensive pytree traversals of tensors to C++ (#148555 )" This reverts commit d686d04c2f3bac110044ebad5cc46e3035d7b425. Reverted https://github.com/pytorch/pytorch/pull/148555 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/148555#issuecomment-2753283221))	2025-03-26 05:27:52 +00:00
Shangdi Yu	819b23e0b4	Support None return type in torchbind and Add more AOTI torchbind e2e tests (#149749 ) Summary: - Add more tests for torchbind in aoti FallBackKernel - In FallbackKernel.find_device, do not check the device of torchbind obj because they don't have a fixed "device" - If no device found for CallTorchBindObject, use cpu - handle None output in `export_extern_kernel_node` Test Plan: ``` buck run //sigmoid/inference/test:e2e_test_cpu -- -r CustomClassHolderConstantDynamic ``` Differential Revision: D70746626 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149749 Approved by: https://github.com/desertfire	2025-03-26 04:20:14 +00:00
Isuru Fernando	71acb1bb42	[inductor] Fix division by zero error in fractional max (#148729 ) Fixes https://github.com/pytorch/pytorch/issues/148152 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148729 Approved by: https://github.com/eellison	2025-03-26 04:18:50 +00:00
eqy	9108d153ce	[CUDA]][SymmetricMemory] Interpret empty string as `std::nullopt` in `rendezvous` (#149793 ) this is a "temporary" fix as current internal API requires strings at some interfaces instead of `std::optional` and empty strings are presumably used in-lieu of `nullopt`. e.g., `9d02b3993f/torch/csrc/distributed/c10d/intra_node_comm.cu (L49)` this currently breaks `test_intra_node_comm_all_reduce` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149793 Approved by: https://github.com/kwen2501, https://github.com/cyyever	2025-03-26 03:59:43 +00:00
PyTorch MergeBot	ab9ca6b31f	Revert "[inductor] Fix mm logging for `torch._scaled_.mm` (#149967 )" This reverts commit 661d74bf4483e19e158c41b55d47f02eb9fdcc21. Reverted https://github.com/pytorch/pytorch/pull/149967 on behalf of https://github.com/malfet due to This broke ROCM testing, see `45b11730f1/1` ([comment](https://github.com/pytorch/pytorch/pull/149967#issuecomment-2753149024))	2025-03-26 03:29:59 +00:00
Nichols A. Romero	45b11730f1	[ROCm][TunableOp] TunableOp Context Manager for unit tests (#149930 ) This PR is cleanup only. There are no feature changes or bug fixes. We create a TunableOp context manager for setting up and cleanup. We re-write TunableOp unit tests in terms of this context manager. Ultimately reduces the amount of copy-paste code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149930 Approved by: https://github.com/jeffdaily	2025-03-26 02:59:58 +00:00
David Berard	a8d0c5c928	[inductor][triton 3.3] Fix cpp_wrapper w/ TMA in triton 3.3 (#149973 ) Fixes #148938 Context: In triton 3.3, triton kernels expect a global scratch space arg to be passed in. This is fixed in #148051, which fixed most of the AOTI/cpp_wrapper failures; the fix is to inject a (null) global scratch space arg passed as an argument to all kernels. But in the case of TMA, we need to call a non-triton-generated function - init1DTMADescriptor. The same `generate_args_decl` function used for calling triton kernels (and modified in #148051 to insert a global scratch space) is used to prepare the arguments to init1DTMADescriptor, and so it had an extra global scratch space arg. Then we'd get a null pointer passed into init1DTMADescriptor, resulting in an IMA later on when the TMA use kernel This PR: adds an option to `generate_args_decl` to specify whether this is a triton kernel (in which case we should add the global scratch space arg) or not (when we shouldn't add the extra arg). Note: this doesn't appear in CI because we don't run these tests with Hopper machines in CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149973 Approved by: https://github.com/drisspg	2025-03-26 00:12:02 +00:00
PyTorch MergeBot	1b373f6cd4	Revert "cpp_wrapper: Fix even more tests (#147225 )" This reverts commit 62d351a35b1bd961afbd09057beec14ff201c41d. Reverted https://github.com/pytorch/pytorch/pull/147225 on behalf of https://github.com/yangw-dev due to broke [ROCM mi300 test](https://github.com/pytorch/pytorch/actions/runs/14066803692/job/39393110086) in [HUD](https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=rocm-mi300%20%2F%20linux-focal-rocm6.3-py3.10%20%2F%20test%20(default%2C%201%2C%206%2C%20linux.rocm.gpu.mi300.2)&mergeLF=true) ([comment](https://github.com/pytorch/pytorch/pull/147225#issuecomment-2752799778))	2025-03-26 00:03:13 +00:00
PyTorch MergeBot	91bf92597c	Revert "cpp_wrapper: precompile a few more commonly used headers, and improve RAIIPyObject interface (#149350 )" This reverts commit 0de70fbbe73d2109497cd57ed5402e0cf9450f18. Reverted https://github.com/pytorch/pytorch/pull/149350 on behalf of https://github.com/yangw-dev due to broke [ROCM mi300 test](https://github.com/pytorch/pytorch/actions/runs/14066803692/job/39393110086) in [HUD](https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=rocm-mi300%20%2F%20linux-focal-rocm6.3-py3.10%20%2F%20test%20(default%2C%201%2C%206%2C%20linux.rocm.gpu.mi300.2)&mergeLF=true) ([comment](https://github.com/pytorch/pytorch/pull/147225#issuecomment-2752799778))	2025-03-26 00:03:13 +00:00
Vincent Moens	3c85784980	Fix broken LazyLinear init (#149693 ) Fixes #149691 I beleive it does not impact negatively the fix in https://github.com/pytorch/pytorch/pull/147599 as the tests stilll pass but @FFFrog should confirm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149693 Approved by: https://github.com/mikaylagawarecki, https://github.com/FFFrog, https://github.com/malfet	2025-03-25 23:49:49 +00:00
Rachel Guo	661d74bf44	[inductor] Fix mm logging for `torch._scaled_.mm` (#149967 ) Summary: This pr is just for recreation of the original pr: https://github.com/pytorch/pytorch/pull/149769 Fix for `torch._scaled_mm` op mm logging, which breaks the original brittle underscore parsing assumptions. Test Plan: CI Differential Revision: D71828732 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149967 Approved by: https://github.com/vkuzo	2025-03-25 23:38:35 +00:00
Ethan Wee	c05328e01a	[ROCm] fix uninitialized warning in BFloat16.h (#149868 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149868 Approved by: https://github.com/jeffdaily, https://github.com/cyyever	2025-03-25 23:36:10 +00:00
Ethan Wee	36eb64d60e	[ROCm] missing AT_CUDA_CHECK for cub and SoftMax (#149883 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149883 Approved by: https://github.com/jeffdaily, https://github.com/Skylion007	2025-03-25 23:22:32 +00:00
eqy	de73790fe6	[cuDNN][SDPA] cuDNN SDPA supports `head_dim <= 256` on `sm90` and `sm100` as of `9.5.1+` (#149904 ) gqa check PR will go next... Pull Request resolved: https://github.com/pytorch/pytorch/pull/149904 Approved by: https://github.com/drisspg	2025-03-25 23:10:16 +00:00
Divain	68b327341c	Fix #149806 : Fix path lookup in _preload_cuda_deps (#149808 ) @pytorchbot label "bug" Fixes #149806 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149808 Approved by: https://github.com/jansel	2025-03-25 23:03:47 +00:00
Ozan Aydin	ce54c430c0	[Submodule] [cpuinfo] cpuinfo update (#149305 ) Updating `cpuinfo` module. Relevant: https://github.com/pytorch/cpuinfo/issues/270 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149305 Approved by: https://github.com/malfet	2025-03-25 22:44:50 +00:00
Mu-Chu Lee	feb503c1df	[AOTInductor] Refine error message for dlopen in AOTInductor (#149812 ) Summary: Refine the error message if dlopen failed in AOTInductor. The original error message was ominous, modified to recommend user to rebuild AOTInductor if needed, otherwise it's fine. Test Plan: None. Error message change. Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/149812 Approved by: https://github.com/chenyang78, https://github.com/jingsh	2025-03-25 21:45:10 +00:00
Jeff Daily	0159f8ed54	[ROCm] build magma rocm and upload tarball (#149902 ) This will improve docker image build times by not having to rebuild magma rocm for unrelated changes. This PR is step 1 of 2. The next step is a second PR to modify the docker image builds to use the magma tarball that this PR will produce. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149902 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-03-25 21:37:13 +00:00
PyTorch MergeBot	d3b7cf7b7d	Revert "[ROCm] build magma rocm and upload tarball (#149902 )" This reverts commit bf8f4efd3158204592643e6cf26889fff5afcee2. Reverted https://github.com/pytorch/pytorch/pull/149902 on behalf of https://github.com/seemethere due to This is currently breaking lint see [GH job link](https://github.com/pytorch/pytorch/actions/runs/14069330750/job/39399569526) [HUD commit link](`bf8f4efd31`) ([comment](https://github.com/pytorch/pytorch/pull/149902#issuecomment-2752594578))	2025-03-25 21:33:00 +00:00
Davide Italiano	e85ce64bde	[MPS/Inductor] Add support for chebyshev_polynomial_t. (#149928 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149928 Approved by: https://github.com/malfet	2025-03-25 21:02:13 +00:00
Laith Sakka	6c9d48b32b	refresh results of benchmarks (#149936 ) while the test was disabled, I put a fix but another win change landed before the test was restored to it stayed disabled. <img width="698" alt="Screenshot 2025-03-24 at 6 26 36 PM" src="https://github.com/user-attachments/assets/2713c685-aee2-4dea-9a6c-cad01ef575cd" /> caused by https://github.com/pytorch/pytorch/pull/149295 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149936 Approved by: https://github.com/bobrenjc93	2025-03-25 21:01:08 +00:00
bobrenjc93	90110b069f	Use statically known true in should_decompose_mm (#149950 ) This meta function is causing recompiles for large ads runs due to overguarding: https://www.internalfb.com/ai_infra/job_inspector/guided/pt2_compile?jobName=aps-ig_fm_v4_pt2_on-6e0a734dcc&jobVersion=0&jobAttempt=0 If we look at the reasons, it's because of this function adding guards: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-ig_fm_v4_pt2_on-6e0a734dcc/attempt_0/version_0/rank_0/-_18_8_0/recompile_reasons_1971.json?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 This PR moves to statically_known_true so we don't overly guard for dynamic shapes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149950 Approved by: https://github.com/mengluy0125	2025-03-25 20:40:00 +00:00
Fuzzkatt	ce3dc9e346	add some extra test oom skips for jetson due to lacking nvml support (#149587 ) Add a couple of Jetson skips for oom tests in test/test_cuda.py due to failures in nvidia CI. Jetson not having full nvml support is a known issue so this is mostly a test side fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149587 Approved by: https://github.com/eqy	2025-03-25 20:39:10 +00:00
Fuzzkatt	b562d22772	test/test_cuda.py: rework TEST_PYNVML logic to make more sense, add not IS_JETSON condition (#149578 ) PYNVML related tests in test/test_cuda.py are failing in nvidia internal CI for Jetson devices because Jetson devices don't fully support nvml (it exists as a stub library). In addition to skipping PYNVML tests for Jetson, this PR also reworks the TEST_PYNVML logic a bit to be more consistent with the rest of TEST_{something} conditions in test/test_cuda.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/149578 Approved by: https://github.com/janeyx99, https://github.com/eqy	2025-03-25 20:38:15 +00:00
Mu-Chu Lee	12628ba24d	[AOTInductor] Bug fix for freeing buffers when freeing multiple times (#149810 ) Summary: We might free the active buffer if we free the buffer twice. Test Plan: ``` LD_LIBRARY_PATH=/data/users/$USER/pytorch/build/lib /home/$USER/local/pytorch/build/bin/test_aoti_inference ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/149810 Approved by: https://github.com/chenyang78	2025-03-25 20:26:36 +00:00
Jeff Daily	bf8f4efd31	[ROCm] build magma rocm and upload tarball (#149902 ) This will improve docker image build times by not having to rebuild magma rocm for unrelated changes. This PR is step 1 of 2. The next step is a second PR to modify the docker image builds to use the magma tarball that this PR will produce. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149902 Approved by: https://github.com/malfet	2025-03-25 20:20:36 +00:00
Lucas Kabela	d1ff3ff675	[Bugfix] Add handling for buffer overrides (#149882 ) Fixes #139167 This PR: * uses `named_buffers` to mark static * Checks that `named_buffers` is of expected type (callable, iterator) before trying to iterate over; if not, we skip this pass These changes fix the previous errors in dynamo causing to crash (as shown in issue above) ### Unit Test ``` python test/dynamo/test_buffers_override.py ``` Results in: ``` . ---------------------------------------------------------------------- Ran 2 tests in 5.344s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149882 Approved by: https://github.com/anijain2305	2025-03-25 20:12:43 +00:00
Sam Larsen	8cd6a133f2	Improve subproc autotuning implementation (#149700 ) Summary: The primary change is to update the autotune-in-a-subproc implementation to avoid using multiprocessing spawn. Spawn (re)executes the toplevel script in the subproc, which can be problematic. The approach here is similar to Triton parallel compile: we Popen a subproc on a controlled entry point and communicate over pipes. That change drove a lot of refactoring in the TuningProcess class, so I took the opportunity to simplify some things, rename some methods, etc. One other notable change is around the timeout / kill approach. After a timeout, we were previously attempting to stop the subproc in three steps (graceful shutdown, sigkill if graceful fails, sigterm if sigkill fails). I'm gonna argue think that's not useful: 1) The graceful shutdown is never going to work unless the subproc happens to have just completed its task and is ready to receive the next command. 2) If we're going to kill the subproc, let's just take the most aggressive approach and move on as quickly as possible to restarting it rather than waiting to see if previous shutdown attempts succeeded. The only downside that I can find find is maybe a little log spew?, e.g., ` ResourceWarning: subprocess 2987680 is still running` List of changes: * Use Popen instead of spawn for the autotuning subprocess. * Introduced a new entry point `__autotune_main__.py` * Renamed some TuningProcess methods. For example `shutdown` makes more sense than `terminate` because the latter implies a forced kill. * Simplified the implementation around benchmarking timeout and how we kill the subproc after a timeout. * Deprecated the unused timeout configs in `_inductor/config.py` * Moved `get_ld_library_path` helper to a common utils file. * Added more unit tests for subproc crashes / timeouts / exceptions, etc. Test plan: * New unit tests * Also ran internally with all combinations of: build mode `opt` and `dev-nosan`, and `buck run` vs. executing the `.par` file directly. * Made sure the functionality to parallelize autotuning across different GPUs is working (it wasn't clear to me this was behaving the way we wanted it to). Pull Request resolved: https://github.com/pytorch/pytorch/pull/149700 Approved by: https://github.com/aorenste, https://github.com/jansel, https://github.com/eellison	2025-03-25 20:07:28 +00:00
PyTorch MergeBot	30e8be599f	Revert "[ONNX] Clean up the diagnostics module (#149864 )" This reverts commit cc6e300fe225ac7f34f37494639b061ef45ceeec. Reverted https://github.com/pytorch/pytorch/pull/149864 on behalf of https://github.com/malfet due to This indeed broke Mac testing see `1c98dc3664/1` ([comment](https://github.com/pytorch/pytorch/pull/149864#issuecomment-2752317873))	2025-03-25 19:31:50 +00:00
Ryan Guo	1c98dc3664	[dynamo] Fix handling of setattr with some tensor attributes (#149791 ) We weren't handling `setattr(tensor_obj, "real", 42)` correctly, because the attribute is a `GetSetDescriptorType` that has special setter logic. See added test and comments for more explanations. This patch makes it so that we graph break in those cases, rather than resulting in silent incorrectness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149791 Approved by: https://github.com/mlazos ghstack dependencies: #149481	2025-03-25 18:57:56 +00:00
Benjamin Glass	0de70fbbe7	cpp_wrapper: precompile a few more commonly used headers, and improve RAIIPyObject interface (#149350 ) Add includes for torch.device, torch.dtype, torch.layout, and torch.memory_format to the cpp_wrapper common header, so that they get precompiled. Additionally, add move constructors and operator bool to RAIIPyObject. Closes #142005. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149350 Approved by: https://github.com/desertfire ghstack dependencies: #146706, #147225	2025-03-25 17:58:40 +00:00
Benjamin Glass	62d351a35b	cpp_wrapper: Fix even more tests (#147225 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147225 Approved by: https://github.com/desertfire ghstack dependencies: #146706	2025-03-25 17:58:40 +00:00
Benjamin Glass	0f1aaeb62e	cpp_wrapper: persist autotune example tensors until last use (#146706 ) Patches over an issue where randomly generated example tensors can cause kernel autotuning to fail, when those tensors would not be possible outputs from previous kernels in the sequence. This fixes a failure in `test_torchinductor_opinfo.py` when run with compile-time autotuning, `test_comprehensive_nanquantile_cuda_float64`. For clarity, the situation triggering this PR looks like kernels `A -> BCDE -> F` (`BCDE` is fused), where one of the outputs from `A` is a boolean tensor describing some of the input data. Previously, we randomly regenerated that boolean tensor and the input data before passing them to `BCDE`, so that they no longer matched. This caused a `tl.device_assert` call in `BCDE` to fail. With this PR, we reuse the random data input to `A` and the output Boolean tensor, such that they match and pass the device assertion in `BCDE`. Fixes #147799. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146706 Approved by: https://github.com/desertfire	2025-03-25 17:58:40 +00:00
Nikita Shulga	8d1db7f39d	[MPS][BE] Add `c10/metal/common.h` (#149955 ) That could be shared between host and metal code So far put only one constant, which is a maximum number of tensor dimentions Pull Request resolved: https://github.com/pytorch/pytorch/pull/149955 Approved by: https://github.com/Skylion007, https://github.com/manuelcandales	2025-03-25 17:37:24 +00:00
Justin Chu	cc6e300fe2	[ONNX] Clean up the diagnostics module (#149864 ) Remove the diagnostics/SARIF module from ONNX exporter because it is obsolete unused. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149864 Approved by: https://github.com/titaiwangms	2025-03-25 16:58:46 +00:00
angelayi	84ae056d82	[invoke_subgraph] Support pending unbacked symint (#149297 ) The "PendingUnbackedSymbolNotFound" error is when an unbacked symbol is created within a piece of code, but this symbol never appears in any of the outputs. I believe the original intention is to help catch incorrectly written meta kernels, where users might've unintentionally created an unbacked symbol but never used it anywhere, but in our case this is intentional. An example is the following test case: ```python def test_pending_unbacked(self): class M(torch.nn.Module): @mark_compile_region def gn(self, x): u = x[0].item() return x * u def forward(self, x): for _ in range(4): x = self.gn(x) return x torch._dynamo.config.capture_scalar_outputs = True torch.compile(M())(torch.randn(8)) ``` This fails with the error: ``` torch._dynamo.exc.InternalTorchDynamoError: PendingUnbackedSymbolNotFound: Pending unbacked symbols {zuf1} not in returned outputs (FakeTensor(..., size=(8,)),) . ``` In this case, creating the unbacked symbol is intentional, so we can bypass this using `fake_mode.shape_env.ignore_fresh_unbakced_symbols()`. Differential Revision: [D71298926](https://our.internmc.facebook.com/intern/diff/D71298926) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149297 Approved by: https://github.com/zou3519 ghstack dependencies: #149296	2025-03-25 16:42:58 +00:00
angelayi	8be1bf1dbb	[export] Add mark_compiled_region support (#149296 ) Differential Revision: [D71298930](https://our.internmc.facebook.com/intern/diff/D71298930) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149296 Approved by: https://github.com/zou3519	2025-03-25 16:42:58 +00:00
Eli Uriegas	5c19952c83	cd: Restore windows release builds for libtorch (#149863 ) These were accidentally deleted in the refactor of DEVTOOLSET + cxx11abi. This happened because the `build_environment` variable wasn't aware of the `build_variant` for libtorch and subsequently overwrote the original file twice, leaving the last written as the actual workflow (which in this case was the debug builds). One thing this has made me curious on is if we actually need `debug` builds for window at all? We don't release them for linux and I'd probably bet that they have low download numbers anyways so maybe it makes sense to cut them. Adds a build_variant parameter to the dataclass so that we can extend these easily in the future if we want. Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/149863 Approved by: https://github.com/malfet, https://github.com/atalman	2025-03-25 16:23:59 +00:00
Nikita Shulga	f0ca0d45a6	[CI] Add MacOS-M2-15 as MPS test target on trunk (#149900 ) Now that we have runners allocated by AWS Pull Request resolved: https://github.com/pytorch/pytorch/pull/149900 Approved by: https://github.com/ZainRizvi, https://github.com/seemethere	2025-03-25 16:19:35 +00:00
Wang, Eikan	2cc3f5030a	Add XPU and SYCL Merge Patterns (#149933 ) As the title Pull Request resolved: https://github.com/pytorch/pytorch/pull/149933 Approved by: https://github.com/atalman	2025-03-25 16:03:29 +00:00
Alanna Burke	43ee67e8dc	Removing doc references to PRE_CXX11_ABI. (#149756 ) Fixes #149550 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149756 Approved by: https://github.com/svekars, https://github.com/atalman	2025-03-25 16:01:59 +00:00
atalman	5dca832257	Add smoke test to validate pypi env version vs torch complied and installed versions of nccl and cudnn (#149885 ) Followup after nccl update to validate both cudnn and nccl versions in nightly and release pipelines. Tested on local dev machine, output. Success: ``` Found matching cudnn. Torch: 9.5.1 PyPI 9.5.1.17 Found matching nccl. Torch: 2.25.1 PyPI 2.25.1 ``` Failure: ``` Traceback (most recent call last): File "test1.py", line 29, in <module> compare_pypi_to_torch_versions("nccl", find_pypi_package_version("nvidia-nccl"), torch_nccl_version) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ec2-user/test1.py", line 24, in compare_pypi_to_torch_versions raise RuntimeError( f"Wrong {package} version. Torch: {torch_version} PyPI: {pypi_version}" ) RuntimeError: Wrong nccl version. Torch: 2.25.1 PyPI: 2.26.2 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149885 Approved by: https://github.com/malfet, https://github.com/ZainRizvi, https://github.com/d4l3k	2025-03-25 15:57:53 +00:00
Ivan Grigorev	d90d83c484	[torch] Fix unsafe concurrent access to autocast_enabled (#148281 ) Summary: Making autocast_enabled atomic, as it can be accessed from multiple threads Differential Revision: D70456813 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148281 Approved by: https://github.com/davidberard98	2025-03-25 14:46:12 +00:00
soulitzer	a2bba53f87	Improve error message when view of intermediate is returned from autograd.Function and marked dirty (#149543 ) Fixes https://github.com/pytorch/pytorch/issues/149252 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149543 Approved by: https://github.com/zou3519 ghstack dependencies: #149220	2025-03-25 14:44:11 +00:00
PyTorch MergeBot	7b218ca874	Revert "[BE] Replace XPU support packages installation to offline mode in Linux CI/CD (#149843 )" This reverts commit 86dcdf9c8bb8f69c5d28184b31ee6d7f19127d67. Reverted https://github.com/pytorch/pytorch/pull/149843 on behalf of https://github.com/malfet due to This breaks XPU builds, see `23183fef7e/1` ([comment](https://github.com/pytorch/pytorch/pull/149843#issuecomment-2751482412))	2025-03-25 14:39:10 +00:00
Nikita Shulga	29b3f409c2	[BE][CI] Update actionlint to 1.7.7 (#149919 ) - fix anti-pattern started by https://github.com/pytorch/pytorch/pull/81922 when x86 actionlint binaries were placed in Linux-arm64 folder - Fix renaming lint violations, namely ``` >>> Lint for .github/workflows/_linux-test.yml: Error (ACTIONLINT) [expression] property "workspace" is not defined in object type {arch: string; debug: string; environment: string; name: string; os: string; temp: string; tool_cache: string} 446 \| if: failure() && steps.install-nvidia-driver.outcome && steps.install-nvidia-driver.outcome != 'skipped' 447 \| shell: bash 448 \| env: >>> 449 \| RUNNER_WORKSPACE: ${{ runner.workspace }} 450 \| run: \| 451 \| set +e 452 \| set -x >>> Lint for .github/workflows/create_release.yml: Error (ACTIONLINT) [deprecated-commands] workflow command "set-output" was deprecated. use `echo "{name}={value}" >> $GITHUB_OUTPUT` instead: https://docs.github.com/en/actions/using- workflows/workflow-commands-for-github-actions 80 \| path: ${{ env.PT_RELEASE_FILE }} 81 \| - name: Set output 82 \| id: release_name >>> 83 \| run: echo "::set-output name=pt_release_name::${{ env.PT_RELEASE_NAME }}.tar.gz" 84 \| 85 \| upload_source_code_to_s3: 86 \| if: ${{ github.repository == 'pytorch/pytorch' && github.event_name == 'push' && startsWith(github.ref, 'refs/tags/v') && contains(github.ref, 'rc') }} >>> Lint for .github/workflows/target-determination-indexer.yml: Error (ACTIONLINT) [shellcheck] shellcheck reported issue in this script: SC2086:info:3:3: Double quote to prevent globbing and word splitting 98 \| DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }} 99 \| GITHUB_RUN_ID: ${{ github.run_id }} 100 \| AWS_DEFAULT_REGION: us-east-1 >>> 101 \| run: \| 102 \| # detached container should get cleaned up by teardown_ec2_linux 103 \| container_name=$(docker run \ 104 \| ${GPU_FLAG:-} \ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149919 Approved by: https://github.com/jeanschmidt, https://github.com/atalman, https://github.com/Skylion007 ghstack dependencies: #149917, #149918, #149922	2025-03-25 14:37:10 +00:00
Nikita Shulga	6c7f9f7e7d	[CI][BE] Update other actions (#149922 ) Discovered by actionlint-1.7.7: - `actions/checkout@v3`->`actions/checkout@v4` - `actions/setup-python@v4` -> `actions/setup-python@v5` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149922 Approved by: https://github.com/Skylion007 ghstack dependencies: #149917, #149918	2025-03-25 14:37:10 +00:00
Nikita Shulga	535885dc8d	[BE][CI] Update configure-aws-credential to v4 (#149918 ) Prerequisite for update to actionlint-1.7.7 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149918 Approved by: https://github.com/Skylion007 ghstack dependencies: #149917	2025-03-25 14:37:02 +00:00
Nikita Shulga	f63b03e9fc	[BE] Add Mac ARM64 actionlint binary (#149917 ) Downloaded from https://github.com/rhysd/actionlint/releases/tag/v1.6.21 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149917 Approved by: https://github.com/Skylion007	2025-03-25 14:36:54 +00:00
Nikita Shulga	23183fef7e	[Test] Add simple MPS op benchmarks (#149914 ) Lots of benchmark tests has been posted in PRs, but they might get lost over time So let's create a benchmark and populate it with results (preferably from the run on CI machine) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149914 Approved by: https://github.com/dcci, https://github.com/cyyever	2025-03-25 11:31:27 +00:00
Wang, Chuanqi	86dcdf9c8b	[BE] Replace XPU support packages installation to offline mode in Linux CI/CD (#149843 ) To ensure the build environment is stable Pull Request resolved: https://github.com/pytorch/pytorch/pull/149843 Approved by: https://github.com/EikanWang	2025-03-25 09:11:35 +00:00
Yuanhao Ji	86fbbe44cc	Improve error message for CUDAGuardImpl, MPSGuardImpl, XPUGuardImpl (#149838 ) Fixes #149822 Will get: ``` RuntimeError: t == DeviceType::CUDA INTERNAL ASSERT FAILED at "/home/jyh/workspace/pytorch/c10/cuda/impl/CUDAGuardImpl.h":28, please report a bug to PyTorch. CUDAGuardImpl initialized with non-CUDA DeviceType: cpu ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149838 Approved by: https://github.com/Skylion007, https://github.com/guangyey	2025-03-25 07:29:53 +00:00
Michael Lazos	a89bdc0565	[Hierarchical Compilation] Handle origin nodes without children (#149685 ) Bug discovered running Hierarchical Compilation on HF. I don't have a smaller repro for this unfortunately. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149685 Approved by: https://github.com/williamwen42, https://github.com/anijain2305	2025-03-25 07:27:11 +00:00
Nikita Shulga	5a7588f183	[Build] Remove pre-CXX11 ABI logic from build script (#149888 ) Only keep one in check_binary_symbols to make sure there are no pre-CXX11 ABI symbols in the library Pull Request resolved: https://github.com/pytorch/pytorch/pull/149888 Approved by: https://github.com/atalman, https://github.com/seemethere ghstack dependencies: #149887	2025-03-25 03:17:16 +00:00
titaiwangms	280e48739a	[ONNX] Set is_in_onnx_export for dynamo=True (#149678 ) Fixes #149141 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149678 Approved by: https://github.com/justinchuby	2025-03-25 03:16:23 +00:00
Tugsbayasgalan Manlaibaatar	27657a00d9	Demote logger of runtime_asserts_frozen to be fired only on debug mode (#149832 ) Differential Revision: [D71702305](https://our.internmc.facebook.com/intern/diff/D71702305) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149832 Approved by: https://github.com/malfet	2025-03-25 02:29:13 +00:00
FEI	59d5cf083b	update torch.nn.RelicationPad{1,2,3}d deternimistic documentation (#148633 ) https://github.com/pytorch/pytorch/issues/115395 This issue mentioned that when deterministic mode is turned on, added a decomp for replication_pad_{1,2,3}d to make the backward function deterministic. @malfet Pull Request resolved: https://github.com/pytorch/pytorch/pull/148633 Approved by: https://github.com/isuruf	2025-03-25 02:01:31 +00:00
Saurabh Mishra	d4c578082a	[DCP] Cache save plan metadata to reduce the collective overhead (#149785 ) Summary: Cache save plan metadata to reduce the collective overhead. Global plan dedupe and metadata creation are the main overheads on Rank 0. This change saves all this cost for the subsequent saves if the plans do not change. A quick experiment with the 256 rank job, Global step overhead drops by ~99%, from 90s+ to mere 1.5s. 1.5s was mostly spent on creating the checkpoint module directories and near empty collective. Differential Revision: D71631441 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149785 Approved by: https://github.com/MeetVadakkanchery	2025-03-25 02:00:15 +00:00
Scott Wolchok	dc39e673e2	Remove aten.elu core ATen decomp because it is now core ATen (#149780 ) Per @larryliu0820. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149780 Approved by: https://github.com/larryliu0820	2025-03-25 01:59:57 +00:00
Zhengxu Chen	84684e9397	[sigmoid] Fix scalar resolution for Scalar_mode aten ops. (#149755 ) Summary: For Scalar variant resolution, we didn't handle a corner case of "Tensor_mode" variant (from aten::div). Adding the missing case to the graph pass. Test Plan: buck test mode/opt caffe2/test:test_export -- -r test_operator_aten_tensor_mode_variant_cpp_runtime Differential Revision: D71638433 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149755 Approved by: https://github.com/yushangdi	2025-03-25 01:17:36 +00:00
Tristan Rice	159e97cbcf	ProcessGroupGloo: support reduce_scatter + update support chart (#149869 ) This adds a `reduce_scatter` implementation for ProcessGroupGloo. This is a pretty naive implementation as it does 1 allreduce per rank but may be useful for testing in FSDP etc. There was an existing implementation of reduce_scatter_tensor/reduce_scatter_tensor_coalesed that has a very similar implementation but requires a fixed tensor size per rank. If users find these functions to be too slow we can address them as issues arise. Gloo now supports all major distributed operations. Quite a few of these were added by @rohan-varma and @yifuwang but they didn't update the support chart. We also have `CUDAWork` variants of most operations so those were also added to the chart. Test plan: ``` pytest -v test/distributed/test_c10d_gloo.py -k reduce_scatter ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149869 Approved by: https://github.com/fduwjj	2025-03-25 01:16:12 +00:00
Carlo Bertolli	5af9cb12b7	[ROCm] Extend vectorized elementwise kernel to more heterogenous tensor types. (#149738 ) This patch extends the initial support for "vectorized templated" kernels to the following input tensor types: (BFloat16, float) (float, float16) (float16, float) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149738 Approved by: https://github.com/jeffdaily	2025-03-25 01:10:01 +00:00
Stepan Hruda	2a9e737839	[caffe2] Do not use --no-as-needed on macOS (#149421 ) Summary: `--no-as-needed` is not available in ld64.lld Applying this on all macos is potentially too broad? I am not sure if `fbcode//mode/mac` uses a different linker, but arvr mode for sure uses ld64.lld. Test Plan: CI / used for a macOS build on top of the stack. Differential Revision: D71315125 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149421 Approved by: https://github.com/colesbury	2025-03-25 00:41:09 +00:00
bobrenjc93	1cee6c37cc	add bobren and laithsakka as ds owners (#149873 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149873 Approved by: https://github.com/laithsakka	2025-03-25 00:14:04 +00:00
Benjamin Glass	23855391f1	Add regression tests for 3 missing PR-time benchmarks (#149423 ) Uses values from the latest PR-time benchmark run on viable/strict. See https://github.com/pytorch/pytorch/actions/runs/13898520615/job/38900894469 for a job showing why this is needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149423 Approved by: https://github.com/laithsakka	2025-03-24 23:39:36 +00:00
Isalia20	ba46643df1	[MPS] tril op not handling infs correctly (#149866 ) Fixes #149813 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149866 Approved by: https://github.com/malfet	2025-03-24 23:38:41 +00:00
Nikita Shulga	51f91e3428	[CD] Check that nightly x86 binaries are build with gcc-11 (#149887 ) Though they should have been with gcc-14, per https://github.com/pypa/manylinux?tab=readme-ov-file#manylinux_2_28-almalinux-8-based Pull Request resolved: https://github.com/pytorch/pytorch/pull/149887 Approved by: https://github.com/atalman, https://github.com/seemethere	2025-03-24 23:22:19 +00:00
Jzhyang1	f320c7b766	Rename README.txt to README.md (#149811 ) I am 99% sure this is meant to be a .md file rather than a .txt file Fixes an issue with viewing the README on github, idk what else this accomplishes but it's been bothering me Pull Request resolved: https://github.com/pytorch/pytorch/pull/149811 Approved by: https://github.com/colesbury	2025-03-24 22:33:33 +00:00
Zhengxu Chen	490ce7e67c	[sigmoid] Support _operator.neg/truediv (#149754 ) Summary: adding operator.truediv and operator.neg support to the runtime Test Plan: buck run mode/opt caffe2/test:test_export -- -r test_sym_float_operators_cpp_runtime_nonstrict Differential Revision: D71637267 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149754 Approved by: https://github.com/pianpwk	2025-03-24 22:15:25 +00:00
sanchitintel	e77ca19999	[Inductor-CPU] Fix int8 WoQ AMX micro-kernel when `block_n` is 16 or 48 (#149359 ) ### Summary When the block-size for `N` dimension is `48` for the AMX GEMM micro-kernel for int8 WoQ (BF16 activation, int8 statically quantized weights), the logic for handling the tail is incorrect - we can't always dequantize 32 elements of weights at a time because we may need to dequantize `32` followed by `16` when `block_n` is `48` (for each `K`). This PR fixes that logic, which was initially exposed with `M=17, N=1024, K=1024`. This PR also fixes the case of `block_n` being 16. I had introduced [this bug ](`ca9813ea14`) after misreading GEMM blockings as `["block_m", "block_k", "block_n"]` instead of `["block_m", "block_n", "block_k"]` (so I had wrongly assumed that `block_n` was always 32). ### Future work While this PR simply fixes a bug, it's possible to optimize the code pertaining to dequantizing & caching the B buffer - for `block_n` being `16` or `48`, `K` would always be a multiple of 2, so `K * block_n` will always be a multiple of 32. Since `dequantized_B_buf` stores rows contiguously, when `block_n` would be `16` or `48`, we could store 32 BF16 elements at a time instead of storing `16` at a time (when `block_n` is 16), or `32` followed by `16` at a time (when `block_n` is 48). Such an optimization would lower `register -> memory` data movements. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149359 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5	2025-03-24 21:27:46 +00:00
James Wu	49f86a939c	[AOTAutogradCache] Allow Custom Autograd functions behind a flag (#149751 ) This adds a new env var and flag, autograd_cache_allow_custom_autograd_functions, (env var: `TORCHINDUCTOR_AUTOGRAD_CACHE_ALLOW_CUSTOM_AUTOGRAD`) which allows custom autograd functions into AOTAutogradCache. @hirsheybar and I worked together to verify that the higher order op AutogradFunctionApply is pure with respect to the dynamo input being passed in, so this should be safe. I'm still putting it behind a flag and turning it on slowly, first on an internal model, though. Once we verify that it is correct on the internal model we can work to enable the flag by default. Differential Revision: [D71633184](https://our.internmc.facebook.com/intern/diff/D71633184/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149751 Approved by: https://github.com/bdhirsh, https://github.com/zou3519	2025-03-24 21:12:11 +00:00
Ryan Guo	ae6158500a	[dynamo] fix calling torch function on newly constructed tensor subclass (#149481 ) This patch updates existing `test_return_..._subclass` tests in `test/dynamo/test_subclasses.py`, so that they end up invoking the `__torch_function__` method of the newly constructed tensor subclass instnaces. This exposes a bug in `TensorVariable.method_as_subclass`, where it forgot to grab the `__func__` out of `__torch_function__`, which led to the an error down the line. This patch fixes `TensorVariable.method_as_subclass` by centralizing how we extract and wrap torch function, in `build_torch_function_fn`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149481 Approved by: https://github.com/jansel	2025-03-24 21:07:41 +00:00
Kirill Goltsman	f12969421e	[DYNAMO] [BUG FIX] correct casting to boolean for TORCH_COMPILE_DISABLE (#149852 ) Fixes #149840 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149852 Approved by: https://github.com/jingsh	2025-03-24 20:50:44 +00:00
Tristan Rice	b248edd7cc	ProcessGroupGloo: support ReduceOp::AVG (#149781 ) This adds AVG support to ProcessGroupGloo to better support FSDP on CPU. I expect there will be more issues but this is easy enough to support in a naive fashion. This applies to both reduce and allreduce. This is a simple SUM + division and may not be the most numerically stable but that's expected. FSDP for low precision data types implements pre/post divide and uses SUM instead. Test plan: ``` pytest -v test/distributed/test_c10d_gloo.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149781 Approved by: https://github.com/fduwjj	2025-03-24 20:29:30 +00:00
Yuxin Wu	40ec9d2bfa	avoid allocation when tensor_new from storage (#149797 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149797 Approved by: https://github.com/Skylion007	2025-03-24 20:02:45 +00:00
Nikita Shulga	112f983056	[MPS] Replace indexed with strided flavor (#149730 ) Which renders non-contiguous operations much faster for larger tensors, for example `fmax` of 1000x1000 strides tensors takes 270ms with new algorithm and 430ms with an old one, that needed additional tensor of 3e6 elements to function. TODO: Add 64-bit indexing logic, as current implementation has the same limitation as `generateKernelDataOffsets` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149730 Approved by: https://github.com/dcci, https://github.com/manuelcandales	2025-03-24 19:37:51 +00:00
Davide Italiano	9179178728	[MPS] Add support for `chebyshev_polynomial_t` in eager. (#149816 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149816 Approved by: https://github.com/malfet	2025-03-24 19:19:55 +00:00
Simon Fan	1e5a561c13	[ca] fix accumulate grad polyfill when different strides between param and grad (#149651 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149651 Approved by: https://github.com/jansel ghstack dependencies: #149647, #149709	2025-03-24 19:06:45 +00:00
Simon Fan	754875e237	[ca] API comments and support dynamic shapes via configs (#149709 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149709 Approved by: https://github.com/jansel ghstack dependencies: #149647	2025-03-24 19:06:45 +00:00
Simon Fan	86ee3bf3d5	[ca] use torch.compile ca API for benchmarks (#149647 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149647 Approved by: https://github.com/jansel	2025-03-24 19:06:45 +00:00
atalman	71145059c8	Allow rebuild of triton on workflow_dispatch (#149865 ) Allows to rebuild triton from main. latest triton build failed : https://github.com/pytorch/pytorch/actions/runs/13984299781/job/39298288914 The cause PR was reverted: https://github.com/pytorch/pytorch/pull/148419 We need to rebuild the triton now Pull Request resolved: https://github.com/pytorch/pytorch/pull/149865 Approved by: https://github.com/seemethere, https://github.com/malfet	2025-03-24 18:17:47 +00:00
PyTorch MergeBot	bada898f5e	Revert "Extend vec backend with BF16 SVE intrinsics (#143666 )" This reverts commit d072254eaea325a507c1498431e4c8294205fe2d. Reverted https://github.com/pytorch/pytorch/pull/143666 on behalf of https://github.com/malfet due to I'm unsure why this PR got merged, as it doesn't have a valid review ([comment](https://github.com/pytorch/pytorch/pull/143666#issuecomment-2749013169))	2025-03-24 18:13:50 +00:00
Jingyi Yang	5beb5b7e47	[torch/c10d] change class variable from private to protected (#149579 ) (#149645 ) Summary: Change class variable from private to protected in ProcessGroupNCCL Test Plan: Existing UT Pass. Reviewed By: kingchc, kwen2501 Differential Revision: D71373067 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149645 Approved by: https://github.com/kwen2501	2025-03-24 17:58:54 +00:00
Ethan Wee	d0c06c4533	[ROCm] Update libamd_comgr.so file in triton wheel build (#149855 ) In ROCm 6.4 and newer, when building Triton in the Triton-ROCm wheel build flow, newer releases of ROCm no longer have libamd_comgr.so.2 as the .so file has been updated to libamd_comgr.so.3 in ROCm 6.4 and newer. We conditionalize on which ROCm the wheel build is for, and choose the .so accordingly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149855 Approved by: https://github.com/Skylion007, https://github.com/jeffdaily	2025-03-24 17:51:14 +00:00
bobrenjc93	60f31f551e	Only print dde partial fx graph for export (#149831 ) Lazos correctly pointed out this doesn't make sense for compile since we graph break in compile. This results in tons of unwanted user log spew. We do want this in export though since it's drastiaclly reduced the support load for DDEs. This PR does the refactor to keep it in export but remove it from compile Pull Request resolved: https://github.com/pytorch/pytorch/pull/149831 Approved by: https://github.com/mlazos	2025-03-24 17:46:18 +00:00
PyTorch MergeBot	42e7bda53e	Revert "[export] Save unflattened gm (#149717 )" This reverts commit 1e159db57c611b98a531341927b2d01f39383f7a. Reverted https://github.com/pytorch/pytorch/pull/149717 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/149717#issuecomment-2748924563))	2025-03-24 17:41:01 +00:00
William Wen	6608d4e3e9	[dynamo] keep chained exceptions in user-facing tracebacks (#149676 ) This preserves graph breaks in the case that one graph break directly causes another, e.g. graph breaks in generic context managers. ```python import torch class CtxMgr: def __enter__(self): return self def __exit__(self, exc_type, exc_value, traceback): pass @torch.compile(backend="eager", fullgraph=True) def fn(): with CtxMgr(): with CtxMgr(): pass with CtxMgr(): with CtxMgr(): pass torch._dynamo.graph_break() fn() ``` Output: ``` torch._dynamo.exc.Unsupported: Call to `torch._dynamo.graph_break()` Explanation: User-inserted graph break. Message: None Hint: Remove the `torch._dynamo.graph_break()` call. Developer debug context: Called `torch._dynamo.graph_break()` with args `[]`, kwargs `{}` The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/data/users/williamwen/pytorch/playground.py", line 23, in <module> fn() File "/data/users/williamwen/pytorch/torch/_dynamo/eval_frame.py", line 664, in _fn raise e.with_traceback(None) from e.__cause__ torch._dynamo.exc.Unsupported: Graph break under GenericContextWrappingVariable Explanation: Attempted to graph break in an active context manager(s) that doesn't support graph breaking. Hint: Move the offending context manager(s) to outside the compiled region. Hint: This graph break may have been caused by an earlier graph break. Resolving the earlier graph break may resolve this one. Developer debug context: Active generic context managers: [GenericContextWrappingVariable(CtxMgr), GenericContextWrappingVariable(CtxMgr)] from user code: File "/data/users/williamwen/pytorch/playground.py", line 20, in fn torch._dynamo.graph_break() Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo" ``` Note in particular that both graph breaks (torch._dynamo.graph_break and graph break in context manager) are present in the logs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149676 Approved by: https://github.com/jansel, https://github.com/zou3519, https://github.com/anijain2305	2025-03-24 17:36:13 +00:00
Angela Yi	1e159db57c	[export] Save unflattened gm (#149717 ) Test Plan: CI Differential Revision: D71082652 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149717 Approved by: https://github.com/pianpwk	2025-03-24 17:25:25 +00:00
Yidi Wu	0a0a73a9a9	[cond] don't trace fw and bw graph in autograd key (#148930 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148930 Approved by: https://github.com/zou3519	2025-03-24 17:07:29 +00:00
Rachel Guo	9bae904cb4	[inductor] fix combo_kernel logging #2 (#149772 ) Summary: fix another combo kernel logging error: File "/home/guorachel/local/fbsource/buck-out/v2/gen/fbcode/4bcbfa3ef39dbd6f/caffe2/test/inductor/__combo_kernels__/combo_kernels#link-tree/torch/_inductor/scheduler.py", line 2036, in _init self.create_combo_kernel_nodes(num_ck_nodes=None) File "/home/guorachel/local/fbsource/buck-out/v2/gen/fbcode/4bcbfa3ef39dbd6f/caffe2/test/inductor/__combo_kernels__/combo_kernels#link-tree/torch/_inductor/scheduler.py", line 3068, in create_combo_kernel_nodes log.debug("ComboKernels: Generating with num_ck_nodes = %d...", num_ck_nodes) Message: 'ComboKernels: Generating with num_ck_nodes = %d...' Arguments: (None,) Test Plan: Verified in test_combo_kernel.py the logging error went away. Differential Revision: D71655949 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149772 Approved by: https://github.com/ColinPeppler, https://github.com/Skylion007	2025-03-24 16:57:45 +00:00
PyTorch MergeBot	453da423d4	Revert "ci: Add sccache to manylinux images (#148419 )" This reverts commit 1099c371505a6a3e3cab69e5afca1e747f2215a4. Reverted https://github.com/pytorch/pytorch/pull/148419 on behalf of https://github.com/atalman due to Breaks triton build ([comment](https://github.com/pytorch/pytorch/pull/148419#issuecomment-2748759515))	2025-03-24 16:43:26 +00:00
Bert Maher	a439524be6	[inductor] Add the largest matmul tile size to default tuning set (#149790 ) While we probably don't want to expand the set of default matmul tunings too much, this is the largest tile size usable by H100 and A100, and is usually the top performing tile size for large matmuls. E.g. on H100 adding this tile size improves perf of multiplying 8192-square matrices from 600->700 tflops. (cuBLAS 12.6 gets 780, so Triton still isn't SOTA, but closer) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149790 Approved by: https://github.com/jansel	2025-03-24 16:32:53 +00:00
Dmitry Nikolayev	db92d0f388	A bunch of typos (#149404 ) Improves readability. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149404 Approved by: https://github.com/soulitzer	2025-03-24 16:16:04 +00:00
Tristan Rice	ddc0fe903f	ci/docker: use NCCL 2.26.2-1 (#149778 ) Related to #149153 This updates some build scripts to hopefully fix the nightly builds which are somehow building against nccl 2.25.1 and using 2.26.2 from pip. Test plan: After merging rerun nightly linux jobs and validate that nccl version matches Pull Request resolved: https://github.com/pytorch/pytorch/pull/149778 Approved by: https://github.com/Skylion007, https://github.com/atalman Co-authored-by: Andrey Talman <atalman@fb.com>	2025-03-24 16:14:54 +00:00
Francisco Massa	0a60a0cad4	Let pointwise sharding take arg with largest number of dims in case of ties (#149721 ) Before, we would take the first argument with the largest number of shards, regardless if it had fewer dims than another arg with the same number of shards but more dimensions. This would lead to potentially fewer sharding options Pull Request resolved: https://github.com/pytorch/pytorch/pull/149721 Approved by: https://github.com/tianyu-l	2025-03-24 15:39:39 +00:00
Wang, Chuanqi	2c13a07002	[CI] Fix xpu linux test permission issue and add ci docker image pull (#149053 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/149053 Approved by: https://github.com/atalman	2025-03-24 15:19:24 +00:00
Yu, Guangye	db9b031b00	Add default XPU toolkit path to CMake (#149270 ) # Motivation Add default XPU runtime path to CMake to mitigate https://github.com/pytorch/pytorch/issues/149075 This ensures proper linking with `libtorch` when a user does not source the Torch XPU toolkit while working on a C++ library or executable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149270 Approved by: https://github.com/dvrogozh, https://github.com/EikanWang, https://github.com/atalman	2025-03-24 14:41:24 +00:00
Isuru Fernando	66b0a0b61a	[inductor] support dilation in max_pool2d lowering (#148209 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148209 Approved by: https://github.com/eellison	2025-03-24 13:00:12 +00:00
PyTorch UpdateBot	dfdc28ea67	Update slow tests (#149844 ) This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149844 Approved by: https://github.com/pytorchbot	2025-03-24 12:12:56 +00:00
Isalia20	248487f455	[MPS] nanmedian with dims (#149680 ) Third most voted op from #77764 Tests were deleted because they are covered by the regular test_output_match tests so those were redundant and were added in the last PR before the nanmedian dim version would be implemented Pull Request resolved: https://github.com/pytorch/pytorch/pull/149680 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-03-24 03:49:16 +00:00
Yu, Guangye	d5ce5c9509	Reuse format_size utils (#149383 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149383 Approved by: https://github.com/malfet	2025-03-24 03:06:27 +00:00
James Wu	de3aca3311	[StaticCudaLauncher] Support any number of kernel arguments (#149442 ) Fixes #149450 This PR adds fallback support on StaticCudaLauncher for any number of kernel arguments. Above MAX_ARGS, we can do a heap allocation/malloc instead. For 0 arguments, triton technically does some undefined behavior by allocating a 0 byte array and passing it to cuLaunchKernel. In reality, cuLaunchKernel never accesses the pointer if the singature of the cubin has no parameters, so we can just pass nullptr directly. We could technically use `alloca` to stack allocate instead of heap allocate, though in my tests it didn't seem to affect runtime performance on benchmarks particularly impressively, and alloca has portability issues, so I'd rather just stick with something simpler for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149442 Approved by: https://github.com/jansel	2025-03-23 22:43:47 +00:00
Justin Chu	2dccd70ef0	[ONNX] Clean up legacy dynamo export code (#149745 ) Clean up code that is unused and obsolete. The public `torch.onnx.dynamo_export` is kept for now but the legacy implementation is removed. Remove public option classes and OnnxRegistry that have been deprecated. Users: use torch.onnx.export(…, dynamo=True). Pull Request resolved: https://github.com/pytorch/pytorch/pull/149745 Approved by: https://github.com/titaiwangms, https://github.com/cyyever	2025-03-23 19:35:16 +00:00
Nikita Shulga	8bece88655	[BE] Eliminate TODO for 2022 (#149557 ) Need to think a bit more about what types.h includes Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/149557 Approved by: https://github.com/albanD	2025-03-23 05:35:54 +00:00
Alfredo Tupone	c201d4dbea	elif is not a cmake keyword (#149655 ) Test for pocketfft_header not in its place is wrong Pull Request resolved: https://github.com/pytorch/pytorch/pull/149655 Approved by: https://github.com/Skylion007	2025-03-23 03:28:53 +00:00
fzyzcjy	85027ef74a	Super tiny fix typo (#149109 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/149109 Approved by: https://github.com/malfet	2025-03-23 03:02:53 +00:00
James Wu	fe954cdcbf	Use correct boxed_forward_device_index when running `CompiledFxGraph.post_compile` (#148130 ) This PR threads through the correct boxed_forward_device_index from graph_kwargs to CompiledFXGraph.post_compile. This allows us to correctly update BoxedDeviceIndex from cache hits. We don't actually need to save `boxed_forward_device_index` in CompiledFXGraph because its value is in the cache key, so it always matches to the ambient one anyway. On forward with cudagraphs enabled, derive `boxed_forward_device_index`'s value from `device_idxs`. Testing: ``` python benchmarks/dynamo/cachebench.py --mode training --benchmark torchbench --model BERT_pytorch --device cuda --repeat 1 --dynamic --output="dynamic.json" ``` Now cache hits properly on FXGraphCache. AOTAutogradCache has a guard failure. Will look into that as a followup. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148130 Approved by: https://github.com/eellison	2025-03-23 02:57:58 +00:00
Mark Saroufim	539db4af4b	load_inline no_implicit_headers mode (#149480 ) In the kernelBot leaderboard we support people competing with custom cuda extensions via `load_inline()`, however even on toy kernels this can result in cold starts of up to 90s - this feature is primarily responsible for us having to double our timeout values I performed an investigation here https://github.com/msaroufim/load_inline_slow and the primary cause was that torch/extension.h and torch/types.h add in about 5,000 header files https://github.com/msaroufim/load_inline_slow/blob/main/header-analysis So we introduce a mode `no_implicit_headers` which forces users to be explicit about exactly what they want to add. There's a proper test meant to be used in a CLI and a pytest test that's not terribly helpful Then there's still an open question around what's the most minimal example implementation we can provide. For the baseline kernel we're showing here, it takes about 1 min to compile 1. There's using TensorBase.h (finicky to get right but can get compilation times down to 7s) 2. Just using Tensor.h (down to 15s) 3. Using Shim.h (did not try yet since the syntax is verbose relative to cuda) This is my take so far https://gist.github.com/msaroufim/079a8d08ffebd0f91a1c2247eb0ce9e0 for a minimal implementation at 15s but @malfet has a simpler one at only 5s There's more things I'd like to try moving forward like nvrtc and fancier compilation flags. Typical advice around using precompiled headers does not apply to us because we are mostly interested in cold starts where we tear down the machine after running a kernel Also in a future PR I'd like to fix issue I've noticed with load_inline 1. It needs a force recompilation mode, I was using this quite a bit myself 2. The cache does not take into account changes in environment so the best way to force a recompilation is to change some string in the file 3. Instead of relying on pybind, can we use TORCH_LIBRARY instead Pull Request resolved: https://github.com/pytorch/pytorch/pull/149480 Approved by: https://github.com/malfet	2025-03-22 19:21:29 +00:00
cyy	9367f8f6f1	Remove outdated instructions from CI scripts (#149795 ) Some instructions about Python 3.8 and CUDA 11.3 are removed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149795 Approved by: https://github.com/malfet	2025-03-22 18:37:07 +00:00
Davide Italiano	2b848ab192	[MPS/inductor] Add support for modified_scaled_bessel_k{0,1} (#149794 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149794 Approved by: https://github.com/malfet	2025-03-22 15:41:40 +00:00
Animesh Jain	6bbe8dbd63	[dynamo][hooks] config to wrap the top frame in a wrapper (#149758 ) This should be done by default but there are too many issues. This PR is a workaround. https://github.com/pytorch/pytorch/issues/117584 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149758 Approved by: https://github.com/yf225 ghstack dependencies: #149712	2025-03-22 07:17:01 +00:00
bobrenjc93	621c801f78	fix dynamic float when dynamic=True (#149564 ) Fixes https://github.com/pytorch/pytorch/issues/149406#issuecomment-2738111733. Basically previously we would only make floats dynamic via automatic dynamic, now if you set dynamic=True, we will make the floats dynamic on the first compile. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149564 Approved by: https://github.com/laithsakka	2025-03-22 05:58:59 +00:00
eqy	8f7fbe3d7d	[cuBLAS][cuBLASLt] Unify `cuBLASLt` workspaces with `cuBLAS` workspaces (#145130 ) As `cuBLAS` workspaces are already per-stream, there shouldn't be kernel execution overlap with `cuBLASLt` kernels. This PR reuses `cuBLAS` workspaces for `cuBLASLt` for the following benefits: + caching (`cuBLAS` workspaces were already cached, so now we get that for `cuBLASLt`) + "free" workspace size bump for `cuBLASLt` `cuBLASLt` workspace sizes were previously smaller than those for `cuBLAS` by default which potentially hurts performance, and we encountered difficulty in increasing the size due to downstream OOMs , see also #120925 + fixes behavior broken behavior with the memtracker; https://github.com/pytorch/pytorch/pull/139442 attempted to handle peaky allocation behavior that broke memtracker equivalence tests but it didn't seem to fully work, here the cached/reused `cuBLAS` workspace seems to fix it + one environment variable to rule them all: `CUBLAS_WORKSPACE_CONFIG` applies directly to `cuBLASLt` without a confusing `CUBLASLT_WORKSPACE_SIZE` that users would also need to consider Pull Request resolved: https://github.com/pytorch/pytorch/pull/145130 Approved by: https://github.com/ngimel	2025-03-22 05:50:11 +00:00
PyTorch UpdateBot	51fa8fb0ff	[executorch hash update] update the pinned executorch hash (#149585 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149585 Approved by: https://github.com/pytorchbot	2025-03-22 05:14:19 +00:00
Nichols A. Romero	01b1d1f91b	[ROCm][TunableOp] Fix offline tuning for ScaledGEMM. (#149677 ) The main purpose of this PR is to fix offline tuning for ScaledGEMM. The previous UT passed because it was not strict enough. Additionally: - All the offline tuning tests now do a comparison with the online results to ensure that ParamSignature match. - We raise an error if submatrices are encountered as this is only supported in online tuning mode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149677 Approved by: https://github.com/jeffdaily	2025-03-22 02:22:13 +00:00
Davide Italiano	b9a5e1d038	[MPS] Add support for scaled_modified_bessel_k1 to eager. (#149783 ) Another day another op Pull Request resolved: https://github.com/pytorch/pytorch/pull/149783 Approved by: https://github.com/malfet	2025-03-22 02:13:41 +00:00
Tugsbayasgalan Manlaibaatar	021b3e23ec	Fix is_nonzero for more than one elem tensors (#149637 ) Differential Revision: [D71560442](https://our.internmc.facebook.com/intern/diff/D71560442) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149637 Approved by: https://github.com/pianpwk	2025-03-22 02:08:28 +00:00
Xintong Hu	9d02b3993f	[PT2] Port use_triton_lce to PT2 pre_grad passes (#149702 ) Summary: `use_triton_lce_replace_simple_LCE` and `use_triton_lce_replace_normal_LCE` code is mostly the same, some minor changes to support aten IR Test Plan: ``` scripts/aetk/aetk -L %run ~/fbsource/fbcode/caffe2/test/inductor/fb/test_customized_triton_kernel_passes.py ``` will verify the qps after everything done in the stack Reviewed By: frank-wei Differential Revision: D68909857 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149702 Approved by: https://github.com/frank-wei	2025-03-22 00:36:58 +00:00
Scott Wolchok	c73a526599	Extract reusable portions of elu_kernel into header (#149673 ) Similar to #140425, we are making the implementation usable via header-only code sharing. Review note: #62546 by @yanbing-j removed expm1 usage from this path. I don't know why and expm1 should be more efficient, so I've put it back. Please let me know if there is a good reason I shouldn't. Testing: existing correctness tests should cover. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149673 Approved by: https://github.com/cyyever, https://github.com/Skylion007	2025-03-21 23:54:26 +00:00
PyTorch MergeBot	b238e36fd9	Revert "[BE][Ez]: Update CU126 to CUDNN 12.8 too (#149254 )" This reverts commit b0a5d55c584792a504ec18600180e3d1200dfea6. Reverted https://github.com/pytorch/pytorch/pull/149254 on behalf of https://github.com/izaitsevfb due to seems to be causing multiple test failures ([comment](https://github.com/pytorch/pytorch/pull/149254#issuecomment-2744686862))	2025-03-21 23:44:09 +00:00
Nikita Shulga	27370998b2	[MPS][BE] Move `polar`/`complex` to stubs (#149752 ) No need to have in-place MPS kernel, as it just copy-n-paste of code from TensorFactories.cpp into Binarykernel.mm Pull Request resolved: https://github.com/pytorch/pytorch/pull/149752 Approved by: https://github.com/Skylion007, https://github.com/dcci ghstack dependencies: #149727, #149728, #149729	2025-03-21 22:36:05 +00:00
Animesh Jain	d320af0663	[dynamo] Ensure placeholder name is not an intermediate node name (#149712 ) Fixes https://fb.workplace.com/groups/1075192433118967/permalink/1615671879071017/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/149712 Approved by: https://github.com/zou3519	2025-03-21 22:24:45 +00:00
Brian Hirsh	7f836b747f	partitioner: ensure collectives saved by SAC that are actually unused in the bw are properly not saved (#149652 ) This PR fixes one of the issues described here: https://github.com/pytorch/torchtitan/issues/866#issuecomment-2726015248 I spent some time trying to write a unit test and ultimately failed. If folks are interested I can spend more time trying to, but otherwise I have an E2E test with torchtitan. command: ``` CUDA_VISIBLE_DEVICES=1,2,3,4 NGPU=4 CONFIG_FILE="./torchtitan/models/llama/train_configs/llama3_8b.toml" tlp ./run_train.sh --training.steps=30 --training.tensor_parallel_degree=2 --training.compile --experimental.enable_async_tensor_parallel ``` here's the backward graph generated prior to the PR: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/hirsheybar/f7d17388-42c2-4d7e-8a55-a00387341ecb/custom/rank_0/-_0_0_0/aot_backward_graph_9.txt?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 and new backward graph with the PR: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/hirsheybar/ab8576fc-98c1-4915-af47-699aa8e2557e/custom/rank_0/-_0_0_0/aot_backward_graph_9.txt?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 The main difference is that the input arg `reduce_scatter_tensor_1` is dead code in the bw graph, causing us to unnecessarily save a giant `reduce_scatter` for bw. With the PR, we properly ensure that it is not saved for backward. More comments in the PR, but the main thing going on is that: (1) We have some existing logic that checks for activations that are actually dead code in the backward, and removes them (2) collectives are not properly handled by this code. Why? collective are always followed by `wait_tensor()` call. So we need to go one node further and check if the "dead" code has a wait_tensor user that is also dead Pull Request resolved: https://github.com/pytorch/pytorch/pull/149652 Approved by: https://github.com/zou3519 ghstack dependencies: #149514	2025-03-21 22:09:19 +00:00
Brian Hirsh	1c6b517e19	DTensor: more generically support CompositeImplicitAutograd ops under inference mode (#149514 ) Today, if you run DTensor (or any tensor subclass) under __torch_dispatch__, you will start seeing `CompositeImplicitAutograd` ops show up in the torch_dispatch. "handling" these ops is trivial: you can just tell them to decompose into their constituent ops. Normally this decomposing happens in autograd, above DTensor, but inference_mode turns autograd off, forcing the subclass to handle the op directly. It looks like previously we manually added a few CompositeImplicitAutograd entries to DTensor (e.g. linear), but this PR tries to support these ops a bit more generically. The main difference is that DTensor now needs to check if a given op is `CompositeImplicitAutograd` before attempting to run sharding prop. I ran a quick microbenchmark for the below code with `timeit`, which gave me overhead on the order of ~1us, which is hopefully not too bad for eager mode: ``` def fast_function(): return torch._C._dispatch_has_kernel_for_dispatch_key(op_call.name(), torch._C.DispatchKey.CompositeImplicitAutograd) import timeit time_taken = timeit.timeit(fast_function, number=1000) # printed 0.12..., aka 1.2us print(f'func={str(op_call)}, time={str(time_taken)}') ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149514 Approved by: https://github.com/kwen2501, https://github.com/albanD, https://github.com/wanchaol	2025-03-21 22:09:19 +00:00
Wei Feng	d46c16fca6	[FSDP2] warning that reshard_after_forward=1 and True are different (#149750 ) people complains about spending time to debug reshard_after_forward=1. What they actually want is reshard_after_forward=True. 1 and True can be used interchangeably in programming generally, add one-time warning to remind they are different * reshard_after_forward=1 means resharding parameters to world size 1, by keeping unsharded parameters from forward to backward * reshard_after_forward=True means reshard parameters to FSDP mesh from FSDP2 perspective, our docstring is clear about int vs bool https://pytorch.org/docs/main/distributed.fsdp.fully_shard.html <img width="764" alt="Screenshot 2025-03-21 at 11 02 55 AM" src="https://github.com/user-attachments/assets/6675f7a4-95a0-4421-8dbf-f47e9fdeca26" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/149750 Approved by: https://github.com/mori360, https://github.com/msaroufim, https://github.com/wconstab	2025-03-21 22:05:20 +00:00
angelayi	ff020d32b6	[export] Patch dynamo configs when nonstrict tracing (#149295 ) Differential Revision: [D71298929](https://our.internmc.facebook.com/intern/diff/D71298929) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149295 Approved by: https://github.com/ydwu4, https://github.com/zou3519	2025-03-21 21:44:54 +00:00
Avik Chaudhuri	fb07fe6f36	pretty print graph signature (#149710 ) Fixes #141243 Differential Revision: [D71604218](https://our.internmc.facebook.com/intern/diff/D71604218/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149710 Approved by: https://github.com/angelayi	2025-03-21 21:31:58 +00:00
eellison	5757aa8773	Cudagraph fix + comment cleanup (#149741 ) Cudagraphs is careful to not allow any memory recorded to escape globally without having a reference to the tensor. This is because we may later reclaim that memory for a cudagraph recording and we need to mark the tensor as erroring on access. Very occasionally, a stray tensor will have been allocated locally but not yet cleaned up. In this case, we enter the slow path and try to gc.collect() to deallocate it. From a hard to repro internal use case, this was fixed by an additional `cuda.synchronize()`. i also snuck in an outdated comment and a duplicate line removal. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149741 Approved by: https://github.com/BoyuanFeng, https://github.com/Skylion007	2025-03-21 21:12:36 +00:00
Annop Wongwathanarat	842d51500b	Parallelize sort (#149505 ) PR #142391 erroneously used `USE_OMP` instead of `USE_OPENMP`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149505 Approved by: https://github.com/fadara01, https://github.com/Skylion007	2025-03-21 20:54:40 +00:00
Xuehai Pan	85f6d61421	[BE] format `test/inductor/s429861_repro.py` (#148554 ) Split from #148186 The diff can be re-generated with the following code in the repo root directory on main branch: ```python import re from pathlib import Path def replace(m: re.Match) -> str: s = m.group() if '\n' not in s: return s indent = m.group("indent") varnames = s.removesuffix("None").replace("=", "").replace("(", "").replace(")", "").split() return "\n".join( [ f"{indent}(", (f"{indent} {varname}," for varname in varnames), f"{indent}) = (None,) {len(varnames)}", ] ) file = Path('test/inductor/s429861_repro.py') content = file.read_text(encoding='utf-8') new_content = re.sub( r"^(?P<indent> )\w+ =(\s($\s\w+\s$\|\w+)\s=\s*)+None$", replace, content, flags=re.MULTILINE, ) file.write_text(new_content, encoding='utf-8') ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148554 Approved by: https://github.com/jansel	2025-03-21 20:39:28 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	c5deacc27a	Fix subclass access custom op bug (#149698 ) Summary: When we call torch.inference_mode, we seem to skip Autograd key causing the custom op export uses to be not decomposed properly before subclass dispatching starts. We fix this by force desugaring this op at Python key Test Plan: test Differential Revision: D71599541 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149698 Approved by: https://github.com/bdhirsh	2025-03-21 19:42:56 +00:00
Avik Chaudhuri	09aa63ea2c	preserve custom meta in placeholders (#149661 ) Fixes #147338 Differential Revision: [D71573533](https://our.internmc.facebook.com/intern/diff/D71573533/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149661 Approved by: https://github.com/junpeiz, https://github.com/angelayi	2025-03-21 19:09:38 +00:00
Aaron Orenstein	0eb3ac9349	Make sure to write to caches atomically (#149654 ) This is an attempt to fix #119698 I was unable to reproduce the original described problem on the latest trunk but the proposed fix makes sense. Instead of adding locks like the original (unlanded) fix I changed a few of the cache writes to be atomic file swaps (write to temp file, rename file) which should have the same effect without blocking reads. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149654 Approved by: https://github.com/eellison	2025-03-21 18:59:41 +00:00
Shangdi Yu	46dd226702	Fakify torchbind objects in compile_fx and add tests for SigridTransformsInstanceTorchBind (#149529 ) Summary: We need to properly fakify torchbind objects, including the ones in graph module attributes, so the resgitered fake implementation works properly. - _fakify_script_objects in `compile_fx` - Allow fake torchbind objects in `torchbind_constants` Remove `node.meta["unbacked_bindings"]` for `aot_compile` in `compile_fx`. Otherwise `ShapeProp` will fail when trying to resolve the `unbacked_bindings` of `with_effect` tokens. Update `sigrid_transforms_test` to use the latest `torch._inductor.aot_compile` API. Add a test for `Fakify torchbind objects in compile_fx and add tests for SigridTransformsInstanceTorchBind` in `e2e_test`. Test Plan: ``` buck run //caffe2/torch/fb/sparsenn:sigrid_test -- -r test_transform_torch_bind buck run //sigmoid/inference/test:e2e_test_cpu -- -r SigridTransforms buck2 run mode/dev-nosan sigmoid/inference/ts_migration:pt2i_readiness_main -- --model_id 545017754 --test_suite ads_all --mode test_preproc ``` Differential Revision: D70013257 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149529 Approved by: https://github.com/angelayi	2025-03-21 18:58:28 +00:00
Alexander Grund	19b763def1	Skip test if torchvision is not available (#149494 ) The test unconditionally imports torchvision and fails if the isn't installed. Skip it in this case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149494 Approved by: https://github.com/janeyx99	2025-03-21 18:57:13 +00:00
Aaron Gokaslan	b0a5d55c58	[BE][Ez]: Update CU126 to CUDNN 12.8 too (#149254 ) Have CUDNN have the same version for 12.6 and 12.8 for better performance and consistency. We can't do CU12.1 because it's not supported and CU12.4 isn't updated due to manywheel Linux compatibility reasons and dropping support for it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149254 Approved by: https://github.com/jansel, https://github.com/atalman, https://github.com/tinglvv	2025-03-21 18:20:44 +00:00
Pradeep Fernando	1b08aaeafe	Supporting non-tensor-data write_size in planner write items. (#149699 ) Summary: 1\ The current write item structure does not contain the amount of data that needs to be written. 2\ the planner.item already has a size primitive 'tensor_storage_size'. https://fburl.com/code/7a0gsmw7 But only for tensors. 3\ Right now, the only way the writer layer get hold of this property (fro non tensor data) first do a lookup in to the actual tensor/bytes then calculate the nbytes. This change introduce a way to capture non-tensor data size within a write-plan item. Test Plan: Existing UT. Differential Revision: D71599725 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149699 Approved by: https://github.com/MeetVadakkanchery	2025-03-21 18:09:14 +00:00
Ding, Yi1	f7d1b966c2	[Inductor] Unify the data type propagation between Triton and CPP Backend (#146970 ) Fixes #144246 Use `DtypePropagationOpsHandler` for CSE variables of CPP backend. In addition, add static type checking for the generated CPP code similar to the `config.test_configs.runtime_triton_dtype_assert`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146970 Approved by: https://github.com/jgong5, https://github.com/eellison, https://github.com/leslie-fang-intel	2025-03-21 17:52:51 +00:00
Scott Wolchok	99a4fc5a2f	Add elu as core ATen (#149684 ) Differential Revision: [D71590420](https://our.internmc.facebook.com/intern/diff/D71590420/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149684 Approved by: https://github.com/larryliu0820	2025-03-21 16:56:10 +00:00
LifengWang	fa5f556f88	[CI] enable operator benchmark on CPU (#143733 ) This is to enable operator benchmark for CPU to track op level performance. This PR is motivated by PR: https://github.com/pytorch/pytorch/issues/120982 and investigate feasibility in https://github.com/pytorch/pytorch/pull/127216 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143733 Approved by: https://github.com/leslie-fang-intel, https://github.com/atalman, https://github.com/huydhn, https://github.com/malfet Co-authored-by: diwei sun <diwei.sun@intel.com> Co-authored-by: chuanqiw <chuanqi.wang@intel.com>	2025-03-21 16:46:03 +00:00
Nikita Shulga	700260f166	[MPS][BE] Get rid of `supports_dense` flag (#149729 ) As now all binary ops supports dense Pull Request resolved: https://github.com/pytorch/pytorch/pull/149729 Approved by: https://github.com/dcci ghstack dependencies: #149727, #149728	2025-03-21 16:37:03 +00:00
Nikita Shulga	64d22b9fad	[MPS][BE] Migrate complex_mul to tensor iterator (#149728 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149728 Approved by: https://github.com/dcci ghstack dependencies: #149727	2025-03-21 16:37:03 +00:00
Nikita Shulga	e35ef61066	[MPS][BE] Migrate `torch.complex` to binary_functor (#149727 ) As it's very similar in nature to `torch.polar` Though rename kernel from `complex_kernel` to `make_complex` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149727 Approved by: https://github.com/dcci	2025-03-21 16:36:56 +00:00
Davide Italiano	bdc132d0e1	[MPS] Add support for scaled_modified_bessel_k0 for eager. (#149705 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149705 Approved by: https://github.com/malfet	2025-03-21 16:14:29 +00:00
Jithun Nair	1eab841185	Add release branch push triggers to inductor-rocm-mi300.yml (#149672 ) In similar vein as https://github.com/pytorch/pytorch/pull/149517 When we added the rocm-mi300.yml earlier this year, we had lower capacity and we were just pipecleaning the workflow, so we set the trigger to only respond to pushes to main branch. But now we have more stability as well as capacity, and we would really like to ensure that the release branch is being tested on MI300s as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149672 Approved by: https://github.com/jeffdaily	2025-03-21 16:02:03 +00:00
Davide Italiano	5d4b5ee315	[MPS] Add inline to function definition. (#149704 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149704 Approved by: https://github.com/malfet	2025-03-21 14:53:09 +00:00
Ryo Suzuki	d072254eae	Extend vec backend with BF16 SVE intrinsics (#143666 ) - Following the work in https://github.com/pytorch/pytorch/pull/119571, BF16 SVE intrinsics are added to the Vectorized class, providing ~1.7x speedup on `silu` and `softmax`. - Added bf16 detection in CMake - Added a guard for native NEON code to prevent compilation errors @aditew01 @maajidkhann please have a look Pull Request resolved: https://github.com/pytorch/pytorch/pull/143666 Approved by: https://github.com/swolchok, https://github.com/aditew01 Co-authored-by: Aditya Tewari <aditya.tewari@arm.com>	2025-03-21 10:55:11 +00:00
Nikita Shulga	68dfd44e50	Do not depend on numpy during the import (#149683 ) But a good followup would be to use torch primitives instead of numpy here Fixes https://github.com/pytorch/pytorch/issues/149681 Test plan: Monkey-patch 2.7.0-rc and run `python -c "import torch;print(torch.compile(lambda x:x.sin() + x.cos())(torch.rand(32)))"` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149683 Approved by: https://github.com/seemethere	2025-03-21 08:14:57 +00:00
Michael Lazos	34743678b9	[Dynamo] Cleanup state management for ctx managers (#149689 ) Removes state indirection for ctx managers. This isn't needed anymore since VTs are mutable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149689 Approved by: https://github.com/StrongerXi	2025-03-21 07:18:33 +00:00
Arash Pakbin	cfc08caea9	[ROCm] NLLLoss (torch.nll_loss) Performance Tuning by Dynamically Selecting # of GPU threads (#149548 ) Instead of fixing the number of GPU threads to 32 regardless of input size, this PR dynamically selects the number of threads based on the formula: clamp(2^round(log2(dim0/16)), min = 32, max = 1024). The experiments below were done on an MI300 machine for data type float32: ![nll_loss_threads_bests](https://github.com/user-attachments/assets/3be3d465-e3db-44ed-991a-fdfcab03baae) ![nll_loss_heauristic](https://github.com/user-attachments/assets/e82b9788-9b4d-4862-a180-8df7ad298182) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149548 Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony	2025-03-21 07:16:37 +00:00
Davide Italiano	0ed34210b2	[MPS] Add support for `modified_bessel_k1` to eager and inductor. (#149687 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149687 Approved by: https://github.com/malfet	2025-03-21 04:59:06 +00:00
Yuanhao Ji	0a396a8160	[Docs] Make `torch.Library`'s `kind` have no default value to be consistent with the code (#149390 ) Fixes #149389 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149390 Approved by: https://github.com/janeyx99	2025-03-21 04:42:10 +00:00
Jing Xu	4ea580568a	update aotinductor doc for XPU support (#149299 ) as title. Since the AOTInductor feature starting from 2.7 works on Intel GPU, add the related contents into its doc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149299 Approved by: https://github.com/guangyey, https://github.com/desertfire	2025-03-21 04:40:31 +00:00
Rachel Guo	ccd5d811e8	[aoti] follow up to use new api in `test_provenance_tracing.py` (#149387 ) Summary: As title. Follow up of D71181284. and some minor refactoring Context : D69609685 (update test runner to use new api) / https://github.com/pytorch/pytorch/pull/147105 Test Plan: ``` buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:provenance_tracing -- -r test_triton_kernel_to_post_grad_tracing_cpu ``` Differential Revision: D71375725 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149387 Approved by: https://github.com/yushangdi	2025-03-21 04:37:50 +00:00
Nikita Shulga	5327894812	[BE] Introduce `lapack_work_to_int` function (#149682 ) That could be used to safely cast floating values to int by adding an ULP, which is a followup after https://github.com/pytorch/pytorch/pull/146456 Fixes https://github.com/pytorch/pytorch/issues/149591 (Not adding unittest as it's just going to be too slow) Test plan: ``` % python3 -c "import torch; torch.pinverse(torch.rand(50000, 8193))" ``` Before the change errored out with ``` RuntimeError: false INTERNAL ASSERT FAILED at "pytorch/pytorch/aten/src/ATen/native/BatchLinearAlgebra.cpp":1605, please report a bug to PyTorch. linalg.svd: Argument 12 has illegal value. Most certainly there is a bug in the implementation calling the backend library. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149682 Approved by: https://github.com/wdvr	2025-03-21 04:08:07 +00:00
Yuanhao Ji	bf6621d08f	[Distributed] Add `repr` methods for `ParallelStyle`s (#149478 ) Fixes #149470 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149478 Approved by: https://github.com/wanchaol	2025-03-21 03:59:25 +00:00
xinan.lin	ee6a029165	[XPU] Update triton commit to fix to fix level_zero not found by env var LEVEL_ZERO_V1_SDK_PATH. (#149511 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149511 Approved by: https://github.com/EikanWang	2025-03-21 03:56:00 +00:00
zeshengzong	732f9d7435	Optimize `torch.equal` description (#149618 ) Fixes #149222 ## Test Result ![image](https://github.com/user-attachments/assets/559a376f-2dd0-4474-bbd5-9299d9df51e3) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149618 Approved by: https://github.com/zou3519	2025-03-21 03:44:49 +00:00
Xia, Weiwen	64bd889660	[Inductor][CPP] rename shim_mkldnn.h/.cpp to shim_cpu.h/.cpp (#149372 ) Summary Previous discussion is here: https://github.com/pytorch/pytorch/pull/148907#issuecomment-2712795600 Rename these files because - they may hold mkldnn-unrelated code for CPU - filenames are aligned with files for CUDA and XPU Pull Request resolved: https://github.com/pytorch/pytorch/pull/149372 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/desertfire	2025-03-21 03:42:12 +00:00
Justin Chu	a39bf846f5	[ONNX] Add draft_export as a strategy (#147529 ) Create draft_export strategy. The strategy is added before jit and after strict=True, as the third fallback. Since it is specializing tensors it should not be less robust than the jit trace strategy. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147529 Approved by: https://github.com/titaiwangms	2025-03-21 03:05:17 +00:00
Hollow Man	0692301e25	Catch OSError in general when writing files (#149464 ) Redundant exception types in `except (PermissionError, OSError):`. Write `except OSError:`, which catches exactly the same exceptions. https://github.com/pytorch/pytorch/actions/runs/13935844871/job/39141062991 When hipify files, or writing cprofile files, PermissionError is not enough when the file is located in a place that is not writable at all, or other OS errors happened when writing files. This fix makes the code more robust. Example error log: ```log File "deepspeed/ops/adam/fused_adam.py", line 94, in __init__ fused_adam_cuda = FusedAdamBuilder().load() ^^^^^^^^^^^^^^^^^^^^^^^^^ File "deepspeed/ops/op_builder/builder.py", line 540, in load return self.jit_load(verbose) ^^^^^^^^^^^^^^^^^^^^^^ File "deepspeed/ops/op_builder/builder.py", line 587, in jit_load op_module = load(name=self.name, ^^^^^^^^^^^^^^^^^^^^ File "torch/utils/cpp_extension.py", line 1597, in load return _jit_compile( ^^^^^^^^^^^^^ File "torch/utils/cpp_extension.py", line 2031, in _jit_compile hipify_result = hipify_python.hipify( ^^^^^^^^^^^^^^^^^^^^^ File "torch/utils/hipify/hipify_python.py", line 1167, in hipify preprocess_file_and_save_result(output_directory, filepath, all_files, header_include_dirs, File "torch/utils/hipify/hipify_python.py", line 213, in preprocess_file_and_save_result result = preprocessor(output_directory, filepath, all_files, header_include_dirs, stats, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "torch/utils/hipify/hipify_python.py", line 940, in preprocessor output_source = RE_QUOTE_HEADER.sub(mk_repl('#include "{0}"', True), output_source) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "torch/utils/hipify/hipify_python.py", line 919, in repl preprocess_file_and_save_result(output_directory, File "torch/utils/hipify/hipify_python.py", line 213, in preprocess_file_and_save_result result = preprocessor(output_directory, filepath, all_files, header_include_dirs, stats, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "torch/utils/hipify/hipify_python.py", line 986, in preprocessor with clean_ctx.open(fout_path, 'w', encoding='utf-8') as fout: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "torch/utils/hipify/hipify_python.py", line 123, in open return open(fn, args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ OSError: [Errno 30] Read-only file system: 'deepspeed/ops/csrc/adam/multi_tensor_apply_hip.cuh' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149464 Approved by: https://github.com/janeyx99	2025-03-21 02:42:50 +00:00
Justin Chu	362b40939d	[ONNX] Improve docstring of onnx symbolic ops (#149668 ) Better examples Pull Request resolved: https://github.com/pytorch/pytorch/pull/149668 Approved by: https://github.com/titaiwangms	2025-03-21 01:57:39 +00:00
Matthias Braun	66dd00fca0	Fix clang-tidy errors (#149581 ) Summary: Cleanup clang-tidy complaints in `EmbeddingBag.cpp`: Avoid shadowed variables and unused parameters. Test Plan: sandcastle Differential Revision: D71512594 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149581 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-03-21 01:53:57 +00:00
Simon Fan	e481615bc7	[aot] always lower the backward with a deepcopy (#149229 ) FIXES https://github.com/pytorch/pytorch/issues/149105 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149229 Approved by: https://github.com/bdhirsh	2025-03-21 01:47:13 +00:00
Xintong Hu	5ebc283f2c	[PT2] Port use_triton_dot_compress to PT2 pre_grad passes (#148517 ) Summary: add use_triton_dot_compress in pre_grad Test Plan: ``` scripts/aetk/aetk -L %run ~/fbsource/fbcode/caffe2/test/inductor/fb/test_customized_triton_kernel_passes.py ``` Reviewed By: frank-wei Differential Revision: D68909838 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148517 Approved by: https://github.com/frank-wei	2025-03-21 01:42:32 +00:00
James Wu	c2ada9d77b	[easy] Do not logspam if static cuda launcher is disabled (#149669 ) No need to log.info every time someone runs with StaticCudaLauncher disabled. Test plan: Run any benchmark and see that we don't spam the bypass message in logs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149669 Approved by: https://github.com/oulgen, https://github.com/jansel ghstack dependencies: #148890	2025-03-21 01:22:26 +00:00
Eli Uriegas	1099c37150	ci: Add sccache to manylinux images (#148419 ) Adds sccache to our manylinux images, these are purposefully built without the scccache-dist binary since we're not expecting to use that. Another caveat of these builds is that they are built with the vendored version of openssl. This is to set the stage for us to be able to build binaries sequentially. Signed-off-by: Eli Uriegas <github@terriblecode.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/148419 Approved by: https://github.com/atalman	2025-03-21 01:15:34 +00:00
Han, Xu	2975664fb0	add python root bin to windows load path. (#146573 ) This PR is extend python root bin path to dll load list. It makes PyTorch robust and compatible to more dependency libraries, such as `intel-pti`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146573 Approved by: https://github.com/EikanWang, https://github.com/albanD	2025-03-21 00:48:43 +00:00
Sam Larsen	90543e90a0	Fix broken dynamo_timed test due to python_version field (#149659 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149659 Approved by: https://github.com/ppanchalia	2025-03-21 00:27:28 +00:00
Zhengxu Chen	f47aa08130	[export] Support python assertion with symints. (#149444 ) Summary: This diff ports some technique from torch.fx symbolic trace to trace through Python asserts when we run into data dependent symbolic shape assertions, so that we can achieve the same effect as torch dynamo to automatically turn assert into torch.check()s. Test Plan: buck test mode/opt caffe2/test:test_export -- -r test_python_asserts_with_sym_int Differential Revision: D71425360 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149444 Approved by: https://github.com/tugsbayasgalan	2025-03-20 23:07:45 +00:00
angelayi	bf34e228c5	[export] Beef up guard_added logs (#149465 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149465 Approved by: https://github.com/pianpwk	2025-03-20 23:02:07 +00:00
Michael Lazos	1d3c50fcc5	[Dynamo] Support the torch._C.DisableTorchFunction ctx manager (#149491 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149491 Approved by: https://github.com/StrongerXi ghstack dependencies: #149489, #149490	2025-03-20 22:19:55 +00:00
Michael Lazos	ce5adc5c05	[Dynamo] add support for torch._C._is_torch_function_all_disabled (#149490 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149490 Approved by: https://github.com/StrongerXi ghstack dependencies: #149489	2025-03-20 22:19:55 +00:00
Michael Lazos	f64c361860	[Dynamo] Refactor DisableTorchFunction ctx manager (#149489 ) Refactors the DisableTorchFunction ctx manager to properly model the eager code (no args to the context manager). Pull Request resolved: https://github.com/pytorch/pytorch/pull/149489 Approved by: https://github.com/StrongerXi	2025-03-20 22:19:55 +00:00
zhc7	a268c29b9f	[distributed] fix: use group rank instead of global rank when possible (#149488 ) Fixes #149200 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149488 Approved by: https://github.com/wconstab	2025-03-20 21:47:03 +00:00
Isuru Fernando	b07b819912	[inductor] Add a helper for convert index_dtype to torch dtype (#149531 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149531 Approved by: https://github.com/eellison	2025-03-20 21:33:29 +00:00
Zhuoran Zhao	a703107f7b	[AOTInductor] Fix skip cpp wrapper unit test (#149606 ) Summary: as title Test Plan: ``` buck2 test 'fbcode//mode/opt' fbcode//deeplearning/aot_inductor/cpu/test:cpu_lowering_utils_test -- --exact 'deeplearning/aot_inductor/cpu/test:cpu_lowering_utils_test - test_cpu_lower_aoti_ep_called (deeplearning.aot_inductor.cpu.test.test_lowering_utils.CPULoweringTest)' ``` ``` buck test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:cudagraph_trees_expandable_segments -- --exact 'caffe2/test/inductor:cudagraph_trees_expandable_segments - test_skip_cpp_wrapper (caffe2.test.inductor.test_cudagraph_trees.CudaGraphTreeTests)' ``` https://www.internalfb.com/phabricator/paste/view/P1758059197 Reviewed By: henryoier Differential Revision: D71528281 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149606 Approved by: https://github.com/desertfire	2025-03-20 20:55:33 +00:00
Guilherme Leobas	406d464d97	Add `is_batchedtensor` to dynamo builder (#149541 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149541 Approved by: https://github.com/zou3519	2025-03-20 20:46:15 +00:00
Kai Londenberg	f17ae3f7b7	[Inductor Cutlass backend] Fix imports and compilation of Cutlass SM100 Kernels (#149515 ) Summary: Fixes the import and compilation of Cutlass SM100 Kernels. Test Plan: Cutlass backend unit tests, running benchmarks/inductor_backends/cutlass.py Differential Revision: D71196747 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149515 Approved by: https://github.com/ColinPeppler, https://github.com/chenyang78	2025-03-20 20:35:18 +00:00
PyTorch MergeBot	24176f6e32	Revert "[cond] don't trace fw and bw graph in autograd key (#148930 )" This reverts commit 6e843a51dd5743b864fc28601ef06cdc18488b3e. Reverted https://github.com/pytorch/pytorch/pull/148930 on behalf of https://github.com/ydwu4 due to Test failure is legit ([comment](https://github.com/pytorch/pytorch/pull/148930#issuecomment-2741585315))	2025-03-20 20:28:29 +00:00
Yidi Wu	4a4a71a73c	[inductor]lowering scan to while_loop (#148580 ) This PR add a pass in post_grad that lowers scan to while_loop. See the comment before the pass for how this is implemented. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148580 Approved by: https://github.com/jansel, https://github.com/eellison	2025-03-20 20:21:02 +00:00
Yidi Wu	6e843a51dd	[cond] don't trace fw and bw graph in autograd key (#148930 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148930 Approved by: https://github.com/zou3519	2025-03-20 20:18:29 +00:00
Guilherme Leobas	18435945af	Set __context__/__cause__ when generator raise `StopIteration` (#148765 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148765 Approved by: https://github.com/zou3519 ghstack dependencies: #146505	2025-03-20 19:59:30 +00:00
Guilherme Leobas	44e6464914	Allow setting attribute to NestedUserFunctionVariable (#146505 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146505 Approved by: https://github.com/zou3519	2025-03-20 19:59:30 +00:00
Dominic Binks	aae4c0729e	Fix broken build within xplat/caffe2 (#149403 ) Summary: Following a pull from open source, the build within xplat is broken due to not finding <autograd/function.h>. Within the python_function.cpp there seems to be a convention of using the torch/csrc prefix. This change includes that prefix to enable the build to proceed. Test Plan: Build a binary using torch. https://www.internalfb.com/buck2/83122485-d3c3-43f4-97b4-81bb90450b3b Unit tests run too https://www.internalfb.com/intern/testinfra/testrun/13229323975828416 Further testing in CI and elsewise expected. Reviewed By: malfet Differential Revision: D70331539 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149403 Approved by: https://github.com/izaitsevfb Co-authored-by: Dominic Binks <dbinks@meta.com>	2025-03-20 19:27:55 +00:00
Yi Wang	ffa085334c	Specify the default PyTorch Distributed backend for MPS (#149538 ) Fixes #149537 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149538 Approved by: https://github.com/d4l3k, https://github.com/malfet	2025-03-20 18:54:03 +00:00
Natalia Gimelshein	1d221724fc	fix missing field initializer warning (#149597 ) Per title Pull Request resolved: https://github.com/pytorch/pytorch/pull/149597 Approved by: https://github.com/drisspg, https://github.com/Skylion007	2025-03-20 18:48:05 +00:00
William Wen	6285a71aba	[dynamo] fix bug where non-recursive disable modifies the original function (#148896 ) Fixes https://github.com/pytorch/pytorch/issues/148787. We fix this by: - Wrapping the original function instead of directly modifying it - When we detect that the previous frame is the non-recursive disable wrapper, then skip tracing this frame (non-recursive disable wrapper will always be skipped, so that frame will be present in the traceback)l Pull Request resolved: https://github.com/pytorch/pytorch/pull/148896 Approved by: https://github.com/jansel	2025-03-20 18:33:54 +00:00
Jane Xu	88a26dbb9d	[BE] simplify test_cpp_extensions_aot and .gitignore (#149231 ) It is shady to clean up an install mid-test. So don't do that anymore and use .gitignore instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149231 Approved by: https://github.com/albanD, https://github.com/msaroufim	2025-03-20 18:17:19 +00:00
Sergey Zimin	b99fc9d29f	[MTIA] Support loading Tensors on mtia:0 for pytorch code (#149327 ) Summary: The diff includes updates to the PyTorch code to enable loading tensors to MTIA. Reviewed By: PatriceVignola Differential Revision: D71176848 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149327 Approved by: https://github.com/ezyang	2025-03-20 18:05:15 +00:00
James Wu	7bb9c36784	Hook StaticCudaLauncher up to torch.compile (cold start) (#148890 ) This hooks up the previous PR to torch.compile. Will add a config flag to hide this behind in a bit, but for now it's useful for testing purposes to have it on by default. Inductor will automatically choose to use StaticCudaLauncher to launch triton kernels if: - The kernel is a cuda kernel and inductor can find a cubin file associated with it - The kernel takes less than 50 arguments - The kernel doesn't use any special features (launch hooks, large amounts of shared memory) - The kernel is not user defined (to be supported in a later PR) We split CompileResult into TritonCompileResult and StaticTritonCompileResult, but have them share implementations of how they exec a python launcher. StaticTritonCompileResult's python launcher has the benefit of a simpler def_args/call_args setup, since it always filters out all constexprs before running, no matter the triton version. Some key features of StaticTritonCompileResult: - It is fully serializable - It stores the minimum amount of stuff, so that later it can be cached easily - It does not depend on any triton specific types (though it does have various triton metadata). For now, both TritonCompileResult and StaticTritonCompileResult still `exec` custom python launchers, and use GridExpr. We can change that in the future to simplify if we'd like. For now though, this custom python codegen is good for flexibility when it comes to supporting removal of constexprs, so using it for static launching is nice to not have to pay the cost of removing constexprs at kernel runtime. Hooking everything up to torch.compile lets me run every unit test with StaticCudaLauncher to make sure that we still pass (even if we bypass StaticCudaLauncher itself). It also lets me check for compilation/runtime performance with these changes. Fixes #149448 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148890 Approved by: https://github.com/jansel	2025-03-20 17:32:20 +00:00
Dmitry Nikolaev	c99efc08fb	[ROCm] skip test_RNN_dropout_state (#149446 ) PR to skip test_nn.py::TestNN::test_RNN_dropout_state Currently ROCm doesn't support dropout value for RNN PR to enable RNN dropout on ROCm still in review and blocked pytorch/pytorch#144572 Fixes: https://github.com/pytorch/pytorch/issues/68849 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149446 Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily	2025-03-20 17:22:39 +00:00
Eli Uriegas	1d9401befc	ci: Remove mentions and usages of DESIRED_DEVTOOLSET and cxx11 (#149443 ) This is a remnant of our migration to manylinux2_28 we should remove these since all of our binary builds are now built with cxx11_abi Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/149443 Approved by: https://github.com/izaitsevfb, https://github.com/atalman	2025-03-20 16:49:46 +00:00
Avik Chaudhuri	6237495fcf	torch.Size input (#149414 ) Summary: Support for `torch.Size` inputs was patchy before because `unflatten_fn` for this type returned a tuple. This PR cleans this up. Fixes #149158 Test Plan: added test Differential Revision: D71403635 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149414 Approved by: https://github.com/yushangdi	2025-03-20 16:23:13 +00:00
IvanKobzarev	2c4bc65366	[aotd] Guess tangents stride as output strides (#144579 ) AOTDispatch doing AOT backward graph preparation does not know real tangents that user will specify when runs backward. AOTD guesses the tangents. Before - we guessed that memory format of tangents will be as memory format of corresponding outputs. And if specified tangents at runtime are not the same memory format as we guessed during compilation, AOTD does coercion (copy) to guessed memory_format But as Horace found, there are popular use cases, where the outputs of compiled region will be in specific memory_format. E.g. in 4D tensor transposing dims 1 and 2. https://github.com/karpathy/nanoGPT/blob/master/model.py#L57 This PR changes the logic, that AOTD expects the same "strideness" of tangents as outputs. As a result it will avoid coercion for the case of transposed dims. Limitations: We keep guessing memory_format for: 1/ Dynamic shapes (needs more changes) 2/ Tensor subclasses (needs more changes) Other changes: test_torchinductor was always creating contiguous tangents via `torch.randn()`, changing them to be `torch.randn_like()` to compare computation with the same strideness. (E.g. for cuda float16 strideness affects numerics for fft ops). Pull Request resolved: https://github.com/pytorch/pytorch/pull/144579 Approved by: https://github.com/bdhirsh	2025-03-20 15:41:36 +00:00
Andrey Talman	9b1127437e	Add triton as dependency to CUDA aarch64 build (#149584 ) Aarch64 Triton build was added by: https://github.com/pytorch/pytorch/pull/148705 Hence add proper contrain to CUDA 12.8 Aarch64 build Please note we want to still use: ```platform_system == 'Linux' and platform_machine == 'x86_64'``` For all other builds. Since these are prototype binaries only used by cuda 12.8 linux aarch64 build. Which we would like to serve from download.pytorch.org Pull Request resolved: https://github.com/pytorch/pytorch/pull/149584 Approved by: https://github.com/nWEIdia, https://github.com/tinglvv, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-03-20 15:39:45 +00:00
Zhengxu Chen	80dfce2cc3	[export] Handle non OpNamespace type during decomposition. (#149431 ) Summary: Turns out we can have non OpNamespace object in torch.ops._dir. We should just throw away those during iteration. Test Plan: eyes Differential Revision: D71417992 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149431 Approved by: https://github.com/tugsbayasgalan	2025-03-20 15:36:15 +00:00
ZhiweiYan-96	d67c1a027e	[Intel GPU][PT2E] bugfix: use zero-point to decide conv src zp mask (#149473 ) # Motivation The PR fix a bug that wrongly decides the zero-point mask setting. Specifically, it deems zero-point is always not zeros due to scale is used for judgement. Fortunately, the bug only affects the performance. The accuracy is not affected. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149473 Approved by: https://github.com/EikanWang, https://github.com/guangyey	2025-03-20 14:46:07 +00:00
Sun, Jiayi	496bbf38be	add grad_output shape check for adaptive_avg_pool2d_backward (#145241 ) Fix https://github.com/pytorch/pytorch/issues/145070. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145241 Approved by: https://github.com/malfet, https://github.com/eqy	2025-03-20 14:10:31 +00:00
Shuai Yang	00a2c68f67	Fix a typo "trochrec" to "torchrec" (#149542 ) Summary: As titled, the path is incorrect due to the typo Test Plan: CI Differential Revision: D71490709 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149542 Approved by: https://github.com/williamwen42	2025-03-20 10:14:23 +00:00
William Wen	a66a9581da	[dynamo] support Python 3.13t (#149549 ) A few bug fixes to get Dynamo mostly working with 3.13 nogil. Dynamo encounters internal CPython assert errors in older versions of 3.13. The fix has been landed on [CPython's 3.13 branch](https://github.com/python/cpython/tree/3.13) and will be included in 3.13.3 (https://peps.python.org/pep-0719/ - april 8). If you wish to try `torch.compile` on the latest 3.13 branch, you can comment out the error checking (i.e. `70b6cd4e11/torch/__init__.py (L2535)` and `70b6cd4e11/torch/_dynamo/eval_frame.py (L899)`). We will work on getting PyTorch CI up for Dynamo/dynamo-wrapped/inductor once 3.13.3 is available. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149549 Approved by: https://github.com/jansel	2025-03-20 09:49:27 +00:00
Blaine Burton Rister	970ac2d907	[Inductor] Improve memory locality by iterating over y dimension before x (#149339 ) # Feature Fixes https://github.com/pytorch/pytorch/issues/148718 by reordering the tensor dims to `(z, y, x)`. As a bonus refactor, block pointers no longer needed the `reorder=True` argument to `self.active_range_trees()`. Since this argument is no longer used anywhere, this PR simply deletes it as opposed to updating the logic for the new iteration order. # Perf impact It looks like there's a decent perf bump on A100, with cudagraphs enabled. Granted, perf runs seem to have some noise between commits. ([Workflow run](https://github.com/pytorch/pytorch/actions/runs/13914815576).) Training (all neutral or positive): ![image](https://github.com/user-attachments/assets/57f1ef1d-60b4-446f-baf3-aca87a26b81b) Inference (one positive, one very small negative): ![image](https://github.com/user-attachments/assets/679aa057-af23-47f1-8d8e-8520daf1bd92) As reported in https://github.com/pytorch/pytorch/issues/148718, this PR makes consecutive threads access consecutive memory addresses. This should theoretically give the GPU more opportunities to coalesce loads and stores. From Nvidia's [kernel profiling guide](https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html): > Local memory is private storage for an executing thread and is not visible outside of that thread. It is intended for thread-local data like thread stacks and register spills. Local memory addresses are translated to global virtual addresses by the AGU unit. Local memory has the same latency as global memory. One difference between global and local memory is that local memory is arranged such that consecutive 32-bit words are accessed by consecutive thread IDs. Accesses are therefore fully coalesced as long as all threads in a warp access the same relative address (e.g., same index in an array variable, same member in a structure variable, etc.). I couldn't find any information on how coalescing works for other kinds of memory, but the guide mentions it is also supported for accesses to the L2 cache. > The L2 Request Coalescer (LRC) processes incoming requests for L2 and tries to coalesce read requests before forwarding them to the L2 cache. It also serves programmatic multicast requests from the SM and supports compression for writes. The [answer to this Stack Overflow post](https://stackoverflow.com/a/5044424) also explains coalescing in a straightforward way. Inductor's current iteration order corresponds to the first (uncoalesced) example in that answer, while the order after this PR corresponds to the second (coalesced) example. Besides GPUs, this order of accessing data is highly advantageous for systems relying on DMAs, as those are designed to access contiguous spans of memory. This change improves the performance of an elementwise add kernel on an internal model, using internal hardware, by 1.76x. I will share the details with reviewers who are Meta employees via a private channel. # Test plan - Updated expected code on CI tests. - Added a new test checking the {x,y,z}indices and block pointers on a 3D pointwise kernel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149339 Approved by: https://github.com/jansel	2025-03-20 08:12:00 +00:00
Bin Bao	3647711a89	[AOTI][refactor] Remove dead code (#149287 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149287 Approved by: https://github.com/cyyever, https://github.com/yushangdi	2025-03-20 07:29:27 +00:00
PyTorch MergeBot	90ef7a9561	Revert "Supporting non-tensor-data write_size in planner write items. (#149434 )" This reverts commit 1442230a267f0ce4f0bb540fca775faa71e7cfd5. Reverted https://github.com/pytorch/pytorch/pull/149434 on behalf of https://github.com/izaitsevfb due to breaking docs build ([comment](https://github.com/pytorch/pytorch/pull/149434#issuecomment-2739378287))	2025-03-20 06:52:02 +00:00
Sun, Jiayi	00333c4548	[Inductor] Set prop_kind to forward_inference when grad is not needed for mkldnn_linear_pointwise and mkldnn_convolution_pointwise (#147072 ) Summary: The `prop_kind` of `mkldnn._linear_pointwise`, `mkldnn._linear_pointwise.binary`, `mkldnn._convolution_pointwise.binary` and `mkldnn._convolution_pointwise_.binary` are always `dnnl_forward`, i.e., `dnnl_forward_training` , regardless of whether `grad` is needed. Setting `prop_kind` to `dnnl_forward_inference` for these ops when `grad` is not needed could have better performance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147072 Approved by: https://github.com/leslie-fang-intel, https://github.com/CaoE, https://github.com/jansel	2025-03-20 06:21:31 +00:00
Rachel Guo	c4d59e6279	[Inductor] Fix combo_kernel logging error (#149575 ) Summary: Fix logging error like: ``` in combinable_nodes log.debug( Message: 'ComboKernels: %d template nodes are filtered' Arguments: (OrderedSet([8]),) --- Logging error --- Traceback (most recent call last): File "/usr/local/fbcode/platform010/lib/python3.10/logging/__init__.py", line 1100, in emit msg = self.format(record) File "/usr/local/fbcode/platform010/lib/python3.10/logging/__init__.py", line 943, in format return fmt.format(record) File "/data/users/guorachel/fbsource/buck-out/v2/gen/fbcode/854b9ed00d28c5c5/caffe2/torch/fb/model_transform/experimental/benchmark/__mts_gpu_benchmark__/mts_gpu_benchmark#link-tree/torch/_logging/_internal.py", line 818, in format record.message = record.getMessage() File "/usr/local/fbcode/platform010/lib/python3.10/logging/__init__.py", line 368, in getMessage msg = msg % self.args TypeError: %d format: a real number is required, not OrderedSet ``` encountered in running a prod model + enable combo kernel feature Test Plan: CI Differential Revision: D71512220 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149575 Approved by: https://github.com/ColinPeppler	2025-03-20 06:09:44 +00:00
Davide Italiano	595293316d	[MPS/Inductor] Add support for modified_bessel_k0. (#149593 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149593 Approved by: https://github.com/jansel	2025-03-20 04:51:44 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	9a184b1074	Monkeypatch fake mode so it errors on invalid custom ops (#149410 ) Internal version: [D71294776](https://www.internalfb.com/diff/D71294776) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149410 Approved by: https://github.com/gmagogsfm	2025-03-20 04:50:57 +00:00
Menglu Yu	fe94d7da1a	[Inductor][Optimus] Add move view after cat aten pattern (#149178 ) Summary: Add aten pattern to move the view/reshape out of split cat, further reduce the number of kernels. context: https://docs.google.com/document/d/1G2qFcQu1K7VXbz2uPe0CS2aBirnwtwI_B8lxmlBlAPQ/edit?tab=t.0 Test Plan: ### how to enable Add the following patterns to the post grad ``` post_grad_fusion_options={ "normalization_aten_pass": {}, "move_view_after_cat_aten_pass": {}, }, ``` ### unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:split_cat_fx_aten_passes -- test_move_view_after_cat_aten ``` Buck UI: https://www.internalfb.com/buck2/3c5451be-c63a-4794-8d6b-103ecac78905 Test UI: https://www.internalfb.com/intern/testinfra/testrun/6192449704507267 ### local reproduce ``` buck2 run mode/opt scripts/shuaiyang:test -- --flow_id 691990503 --use_synthetic_data --optimus ``` https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/mengluy/2025-03-13-20-59-34/trace.json.gz&bucket=gpu_traces ### E2E baseline f691990503 proposal Differential Revision: D71177004 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149178 Approved by: https://github.com/Yuzhen11	2025-03-20 04:07:25 +00:00
Isalia20	95e71765f2	[MPS] nanmedian implementation (#149407 ) Implements nanmedian on MPS. This implementation only implements `torch.nanmedian(tensor)` without `keepdim` and `dim` Will implement nanmedian with dim and keepdim in a followup Pull Request resolved: https://github.com/pytorch/pytorch/pull/149407 Approved by: https://github.com/malfet	2025-03-20 03:50:26 +00:00
bobrenjc93	cca46a0b6f	Fix score_mod.py dynamic max autotune (#148991 ) python benchmarks/transformer/score_mod.py --dynamic --max-autotune previously would crash with ``` "/home/bobren/local/a/pytorch/torch/_inductor/select_algorithm.py", line 2306, in key_of node.get_device().type, ``` but with this change no longer does Pull Request resolved: https://github.com/pytorch/pytorch/pull/148991 Approved by: https://github.com/drisspg	2025-03-20 03:28:51 +00:00
Xu Han	bc1b8730a4	[Windows][inductor] fix blank space break windows file path (#149388 ) Fixes #149310 From origin error message: ```cmd Command: cl /I C:/Program Files/Python310/Include /I c:/code/.env/lib/site-packages/torch/include /I c:/code/.env/lib/site-packages/torch/include/torch/csrc/api/include /I c:/code/.env/lib/site-packages/torch/include/TH /I c:/code/.env/lib/site-packages/torch/include/THC /D TORCH_INDUCTOR_CPP_WRAPPER /D STANDALONE_TORCH_HEADER /D C10_USING_CUSTOM_GENERATED_MACROS /DLL /MD /O2 /std:c++20 /wd4819 /wd4251 /wd4244 /wd4267 /wd4275 /wd4018 /wd4190 /wd4624 /wd4067 /wd4068 /EHsc /openmp /openmp:experimental C:/Users/user/AppData/Local/Temp/torchinductor_user/ou/coubnfnqsm2gbdzdytufv46jotd6sxsnnhgldiw45pl5yjq5nbvz.cpp /LD /FeC:/Users/user/AppData/Local/Temp/torchinductor_user/ou/coubnfnqsm2gbdzdytufv46jotd6sxsnnhgldiw45pl5yjq5nbvz.pyd /link /LIBPATH:c:/code/.env/Scripts/libs /LIBPATH:c:/code/.env/lib/site-packages/torch/lib torch.lib torch_cpu.lib torch_python.lib sleef.lib Output: Microsoft (R) C/C++ Optimizing Compiler Version 19.43.34809 for x86 Copyright (C) Microsoft Corporation. All rights reserved. cl : Command line warning D9025 : overriding '/openmp' with '/openmp:experimental' cl : Command line warning D9024 : unrecognized source file type 'Files/Python310/Include', object file assumed coubnfnqsm2gbdzdytufv46jotd6sxsnnhgldiw45pl5yjq5nbvz.cpp C:/Users/user/AppData/Local/Temp/torchinductor_user/ou/coubnfnqsm2gbdzdytufv46jotd6sxsnnhgldiw45pl5yjq5nbvz.cpp(21): fatal error C1083: Cannot open include file: 'Python.h': No such file or directory ``` Python installed in `C:/Program Files/Python310` path, and the blank space break the file path. Solution: Add quotes to declare Windows file paths, after that: ```cmd cl /I "C:/Users/Xuhan/.conda/envs/new_build/Include" /I "C:/Users/Xuhan/.conda/envs/new_build/lib/site-packages/torch/include" /I "C:/Users/Xuhan/.conda/envs/new_build/lib/site-packages/torch/include/torch/csrc/api/include" /D TORCH_INDUCTOR_CPP_WRAPPER /D STANDALONE_TORCH_HEADER /D C10_USING_CUSTOM_GENERATED_MACROS /D CPU_CAPABILITY_AVX512 /DLL /MD /O2 /std:c++20 /wd4819 /wd4251 /wd4244 /wd4267 /wd4275 /wd4018 /wd4190 /wd4624 /wd4067 /wd4068 /EHsc /openmp /openmp:experimental C:/Users/Xuhan/AppData/Local/Temp/tmp1wsj0m8r/za/czarp3ly5c22ge3hydvnzvad4cjimyr3hkwvofodxqffgil7frfd.cpp /arch:AVX512 /FeC:/Users/Xuhan/AppData/Local/Temp/tmp1wsj0m8r/za/czarp3ly5c22ge3hydvnzvad4cjimyr3hkwvofodxqffgil7frfd.pyd /LD /link /LIBPATH:"C:/Users/Xuhan/.conda/envs/new_build/libs" /LIBPATH:"C:/Users/Xuhan/.conda/envs/new_build/lib/site-packages/torch/lib" "torch.lib" "torch_cpu.lib" "torch_python.lib" "sleef.lib" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149388 Approved by: https://github.com/jansel	2025-03-20 03:10:30 +00:00
Dmitry Rogozhkin	45a879e55b	xpu: improve error handling and reporting in XPU cmake files (#149353 ) For #149075 * Add a graceful cmake error instead of cryptic one if SYCL runtime is not found: ``` The link interface of target "c10_xpu" contains: torch::xpurt but the target was not found. ``` * Suppress unclear cmake error if SYCL compiler is not available and further version query fails: ``` CMake Error at /home/dvrogozh/pytorch/torch/share/cmake/Caffe2/FindSYCLToolkit.cmake:37 (string): string sub-command REGEX, mode REPLACE needs at least 6 arguments total to command. ``` CC: @gujinghui @EikanWang @fengyuan14 @guangyey @jgong5 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149353 Approved by: https://github.com/guangyey, https://github.com/malfet	2025-03-20 02:00:39 +00:00
Tugsbayasgalan Manlaibaatar	3b7bd6c63d	Fix dynamic shapes repordering bug (#149528 ) WHen we create constraints, we look at the ordering of kwargs according to model signature. But when we trace, we use the ordering that is created based on how user passes in their kwargs. As a result, constraints and dynamic shapes end up having a different order causing issues when they have different dynamic tensor specs. Differential Revision: [D71478578](https://our.internmc.facebook.com/intern/diff/D71478578) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149528 Approved by: https://github.com/ydwu4	2025-03-20 01:57:44 +00:00
Sam Larsen	1e30192b19	[logging] Add python version to dynamo_compile table (#149419 ) Summary: This adds a version field like the following: `3.10.9+fb (3.10:1dd9be6, May 4 2022, 01:23:45) [Clang 15.0.7 (mononoke://mononoke.internal.tfbnw.net/fbsource 5d1601b0eed7426ac` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149419 Approved by: https://github.com/c00w	2025-03-20 01:48:34 +00:00
Pradeep Fernando	1442230a26	Supporting non-tensor-data write_size in planner write items. (#149434 ) Summary: 1\ The current write item structure does not contain the amount of data that needs to be written. 2\ the planner.item already has a size primitive 'tensor_storage_size'. https://fburl.com/code/7a0gsmw7 But only for tensors. 3\ Right now, the only way the writer layer get hold of this property (fro non tensor data) - first do a lookup in to the actual tensor/bytes - then calculate the nbytes. This change introduce a way to capture non-tensor data size within a write-plan item. Reviewed By: daulet-askarov Differential Revision: D70497442 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149434 Approved by: https://github.com/MeetVadakkanchery	2025-03-20 01:22:05 +00:00
Theodore Ehrenborg	02e21c7854	Fix spelling (#149277 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149277 Approved by: https://github.com/zou3519	2025-03-20 01:02:32 +00:00
PyTorch MergeBot	826e790696	Revert "ci: Remove mentions and usages of DESIRED_DEVTOOLSET (#149443 )" This reverts commit 95a633c45304755ebdbc08396d9948d34243ddb3. Reverted https://github.com/pytorch/pytorch/pull/149443 on behalf of https://github.com/izaitsevfb due to fails lint ([comment](https://github.com/pytorch/pytorch/pull/149443#issuecomment-2738709561))	2025-03-20 00:59:41 +00:00
Eli Uriegas	95a633c453	ci: Remove mentions and usages of DESIRED_DEVTOOLSET (#149443 ) This is a remnant of our migration to manylinux2_28 we should remove these since all of our binary builds are now built with cxx11_abi Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/149443 Approved by: https://github.com/izaitsevfb, https://github.com/atalman	2025-03-20 00:39:02 +00:00
cyy	29c4f2c07a	Remove Ubuntu 18.04 scripts (#149479 ) Ubuntu 18.04 end of life reached on May 31, 2023. These code isn't used now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149479 Approved by: https://github.com/malfet	2025-03-20 00:13:40 +00:00
Ethan Wee	6cbf97ede8	[ROCm] enable HIPMallocAsyncAllocator (#149145 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149145 Approved by: https://github.com/izaitsevfb Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-03-19 23:42:35 +00:00
Aleksei Nikiforov	2be97c7257	Update nightly s390x builds (#149337 ) This change should fix new nightly build failures for s390x. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149337 Approved by: https://github.com/malfet	2025-03-19 23:27:14 +00:00
Andrey Talman	c9de76a1e4	Modify cuda aarch64 install for cudnn and nccl. Cleanup aarch64 cuda 12.6 docker (#149540 ) 1. Use NCCL_VERSION=v2.26.2-1 . Fixes nccl cuda aarch64 related failure we see here: https://github.com/pytorch/pytorch/actions/runs/13955856471/job/39066681549?pr=149443 . After landing: https://github.com/pytorch/pytorch/pull/149351 TODO: Followup required to unify NCCL definitions across the x86 and aarch64 builds 3. Cleanup Remove older CUDA versions for aarch64 builds . CUDA 12.6 where removed by: https://github.com/pytorch/pytorch/pull/148895 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149540 Approved by: https://github.com/seemethere, https://github.com/malfet, https://github.com/nWEIdia	2025-03-19 23:20:05 +00:00
Avik Chaudhuri	5005e1bc47	support multinomial for dynamic num_samples (#149463 ) Test Plan: added test Fixes #149048 Differential Revision: D71434914 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149463 Approved by: https://github.com/pianpwk	2025-03-19 23:15:29 +00:00
Catherine Lee	cc469aaf3b	[CI][docker] Remove vulkan and swiftshader from docker builds (#149530 ) Probably should have been removed with https://github.com/pytorch/pytorch/pull/139354/files? Should I also remove mentions of them from build.sh and test.sh? Pull Request resolved: https://github.com/pytorch/pytorch/pull/149530 Approved by: https://github.com/malfet	2025-03-19 23:13:27 +00:00
Davide Italiano	88c2fe533f	[MPS] Add `modified_bessel_k0` support to eager. (#149563 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149563 Approved by: https://github.com/malfet	2025-03-19 23:10:55 +00:00
Mergen Nachin	bc86b6c55a	Update ExecuTorch pin update (#149539 ) Latest commit in https://hud.pytorch.org/hud/pytorch/executorch/viable%2Fstrict/1?per_page=50 Follow-up to https://github.com/pytorch/pytorch/issues/144480#issuecomment-2731150636 Also, need to incorporate change from https://github.com/pytorch/executorch/pull/8817 Test Plan: Monitor linux-jammy-py3-clang12-executorch test Pull Request resolved: https://github.com/pytorch/pytorch/pull/149539 Approved by: https://github.com/larryliu0820	2025-03-19 22:29:59 +00:00
Catherine Lee	6974ba84f6	[ci][anaconda] Remove conda from linter docker images (#147789 ) Remove conda usage from the linter docker images Handles part of https://github.com/pytorch/pytorch/issues/148110 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147789 Approved by: https://github.com/atalman	2025-03-19 21:56:44 +00:00
Shivam Raikundalia	a11538aa46	[GPU Snapshot] Add Clear History Flag (#149352 ) Summary: Oftentimes, users complain that a bunch of extra events are prepended to their desired GPU snapshot. This is because they usually attach an OOM logger without knowing and when they go to collect the actual snapshot, it adds all the OOM logger contents. Since OOM and regular snapshot use the same backend, we currently don't have the infra in place to split these snapshots. As a solution we add a flag to the snapshot frontend to clear out the history when starting the auto-trace record memory history. A more thorough solution would be to have a user pass in a handle and to have snapshots per handle to seperate the events. However, this would likely be complicated and more work than it is worth as we would have to change the callbacks in the caching allocator and pass these objects between python and cpp. Test Plan: See diff below Differential Revision: D71159720 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149352 Approved by: https://github.com/eqy, https://github.com/aaronenyeshi	2025-03-19 21:44:20 +00:00
PyTorch MergeBot	e1d143cb7b	Revert "[ROCm] enable HIPMallocAsyncAllocator (#149145 )" This reverts commit ee1a2b7810126258ce64d1e22b59fae81a3f7bcb. Reverted https://github.com/pytorch/pytorch/pull/149145 on behalf of https://github.com/izaitsevfb due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/149145#issuecomment-2738115728))	2025-03-19 21:12:13 +00:00
Nichols A. Romero	37bb7f79c6	[ROCm][TunableOp] Unit test for TunableOp BLAS logging. (#148982 ) Add unit test for new TunableOp BLAS logging feature. Requires this PR to be merged in first: https://github.com/pytorch/pytorch/pull/148979 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148982 Approved by: https://github.com/jeffdaily	2025-03-19 20:57:19 +00:00
Jessica Vandebon	71daeddde2	[MTIA] Ensure correct stream behavior for input_buffer add autograd on MTIA (#149433 ) Test Plan: CI Differential Revision: D71414498 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149433 Approved by: https://github.com/albanD	2025-03-19 20:19:18 +00:00
Yanan Cao (PyTorch)	fae79e91a0	Remove torch.export.export_for_inference (#149078 ) Summary: Remove torch.export.export_for_inference, it is redundant and can always be replaced with torch.export.export_for_training() + run_decompositions() Test Plan: unit tests Differential Revision: D71069057 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149078 Approved by: https://github.com/tugsbayasgalan	2025-03-19 19:57:18 +00:00
Shangdi Yu	05fee772e5	Fix with effect lowering for list return type (#149510 ) Summary: - For `torch.ops.higher_order.with_effects`'s lowering, we should not extract the items out of an list (i.e. `*result` vs `result`). The `get_attr` nodes consider the result to be in the list format. Test Plan: ``` buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r test_torchbind_aot_compile buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r list_return buck run //caffe2/torch/fb/sparsenn:sigrid_test -- -r test_transform_torch_bind # tested together with D70013257 buck run fbcode//mode/dev-nosan //caffe2/test:test_export -- -r test_custom_obj ``` Reviewed By: angelayi Differential Revision: D71346024 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149510 Approved by: https://github.com/zou3519	2025-03-19 19:35:08 +00:00
Scott Ramsby	842a072fd3	[codemod] Fix clang-tidy command line doc comments (#149524 ) Summary: Fixes the comments to match the latest updates to the checked-in tools. Search/replace applied in this order: * `# /fbsource/tools/lint/clangtidy/clang-tidy-platform010 -list-checks` -> `# ~/fbsource/tools/lint/clangtidy/clang-tidy-platform010-clang-17 -list-checks` * `# ~/fbsource/tools/lint/clangtidy/clang-tidy-platform010 -list-checks` -> `# ~/fbsource/tools/lint/clangtidy/clang-tidy-platform010-clang-17 -list-checks` * `fbsource/tools/lint/clangtidy/clang-tidy-platform010 -list-checks` -> `fbsource/tools/lint/clangtidy/clang-tidy-platform010-clang-17 -list-checks` Test Plan: CI Reviewed By: johnkearney Differential Revision: D71431516 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149524 Approved by: https://github.com/janeyx99	2025-03-19 19:22:11 +00:00
Pian Pawakapan	96828a2155	[export] refactor DimHints for type errors (#149424 ) Differential Revision: D71414367 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149424 Approved by: https://github.com/justinchuby, https://github.com/avikchaudhuri	2025-03-19 18:51:07 +00:00
Yidi Wu	9ec9f4740c	[export] fix stft decomp and making it consistent with cpp impl. (#149232 ) Summary: We change the fake impl of stft to follow more closely with its cpp implementation [here](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/SpectralOps.cpp#L951-L963) where " n_frames = 1 + (len - n_fft) / hop_length;" is also an integer division. Test Plan: Existing tests and buck2 build --flagfile fbcode//mode/dev fbcode//executorch/examples/models/fb/llama4:speech_transform.pte Differential Revision: D71209142 edit: we kept the original path un-changed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149232 Approved by: https://github.com/jackzhxng	2025-03-19 18:40:35 +00:00
Bin Bao	94d761fbf0	[AOTI][reland] Update test runner to use the new APIs (#149412 ) Summary: Reland https://github.com/pytorch/pytorch/pull/147105. Switch to the newer aoti_compile_and_package APIs. Some tests still kept using legacy APIs, and will follow up with internal test refactoring. Differential Revision: [D71470265](https://our.internmc.facebook.com/intern/diff/D71470265) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149412 Approved by: https://github.com/yushangdi	2025-03-19 17:56:44 +00:00
IvanKobzarev	d686d04c2f	[custom_ops][perf] Move expensive pytree traversals of tensors to C++ (#148555 ) (benchmark for 1 call) Before: ``` └─ $ python ~/task_custom_ops_perf/test_custom_ops_perf_repro.py DO_BENCH mutate: 77.72445678710938 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/mutate.json DO_BENCH no_mutate: 64.61143493652344 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/no_mutate.json DO_BENCH direct_mutate: 11.682510375976562 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/direct_mutate.json DO_BENCH direct_no_mutate: 18.596649169921875 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/direct_no_mutate.json ``` After: ``` └─ $ python ~/task_custom_ops_perf/test_custom_ops_perf_repro.py DO_BENCH mutate: 47.6837158203125 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/mutate.json DO_BENCH no_mutate: 31.709671020507812 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/no_mutate.json DO_BENCH direct_mutate: 10.967254638671875 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/direct_mutate.json DO_BENCH direct_no_mutate: 10.728836059570312 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/direct_no_mutate.json ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148555 Approved by: https://github.com/zou3519	2025-03-19 17:16:57 +00:00
Jithun Nair	518563d6ef	Add release branch push triggers to rocm-mi300.yml (#149517 ) When we added the rocm-mi300.yml earlier this year, we had lower capacity and we were just pipecleaning the workflow, so we set the trigger to only respond to pushes to main branch. But now we have more stability as well as capacity, and we would really like to ensure that the release branch is being tested on MI300s as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149517 Approved by: https://github.com/atalman	2025-03-19 16:14:09 +00:00
Ze Sheng	e98afa0f89	[Sigmoid] Remove magic method in CapabilityBasedPartitioner (#149400 ) Summary: As title. Test Plan: CI Differential Revision: D70575197 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149400 Approved by: https://github.com/jfix71	2025-03-19 16:02:43 +00:00
Andrey Talman	4df66e0b7f	Pin auditwheel to 6.2.0 (#149471 ) Observing aarch64 failure in nightly: https://github.com/pytorch/pytorch/actions/runs/13917778961/job/38943911228 Similar to: https://github.com/pytorch/vision/pull/8982 ``` 2025-03-18T08:44:58.4128744Z Repairing Wheel with AuditWheel 2025-03-18T08:44:58.5440988Z INFO:auditwheel.main_repair:Repairing torch-2.8.0.dev20250318+cpu-cp39-cp39-linux_aarch64.whl 2025-03-18T08:45:20.3393288Z Traceback (most recent call last): 2025-03-18T08:45:20.3393732Z File "/opt/python/cp39-cp39/bin/auditwheel", line 8, in <module> 2025-03-18T08:45:20.3394115Z sys.exit(main()) 2025-03-18T08:45:20.3394559Z File "/opt/_internal/cpython-3.9.21/lib/python3.9/site-packages/auditwheel/main.py", line 53, in main 2025-03-18T08:45:20.3395064Z result: int \| None = args.func(args, p) 2025-03-18T08:45:20.3395626Z File "/opt/_internal/cpython-3.9.21/lib/python3.9/site-packages/auditwheel/main_repair.py", line 203, in execute 2025-03-18T08:45:20.3396163Z out_wheel = repair_wheel( 2025-03-18T08:45:20.3396657Z File "/opt/_internal/cpython-3.9.21/lib/python3.9/site-packages/auditwheel/repair.py", line 84, in repair_wheel 2025-03-18T08:45:20.3397184Z raise ValueError(msg) 2025-03-18T08:45:20.3397620Z ValueError: Cannot repair wheel, because required library "libarm_compute.so" could not be located 2025-03-18T08:45:20.3678843Z Traceback (most recent call last): 2025-03-18T08:45:20.3679267Z File "/pytorch/.ci/aarch64_linux/aarch64_wheel_ci_build.py", line 236, in <module> 2025-03-18T08:45:20.3680988Z pytorch_wheel_name = complete_wheel("/pytorch/") 2025-03-18T08:45:20.3681449Z File "/pytorch/.ci/aarch64_linux/aarch64_wheel_ci_build.py", line 141, in complete_wheel 2025-03-18T08:45:20.3681976Z check_call(["auditwheel", "repair", f"dist/{wheel_name}"], cwd=folder) 2025-03-18T08:45:20.3682860Z File "/opt/python/cp39-cp39/lib/python3.9/subprocess.py", line 373, in check_call 2025-03-18T08:45:20.3683308Z raise CalledProcessError(retcode, cmd) 2025-03-18T08:45:20.3684034Z subprocess.CalledProcessError: Command '['auditwheel', 'repair', 'dist/torch-2.8.0.dev20250318+cpu-cp39-cp39-linux_aarch64.whl']' returned non-zero exit status 1. 2025-03-18T08:45:20.3790063Z ##[error]Process completed with exit code 1. 2025-03-18T08:45:20.3862012Z ##[group]Run pytorch/test-infra/.github/actions/teardown-linux@main 2025-03-18T08:45:20.3862448Z with: ``` Please note aarch64 CUDA failures are related to: https://github.com/pytorch/pytorch/pull/149351 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149471 Approved by: https://github.com/malfet	2025-03-19 15:55:05 +00:00
Shangdi Yu	1bf443e2f2	[aoti x with_effect token] Unbacked symint and register lowering (#147656 ) Differential Revision: D70022208 - When resolving unbacked symints in ExternKernel for with_effect, we need to ignore the first item in the binding path, because the `example_output` doesn't contain the effect token, but the binding paths do. - Similarly, `node.meta["val"]` contains the effect token, so when we compute_unbacked_bindings, we need to remove that effect token - For `torch.ops.higher_order.with_effects`'s lowering, we should not extract the items out of an list (i.e. `*result` vs `result`). The `get_attr` nodes consider the result to be in the list format. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147656 Approved by: https://github.com/angelayi, https://github.com/zou3519	2025-03-19 14:38:30 +00:00
Aaron Orenstein	2fcfae72b4	async fx compile (#146135 ) Adds the ability to run the selected out-of-process fx compile scheme in async mode - where we kick off the compile and then run eagerly until the compile is finished. Added a test which runs a tiny model in a loop making sure that we execute it both eagerly and then compiled. Differential Revision: [D71135546](https://our.internmc.facebook.com/intern/diff/D71135546) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146135 Approved by: https://github.com/jamesjwu, https://github.com/jansel	2025-03-19 14:07:51 +00:00
FFFrog	1dce65a82c	Fix the invalid link for FX (#149289 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149289 Approved by: https://github.com/zou3519	2025-03-19 14:03:18 +00:00
Aleksei Nikiforov	97910b6c00	Update s390x docker image (#148444 ) New releases of ml_dtypes successfully build on s390x, skip building patched old release. Unpin grpcio version. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148444 Approved by: https://github.com/seemethere	2025-03-19 12:25:10 +00:00
Aleksei Nikiforov	7ca296f564	Document patched podman build for s390x runners (#147618 ) Podman patches from upstream are needed to resolve a couple of issues hit when using it. Document automated build of podman with applied patches fixing those issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147618 Approved by: https://github.com/seemethere	2025-03-19 12:25:05 +00:00
Aleksei Nikiforov	cfbeaf7b7e	Improve docker build cleanup on s390x runners (#149316 ) Currently it sometimes still leaves a couple of processess running. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149316 Approved by: https://github.com/seemethere	2025-03-19 10:10:44 +00:00
FFFrog	466d5295c1	Fixed abnormal behavior of LazyLinear when using LayzLinear and load_state together (#147599 ) Update Points: - Update the logic of ``initialize_parameters`` - Add new testcases The ISSUE Related: https://github.com/pytorch/pytorch/issues/147389 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147599 Approved by: https://github.com/mikaylagawarecki	2025-03-19 10:01:12 +00:00
fduwjj	8bf3f3fc43	[c10d] Add a collective time estimator for NCCL comms (#149343 ) We want to upstream the feature from new nccl for users to estimate comm time. Resolves #147753 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149343 Approved by: https://github.com/kwen2501	2025-03-19 07:54:02 +00:00
Riham Selim	b963d96bad	[Torchscript] Add a flag to use mangled names instead of demangled (#148906 ) Summary: Optionally keep mangled names when expanding torchscript stacks Test Plan: ``` buck2 build mode/opt //scripts/rihams/LearnPyTorch:torch_script_generate --show-full-output /data/users/rihams/fbsource/buck-out/v2/gen/fbcode/0bd9d136228ad8a7/scripts/rihams/LearnPyTorch/__torch_script_generate__/torch_script_generate.par buck2 build mode/opt //scripts/rihams/LearnPyTorch:torch_script_execute --show-full-output ``` - With `--torch_jit_expanded_stacks_mangled` Flag: /data/users/rihams/fbsource/buck-out/v2/gen/fbcode/ef35e45045e8164c/scripts/rihams/LearnPyTorch/__torch_script_execute__/torch_script_execute fbcode/model.pt --torch_jit_expanded_stacks_mangled --torch_jit_enable_expanded_stacks https://fburl.com/scuba/strobelight_function_tracer/8die4rvm {F1975933247} Without Flag: /data/users/rihams/fbsource/buck-out/v2/gen/fbcode/ef35e45045e8164c/scripts/rihams/LearnPyTorch/__torch_script_execute__/torch_script_execute ./model.pt --torch_jit_enable_expanded_stacks https://fburl.com/scuba/strobelight_function_tracer/x3nladpf {F1975933268} Reviewed By: bbus Differential Revision: D70905872 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148906 Approved by: https://github.com/zdevito	2025-03-19 07:53:02 +00:00
ikalinic	3e78c9e967	[ROCm][Windows] Disable hipSPARSE and CK declarations and remove references for Windows (#149195 ) This PR removes references to `hipSPARSE` and `ck` functions and disables declarations which are not supported on Windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149195 Approved by: https://github.com/jeffdaily Co-authored-by: Michal Gallus <Michal.Gallus@amd.com> Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-03-19 07:30:53 +00:00
yifanmao	2cb42f26c1	Remove test_get_model_state_dict_del_memory (#149460 ) test_get_model_state_dict_del_memory get unexpected memory, leading to the test failures. Remove tests right now to avoid blocking the others. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149460 Approved by: https://github.com/fegin	2025-03-19 07:06:46 +00:00
FFFrog	e8a35eb7da	Add Missing Communication collectives (#147379 ) ---- - reduce_add_coalesced Pull Request resolved: https://github.com/pytorch/pytorch/pull/147379 Approved by: https://github.com/mikaylagawarecki	2025-03-19 06:59:04 +00:00
Menglu Yu	981807cfcb	[Inductor][Optimus] split cat aten pass (#149027 ) Summary: We add the aten pattern to optimize big cat node with arbitrary order of inputs to support APS jobs context: https://docs.google.com/document/d/1G2qFcQu1K7VXbz2uPe0CS2aBirnwtwI_B8lxmlBlAPQ/edit?tab=t.0 Test Plan: ### how to enable Add the following patterns to the post grad ``` post_grad_fusion_options={ "normalization_aten_pass": {}, "split_cat_aten_pass": {"threshold_to_cat": 10}, }, ``` You can tune threshold_to_cat to achieve best performance. If nothing gives, the default value 10 will be used ### unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:split_cat_fx_aten_passes -- test_split_cat_post_grad ``` Buck UI: https://www.internalfb.com/buck2/9e52168d-c107-4be8-a46b-b9d239f5c50d Test UI: https://www.internalfb.com/intern/testinfra/testrun/17732923605061752 Network: Up: 112KiB Down: 132KiB (reSessionID-915796e0-4a8f-486a-9f63-afb1e191d24a) Executing actions. Remaining 0/3 1.0s exec time total Command: test. Finished 2 local Time elapsed: 4:57.9s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 ### E2E baseline f691990503 proposal Differential Revision: D71017436 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149027 Approved by: https://github.com/Yuzhen11	2025-03-19 06:01:05 +00:00
Simon Fan	f123f2c077	[ca] fix dce for side-effects (#149336 ) The AOT backward could have contained side effectful ops, so we can't DCE them. Have CA also call the default fx.Node.is_impure which will cover some of the existing cases Pull Request resolved: https://github.com/pytorch/pytorch/pull/149336 Approved by: https://github.com/jansel	2025-03-19 05:56:47 +00:00
PyTorch UpdateBot	ddb076591d	[executorch hash update] update the pinned executorch hash (#147422 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147422 Approved by: https://github.com/pytorchbot	2025-03-19 05:22:35 +00:00
Pat Vignola	42bd4a09a3	[MTIA] Add _mtia_getCurrentRawStream to MTIA module (#149436 ) Summary: The FlexAttention path generates code that uses this function. Although streams are not used yet in Triton-MTIA, adding this now allows us to not branch out just for MTIA and generate different code. Test Plan: CI Reviewed By: chaos5958 Differential Revision: D70072057 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149436 Approved by: https://github.com/chaos5958	2025-03-19 05:17:51 +00:00
PyTorch UpdateBot	ef93cdfb8a	[audio hash update] update the pinned audio hash (#149467 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149467 Approved by: https://github.com/pytorchbot	2025-03-19 04:28:57 +00:00
Ethan Wee	ee1a2b7810	[ROCm] enable HIPMallocAsyncAllocator (#149145 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149145 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-03-19 03:59:55 +00:00
Avik Chaudhuri	20874a1f46	debug ival swap (#149206 ) Summary: Recall that we use "ivals" to track intermediate values of mutations during unflattening. Previously, for each such intermediate value, we would create a hidden shared attribute that would be updated / read by respective submodules. Unfortunately this scheme doesn't work when some but not all of those submodules are swapped out. This is because the swapped in submodules have no knowledge of these hidden attributes. Thus the submodules that are not swapped out end up reading / updating dangling state. This PR does away with these hidden attributes. Instead, we directly read the underlying buffer or placeholder that was updated, and update those underlying buffers and placeholders in place. This makes the graphs look much closer to their eager origins. Test Plan: added some tests, ensured existing tests pass Differential Revision: D71203469 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149206 Approved by: https://github.com/tugsbayasgalan	2025-03-19 03:43:30 +00:00
Jun Luo	14dc6e732d	Cache the get_device_module result (#149207 ) Summary: As title. Test Plan: OSS CIs. Reviewed By: chaos5958 Differential Revision: D71084180 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149207 Approved by: https://github.com/jansel	2025-03-19 03:20:38 +00:00
angelayi	01a57981aa	[export] Add TracingContext (#149294 ) TracingContext is added to all tracing locations -- in torch.export this is where we call make_fx (for training IR) and aot_export_module (for inference IR), and in run_decompositions where we call aot_export_module Differential Revision: [D71298927](https://our.internmc.facebook.com/intern/diff/D71298927) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149294 Approved by: https://github.com/ydwu4	2025-03-19 03:11:08 +00:00
Animesh Jain	a3c286677b	[compile] Switch off inference mode during compilation (#149321 ) PR does following * Turns `inference_mode` to False and `no_grad` for `convert_frame`, if the inference_mode is on globally. * Turns off inference_mode for fake tensor prop. This ensures that converting from real inference tensor to a fake tensor removes the inference-ness. * Graph breaks on is_inference and is_inference_mode_enabled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149321 Approved by: https://github.com/jansel, https://github.com/zou3519	2025-03-19 02:45:27 +00:00
Bin Bao	04e251a7dd	[AOTI] Add num_runners to AOTIModelPackageLoader (#149364 ) Summary: AOTIModelContainerRunner takes a num_runners argument for multi-threaded inference, but AOTIModelPackageLoader forgot to take the same parameter, although its run() API already expects to take an optional cudaStream_t parameter for multi-threaded inference. Differential Revision: [D71357418](https://our.internmc.facebook.com/intern/diff/D71357418) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149364 Approved by: https://github.com/angelayi	2025-03-19 02:28:06 +00:00
Richard Barnes	536c0c7a47	[codemod][lowrisk] Remove unused exception parameter from caffe2/aten/src/ATen/cuda/CUDABlas.cpp (#149328 ) Summary: `-Wunused-exception-parameter` has identified an unused exception parameter. This diff removes it. This: ``` try { ... } catch (exception& e) { // no use of e } ``` should instead be written as ``` } catch (exception&) { ``` If the code compiles, this is safe to land. Test Plan: Sandcastle Reviewed By: dtolnay Pull Request resolved: https://github.com/pytorch/pytorch/pull/149328 Approved by: https://github.com/Skylion007, https://github.com/eqy	2025-03-19 02:05:33 +00:00
Ivan Zaitsev	919d54b7b1	Fix format string in ck_gemm_template.h for int64_t variables (#149438 ) Summary: Change %d to %ld in printf format specifier to correctly handle int64_t variables n, m, k. This fixes compilation errors in HIP builds where the format string didn't match the argument type. forward fix for D71412006 ``` In file included from fbcode/caffe2/aten/src/ATen/native/hip/ck_gemm_bfloat16.hip:4: fbcode/caffe2/aten/src/ATen/native/hip/ck_gemm_template.h:386:28: error: format specifies type 'int' but the argument has type 'int64_t' (aka 'long') [-Werror,-Wformat] 385 \| printf("error shape = %d %d %d TRANSA=%d TRANSB=%d \n", \| ~~ \| %ld 386 \| n, m, k,TRANSA, TRANSB); \| ^ fbcode/caffe2/aten/src/ATen/native/hip/ck_gemm_template.h:386:31: error: format specifies type 'int' but the argument has type 'int64_t' (aka 'long') [-Werror,-Wformat] 385 \| printf("error shape = %d %d %d TRANSA=%d TRANSB=%d \n", \| ~~ \| %ld 386 \| n, m, k,TRANSA, TRANSB); \| ^ fbcode/caffe2/aten/src/ATen/native/hip/ck_gemm_template.h:386:25: error: format specifies type 'int' but the argument has type 'int64_t' (aka 'long') [-Werror,-Wformat] 385 \| printf("error shape = %d %d %d TRANSA=%d TRANSB=%d \n", \| ~~ \| %ld 386 \| n, m, k,TRANSA, TRANSB); \| ^ ``` Test Plan: ``` buck2 build --flagfile fbcode//mode/opt-amd-gpu fbcode//torchrec/sparse/tests:test_jagged_tensor_gpu ``` Differential Revision: D71418611 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149438 Approved by: https://github.com/ZainRizvi	2025-03-19 01:46:34 +00:00
Stepan Hruda	6bcf9c6ce3	[xnnpack] Expose subgraph symbols (#149397 ) Summary: Main XNNPack target code uses symbols from subgraph so they need to be exported - this gets uncovered on macos where symbols were not visible after linking Test Plan: CI / used for a macOS build on top of the stack. Differential Revision: D71315023 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149397 Approved by: https://github.com/digantdesai	2025-03-19 01:14:46 +00:00
Nichols A. Romero	11d4438a5f	[ROCm][TunableOp] More TF32 support. (#149088 ) This PR includes additional enhancements to TF32 support in TunableOp. - OpSignature now differentiates between float32 and tf32 data types. - Offline tuning now supports TF32. - Unit tests for online and offline tuning of TF32. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149088 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-03-19 00:26:20 +00:00
tvukovic-amd	268de64005	[ROCm][Windows] Enable torchvision build with ROCm on Windows (#147382 ) - Updated HIP flags for Windows (removed non Windows flags on Windows case, added runtime library) - Set hipcc call for Windows case - Removed CUDA flags (not used in ROCm) on Windows - Updated Windows compiler (added case when using ROCm on Windows) - Fixed path issue in hipify_python Pull Request resolved: https://github.com/pytorch/pytorch/pull/147382 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-03-18 23:37:05 +00:00
Nikita Shulga	61a64c20c4	[MPSInductor] Move threadfence at the right location (#149437 ) Not sure how it worked in the past, but fence should be before first read from the shared memory, not after it. This bug was exposed by https://github.com/pytorch/pytorch/pull/148969 which removed unnecessary barrier before calling `threadgroup_reduce` functions Test plan: ``` % python3 generate.py --checkpoint_path checkpoints/stories15M/model.pth --prompt "Once upon a time" --device mps --compile ``` Before that it produced gibberish, now it works fine Pull Request resolved: https://github.com/pytorch/pytorch/pull/149437 Approved by: https://github.com/manuelcandales, https://github.com/dcci	2025-03-18 23:27:19 +00:00
Angela Yi	ea02aac2ca	[export] Update remove runtime asserts pass (#149198 ) Test Plan: CI -- Removing asserts should be a noop Differential Revision: D69566851 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149198 Approved by: https://github.com/pianpwk	2025-03-18 23:07:25 +00:00
Nikita Shulga	5db3a4ac88	[Build] Guard per-op headers in ACLUtils.cpp (#149417 ) To fix internal build failures, where per-op headers are not generated. We really should have lint for something like that. Test Plan: CI Reviewed By: izaitsevfb Differential Revision: D71406882 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149417 Approved by: https://github.com/Skylion007, https://github.com/izaitsevfb	2025-03-18 22:56:29 +00:00
Zhuoran Zhao	45fec7843d	Fix local compilication and hipification (#149384 ) Summary: As title, we need to fix the issue introduced from https://github.com/pytorch/pytorch/pull/148305 Test Plan: CI and e2e https://docs.google.com/document/d/1Bu-MxJCkN7WaRkKJLVBQvnSp8yV0v3Aeb3Y9R5sjeHw/edit?tab=t.0 Differential Revision: D71373001 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149384 Approved by: https://github.com/desertfire, https://github.com/jansel, https://github.com/chenyang78	2025-03-18 22:56:02 +00:00
Shivam Raikundalia	0d804dec0f	[Profiler/Easy] Pass Overload Names To Kineto (#149333 ) Summary: Right now we get Overload names and forward them to the Event List frontend for profiler but we do not forward anything to kineto. This diff checks if there is an overload name for each cpu op and appends it to the name if necessary Test Plan: Added test in CI Differential Revision: D71326670 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149333 Approved by: https://github.com/aaronenyeshi	2025-03-18 22:15:51 +00:00
angelayi	3b48c72141	[export] Minor refactor to trace.py (#149240 ) Minor refactor to trace.py * Removed `_strict_export_lower_to_aten_ir` in favor of just `_strict_export` and `_non_strict_export` * Matched the APIs of `_strict_export` and `_non_strict_export` * Instead of a `lower_to_aten_callback` which is a callable, or `dispatch_tracing_mode`, both functions take in a `_to_aten_func` which can be either `_export_to_aten_ir_make_fx` or `_export_to_aten_ir`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149240 Approved by: https://github.com/pianpwk	2025-03-18 21:40:30 +00:00
Justin Chu	010963032c	[ONNX] Create onnx_symbolic (#148905 ) In the old exporter we allow users to define a symbolic() method to bypass JIT tracing for a block of logic. We can allow users to do similar things by creating symbolic ops at export. This PR implements `torch.onnx.ops.symbolic` and `torch.onnx.ops.symbolic_multi_out` to allow users to create onnx nodes symbolically with pt2 & fx. The custom pytorch ops were designed such that the attributes are encoded to be part of a valid fx op. Users provide shape and dtype for the meta function to produce the currect fake tensor during export. An example is ![image](https://github.com/user-attachments/assets/c62f5f21-e038-456e-a71d-b9a5d0a7cd9d) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148905 Approved by: https://github.com/titaiwangms	2025-03-18 21:32:06 +00:00
Yuxin Wu	d80a70b58a	Avoid unnecessary clone in torch.cuda.set_rng_state (#149283 ) Clone has performance issue according to `f49c3eb6e6/megatron/core/tensor_parallel/random.py (L77-L80)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149283 Approved by: https://github.com/cyyever, https://github.com/Skylion007	2025-03-18 20:47:57 +00:00
Thomas Bohnstingl	cd5c13d8f0	[hop] Rework the check of Metadata in the functionalization key (#148789 ) This PR is a more cosmetic rework of the metadata check performed by some HOPs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148789 Approved by: https://github.com/ydwu4	2025-03-18 20:30:59 +00:00
Brian Hirsh	f06e366532	partitioner: treat inputs with static indices as free to save (#148922 ) Fixes https://github.com/pytorch/pytorch/issues/141881 internal xref: https://fb.workplace.com/groups/1075192433118967/posts/1538435030128036/?comment_id=1556782068293332 I tried to make a test case out of the code linked in that github issue. The setup + bad outcome today was as follows: (1) you have a graph where one of its inputs is a model weight (2) in the backward, you do some downstream compute on `weight`, `tmp = f(weight)`, where (a) `tmp` is of a smaller size than `weight`, and (b) the compute is trivially fusible into other kernels (so the partitioner thinks it is "free" to recompute (3) since `sizeof(tmp) < sizeof(weight)` and the recompute is free, the partitioner decides that it would be strictly better to save `tmp` for backward instead of weight (4) this is bad: `weight` is a static tensor that sits in GPU memory for the duration of your entire training loop, so saving it for backward has no negative impact on peak memory. Since we're saving `tmp` instead, we end up unnecessarily increasing peak memory. In particular - the repro involves an autograd.Function in eager that saves the weight for bw, so we end up hitting higher peak memory in compile The fix I'm trying out in this PR is to tell the partitioner that graph inputs that we know have static addresses (aka parameters) are "free" to save. Below is the fw/bw graph before my change, where you can see that instead of `primals_2` being saved for backward, we save `t_8` (which involves some low precision downstream compute on `primals_2`, that is only needed in the backward. ``` ===== Forward graph 0 ===== /data/users/hirsheybar/checkout2/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module): def forward(self, primals_1: "bf16[64, 64][64, 1]cuda:0", primals_2: "bf16[64, 64][64, 1]cuda:0", primals_3: "bf16[64][1]cuda:0"): # File: /data/users/hirsheybar/checkout2/pytorch/test/dynamo/test_repros.py:6943 in forward, code: out = Fp8LinearFn.apply( abs_1: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.abs.default(primals_1) view: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(abs_1, [64, 1, 64]); abs_1 = None amax: "bf16[64, 1][1, 1]cuda:0" = torch.ops.aten.amax.default(view, [-1]); view = None abs_2: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.abs.default(primals_2) view_1: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(abs_2, [64, 1, 64]); abs_2 = None amax_1: "bf16[64, 1][1, 1]cuda:0" = torch.ops.aten.amax.default(view_1, [-1]); view_1 = None _to_copy: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten._to_copy.default(amax, dtype = torch.float32); amax = None clamp: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.clamp.default(_to_copy, 1e-12); _to_copy = None div: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.div.Tensor(clamp, 448.0); clamp = None reciprocal: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.reciprocal.default(div) view_2: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(primals_1, [64, 1, 64]) view_3: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_2, [64, 1, 1, 64]); view_2 = None slice_1: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(reciprocal, 0, 0, 9223372036854775807); reciprocal = None unsqueeze: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_1, 1); slice_1 = None slice_2: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze, 2, 0, 9223372036854775807); unsqueeze = None unsqueeze_1: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_2, 3); slice_2 = None mul: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_3, unsqueeze_1); view_3 = unsqueeze_1 = None view_4: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul, [64, 1, 64]); mul = None view_5: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_4, [64, 64]); view_4 = None _to_copy_1: "f8e4m3fn[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_5, dtype = torch.float8_e4m3fn); view_5 = None _to_copy_2: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten._to_copy.default(amax_1, dtype = torch.float32) clamp_1: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.clamp.default(_to_copy_2, 1e-12); _to_copy_2 = None div_1: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.div.Tensor(clamp_1, 448.0); clamp_1 = None reciprocal_1: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.reciprocal.default(div_1) view_6: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(primals_2, [64, 1, 64]) view_7: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_6, [64, 1, 1, 64]); view_6 = None slice_3: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(reciprocal_1, 0, 0, 9223372036854775807); reciprocal_1 = None unsqueeze_2: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_3, 1); slice_3 = None slice_4: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_2, 2, 0, 9223372036854775807); unsqueeze_2 = None unsqueeze_3: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_4, 3); slice_4 = None mul_1: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_7, unsqueeze_3); view_7 = unsqueeze_3 = None view_8: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul_1, [64, 1, 64]); mul_1 = None view_9: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_8, [64, 64]); view_8 = None _to_copy_3: "f8e4m3fn[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_9, dtype = torch.float8_e4m3fn); view_9 = None t: "f32[1, 64][1, 1]cuda:0" = torch.ops.aten.t.default(div_1); div_1 = None new_ones: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.new_ones.default(div, [1, 1], pin_memory = False) new_ones_1: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.new_ones.default(t, [1, 1], pin_memory = False) t_2: "f8e4m3fn[64, 64][1, 64]cuda:0" = torch.ops.aten.t.default(_to_copy_3); _to_copy_3 = None t_3: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.t.default(new_ones_1); new_ones_1 = None _scaled_mm: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten._scaled_mm.default(_to_copy_1, t_2, new_ones, t_3, None, None, torch.bfloat16); _to_copy_1 = t_2 = new_ones = t_3 = None view_10: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(_scaled_mm, [64, 1, 64]); _scaled_mm = None view_11: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_10, [64, 1, 1, 64]); view_10 = None slice_5: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(div, 0, 0, 9223372036854775807); div = None unsqueeze_4: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_5, 1); slice_5 = None slice_6: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_4, 2, 0, 9223372036854775807); unsqueeze_4 = None unsqueeze_5: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_6, 3); slice_6 = None mul_2: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_11, unsqueeze_5); view_11 = unsqueeze_5 = None view_12: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul_2, [64, 1, 64]); mul_2 = None view_13: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_12, [64, 64]); view_12 = None view_14: "f32[1, 64, 64][4096, 64, 1]cuda:0" = torch.ops.aten.view.default(view_13, [1, 64, 64]); view_13 = None view_15: "f32[1, 64, 64, 1][4096, 64, 1, 1]cuda:0" = torch.ops.aten.view.default(view_14, [1, 64, 64, 1]); view_14 = None slice_7: "f32[1, 64][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(t, 0, 0, 9223372036854775807); t = None unsqueeze_6: "f32[1, 1, 64][1, 64, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_7, 1); slice_7 = None slice_8: "f32[1, 1, 64][1, 64, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_6, 2, 0, 9223372036854775807); unsqueeze_6 = None unsqueeze_7: "f32[1, 1, 64, 1][1, 64, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_8, 3); slice_8 = None mul_3: "f32[1, 64, 64, 1][4096, 64, 1, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_15, unsqueeze_7); view_15 = unsqueeze_7 = None view_16: "f32[64, 64, 1][64, 1, 1]cuda:0" = torch.ops.aten.view.default(mul_3, [64, 64, 1]); mul_3 = None view_17: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_16, [64, 64]); view_16 = None _to_copy_4: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_17, dtype = torch.bfloat16); view_17 = None add: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.add.Tensor(_to_copy_4, primals_3); _to_copy_4 = primals_3 = None t_4: "bf16[64, 64][1, 64]cuda:0" = torch.ops.aten.t.default(primals_2); primals_2 = None clone: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.clone.default(t_4, memory_format = torch.contiguous_format); t_4 = None t_5: "bf16[1, 64][1, 1]cuda:0" = torch.ops.aten.t.default(amax_1); amax_1 = None view_21: "bf16[1, 1, 64][1, 64, 1]cuda:0" = torch.ops.aten.view.default(t_5, [1, 1, 64]); t_5 = None amax_3: "bf16[1, 1][1, 1]cuda:0" = torch.ops.aten.amax.default(view_21, [-1]); view_21 = None unsqueeze_8: "bf16[1, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(amax_3, 1); amax_3 = None expand: "bf16[1, 64, 1][1, 0, 1]cuda:0" = torch.ops.aten.expand.default(unsqueeze_8, [1, 64, 1]) clone_1: "bf16[1, 64, 1][64, 1, 1]cuda:0" = torch.ops.aten.clone.default(expand, memory_format = torch.contiguous_format); expand = None view_22: "bf16[64, 1][1, 1]cuda:0" = torch.ops.aten.view.default(clone_1, [64, 1]); clone_1 = None _to_copy_7: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten._to_copy.default(view_22, dtype = torch.float32); view_22 = None clamp_3: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.clamp.default(_to_copy_7, 1e-12); _to_copy_7 = None div_3: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.div.Tensor(clamp_3, 448.0); clamp_3 = None reciprocal_3: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.reciprocal.default(div_3); div_3 = None view_27: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(clone, [64, 1, 64]); clone = None view_28: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_27, [64, 1, 1, 64]); view_27 = None slice_11: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(reciprocal_3, 0, 0, 9223372036854775807); reciprocal_3 = None unsqueeze_11: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_11, 1); slice_11 = None slice_12: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_11, 2, 0, 9223372036854775807); unsqueeze_11 = None unsqueeze_12: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_12, 3); slice_12 = None mul_5: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_28, unsqueeze_12); view_28 = unsqueeze_12 = None view_29: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul_5, [64, 1, 64]); mul_5 = None view_30: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_29, [64, 64]); view_29 = None _to_copy_8: "f8e4m3fn[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_30, dtype = torch.float8_e4m3fn); view_30 = None t_8: "f8e4m3fn[64, 64][1, 64]cuda:0" = torch.ops.aten.t.default(_to_copy_8); _to_copy_8 = None # No stacktrace found for following nodes view_39: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(add, [64, 64]); add = None return (view_39, primals_1, unsqueeze_8, t_8) INFO: TRACED GRAPH ===== Backward graph 0 ===== <eval_with_key>.1 class GraphModule(torch.nn.Module): def forward(self, primals_1: "bf16[64, 64][64, 1]cuda:0", unsqueeze_8: "bf16[1, 1, 1][1, 1, 1]cuda:0", t_8: "f8e4m3fn[64, 64][1, 64]cuda:0", tangents_1: "bf16[64, 64][64, 1]cuda:0"): # File: /data/users/hirsheybar/checkout2/pytorch/test/dynamo/test_repros.py:6946 in forward, code: out = out.unflatten(0, input.shape[:-1]) view_19: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(tangents_1, [64, 64]); tangents_1 = None # File: /data/users/hirsheybar/checkout2/pytorch/test/dynamo/test_repros.py:6943 in forward, code: out = Fp8LinearFn.apply( abs_3: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.abs.default(view_19) view_20: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(abs_3, [64, 1, 64]); abs_3 = None amax_2: "bf16[64, 1][1, 1]cuda:0" = torch.ops.aten.amax.default(view_20, [-1]); view_20 = None expand: "bf16[1, 64, 1][1, 0, 1]cuda:0" = torch.ops.aten.expand.default(unsqueeze_8, [1, 64, 1]); unsqueeze_8 = None clone_1: "bf16[1, 64, 1][64, 1, 1]cuda:0" = torch.ops.aten.clone.default(expand, memory_format = torch.contiguous_format); expand = None view_22: "bf16[64, 1][1, 1]cuda:0" = torch.ops.aten.view.default(clone_1, [64, 1]); clone_1 = None _to_copy_5: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten._to_copy.default(amax_2, dtype = torch.float32); amax_2 = None clamp_2: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.clamp.default(_to_copy_5, 1e-12); _to_copy_5 = None div_2: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.div.Tensor(clamp_2, 448.0); clamp_2 = None reciprocal_2: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.reciprocal.default(div_2) view_23: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_19, [64, 1, 64]) view_24: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_23, [64, 1, 1, 64]); view_23 = None slice_9: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(reciprocal_2, 0, 0, 9223372036854775807); reciprocal_2 = None unsqueeze_9: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_9, 1); slice_9 = None slice_10: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_9, 2, 0, 9223372036854775807); unsqueeze_9 = None unsqueeze_10: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_10, 3); slice_10 = None mul_4: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_24, unsqueeze_10); view_24 = unsqueeze_10 = None view_25: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul_4, [64, 1, 64]); mul_4 = None view_26: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_25, [64, 64]); view_25 = None _to_copy_6: "f8e4m3fn[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_26, dtype = torch.float8_e4m3fn); view_26 = None _to_copy_7: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten._to_copy.default(view_22, dtype = torch.float32); view_22 = None clamp_3: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.clamp.default(_to_copy_7, 1e-12); _to_copy_7 = None div_3: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.div.Tensor(clamp_3, 448.0); clamp_3 = None t_6: "f32[1, 64][1, 1]cuda:0" = torch.ops.aten.t.default(div_3); div_3 = None new_ones_2: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.new_ones.default(div_2, [1, 1], pin_memory = False) new_ones_3: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.new_ones.default(t_6, [1, 1], pin_memory = False) t_9: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.t.default(new_ones_3); new_ones_3 = None _scaled_mm_1: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten._scaled_mm.default(_to_copy_6, t_8, new_ones_2, t_9, None, None, torch.bfloat16); _to_copy_6 = t_8 = new_ones_2 = t_9 = None view_31: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(_scaled_mm_1, [64, 1, 64]); _scaled_mm_1 = None view_32: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_31, [64, 1, 1, 64]); view_31 = None slice_13: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(div_2, 0, 0, 9223372036854775807); div_2 = None unsqueeze_13: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_13, 1); slice_13 = None slice_14: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_13, 2, 0, 9223372036854775807); unsqueeze_13 = None unsqueeze_14: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_14, 3); slice_14 = None mul_6: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_32, unsqueeze_14); view_32 = unsqueeze_14 = None view_33: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul_6, [64, 1, 64]); mul_6 = None view_34: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_33, [64, 64]); view_33 = None view_35: "f32[1, 64, 64][4096, 64, 1]cuda:0" = torch.ops.aten.view.default(view_34, [1, 64, 64]); view_34 = None view_36: "f32[1, 64, 64, 1][4096, 64, 1, 1]cuda:0" = torch.ops.aten.view.default(view_35, [1, 64, 64, 1]); view_35 = None slice_15: "f32[1, 64][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(t_6, 0, 0, 9223372036854775807); t_6 = None unsqueeze_15: "f32[1, 1, 64][1, 64, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_15, 1); slice_15 = None slice_16: "f32[1, 1, 64][1, 64, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_15, 2, 0, 9223372036854775807); unsqueeze_15 = None unsqueeze_16: "f32[1, 1, 64, 1][1, 64, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_16, 3); slice_16 = None mul_7: "f32[1, 64, 64, 1][4096, 64, 1, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_36, unsqueeze_16); view_36 = unsqueeze_16 = None view_37: "f32[64, 64, 1][64, 1, 1]cuda:0" = torch.ops.aten.view.default(mul_7, [64, 64, 1]); mul_7 = None view_38: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_37, [64, 64]); view_37 = None _to_copy_9: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_38, dtype = torch.bfloat16); view_38 = None t_10: "bf16[64, 64][1, 64]cuda:0" = torch.ops.aten.t.default(view_19) mm: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.mm.default(t_10, primals_1); t_10 = primals_1 = None sum_1: "bf16[64][1]cuda:0" = torch.ops.aten.sum.dim_IntList(view_19, [0]); view_19 = None return (_to_copy_9, mm, sum_1) ``` With the change, we save primals_2 for backward instead ``` ===== Forward graph 0 ===== /data/users/hirsheybar/checkout2/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module): def forward(self, primals_1: "bf16[64, 64][64, 1]cuda:0", primals_2: "bf16[64, 64][64, 1]cuda:0", primals_3: "bf16[64][1]cuda:0"): # File: /data/users/hirsheybar/checkout2/pytorch/test/dynamo/test_repros.py:6943 in forward, code: out = Fp8LinearFn.apply( abs_1: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.abs.default(primals_1) view: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(abs_1, [64, 1, 64]); abs_1 = None amax: "bf16[64, 1][1, 1]cuda:0" = torch.ops.aten.amax.default(view, [-1]); view = None abs_2: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.abs.default(primals_2) view_1: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(abs_2, [64, 1, 64]); abs_2 = None amax_1: "bf16[64, 1][1, 1]cuda:0" = torch.ops.aten.amax.default(view_1, [-1]); view_1 = None _to_copy: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten._to_copy.default(amax, dtype = torch.float32); amax = None clamp: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.clamp.default(_to_copy, 1e-12); _to_copy = None div: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.div.Tensor(clamp, 448.0); clamp = None reciprocal: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.reciprocal.default(div) view_2: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(primals_1, [64, 1, 64]) view_3: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_2, [64, 1, 1, 64]); view_2 = None slice_1: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(reciprocal, 0, 0, 9223372036854775807); reciprocal = None unsqueeze: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_1, 1); slice_1 = None slice_2: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze, 2, 0, 9223372036854775807); unsqueeze = None unsqueeze_1: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_2, 3); slice_2 = None mul: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_3, unsqueeze_1); view_3 = unsqueeze_1 = None view_4: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul, [64, 1, 64]); mul = None view_5: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_4, [64, 64]); view_4 = None _to_copy_1: "f8e4m3fn[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_5, dtype = torch.float8_e4m3fn); view_5 = None _to_copy_2: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten._to_copy.default(amax_1, dtype = torch.float32) clamp_1: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.clamp.default(_to_copy_2, 1e-12); _to_copy_2 = None div_1: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.div.Tensor(clamp_1, 448.0); clamp_1 = None reciprocal_1: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.reciprocal.default(div_1) view_6: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(primals_2, [64, 1, 64]) view_7: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_6, [64, 1, 1, 64]); view_6 = None slice_3: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(reciprocal_1, 0, 0, 9223372036854775807); reciprocal_1 = None unsqueeze_2: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_3, 1); slice_3 = None slice_4: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_2, 2, 0, 9223372036854775807); unsqueeze_2 = None unsqueeze_3: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_4, 3); slice_4 = None mul_1: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_7, unsqueeze_3); view_7 = unsqueeze_3 = None view_8: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul_1, [64, 1, 64]); mul_1 = None view_9: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_8, [64, 64]); view_8 = None _to_copy_3: "f8e4m3fn[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_9, dtype = torch.float8_e4m3fn); view_9 = None t: "f32[1, 64][1, 1]cuda:0" = torch.ops.aten.t.default(div_1); div_1 = None new_ones: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.new_ones.default(div, [1, 1], pin_memory = False) new_ones_1: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.new_ones.default(t, [1, 1], pin_memory = False) t_2: "f8e4m3fn[64, 64][1, 64]cuda:0" = torch.ops.aten.t.default(_to_copy_3); _to_copy_3 = None t_3: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.t.default(new_ones_1); new_ones_1 = None _scaled_mm: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten._scaled_mm.default(_to_copy_1, t_2, new_ones, t_3, None, None, torch.bfloat16); _to_copy_1 = t_2 = new_ones = t_3 = None view_10: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(_scaled_mm, [64, 1, 64]); _scaled_mm = None view_11: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_10, [64, 1, 1, 64]); view_10 = None slice_5: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(div, 0, 0, 9223372036854775807); div = None unsqueeze_4: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_5, 1); slice_5 = None slice_6: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_4, 2, 0, 9223372036854775807); unsqueeze_4 = None unsqueeze_5: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_6, 3); slice_6 = None mul_2: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_11, unsqueeze_5); view_11 = unsqueeze_5 = None view_12: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul_2, [64, 1, 64]); mul_2 = None view_13: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_12, [64, 64]); view_12 = None view_14: "f32[1, 64, 64][4096, 64, 1]cuda:0" = torch.ops.aten.view.default(view_13, [1, 64, 64]); view_13 = None view_15: "f32[1, 64, 64, 1][4096, 64, 1, 1]cuda:0" = torch.ops.aten.view.default(view_14, [1, 64, 64, 1]); view_14 = None slice_7: "f32[1, 64][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(t, 0, 0, 9223372036854775807); t = None unsqueeze_6: "f32[1, 1, 64][1, 64, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_7, 1); slice_7 = None slice_8: "f32[1, 1, 64][1, 64, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_6, 2, 0, 9223372036854775807); unsqueeze_6 = None unsqueeze_7: "f32[1, 1, 64, 1][1, 64, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_8, 3); slice_8 = None mul_3: "f32[1, 64, 64, 1][4096, 64, 1, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_15, unsqueeze_7); view_15 = unsqueeze_7 = None view_16: "f32[64, 64, 1][64, 1, 1]cuda:0" = torch.ops.aten.view.default(mul_3, [64, 64, 1]); mul_3 = None view_17: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_16, [64, 64]); view_16 = None _to_copy_4: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_17, dtype = torch.bfloat16); view_17 = None add: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.add.Tensor(_to_copy_4, primals_3); _to_copy_4 = primals_3 = None t_5: "bf16[1, 64][1, 1]cuda:0" = torch.ops.aten.t.default(amax_1); amax_1 = None view_21: "bf16[1, 1, 64][1, 64, 1]cuda:0" = torch.ops.aten.view.default(t_5, [1, 1, 64]); t_5 = None amax_3: "bf16[1, 1][1, 1]cuda:0" = torch.ops.aten.amax.default(view_21, [-1]); view_21 = None unsqueeze_8: "bf16[1, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(amax_3, 1); amax_3 = None # No stacktrace found for following nodes view_39: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(add, [64, 64]); add = None return (view_39, primals_1, primals_2, unsqueeze_8) INFO: TRACED GRAPH ===== Backward graph 0 ===== <eval_with_key>.1 class GraphModule(torch.nn.Module): def forward(self, primals_1: "bf16[64, 64][64, 1]cuda:0", primals_2: "bf16[64, 64][64, 1]cuda:0", unsqueeze_8: "bf16[1, 1, 1][1, 1, 1]cuda:0", tangents_1: "bf16[64, 64][64, 1]cuda:0"): # File: /data/users/hirsheybar/checkout2/pytorch/test/dynamo/test_repros.py:6946 in forward, code: out = out.unflatten(0, input.shape[:-1]) view_19: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(tangents_1, [64, 64]); tangents_1 = None # File: /data/users/hirsheybar/checkout2/pytorch/test/dynamo/test_repros.py:6943 in forward, code: out = Fp8LinearFn.apply( t_4: "bf16[64, 64][1, 64]cuda:0" = torch.ops.aten.t.default(primals_2); primals_2 = None clone: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.clone.default(t_4, memory_format = torch.contiguous_format); t_4 = None abs_3: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.abs.default(view_19) view_20: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(abs_3, [64, 1, 64]); abs_3 = None amax_2: "bf16[64, 1][1, 1]cuda:0" = torch.ops.aten.amax.default(view_20, [-1]); view_20 = None expand: "bf16[1, 64, 1][1, 0, 1]cuda:0" = torch.ops.aten.expand.default(unsqueeze_8, [1, 64, 1]); unsqueeze_8 = None clone_1: "bf16[1, 64, 1][64, 1, 1]cuda:0" = torch.ops.aten.clone.default(expand, memory_format = torch.contiguous_format); expand = None view_22: "bf16[64, 1][1, 1]cuda:0" = torch.ops.aten.view.default(clone_1, [64, 1]); clone_1 = None _to_copy_5: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten._to_copy.default(amax_2, dtype = torch.float32); amax_2 = None clamp_2: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.clamp.default(_to_copy_5, 1e-12); _to_copy_5 = None div_2: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.div.Tensor(clamp_2, 448.0); clamp_2 = None reciprocal_2: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.reciprocal.default(div_2) view_23: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_19, [64, 1, 64]) view_24: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_23, [64, 1, 1, 64]); view_23 = None slice_9: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(reciprocal_2, 0, 0, 9223372036854775807); reciprocal_2 = None unsqueeze_9: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_9, 1); slice_9 = None slice_10: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_9, 2, 0, 9223372036854775807); unsqueeze_9 = None unsqueeze_10: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_10, 3); slice_10 = None mul_4: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_24, unsqueeze_10); view_24 = unsqueeze_10 = None view_25: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul_4, [64, 1, 64]); mul_4 = None view_26: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_25, [64, 64]); view_25 = None _to_copy_6: "f8e4m3fn[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_26, dtype = torch.float8_e4m3fn); view_26 = None _to_copy_7: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten._to_copy.default(view_22, dtype = torch.float32); view_22 = None clamp_3: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.clamp.default(_to_copy_7, 1e-12); _to_copy_7 = None div_3: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.div.Tensor(clamp_3, 448.0); clamp_3 = None reciprocal_3: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.reciprocal.default(div_3) view_27: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(clone, [64, 1, 64]); clone = None view_28: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_27, [64, 1, 1, 64]); view_27 = None slice_11: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(reciprocal_3, 0, 0, 9223372036854775807); reciprocal_3 = None unsqueeze_11: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_11, 1); slice_11 = None slice_12: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_11, 2, 0, 9223372036854775807); unsqueeze_11 = None unsqueeze_12: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_12, 3); slice_12 = None mul_5: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_28, unsqueeze_12); view_28 = unsqueeze_12 = None view_29: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul_5, [64, 1, 64]); mul_5 = None view_30: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_29, [64, 64]); view_29 = None _to_copy_8: "f8e4m3fn[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_30, dtype = torch.float8_e4m3fn); view_30 = None t_6: "f32[1, 64][1, 1]cuda:0" = torch.ops.aten.t.default(div_3); div_3 = None new_ones_2: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.new_ones.default(div_2, [1, 1], pin_memory = False) new_ones_3: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.new_ones.default(t_6, [1, 1], pin_memory = False) t_8: "f8e4m3fn[64, 64][1, 64]cuda:0" = torch.ops.aten.t.default(_to_copy_8); _to_copy_8 = None t_9: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.t.default(new_ones_3); new_ones_3 = None _scaled_mm_1: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten._scaled_mm.default(_to_copy_6, t_8, new_ones_2, t_9, None, None, torch.bfloat16); _to_copy_6 = t_8 = new_ones_2 = t_9 = None view_31: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(_scaled_mm_1, [64, 1, 64]); _scaled_mm_1 = None view_32: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_31, [64, 1, 1, 64]); view_31 = None slice_13: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(div_2, 0, 0, 9223372036854775807); div_2 = None unsqueeze_13: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_13, 1); slice_13 = None slice_14: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_13, 2, 0, 9223372036854775807); unsqueeze_13 = None unsqueeze_14: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_14, 3); slice_14 = None mul_6: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_32, unsqueeze_14); view_32 = unsqueeze_14 = None view_33: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul_6, [64, 1, 64]); mul_6 = None view_34: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_33, [64, 64]); view_33 = None view_35: "f32[1, 64, 64][4096, 64, 1]cuda:0" = torch.ops.aten.view.default(view_34, [1, 64, 64]); view_34 = None view_36: "f32[1, 64, 64, 1][4096, 64, 1, 1]cuda:0" = torch.ops.aten.view.default(view_35, [1, 64, 64, 1]); view_35 = None slice_15: "f32[1, 64][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(t_6, 0, 0, 9223372036854775807); t_6 = None unsqueeze_15: "f32[1, 1, 64][1, 64, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_15, 1); slice_15 = None slice_16: "f32[1, 1, 64][1, 64, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_15, 2, 0, 9223372036854775807); unsqueeze_15 = None unsqueeze_16: "f32[1, 1, 64, 1][1, 64, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_16, 3); slice_16 = None mul_7: "f32[1, 64, 64, 1][4096, 64, 1, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_36, unsqueeze_16); view_36 = unsqueeze_16 = None view_37: "f32[64, 64, 1][64, 1, 1]cuda:0" = torch.ops.aten.view.default(mul_7, [64, 64, 1]); mul_7 = None view_38: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_37, [64, 64]); view_37 = None _to_copy_9: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_38, dtype = torch.bfloat16); view_38 = None t_10: "bf16[64, 64][1, 64]cuda:0" = torch.ops.aten.t.default(view_19) mm: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.mm.default(t_10, primals_1); t_10 = primals_1 = None sum_1: "bf16[64][1]cuda:0" = torch.ops.aten.sum.dim_IntList(view_19, [0]); view_19 = None return (_to_copy_9, mm, sum_1) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148922 Approved by: https://github.com/zou3519	2025-03-18 20:08:11 +00:00
Zain Rizvi	b8c0c50bbe	Release.md readability improvements (#149402 ) Improves a bunch of readability/grammatical issues with release.md. Note: This was a claude code experiment, with all changes automatically generated. But turns out minor edits like this is _not_ a good use of claude code since it asked for approval on every single changed line. Prob way more efficient to toss this entire thing into a simple LLM. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149402 Approved by: https://github.com/atalman	2025-03-18 20:04:56 +00:00
jzhou	dfdf58f8cb	[ROCm] enable CK backend for bf16/fp16 on gfx11 (#143971 ) this change enables enable CK backend for fp16 on Gfx11 @jeffdaily Pull Request resolved: https://github.com/pytorch/pytorch/pull/143971 Approved by: https://github.com/jeffdaily	2025-03-18 18:18:22 +00:00
Pian Pawakapan	e0e8639a10	[torchbench] fix dynamic_shapes spec for moco (#148772 ) Fixes https://github.com/pytorch/pytorch/issues/148333 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148772 Approved by: https://github.com/yushangdi, https://github.com/desertfire	2025-03-18 18:16:54 +00:00
Nichols A. Romero	dbea13ed45	[ROCm][TunableOp] Minor fix to BLAS logging for ScaledGEMM with no bias vector. (#149357 ) Omit the bias type argument for BLAS logging when there is a ScaledGEMM with no bias vector. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149357 Approved by: https://github.com/jeffdaily	2025-03-18 18:14:52 +00:00
Nichols A. Romero	c0566e0dbf	[ROCm] Fixes and improvements to CUDA->HIP flag conversion for CPP extensions (#149245 ) Fixes https://github.com/ROCm/hip/issues/3764. Fixes and improvements to CUDA->HIP flag conversion for CPP extensions - Log flag conversion for debugging purposes. - Fix cases where it should not touch the -I flags or cases where CUDA appears more than once by replacing only the first instance. - Fix case where nvcc key may not exist - Fix case where hipify should ignore flag values and only touch the flag itself Pull Request resolved: https://github.com/pytorch/pytorch/pull/149245 Approved by: https://github.com/jeffdaily Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai>	2025-03-18 18:01:07 +00:00
eellison	585fd972b8	Iterate over dense dim first in split reduction reindexing (#147229 ) Fix for https://github.com/pytorch/pytorch/issues/144431. Improves perf from 0.29963893827160504 -> 0.0396331632970453. In split reductions, we view an input tensor as a single dimension, then reduce over it. When we are reducing over a tensor which has a dimension other than the last dimension as the dense dimension, we should iterate over the dense dimension first in our re-indexing. This pr also gives evidence for general need of reduction tiling, e.g. for cooperative reduction handling of this.. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147229 Approved by: https://github.com/jansel	2025-03-18 17:35:21 +00:00
mori360	ee3a2c6ee2	[State_dict] Remove functools.cache and add unit test (#149354 ) Fixes https://github.com/pytorch/pytorch/issues/149100 @functools.cache would keep 'self' alive, leading to unexpected memory performance. (e.g. in the issue linked, if the model is deleted, the model's memory is still occupied.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149354 Approved by: https://github.com/fegin	2025-03-18 17:30:41 +00:00
mori360	5b8cc4709a	[FSDP2] Add set_reshard_after_forward (#149103 ) Fixes https://github.com/pytorch/pytorch/issues/149029 Add `set_reshard_after_forward` to set `post_forward_mesh_info` so as to decide `_reshard_after_forward` Add unit test similar to `test_fully_shard_communication_count`, the FSDPModule would perform as `._reshard_after_forward=True` after `.set_reshard_after_forward=True`, as well as setting to False Pull Request resolved: https://github.com/pytorch/pytorch/pull/149103 Approved by: https://github.com/awgu	2025-03-18 17:21:54 +00:00
Animesh Jain	a8df5e5af9	[dynamo] Add mem leak test (#149358 ) Test for https://github.com/pytorch/pytorch/pull/148480 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149358 Approved by: https://github.com/malfet	2025-03-18 16:38:28 +00:00
Aleksei Nikiforov	d5b1d99f78	Enable more nightly tests on s390x (#148452 ) Also enable some tests which probably were accidentally disabled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148452 Approved by: https://github.com/seemethere, https://github.com/malfet	2025-03-18 16:09:39 +00:00
Saurabh Mishra	381d0cb239	[DCP] Avoid in-place update and deepcopy during dudpe (#149320 ) Summary: Avoid in-place update and deepcopy during dudpe. Deepcopy becomes prohibitively expensive with models having a huge number of FQNs. This was manifestd in the Ads 2K experiment as well. Here are the results from the TextRay model in Mitra: #### Control job with deepcopy regression: First save ~24.8s Global step latency is ~7-8s Test job with the new fix to avoid deepcopy: First save is ~21s global step latency ~2s Test Plan: ``` buck test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/distributed/checkpoint:test_planner ``` https://www.internalfb.com/intern/testinfra/testrun/3940649945104822 Differential Revision: D71245218 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149320 Approved by: https://github.com/MeetVadakkanchery	2025-03-18 16:08:40 +00:00
Nikita Shulga	c41196a4d0	[EZ][Docker] Remove `install_db.sh` (#149360 ) Which is a vestige of caffe2 days and was no-op since https://github.com/pytorch/pytorch/pull/125092 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149360 Approved by: https://github.com/atalman, https://github.com/cyyever, https://github.com/seemethere, https://github.com/Skylion007	2025-03-18 16:07:47 +00:00
Justin Chu	fdacf3c920	[ONNX] Update types in VerificationInfo (#149377 ) torch.types.Number was rendered as is in the documentation and can be confusing. We write the original types instead to reduce confusion for users. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149377 Approved by: https://github.com/titaiwangms	2025-03-18 15:37:39 +00:00
PyTorch MergeBot	405025778d	Revert "[AOTI] Update test runner to use the new APIs (#147105 )" This reverts commit 9a78513c3cb21a5f506135e2a56f967cf1fddc60. Reverted https://github.com/pytorch/pytorch/pull/147105 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/147105#issuecomment-2733656413))	2025-03-18 15:25:40 +00:00
PyTorch MergeBot	5ba437fb45	Revert "[AOTI] Forward fix unit test failures (#149401 )" This reverts commit ec9e11145e1a86300aae0fe09a1d8917d21deba1. Reverted https://github.com/pytorch/pytorch/pull/149401 on behalf of https://github.com/desertfire due to reverting the original PR instead ([comment](https://github.com/pytorch/pytorch/pull/149401#issuecomment-2733633516))	2025-03-18 15:18:48 +00:00
Pat Vignola	213eea216a	[MTIA] Add _mtia_maybeExchangeDevice to MTIA module (#149340 ) Summary: The FlexAttention path uses `_maybe_exchange_device`, so it will be needed eventually for MTIA as well. Test Plan: `buck2 test fbcode//mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api -- test_maybe_exchange_device` Reviewed By: chaos5958 Differential Revision: D70072063 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149340 Approved by: https://github.com/chaos5958	2025-03-18 15:15:12 +00:00
Bin Bao	ec9e11145e	[AOTI] Forward fix unit test failures (#149401 ) Summary: There is a land conflict between https://github.com/pytorch/pytorch/pull/149161 and https://github.com/pytorch/pytorch/pull/147105. We just need to update the APIs used in two new unit tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149401 Approved by: https://github.com/ZainRizvi	2025-03-18 15:02:01 +00:00
atalman	6e2b2660b9	Make numpy check optional (#149356 ) We may want to skip numpy smoke tests. Hence making it optional Pull Request resolved: https://github.com/pytorch/pytorch/pull/149356 Approved by: https://github.com/ZainRizvi	2025-03-18 15:00:01 +00:00
Andrey Talman	bc88f6faa1	Use TorchVersion for triton version check (#149136 ) Followup after https://github.com/pytorch/pytorch/pull/149092#issuecomment-2721990321 To use TorchVersion for triton version parsing Pull Request resolved: https://github.com/pytorch/pytorch/pull/149136 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-03-18 13:48:46 +00:00
Jithun Nair	b06b5c3e27	[ROCm] Use alternate mirror for drm repo (#149380 ) Fixes issue with building ROCm manywheel and libtorch images eg. https://github.com/pytorch/pytorch/actions/runs/13887711267/job/38854659005#step:4:8328 ``` #53 2.832 Cloning into 'drm'... #53 2.849 fatal: unable to access 'https://gitlab.freedesktop.org/mesa/drm.git/': The requested URL returned error: 503 #53 2.851 ./install_rocm_drm.sh: line 29: pushd: drm: No such file or directory ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149380 Approved by: https://github.com/jeffdaily	2025-03-18 13:33:25 +00:00
Laith Sakka	6055a4f612	refresh benchmarks results. (#149347 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149347 Approved by: https://github.com/jamesjwu	2025-03-18 08:53:49 +00:00
Francisco Massa	9b92828d4b	Add batch dim sharding rule to sdpa (#149253 ) This is a trivial rule that for most cases isn't needed, but if we want to consider that the input data is actually `Shard(0)` (instead of `Replicated()` as it is currently assumed), then we need this rule. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149253 Approved by: https://github.com/XilunWu	2025-03-18 07:54:02 +00:00
Davide Italiano	9cd52da45c	[MPS/inductor] Add support for `modified_bessel_i1`. (#149379 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149379 Approved by: https://github.com/malfet	2025-03-18 06:02:33 +00:00
Fadi Arafeh	6c2db8fab0	Enable qint8 and quint8 add for AArch64 using ACL directly (#148653 ) This enables qint8 and quint8 add for AArch64 through Arm Compute Library (ACL) directly. Relative performance improvement using OMP_NUM_THREADS=1 is ~15x, using OMP_NUM_THREADS=32 it’s ~5.4x. Co-authored-by: David Svantesson <david.svantesson-yeung@arm.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/148653 Approved by: https://github.com/malfet ghstack dependencies: #148585	2025-03-18 05:38:39 +00:00
Nikita Shulga	2e0c98ff05	[MPS] Add `bicubic2d_aa` (#149378 ) Which is currently the most frequently requested op in https://github.com/pytorch/pytorch/issues/141287 Mostly done by refactoring `upsample_bilinear2d_aa` to accept Functor as one of the template arguments, which closely ideas from `eec43cfbc0/src/libImaging/Resample.c` as well as `bb42e4d137/aten/src/ATen/native/cuda/UpSampleBilinear2d.cu (L472-L478)` Populate unit tests by copying upsample_bilinear_2d_aa and reusing it as upsample_bicubic2d_aa At that point, only difference between upsample_bilinear2d_aa and upsample_bicubic2d_aa are convolution kernel function and size: for bilinear it's 3x3, for bicubic it's 5x5 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149378 Approved by: https://github.com/dcci	2025-03-18 05:35:41 +00:00
Tristan Rice	dea7157160	nccl: upgrade to 2.26.2 to avoid hang on ncclCommAbort (#149351 ) Fixes #149153 Yaml generated from: ``` python .github/scripts/generate_ci_workflows.py ``` Test plan: Repro in https://gist.github.com/d4l3k/16a19b475952bc40ddd7f2febcc297b7 ``` rm -rf third_party/nccl python setup.py develop ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149351 Approved by: https://github.com/kwen2501, https://github.com/atalman, https://github.com/malfet	2025-03-18 05:23:18 +00:00
Rachel Guo	b8f91bcb14	[pt2_provenance_tracking] add support for cpp kernel (#149185 ) Summary: As title. Add inductor cpp kernel to post grad graph node mapping & UT. Context: Raised as a feature request for AOTI CPU case. https://fb.workplace.com/groups/1028545332188949/permalink/1169020841474730/ Differential Revision: D71181284 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149185 Approved by: https://github.com/jingsh	2025-03-18 04:43:07 +00:00
Shangdi Yu	7869196482	Fix torchbind schema str generation (#149239 ) Summary: Fix Torchbind HOP schema generation when there's no input Test Plan: ``` buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r schema ``` Differential Revision: D71231164 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149239 Approved by: https://github.com/zou3519	2025-03-18 04:29:56 +00:00
Wei-Sheng Chin	bca75fe97a	[MAIA] [Autocast] Enable autocast on MAIA device (#148511 ) Fixes #148510. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148511 Approved by: https://github.com/albanD	2025-03-18 03:46:22 +00:00
Davide Italiano	c43e35d6f7	[MPS] Implement support for `modified_bessel_i1` in eager. (#149368 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149368 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-03-18 03:29:10 +00:00
Mu-Chu Lee	bb42e4d137	[AOTInductor] Add function to free buffer (#149161 ) Summary: We add a function that allows users to free the unused buffer. Test Plan: Testing correctness: python test/inductor/test_aot_inductor.py -k free_inactive Testing memory consumption: LD_LIBRARY_PATH=/data/users/$USER/pytorch/build/lib /home/$USER/local/pytorch/build/bin/test_aoti_inference Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/149161 Approved by: https://github.com/chenyang78, https://github.com/desertfire ghstack dependencies: #149249	2025-03-18 02:43:14 +00:00
Jane Xu	cccdf860e2	[BE] Add STABLE_LIBRARY test for multiple returns (#149230 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149230 Approved by: https://github.com/albanD, https://github.com/zou3519 ghstack dependencies: #149052	2025-03-18 02:40:54 +00:00
Jane Xu	988827cdfb	Use schema as source of truth + support ones_like/empty_like (#149052 ) This change does 2 important things: (a) Instead of relying on IValue type as source of truth, we use the schema as the source of truth, which is important as IValue types are overloaded and can ambiguously convert incorrectly. For example, a MemoryFormat will look like an int + get converted to an int64_t vs a MemoryFormat! (b) This PR expands support for many more types to encompass way more schemas, e.g., Optional, Device, dtype, etc. The main win from this PR is the ability for aoti_torch_call_dispatcher to call TensorFactory ops like ones_like/empty_like! Pull Request resolved: https://github.com/pytorch/pytorch/pull/149052 Approved by: https://github.com/albanD	2025-03-18 02:40:54 +00:00
Justin Chu	ebabd0efdd	[ONNX] Expose verification utilities (#148603 ) Expose verification utilities to public documentation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148603 Approved by: https://github.com/titaiwangms	2025-03-18 02:10:34 +00:00
Sun, Jiayi	c36ac16da1	[Inductor] optimize welford reduction (#145061 ) Fix https://github.com/pytorch/pytorch/issues/141541. Fix https://github.com/pytorch/pytorch/issues/142839. Fix https://github.com/pytorch/pytorch/issues/143182. Summary: In order to fix the issue that the accuracy of welford reduction is not good enough, we refer to the eager implementation, combine Welford algorithm with cascade sum to improve numerical stability. Specifically: 1. Use Welford algorithm to compute mean and variance. 2. Use cascade summation when computing sum over input for both mean and variance. I tested Inductor benchmark with this PR on CPU, no performance gains or regressions were seen. Example: Take https://github.com/pytorch/pytorch/issues/141541 as an example: ``` import torch import torch.nn as nn torch.manual_seed(0) class Model(nn.Module): def __init__(self): super().__init__() self.gn = nn.GroupNorm(num_groups=32, num_channels=32) def forward(self, x): return self.gn(x) model = Model().eval() c_model = torch.compile(model) x = torch.randn(1, 32, 128, 128, 128) with torch.no_grad(): output = model(x) c_output = c_model(x) print(torch.max(torch.abs(output - c_output))) print(torch.allclose(output, c_output, 1.3e-6, 1e-5)) ``` logs - before ``` tensor(7.0095e-05) False ``` - After ``` tensor(9.5367e-07) True ``` - on CUDA ``` tensor(1.4305e-06, device='cuda:0', grad_fn=<MaxBackward1>) True ``` Generated code: - before ``` cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['const float', 'const float', 'const float', 'float', 'float', 'float'], ''' #include "/tmp/torchinductor_jiayisun/pi/cpicxudqmdsjh5cm4klbtbrvy2cxwr7whxl3md2zzdjdf3orvfdf.h" extern "C" void kernel(const float* in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr0, float* out_ptr1, float* out_ptr2) { { #pragma GCC ivdep for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L)) { { Welford<float> tmp_acc0 = Welford<float>(); Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<int64_t>(131072L)); for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2097152L); x1+=static_cast<int64_t>(16L)) { { if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(2097152L))) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x1 + 2097152Lx0), static_cast<int64_t>(16)); tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &wrecps0); } } } tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec)); tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec)); out_ptr0[static_cast<int64_t>(x0)] = static_cast<float>(tmp_acc0.mean); out_ptr1[static_cast<int64_t>(x0)] = static_cast<float>(tmp_acc0.m2); } } } { #pragma GCC ivdep for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2097152L); x1+=static_cast<int64_t>(16L)) { { if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(2097152L))) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x1 + 2097152Lx0), static_cast<int64_t>(16)); auto tmp1 = out_ptr0[static_cast<int64_t>(x0)]; auto tmp4 = out_ptr1[static_cast<int64_t>(x0)]; auto tmp12 = in_ptr1[static_cast<int64_t>(x0)]; auto tmp15 = in_ptr2[static_cast<int64_t>(x0)]; auto tmp2 = at::vec::Vectorized<float>(tmp1); auto tmp3 = tmp0 - tmp2; auto tmp5 = static_cast<float>(2097152.0); auto tmp6 = tmp4 / tmp5; auto tmp7 = static_cast<float>(1e-05); auto tmp8 = decltype(tmp6)(tmp6 + tmp7); auto tmp9 = 1 / std::sqrt(tmp8); auto tmp10 = at::vec::Vectorized<float>(tmp9); auto tmp11 = tmp3 * tmp10; auto tmp13 = at::vec::Vectorized<float>(tmp12); auto tmp14 = tmp11 * tmp13; auto tmp16 = at::vec::Vectorized<float>(tmp15); auto tmp17 = tmp14 + tmp16; tmp17.store(out_ptr2 + static_cast<int64_t>(x1 + 2097152Lx0)); } } } } } } ''') ``` - After ``` cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['const float', 'const float', 'const float', 'float', 'float', 'float'], ''' #include "/tmp/torchinductor_jiayisun/ln/clnlak27xpvmq3klpqyj6xzyq2thf4ecrezve5ddy4f4xaz4sb7w.h" extern "C" void kernel(const float in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr0, float* out_ptr1, float* out_ptr2) { { #pragma GCC ivdep for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L)) { { Welford<float> tmp_acc0 = Welford<float>(); Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); WelfordHelper<at::vec::Vectorized<float>> welford_helper0(static_cast<int64_t>(131072L)); static WelfordHelper<at::vec::Vectorized<float>> masked_welford_helper0(static_cast<int64_t>(0L)); for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2097152L); x1+=static_cast<int64_t>(16L)) { { if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(2097152L))) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x1 + 2097152Lx0), static_cast<int64_t>(16)); tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &welford_helper0); } } } tmp_acc0_vec = welford_combine(tmp_acc0_vec, &welford_helper0); masked_tmp_acc0_vec = welford_combine(masked_tmp_acc0_vec, &masked_welford_helper0); tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec)); tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec)); out_ptr0[static_cast<int64_t>(x0)] = static_cast<float>(tmp_acc0.mean); out_ptr1[static_cast<int64_t>(x0)] = static_cast<float>(tmp_acc0.m2); } } } { #pragma GCC ivdep for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2097152L); x1+=static_cast<int64_t>(16L)) { { if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(2097152L))) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x1 + 2097152Lx0), static_cast<int64_t>(16)); auto tmp1 = out_ptr0[static_cast<int64_t>(x0)]; auto tmp4 = out_ptr1[static_cast<int64_t>(x0)]; auto tmp12 = in_ptr1[static_cast<int64_t>(x0)]; auto tmp15 = in_ptr2[static_cast<int64_t>(x0)]; auto tmp2 = at::vec::Vectorized<float>(tmp1); auto tmp3 = tmp0 - tmp2; auto tmp5 = static_cast<float>(2097152.0); auto tmp6 = tmp4 / tmp5; auto tmp7 = static_cast<float>(1e-05); auto tmp8 = decltype(tmp6)(tmp6 + tmp7); auto tmp9 = 1 / std::sqrt(tmp8); auto tmp10 = at::vec::Vectorized<float>(tmp9); auto tmp11 = tmp3 * tmp10; auto tmp13 = at::vec::Vectorized<float>(tmp12); auto tmp14 = tmp11 * tmp13; auto tmp16 = at::vec::Vectorized<float>(tmp15); auto tmp17 = tmp14 + tmp16; tmp17.store(out_ptr2 + static_cast<int64_t>(x1 + 2097152L*x0)); } } } } } } ''') ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145061 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jansel	2025-03-18 02:05:35 +00:00
cyy	1096443467	Use torch_compile_options for c10 libraries (#147821 ) c10, c10_cuda, c10_hip and c10_xpu are given additional compile options by torch_compile_options, which are more restrictive and can help reveal potential bugs inside the code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147821 Approved by: https://github.com/guangyey, https://github.com/malfet	2025-03-18 01:54:23 +00:00
Su, Tong	60523540f1	Force build to conform C++ standard on windows by adding /permissive- flag (#149035 ) Fixes #147366 1. Add `/permissive-` to the `torch_compile_options` for the build to conform to the C++ standard. 2. Fix the error when trying to assign a string literal to a non-const ptr. The `/permissive-` flag can be found at https://learn.microsoft.com/en-us/cpp/build/reference/permissive-standards-conformance?view=msvc-170 From the above [doc](https://learn.microsoft.com/en-us/cpp/build/reference/permissive-standards-conformance?view=msvc-170#remarks), > By default, the /permissive- option is set in new projects created by Visual Studio 2017 version 15.5 and later versions. > The /permissive- option is implicitly set by the /std:c++latest option starting in Visual Studio 2019 version 16.8, and in version 16.11 by the /std:c++20 option. Thus, it is reasonable to add this flag to the existing project. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149035 Approved by: https://github.com/guangyey, https://github.com/malfet	2025-03-18 01:51:46 +00:00
Xia, Weiwen	c1dd75e4dc	Add AOTI shim for _weight_int4pack_mm_cpu_tensor (#149031 ) Summary Previous implementation of shim did not align with the design and it was removed by https://github.com/pytorch/pytorch/pull/148907 This PR adds it back in the files of MKLDNN backend and re-enable the CPP wrapper UT. Test plan ``` pytest -s test/inductor/test_cpu_cpp_wrapper.py -k test_woq_int4 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149031 Approved by: https://github.com/leslie-fang-intel, https://github.com/EikanWang, https://github.com/desertfire	2025-03-18 01:33:13 +00:00
cyy	425c6d8eba	Replace c10::is_pod with std::is_trivial (#149286 ) These remaining c10::is_pod calls can be replaced without compromising the semantics. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149286 Approved by: https://github.com/zou3519	2025-03-18 01:33:01 +00:00
Animesh Jain	f9a787224c	[dynamo][guards][serialization] Dont use ID_MATCH guard for bool and None (#149228 ) Doing this removes the need of collecting `id` and therefore facilitates serialization. It also improves readability with recompilations. Earlier, recompile message will just show the `id`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149228 Approved by: https://github.com/jansel	2025-03-18 01:25:37 +00:00
Davide Italiano	186cc7327c	[MPS/BE] Remove decorator that skipped test on macOS 12. (#149365 ) macOS 12 is not really supported anymore. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149365 Approved by: https://github.com/malfet	2025-03-18 00:58:08 +00:00
Aaron Gokaslan	a0ac63cbd9	[BE]: Apply ruff PERF403 to use dict comprehensions more often (#149257 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/149257 Approved by: https://github.com/jansel	2025-03-18 00:46:07 +00:00
Davide Italiano	811f587d86	[MPS/BE] @parametrize generation of pointwise_ops. (#149363 ) Make this less error prone/reduces duplication. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149363 Approved by: https://github.com/malfet	2025-03-18 00:37:43 +00:00
Bin Bao	9a78513c3c	[AOTI] Update test runner to use the new APIs (#147105 ) Summary: Switch to the newer aoti_compile_and_package APIs. Some tests still kept using legacy APIs, and will follow up with internal test refactoring. Differential Revision: [D69609685](https://our.internmc.facebook.com/intern/diff/D69609685) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147105 Approved by: https://github.com/jingsh	2025-03-18 00:27:09 +00:00
PyTorch MergeBot	b52a8bef01	Revert "[dynamo][guards][serialization] Dont use ID_MATCH guard for bool and None (#149228 )" This reverts commit 5905bbe745b0acb4909243c93014c0e6f3512c2d. Reverted https://github.com/pytorch/pytorch/pull/149228 on behalf of https://github.com/malfet due to I wonder if this will fix the pr-time-benchmark regressions ([comment](https://github.com/pytorch/pytorch/pull/149228#issuecomment-2731237949))	2025-03-18 00:10:50 +00:00
Nikita Shulga	46226a90c8	[EZ][BE] Remove cross-compilation options from mac-build.yml (#149237 ) It has long been gone Pull Request resolved: https://github.com/pytorch/pytorch/pull/149237 Approved by: https://github.com/seemethere, https://github.com/atalman	2025-03-17 23:50:31 +00:00
Eli Uriegas	523bffd388	cd: Add no-cache for test binaries (#149218 ) This is to make it so that we don't experience issues like https://github.com/pytorch/vision/actions/runs/13861462856/job/38795684317#step:13:212 ``` ERROR: THESE PACKAGES DO NOT MATCH THE HASHES FROM THE REQUIREMENTS FILE. If you have updated the package versions, please update the hashes. Otherwise, examine the package contents carefully; someone may have tampered with them. unknown package: Expected sha256 8e34a6f02ac5a63763251953063a19ba9df855ac2c8a13ef409dfef708e2ba26 Got 341156cc5067488565c1e103be6e95105b0fc0d87d8ac24ff8891f63fd33216f ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149218 Approved by: https://github.com/ZainRizvi, https://github.com/atalman, https://github.com/malfet	2025-03-17 23:26:20 +00:00
Mayank Mishra	37c914ca0c	fix simple-spec crash (#147723 ) found an issue while running `python torchgen/fuse/gen_patterns.py` exact error: ```shell Traceback (most recent call last): File "/Users/mayankmishra/Desktop/non-IBM/pytorch/torchgen/fuse/gen_patterns.py", line 19, in <module> joint_graph.lazy_init() File "/Users/mayankmishra/miniconda3/envs/ai/lib/python3.10/site-packages/torch/_inductor/pattern_matcher.py", line 2096, in lazy_init result = fn() File "/Users/mayankmishra/miniconda3/envs/ai/lib/python3.10/site-packages/torch/_inductor/fx_passes/joint_graph.py", line 53, in lazy_init _pad_mm_init() File "/Users/mayankmishra/miniconda3/envs/ai/lib/python3.10/site-packages/torch/_inductor/fx_passes/pad_mm.py", line 905, in _pad_mm_init gen_register_replacement( File "/Users/mayankmishra/miniconda3/envs/ai/lib/python3.10/site-packages/torch/_inductor/pattern_matcher.py", line 1584, in gen_register_replacement pat = _serialize_pattern( File "/Users/mayankmishra/miniconda3/envs/ai/lib/python3.10/site-packages/torch/_inductor/pattern_matcher.py", line 1539, in _serialize_pattern file_template = get_file_template() File "/Users/mayankmishra/miniconda3/envs/ai/lib/python3.10/site-packages/torch/_inductor/pattern_matcher.py", line 1513, in get_file_template if isinstance(attr, type) and issubclass(attr, (PatternExpr, _TargetExpr)): File "/Users/mayankmishra/miniconda3/envs/ai/lib/python3.10/abc.py", line 123, in __subclasscheck__ return _abc_subclasscheck(cls, subclass) TypeError: issubclass() arg 1 must be a class ``` This PR fixes this issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147723 Approved by: https://github.com/aorenste Co-authored-by: Aaron Orenstein <aorenste@meta.com>	2025-03-17 23:25:48 +00:00
Tony-Y	78715a181f	Convert Tensor lr to 0-dim as needed for the optimizer to normally work (#145674 ) Fixes #145461 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145674 Approved by: https://github.com/janeyx99 Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2025-03-17 23:07:05 +00:00
Mu-Chu Lee	1157367c78	[AOTInductor] [BE] Add macro for loading symbols in aoti runner (#149249 ) Summary: Add macro for loading symbols in aoti runner Test Plan: Existing tests Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/149249 Approved by: https://github.com/chenyang78	2025-03-17 23:02:01 +00:00
PyTorch MergeBot	24cfeec2c7	Revert "[BE]: Apply ruff PERF403 to use dict comprehensions more often (#149257 )" This reverts commit bfee141666319c80b6c5284394905beef8682515. Reverted https://github.com/pytorch/pytorch/pull/149257 on behalf of https://github.com/malfet due to Let's see if it helps restore compiler benchmark sanity, see `8bc7bd94a5/1` ([comment](https://github.com/pytorch/pytorch/pull/149257#issuecomment-2731133812))	2025-03-17 22:57:00 +00:00
PyTorch MergeBot	afa1eda901	Revert "[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely (#148590 )" This reverts commit ef6296e7f20d744a0cfed81cab573d60204e7626. Reverted https://github.com/pytorch/pytorch/pull/148590 on behalf of https://github.com/izaitsevfb due to reverted internally, see D71292427 ([comment](https://github.com/pytorch/pytorch/pull/148590#issuecomment-2731114626))	2025-03-17 22:43:15 +00:00
Yanan Cao (PyTorch)	a16ada41b9	Fix outdated docstring of torch.export.export regarding strict flag (#149077 ) Summary: Fix outdated docstring of torch.export.export regarding strict flag Test Plan: None, doc only change Differential Revision: D71068215 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149077 Approved by: https://github.com/zhxchen17	2025-03-17 22:29:20 +00:00
Sheng Qin	d25617255c	Fix AOTI update_constant_buffer issue. (#149243 ) Summary: In D69553929 we changed the logic of constant & buffer update in AOTI. However this is incompatible with current Sigmoid runtime since we have different logics to pass in buffers, resulted in errors like ``` I0310 17:29:24.456960 3679102 AOTIDelegateExecutor.cpp:89] AOTIDelegateExecutor processing weights * Aborted at 1741652964 (Unix time, try 'date -d 1741652964') * * Signal 11 (SIGSEGV) (0x30) received by PID 3679102 (pthread TID 0x7f9933e49000) (linux TID 3679102) (code: address not mapped to object), stack trace: * @ 00000000000040b9 folly::symbolizer::(anonymous namespace)::signalHandler(int, siginfo_t, void) ./fbcode/folly/debugging/symbolizer/SignalHandler.cpp:453 @ 0000000000006c45 folly::fibers::(anonymous namespace)::sigsegvSignalHandler(int, siginfo_t, void) ./fbcode/folly/fibers/GuardPageAllocator.cpp:237 @ 000000000004455f (unknown) /home/engshare/third-party2/glibc/2.34/src/glibc-2.34/signal/../sysdeps/unix/sysv/linux/libc_sigaction.c:8 -> /home/engshare/third-party2/glibc/2.34/src/glibc-2.34/signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c @ 00000000001e8164 torch::aot_inductor::AOTInductorModelContainer::update_constant_buffer(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, AtenTensorOpaque, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, AtenTensorOpaque> > > const&, bool, bool) ``` Test Plan: 1) Generate lowered merge net ``` CUDA_VISIBLE_DEVICES=0 ../buck-out/v2/gen/fbcode/b5b13003c82cbdec/caffe2/torch/fb/model_transform/fx2trt/packaging/__generate_merge_net_file__/generate_merge_net_file.par --action=generate --input-file=/home/shengqin/models/aoti_sigmoid_test/cmf_interformer_with_custom_triton_kernels_691990503_0_input --output-file=/home/shengqin/models/aoti_sigmoid_test/cmf_interformer_with_custom_triton_kernels_691990503_0_output.aoti_sigmoid --lower-backend=aot_inductor --use_sigmoid=true --aot_inductor_config="{'max_autotune': True, 'comprehensive_padding': False}" --add_passes=use_matmul_lce_replace_normal_LCE,use_triton_dot_compress,use_matmul_fuse_lce_replace_first_LCE,use_contiguous_linear_reduction_replace_linear_reduction --disable_acc_tracer=false ``` 2) Load net predictor ``` CUDA_VISIBLE_DEVICES=1 ../buck-out/v2/gen/fbcode/103717df3cc2b97a/caffe2/torch/fb/model_transform/fx2trt/packaging/__load_net_predictor__/load_net_predictor --loadMode=AccuracyAB --inputNetFile=/home/shengqin/models/aoti_sigmoid_test/cmf_interformer_with_custom_triton_kernels_691990503_0_output.aoti_ts --otherNetFile=/home/shengqin/models/aoti_sigmoid_test/cmf_interformer_with_custom_triton_kernels_691990503_0_output.aoti_sigmoid --moduleName=merge --benchmarkEnableProfiling=false —-predictor_hardware_type=1 --disableStaticRuntime=true ``` Reviewed By: hl475 Differential Revision: D71236710 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149243 Approved by: https://github.com/hl475, https://github.com/jingsh	2025-03-17 22:10:57 +00:00
Isuru Fernando	a3c6e3139a	allow extra args for parameterization of tests in inductor (#149154 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149154 Approved by: https://github.com/amjames, https://github.com/eellison	2025-03-17 22:05:06 +00:00
Davide Italiano	e4f6e4ac84	[MPS] Add inductor support for `modified_bessel_i0`. (#149342 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149342 Approved by: https://github.com/malfet	2025-03-17 21:45:51 +00:00
Carlo Bertolli	8bc7bd94a5	[ROCm] Input vectorization in elementwise kernels for tensors with heterogeneous types (#147527 ) This patch exemplifies its use for input tensors with types (float,bfloat16) when functor type is float(float,float). Pull Request resolved: https://github.com/pytorch/pytorch/pull/147527 Approved by: https://github.com/jeffdaily Co-authored-by: Hashem Hashemi <hashem.hashemi@amd.com>	2025-03-17 20:51:36 +00:00
Benjamin Glass	e8dd58b8cf	cpp_wrapper: Precompile device-specific header files (#146928 ) This saves us about a second per compilation, which is _massive_ for the OpInfo tests. Total OpInfo test runtime is down about 2x from this change alone. Relands #144002, with changes needed by fbcode internals. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146928 Approved by: https://github.com/desertfire	2025-03-17 20:40:15 +00:00
Sampsa	5e9f792479	[ROCm] Unskip flex attention UTs after triton 3.3 bump (#148327 ) Enable `test_flex_attention.py::TestLearnableBiases` unit tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148327 Approved by: https://github.com/jeffdaily	2025-03-17 20:15:14 +00:00
Shunting Zhang	6c7d8419e3	fix two accuracy regression (#149172 ) There are 2 accuracy regression in 3/12 nightly perf run. I can not repro them locally thus there is no effective way to bisect. Raise the tolerance to make them pass the accuracy check. - error log for HF MegatronBertForQuestionAnswering https://gist.github.com/shunting314/25322b66e15e98feed32e0d9a1e43316 - error log for TIMM gluon_inception_v3 https://gist.github.com/shunting314/df64ce22327df27a7057bbbd19ef5164 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149172 Approved by: https://github.com/jansel, https://github.com/eellison	2025-03-17 19:34:00 +00:00
Pat Vignola	769f19bf95	[MTIA] Add _mtia_exchangeDevice to MTIA module (#149322 ) Summary: The FlexAttention path uses `_exchange_device`, so it will be needed eventually for MTIA as well. Test Plan: `buck2 test fbcode//mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api -- test_exchange_device` Reviewed By: chaos5958 Differential Revision: D70072059 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149322 Approved by: https://github.com/chaos5958	2025-03-17 19:31:10 +00:00
angelayi	8d7c430e84	Symintify transpose_ (#149057 ) Fixes https://github.com/pytorch/pytorch/issues/148702 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149057 Approved by: https://github.com/yushangdi	2025-03-17 19:11:54 +00:00
Fadi Arafeh	08a644a4c4	Enable fast qlinear static/dynamic path for AArch64 through ACL directly (#148585 ) This enables a fast path for eager mode static/dynamic quantization for AArch64 through Arm Compute Library (ACL) directly. Context: PRs #126687, #139887 enabled an optimized implementation for `qlinear` and `qlinear_dynamic` for aarch64 through `ideep → oneDNN → ACL` which improved performance by ~10x compared to the previous implementation. However, the current `qlinear` and `qlinear_dynamic` path (`ideep → oneDNN → ACL`) suffers from high overhead due to the API friction between the stateless oneDNN API and the stateful ACL low-precision GEMM (`lowp_gemm`) API - for example, ACL's `lowp_gemm` objects cache information like weights reduction or weights in optimized memory format which oneDNN does not allow due to its stateless nature. Hence, ACL currently runs a (redundant) sum of columns and pre-transposition (to the gemm kerne's optimal format) for each GEMM operation. This PR addresses the sub-optimalities above by integrating ACL directly with `qlinear` and `qlinear_dynamic`. - For `qlinear_dynamic` (dynamically quantized matmuls): This PR yields an **average speedup (averaged over context_lengths of 2^3 up to 2^9) of ~ 50% for `bert-base-uncased`, `bert-large-uncased`, `roberta-base`, `distilbert-base-uncased` with 16 threads on a Neoverse-V1 (with transformers==4.48) for the benchmarking script below: ``` # SPDX-FileCopyrightText: Copyright 2025 Arm Limited and/or its affiliate <open-source-office@arm.com> # SPDX-License-Identifier: BSD-3-Clause import torch from transformers import AutoModel, AutoConfig import time import numpy as np from argparse import ArgumentParser class ModelArgumentParser(ArgumentParser): def __init__(self) -> None: super().__init__(description="huggingface model") self.add_argument("--context_length", help="context length - number of input tokens", type=int, default=64 ) self.add_argument("--model", help="model checkpoint - i.e. 'bert-base-uncased'", type=str, default=None) self.add_argument("--iters", help="benchmark iterations", default=500) if __name__ == "__main__": parser = ModelArgumentParser() args = parser.parse_args() model_name = args.model config = AutoConfig.from_pretrained(model_name) batch_size = 1 model = AutoModel.from_pretrained(model_name) model = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8) model.eval() inputs = torch.randint(config.vocab_size, (batch_size, args.context_length), dtype=torch.long, device="cpu") times = [] with torch.no_grad(): # warmup for _ in range(10): model(inputs) # benchmark for _ in range(args.iters): s = time.time_ns() model(inputs) times.append((time.time_ns() - s) / 1e6) print("Model = ", model_name) print("Context Length = ", args.context_length) print("Min (ms) = ", min(times)) print("Mean (ms) = ", np.mean(times)) ``` - For `qlinear` (statically quantized matmuls): This PR yields an average speedup of 2x for signed activations (`s8s8s8`) and 95x for unsigned activations (u8s8u8)** on a Neoverse-V1 with 16 threads for the benchmarking script below. The averages are over for all combinations of `M = [8, 16, ..., 512]`, `K = [768, 1024, 2048, 4096]`, `N = [768, 1024, 2048, 4096]`. The astronomical speedup for unsigned activation is because oneDNN v3.7 does not have an optimized implementation for `u8s8u8` on AArch64. ``` # SPDX-FileCopyrightText: Copyright 2025 Arm Limited and/or its affiliate <open-source-office@arm.com> # SPDX-License-Identifier: BSD-3-Clause import torch import torch.nn as nn from torch.quantization import QConfig from torch.ao.quantization.observer import HistogramObserver, default_weight_observer import torch import torch.nn as nn import numpy as np import random from argparse import ArgumentParser import time class ModelArgumentParser(ArgumentParser): def __init__(self) -> None: super().__init__() self.add_argument("--M", help="M dimension", type=int, default=64 ) self.add_argument("--K", help="K dimension", type=int, default=64 ) self.add_argument("--N", help="N dimension", type=int, default=64 ) self.add_argument("--signed_input", help="Use (signed) torch.qint8 for inputs instead of (unsigned) torch.quint8", action="store_true" ) self.add_argument("--seed", help="Random seed", type=int, default=42 ) self.add_argument("--iters", help="benchmark iterations", default=500) def set_seed(seed): random.seed(seed) np.random.seed(seed) torch.manual_seed(seed) class LinearModel(nn.Module): def __init__(self, K, N): super(LinearModel, self).__init__() self.quant = torch.quantization.QuantStub() self.fc = nn.Linear(K, N) self.dequant = torch.quantization.DeQuantStub() def forward(self, x): x = self.quant(x) x = self.fc(x) x = self.dequant(x) return x def quantize_model(model, args): qconfig = QConfig( activation=HistogramObserver.with_args(reduce_range=False, dtype=torch.qint8 if args.signed_input else torch.quint8), weight=default_weight_observer, ) # Prepare the model for static quantization # Specify quantization configurations model.qconfig = qconfig model_prepared = torch.quantization.prepare(model_fp32) # Calibrate the model with sample inputs # Example input data for calibration with torch.no_grad(): sample_data = torch.randn(args.M, args.K) model_prepared(sample_data) # Convert the prepared model to a quantized model model_quantized = torch.quantization.convert(model_prepared) return model_quantized if __name__ == "__main__": parser = ModelArgumentParser() args = parser.parse_args() set_seed(args.seed) model_fp32 = LinearModel(args.K, args.N) model_quantized = quantize_model(model_fp32, args) inputs = torch.randn(args.M, args.K) times = [] with torch.no_grad(): # warmup for _ in range(10): model_quantized(inputs) # benchmark for _ in range(args.iters): s = time.time_ns() model_quantized(inputs) times.append((time.time_ns() - s) / 1e6) print("M,K,N,signed = ", args.M, args.K, args.N, args.signed_input) print("Min Times (ms) = ", min(times)) print("Mean Times (ms) = ", np.mean(times)) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148585 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-03-17 18:21:10 +00:00
Isuru Fernando	c41c2130be	Fix printing INT64_MIN (#149148 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149148 Approved by: https://github.com/anijain2305	2025-03-17 17:57:18 +00:00
Yichen Yan	8cdb9adc05	do not run `test_ck_blas_library` on cpu (#148316 ) Fix on non-rocm: ``` root@e01-tw-ue5g2g3sap6:~/pytorch/test# python test_linalg.py TestLinalgCPU.test_ck_blas_library_cpu E ====================================================================== ERROR: test_ck_blas_library_cpu (__main__.TestLinalgCPU) ---------------------------------------------------------------------- Traceback (most recent call last): File "/root/pytorch/torch/testing/_internal/common_utils.py", line 3108, in wrapper method(args, kwargs) File "/root/pytorch/torch/testing/_internal/common_device_type.py", line 480, in instantiated_test raise rte File "/root/pytorch/torch/testing/_internal/common_device_type.py", line 460, in instantiated_test result = test(self, param_kwargs) File "/root/pytorch/torch/testing/_internal/common_device_type.py", line 1242, in dep_fn return fn(slf, args, *kwargs) File "/root/pytorch/torch/testing/_internal/common_utils.py", line 1981, in _fn fn(args, **kwargs) File "/root/pytorch/test/test_linalg.py", line 8621, in test_ck_blas_library torch.backends.cuda.preferred_blas_library('ck') File "/root/pytorch/torch/backends/cuda/__init__.py", line 258, in preferred_blas_library torch._C._set_blas_preferred_backend(_BlasBackends[backend]) RuntimeError: Cannot set preferred backend to Ck if PyTorch has not been compiled for ROCm. To execute this test, run the following from the base repo dir: python test/test_linalg.py TestLinalgCPU.test_ck_blas_library_cpu This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ---------------------------------------------------------------------- Ran 1 test in 0.346s FAILED (errors=1) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148316 Approved by: https://github.com/jeffdaily	2025-03-17 17:45:45 +00:00
Catherine Lee	224cd9f055	[ez] Flush trymerge print statements (#149012 ) Logs of trymerge don't match up with timestamps, ex https://github.com/pytorch/pytorch/actions/runs/13766246347/job/38493307591 Ex: ``` 2025-03-10T14:20:41.4899509Z Attempting merge of https://github.com/pytorch/pytorch/pull/148648 (0.003460856278737386 minutes elapsed) ... 2025-03-10T14:20:41.4907867Z Merge of https://github.com/pytorch/pytorch/pull/148648 failed due to: Still waiting for 16 jobs to finish, first few of them are: Check Labels / Check labels, trunk / macos-py3-arm64 / build, trunk / win-vs2022-cpu-py3 / build, trunk / cuda12.4-py3.10-gcc9-sm80 / build, trunk / win-vs2022-cuda12.6-py3 / build. Retrying in 5 min 2025-03-10T14:20:41.4909772Z Attempting merge of https://github.com/pytorch/pytorch/pull/148648 (5.280085611343384 minutes elapsed) ... 2025-03-10T14:20:41.4916812Z Merge of https://github.com/pytorch/pytorch/pull/148648 failed due to: Still waiting for 15 jobs to finish, first few of them are: trunk / macos-py3-arm64 / build, trunk / win-vs2022-cpu-py3 / build, trunk / cuda12.4-py3.10-gcc9-sm80 / build, trunk / win-vs2022-cuda12.6-py3 / build, trunk / linux-focal-cuda12.6-py3.10-gcc11-no-ops / build. Retrying in 5 min 2025-03-10T14:20:41.4918183Z Attempting merge of https://github.com/pytorch/pytorch/pull/148648 (10.590279157956441 minutes elapsed) ``` Either buffering prints or github actions logs are being weird? Print with flush to see if it helps Pull Request resolved: https://github.com/pytorch/pytorch/pull/149012 Approved by: https://github.com/malfet	2025-03-17 17:04:48 +00:00
Rachel Guo	aaa4c3d60b	[mm_logs] make aten mm info readable (#148800 ) Summary: as title. make it into a table like e.g. also see pic in test plan \| Name \| M \| N \| K \| Count \| \| aten.mm \| 16 \| 6 \| 16 \| 1 \| ... Test Plan: {F1975907876} <img width="1090" alt="Screenshot 2025-03-11 at 3 13 00 PM" src="https://github.com/user-attachments/assets/ffae8c56-e32c-49cc-bbfb-5b8d216b8657" /> Differential Revision: D70825664 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148800 Approved by: https://github.com/henrylhtsang	2025-03-17 17:00:58 +00:00
Xinya Zhang	2a011ca904	[ROCm] testing: enable MEFF/FA unittests for gfx1100 (#148911 ) Include gfx1100, and optionally enable gfx1201/gfx950 according to env var TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL Pull Request resolved: https://github.com/pytorch/pytorch/pull/148911 Approved by: https://github.com/jeffdaily	2025-03-17 16:41:15 +00:00
PyTorch MergeBot	9d37b501db	Revert "[ROCm] enable HIPMallocAsyncAllocator (#149145 )" This reverts commit 2e02c07a5d1c432547542f90de2885be9ffd13cf. Reverted https://github.com/pytorch/pytorch/pull/149145 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. @albanD, might you be able to help get this PR landed? See D71214814 for more details on the failure. To validate the fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/149145#issuecomment-2730104736))	2025-03-17 16:17:02 +00:00
Yu, Guangye	c7c3e77324	Refine XPU oneDNN context manager API (#147349 ) # Motivation This PR introduces improvements to the XPU oneDNN context manager API: - `GpuEngineManager::get_engine`: Added a new API that accepts a `DeviceIndex` to simplify code and improve usability - by default, using the current device index. - `GpuStreamManager::get_stream`: Now explicitly requires a `DeviceIndex` as input to ensure correctness and consistency - by default, using the current device index. Additionally, it enhances integration with `c10::DeviceGuard`, ensuring correct device management. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147349 Approved by: https://github.com/EikanWang	2025-03-17 14:45:56 +00:00
PyTorch UpdateBot	790f93db3a	Update slow tests (#149300 ) This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149300 Approved by: https://github.com/pytorchbot	2025-03-17 11:39:29 +00:00
Sun, Jiayi	b2862f1435	optimize the decomposition of aten.native_group_norm (#144733 ) Summary: Optimize the decomposition of aten.native_group_norm. Reduce unnecessary repeated operations by changing the order of operations for `mean`, `rstd`, `weight`, `bias `and `input`, which can improve performance when `flattened_inner_size `is large. The original decomposition: 1. compute `mean `and `rstd`, 2. out = (x - mean) * rstd, compute in the range [N, C, ], 3. out = out weight + bias, compute in the range [N, C, ], The new decomposition: 1. compute `mean `and `rstd`, 2. new_weight = rstd weight, new_bias = - mean * rstd * weight + bias, compute in the range [N, C], 3. out = out * new_weight + new_bias, compute in the range [N, C, *], I tested the Inductor performance benchmark with this PR on both CPU and A100. On CPU, two torchbench models(functorch_dp_cifar10 and opacus_cifar10) have about 25% performance improvement, and two diffusion models(Stable Diffusion and Latent Consistency Model(LCM)) have about 2% performance improvement. On A100, no performance gains or regressions were seen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144733 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel	2025-03-17 09:27:01 +00:00
zeshengzong	1cc5f6b623	Optimize `MaxPool1d` param `ceil_mode` description (#148869 ) Fixes #148123 Add output shape formula based on `ceil_mode` value, according to `00199acdb8/aten/src/ATen/native/Pool.h (L61-L75)` ## Test Result ### Before ![image](https://github.com/user-attachments/assets/0a175178-a104-4348-a14b-516e866d533a) ### After ![image](https://github.com/user-attachments/assets/ce621d4b-1986-41fb-bd71-2b03c0aa996e) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148869 Approved by: https://github.com/mikaylagawarecki	2025-03-17 08:50:40 +00:00
soulitzer	916e8979d3	Skip some tests not using gradcheck on slowgradcheck (#149220 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149220 Approved by: https://github.com/seemethere	2025-03-17 00:34:52 +00:00
eqy	6048d88afe	[ARM64][CUDA] skip string pattern matching in `test_workspace_allocation_error` (#149236 ) `unwind()` on ARM64 seems to elide the strings of interest Pull Request resolved: https://github.com/pytorch/pytorch/pull/149236 Approved by: https://github.com/malfet, https://github.com/eellison, https://github.com/BoyuanFeng	2025-03-17 00:30:43 +00:00
Aaron Gokaslan	bfee141666	[BE]: Apply ruff PERF403 to use dict comprehensions more often (#149257 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/149257 Approved by: https://github.com/jansel	2025-03-16 23:52:58 +00:00
Tugsbayasgalan Manlaibaatar	6b1b95ad2a	Support subclass constructor capturing in export (#147014 ) Notable TODOs: 1. Need to implement AutogradHOP to get rid of subclasses before serializing 2. Need to implement mechanism to figure out what subclasses will be used in export when they are not expressed in the inputs Differential Revision: [D69640673](https://our.internmc.facebook.com/intern/diff/D69640673) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147014 Approved by: https://github.com/bdhirsh	2025-03-16 18:19:19 +00:00
Animesh Jain	5905bbe745	[dynamo][guards][serialization] Dont use ID_MATCH guard for bool and None (#149228 ) Doing this removes the need of collecting `id` and therefore facilitates serialization. It also improves readability with recompilations. Earlier, recompile message will just show the `id`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149228 Approved by: https://github.com/jansel	2025-03-16 15:56:17 +00:00
Davide Italiano	9f33c6f0a0	[MPS] Add support for modified_bessel_i0 in eager. (#149264 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149264 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-03-16 04:45:49 +00:00
Nikita Shulga	f80bee4934	[MPS][BE] Move common binary ops macros to indexing.h (#149263 ) And binary op invocation logic to OperationUtils.mm This is a no-op change, additional sanity checks/logic improvements will be added as followups Pull Request resolved: https://github.com/pytorch/pytorch/pull/149263 Approved by: https://github.com/dcci ghstack dependencies: #149262	2025-03-16 02:06:40 +00:00
Davide Italiano	21c2edfec8	[MPS/metal] Add missing `inline` to function definitions. (#149265 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149265 Approved by: https://github.com/malfet	2025-03-16 00:33:27 +00:00
Nikita Shulga	3e2c4086ad	[EZ][BE] Reuse `result_of` from `c10/metal/utils.h` (#149262 ) No need for one more implementation Pull Request resolved: https://github.com/pytorch/pytorch/pull/149262 Approved by: https://github.com/dcci	2025-03-16 00:21:28 +00:00
Sam Larsen	acf42b0048	Fix memory leak in subproc_pool future (#149259 ) Summary: The future holds a reference to the callback, and the callback captures the outer future. Seems to create a cycle that the garbage collector doesn't clean up. Verified by compiling 15k synthetic Triton kernels and observing that subprocess memory overhead improves. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149259 Approved by: https://github.com/Skylion007	2025-03-15 20:26:30 +00:00
James Wu	a9c55277d7	[Reland] First version of statically compiled launcher for triton compiled CUDA kernels (#149238 ) This is a new version of https://github.com/pytorch/pytorch/pull/148561 fixing the ROCM test failure Putting this up for a first pass review, though I will likely make a bunch of changes before landing to add more features, etc. This diff implements a first version of a static CUDA kernel launcher in `torch._C`. The goal here is to take a cubin file and some metadata from a CompiledKernel from `triton`, and launch the cubin file directly. Background doc: https://docs.google.com/document/d/1rjRcHl6MfauHG30nCoQX-9UKvKyIs4WWMy_GsGyqb9g/edit?tab=t.0#heading=h.ut5lf39lzq66 Normally, using triton's CompiledKernel.make_launcher(), we would pay the cost of codegenning C++ and running it at compile time. With this new approach, we can use one statically compiled library to launch the kernel. The tradeoff here is that this new kernel launcher will not be able to use codegen to deal with different lengths/types of arguments. So we use templating to handle up to 10 arguments for now. We also allocate 8 bytes on the stack per argument no matter the argument type, which can take more memory than codegenning. On the other hand, we improve compile time on cold and warm start by not having to call the C++ compiler at all. This diff does not add the launcher to torch, but introduces a basic test suite. A list of TODOs that are not yet complete: - Handle `nvTmaDesc` and `cuTensorMap`, which triton handles - Embed the grid logic instead of passing in gridX,Y,Z - Handle launch_enter and exit hooks? (Not sure if inductor has these) - Benchmarking to see if there's runtime performance loss - Probably lots of features of the triton C++ generated code that I haven't handled yet. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149238 Approved by: https://github.com/oulgen	2025-03-15 15:06:46 +00:00
Sam Larsen	c83c711da8	Remove some memory overhead in parallel compile workers (#149168 ) Summary: The parallel compile workers are holding on to more memory than they need to because they're loading the compiled modules into memory. Update the post-fork initializer to record when in a subprocess and skip some of the unnecessary overhead. Test Plan: Ran a test script to compile 15k Triton kernels and used tracemalloc in the subprocs to investigate the overhead. On my devgpu: * After importing torch in a subproc: 371M * Without this PR, after compiling 15k kernels: 825M * With this PR, after compiling 15k kernels: 531M Pull Request resolved: https://github.com/pytorch/pytorch/pull/149168 Approved by: https://github.com/jansel	2025-03-15 14:20:40 +00:00
Huamin Li	e7e477c1f9	Not generate custom obj json when it's empty (#149246 ) Summary: as title. See internal Diff summary for more context. Test Plan: buck run @fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r config_not_generated Differential Revision: D71241676 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149246 Approved by: https://github.com/houseroad Co-authored-by: Huamin Li <huaminli@meta.com>	2025-03-15 13:00:48 +00:00
Lirong	4482a65fef	Add side_effect to avoid dce custom op in CA graph (#149181 ) We found that in compiled_autograd, when defining custom op, the custom op will be dce in the backward graph. We added a side effect condition in the dce function to prevent eliminating custom op with side effect in CA graph. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149181 Approved by: https://github.com/xmfan	2025-03-15 04:15:49 +00:00
Wenjie Yang	115fc98cc0	Migrate aten.split.Tensor from using Sharding Rule to Sharding Strategy (#149106 ) Summary: Use Sharding Strategy for aten.split.Tensor instead of sharding rule Test Plan: pytest test/distributed/tensor/test_dtensor_ops.py -s -k split Reviewers: xilunwu Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/149106 Approved by: https://github.com/XilunWu, https://github.com/tianyu-l	2025-03-15 04:03:40 +00:00
Jane Xu	740ce0fa5f	op should NOT be static in aoti_torch_call_dispatcher (#149208 ) aoti_torch_call_dispatcher is meant to call different ops, so the op must not be static. Otherwise, every call to this API will call the first op that was ever called, which is not the intended behavior of any human being. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149208 Approved by: https://github.com/albanD, https://github.com/zou3519, https://github.com/malfet	2025-03-15 01:47:11 +00:00
Simon Fan	578160c875	[ca] don't inline accumulate grad op (#149014 ) we use dummy tensors in our initial trace, so we should never inline. the subclass dispatch might not support the dummy tensor, e.g. DTensor accumulate grad will check that both param and grad are DTensors Pull Request resolved: https://github.com/pytorch/pytorch/pull/149014 Approved by: https://github.com/jansel ghstack dependencies: #149064	2025-03-15 01:10:54 +00:00
Simon Fan	f4368d8872	[ca] clean up aot node deduping (#149064 ) rename the AOT nodes as we copy paste them into the CA graph Pull Request resolved: https://github.com/pytorch/pytorch/pull/149064 Approved by: https://github.com/jansel	2025-03-15 01:10:54 +00:00
Nikita Shulga	96795e9533	[BE] Parametrize `TestMPS.test_binops_dtype_precedence` (#149234 ) No op change, just splits a longer tests into a series of a smaller ones Pull Request resolved: https://github.com/pytorch/pytorch/pull/149234 Approved by: https://github.com/atalman, https://github.com/dcci ghstack dependencies: #149216, #149233	2025-03-15 00:37:11 +00:00
Jithun Nair	1c7196f04b	Add new GHA workflow to cache ROCm CI docker images on MI300 CI runners periodically (#148394 ) Refiling https://github.com/pytorch/pytorch/pull/148387 from pytorch repo branch to get AWS login via OIDC working Successful docker caching run: https://github.com/pytorch/pytorch/actions/runs/13843689908/job/38737095535 Run without cached docker image: https://github.com/pytorch/pytorch/actions/runs/13843692637/job/38746033460 ![image](https://github.com/user-attachments/assets/c410ff35-a150-4885-b904-3a5e1888c032) Run with cached docker image: ![image](https://github.com/user-attachments/assets/41e417b5-a795-4ed2-a9cd-00151db8f813) ~6 min vs 3 s :) Thanks @saienduri for the help on the MI300 infra side Pull Request resolved: https://github.com/pytorch/pytorch/pull/148394 Approved by: https://github.com/jeffdaily	2025-03-15 00:34:04 +00:00
xinan.lin	9ad6265d04	[AOTI][XPU] Fix: model_container_runner_xpu.cpp is not built into libtorch_xpu.so (#149175 ) The missing of model_container_runner_xpu.cpp will cause compilation failure when user build CPP inference application on XPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149175 Approved by: https://github.com/jansel	2025-03-15 00:30:04 +00:00
yifanmao	7537b19c73	[FSDP2] Update ignored_params docstring and add unit test (#149074 ) Fixes https://github.com/pytorch/pytorch/issues/148242 ignored_params won't be moved to devices in full_shard(), update docstring. Add unit test `test_move_states_to_device_ignored_param_device` to show that ignored_params won't be moved during full_shard(), but would be after `model.cuda()` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149074 Approved by: https://github.com/awgu	2025-03-15 00:23:09 +00:00
maajidkhann	09f7f62cfe	Fix atomic operation compatibility for ARMv8-A (Raspberry Pi 4) by adjusting compilation flags (#148070 ) Issue: * The ldaddal instruction is an AArch64 atomic operation available from ARMv8.1-A onwards. * Raspberry Pi 4 (Cortex-A72) is ARMv8-A, which does not support ldaddal, leading to failures when running PyTorch built with march=armv8.2-a+sve * This led to an issue when running PyTorch on ARMv8-A (Raspberry Pi 4), as unsupported atomic operations were generated. Fix: * Updated the build flags to explicitly use -march=armv8-a+sve, ensuring GCC and clang promotes it correctly and resolves compatibility issues with armv8 and still work correctly for SVE like before. * This ensures that PyTorch builds correctly for ARMv8-A platforms (e.g., Raspberry Pi 4) while still enabling SVE for supported hardware. Test plan: - Allocate `a1.4xlarge` on AWS - Run following script using wheel produced by this PR ```python import torch def f(x): return x.sin() + x.cos() print(torch.__version__) f_c = torch.jit.script(f) ``` - Observe no crash ``` $ python3 foo.py 2.7.0.dev20250313+cpu ``` - Observe crash with 2.6.0 ``` $ python3 foo.py 2.6.0+cpu Illegal instruction (core dumped) ``` Fixes #146792 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148070 Approved by: https://github.com/malfet	2025-03-15 00:02:38 +00:00
Nikita Shulga	08af311fc2	[MPS] Fix type promotion for `torch.floor_divide` (#149233 ) And delete some duplicating glue code by relying on the stub After this change `torch.arange(10, device = 'mps') // torch.arange(10., device='mps')` will return tensor of floats, which is a common dtype for float + integral operation, rather than tensor of ints Checked by `test_div2` inductor testing Pull Request resolved: https://github.com/pytorch/pytorch/pull/149233 Approved by: https://github.com/atalman ghstack dependencies: #149216	2025-03-15 00:00:42 +00:00
bobrenjc93	eb7bf4202d	Make dynamism code robust to NotImplementedException (#148823 ) In prod many models have `@property` methods that raise NotImplementedError. This PR updates our dynamism code to be more robust to these types of models. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148823 Approved by: https://github.com/laithsakka	2025-03-14 23:38:19 +00:00
Stephen Jia	ff58ccec6c	[ATen-CPU] Add `math.h` for Gelu (#149164 ) Summary: ## Context This PR is mostly to enable ExecuTorch build for Windows: https://github.com/pytorch/executorch/pull/9198 In ExecuTorch, the optimized GeLU kernel calls the ATen implementation. However, on Windows `math.h` needs to be included with `#define _USE_MATH_DEFINES` in order for math constants to be defined. Test Plan: Rely on CI to make sure existing tests do not break. Tested separately with ExecuTorch to make sure Windows build is successful. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149164 Approved by: https://github.com/swolchok	2025-03-14 23:37:25 +00:00
PyTorch MergeBot	f9b4856989	Revert "[pytree] add APIs to determine a class is a namedtuple or PyStructSequence (#113257 )" This reverts commit c95a6b416b4d1b830535f82e2719c055d077cbad. Reverted https://github.com/pytorch/pytorch/pull/113257 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. @zou3519 can you please help land this internally? See the sigmoid tests in D71198793 for details. To validate the fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/113257#issuecomment-2725982539))	2025-03-14 23:13:34 +00:00
PyTorch MergeBot	643aaea133	Revert "[RFC] First version of statically compiled launcher for triton compiled CUDA kernels (#148561 )" This reverts commit 5a843f8973d7fc6a601f089fc969d2a5ac7e5338. Reverted https://github.com/pytorch/pytorch/pull/148561 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/148561#issuecomment-2725969268))	2025-03-14 23:01:26 +00:00
cz2h	05f2cbfe19	Add meta function for out variants of ones,zeros,empty (#149098 ) Open another PR to fix merge conflicts. Fixes https://github.com/pytorch/pytorch/issues/135832 For aten.ones, aten.zeros, followed this [link](https://docs.google.com/document/d/1GgvOe7C8_NVOMLOCwDaYV1mXXyHMXY7ExoewHqooxrs/edit?tab=t.0#heading=h.64r4npvq0w0) to register meta functions. For aten.empty.out, followed this [part](https://docs.google.com/document/d/1GgvOe7C8_NVOMLOCwDaYV1mXXyHMXY7ExoewHqooxrs/edit?tab=t.0#heading=h.iy9lxhxhtl5v) to register a decomp for empty that handles the FakeTensor input. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149098 Approved by: https://github.com/williamwen42	2025-03-14 22:17:30 +00:00
Nikita Shulga	d7d9a71e19	[MPSInductor] Add support for atan2 (#149216 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149216 Approved by: https://github.com/dcci	2025-03-14 21:53:03 +00:00
Isalia20	dd6e9df3d0	[MPS] fix attention enable_gqa crash on mps (#149147 ) Fixes #149132 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149147 Approved by: https://github.com/malfet	2025-03-14 21:25:54 +00:00
Davide Italiano	0bd863a62f	[MPS] Add inductor support for `i1e`. (#149221 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149221 Approved by: https://github.com/malfet	2025-03-14 21:18:38 +00:00
Aditya Tewari	a0893475ba	Enable oneDNN dispatch for gemm bf16bf16->bf16 (#148197 ) Currently, `linear` layers using BF16 are dispatched to OpenBLAS, provided that sbgemm_ is available. However, profiling on AArch64 shows that dispatching to oneDNN results in a significant speedup. This PR updates the dispatch logic to leverage oneDNN for improved performance. Attaching some benchmark results. Instance: NeoverseV1., on 16 threads. <img width="482" alt="Screenshot 2025-02-28 at 17 18 38" src="https://github.com/user-attachments/assets/b84e7455-af6e-417f-920d-bdd2bec2e8f9" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/148197 Approved by: https://github.com/malfet	2025-03-14 20:58:24 +00:00
albanD	1bdbf12672	Update as strided doc (#149146 ) Make it clearer why it is not recommended to use it and when the resulting Tensor will have undefined behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149146 Approved by: https://github.com/gchanan, https://github.com/jbschlosser	2025-03-14 19:49:57 +00:00
Um Changyong	69aeb87eca	update error message in get_backend() more detail_ (#141796 ) Fixes #ISSUE_NUMBER When attempting to reconfigure the environment without properly handling the PyTorch-related settings, you may encounter the following message. ``` │ /root/.cache/pypoetry/virtualenvs/app-rag-sample-9TtSrW0h-py3.10/lib/python3.10/site-packages/torch/distributed/distribut │ │ ed_c10d.py:1215 in get_backend │ │ │ │ 1212 │ if _rank_not_in_group(pg): │ │ 1213 │ │ raise ValueError("Invalid process group specified") │ │ 1214 │ pg_store = _world.pg_map[pg] if pg in _world.pg_map else None │ │ ❱ 1215 │ return Backend(not_none(pg_store)[0]) │ │ 1216 │ │ 1217 │ │ 1218 def _get_process_group_uid(pg: ProcessGroup) -> int: │ │ │ │ /root/.cache/pypoetry/virtualenvs/app-rag-sample-9TtSrW0h-py3.10/lib/python3.10/site-packages/torch/utils/_typing_utils.p │ │ y:13 in not_none │ │ │ │ 10 │ │ 11 def not_none(obj: Optional[T]) -> T: │ │ 12 │ if obj is None: │ │ ❱ 13 │ │ raise TypeError("Invariant encountered: value was None when it should not be") │ │ 14 │ return obj │ │ 15 │ ╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ TypeError: Invariant encountered: value was None when it should not be Exception ignored in: <function Vllm.__del__ at 0x7f35f96b6dd0> ``` Since this message can cause confusion for multiple developers, the purpose of this PR is to suggest additional details to help clarify the situation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141796 Approved by: https://github.com/kwen2501	2025-03-14 19:42:42 +00:00
Qiongwen Zhang	5e79b61e8a	add PrivateUse1 backend in fsdp collecitves (#147260 ) add PrivateUse1 backend in fsdp collecitves Pull Request resolved: https://github.com/pytorch/pytorch/pull/147260 Approved by: https://github.com/weifengpy	2025-03-14 19:41:41 +00:00
henrylhtsang	fe01af2242	[AOTI][debug logger] small fix for intermediate value debugger for jit when arg is not tensor (#149007 ) repro: ``` import torch import torch._inductor.config as config config.aot_inductor.debug_intermediate_value_printer = "2" config.aot_inductor.filtered_kernel_names = "triton_poi_fused__to_copy_add_0" class Model(torch.nn.Module): def forward(self, x): x = x.to(torch.float) return x + 1 model = Model().cuda() x = torch.randn(10).cuda().to(torch.float8_e4m3fn) _ = torch.compile(model, fullgraph=True)(x) print("done") ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149007 Approved by: https://github.com/jingsh	2025-03-14 19:40:41 +00:00
Aaron Gokaslan	c96ed7e6f5	[BE]: No include left behind - recursive glob setuptools support (#148258 ) Fixes #148256 TestPlan check the printout from the setup.py build and verify the files are still included. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148258 Approved by: https://github.com/malfet, https://github.com/benjaminglass1	2025-03-14 19:39:21 +00:00
Nikita Shulga	9d7945e382	[EZ] Fix typo in UnaryOps.mm (#149217 ) s/imput/input/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/149217 Approved by: https://github.com/ZainRizvi, https://github.com/dcci	2025-03-14 19:31:20 +00:00
zeshengzong	a7f8de2198	Add `nn.Bilinear` param validation (#149018 ) Fixes #103425 ## Changes - Add doc description size value `must be > 0` - Add validation for `in1_features` param Currently, only `in1_features` will cause runtime error, if add checks for `in2_features` and `out_features` as well, might be kind of BC breaking. ```python import torch from torch import nn class lenet(nn.Module): def __init__(self): super(lenet, self).__init__() self.conv = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=5, stride=1) # Error, `in1_features=1, in2_features=0, out_features=0` no error self.linear = nn.Bilinear(in1_features=0, in2_features=0, out_features=0) def forward(self, x): # 1st block x = self.conv(x) x = self.linear(x) return x if __name__ == '__main__': net = lenet() ``` ## Test Result ```bash pytest test/test_nn.py -k test_bilinear -vv ``` ![image](https://github.com/user-attachments/assets/20617ba9-bac5-4db2-aecc-1831dbc8eb43) ![image](https://github.com/user-attachments/assets/401e4e1f-051a-4e1c-952b-48e85de64b0b) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149018 Approved by: https://github.com/mikaylagawarecki	2025-03-14 19:26:12 +00:00
James Wu	5a843f8973	[RFC] First version of statically compiled launcher for triton compiled CUDA kernels (#148561 ) Putting this up for a first pass review, though I will likely make a bunch of changes before landing to add more features, etc. This diff implements a first version of a static CUDA kernel launcher in `torch._C`. The goal here is to take a cubin file and some metadata from a CompiledKernel from `triton`, and launch the cubin file directly. Background doc: https://docs.google.com/document/d/1rjRcHl6MfauHG30nCoQX-9UKvKyIs4WWMy_GsGyqb9g/edit?tab=t.0#heading=h.ut5lf39lzq66 Normally, using triton's CompiledKernel.make_launcher(), we would pay the cost of codegenning C++ and running it at compile time. With this new approach, we can use one statically compiled library to launch the kernel. The tradeoff here is that this new kernel launcher will not be able to use codegen to deal with different lengths/types of arguments. So we use templating to handle up to 10 arguments for now. We also allocate 8 bytes on the stack per argument no matter the argument type, which can take more memory than codegenning. On the other hand, we improve compile time on cold and warm start by not having to call the C++ compiler at all. This diff does not add the launcher to torch, but introduces a basic test suite. A list of TODOs that are not yet complete, will do in separate diff: - Handle `nvTmaDesc` and `cuTensorMap`, which triton handles - Embed the grid logic instead of passing in gridX,Y,Z. With https://github.com/pytorch/pytorch/pull/147583, we should be able to handle all of the grid logic directly in _StaticCudaLauncher.launch_kernel, and get rid of the python evaluation. - Handle launch_enter and exit hooks? (Not sure if inductor has these) - Benchmarking to see if there's runtime performance loss - Hooking it up with a config to inductor - Testing harness to test against torch generated triton kernels Differential Revision: [D69926783](https://our.internmc.facebook.com/intern/diff/D69926783/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148561 Approved by: https://github.com/aorenste, https://github.com/syed-ahmed	2025-03-14 19:12:13 +00:00
zeshengzong	97272e4b49	Fix `torch.nn.functional.hardswish` gradients corner case (#148049 ) Fixes #147801 ## Changes - Change hardswish gradient compute condition as [torch.nn.functional.hardswish](https://pytorch.org/docs/stable/generated/torch.nn.functional.hardswish.html) - Enable cuda for test `test_hardswish_grad_corner` - Add test case for value=-3 ## Test Result ```bash pytest test/test_nn.py -k test_hardswish pytest test/test_unary_ufuncs.py -k test_hardswish pytest test/inductor/test_torchinductor.py -k test_hardswish ``` ![image](https://github.com/user-attachments/assets/000cb5c4-15f5-4bfd-ab45-f52bf810ff3d) ![image](https://github.com/user-attachments/assets/38b08cf8-ea84-47a2-8e37-0a213da3e0c8) ![image](https://github.com/user-attachments/assets/54bc57be-2c57-46cc-ab90-94ea6cbe1c34) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148049 Approved by: https://github.com/soulitzer	2025-03-14 18:53:10 +00:00
Ethan Wee	2e02c07a5d	[ROCm] enable HIPMallocAsyncAllocator (#149145 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149145 Approved by: https://github.com/jeffdaily	2025-03-14 18:21:27 +00:00
Nikita Shulga	f2221b2fce	[MPS] Add support for `i1e` (#149203 ) Followup after https://github.com/pytorch/pytorch/pull/149174 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149203 Approved by: https://github.com/dcci	2025-03-14 17:33:52 +00:00
Davide Italiano	f067eafabb	[MPS] Modify a test to test the correct function. (#149204 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149204 Approved by: https://github.com/malfet	2025-03-14 17:27:47 +00:00
Nikita Shulga	42e468d9b0	[MPSInductor] Adjust check_bounds (#147205 ) To make upper bound inclusive, which fixes `test_vectorized_ops_masked` and results in the following code ```python mps_lib_0 = compile_mps_shader(""" #include <c10/metal/random.h> #include <c10/metal/special_math.h> #include <c10/metal/utils.h> kernel void generated_kernel( device float* out_ptr0, constant float* in_ptr0, uint xindex [[thread_position_in_grid]] ) { int x0 = (xindex) % (64); int x1 = (xindex) / (64); auto tmp5 = in_ptr0[x0 + 63*x1]; int x2 = xindex; auto tmp0 = x0; auto tmp1 = static_cast<long>(tmp0); auto tmp2 = 63; auto tmp3 = tmp1 < tmp2; if (x0 > 63) return; auto tmp6 = tmp3 ? tmp5 : 7; out_ptr0[x2] = static_cast<float>(tmp6); } """) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147205 Approved by: https://github.com/jansel, https://github.com/dcci ghstack dependencies: #147211	2025-03-14 17:26:00 +00:00
cyy	a9aae05a6b	Remove test decorations on MacOS 12 (#148942 ) MacOS 12 may reach EOL, as from https://endoflife.date/macos Pull Request resolved: https://github.com/pytorch/pytorch/pull/148942 Approved by: https://github.com/malfet	2025-03-14 17:22:37 +00:00
Davide Italiano	f2ea77c099	[MPS] Add inductor support for i0e. (#149180 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149180 Approved by: https://github.com/malfet	2025-03-14 16:15:52 +00:00
PyTorch MergeBot	71795f159e	Revert "[AOTInductor] [BE] Add swap_constant_buffer into pybind for tests. (#149167 )" This reverts commit bea181ff7eeead9fcdd806e286846296c4ab2d67. Reverted https://github.com/pytorch/pytorch/pull/149167 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. See D71177501 for the failure. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/149167#issuecomment-2725001232))	2025-03-14 15:16:21 +00:00
Davide Italiano	706c22549c	[MPS] Add support for `i0e` in eager. (#149174 ) Add `special.i0e` to XFAIL_GRADLIST for now, as its backward op is not yet implemented Pull Request resolved: https://github.com/pytorch/pytorch/pull/149174 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-03-14 14:43:46 +00:00
Huamin Li	68bbe20db7	Add test coverage (#149182 ) Summary: Follow up from D71160718 Differential Revision: D71177037 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149182 Approved by: https://github.com/houseroad	2025-03-14 09:38:29 +00:00
Xuehai Pan	c95a6b416b	[pytree] add APIs to determine a class is a namedtuple or PyStructSequence (#113257 ) Changes in this PR: 1. Add `is_structseq` and `is_structseq_class` functions to determine a object or a class is PyStructSequence. 2. Add a generic class `structseq` which can be used as the registration key for PyStructSequence types like `namedtuple` for Named Tuple types. 3. Change `is_namedtuple` to accept subclasses of namedtuple to be namedtuple. Before this PR, only namedtuple class directly created by `collections.namedtuple` or `typing.NamedTuple` were namedtuple classes while their subclasses were not. This PR makes `is_namedtuple` return true for subclasses of namedtuple class. Resolves #75982. New tests are included in this PR. - #75982 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113257 Approved by: https://github.com/zou3519	2025-03-14 08:50:30 +00:00
Sheng Fu	05ac99042f	Clean up grid in execution trace (#149159 ) Summary: This DIFF https://www.internalfb.com/diff/D70471332 removed input "grid" when calling triton kernel. PyTorch execution trace need to make the appropriate change. It includes capturing ET and replay ET. Test Plan: buck2 run mode/opt caffe2/test:test_profiler_cuda -- profiler.test_execution_trace.TestExecutionTraceCUDA.test_execution_trace_with_pt2_cuda buck2 run mode/opt param_bench/fb/integration_tests:test_et_replay Differential Revision: D71152464 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149159 Approved by: https://github.com/sraikund16, https://github.com/jansel	2025-03-14 07:12:16 +00:00
PyTorch MergeBot	be4e6c1c8e	Revert "[MPS] Add support for `i0e` in eager. (#149174 )" This reverts commit b4745db90482ff139ea62d06ec0a18468e1131b7. Reverted https://github.com/pytorch/pytorch/pull/149174 on behalf of https://github.com/malfet due to MPS are red on trunk ([comment](https://github.com/pytorch/pytorch/pull/149174#issuecomment-2723774600))	2025-03-14 06:35:01 +00:00
Nikita Shulga	e162758051	[MPSInductor] Add `bessel_[jy][01]` ops (#149179 ) By simply calling corresponding special functions Followup TODO: tweak bessel_y0 to match CPU implementation for `torch.half` dtype Pull Request resolved: https://github.com/pytorch/pytorch/pull/149179 Approved by: https://github.com/dcci ghstack dependencies: #149123	2025-03-14 06:33:30 +00:00
Huamin Li	d4496346b9	Update logic when producing key name for keep_original_weights (#149171 ) Differential Revision: D71160718 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149171 Approved by: https://github.com/houseroad	2025-03-14 05:29:54 +00:00
Nikita Shulga	db6d72213b	[MPS] Add `torch.special.bessel_[jy][01]` implementations (#149123 ) By copy-n-pasting functions from `f59064f2b7/aten/src/ATen/native/cuda/Math.cuh (L1463)` With an ugly workaround for `bessel_y[01]` to avoid internal compiler exception on M1/M2 machines (see FB16863363 / https://gist.github.com/malfet/e7785e4b572e7740887a83a2386ef769 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149123 Approved by: https://github.com/Skylion007, https://github.com/dcci	2025-03-14 05:13:55 +00:00
PyTorch MergeBot	e6839819c8	Revert "[ROCm] Input vectorization in elementwise kernels for tensors with heterogeneous types (#147527 )" This reverts commit 4f8391db55c8c3a574d61d99d6d6a4a0b6723acb. Reverted https://github.com/pytorch/pytorch/pull/147527 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. @albanD, would you be able to help them land the fixes internally? The error looks really simple. See D71152448 for details. To validate the fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/147527#issuecomment-2723531085))	2025-03-14 05:11:01 +00:00
Isuru Fernando	9e6b2ca58d	Fix sympy float priting (#147552 ) Fixes https://github.com/pytorch/pytorch/pull/147261 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147552 Approved by: https://github.com/bobrenjc93, https://github.com/cyyever	2025-03-14 05:07:06 +00:00
Mu-Chu Lee	bea181ff7e	[AOTInductor] [BE] Add swap_constant_buffer into pybind for tests. (#149167 ) Summary: We add swap_constant_buffer in pybind to add tests. Test Plan: python test/inductor/test_aot_inductor.py -k test_update_inactive_constant_buffer Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/149167 Approved by: https://github.com/chenyang78, https://github.com/jingsh	2025-03-14 04:12:48 +00:00
Mu-Chu Lee	e567900998	[AOTInductor] Activate CPU test for update_constant_buffer (#149162 ) Summary: Fixed by #145459 Test Plan: Re-activating tests. Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/149162 Approved by: https://github.com/chenyang78, https://github.com/jingsh	2025-03-14 04:09:57 +00:00
fduwjj	aed0b7a742	[c10d] Add param recording for uniqueID broadcasting and allgather (#149166 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149166 Approved by: https://github.com/kwen2501	2025-03-14 03:51:30 +00:00
Davide Italiano	b4745db904	[MPS] Add support for `i0e` in eager. (#149174 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149174 Approved by: https://github.com/malfet	2025-03-14 02:51:28 +00:00
Dmitry Rogozhkin	c179971bfc	xpu: update filter out of dg2 AOT target (#148677 ) torch-xpu-ops has updated list of AOT targets to use and used `dg2` instead of `dg2-g10`. This requires an update in cpp_extension.py which currently filters out `dg2-` prefixed AOT targets. CC: @gujinghui @EikanWang @fengyuan14 @guangyey @jgong5 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148677 Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/albanD	2025-03-14 02:24:06 +00:00
Eli Uriegas	56b2e4b8f0	ci: Update linux.20_04 --> linux.24_04 (#149142 ) Ubuntu 20.04 is getting deprecated soon so we might as well proactively move to the latest LTS which is 24.04 > [!NOTE] > The oldest supported version of python on 24.04 is Python 3.8. Since we test for Python 3.6 compat in our collect_env test we need to have this particular job stick with 20.04 for now until we decide to upgrade it to a newer python version. Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/149142 Approved by: https://github.com/atalman, https://github.com/wdvr	2025-03-14 02:20:10 +00:00
cyy	e66ad221e9	Use std::string_view in get_fully_qualified_type_name (#145197 ) The same as #139164 but open a new PR due to messy history there. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145197 Approved by: https://github.com/r-barnes	2025-03-14 01:58:35 +00:00
Pat Vignola	e8d36019d4	[c10d] Make getDefaultBackend more fault tolerant without relying on exceptions (#149152 ) Summary: no-except builds are terminating when this exception is thrown. We should proactively check if a backend is available before calling has_hooks, instead of trying and failing. Test Plan: CI Differential Revision: D71144456 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149152 Approved by: https://github.com/kwen2501	2025-03-14 01:27:52 +00:00
Yiming Zhou	15cd6921a5	[export] Fix tensor_constant and buffer naming conflicts in TS converter (#148803 ) Summary: In TS converter, tensor constants are traced as BUFFER and later we will convert them back to CONSTANT_TENSOR. So we need to prevent naming conflicts during lift constant pass. Test Plan: CI Differential Revision: D70826426 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148803 Approved by: https://github.com/angelayi	2025-03-14 00:38:12 +00:00
PyTorch MergeBot	49570cb402	Revert "Split up cub-RadixSortPairs.cu to parallelize compilation (#148936 )" This reverts commit 9a3d26cfcdb1c1be84a04baa3ee554dbe67cb049. Reverted https://github.com/pytorch/pytorch/pull/148936 on behalf of https://github.com/ZainRizvi due to Breaks lint in trunk [GH job link](https://github.com/pytorch/pytorch/actions/runs/13845459825/job/38742803351) [HUD commit link](`9a3d26cfcd`) ([comment](https://github.com/pytorch/pytorch/pull/148936#issuecomment-2722853628))	2025-03-13 22:54:33 +00:00
Gheorghe-Teodor Bercea	4cae8f48cc	[ROCm] Improve softmax performance (#149076 ) This patch improves the performance of softmax for 2D tensors by: using a softmax calculation which eliminates the increase of shared memory usage with the size of the tensor and relies on global memory accesses for the tensor data accesses while still using shared memory for the actual reduction step (the shared memory used for the reduction is constant and does not increase with tensor size). for the final computation replacing the division by the sum with the multiplication of 1/sum. The 1/sum is computed as the last step of the warp reduction. replace the use of the exp function with the __expf function. The impact on numerical accuracy is within a 1e-5 for half precision and 1e-7 for full precision. The impact on performance for MI300X is between 22% and 50% percentage improvement over current runtimes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149076 Approved by: https://github.com/jeffdaily	2025-03-13 22:07:28 +00:00
Tovly Deutsch	9a3d26cfcd	Split up cub-RadixSortPairs.cu to parallelize compilation (#148936 ) Summary: `cub-RadixSortPairs.cu` has slow compilation times, especially on Windows. These changes split up the file into smaller components to allow each component to compile in parallel. On Windows, I observed a compile time drop from about 20 minutes to 6 minutes. Differential Revision: D70539649 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148936 Approved by: https://github.com/suo, https://github.com/eqy	2025-03-13 22:02:05 +00:00
Shangdi Yu	4098a229a0	Add back fake class registration to test_torchbind (#149137 ) Fixes #149121 Summary: as title, to fix https://github.com/pytorch/pytorch/issues/149121 Test Plan: ``` python test/export/test_torchbind.py ``` Differential Revision: D71129321 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149137 Approved by: https://github.com/yiming0416	2025-03-13 21:26:37 +00:00
Zhenghao Hu	e5fccb2bab	[pytorch] Fix duplicated Malloc/Free insertation when using IRBuilderBase::CreateMalloc/CreateFree in LLVM 18+ (#149058 ) Summary: Pytorch unitest hangs when jitting the Tensor kernel. The problem exists for LLVM version >= 18 due to this upstream change: `45bb45f2ae` `IRBuilderBase::CreateCall` will insert the instruction into the BasicBlock by default. And we don't need to explicitly insert the instruction when compiling the tensor kernel. Test Plan: ## Test with the release toolchain ``` buck test 'mode/dev' //caffe2/test:jit -- --exact 'caffe2/test:jit - test_concat_invariant (test_jit_fuser_te.TestTEFuserDynamic)' ``` ## Test with the Buckified toolchain Apply this D71046097 to select the LLVM libraries. ``` # Build tests buck build 'mode/dev-asan' //caffe2/test:jit --show-output ``` ``` # Run test (Change HASH and paths accordingly) HASH="b755f1c435832a1e" ENABLE_FLATBUFFER=0 FB_OVERRIDE_PYBIND11_GIL_INCREF_DECREF_CHECK=1 MKL_NUM_THREADS=1 NO_MULTIPROCESSING_SPAWN=0 OMP_NUM_THREADS=1 PYTORCH_TEST=1 PYTORCH_TEST_FBCODE=1 PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_DEV_DBG_ASAN=1 PYTORCH_TEST_WITH_TSAN=0 PYTORCH_TEST_WITH_UBSAN=1 SKIP_TEST_BOTTLENECK=1 TENSORPIPE_TLS_DATACENTER=test_dc TEST_PILOT=True TPX_IS_TEST_EXECUTION=true TPX_TIMEOUT_SEC=6000 \ buck-out/v2/gen/$HASH/caffe2/test/__jit__/jit.par --test-filter test_jit_fuser_te.TestTEFuserDynamic.test_concat_invariant ``` Differential Revision: D71046799 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149058 Approved by: https://github.com/dcci, https://github.com/Skylion007	2025-03-13 20:37:47 +00:00
Andy Lugo	38e81a5332	[ROCm] Use generated CK config.h rather than system (#147993 ) prevents pytorch from potentially using system version of config.h and instead prioritize the CK submodule's version Pull Request resolved: https://github.com/pytorch/pytorch/pull/147993 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-03-13 20:04:12 +00:00
Carlo Bertolli	4f8391db55	[ROCm] Input vectorization in elementwise kernels for tensors with heterogeneous types (#147527 ) This patch exemplifies its use for input tensors with types (float,bfloat16) when functor type is float(float,float). Pull Request resolved: https://github.com/pytorch/pytorch/pull/147527 Approved by: https://github.com/jeffdaily Co-authored-by: Hashem Hashemi <hashem.hashemi@amd.com>	2025-03-13 19:56:26 +00:00
Eddie Yan	0dcd482e54	[SDPA] Respect `sdpa_kernel`'s `priority_order` setting in `torch.compile` (#147768 ) [https://github.com/pytorch/pytorch/pull/140467](https://github.com/pytorch/pytorch/pull/140467) added the option to specify a priority order for SDPA but the `torch.compile` path silently ignored this setting as I wasn't aware of the separate context manager handling on `torch.compile` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147768 Approved by: https://github.com/drisspg	2025-03-13 18:52:34 +00:00
Joel Schlosser	5e1b715dda	BC fix for AOTIModelPackageLoader() constructor defaults (#149082 ) The default value for `run_single_threaded` was wrongly specified in the .cpp file instead of the header, breaking C++-side instantiation of `AOTIModelPackageLoader` with no arguments. This PR fixes this and adds a test for the use case of running with `AOTIModelPackageLoader` instead of `AOTIModelContainerRunner` on the C++ side. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149082 Approved by: https://github.com/desertfire	2025-03-13 18:40:53 +00:00
cyy	970fefcc53	Remove outdated skipCUDAIfCudnnVersionLessThan decoration (#148940 ) Test conditions for CUDNN 7 and 8 were removed because we have moved to CUDNN 9. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148940 Approved by: https://github.com/mikaylagawarecki	2025-03-13 18:02:50 +00:00
Eli Uriegas	c73c72b1e1	ci: Update linux_job references to v2 (#149102 ) This is probably a bit overdue but trying to update these so we can finally get rid of all the remnants that rely on non-manylinux2_28 stuff and conda stuff Signed-off-by: Eli Uriegas <github@terriblecode.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/149102 Approved by: https://github.com/Skylion007, https://github.com/atalman, https://github.com/malfet ghstack dependencies: #149104	2025-03-13 17:31:55 +00:00
Eli Uriegas	77ea66695a	ci: Fix check_binary gcc abi check (#149104 ) All of our binaries should be built with the cxx11-abi now so lets fix this check to reflect reality. I also noticed that this particular script is not used widely since this issue should've been caught in nightlies a long time ago. Maybe worth an investigation to just remove this script if it's not actually being used. Signed-off-by: Eli Uriegas <github@terriblecode.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/149104 Approved by: https://github.com/Skylion007, https://github.com/atalman, https://github.com/malfet	2025-03-13 17:31:55 +00:00
Simon Fan	7c87ec1b50	[ca] always do initial trace with dynamic shapes (#148801 ) HUD: https://fburl.com/wzvx6tax no regressions (ignore the pass rate improvements, those come from #149030) <img width="864" alt="image" src="https://github.com/user-attachments/assets/d7598f98-b378-4abb-a0c7-e4311162f681" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/148801 Approved by: https://github.com/jansel ghstack dependencies: #148799, #149030	2025-03-13 17:30:29 +00:00
Simon Fan	b263b272fa	[ca] fix lazily compiled aot bwd (#149030 ) FIXES https://github.com/pytorch/pytorch/issues/137372 sometimes, the aot bwd is lowered lazily. so the bw_module we saved in CompiledFunction._lazy_backward_info hasn't gone through post grad passes, specifically the view_to_reshape pass. Running that directly will then sometimes error, because the AOT forward has already changed its views to reshapes, and it is reflected in the gradients we see in CA. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149030 Approved by: https://github.com/bdhirsh ghstack dependencies: #148799	2025-03-13 17:30:29 +00:00
Simon Fan	e6f560a262	[ca] support for dynamic shapes CopySlices (#148799 ) i'm changing CA initial trace to always trace as dynamic, fixes these errors: ```python This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 FAILED [0.2139s] test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_autograd_python_custom_function_inplace - RuntimeError: !has_symbolic_sizes_strides_ INTERNAL ASSERT FAILED at "/home/xmfan/core/a/pytorch/aten/src/ATen/TensorGeometry.h":63, please report a bug to PyTorch. To execute this test, run the following from the base repo dir: python test/test_autograd.py TestAutogradWithCompiledAutograd.test_autograd_python_custom_function_inplace This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 FAILED [0.0057s] test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_copy_slices_graph_task_updates - RuntimeError: !has_symbolic_sizes_strides_ INTERNAL ASSERT FAILED at "/home/xmfan/core/a/pytorch/aten/src/ATen/TensorGeometry.h":63, please report a bug to PyTorch. To execute this test, run the following from the base repo dir: python test/test_autograd.py TestAutogradWithCompiledAutograd.test_copy_slices_graph_task_updates This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 FAILED [0.9662s] test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_inplace_on_view_weak_grad_fn - RuntimeError: !has_symbolic_sizes_strides_ INTERNAL ASSERT FAILED at "/home/xmfan/core/a/pytorch/aten/src/ATen/TensorGeometry.h":63, please report a bug to PyTorch. To execute this test, run the following from the base repo dir: python test/test_autograd.py TestAutogradWithCompiledAutograd.test_inplace_on_view_weak_grad_fn This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 FAILED [0.0077s] test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_leaf_assignment - RuntimeError: !has_symbolic_sizes_strides_ INTERNAL ASSERT FAILED at "/home/xmfan/core/a/pytorch/aten/src/ATen/TensorGeometry.h":63, please report a bug to PyTorch. To execute this test, run the following from the base repo dir: python test/test_autograd.py TestAutogradWithCompiledAutograd.test_leaf_assignment This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 FAILED [5.0485s] test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_setitem_mask - RuntimeError: !has_symbolic_sizes_strides_ INTERNAL ASSERT FAILED at "/home/xmfan/core/a/pytorch/aten/src/ATen/TensorGeometry.h":63, please report a bug to PyTorch. To execute this test, run the following from the base repo dir: python test/test_autograd.py TestAutogradWithCompiledAutograd.test_setitem_mask This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 FAILED [0.0102s] test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_tensor_hooks_inplace_over_view - RuntimeError: !has_symbolic_sizes_strides_ INTERNAL ASSERT FAILED at "/home/xmfan/core/a/pytorch/aten/src/ATen/TensorGeometry.h":63, please report a bug to PyTorch. To execute this test, run the following from the base repo dir: python test/test_autograd.py TestAutogradWithCompiledAutograd.test_tensor_hooks_inplace_over_view ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148799 Approved by: https://github.com/jansel, https://github.com/zou3519	2025-03-13 17:30:20 +00:00
Shivam Raikundalia	e84cc4c052	Update Kineto Submodule (#149089 ) Summary: We have made a lot of changes in Kineto this month. It is a good idea to update the submodule in now especially since the roctracer-sdk change will be very large Test Plan: CI Differential Revision: D71082829 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149089 Approved by: https://github.com/Skylion007	2025-03-13 17:18:16 +00:00
Aaron Gokaslan	6856d81c60	[BE]: Update CU128 cudnn to 9.8.0.87 (#148963 ) Also cu12.6 is an on old CUDNN version, we may want to upgrade it for all the performance reasons as I don't see a manywheel linux reason to stay back on the old 9.5 release. I might split that into it's own PR. This one just updates CU126 to the latest and greatest. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148963 Approved by: https://github.com/jansel, https://github.com/eqy, https://github.com/nWEIdia, https://github.com/tinglvv, https://github.com/atalman	2025-03-13 16:59:12 +00:00
Bin Bao	b9803a5c81	[AOTI] Re-enable AOTI cpp unit test (#149085 ) Summary: test_inductor_aoti was removed by accident previously. Add it back. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149085 Approved by: https://github.com/jbschlosser	2025-03-13 16:00:38 +00:00
Boyuan Feng	3e605fe46d	[CUDAGraph] Graph Partition (#147648 ) This PR implements cudagraph partition, following previous PR on inductor graph partition (#147038). Since there are many ops that cudagraph cannot support, this PR focuses on `cpu ops` and will add more partition rules in the next PR. ## Example ```python import torch torch._inductor.config.graph_partition = True def f(x, y): x1 = x + 1 y1 = y + 1 y_cpu = y1.cpu() + 1 z = x @ y return x1 + y1 + z + y_cpu.cuda() x, y = [torch.ones(2, 2, device="cuda") for _ in range(2)] x_cloned, y_cloned = [tmp.clone() for tmp in [x,y]] eager_out = f(x, y) f_compiled = torch.compile(f, mode="reduce-overhead") for _ in range(5): compiled_out = f_compiled(x_cloned, y_cloned) assert torch.allclose(eager_out, compiled_out) ``` w/o graph partition, we will skip cudagraph: ``` skipping cudagraphs due to skipping cudagraphs due to cpu device (device_put). Found from : File "/home/boyuan/playground/cudagraph/graph_partition/graph_partition.py", line 9, in f y_cpu = y1.cpu() + 1 # 3 ``` w/ graph partition, we can see two cudagraphify under the same torch-compiled region: ![image](https://github.com/user-attachments/assets/4e22d428-2687-433d-b92a-0814a2201b25) ## Design PR #147038 splits `def call(args)` function into multiple `def partition_id(args)`. In this PR, we use `recursively_apply_fns()` to wrap each `partition_id()` function with `cudagraphify`. One major design point is, `cudagraphify` takes metadata such as static_input_idxs and we need to provide such metadata for each graph partition. However, we previously only have such metadata for the original graph instead of graph partitions. The [idea](https://github.com/pytorch/pytorch/pull/147038#discussion_r1964124800) is: - compute a mapping from the partition metadata (e.g., input/output idx) to the graph metadata, stored in `GraphPartitionMap`. - during post_compile, get the `CudagraphMetadata` for each partition based on the graph-level metadata and `GraphPartitionMap`, via `get_partition_cudagraph_metadata()`. - finally, in `cudagraph_partition_pos_compile`, we compute the `CudagraphMetadata` and apply cudagraphify for each graph via `recursively_apply_fns`. #### Q: How does it work with codecache? While we have multiple graph partitions, we still have 1 file and 1 `call` function for 1 dynamo graph. The major difference is we need to additionally load a `recursively_apply_fns()` for graph partition. We also add `partition_maps: Optional[list[GraphPartitionMap]]` to `CompiledFxGraph` so it will be serialized and could be deserialized later. ## Edge Case 1 PyTorch has an assumption on input/output orders. For example, backward inputs take saved tensors first and then tangents. In graph partition, we respect such orders via `graph_partition_signature_reorder`. ## Edge Case 2 Cudagraphifying `call` function gives 2 cudagraph managed tensors `buf0` and `primals_1`. However, cudagraphifying `partition_0` gives only 1 cudagraph managed tensor `buf0`. This leads to a semantic difference between cudagraph w/ and w/o graph partition. [full code comparison](https://www.internalfb.com/intern/diffing/?paste_number=1747654420) ![image](https://github.com/user-attachments/assets/03d08ce0-f1d1-4d1d-8432-805a07e1dd40) To achieve the same semantic, we returns an input tensor as output if it is not freed in a graph partition. This allows more cudagraph managed tensors and is important for handling saved tensors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147648 Approved by: https://github.com/eellison	2025-03-13 16:00:21 +00:00
atalman	65d19a5699	Remove runtime dependency on packaging (#149092 ) Looks like after https://github.com/pytorch/pytorch/pull/148924 We are seeing this error in nightly test: https://github.com/pytorch/pytorch/actions/runs/13806023728/job/38616861623 ``` File "/Users/runner/work/_temp/anaconda/envs/test_conda_env/lib/python3.13/site-packages/torch/_inductor/pattern_matcher.py", line 79, in <module> from .lowering import fallback_node_due_to_unsupported_type File "/Users/runner/work/_temp/anaconda/envs/test_conda_env/lib/python3.13/site-packages/torch/_inductor/lowering.py", line 7024, in <module> from . import kernel File "/Users/runner/work/_temp/anaconda/envs/test_conda_env/lib/python3.13/site-packages/torch/_inductor/kernel/__init__.py", line 1, in <module> from . import mm, mm_common, mm_plus_mm File "/Users/runner/work/_temp/anaconda/envs/test_conda_env/lib/python3.13/site-packages/torch/_inductor/kernel/mm.py", line 6, in <module> from packaging.version import Version ModuleNotFoundError: No module named 'packaging' ``` Hence removing runtime dependency on packaging since it may not be installed by default Pull Request resolved: https://github.com/pytorch/pytorch/pull/149092 Approved by: https://github.com/drisspg, https://github.com/davidberard98	2025-03-13 14:53:13 +00:00
taoyang	f59064f2b7	[FIX] remove the duplicate key in DEFAULT_STATIC_QUANT_MODULE_MAPPINGS (#149043 ) nn.Dropout appeared at line 81 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149043 Approved by: https://github.com/jingsh	2025-03-13 12:42:33 +00:00
Bin Bao	bdf57fb8f7	[AOTI][refactor] Split MiniArrayRef into a separate header (#149073 ) Summary: MiniArrayRef is a common utility and will be used by the libtorch-free AOTI. Differential Revision: [D71064657](https://our.internmc.facebook.com/intern/diff/D71064657) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149073 Approved by: https://github.com/yushangdi	2025-03-13 11:57:32 +00:00
Andrew Gu	a8b1767ae5	[DTensor] Fix `local_map` with multi-threading (#149070 ) Using `nonlocal device_mesh` is not safe with multi-threading Pull Request resolved: https://github.com/pytorch/pytorch/pull/149070 Approved by: https://github.com/wanchaol	2025-03-13 10:58:59 +00:00
Shangdi Yu	df60500ab8	Fix too big to optimize in test, actually use O0 when aot_inductor.compile_wrapper_with_O0 is set (#148714 ) Summary: 1. Check against the "0" char instead 2. We got the following error when using anything other than O0 flag: `error: Function ZN5torch12aot_inductorL22__check_inputs_outputsEPP16AtenTensorOpaqueS3 is too big to optimize [-Werror,-Wignored-optimization-argument]` So we use O0 flag in wrapper code when `aot_inductor.compile_wrapper_opt_level` is set to `O0`. Test Plan: ``` buck run 'fbcode//mode/opt' fbcode//deeplearning/aot_inductor/cpu/test:ads_second_stage_dsnn_models_aoti_lowering_test -- -r AdsSecondStageDSNNModelsAOTILoweringTest ``` Differential Revision: D70670957 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148714 Approved by: https://github.com/desertfire	2025-03-13 10:22:06 +00:00
George Wigley	96a6a71ac7	skip test_torch_dynamo_codegen_pow if CPU backend is not cpp (#146595 ) The test asserts that `aten.pow` is not present in the generated kernel code. When using a CPU backend other than cpp, the kernel contains comments referencing the aten ops that produced the kernel in this case `aten.pow`. This PR skips that test case if the CPU backend is not cpp. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146595 Approved by: https://github.com/williamwen42	2025-03-13 10:03:29 +00:00
Tom Ritchford	d90f9e9a34	[inductor] Fix issue with set_linter, improve linter framework (#144620 ) ### `set_linter` only * Fix gnarly [bug](`dbed747aae/tools/test/set_linter_testdata/python_code.py.txt.python (L42)`) which would have garbled Python files involving sets contained in sets. * Better handling of new Python3.12 token types ### Both linters. * Recover from and report on unparseable Python files * Remove `ParseError.check()` (it made it harder to read the code) * FileLinter is now generic on `PythonFile` ### Notes As I started working on new docstring features, I found a nasty bug and an edge case bug in set linter, and realized both the linters crash when there is a badly-formed Python file in the repo. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144620 Approved by: https://github.com/amjames, https://github.com/jansel	2025-03-13 09:49:40 +00:00
Leo Wang	f4bffb7461	[docs] fix autograd description on convex function case (#148658 ) The sub-gradient of minimum norm is the least steep descent direction. ```python import torch x = torch.tensor([-2, -1, 0, 1, 2.], requires_grad=True) torch.relu(x).sum().backward() print(x.grad) # tensor([0., 0., 0., 1., 1.]) y = torch.tensor([-2, -1, 0, 1, 2.], requires_grad=True) torch.abs(y).sum().backward() print(y.grad) # tensor([-1., -1., 0., 1., 1.]) ``` (How can I request a reviewer? I don't have the button on the right) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148658 Approved by: https://github.com/lezcano	2025-03-13 09:06:15 +00:00
wdziurdz	75c8b7d972	[Profiler][HPU] Fix incorrect availabilities for HPU (#148663 ) Fixes #148661 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148663 Approved by: https://github.com/jeromean, https://github.com/albanD	2025-03-13 08:03:52 +00:00
eqy	ec93aa7f84	fix cuDNN SDPA meta registration (#148921 ) Update `cuDNN SDPA` meta registration to matching memory layout behavior in: https://github.com/pytorch/pytorch/pull/138354 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148921 Approved by: https://github.com/drisspg, https://github.com/jbschlosser	2025-03-13 07:33:16 +00:00
Shangdi Yu	2a7d583452	Consolidate torchbind fake class registration (#149063 ) Summary: Remove duplicated fake class registration Test Plan: CI Differential Revision: D71052419 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149063 Approved by: https://github.com/angelayi	2025-03-13 06:57:13 +00:00
Yuanhao Ji	c208f21791	[Dynamo] Replace `unimplemented` with`unimplemented_v2` in `torch/_dynamo/variables/base.py` (#148177 ) Part of #147913 Replace `unimplemented` with`unimplemented_v2` in `torch/_dynamo/variables/base.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148177 Approved by: https://github.com/williamwen42	2025-03-13 06:35:51 +00:00
xinan.lin	037d7af778	[Inductor UT] Enable PYTORCH_TESTING_DEVICE_ONLY_FOR test case filter for test_torchinductor.py (#149023 ) The environ var PYTORCH_TESTING_DEVICE_ONLY_FOR controls the devices in get_desired_device_type_test_bases, so we add RUN_CPU and RUN_GPU to make sure cases are only enabled for devices specified for PYTORCH_TESTING_DEVICE_ONLY_FOR. eg. Only enable GPU cases, not CPU cases even HAS_CPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149023 Approved by: https://github.com/jansel, https://github.com/cyyever	2025-03-13 05:15:28 +00:00
Sam Larsen	7cdbb913e7	[logging] Set compile_id in the CachingAutotuner during compilation so we have it for dynamo_timed logging (#148693 ) Summary: This is a simpler alternative to https://github.com/pytorch/pytorch/pull/146455, where we can stick the compileId (and forward/backward bool) in the CachingAutotuner so that we have it for logging `benchmark_all_configs`. Recall that the first attempt put the compileId in the inductor_meta and that interfered with caching. Test Plan: `python benchmarks/dynamo/torchbench.py --performance --training --amp --backend inductor --device cuda --print-compilation-time --repeat 5 --cold-start-latency --only nanogpt` * tlparse: https://fburl.com/e71yn6uc * dynamo_compile: https://fburl.com/scuba/dynamo_compile/sandbox/4ageghhv * pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/4fgv1itq Pull Request resolved: https://github.com/pytorch/pytorch/pull/148693 Approved by: https://github.com/eellison	2025-03-13 03:50:58 +00:00
Brian Hirsh	3646d4dbc8	[partitioner] always ban compiler-driven recompute of collectives by default (#147561 ) This should fix the hang in https://fb.workplace.com/groups/1075192433118967/permalink/1603268720311333/ The argument here is that: (1) in general, it is not safe for the partitioner to sometimes choose to recompute collectives in the backward. Why? If we are running a distributed job, where many ranks are compiling at the same time, we need every rank to make a consistent decision about which collectives are recomputed for backward. If we let each compiler instance make its own choice without any cross-rank communication, they can make different choices and cause NCCL hangs (see the link above) (2) later on, we'll want an `spmd_mode` flag that causes the compiler to issue collectives and communicate info across ranks. Once we have such a config, then turning it on should make it safe for the partitioner to potentially choose to recompute collectives (and agree on the binary "recompute-or-save" choice across all ranks) (3) even without an `spmd_mode`, users can override this choice by using `torch.utils.checkpoint()` in their user code. User checkpointing generally always overrides the partitioner, and this should be safe because we expect the user to apply checkpointing consistently across ranks Pull Request resolved: https://github.com/pytorch/pytorch/pull/147561 Approved by: https://github.com/zou3519	2025-03-13 03:36:13 +00:00
Bartlomiej Stemborowski	420a9be743	[regression] Fix pin_memory() when it is called before device lazy initialization. (#149033 ) PR #145752 has added a check in the isPinnedPtr to check if a device is initialized before checking if the tensor is pinned. Also that PR has added a lazy initialization trigger when an at::empty is called with a pinned param set to true. However, when the tensor is firstly created and it is pinned in a separate call by calling pin_memory() function, lazy device init is not called so is_pinned returns always false. With this PR, the lazy initialization is moved to getPinnedMemoryAllocator function, thus it is assured that device is initialized before we pin a tensor. Fixes #149032 @ngimel @albanD Pull Request resolved: https://github.com/pytorch/pytorch/pull/149033 Approved by: https://github.com/ngimel, https://github.com/albanD	2025-03-13 02:56:24 +00:00
henrylhtsang	f2d43d866c	[cutlass backend] switch layout for cutlass backend benchmark (#149009 ) ``` python benchmarks/inductor_backends/cutlass.py ``` logs: ``` Experiment group: mm (1024x1024, 1024x1024) torch.float16 +-----------------------+--------------------+----------------------+---------------------+ \| name \| forward_time (us) \| compilation_time (s) \| perf_over_aten (%) \| +-----------------------+--------------------+----------------------+---------------------+ \| aten \| 13.059554621577263 \| 1.580178506206721 \| NA \| \| triton \| 10.245470330119133 \| 0.04118620231747627 \| -21.54808776410064 \| \| triton_persistent_tma \| 10.388538241386414 \| 0.04225084185600281 \| -20.45258400908819 \| \| cutlass_lvl_default \| 12.882896699011326 \| 231.14990583620965 \| -1.3527101626732294 \| \| cutlass_lvl_1111 \| 11.362981051206589 \| 126.41650272067636 \| -12.99105229490415 \| \| cutlass_lvl_2222 \| 11.107578873634338 \| 555.8380545829423 \| -14.946725248331441 \| +-----------------------+--------------------+----------------------+---------------------+ Experiment group: mm (1024x1024, 1024x1024) torch.bfloat16 +-----------------------+--------------------+----------------------+---------------------+ \| name \| forward_time (us) \| compilation_time (s) \| perf_over_aten (%) \| +-----------------------+--------------------+----------------------+---------------------+ \| aten \| 14.037585817277431 \| 0.21587548777461052 \| NA \| \| triton \| 10.571777820587158 \| 78.15654796129093 \| -24.68948750735019 \| \| triton_persistent_tma \| 10.761583223938942 \| 1.3195342738181353 \| -23.337364672110443 \| \| cutlass_lvl_default \| 12.872588820755482 \| 237.0100042372942 \| -8.299126443010406 \| \| cutlass_lvl_1111 \| 11.08622644096613 \| 137.55013868492097 \| -21.02469338195443 \| \| cutlass_lvl_2222 \| 11.044904589653015 \| 551.265836935956 \| -21.319059178545007 \| +-----------------------+--------------------+----------------------+---------------------+ Experiment group: mm (2048x2048, 2048x2048) torch.float16 +-----------------------+--------------------+----------------------+---------------------+ \| name \| forward_time (us) \| compilation_time (s) \| perf_over_aten (%) \| +-----------------------+--------------------+----------------------+---------------------+ \| aten \| 30.483894050121307 \| 0.27990864124149084 \| NA \| \| triton \| 29.567627236247063 \| 99.87172158574685 \| -3.005740711366232 \| \| triton_persistent_tma \| 29.66325916349888 \| 1.3695051120594144 \| -2.692027748401006 \| \| cutlass_lvl_default \| 29.82821688055992 \| 72.61214569816366 \| -2.150897022812533 \| \| cutlass_lvl_1111 \| 29.476772993803024 \| 67.7428645719774 \| -3.303780857728953 \| \| cutlass_lvl_2222 \| 30.113255605101585 \| 233.84051702311262 \| -1.2158500630212203 \| +-----------------------+--------------------+----------------------+---------------------+ Experiment group: mm (2048x2048, 2048x2048) torch.bfloat16 +-----------------------+--------------------+----------------------+---------------------+ \| name \| forward_time (us) \| compilation_time (s) \| perf_over_aten (%) \| +-----------------------+--------------------+----------------------+---------------------+ \| aten \| 30.58255836367607 \| 0.058386584743857384 \| NA \| \| triton \| 29.799651354551315 \| 100.18178300186992 \| -2.559978795150901 \| \| triton_persistent_tma \| 29.362043365836143 \| 1.534341821912676 \| -3.990885861562106 \| \| cutlass_lvl_default \| 29.4346883893013 \| 73.68858492700383 \| -3.7533484305817093 \| \| cutlass_lvl_1111 \| 29.164200648665428 \| 75.44329373072833 \| -4.637799421958348 \| \| cutlass_lvl_2222 \| 29.13798950612545 \| 227.33327346481383 \| -4.7235056020244 \| +-----------------------+--------------------+----------------------+---------------------+ Experiment group: mm (8192x8192, 8192x8192) torch.float16 +-----------------------+--------------------+----------------------+--------------------+ \| name \| forward_time (us) \| compilation_time (s) \| perf_over_aten (%) \| +-----------------------+--------------------+----------------------+--------------------+ \| aten \| 1656.6237211227417 \| 0.0549461180344224 \| NA \| \| triton \| 1892.8285837173462 \| 2.3174119112081826 \| 14.258208401997386 \| \| triton_persistent_tma \| 1665.332317352295 \| 2.7922237082384527 \| 0.525683419747917 \| \| cutlass_lvl_default \| 1705.5492401123047 \| 108.31571159465238 \| 2.9533272019312116 \| \| cutlass_lvl_1111 \| 1714.9059772491455 \| 17.64627545280382 \| 3.518134829489478 \| \| cutlass_lvl_2222 \| 1680.4152727127075 \| 306.9972395859659 \| 1.4361469829637354 \| +-----------------------+--------------------+----------------------+--------------------+ Experiment group: mm (8192x8192, 8192x8192) torch.bfloat16 +-----------------------+--------------------+----------------------+--------------------+ \| name \| forward_time (us) \| compilation_time (s) \| perf_over_aten (%) \| +-----------------------+--------------------+----------------------+--------------------+ \| aten \| 1621.416687965393 \| 0.06300561130046844 \| NA \| \| triton \| 1782.3902368545532 \| 2.318530729971826 \| 9.927956834535548 \| \| triton_persistent_tma \| 1586.0934257507324 \| 2.7931175641715527 \| -2.178543151605614 \| \| cutlass_lvl_default \| 1657.4617624282837 \| 43.31810224894434 \| 2.2230605328307784 \| \| cutlass_lvl_1111 \| 1641.5367126464844 \| 17.648567833006382 \| 1.2408916739557292 \| \| cutlass_lvl_2222 \| 1645.8417177200317 \| 249.33647010894492 \| 1.5064005407078918 \| +-----------------------+--------------------+----------------------+--------------------+ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149009 Approved by: https://github.com/chenyang78, https://github.com/jingsh	2025-03-13 01:57:47 +00:00
Zhou, Lingzhi	4a12777ffe	[Partitioner] Remove unnecessary upstream nodes in dependency viewer (#146580 ) We iterate upstream nodes to update partition map. But actually did nothing due to we iterate nodes with reversed topological order https://github.com/pytorch/pytorch/pull/136608/files#diff-f2f9dd3903fd99955732eb694941fea0cb7301a58d59554787f3311d417e5615L193 so that there exists no upstream nodes in assignment. Remove it to reduce for-loop overhead which up to O(N * N) complexity. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146580 Approved by: https://github.com/Skylion007, https://github.com/jerome-habana	2025-03-13 01:42:10 +00:00
Andrey Talman	1e37e5b836	Update nightly PyTorch version to 2.8.0 (#149038 ) Branch for 2.7: https://github.com/pytorch/pytorch/tree/release/2.7 Same as https://github.com/pytorch/pytorch/pull/135916 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149038 Approved by: https://github.com/ZainRizvi	2025-03-12 23:51:04 +00:00
PyTorch MergeBot	e51615cb73	Revert "[Profiler][HPU] Fix incorrect availabilities for HPU (#148663 )" This reverts commit 28b78800b92a4d847a2360ab0e0b87d3e00a6138. Reverted https://github.com/pytorch/pytorch/pull/148663 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. @albanD, could you please help get this relanded? See D71052806 for more details. To validate the fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/148663#issuecomment-2719297055))	2025-03-12 22:52:11 +00:00
PyTorch MergeBot	b1980b2405	Revert "Make dynamism code robust to NotImplementedException (#148823 )" This reverts commit 60576419a2a5cc09e4a92be870fda8f3fc305ddc. Reverted https://github.com/pytorch/pytorch/pull/148823 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally, see D71042206 for details. To validate your fixes internally before relanding, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/148823#issuecomment-2719287467))	2025-03-12 22:45:39 +00:00
Catherine Lee	38c5cf99b3	[CI] Don't clean workspace when fetching repo (#147994 ) Tested on https://github.com/pytorch/pytorch/pull/148995 Do two checkouts: first one attempts to use an existing checkout if possible. The second one removes the workspace and re pulls everything if the first one fails This is probably not going to be useful if we switch entirely to ephemeral runners but w/e Pull Request resolved: https://github.com/pytorch/pytorch/pull/147994 Approved by: https://github.com/malfet, https://github.com/atalman	2025-03-12 22:29:52 +00:00
Catherine Lee	3f1769f785	Add ninja to requirements-ci for all arch (#148778 ) So I can get ninja_logs for the builds No negative consequences afaik Pull Request resolved: https://github.com/pytorch/pytorch/pull/148778 Approved by: https://github.com/malfet, https://github.com/atalman	2025-03-12 22:07:46 +00:00
Jeff Daily	0c8ec26d3b	[ROCm][TunableOp] hipblaslt tf32 support (#145946 ) TF32 is supported by hipblaslt. Support added by #143549. This PR expands integration to the TunableOp feature. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145946 Approved by: https://github.com/pruthvistony, https://github.com/echen4096, https://github.com/yoyoyocmu Co-authored-by: Nichols A. Romero <nick.romero@amd.com>	2025-03-12 21:17:11 +00:00
Yanan Cao (PyTorch)	ab45aaca97	Set non-strict export as default mode (#148790 ) Summary: - Flip the default value of strict argument in torch.export.export from True to False - Update test infra to cope with the change, some of them made the assumption of strict mode as default - Disabled some tests that fail in non-strict mode Test Plan: Sandcastle Differential Revision: D70228628 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148790 Approved by: https://github.com/angelayi	2025-03-12 21:10:58 +00:00
Matthew Hoffman	e3ebf61589	Create and send `full_tensor` on `ProcessGroup`-supported device in `_broadcast_tensors` (#148865 ) Fixes #138842 `device` is always the device of the `local_state_dict`, which may or may not be CPU, which is not supported by NCCL backend. Instead, create broadcasted tensors on one of `pg._device_types` and then move the tensors back if `local_state_dict`'s `device` was not supported by the `ProcessGroup`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148865 Approved by: https://github.com/mori360	2025-03-12 20:56:31 +00:00
Richard Barnes	b5191b9312	[codemod][lowrisk] Fix deprecated use of 0/NULL in caffe2/aten/src/ATen/native/quantized/cpu/qnnpack/src/fc-unpack.cc + 1 (#148996 ) Summary: `nullptr` is typesafe. `0` and `NULL` are not. In the future, only `nullptr` will be allowed. This diff helps us embrace the future _now_ in service of enabling `-Wzero-as-null-pointer-constant`. Test Plan: Sandcastle Reviewed By: dtolnay Differential Revision: D70939306 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148996 Approved by: https://github.com/Skylion007	2025-03-12 20:06:19 +00:00
eqy	b90698f5ba	[CUDA] try to abate some flakiness in `test_stream_event_nogil` (#148796 ) threshold twiddling as one in a few dozen runs tend to fail the current threshold Pull Request resolved: https://github.com/pytorch/pytorch/pull/148796 Approved by: https://github.com/Skylion007	2025-03-12 19:12:50 +00:00
min-jean-cho	215f856142	Add XPU device to nested_layer_norm (#148593 ) Work with https://github.com/intel/torch-xpu-ops/pull/1416 . Pull Request resolved: https://github.com/pytorch/pytorch/pull/148593 Approved by: https://github.com/guangyey, https://github.com/jbschlosser	2025-03-12 19:07:08 +00:00
henrylhtsang	66300d3d55	[cutlass backend] try make cutlass backend benchmark more robust (#149015 ) Differential Revision: [D71006269](https://our.internmc.facebook.com/intern/diff/D71006269/) I want to make sure the benchmark even if failed on some experiment can still print most of the results. ``` Experiment group: mm (3x3, 3x3) torch.bfloat16 +-----------------------+-------------------+----------------------+---------------------+ \| name \| forward_time (us) \| compilation_time (s) \| perf_over_aten (%) \| +-----------------------+-------------------+----------------------+---------------------+ \| aten \| 6.175220478326082 \| 0.5982149520423263 \| NA \| \| triton \| 5.326753947883844 \| 3.2067150759976357 \| -13.739858089605114 \| \| triton_persistent_tma \| 5.340870004147291 \| 3.279932268196717 \| -13.51126615004617 \| \| cutlass_lvl_default \| inf \| inf \| inf \| \| cutlass_lvl_1111 \| inf \| inf \| inf \| \| cutlass_lvl_2222 \| inf \| inf \| inf \| \| cutlass_lvl_3333 \| inf \| inf \| inf \| +-----------------------+-------------------+----------------------+---------------------+ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149015 Approved by: https://github.com/chenyang78, https://github.com/jingsh	2025-03-12 18:59:49 +00:00
Thomas Bohnstingl	86bc154d61	[scan] Flattened output of HOP scan (#148955 ) This is required because downstream operations expect HOPs to return a flattened list of output elements. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148955 Approved by: https://github.com/ydwu4	2025-03-12 18:27:27 +00:00
Tugsbayasgalan Manlaibaatar	fb0e9cb0a0	Remove warnings on non-buffer tensor constants (#148483 ) Export already registers tensor constants directly in the graph and this is also true for Torchbind objects. This removes warning that pollutes the output. Differential Revision: [D70577856](https://our.internmc.facebook.com/intern/diff/D70577856) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148483 Approved by: https://github.com/zhxchen17, https://github.com/zou3519 ghstack dependencies: #148364	2025-03-12 18:20:04 +00:00
atalman	29fd875bc1	Automate stable CUDA update and linter using min Python verison (#148912 ) 1. Fixes: https://github.com/pytorch/pytorch/issues/145571 . Cuda Stable is the same cuda version that is published to pypi, also used to set Metadata section in the rest of whl scripts and tag the docker releases with latest tag. 2. Updates min python version used in linter Pull Request resolved: https://github.com/pytorch/pytorch/pull/148912 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-03-12 18:12:34 +00:00
Shangdi Yu	01e9036bd2	skip torchbind in cosntant folding (#148993 ) Summary: Do not fold torchbind objects in constant folding Any operation on these torchbind objects can have arbitrary side effects, so we can't effectively constant fold anything torchbind-obj-related anyway. Test Plan: ``` buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r aot_compile_constant_folding ``` Reviewed By: angelayi Differential Revision: D69946541 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148993 Approved by: https://github.com/angelayi	2025-03-12 18:08:08 +00:00
Yidi Wu	923ce10f6c	[while_loop] require stride to be the same as input for body_fn (#148002 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148002 Approved by: https://github.com/zou3519	2025-03-12 17:15:10 +00:00
wdziurdz	28b78800b9	[Profiler][HPU] Fix incorrect availabilities for HPU (#148663 ) Fixes #148661 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148663 Approved by: https://github.com/jeromean, https://github.com/Skylion007, https://github.com/EikanWang, https://github.com/albanD	2025-03-12 17:06:57 +00:00
Jason Ansel	b040dc3a53	Reland: [inductor] Simplify grid handling (#148305 ) Summary: Relands D69965761 / https://github.com/pytorch/pytorch/pull/147583 Before this PR, calling a triton kernel would look like: ```py kernel.run(a, b, xnumel, grid=grid(xnumel), stream=stream0) ``` where the `grid=` was passed as a callable (function closure) arg. This PR removes the grid arg: ```py kernel.run(a, b, xnumel, stream=stream0) ``` instead now the grid computation is included in the kernel launcher, with something like: ```py def launcher(in_ptr0, out_ptr0, xnumel, stream): grid_0 = ((xnumel + 1023) >> 10) grid_1 = 1 grid_2 = 1 runner(grid_0, grid_1, grid_2, stream, function, metadata, None, launch_enter_hook, launch_exit_hook, in_ptr0, out_ptr0, xnumel) ``` This should be faster, since we remove multiple function/dict calls and are able to specialize the grid computation for each `triton.Config`. It also allows us to unify the handling of grids between the Python and C++ wrapper code. Before this, C++ wrapper code didn't actually support dynamic grid sizes and instead burned in a static grid. This unification allows this PR to be a net deletion of code. Differential [disconnected] Revision: D70471332 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148305 Approved by: https://github.com/shunting314, https://github.com/eellison	2025-03-12 15:52:16 +00:00
PyTorch MergeBot	626a5e22eb	Revert "[CI] Don't clean workspace when fetching repo (#147994 )" This reverts commit e5fef8a08ebb8548e8413ae54ef0ad9a11f1f4c0. Reverted https://github.com/pytorch/pytorch/pull/147994 on behalf of https://github.com/clee2000 due to broke checkout on xpu, probably lack of sudo? ([comment](https://github.com/pytorch/pytorch/pull/147994#issuecomment-2718335186))	2025-03-12 15:50:38 +00:00
Catherine Lee	9a0f65d3d3	[TD] test_cpp_extensions_aot_ninja corresponds to things in test/cpp_extensions (#148992 ) Manually map test_cpp_extensions_aot_ninja to files in test/cpp_extensions since test_cpp_extensions_aot_ninja isn't an actual file you can edit, but a wrapper for files in test/cpp_extensions. Idk if this is a good idea, feels very manual. Maybe it would be better to classify this the same as any other TD failure where TD simply can't figure out the tests it needs to run Pull Request resolved: https://github.com/pytorch/pytorch/pull/148992 Approved by: https://github.com/malfet, https://github.com/seemethere, https://github.com/janeyx99	2025-03-12 15:40:06 +00:00
Jason Ansel	488c4480f9	[inductor] Fix profiler tests with latest Triton (#149025 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149025 Approved by: https://github.com/yanboliang	2025-03-12 15:34:26 +00:00
PyTorch MergeBot	5ada4e6a53	Revert "Reland: [inductor] Simplify grid handling (#148305 )" This reverts commit 8d08b4901586f230353a558ee00c16ad57f95178. Reverted https://github.com/pytorch/pytorch/pull/148305 on behalf of https://github.com/jithunnair-amd due to Broke ROCm CI ([comment](https://github.com/pytorch/pytorch/pull/148305#issuecomment-2718177044))	2025-03-12 14:58:43 +00:00
cyy	8fa81a6066	Enable misc-use-internal-linkage check and apply fixes (#148948 ) Enables clang-tidy rule [`misc-use-internal-linkage`](https://clang.llvm.org/extra/clang-tidy/checks/misc/use-internal-linkage.html). This new check was introduced in Clang-Tidy 18 and is available due to recent update of Clang-Tidy 19. The check marks functions and variables used only in the translation unit as static. Therefore undesired symbols are not leaked into other units, more link time optimisations are possible and the resulting binaries may be smaller. The detected violations were mostly fixed by using static. In other cases, the symbols were indeed consumed by others files, then their declaring headers were included. Still some declarations were wrong and have been fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148948 Approved by: https://github.com/Skylion007	2025-03-12 14:22:56 +00:00
leslie-fang-intel	f349304c08	[Inductor][CPP] Fix expr issue in loop split (#148882 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/148058. In this case, there is an `indexing_expr` as an integer which doesn't have the method of `find`. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_issue_148058 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148882 Approved by: https://github.com/jgong5	2025-03-12 11:08:07 +00:00
lingzhi98	81aee3c9c4	[Partitioner] Reduce time consuming of partitions merger (#146582 ) This patch optimize maybe_merge_partition func through 3-ways: Remove unnecessary copy https://github.com/pytorch/pytorch/blob/main/torch/fx/passes/infra/partitioner.py#L99. The number of copied nodes is large if we can merge all of the nodes of graph into one partition. Record users of each partition to avoid duplicate iteration over nodes https://github.com/pytorch/pytorch/blob/main/torch/fx/passes/infra/partitioner.py#L133. The trip count of this loop maybe very large. The nodes number of each partitions maybe not balance https://github.com/pytorch/pytorch/blob/main/torch/fx/passes/infra/partitioner.py#L145. We always encounter one issue: one partition has n nodes, but the other has one node. Merge the smaller partition into the larger can help to reduce time consuming. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146582 Approved by: https://github.com/jerome-habana, https://github.com/Skylion007	2025-03-12 09:24:38 +00:00
Xiaodong Wang	d547a56668	[AMD] Various fixes for mem efficient attention on CK backend (#148986 ) Summary: Decouple aotriton vs. ck for mem efficient attention. Also fixed HW check. Reviewed By: henryhu6 Differential Revision: D70872677 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148986 Approved by: https://github.com/jianyuh, https://github.com/houseroad	2025-03-12 07:36:46 +00:00

3375 changed files with 175837 additions and 79466 deletions

									
										2

.ci/aarch64_linux/aarch64_ci_build.sh
									
												View File
												
				@ -20,7 +20,7 @@ cd /

				# on the mounted pytorch repo

				git config --global --add safe.directory /pytorch

				pip install -r /pytorch/requirements.txt

				pip install auditwheel

				pip install auditwheel==6.2.0

				if [ "$DESIRED_CUDA" = "cpu" ]; then

				    echo "BASE_CUDA_VERSION is not set. Building cpu wheel."

				    #USE_PRIORITIZED_TEXT_FOR_LD for enable linker script optimization https://github.com/pytorch/pytorch/pull/121975/files

									
										110

.ci/aarch64_linux/aarch64_wheel_ci_build.py
									
												View File
												
				@ -31,33 +31,47 @@ def build_ArmComputeLibrary() -> None:

				        "build=native",

				    ]

				    acl_install_dir = "/acl"

				    acl_checkout_dir = "ComputeLibrary"

				    os.makedirs(acl_install_dir)

				    check_call(

				        [

				            "git",

				            "clone",

				            "https://github.com/ARM-software/ComputeLibrary.git",

				            "-b",

				            "v25.02",

				            "--depth",

				            "1",

				            "--shallow-submodules",

				        ]

				    )

				    acl_checkout_dir = os.getenv("ACL_SOURCE_DIR", "ComputeLibrary")

				    if os.path.isdir(acl_install_dir):

				        shutil.rmtree(acl_install_dir)

				    if not os.path.isdir(acl_checkout_dir) or not len(os.listdir(acl_checkout_dir)):

				        check_call(

				            [

				                "git",

				                "clone",

				                "https://github.com/ARM-software/ComputeLibrary.git",

				                "-b",

				                "v25.02",

				                "--depth",

				                "1",

				                "--shallow-submodules",

				            ]

				        )

				    check_call(

				        ["scons", "Werror=1", "-j8", f"build_dir=/{acl_install_dir}/build"]

				        + acl_build_flags,

				        ["scons", "Werror=1", f"-j{os.cpu_count()}"] + acl_build_flags,

				        cwd=acl_checkout_dir,

				    )

				    for d in ["arm_compute", "include", "utils", "support", "src"]:

				    for d in ["arm_compute", "include", "utils", "support", "src", "build"]:

				        shutil.copytree(f"{acl_checkout_dir}/{d}", f"{acl_install_dir}/{d}")

				def update_wheel(wheel_path, desired_cuda) -> None:

				def replace_tag(filename) -> None:

				    with open(filename) as f:

				        lines = f.readlines()

				    for i, line in enumerate(lines):

				        if line.startswith("Tag:"):

				            lines[i] = line.replace("-linux_", "-manylinux_2_28_")

				            print(f"Updated tag from {line} to {lines[i]}")

				            break

				    with open(filename, "w") as f:

				        f.writelines(lines)

				def package_cuda_wheel(wheel_path, desired_cuda) -> None:

				    """

				    Update the cuda wheel libraries

				    Package the cuda wheel libraries

				    """

				    folder = os.path.dirname(wheel_path)

				    wheelname = os.path.basename(wheel_path)

				@ -88,30 +102,19 @@ def update_wheel(wheel_path, desired_cuda) -> None:

				        "/usr/lib64/libgfortran.so.5",

				        "/acl/build/libarm_compute.so",

				        "/acl/build/libarm_compute_graph.so",

				        "/usr/local/lib/libnvpl_lapack_lp64_gomp.so.0",

				        "/usr/local/lib/libnvpl_blas_lp64_gomp.so.0",

				        "/usr/local/lib/libnvpl_lapack_core.so.0",

				        "/usr/local/lib/libnvpl_blas_core.so.0",

				    ]

				    if enable_cuda:

				    if "128" in desired_cuda:

				        libs_to_copy += [

				            "/usr/local/lib/libnvpl_lapack_lp64_gomp.so.0",

				            "/usr/local/lib/libnvpl_blas_lp64_gomp.so.0",

				            "/usr/local/lib/libnvpl_lapack_core.so.0",

				            "/usr/local/lib/libnvpl_blas_core.so.0",

				        ]

				        if "126" in desired_cuda:

				            libs_to_copy += [

				                "/usr/local/cuda/lib64/libnvrtc-builtins.so.12.6",

				                "/usr/local/cuda/lib64/libcufile.so.0",

				                "/usr/local/cuda/lib64/libcufile_rdma.so.1",

				            ]

				        elif "128" in desired_cuda:

				            libs_to_copy += [

				                "/usr/local/cuda/lib64/libnvrtc-builtins.so.12.8",

				                "/usr/local/cuda/lib64/libcufile.so.0",

				                "/usr/local/cuda/lib64/libcufile_rdma.so.1",

				            ]

				    else:

				        libs_to_copy += [

				            "/opt/OpenBLAS/lib/libopenblas.so.0",

				            "/usr/local/cuda/lib64/libnvrtc-builtins.so.12.8",

				            "/usr/local/cuda/lib64/libcufile.so.0",

				            "/usr/local/cuda/lib64/libcufile_rdma.so.1",

				        ]

				    # Copy libraries to unzipped_folder/a/lib

				    for lib_path in libs_to_copy:

				        lib_name = os.path.basename(lib_path)

				@ -120,6 +123,13 @@ def update_wheel(wheel_path, desired_cuda) -> None:

				            f"cd {folder}/tmp/torch/lib/; "

				            f"patchelf --set-rpath '$ORIGIN' --force-rpath {folder}/tmp/torch/lib/{lib_name}"

				        )

				    # Make sure the wheel is tagged with manylinux_2_28

				    for f in os.scandir(f"{folder}/tmp/"):

				        if f.is_dir() and f.name.endswith(".dist-info"):

				            replace_tag(f"{f.path}/WHEEL")

				            break

				    os.mkdir(f"{folder}/cuda_wheel")

				    os.system(f"cd {folder}/tmp/; zip -r {folder}/cuda_wheel/{wheelname} *")

				    shutil.move(

				@ -136,6 +146,9 @@ def complete_wheel(folder: str) -> str:

				    """

				    wheel_name = list_dir(f"/{folder}/dist")[0]

				    # Please note for cuda we don't run auditwheel since we use custom script to package

				    # the cuda dependencies to the wheel file using update_wheel() method.

				    # However we need to make sure filename reflects the correct Manylinux platform.

				    if "pytorch" in folder and not enable_cuda:

				        print("Repairing Wheel with AuditWheel")

				        check_call(["auditwheel", "repair", f"dist/{wheel_name}"], cwd=folder)

				@ -147,7 +160,14 @@ def complete_wheel(folder: str) -> str:

				            f"/{folder}/dist/{repaired_wheel_name}",

				        )

				    else:

				        repaired_wheel_name = wheel_name

				        repaired_wheel_name = wheel_name.replace(

				            "linux_aarch64", "manylinux_2_28_aarch64"

				        )

				        print(f"Renaming {wheel_name} wheel to {repaired_wheel_name}")

				        os.rename(

				            f"/{folder}/dist/{wheel_name}",

				            f"/{folder}/dist/{repaired_wheel_name}",

				        )

				    print(f"Copying {repaired_wheel_name} to artifacts")

				    shutil.copy2(

				@ -184,8 +204,10 @@ if __name__ == "__main__":

				    ).decode()

				    print("Building PyTorch wheel")

				    build_vars = "MAX_JOBS=5 CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000 "

				    os.system("cd /pytorch; python setup.py clean")

				    build_vars = "CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000 "

				    # MAX_JOB=5 is not required for CPU backend (see commit 465d98b)

				    if enable_cuda:

				        build_vars = "MAX_JOBS=5 " + build_vars

				    override_package_version = os.getenv("OVERRIDE_PACKAGE_VERSION")

				    desired_cuda = os.getenv("DESIRED_CUDA")

				@ -232,6 +254,6 @@ if __name__ == "__main__":

				        print("Updating Cuda Dependency")

				        filename = os.listdir("/pytorch/dist/")

				        wheel_path = f"/pytorch/dist/{filename[0]}"

				        update_wheel(wheel_path, desired_cuda)

				        package_cuda_wheel(wheel_path, desired_cuda)

				    pytorch_wheel_name = complete_wheel("/pytorch/")

				    print(f"Build Complete. Created {pytorch_wheel_name}..")

									
										16

.ci/aarch64_linux/build_aarch64_wheel.py
									
												View File
												
				@ -19,13 +19,11 @@ import boto3

				# AMI images for us-east-1, change the following based on your ~/.aws/config

				os_amis = {

				    "ubuntu18_04": "ami-078eece1d8119409f",  # login_name: ubuntu

				    "ubuntu20_04": "ami-052eac90edaa9d08f",  # login_name: ubuntu

				    "ubuntu22_04": "ami-0c6c29c5125214c77",  # login_name: ubuntu

				    "redhat8": "ami-0698b90665a2ddcf1",  # login_name: ec2-user

				}

				ubuntu18_04_ami = os_amis["ubuntu18_04"]

				ubuntu20_04_ami = os_amis["ubuntu20_04"]

				@ -659,18 +657,6 @@ def configure_system(

				            "sudo apt-get install -y python3-dev python3-yaml python3-setuptools python3-wheel python3-pip"

				        )

				    host.run_cmd("pip3 install dataclasses typing-extensions")

				    # Install and switch to gcc-8 on Ubuntu-18.04

				    if not host.using_docker() and host.ami == ubuntu18_04_ami and compiler == "gcc-8":

				        host.run_cmd("sudo apt-get install -y g++-8 gfortran-8")

				        host.run_cmd(

				            "sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-8 100"

				        )

				        host.run_cmd(

				            "sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-8 100"

				        )

				        host.run_cmd(

				            "sudo update-alternatives --install /usr/bin/gfortran gfortran /usr/bin/gfortran-8 100"

				        )

				    if not use_conda:

				        print("Installing Cython + numpy from PyPy")

				        host.run_cmd("sudo pip3 install Cython")

				@ -1026,7 +1012,7 @@ if __name__ == "__main__":

				        install_condaforge_python(host, args.python_version)

				        sys.exit(0)

				    python_version = args.python_version if args.python_version is not None else "3.8"

				    python_version = args.python_version if args.python_version is not None else "3.9"

				    if args.use_torch_from_pypi:

				        configure_system(host, compiler=args.compiler, python_version=python_version)

2

.ci/caffe2/README.md

View File

 @ -10,5 +10,3 @@ example: `py2-cuda9.0-cudnn7-ubuntu16.04`. The Docker images that are
 built on Jenkins and are used in triggered builds already have this
 environment variable set in their manifest. Also see
 `./docker/jenkins/*/Dockerfile` and search for `BUILD_ENVIRONMENT`.
 Our Jenkins installation is located at https://ci.pytorch.org/jenkins/.

									
										4

.ci/caffe2/test.sh
									
												View File
												
				@ -13,10 +13,6 @@ if [[ "${BUILD_ENVIRONMENT}" == *-android* ]]; then

				  echo 'Skipping tests'

				  exit 0

				fi

				if [[ "${BUILD_ENVIRONMENT}" == *-rocm* ]]; then

				  # temporary to locate some kernel issues on the CI nodes

				  export HSAKMT_DEBUG_LEVEL=4

				fi

				# These additional packages are needed for circleci ROCm builds.

				if [[ $BUILD_ENVIRONMENT == *rocm* ]]; then

				    # Need networkx 2.0 because bellmand_ford was moved in 2.1 . Scikit-image by

									
										2

.ci/docker/README.md
									
												View File
												
				@ -34,5 +34,5 @@ See `build.sh` for valid build environments (it's the giant switch).

				./build.sh pytorch-linux-bionic-py3.8-gcc9 -t myimage:latest

				# Set flags (see build.sh) and build image

				sudo bash -c 'PROTOBUF=1 ./build.sh pytorch-linux-bionic-py3.8-gcc9 -t myimage:latest

				sudo bash -c 'TRITON=1 ./build.sh pytorch-linux-bionic-py3.8-gcc9 -t myimage:latest

				```

									
										35

.ci/docker/almalinux/Dockerfile
									
												View File
												
				@ -1,6 +1,7 @@

				ARG CUDA_VERSION=12.4

				ARG BASE_TARGET=cuda${CUDA_VERSION}

				FROM amd64/almalinux:8 as base

				ARG ROCM_IMAGE=rocm/dev-almalinux-8:6.3-complete

				FROM amd64/almalinux:8.10-20250519 as base

				ENV LC_ALL en_US.UTF-8

				ENV LANG en_US.UTF-8

				@ -8,12 +9,10 @@ ENV LANGUAGE en_US.UTF-8

				ARG DEVTOOLSET_VERSION=11

				ENV LC_ALL en_US.UTF-8

				ENV LANG en_US.UTF-8

				ENV LANGUAGE en_US.UTF-8

				RUN yum -y update

				RUN yum -y install epel-release

				# install glibc-langpack-en make sure en_US.UTF-8 locale is available

				RUN yum -y install glibc-langpack-en

				RUN yum install -y sudo wget curl perl util-linux xz bzip2 git patch which perl zlib-devel openssl-devel yum-utils autoconf automake make gcc-toolset-${DEVTOOLSET_VERSION}-toolchain

				# Just add everything as a safe.directory for git since these will be used in multiple places with git

				RUN git config --global --add safe.directory '*'

				@ -41,9 +40,12 @@ RUN bash ./install_conda.sh && rm install_conda.sh

				# Install CUDA

				FROM base as cuda

				ARG CUDA_VERSION=12.4

				ARG CUDA_VERSION=12.6

				RUN rm -rf /usr/local/cuda-*

				ADD ./common/install_cuda.sh install_cuda.sh

				COPY ./common/install_nccl.sh install_nccl.sh

				COPY ./ci_commit_pins/nccl-cu* /ci_commit_pins/

				COPY ./common/install_cusparselt.sh install_cusparselt.sh

				ENV CUDA_HOME=/usr/local/cuda-${CUDA_VERSION}

				# Preserve CUDA_VERSION for the builds

				ENV CUDA_VERSION=${CUDA_VERSION}

				@ -54,18 +56,20 @@ FROM cuda as cuda11.8

				RUN bash ./install_cuda.sh 11.8

				ENV DESIRED_CUDA=11.8

				FROM cuda as cuda12.1

				RUN bash ./install_cuda.sh 12.1

				ENV DESIRED_CUDA=12.1

				FROM cuda as cuda12.4

				RUN bash ./install_cuda.sh 12.4

				ENV DESIRED_CUDA=12.4

				FROM cuda as cuda12.6

				RUN bash ./install_cuda.sh 12.6

				ENV DESIRED_CUDA=12.6

				FROM cuda as cuda12.8

				RUN bash ./install_cuda.sh 12.8

				ENV DESIRED_CUDA=12.8

				FROM ${ROCM_IMAGE} as rocm

				ENV PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201"

				ADD ./common/install_mkl.sh install_mkl.sh

				RUN bash ./install_mkl.sh && rm install_mkl.sh

				ENV MKLROOT /opt/intel

				# Install MNIST test data

				FROM base as mnist

				ADD ./common/install_mnist.sh install_mnist.sh

				@ -73,9 +77,8 @@ RUN bash ./install_mnist.sh

				FROM base as all_cuda

				COPY --from=cuda11.8  /usr/local/cuda-11.8 /usr/local/cuda-11.8

				COPY --from=cuda12.1  /usr/local/cuda-12.1 /usr/local/cuda-12.1

				COPY --from=cuda12.4  /usr/local/cuda-12.4 /usr/local/cuda-12.4

				COPY --from=cuda12.6  /usr/local/cuda-12.6 /usr/local/cuda-12.6

				COPY --from=cuda12.4  /usr/local/cuda-12.8 /usr/local/cuda-12.8

				# Final step

				FROM ${BASE_TARGET} as final

									
										100

.ci/docker/almalinux/build.sh
									
												View File
												
				@ -1,82 +1,70 @@

				#!/usr/bin/env bash

				# Script used only in CD pipeline

				set -eou pipefail

				set -exou pipefail

				image="$1"

				shift

				if [ -z "${image}" ]; then

				  echo "Usage: $0 IMAGE"

				  echo "Usage: $0 IMAGENAME:ARCHTAG"

				  exit 1

				fi

				DOCKER_IMAGE_NAME="pytorch/${image}"

				# Go from imagename:tag to tag

				DOCKER_TAG_PREFIX=$(echo "${image}" | awk -F':' '{print $2}')

				CUDA_VERSION=""

				ROCM_VERSION=""

				EXTRA_BUILD_ARGS=""

				if [[ "${DOCKER_TAG_PREFIX}" == cuda* ]]; then

				    # extract cuda version from image name and tag.  e.g. manylinux2_28-builder:cuda12.8 returns 12.8

				    CUDA_VERSION=$(echo "${DOCKER_TAG_PREFIX}" | awk -F'cuda' '{print $2}')

				    EXTRA_BUILD_ARGS="--build-arg CUDA_VERSION=${CUDA_VERSION}"

				elif [[ "${DOCKER_TAG_PREFIX}" == rocm* ]]; then

				    # extract rocm version from image name and tag.  e.g. manylinux2_28-builder:rocm6.2.4 returns 6.2.4

				    ROCM_VERSION=$(echo "${DOCKER_TAG_PREFIX}" | awk -F'rocm' '{print $2}')

				    EXTRA_BUILD_ARGS="--build-arg ROCM_IMAGE=rocm/dev-almalinux-8:${ROCM_VERSION}-complete"

				fi

				export DOCKER_BUILDKIT=1

				TOPDIR=$(git rev-parse --show-toplevel)

				CUDA_VERSION=${CUDA_VERSION:-12.1}

				case ${CUDA_VERSION} in

				case ${DOCKER_TAG_PREFIX} in

				  cpu)

				    BASE_TARGET=base

				    DOCKER_TAG=cpu

				    ;;

				  all)

				    BASE_TARGET=all_cuda

				    DOCKER_TAG=latest

				  cuda*)

				    BASE_TARGET=cuda${CUDA_VERSION}

				    ;;

				  rocm*)

				    BASE_TARGET=rocm

				    ;;

				  *)

				    BASE_TARGET=cuda${CUDA_VERSION}

				    DOCKER_TAG=cuda${CUDA_VERSION}

				    echo "ERROR: Unknown docker tag ${DOCKER_TAG_PREFIX}"

				    exit 1

				    ;;

				esac

				# TODO: Remove LimitNOFILE=1048576 patch once https://github.com/pytorch/test-infra/issues/5712

				# is resolved. This patch is required in order to fix timing out of Docker build on Amazon Linux 2023.

				sudo sed -i s/LimitNOFILE=infinity/LimitNOFILE=1048576/ /usr/lib/systemd/system/docker.service

				sudo systemctl daemon-reload

				sudo systemctl restart docker

				(

				  set -x

				  # TODO: Remove LimitNOFILE=1048576 patch once https://github.com/pytorch/test-infra/issues/5712

				  # is resolved. This patch is required in order to fix timing out of Docker build on Amazon Linux 2023.

				  sudo sed -i s/LimitNOFILE=infinity/LimitNOFILE=1048576/ /usr/lib/systemd/system/docker.service

				  sudo systemctl daemon-reload

				  sudo systemctl restart docker

				export DOCKER_BUILDKIT=1

				TOPDIR=$(git rev-parse --show-toplevel)

				tmp_tag=$(basename "$(mktemp -u)" | tr '[:upper:]' '[:lower:]')

				  docker build \

				    --target final \

				    --progress plain \

				    --build-arg "BASE_TARGET=${BASE_TARGET}" \

				    --build-arg "CUDA_VERSION=${CUDA_VERSION}" \

				    --build-arg "DEVTOOLSET_VERSION=11" \

				    -t ${DOCKER_IMAGE_NAME} \

				    $@ \

				    -f "${TOPDIR}/.ci/docker/almalinux/Dockerfile" \

				    ${TOPDIR}/.ci/docker/

				)

				docker build \

				  --target final \

				  --progress plain \

				  --build-arg "BASE_TARGET=${BASE_TARGET}" \

				  --build-arg "DEVTOOLSET_VERSION=11" \

				  ${EXTRA_BUILD_ARGS} \

				  -t ${tmp_tag} \

				  $@ \

				  -f "${TOPDIR}/.ci/docker/almalinux/Dockerfile" \

				  ${TOPDIR}/.ci/docker/

				if [[ "${DOCKER_TAG}" =~ ^cuda* ]]; then

				if [ -n "${CUDA_VERSION}" ]; then

				  # Test that we're using the right CUDA compiler

				  (

				    set -x

				    docker run --rm "${DOCKER_IMAGE_NAME}" nvcc --version | grep "cuda_${CUDA_VERSION}"

				  )

				fi

				GITHUB_REF=${GITHUB_REF:-$(git symbolic-ref -q HEAD || git describe --tags --exact-match)}

				GIT_BRANCH_NAME=${GITHUB_REF##*/}

				GIT_COMMIT_SHA=${GITHUB_SHA:-$(git rev-parse HEAD)}

				DOCKER_IMAGE_BRANCH_TAG=${DOCKER_IMAGE_NAME}-${GIT_BRANCH_NAME}

				DOCKER_IMAGE_SHA_TAG=${DOCKER_IMAGE_NAME}-${GIT_COMMIT_SHA}

				if [[ "${WITH_PUSH:-}" == true ]]; then

				  (

				    set -x

				    docker push "${DOCKER_IMAGE_NAME}"

				    if [[ -n ${GITHUB_REF} ]]; then

				        docker tag ${DOCKER_IMAGE_NAME} ${DOCKER_IMAGE_BRANCH_TAG}

				        docker tag ${DOCKER_IMAGE_NAME} ${DOCKER_IMAGE_SHA_TAG}

				        docker push "${DOCKER_IMAGE_BRANCH_TAG}"

				        docker push "${DOCKER_IMAGE_SHA_TAG}"

				    fi

				  )

				  docker run --rm "${tmp_tag}" nvcc --version | grep "cuda_${CUDA_VERSION}"

				fi

									
										212

.ci/docker/build.sh
									
												View File
												
				@ -85,9 +85,6 @@ elif [[ "$image" == *linter* ]]; then

				  DOCKERFILE="linter/Dockerfile"

				fi

				# CMake 3.18 is needed to support CUDA17 language variant

				CMAKE_VERSION=3.18.5

				_UCX_COMMIT=7bb2722ff2187a0cad557ae4a6afa090569f83fb

				_UCC_COMMIT=20eae37090a4ce1b32bcce6144ccad0b49943e0b

				if [[ "$image" == *rocm* ]]; then

				@ -95,66 +92,32 @@ if [[ "$image" == *rocm* ]]; then

				  _UCC_COMMIT=0c0fc21559835044ab107199e334f7157d6a0d3d

				fi

				tag=$(echo $image | awk -F':' '{print $2}')

				# It's annoying to rename jobs every time you want to rewrite a

				# configuration, so we hardcode everything here rather than do it

				# from scratch

				case "$image" in

				case "$tag" in

				  pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc11)

				    CUDA_VERSION=12.6.3

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=11

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.4.1

				  pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.8

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    CONDA_CMAKE=yes

				    TRITON=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-focal-cuda12.4-cudnn9-py3.12-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.4.1

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.12

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    CONDA_CMAKE=yes

				    TRITON=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-focal-cuda12.4-cudnn9-py3.13-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.4.1

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.13

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    CONDA_CMAKE=yes

				    TRITON=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				@ -163,57 +126,45 @@ case "$image" in

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.6.3

				  pytorch-linux-jammy-cuda12.6-cudnn9-py3-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.6

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    CONDA_CMAKE=yes

				    TRITON=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-focal-cuda12.6-cudnn9-py3.12-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.6.3

				  pytorch-linux-jammy-cuda12.6-cudnn9-py3.12-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.6

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.12

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    CONDA_CMAKE=yes

				    TRITON=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-focal-cuda12.6-cudnn9-py3.13-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.6.3

				  pytorch-linux-jammy-cuda12.6-cudnn9-py3.13-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.6

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.13

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    CONDA_CMAKE=yes

				    TRITON=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				@ -222,115 +173,81 @@ case "$image" in

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-py3-clang10-onnx)

				    ANACONDA_PYTHON_VERSION=3.9

				    CLANG_VERSION=10

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    CONDA_CMAKE=yes

				    ONNX=yes

				    ;;

				  pytorch-linux-focal-py3.9-clang10)

				    ANACONDA_PYTHON_VERSION=3.9

				    CLANG_VERSION=10

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    VULKAN_SDK_VERSION=1.2.162.1

				    SWIFTSHADER=yes

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-py3.11-clang10)

				    ANACONDA_PYTHON_VERSION=3.11

				    CLANG_VERSION=10

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    VULKAN_SDK_VERSION=1.2.162.1

				    SWIFTSHADER=yes

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-py3.9-gcc9)

				    ANACONDA_PYTHON_VERSION=3.9

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-rocm-n-1-py3)

				  pytorch-linux-jammy-rocm-n-1-py3)

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=11

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    ROCM_VERSION=6.2.4

				    NINJA_VERSION=1.9.0

				    CONDA_CMAKE=yes

				    TRITON=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-focal-rocm-n-py3)

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=11

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    ROCM_VERSION=6.3

				    NINJA_VERSION=1.9.0

				    CONDA_CMAKE=yes

				    TRITON=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-jammy-xpu-2024.0-py3)

				    ANACONDA_PYTHON_VERSION=3.9

				  pytorch-linux-jammy-rocm-n-py3)

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=11

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    XPU_VERSION=0.5

				    ROCM_VERSION=6.4

				    NINJA_VERSION=1.9.0

				    CONDA_CMAKE=yes

				    TRITON=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-jammy-xpu-2025.0-py3)

				    ANACONDA_PYTHON_VERSION=3.9

				    GCC_VERSION=11

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    XPU_VERSION=2025.0

				    NINJA_VERSION=1.9.0

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-jammy-xpu-2025.1-py3)

				    ANACONDA_PYTHON_VERSION=3.9

				    GCC_VERSION=11

				    VISION=yes

				    XPU_VERSION=2025.1

				    NINJA_VERSION=1.9.0

				    TRITON=yes

				    ;;

				    pytorch-linux-jammy-py3.9-gcc11-inductor-benchmarks)

				    ANACONDA_PYTHON_VERSION=3.9

				    GCC_VERSION=11

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    KATEX=yes

				    CONDA_CMAKE=yes

				    TRITON=yes

				    DOCS=yes

				    INDUCTOR_BENCHMARKS=yes

				@ -340,40 +257,30 @@ case "$image" in

				    CUDA_VERSION=11.8

				    CUDNN_VERSION=9

				    CLANG_VERSION=12

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-jammy-py3-clang12-asan)

				    ANACONDA_PYTHON_VERSION=3.9

				    CLANG_VERSION=12

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-jammy-py3-clang15-asan)

				    ANACONDA_PYTHON_VERSION=3.10

				    CLANG_VERSION=15

				    CONDA_CMAKE=yes

				    VISION=yes

				    ;;

				  pytorch-linux-jammy-py3-clang18-asan)

				    ANACONDA_PYTHON_VERSION=3.10

				    CLANG_VERSION=18

				    CONDA_CMAKE=yes

				    VISION=yes

				    ;;

				  pytorch-linux-jammy-py3.9-gcc11)

				    ANACONDA_PYTHON_VERSION=3.9

				    GCC_VERSION=11

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    KATEX=yes

				    CONDA_CMAKE=yes

				    TRITON=yes

				    DOCS=yes

				    UNINSTALL_DILL=yes

				@ -381,14 +288,12 @@ case "$image" in

				  pytorch-linux-jammy-py3-clang12-executorch)

				    ANACONDA_PYTHON_VERSION=3.10

				    CLANG_VERSION=12

				    CONDA_CMAKE=yes

				    EXECUTORCH=yes

				    ;;

				  pytorch-linux-jammy-py3.12-halide)

				    CUDA_VERSION=12.6

				    ANACONDA_PYTHON_VERSION=3.12

				    GCC_VERSION=11

				    CONDA_CMAKE=yes

				    HALIDE=yes

				    TRITON=yes

				    ;;

				@ -396,29 +301,23 @@ case "$image" in

				    CUDA_VERSION=12.6

				    ANACONDA_PYTHON_VERSION=3.12

				    GCC_VERSION=11

				    CONDA_CMAKE=yes

				    TRITON_CPU=yes

				    ;;

				  pytorch-linux-focal-linter)

				    # TODO: Use 3.9 here because of this issue https://github.com/python/mypy/issues/13627.

				    # We will need to update mypy version eventually, but that's for another day. The task

				    # would be to upgrade mypy to 1.0.0 with Python 3.11

				    ANACONDA_PYTHON_VERSION=3.9

				    CONDA_CMAKE=yes

				    PYTHON_VERSION=3.9

				    ;;

				  pytorch-linux-jammy-cuda11.8-cudnn9-py3.9-linter)

				    ANACONDA_PYTHON_VERSION=3.9

				    PYTHON_VERSION=3.9

				    CUDA_VERSION=11.8

				    CONDA_CMAKE=yes

				    ;;

				  pytorch-linux-jammy-aarch64-py3.10-gcc11)

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=11

				    ACL=yes

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    CONDA_CMAKE=yes

				    # snadampal: skipping llvm src build install because the current version

				    # from pytorch/llvm:9.0.1 is x86 specific

				    SKIP_LLVM_SRC_BUILD_INSTALL=yes

				@ -427,10 +326,7 @@ case "$image" in

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=11

				    ACL=yes

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    CONDA_CMAKE=yes

				    # snadampal: skipping llvm src build install because the current version

				    # from pytorch/llvm:9.0.1 is x86 specific

				    SKIP_LLVM_SRC_BUILD_INSTALL=yes

				@ -438,8 +334,6 @@ case "$image" in

				    ;;

				  *)

				    # Catch-all for builds that are not hardcoded.

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    echo "image '$image' did not match an existing build configuration"

				    if [[ "$image" == *py* ]]; then

				@ -455,8 +349,7 @@ case "$image" in

				      TRITON=yes

				      # To ensure that any ROCm config will build using conda cmake

				      # and thus have LAPACK/MKL enabled

				      CONDA_CMAKE=yes

				    fi

				      fi

				    if [[ "$image" == *centos7* ]]; then

				      NINJA_VERSION=1.10.2

				    fi

				@ -472,9 +365,6 @@ case "$image" in

				    if [[ "$image" == *glibc* ]]; then

				      extract_version_from_image_name glibc GLIBC_VERSION

				    fi

				    if [[ "$image" == *cmake* ]]; then

				      extract_version_from_image_name cmake CMAKE_VERSION

				    fi

				  ;;

				esac

				@ -488,14 +378,20 @@ if [[ "$image" == *cuda*  && ${OS} == "ubuntu" ]]; then

				  fi

				fi

				no_cache_flag=""

				progress_flag=""

				# Do not use cache and progress=plain when in CI

				if [[ -n "${CI:-}" ]]; then

				  no_cache_flag="--no-cache"

				  progress_flag="--progress=plain"

				fi

				# Build image

				docker build \

				       --no-cache \

				       --progress=plain \

				       ${no_cache_flag} \

				       ${progress_flag} \

				       --build-arg "BUILD_ENVIRONMENT=${image}" \

				       --build-arg "PROTOBUF=${PROTOBUF:-}" \

				       --build-arg "LLVMDEV=${LLVMDEV:-}" \

				       --build-arg "DB=${DB:-}" \

				       --build-arg "VISION=${VISION:-}" \

				       --build-arg "UBUNTU_VERSION=${UBUNTU_VERSION}" \

				       --build-arg "CENTOS_VERSION=${CENTOS_VERSION}" \

				@ -503,14 +399,12 @@ docker build \

				       --build-arg "GLIBC_VERSION=${GLIBC_VERSION}" \

				       --build-arg "CLANG_VERSION=${CLANG_VERSION}" \

				       --build-arg "ANACONDA_PYTHON_VERSION=${ANACONDA_PYTHON_VERSION}" \

				       --build-arg "PYTHON_VERSION=${PYTHON_VERSION}" \

				       --build-arg "GCC_VERSION=${GCC_VERSION}" \

				       --build-arg "CUDA_VERSION=${CUDA_VERSION}" \

				       --build-arg "CUDNN_VERSION=${CUDNN_VERSION}" \

				       --build-arg "TENSORRT_VERSION=${TENSORRT_VERSION}" \

				       --build-arg "GRADLE_VERSION=${GRADLE_VERSION}" \

				       --build-arg "VULKAN_SDK_VERSION=${VULKAN_SDK_VERSION}" \

				       --build-arg "SWIFTSHADER=${SWIFTSHADER}" \

				       --build-arg "CMAKE_VERSION=${CMAKE_VERSION:-}" \

				       --build-arg "NINJA_VERSION=${NINJA_VERSION:-}" \

				       --build-arg "KATEX=${KATEX:-}" \

				       --build-arg "ROCM_VERSION=${ROCM_VERSION:-}" \

				@ -518,7 +412,6 @@ docker build \

				       --build-arg "IMAGE_NAME=${IMAGE_NAME}" \

				       --build-arg "UCX_COMMIT=${UCX_COMMIT}" \

				       --build-arg "UCC_COMMIT=${UCC_COMMIT}" \

				       --build-arg "CONDA_CMAKE=${CONDA_CMAKE}" \

				       --build-arg "TRITON=${TRITON}" \

				       --build-arg "TRITON_CPU=${TRITON_CPU}" \

				       --build-arg "ONNX=${ONNX}" \

				@ -527,6 +420,7 @@ docker build \

				       --build-arg "EXECUTORCH=${EXECUTORCH}" \

				       --build-arg "HALIDE=${HALIDE}" \

				       --build-arg "XPU_VERSION=${XPU_VERSION}" \

				       --build-arg "UNINSTALL_DILL=${UNINSTALL_DILL}" \

				       --build-arg "ACL=${ACL:-}" \

				       --build-arg "SKIP_SCCACHE_INSTALL=${SKIP_SCCACHE_INSTALL:-}" \

				       --build-arg "SKIP_LLVM_SRC_BUILD_INSTALL=${SKIP_LLVM_SRC_BUILD_INSTALL:-}" \

				@ -544,7 +438,7 @@ docker build \

				UBUNTU_VERSION=$(echo ${UBUNTU_VERSION} | sed 's/-rc$//')

				function drun() {

				  docker run --rm "$tmp_tag" $*

				  docker run --rm "$tmp_tag" "$@"

				}

				if [[ "$OS" == "ubuntu" ]]; then

				@ -592,3 +486,23 @@ if [ -n "$KATEX" ]; then

				    exit 1

				  fi

				fi

				HAS_TRITON=$(drun python -c "import triton" > /dev/null 2>&1 && echo "yes" || echo "no")

				if [[ -n "$TRITON" || -n "$TRITON_CPU" ]]; then

				  if [ "$HAS_TRITON" = "no" ]; then

				    echo "expecting triton to be installed, but it is not"

				    exit 1

				  fi

				elif [ "$HAS_TRITON" = "yes" ]; then

				  echo "expecting triton to not be installed, but it is"

				  exit 1

				fi

				# Sanity check cmake version.  Executorch reinstalls cmake and I'm not sure if

				# they support 4.0.0 yet, so exclude them from this check.

				CMAKE_VERSION=$(drun cmake --version)

				if [[ "$EXECUTORCH" != *yes* && "$CMAKE_VERSION" != *4.* ]]; then

				  echo "CMake version is not 4.0.0:"

				  drun cmake --version

				  exit 1

				fi

									
										28

.ci/docker/centos-rocm/Dockerfile
									
												View File
												
				@ -17,9 +17,8 @@ RUN bash ./install_base.sh && rm install_base.sh

				# Update CentOS git version

				RUN yum -y remove git

				RUN yum -y remove git-*

				RUN yum -y install https://packages.endpoint.com/rhel/7/os/x86_64/endpoint-repo-1.9-1.x86_64.rpm || \

				    (yum -y install https://packages.endpointdev.com/rhel/7/os/x86_64/endpoint-repo-1.9-1.x86_64.rpm && \

				    sed -i "s/packages.endpoint/packages.endpointdev/" /etc/yum.repos.d/endpoint.repo)

				RUN yum -y install https://packages.endpointdev.com/rhel/7/os/x86_64/endpoint-repo-1.9-1.x86_64.rpm && \

				    sed -i 's/packages.endpoint/packages.endpointdev/' /etc/yum.repos.d/endpoint.repo

				RUN yum install -y git

				# Install devtoolset

				@ -40,7 +39,6 @@ RUN bash ./install_user.sh && rm install_user.sh

				# Install conda and other packages (e.g., numpy, pytest)

				ARG ANACONDA_PYTHON_VERSION

				ARG CONDA_CMAKE

				ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION

				ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH

				COPY requirements-ci.txt /opt/conda/requirements-ci.txt

				@ -48,20 +46,6 @@ COPY ./common/install_conda.sh install_conda.sh

				COPY ./common/common_utils.sh common_utils.sh

				RUN bash ./install_conda.sh && rm install_conda.sh common_utils.sh /opt/conda/requirements-ci.txt

				# (optional) Install protobuf for ONNX

				ARG PROTOBUF

				COPY ./common/install_protobuf.sh install_protobuf.sh

				RUN if [ -n "${PROTOBUF}" ]; then bash ./install_protobuf.sh; fi

				RUN rm install_protobuf.sh

				ENV INSTALLED_PROTOBUF ${PROTOBUF}

				# (optional) Install database packages like LMDB and LevelDB

				ARG DB

				COPY ./common/install_db.sh install_db.sh

				RUN if [ -n "${DB}" ]; then bash ./install_db.sh; fi

				RUN rm install_db.sh

				ENV INSTALLED_DB ${DB}

				# (optional) Install vision packages like OpenCV

				ARG VISION

				COPY ./common/install_vision.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./

				@ -75,7 +59,7 @@ COPY ./common/install_rocm.sh install_rocm.sh

				RUN bash ./install_rocm.sh

				RUN rm install_rocm.sh

				COPY ./common/install_rocm_magma.sh install_rocm_magma.sh

				RUN bash ./install_rocm_magma.sh

				RUN bash ./install_rocm_magma.sh ${ROCM_VERSION}

				RUN rm install_rocm_magma.sh

				COPY ./common/install_amdsmi.sh install_amdsmi.sh

				RUN bash ./install_amdsmi.sh

				@ -89,12 +73,6 @@ ENV MAGMA_HOME /opt/rocm/magma

				ENV LANG en_US.utf8

				ENV LC_ALL en_US.utf8

				# (optional) Install non-default CMake version

				ARG CMAKE_VERSION

				COPY ./common/install_cmake.sh install_cmake.sh

				RUN if [ -n "${CMAKE_VERSION}" ]; then bash ./install_cmake.sh; fi

				RUN rm install_cmake.sh

				# (optional) Install non-default Ninja version

				ARG NINJA_VERSION

				COPY ./common/install_ninja.sh install_ninja.sh

2

.ci/docker/ci_commit_pins/executorch.txt

View File

 @ -1 +1 @@
 e4d6b6380d575e48e37e9d987fded4ec588e7bc
 b173722085b3f555d6ba4533d6bbaddfd7c71144

2

.ci/docker/ci_commit_pins/nccl-cu12.txt

View File

 @ -1 +1 @@
 v2.25.1-1
 v2.26.5-1

2

.ci/docker/ci_commit_pins/triton-xpu.txt

View File

 @ -1 +1 @@
 ab22be6e4a588d184ac45175986a7dde9fc
 b0e26b7359c147b8aa0af686c20510fb9b15990a

2

.ci/docker/ci_commit_pins/triton.txt

View File

 @ -1 +1 @@
 ce50fade7e209553aba4898cd9b82aab83b
 c8757738a7418249896224430ce84888e8ecdd79

									
										5

.ci/docker/common/install_base.sh
									
												View File
												
				@ -37,7 +37,7 @@ install_ubuntu() {

				  if [[ "$UBUNTU_VERSION" == "20.04"* && "$CUDA_VERSION" == "11.8"* ]]; then

				    maybe_libnccl_dev="libnccl2=2.15.5-1+cuda11.8 libnccl-dev=2.15.5-1+cuda11.8 --allow-downgrades --allow-change-held-packages"

				  elif [[ "$UBUNTU_VERSION" == "20.04"* && "$CUDA_VERSION" == "12.4"* ]]; then

				    maybe_libnccl_dev="libnccl2=2.25.1-1+cuda12.4 libnccl-dev=2.25.1-1+cuda12.4 --allow-downgrades --allow-change-held-packages"

				    maybe_libnccl_dev="libnccl2=2.26.2-1+cuda12.4 libnccl-dev=2.26.2-1+cuda12.4 --allow-downgrades --allow-change-held-packages"

				  else

				    maybe_libnccl_dev=""

				  fi

				@ -99,9 +99,6 @@ install_centos() {

				  ccache_deps="asciidoc docbook-dtds docbook-style-xsl libxslt"

				  numpy_deps="gcc-gfortran"

				  # Note: protobuf-c-{compiler,devel} on CentOS are too old to be used

				  # for Caffe2. That said, we still install them to make sure the build

				  # system opts to build/use protoc and libprotobuf from third-party.

				  yum install -y \

				    $ccache_deps \

				    $numpy_deps \

									
										2

.ci/docker/common/install_cache.sh
									
												View File
												
				@ -9,7 +9,7 @@ install_ubuntu() {

				  # Instead use lib and headers from OpenSSL1.1 installed in `install_openssl.sh``

				  apt-get install -y cargo

				  echo "Checking out sccache repo"

				  git clone https://github.com/mozilla/sccache -b v0.9.1

				  git clone https://github.com/mozilla/sccache -b v0.10.0

				  cd sccache

				  echo "Building sccache"

				  cargo build --release

									
										12

.ci/docker/common/install_clang.sh
									
												View File
												
				@ -4,16 +4,10 @@ set -ex

				if [ -n "$CLANG_VERSION" ]; then

				  if [[ $CLANG_VERSION == 9 && $UBUNTU_VERSION == 18.04 ]]; then

				    sudo apt-get update

				    # gpg-agent is not available by default on 18.04

				    sudo apt-get install  -y --no-install-recommends gpg-agent

				    wget --no-check-certificate -O - https://apt.llvm.org/llvm-snapshot.gpg.key | sudo apt-key add  -

				    apt-add-repository "deb http://apt.llvm.org/bionic/ llvm-toolchain-bionic-${CLANG_VERSION} main"

				  elif [[ $UBUNTU_VERSION == 22.04 ]]; then

				  if [[ $UBUNTU_VERSION == 22.04 ]]; then

				    # work around ubuntu apt-get conflicts

				    sudo apt-get -y -f install

				    wget --no-check-certificate -O - https://apt.llvm.org/llvm-snapshot.gpg.key | sudo apt-key add  -

				    wget --no-check-certificate -O - https://apt.llvm.org/llvm-snapshot.gpg.key | sudo apt-key add -

				    if [[ $CLANG_VERSION == 18 ]]; then

				      apt-add-repository "deb http://apt.llvm.org/jammy/ llvm-toolchain-jammy-18 main"

				    fi

				@ -41,7 +35,7 @@ if [ -n "$CLANG_VERSION" ]; then

				  # clang's packaging is a little messed up (the runtime libs aren't

				  # added into the linker path), so give it a little help

				  clang_lib=("/usr/lib/llvm-$CLANG_VERSION/lib/clang/"*"/lib/linux")

				  echo "$clang_lib" > /etc/ld.so.conf.d/clang.conf

				  echo "$clang_lib" >/etc/ld.so.conf.d/clang.conf

				  ldconfig

				  # Cleanup package manager

									
										31

.ci/docker/common/install_cmake.sh
									
												View File
											
				@ -1,31 +0,0 @@

				#!/bin/bash

				set -ex

				[ -n "$CMAKE_VERSION" ]

				# Remove system cmake install so it won't get used instead

				ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')

				case "$ID" in

				  ubuntu)

				    apt-get remove cmake -y

				    ;;

				  centos)

				    yum remove cmake -y

				    ;;

				  *)

				    echo "Unable to determine OS..."

				    exit 1

				    ;;

				esac

				# Turn 3.6.3 into v3.6

				path=$(echo "${CMAKE_VERSION}" | sed -e 's/\([0-9].[0-9]\+\).*/v\1/')

				file="cmake-${CMAKE_VERSION}-Linux-x86_64.tar.gz"

				# Download and install specific CMake version in /usr/local

				pushd /tmp

				curl -Os --retry 3 "https://cmake.org/files/${path}/${file}"

				tar -C /usr/local --strip-components 1 --no-same-owner -zxf cmake-*.tar.gz

				rm -f cmake-*.tar.gz

				popd

									
										14

.ci/docker/common/install_conda.sh
									
												View File
												
				@ -7,7 +7,7 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then

				  BASE_URL="https://repo.anaconda.com/miniconda"

				  CONDA_FILE="Miniconda3-latest-Linux-x86_64.sh"

				  if [[ $(uname -m) == "aarch64" ]] || [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then

				    BASE_URL="https://github.com/conda-forge/miniforge/releases/latest/download"

				    BASE_URL="https://github.com/conda-forge/miniforge/releases/latest/download"  # @lint-ignore

				    CONDA_FILE="Miniforge3-Linux-$(uname -m).sh"

				  fi

				@ -62,7 +62,7 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then

				  # libstdcxx from conda default channels are too old, we need GLIBCXX_3.4.30

				  # which is provided in libstdcxx 12 and up.

				  conda_install libstdcxx-ng=12.3.0 -c conda-forge

				  conda_install libstdcxx-ng=12.3.0 --update-deps -c conda-forge

				  # Install PyTorch conda deps, as per https://github.com/pytorch/pytorch README

				  if [[ $(uname -m) == "aarch64" ]]; then

				@ -75,19 +75,11 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then

				  # and libpython-static for torch deploy

				  conda_install llvmdev=8.0.0 "libpython-static=${ANACONDA_PYTHON_VERSION}"

				  # Use conda cmake in some cases. Conda cmake will be newer than our supported

				  # min version (3.5 for xenial and 3.10 for bionic), so we only do it in those

				  # following builds that we know should use conda. Specifically, Ubuntu bionic

				  # and focal cannot find conda mkl with stock cmake, so we need a cmake from conda

				  if [ -n "${CONDA_CMAKE}" ]; then

				    conda_install cmake

				  fi

				  # Magma package names are concatenation of CUDA major and minor ignoring revision

				  # I.e. magma-cuda102 package corresponds to CUDA_VERSION=10.2 and CUDA_VERSION=10.2.89

				  # Magma is installed from a tarball in the ossci-linux bucket into the conda env

				  if [ -n "$CUDA_VERSION" ]; then

				    ${SCRIPT_FOLDER}/install_magma_conda.sh $(cut -f1-2 -d'.' <<< ${CUDA_VERSION}) ${ANACONDA_PYTHON_VERSION}

				    conda_run ${SCRIPT_FOLDER}/install_magma_conda.sh $(cut -f1-2 -d'.' <<< ${CUDA_VERSION})

				  fi

				  # Install some other packages, including those needed for Python test reporting

									
										4

.ci/docker/common/install_cpython.sh
									
												View File
												
				@ -3,11 +3,11 @@

				set -uex -o pipefail

				PYTHON_DOWNLOAD_URL=https://www.python.org/ftp/python

				PYTHON_DOWNLOAD_GITHUB_BRANCH=https://github.com/python/cpython/archive/refs/heads

				PYTHON_DOWNLOAD_GITHUB_BRANCH=https://github.com/python/cpython/archive/refs/heads  # @lint-ignore

				GET_PIP_URL=https://bootstrap.pypa.io/get-pip.py

				# Python versions to be installed in /opt/$VERSION_NO

				CPYTHON_VERSIONS=${CPYTHON_VERSIONS:-"3.8.1 3.9.0 3.10.1 3.11.0 3.12.0 3.13.0 3.13.0t"}

				CPYTHON_VERSIONS=${CPYTHON_VERSIONS:-"3.9.0 3.10.1 3.11.0 3.12.0 3.13.0 3.13.0t"}

				function check_var {

				    if [ -z "$1" ]; then

									
										188

.ci/docker/common/install_cuda.sh
									
												View File
												
				@ -2,140 +2,82 @@

				set -ex

				NCCL_VERSION=v2.25.1-1

				CUDNN_VERSION=9.5.1.17

				arch_path=''

				targetarch=${TARGETARCH:-$(uname -m)}

				if [ ${targetarch} = 'amd64' ] || [ "${targetarch}" = 'x86_64' ]; then

				  arch_path='x86_64'

				else

				  arch_path='sbsa'

				fi

				function install_cusparselt_040 {

				    # cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html

				    mkdir tmp_cusparselt && pushd tmp_cusparselt

				    wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-x86_64/libcusparse_lt-linux-x86_64-0.4.0.7-archive.tar.xz

				    tar xf libcusparse_lt-linux-x86_64-0.4.0.7-archive.tar.xz

				    cp -a libcusparse_lt-linux-x86_64-0.4.0.7-archive/include/* /usr/local/cuda/include/

				    cp -a libcusparse_lt-linux-x86_64-0.4.0.7-archive/lib/* /usr/local/cuda/lib64/

				    popd

				    rm -rf tmp_cusparselt

				function install_cuda {

				  version=$1

				  runfile=$2

				  major_minor=${version%.*}

				  rm -rf /usr/local/cuda-${major_minor} /usr/local/cuda

				  if [[ ${arch_path} == 'sbsa' ]]; then

				      runfile="${runfile}_sbsa"

				  fi

				  runfile="${runfile}.run"

				  wget -q https://developer.download.nvidia.com/compute/cuda/${version}/local_installers/${runfile} -O ${runfile}

				  chmod +x ${runfile}

				  ./${runfile} --toolkit --silent

				  rm -f ${runfile}

				  rm -f /usr/local/cuda && ln -s /usr/local/cuda-${major_minor} /usr/local/cuda

				}

				function install_cusparselt_062 {

				    # cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html

				    mkdir tmp_cusparselt && pushd tmp_cusparselt

				    wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-x86_64/libcusparse_lt-linux-x86_64-0.6.2.3-archive.tar.xz

				    tar xf libcusparse_lt-linux-x86_64-0.6.2.3-archive.tar.xz

				    cp -a libcusparse_lt-linux-x86_64-0.6.2.3-archive/include/* /usr/local/cuda/include/

				    cp -a libcusparse_lt-linux-x86_64-0.6.2.3-archive/lib/* /usr/local/cuda/lib64/

				    popd

				    rm -rf tmp_cusparselt

				}

				function install_cusparselt_063 {

				    # cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html

				    mkdir tmp_cusparselt && pushd tmp_cusparselt

				    wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-x86_64/libcusparse_lt-linux-x86_64-0.6.3.2-archive.tar.xz

				    tar xf libcusparse_lt-linux-x86_64-0.6.3.2-archive.tar.xz

				    cp -a libcusparse_lt-linux-x86_64-0.6.3.2-archive/include/* /usr/local/cuda/include/

				    cp -a libcusparse_lt-linux-x86_64-0.6.3.2-archive/lib/* /usr/local/cuda/lib64/

				    popd

				    rm -rf tmp_cusparselt

				function install_cudnn {

				  cuda_major_version=$1

				  cudnn_version=$2

				  mkdir tmp_cudnn && cd tmp_cudnn

				  # cuDNN license: https://developer.nvidia.com/cudnn/license_agreement

				  filepath="cudnn-linux-${arch_path}-${cudnn_version}_cuda${cuda_major_version}-archive"

				  wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-${arch_path}/${filepath}.tar.xz

				  tar xf ${filepath}.tar.xz

				  cp -a ${filepath}/include/* /usr/local/cuda/include/

				  cp -a ${filepath}/lib/* /usr/local/cuda/lib64/

				  cd ..

				  rm -rf tmp_cudnn

				}

				function install_118 {

				    CUDNN_VERSION=9.1.0.70

				    NCCL_VERSION=v2.21.5-1

				    echo "Installing CUDA 11.8 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.4.0"

				    rm -rf /usr/local/cuda-11.8 /usr/local/cuda

				    # install CUDA 11.8.0 in the same container

				    wget -q https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda_11.8.0_520.61.05_linux.run

				    chmod +x cuda_11.8.0_520.61.05_linux.run

				    ./cuda_11.8.0_520.61.05_linux.run --toolkit --silent

				    rm -f cuda_11.8.0_520.61.05_linux.run

				    rm -f /usr/local/cuda && ln -s /usr/local/cuda-11.8 /usr/local/cuda

				    echo "Installing CUDA 11.8 and cuDNN ${CUDNN_VERSION} and NCCL and cuSparseLt-0.4.0"

				    install_cuda 11.8.0 cuda_11.8.0_520.61.05_linux

				    # cuDNN license: https://developer.nvidia.com/cudnn/license_agreement

				    mkdir tmp_cudnn && cd tmp_cudnn

				    wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-${CUDNN_VERSION}_cuda11-archive.tar.xz -O cudnn-linux-x86_64-${CUDNN_VERSION}_cuda11-archive.tar.xz

				    tar xf cudnn-linux-x86_64-${CUDNN_VERSION}_cuda11-archive.tar.xz

				    cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda11-archive/include/* /usr/local/cuda/include/

				    cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda11-archive/lib/* /usr/local/cuda/lib64/

				    cd ..

				    rm -rf tmp_cudnn

				    install_cudnn 11 $CUDNN_VERSION

				    # NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses

				    # Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build

				    git clone -b $NCCL_VERSION --depth 1 https://github.com/NVIDIA/nccl.git

				    cd nccl && make -j src.build

				    cp -a build/include/* /usr/local/cuda/include/

				    cp -a build/lib/* /usr/local/cuda/lib64/

				    cd ..

				    rm -rf nccl

				    CUDA_VERSION=11.8 bash install_nccl.sh

				    install_cusparselt_040

				    CUDA_VERSION=11.8 bash install_cusparselt.sh

				    ldconfig

				}

				function install_124 {

				  CUDNN_VERSION=9.1.0.70

				  echo "Installing CUDA 12.4.1 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.6.2"

				  rm -rf /usr/local/cuda-12.4 /usr/local/cuda

				  # install CUDA 12.4.1 in the same container

				  wget -q https://developer.download.nvidia.com/compute/cuda/12.4.1/local_installers/cuda_12.4.1_550.54.15_linux.run

				  chmod +x cuda_12.4.1_550.54.15_linux.run

				  ./cuda_12.4.1_550.54.15_linux.run --toolkit --silent

				  rm -f cuda_12.4.1_550.54.15_linux.run

				  rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.4 /usr/local/cuda

				  echo "Installing CUDA 12.4.1 and cuDNN ${CUDNN_VERSION} and NCCL and cuSparseLt-0.6.2"

				  install_cuda 12.4.1 cuda_12.4.1_550.54.15_linux

				  # cuDNN license: https://developer.nvidia.com/cudnn/license_agreement

				  mkdir tmp_cudnn && cd tmp_cudnn

				  wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz -O cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz

				  tar xf cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz

				  cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/include/* /usr/local/cuda/include/

				  cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/lib/* /usr/local/cuda/lib64/

				  cd ..

				  rm -rf tmp_cudnn

				  install_cudnn 12 $CUDNN_VERSION

				  # NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses

				  # Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build

				  git clone -b $NCCL_VERSION --depth 1 https://github.com/NVIDIA/nccl.git

				  cd nccl && make -j src.build

				  cp -a build/include/* /usr/local/cuda/include/

				  cp -a build/lib/* /usr/local/cuda/lib64/

				  cd ..

				  rm -rf nccl

				  CUDA_VERSION=12.4 bash install_nccl.sh

				  install_cusparselt_062

				  CUDA_VERSION=12.4 bash install_cusparselt.sh

				  ldconfig

				}

				function install_126 {

				  echo "Installing CUDA 12.6.3 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.6.3"

				  rm -rf /usr/local/cuda-12.6 /usr/local/cuda

				  # install CUDA 12.6.3 in the same container

				  wget -q https://developer.download.nvidia.com/compute/cuda/12.6.3/local_installers/cuda_12.6.3_560.35.05_linux.run

				  chmod +x cuda_12.6.3_560.35.05_linux.run

				  ./cuda_12.6.3_560.35.05_linux.run --toolkit --silent

				  rm -f cuda_12.6.3_560.35.05_linux.run

				  rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.6 /usr/local/cuda

				  CUDNN_VERSION=9.5.1.17

				  echo "Installing CUDA 12.6.3 and cuDNN ${CUDNN_VERSION} and NCCL and cuSparseLt-0.6.3"

				  install_cuda 12.6.3 cuda_12.6.3_560.35.05_linux

				  # cuDNN license: https://developer.nvidia.com/cudnn/license_agreement

				  mkdir tmp_cudnn && cd tmp_cudnn

				  wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz -O cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz

				  tar xf cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz

				  cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/include/* /usr/local/cuda/include/

				  cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/lib/* /usr/local/cuda/lib64/

				  cd ..

				  rm -rf tmp_cudnn

				  install_cudnn 12 $CUDNN_VERSION

				  # NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses

				  # Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build

				  git clone -b $NCCL_VERSION --depth 1 https://github.com/NVIDIA/nccl.git

				  cd nccl && make -j src.build

				  cp -a build/include/* /usr/local/cuda/include/

				  cp -a build/lib/* /usr/local/cuda/lib64/

				  cd ..

				  rm -rf nccl

				  CUDA_VERSION=12.6 bash install_nccl.sh

				  install_cusparselt_063

				  CUDA_VERSION=12.6 bash install_cusparselt.sh

				  ldconfig

				}

				@ -240,35 +182,17 @@ function prune_126 {

				}

				function install_128 {

				  CUDNN_VERSION=9.7.1.26

				  echo "Installing CUDA 12.8.0 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.6.3"

				  rm -rf /usr/local/cuda-12.8 /usr/local/cuda

				  # install CUDA 12.8.0 in the same container

				  wget -q https://developer.download.nvidia.com/compute/cuda/12.8.0/local_installers/cuda_12.8.0_570.86.10_linux.run

				  chmod +x cuda_12.8.0_570.86.10_linux.run

				  ./cuda_12.8.0_570.86.10_linux.run --toolkit --silent

				  rm -f cuda_12.8.0_570.86.10_linux.run

				  rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.8 /usr/local/cuda

				  CUDNN_VERSION=9.8.0.87

				  echo "Installing CUDA 12.8.1 and cuDNN ${CUDNN_VERSION} and NCCL and cuSparseLt-0.6.3"

				  # install CUDA 12.8.1 in the same container

				  install_cuda 12.8.1 cuda_12.8.1_570.124.06_linux

				  # cuDNN license: https://developer.nvidia.com/cudnn/license_agreement

				  mkdir tmp_cudnn && cd tmp_cudnn

				  wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz -O cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz

				  tar xf cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz

				  cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/include/* /usr/local/cuda/include/

				  cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/lib/* /usr/local/cuda/lib64/

				  cd ..

				  rm -rf tmp_cudnn

				  install_cudnn 12 $CUDNN_VERSION

				  # NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses

				  # Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build

				  git clone -b $NCCL_VERSION --depth 1 https://github.com/NVIDIA/nccl.git

				  cd nccl && make -j src.build

				  cp -a build/include/* /usr/local/cuda/include/

				  cp -a build/lib/* /usr/local/cuda/lib64/

				  cd ..

				  rm -rf nccl

				  CUDA_VERSION=12.8 bash install_nccl.sh

				  install_cusparselt_063

				  CUDA_VERSION=12.8 bash install_cusparselt.sh

				  ldconfig

				}

									
										211

.ci/docker/common/install_cuda_aarch64.sh
									
												View File
											
				@ -1,211 +0,0 @@

				#!/bin/bash

				# Script used only in CD pipeline

				set -ex

				NCCL_VERSION=v2.21.5-1

				CUDNN_VERSION=9.5.1.17

				function install_cusparselt_062 {

				    # cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html

				    mkdir tmp_cusparselt && pushd tmp_cusparselt

				    wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-sbsa/libcusparse_lt-linux-sbsa-0.6.2.3-archive.tar.xz

				    tar xf libcusparse_lt-linux-sbsa-0.6.2.3-archive.tar.xz

				    cp -a libcusparse_lt-linux-sbsa-0.6.2.3-archive/include/* /usr/local/cuda/include/

				    cp -a libcusparse_lt-linux-sbsa-0.6.2.3-archive/lib/* /usr/local/cuda/lib64/

				    popd

				    rm -rf tmp_cusparselt

				}

				function install_cusparselt_063 {

				    # cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html

				    mkdir tmp_cusparselt && pushd tmp_cusparselt

				    wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-sbsa/libcusparse_lt-linux-sbsa-0.6.3.2-archive.tar.xz

				    tar xf libcusparse_lt-linux-sbsa-0.6.3.2-archive.tar.xz

				    cp -a libcusparse_lt-linux-sbsa-0.6.3.2-archive/include/* /usr/local/cuda/include/

				    cp -a libcusparse_lt-linux-sbsa-0.6.3.2-archive/lib/* /usr/local/cuda/lib64/

				    popd

				    rm -rf tmp_cusparselt

				}

				function install_124 {

				  CUDNN_VERSION=9.1.0.70

				  echo "Installing CUDA 12.4.1 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.6.2"

				  rm -rf /usr/local/cuda-12.4 /usr/local/cuda

				  # install CUDA 12.4.1 in the same container

				  wget -q https://developer.download.nvidia.com/compute/cuda/12.4.1/local_installers/cuda_12.4.1_550.54.15_linux_sbsa.run

				  chmod +x cuda_12.4.1_550.54.15_linux_sbsa.run

				  ./cuda_12.4.1_550.54.15_linux_sbsa.run --toolkit --silent

				  rm -f cuda_12.4.1_550.54.15_linux_sbsa.run

				  rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.4 /usr/local/cuda

				  # cuDNN license: https://developer.nvidia.com/cudnn/license_agreement

				  mkdir tmp_cudnn && cd tmp_cudnn

				  wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-sbsa/cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive.tar.xz -O cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive.tar.xz

				  tar xf cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive.tar.xz

				  cp -a cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive/include/* /usr/local/cuda/include/

				  cp -a cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive/lib/* /usr/local/cuda/lib64/

				  cd ..

				  rm -rf tmp_cudnn

				  # NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses

				  # Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build

				  git clone -b ${NCCL_VERSION} --depth 1 https://github.com/NVIDIA/nccl.git

				  cd nccl && make -j src.build

				  cp -a build/include/* /usr/local/cuda/include/

				  cp -a build/lib/* /usr/local/cuda/lib64/

				  cd ..

				  rm -rf nccl

				  install_cusparselt_063

				  ldconfig

				}

				function prune_124 {

				  echo "Pruning CUDA 12.4"

				  #####################################################################################

				  # CUDA 12.4 prune static libs

				  #####################################################################################

				  export NVPRUNE="/usr/local/cuda-12.4/bin/nvprune"

				  export CUDA_LIB_DIR="/usr/local/cuda-12.4/lib64"

				  export GENCODE="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"

				  export GENCODE_CUDNN="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"

				  if [[ -n "$OVERRIDE_GENCODE" ]]; then

				      export GENCODE=$OVERRIDE_GENCODE

				  fi

				  # all CUDA libs except CuDNN and CuBLAS

				  ls $CUDA_LIB_DIR/ | grep "\.a" | grep -v "culibos" | grep -v "cudart" | grep -v "cudnn" | grep -v "cublas" | grep -v "metis"  \

				      | xargs -I {} bash -c \

				                "echo {} && $NVPRUNE $GENCODE $CUDA_LIB_DIR/{} -o $CUDA_LIB_DIR/{}"

				  # prune CuDNN and CuBLAS

				  $NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublas_static.a -o $CUDA_LIB_DIR/libcublas_static.a

				  $NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a

				  #####################################################################################

				  # CUDA 12.4 prune visual tools

				  #####################################################################################

				  export CUDA_BASE="/usr/local/cuda-12.4/"

				  rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2024.1.0 $CUDA_BASE/nsight-systems-2023.4.4/

				}

				function install_126 {

				  echo "Installing CUDA 12.6.3 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.6.3"

				  rm -rf /usr/local/cuda-12.6 /usr/local/cuda

				  # install CUDA 12.6.3 in the same container

				  wget -q https://developer.download.nvidia.com/compute/cuda/12.6.3/local_installers/cuda_12.6.3_560.35.05_linux_sbsa.run

				  chmod +x cuda_12.6.3_560.35.05_linux_sbsa.run

				  ./cuda_12.6.3_560.35.05_linux_sbsa.run --toolkit --silent

				  rm -f cuda_12.6.3_560.35.05_linux_sbsa.run

				  rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.6 /usr/local/cuda

				  # cuDNN license: https://developer.nvidia.com/cudnn/license_agreement

				  mkdir tmp_cudnn && cd tmp_cudnn

				  wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-sbsa/cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive.tar.xz -O cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive.tar.xz

				  tar xf cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive.tar.xz

				  cp -a cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive/include/* /usr/local/cuda/include/

				  cp -a cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive/lib/* /usr/local/cuda/lib64/

				  cd ..

				  rm -rf tmp_cudnn

				  # NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses

				  # Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build

				  git clone -b ${NCCL_VERSION} --depth 1 https://github.com/NVIDIA/nccl.git

				  cd nccl && make -j src.build

				  cp -a build/include/* /usr/local/cuda/include/

				  cp -a build/lib/* /usr/local/cuda/lib64/

				  cd ..

				  rm -rf nccl

				  install_cusparselt_063

				  ldconfig

				}

				function prune_126 {

				  echo "Pruning CUDA 12.6"

				  #####################################################################################

				  # CUDA 12.6 prune static libs

				  #####################################################################################

				  export NVPRUNE="/usr/local/cuda-12.6/bin/nvprune"

				  export CUDA_LIB_DIR="/usr/local/cuda-12.6/lib64"

				  export GENCODE="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"

				  export GENCODE_CUDNN="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"

				  if [[ -n "$OVERRIDE_GENCODE" ]]; then

				      export GENCODE=$OVERRIDE_GENCODE

				  fi

				  if [[ -n "$OVERRIDE_GENCODE_CUDNN" ]]; then

				      export GENCODE_CUDNN=$OVERRIDE_GENCODE_CUDNN

				  fi

				  # all CUDA libs except CuDNN and CuBLAS

				  ls $CUDA_LIB_DIR/ | grep "\.a" | grep -v "culibos" | grep -v "cudart" | grep -v "cudnn" | grep -v "cublas" | grep -v "metis"  \

				      | xargs -I {} bash -c \

				                "echo {} && $NVPRUNE $GENCODE $CUDA_LIB_DIR/{} -o $CUDA_LIB_DIR/{}"

				  # prune CuDNN and CuBLAS

				  $NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublas_static.a -o $CUDA_LIB_DIR/libcublas_static.a

				  $NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a

				  #####################################################################################

				  # CUDA 12.6 prune visual tools

				  #####################################################################################

				  export CUDA_BASE="/usr/local/cuda-12.6/"

				  rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2024.3.2 $CUDA_BASE/nsight-systems-2024.5.1/

				}

				function install_128 {

				  CUDNN_VERSION=9.7.1.26

				  echo "Installing CUDA 12.8.0 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.6.3"

				  rm -rf /usr/local/cuda-12.8 /usr/local/cuda

				  # install CUDA 12.8.0 in the same container

				  wget -q https://developer.download.nvidia.com/compute/cuda/12.8.0/local_installers/cuda_12.8.0_570.86.10_linux_sbsa.run

				  chmod +x cuda_12.8.0_570.86.10_linux_sbsa.run

				  ./cuda_12.8.0_570.86.10_linux_sbsa.run --toolkit --silent

				  rm -f cuda_12.8.0_570.86.10_linux_sbsa.run

				  rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.8 /usr/local/cuda

				  # cuDNN license: https://developer.nvidia.com/cudnn/license_agreement

				  mkdir tmp_cudnn && cd tmp_cudnn

				  wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-sbsa/cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive.tar.xz -O cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive.tar.xz

				  tar xf cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive.tar.xz

				  cp -a cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive/include/* /usr/local/cuda/include/

				  cp -a cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive/lib/* /usr/local/cuda/lib64/

				  cd ..

				  rm -rf tmp_cudnn

				  # NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses

				  # Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build

				  git clone -b ${NCCL_VERSION} --depth 1 https://github.com/NVIDIA/nccl.git

				  cd nccl && make -j src.build

				  cp -a build/include/* /usr/local/cuda/include/

				  cp -a build/lib/* /usr/local/cuda/lib64/

				  cd ..

				  rm -rf nccl

				  install_cusparselt_063

				  ldconfig

				}

				# idiomatic parameter and option handling in sh

				while test $# -gt 0

				do

				    case "$1" in

				    12.4) install_124; prune_124

				        ;;

				    12.6) install_126; prune_126

				        ;;

				    12.8) install_128;

				        ;;

				    *) echo "bad argument $1"; exit 1

				        ;;

				    esac

				    shift

				done

									
										2

.ci/docker/common/install_cudnn.sh
									
												View File
												
				@ -5,7 +5,7 @@ if [[ -n "${CUDNN_VERSION}" ]]; then

				    mkdir tmp_cudnn

				    pushd tmp_cudnn

				    if [[ ${CUDA_VERSION:0:4} == "12.8" ]]; then

				        CUDNN_NAME="cudnn-linux-x86_64-9.7.1.26_cuda12-archive"

				        CUDNN_NAME="cudnn-linux-x86_64-9.8.0.87_cuda12-archive"

				    elif [[ ${CUDA_VERSION:0:4} == "12.6" ]]; then

				        CUDNN_NAME="cudnn-linux-x86_64-9.5.1.17_cuda12-archive"

				    elif [[ ${CUDA_VERSION:0:2} == "12" ]]; then

									
										38

.ci/docker/common/install_db.sh
									
												View File
											
				@ -1,38 +0,0 @@

				#!/bin/bash

				set -ex

				install_ubuntu() {

				  apt-get update

				  # Cleanup

				  apt-get autoclean && apt-get clean

				  rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

				}

				install_centos() {

				  # Need EPEL for many packages we depend on.

				  # See http://fedoraproject.org/wiki/EPEL

				  yum --enablerepo=extras install -y epel-release

				  # Cleanup

				  yum clean all

				  rm -rf /var/cache/yum

				  rm -rf /var/lib/yum/yumdb

				  rm -rf /var/lib/yum/history

				}

				# Install base packages depending on the base OS

				ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')

				case "$ID" in

				  ubuntu)

				    install_ubuntu

				    ;;

				  centos)

				    install_centos

				    ;;

				  *)

				    echo "Unable to determine OS..."

				    exit 1

				    ;;

				esac

									
										7

.ci/docker/common/install_executorch.sh
									
												View File
												
				@ -13,7 +13,7 @@ clone_executorch() {

				  # and fetch the target commit

				  pushd executorch

				  git checkout "${EXECUTORCH_PINNED_COMMIT}"

				  git submodule update --init

				  git submodule update --init --recursive

				  popd

				  chown -R jenkins executorch

				@ -50,10 +50,9 @@ setup_executorch() {

				  pushd executorch

				  export PYTHON_EXECUTABLE=python

				  export EXECUTORCH_BUILD_PYBIND=ON

				  export CMAKE_ARGS="-DEXECUTORCH_BUILD_XNNPACK=ON -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON"

				  export CMAKE_ARGS="-DEXECUTORCH_BUILD_PYBIND=ON -DEXECUTORCH_BUILD_XNNPACK=ON -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON"

				  as_jenkins .ci/scripts/setup-linux.sh cmake || true

				  as_jenkins .ci/scripts/setup-linux.sh --build-tool cmake || true

				  popd

				}

									
										6

.ci/docker/common/install_halide.sh
									
												View File
												
				@ -17,7 +17,7 @@ if [ -n "${UBUNTU_VERSION}" ];then

				                  libopenblas-dev libeigen3-dev libatlas-base-dev libzstd-dev

				fi

				conda_install numpy scipy imageio cmake ninja

				pip_install numpy scipy imageio cmake ninja

				git clone --depth 1 --branch release/16.x --recursive https://github.com/llvm/llvm-project.git

				cmake -DCMAKE_BUILD_TYPE=Release \

				@ -35,7 +35,9 @@ git clone https://github.com/halide/Halide.git

				pushd Halide

				git checkout ${COMMIT} && git submodule update --init --recursive

				pip_install -r requirements.txt

				cmake -G Ninja -DCMAKE_BUILD_TYPE=Release -S . -B build

				# NOTE: pybind has a requirement for cmake > 3.5 so set the minimum cmake version here with a flag

				#       Context: https://github.com/pytorch/pytorch/issues/150420

				cmake -G Ninja -DCMAKE_POLICY_VERSION_MINIMUM=3.5 -DCMAKE_BUILD_TYPE=Release -S . -B build

				cmake --build build

				test -e ${CONDA_PREFIX}/lib/python3 || ln -s python${ANACONDA_PYTHON_VERSION} ${CONDA_PREFIX}/lib/python3

				cmake --install build --prefix ${CONDA_PREFIX}

									
										9

.ci/docker/common/install_inductor_benchmark_deps.sh
									
												View File
												
				@ -14,16 +14,9 @@ function install_timm() {

				  local commit

				  commit=$(get_pinned_commit timm)

				  # TODO (huydhn): There is no torchvision release on 3.13 when I write this, so

				  # I'm using nightly here instead. We just need to package to be able to install

				  # TIMM. Removing this once vision has a release on 3.13

				  if [[ "${ANACONDA_PYTHON_VERSION}" == "3.13" ]]; then

				    pip_install --pre torch torchvision --index-url https://download.pytorch.org/whl/nightly/cu124

				  fi

				  pip_install "git+https://github.com/huggingface/pytorch-image-models@${commit}"

				  # Clean up

				  conda_run pip uninstall -y cmake torch torchvision triton

				  conda_run pip uninstall -y torch torchvision triton

				}

				# Pango is needed for weasyprint which is needed for doctr

									
										6

.ci/docker/common/install_linter.sh
									
												View File
												
				@ -2,8 +2,6 @@

				set -ex

				source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"

				if [ -n "${UBUNTU_VERSION}" ]; then

				  apt update

				  apt-get install -y clang doxygen git graphviz nodejs npm libtinfo5

				@ -15,8 +13,8 @@ chown -R jenkins pytorch

				pushd pytorch

				# Install all linter dependencies

				pip_install -r requirements.txt

				conda_run lintrunner init

				pip install -r requirements.txt

				lintrunner init

				# Cache .lintbin directory as part of the Docker image

				cp -r .lintbin /tmp

									
										39

.ci/docker/common/install_magma_conda.sh
									
												View File
												
				@ -1,26 +1,23 @@

				#!/usr/bin/env bash

				# Script that replaces the magma install from a conda package

				# Script that installs magma from tarball inside conda environment.

				# It replaces anaconda magma-cuda package which is no longer published.

				# Execute it inside active conda environment.

				# See issue: https://github.com/pytorch/pytorch/issues/138506

				set -eou pipefail

				function do_install() {

				    cuda_version_nodot=${1/./}

				    anaconda_python_version=$2

				cuda_version_nodot=${1/./}

				anaconda_dir=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}

				    MAGMA_VERSION="2.6.1"

				    magma_archive="magma-cuda${cuda_version_nodot}-${MAGMA_VERSION}-1.tar.bz2"

				    anaconda_dir="/opt/conda/envs/py_${anaconda_python_version}"

				    (

				        set -x

				        tmp_dir=$(mktemp -d)

				        pushd ${tmp_dir}

				        curl -OLs https://ossci-linux.s3.us-east-1.amazonaws.com/${magma_archive}

				        tar -xvf "${magma_archive}"

				        mv include/* "${anaconda_dir}/include/"

				        mv lib/* "${anaconda_dir}/lib"

				        popd

				    )

				}

				do_install $1 $2

				MAGMA_VERSION="2.6.1"

				magma_archive="magma-cuda${cuda_version_nodot}-${MAGMA_VERSION}-1.tar.bz2"

				(

				    set -x

				    tmp_dir=$(mktemp -d)

				    pushd ${tmp_dir}

				    curl -OLs https://ossci-linux.s3.us-east-1.amazonaws.com/${magma_archive}

				    tar -xvf "${magma_archive}"

				    mv include/* "${anaconda_dir}/include/"

				    mv lib/* "${anaconda_dir}/lib"

				    popd

				)

									
										26

.ci/docker/common/install_nccl.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,26 @@

				#!/bin/bash

				set -ex

				NCCL_VERSION=""

				if [[ ${CUDA_VERSION:0:2} == "11" ]]; then

				  NCCL_VERSION=$(cat ci_commit_pins/nccl-cu11.txt)

				elif [[ ${CUDA_VERSION:0:2} == "12" ]]; then

				  NCCL_VERSION=$(cat ci_commit_pins/nccl-cu12.txt)

				else

				  echo "Unexpected CUDA_VERSION ${CUDA_VERSION}"

				  exit 1

				fi

				if [[ -n "${NCCL_VERSION}" ]]; then

				  # NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses

				  # Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build

				  git clone -b $NCCL_VERSION --depth 1 https://github.com/NVIDIA/nccl.git

				  pushd nccl

				  make -j src.build

				  cp -a build/include/* /usr/local/cuda/include/

				  cp -a build/lib/* /usr/local/cuda/lib64/

				  popd

				  rm -rf nccl

				  ldconfig

				fi

									
										3

.ci/docker/common/install_onnx.sh
									
												View File
												
				@ -31,8 +31,7 @@ pip_install \

				pip_install coloredlogs packaging

				pip_install onnxruntime==1.18.1

				pip_install onnx==1.17.0

				pip_install onnxscript==0.2.2 --no-deps

				pip_install onnxscript==0.2.6 --no-deps

				# required by onnxscript

				pip_install ml_dtypes

									
										19

.ci/docker/common/install_protobuf.sh
									
												View File
											
				@ -1,19 +0,0 @@

				#!/bin/bash

				set -ex

				pb_dir="/usr/temp_pb_install_dir"

				mkdir -p $pb_dir

				# On the nvidia/cuda:9-cudnn7-devel-centos7 image we need this symlink or

				# else it will fail with

				#   g++: error: ./../lib64/crti.o: No such file or directory

				ln -s /usr/lib64 "$pb_dir/lib64"

				curl -LO "https://github.com/protocolbuffers/protobuf/releases/download/v3.17.3/protobuf-all-3.17.3.tar.gz" --retry 3

				tar -xvz --no-same-owner -C "$pb_dir" --strip-components 1 -f protobuf-all-3.17.3.tar.gz

				NPROC=$[$(nproc) - 2]

				pushd "$pb_dir" && ./configure && make -j${NPROC} && make -j${NPROC} check && sudo make -j${NRPOC} install && sudo ldconfig

				popd

				rm -rf $pb_dir

									
										15

.ci/docker/common/install_python.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,15 @@

				#!/bin/bash

				set -ex

				apt-get update

				# Use deadsnakes in case we need an older python version

				sudo add-apt-repository ppa:deadsnakes/ppa

				apt-get install -y python${PYTHON_VERSION} python${PYTHON_VERSION}-dev python3-pip python${PYTHON_VERSION}-venv

				# Use a venv because uv and some other package managers don't support --user install

				ln -s /usr/bin/python${PYTHON_VERSION} /usr/bin/python

				python -m venv /var/lib/jenkins/ci_env

				source /var/lib/jenkins/ci_env/bin/activate

				python -mpip install --upgrade pip

				python -mpip install -r /opt/requirements-ci.txt

									
										27

.ci/docker/common/install_rocm.sh
									
												View File
												
				@ -8,10 +8,6 @@ ver() {

				install_ubuntu() {

				    apt-get update

				    if [[ $UBUNTU_VERSION == 18.04 ]]; then

				      # gpg-agent is not available by default on 18.04

				      apt-get install -y --no-install-recommends gpg-agent

				    fi

				    if [[ $UBUNTU_VERSION == 20.04 ]]; then

				      # gpg-agent is not available by default on 20.04

				      apt-get install -y --no-install-recommends gpg-agent

				@ -23,6 +19,13 @@ install_ubuntu() {

				    apt-get install -y libc++1

				    apt-get install -y libc++abi1

				    # Make sure rocm packages from repo.radeon.com have highest priority

				    cat << EOF > /etc/apt/preferences.d/rocm-pin-600

				Package: *

				Pin: release o=repo.radeon.com

				Pin-Priority: 600

				EOF

				    # Add amdgpu repository

				    UBUNTU_VERSION_NAME=`cat /etc/os-release | grep UBUNTU_CODENAME | awk -F= '{print $2}'`

				    echo "deb [arch=amd64] https://repo.radeon.com/amdgpu/${ROCM_VERSION}/ubuntu ${UBUNTU_VERSION_NAME} main" > /etc/apt/sources.list.d/amdgpu.list

				@ -63,17 +66,25 @@ install_ubuntu() {

				    done

				    # ROCm 6.3 had a regression where initializing static code objects had significant overhead

				    if [[ $(ver $ROCM_VERSION) -eq $(ver 6.3) ]]; then

				    # ROCm 6.4 did not yet fix the regression, also HIP branch names are different

				    if [[ $(ver $ROCM_VERSION) -eq $(ver 6.3) ]] || [[ $(ver $ROCM_VERSION) -eq $(ver 6.4) ]]; then

				        if [[ $(ver $ROCM_VERSION) -eq $(ver 6.3) ]]; then

				            HIP_BRANCH=rocm-6.3.x

				            VER_STR=6.3

				        elif [[ $(ver $ROCM_VERSION) -eq $(ver 6.4) ]]; then

				            HIP_BRANCH=release/rocm-rel-6.4

				            VER_STR=6.4

				        fi

				        # clr build needs CppHeaderParser but can only find it using conda's python

				        /opt/conda/bin/python -m pip install CppHeaderParser

				        git clone https://github.com/ROCm/HIP -b rocm-6.3.x

				        git clone https://github.com/ROCm/HIP -b $HIP_BRANCH

				        HIP_COMMON_DIR=$(readlink -f HIP)

				        git clone https://github.com/jeffdaily/clr -b release/rocm-rel-6.3-statco-hotfix

				        git clone https://github.com/jeffdaily/clr -b release/rocm-rel-${VER_STR}-statco-hotfix

				        mkdir -p clr/build

				        pushd clr/build

				        cmake .. -DCLR_BUILD_HIP=ON -DHIP_COMMON_DIR=$HIP_COMMON_DIR

				        make -j

				        cp hipamd/lib/libamdhip64.so.6.3.* /opt/rocm/lib/libamdhip64.so.6.3.*

				        cp hipamd/lib/libamdhip64.so.${VER_STR}.* /opt/rocm/lib/libamdhip64.so.${VER_STR}.*

				        popd

				        rm -rf HIP clr

				    fi

									
										72

.ci/docker/common/install_rocm_magma.sh
									
												View File
												
				@ -1,50 +1,32 @@

				#!/bin/bash

				# Script used in CI and CD pipeline

				#!/usr/bin/env bash

				# Script used only in CD pipeline

				set -ex

				set -eou pipefail

				# Magma build scripts need `python`

				ln -sf /usr/bin/python3 /usr/bin/python

				function do_install() {

				    rocm_version=$1

				    rocm_version_nodot=${1//./}

				ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')

				case "$ID" in

				  almalinux)

				    yum install -y gcc-gfortran

				    ;;

				  *)

				    echo "No preinstalls to build magma..."

				    ;;

				esac

				    # Version 2.7.2 + ROCm related updates

				    MAGMA_VERSION=a1625ff4d9bc362906bd01f805dbbe12612953f6

				    magma_archive="magma-rocm${rocm_version_nodot}-${MAGMA_VERSION}-1.tar.bz2"

				MKLROOT=${MKLROOT:-/opt/conda/envs/py_$ANACONDA_PYTHON_VERSION}

				    rocm_dir="/opt/rocm"

				    (

				        set -x

				        tmp_dir=$(mktemp -d)

				        pushd ${tmp_dir}

				        curl -OLs https://ossci-linux.s3.us-east-1.amazonaws.com/${magma_archive}

				        if tar -xvf "${magma_archive}"

				        then

				            mkdir -p "${rocm_dir}/magma"

				            mv include "${rocm_dir}/magma/include"

				            mv lib "${rocm_dir}/magma/lib"

				        else

				            echo "${magma_archive} not found, skipping magma install"

				        fi

				        popd

				    )

				}

				# "install" hipMAGMA into /opt/rocm/magma by copying after build

				git clone https://bitbucket.org/icl/magma.git

				pushd magma

				# Version 2.7.2 + ROCm related updates

				git checkout a1625ff4d9bc362906bd01f805dbbe12612953f6

				cp make.inc-examples/make.inc.hip-gcc-mkl make.inc

				echo 'LIBDIR += -L$(MKLROOT)/lib' >> make.inc

				if [[ -f "${MKLROOT}/lib/libmkl_core.a" ]]; then

				    echo 'LIB = -Wl,--start-group -lmkl_gf_lp64 -lmkl_gnu_thread -lmkl_core -Wl,--end-group -lpthread -lstdc++ -lm -lgomp -lhipblas -lhipsparse' >> make.inc

				fi

				echo 'LIB += -Wl,--enable-new-dtags -Wl,--rpath,/opt/rocm/lib -Wl,--rpath,$(MKLROOT)/lib -Wl,--rpath,/opt/rocm/magma/lib -ldl' >> make.inc

				echo 'DEVCCFLAGS += --gpu-max-threads-per-block=256' >> make.inc

				export PATH="${PATH}:/opt/rocm/bin"

				if [[ -n "$PYTORCH_ROCM_ARCH" ]]; then

				  amdgpu_targets=`echo $PYTORCH_ROCM_ARCH | sed 's/;/ /g'`

				else

				  amdgpu_targets=`rocm_agent_enumerator | grep -v gfx000 | sort -u | xargs`

				fi

				for arch in $amdgpu_targets; do

				  echo "DEVCCFLAGS += --offload-arch=$arch" >> make.inc

				done

				# hipcc with openmp flag may cause isnan() on __device__ not to be found; depending on context, compiler may attempt to match with host definition

				sed -i 's/^FOPENMP/#FOPENMP/g' make.inc

				make -f make.gen.hipMAGMA -j $(nproc)

				LANG=C.UTF-8 make lib/libmagma.so -j $(nproc) MKLROOT="${MKLROOT}"

				make testing/testing_dgemm -j $(nproc) MKLROOT="${MKLROOT}"

				popd

				mv magma /opt/rocm

				do_install $1

									
										24

.ci/docker/common/install_swiftshader.sh
									
												View File
											
				@ -1,24 +0,0 @@

				#!/bin/bash

				set -ex

				[ -n "${SWIFTSHADER}" ]

				retry () {

				    $*  || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)

				}

				_https_amazon_aws=https://ossci-android.s3.amazonaws.com

				# SwiftShader

				_swiftshader_dir=/var/lib/jenkins/swiftshader

				_swiftshader_file_targz=swiftshader-abe07b943-prebuilt.tar.gz

				mkdir -p $_swiftshader_dir

				_tmp_swiftshader_targz="/tmp/${_swiftshader_file_targz}"

				curl --silent --show-error --location --fail --retry 3 \

				  --output "${_tmp_swiftshader_targz}" "$_https_amazon_aws/${_swiftshader_file_targz}"

				tar -C "${_swiftshader_dir}" -xzf "${_tmp_swiftshader_targz}"

				export VK_ICD_FILENAMES="${_swiftshader_dir}/build/Linux/vk_swiftshader_icd.json"

									
										62

.ci/docker/common/install_triton.sh
									
												View File
												
				@ -2,14 +2,16 @@

				set -ex

				mkdir -p /opt/triton

				if [ -z "${TRITON}" ] && [ -z "${TRITON_CPU}" ]; then

				  echo "TRITON and TRITON_CPU are not set. Exiting..."

				  exit 0

				fi

				source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"

				get_conda_version() {

				  as_jenkins conda list -n py_$ANACONDA_PYTHON_VERSION | grep -w $* | head -n 1 | awk '{print $2}'

				}

				conda_reinstall() {

				  as_jenkins conda install -q -n py_$ANACONDA_PYTHON_VERSION -y --force-reinstall $*

				get_pip_version() {

				  conda_run pip list | grep -w $* | head -n 1 | awk '{print $2}'

				}

				if [ -n "${XPU_VERSION}" ]; then

				@ -31,11 +33,9 @@ if [ -n "${UBUNTU_VERSION}" ];then

				    apt-get install -y gpg-agent

				fi

				if [ -n "${CONDA_CMAKE}" ]; then

				  # Keep the current cmake and numpy version here, so we can reinstall them later

				  CMAKE_VERSION=$(get_conda_version cmake)

				  NUMPY_VERSION=$(get_conda_version numpy)

				fi

				# Keep the current cmake and numpy version here, so we can reinstall them later

				CMAKE_VERSION=$(get_pip_version cmake)

				NUMPY_VERSION=$(get_pip_version numpy)

				if [ -z "${MAX_JOBS}" ]; then

				    export MAX_JOBS=$(nproc)

				@ -52,6 +52,7 @@ cd triton

				as_jenkins git checkout ${TRITON_PINNED_COMMIT}

				as_jenkins git submodule update --init --recursive

				cd python

				pip_install pybind11==2.13.6

				# TODO: remove patch setup.py once we have a proper fix for https://github.com/triton-lang/triton/issues/4527

				as_jenkins sed -i -e 's/https:\/\/tritonlang.blob.core.windows.net\/llvm-builds/https:\/\/oaitriton.blob.core.windows.net\/public\/llvm-builds/g' setup.py

				@ -60,28 +61,35 @@ if [ -n "${UBUNTU_VERSION}" ] && [ -n "${GCC_VERSION}" ] && [[ "${GCC_VERSION}"

				  # Triton needs at least gcc-9 to build

				  apt-get install -y g++-9

				  CXX=g++-9 pip_install .

				  CXX=g++-9 conda_run python setup.py bdist_wheel

				elif [ -n "${UBUNTU_VERSION}" ] && [ -n "${CLANG_VERSION}" ]; then

				  # Triton needs <filesystem> which surprisingly is not available with clang-9 toolchain

				  add-apt-repository -y ppa:ubuntu-toolchain-r/test

				  apt-get install -y g++-9

				  CXX=g++-9 pip_install .

				  CXX=g++-9 conda_run python setup.py bdist_wheel

				else

				  pip_install .

				  conda_run python setup.py bdist_wheel

				fi

				if [ -n "${CONDA_CMAKE}" ]; then

				  # TODO: This is to make sure that the same cmake and numpy version from install conda

				  # script is used. Without this step, the newer cmake version (3.25.2) downloaded by

				  # triton build step via pip will fail to detect conda MKL. Once that issue is fixed,

				  # this can be removed.

				  #

				  # The correct numpy version also needs to be set here because conda claims that it

				  # causes inconsistent environment.  Without this, conda will attempt to install the

				  # latest numpy version, which fails ASAN tests with the following import error: Numba

				  # needs NumPy 1.20 or less.

				  conda_reinstall cmake="${CMAKE_VERSION}"

				  # Note that we install numpy with pip as conda might not have the version we want

				  pip_install --force-reinstall numpy=="${NUMPY_VERSION}"

				# Copy the wheel to /opt for multi stage docker builds

				cp dist/*.whl /opt/triton

				# Install the wheel for docker builds that don't use multi stage

				pip_install dist/*.whl

				# TODO: This is to make sure that the same cmake and numpy version from install conda

				# script is used. Without this step, the newer cmake version (3.25.2) downloaded by

				# triton build step via pip will fail to detect conda MKL. Once that issue is fixed,

				# this can be removed.

				#

				# The correct numpy version also needs to be set here because conda claims that it

				# causes inconsistent environment.  Without this, conda will attempt to install the

				# latest numpy version, which fails ASAN tests with the following import error: Numba

				# needs NumPy 1.20 or less.

				# Note that we install numpy with pip as conda might not have the version we want

				if [ -n "${CMAKE_VERSION}" ]; then

				  pip_install "cmake==${CMAKE_VERSION}"

				fi

				if [ -n "${NUMPY_VERSION}" ]; then

				  pip_install "numpy==${NUMPY_VERSION}"

				fi

									
										24

.ci/docker/common/install_vulkan_sdk.sh
									
												View File
											
				@ -1,24 +0,0 @@

				#!/bin/bash

				set -ex

				[ -n "${VULKAN_SDK_VERSION}" ]

				retry () {

				    $*  || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)

				}

				_vulkansdk_dir=/var/lib/jenkins/vulkansdk

				_tmp_vulkansdk_targz=/tmp/vulkansdk.tar.gz

				curl \

				  --silent \

				  --show-error \

				  --location \

				  --fail \

				  --retry 3 \

				  --output "${_tmp_vulkansdk_targz}" "https://ossci-android.s3.amazonaws.com/vulkansdk-linux-x86_64-${VULKAN_SDK_VERSION}.tar.gz"

				mkdir -p "${_vulkansdk_dir}"

				tar -C "${_vulkansdk_dir}" -xzf "${_tmp_vulkansdk_targz}" --strip-components 1

				rm -rf "${_tmp_vulkansdk_targz}"

									
										14

.ci/docker/common/install_xpu.sh
									
												View File
												
				@ -26,7 +26,7 @@ function install_ubuntu() {

				    wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB \

				        | gpg --dearmor > /usr/share/keyrings/oneapi-archive-keyring.gpg.gpg

				    echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg.gpg] \

				        https://apt.repos.intel.com/${XPU_REPO_NAME} all main" \

				        https://apt.repos.intel.com/oneapi all main" \

				        | tee /etc/apt/sources.list.d/oneAPI.list

				    # Update the packages list and repository index

				@ -74,7 +74,7 @@ function install_rhel() {

				    tee > /etc/yum.repos.d/oneAPI.repo << EOF

				[oneAPI]

				name=Intel for Pytorch GPU dev repository

				baseurl=https://yum.repos.intel.com/${XPU_REPO_NAME}

				baseurl=https://yum.repos.intel.com/oneapi

				enabled=1

				gpgcheck=1

				repo_gpgcheck=1

				@ -118,7 +118,7 @@ function install_sles() {

				        https://repositories.intel.com/gpu/sles/${VERSION_SP}${XPU_DRIVER_VERSION}/unified/intel-gpu-${VERSION_SP}.repo

				    rpm --import https://repositories.intel.com/gpu/intel-graphics.key

				    # To add the online network network package repository for the Intel Support Packages

				    zypper addrepo https://yum.repos.intel.com/${XPU_REPO_NAME} oneAPI

				    zypper addrepo https://yum.repos.intel.com/oneapi oneAPI

				    rpm --import https://yum.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB

				    # The xpu-smi packages

				@ -141,10 +141,10 @@ if [[ "${XPU_DRIVER_TYPE,,}" == "rolling" ]]; then

				    XPU_DRIVER_VERSION=""

				fi

				XPU_REPO_NAME="intel-for-pytorch-gpu-dev"

				XPU_PACKAGES="intel-for-pytorch-gpu-dev-0.5 intel-pti-dev-0.9"

				if [[ "$XPU_VERSION" == "2025.0" ]]; then

				    XPU_REPO_NAME="oneapi"

				# Default use Intel® oneAPI Deep Learning Essentials 2025.0

				if [[ "$XPU_VERSION" == "2025.1" ]]; then

				    XPU_PACKAGES="intel-deep-learning-essentials-2025.1"

				else

				    XPU_PACKAGES="intel-deep-learning-essentials-2025.0"

				fi

									
										8

.ci/docker/libtorch/Dockerfile
									
												View File
												
				@ -49,6 +49,9 @@ RUN bash ./install_mkl.sh && rm install_mkl.sh

				FROM cpu as cuda

				ADD ./common/install_cuda.sh install_cuda.sh

				ADD ./common/install_magma.sh install_magma.sh

				COPY ./common/install_nccl.sh install_nccl.sh

				COPY ./ci_commit_pins/nccl-cu* /ci_commit_pins/

				COPY ./common/install_cusparselt.sh install_cusparselt.sh

				ENV CUDA_HOME /usr/local/cuda

				FROM cuda as cuda11.8

				@ -72,6 +75,7 @@ RUN bash ./install_magma.sh 12.8

				RUN ln -sf /usr/local/cuda-12.8 /usr/local/cuda

				FROM cpu as rocm

				ARG ROCM_VERSION

				ARG PYTORCH_ROCM_ARCH

				ENV PYTORCH_ROCM_ARCH ${PYTORCH_ROCM_ARCH}

				ENV MKLROOT /opt/intel

				@ -86,11 +90,11 @@ ADD ./common/install_rocm_magma.sh install_rocm_magma.sh

				# gfortran and python needed for building magma from source for ROCm

				RUN apt-get update -y && \

				    apt-get install gfortran -y && \

				    apt-get install python -y && \

				    apt-get install python3 python-is-python3 -y && \

				    apt-get clean

				RUN bash ./install_rocm_drm.sh && rm install_rocm_drm.sh

				RUN bash ./install_rocm_magma.sh && rm install_rocm_magma.sh

				RUN bash ./install_rocm_magma.sh ${ROCM_VERSION} && rm install_rocm_magma.sh

				FROM ${BASE_TARGET} as final

				COPY --from=openssl            /opt/openssl           /opt/openssl

									
										80

.ci/docker/libtorch/build.sh
									
												View File
												
				@ -1,83 +1,63 @@

				#!/usr/bin/env bash

				# Script used only in CD pipeline

				set -eou pipefail

				set -eoux pipefail

				image="$1"

				shift

				if [ -z "${image}" ]; then

				  echo "Usage: $0 IMAGE"

				  echo "Usage: $0 IMAGENAME:ARCHTAG"

				  exit 1

				fi

				DOCKER_IMAGE="pytorch/${image}"

				TOPDIR=$(git rev-parse --show-toplevel)

				GPU_ARCH_TYPE=${GPU_ARCH_TYPE:-cpu}

				GPU_ARCH_VERSION=${GPU_ARCH_VERSION:-}

				WITH_PUSH=${WITH_PUSH:-}

				DOCKER=${DOCKER:-docker}

				case ${GPU_ARCH_TYPE} in

				# Go from imagename:tag to tag

				DOCKER_TAG_PREFIX=$(echo "${image}" | awk -F':' '{print $2}')

				GPU_ARCH_VERSION=""

				if [[ "${DOCKER_TAG_PREFIX}" == cuda* ]]; then

				    # extract cuda version from image name.  e.g. manylinux2_28-builder:cuda12.8 returns 12.8

				    GPU_ARCH_VERSION=$(echo "${DOCKER_TAG_PREFIX}" | awk -F'cuda' '{print $2}')

				elif [[ "${DOCKER_TAG_PREFIX}" == rocm* ]]; then

				    # extract rocm version from image name.  e.g. manylinux2_28-builder:rocm6.2.4 returns 6.2.4

				    GPU_ARCH_VERSION=$(echo "${DOCKER_TAG_PREFIX}" | awk -F'rocm' '{print $2}')

				fi

				case ${DOCKER_TAG_PREFIX} in

				    cpu)

				        BASE_TARGET=cpu

				        DOCKER_TAG=cpu

				        GPU_IMAGE=ubuntu:20.04

				        DOCKER_GPU_BUILD_ARG=""

				        ;;

				    cuda)

				    cuda*)

				        BASE_TARGET=cuda${GPU_ARCH_VERSION}

				        DOCKER_TAG=cuda${GPU_ARCH_VERSION}

				        GPU_IMAGE=ubuntu:20.04

				        DOCKER_GPU_BUILD_ARG=""

				        ;;

				    rocm)

				    rocm*)

				        BASE_TARGET=rocm

				        DOCKER_TAG=rocm${GPU_ARCH_VERSION}

				        GPU_IMAGE=rocm/dev-ubuntu-20.04:${GPU_ARCH_VERSION}-complete

				        GPU_IMAGE=rocm/dev-ubuntu-22.04:${GPU_ARCH_VERSION}-complete

				        PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201"

				        DOCKER_GPU_BUILD_ARG="--build-arg PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH}"

				        DOCKER_GPU_BUILD_ARG="--build-arg PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH} --build-arg ROCM_VERSION=${GPU_ARCH_VERSION}"

				        ;;

				    *)

				        echo "ERROR: Unrecognized GPU_ARCH_TYPE: ${GPU_ARCH_TYPE}"

				        echo "ERROR: Unrecognized DOCKER_TAG_PREFIX: ${DOCKER_TAG_PREFIX}"

				        exit 1

				        ;;

				esac

				tmp_tag=$(basename "$(mktemp -u)" | tr '[:upper:]' '[:lower:]')

				(

				    set -x

				    DOCKER_BUILDKIT=1 ${DOCKER} build \

				         --target final \

				        ${DOCKER_GPU_BUILD_ARG} \

				        --build-arg "GPU_IMAGE=${GPU_IMAGE}" \

				        --build-arg "BASE_TARGET=${BASE_TARGET}" \

				        -t "${DOCKER_IMAGE}" \

				        $@ \

				        -f "${TOPDIR}/.ci/docker/libtorch/Dockerfile" \

				        "${TOPDIR}/.ci/docker/"

				)

				GITHUB_REF=${GITHUB_REF:-$(git symbolic-ref -q HEAD || git describe --tags --exact-match)}

				GIT_BRANCH_NAME=${GITHUB_REF##*/}

				GIT_COMMIT_SHA=${GITHUB_SHA:-$(git rev-parse HEAD)}

				DOCKER_IMAGE_BRANCH_TAG=${DOCKER_IMAGE}-${GIT_BRANCH_NAME}

				DOCKER_IMAGE_SHA_TAG=${DOCKER_IMAGE}-${GIT_COMMIT_SHA}

				if [[ "${WITH_PUSH}" == true ]]; then

				  (

				    set -x

				    ${DOCKER} push "${DOCKER_IMAGE}"

				    if [[ -n ${GITHUB_REF} ]]; then

				        ${DOCKER} tag ${DOCKER_IMAGE} ${DOCKER_IMAGE_BRANCH_TAG}

				        ${DOCKER} tag ${DOCKER_IMAGE} ${DOCKER_IMAGE_SHA_TAG}

				        ${DOCKER} push "${DOCKER_IMAGE_BRANCH_TAG}"

				        ${DOCKER} push "${DOCKER_IMAGE_SHA_TAG}"

				    fi

				  )

				fi

				DOCKER_BUILDKIT=1 ${DOCKER} build \

				    --target final \

				    ${DOCKER_GPU_BUILD_ARG} \

				    --build-arg "GPU_IMAGE=${GPU_IMAGE}" \

				    --build-arg "BASE_TARGET=${BASE_TARGET}" \

				    -t "${tmp_tag}" \

				    $@ \

				    -f "${TOPDIR}/.ci/docker/libtorch/Dockerfile" \

				    "${TOPDIR}/.ci/docker/"

									
										27

.ci/docker/linter-cuda/Dockerfile
									
												View File
												
				@ -18,28 +18,31 @@ COPY ./common/install_user.sh install_user.sh

				RUN bash ./install_user.sh && rm install_user.sh

				# Install conda and other packages (e.g., numpy, pytest)

				ARG ANACONDA_PYTHON_VERSION

				ARG CONDA_CMAKE

				ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION

				ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH

				COPY requirements-ci.txt /opt/conda/requirements-ci.txt

				COPY ./common/install_conda.sh install_conda.sh

				COPY ./common/common_utils.sh common_utils.sh

				COPY ./common/install_magma_conda.sh install_magma_conda.sh

				RUN bash ./install_conda.sh && rm install_conda.sh install_magma_conda.sh common_utils.sh /opt/conda/requirements-ci.txt

				ARG PYTHON_VERSION

				ARG PIP_CMAKE

				# Put venv into the env vars so users don't need to activate it

				ENV PATH /var/lib/jenkins/ci_env/bin:$PATH

				ENV VIRTUAL_ENV /var/lib/jenkins/ci_env

				COPY requirements-ci.txt /opt/requirements-ci.txt

				COPY ./common/install_python.sh install_python.sh

				RUN bash ./install_python.sh && rm install_python.sh /opt/requirements-ci.txt

				# Install cuda and cudnn

				ARG CUDA_VERSION

				COPY ./common/install_cuda.sh install_cuda.sh

				RUN bash ./install_cuda.sh ${CUDA_VERSION} && rm install_cuda.sh

				COPY ./common/install_nccl.sh install_nccl.sh

				COPY ./ci_commit_pins/nccl-cu* /ci_commit_pins/

				COPY ./common/install_cusparselt.sh install_cusparselt.sh

				RUN bash ./install_cuda.sh ${CUDA_VERSION} && rm install_cuda.sh install_nccl.sh /ci_commit_pins/nccl-cu* install_cusparselt.sh

				ENV DESIRED_CUDA ${CUDA_VERSION}

				ENV PATH /usr/local/nvidia/bin:/usr/local/cuda/bin:$PATH

				# Note that Docker build forbids copying file outside the build context

				COPY ./common/install_linter.sh install_linter.sh

				COPY ./common/common_utils.sh common_utils.sh

				RUN bash ./install_linter.sh

				RUN rm install_linter.sh common_utils.sh

				RUN rm install_linter.sh

				RUN chown -R jenkins:jenkins /var/lib/jenkins/ci_env

				USER jenkins

				CMD ["bash"]

									
										17

.ci/docker/linter/Dockerfile
									
												View File
												
				@ -15,20 +15,17 @@ COPY ./common/install_user.sh install_user.sh

				RUN bash ./install_user.sh && rm install_user.sh

				# Install conda and other packages (e.g., numpy, pytest)

				ARG ANACONDA_PYTHON_VERSION

				ARG CONDA_CMAKE

				ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION

				ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH

				COPY requirements-ci.txt /opt/conda/requirements-ci.txt

				COPY ./common/install_conda.sh install_conda.sh

				COPY ./common/common_utils.sh common_utils.sh

				RUN bash ./install_conda.sh && rm install_conda.sh common_utils.sh /opt/conda/requirements-ci.txt

				ARG PYTHON_VERSION

				ENV PATH /var/lib/jenkins/ci_env/bin:$PATH

				ENV VIRTUAL_ENV /var/lib/jenkins/ci_env

				COPY requirements-ci.txt /opt/requirements-ci.txt

				COPY ./common/install_python.sh install_python.sh

				RUN bash ./install_python.sh && rm install_python.sh /opt/requirements-ci.txt

				# Note that Docker build forbids copying file outside the build context

				COPY ./common/install_linter.sh install_linter.sh

				COPY ./common/common_utils.sh common_utils.sh

				RUN bash ./install_linter.sh

				RUN rm install_linter.sh common_utils.sh

				RUN rm install_linter.sh

				USER jenkins

				CMD ["bash"]

									
										200

.ci/docker/manywheel/Dockerfile
									
												View File
											
				@ -1,200 +0,0 @@

				# syntax = docker/dockerfile:experimental

				ARG ROCM_VERSION=3.7

				ARG BASE_CUDA_VERSION=11.8

				ARG GPU_IMAGE=centos:7

				FROM centos:7 as base

				ENV LC_ALL en_US.UTF-8

				ENV LANG en_US.UTF-8

				ENV LANGUAGE en_US.UTF-8

				ARG DEVTOOLSET_VERSION=9

				# Note: This is required patch since CentOS have reached EOL

				# otherwise any yum install setp will fail

				RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo

				RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo

				RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo

				RUN yum install -y wget curl perl util-linux xz bzip2 git patch which perl zlib-devel

				# Just add everything as a safe.directory for git since these will be used in multiple places with git

				RUN git config --global --add safe.directory '*'

				RUN yum install -y yum-utils centos-release-scl

				RUN yum-config-manager --enable rhel-server-rhscl-7-rpms

				# Note: After running yum-config-manager --enable rhel-server-rhscl-7-rpms

				# patch is required once again. Somehow this steps adds mirror.centos.org

				RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo

				RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo

				RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo

				RUN yum install -y devtoolset-${DEVTOOLSET_VERSION}-gcc devtoolset-${DEVTOOLSET_VERSION}-gcc-c++ devtoolset-${DEVTOOLSET_VERSION}-gcc-gfortran devtoolset-${DEVTOOLSET_VERSION}-binutils

				ENV PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH

				ENV LD_LIBRARY_PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH

				RUN yum --enablerepo=extras install -y epel-release

				# cmake-3.18.4 from pip

				RUN yum install -y python3-pip && \

				    python3 -mpip install cmake==3.18.4 && \

				    ln -s /usr/local/bin/cmake /usr/bin/cmake

				RUN yum install -y autoconf aclocal automake make sudo

				FROM base as openssl

				# Install openssl (this must precede `build python` step)

				# (In order to have a proper SSL module, Python is compiled

				# against a recent openssl [see env vars above], which is linked

				# statically. We delete openssl afterwards.)

				ADD ./common/install_openssl.sh install_openssl.sh

				RUN bash ./install_openssl.sh && rm install_openssl.sh

				# EPEL for cmake

				FROM base as patchelf

				# Install patchelf

				ADD ./common/install_patchelf.sh install_patchelf.sh

				RUN bash ./install_patchelf.sh && rm install_patchelf.sh

				RUN cp $(which patchelf) /patchelf

				FROM patchelf as python

				# build python

				COPY manywheel/build_scripts /build_scripts

				ADD ./common/install_cpython.sh /build_scripts/install_cpython.sh

				RUN bash build_scripts/build.sh && rm -r build_scripts

				FROM base as cuda

				ARG BASE_CUDA_VERSION=10.2

				# Install CUDA

				ADD ./common/install_cuda.sh install_cuda.sh

				RUN bash ./install_cuda.sh ${BASE_CUDA_VERSION} && rm install_cuda.sh

				FROM base as intel

				# MKL

				ADD ./common/install_mkl.sh install_mkl.sh

				RUN bash ./install_mkl.sh && rm install_mkl.sh

				FROM base as magma

				ARG BASE_CUDA_VERSION=10.2

				# Install magma

				ADD ./common/install_magma.sh install_magma.sh

				RUN bash ./install_magma.sh ${BASE_CUDA_VERSION} && rm install_magma.sh

				FROM base as jni

				# Install java jni header

				ADD ./common/install_jni.sh install_jni.sh

				ADD ./java/jni.h jni.h

				RUN bash ./install_jni.sh && rm install_jni.sh

				FROM base as libpng

				# Install libpng

				ADD ./common/install_libpng.sh install_libpng.sh

				RUN bash ./install_libpng.sh && rm install_libpng.sh

				FROM ${GPU_IMAGE} as common

				RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo

				RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo

				RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo

				ENV LC_ALL en_US.UTF-8

				ENV LANG en_US.UTF-8

				ENV LANGUAGE en_US.UTF-8

				RUN yum install -y \

				        aclocal \

				        autoconf \

				        automake \

				        bison \

				        bzip2 \

				        curl \

				        diffutils \

				        file \

				        git \

				        make \

				        patch \

				        perl \

				        unzip \

				        util-linux \

				        wget \

				        which \

				        xz \

				        yasm

				RUN yum install -y \

				    https://repo.ius.io/ius-release-el7.rpm \

				    https://ossci-linux.s3.amazonaws.com/epel-release-7-14.noarch.rpm

				RUN yum swap -y git git236-core

				# git236+ would refuse to run git commands in repos owned by other users

				# Which causes version check to fail, as pytorch repo is bind-mounted into the image

				# Override this behaviour by treating every folder as safe

				# For more details see https://github.com/pytorch/pytorch/issues/78659#issuecomment-1144107327

				RUN git config --global --add safe.directory "*"

				ENV SSL_CERT_FILE=/opt/_internal/certs.pem

				# Install LLVM version

				COPY --from=openssl            /opt/openssl                          /opt/openssl

				COPY --from=python             /opt/python                           /opt/python

				COPY --from=python             /opt/_internal                        /opt/_internal

				COPY --from=python             /opt/python/cp39-cp39/bin/auditwheel /usr/local/bin/auditwheel

				COPY --from=intel              /opt/intel                            /opt/intel

				COPY --from=patchelf           /usr/local/bin/patchelf               /usr/local/bin/patchelf

				COPY --from=jni                /usr/local/include/jni.h              /usr/local/include/jni.h

				COPY --from=libpng             /usr/local/bin/png*                   /usr/local/bin/

				COPY --from=libpng             /usr/local/bin/libpng*                /usr/local/bin/

				COPY --from=libpng             /usr/local/include/png*               /usr/local/include/

				COPY --from=libpng             /usr/local/include/libpng*            /usr/local/include/

				COPY --from=libpng             /usr/local/lib/libpng*                /usr/local/lib/

				COPY --from=libpng             /usr/local/lib/pkgconfig              /usr/local/lib/pkgconfig

				FROM common as cpu_final

				ARG BASE_CUDA_VERSION=10.1

				ARG DEVTOOLSET_VERSION=9

				# Install Anaconda

				ADD ./common/install_conda_docker.sh install_conda.sh

				RUN bash ./install_conda.sh && rm install_conda.sh

				ENV PATH /opt/conda/bin:$PATH

				RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo

				RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo

				RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo

				RUN yum install -y yum-utils centos-release-scl

				RUN yum-config-manager --enable rhel-server-rhscl-7-rpms

				RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo

				RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo

				RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo

				RUN yum install -y devtoolset-${DEVTOOLSET_VERSION}-gcc devtoolset-${DEVTOOLSET_VERSION}-gcc-c++ devtoolset-${DEVTOOLSET_VERSION}-gcc-gfortran devtoolset-${DEVTOOLSET_VERSION}-binutils

				ENV PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH

				ENV LD_LIBRARY_PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH

				# cmake is already installed inside the rocm base image, so remove if present

				RUN rpm -e cmake || true

				# cmake-3.18.4 from pip

				RUN yum install -y python3-pip && \

				    python3 -mpip install cmake==3.18.4 && \

				    ln -s /usr/local/bin/cmake /usr/bin/cmake

				# ninja

				RUN yum install -y ninja-build

				FROM cpu_final as cuda_final

				RUN rm -rf /usr/local/cuda-${BASE_CUDA_VERSION}

				COPY --from=cuda     /usr/local/cuda-${BASE_CUDA_VERSION}  /usr/local/cuda-${BASE_CUDA_VERSION}

				COPY --from=magma    /usr/local/cuda-${BASE_CUDA_VERSION}  /usr/local/cuda-${BASE_CUDA_VERSION}

				RUN ln -sf /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda

				ENV PATH=/usr/local/cuda/bin:$PATH

				FROM cpu_final as rocm_final

				ARG ROCM_VERSION=3.7

				ARG PYTORCH_ROCM_ARCH

				ENV PYTORCH_ROCM_ARCH ${PYTORCH_ROCM_ARCH}

				# Adding ROCM_PATH env var so that LoadHip.cmake (even with logic updated for ROCm6.0)

				# find HIP works for ROCm5.7. Not needed for ROCm6.0 and above.

				# Remove below when ROCm5.7 is not in support matrix anymore.

				ENV ROCM_PATH /opt/rocm

				ENV MKLROOT /opt/intel

				# No need to install ROCm as base docker image should have full ROCm install

				#ADD ./common/install_rocm.sh install_rocm.sh

				#RUN ROCM_VERSION=${ROCM_VERSION} bash ./install_rocm.sh && rm install_rocm.sh

				ADD ./common/install_rocm_drm.sh install_rocm_drm.sh

				RUN bash ./install_rocm_drm.sh && rm install_rocm_drm.sh

				# cmake3 is needed for the MIOpen build

				RUN ln -sf /usr/local/bin/cmake /usr/bin/cmake3

				ADD ./common/install_rocm_magma.sh install_rocm_magma.sh

				RUN bash ./install_rocm_magma.sh && rm install_rocm_magma.sh

				ADD ./common/install_miopen.sh install_miopen.sh

				RUN bash ./install_miopen.sh ${ROCM_VERSION} && rm install_miopen.sh

37

.ci/docker/manywheel/Dockerfile_2_28

View File

 @ -7,8 +7,8 @@ ENV LC_ALL en_US.UTF-8
 ENV LANG en_US.UTF-8
 ENV LANGUAGE en_US.UTF-8
 ARG DEVTOOLSET_VERSION=11
 RUN yum install -y sudo wget curl perl util-linux xz bzip2 git patch which perl zlib-devel yum-utils gcc-toolset-${DEVTOOLSET_VERSION}-toolchain
 ARG DEVTOOLSET_VERSION=13
 RUN yum install -y sudo wget curl perl util-linux xz bzip2 git patch which perl zlib-devel yum-utils gcc-toolset-${DEVTOOLSET_VERSION}-gcc gcc-toolset-${DEVTOOLSET_VERSION}-gcc-c++ gcc-toolset-${DEVTOOLSET_VERSION}-gcc-gfortran gcc-toolset-${DEVTOOLSET_VERSION}-gdb
 ENV PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH
 ENV LD_LIBRARY_PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH
 @ -33,10 +33,13 @@ RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6
 RUN rm -rf /opt/python/cp34-cp34m /opt/_internal/cpython-3.4.6
 FROM base as cuda
 ARG BASE_CUDA_VERSION=11.8
 ARG BASE_CUDA_VERSION=12.6
 # Install CUDA
 ADD ./common/install_cuda.sh install_cuda.sh
 RUN bash ./install_cuda.sh ${BASE_CUDA_VERSION} && rm install_cuda.sh
 COPY ./common/install_nccl.sh install_nccl.sh
 COPY ./ci_commit_pins/nccl-cu* /ci_commit_pins/
 COPY ./common/install_cusparselt.sh install_cusparselt.sh
 RUN bash ./install_cuda.sh ${BASE_CUDA_VERSION} && rm install_cuda.sh install_nccl.sh ci_commit_pins/nccl-cu* install_cusparselt.sh
 FROM base as intel
 # MKL
 @ -44,7 +47,7 @@ ADD ./common/install_mkl.sh install_mkl.sh
 RUN bash ./install_mkl.sh && rm install_mkl.sh
 FROM base as magma
 ARG BASE_CUDA_VERSION=10.2
 ARG BASE_CUDA_VERSION=12.6
 # Install magma
 ADD ./common/install_magma.sh install_magma.sh
 RUN bash ./install_magma.sh ${BASE_CUDA_VERSION} && rm install_magma.sh
 @ -61,7 +64,7 @@ ADD ./common/install_libpng.sh install_libpng.sh
 RUN bash ./install_libpng.sh && rm install_libpng.sh
 FROM ${GPU_IMAGE} as common
 ARG DEVTOOLSET_VERSION=11
 ARG DEVTOOLSET_VERSION=13
 ENV LC_ALL en_US.UTF-8
 ENV LANG en_US.UTF-8
 ENV LANGUAGE en_US.UTF-8
 @ -84,13 +87,12 @@ RUN yum install -y \
         wget \
         which \
         xz \
         gcc-toolset-${DEVTOOLSET_VERSION}-toolchain \
         glibc-langpack-en
 RUN yum install -y \
     https://repo.ius.io/ius-release-el7.rpm \
     https://ossci-linux.s3.amazonaws.com/epel-release-7-14.noarch.rpm
         glibc-langpack-en \
         gcc-toolset-${DEVTOOLSET_VERSION}-gcc \
         gcc-toolset-${DEVTOOLSET_VERSION}-gcc-c++ \
         gcc-toolset-${DEVTOOLSET_VERSION}-gcc-gfortran \
         gcc-toolset-${DEVTOOLSET_VERSION}-gdb
 RUN yum swap -y git git236-core
 # git236+ would refuse to run git commands in repos owned by other users
 # Which causes version check to fail, as pytorch repo is bind-mounted into the image
 # Override this behaviour by treating every folder as safe
 @ -114,8 +116,8 @@ COPY --from=libpng             /usr/local/lib/pkgconfig              /usr/local/
 COPY --from=jni                /usr/local/include/jni.h              /usr/local/include/jni.h
 FROM common as cpu_final
 ARG BASE_CUDA_VERSION=11.8
 ARG DEVTOOLSET_VERSION=11
 ARG BASE_CUDA_VERSION=12.6
 ARG DEVTOOLSET_VERSION=13
 # Install Anaconda
 ADD ./common/install_conda_docker.sh install_conda.sh
 RUN bash ./install_conda.sh && rm install_conda.sh
 @ -154,11 +156,14 @@ ENV ROCM_PATH /opt/rocm
 # and avoid 3.21.0 cmake+ninja issues with ninja inserting "-Wl,--no-as-needed" in LINK_FLAGS for static linker
 RUN python3 -m pip install --upgrade pip && \
     python3 -mpip install cmake==3.28.4
 # replace the libdrm in /opt/amdgpu with custom amdgpu.ids lookup path
 ADD ./common/install_rocm_drm.sh install_rocm_drm.sh
 RUN bash ./install_rocm_drm.sh && rm install_rocm_drm.sh
 # ROCm 6.4 rocm-smi depends on system drm.h header
 RUN yum install -y libdrm-devel
 ENV MKLROOT /opt/intel
 ADD ./common/install_rocm_magma.sh install_rocm_magma.sh
 RUN bash ./install_rocm_magma.sh && rm install_rocm_magma.sh
 RUN bash ./install_rocm_magma.sh ${ROCM_VERSION} && rm install_rocm_magma.sh
 ADD ./common/install_miopen.sh install_miopen.sh
 RUN bash ./install_miopen.sh ${ROCM_VERSION} && rm install_miopen.sh
 @ -169,6 +174,6 @@ ENV XPU_DRIVER_TYPE ROLLING
 RUN python3 -m pip install --upgrade pip && \
     python3 -mpip install cmake==3.28.4
 ADD ./common/install_xpu.sh install_xpu.sh
 ENV XPU_VERSION 2025.0
 ENV XPU_VERSION 2025.1
 RUN bash ./install_xpu.sh && rm install_xpu.sh
 RUN pushd /opt/_internal && tar -xJf static-libs-for-embedding-only.tar.xz && popd

8

.ci/docker/manywheel/Dockerfile_2_28_aarch64

View File

 @ -1,7 +1,6 @@
 FROM quay.io/pypa/manylinux_2_28_aarch64 as base
 # Graviton needs GCC 10 or above for the build. GCC12 is the default version in almalinux-8.
 ARG GCCTOOLSET_VERSION=11
 ARG GCCTOOLSET_VERSION=13
 # Language variabes
 ENV LC_ALL=en_US.UTF-8
 @ -36,7 +35,10 @@ RUN yum install -y \
   yasm \
   zstd \
   sudo \
   gcc-toolset-${GCCTOOLSET_VERSION}-toolchain
   gcc-toolset-${GCCTOOLSET_VERSION}-gcc \
   gcc-toolset-${GCCTOOLSET_VERSION}-gcc-c++ \
   gcc-toolset-${GCCTOOLSET_VERSION}-gcc-gfortran \
   gcc-toolset-${GCCTOOLSET_VERSION}-gdb
 # (optional) Install non-default Ninja version
 ARG NINJA_VERSION

94

.ci/docker/manywheel/Dockerfile_aarch64

View File

 @ -1,94 +0,0 @@
 FROM quay.io/pypa/manylinux2014_aarch64 as base
 # Graviton needs GCC 10 for the build
 ARG DEVTOOLSET_VERSION=10
 # Language variabes
 ENV LC_ALL=en_US.UTF-8
 ENV LANG=en_US.UTF-8
 ENV LANGUAGE=en_US.UTF-8
 # Installed needed OS packages. This is to support all
 # the binary builds (torch, vision, audio, text, data)
 RUN yum -y install epel-release
 RUN yum -y update
 RUN yum install -y \
   autoconf \
   automake \
   bison \
   bzip2 \
   curl \
   diffutils \
   file \
   git \
   make \
   patch \
   perl \
   unzip \
   util-linux \
   wget \
   which \
   xz \
   yasm \
   less \
   zstd \
   libgomp \
   sudo \
   devtoolset-${DEVTOOLSET_VERSION}-gcc \
   devtoolset-${DEVTOOLSET_VERSION}-gcc-c++ \
   devtoolset-${DEVTOOLSET_VERSION}-gcc-gfortran \
   devtoolset-${DEVTOOLSET_VERSION}-binutils
 # Ensure the expected devtoolset is used
 ENV PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH
 ENV LD_LIBRARY_PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH
 # git236+ would refuse to run git commands in repos owned by other users
 # Which causes version check to fail, as pytorch repo is bind-mounted into the image
 # Override this behaviour by treating every folder as safe
 # For more details see https://github.com/pytorch/pytorch/issues/78659#issuecomment-1144107327
 RUN git config --global --add safe.directory "*"
 ###############################################################################
 # libglfortran.a hack
 #
 # libgfortran.a from quay.io/pypa/manylinux2014_aarch64 is not compiled with -fPIC.
 # This causes __stack_chk_guard@@GLIBC_2.17 on pytorch build. To solve, get
 # ubuntu's libgfortran.a which is compiled with -fPIC
 # NOTE: Need a better way to get this library as Ubuntu's package can be removed by the vender, or changed
 ###############################################################################
 RUN cd ~/ \
   && curl -L -o ~/libgfortran-10-dev.deb http://ports.ubuntu.com/ubuntu-ports/pool/universe/g/gcc-10/libgfortran-10-dev_10.5.0-4ubuntu2_arm64.deb \
   && ar x ~/libgfortran-10-dev.deb \
   && tar --use-compress-program=unzstd -xvf data.tar.zst -C ~/ \
   && cp -f ~/usr/lib/gcc/aarch64-linux-gnu/10/libgfortran.a /opt/rh/devtoolset-10/root/usr/lib/gcc/aarch64-redhat-linux/10/
 # install cmake
 RUN yum install -y cmake3 && \
     ln -s /usr/bin/cmake3 /usr/bin/cmake
 FROM base as openssl
 # Install openssl (this must precede `build python` step)
 # (In order to have a proper SSL module, Python is compiled
 # against a recent openssl [see env vars above], which is linked
 # statically. We delete openssl afterwards.)
 ADD ./common/install_openssl.sh install_openssl.sh
 RUN bash ./install_openssl.sh && rm install_openssl.sh
 ENV SSL_CERT_FILE=/opt/_internal/certs.pem
 FROM base as openblas
 # Install openblas
 ADD ./common/install_openblas.sh install_openblas.sh
 RUN bash ./install_openblas.sh && rm install_openblas.sh
 FROM openssl as final
 # remove unncessary python versions
 RUN rm -rf /opt/python/cp26-cp26m /opt/_internal/cpython-2.6.9-ucs2
 RUN rm -rf /opt/python/cp26-cp26mu /opt/_internal/cpython-2.6.9-ucs4
 RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6
 RUN rm -rf /opt/python/cp34-cp34m /opt/_internal/cpython-3.4.6
 COPY --from=openblas     /opt/OpenBLAS/  /opt/OpenBLAS/
 ENV LD_LIBRARY_PATH=/opt/OpenBLAS/lib:$LD_LIBRARY_PATH

14

.ci/docker/manywheel/Dockerfile_cuda_aarch64

View File

 @ -1,7 +1,7 @@
 FROM quay.io/pypa/manylinux_2_28_aarch64 as base
 # Cuda ARM build needs gcc 11
 ARG DEVTOOLSET_VERSION=11
 ARG DEVTOOLSET_VERSION=13
 # Language variables
 ENV LC_ALL=en_US.UTF-8
 @ -34,7 +34,10 @@ RUN yum install -y \
   zstd \
   libgomp \
   sudo \
   gcc-toolset-${DEVTOOLSET_VERSION}-toolchain
   gcc-toolset-${DEVTOOLSET_VERSION}-gcc \
   gcc-toolset-${DEVTOOLSET_VERSION}-gcc-c++ \
   gcc-toolset-${DEVTOOLSET_VERSION}-gcc-gfortran \
   gcc-toolset-${DEVTOOLSET_VERSION}-gdb
 # Ensure the expected devtoolset is used
 ENV PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH
 @ -66,8 +69,11 @@ RUN rm -rf /opt/python/cp34-cp34m /opt/_internal/cpython-3.4.6
 FROM base as cuda
 ARG BASE_CUDA_VERSION
 # Install CUDA
 ADD ./common/install_cuda_aarch64.sh install_cuda_aarch64.sh
 RUN bash ./install_cuda_aarch64.sh ${BASE_CUDA_VERSION} && rm install_cuda_aarch64.sh
 ADD ./common/install_cuda.sh install_cuda.sh
 COPY ./common/install_nccl.sh install_nccl.sh
 COPY ./common/install_cusparselt.sh install_cusparselt.sh
 COPY ./ci_commit_pins/nccl-cu* /ci_commit_pins/
 RUN bash ./install_cuda.sh ${BASE_CUDA_VERSION} && rm install_cuda.sh install_nccl.sh ci_commit_pins/nccl-cu* install_cusparselt.sh
 FROM base as magma
 ARG BASE_CUDA_VERSION

53

.ci/docker/manywheel/Dockerfile_s390x

View File

 @ -5,7 +5,9 @@ ENV LC_ALL=C.UTF-8
 ENV LANG=C.UTF-8
 ENV LANGUAGE=C.UTF-8
 ARG DEVTOOLSET_VERSION=13
 # there is a bugfix in gcc >= 14 for precompiled headers and s390x vectorization interaction.
 # with earlier gcc versions test/inductor/test_cpu_cpp_wrapper.py will fail.
 ARG DEVTOOLSET_VERSION=14
 # Installed needed OS packages. This is to support all
 # the binary builds (torch, vision, audio, text, data)
 RUN yum -y install epel-release
 @ -42,6 +44,7 @@ RUN yum install -y \
   llvm-devel \
   libzstd-devel \
   python3.12-devel \
   python3.12-test \
   python3.12-setuptools \
   python3.12-pip \
   python3-virtualenv \
 @ -57,7 +60,8 @@ RUN yum install -y \
   libxslt-devel \
   libxml2-devel \
   openssl-devel \
   valgrind
   valgrind \
   ninja-build
 ENV PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH
 ENV LD_LIBRARY_PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH
 @ -101,24 +105,33 @@ CMD ["/bin/bash"]
 # install test dependencies:
 # - grpcio requires system openssl, bundled crypto fails to build
 # - ml_dtypes 0.4.0 requires some fixes provided in later commits to build
 RUN dnf install -y \
   protobuf-devel \
   protobuf-c-devel \
   protobuf-lite-devel \
   wget \
   patch
   hdf5-devel \
   python3-h5py \
   git
 RUN env GRPC_PYTHON_BUILD_SYSTEM_OPENSSL=True pip3 install grpcio==1.65.4
 RUN cd ~ && \
   git clone https://github.com/jax-ml/ml_dtypes && \
   cd ml_dtypes && \
   git checkout v0.4.0 && \
 RUN env GRPC_PYTHON_BUILD_SYSTEM_OPENSSL=True pip3 install grpcio
 # cmake-3.28.0 from pip for onnxruntime
 RUN python3 -mpip install cmake==3.28.0
 # build onnxruntime 1.21.0 from sources.
 # it is not possible to build it from sources using pip,
 # so just build it from upstream repository.
 # h5py is dependency of onnxruntime_training.
 # h5py==3.11.0 builds with hdf5-devel 1.10.5 from repository.
 # install newest flatbuffers version first:
 # for some reason old version is getting pulled in otherwise.
 # packaging package is required for onnxruntime wheel build.
 RUN pip3 install flatbuffers && \
   pip3 install h5py==3.11.0 && \
   pip3 install packaging && \
   git clone https://github.com/microsoft/onnxruntime && \
   cd onnxruntime && git checkout v1.21.0 && \
   git submodule update --init --recursive && \
   wget https://github.com/jax-ml/ml_dtypes/commit/b969f76914d6b30676721bc92bf0f6021a0d1321.patch && \
   wget https://github.com/jax-ml/ml_dtypes/commit/d4e6d035ecda073eab8bcf60f4eef572ee7087e6.patch && \
   patch -p1 < b969f76914d6b30676721bc92bf0f6021a0d1321.patch && \
   patch -p1 < d4e6d035ecda073eab8bcf60f4eef572ee7087e6.patch && \
   python3 setup.py bdist_wheel && \
   pip3 install dist/*.whl && \
   rm -rf ml_dtypes
   ./build.sh --config Release --parallel 0 --enable_pybind \
   --build_wheel --enable_training --enable_training_apis \
   --enable_training_ops --skip_tests --allow_running_as_root \
   --compile_no_warning_as_error && \
   pip3 install ./build/Linux/Release/dist/onnxruntime_training-*.whl && \
   cd .. && /bin/rm -rf ./onnxruntime

									
										150

.ci/docker/manywheel/build.sh
									
												View File
												
				@ -1,7 +1,7 @@

				#!/usr/bin/env bash

				# Script used only in CD pipeline

				set -eou pipefail

				set -exou pipefail

				TOPDIR=$(git rev-parse --show-toplevel)

				@ -9,152 +9,108 @@ image="$1"

				shift

				if [ -z "${image}" ]; then

				  echo "Usage: $0 IMAGE"

				  echo "Usage: $0 IMAGE:ARCHTAG"

				  exit 1

				fi

				DOCKER_IMAGE="pytorch/${image}"

				# Go from imagename:tag to tag

				DOCKER_TAG_PREFIX=$(echo "${image}" | awk -F':' '{print $2}')

				DOCKER_REGISTRY="${DOCKER_REGISTRY:-docker.io}"

				GPU_ARCH_VERSION=""

				if [[ "${DOCKER_TAG_PREFIX}" == cuda* ]]; then

				    # extract cuda version from image name.  e.g. manylinux2_28-builder:cuda12.8 returns 12.8

				    GPU_ARCH_VERSION=$(echo "${DOCKER_TAG_PREFIX}" | awk -F'cuda' '{print $2}')

				elif [[ "${DOCKER_TAG_PREFIX}" == rocm* ]]; then

				    # extract rocm version from image name.  e.g. manylinux2_28-builder:rocm6.2.4 returns 6.2.4

				    GPU_ARCH_VERSION=$(echo "${DOCKER_TAG_PREFIX}" | awk -F'rocm' '{print $2}')

				fi

				GPU_ARCH_TYPE=${GPU_ARCH_TYPE:-cpu}

				GPU_ARCH_VERSION=${GPU_ARCH_VERSION:-}

				MANY_LINUX_VERSION=${MANY_LINUX_VERSION:-}

				DOCKERFILE_SUFFIX=${DOCKERFILE_SUFFIX:-}

				WITH_PUSH=${WITH_PUSH:-}

				case ${GPU_ARCH_TYPE} in

				    cpu)

				case ${image} in

				    manylinux2_28-builder:cpu)

				        TARGET=cpu_final

				        DOCKER_TAG=cpu

				        GPU_IMAGE=centos:7

				        DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=9"

				        ;;

				    cpu-manylinux_2_28)

				        TARGET=cpu_final

				        DOCKER_TAG=cpu

				        GPU_IMAGE=amd64/almalinux:8

				        DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=11"

				        DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=13"

				        MANY_LINUX_VERSION="2_28"

				        ;;

				    cpu-aarch64)

				    manylinux2_28_aarch64-builder:cpu-aarch64)

				        TARGET=final

				        DOCKER_TAG=cpu-aarch64

				        GPU_IMAGE=arm64v8/centos:7

				        DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=10"

				        MANY_LINUX_VERSION="aarch64"

				        ;;

				    cpu-aarch64-2_28)

				        TARGET=final

				        DOCKER_TAG=cpu-aarch64

				        GPU_IMAGE=arm64v8/almalinux:8

				        DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=11 --build-arg NINJA_VERSION=1.12.1"

				        DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=13 --build-arg NINJA_VERSION=1.12.1"

				        MANY_LINUX_VERSION="2_28_aarch64"

				        ;;

				    cpu-cxx11-abi)

				    manylinuxcxx11-abi-builder:cpu-cxx11-abi)

				        TARGET=final

				        DOCKER_TAG=cpu-cxx11-abi

				        GPU_IMAGE=""

				        DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=9"

				        MANY_LINUX_VERSION="cxx11-abi"

				        ;;

				    cpu-s390x)

				    manylinuxs390x-builder:cpu-s390x)

				        TARGET=final

				        DOCKER_TAG=cpu-s390x

				        GPU_IMAGE=s390x/almalinux:8

				        DOCKER_GPU_BUILD_ARG=""

				        MANY_LINUX_VERSION="s390x"

				        ;;

				    cuda)

				    manylinux2_28-builder:cuda11*)

				        TARGET=cuda_final

				        DOCKER_TAG=cuda${GPU_ARCH_VERSION}

				        # Keep this up to date with the minimum version of CUDA we currently support

				        GPU_IMAGE=centos:7

				        DOCKER_GPU_BUILD_ARG="--build-arg BASE_CUDA_VERSION=${GPU_ARCH_VERSION} --build-arg DEVTOOLSET_VERSION=9"

				        ;;

				    cuda-manylinux_2_28)

				        TARGET=cuda_final

				        DOCKER_TAG=cuda${GPU_ARCH_VERSION}

				        GPU_IMAGE=amd64/almalinux:8

				        DOCKER_GPU_BUILD_ARG="--build-arg BASE_CUDA_VERSION=${GPU_ARCH_VERSION} --build-arg DEVTOOLSET_VERSION=11"

				        MANY_LINUX_VERSION="2_28"

				        ;;

				    cuda-aarch64)

				    manylinux2_28-builder:cuda12*)

				        TARGET=cuda_final

				        DOCKER_TAG=cuda${GPU_ARCH_VERSION}

				        GPU_IMAGE=arm64v8/centos:7

				        DOCKER_GPU_BUILD_ARG="--build-arg BASE_CUDA_VERSION=${GPU_ARCH_VERSION} --build-arg DEVTOOLSET_VERSION=11"

				        GPU_IMAGE=amd64/almalinux:8

				        DOCKER_GPU_BUILD_ARG="--build-arg BASE_CUDA_VERSION=${GPU_ARCH_VERSION} --build-arg DEVTOOLSET_VERSION=13"

				        MANY_LINUX_VERSION="2_28"

				        ;;

				    manylinuxaarch64-builder:cuda*)

				        TARGET=cuda_final

				        GPU_IMAGE=amd64/almalinux:8

				        DOCKER_GPU_BUILD_ARG="--build-arg BASE_CUDA_VERSION=${GPU_ARCH_VERSION} --build-arg DEVTOOLSET_VERSION=13"

				        MANY_LINUX_VERSION="aarch64"

				        DOCKERFILE_SUFFIX="_cuda_aarch64"

				        ;;

				    rocm|rocm-manylinux_2_28)

				    manylinux2_28-builder:rocm*)

				        TARGET=rocm_final

				        DOCKER_TAG=rocm${GPU_ARCH_VERSION}

				        GPU_IMAGE=rocm/dev-centos-7:${GPU_ARCH_VERSION}-complete

				        DEVTOOLSET_VERSION="9"

				        if [ ${GPU_ARCH_TYPE} == "rocm-manylinux_2_28" ]; then

				            MANY_LINUX_VERSION="2_28"

				            DEVTOOLSET_VERSION="11"

				            GPU_IMAGE=rocm/dev-almalinux-8:${GPU_ARCH_VERSION}-complete

				        fi

				        MANY_LINUX_VERSION="2_28"

				        DEVTOOLSET_VERSION="11"

				        GPU_IMAGE=rocm/dev-almalinux-8:${GPU_ARCH_VERSION}-complete

				        PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201"

				        DOCKER_GPU_BUILD_ARG="--build-arg ROCM_VERSION=${GPU_ARCH_VERSION} --build-arg PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH} --build-arg DEVTOOLSET_VERSION=${DEVTOOLSET_VERSION}"

				        ;;

				    xpu)

				    manylinux2_28-builder:xpu)

				        TARGET=xpu_final

				        DOCKER_TAG=xpu

				        GPU_IMAGE=amd64/almalinux:8

				        DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=11"

				        MANY_LINUX_VERSION="2_28"

				        ;;

				    *)

				        echo "ERROR: Unrecognized GPU_ARCH_TYPE: ${GPU_ARCH_TYPE}"

				        echo "ERROR: Unrecognized image name: ${image}"

				        exit 1

				        ;;

				esac

				IMAGES=''

				if [[ -n ${MANY_LINUX_VERSION} && -z ${DOCKERFILE_SUFFIX} ]]; then

				    DOCKERFILE_SUFFIX=_${MANY_LINUX_VERSION}

				fi

				(

				    set -x

				    # Only activate this if in CI

				    if [ "$(uname -m)" != "s390x" ] && [ -v CI ]; then

				        # TODO: Remove LimitNOFILE=1048576 patch once https://github.com/pytorch/test-infra/issues/5712

				        # is resolved. This patch is required in order to fix timing out of Docker build on Amazon Linux 2023.

				        sudo sed -i s/LimitNOFILE=infinity/LimitNOFILE=1048576/ /usr/lib/systemd/system/docker.service

				        sudo systemctl daemon-reload

				        sudo systemctl restart docker

				    fi

				    DOCKER_BUILDKIT=1 docker build  \

				        ${DOCKER_GPU_BUILD_ARG} \

				        --build-arg "GPU_IMAGE=${GPU_IMAGE}" \

				        --target "${TARGET}" \

				        -t "${DOCKER_IMAGE}" \

				        $@ \

				        -f "${TOPDIR}/.ci/docker/manywheel/Dockerfile${DOCKERFILE_SUFFIX}" \

				        "${TOPDIR}/.ci/docker/"

				)

				GITHUB_REF=${GITHUB_REF:-"dev")}

				GIT_BRANCH_NAME=${GITHUB_REF##*/}

				GIT_COMMIT_SHA=${GITHUB_SHA:-$(git rev-parse HEAD)}

				DOCKER_IMAGE_BRANCH_TAG=${DOCKER_IMAGE}-${GIT_BRANCH_NAME}

				DOCKER_IMAGE_SHA_TAG=${DOCKER_IMAGE}-${GIT_COMMIT_SHA}

				if [[ "${WITH_PUSH}" == true ]]; then

				    (

				        set -x

				        docker push "${DOCKER_IMAGE}"

				        if [[ -n ${GITHUB_REF} ]]; then

				            docker tag ${DOCKER_IMAGE} ${DOCKER_IMAGE_BRANCH_TAG}

				            docker tag ${DOCKER_IMAGE} ${DOCKER_IMAGE_SHA_TAG}

				            docker push "${DOCKER_IMAGE_BRANCH_TAG}"

				            docker push "${DOCKER_IMAGE_SHA_TAG}"

				        fi

				    )

				# Only activate this if in CI

				if [ "$(uname -m)" != "s390x" ] && [ -v CI ]; then

				    # TODO: Remove LimitNOFILE=1048576 patch once https://github.com/pytorch/test-infra/issues/5712

				    # is resolved. This patch is required in order to fix timing out of Docker build on Amazon Linux 2023.

				    sudo sed -i s/LimitNOFILE=infinity/LimitNOFILE=1048576/ /usr/lib/systemd/system/docker.service

				    sudo systemctl daemon-reload

				    sudo systemctl restart docker

				fi

				tmp_tag=$(basename "$(mktemp -u)" | tr '[:upper:]' '[:lower:]')

				DOCKER_BUILDKIT=1 docker build  \

				    ${DOCKER_GPU_BUILD_ARG} \

				    --build-arg "GPU_IMAGE=${GPU_IMAGE}" \

				    --target "${TARGET}" \

				    -t "${tmp_tag}" \

				    $@ \

				    -f "${TOPDIR}/.ci/docker/manywheel/Dockerfile${DOCKERFILE_SUFFIX}" \

				    "${TOPDIR}/.ci/docker/"

									
										2

.ci/docker/manywheel/build_scripts/build.sh
									
												View File
												
				@ -97,7 +97,7 @@ find /opt/_internal -type f -print0 \

				    | xargs -0 -n1 strip --strip-unneeded 2>/dev/null || true

				# We do not need the Python test suites, or indeed the precompiled .pyc and

				# .pyo files. Partially cribbed from:

				#    https://github.com/docker-library/python/blob/master/3.4/slim/Dockerfile

				#    https://github.com/docker-library/python/blob/master/3.4/slim/Dockerfile  # @lint-ignore

				find /opt/_internal \

				     \( -type d -a -name test -o -name tests \) \

				  -o \( -type f -a -name '*.pyc' -o -name '*.pyo' \) \

									
										2

.ci/docker/manywheel/build_scripts/build_utils.sh
									
												View File
												
				@ -2,7 +2,7 @@

				# Helper utilities for build

				# Script used only in CD pipeline

				OPENSSL_DOWNLOAD_URL=https://www.openssl.org/source/old/1.1.1/

				OPENSSL_DOWNLOAD_URL=https://www.openssl.org/source/old/1.1.1/  # @lint-ignore

				CURL_DOWNLOAD_URL=https://curl.se/download

				AUTOCONF_DOWNLOAD_URL=https://ftp.gnu.org/gnu/autoconf

37

.ci/docker/requirements-ci.txt

View File

 @ -41,11 +41,14 @@ fbscribelogger==0.1.7
 #Pinned versions: 0.1.6
 #test that import:
 flatbuffers==2.0
 flatbuffers==2.0 ; platform_machine != "s390x"
 #Description: cross platform serialization library
 #Pinned versions: 2.0
 #test that import:
 flatbuffers ; platform_machine == "s390x"
 #Description: cross platform serialization library; Newer version is required on s390x for new python version
 hypothesis==5.35.1
 # Pin hypothesis to avoid flakiness: https://github.com/pytorch/pytorch/issues/31136
 #Description: advanced library for generating parametrized tests
 @ -90,7 +93,7 @@ librosa>=0.6.2 ; python_version < "3.11"
 #Pinned versions:
 #test that import:
 mypy==1.14.0
 mypy==1.15.0
 # Pin MyPy version because new errors are likely to appear with each release
 #Description: linter
 #Pinned versions: 1.14.0
 @ -102,10 +105,10 @@ networkx==2.8.8
 #Pinned versions: 2.8.8
 #test that import: functorch
 #ninja
 #Description: build system.  Note that it install from
 #here breaks things so it is commented out
 #Pinned versions: 1.10.0.post1
 ninja==1.11.1.3
 #Description: build system. Used in some tests. Used in build to generate build
 #time tracing information
 #Pinned versions: 1.11.1.3
 #test that import: run_test.py, test_cpp_extensions_aot.py,test_determination.py
 numba==0.49.0 ; python_version < "3.9"
 @ -163,10 +166,10 @@ pillow==11.0.0
 #Pinned versions: 10.3.0
 #test that import:
 protobuf==3.20.2
 #Description:  Google’s data interchange format
 #Pinned versions: 3.20.1
 #test that import: test_tensorboard.py
 protobuf==5.29.4
 #Description:  Google's data interchange format
 #Pinned versions: 5.29.4
 #test that import: test_tensorboard.py, test/onnx/*
 psutil
 #Description: information on running processes and system utilization
 @ -334,12 +337,12 @@ sympy==1.13.3
 #Pinned versions:
 #test that import:
 onnx==1.17.0
 #Description: Required by mypy and test_public_bindings.py when checking torch.onnx._internal
 onnx==1.18.0
 #Description: Required by onnx tests, and mypy and test_public_bindings.py when checking torch.onnx._internal
 #Pinned versions:
 #test that import:
 onnxscript==0.2.2
 onnxscript==0.2.6
 #Description: Required by mypy and test_public_bindings.py when checking torch.onnx._internal
 #Pinned versions:
 #test that import:
 @ -353,7 +356,7 @@ parameterized==0.8.1
 #Pinned versions: 1.24.0
 #test that import: test_sac_estimator.py
 pwlf==2.2.1 ; python_version >= "3.8"
 pwlf==2.2.1
 #Description: required for testing torch/distributed/_tools/sac_estimator.py
 #Pinned versions: 2.2.1
 #test that import: test_sac_estimator.py
 @ -365,10 +368,9 @@ PyYAML
 pyzstd
 setuptools
 ninja==1.11.1 ; platform_machine == "aarch64"
 scons==4.5.2 ; platform_machine == "aarch64"
 pulp==2.9.0 ; python_version >= "3.8"
 pulp==2.9.0
 #Description: required for testing ilp formulaiton under torch/distributed/_tools
 #Pinned versions: 2.9.0
 #test that import: test_sac_ilp.py
 @ -377,3 +379,6 @@ dataclasses_json==0.6.7
 #Description: required for data pipeline and scripts under tools/stats
 #Pinned versions: 0.6.7
 #test that import:
 cmake==4.0.0
 #Description: required for building

14

.ci/docker/requirements-docs.txt

View File

 @ -1,15 +1,24 @@
 sphinx==5.3.0
 #Description: This is used to generate PyTorch docs
 #Pinned versions: 5.3.0
 -e git+https://github.com/pytorch/pytorch_sphinx_theme.git#egg=pytorch_sphinx_theme
 -e git+https://github.com/pytorch/pytorch_sphinx_theme.git@pytorch_sphinx_theme2#egg=pytorch_sphinx_theme2
 # TODO: sphinxcontrib.katex 0.9.0 adds a local KaTeX server to speed up pre-rendering
 # but it doesn't seem to work and hangs around idly. The initial thought is probably
 # something related to Docker setup. We can investigate this later
 sphinxcontrib.katex==0.8.6
 #Description: This is used to generate PyTorch docs
 #Pinned versions: 0.8.6
 sphinxext-opengraph==0.9.1
 #Description: This is used to generate PyTorch docs
 #Pinned versions: 0.9.1
 sphinx_sitemap==2.6.0
 #Description: This is used to generate sitemap for PyTorch docs
 #Pinned versions: 2.6.0
 matplotlib==3.5.3
 #Description: This is used to generate PyTorch docs
 #Pinned versions: 3.5.3
 @ -46,5 +55,6 @@ myst-nb==0.17.2
 # The following are required to build torch.distributed.elastic.rendezvous.etcd* docs
 python-etcd==0.4.5
 sphinx-copybutton==0.5.0
 sphinx-panels==0.4.1
 sphinx-design==0.4.0
 sphinxcontrib-mermaid==1.0.0
 myst-parser==0.18.1

2

.ci/docker/triton_version.txt

View File

 @ -1 +1 @@
 .3.0
 .3.1

									
										43

.ci/docker/ubuntu-cuda/Dockerfile
									
												View File
												
				@ -2,7 +2,7 @@ ARG UBUNTU_VERSION

				ARG CUDA_VERSION

				ARG IMAGE_NAME

				FROM ${IMAGE_NAME}

				FROM ${IMAGE_NAME} as base

				ARG UBUNTU_VERSION

				ARG CUDA_VERSION

				@ -26,7 +26,6 @@ RUN bash ./install_docs_reqs.sh && rm install_docs_reqs.sh

				ARG ANACONDA_PYTHON_VERSION

				ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION

				ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH

				ARG CONDA_CMAKE

				COPY requirements-ci.txt /opt/conda/requirements-ci.txt

				COPY ./common/install_conda.sh install_conda.sh

				COPY ./common/common_utils.sh common_utils.sh

				@ -43,20 +42,6 @@ ARG CLANG_VERSION

				COPY ./common/install_clang.sh install_clang.sh

				RUN bash ./install_clang.sh && rm install_clang.sh

				# (optional) Install protobuf for ONNX

				ARG PROTOBUF

				COPY ./common/install_protobuf.sh install_protobuf.sh

				RUN if [ -n "${PROTOBUF}" ]; then bash ./install_protobuf.sh; fi

				RUN rm install_protobuf.sh

				ENV INSTALLED_PROTOBUF ${PROTOBUF}

				# (optional) Install database packages like LMDB and LevelDB

				ARG DB

				COPY ./common/install_db.sh install_db.sh

				RUN if [ -n "${DB}" ]; then bash ./install_db.sh; fi

				RUN rm install_db.sh

				ENV INSTALLED_DB ${DB}

				# (optional) Install vision packages like OpenCV

				ARG VISION

				COPY ./common/install_vision.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./

				@ -90,21 +75,21 @@ COPY ci_commit_pins/timm.txt timm.txt

				RUN if [ -n "${INDUCTOR_BENCHMARKS}" ]; then bash ./install_inductor_benchmark_deps.sh; fi

				RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface.txt

				# (optional) Install non-default CMake version

				ARG CMAKE_VERSION

				COPY ./common/install_cmake.sh install_cmake.sh

				RUN if [ -n "${CMAKE_VERSION}" ]; then bash ./install_cmake.sh; fi

				RUN rm install_cmake.sh

				ARG TRITON

				FROM base as triton-builder

				# Install triton, this needs to be done before sccache because the latter will

				# try to reach out to S3, which docker build runners don't have access

				COPY ./common/install_triton.sh install_triton.sh

				COPY ./common/common_utils.sh common_utils.sh

				COPY ci_commit_pins/triton.txt triton.txt

				COPY triton_version.txt triton_version.txt

				RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi

				RUN rm install_triton.sh common_utils.sh triton.txt triton_version.txt

				RUN bash ./install_triton.sh

				FROM base as final

				COPY --from=triton-builder /opt/triton /opt/triton

				RUN if [ -n "${TRITON}" ]; then pip install /opt/triton/*.whl; chown -R jenkins:jenkins /opt/conda; fi

				RUN rm -rf /opt/triton

				ARG HALIDE

				# Build and install halide

				@ -159,6 +144,16 @@ COPY ./common/install_cusparselt.sh install_cusparselt.sh

				RUN bash install_cusparselt.sh

				RUN rm install_cusparselt.sh

				# Install NCCL

				ARG CUDA_VERSION

				COPY ./common/install_nccl.sh install_nccl.sh

				COPY ./ci_commit_pins/nccl-cu* /ci_commit_pins/

				RUN bash install_nccl.sh

				RUN rm install_nccl.sh /ci_commit_pins/nccl-cu*

				ENV USE_SYSTEM_NCCL=1

				ENV NCCL_INCLUDE_DIR="/usr/local/cuda/include/"

				ENV NCCL_LIB_DIR="/usr/local/cuda/lib64/"

				# Install CUDSS

				ARG CUDA_VERSION

				COPY ./common/install_cudss.sh install_cudss.sh

									
										23

.ci/docker/ubuntu-rocm/Dockerfile
									
												View File
												
				@ -27,7 +27,6 @@ RUN bash ./install_docs_reqs.sh && rm install_docs_reqs.sh

				ARG ANACONDA_PYTHON_VERSION

				ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION

				ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH

				ARG CONDA_CMAKE

				COPY requirements-ci.txt /opt/conda/requirements-ci.txt

				COPY ./common/install_conda.sh install_conda.sh

				COPY ./common/common_utils.sh common_utils.sh

				@ -43,20 +42,6 @@ ARG CLANG_VERSION

				COPY ./common/install_clang.sh install_clang.sh

				RUN bash ./install_clang.sh && rm install_clang.sh

				# (optional) Install protobuf for ONNX

				ARG PROTOBUF

				COPY ./common/install_protobuf.sh install_protobuf.sh

				RUN if [ -n "${PROTOBUF}" ]; then bash ./install_protobuf.sh; fi

				RUN rm install_protobuf.sh

				ENV INSTALLED_PROTOBUF ${PROTOBUF}

				# (optional) Install database packages like LMDB and LevelDB

				ARG DB

				COPY ./common/install_db.sh install_db.sh

				RUN if [ -n "${DB}" ]; then bash ./install_db.sh; fi

				RUN rm install_db.sh

				ENV INSTALLED_DB ${DB}

				# (optional) Install vision packages like OpenCV

				ARG VISION

				COPY ./common/install_vision.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./

				@ -70,7 +55,7 @@ COPY ./common/install_rocm.sh install_rocm.sh

				RUN bash ./install_rocm.sh

				RUN rm install_rocm.sh

				COPY ./common/install_rocm_magma.sh install_rocm_magma.sh

				RUN bash ./install_rocm_magma.sh

				RUN bash ./install_rocm_magma.sh ${ROCM_VERSION}

				RUN rm install_rocm_magma.sh

				ADD ./common/install_miopen.sh install_miopen.sh

				RUN bash ./install_miopen.sh ${ROCM_VERSION} && rm install_miopen.sh

				@ -115,12 +100,6 @@ COPY ci_commit_pins/timm.txt timm.txt

				RUN if [ -n "${INDUCTOR_BENCHMARKS}" ]; then bash ./install_inductor_benchmark_deps.sh; fi

				RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface.txt

				# (optional) Install non-default CMake version

				ARG CMAKE_VERSION

				COPY ./common/install_cmake.sh install_cmake.sh

				RUN if [ -n "${CMAKE_VERSION}" ]; then bash ./install_cmake.sh; fi

				RUN rm install_cmake.sh

				# (optional) Install non-default Ninja version

				ARG NINJA_VERSION

				COPY ./common/install_ninja.sh install_ninja.sh

									
										14

.ci/docker/ubuntu-xpu/Dockerfile
									
												View File
												
				@ -28,7 +28,6 @@ RUN bash ./install_docs_reqs.sh && rm install_docs_reqs.sh

				# Install conda and other packages (e.g., numpy, pytest)

				ARG ANACONDA_PYTHON_VERSION

				ARG CONDA_CMAKE

				ARG DOCS

				ARG BUILD_ENVIRONMENT

				ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION

				@ -77,13 +76,6 @@ COPY triton_version.txt triton_version.txt

				RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi

				RUN rm install_triton.sh common_utils.sh triton-xpu.txt triton_version.txt

				# (optional) Install database packages like LMDB and LevelDB

				ARG DB

				COPY ./common/install_db.sh install_db.sh

				RUN if [ -n "${DB}" ]; then bash ./install_db.sh; fi

				RUN rm install_db.sh

				ENV INSTALLED_DB ${DB}

				# (optional) Install vision packages like OpenCV

				ARG VISION

				COPY ./common/install_vision.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./

				@ -91,12 +83,6 @@ RUN if [ -n "${VISION}" ]; then bash ./install_vision.sh; fi

				RUN rm install_vision.sh cache_vision_models.sh common_utils.sh

				ENV INSTALLED_VISION ${VISION}

				# (optional) Install non-default CMake version

				ARG CMAKE_VERSION

				COPY ./common/install_cmake.sh install_cmake.sh

				RUN if [ -n "${CMAKE_VERSION}" ]; then bash ./install_cmake.sh; fi

				RUN rm install_cmake.sh

				# (optional) Install non-default Ninja version

				ARG NINJA_VERSION

				COPY ./common/install_ninja.sh install_ninja.sh

									
										66

.ci/docker/ubuntu/Dockerfile
									
												View File
												
				@ -1,6 +1,6 @@

				ARG UBUNTU_VERSION

				FROM ubuntu:${UBUNTU_VERSION}

				FROM ubuntu:${UBUNTU_VERSION} as base

				ARG UBUNTU_VERSION

				@ -28,7 +28,6 @@ RUN bash ./install_docs_reqs.sh && rm install_docs_reqs.sh

				# Install conda and other packages (e.g., numpy, pytest)

				ARG ANACONDA_PYTHON_VERSION

				ARG CONDA_CMAKE

				ARG DOCS

				ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION

				ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH

				@ -52,9 +51,17 @@ RUN  bash ./install_lcov.sh && rm install_lcov.sh

				# Install cuda and cudnn

				ARG CUDA_VERSION

				COPY ./common/install_cuda.sh install_cuda.sh

				RUN bash ./install_cuda.sh ${CUDA_VERSION} && rm install_cuda.sh

				COPY ./common/install_nccl.sh install_nccl.sh

				COPY ./ci_commit_pins/nccl-cu* /ci_commit_pins/

				COPY ./common/install_cusparselt.sh install_cusparselt.sh

				RUN bash ./install_cuda.sh ${CUDA_VERSION} && rm install_cuda.sh install_nccl.sh /ci_commit_pins/nccl-cu* install_cusparselt.sh

				ENV DESIRED_CUDA ${CUDA_VERSION}

				ENV PATH /usr/local/nvidia/bin:/usr/local/cuda/bin:$PATH

				# No effect if cuda not installed

				ENV USE_SYSTEM_NCCL=1

				ENV NCCL_INCLUDE_DIR="/usr/local/cuda/include/"

				ENV NCCL_LIB_DIR="/usr/local/cuda/lib64/"

				# (optional) Install UCC

				ARG UCX_COMMIT

				@ -67,20 +74,6 @@ ADD ./common/install_ucc.sh install_ucc.sh

				RUN if [ -n "${UCX_COMMIT}" ] && [ -n "${UCC_COMMIT}" ]; then bash ./install_ucc.sh; fi

				RUN rm install_ucc.sh

				# (optional) Install protobuf for ONNX

				ARG PROTOBUF

				COPY ./common/install_protobuf.sh install_protobuf.sh

				RUN if [ -n "${PROTOBUF}" ]; then bash ./install_protobuf.sh; fi

				RUN rm install_protobuf.sh

				ENV INSTALLED_PROTOBUF ${PROTOBUF}

				# (optional) Install database packages like LMDB and LevelDB

				ARG DB

				COPY ./common/install_db.sh install_db.sh

				RUN if [ -n "${DB}" ]; then bash ./install_db.sh; fi

				RUN rm install_db.sh

				ENV INSTALLED_DB ${DB}

				# (optional) Install vision packages like OpenCV

				ARG VISION

				COPY ./common/install_vision.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./

				@ -88,24 +81,6 @@ RUN if [ -n "${VISION}" ]; then bash ./install_vision.sh; fi

				RUN rm install_vision.sh cache_vision_models.sh common_utils.sh

				ENV INSTALLED_VISION ${VISION}

				# (optional) Install Vulkan SDK

				ARG VULKAN_SDK_VERSION

				COPY ./common/install_vulkan_sdk.sh install_vulkan_sdk.sh

				RUN if [ -n "${VULKAN_SDK_VERSION}" ]; then bash ./install_vulkan_sdk.sh; fi

				RUN rm install_vulkan_sdk.sh

				# (optional) Install swiftshader

				ARG SWIFTSHADER

				COPY ./common/install_swiftshader.sh install_swiftshader.sh

				RUN if [ -n "${SWIFTSHADER}" ]; then bash ./install_swiftshader.sh; fi

				RUN rm install_swiftshader.sh

				# (optional) Install non-default CMake version

				ARG CMAKE_VERSION

				COPY ./common/install_cmake.sh install_cmake.sh

				RUN if [ -n "${CMAKE_VERSION}" ]; then bash ./install_cmake.sh; fi

				RUN rm install_cmake.sh

				# (optional) Install non-default Ninja version

				ARG NINJA_VERSION

				COPY ./common/install_ninja.sh install_ninja.sh

				@ -127,20 +102,21 @@ RUN if [ -n "${INDUCTOR_BENCHMARKS}" ]; then bash ./install_inductor_benchmark_d

				RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface.txt

				ARG TRITON

				# Install triton, this needs to be done before sccache because the latter will

				# try to reach out to S3, which docker build runners don't have access

				ARG TRITON_CPU

				# Create a separate stage for building Triton and Triton-CPU.  install_triton

				# will check for the presence of env vars

				FROM base as triton-builder

				COPY ./common/install_triton.sh install_triton.sh

				COPY ./common/common_utils.sh common_utils.sh

				COPY ci_commit_pins/triton.txt triton.txt

				RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi

				RUN rm install_triton.sh common_utils.sh triton.txt

				ARG TRITON_CPU

				COPY ./common/install_triton.sh install_triton.sh

				COPY ./common/common_utils.sh common_utils.sh

				COPY ci_commit_pins/triton-cpu.txt triton-cpu.txt

				RUN if [ -n "${TRITON_CPU}" ]; then bash ./install_triton.sh; fi

				RUN rm install_triton.sh common_utils.sh triton-cpu.txt

				RUN bash ./install_triton.sh

				FROM base as final

				COPY --from=triton-builder /opt/triton /opt/triton

				RUN if [ -n "${TRITON}" ] || [ -n "${TRITON_CPU}" ]; then pip install /opt/triton/*.whl; chown -R jenkins:jenkins /opt/conda; fi

				RUN rm -rf /opt/triton

				ARG EXECUTORCH

				# Build and install executorch

2

.ci/magma-rocm/.gitignore vendored Normal file

View File

 @ -0,0 +1,2 @@
 output/
 magma-rocm*/

									
										35

.ci/magma-rocm/Makefile
									
										Normal file
									
												View File
												
				@ -0,0 +1,35 @@

				SHELL=/usr/bin/env bash

				DOCKER_CMD ?= docker

				DESIRED_ROCM ?= 6.4

				DESIRED_ROCM_SHORT = $(subst .,,$(DESIRED_ROCM))

				PACKAGE_NAME = magma-rocm

				# inherit this from underlying docker image, do not pass this env var to docker

				#PYTORCH_ROCM_ARCH ?= gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201

				DOCKER_RUN = set -eou pipefail; ${DOCKER_CMD} run --rm -i \

					-v $(shell git rev-parse --show-toplevel)/.ci:/builder \

					-w /builder \

					-e PACKAGE_NAME=${PACKAGE_NAME}${DESIRED_ROCM_SHORT} \

					-e DESIRED_ROCM=${DESIRED_ROCM} \

					"pytorch/almalinux-builder:rocm${DESIRED_ROCM}" \

					magma-rocm/build_magma.sh

				.PHONY: all

				all: magma-rocm64

				all: magma-rocm63

				.PHONY:

				clean:

					$(RM) -r magma-*

					$(RM) -r output

				.PHONY: magma-rocm64

				magma-rocm64: DESIRED_ROCM := 6.4

				magma-rocm64:

					$(DOCKER_RUN)

				.PHONY: magma-rocm63

				magma-rocm63: DESIRED_ROCM := 6.3

				magma-rocm63:

					$(DOCKER_RUN)

									
										48

.ci/magma-rocm/README.md
									
										Normal file
									
												View File
												
				@ -0,0 +1,48 @@

				# Magma ROCm

				This folder contains the scripts and configurations to build libmagma.so, linked for various versions of ROCm.

				## Building

				Look in the `Makefile` for available targets to build. To build any target, for example `magma-rocm63`, run

				```

				# Using `docker`

				make magma-rocm63

				# Using `podman`

				DOCKER_CMD=podman make magma-rocm63

				```

				This spawns a `pytorch/manylinux-rocm<version>` docker image, which has the required `devtoolset` and ROCm versions installed.

				Within the docker image, it runs `build_magma.sh` with the correct environment variables set, which package the necessary files

				into a tarball, with the following structure:

				```

				.

				├── include       # header files

				├── lib           # libmagma.so

				├── info

				│   ├── licenses  # license file

				│   └── recipe    # build script

				```

				More specifically, `build_magma.sh` copies over the relevant files from the `package_files` directory depending on the ROCm version.

				Outputted binaries should be in the `output` folder.

				## Pushing

				Packages can be uploaded to an S3 bucket using:

				```

				aws s3 cp output/*/magma-cuda*.bz2 <bucket-with-path>

				```

				If you do not have upload permissions, please ping @seemethere or @soumith to gain access

				## New versions

				New ROCm versions can be added by creating a new make target with the next desired version. For ROCm version N.n, the target should be named `magma-rocmNn`.

				Make sure to edit the appropriate environment variables (e.g., DESIRED_ROCM) in the `Makefile` accordingly. Remember also to check `build_magma.sh` to ensure the logic for copying over the files remains correct.

									
										42

.ci/magma-rocm/build_magma.sh
									
										Executable file
									
												View File
												
				@ -0,0 +1,42 @@

				#!/usr/bin/env bash

				set -eou pipefail

				# Environment variables

				# The script expects DESIRED_CUDA and PACKAGE_NAME to be set

				ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"

				# Version 2.7.2 + ROCm related updates

				MAGMA_VERSION=a1625ff4d9bc362906bd01f805dbbe12612953f6

				# Folders for the build

				PACKAGE_FILES=${ROOT_DIR}/magma-rocm/package_files # metadata

				PACKAGE_DIR=${ROOT_DIR}/magma-rocm/${PACKAGE_NAME} # build workspace

				PACKAGE_OUTPUT=${ROOT_DIR}/magma-rocm/output # where tarballs are stored

				PACKAGE_BUILD=${PACKAGE_DIR} # where the content of the tarball is prepared

				PACKAGE_RECIPE=${PACKAGE_BUILD}/info/recipe

				PACKAGE_LICENSE=${PACKAGE_BUILD}/info/licenses

				mkdir -p ${PACKAGE_DIR} ${PACKAGE_OUTPUT}/linux-64 ${PACKAGE_BUILD} ${PACKAGE_RECIPE} ${PACKAGE_LICENSE}

				# Fetch magma sources and verify checksum

				pushd ${PACKAGE_DIR}

				git clone https://bitbucket.org/icl/magma.git

				pushd magma

				git checkout ${MAGMA_VERSION}

				popd

				popd

				# build

				pushd ${PACKAGE_DIR}/magma

				# The build.sh script expects to be executed from the sources root folder

				INSTALL_DIR=${PACKAGE_BUILD} ${PACKAGE_FILES}/build.sh

				popd

				# Package recipe, license and tarball

				# Folder and package name are backward compatible for the build workflow

				cp ${PACKAGE_FILES}/build.sh ${PACKAGE_RECIPE}/build.sh

				cp ${PACKAGE_DIR}/magma/COPYRIGHT ${PACKAGE_LICENSE}/COPYRIGHT

				pushd ${PACKAGE_BUILD}

				tar cjf ${PACKAGE_OUTPUT}/linux-64/${PACKAGE_NAME}-${MAGMA_VERSION}-1.tar.bz2 include lib info

				echo Built in ${PACKAGE_OUTPUT}/linux-64/${PACKAGE_NAME}-${MAGMA_VERSION}-1.tar.bz2

				popd

									
										38

.ci/magma-rocm/package_files/build.sh
									
										Executable file
									
												View File
												
				@ -0,0 +1,38 @@

				# Magma build scripts need `python`

				ln -sf /usr/bin/python3 /usr/bin/python

				ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')

				case "$ID" in

				  almalinux)

				    yum install -y gcc-gfortran

				    ;;

				  *)

				    echo "No preinstalls to build magma..."

				    ;;

				esac

				MKLROOT=${MKLROOT:-/opt/conda/envs/py_$ANACONDA_PYTHON_VERSION}

				cp make.inc-examples/make.inc.hip-gcc-mkl make.inc

				echo 'LIBDIR += -L$(MKLROOT)/lib' >> make.inc

				if [[ -f "${MKLROOT}/lib/libmkl_core.a" ]]; then

				    echo 'LIB = -Wl,--start-group -lmkl_gf_lp64 -lmkl_gnu_thread -lmkl_core -Wl,--end-group -lpthread -lstdc++ -lm -lgomp -lhipblas -lhipsparse' >> make.inc

				fi

				echo 'LIB += -Wl,--enable-new-dtags -Wl,--rpath,/opt/rocm/lib -Wl,--rpath,$(MKLROOT)/lib -Wl,--rpath,/opt/rocm/magma/lib -ldl' >> make.inc

				echo 'DEVCCFLAGS += --gpu-max-threads-per-block=256' >> make.inc

				export PATH="${PATH}:/opt/rocm/bin"

				if [[ -n "$PYTORCH_ROCM_ARCH" ]]; then

				  amdgpu_targets=`echo $PYTORCH_ROCM_ARCH | sed 's/;/ /g'`

				else

				  amdgpu_targets=`rocm_agent_enumerator | grep -v gfx000 | sort -u | xargs`

				fi

				for arch in $amdgpu_targets; do

				  echo "DEVCCFLAGS += --offload-arch=$arch" >> make.inc

				done

				# hipcc with openmp flag may cause isnan() on __device__ not to be found; depending on context, compiler may attempt to match with host definition

				sed -i 's/^FOPENMP/#FOPENMP/g' make.inc

				make -f make.gen.hipMAGMA -j $(nproc)

				LANG=C.UTF-8 make lib/libmagma.so -j $(nproc) MKLROOT="${MKLROOT}"

				make testing/testing_dgemm -j $(nproc) MKLROOT="${MKLROOT}"

				cp -R lib ${INSTALL_DIR}

				cp -R include ${INSTALL_DIR}

									
										8

.ci/magma/Makefile
									
												View File
												
				@ -12,13 +12,12 @@ DOCKER_RUN = set -eou pipefail; ${DOCKER_CMD} run --rm -i \

					-e PACKAGE_NAME=${PACKAGE_NAME}${DESIRED_CUDA_SHORT} \

					-e DESIRED_CUDA=${DESIRED_CUDA} \

					-e CUDA_ARCH_LIST="${CUDA_ARCH_LIST}" \

					"pytorch/manylinux2_28-builder:cuda${DESIRED_CUDA}-main" \

					"pytorch/almalinux-builder:cuda${DESIRED_CUDA}-main" \

					magma/build_magma.sh

				.PHONY: all

				all: magma-cuda128

				all: magma-cuda126

				all: magma-cuda124

				all: magma-cuda118

				.PHONY:

				@ -37,11 +36,6 @@ magma-cuda126: DESIRED_CUDA := 12.6

				magma-cuda126:

					$(DOCKER_RUN)

				.PHONY: magma-cuda124

				magma-cuda124: DESIRED_CUDA := 12.4

				magma-cuda124:

					$(DOCKER_RUN)

				.PHONY: magma-cuda118

				magma-cuda118: DESIRED_CUDA := 11.8

				magma-cuda118: CUDA_ARCH_LIST += -gencode arch=compute_37,code=sm_37

									
										27

.ci/manywheel/build_common.sh
									
												View File
												
				@ -18,12 +18,10 @@ retry () {

				    $*  || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)

				}

				PLATFORM="manylinux2014_x86_64"

				PLATFORM=""

				# TODO move this into the Docker images

				OS_NAME=$(awk -F= '/^NAME/{print $2}' /etc/os-release)

				if [[ "$OS_NAME" == *"CentOS Linux"* ]]; then

				    retry yum install -q -y zip openssl

				elif [[ "$OS_NAME" == *"AlmaLinux"* ]]; then

				if [[ "$OS_NAME" == *"AlmaLinux"* ]]; then

				    retry yum install -q -y zip openssl

				    PLATFORM="manylinux_2_28_x86_64"

				elif [[ "$OS_NAME" == *"Red Hat Enterprise Linux"* ]]; then

				@ -36,6 +34,9 @@ elif [[ "$OS_NAME" == *"Ubuntu"* ]]; then

				    retry apt-get update

				    retry apt-get -y install zip openssl

				else

				    echo "Unknown OS: '$OS_NAME'"

				    exit 1

				fi

				# We use the package name to test the package by passing this to 'pip install'

				@ -79,8 +80,6 @@ if [[ -e /opt/openssl ]]; then

				    export CMAKE_INCLUDE_PATH="/opt/openssl/include":$CMAKE_INCLUDE_PATH

				fi

				mkdir -p /tmp/$WHEELHOUSE_DIR

				export PATCHELF_BIN=/usr/local/bin/patchelf

				@ -111,12 +110,6 @@ case ${DESIRED_PYTHON} in

				    ;;

				esac

				if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* ]]; then

				    export _GLIBCXX_USE_CXX11_ABI=1

				else

				    export _GLIBCXX_USE_CXX11_ABI=0

				fi

				if [[ "$DESIRED_CUDA" == *"rocm"* ]]; then

				    echo "Calling build_amd.py at $(date)"

				    python tools/amd_build/build_amd.py

				@ -209,12 +202,6 @@ if [[ -n "$BUILD_PYTHONLESS" ]]; then

				    mkdir -p /tmp/$LIBTORCH_HOUSE_DIR

				    if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* ]]; then

				        LIBTORCH_ABI="cxx11-abi-"

				    else

				        LIBTORCH_ABI=

				    fi

				    zip -rq /tmp/$LIBTORCH_HOUSE_DIR/libtorch-$LIBTORCH_ABI$LIBTORCH_VARIANT-$PYTORCH_BUILD_VERSION.zip libtorch

				    cp /tmp/$LIBTORCH_HOUSE_DIR/libtorch-$LIBTORCH_ABI$LIBTORCH_VARIANT-$PYTORCH_BUILD_VERSION.zip \

				       /tmp/$LIBTORCH_HOUSE_DIR/libtorch-$LIBTORCH_ABI$LIBTORCH_VARIANT-latest.zip

				@ -333,8 +320,8 @@ for pkg in /$WHEELHOUSE_DIR/torch_no_python*.whl /$WHEELHOUSE_DIR/torch*linux*.w

				            # ROCm workaround for roctracer dlopens

				            if [[ "$DESIRED_CUDA" == *"rocm"* ]]; then

				                patchedpath=$(fname_without_so_number $destpath)

				            # Keep the so number for XPU dependencies

				            elif [[ "$DESIRED_CUDA" == *"xpu"* ]]; then

				            # Keep the so number for XPU dependencies and libgomp.so.1 to avoid twice load

				            elif [[ "$DESIRED_CUDA" == *"xpu"* || "$filename" == "libgomp.so.1" ]]; then

				                patchedpath=$destpath

				            else

				                patchedpath=$(fname_with_sha256 $destpath)

									
										71

.ci/manywheel/build_cuda.sh
									
												View File
												
				@ -36,10 +36,8 @@ if [[ -n "$DESIRED_CUDA" ]]; then

				    if [[ ${DESIRED_CUDA} =~ ^[0-9]+\.[0-9]+$ ]]; then

				        CUDA_VERSION=${DESIRED_CUDA}

				    else

				        # cu90, cu92, cu100, cu101

				        if [[ ${#DESIRED_CUDA} -eq 4 ]]; then

				            CUDA_VERSION="${DESIRED_CUDA:2:1}.${DESIRED_CUDA:3:1}"

				        elif [[ ${#DESIRED_CUDA} -eq 5 ]]; then

				        # cu126, cu128 etc...

				        if [[ ${#DESIRED_CUDA} -eq 5 ]]; then

				            CUDA_VERSION="${DESIRED_CUDA:2:2}.${DESIRED_CUDA:4:1}"

				        fi

				    fi

				@ -61,10 +59,6 @@ case ${CUDA_VERSION} in

				        TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};9.0"

				        EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")

				        ;;

				    12.4)

				        TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};9.0"

				        EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")

				        ;;

				    11.8)

				        TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};3.7;9.0"

				        EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")

				@ -91,14 +85,15 @@ fi

				mkdir -p "$PYTORCH_FINAL_PACKAGE_DIR" || true

				OS_NAME=$(awk -F= '/^NAME/{print $2}' /etc/os-release)

				if [[ "$OS_NAME" == *"CentOS Linux"* ]]; then

				    LIBGOMP_PATH="/usr/lib64/libgomp.so.1"

				elif [[ "$OS_NAME" == *"AlmaLinux"* ]]; then

				if [[ "$OS_NAME" == *"AlmaLinux"* ]]; then

				    LIBGOMP_PATH="/usr/lib64/libgomp.so.1"

				elif [[ "$OS_NAME" == *"Red Hat Enterprise Linux"* ]]; then

				    LIBGOMP_PATH="/usr/lib64/libgomp.so.1"

				elif [[ "$OS_NAME" == *"Ubuntu"* ]]; then

				    LIBGOMP_PATH="/usr/lib/x86_64-linux-gnu/libgomp.so.1"

				else

				    echo "Unknown OS: '$OS_NAME'"

				    exit 1

				fi

				DEPS_LIST=(

				@ -108,26 +103,8 @@ DEPS_SONAME=(

				    "libgomp.so.1"

				)

				# CUDA 11.8 have to ship the libcusparseLt.so.0 with the binary

				# since nvidia-cusparselt-cu11 is not available in PYPI

				if [[ $USE_CUSPARSELT == "1" && $CUDA_VERSION == "11.8" ]]; then

				        DEPS_SONAME+=(

				            "libcusparseLt.so.0"

				        )

				        DEPS_LIST+=(

				            "/usr/local/cuda/lib64/libcusparseLt.so.0"

				        )

				fi

				# Turn USE_CUFILE off for CUDA 11.8, 12.4 since nvidia-cufile-cu11 and 1.9.0.20 are

				# not available in PYPI

				if [[ $CUDA_VERSION == "11.8" || $CUDA_VERSION == "12.4" ]]; then

				    export USE_CUFILE=0

				fi

				# CUDA_VERSION 12.4, 12.6, 12.8

				# CUDA_VERSION 12.6, 12.8

				if [[ $CUDA_VERSION == 12* ]]; then

				    export USE_STATIC_CUDNN=0

				    # Try parallelizing nvcc as well

				@ -151,6 +128,8 @@ if [[ $CUDA_VERSION == 12* ]]; then

				            "/usr/local/cuda/lib64/libnvToolsExt.so.1"

				            "/usr/local/cuda/lib64/libnvrtc.so.12"

				            "/usr/local/cuda/lib64/libnvrtc-builtins.so"

				            "/usr/local/cuda/lib64/libcufile.so.0"

				            "/usr/local/cuda/lib64/libcufile_rdma.so.1"

				        )

				        DEPS_SONAME+=(

				            "libcudnn_adv.so.9"

				@ -168,17 +147,9 @@ if [[ $CUDA_VERSION == 12* ]]; then

				            "libnvToolsExt.so.1"

				            "libnvrtc.so.12"

				            "libnvrtc-builtins.so"

				            "libcufile.so.0"

				            "libcufile_rdma.so.1"

				        )

				        if [[ $USE_CUFILE == 1 ]]; then

				            DEPS_LIST+=(

				                "/usr/local/cuda/lib64/libcufile.so.0"

				                "/usr/local/cuda/lib64/libcufile_rdma.so.1"

				            )

				            DEPS_SONAME+=(

				                "libcufile.so.0"

				                "libcufile_rdma.so.1"

				            )

				        fi

				    else

				        echo "Using nvidia libs from pypi."

				        CUDA_RPATHS=(

				@ -194,12 +165,8 @@ if [[ $CUDA_VERSION == 12* ]]; then

				            '$ORIGIN/../../cusparselt/lib'

				            '$ORIGIN/../../nvidia/nccl/lib'

				            '$ORIGIN/../../nvidia/nvtx/lib'

				            '$ORIGIN/../../nvidia/cufile/lib'

				        )

				        if [[ $USE_CUFILE == 1 ]]; then

				            CUDA_RPATHS+=(

				                '$ORIGIN/../../nvidia/cufile/lib'

				            )

				        fi

				        CUDA_RPATHS=$(IFS=: ; echo "${CUDA_RPATHS[*]}")

				        export C_SO_RPATH=$CUDA_RPATHS':$ORIGIN:$ORIGIN/lib'

				        export LIB_SO_RPATH=$CUDA_RPATHS':$ORIGIN'

				@ -214,11 +181,25 @@ if [[ $CUDA_VERSION == 12* ]]; then

				    fi

				elif [[ $CUDA_VERSION == "11.8" ]]; then

				    export USE_STATIC_CUDNN=0

				    # Turn USE_CUFILE off for CUDA 11.8 since nvidia-cufile-cu11 and 1.9.0.20 are

				    # not available in PYPI

				    export USE_CUFILE=0

				    # Try parallelizing nvcc as well

				    export TORCH_NVCC_FLAGS="-Xfatbin -compress-all --threads 2"

				    # Bundle ptxas into the wheel, see https://github.com/pytorch/pytorch/pull/119750

				    export BUILD_BUNDLE_PTXAS=1

				    # CUDA 11.8 have to ship the libcusparseLt.so.0 with the binary

				    # since nvidia-cusparselt-cu11 is not available in PYPI

				    if [[ $USE_CUSPARSELT == "1" ]]; then

				        DEPS_SONAME+=(

				            "libcusparseLt.so.0"

				        )

				        DEPS_LIST+=(

				            "/usr/local/cuda/lib64/libcusparseLt.so.0"

				        )

				    fi

				    if [[ -z "$PYTORCH_EXTRA_INSTALL_REQUIREMENTS" ]]; then

				        echo "Bundling with cudnn and cublas."

				        DEPS_LIST+=(

									
										19

.ci/manywheel/build_libtorch.sh
									
												View File
												
				@ -22,9 +22,7 @@ retry () {

				# TODO move this into the Docker images

				OS_NAME=`awk -F= '/^NAME/{print $2}' /etc/os-release`

				if [[ "$OS_NAME" == *"CentOS Linux"* ]]; then

				    retry yum install -q -y zip openssl

				elif [[ "$OS_NAME" == *"AlmaLinux"* ]]; then

				if [[ "$OS_NAME" == *"AlmaLinux"* ]]; then

				    retry yum install -q -y zip openssl

				elif [[ "$OS_NAME" == *"Red Hat Enterprise Linux"* ]]; then

				    retry dnf install -q -y zip openssl

				@ -35,6 +33,9 @@ elif [[ "$OS_NAME" == *"Ubuntu"* ]]; then

				    sed -i 's/.*nvidia.*/# &/' $(find /etc/apt/ -type f -name "*.list")

				    retry apt-get update

				    retry apt-get -y install zip openssl

				else

				    echo "Unknown OS: '$OS_NAME'"

				    exit 1

				fi

				# Version: setup.py uses $PYTORCH_BUILD_VERSION.post$PYTORCH_BUILD_NUMBER if

				@ -95,12 +96,6 @@ python setup.py clean

				retry pip install -qr requirements.txt

				retry pip install -q numpy==2.0.1

				if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* ]]; then

				    export _GLIBCXX_USE_CXX11_ABI=1

				else

				    export _GLIBCXX_USE_CXX11_ABI=0

				fi

				if [[ "$DESIRED_CUDA" == *"rocm"* ]]; then

				    echo "Calling build_amd.py at $(date)"

				    python tools/amd_build/build_amd.py

				@ -169,12 +164,6 @@ fi

				)

				if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* ]]; then

				    LIBTORCH_ABI="cxx11-abi-"

				else

				    LIBTORCH_ABI=

				fi

				(

				    set -x

									
										4

.ci/manywheel/build_xpu.sh
									
												View File
												
				@ -20,7 +20,11 @@ fi

				source /opt/intel/oneapi/compiler/latest/env/vars.sh

				source /opt/intel/oneapi/pti/latest/env/vars.sh

				source /opt/intel/oneapi/umf/latest/env/vars.sh

				source /opt/intel/oneapi/ccl/latest/env/vars.sh

				source /opt/intel/oneapi/mpi/latest/env/vars.sh

				export USE_STATIC_MKL=1

				export USE_ONEMKL=1

				export USE_XCCL=1

				WHEELHOUSE_DIR="wheelhousexpu"

				LIBTORCH_HOUSE_DIR="libtorch_housexpu"

2

.ci/onnx/README.md

View File

 @ -10,5 +10,3 @@ example: `py2-cuda9.0-cudnn7-ubuntu16.04`. The Docker images that are
 built on Jenkins and are used in triggered builds already have this
 environment variable set in their manifest. Also see
 `./docker/jenkins/*/Dockerfile` and search for `BUILD_ENVIRONMENT`.
 Our Jenkins installation is located at https://ci.pytorch.org/jenkins/.

									
										26

.ci/pytorch/build.sh
									
												View File
												
				@ -35,7 +35,7 @@ if [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; then

				fi

				if [[ "$BUILD_ENVIRONMENT" == *cuda11* ]]; then

				  if [[ "$BUILD_ENVIRONMENT" != *cuda11.3* && "$BUILD_ENVIRONMENT" != *clang* ]]; then

				  if [[ "$BUILD_ENVIRONMENT" != *clang* ]]; then

				    # TODO: there is a linking issue when building with UCC using clang,

				    # disable it for now and to be fix later.

				    # TODO: disable UCC temporarily to enable CUDA 12.1 in CI

				@ -171,6 +171,12 @@ fi

				if [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then

				  # shellcheck disable=SC1091

				  source /opt/intel/oneapi/compiler/latest/env/vars.sh

				  # shellcheck disable=SC1091

				  source /opt/intel/oneapi/ccl/latest/env/vars.sh

				  # shellcheck disable=SC1091

				  source /opt/intel/oneapi/mpi/latest/env/vars.sh

				  # Enable XCCL build

				  export USE_XCCL=1

				  # XPU kineto feature dependencies are not fully ready, disable kineto build as temp WA

				  export USE_KINETO=0

				  export TORCH_XPU_ARCH_LIST=pvc

				@ -277,10 +283,8 @@ else

				    # or building non-XLA tests.

				    if [[ "$BUILD_ENVIRONMENT" != *rocm*  &&

				          "$BUILD_ENVIRONMENT" != *xla* ]]; then

				      if [[ "$BUILD_ENVIRONMENT" != *py3.8* ]]; then

				        # Install numpy-2.0.2 for builds which are backward compatible with 1.X

				        python -mpip install numpy==2.0.2

				      fi

				      # Install numpy-2.0.2 for builds which are backward compatible with 1.X

				      python -mpip install numpy==2.0.2

				      WERROR=1 python setup.py clean

				@ -303,6 +307,18 @@ else

				    fi

				    pip_install_whl "$(echo dist/*.whl)"

				    if [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then

				      echo "Checking that xpu is compiled"

				      pushd dist/

				      if python -c 'import torch; exit(0 if torch.xpu._is_compiled() else 1)'; then

				        echo "XPU support is compiled in."

				      else

				        echo "XPU support is NOT compiled in."

				        exit 1

				      fi

				      popd

				    fi

				    # TODO: I'm not sure why, but somehow we lose verbose commands

				    set -x

									
										123

.ci/pytorch/check_binary.sh
									
												View File
												
				@ -59,78 +59,16 @@ else

				  export install_root="$(dirname $(which python))/../lib/python${py_dot}/site-packages/torch/"

				fi

				###############################################################################

				# Setup XPU ENV

				###############################################################################

				if [[ "$DESIRED_CUDA" == 'xpu' ]]; then

				  set +u

				  # Refer https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpus.html

				  source /opt/intel/oneapi/compiler/latest/env/vars.sh

				  source /opt/intel/oneapi/pti/latest/env/vars.sh

				fi

				###############################################################################

				# Check GCC ABI

				###############################################################################

				# NOTE [ Building libtorch with old vs. new gcc ABI ]

				#

				# Packages built with one version of ABI could not be linked against by client

				# C++ libraries that were compiled using the other version of ABI. Since both

				# gcc ABIs are still common in the wild, we need to support both ABIs. Currently:

				#

				# - All the nightlies built on CentOS 7 + devtoolset7 use the old gcc ABI.

				# - All the nightlies built on Ubuntu 16.04 + gcc 5.4 use the new gcc ABI.

				# NOTE: As of https://github.com/pytorch/pytorch/issues/126551 we only produce

				#       wheels with cxx11-abi

				echo "Checking that the gcc ABI is what we expect"

				if [[ "$(uname)" != 'Darwin' ]]; then

				  function is_expected() {

				    if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* || "$DESIRED_CUDA" == *"rocm"* ]]; then

				      if [[ "$1" -gt 0 || "$1" == "ON " ]]; then

				        echo 1

				      fi

				    else

				      if [[ -z "$1" || "$1" == 0 || "$1" == "OFF" ]]; then

				        echo 1

				      fi

				    fi

				  }

				  # First we check that the env var in TorchConfig.cmake is correct

				  # We search for D_GLIBCXX_USE_CXX11_ABI=1 in torch/TorchConfig.cmake

				  torch_config="${install_root}/share/cmake/Torch/TorchConfig.cmake"

				  if [[ ! -f "$torch_config" ]]; then

				    echo "No TorchConfig.cmake found!"

				    ls -lah "$install_root/share/cmake/Torch"

				    exit 1

				  fi

				  echo "Checking the TorchConfig.cmake"

				  cat "$torch_config"

				  # The sed call below is

				  #   don't print lines by default (only print the line we want)

				  # -n

				  #   execute the following expression

				  # e

				  #   replace lines that match with the first capture group and print

				  # s/.*D_GLIBCXX_USE_CXX11_ABI=\(.\)".*/\1/p

				  #   any characters, D_GLIBCXX_USE_CXX11_ABI=, exactly one any character, a

				  #   quote, any characters

				  #   Note the exactly one single character after the '='. In the case that the

				  #     variable is not set the '=' will be followed by a '"' immediately and the

				  #     line will fail the match and nothing will be printed; this is what we

				  #     want.  Otherwise it will capture the 0 or 1 after the '='.

				  # /.*D_GLIBCXX_USE_CXX11_ABI=\(.\)".*/

				  #   replace the matched line with the capture group and print

				  # /\1/p

				  actual_gcc_abi="$(sed -ne 's/.*D_GLIBCXX_USE_CXX11_ABI=\(.\)".*/\1/p' < "$torch_config")"

				  if [[ "$(is_expected "$actual_gcc_abi")" != 1 ]]; then

				    echo "gcc ABI $actual_gcc_abi not as expected."

				    exit 1

				  fi

				  # We also check that there are [not] cxx11 symbols in libtorch

				  # We also check that there are cxx11 symbols in libtorch

				  #

				  echo "Checking that symbols in libtorch.so have the right gcc abi"

				  python3 "$(dirname ${BASH_SOURCE[0]})/smoke_test/check_binary_symbols.py"

				@ -208,35 +146,11 @@ setup_link_flags () {

				TEST_CODE_DIR="$(dirname $(realpath ${BASH_SOURCE[0]}))/test_example_code"

				build_and_run_example_cpp () {

				  if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* ]]; then

				    GLIBCXX_USE_CXX11_ABI=1

				  else

				    GLIBCXX_USE_CXX11_ABI=0

				  fi

				  setup_link_flags

				  g++ ${TEST_CODE_DIR}/$1.cpp -I${install_root}/include -I${install_root}/include/torch/csrc/api/include -D_GLIBCXX_USE_CXX11_ABI=$GLIBCXX_USE_CXX11_ABI -std=gnu++17 -L${install_root}/lib ${REF_LIB} ${ADDITIONAL_LINKER_FLAGS} -ltorch $TORCH_CPU_LINK_FLAGS $TORCH_CUDA_LINK_FLAGS $C10_LINK_FLAGS -o $1

				  g++ ${TEST_CODE_DIR}/$1.cpp -I${install_root}/include -I${install_root}/include/torch/csrc/api/include -std=gnu++17 -L${install_root}/lib ${REF_LIB} ${ADDITIONAL_LINKER_FLAGS} -ltorch $TORCH_CPU_LINK_FLAGS $TORCH_CUDA_LINK_FLAGS $C10_LINK_FLAGS -o $1

				  ./$1

				}

				build_example_cpp_with_incorrect_abi () {

				  if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* ]]; then

				    GLIBCXX_USE_CXX11_ABI=0

				  else

				    GLIBCXX_USE_CXX11_ABI=1

				  fi

				  set +e

				  setup_link_flags

				  g++ ${TEST_CODE_DIR}/$1.cpp -I${install_root}/include -I${install_root}/include/torch/csrc/api/include -D_GLIBCXX_USE_CXX11_ABI=$GLIBCXX_USE_CXX11_ABI -std=gnu++17 -L${install_root}/lib ${REF_LIB} ${ADDITIONAL_LINKER_FLAGS} -ltorch $TORCH_CPU_LINK_FLAGS $TORCH_CUDA_LINK_FLAGS $C10_LINK_FLAGS -o $1

				  ERRCODE=$?

				  set -e

				  if [ "$ERRCODE" -eq "0" ]; then

				    echo "Building example with incorrect ABI didn't throw error. Aborting."

				    exit 1

				  else

				    echo "Building example with incorrect ABI throws expected error. Proceeding."

				  fi

				}

				###############################################################################

				# Check simple Python/C++ calls

				###############################################################################

				@ -246,11 +160,6 @@ if [[ "$PACKAGE_TYPE" == 'libtorch' ]]; then

				    export LD_LIBRARY_PATH=/usr/local/cuda/lib64

				  fi

				  build_and_run_example_cpp simple-torch-test

				  # `_GLIBCXX_USE_CXX11_ABI` is always ignored by gcc in devtoolset7, so we test

				  # the expected failure case for Ubuntu 16.04 + gcc 5.4 only.

				  if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* ]]; then

				    build_example_cpp_with_incorrect_abi simple-torch-test

				  fi

				else

				  pushd /tmp

				  python -c 'import torch'

				@ -307,6 +216,14 @@ else

				  fi

				fi

				###############################################################################

				# Check XPU configured correctly

				###############################################################################

				if [[ "$DESIRED_CUDA" == 'xpu' && "$PACKAGE_TYPE" != 'libtorch' ]]; then

				  echo "Checking that xpu is compiled"

				  python -c 'import torch; exit(0 if torch.xpu._is_compiled() else 1)'

				fi

				###############################################################################

				# Check CUDA configured correctly

				###############################################################################

				@ -385,10 +302,22 @@ except RuntimeError as e:

				fi

				###############################################################################

				# Check for C++ ABI compatibility between gcc7 and gcc9 compiled binaries

				# Check for C++ ABI compatibility to GCC-11 - GCC 13

				###############################################################################

				if [[ "$(uname)" == 'Linux' &&  "$PACKAGE_TYPE" == 'manywheel' ]]; then

				  pushd /tmp

				  python -c "import torch; exit(0 if torch.compiled_with_cxx11_abi() else (0 if torch._C._PYBIND11_BUILD_ABI == '_cxxabi1011' else 1))"

				  # Per https://gcc.gnu.org/onlinedocs/gcc/C_002b_002b-Dialect-Options.html

				  # gcc-11 is ABI16, gcc-13 is ABI18, gcc-14 is ABI19

				  # gcc 11 - CUDA 11.8, xpu, rocm

				  # gcc 13 - CUDA 12.6, 12.8 and cpu

				  # Please see issue for reference: https://github.com/pytorch/pytorch/issues/152426

				  if [[ "$(uname -m)" == "s390x" ]]; then

				    cxx_abi="19"

				  elif [[ "$DESIRED_CUDA" != 'cu118' && "$DESIRED_CUDA" != 'xpu' && "$DESIRED_CUDA" != 'rocm'* ]]; then

				    cxx_abi="18"

				  else

				    cxx_abi="16"

				  fi

				  python -c "import torch; exit(0 if torch._C._PYBIND11_BUILD_ABI == '_cxxabi10${cxx_abi}' else 1)"

				  popd

				fi

									
										4

.ci/pytorch/common.sh
									
												View File
												
				@ -13,10 +13,6 @@ if [[ "${BUILD_ENVIRONMENT}" == *rocm* ]]; then

				  # HIP_PLATFORM is auto-detected by hipcc; unset to avoid build errors

				  unset HIP_PLATFORM

				  export PYTORCH_TEST_WITH_ROCM=1

				  # temporary to locate some kernel issues on the CI nodes

				  export HSAKMT_DEBUG_LEVEL=4

				  # improve rccl performance for distributed tests

				  export HSA_FORCE_FINE_GRAIN_PCIE=1

				fi

				# TODO: Renable libtorch testing for MacOS, see https://github.com/pytorch/pytorch/issues/62598

									
										51

.ci/pytorch/install_cache_xla.sh
									
												View File
												
				@ -1,31 +1,50 @@

				#!/bin/bash

				# Script for installing sccache on the xla build job, which uses xla's docker

				# image and doesn't have sccache installed on it.  This is mostly copied from

				# .ci/docker/install_cache.sh.  Changes are: removing checks that will always

				# return the same thing, ex checks for for rocm, CUDA, and changing the path

				# where sccache is installed, and not changing /etc/environment.

				# image, which has sccache installed but doesn't write the stubs.  This is

				# mostly copied from .ci/docker/install_cache.sh.  Changes are: removing checks

				# that will always return the same thing, ex checks for for rocm, CUDA, changing

				# the path where sccache is installed, not changing /etc/environment, and not

				# installing/downloading sccache as it is already in the docker image.

				set -ex -o pipefail

				install_binary() {

				  echo "Downloading sccache binary from S3 repo"

				  curl --retry 3 https://s3.amazonaws.com/ossci-linux/sccache -o /tmp/cache/bin/sccache

				}

				mkdir -p /tmp/cache/bin

				mkdir -p /tmp/cache/lib

				export PATH="/tmp/cache/bin:$PATH"

				install_binary

				chmod a+x /tmp/cache/bin/sccache

				function write_sccache_stub() {

				  # Unset LD_PRELOAD for ps because of asan + ps issues

				  # https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90589

				  # shellcheck disable=SC2086

				  # shellcheck disable=SC2059

				  printf "#!/bin/sh\nif [ \$(env -u LD_PRELOAD ps -p \$PPID -o comm=) != sccache ]; then\n  exec sccache $(which $1) \"\$@\"\nelse\n  exec $(which $1) \"\$@\"\nfi" > "/tmp/cache/bin/$1"

				  if [ "$1" == "gcc" ]; then

				    # Do not call sccache recursively when dumping preprocessor argument

				    # For some reason it's very important for the first cached nvcc invocation

				    cat >"/tmp/cache/bin/$1" <<EOF

				#!/bin/sh

				# sccache does not support -E flag, so we need to call the original compiler directly in order to avoid calling this wrapper recursively

				for arg in "\$@"; do

				  if [ "\$arg" = "-E" ]; then

				    exec $(which "$1") "\$@"

				  fi

				done

				if [ \$(env -u LD_PRELOAD ps -p \$PPID -o comm=) != sccache ]; then

				  exec sccache $(which "$1") "\$@"

				else

				  exec $(which "$1") "\$@"

				fi

				EOF

				  else

				    cat >"/tmp/cache/bin/$1" <<EOF

				#!/bin/sh

				if [ \$(env -u LD_PRELOAD ps -p \$PPID -o comm=) != sccache ]; then

				  exec sccache $(which "$1") "\$@"

				else

				  exec $(which "$1") "\$@"

				fi

				EOF

				  fi

				  chmod a+x "/tmp/cache/bin/$1"

				}

									
										53

.ci/pytorch/macos-build.sh
									
												View File
												
				@ -33,56 +33,15 @@ if which sccache > /dev/null; then

				  export PATH="${tmp_dir}:$PATH"

				fi

				cross_compile_arm64() {

				  # Cross compilation for arm64

				print_cmake_info

				if [[ ${BUILD_ENVIRONMENT} == *"distributed"* ]]; then

				  # Needed for inductor benchmarks, as lots of HF networks make `torch.distribtued` calls

				  USE_DISTRIBUTED=1 USE_OPENMP=1 WERROR=1 python setup.py bdist_wheel

				else

				  # Explicitly set USE_DISTRIBUTED=0 to align with the default build config on mac. This also serves as the sole CI config that tests

				  # that building with USE_DISTRIBUTED=0 works at all. See https://github.com/pytorch/pytorch/issues/86448

				  USE_DISTRIBUTED=0 CMAKE_OSX_ARCHITECTURES=arm64 MACOSX_DEPLOYMENT_TARGET=11.0 USE_MKLDNN=OFF USE_QNNPACK=OFF WERROR=1 BUILD_TEST=OFF USE_PYTORCH_METAL=1 python setup.py bdist_wheel

				}

				compile_arm64() {

				  # Compilation for arm64

				  # TODO: Compile with OpenMP support (but this causes CI regressions as cross-compilation were done with OpenMP disabled)

				  USE_DISTRIBUTED=0 USE_OPENMP=1 MACOSX_DEPLOYMENT_TARGET=11.0 WERROR=1 BUILD_TEST=OFF USE_PYTORCH_METAL=1 python setup.py bdist_wheel

				}

				compile_x86_64() {

				  USE_DISTRIBUTED=0 WERROR=1 python setup.py bdist_wheel --plat-name=macosx_10_9_x86_64

				}

				build_lite_interpreter() {

				    echo "Testing libtorch (lite interpreter)."

				    CPP_BUILD="$(pwd)/../cpp_build"

				    # Ensure the removal of the tmp directory

				    trap 'rm -rfv ${CPP_BUILD}' EXIT

				    rm -rf "${CPP_BUILD}"

				    mkdir -p "${CPP_BUILD}/caffe2"

				    # It looks libtorch need to be built in "${CPP_BUILD}/caffe2 folder.

				    BUILD_LIBTORCH_PY=$PWD/tools/build_libtorch.py

				    pushd "${CPP_BUILD}/caffe2" || exit

				    VERBOSE=1 DEBUG=1 python "${BUILD_LIBTORCH_PY}"

				    popd || exit

				    "${CPP_BUILD}/caffe2/build/bin/test_lite_interpreter_runtime"

				}

				print_cmake_info

				if [[ ${BUILD_ENVIRONMENT} = *arm64* ]]; then

				  if [[ $(uname -m) == "arm64" ]]; then

				    compile_arm64

				  else

				    cross_compile_arm64

				  fi

				elif [[ ${BUILD_ENVIRONMENT} = *lite-interpreter* ]]; then

				  export BUILD_LITE_INTERPRETER=1

				  build_lite_interpreter

				else

				  compile_x86_64

				  USE_DISTRIBUTED=0 USE_OPENMP=1 MACOSX_DEPLOYMENT_TARGET=11.0 WERROR=1 BUILD_TEST=OFF USE_PYTORCH_METAL=1 python setup.py bdist_wheel --plat-name macosx_11_0_arm64

				fi

				if which sccache > /dev/null; then

				  print_sccache_stats

				fi

									
										10

.ci/pytorch/macos-common.sh
									
												View File
												
				@ -20,14 +20,4 @@ print_cmake_info() {

				  CONDA_INSTALLATION_DIR=$(dirname "$CMAKE_EXEC")

				  # Print all libraries under cmake rpath for debugging

				  ls -la "$CONDA_INSTALLATION_DIR/../lib"

				  export CMAKE_EXEC

				  # Explicitly add conda env lib folder to cmake rpath to address the flaky issue

				  # where cmake dependencies couldn't be found. This seems to point to how conda

				  # links $CMAKE_EXEC to its package cache when cloning a new environment

				  install_name_tool -add_rpath @executable_path/../lib "${CMAKE_EXEC}" || true

				  # Adding the rpath will invalidate cmake signature, so signing it again here

				  # to trust the executable. EXC_BAD_ACCESS (SIGKILL (Code Signature Invalid))

				  # with an exit code 137 otherwise

				  codesign -f -s - "${CMAKE_EXEC}" || true

				}

									
										89

.ci/pytorch/macos-test.sh
									
												View File
												
				@ -42,6 +42,16 @@ test_python_all() {

				  assert_git_not_dirty

				}

				test_python_mps() {

				  setup_test_python

				  time python test/run_test.py --verbose --mps

				  MTL_CAPTURE_ENABLED=1 ${CONDA_RUN} python3 test/test_mps.py --verbose -k test_metal_capture

				  assert_git_not_dirty

				}

				test_python_shard() {

				  if [[ -z "$NUM_TEST_SHARDS" ]]; then

				    echo "NUM_TEST_SHARDS must be defined to run a Python test shard"

				@ -155,6 +165,7 @@ test_jit_hooks() {

				torchbench_setup_macos() {

				  git clone --recursive https://github.com/pytorch/vision torchvision

				  git clone --recursive https://github.com/pytorch/audio torchaudio

				  brew install jpeg-turbo libpng

				  pushd torchvision

				  git fetch

				@ -169,7 +180,8 @@ torchbench_setup_macos() {

				  git checkout "$(cat ../.github/ci_commit_pins/audio.txt)"

				  git submodule update --init --recursive

				  python setup.py clean

				  python setup.py develop

				  #TODO: Remove me, when figure out how to make TorchAudio find brew installed openmp

				  USE_OPENMP=0 python setup.py develop

				  popd

				  # Shellcheck doesn't like it when you pass no arguments to a function that can take args. See https://www.shellcheck.net/wiki/SC2120

				@ -177,9 +189,8 @@ torchbench_setup_macos() {

				  checkout_install_torchbench

				}

				conda_benchmark_deps() {

				  conda install -y astunparse numpy scipy ninja pyyaml setuptools cmake typing-extensions requests protobuf numba cython scikit-learn

				  conda install -y -c conda-forge librosa

				pip_benchmark_deps() {

				  python -mpip install --no-input astunparse requests cython scikit-learn

				}

				@ -187,7 +198,7 @@ test_torchbench_perf() {

				  print_cmake_info

				  echo "Launching torchbench setup"

				  conda_benchmark_deps

				  pip_benchmark_deps

				  torchbench_setup_macos

				  TEST_REPORTS_DIR=$(pwd)/test/test-reports

				@ -214,32 +225,62 @@ test_torchbench_smoketest() {

				  print_cmake_info

				  echo "Launching torchbench setup"

				  conda_benchmark_deps

				  pip_benchmark_deps

				  # shellcheck disable=SC2119,SC2120

				  torchbench_setup_macos

				  TEST_REPORTS_DIR=$(pwd)/test/test-reports

				  mkdir -p "$TEST_REPORTS_DIR"

				  local backend=eager

				  local dtype=notset

				  local device=mps

				  local models=(hf_T5 llama BERT_pytorch dcgan hf_GPT2 yolov3 resnet152 sam pytorch_unet stable_diffusion_text_encoder speech_transformer Super_SloMo doctr_det_predictor doctr_reco_predictor)

				  local hf_models=(GoogleFnet YituTechConvBert Speech2Text2ForCausalLM)

				  touch "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_training_${device}_performance.csv"

				  touch "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_inference_${device}_performance.csv"

				  for backend in eager inductor; do

				  echo "Setup complete, launching torchbench training performance run"

				  for model in hf_T5 llama BERT_pytorch dcgan hf_GPT2 yolov3 resnet152; do

				    PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py \

				      --performance --only "$model" --backend "$backend" --training --devices "$device" \

				      --output "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_training_${device}_performance.csv"

				  done

				    for dtype in notset float16 bfloat16; do

				      echo "Launching torchbench inference performance run for backend ${backend} and dtype ${dtype}"

				      local dtype_arg="--${dtype}"

				      if [ "$dtype" == notset ]; then

				          dtype_arg="--float32"

				      fi

				      touch "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_inference_${device}_performance.csv"

				      for model in "${models[@]}"; do

				        PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py \

				          --performance --only "$model" --backend "$backend" --inference --devices "$device" "$dtype_arg" \

				          --output "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_inference_${device}_performance.csv" || true

				        if [ "$backend" == "inductor" ]; then

				          PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py \

				            --accuracy --only "$model" --backend "$backend" --inference --devices "$device" "$dtype_arg" \

				            --output "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_inference_${device}_accuracy.csv" || true

				        fi

				      done

				      for model in "${hf_models[@]}"; do

				        if [ "$backend" == "inductor" ]; then

				          PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/huggingface.py \

				            --performance --only "$model" --backend "$backend" --inference --devices "$device" "$dtype_arg" \

				            --output "$TEST_REPORTS_DIR/inductor_${backend}_huggingface_${dtype}_inference_${device}_performance.csv" || true

				          PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/huggingface.py \

				            --accuracy --only "$model" --backend "$backend" --inference --devices "$device" "$dtype_arg" \

				            --output "$TEST_REPORTS_DIR/inductor_${backend}_huggingface_${dtype}_inference_${device}_accuracy.csv" || true

				        fi

				      done

				    done

				    for dtype in notset amp; do

				      echo "Launching torchbench training performance run for backend ${backend} and dtype ${dtype}"

				      touch "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_training_${device}_performance.csv"

				      local dtype_arg="--${dtype}"

				      if [ "$dtype" == notset ]; then

				          dtype_arg="--float32"

				      fi

				      for model in "${models[@]}"; do

				        PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py \

				          --performance --only "$model" --backend "$backend" --training --devices "$device" "$dtype_arg" \

				          --output "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_training_${device}_performance.csv" || true

				      done

				    done

				  echo "Launching torchbench inference performance run"

				  for model in hf_T5 llama BERT_pytorch dcgan hf_GPT2 yolov3 resnet152; do

				    PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py \

				      --performance --only "$model" --backend "$backend" --inference --devices "$device" \

				      --output "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_inference_${device}_performance.csv"

				  done

				  echo "Pytorch benchmark on mps device completed"

				@ -249,7 +290,7 @@ test_hf_perf() {

				  print_cmake_info

				  TEST_REPORTS_DIR=$(pwd)/test/test-reports

				  mkdir -p "$TEST_REPORTS_DIR"

				  conda_benchmark_deps

				  pip_benchmark_deps

				  torchbench_setup_macos

				  echo "Launching HuggingFace training perf run"

				@ -265,7 +306,7 @@ test_timm_perf() {

				  print_cmake_info

				  TEST_REPORTS_DIR=$(pwd)/test/test-reports

				  mkdir -p "$TEST_REPORTS_DIR"

				  conda_benchmark_deps

				  pip_benchmark_deps

				  torchbench_setup_macos

				  echo "Launching timm training perf run"

				@ -291,6 +332,8 @@ elif [[ $TEST_CONFIG == *"perf_timm"* ]]; then

				  test_timm_perf

				elif [[ $TEST_CONFIG == *"perf_smoketest"* ]]; then

				  test_torchbench_smoketest

				elif [[ $TEST_CONFIG == *"mps"* ]]; then

				  test_python_mps

				elif [[ $NUM_TEST_SHARDS -gt 1 ]]; then

				  test_python_shard "${SHARD_NUMBER}"

				  if [[ "${SHARD_NUMBER}" == 1 ]]; then

									
										22

.ci/pytorch/perf_test/common.sh
									
												View File
											
				@ -1,22 +0,0 @@

				#!/bin/bash

				set -e

				run_test () {

				  rm -rf test_tmp/ && mkdir test_tmp/ && cd test_tmp/

				  "$@"

				  cd .. && rm -rf test_tmp/

				}

				get_runtime_of_command () {

				  TIMEFORMAT=%R

				  # runtime=$( { time ($@ &> /dev/null); } 2>&1 1>/dev/null)

				  runtime=$( { time "$@"; } 2>&1 1>/dev/null)

				  if [[ $runtime == *"Error"* ]]; then

				    exit 1

				  fi

				  runtime=${runtime#+++ $@}

				  runtime=$(python -c "print($runtime)")

				  echo "$runtime"

				}

									
										91

.ci/pytorch/perf_test/compare_with_baseline.py
									
												View File
											
				@ -1,91 +0,0 @@

				import argparse

				import json

				import math

				import sys

				parser = argparse.ArgumentParser()

				parser.add_argument(

				    "--test-name", dest="test_name", action="store", required=True, help="test name"

				)

				parser.add_argument(

				    "--sample-stats",

				    dest="sample_stats",

				    action="store",

				    required=True,

				    help="stats from sample",

				)

				parser.add_argument(

				    "--update",

				    action="store_true",

				    help="whether to update baseline using stats from sample",

				)

				args = parser.parse_args()

				test_name = args.test_name

				if "cpu" in test_name:

				    backend = "cpu"

				elif "gpu" in test_name:

				    backend = "gpu"

				data_file_path = f"../{backend}_runtime.json"

				with open(data_file_path) as data_file:

				    data = json.load(data_file)

				if test_name in data:

				    mean = float(data[test_name]["mean"])

				    sigma = float(data[test_name]["sigma"])

				else:

				    # Let the test pass if baseline number doesn't exist

				    mean = sys.maxsize

				    sigma = 0.001

				print("population mean: ", mean)

				print("population sigma: ", sigma)

				# Let the test pass if baseline number is NaN (which happened in

				# the past when we didn't have logic for catching NaN numbers)

				if math.isnan(mean) or math.isnan(sigma):

				    mean = sys.maxsize

				    sigma = 0.001

				sample_stats_data = json.loads(args.sample_stats)

				sample_mean = float(sample_stats_data["mean"])

				sample_sigma = float(sample_stats_data["sigma"])

				print("sample mean: ", sample_mean)

				print("sample sigma: ", sample_sigma)

				if math.isnan(sample_mean):

				    raise Exception("""Error: sample mean is NaN""")  # noqa: TRY002

				elif math.isnan(sample_sigma):

				    raise Exception("""Error: sample sigma is NaN""")  # noqa: TRY002

				z_value = (sample_mean - mean) / sigma

				print("z-value: ", z_value)

				if z_value >= 3:

				    raise Exception(  # noqa: TRY002

				        f"""\n

				z-value >= 3, there is high chance of perf regression.\n

				To reproduce this regression, run

				`cd .ci/pytorch/perf_test/ && bash {test_name}.sh` on your local machine

				and compare the runtime before/after your code change.

				"""

				    )

				else:

				    print("z-value < 3, no perf regression detected.")

				    if args.update:

				        print("We will use these numbers as new baseline.")

				        new_data_file_path = f"../new_{backend}_runtime.json"

				        with open(new_data_file_path) as new_data_file:

				            new_data = json.load(new_data_file)

				        new_data[test_name] = {}

				        new_data[test_name]["mean"] = sample_mean

				        new_data[test_name]["sigma"] = max(sample_sigma, sample_mean * 0.1)

				        with open(new_data_file_path, "w") as new_data_file:

				            json.dump(new_data, new_data_file, indent=4)

									
										18

.ci/pytorch/perf_test/get_stats.py
									
												View File
											
				@ -1,18 +0,0 @@

				import json

				import sys

				import numpy

				sample_data_list = sys.argv[1:]

				sample_data_list = [float(v.strip()) for v in sample_data_list]

				sample_mean = numpy.mean(sample_data_list)

				sample_sigma = numpy.std(sample_data_list)

				data = {

				    "mean": sample_mean,

				    "sigma": sample_sigma,

				}

				print(json.dumps(data))

									
										43

.ci/pytorch/perf_test/test_cpu_speed_mini_sequence_labeler.sh
									
												View File
											
				@ -1,43 +0,0 @@

				#!/bin/bash

				set -e

				. ./common.sh

				test_cpu_speed_mini_sequence_labeler () {

				  echo "Testing: mini sequence labeler, CPU"

				  export OMP_NUM_THREADS=4

				  export MKL_NUM_THREADS=4

				  git clone https://github.com/pytorch/benchmark.git

				  cd benchmark/

				  git checkout 726567a455edbfda6199445922a8cfee82535664

				  cd scripts/mini_sequence_labeler

				  SAMPLE_ARRAY=()

				  NUM_RUNS=$1

				  for (( i=1; i<=NUM_RUNS; i++ )) do

				    runtime=$(get_runtime_of_command python main.py)

				    SAMPLE_ARRAY+=("${runtime}")

				  done

				  cd ../../..

				  stats=$(python ../get_stats.py "${SAMPLE_ARRAY[@]}")

				  echo "Runtime stats in seconds:"

				  echo "$stats"

				  if [ "$2" == "compare_with_baseline" ]; then

				    python ../compare_with_baseline.py --test-name "${FUNCNAME[0]}" --sample-stats "${stats}"

				  elif [ "$2" == "compare_and_update" ]; then

				    python ../compare_with_baseline.py --test-name "${FUNCNAME[0]}" --sample-stats "${stats}" --update

				  fi

				}

				if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then

				  run_test test_cpu_speed_mini_sequence_labeler "$@"

				fi

									
										45

.ci/pytorch/perf_test/test_cpu_speed_mnist.sh
									
												View File
											
				@ -1,45 +0,0 @@

				#!/bin/bash

				set -e

				. ./common.sh

				test_cpu_speed_mnist () {

				  echo "Testing: MNIST, CPU"

				  export OMP_NUM_THREADS=4

				  export MKL_NUM_THREADS=4

				  git clone https://github.com/pytorch/examples.git -b perftests

				  cd examples/mnist

				  conda install -c pytorch torchvision-cpu

				  # Download data

				  python main.py --epochs 0

				  SAMPLE_ARRAY=()

				  NUM_RUNS=$1

				  for (( i=1; i<=NUM_RUNS; i++ )) do

				    runtime=$(get_runtime_of_command python main.py --epochs 1 --no-log)

				    echo "$runtime"

				    SAMPLE_ARRAY+=("${runtime}")

				  done

				  cd ../..

				  stats=$(python ../get_stats.py "${SAMPLE_ARRAY[@]}")

				  echo "Runtime stats in seconds:"

				  echo "$stats"

				  if [ "$2" == "compare_with_baseline" ]; then

				    python ../compare_with_baseline.py --test-name "${FUNCNAME[0]}" --sample-stats "${stats}"

				  elif [ "$2" == "compare_and_update" ]; then

				    python ../compare_with_baseline.py --test-name "${FUNCNAME[0]}" --sample-stats "${stats}" --update

				  fi

				}

				if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then

				  run_test test_cpu_speed_mnist "$@"

				fi

									
										29

.ci/pytorch/perf_test/test_cpu_speed_torch.sh
									
												View File
											
				@ -1,29 +0,0 @@

				#!/bin/bash

				. ./common.sh

				test_cpu_speed_torch () {

				  echo "Testing: torch.*, CPU"

				  export OMP_NUM_THREADS=4

				  export MKL_NUM_THREADS=4

				  git clone https://github.com/yf225/perf-tests.git

				  if [ "$1" == "compare_with_baseline" ]; then

				    export ARGS=(--compare ../cpu_runtime.json)

				  elif [ "$1" == "compare_and_update" ]; then

				    export ARGS=(--compare ../cpu_runtime.json --update ../new_cpu_runtime.json)

				  elif [ "$1" == "update_only" ]; then

				    export ARGS=(--update ../new_cpu_runtime.json)

				  fi

				  if ! python perf-tests/modules/test_cpu_torch.py "${ARGS[@]}"; then

				    echo "To reproduce this regression, run \`cd .ci/pytorch/perf_test/ && bash ${FUNCNAME[0]}.sh\` on your local machine and compare the runtime before/after your code change."

				    exit 1

				  fi

				}

				if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then

				  run_test test_cpu_speed_torch "$@"

				fi

									
										29

.ci/pytorch/perf_test/test_cpu_speed_torch_tensor.sh
									
												View File
											
				@ -1,29 +0,0 @@

				#!/bin/bash

				. ./common.sh

				test_cpu_speed_torch_tensor () {

				  echo "Testing: torch.Tensor.*, CPU"

				  export OMP_NUM_THREADS=4

				  export MKL_NUM_THREADS=4

				  git clone https://github.com/yf225/perf-tests.git

				  if [ "$1" == "compare_with_baseline" ]; then

				    export ARGS=(--compare ../cpu_runtime.json)

				  elif [ "$1" == "compare_and_update" ]; then

				    export ARGS=(--compare ../cpu_runtime.json --update ../new_cpu_runtime.json)

				  elif [ "$1" == "update_only" ]; then

				    export ARGS=(--update ../new_cpu_runtime.json)

				  fi

				  if ! python perf-tests/modules/test_cpu_torch_tensor.py "${ARGS[@]}"; then

				    echo "To reproduce this regression, run \`cd .ci/pytorch/perf_test/ && bash ${FUNCNAME[0]}.sh\` on your local machine and compare the runtime before/after your code change."

				    exit 1

				  fi

				}

				if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then

				  run_test test_cpu_speed_torch_tensor "$@"

				fi

									
										44

.ci/pytorch/perf_test/test_gpu_speed_cudnn_lstm.sh
									
												View File
											
				@ -1,44 +0,0 @@

				#!/bin/bash

				set -e

				. ./common.sh

				test_gpu_speed_cudnn_lstm () {

				  echo "Testing: CuDNN LSTM, GPU"

				  export OMP_NUM_THREADS=4

				  export MKL_NUM_THREADS=4

				  git clone https://github.com/pytorch/benchmark.git

				  cd benchmark/

				  git checkout 43dfb2c0370e70ef37f249dc09aff9f0ccd2ddb0

				  cd scripts/

				  SAMPLE_ARRAY=()

				  NUM_RUNS=$1

				  for (( i=1; i<=NUM_RUNS; i++ )) do

				    runtime=$(get_runtime_of_command python cudnn_lstm.py --skip-cpu-governor-check)

				    echo "$runtime"

				    SAMPLE_ARRAY+=("${runtime}")

				  done

				  cd ../..

				  stats=$(python ../get_stats.py "${SAMPLE_ARRAY[@]}")

				  echo "Runtime stats in seconds:"

				  echo "$stats"

				  if [ "$2" == "compare_with_baseline" ]; then

				    python ../compare_with_baseline.py --test-name "${FUNCNAME[0]}" --sample-stats "${stats}"

				  elif [ "$2" == "compare_and_update" ]; then

				    python ../compare_with_baseline.py --test-name "${FUNCNAME[0]}" --sample-stats "${stats}" --update

				  fi

				}

				if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then

				  run_test test_gpu_speed_cudnn_lstm "$@"

				fi

									
										44

.ci/pytorch/perf_test/test_gpu_speed_lstm.sh
									
												View File
											
				@ -1,44 +0,0 @@

				#!/bin/bash

				set -e

				. ./common.sh

				test_gpu_speed_lstm () {

				  echo "Testing: LSTM, GPU"

				  export OMP_NUM_THREADS=4

				  export MKL_NUM_THREADS=4

				  git clone https://github.com/pytorch/benchmark.git

				  cd benchmark/

				  git checkout 43dfb2c0370e70ef37f249dc09aff9f0ccd2ddb0

				  cd scripts/

				  SAMPLE_ARRAY=()

				  NUM_RUNS=$1

				  for (( i=1; i<=NUM_RUNS; i++ )) do

				    runtime=$(get_runtime_of_command python lstm.py --skip-cpu-governor-check)

				    echo "$runtime"

				    SAMPLE_ARRAY+=("${runtime}")

				  done

				  cd ../..

				  stats=$(python ../get_stats.py "${SAMPLE_ARRAY[@]}")

				  echo "Runtime stats in seconds:"

				  echo "$stats"

				  if [ "$2" == "compare_with_baseline" ]; then

				    python ../compare_with_baseline.py --test-name "${FUNCNAME[0]}" --sample-stats "${stats}"

				  elif [ "$2" == "compare_and_update" ]; then

				    python ../compare_with_baseline.py --test-name "${FUNCNAME[0]}" --sample-stats "${stats}" --update

				  fi

				}

				if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then

				  run_test test_gpu_speed_lstm "$@"

				fi

									
										44

.ci/pytorch/perf_test/test_gpu_speed_mlstm.sh
									
												View File
											
				@ -1,44 +0,0 @@

				#!/bin/bash

				set -e

				. ./common.sh

				test_gpu_speed_mlstm () {

				  echo "Testing: MLSTM, GPU"

				  export OMP_NUM_THREADS=4

				  export MKL_NUM_THREADS=4

				  git clone https://github.com/pytorch/benchmark.git

				  cd benchmark/

				  git checkout 43dfb2c0370e70ef37f249dc09aff9f0ccd2ddb0

				  cd scripts/

				  SAMPLE_ARRAY=()

				  NUM_RUNS=$1

				  for (( i=1; i<=NUM_RUNS; i++ )) do

				    runtime=$(get_runtime_of_command python mlstm.py --skip-cpu-governor-check)

				    echo "$runtime"

				    SAMPLE_ARRAY+=("${runtime}")

				  done

				  cd ../..

				  stats=$(python ../get_stats.py "${SAMPLE_ARRAY[@]}")

				  echo "Runtime stats in seconds:"

				  echo "$stats"

				  if [ "$2" == "compare_with_baseline" ]; then

				    python ../compare_with_baseline.py --test-name "${FUNCNAME[0]}" --sample-stats "${stats}"

				  elif [ "$2" == "compare_and_update" ]; then

				    python ../compare_with_baseline.py --test-name "${FUNCNAME[0]}" --sample-stats "${stats}" --update

				  fi

				}

				if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then

				  run_test test_gpu_speed_mlstm "$@"

				fi

									
										48

.ci/pytorch/perf_test/test_gpu_speed_mnist.sh
									
												View File
											
				@ -1,48 +0,0 @@

				#!/bin/bash

				set -e

				. ./common.sh

				test_gpu_speed_mnist () {

				  echo "Testing: MNIST, GPU"

				  export OMP_NUM_THREADS=4

				  export MKL_NUM_THREADS=4

				  git clone https://github.com/pytorch/examples.git -b perftests

				  cd examples/mnist

				  conda install -c pytorch torchvision

				  # Download data

				  python main.py --epochs 0

				  SAMPLE_ARRAY=()

				  NUM_RUNS=$1

				  # Needs warm up to get accurate number

				  python main.py --epochs 1 --no-log

				  for (( i=1; i<=NUM_RUNS; i++ )) do

				    runtime=$(get_runtime_of_command python main.py --epochs 1 --no-log)

				    echo "$runtime"

				    SAMPLE_ARRAY+=("${runtime}")

				  done

				  cd ../..

				  stats=$(python ../get_stats.py "${SAMPLE_ARRAY[@]}")

				  echo "Runtime stats in seconds:"

				  echo "$stats"

				  if [ "$2" == "compare_with_baseline" ]; then

				    python ../compare_with_baseline.py --test-name "${FUNCNAME[0]}" --sample-stats "${stats}"

				  elif [ "$2" == "compare_and_update" ]; then

				    python ../compare_with_baseline.py --test-name "${FUNCNAME[0]}" --sample-stats "${stats}" --update

				  fi

				}

				if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then

				  run_test test_gpu_speed_mnist "$@"

				fi

									
										53

.ci/pytorch/perf_test/test_gpu_speed_word_language_model.sh
									
												View File
											
				@ -1,53 +0,0 @@

				#!/bin/bash

				set -e

				. ./common.sh

				test_gpu_speed_word_language_model () {

				  echo "Testing: word language model on Wikitext-2, GPU"

				  export OMP_NUM_THREADS=4

				  export MKL_NUM_THREADS=4

				  git clone https://github.com/pytorch/examples.git -b perftests

				  cd examples/word_language_model

				  cd data/wikitext-2

				  # Reduce dataset size, so that we can have more runs per test

				  sed -n '1,200p' test.txt > test_tmp.txt

				  sed -n '1,1000p' train.txt > train_tmp.txt

				  sed -n '1,200p' valid.txt > valid_tmp.txt

				  mv test_tmp.txt test.txt

				  mv train_tmp.txt train.txt

				  mv valid_tmp.txt valid.txt

				  cd ../..

				  SAMPLE_ARRAY=()

				  NUM_RUNS=$1

				  for (( i=1; i<=NUM_RUNS; i++ )) do

				    runtime=$(get_runtime_of_command python main.py --cuda --epochs 1)

				    echo "$runtime"

				    SAMPLE_ARRAY+=("${runtime}")

				  done

				  cd ../..

				  stats=$(python ../get_stats.py "${SAMPLE_ARRAY[@]}")

				  echo "Runtime stats in seconds:"

				  echo "$stats"

				  if [ "$2" == "compare_with_baseline" ]; then

				    python ../compare_with_baseline.py --test-name "${FUNCNAME[0]}" --sample-stats "${stats}"

				  elif [ "$2" == "compare_and_update" ]; then

				    python ../compare_with_baseline.py --test-name "${FUNCNAME[0]}" --sample-stats "${stats}" --update

				  fi

				}

				if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then

				  run_test test_gpu_speed_word_language_model "$@"

				fi

									
										14

.ci/pytorch/perf_test/update_commit_hash.py
									
												View File
											
				@ -1,14 +0,0 @@

				import json

				import sys

				data_file_path = sys.argv[1]

				commit_hash = sys.argv[2]

				with open(data_file_path) as data_file:

				    data = json.load(data_file)

				data["commit"] = commit_hash

				with open(data_file_path, "w") as data_file:

				    json.dump(data, data_file)

									
										6

.ci/pytorch/python_doc_push_script.sh
									
												View File
												
				@ -119,12 +119,6 @@ popd

				git rm -rf "$install_path" || true

				mv "$pt_checkout/docs/build/html" "$install_path"

				# Prevent Google from indexing $install_path/_modules. This folder contains

				# generated source files.

				# NB: the following only works on gnu sed. The sed shipped with mac os is different.

				# One can `brew install gnu-sed` on a mac and then use "gsed" instead of "sed".

				find "$install_path/_modules" -name "*.html" -print0 | xargs -0 sed -i '/<head>/a \ \ <meta name="robots" content="noindex">'

				git add "$install_path" || true

				git status

				git config user.email "soumith+bot@pytorch.org"

									
										3

.ci/pytorch/run_tests.sh
									
												View File
												
				@ -76,7 +76,7 @@ fi

				# Environment initialization

				if [[ "$(uname)" == Darwin ]]; then

				    # Install the testing dependencies

				    retry conda install -yq future hypothesis ${NUMPY_PACKAGE} ${PROTOBUF_PACKAGE} pytest setuptools six typing_extensions pyyaml

				    retry pip install -q future hypothesis ${NUMPY_PACKAGE} ${PROTOBUF_PACKAGE} pytest setuptools six typing_extensions pyyaml

				else

				    retry pip install -qr requirements.txt || true

				    retry pip install -q hypothesis protobuf pytest setuptools || true

				@ -91,7 +91,6 @@ fi

				echo "Testing with:"

				pip freeze

				conda list || true

				##############################################################################

				# Smoke tests

									
										71

.ci/pytorch/short-perf-test-cpu.sh
									
												View File
											
				@ -1,71 +0,0 @@

				#!/bin/bash

				SCRIPT_PARENT_DIR=$(dirname "${BASH_SOURCE[0]}")

				# shellcheck source=.ci/pytorch/common.sh

				source "$SCRIPT_PARENT_DIR/common.sh"

				cd .ci/pytorch/perf_test

				echo "Running CPU perf test for PyTorch..."

				pip install -q awscli

				# Set multipart_threshold to be sufficiently high, so that `aws s3 cp` is not a multipart read

				# More info at https://github.com/aws/aws-cli/issues/2321

				aws configure set default.s3.multipart_threshold 5GB

				UPSTREAM_DEFAULT_BRANCH="$(git remote show https://github.com/pytorch/pytorch.git | awk '/HEAD branch/ {print $NF}')"

				if [[ "$COMMIT_SOURCE" == "$UPSTREAM_DEFAULT_BRANCH" ]]; then

				    # Get current default branch commit hash

				    DEFAULT_BRANCH_COMMIT_ID=$(git log --format="%H" -n 1)

				    export DEFAULT_BRANCH_COMMIT_ID

				fi

				# Find the default branch commit to test against

				git remote add upstream https://github.com/pytorch/pytorch.git

				git fetch upstream

				IFS=$'\n'

				while IFS='' read -r commit_id; do

				    if aws s3 ls s3://ossci-perf-test/pytorch/cpu_runtime/"${commit_id}".json; then

				        LATEST_TESTED_COMMIT=${commit_id}

				        break

				    fi

				done < <(git rev-list upstream/"$UPSTREAM_DEFAULT_BRANCH")

				aws s3 cp s3://ossci-perf-test/pytorch/cpu_runtime/"${LATEST_TESTED_COMMIT}".json cpu_runtime.json

				if [[ "$COMMIT_SOURCE" == "$UPSTREAM_DEFAULT_BRANCH" ]]; then

				    # Prepare new baseline file

				    cp cpu_runtime.json new_cpu_runtime.json

				    python update_commit_hash.py new_cpu_runtime.json "${DEFAULT_BRANCH_COMMIT_ID}"

				fi

				# Include tests

				# shellcheck source=./perf_test/test_cpu_speed_mini_sequence_labeler.sh

				. ./test_cpu_speed_mini_sequence_labeler.sh

				# shellcheck source=./perf_test/test_cpu_speed_mnist.sh

				. ./test_cpu_speed_mnist.sh

				# shellcheck source=./perf_test/test_cpu_speed_torch.sh

				. ./test_cpu_speed_torch.sh

				# shellcheck source=./perf_test/test_cpu_speed_torch_tensor.sh

				. ./test_cpu_speed_torch_tensor.sh

				# Run tests

				export TEST_MODE="compare_with_baseline"

				if [[ "$COMMIT_SOURCE" == "$UPSTREAM_DEFAULT_BRANCH" ]]; then

				    export TEST_MODE="compare_and_update"

				fi

				# Operator tests

				run_test test_cpu_speed_torch ${TEST_MODE}

				run_test test_cpu_speed_torch_tensor ${TEST_MODE}

				# Sample model tests

				run_test test_cpu_speed_mini_sequence_labeler 20 ${TEST_MODE}

				run_test test_cpu_speed_mnist 20 ${TEST_MODE}

				if [[ "$COMMIT_SOURCE" == "$UPSTREAM_DEFAULT_BRANCH" ]]; then

				    # This could cause race condition if we are testing the same default branch commit twice,

				    # but the chance of them executing this line at the same time is low.

				    aws s3 cp new_cpu_runtime.json s3://ossci-perf-test/pytorch/cpu_runtime/"${DEFAULT_BRANCH_COMMIT_ID}".json --acl public-read

				fi

									
										76

.ci/pytorch/short-perf-test-gpu.sh
									
												View File
											
				@ -1,76 +0,0 @@

				#!/bin/bash

				# shellcheck source=./common.sh

				source "$(dirname "${BASH_SOURCE[0]}")/common.sh"

				pushd .ci/pytorch/perf_test

				echo "Running GPU perf test for PyTorch..."

				# Trying to uninstall PyYAML can cause problem. Workaround according to:

				# https://github.com/pypa/pip/issues/5247#issuecomment-415571153

				pip install -q awscli --ignore-installed PyYAML

				# Set multipart_threshold to be sufficiently high, so that `aws s3 cp` is not a multipart read

				# More info at https://github.com/aws/aws-cli/issues/2321

				aws configure set default.s3.multipart_threshold 5GB

				UPSTREAM_DEFAULT_BRANCH="$(git remote show https://github.com/pytorch/pytorch.git | awk '/HEAD branch/ {print $NF}')"

				if [[ "$COMMIT_SOURCE" == "$UPSTREAM_DEFAULT_BRANCH" ]]; then

				    # Get current default branch commit hash

				    DEFAULT_BRANCH_COMMIT_ID=$(git log --format="%H" -n 1)

				    export DEFAULT_BRANCH_COMMIT_ID

				fi

				# Find the default branch commit to test against

				git remote add upstream https://github.com/pytorch/pytorch.git

				git fetch upstream

				IFS=$'\n'

				while IFS='' read -r commit_id; do

				    if aws s3 ls s3://ossci-perf-test/pytorch/gpu_runtime/"${commit_id}".json; then

				        LATEST_TESTED_COMMIT=${commit_id}

				        break

				    fi

				done < <(git rev-list upstream/"$UPSTREAM_DEFAULT_BRANCH")

				aws s3 cp s3://ossci-perf-test/pytorch/gpu_runtime/"${LATEST_TESTED_COMMIT}".json gpu_runtime.json

				if [[ "$COMMIT_SOURCE" == "$UPSTREAM_DEFAULT_BRANCH" ]]; then

				    # Prepare new baseline file

				    cp gpu_runtime.json new_gpu_runtime.json

				    python update_commit_hash.py new_gpu_runtime.json "${DEFAULT_BRANCH_COMMIT_ID}"

				fi

				# Include tests

				# shellcheck source=./perf_test/test_gpu_speed_mnist.sh

				. ./test_gpu_speed_mnist.sh

				# shellcheck source=./perf_test/test_gpu_speed_word_language_model.sh

				. ./test_gpu_speed_word_language_model.sh

				# shellcheck source=./perf_test/test_gpu_speed_cudnn_lstm.sh

				. ./test_gpu_speed_cudnn_lstm.sh

				# shellcheck source=./perf_test/test_gpu_speed_lstm.sh

				. ./test_gpu_speed_lstm.sh

				# shellcheck source=./perf_test/test_gpu_speed_mlstm.sh

				. ./test_gpu_speed_mlstm.sh

				# Run tests

				if [[ "$COMMIT_SOURCE" == "$UPSTREAM_DEFAULT_BRANCH" ]]; then

				    run_test test_gpu_speed_mnist 20 compare_and_update

				    run_test test_gpu_speed_word_language_model 20 compare_and_update

				    run_test test_gpu_speed_cudnn_lstm 20 compare_and_update

				    run_test test_gpu_speed_lstm 20 compare_and_update

				    run_test test_gpu_speed_mlstm 20 compare_and_update

				else

				    run_test test_gpu_speed_mnist 20 compare_with_baseline

				    run_test test_gpu_speed_word_language_model 20 compare_with_baseline

				    run_test test_gpu_speed_cudnn_lstm 20 compare_with_baseline

				    run_test test_gpu_speed_lstm 20 compare_with_baseline

				    run_test test_gpu_speed_mlstm 20 compare_with_baseline

				fi

				if [[ "$COMMIT_SOURCE" == "$UPSTREAM_DEFAULT_BRANCH" ]]; then

				    # This could cause race condition if we are testing the same default branch commit twice,

				    # but the chance of them executing this line at the same time is low.

				    aws s3 cp new_gpu_runtime.json s3://ossci-perf-test/pytorch/gpu_runtime/"${DEFAULT_BRANCH_COMMIT_ID}".json --acl public-read

				fi

				popd

									
										33

.ci/pytorch/smoke_test/check_binary_symbols.py
									
												View File
												
				@ -80,7 +80,7 @@ def grep_symbols(lib: str, patterns: list[Any]) -> list[str]:

				        return functools.reduce(list.__add__, (x.result() for x in tasks), [])

				def check_lib_symbols_for_abi_correctness(lib: str, pre_cxx11_abi: bool = True) -> None:

				def check_lib_symbols_for_abi_correctness(lib: str) -> None:

				    print(f"lib: {lib}")

				    cxx11_symbols = grep_symbols(lib, LIBTORCH_CXX11_PATTERNS)

				    pre_cxx11_symbols = grep_symbols(lib, LIBTORCH_PRE_CXX11_PATTERNS)

				@ -88,28 +88,12 @@ def check_lib_symbols_for_abi_correctness(lib: str, pre_cxx11_abi: bool = True)

				    num_pre_cxx11_symbols = len(pre_cxx11_symbols)

				    print(f"num_cxx11_symbols: {num_cxx11_symbols}")

				    print(f"num_pre_cxx11_symbols: {num_pre_cxx11_symbols}")

				    if pre_cxx11_abi:

				        if num_cxx11_symbols > 0:

				            raise RuntimeError(

				                f"Found cxx11 symbols, but there shouldn't be any, see: {cxx11_symbols[:100]}"

				            )

				        if num_pre_cxx11_symbols < 1000:

				            raise RuntimeError("Didn't find enough pre-cxx11 symbols.")

				        # Check for no recursive iterators, regression test for https://github.com/pytorch/pytorch/issues/133437

				        rec_iter_symbols = grep_symbols(

				            lib, [re.compile("std::filesystem::recursive_directory_iterator.*")]

				    if num_pre_cxx11_symbols > 0:

				        raise RuntimeError(

				            f"Found pre-cxx11 symbols, but there shouldn't be any, see: {pre_cxx11_symbols[:100]}"

				        )

				        if len(rec_iter_symbols) > 0:

				            raise RuntimeError(

				                f"recursive_directory_iterator in used pre-CXX11 binaries, see; {rec_iter_symbols}"

				            )

				    else:

				        if num_pre_cxx11_symbols > 0:

				            raise RuntimeError(

				                f"Found pre-cxx11 symbols, but there shouldn't be any, see: {pre_cxx11_symbols[:100]}"

				            )

				        if num_cxx11_symbols < 100:

				            raise RuntimeError("Didn't find enought cxx11 symbols")

				    if num_cxx11_symbols < 100:

				        raise RuntimeError("Didn't find enought cxx11 symbols")

				def main() -> None:

				@ -121,9 +105,8 @@ def main() -> None:

				        else:

				            install_root = Path(distutils.sysconfig.get_python_lib()) / "torch"

				    libtorch_cpu_path = install_root / "lib" / "libtorch_cpu.so"

				    pre_cxx11_abi = "cxx11-abi" not in os.getenv("DESIRED_DEVTOOLSET", "")

				    check_lib_symbols_for_abi_correctness(libtorch_cpu_path, pre_cxx11_abi)

				    libtorch_cpu_path = str(install_root / "lib" / "libtorch_cpu.so")

				    check_lib_symbols_for_abi_correctness(libtorch_cpu_path)

				if __name__ == "__main__":

									
										74

.ci/pytorch/smoke_test/check_gomp.py
									
										Normal file
									
												View File
												
				@ -0,0 +1,74 @@

				import ctypes

				import os

				import sys

				from pathlib import Path

				def get_gomp_thread():

				    """

				    Retrieves the maximum number of OpenMP threads after loading the `libgomp.so.1` library

				    and the `libtorch_cpu.so` library. It then queries the

				    maximum number of threads available for OpenMP parallel regions using the

				    `omp_get_max_threads` function.

				    Returns:

				        int: The maximum number of OpenMP threads available.

				    Notes:

				        - The function assumes the default path for `libgomp.so.1` on AlmaLinux OS.

				        - The path to `libtorch_cpu.so` is constructed based on the Python executable's

				          installation directory.

				        - This function is specific to environments where PyTorch and OpenMP are used

				          together and may require adjustments for other setups.

				    """

				    python_path = Path(sys.executable).resolve()

				    python_prefix = (

				        python_path.parent.parent

				    )  # Typically goes to the Python installation root

				    # Get the additional ABI flags (if any); it may be an empty string.

				    abiflags = getattr(sys, "abiflags", "")

				    # Construct the Python directory name correctly (e.g., "python3.13t").

				    python_version = (

				        f"python{sys.version_info.major}.{sys.version_info.minor}{abiflags}"

				    )

				    libtorch_cpu_path = (

				        python_prefix

				        / "lib"

				        / python_version

				        / "site-packages"

				        / "torch"

				        / "lib"

				        / "libtorch_cpu.so"

				    )

				    # use the default gomp path of AlmaLinux OS

				    libgomp_path = "/usr/lib64/libgomp.so.1"

				    os.environ["GOMP_CPU_AFFINITY"] = "0-3"

				    libgomp = ctypes.CDLL(libgomp_path)

				    libgomp = ctypes.CDLL(libtorch_cpu_path)

				    libgomp.omp_get_max_threads.restype = ctypes.c_int

				    libgomp.omp_get_max_threads.argtypes = []

				    omp_max_threads = libgomp.omp_get_max_threads()

				    return omp_max_threads

				def main():

				    omp_max_threads = get_gomp_thread()

				    print(

				        f"omp_max_threads after loading libgomp.so and libtorch_cpu.so: {omp_max_threads}"

				    )

				    if omp_max_threads == 1:

				        raise RuntimeError(

				            "omp_max_threads is 1. Check whether libgomp.so is loaded twice."

				        )

				if __name__ == "__main__":

				    main()

									
										80

.ci/pytorch/smoke_test/smoke_test.py
									
												View File
												
				@ -7,6 +7,7 @@ import subprocess

				import sys

				from pathlib import Path

				from tempfile import NamedTemporaryFile

				from typing import Optional

				import torch

				import torch._dynamo

				@ -76,10 +77,13 @@ def read_release_matrix():

				def test_numpy():

				    import numpy as np

				    try:

				        import numpy as np

				    x = np.arange(5)

				    torch.tensor(x)

				        x = np.arange(5)

				        torch.tensor(x)

				    except ImportError:

				        print("Numpy check skipped. Numpy is not installed.")

				def check_version(package: str) -> None:

				@ -192,8 +196,41 @@ def test_cuda_gds_errors_captured() -> None:

				        )

				def find_pypi_package_version(package: str) -> Optional[str]:

				    from importlib import metadata

				    dists = metadata.distributions()

				    for dist in dists:

				        if dist.metadata["Name"].startswith(package):

				            return dist.version

				    return None

				def cudnn_to_version_str(cudnn_version: int) -> str:

				    patch = int(cudnn_version % 10)

				    minor = int((cudnn_version / 100) % 100)

				    major = int((cudnn_version / 10000) % 10000)

				    return f"{major}.{minor}.{patch}"

				def compare_pypi_to_torch_versions(

				    package: str, pypi_version: str, torch_version: str

				) -> None:

				    if pypi_version is None:

				        raise RuntimeError(f"Can't find {package} in PyPI for Torch: {torch_version}")

				    if pypi_version.startswith(torch_version):

				        print(f"Found matching {package}. Torch: {torch_version} PyPI {pypi_version}")

				    else:

				        raise RuntimeError(

				            f"Wrong {package} version. Torch: {torch_version} PyPI: {pypi_version}"

				        )

				def smoke_test_cuda(

				    package: str, runtime_error_check: str, torch_compile_check: str

				    package: str,

				    runtime_error_check: str,

				    torch_compile_check: str,

				    pypi_pkg_check: str,

				) -> None:

				    if not torch.cuda.is_available() and is_cuda_system:

				        raise RuntimeError(f"Expected CUDA {gpu_arch_ver}. However CUDA is not loaded.")

				@ -223,20 +260,30 @@ def smoke_test_cuda(

				            raise RuntimeError(

				                f"Wrong CUDA version. Loaded: {torch.version.cuda} Expected: {gpu_arch_ver}"

				            )

				        print(f"torch cuda: {torch.version.cuda}")

				        # todo add cudnn version validation

				        print(f"torch cudnn: {torch.backends.cudnn.version()}")

				        print(f"cuDNN enabled? {torch.backends.cudnn.enabled}")

				        print(f"torch cuda: {torch.version.cuda}")

				        torch.cuda.init()

				        print("CUDA initialized successfully")

				        print(f"Number of CUDA devices: {torch.cuda.device_count()}")

				        for i in range(torch.cuda.device_count()):

				            print(f"Device {i}: {torch.cuda.get_device_name(i)}")

				        # nccl is availbale only on Linux

				        print(f"cuDNN enabled? {torch.backends.cudnn.enabled}")

				        torch_cudnn_version = cudnn_to_version_str(torch.backends.cudnn.version())

				        print(f"Torch cuDNN version: {torch_cudnn_version}")

				        if sys.platform in ["linux", "linux2"]:

				            print(f"torch nccl version: {torch.cuda.nccl.version()}")

				            torch_nccl_version = ".".join(str(v) for v in torch.cuda.nccl.version())

				            print(f"Torch nccl; version: {torch_nccl_version}")

				        # Pypi dependencies are installed on linux ony and nccl is availbale only on Linux.

				        if pypi_pkg_check == "enabled" and sys.platform in ["linux", "linux2"]:

				            compare_pypi_to_torch_versions(

				                "cudnn", find_pypi_package_version("nvidia-cudnn"), torch_cudnn_version

				            )

				            compare_pypi_to_torch_versions(

				                "nccl", find_pypi_package_version("nvidia-nccl"), torch_nccl_version

				            )

				        if runtime_error_check == "enabled":

				            test_cuda_runtime_errors_captured()

				@ -395,6 +442,13 @@ def parse_args():

				        choices=["enabled", "disabled"],

				        default="enabled",

				    )

				    parser.add_argument(

				        "--pypi-pkg-check",

				        help="Check pypi package versions cudnn and nccl",

				        type=str,

				        choices=["enabled", "disabled"],

				        default="enabled",

				    )

				    return parser.parse_args()

				@ -410,6 +464,7 @@ def main() -> None:

				    smoke_test_conv2d()

				    test_linalg()

				    test_numpy()

				    if is_cuda_system:

				        test_linalg("cuda")

				        test_cuda_gds_errors_captured()

				@ -418,7 +473,10 @@ def main() -> None:

				        smoke_test_modules()

				    smoke_test_cuda(

				        options.package, options.runtime_error_check, options.torch_compile_check

				        options.package,

				        options.runtime_error_check,

				        options.torch_compile_check,

				        options.pypi_pkg_check,

				    )

									
										100

.ci/pytorch/test.sh
									
												View File
												
				@ -191,6 +191,10 @@ if [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then

				    # shellcheck disable=SC1091

				    source /opt/intel/oneapi/umf/latest/env/vars.sh

				  fi

				  # shellcheck disable=SC1091

				  source /opt/intel/oneapi/ccl/latest/env/vars.sh

				  # shellcheck disable=SC1091

				  source /opt/intel/oneapi/mpi/latest/env/vars.sh

				  # Check XPU status before testing

				  xpu-smi discovery

				fi

				@ -314,6 +318,12 @@ test_python() {

				  assert_git_not_dirty

				}

				test_python_smoke() {

				  # Smoke tests for H100

				  time python test/run_test.py --include test_matmul_cuda inductor/test_fp8 inductor/test_max_autotune $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running

				  assert_git_not_dirty

				}

				test_lazy_tensor_meta_reference_disabled() {

				  export TORCH_DISABLE_FUNCTIONALIZATION_META_REFERENCE=1

				  echo "Testing lazy tensor operations without meta reference"

				@ -398,8 +408,15 @@ test_inductor_aoti() {

				    # We need to hipify before building again

				    python3 tools/amd_build/build_amd.py

				  fi

				  BUILD_AOT_INDUCTOR_TEST=1 python setup.py develop

				  CPP_TESTS_DIR="${BUILD_BIN_DIR}" LD_LIBRARY_PATH="${TORCH_LIB_DIR}" python test/run_test.py --cpp --verbose -i cpp/test_aoti_abi_check cpp/test_aoti_inference

				  if [[ "$BUILD_ENVIRONMENT" == *sm86* ]]; then

				    BUILD_AOT_INDUCTOR_TEST=1 TORCH_CUDA_ARCH_LIST=8.6 USE_FLASH_ATTENTION=OFF python setup.py develop

				    # TODO: Replace me completely, as one should not use conda libstdc++, nor need special path to TORCH_LIB

				    LD_LIBRARY_PATH=/opt/conda/envs/py_3.10/lib/:${TORCH_LIB_DIR}:$LD_LIBRARY_PATH

				    CPP_TESTS_DIR="${BUILD_BIN_DIR}" python test/run_test.py --cpp --verbose -i cpp/test_aoti_abi_check cpp/test_aoti_inference -dist=loadfile

				  else

				    BUILD_AOT_INDUCTOR_TEST=1 python setup.py develop

				    CPP_TESTS_DIR="${BUILD_BIN_DIR}" LD_LIBRARY_PATH="${TORCH_LIB_DIR}" python test/run_test.py --cpp --verbose -i cpp/test_aoti_abi_check cpp/test_aoti_inference -dist=loadfile

				  fi

				}

				test_inductor_cpp_wrapper_shard() {

				@ -414,10 +431,11 @@ test_inductor_cpp_wrapper_shard() {

				  if [[ "$1" -eq "2" ]]; then

				    # For now, manually put the opinfo tests in shard 2, and all other tests in

				    # shard 1.  Test specific things triggering past bugs, for now.

				    # shard 1.  Run all CPU tests, as well as specific GPU tests triggering past

				    # bugs, for now.

				    python test/run_test.py \

				      --include inductor/test_torchinductor_opinfo \

				      -k 'linalg or to_sparse' \

				      -k 'linalg or to_sparse or TestInductorOpInfoCPU' \

				      --verbose

				    exit

				  fi

				@ -802,16 +820,7 @@ test_inductor_torchbench_smoketest_perf() {

				  done

				}

				test_inductor_get_core_number() {

				  if [[ "${TEST_CONFIG}" == *aarch64* ]]; then

				    echo "$(($(lscpu | grep 'Cluster(s):' | awk '{print $2}') * $(lscpu | grep 'Core(s) per cluster:' | awk '{print $4}')))"

				  else

				    echo "$(($(lscpu | grep 'Socket(s):' | awk '{print $2}') * $(lscpu | grep 'Core(s) per socket:' | awk '{print $4}')))"

				  fi

				}

				test_inductor_set_cpu_affinity(){

				  #set jemalloc

				  JEMALLOC_LIB="$(find /usr/lib -name libjemalloc.so.2)"

				  export LD_PRELOAD="$JEMALLOC_LIB":"$LD_PRELOAD"

				  export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"

				@ -823,14 +832,23 @@ test_inductor_set_cpu_affinity(){

				    export KMP_AFFINITY=granularity=fine,compact,1,0

				    export KMP_BLOCKTIME=1

				  fi

				  cores=$(test_inductor_get_core_number)

				  # Set number of cores to 16 on Aarch64 for performance runs.

				  # Use nproc here instead of lscpu because it takes into account cgroups slice

				  cpus=$(nproc)

				  thread_per_core=$(lscpu | grep 'Thread(s) per core:' | awk '{print $4}')

				  cores=$((cpus / thread_per_core))

				  # Set number of cores to 16 on aarch64 for performance runs

				  if [[ "${TEST_CONFIG}" == *aarch64* && $cores -gt 16 ]]; then

				    cores=16

				  fi

				  export OMP_NUM_THREADS=$cores

				  end_core=$((cores-1))

				  export TASKSET="taskset -c 0-$end_core"

				  # Handle cgroups slice start and end CPU

				  start_cpu=$(python -c 'import os; print(min(os.sched_getaffinity(0)))')

				  # Leaving one physical CPU for other tasks

				  end_cpu=$(($(python -c 'import os; print(max(os.sched_getaffinity(0)))') - thread_per_core))

				  export TASKSET="taskset -c $start_cpu-$end_cpu"

				}

				test_inductor_torchbench_cpu_smoketest_perf(){

				@ -1173,7 +1191,7 @@ build_xla() {

				  apply_patches

				  SITE_PACKAGES="$(python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())')"

				  # These functions are defined in .circleci/common.sh in pytorch/xla repo

				  retry install_deps_pytorch_xla $XLA_DIR $USE_CACHE

				  retry install_pre_deps_pytorch_xla $XLA_DIR $USE_CACHE

				  CMAKE_PREFIX_PATH="${SITE_PACKAGES}/torch:${CMAKE_PREFIX_PATH}" XLA_SANDBOX_BUILD=1 build_torch_xla $XLA_DIR

				  assert_git_not_dirty

				}

				@ -1474,14 +1492,11 @@ test_executorch() {

				  pushd /executorch

				  export PYTHON_EXECUTABLE=python

				  export EXECUTORCH_BUILD_PYBIND=ON

				  export CMAKE_ARGS="-DEXECUTORCH_BUILD_XNNPACK=ON -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON"

				  export CMAKE_ARGS="-DEXECUTORCH_BUILD_PYBIND=ON -DEXECUTORCH_BUILD_XNNPACK=ON -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON"

				  # For llama3

				  bash examples/models/llama3_2_vision/install_requirements.sh

				  # NB: We need to rebuild ExecuTorch runner here because it depends on PyTorch

				  # from the PR

				  bash .ci/scripts/setup-linux.sh cmake

				  bash .ci/scripts/setup-linux.sh --build-tool cmake

				  echo "Run ExecuTorch unit tests"

				  pytest -v -n auto

				@ -1521,12 +1536,33 @@ test_linux_aarch64() {

				       inductor/test_inplacing_pass inductor/test_kernel_benchmark inductor/test_layout_optim \

				       inductor/test_max_autotune inductor/test_memory_planning inductor/test_metrics inductor/test_multi_kernel inductor/test_pad_mm \

				       inductor/test_pattern_matcher inductor/test_perf inductor/test_profiler inductor/test_select_algorithm inductor/test_smoke \

				       inductor/test_split_cat_fx_passes inductor/test_standalone_compile inductor/test_torchinductor \

				       inductor/test_split_cat_fx_passes inductor/test_compile inductor/test_torchinductor \

				       inductor/test_torchinductor_codegen_dynamic_shapes inductor/test_torchinductor_dynamic_shapes inductor/test_memory \

				       inductor/test_triton_cpu_backend inductor/test_triton_extension_backend inductor/test_mkldnn_pattern_matcher inductor/test_cpu_cpp_wrapper \

				       --shard "$SHARD_NUMBER" "$NUM_TEST_SHARDS" --verbose

				}

				test_operator_benchmark() {

				  TEST_REPORTS_DIR=$(pwd)/test/test-reports

				  mkdir -p "$TEST_REPORTS_DIR"

				  TEST_DIR=$(pwd)

				  test_inductor_set_cpu_affinity

				  cd benchmarks/operator_benchmark/pt_extension

				  python setup.py install

				  cd "${TEST_DIR}"/benchmarks/operator_benchmark

				  $TASKSET python -m benchmark_all_test --device "$1" --tag-filter "$2" \

				      --output-dir "${TEST_REPORTS_DIR}/operator_benchmark_eager_float32_cpu.csv"

				  pip_install pandas

				  python check_perf_csv.py \

				      --actual "${TEST_REPORTS_DIR}/operator_benchmark_eager_float32_cpu.csv" \

				      --expected "expected_ci_operator_benchmark_eager_float32_cpu.csv"

				}

				if ! [[ "${BUILD_ENVIRONMENT}" == *libtorch* || "${BUILD_ENVIRONMENT}" == *-bazel-* ]]; then

				  (cd test && python -c "import torch; print(torch.__config__.show())")

				  (cd test && python -c "import torch; print(torch.__config__.parallel_info())")

				@ -1557,6 +1593,19 @@ elif [[ "$TEST_CONFIG" == distributed ]]; then

				  if [[ "${SHARD_NUMBER}" == 1 ]]; then

				    test_rpc

				  fi

				elif [[ "${TEST_CONFIG}" == *operator_benchmark* ]]; then

				  TEST_MODE="short"

				  if [[ "${TEST_CONFIG}" == *cpu* ]]; then

				    if [[ "${TEST_CONFIG}" == *long* ]]; then

				      TEST_MODE="long"

				    elif [[ "${TEST_CONFIG}" == *all* ]]; then

				      TEST_MODE="all"

				    fi

				    test_operator_benchmark cpu ${TEST_MODE}

				  fi

				elif [[ "${TEST_CONFIG}" == *inductor_distributed* ]]; then

				  test_inductor_distributed

				elif [[ "${TEST_CONFIG}" == *inductor-halide* ]]; then

				@ -1619,6 +1668,7 @@ elif [[ "${TEST_CONFIG}" == *inductor_cpp_wrapper* ]]; then

				  install_torchvision

				  checkout_install_torchbench hf_T5 llama moco

				  PYTHONPATH=$(pwd)/torchbench test_inductor_cpp_wrapper_shard "$SHARD_NUMBER"

				  test_inductor_aoti

				elif [[ "${TEST_CONFIG}" == *inductor* ]]; then

				  install_torchvision

				  test_inductor_shard "${SHARD_NUMBER}"

				@ -1672,6 +1722,8 @@ elif [[ "${BUILD_ENVIRONMENT}" == *xpu* ]]; then

				  test_python

				  test_aten

				  test_xpu_bin

				elif [[ "${TEST_CONFIG}" == smoke ]]; then

				  test_python_smoke

				else

				  install_torchvision

				  install_monkeytype

									
										7

.ci/pytorch/win-test-helpers/build_pytorch.bat
									
												View File
												
				@ -37,6 +37,11 @@ call %INSTALLER_DIR%\activate_miniconda3.bat

				if errorlevel 1 goto fail

				if not errorlevel 0 goto fail

				:: Update CMake

				call choco upgrade -y cmake --no-progress --installargs 'ADD_CMAKE_TO_PATH=System' --apply-install-arguments-to-dependencies --version=3.27.9

				if errorlevel 1 goto fail

				if not errorlevel 0 goto fail

				call pip install mkl-include==2021.4.0 mkl-devel==2021.4.0

				if errorlevel 1 goto fail

				if not errorlevel 0 goto fail

				@ -88,7 +93,7 @@ set PATH=%CUDA_PATH%\bin;%CUDA_PATH%\libnvvp;%PATH%

				:cuda_build_end

				set DISTUTILS_USE_SDK=1

				set PATH=%TMP_DIR_WIN%\bin;%PATH%

				set PATH=%TMP_DIR_WIN%\bin;C:\Program Files\CMake\bin;%PATH%

				:: The latest Windows CUDA test is running on AWS G5 runner with A10G GPU

				if "%TORCH_CUDA_ARCH_LIST%" == "" set TORCH_CUDA_ARCH_LIST=8.6

									
										2

.ci/pytorch/win-test-helpers/installation-helpers/install_magma.bat
									
												View File
												
				@ -24,7 +24,7 @@ if "%CUDA_SUFFIX%" == "" (

				if "%REBUILD%"=="" (

				  if "%BUILD_ENVIRONMENT%"=="" (

				    curl --retry 3 --retry-all-errors -k https://s3.amazonaws.com/ossci-windows/magma_2.5.4_%CUDA_SUFFIX%_%BUILD_TYPE%.7z --output %TMP_DIR_WIN%\magma_2.5.4_%CUDA_SUFFIX%_%BUILD_TYPE%.7z

				    curl --retry 3 --retry-all-errors -k https://s3.amazonaws.com/ossci-windows/magma_2.5.4_%CUDA_SUFFIX%_%BUILD_TYPE%.7z --output %TMP_DIR_WIN%\magma_2.5.4_%CUDA_SUFFIX%_%BUILD_TYPE%.7z & REM @lint-ignore

				  ) else (

				    aws s3 cp s3://ossci-windows/magma_2.5.4_%CUDA_SUFFIX%_%BUILD_TYPE%.7z %TMP_DIR_WIN%\magma_2.5.4_%CUDA_SUFFIX%_%BUILD_TYPE%.7z --quiet

				  )

Compare commits

2929 Commits v2.7.0-rc9 ... benchmarki

2 .ci/aarch64_linux/aarch64_ci_build.sh Unescape Escape View File

110 .ci/aarch64_linux/aarch64_wheel_ci_build.py Unescape Escape View File

16 .ci/aarch64_linux/build_aarch64_wheel.py Unescape Escape View File

2 .ci/caffe2/README.md Unescape Escape View File

4 .ci/caffe2/test.sh Unescape Escape View File

2 .ci/docker/README.md Unescape Escape View File

35 .ci/docker/almalinux/Dockerfile Unescape Escape View File

100 .ci/docker/almalinux/build.sh Unescape Escape View File

212 .ci/docker/build.sh Unescape Escape View File

28 .ci/docker/centos-rocm/Dockerfile Unescape Escape View File

2 .ci/docker/ci_commit_pins/executorch.txt Unescape Escape View File

2 .ci/docker/ci_commit_pins/nccl-cu12.txt Unescape Escape View File

2 .ci/docker/ci_commit_pins/triton-xpu.txt Unescape Escape View File

2 .ci/docker/ci_commit_pins/triton.txt Unescape Escape View File

5 .ci/docker/common/install_base.sh Unescape Escape View File

2 .ci/docker/common/install_cache.sh Unescape Escape View File

12 .ci/docker/common/install_clang.sh Unescape Escape View File

31 .ci/docker/common/install_cmake.sh Unescape Escape View File

14 .ci/docker/common/install_conda.sh Unescape Escape View File

4 .ci/docker/common/install_cpython.sh Unescape Escape View File

188 .ci/docker/common/install_cuda.sh Unescape Escape View File

211 .ci/docker/common/install_cuda_aarch64.sh Unescape Escape View File

2 .ci/docker/common/install_cudnn.sh Unescape Escape View File

38 .ci/docker/common/install_db.sh Unescape Escape View File

7 .ci/docker/common/install_executorch.sh Unescape Escape View File

6 .ci/docker/common/install_halide.sh Unescape Escape View File

9 .ci/docker/common/install_inductor_benchmark_deps.sh Unescape Escape View File

6 .ci/docker/common/install_linter.sh Unescape Escape View File

39 .ci/docker/common/install_magma_conda.sh Unescape Escape View File

26 .ci/docker/common/install_nccl.sh Normal file Unescape Escape View File

3 .ci/docker/common/install_onnx.sh Unescape Escape View File

19 .ci/docker/common/install_protobuf.sh Unescape Escape View File

15 .ci/docker/common/install_python.sh Normal file Unescape Escape View File

27 .ci/docker/common/install_rocm.sh Unescape Escape View File

72 .ci/docker/common/install_rocm_magma.sh Unescape Escape View File

24 .ci/docker/common/install_swiftshader.sh Unescape Escape View File

62 .ci/docker/common/install_triton.sh Unescape Escape View File

24 .ci/docker/common/install_vulkan_sdk.sh Unescape Escape View File

14 .ci/docker/common/install_xpu.sh Unescape Escape View File

8 .ci/docker/libtorch/Dockerfile Unescape Escape View File

80 .ci/docker/libtorch/build.sh Unescape Escape View File

27 .ci/docker/linter-cuda/Dockerfile Unescape Escape View File

17 .ci/docker/linter/Dockerfile Unescape Escape View File

200 .ci/docker/manywheel/Dockerfile Unescape Escape View File

37 .ci/docker/manywheel/Dockerfile_2_28 Unescape Escape View File

8 .ci/docker/manywheel/Dockerfile_2_28_aarch64 Unescape Escape View File

94 .ci/docker/manywheel/Dockerfile_aarch64 Unescape Escape View File

14 .ci/docker/manywheel/Dockerfile_cuda_aarch64 Unescape Escape View File

53 .ci/docker/manywheel/Dockerfile_s390x Unescape Escape View File

150 .ci/docker/manywheel/build.sh Unescape Escape View File

2 .ci/docker/manywheel/build_scripts/build.sh Unescape Escape View File

2 .ci/docker/manywheel/build_scripts/build_utils.sh Unescape Escape View File

37 .ci/docker/requirements-ci.txt Unescape Escape View File

14 .ci/docker/requirements-docs.txt Unescape Escape View File

2 .ci/docker/triton_version.txt Unescape Escape View File

43 .ci/docker/ubuntu-cuda/Dockerfile Unescape Escape View File

23 .ci/docker/ubuntu-rocm/Dockerfile Unescape Escape View File

14 .ci/docker/ubuntu-xpu/Dockerfile Unescape Escape View File

66 .ci/docker/ubuntu/Dockerfile Unescape Escape View File

2 .ci/magma-rocm/.gitignore vendored Normal file Unescape Escape View File

35 .ci/magma-rocm/Makefile Normal file Unescape Escape View File

48 .ci/magma-rocm/README.md Normal file Unescape Escape View File

42 .ci/magma-rocm/build_magma.sh Executable file Unescape Escape View File

38 .ci/magma-rocm/package_files/build.sh Executable file Unescape Escape View File

8 .ci/magma/Makefile Unescape Escape View File

27 .ci/manywheel/build_common.sh Unescape Escape View File

71 .ci/manywheel/build_cuda.sh Unescape Escape View File

19 .ci/manywheel/build_libtorch.sh Unescape Escape View File

4 .ci/manywheel/build_xpu.sh Unescape Escape View File

2 .ci/onnx/README.md Unescape Escape View File

26 .ci/pytorch/build.sh Unescape Escape View File

123 .ci/pytorch/check_binary.sh Unescape Escape View File

4 .ci/pytorch/common.sh Unescape Escape View File

51 .ci/pytorch/install_cache_xla.sh Unescape Escape View File

53 .ci/pytorch/macos-build.sh Unescape Escape View File

10 .ci/pytorch/macos-common.sh Unescape Escape View File

89 .ci/pytorch/macos-test.sh Unescape Escape View File

22 .ci/pytorch/perf_test/common.sh Unescape Escape View File

2929 Commits

v2.7.0-rc9 ... benchmarki

2

.ci/aarch64_linux/aarch64_ci_build.sh

View File

110

.ci/aarch64_linux/aarch64_wheel_ci_build.py

View File

16

.ci/aarch64_linux/build_aarch64_wheel.py

View File

2

.ci/caffe2/README.md

View File

4

.ci/caffe2/test.sh

View File

2

.ci/docker/README.md

View File

35

.ci/docker/almalinux/Dockerfile

View File

100

.ci/docker/almalinux/build.sh

View File

212

.ci/docker/build.sh

View File

28

.ci/docker/centos-rocm/Dockerfile

View File

2

.ci/docker/ci_commit_pins/executorch.txt

View File

2

.ci/docker/ci_commit_pins/nccl-cu12.txt

View File

2

.ci/docker/ci_commit_pins/triton-xpu.txt

View File

2

.ci/docker/ci_commit_pins/triton.txt

View File

5

.ci/docker/common/install_base.sh

View File

2

.ci/docker/common/install_cache.sh

View File

12

.ci/docker/common/install_clang.sh

View File

31

.ci/docker/common/install_cmake.sh

View File

14

.ci/docker/common/install_conda.sh

View File

4

.ci/docker/common/install_cpython.sh

View File

188

.ci/docker/common/install_cuda.sh

View File

211

.ci/docker/common/install_cuda_aarch64.sh

View File

2

.ci/docker/common/install_cudnn.sh

View File

38

.ci/docker/common/install_db.sh

View File

7

.ci/docker/common/install_executorch.sh

View File

6

.ci/docker/common/install_halide.sh

View File

9

.ci/docker/common/install_inductor_benchmark_deps.sh

View File

6

.ci/docker/common/install_linter.sh

View File

39

.ci/docker/common/install_magma_conda.sh

View File

26

.ci/docker/common/install_nccl.sh Normal file

View File

3

.ci/docker/common/install_onnx.sh

View File

19

.ci/docker/common/install_protobuf.sh

View File

15

.ci/docker/common/install_python.sh Normal file

View File

27

.ci/docker/common/install_rocm.sh

View File

72

.ci/docker/common/install_rocm_magma.sh

View File

24

.ci/docker/common/install_swiftshader.sh

View File

62

.ci/docker/common/install_triton.sh

View File

24

.ci/docker/common/install_vulkan_sdk.sh

View File

14

.ci/docker/common/install_xpu.sh

View File

8

.ci/docker/libtorch/Dockerfile

View File

80

.ci/docker/libtorch/build.sh

View File

27

.ci/docker/linter-cuda/Dockerfile

View File

17

.ci/docker/linter/Dockerfile

View File

200

.ci/docker/manywheel/Dockerfile

View File

37

.ci/docker/manywheel/Dockerfile_2_28

View File

8

.ci/docker/manywheel/Dockerfile_2_28_aarch64

View File

94

.ci/docker/manywheel/Dockerfile_aarch64

View File

14

.ci/docker/manywheel/Dockerfile_cuda_aarch64

View File

53

.ci/docker/manywheel/Dockerfile_s390x

View File

150

.ci/docker/manywheel/build.sh

View File

2

.ci/docker/manywheel/build_scripts/build.sh

View File

2

.ci/docker/manywheel/build_scripts/build_utils.sh

View File

37

.ci/docker/requirements-ci.txt

View File

14

.ci/docker/requirements-docs.txt

View File

2

.ci/docker/triton_version.txt

View File

43

.ci/docker/ubuntu-cuda/Dockerfile

View File

23

.ci/docker/ubuntu-rocm/Dockerfile

View File

14

.ci/docker/ubuntu-xpu/Dockerfile

View File

66

.ci/docker/ubuntu/Dockerfile

View File

2

.ci/magma-rocm/.gitignore vendored Normal file

View File

35

.ci/magma-rocm/Makefile Normal file

View File

48

.ci/magma-rocm/README.md Normal file

View File

42

.ci/magma-rocm/build_magma.sh Executable file

View File

38

.ci/magma-rocm/package_files/build.sh Executable file

View File

8

.ci/magma/Makefile

View File

27

.ci/manywheel/build_common.sh

View File

71

.ci/manywheel/build_cuda.sh

View File

19

.ci/manywheel/build_libtorch.sh

View File

4

.ci/manywheel/build_xpu.sh

View File

2

.ci/onnx/README.md

View File

26

.ci/pytorch/build.sh

View File

123

.ci/pytorch/check_binary.sh

View File

4

.ci/pytorch/common.sh

View File

51

.ci/pytorch/install_cache_xla.sh

View File

53

.ci/pytorch/macos-build.sh

View File

10

.ci/pytorch/macos-common.sh

View File

89

.ci/pytorch/macos-test.sh

View File

22

.ci/pytorch/perf_test/common.sh

View File

91

.ci/pytorch/perf_test/compare_with_baseline.py

View File