pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
Thanh Ha	10a9fb641b	Switch build jobs from linux.4xlarge to c7i (#165057 ) Switch build jobs that use linux.4xlarge which uses c5 instance types to c7i variant. This should improve performance by ~15-20% while cutting costs by ~10-15%. Relates to pytorch/test-infra#7175 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165057 Approved by: https://github.com/huydhn	2025-10-10 15:13:40 +00:00
PyTorch MergeBot	9420944033	Revert "[AMP][Refactor] Simplify dtype support logic in autocast context manager (#163446 )" This reverts commit 960b0d5f0d0efb1f1962bddcf62e2a698e26edd2. Reverted https://github.com/pytorch/pytorch/pull/163446 on behalf of https://github.com/izaitsevfb due to breaks autocast tests on linux and mac ([comment](https://github.com/pytorch/pytorch/pull/163446#issuecomment-3390688642))	2025-10-10 15:12:46 +00:00
Chinmay Kuchinad	55f01a48af	[ROCm] Enable and fix several FSDP + Inductor distributed unit tests (#165011 ) This PR enables a number of distributed unit tests and applies necessary fixes to ensure they pass on ROCm platforms. The changes have been successfully tested on both MI200 and MI300 hardware. This work addresses the following issues: https://github.com/ROCm/frameworks-internal/issues/13586 https://github.com/ROCm/frameworks-internal/issues/13578 Enabled Tests The following tests have been enabled and are now passing: 1. test_compiled_autograd_ctx 2. test_simple_mlp_fullgraph_backend_aot_eager 3. test_simple_mlp_fullgraph_backend_aot_eager_decomp_partition 4. test_simple_mlp_fullgraph_backend_inductor 5. test_nested_fully_shard_backend_aot_eager 6. test_nested_fully_shard_backend_aot_eager_decomp_partition 7. test_nested_fully_shard_backend_inductor_fullgraph_True 8. test_nested_fully_shard_backend_inductor_fullgraph_True_graph_partition 9. test_transformer_backend_aot_eager 10. test_transformer_backend_aot_eager_decomp_partition 11. test_storage_resize_zero_gpu 12. test_storage_resize_nonzero_gpu 13. test_fake_distributed_inductor Tests skipped due to upstream issues: 1. test_nested_fully_shard_backend_inductor_fullgraph_False 2. test_transformer_backend_inductor_fullgraph_True 3. test_transformer_backend_inductor_fullgraph_True_graph_partition 4. test_transformer_backend_inductor_fullgraph_False Pull Request resolved: https://github.com/pytorch/pytorch/pull/165011 Approved by: https://github.com/jeffdaily	2025-10-10 14:10:54 +00:00
PaulZhang12	68913d8f2a	Fix truediv numerics between eager and compile (#164144 ) Addresses numeric differences between eager and compile in https://github.com/pytorch/pytorch/issues/141753 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164144 Approved by: https://github.com/eellison, https://github.com/jansel, https://github.com/ngimel	2025-10-10 14:00:46 +00:00
PyTorch MergeBot	b8be796a57	Revert "[2/N] More ruff SIM fixes (#165031 )" This reverts commit 38095fbd1323ee4a9541fbcbb9b28bd20f2cd956. Reverted https://github.com/pytorch/pytorch/pull/165031 on behalf of https://github.com/albanD due to One of the changed line started to fail on trunk ([comment](https://github.com/pytorch/pytorch/pull/165031#issuecomment-3390190870))	2025-10-10 13:42:14 +00:00
Howard Huang	238dd5517d	[PP] Move profiler record_function in schedule (#164976 ) Better engineering to move the `record_function` call to also encompass the custom callback, this line is the only change: https://github.com/pytorch/pytorch/pull/164976/files#diff-1d3d91f53db88fb886901fb178d69e47776e71b8103f85688fa9ca64cc55d068R2147, the rest is just formatting. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164976 Approved by: https://github.com/fegin ghstack dependencies: #162016, #164962	2025-10-10 13:09:23 +00:00
eellison	d272ed4b3e	Fix identity expansion (#165066 ) In some cases, we wrap indexing with `Identity` to prevent expansion from int32 -> int64 range. There are some checks in codegen which intend to check for constants, which did not handle Identity. Update these checks and update Identity so that it recursively prints inputs. Fix for https://github.com/pytorch/pytorch/issues/164700 Replaces https://github.com/pytorch/pytorch/pull/160190 cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @njriasan Pull Request resolved: https://github.com/pytorch/pytorch/pull/165066 Approved by: https://github.com/njriasan, https://github.com/shunting314, https://github.com/jansel	2025-10-10 13:07:15 +00:00
Yuanyuan Chen	70925bdf82	[1/N] Use "is" in python type comparison (#165037 ) It generally recommended to use `is/is not` to compare types. Therefore this series of changes apply this suggestion in the code base, and it aims to finally enabling related linter checks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165037 Approved by: https://github.com/mlazos	2025-10-10 12:36:50 +00:00
KarhouTam	960b0d5f0d	[AMP][Refactor] Simplify dtype support logic in autocast context manager (#163446 ) ## Description: This PR refactors the autocast context manager in `autocast_mode.py` to simplify and centralize the logic for checking supported dtypes for each device. The previous implementation repeated similar checks for multiple device types. Now, a single mapping `device_supported_dtypes` is used to associate device types with their supported dtypes, and the validation logic is unified. In my view, this makes the code easier to maintain and extend for new devices. Please share any suggestions and comments with me. BTW, in the original `xla` branch, the `supported_dtype` are `[torch.float16, torch.bfloat16]`, `5d8a226e23/torch/amp/autocast_mode.py (L358-L363)` but the warning message has only `torch.bfloat16`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163446 Approved by: https://github.com/FFFrog, https://github.com/albanD	2025-10-10 12:30:06 +00:00
FFFrog	e0abcee3b5	[Code Clean] Remove support of python3.9 (#163846 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163846 Approved by: https://github.com/ezyang	2025-10-10 11:11:56 +00:00
Shangdi Yu	77bf23d85c	Add an option to put store large mmap weights on disk (#164526 ) As title In windows, we cannot modify the .dll to append weights at the end, the windows .dll loader will complain it's not a valid .dll file. So we store the weight blob as a separete file. 1. We add the following API which allows passing in a pointer to the weight blob and get the size of the weight blob. ```cpp AOTI_API AOTIRuntimeError AOTInductorModelContainerGetConstantsBlobSize( AOTInductorModelContainerHandle container_handle, uint64_t* ret_size); // Load weights from a single blob in weight_blob_ptr AOTI_API AOTIRuntimeError AOTInductorModelUpdateConstantsFromBlob( AOTInductorModelContainerHandle container_handle, const uint8_t* weight_blob_ptr); ``` 2. We also add a method in ModelContainerRunner to load the weight: If the runner see that there is a `.blob` file in the package, if will mmap the .blob file and use the content to load the constants. 3. We also add the `USE_MMAP_EXTERNAL` macro. When this macro is defined, the model expects to load the weights from external mmap'd weights. Test Plan: ``` buck run @mode/dev-nosan caffe2/test/inductor:test_aot_inductor -- -r test_large_mmaped_weights_on_disk ``` Also tested for windows-cross compilation with `6542566585/demo/main_voxtral.cpp` ``` Loaded model.dll audio_encoder loaded C:\Users\shangdiy\source\repos\torchnative\demo\token_embedding\data\aotinductor\model\model.wrapper.so Loaded model.dll token_embedding loaded C:\Users\shangdiy\source\repos\torchnative\demo\text_decoder\data\aotinductor\model\model.wrapper.so Loaded model.dll Loading weights from C:\Users\shangdiy\source\repos\torchnative\demo\text_decoder\data\aotinductor\model\model.wrapper_weights.blob text_decoder loaded Load latency (ms): audio_encoder: 1011.234 archive extraction: 0.000 .so loading: 1011.197 token_embedding: 525.773 archive extraction: 0.000 .so loading: 525.704 text_decoder: 3324.130 archive extraction: 0.000 .so loading: 3323.979 Run latency (ms): audio_encoder: 285.958 audio_encoder output: dtype=bfloat16, shape=[1, 1125, 3072], numel=3456000 token_embedding: 6.676 token_embedding output: dtype=bfloat16, shape=[1, 1138, 3072], numel=3495936 text_decoder: 576.519 text_decoder output: dtype=bfloat16, shape=[1, 1138, 131072], numel=149159936 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/164526 Approved by: https://github.com/desertfire	2025-10-10 07:53:57 +00:00
PyTorch MergeBot	d2cb183344	Revert "[inductor] verify determinism with inductor benchmark script (#164904 )" This reverts commit a3c700656f9a666eb33074b60333a23eb7e99a15. Reverted https://github.com/pytorch/pytorch/pull/164904 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but there seems to be some failed vLLM failures coming out of this ([comment](https://github.com/pytorch/pytorch/pull/164904#issuecomment-3388443678))	2025-10-10 06:23:07 +00:00
Yuanyuan Chen	38095fbd13	[2/N] More ruff SIM fixes (#165031 ) This is follow-up of #164695 to apply ruff SIM rules to more files. Most changes are about simplifying dict.get because None is already the default value. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165031 Approved by: https://github.com/mlazos	2025-10-10 05:37:46 +00:00
Yuanyuan Chen	ffc9559d9f	[7/N] Apply ruff UP035 rule (#164653 ) This PR is follow-up of #164438 to continue applying `UP035` rule. All changes are about proper `Callable` importation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164653 Approved by: https://github.com/aorenste viable/strict/1760091921	2025-10-10 05:16:17 +00:00
Simon Layton	172d6ed8b8	Refactor _scaled_grouped_mm_cuda dispatch (#165060 ) Summary: * Clean & simplify different scaling recipe dispatch * Split out recipes into separate dispatch functions Test Plan: ``` pytest -svv -k grouped test/test_scaled_matmul_cuda.py ``` Reviewers: Subscribers: Tasks: Tags: Signed-off-by: Simon Layton <simonlayton@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/165060 Approved by: https://github.com/danielvegamyhre, https://github.com/ngimel	2025-10-10 04:44:25 +00:00
Nikita Shulga	9a3c4b917e	[CMake] Remove forcing of `-O2` from `torch_compile_options` (#164894 ) That was introduced by `75a65ffe0f` Hattip to @jathu for alerting me about the issue. As result, all our PyTorch builds were shipped with `-O2` for almost all of its modern history Partially undo the damage introduced by https://github.com/pytorch/pytorch/pull/128406 that cause cross-ISA symbols leak, to be properly followed up in https://github.com/pytorch/pytorch/issues/165123 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164894 Approved by: https://github.com/ezyang	2025-10-10 04:43:53 +00:00
PyTorch MergeBot	df514a6d5a	Revert "[inductor][eazy] change how torch.use_deterministic_algorithms affect inductor (#164905 )" This reverts commit 344e6365a0068c2d2847fcec0c55dd53291d475e. Reverted https://github.com/pytorch/pytorch/pull/164905 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but there seems to be some failed vLLM failures coming out of this ([comment](https://github.com/pytorch/pytorch/pull/164905#issuecomment-3388258660))	2025-10-10 04:37:09 +00:00
Maggie Moss	48fe858fef	Fix error, remove file from pyrefly checking (#165094 ) Reported issue with formatting and parsing. Removing suppressions and avoiding this file in future type checking until we can get a more complete fix in . Pull Request resolved: https://github.com/pytorch/pytorch/pull/165094 Approved by: https://github.com/albanD	2025-10-10 04:34:51 +00:00
PyTorch MergeBot	7ab00c7c17	Revert "Hotfix test scaled matmul cuda (#165104 )" This reverts commit 9aa92f246fa5fe5cfda17970d41d167b19a0612a. Reverted https://github.com/pytorch/pytorch/pull/165104 on behalf of https://github.com/malfet due to Looks like it broke cuda tests, isn't it, see `44b1ff54e9/1` ([comment](https://github.com/pytorch/pytorch/pull/165104#issuecomment-3388247886)) viable/strict/1760089782	2025-10-10 04:32:18 +00:00
Nikita Shulga	44b1ff54e9	[CD] Do not propagate download.pytorch.org IP into container (#165075 ) Followup after https://github.com/pytorch/pytorch/pull/164969 Should fix binary build test failures Pull Request resolved: https://github.com/pytorch/pytorch/pull/165075 Approved by: https://github.com/seemethere, https://github.com/huydhn ghstack dependencies: #164968, #164969	2025-10-10 04:27:29 +00:00
PyTorch MergeBot	daea35df5c	Revert "[CD] Do not propagate download.pytorch.org IP into container (#165075 )" This reverts commit 6d27a8e5093ee2a21d44dceeeffcb272e6e0f655. Reverted https://github.com/pytorch/pytorch/pull/165075 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/165075#issuecomment-3388228013))	2025-10-10 04:20:51 +00:00
Laith Sakka	7f2a902ea2	more sizelike deprecation (#164889 ) remove expext_size c++ bindings and usages Pull Request resolved: https://github.com/pytorch/pytorch/pull/164889 Approved by: https://github.com/mlazos ghstack dependencies: #164884, #164885, #164886, #164887, #164888	2025-10-10 03:45:06 +00:00
Mikayla Gawarecki	9c057d9863	[BE] Refresh documentation for stable ABI / API (#163899 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163899 Approved by: https://github.com/janeyx99	2025-10-10 03:26:28 +00:00
Yiming Zhou	938869e7d3	[DTensor] Improve sharding propagation error msg in DTensor dispatch (#164623 ) Fixes #164543 This PR improves the `__str__` method of DTensor's `OpSchema` to provide better readable error message when dispatch fails as the error message prints `{op_info.schema}` example 1 `aten.embedding` ``` aten.embedding.default(Spec(f32[2048, 256](S(0))), Spec(i64[16, 2048](S(0)R))) on DeviceMesh((dp=2, tp=2), 'cuda', stride=(2, 1))) ``` example 2 `aten.mm` ``` aten.mm.default(Spec(f32[1024, 512](S(1))), Spec(f32[512, 256](S(0)))) on DeviceMesh((tp=4), 'cuda', stride=(1,))) ``` example 3 `aten._scaled_dot_product_flash_attention` ``` aten._scaled_dot_product_flash_attention.default(Spec(f16[8, 16, 128, 64](RS(1))), Spec(f16[8, 16, 128, 64](RS(1))), Spec(f16[8, 16, 128, 64](RS(1)))) on DeviceMesh((dp=2, tp=4), 'cuda', stride=(4, 1))) ``` Added test ``` python test/distributed/tensor/test_dtensor_ops.py -k test_embedding_error_msg ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/164623 Approved by: https://github.com/zpcore	2025-10-10 03:16:04 +00:00
Yuanyuan Chen	ce6b589545	Enable B904 check of flake8 (#165047 ) The description of `B904` is `Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling. ` Pull Request resolved: https://github.com/pytorch/pytorch/pull/165047 Approved by: https://github.com/Lucaskabela	2025-10-10 03:08:01 +00:00
Dzmitry Huba	ae25dd51fc	Simplifying computation of the final result for equals op on DTensor (#164999 ) Instead of collecting local results using all_gather_object followed by local reduction, with this change we switch to using a single all_reduce with MIN reduction operation to compute the final equals result. This change is needed to enable LocalTensor work (all_gather_object introduces challenges in for DTensor and LocalTensor integration). topic: not user facing Pull Request resolved: https://github.com/pytorch/pytorch/pull/164999 Approved by: https://github.com/ezyang	2025-10-10 03:01:28 +00:00
Simon Fan	a61d0de9f9	[hop] support local_map filtered gradients (#164437 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164437 Approved by: https://github.com/ezyang ghstack dependencies: #164296, #164321, #164419, #164420, #164340, #163602, #164431, #164433	2025-10-10 02:34:27 +00:00
Simon Fan	3ad88924ad	[hop] support local_map None placements (#164433 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164433 Approved by: https://github.com/ezyang ghstack dependencies: #164296, #164321, #164419, #164420, #164340, #163602, #164431	2025-10-10 02:34:27 +00:00
Simon Fan	3241b9c15f	[hop] support local_map None gradients (#164431 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164431 Approved by: https://github.com/bdhirsh ghstack dependencies: #164296, #164321, #164419, #164420, #164340, #163602	2025-10-10 02:34:27 +00:00
Simon Fan	25d4d5107e	[dynamo] trace local_map with local shapes for AP (#163602 ) Context is in https://www.internalfb.com/excalidraw/EX519691 and https://docs.google.com/document/d/1qnuXLZk_GYt_PksHTwkn7L2ELRDnYlIRPkHAlXTyuhw/edit?tab=t.0. And the description of the previous PR: https://github.com/pytorch/pytorch/pull/164340. The previous PR adds the support on the HOP side for eager execution and AOTAutograd. Dynamo is still passing the HOP a subgraph with wrong shapes. This PR fixes that. This is similar to the HOP implementation, however we additionally need to manually keep the TensorVariable metadata in sync. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163602 Approved by: https://github.com/ydwu4 ghstack dependencies: #164296, #164321, #164419, #164420, #164340	2025-10-10 02:34:27 +00:00
Simon Fan	e4fe811be8	[hop] trace local_map with local shapes in fake key (#164340 ) Context is in https://www.internalfb.com/excalidraw/EX519691 and https://docs.google.com/document/d/1qnuXLZk_GYt_PksHTwkn7L2ELRDnYlIRPkHAlXTyuhw/edit?tab=t.0. So for Autoparallel initial trace, we want to trace the graph with global shapes initially. But, for the local_map region, we are forced to trace with the expected local tensors. To the tracers, this looks weird, because it's a plain tensor input (representing DTensor's full tensor .to_local()) that we need to "redistribute". After hacking a miserable version that had cross-key dependencies, @ydwu4 proposed this simpler approach to override the fake key. This means the shape conversion will be invisible to all dispatch keys above fake, this covers all current tracing mechanisms. This manifests as the joint graph for the HOP body being traced with local shapes: ```python # HOP forward, note local shapes (10, 80) class GraphModule(torch.nn.Module): def forward(self, primals_0: "f32[10, 80]"): # No stacktrace found for following nodes view: "f32[800]" = torch.ops.aten.view.default(primals_0, [-1]); primals_0 = None add: "f32[800]" = torch.ops.aten.add.Tensor(view, 10); view = None view_1: "f32[10, 80]" = torch.ops.aten.view.default(add, [10, 80]); add = None return (view_1,) # HOP backward, note local shapes (10, 80) class GraphModule(torch.nn.Module): def forward(self, tangents_0: "f32[10, 80]"): # No stacktrace found for following nodes clone: "f32[10, 80]" = torch.ops.aten.clone.default(tangents_0); tangents_0 = None return (clone,) ``` while the rest of the graph is still traced with global shapes: ```python # Parent graph joint, note global shapes (80, 80) class inner_f(torch.nn.Module): def forward(self, primals, tangents): primals_1: "f32[80, 80]"; tangents_1: "f32[80, 80]"; primals_1, tangents_1, = fx_pytree.tree_flatten_spec([primals, tangents], self._in_spec) # File: /home/xmfan/core/a/pytorch/test/higher_order_ops/test_local_map.py:597 in forward, code: return fn(x) call_local_map = torch._higher_order_ops.local_map.call_local_map(primals_1); primals_1 = None getitem: "f32[80, 80]" = call_local_map[0]; call_local_map = None call_local_map_1 = torch._higher_order_ops.local_map.call_local_map(tangents_1); tangents_1 = None getitem_1: "f32[80, 80]" = call_local_map_1[0]; call_local_map_1 = None return pytree.tree_unflatten([getitem, getitem_1], self._out_spec) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/164340 Approved by: https://github.com/ydwu4 ghstack dependencies: #164296, #164321, #164419, #164420	2025-10-10 02:34:27 +00:00
Simon Fan	82c71af59a	[hop] local_map validate partitioned fw/bw wrt placements (#164420 ) Reviewed GPT-5 Summary: Summary / Goal Add validation that partitioned forward/backward graphs respect placements. Details - Validates placement alignment in local_map. - The HOP's autograd key gets called when we are tracing the joint, we need to validate: - the inputs to the HOP's fwd gm (typically this is the dynamo rewritten inputs) - the inputs to the HOP partitioned fwd/bwd gm - the outputs of the HOP partitioned fwd/bwd gm Motivation Catch mismatch errors earlier, improve debugging. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164420 Approved by: https://github.com/ezyang ghstack dependencies: #164296, #164321, #164419	2025-10-10 02:34:27 +00:00
Simon Fan	7bd704a346	[hop] local_map fix fw_gm/bw_gm naming (#164419 ) Reviewed GPT5 summary: Summary / Goal Fix inconsistent variable naming for forward/backward graphs. Details - Those methods are actually for both fw and bw graphs now that we reuse the same op for fw/bw Motivation Improves clarity, avoids confusion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164419 Approved by: https://github.com/bdhirsh ghstack dependencies: #164296, #164321	2025-10-10 02:34:27 +00:00
Simon Fan	ae139b73e0	[dynamo] Better error message for local_map subgraph mismatches number of inputs/outputs with placement info (#164321 ) Reviewed GPT5 summary: Summary / Goal Improve error reporting when local_map subgraph input/output counts mismatch placement info. Details - Adds descriptive runtime error messages. Motivation Helps debug local_map misalignments. ```python AssertionError: Expecting 2 inputs to local_map function based on placements, but found 1. If the count matches for eager, Dynamo may have flattened inputs to the function or found additional tensors used via closures. Please adjust the input placements to match what the traced graph sees: class GraphModule(torch.nn.Module): def forward(self, l_args_0_: "f32[8, 8, 16]"): # File: /home/xmfan/core/a/pytorch/test/higher_order_ops/test_local_map.py:523 in mismatch_input, code: return x + scalar, scalar child: "f32[8, 8, 16]" = l_args_0_ + 10; l_args_0_ = None return (child,) . ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/164321 Approved by: https://github.com/ezyang, https://github.com/mlazos ghstack dependencies: #164296	2025-10-10 02:34:27 +00:00
Simon Fan	cbaa07e438	[dtensor] add util to compute expected local sizes/strides for even sharding (#164296 ) Reviewed GPT5 summary: Summary / Goal Add a utility to compute expected local tensor sizes and strides under even sharding in dtensor. Details - New function in `torch/distributed/tensor/_utils.py`. - Computes local sizes/strides given global shape, mesh, and placements. - Enforces divisibility of global dimension by mesh size (strict even sharding). - Complements `compute_global_tensor_info`. Motivation Ensures correctness for stride/layout computations in distributed tensors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164296 Approved by: https://github.com/ezyang	2025-10-10 02:34:27 +00:00
Yuanyuan Chen	bc0e2a0d2b	Fix a condition error in torch/_inductor/codegen/debug_utils.py (#165033 ) This PR fixes the condition ``` if arg_signatures is None and self.kernel_type == "cpp" or "extern" ``` which is interpreted as ``` if (arg_signatures is None and self.kernel_type == "cpp") or ("extern"): ``` and it is always evaluated to `True`. According to the context the intention was ``` if arg_signatures is None and (self.kernel_type == "cpp" or self.kernel_type == "extern"): ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/165033 Approved by: https://github.com/Skylion007	2025-10-10 02:20:00 +00:00
drisspg	0747d95994	Add Loads from fixed inputs (#162031 ) ## TODO Check on multi indices ```Python @cute.jit def score_mod(tSrS_ssa, b_idx, h_idx, q_idx, kv_idx, buffers): in_ptr4 = buffers[0] tmp0 = tSrS_ssa tmp1 = b_idx tmp2 = h_idx tmp3 = cute.make_fragment(1, cutlass.Int32) tmp4 = tmp3.store(32tmp1 + tmp2) tmp5 = cute.make_fragment(1, cutlass.BFloat16) tmp6 = tmp3[0] tmp7 = tmp5[0] = (in_ptr4[tmp6]) tmp8 = (tmp5.load()).to(cutlass.Float32) tmp9 = (tmp0 + tmp8) tSrS_ssa = tmp9 return tSrS_ssa ``` I dont think that ``` tmp4 = tmp3.store(32tmp1 + tmp2) tmp5 = cute.make_fragment(1, cutlass.BFloat16) tmp6 = tmp3[0] tmp7 = tmp5[0] = (in_ptr4[tmp6] ``` is right since this tmp6 value will be larger than the actual index dim int his case its B -> see if its possible to 1d index Pull Request resolved: https://github.com/pytorch/pytorch/pull/162031 Approved by: https://github.com/v0i0 ghstack dependencies: #161118	2025-10-10 01:23:37 +00:00
drisspg	0a2cde2f06	Add Flash Attention support to FlexAttention (#161118 ) Relies on this PR in Flash Attention: https://github.com/Dao-AILab/flash-attention/pull/1840 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161118 Approved by: https://github.com/v0i0	2025-10-10 01:23:37 +00:00
Jithun Nair	c7b57d9349	Add gfx1100 to build target for ROCm docker builds (#165103 ) Fixes issue of gfx1100 test jobs timing out Pull Request resolved: https://github.com/pytorch/pytorch/pull/165103 Approved by: https://github.com/jeffdaily	2025-10-10 01:18:56 +00:00
PyTorch MergeBot	7614338b69	Revert "Add SVE128 ISA (#158932 )" This reverts commit 92284fb2ff44f09a9c7df0d8cf6cac9903e376a4. Reverted https://github.com/pytorch/pytorch/pull/158932 on behalf of https://github.com/malfet due to Hmm, but from OSS point of view, this is a no-op ([comment](https://github.com/pytorch/pytorch/pull/158932#issuecomment-3387961238))	2025-10-10 01:17:02 +00:00
Edward Z. Yang	a6fa4f9c28	Do not decompose in functionalization/proxy tensor if autograd wouldn't have decomposed (#164939 ) This fixes AOTAutograd rms_norm not being bitwise equivalent to eager, because it avoids a decomposition. You can force the decomposition by having the decomposition in the dispatch table, but if eager mode wouldn't have decomposed (because it went to the fused one), we now default to preserving the fused call by default. This largely reverts https://github.com/pytorch/pytorch/pull/103275/ for view ops. This means that in inference mode we could hit the wrong C++ kernel; if this occurs we should just SymInt'ify the C++ kernel. Another neat side effect of this change is that Inductor's generated kernels for rms_norm now have rms_norm in their name. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/164939 Approved by: https://github.com/bdhirsh	2025-10-10 00:15:00 +00:00
Shunting Zhang	344e6365a0	[inductor][eazy] change how torch.use_deterministic_algorithms affect inductor (#164905 ) Previously when torch.are_deterministic_algorithms_enabled() is True Inductor will - skip autotuning pointwise kernels - pick a fixed (and quite arbitrary) config for reduction This PR change the behavior to - for pointwise kernels, we still do autotuning - for reduction kernels, we use the recent added heuristic to pick a config Pull Request resolved: https://github.com/pytorch/pytorch/pull/164905 Approved by: https://github.com/jansel, https://github.com/v0i0 ghstack dependencies: #164801, #164532, #164904	2025-10-10 00:00:58 +00:00
Shunting Zhang	a3c700656f	[inductor] verify determinism with inductor benchmark script (#164904 ) Verify the deterministic mode with torch.compile benchmark scripts. Here is what my testing script does (pasted in the end): - run a model in default mode, save it's result - run the model again in default mode, but distort the benchmarking results. Compare it with the saved result. - Do the above again in deterministic mode. I tried to test a few modes - BertForMaskedLM and GoogleFnet: I can repro the numeric change by distorting the benchnmark result in the default mode. The non-determinism is gone in the deterministic mode - DistillGPT2: I can not repro the numeric change by distorting the benchmarking result in the default mode. It does not surprise me much. Reduction order change does not always cause numeric change. ``` model=GoogleFnet export TORCHINDUCTOR_WRITE_ARE_DETERMINISTIC_ALGORITHMS_ENABLED=0 export TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 # disable autotune cache export TORCHINDUCTOR_FX_GRAPH_REMOTE_CACHE=0 export TORCHINDUCTOR_FX_GRAPH_CACHE=0 export TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_shunting/ export TORCHINDUCTOR_BENCHMARK_KERNEL=1 export TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 export INDUCTOR_TEST_DISABLE_FRESH_CACHE=1 # Non deterministic mode # --float32 rather than --amp to make it easier to repro non-deterministic echo "Save results for non-deterministic mode" python benchmarks/dynamo/huggingface.py --backend inductor --float32 --accuracy --only $model --training --disable-cudagraphs --save-model-outputs-to=/tmp/saved-non-deterministic.pkl echo "Compare results with distorted benchmarking in non-deterministic mode" TORCHINDUCTOR_DISTORT_BENCHMARKING_RESULT=inverse python benchmarks/dynamo/huggingface.py --backend inductor --float32 --accuracy --only $model --training --disable-cudagraphs --compare-model-outputs-with=/tmp/saved-non-deterministic.pkl echo "Save results for deterministic mode" TORCHINDUCTOR_DETERMINISTIC=1 python benchmarks/dynamo/huggingface.py --backend inductor --float32 --accuracy --only $model --training --disable-cudagraphs --save-model-outputs-to=/tmp/saved-deterministic.pkl echo "Compare results with distorted benchmarking in deterministic mode" TORCHINDUCTOR_DETERMINISTIC=1 TORCHINDUCTOR_DISTORT_BENCHMARKING_RESULT=inverse python benchmarks/dynamo/huggingface.py --backend inductor --float32 --accuracy --only $model --training --disable-cudagraphs --compare-model-outputs-with=/tmp/saved-deterministic.pkl ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/164904 Approved by: https://github.com/jansel, https://github.com/v0i0 ghstack dependencies: #164801, #164532	2025-10-10 00:00:58 +00:00
Yidi Wu	600db525bd	[easy][while_loop] use copy_input instead of clone in _clone_aliased_inputs (#164955 ) Compared with clone, ExternKernel.copy_input additionally realize the buffer, which downstream assumes the input buffer are realized. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164955 Approved by: https://github.com/BoyuanFeng	2025-10-09 23:39:00 +00:00
Animesh Jain	f6de195616	[dynamo][trace_rules] Add ao.quantization (#165069 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165069 Approved by: https://github.com/tugsbayasgalan, https://github.com/mlazos	2025-10-09 23:08:42 +00:00
angelayi	4a0df39f81	Symintify fused_scaled_matmul_reduce_scatter (#165086 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/165086 Approved by: https://github.com/zou3519, https://github.com/Skylion007	2025-10-09 23:07:40 +00:00
PyTorch MergeBot	34ac9b61cb	Revert "[export] Turn on install_free_tensors flag (#164691 )" This reverts commit 0e9b3a772ab96e998ab85591d5b2a9c1d41bacb0. Reverted https://github.com/pytorch/pytorch/pull/164691 on behalf of https://github.com/izaitsevfb due to breaks tests internally, author asked to revert, see [D84230990](https://www.internalfb.com/diff/D84230990) ([comment](https://github.com/pytorch/pytorch/pull/164691#issuecomment-3387718323))	2025-10-09 22:53:50 +00:00
Jeff Daily	9aa92f246f	Hotfix test scaled matmul cuda (#165104 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165104 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-10-09 22:51:30 +00:00
Tugsbayasgalan Manlaibaatar	a57a14868d	Better handling of restore_state_dict (#164401 ) After lean export, we might want to be able to restore the original fqn. This PR refactors one util function in export that sort of does this. Note that strict_export has some complicated logic of updating the graph signature as well which we don't want. I think we can gradually make this util more refined by handling constants, non persistent buffers etc and change how strict_export does it today. Differential Revision: [D83687844](https://www.internalfb.com/diff/D83687844) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164401 Approved by: https://github.com/avikchaudhuri viable/strict/1760070345	2025-10-09 22:39:11 +00:00
PyTorch MergeBot	47956196d9	Revert "Call internal log_compilation_event if it exists (#164855 )" This reverts commit 98a081a24c22072362dc536afd39a469e28939d4. Reverted https://github.com/pytorch/pytorch/pull/164855 on behalf of https://github.com/albanD due to We should not land this kind of code in core ([comment](https://github.com/pytorch/pytorch/pull/164855#issuecomment-3387692988))	2025-10-09 22:38:45 +00:00

1 2 3 4 5 ...

94270 Commits