pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
Mu-Chu Lee	9fccbdd4f0	Fix incorrect function signature in template (#165567 ) Summary: In https://github.com/pytorch/pytorch/pull/148305 we refactored the grid argument out, but it's not reflected in our template. Test Plan: Included in commit. python test/inductor/test_aot_inductor.py AOTInductorTestABICompatibleGpu.test_cond_symint_input_disable_one_pass_cuda Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/165567 Approved by: https://github.com/desertfire	2025-10-17 02:40:56 +00:00
xinan.lin	e5a9c247bc	[Fix XPU CI] [Inductor UT] Fix test cases broken by community. (#165406 ) Fixes #163159, Fixes #164098, Fixes #164097, Fixes #164099, Fixes #165025 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165406 Approved by: https://github.com/EikanWang, https://github.com/jansel	2025-10-16 00:53:32 +00:00
Xu Han	e7091a47da	[AOTI] skip Windows XPU crashed UTs. (#165393 ) Skip some UTs, which crashed on Windows XPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165393 Approved by: https://github.com/jansel	2025-10-14 23:45:14 +00:00
Animesh Jain	c9b2a09530	[export] Turn on install_free_tensors flag (#164691 ) The final step in removing the discrepancy between torch.compile(fullgraph=True) and torch.export(strict=True). Pull Request resolved: https://github.com/pytorch/pytorch/pull/164691 Approved by: https://github.com/avikchaudhuri	2025-10-14 15:33:50 +00:00
Colin Peppler	306c55ba27	[atomically_apply_size_hint] Make unbacked replacements reconciles to a single expr (#164324 ) ## Problem Okay there's limitations with today's `atomically_apply_size_hint` though it works for most observed failures we've seen so far. However, it's easy to come up with an edge case. Suppose you encounter this setup. ``` a: [s0 + u0] b: [s1 + u1] c: [u2 + u3] d: [u100] ``` Today, we use a few heuristics to specify the LHS and RHS for replacements. `10d2734d9b/torch/_inductor/sizevars.py (L730-L759)` It's possible to end up with these replacement rules. Notice how there's no replacement for `s1 + u1` and `u2 + u3` :( That's because today picking the LHS and RHS matters a lot, and `s1 + u1` & `u2 + u3` happened to end up on the RHS. ``` s0 + u0 => s1 + u1 s0 + u0 => u2 + u3 # overrides previous replacement; each expr only gets one replacement s0 + u0 => u100 # overrides previous replacement; ditto ``` I believe what we really want is this: everybody gets a replacement! And they all should (eventually) settle at the same canonical expr (i.e. `u100`) when running the replacement several times. ``` s1 + u1 ==> s0 + u0 u2 + u3 ==> s0 + u0 s0 + u0 ==> u100 ``` We can just short-cut this by using the canonical expr as the replacement. ``` s1 + u1 ==> u100 u2 + u3 ==> u100 s0 + u0 ==> u100 ``` ## Implementation I offer one way to deal with this: 1. assure every expression has one canonical replacement (i.e. `u100`) 2. if two expressions are equal (inferred from `deferred_runtime_asserts`), then they must have the same canonical replacement We can implement the above with union find. * Whenever you see `Eq(lhs, rhs)` then do `union(lhs, rhs)`. * Whenever you want to find the canonical replacement for a given expr then do `find(expr)`. * When picking the canonical replacement we can use a few heuristics like (1) prefer a fully backed expr, (2) replacing with sub-expressions, and whatever we'd like. Differential Revision: [D84549260](https://our.internmc.facebook.com/intern/diff/D84549260) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164324 Approved by: https://github.com/laithsakka	2025-10-14 13:57:33 +00:00
nullplay	ac529df244	Native matmul (#157743 ) ### Implementation of #151705 This PR introduces the initial implementation of native `tl.dot` support in Inductor, with the goal of generating Triton matmul kernels directly—without relying on predefined templates. To avoid complexity and ease the review process, I plan to split this work into two phases as outlined in #151705: 1. Basic support (this PR) 2. Lazy broadcasting for optimal performance (future PR) ### Summary of This PR This PR implements the basic functionality. It does not include lazy broadcasting, so the generated kernels may involve explicit `tl.reshape` and `tl.trans` operations before calling `tl.dot`, which introduces some overhead. ### Notable Changes 1. Adds a new config flag: `config.triton.enable_native_matmul` 2. Introduces a new `ops.dot` IR node in Inductor and lowers `aten.mm` and `aten.bmm` to it when native matmul is enabled 3. Enforces tililng suitable for matmul when the native matmul flag is enabled 4. Implements code generation for `ops.dot` 5. Adds Triton autotuning heuristics: for now, I’ve copied the configuration from the existing matmul templates. However, this may not be optimal—it currently takes a long time to tune, and I think there must be a better way to tackle this. @eellison @jansel @PaulZhang12 @shunting314 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157743 Approved by: https://github.com/jansel	2025-10-14 04:22:30 +00:00
PyTorch MergeBot	fa3916f466	Revert "[export] Turn on install_free_tensors flag (#164691 )" This reverts commit 220a34118f40fab4f3f517556d6e1434139a1590. Reverted https://github.com/pytorch/pytorch/pull/164691 on behalf of https://github.com/seemethere due to Breaks some internal things, both me and author agreed that revert was the best course of action ([comment](https://github.com/pytorch/pytorch/pull/164691#issuecomment-3400013759))	2025-10-14 03:58:12 +00:00
PyTorch MergeBot	1803d40c99	Reapply "[export] Turn on install_free_tensors flag (#164691 )" (#165353 ) This reverts commit 9166f6120f63e2d5d76e6ccdbfccb8d6e41cbb43. Reverted https://github.com/pytorch/pytorch/pull/165353 on behalf of https://github.com/seemethere due to This is causing merge conflicts since a dependent PR wasn't reverted ([comment](https://github.com/pytorch/pytorch/pull/165353#issuecomment-3400006587))	2025-10-14 03:52:50 +00:00
Animesh Jain	9166f6120f	Revert "[export] Turn on install_free_tensors flag (#164691 )" (#165353 ) This reverts commit 220a34118f40fab4f3f517556d6e1434139a1590. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/165353 Approved by: https://github.com/seemethere	2025-10-13 23:40:11 +00:00
Animesh Jain	220a34118f	[export] Turn on install_free_tensors flag (#164691 ) The final step in removing the discrepancy between torch.compile(fullgraph=True) and torch.export(strict=True). Pull Request resolved: https://github.com/pytorch/pytorch/pull/164691 Approved by: https://github.com/avikchaudhuri	2025-10-11 04:26:09 +00:00
Shangdi Yu	77bf23d85c	Add an option to put store large mmap weights on disk (#164526 ) As title In windows, we cannot modify the .dll to append weights at the end, the windows .dll loader will complain it's not a valid .dll file. So we store the weight blob as a separete file. 1. We add the following API which allows passing in a pointer to the weight blob and get the size of the weight blob. ```cpp AOTI_API AOTIRuntimeError AOTInductorModelContainerGetConstantsBlobSize( AOTInductorModelContainerHandle container_handle, uint64_t* ret_size); // Load weights from a single blob in weight_blob_ptr AOTI_API AOTIRuntimeError AOTInductorModelUpdateConstantsFromBlob( AOTInductorModelContainerHandle container_handle, const uint8_t* weight_blob_ptr); ``` 2. We also add a method in ModelContainerRunner to load the weight: If the runner see that there is a `.blob` file in the package, if will mmap the .blob file and use the content to load the constants. 3. We also add the `USE_MMAP_EXTERNAL` macro. When this macro is defined, the model expects to load the weights from external mmap'd weights. Test Plan: ``` buck run @mode/dev-nosan caffe2/test/inductor:test_aot_inductor -- -r test_large_mmaped_weights_on_disk ``` Also tested for windows-cross compilation with `6542566585/demo/main_voxtral.cpp` ``` Loaded model.dll audio_encoder loaded C:\Users\shangdiy\source\repos\torchnative\demo\token_embedding\data\aotinductor\model\model.wrapper.so Loaded model.dll token_embedding loaded C:\Users\shangdiy\source\repos\torchnative\demo\text_decoder\data\aotinductor\model\model.wrapper.so Loaded model.dll Loading weights from C:\Users\shangdiy\source\repos\torchnative\demo\text_decoder\data\aotinductor\model\model.wrapper_weights.blob text_decoder loaded Load latency (ms): audio_encoder: 1011.234 archive extraction: 0.000 .so loading: 1011.197 token_embedding: 525.773 archive extraction: 0.000 .so loading: 525.704 text_decoder: 3324.130 archive extraction: 0.000 .so loading: 3323.979 Run latency (ms): audio_encoder: 285.958 audio_encoder output: dtype=bfloat16, shape=[1, 1125, 3072], numel=3456000 token_embedding: 6.676 token_embedding output: dtype=bfloat16, shape=[1, 1138, 3072], numel=3495936 text_decoder: 576.519 text_decoder output: dtype=bfloat16, shape=[1, 1138, 131072], numel=149159936 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/164526 Approved by: https://github.com/desertfire	2025-10-10 07:53:57 +00:00
PyTorch MergeBot	34ac9b61cb	Revert "[export] Turn on install_free_tensors flag (#164691 )" This reverts commit 0e9b3a772ab96e998ab85591d5b2a9c1d41bacb0. Reverted https://github.com/pytorch/pytorch/pull/164691 on behalf of https://github.com/izaitsevfb due to breaks tests internally, author asked to revert, see [D84230990](https://www.internalfb.com/diff/D84230990) ([comment](https://github.com/pytorch/pytorch/pull/164691#issuecomment-3387718323))	2025-10-09 22:53:50 +00:00
Manuel Candales	aea57b3aa3	AOTI MPS Shim Implementation (#163865 ) ## MPS Shim API * Updated MPS shimification API with handles and function declarations: * `AOTIMetalShaderLibraryHandle` and `AOTIMetalKernelFunctionHandle` types * Library management: `aoti_torch_mps_create_shader_library`, `aoti_torch_mps_delete_shader_library`, `aoti_torch_mps_get_kernel_function` * Kernel execution: `aoti_torch_mps_run_command_block`, `aoti_torch_mps_start_encoding`, `aoti_torch_mps_dispatch` variants, etc ## MPS Shader Codegen * Modified to generate source constants instead of direct `DynamicMetalShaderLibrary` instantiation: * Before: `at::native::mps::DynamicMetalShaderLibrary mps_lib_0(R"MTL(...)MTL");` * After: `const char* mps_lib_0_source = R"MTL(...)MTL";` * Updated kernel call generation to use shimified functions: * Generates calls to shimified API instead of direct libtorch calls ## Before vs After Comparison ### Section 1: Shader Library Before (Direct Library Object) ```cpp at::native::mps::DynamicMetalShaderLibrary mps_lib_0(R"MTL( ... )MTL"); ``` After (Source String) ```cpp const char* mps_lib_0_source = (R"MTL( ... )MTL"); ``` ### Section 2: Getter Functions & RAII Management Before (Direct Library Access) ```cpp const std::shared_ptr<at::native::mps::MetalKernelFunction> get_mps_lib_0() { static const auto func = mps_lib_0.getKernelFunction("generated_kernel"); return func; } AOTIMetalKernelFunctionHandle get_mps_lib_0_handle() { static const auto handle = AOTIMetalKernelFunctionHandle(get_mps_lib_0().get()); return handle; } ``` After (Shim API + RAII Wrapper) ```cpp AOTIMetalKernelFunctionHandle get_mps_lib_0_handle() { static auto kernel_handle = []() { AOTIMetalShaderLibraryHandle lib_handle = nullptr; AOTIMetalKernelFunctionHandle kern_handle = nullptr; aoti_torch_mps_create_shader_library(mps_lib_0_source, &lib_handle); aoti_torch_mps_get_kernel_function(lib_handle, "generated_kernel", &kern_handle); // RAII wrapper with custom deleter auto lib_deleter = [](AOTIMetalShaderLibraryHandle h) {{ if (h) aoti_torch_mps_delete_shader_library(h); }}; using LibDeleter = decltype(lib_deleter); using LibPtr = std::unique_ptr<AOTIMetalShaderLibraryOpaque, LibDeleter>; // Return pair of kernel handle and library smart pointer for cleanup return std::make_pair(kern_handle, LibPtr(lib_handle, lib_deleter)); }(); return kernel_handle.first; } ``` ### Section 3: Runtime Execution Before (Direct Library Methods) ```cpp void AOTInductorModel::run_impl(...) { ... get_mps_lib_0()->runCommandBlock([&] { get_mps_lib_0()->startEncoding(); aoti_torch_mps_set_arg_tensor(get_mps_lib_0_handle(), 0, buf0); aoti_torch_mps_set_arg_tensor(get_mps_lib_0_handle(), 1, arg0_1); aoti_torch_mps_set_arg_tensor(get_mps_lib_0_handle(), 2, arg1_1); get_mps_lib_0()->dispatch({static_cast<uint64_t>(10LL)}); }); ... } // AOTInductorModel::run_impl ``` After (Shim API with Lambda Pattern) ```cpp void AOTInductorModel::run_impl(...) { ... auto mps_lib_0_lambda_0 = [&](AOTIMetalKernelFunctionHandle handle) { aoti_torch_mps_start_encoding(handle); aoti_torch_mps_set_arg_tensor(handle, 0, buf0); aoti_torch_mps_set_arg_tensor(handle, 1, arg0_1); aoti_torch_mps_set_arg_tensor(handle, 2, arg1_1); aoti_torch_mps_dispatch_single(handle, static_cast<uint64_t>(10LL)); }; std::function<void(AOTIMetalKernelFunctionHandle)> mps_lib_0_func_wrapper_0 = mps_lib_0_lambda_0; aoti_torch_mps_run_command_block(get_mps_lib_0_handle(), aoti_torch_mps_shared_callback, &mps_lib_0_func_wrapper_0); ... } // AOTInductorModel::run_impl ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163865 Approved by: https://github.com/angelayi, https://github.com/desertfire	2025-10-09 16:06:36 +00:00
PyTorch MergeBot	4412026949	Revert "AOTI MPS Shim Implementation (#163865 )" This reverts commit 874efa2d72d83b00894097130f18062ce331a265. Reverted https://github.com/pytorch/pytorch/pull/163865 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/163865#issuecomment-3385196387))	2025-10-09 10:26:01 +00:00
Manuel Candales	874efa2d72	AOTI MPS Shim Implementation (#163865 ) ## MPS Shim API * Updated MPS shimification API with handles and function declarations: * `AOTIMetalShaderLibraryHandle` and `AOTIMetalKernelFunctionHandle` types * Library management: `aoti_torch_mps_create_shader_library`, `aoti_torch_mps_delete_shader_library`, `aoti_torch_mps_get_kernel_function` * Kernel execution: `aoti_torch_mps_run_command_block`, `aoti_torch_mps_start_encoding`, `aoti_torch_mps_dispatch` variants, etc ## MPS Shader Codegen * Modified to generate source constants instead of direct `DynamicMetalShaderLibrary` instantiation: * Before: `at::native::mps::DynamicMetalShaderLibrary mps_lib_0(R"MTL(...)MTL");` * After: `const char* mps_lib_0_source = R"MTL(...)MTL";` * Updated kernel call generation to use shimified functions: * Generates calls to shimified API instead of direct libtorch calls ## Before vs After Comparison ### Section 1: Shader Library Before (Direct Library Object) ```cpp at::native::mps::DynamicMetalShaderLibrary mps_lib_0(R"MTL( ... )MTL"); ``` After (Source String) ```cpp const char* mps_lib_0_source = (R"MTL( ... )MTL"); ``` ### Section 2: Getter Functions & RAII Management Before (Direct Library Access) ```cpp const std::shared_ptr<at::native::mps::MetalKernelFunction> get_mps_lib_0() { static const auto func = mps_lib_0.getKernelFunction("generated_kernel"); return func; } AOTIMetalKernelFunctionHandle get_mps_lib_0_handle() { static const auto handle = AOTIMetalKernelFunctionHandle(get_mps_lib_0().get()); return handle; } ``` After (Shim API + RAII Wrapper) ```cpp AOTIMetalKernelFunctionHandle get_mps_lib_0_handle() { static auto kernel_handle = []() { AOTIMetalShaderLibraryHandle lib_handle = nullptr; AOTIMetalKernelFunctionHandle kern_handle = nullptr; aoti_torch_mps_create_shader_library(mps_lib_0_source, &lib_handle); aoti_torch_mps_get_kernel_function(lib_handle, "generated_kernel", &kern_handle); // RAII wrapper with custom deleter auto lib_deleter = [](AOTIMetalShaderLibraryHandle h) {{ if (h) aoti_torch_mps_delete_shader_library(h); }}; using LibDeleter = decltype(lib_deleter); using LibPtr = std::unique_ptr<AOTIMetalShaderLibraryOpaque, LibDeleter>; // Return pair of kernel handle and library smart pointer for cleanup return std::make_pair(kern_handle, LibPtr(lib_handle, lib_deleter)); }(); return kernel_handle.first; } ``` ### Section 3: Runtime Execution Before (Direct Library Methods) ```cpp void AOTInductorModel::run_impl(...) { ... get_mps_lib_0()->runCommandBlock([&] { get_mps_lib_0()->startEncoding(); aoti_torch_mps_set_arg_tensor(get_mps_lib_0_handle(), 0, buf0); aoti_torch_mps_set_arg_tensor(get_mps_lib_0_handle(), 1, arg0_1); aoti_torch_mps_set_arg_tensor(get_mps_lib_0_handle(), 2, arg1_1); get_mps_lib_0()->dispatch({static_cast<uint64_t>(10LL)}); }); ... } // AOTInductorModel::run_impl ``` After (Shim API with Lambda Pattern) ```cpp void AOTInductorModel::run_impl(...) { ... auto mps_lib_0_lambda_0 = [&](AOTIMetalKernelFunctionHandle handle) { aoti_torch_mps_start_encoding(handle); aoti_torch_mps_set_arg_tensor(handle, 0, buf0); aoti_torch_mps_set_arg_tensor(handle, 1, arg0_1); aoti_torch_mps_set_arg_tensor(handle, 2, arg1_1); aoti_torch_mps_dispatch_single(handle, static_cast<uint64_t>(10LL)); }; std::function<void(AOTIMetalKernelFunctionHandle)> mps_lib_0_func_wrapper_0 = mps_lib_0_lambda_0; aoti_torch_mps_run_command_block(get_mps_lib_0_handle(), aoti_torch_mps_shared_callback, &mps_lib_0_func_wrapper_0); ... } // AOTInductorModel::run_impl ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163865 Approved by: https://github.com/angelayi, https://github.com/desertfire	2025-10-09 09:28:10 +00:00
Animesh Jain	0e9b3a772a	[export] Turn on install_free_tensors flag (#164691 ) The final step in removing the discrepancy between torch.compile(fullgraph=True) and torch.export(strict=True). Pull Request resolved: https://github.com/pytorch/pytorch/pull/164691 Approved by: https://github.com/avikchaudhuri ghstack dependencies: #164721	2025-10-09 03:25:15 +00:00
Laith Sakka	2035f6b2e6	use check_size instead of check_is_size in ops.py (#164668 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164668 Approved by: https://github.com/angelayi ghstack dependencies: #164664, #164665, #164667	2025-10-08 14:23:38 +00:00
Colin Peppler	5dbae1eae2	Fix unbacked replacement where LHS is purely backed expr and RHS is unbacked expr (#164013 ) ## Scenario - If there's a `torch._check(backed_expr == unbacked_symbol)` - then we should replace unbacked_symbol for backed_expr - currently, we don't do that when generating inputs for autotune_at_compile_time ## Error traceback ``` $ python test/inductor/test_aot_inductor.py -k test_size_with_unbacked_add_expr_transitive ... File "/data/users/colinpeppler/pytorch/torch/_inductor/compile_fx.py", line 1696, in fx_codegen_and_compile return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs) File "/data/users/colinpeppler/pytorch/torch/_inductor/compile_fx.py", line 1187, in codegen_and_compile dynamo_utils.preserve_rng_state(), File "/home/colinpeppler/.conda/envs/pytorch/lib/python3.12/contextlib.py", line 158, in __exit__ self.gen.throw(value) File "/data/users/colinpeppler/pytorch/torch/_dynamo/utils.py", line 2236, in preserve_rng_state torch.cuda.set_rng_state(cuda_rng_state) # type: ignore[possibly-undefined] File "/data/users/colinpeppler/pytorch/torch/cuda/random.py", line 79, in set_rng_state _lazy_call(cb) File "/data/users/colinpeppler/pytorch/torch/cuda/__init__.py", line 341, in _lazy_call callable() File "/data/users/colinpeppler/pytorch/torch/cuda/random.py", line 77, in cb default_generator.set_state(new_state) torch.AcceleratorError: CUDA error: an illegal memory access was encountered ``` ## Bad autotuning input generation ``` # assume unbacked_symint_fallback = 16 # we generate too small of an input (16) buf11 = generate_example_value((16, 256), (256, 1), 'cuda:0', torch.float32, 0, (16, 256)) triton_poi_fused_ones_1.run(buf11, 4096, stream=stream0) stream0 = get_raw_stream(0) buf12 = generate_example_value((16, 256), (256, 1), 'cuda:0', torch.float32, 0, (16, 256)) buf13 = generate_example_value((16, 256), (256, 1), 'cuda:0', torch.float32, 0, (16, 256)) add_kernel_1.run(buf11, buf12, buf13, 4096, 16, 1, 1, stream=stream0) del buf11, buf12 stream0 = get_raw_stream(0) buf15 = generate_example_value((10500, 256), (256, 1), 'cuda:0', torch.float32, 0, (10500, 256)) triton_poi_fused_add_mul_2.run(buf2, buf13, buf15, 2688000, stream=stream0) ``` ## Good autotuning input generation ``` # notice we generate with the proper size now (10500) buf11 = generate_example_value((10500, 256), (256, 1), 'cuda:0', torch.float32, 0, (10500, 256)) triton_poi_fused_ones_1.run(buf11, 2688000, stream=stream0) stream0 = get_raw_stream(0) buf12 = generate_example_value((10500, 256), (256, 1), 'cuda:0', torch.float32, 0, (10500, 256)) buf13 = generate_example_value((10500, 256), (256, 1), 'cuda:0', torch.float32, 0, (10500, 256)) add_kernel_1.run(buf11, buf12, buf13, 2688000, 10500, 1, 1, stream=stream0) del buf11, buf12 stream0 = get_raw_stream(0) buf15 = generate_example_value((10500, 256), (256, 1), 'cuda:0', torch.float32, 0, (10500, 256)) triton_poi_fused_add_mul_2.run(buf2, buf13, buf15, 2688000, stream=stream0) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/164013 Approved by: https://github.com/cp2923, https://github.com/laithsakka	2025-10-02 02:40:54 +00:00
Shangdi Yu	28c1d2f81b	[aoti] AOTI mingw cross compilation (#163188 ) To run this, you need to install `mingw64-gcc-c++` and download windows cuda library toolkit. See design doc and demo instructions in https://docs.google.com/document/d/1iDaChqA5nNKkBFTzsdkmoomvQlXHbnlb1Z4yEp7xaJA/edit?tab=t.0 If cross_platform_target is windows, we do the following: - do not link to `sleef`. This can be improved in the future if we need it. Currently I avoid it because that requires extra setup on the linux side - Use `mingw64-gcc-c++` to compile - Use `WINDOWS_CUDA_HOME` instead of `CUDA_HOME` when linking to cuda ``` python test/inductor/test_aot_inductor_windows.py -k so ``` Other changes: - de-couples compile_standalone config and dynamic link flag - create a new aot_inductor_mode config module, which is used to control configs in aot_inductor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163188 Approved by: https://github.com/desertfire	2025-10-01 02:22:06 +00:00
Yavuz Yetim	7afcb030d8	Back out "Revert D81959389" (#163905 ) Summary: Original commit changeset: 06888d7ebff0 Original Phabricator Diff: D82932788 Restricted the test to SM90 for scaled_grouped_mm Test Plan: TBD (will share the linux CI results) Differential Revision: D83283991 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163905 Approved by: https://github.com/angelayi	2025-09-30 07:05:13 +00:00
Animesh Jain	d8becd1cf4	[dynamo][export] Make the source_stack and fqn info same between dynamo and export (#164085 ) preparing for landing the install_free_tensors flag Pull Request resolved: https://github.com/pytorch/pytorch/pull/164085 Approved by: https://github.com/tugsbayasgalan	2025-09-29 04:35:13 +00:00
q1l1	f8c7505855	[inductor] Fix unbounded number of substitutions when equality checks contain Max expr (#163685 ) ## Issue From an internal use case, we found that if we have an equality rule like: ``` Max(15, u0) == s0 * Max(15, u0) ``` This would lead to wrong substitution rule being generated in the substitution table, the result would be the process got stuck in the substitution loop as if it hangs indefinitely, as it's doing the following substitutions: ``` Max(15, u0) --> s0 * Max(15, u0) --> s0 ** 2 * Max(15, u0) --> s0 ** 3 * Max(15, u0) --> s0 ** 4 * Max(15, u0) ... ``` The root cause is with SymPy expression comparison: as `Max` is [not inside the op class table](https://github.com/sympy/sympy/blob/1.14/sympy/core/basic.py#L50-L86), it'll take the [UNKNOWN](https://github.com/sympy/sympy/blob/1.14/sympy/core/basic.py#L120) order, and considered bigger than any other types of expressions. ## Fix 1. Added a breaking-out from the substitution while-loop to warn about any exccessive substitutions, what threshold should be used here and how to pass it are open to suggestion, using a hard-coded static value to be simple for now 2. Enhanced the sympy expression comparison logic, so that we first check if one expr "has" the other one or not, to help work around the issue with `Max` here ## Testing - with the unittiest alone --> unittest stuck - with the unittest and while-loop breakout, we could see tests finished with warning "Substitution limit reached": ``` test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpu::test_unbounded_expr_substitutions_cpu W0923 13:00:37.864000 46140 /data/users/q1l1/pytorch/torch/_export/__init__.py:70] +============================+ W0923 13:00:37.864000 46140 /data/users/q1l1/pytorch/torch/_export/__init__.py:71] \| !!! WARNING !!! \| W0923 13:00:37.865000 46140 /data/users/q1l1/pytorch/torch/_export/__init__.py:72] +============================+ W0923 13:00:37.865000 46140 /data/users/q1l1/pytorch/torch/_export/__init__.py:73] torch._export.aot_compile()/torch._export.aot_load() is being deprecated, please switch to directly calling torch._inductor.aoti_compile_and_package(torch.export.export())/torch._inductor.aoti_load_package() instead. stats [('calls_captured', 5), ('unique_graphs', 1)] inductor [('extern_calls', 2)] graph_break [] aten_mm_info [('aten.mm_Max(15, u0)_16_64', 1)] PASSED [5.6947s] test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_unbounded_expr_substitutions_cuda W0923 13:00:39.633000 46140 /data/users/q1l1/pytorch/torch/_inductor/sizevars.py:765] [0/0] Substitution limit (30) reached w/ u1*30Max(15, u0) W0923 13:00:39.679000 46140 /data/users/q1l1/pytorch/torch/_inductor/sizevars.py:765] [0/0] Substitution limit (30) reached w/ 64u130Max(15, u0) stats [('calls_captured', 5), ('unique_graphs', 1)] inductor [('extern_calls', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('async_compile_cache_miss', 1)] graph_break [] aten_mm_info [('aten.mm_Max(15, u0)_16_64', 1)] PASSED [5.6278s] test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleMps::test_unbounded_expr_substitutions_mps SKIPPED [0.0002s] ============================ 2 passed, 1 skipped, 870 deselected in 19.66s ============================ ``` - with the unittest + comparison logic enhanced, we don't see the warning any more: ``` Running 3 items in this shard test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpu::test_unbounded_expr_substitutions_cpu W0923 13:15:39.560000 290812 /data/users/q1l1/pytorch/torch/_export/__init__.py:70] +============================+ W0923 13:15:39.561000 290812 /data/users/q1l1/pytorch/torch/_export/__init__.py:71] \| !!! WARNING !!! \| W0923 13:15:39.561000 290812 /data/users/q1l1/pytorch/torch/_export/__init__.py:72] +============================+ W0923 13:15:39.562000 290812 /data/users/q1l1/pytorch/torch/_export/__init__.py:73] torch._export.aot_compile()/torch._export.aot_load() is being deprecated, please switch to directly calling torch._inductor.aoti_compile_and_package(torch.export.export())/torch._inductor.aoti_load_package() instead. stats [('calls_captured', 5), ('unique_graphs', 1)] inductor [('extern_calls', 2)] graph_break [] aten_mm_info [('aten.mm_Max(15, u0)_16_64', 1)] PASSED [6.6093s] test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_unbounded_expr_substitutions_cuda stats [('calls_captured', 5), ('unique_graphs', 1)] inductor [('extern_calls', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('async_compile_cache_miss', 1)] graph_break [] aten_mm_info [('aten.mm_Max(15, u0)_16_64', 1)] PASSED [6.0502s] test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleMps::test_unbounded_expr_substitutions_mps SKIPPED [0.0002s] ============================ 2 passed, 1 skipped, 870 deselected in 21.99s ============================ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163685 Approved by: https://github.com/jansel	2025-09-26 18:46:36 +00:00
Mu-Chu Lee	908bcfd403	[AOTInductor] Add input information for Triton Kernels in AOTI (#160380 ) Summary: We use record_function to pass in input information to let Kineto show input information. Test Plan: Before: <img width="459" height="582" alt="Screenshot 2025-09-19 at 10 45 10 AM" src="https://github.com/user-attachments/assets/baa0c251-86e9-49ca-8c6c-fcd2619f7f48" /> After: <img width="473" height="1130" alt="Screenshot 2025-09-19 at 10 44 53 AM" src="https://github.com/user-attachments/assets/b7942d84-0362-4b9e-9232-14de92bbdd00" /> Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/160380 Approved by: https://github.com/desertfire ghstack dependencies: #163593	2025-09-25 22:41:04 +00:00
Kurt Mohler	20149080f2	[MPS] Compute `offset2bag/bag_size/max_indices` in `_embedding_bag` (#163281 ) Part of #162270 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163281 Approved by: https://github.com/malfet	2025-09-23 22:30:48 +00:00
Chang Pan	e0cbab46ad	[Inductor] avoid CUDA__equal when constant tensors are from different device (#163529 ) Summary: otherwise, may hit ``` Exception: Expected all tensors to be on the same device, but got other is on cuda:0, different from other tensors on cpu (when checking argument in method wrapper_CUDA__equal) ``` Test Plan: UTs Reviewed By: yushangdi Differential Revision: D82974062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163529 Approved by: https://github.com/yushangdi, https://github.com/Skylion007	2025-09-22 22:04:11 +00:00
Jagadish Krishnamoorthy	8bc4a467a7	[ROCm] test_aot_inductor: Enable fp8 tests. (#163050 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163050 Approved by: https://github.com/jeffdaily	2025-09-18 14:05:21 +00:00
Kurt Mohler	5236007806	[MPS] Add `embedding_bag` forward pass (#163012 ) Part of #162270 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163012 Approved by: https://github.com/kulinseth, https://github.com/malfet	2025-09-17 19:00:47 +00:00
xinan.lin	39450e7b00	[Fix XPU CI][Inductor UT] Fix test cases broken by community. (#162933 ) Fixes #162937 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162933 Approved by: https://github.com/EikanWang, https://github.com/jansel	2025-09-17 05:35:06 +00:00
Shangdi Yu	636a511084	[aoti] add config for libtorch free so (#162655 ) Users can specify the following to get a libtorch_free `.so`. "aot_inductor.use_libtorch": False, The following config is only used for torchnative (see https://github.com/meta-pytorch/torchnative/pull/110). It's not intended to be used by executorch. The reason we need it for torchnative is because a lot of the symbol definitions in torchnative repo is only in header files. "aot_inductor.libtorch_free_header": "/data/users/shangdiy/torchnative/standalone,/data/users/shangdiy/torchnative/" (or their custom headers) The main motivating use case is for executorch to produce a libtorch free `.so`. TODO for follow-up PR: this flag should be consolidated with the `compile_standalone` flag. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162655 Approved by: https://github.com/angelayi	2025-09-12 07:31:04 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	de05dbc39c	Replace export_for_training with export (#162396 ) Summary: replace export_for_training with epxort Test Plan: CI Rollback Plan: Differential Revision: D81935792 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162396 Approved by: https://github.com/angelayi, https://github.com/jerryzh168	2025-09-10 14:19:34 +00:00
Yidi Wu	ec2e3687c7	[while_loop][autograd] support autograd_key of while_loop (#160483 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160483 Approved by: https://github.com/zou3519	2025-09-07 21:55:29 +00:00
PyTorch MergeBot	7a83cf430e	Revert " [while_loop][autograd] support autograd_key of while_loop (#160483 )" This reverts commit 2b8a83901c58a0858ea9e4ce00055f48e6ed164c. Reverted https://github.com/pytorch/pytorch/pull/160483 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but some trunk tests are failing either from this PR or the previous one in the stack ([comment](https://github.com/pytorch/pytorch/pull/160483#issuecomment-3263597325))	2025-09-07 08:50:49 +00:00
Yidi Wu	2b8a83901c	[while_loop][autograd] support autograd_key of while_loop (#160483 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160483 Approved by: https://github.com/zou3519 ghstack dependencies: #160548, #160467	2025-09-06 21:26:33 +00:00
Shangdi Yu	b04e922712	Fix memory leak in AOTI when calling `aoti_torch_as_strided` (#162118 ) Summary: Fix memory leak in AOTI when calling `aoti_torch_as_strided` If you have something like `AtenTensorHandle buf_handle`; and you allocated memory to it, you have to make it a `RAIIAtenTensorHandle` to release the ownership. Otherwise you have leaked the memory because even when the program ends, there's still a pointer pointing to the underlying storage of `buf_handle_restrided`, and the storage is never freed. Test Plan: ``` buck run fbcode//mode/dev-nosan fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_pad_non_zero_memory_leak ``` Also verified by looking at `print(f"Allocated memory: {torch.cuda.memory_allocated() / 1024 ** 2:.2f} MB")` Differential Revision: D81640339 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162118 Approved by: https://github.com/angelayi	2025-09-04 22:17:06 +00:00
Yu, Guangye	3a20a20e70	Fix largeTensorTest malfunction on XPU (#161988 ) # Motivation https://github.com/pytorch/pytorch/pull/143553/files#diff-6492991193449e118ff0c8d42ca544cc38a73604e505ff246a3c711aeab91748R1345 makes `largeTensorTest` malfunction on XPU. This PR aims to fix it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161988 Approved by: https://github.com/EikanWang, https://github.com/albanD	2025-09-04 16:10:03 +00:00
Shangdi Yu	d67c29ad22	[inductor] Fix int64 from MutationOutput Buffer (#162020 ) Summary: When we have a user defined triton kernel, it marks the mutated outputs as `MutationOutput` with a NoneLayout. This MutationOutput may later be used as input to another inductor-generated triton kernel. When we determine whether to use int32 or int64 for the inductor generated triton kernel, we need to look at the number of elements for all buffers involved. If one of the buffer is a MutationOutput, we should still consider it's number of elements, instead of skipping it. To get a hint on the MutationOutput size, we look at the buffers corresponding to `mutation_names` in MutationOutput. Test Plan: ``` buck run mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_autotune_int64_user_defined_triton_kernel ``` Differential Revision: D81530083 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162020 Approved by: https://github.com/davidberard98, https://github.com/eellison	2025-09-04 09:47:57 +00:00
Chong Gu	c024b1f5a1	[AMD] [Reland] Fix AMD User Defined Kernel Autotune (#161521 ) Summary: This is a reland of D80285441, fixed the unit test. Test Plan: ``` buck2 run mode/opt-amd-gpu -m rocm641 -c fbcode.split-dwarf=true -c fbcode.use_link_groups=true -c fbcode.enable_gpu_sections=true //hpc/new/models/feed/benchmark:feed_lower_benchmark -- --load=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/894698382/0/gpu_lowering/new_input8 --skip-eager --skip-flop-estimation --sync-mode=0 --lower-backend=AOT_INDUCTOR ``` will succeed after this diff. Rollback Plan: Differential Revision: D80971224 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161521 Approved by: https://github.com/frank-wei	2025-09-04 08:41:18 +00:00
Xu Han	a918bbad6a	[inductor] fix test output path 2 (#162085 ) Fix test_output_path_2 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162085 Approved by: https://github.com/angelayi, https://github.com/jansel	2025-09-04 00:03:47 +00:00
Xu Han	ad7b748686	[AOTI] fix ut, add extension file type for Windows. (#161851 ) fix ut, add extension file type for Windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161851 Approved by: https://github.com/ezyang	2025-08-31 01:13:29 +00:00
xinan.lin	5b701a6bb2	[AOTI][Intel GPU] Add XPU quantization ops to AOT Inductor. (#156572 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156572 Approved by: https://github.com/EikanWang, https://github.com/angelayi ghstack dependencies: #157430	2025-08-29 09:19:44 +00:00
AmdSampsa	f9df4ec2af	SDPA skip logic for ROCm (#160522 ) Skips some test for flex and eff attention if they are not supported by the hardware Pull Request resolved: https://github.com/pytorch/pytorch/pull/160522 Approved by: https://github.com/drisspg, https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-08-26 15:51:07 +00:00
PyTorch MergeBot	40c0e700a4	Revert "[AMD] Fix AMD User Defined Kernel Autotune (#160671 )" This reverts commit 431846a6323c6f1d02da49e311ac694324f386f4. Reverted https://github.com/pytorch/pytorch/pull/160671 on behalf of https://github.com/atalman due to new test is failing: inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_rocm_triton_autotuning_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/17172795679/job/48725235301) [HUD commit link](`431846a632`) ([comment](https://github.com/pytorch/pytorch/pull/160671#issuecomment-3220442141))	2025-08-25 14:07:48 +00:00
Chong Gu	431846a632	[AMD] Fix AMD User Defined Kernel Autotune (#160671 ) Summary: AMD specific kwargs need to be removed from the guard, otherwise a keyerror will be raised when executing the kernel. Test Plan: ``` buck2 run mode/opt-amd-gpu -m rocm641 -c fbcode.split-dwarf=true -c fbcode.use_link_groups=true -c fbcode.enable_gpu_sections=true //hpc/new/models/feed/benchmark:feed_lower_benchmark -- --load=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/894698382/0/gpu_lowering/new_input8 --skip-eager --skip-flop-estimation --sync-mode=0 --lower-backend=AOT_INDUCTOR ``` can succeed after this change. Rollback Plan: Differential Revision: D80285441 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160671 Approved by: https://github.com/muchulee8	2025-08-23 07:23:09 +00:00
angelayi	3c8c509a9c	[export] Fix custom ops in subgraphs (#160004 ) Fixes https://github.com/pytorch/pytorch/issues/159995 Currently there are two problems with extern kernels in subgraphs: 1. They don't get serialized to the extern kernel json file because we only look at the toplevel graph. 2. Since the scope of each extern_kernel list is within its own subgraph, the indices referencing the operator is messed up because each subgraph will start counting from 0. So, this PR moves the extern_kernels list to a global view (under virtualized) so that we can count the extern kernels across subgraphs and the toplevel graph. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160004 Approved by: https://github.com/ydwu4	2025-08-18 15:42:19 +00:00
Mu-Chu Lee	40311e2ec1	[AOTInductor] ABI-Compatibility for RecordFunction. (#159842 ) Summary: Previous our implementation for RecordFunction injects Aten into codegen, which is breaking the ABI contract for AOTInductor. C10::IValue is aded to call the full record function. The extension of more profiling info will come in later PRs. Test Plan: Included in commit. Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D79622071](https://our.internmc.facebook.com/intern/diff/D79622071) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159842 Approved by: https://github.com/desertfire	2025-08-15 21:45:47 +00:00
Shangdi Yu	25d0d8b0a3	[inductor] Fix propagating torch.utils._sympy.functions.Identity in IndexPropagation (#155504 ) Fixes https://github.com/pytorch/pytorch/issues/160535 Index may contain ` torch.utils._sympy.functions.Identity`. When we call `SymPyOps.index_expr`, if the value is a sympy.Expr with Identity, `TypedExpr(value, dtype)` will fail. So when we unwrap arguments, we expand the sympy expression to unwrap Identity. Test Plan: buck run @mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r test_sym_expr_indexing Rollback Plan: Differential Re vision: D76308640 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155504 Approved by: https://github.com/eellison	2025-08-15 18:38:23 +00:00
xinan.lin	8047421fbb	[Linter] Expanding the scope of detecting device-bias code. (#159949 ) Currently, the device-bias linter only targets functions decorated with @requires_gpu. This PR adds support for two new detection scenarios: 1. Detect device-bias code in functions decorated with @requires_triton. 2. Detect device-bias code for entire test suites that are defined as shared across GPUs. For example: ``` if __name__ == "__main__": if HAS_GPU: run_tests() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159949 Approved by: https://github.com/EikanWang, https://github.com/jansel	2025-08-09 09:41:16 +00:00
Markus Hoehnerbach	e167c7d0f3	[inductor] allocate non-blocking copy destinations in pinned memory (#155121 ) (#158758 ) Fixes #155121 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158758 Approved by: https://github.com/EikanWang, https://github.com/eellison	2025-08-07 17:07:26 +00:00
Bin Bao	a4b07fe8f6	[AOTI] Add more default options to compile_standalone (#158560 ) Summary: When compiling for standalone, make embed_kernel_binary and emit_multi_arch_kernel default to True, and add a default name for model_name_for_generated_files to make the generated cpp project easier to understand. Also improved the weights object file naming to be more readable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158560 Approved by: https://github.com/yushangdi	2025-08-06 15:59:27 +00:00
angelayi	23cf241039	[aoti][mps] Initialize mps kernels first (#159753 ) In some cases we have mps kernels which are reused across higher-order-op subgraphs and the toplevel code. However, currently we initialize the variable for the mps kernel the first time we use it, which runs into an issue if we run into the mps kernel within a subgraph since the kernel will only be initialized within the subgraph scope. For instance: ``` if ... auto mps_lib_0_func = ... mps_lib_0_func->run() // since we already used mps_lib_0 once, we don't re-initialize it mps_lib_0_func->run() // error, mps_lib_0_func not initialized ``` So the solution we took here is to initialize all the kernels at the beginning: ``` const std::shared_ptr<at::native::mps::MetalKernelFunction> get_mps_lib_0() { static const auto func = mps_lib_0.getKernelFunction("generated_kernel"); return func; } AOTIMetalKernelFunctionHandle get_mps_lib_0_handle() { static const auto handle = AOTIMetalKernelFunctionHandle(get_mps_lib_0().get()); return handle; } ... if ... get_mps_lib_0()->run() get_mps_lib_0()->run() // success ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159753 Approved by: https://github.com/malfet ghstack dependencies: #159456, #159695	2025-08-06 07:54:29 +00:00

1 2 3 4 5 ...

520 Commits