pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 12:54:11 +08:00

Author	SHA1	Message	Date
Yuanyuan Chen	3255e7872b	Enable all flake8-logging-format rules (#164655 ) These rules are enabled by removing existing suppressions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164655 Approved by: https://github.com/janeyx99, https://github.com/mlazos	2025-10-19 00:59:28 +00:00
Colin L Reliability Rice	98a488c9aa	Start recording inductor provenance (#162669 ) Summary: This stores information on where fx graphs come from, which makes it significantly easier to debug. One outstanding question 1) I only stored the kernel stack traces, do we also want the node mappings? Test Plan: I wrote a explicit logging test which makes a module, fx traces it, compiles it, and makes sure the logging infomration shows up. ``` clr@devvm17763 ~/fbsource/fbcode/caffe2/test/dynamo % buck2 test @//mode/opt fbcode//caffe2/test/dynamo:test_dynamo -- test_utils File changed: fbsource//xplat/caffe2/test/dynamo/test_utils.py File changed: fbcode//caffe2/test/dynamo/test_utils.py Buck UI: https://www.internalfb.com/buck2/528dea32-2416-4a62-a1ec-39f3c0efdd2e Test UI: https://www.internalfb.com/intern/testinfra/testrun/13229324015574003 Network: Up: 0B Down: 0B Executing actions. Remaining 0/2 Command: test. Time elapsed: 17.3s Tests finished: Pass 16. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` Rollback Plan: Differential Revision: D82037582 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162669 Approved by: https://github.com/yushangdi	2025-10-16 23:05:31 +00:00
Oguz Ulgen	7d0f872cb3	Use union syntax in torch/_inductor runtime and fx_passes (#165652 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165652 Approved by: https://github.com/aorenste	2025-10-16 20:51:59 +00:00
Oguz Ulgen	f6daffc54d	Codemod codecache.py from Optional to union none (#165604 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165604 Approved by: https://github.com/aorenste	2025-10-16 04:59:37 +00:00
Rahul Agrawal	4f8a986b8f	Make LOCK_TIMEOUT in codecache configurable (#165030 ) - Introduce file_lock_timeout in config (defaults to current value of 600) - Use the above config instead of hardcoded 600 config. This is useful when running stress tests. Differential Revision: D84109142 Privacy Context Container: L1297311 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165030 Approved by: https://github.com/hl475	2025-10-10 20:22:11 +00:00
Shangdi Yu	77bf23d85c	Add an option to put store large mmap weights on disk (#164526 ) As title In windows, we cannot modify the .dll to append weights at the end, the windows .dll loader will complain it's not a valid .dll file. So we store the weight blob as a separete file. 1. We add the following API which allows passing in a pointer to the weight blob and get the size of the weight blob. ```cpp AOTI_API AOTIRuntimeError AOTInductorModelContainerGetConstantsBlobSize( AOTInductorModelContainerHandle container_handle, uint64_t* ret_size); // Load weights from a single blob in weight_blob_ptr AOTI_API AOTIRuntimeError AOTInductorModelUpdateConstantsFromBlob( AOTInductorModelContainerHandle container_handle, const uint8_t* weight_blob_ptr); ``` 2. We also add a method in ModelContainerRunner to load the weight: If the runner see that there is a `.blob` file in the package, if will mmap the .blob file and use the content to load the constants. 3. We also add the `USE_MMAP_EXTERNAL` macro. When this macro is defined, the model expects to load the weights from external mmap'd weights. Test Plan: ``` buck run @mode/dev-nosan caffe2/test/inductor:test_aot_inductor -- -r test_large_mmaped_weights_on_disk ``` Also tested for windows-cross compilation with `6542566585/demo/main_voxtral.cpp` ``` Loaded model.dll audio_encoder loaded C:\Users\shangdiy\source\repos\torchnative\demo\token_embedding\data\aotinductor\model\model.wrapper.so Loaded model.dll token_embedding loaded C:\Users\shangdiy\source\repos\torchnative\demo\text_decoder\data\aotinductor\model\model.wrapper.so Loaded model.dll Loading weights from C:\Users\shangdiy\source\repos\torchnative\demo\text_decoder\data\aotinductor\model\model.wrapper_weights.blob text_decoder loaded Load latency (ms): audio_encoder: 1011.234 archive extraction: 0.000 .so loading: 1011.197 token_embedding: 525.773 archive extraction: 0.000 .so loading: 525.704 text_decoder: 3324.130 archive extraction: 0.000 .so loading: 3323.979 Run latency (ms): audio_encoder: 285.958 audio_encoder output: dtype=bfloat16, shape=[1, 1125, 3072], numel=3456000 token_embedding: 6.676 token_embedding output: dtype=bfloat16, shape=[1, 1138, 3072], numel=3495936 text_decoder: 576.519 text_decoder output: dtype=bfloat16, shape=[1, 1138, 131072], numel=149159936 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/164526 Approved by: https://github.com/desertfire	2025-10-10 07:53:57 +00:00
Maggie Moss	9944cac6e6	Add suppressions to torch/_inductor (#165062 ) Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283 Split this directory into two PRs to keep them from being too large. Test plan: dmypy restart && python3 scripts/lintrunner.py -a pyrefly check step 1: delete lines in the pyrefly.toml file from the project-excludes field step 2: run pyrefly check step 3: add suppressions, clean up unused suppressions before: https://gist.github.com/maggiemoss/4b3bf2037014e116bc00706a16aef199 after: INFO 0 errors (6,884 ignored) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165062 Approved by: https://github.com/oulgen, https://github.com/mlazos	2025-10-09 20:34:20 +00:00
Sam Larsen	608792153f	[inductor][codecache] Print bytes in codecache debug output (#164898 ) Summary: We have an internal request to help understand why the hash of `post_grad_custom_post_pass` is changing between attempts. We don't get useful info from the debug output, because we just print "<bytes>". Instead, attempt to print at least _some_ of the value in case it contains readable characters. Test Plan: Registered a dummy post_grad_custom_pass and printed codecache debug output `TORCH_LOGS=+torch._inductor.codecache python ~/foo.py` Yields something like: ``` V1007 16:41:19.024000 3546009 /data/users/slarsen/pytorch-3.10_4/torch/_inductor/codecache.py:989] [0/0] [law2ujt2wzjb5tyiu6jh64r2lxpvl62yvxcsmdouhg3qyelhhdv] post_grad_custom_post_pass: HelloWorld!��... ``` Differential Revision: [D84108770](https://our.internmc.facebook.com/intern/diff/D84108770) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164898 Approved by: https://github.com/oulgen	2025-10-08 02:45:20 +00:00
Shangdi Yu	28c1d2f81b	[aoti] AOTI mingw cross compilation (#163188 ) To run this, you need to install `mingw64-gcc-c++` and download windows cuda library toolkit. See design doc and demo instructions in https://docs.google.com/document/d/1iDaChqA5nNKkBFTzsdkmoomvQlXHbnlb1Z4yEp7xaJA/edit?tab=t.0 If cross_platform_target is windows, we do the following: - do not link to `sleef`. This can be improved in the future if we need it. Currently I avoid it because that requires extra setup on the linux side - Use `mingw64-gcc-c++` to compile - Use `WINDOWS_CUDA_HOME` instead of `CUDA_HOME` when linking to cuda ``` python test/inductor/test_aot_inductor_windows.py -k so ``` Other changes: - de-couples compile_standalone config and dynamic link flag - create a new aot_inductor_mode config module, which is used to control configs in aot_inductor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163188 Approved by: https://github.com/desertfire	2025-10-01 02:22:06 +00:00
angelayi	11b4c0eb9e	[aoti] Save compute information (#163792 ) Metadata looks like: ``` { 'AOTI_DEVICE_KEY': 'cpu', 'AOTI_PLATFORM': 'linux', 'AOTI_MACHINE': 'x86_64', 'AOTI_CPU_ISA': 'AVX512', 'AOTI_COMPUTE_CAPABILITY': '90' } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163792 Approved by: https://github.com/yushangdi, https://github.com/desertfire ghstack dependencies: #163779	2025-09-26 05:40:44 +00:00
Shangdi Yu	ebfc87e303	Always produce kernel_info.json (#163715 ) Summary: Always produce kernel_info.json so zoomer can use this json to populate GPU traces Test Plan: CI Differential Revision: D82762435 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163715 Approved by: https://github.com/angelayi	2025-09-25 19:38:49 +00:00
Jason Ansel	3e1b1a30f2	Revert "[inductor] Fix issue with scalar arg handling" (#163737 ) This reverts commit a8cd437183142e17ba6fc8d7b5e9dcee462d7904. See https://github.com/pytorch/pytorch/pull/163481#issuecomment-3326310774 This PR might also cause issues with cudagraphs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163737 Approved by: https://github.com/ezyang ghstack dependencies: #163386, #163398, #163387, #163414, #163415, #163419, #163434, #163393, #163412, #163422, #163481, #163520, #163482	2025-09-24 07:33:12 +00:00
Jason Ansel	ca512af3e7	[inductor] Fix issue with scalar arg handling (#163481 ) Fixes #163420 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163481 Approved by: https://github.com/eellison ghstack dependencies: #163386, #163398, #163387, #163414, #163415, #163419, #163434, #163393, #163412, #163422	2025-09-24 02:52:36 +00:00
Edward Yang	8627454c84	Add local file path to inductor_output_code trace metadata (#160920 ) ## Summary - include local file path in `inductor_output_code` structured trace metadata - adjust structured trace tests for new `file_path` field ## Testing - `python test/dynamo/test_structured_trace.py StructuredTraceTest.test_compile_id_serialization_deserialization` - `lintrunner -a torch/_inductor/codecache.py torch/_inductor/graph.py test/dynamo/test_structured_trace.py` (fails: MYPY failure) ------ https://chatgpt.com/codex/tasks/task_e_68a2b02b54ec8323ae820120605a9f1c Pull Request resolved: https://github.com/pytorch/pytorch/pull/160920 Approved by: https://github.com/oulgen	2025-09-18 18:39:46 +00:00
Xu Han	52d4660ae9	[AOTI] Fix Windows fail to zip opened file. (#162617 ) Original issue: <img width="1767" height="544" alt="Image" src="https://github.com/user-attachments/assets/9de90d50-217f-4049-8f19-77ff1660c8b0" /> reproducer: ```cmd pytest test\inductor\test_aot_inductor.py -v -k test_weight_on_disk_legacy_cpu ``` Fixed list: 1. `WritableTempFile`'s `__exit__` function auto unlink opened file, when the file was opened, it should raise error. Ignore it on Windows. 2. When open zip file, if the file is opened, it would be failed. Switch to `_wfsopen` with shared access flag, which can open file with shared access. Local test passed: <img width="1101" height="233" alt="image" src="https://github.com/user-attachments/assets/935cbf2e-52db-41f1-80fa-617569b92a96" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/162617 Approved by: https://github.com/jansel	2025-09-11 06:22:21 +00:00
Xuan Zhang	4d4abec80f	allow user to pass in custom partitioner function (#157580 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157580 Approved by: https://github.com/bdhirsh	2025-09-05 22:49:39 +00:00
Xu Han	451ed93156	[inductor] fix split_aot_inductor_output_path on Windows. (#162058 ) fix split_aot_inductor_output_path on Windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162058 Approved by: https://github.com/angelayi	2025-09-03 16:53:38 +00:00
Shangdi Yu	92c2daebb6	Add inductor provenance tracking artifacts to cache (#161440 ) Summary: - Add inductor provenance tracking artifacts to cache - Update the tlparse version pin to `0.4.0`. The old tlparse version errors out on the new tlparse output. The lowest tlparse version that works is `0.3.42`. tlparse error: ``` thread 'main' panicked at src/parsers.rs:671:71: called `Result::unwrap()` on an `Err` value: Error("EOF while parsing a value", line: 1, column: 0) stack backtrace: 0: 0x55e4ff1c7f00 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h6d42cc84fc840290 1: 0x55e4ff1ee503 - core::fmt::write::h5af61a909e3ec64d 2: 0x55e4ff1c4c33 - std::io::Write::write_fmt::h5a7b54aa6e4a315d 3: 0x55e4ff1c7d52 - std::sys::backtrace::BacktraceLock::print::h555579e7396c26ac 4: 0x55e4ff1c8caf - std::panicking::default_hook::{{closure}}::h9128866118196224 5: 0x55e4ff1c8b1a - std::panicking::default_hook::h52e9e7314e0255f6 6: 0x55e4ff1c9652 - std::panicking::rust_panic_with_hook::h541791bcc774ef34 7: 0x55e4ff1c93fa - std::panicking::begin_panic_handler::{{closure}}::h6479a2f0137c7d19 8: 0x55e4ff1c8419 - std::sys::backtrace::__rust_end_short_backtrace::ha04e7c0fc61ded91 9: 0x55e4ff1c908d - rust_begin_unwind 10: 0x55e4fef7a030 - core::panicking::panic_fmt::h5764ee7030b7a73d 11: 0x55e4fef7a406 - core::result::unwrap_failed::h3ff7104a9ace307a 12: 0x55e4fefb3c56 - <tlparse::parsers::ArtifactParser as tlparse::parsers::StructuredLogParser>::parse::h20bc51a17ffc494a 13: 0x55e4fef9669a - tlparse::run_parser::h20c7729f151eec62 14: 0x55e4fef99a1b - tlparse::parse_path::he4892147f47fbade 15: 0x55e4fef7c760 - tlparse::main::hdc05613b32f4f53b 16: 0x55e4fef89263 - std::sys::backtrace::__rust_begin_short_backtrace::h15f188f3edf42596 17: 0x55e4fef8827d - std::rt::lang_start::{{closure}}::he2c21e32a442538e 18: 0x55e4ff1be0f0 - std::rt::lang_start_internal::h15895544e2012228 19: 0x55e4fef83975 - main 20: 0x7f0b3662a610 - __libc_start_call_main 21: 0x7f0b3662a6c0 - __libc_start_main_alias_2 22: 0x55e4fef7a610 - <unknown> 23: 0x0 - <unknown> ``` Test Plan: ``` buck run mode/dev-nosan fbcode//caffe2/test/inductor:provenance_tracing -- -r test_kernel_information_generation python test/dynamo/test_structured_trace.py -k test_chromium_event ``` Differential Revision: D80976585 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161440 Approved by: https://github.com/oulgen	2025-08-28 01:16:02 +00:00
Xu Han	97200c9711	[inductor] Add get page_size support for Windows. (#161273 ) `resource` can't work on Windows, as it is a Unix specific package as seen in https://docs.python.org/2/library/resource.html Use Windows system API to get page_size. Local tested: <img width="467" height="433" alt="image" src="https://github.com/user-attachments/assets/47a39060-3aea-46c3-bd8e-35a39413c51f" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/161273 Approved by: https://github.com/angelayi	2025-08-22 18:36:14 +00:00
Sandeep Narendranath Karjala	2b62ef7420	Add kernel information JSON generation for AOTI packages (#160540 ) Summary: Build on D80031559. Generate kernel_information.json in AOTI compiled artifacts by combining stack traces and node mappings from provenance tracking. This implementation delivers exactly what Zoomer team requested: 1. Core Function: `create_kernel_information_json()` in debug.py combines 3 data sources: - `_inductor_kernel_stack_trace` → `stack_traces` field - `_inductor_triton_kernel_to_post_grad_node_info` → `post_grad_nodes` field - `_inductor_post_to_pre_grad_nodes["postToPre"]` → `pre_grad_nodes` field 2. AOTI Integration: codecache.py writes `kernel_information.json` to pt2 packages when both AOTI packaging and provenance tracking are enabled. 3. Test Coverage: TestKernelInformationAOTI class validates: - JSON file creation in AOTI packages using zipfile - Exact format compliance - Proper disabling without provenance tracking Output Format (exact specification): ```json { "triton_kernel_name_1": { "stack_traces": [str, str, ...], "post_grad_nodes": [str, str, ...], "pre_grad_nodes": [str, str, ...] } } ``` Test Plan: ``` buck test fbcode//caffe2/test/inductor:provenance_tracing -- TestKernelInformationAOTI ``` Manual validation: ```python import torch model = torch.nn.Linear(10, 1) with torch._inductor.config.patch("aot_inductor.package", True): with torch._inductor.config.patch("trace.basic_provenance_tracking", True): # AOTI compilation should generate kernel_information.json compiled = torch.export.export(model, (torch.randn(1, 10),)) ``` --- Rollback Plan: Differential Revision: D80139160 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160540 Approved by: https://github.com/yushangdi	2025-08-20 02:33:45 +00:00
PyTorch MergeBot	eddaaa6c2a	Revert "Recheck Autotune cache on Precompile serialization to prune compilation results (#158656 )" This reverts commit 664005662ad8c9aa1942015397048aa9ca14fd6d. Reverted https://github.com/pytorch/pytorch/pull/158656 on behalf of https://github.com/seemethere due to failing internal tests, see D80486843 ([comment](https://github.com/pytorch/pytorch/pull/158656#issuecomment-3201491561))	2025-08-19 16:53:20 +00:00
Michael Lazos	5cf6567c1f	[Inductor] add cuda compile cmd to autotuning logging (#160906 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160906 Approved by: https://github.com/henrylhtsang	2025-08-19 01:14:46 +00:00
James Wu	664005662a	Recheck Autotune cache on Precompile serialization to prune compilation results (#158656 ) This PR rechecks the autotune cache on Precompile.serialize(), allowing us to ahead of time save autotune results for statically compiled triton kernels, so that warm start does not need to check the autotune cache. It has a few extra changes to make this work: ### Storing source code in TritonBundler - We now store the source_code for statically compiled triton kernels instead of the hash of the source code in TritonBundler, so that we can easily access their source code when rechecking the autotune cache on PrecompileContext.serialize. To make sure that this is not a huge space concern, I ran the entire hugging face benchmark on training. The total space of `/tmp/torchinductor_jjwu/fxgraph` before my change was 1185004 KB (1.18 GB). After my change, this increased to 1207312 KB (1.2 GB), for an increased storage cost of ~1.8%, which seems safe. - We now return early from recheck_autotune_cache if the number of triton kernels being compiled is 1, since there's no reason to check the cache at all in those cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158656 Approved by: https://github.com/zhxchen17	2025-08-18 17:55:10 +00:00
PyTorch MergeBot	30d2f98daa	Revert "[cutlass backend] re-add pip cutlass path (#160180 )" This reverts commit d556586448f3caab85673c7da0978fe31c7748f7. Reverted https://github.com/pytorch/pytorch/pull/160180 on behalf of https://github.com/atalman due to broke macos nightly ([comment](https://github.com/pytorch/pytorch/pull/160180#issuecomment-3192311552))	2025-08-15 18:00:41 +00:00
Alexander Grund	d556586448	[cutlass backend] re-add pip cutlass path (#160180 ) Revert #156651 to allow using the cutlass PIP package which is easier for users than the Git checkout or similar method. Also fix a bug where the PIP cutlass path wouldn't be available to subprocesses spawned during benchmarking for algorithm selection. Looks like the "spawn" method does not inherit the (potentially) already set up `config.cuda.cutlass_dir` so in the subprocess the include paths will still be set to `"../third_party/cutlass/"` leading to compilation failure due to missing headers. Ensure `try_import_cutlass` is called at that point, which due to caching is a no-op in most cases, so doesn't hurt. Change the logic to return `None` when cutlass isn't available returning more useful values for include paths, namely an empty list. This is in line with other inductor code which disables the CUTLASS backend when `try_import_cutlass` returns False Pull Request resolved: https://github.com/pytorch/pytorch/pull/160180 Approved by: https://github.com/henrylhtsang, https://github.com/mlazos	2025-08-14 14:48:31 +00:00
Alexander Grund	adcca7d9a1	Do not rpath CUDA stubs folder in JIT generated code (#160179 ) `_transform_cuda_paths` intentionally includes the CUDA stubs folder. However this path must not be added to the rpath as otherwise any CUDA command will fail at runtime with > CUDA_ERROR_STUB_LIBRARY: "CUDA driver is a stub library" This results in e.g. non-descriptive errors like ``` cutlass_library/source/tools/util/include/cutlass/util/device_memory.h:67 cutlass::device_memory::allocate: cudaMalloc failed: bytes=4096 terminate called after throwing an instance of 'cutlass::cuda_exception' what(): std::exception ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160179 Approved by: https://github.com/jansel	2025-08-13 18:29:24 +00:00
Aaron Gokaslan	beb4d7816d	[BE]: ruff PLC0207 - use maxsplit kwarg (#160107 ) Automatically replaces split with rsplit when relevant and only performs the split up to the first ( or last value). This allows early return of the split function and improve efficiency. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160107 Approved by: https://github.com/albanD	2025-08-08 03:14:59 +00:00
Bin Bao	a4b07fe8f6	[AOTI] Add more default options to compile_standalone (#158560 ) Summary: When compiling for standalone, make embed_kernel_binary and emit_multi_arch_kernel default to True, and add a default name for model_name_for_generated_files to make the generated cpp project easier to understand. Also improved the weights object file naming to be more readable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158560 Approved by: https://github.com/yushangdi	2025-08-06 15:59:27 +00:00
Xu Han	510e8b4ae0	[inductor] use writable temp file on windows (#159738 ) Use `WritableTempFile` on Windows, reference to: https://github.com/pytorch/pytorch/pull/159342 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159738 Approved by: https://github.com/angelayi, https://github.com/Skylion007	2025-08-04 21:51:02 +00:00
Aleksei Nikiforov	6646461764	S390X: fix detection of magic number placeholder in inductor (#157784 ) This change fixes multiple tests in test/inductor/test_aot_inductor_arrayref.py such as test_cond_with_parameters_cpu_with_stack_allocation, test_issue_140766_cpu_with_stack_allocation, test_model_modified_weights_cpu_with_stack_allocation, test_nested_tensor_from_jagged_cpu_with_stack_allocation. Enable tests in test/inductor/test_aot_inductor_arrayref.py This change is split off from https://github.com/pytorch/pytorch/pull/150116 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157784 Approved by: https://github.com/huydhn	2025-08-04 12:42:31 +00:00
Xu Han	7e00f2ec9d	[AOTI] add zero size consts asm handler (#159225 ) Add `get_zero_consts_asm_code` to handle zero size consts to object. This function is used to handle zero consts situation. Because cpp standard does not allow zero size array: https://stackoverflow.com/questions/9722632/what-happens-if-i-define-a-0-size-array-in-c-c 1. On Windows, MSVC will report error C2466: https://learn.microsoft.com/en-us/cpp/error-messages/compiler-errors-1/compiler-error-c2466?view=msvc-170 So, we can use assmbely compiler to handle this situation. 2. On Windows, why not use Win32 asm to handle all path? Because ml64 only supports up to align `16`, it is not aligned to pytorch's `64`. Reference: https://learn.microsoft.com/en-us/cpp/assembler/masm/ml-and-ml64-command-line-reference?view=msvc-170 ``` Packs structures on the specified byte boundary. The alignment can be 1, 2, 4, 8, or 16. ``` 3. It function can handle zero size case on both Windows and Linux, as that: A. On Linux, we added `-pedantic` to disable zero size array on C++ compiler. `8e07c9870d/torch/_inductor/cpp_builder.py (L580)` B. On Windows, msvc is not support zero size array by default. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159225 Approved by: https://github.com/desertfire	2025-07-31 22:46:33 +00:00
Xu Han	d5c719ec3c	[inductor] fix open temp file failed on Windows. (#159342 ) Fix open temp file failed on Windows. Error message: <img width="1181" height="239" alt="image" src="https://github.com/user-attachments/assets/e4a6f438-cb06-44c6-959b-0a6a49d2f44f" /> Here two option to fix this issue: https://stackoverflow.com/questions/66744497/python-tempfile-namedtemporaryfile-cant-use-generated-tempfile 1. `tempfile.NamedTemporaryFile` must setup `delete=False` on Windows 2. Use `WritableTempFile` to handle this case on Windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159342 Approved by: https://github.com/jansel	2025-07-31 04:58:02 +00:00
Lucas Kabela	2b1ae29960	[Dynamo][Better Engineering] Add typing annotations to guard and source (#158397 ) (#159491 ) Summary: X-link: https://github.com/pytorch/executorch/pull/12986 As part of better engineering week, we would like to improve out type support to improve dev experience in dynamo This PR adds strict typing support to a critical set of files for dynamo, `source.py` and the base `_guards.py` Running ``` mypy torch/_dynamo/source.py torch/_guards.py --linecount-report /tmp/coverage_log ``` \| -------- \| Lines Unannotated \| Lines Total \| % lines covered \| Funcs Unannotated \| Funcs Total \| % funcs covered \| \| -------- \| ------- \| -------- \| ------- \| ------- \| ------- \| ------- \| \| Main \| 1227 \| 2208 \| 55.57% \| 207 \| 362 \| 57.18% \| \| This PR \| 2217 \| 2217 \| 100.00% \| 362 \| 362 \| 100.00% \| \| Delta \| +990 \| +9 \| +44.43% \| +155 \| 0 \| +42.82% \| cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 jerryzh168 voznesenskym penguinwu EikanWang Guobing-Chen zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben Test Plan: Imported from GitHub, without a `Test Plan:` line. Rollback Plan: Reviewed By: JacobSzwejbka, yangw-dev Differential Revision: D79199389 Pulled By: Lucaskabela Pull Request resolved: https://github.com/pytorch/pytorch/pull/159491 Approved by: https://github.com/anijain2305, https://github.com/yangw-dev	2025-07-30 22:57:50 +00:00
Sam Larsen	af39144a93	Don't use torch.backends.cuda.matmul.allow_tf32 in inductor cache key (#159480 ) Summary: According to https://github.com/pytorch/pytorch/pull/158209, the API is deprecated and we should be using torch.backends.cuda.matmul.fp32_precision instead. Fixes https://github.com/pytorch/pytorch/issues/159440 Test Plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/159480 Approved by: https://github.com/xmfan, https://github.com/oulgen	2025-07-30 21:29:38 +00:00
PyTorch MergeBot	d987a6f7f0	Revert "[Dynamo][Better Engineering] Add typing annotations to guard and source (#158397 )" This reverts commit abcb24f4de11f8fedf2c2c9ff53b6092ef42306d. Reverted https://github.com/pytorch/pytorch/pull/158397 on behalf of https://github.com/yangw-dev due to Suggested to fix failing internal signals on D78911890 ([comment](https://github.com/pytorch/pytorch/pull/158397#issuecomment-3133823766))	2025-07-29 19:49:40 +00:00
Lucas Kabela	abcb24f4de	[Dynamo][Better Engineering] Add typing annotations to guard and source (#158397 ) As part of better engineering week, we would like to improve out type support to improve dev experience in dynamo This PR adds strict typing support to a critical set of files for dynamo, `source.py` and the base `_guards.py` Running ``` mypy torch/_dynamo/source.py torch/_guards.py --linecount-report /tmp/coverage_log ``` \| -------- \| Lines Unannotated \| Lines Total \| % lines covered \| Funcs Unannotated \| Funcs Total \| % funcs covered \| \| -------- \| ------- \| -------- \| ------- \| ------- \| ------- \| ------- \| \| Main \| 1227 \| 2208 \| 55.57% \| 207 \| 362 \| 57.18% \| \| This PR \| 2217 \| 2217 \| 100.00% \| 362 \| 362 \| 100.00% \| \| Delta \| +990 \| +9 \| +44.43% \| +155 \| 0 \| +42.82% \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/158397 Approved by: https://github.com/anijain2305	2025-07-24 15:55:18 +00:00
Xu Han	5e386eec94	[AOTI] enable aot inductor on Windows (#158915 ) With many PRs landed, we can run the first aot inductor example on Windows. <img width="640" height="427" alt="image" src="https://github.com/user-attachments/assets/131db159-ce17-4857-a3d5-a4b03638f01d" /> Let's remove the Windows check on `AotCodeCompiler`. CC: @angelayi , @desertfire , @jansel Pull Request resolved: https://github.com/pytorch/pytorch/pull/158915 Approved by: https://github.com/desertfire	2025-07-23 16:29:15 +00:00
Mwiza Kunda	d3d9bc1c31	[inductor] Allow backends to register their own custom config object (#158254 ) An out of tree backend can have its own configuration options that the user can enable to control inductor compilation. These config options need to be taken into account when calculating the key that is used to determine cache miss / hits. This PR allows out of tree backends to specify a custom config module that has the same type as `torch._inductor.config` that can be used to control codegen (in addition to the default config), and will be used when creating the cache key. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158254 Approved by: https://github.com/eellison	2025-07-23 15:56:06 +00:00
PyTorch MergeBot	7d6f340238	Revert "[AOTI] Add more default options to compile_standalone (#158560 )" This reverts commit a991e285ae35159680b0ad4be24669906a6fa256. Reverted https://github.com/pytorch/pytorch/pull/158560 on behalf of https://github.com/jeffdaily due to broke rocm CI, no test signal was available from rocm ciflow/trunk, need to add ciflow/rocm to reland ([comment](https://github.com/pytorch/pytorch/pull/158560#issuecomment-3103633964))	2025-07-22 16:20:17 +00:00
Bin Bao	a991e285ae	[AOTI] Add more default options to compile_standalone (#158560 ) Summary: When compiling for standalone, make embed_kernel_binary and emit_multi_arch_kernel default to True, and add a default name for model_name_for_generated_files to make the generated cpp project easier to understand. Also improved the weights object file naming to be more readable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158560 Approved by: https://github.com/yushangdi	2025-07-21 21:16:48 +00:00
Xu Han	16b21fa8b2	[AOTI] skip ld and objcopy on Windows. (#158545 ) Skip `ld` and `objcopy` on Windows. They are not support on Windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158545 Approved by: https://github.com/desertfire	2025-07-17 15:43:24 +00:00
Xu Han	da4c7b4ced	[AOTI] align signature to model_base.h (#158554 ) Remove `const` keyword, align its signature to `model_base.h` `eeda1a75ac/torch/csrc/inductor/aoti_runtime/model_base.h (L51-L53)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158554 Approved by: https://github.com/desertfire	2025-07-17 14:44:32 +00:00
Xu Han	a04bd11895	[AOTI] Use format_consts_to_cpp on Windows. (#158543 ) `format_consts_to_asm` is not supported on Windows, force use `format_consts_to_cpp` on Windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158543 Approved by: https://github.com/desertfire	2025-07-17 14:40:34 +00:00
Han, Xu	4805a6ead6	[aot][XPU] switch xpu to use consts cpp build. (#158425 ) Intel compiler is not support `format_consts_to_asm`, let's use `format_consts_to_cpp`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158425 Approved by: https://github.com/jansel	2025-07-16 16:19:33 +00:00
henrylhtsang	7e433d5f42	[cutlass backend] cache a few things for codegen and properties (#158158 ) Differential Revision: [D78193404](https://our.internmc.facebook.com/intern/diff/D78193404/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158158 Approved by: https://github.com/ColinPeppler	2025-07-15 00:18:31 +00:00
bobrenjc93	5221448574	multi-kernel matmuls based on varying hint sizes (#156628 ) The core idea is to generate multiple matmul kernels using different hints for symbolic variables, then select the most appropriate one at runtime for each unique shape we encounter. You can find some early experimentation details in these posts: https://fb.workplace.com/groups/8940092306109185/posts/9803850776399996/ https://fb.workplace.com/groups/8940092306109185/posts/9695805170537891/ https://fb.workplace.com/groups/257735836456307/posts/906589324904285/ Here’s a graph illustrating the empirically observed worst-case performance if an oracle always selected the least optimal hint for a given runtime size: ![image](https://github.com/user-attachments/assets/6d90ee06-a572-453e-9cba-03006f343301) This graph illustrates the performance of a hint size of 64 relative to the worst case. Notice that as the runtime sizes increase, the performance gradually approaches the worst case: ![image](https://github.com/user-attachments/assets/85ad49fe-165a-474c-8d03-db2e57654213) This graph shows the performance of a hint size of 4096 — very poor for small sizes, and also suboptimal for some mid-sized shapes: ![image](https://github.com/user-attachments/assets/adea1106-3bc8-40f3-97b0-20d940fb74f1) Finally, here’s the graph that motivated this PR. It illustrates the performance when selecting the best of three kernels generated with three different hints — 64, 256, and 4096: ![image](https://github.com/user-attachments/assets/a7cb0ce5-8139-48b1-b5c9-7670e75cbfce) ## How to review this PR At a high level, this extends @shunting314's multi-kernel abstraction to support varying GEMM choices driven by different hints. A few key points: 1. Unlike reduction kernels, triton template matmuls pass their grid as arguments to the kernel. This PR updates `MultiKernelCall` to support kernels with varying arguments. 2. The `V.graph.sizevars.size_hints` API is extended to accept a `hint_override`, allowing us to substitute the example input’s size hint with a custom value when generating multiple kernels. 3. The choice generation and benchmarking logic is updated to support multiple hint values. One kernel is generated per value in `torch._inductor.config.multi_kernel_hints`, and at runtime, we select the most suitable kernel for the current shape. 4. This PR does not add support for cpp wrapper codegen to keep it scoped. That will be added in the next PR. ## Results The following is a basic test that shows our basic multi kernel working where we no longer show significant variance based on the original hint size: https://gist.github.com/bobrenjc93/ba711d529e65fd65839b34799f6323ec Before ``` Hint\Runtime \| 64 \| 256 \| 4096 --------------------------------------------------- 64 \| 0.0948 \| 0.3124 \| 4.9477 256 \| 0.2243 \| 0.2256 \| 3.3880 4096 \| 0.3384 \| 0.3404 \| 3.3010 ``` After ``` Hint\Runtime \| 64 \| 256 \| 4096 --------------------------------------------------- 64 \| 0.0951 \| 0.2289 \| 3.3013 256 \| 0.0952 \| 0.2258 \| 3.4045 4096 \| 0.0957 \| 0.2231 \| 3.3146 ``` We also see an average speedup of 5.04% for the matrix of all hint/runtime pairs in [64, 4096] for every increment of 64: https://docs.google.com/spreadsheets/d/12TmYUDrAAFASGuP3POXTKPeAvQWIRzKzdrVSIb3vQkA/edit?gid=480268938#gid=480268938 ![Worst Case, multi-kernel](https://github.com/user-attachments/assets/712df23b-87e2-4d9d-95c2-cc25305ba2ed) NB: This is just the beginning and I plan on doing more investigation to see further improve on this initial result. For posterity the script used to generate that matrix is here: https://gist.github.com/bobrenjc93/c211fd0bd97fad8f46b91ad9dee76ad0 HUD benchmark runs: base: https://github.com/pytorch/pytorch/actions/runs/15889871988 head: https://github.com/pytorch/pytorch/actions/runs/15889876842 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156628 Approved by: https://github.com/jansel	2025-07-12 15:08:21 +00:00
Xu Han	aacb944079	[aot inductor] fix clang-asan for consts_cpp. (#158175 ) From the perivous PR: https://github.com/pytorch/pytorch/pull/157608 , I added `format_consts_to_cpp` to build consts bytes. But it still raise clang ASAN `stack alloction`, when build large size consts. This PR: 1. add `test_aot_inductor_consts_cpp_build` to stack allocation skip list. 2. add ATTRIBUTE_NO_SANITIZE_ADDRESS to skip ASAN check, because consts array is locate in global area. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158175 Approved by: https://github.com/jansel	2025-07-12 07:14:05 +00:00
Xuehai Pan	7f14b42adf	[BE][2/16] fix typos in torch/ (torch/_*/) (#156312 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156312 Approved by: https://github.com/albanD	2025-07-12 05:47:06 +00:00
PyTorch MergeBot	9c189ed29a	Revert "multi-kernel matmuls based on varying hint sizes (#156628 )" This reverts commit 6c795306378c47341d58109da03371bba2bec46e. Reverted https://github.com/pytorch/pytorch/pull/156628 on behalf of https://github.com/huydhn due to Sorry for reverting your change but some ROCM jobs went crazy after this lands, so I try to see if reverting helps ([comment](https://github.com/pytorch/pytorch/pull/156628#issuecomment-3064617123))	2025-07-12 03:48:39 +00:00
bobrenjc93	6c79530637	multi-kernel matmuls based on varying hint sizes (#156628 ) The core idea is to generate multiple matmul kernels using different hints for symbolic variables, then select the most appropriate one at runtime for each unique shape we encounter. You can find some early experimentation details in these posts: https://fb.workplace.com/groups/8940092306109185/posts/9803850776399996/ https://fb.workplace.com/groups/8940092306109185/posts/9695805170537891/ https://fb.workplace.com/groups/257735836456307/posts/906589324904285/ Here’s a graph illustrating the empirically observed worst-case performance if an oracle always selected the least optimal hint for a given runtime size: ![image](https://github.com/user-attachments/assets/6d90ee06-a572-453e-9cba-03006f343301) This graph illustrates the performance of a hint size of 64 relative to the worst case. Notice that as the runtime sizes increase, the performance gradually approaches the worst case: ![image](https://github.com/user-attachments/assets/85ad49fe-165a-474c-8d03-db2e57654213) This graph shows the performance of a hint size of 4096 — very poor for small sizes, and also suboptimal for some mid-sized shapes: ![image](https://github.com/user-attachments/assets/adea1106-3bc8-40f3-97b0-20d940fb74f1) Finally, here’s the graph that motivated this PR. It illustrates the performance when selecting the best of three kernels generated with three different hints — 64, 256, and 4096: ![image](https://github.com/user-attachments/assets/a7cb0ce5-8139-48b1-b5c9-7670e75cbfce) ## How to review this PR At a high level, this extends @shunting314's multi-kernel abstraction to support varying GEMM choices driven by different hints. A few key points: 1. Unlike reduction kernels, triton template matmuls pass their grid as arguments to the kernel. This PR updates `MultiKernelCall` to support kernels with varying arguments. 2. The `V.graph.sizevars.size_hints` API is extended to accept a `hint_override`, allowing us to substitute the example input’s size hint with a custom value when generating multiple kernels. 3. The choice generation and benchmarking logic is updated to support multiple hint values. One kernel is generated per value in `torch._inductor.config.multi_kernel_hints`, and at runtime, we select the most suitable kernel for the current shape. 4. This PR does not add support for cpp wrapper codegen to keep it scoped. That will be added in the next PR. ## Results The following is a basic test that shows our basic multi kernel working where we no longer show significant variance based on the original hint size: https://gist.github.com/bobrenjc93/ba711d529e65fd65839b34799f6323ec Before ``` Hint\Runtime \| 64 \| 256 \| 4096 --------------------------------------------------- 64 \| 0.0948 \| 0.3124 \| 4.9477 256 \| 0.2243 \| 0.2256 \| 3.3880 4096 \| 0.3384 \| 0.3404 \| 3.3010 ``` After ``` Hint\Runtime \| 64 \| 256 \| 4096 --------------------------------------------------- 64 \| 0.0951 \| 0.2289 \| 3.3013 256 \| 0.0952 \| 0.2258 \| 3.4045 4096 \| 0.0957 \| 0.2231 \| 3.3146 ``` We also see an average speedup of 5.04% for the matrix of all hint/runtime pairs in [64, 4096] for every increment of 64: https://docs.google.com/spreadsheets/d/12TmYUDrAAFASGuP3POXTKPeAvQWIRzKzdrVSIb3vQkA/edit?gid=480268938#gid=480268938 ![Worst Case, multi-kernel](https://github.com/user-attachments/assets/712df23b-87e2-4d9d-95c2-cc25305ba2ed) NB: This is just the beginning and I plan on doing more investigation to see further improve on this initial result. For posterity the script used to generate that matrix is here: https://gist.github.com/bobrenjc93/c211fd0bd97fad8f46b91ad9dee76ad0 HUD benchmark runs: base: https://github.com/pytorch/pytorch/actions/runs/15889871988 head: https://github.com/pytorch/pytorch/actions/runs/15889876842 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156628 Approved by: https://github.com/jansel	2025-07-11 19:38:10 +00:00

1 2 3 4 5 ...

743 Commits