Commit Graph

129 Commits

Author SHA1 Message Date
9fff8155c3 [2/N] Fix clang-tidy readability checks (#164652)
This PR applies clang-tidy readability checks to jit sources and all headers in the code base.
`readability-redundant-inline-specifier` is suppressed because it incurs too many changes. `readability-redundant-inline-specifier` is used to detect redundant inline specifiers on function and variable declarations. There are many in-class method definitions that are marked inline.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164652
Approved by: https://github.com/Skylion007
2025-10-06 01:06:01 +00:00
2c5ed6e7c0 Revert "[2/N] Fix clang-tidy readability checks (#164652)"
This reverts commit 3c5ca685d6f5b6f3971c0cd20a054aa355610419.

Reverted https://github.com/pytorch/pytorch/pull/164652 on behalf of https://github.com/izaitsevfb due to need to revert due to a conflict with revert of https://github.com/pytorch/pytorch/pull/162659 ([comment](https://github.com/pytorch/pytorch/pull/164652#issuecomment-3369346707))
2025-10-05 21:36:57 +00:00
3c5ca685d6 [2/N] Fix clang-tidy readability checks (#164652)
This PR applies clang-tidy readability checks to jit sources and all headers in the code base.
`readability-redundant-inline-specifier` is suppressed because it incurs too many changes. `readability-redundant-inline-specifier` is used to detect redundant inline specifiers on function and variable declarations. There are many in-class method definitions that are marked inline.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164652
Approved by: https://github.com/Skylion007
2025-10-05 07:05:11 +00:00
bf717ce346 [AOTI win] Add ABI stable method for updating constant buffer (#163819)
Add `struct AOTInductorConstantMapEntry` to represent the constant map in AOTI Model. We cannot use `std::unordered_map` for cross-compilation, because it is not ABI stable.

it will be tested when we test `update_user_managed_constant_buffer` for windows cross-compilation

Example usage:

```
        // Load constants. Create random constants here.
        auto* fc1_w = new slim::SlimTensor(slim::empty({16, 10}, c10::kFloat, c10::Device(c10::kCUDA, 0)));
        fc1_w->fill_(1.0);

.....

        // Build pairs
        std::vector<AOTInductorConstantPair> constants{
            {"fc1_weight", fc1_w},
            {"fc1_bias",   fc1_b},
            {"fc2_weight", fc2_w},
            {"fc2_bias",   fc2_b},
        };

        // Call runtime (pass raw pointer + size)
        update_user_managed_constant_buffer_abi(
            container_handle,
            constants.data(),
            constants.size(),
            /*use_inactive=*/false,
            /*validate_full_update=*/true);
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163819
Approved by: https://github.com/desertfire
2025-10-02 18:31:00 +00:00
28c1d2f81b [aoti] AOTI mingw cross compilation (#163188)
To run this, you need to install `mingw64-gcc-c++` and download windows cuda library toolkit.

See design doc and demo instructions in https://docs.google.com/document/d/1iDaChqA5nNKkBFTzsdkmoomvQlXHbnlb1Z4yEp7xaJA/edit?tab=t.0

If cross_platform_target is windows, we do the following:

- do not link to `sleef`. This can be improved in the future if we need it. Currently I avoid it because that requires extra setup on the linux side
- Use `mingw64-gcc-c++` to compile
- Use `WINDOWS_CUDA_HOME` instead of `CUDA_HOME` when linking to cuda

```
 python test/inductor/test_aot_inductor_windows.py -k so
 ```

 Other changes:
 - de-couples compile_standalone config and dynamic link flag
 - create a new aot_inductor_mode config module, which is used to control configs in aot_inductor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163188
Approved by: https://github.com/desertfire
2025-10-01 02:22:06 +00:00
8f30a8dc47 [AOTInductor] Add grid information for Triton Kernels (#160131)
Summary:
Add grid information for Triton Kernels for profiling in Kineto.

Test Plan:
Before change:
<img width="539" height="625" alt="Screenshot 2025-08-07 at 1 09 07 PM" src="https://github.com/user-attachments/assets/dd0778a9-2ff3-4819-acd3-de585cf7f9d1" />

After change:
<img width="550" height="898" alt="Screenshot 2025-08-07 at 1 05 49 PM" src="https://github.com/user-attachments/assets/d84988df-bb83-41ed-80ac-8a6d843a1a9d" />

*Note we can extract grid size etc. from device side trace, but we're focusing host side specifically for this PR, mainly to add more host side information in the future needed for performance profiling.

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160131
Approved by: https://github.com/desertfire
2025-09-23 02:15:24 +00:00
9f8a311af0 [Inductor][Intel GPU] Save threads_per_warp from tirton compiled kernel for launching kernel correctly in cpp wrapper. (#163315)
On the Inductor XPU backend, `threads_per_warp` is not always 32. For Intel GEMM Triton kernels, it can be 16. This information must be preserved for XPU so that the Cpp wrapper can launch the kernel with the correct configuration.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163315
Approved by: https://github.com/EikanWang, https://github.com/desertfire
2025-09-19 21:06:56 +00:00
2291199e9b [AOTInductor] Use CudaCachingAllocator for memory allocation (#162893)
Summary:
Use c10::CudaCachingAllocator for AOTInductor's initial constant buffer
allocation.

Test Plan:
Activate test under test/cpp/aoti_inference/test.cpp

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162893
Approved by: https://github.com/desertfire
2025-09-17 17:08:20 +00:00
b16d3f4c8c [AOTI] Fix a bug from load_constants (#161887)
Summary:
we have
```
std::vector<size_t> constants_internal_offset(
        num_constants - num_folded_constants);
```

but the for loop does not consider it
```
for (size_t i = 0; i < num_constants; i++) {
...
constants_internal_offset[i]
...
```
even in the for loop, it does
```
bool from_folded = this->constant_from_folded(i);
      if (from_folded) {
        continue;
      }
```
but `i` could still be wrong

Rollback Plan:

Differential Revision: D81425007

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161887
Approved by: https://github.com/angelayi
2025-09-03 07:45:16 +00:00
b4ad38279b [AOTI] Add Windows-compatible implementation of the mmap-related funcs (#161805)
Add Windows-compatible implementation of the mmap-related functions.

These code was validated on the small developing project: https://github.com/xuhancn/cross_os_mmap?tab=readme-ov-file#cross_os_mmap

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161805
Approved by: https://github.com/angelayi
2025-09-02 20:07:41 +00:00
40311e2ec1 [AOTInductor] ABI-Compatibility for RecordFunction. (#159842)
Summary:
Previous our implementation for RecordFunction injects Aten into
codegen, which is breaking the ABI contract for AOTInductor.

C10::IValue is aded to call the full record function. The extension of
more profiling info will come in later PRs.

Test Plan:
Included in commit.

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D79622071](https://our.internmc.facebook.com/intern/diff/D79622071)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159842
Approved by: https://github.com/desertfire
2025-08-15 21:45:47 +00:00
3ddfd46bd2 Cut a version of TORCH_ERROR_CODE_CHECK in headeronly from AOTI (#159604)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159604
Approved by: https://github.com/albanD, https://github.com/desertfire
2025-08-06 00:29:56 +00:00
096dc35d77 [aoti][mps] Fix update constants buffer (#158349)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158349
Approved by: https://github.com/malfet
2025-07-23 00:54:52 +00:00
9894d43b6c [AOTI] explicit aoti wrapper functions for Windows. (#158713)
On Windows, we need to explicit declaration for export APIs. Because the package loader call these API via GetProcAddress.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158713
Approved by: https://github.com/desertfire
2025-07-21 15:59:44 +00:00
b6454a9058 [AOT_inductor] model_base.h add Windows include files. (#158477)
model_base.h add Windows include files.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158477
Approved by: https://github.com/desertfire, https://github.com/jansel
2025-07-17 00:57:48 +00:00
52772765e0 Change AOTI_RUNTIME_DEVICE_CHECK to be device device specific (#157818)
Summary:
Change AOTI_RUNTIME_DEVICE_CHECK to the following depending on device:

AOTI_RUNTIME_CUDA_CHECK
AOTI_RUNTIME_XPU_CHECK
AOTI_RUNTIME_CPU_CHECK

Currently in the codebase, only `AOTI_RUNTIME_CUDA_CHECK` is used.

This shouldn't change anything as of now, but we do this to prepare for simultaneouly loading multiple backends (e..g CPU and CUDA) in AOTI standalone.

We don't want people writing `AOTI_RUNTIME_DEVICE_CHECK` for both CPU and CUDA checks. This could cause compilation problems when we statically link both CPU and CUDA models.

Test Plan:
CI

Rollback Plan:

Reviewed By: muchulee8

Differential Revision: D77742977

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157818
Approved by: https://github.com/jingsh
2025-07-09 18:34:56 +00:00
6f05d58f2b [AOTI] Split aoti_runtime/model.h to prepare for model static linking (#157592)
Summary:
Prepare for https://github.com/pytorch/pytorch/pull/157129.

We split the file so we can re-use `model.h` part for codegen a separate header for each model in static linkage.

Test Plan:
CI

Rollback Plan:

Differential Revision: D77761249

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157592
Approved by: https://github.com/desertfire
2025-07-07 22:13:22 +00:00
5b210bb3a6 [BE][9/16] fix typos in torch/ (torch/csrc/) (#156319)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156319
Approved by: https://github.com/albanD
ghstack dependencies: #156313, #156314, #156315, #156316, #156317
2025-06-23 02:57:50 +00:00
1d3bca40ed Revert "[BE][9/16] fix typos in torch/ (torch/csrc/) (#156319)"
This reverts commit a23ccaa8479e038e79532759a64e9947c0fac43d.

Reverted https://github.com/pytorch/pytorch/pull/156319 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](c95f7fa874) ([comment](https://github.com/pytorch/pytorch/pull/156313#issuecomment-2994171213))
2025-06-22 12:31:56 +00:00
a23ccaa847 [BE][9/16] fix typos in torch/ (torch/csrc/) (#156319)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156319
Approved by: https://github.com/albanD
ghstack dependencies: #156313, #156314, #156315, #156316, #156317
2025-06-22 08:43:49 +00:00
a4ab392251 [aoti][mps] mps constants support (#154287)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154287
Approved by: https://github.com/malfet
ghstack dependencies: #155752
2025-06-12 23:33:07 +00:00
3040ca6d0f [Cutlass] Include fp8 headers in aoti cpp wrapper (#155173)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155173
Approved by: https://github.com/desertfire
ghstack dependencies: #154829, #154835, #155195
2025-06-11 01:21:16 +00:00
447b481c79 [AOTI] Save data sizes to constants_info (#154534)
Differential Revision: D75223179

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154534
Approved by: https://github.com/muchulee8
2025-05-29 06:39:13 +00:00
eaf355cb11 [BE] Clean up unused parameter input in AOTIModel (#154276)
Summary: As title

Test Plan: CI

Differential Revision: D74691763

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154276
Approved by: https://github.com/Skylion007
2025-05-27 22:17:32 +00:00
870133b2a0 Use get_device_context in aoti runtime for XPU directly (#154360)
# Motivation
Reuse [c10::xpu::get_device_context](1bebe0424e/c10/xpu/XPUFunctions.h (L27)) directly to reduce overhead, as it returns a cached `sycl::context` managed by PyTorch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154360
Approved by: https://github.com/EikanWang
2025-05-27 05:55:59 +00:00
b33b7d5c8c [aoti] Add MPS runner and shim (#153964)
Added AOTIModelContainerRunnerMps and a shim for mps fallback ops.
I also added a mps-specific shim which contains one operator, which will be used to set arguments being passed to the Metal kernel:

```
AOTI_TORCH_EXPORT AOTITorchError aoti_torch_mps_set_arg(
    AOTIMetalKernelFunctionHandle func,
    unsigned idx,
    AtenTensorHandle tensor);
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153964
Approved by: https://github.com/malfet, https://github.com/desertfire
2025-05-23 05:45:35 +00:00
a82c8891d5 Revert "[aoti] Add MPS runner and shim (#153964)"
This reverts commit 918ae5d36188f419a47f3b1315f9fb373035ed66.

Reverted https://github.com/pytorch/pytorch/pull/153964 on behalf of https://github.com/angelayi due to broke frl build ([comment](https://github.com/pytorch/pytorch/pull/153964#issuecomment-2901876832))
2025-05-22 16:35:59 +00:00
5b6e551c0f [AOTI][refactor] Fix an anonymous namespace issue (#154033)
Summary: Remove anonymous namespace in model_container.h to fix the following compiler warning,
```
warning: ‘torch::aot_inductor::AOTInductorModelContainer’ has a field ‘torch::aot_inductor::AOTInductorModelContainer::constant_folded_’ whose type uses the anonymous namespace [-Wsubobject-linkage]
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154033
Approved by: https://github.com/chenyang78
2025-05-21 22:29:09 +00:00
918ae5d361 [aoti] Add MPS runner and shim (#153964)
Added AOTIModelContainerRunnerMps and a shim for mps fallback ops.
I also added a mps-specific shim which contains one operator, which will be used to set arguments being passed to the Metal kernel:

```
AOTI_TORCH_EXPORT AOTITorchError aoti_torch_mps_set_arg(
    AOTIMetalKernelFunctionHandle func,
    unsigned idx,
    AtenTensorHandle tensor);
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153964
Approved by: https://github.com/malfet, https://github.com/desertfire
2025-05-21 21:55:59 +00:00
dcb3edd30d [AOTI][XPU] Refactor AOTInductor runtime API for Intel GPU. (#153929)
Simplify and improve code format for sycl_runtime_wrappers.h

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153929
Approved by: https://github.com/desertfire
ghstack dependencies: #153924
2025-05-21 03:52:54 +00:00
0087f5f0af [AOTI][XPU] Embed SPRI-V files into .so (#153924)
Following the design of #150739, this PR supports embed kernel SPIR-V files so AOTI is one step closer to generate a single binary.
Fixes #153829
Fixes #153830

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153924
Approved by: https://github.com/desertfire
2025-05-20 17:38:53 +00:00
33a5179269 [AOTI][reland2] Remove typedef for half and bfloat16 (#153467)
Summary:
Reland https://github.com/pytorch/pytorch/pull/151109 after fixing cutlass AOTI build issues.

typedef is prone to name collision. Explicitly spell out the actual aten types, needed for the standalone AOTI codegen.

Differential Revision: D74398762

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153467
Approved by: https://github.com/jingsh, https://github.com/henrylhtsang, https://github.com/cyyever
2025-05-14 02:37:18 +00:00
c227865720 [AOTInductor] Fix state of ConstantFolding (#153152)
Summary:
Bug fix for constant folding states. We are not setting the correct state for each updates.
One race condition would be:
(1) All threads obtain the model_exec_lock from main run.
(2) In second round of updated constant buffer, we should have set secondary as INITIALIZED but primary is mistakenly set instead.
(3) run_const_fold get called and an model_exec_lock is obtained, waiting for available at this time.
(4) main run enters INITIALIZED, waiting for unique_lock (which a shared_lock is being held by (3) at this moment)

Test Plan:
TBD

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153152
Approved by: https://github.com/jingsh, https://github.com/chenyang78
2025-05-09 16:03:05 +00:00
470cd3a995 [aotinductor] Don't alloc weights if they don't exist (#152692)
Fixes https://github.com/pytorch/pytorch/issues/152356

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152692
Approved by: https://github.com/henrylhtsang
2025-05-06 02:50:21 +00:00
8d427e9e76 [AOTInductor] Inherit Buffer if not being updated (#152092)
Summary: Inherit buffer from original constants buffer if it's not being updated.

Test Plan: TBD

Differential Revision: D73571260

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152092
Approved by: https://github.com/kflu, https://github.com/jingsh
2025-04-26 04:28:23 +00:00
c3a18f6126 [AOTInductor] Add states for constant folding process (#151273)
Summary:
We add states in the constant folding process for AOTInductor.
Basically, there's 3 states, which is
(1) None: The state when no constants are loaded and uninitialized.
(2) Initialized: The state when constants are loaded, but not yet
folded.
(3) Folded: The state where the model is fully ready with folded
constants.

Note that even if constant folding is not enabled, we still only run
when state is FOLDED, this is okay because without constant folding, the
transition from INITIALIZED to FOLDED is just a pass-throught.

Test Plan:
python test/inductor/test_aot_inductor.py -k test_constant_folding_with_update

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D73002538](https://our.internmc.facebook.com/intern/diff/D73002538)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151273
Approved by: https://github.com/jingsh, https://github.com/desertfire
2025-04-17 16:41:38 +00:00
70e7b76707 [AOTInductor] Add Python interface for user managed buffer. (#151141)
Summary: Add pybind for user managed buffer in update_constants_buffer.

Test Plan:
Included in commit.
```
python test/inductor/test_aot_inductor.py -k user_managed
```

Differential Revision: D72892310

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151141
Approved by: https://github.com/henrylhtsang, https://github.com/desertfire
2025-04-15 09:36:30 +00:00
f3cf3ec591 [AOTInductor] Add User Managed buffer for AOTI constant buffer. (#150276)
Summary:
We add the functionality to allow users to directly pass in a at::Tensor
into AOTInductor, that would be used as the constant.
This user managed buffer skips the copying step in AOTInductor, and let
users to directly manage the memory usage themselve.

Test Plan:
LD_LIBRARY_PATH=/data/users/$USER/pytorch/build/lib
/data/users/$USER/pytorch/build/bin/test_aoti_inference

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D72589514](https://our.internmc.facebook.com/intern/diff/D72589514)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150276
Approved by: https://github.com/chenyang78, https://github.com/desertfire
2025-04-10 00:15:44 +00:00
60a45eb862 [AOTInductor] Introduce MaybeOwningAtenTensorHandle for ConstantMap (#150275)
Summary:
We used RAIIAtenTensorHandle for ConstantMap, where RAIIAtenTensorHandle
is a unique_ptr, indicating that all memory handling is by the
AOTInductor internally.

In this PR, we introduce ConstantAtenTensorHandle which replaces
RAIIATenTensorHandle. This class holds a raw AtenTensorHandle, and also
owns a RAIIAtenTensorHandle if user decides to delegate memory
management to AOTInductor.

This is a prerequisite for user managed buffer, this PR, however only
introduces this class and make sure it works with existing AOTInductor
and has the default behavior identical as using RAIIAtenTensorHandle.

Test Plan:
Existing tests. No change should be introduced within this PR.

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150275
Approved by: https://github.com/chenyang78, https://github.com/desertfire
2025-04-05 06:00:35 +00:00
07d439e782 [aoti] Split ConstantType definition out of model.h (#150545)
Summary:
Splitting the type definition of ConstantType into a separate header because it's needed by Sigmoid OSS but the entire model.h header include cause the following compilation error:
```
2025-04-01T18:12:42.0391272Z FAILED: caffe2/CMakeFiles/torch_cpu.dir/__/torch/csrc/nativert/kernels/AOTICallDelegateKernel.cpp.o
2025-04-01T18:12:42.0417705Z /opt/cache/bin/sccache /opt/cache/bin/clang++ -DAT_PER_OPERATOR_HEADERS -DBUILD_ONEDNN_GRAPH -DCAFFE2_BUILD_MAIN_LIB -DCPUINFO_SUPPORTED_PLATFORM=1 -DFMT_HEADER_ONLY=1 -DFXDIV_USE_INLINE_ASSEMBLY=0 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DIDEEP_USE_MKL -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DNNP_CONVOLUTION_ONLY=0 -DNNP_INFERENCE_ONLY=0 -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTORCH_ENABLE_LLVM -DUSE_C10D_GLOO -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_RPC -DUSE_TENSORPIPE -DXNN_LOG_LEVEL=0 -D_FILE_OFFSET_BITS=64 -Dtorch_cpu_EXPORTS -I/var/lib/jenkins/workspace/build/aten/src -I/var/lib/jenkins/workspace/aten/src -I/var/lib/jenkins/workspace/build -I/var/lib/jenkins/workspace -I/var/lib/jenkins/workspace/cmake/../third_party/benchmark/include -I/opt/llvm/include -I/var/lib/jenkins/workspace/third_party/onnx -I/var/lib/jenkins/workspace/build/third_party/onnx -I/var/lib/jenkins/workspace/nlohmann -I/var/lib/jenkins/workspace/torch/csrc/api -I/var/lib/jenkins/workspace/torch/csrc/api/include -I/var/lib/jenkins/workspace/caffe2/aten/src/TH -I/var/lib/jenkins/workspace/build/caffe2/aten/src/TH -I/var/lib/jenkins/workspace/build/caffe2/aten/src -I/var/lib/jenkins/workspace/build/caffe2/../aten/src -I/var/lib/jenkins/workspace/torch/csrc -I/var/lib/jenkins/workspace/third_party/miniz-3.0.2 -I/var/lib/jenkins/workspace/third_party/kineto/libkineto/include -I/var/lib/jenkins/workspace/third_party/kineto/libkineto/src -I/var/lib/jenkins/workspace/third_party/cpp-httplib -I/var/lib/jenkins/workspace/aten/src/ATen/.. -I/var/lib/jenkins/workspace/third_party/FXdiv/include -I/var/lib/jenkins/workspace/c10/.. -I/var/lib/jenkins/workspace/third_party/pthreadpool/include -I/var/lib/jenkins/workspace/third_party/cpuinfo/include -I/var/lib/jenkins/workspace/aten/src/ATen/native/quantized/cpu/qnnpack/include -I/var/lib/jenkins/workspace/aten/src/ATen/native/quantized/cpu/qnnpack/src -I/var/lib/jenkins/workspace/aten/src/ATen/native/quantized/cpu/qnnpack/deps/clog/include -I/var/lib/jenkins/workspace/third_party/NNPACK/include -I/var/lib/jenkins/workspace/third_party/fbgemm/include -I/
2025-04-01T18:12:42.0444143Z In file included from /var/lib/jenkins/workspace/torch/csrc/nativert/kernels/AOTICallDelegateKernel.cpp:5:
2025-04-01T18:12:42.0445081Z In file included from /var/lib/jenkins/workspace/torch/csrc/nativert/executor/AOTIDelegateExecutor.h:6:
2025-04-01T18:12:42.0446002Z In file included from /var/lib/jenkins/workspace/torch/csrc/nativert/executor/AOTInductorModelImpl.h:5:
2025-04-01T18:12:42.0447549Z /var/lib/jenkins/workspace/torch/csrc/inductor/aoti_runtime/model.h:78:13: error: function 'RAII_cpuMalloc' is not needed and will not be emitted [-Werror,-Wunneeded-internal-declaration]
2025-04-01T18:12:42.0448656Z RAIIDataPtr RAII_cpuMalloc(size_t num_bytes) {
```

model.h defines RAII_malloc functions directly into anonymous namespace which seems pretty sad. we should do something about it but may not in the current diff.

Test Plan: CI

Differential Revision: D72320413

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150545
Approved by: https://github.com/desertfire
2025-04-04 15:48:45 +00:00
03313c6619 [AOTInductor] Add function for users to extract constants in container (#150163)
Summary: Add extract_constant_map that allows users to inspect the constants being used by AOTInductor

Test Plan:
`python test/inductor/test_aot_inductor.py -k extract_constants_map`

`LD_LIBRARY_PATH=/data/users/$USER/pytorch/build/lib /data/users/$USER/pytorch/build/bin/test_aoti_inference`

Differential Revision: D72020400

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150163
Approved by: https://github.com/chenyang78
2025-03-29 03:36:12 +00:00
e6afb51805 [AOTInductor] Free folded constants that's managed by AOTInductor (#149825)
internally.

Summary:
This diff allows freeing the usage of folded constants that's created by
AOTInductor through CUDACachingAllocator instead of the constant blob
from cudaMalloc directly.

Test Plan:
LD_LIBRARY_PATH=/data/users/$USER/pytorch/build/lib
/home/$USER/local/pytorch/build/bin/test_aoti_inference

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149825
Approved by: https://github.com/chenyang78, https://github.com/desertfire, https://github.com/jingsh
2025-03-27 06:05:50 +00:00
12628ba24d [AOTInductor] Bug fix for freeing buffers when freeing multiple times (#149810)
Summary:
We might free the active buffer if we free the buffer twice.

Test Plan:
```
LD_LIBRARY_PATH=/data/users/$USER/pytorch/build/lib
/home/$USER/local/pytorch/build/bin/test_aoti_inference
```
Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149810
Approved by: https://github.com/chenyang78
2025-03-25 20:26:36 +00:00
bb42e4d137 [AOTInductor] Add function to free buffer (#149161)
Summary:
We add a function that allows users to free the unused buffer.

Test Plan:
Testing correctness:
    python test/inductor/test_aot_inductor.py -k free_inactive

    Testing memory consumption:
    LD_LIBRARY_PATH=/data/users/$USER/pytorch/build/lib
    /home/$USER/local/pytorch/build/bin/test_aoti_inference

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149161
Approved by: https://github.com/chenyang78, https://github.com/desertfire
ghstack dependencies: #149249
2025-03-18 02:43:14 +00:00
d25617255c Fix AOTI update_constant_buffer issue. (#149243)
Summary:
In D69553929 we changed the logic of constant & buffer update in AOTI. However this is incompatible with current Sigmoid runtime since we have different logics to pass in buffers, resulted in errors like
```
I0310 17:29:24.456960 3679102 AOTIDelegateExecutor.cpp:89] AOTIDelegateExecutor processing weights
*** Aborted at 1741652964 (Unix time, try 'date -d 1741652964') ***
*** Signal 11 (SIGSEGV) (0x30) received by PID 3679102 (pthread TID 0x7f9933e49000) (linux TID 3679102) (code: address not mapped to object), stack trace: ***
    @ 00000000000040b9 folly::symbolizer::(anonymous namespace)::signalHandler(int, siginfo_t*, void*)
                       ./fbcode/folly/debugging/symbolizer/SignalHandler.cpp:453
    @ 0000000000006c45 folly::fibers::(anonymous namespace)::sigsegvSignalHandler(int, siginfo_t*, void*)
                       ./fbcode/folly/fibers/GuardPageAllocator.cpp:237
    @ 000000000004455f (unknown)
                       /home/engshare/third-party2/glibc/2.34/src/glibc-2.34/signal/../sysdeps/unix/sysv/linux/libc_sigaction.c:8
                       -> /home/engshare/third-party2/glibc/2.34/src/glibc-2.34/signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c
    @ 00000000001e8164 torch::aot_inductor::AOTInductorModelContainer::update_constant_buffer(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, AtenTensorOpaque*, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, AtenTensorOpaque*> > > const&, bool, bool)
```

Test Plan:
1) Generate lowered merge net
```
CUDA_VISIBLE_DEVICES=0 ../buck-out/v2/gen/fbcode/b5b13003c82cbdec/caffe2/torch/fb/model_transform/fx2trt/packaging/__generate_merge_net_file__/generate_merge_net_file.par  --action=generate --input-file=/home/shengqin/models/aoti_sigmoid_test/cmf_interformer_with_custom_triton_kernels_691990503_0_input --output-file=/home/shengqin/models/aoti_sigmoid_test/cmf_interformer_with_custom_triton_kernels_691990503_0_output.aoti_sigmoid --lower-backend=aot_inductor  --use_sigmoid=true --aot_inductor_config="{'max_autotune': True, 'comprehensive_padding': False}" --add_passes=use_matmul_lce_replace_normal_LCE,use_triton_dot_compress,use_matmul_fuse_lce_replace_first_LCE,use_contiguous_linear_reduction_replace_linear_reduction --disable_acc_tracer=false
```

2) Load net predictor
```
CUDA_VISIBLE_DEVICES=1 ../buck-out/v2/gen/fbcode/103717df3cc2b97a/caffe2/torch/fb/model_transform/fx2trt/packaging/__load_net_predictor__/load_net_predictor --loadMode=AccuracyAB --inputNetFile=/home/shengqin/models/aoti_sigmoid_test/cmf_interformer_with_custom_triton_kernels_691990503_0_output.aoti_ts --otherNetFile=/home/shengqin/models/aoti_sigmoid_test/cmf_interformer_with_custom_triton_kernels_691990503_0_output.aoti_sigmoid --moduleName=merge --benchmarkEnableProfiling=false —-predictor_hardware_type=1 --disableStaticRuntime=true
```

Reviewed By: hl475

Differential Revision: D71236710

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149243
Approved by: https://github.com/hl475, https://github.com/jingsh
2025-03-17 22:10:57 +00:00
bdf57fb8f7 [AOTI][refactor] Split MiniArrayRef into a separate header (#149073)
Summary: MiniArrayRef is a common utility and will be used by the libtorch-free AOTI.

Differential Revision: [D71064657](https://our.internmc.facebook.com/intern/diff/D71064657)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149073
Approved by: https://github.com/yushangdi
2025-03-13 11:57:32 +00:00
85467ed063 Fix for AOTI + CUDAGraphs when calling from Python (#148601)
**Background**: I've been comparing performance of torch.compile vs. torch.export + AOTI (specifically, loaded from Python) on the Flux model and found a ~1.4% performance decrease with the latter. The trace shows that CUDAGraphs are not utilized for torch.export + AOTI, leading to higher overhead.

When trying to manually CUDAGraph the loaded, previously exported + AOTIed model (thanks to @eellison for the logic here), I get:
```
Error: operation not permitted when stream is capturing
```

@desertfire confirms that this is due to multi-threading logic on the AOTI runtime side (in `AOTIModelContainer` / `AOTIModel`) conflicting with the use of CUDAGraphs.

**Fix**: This PR takes the approach of providing an alternate, single-threaded method for running loaded models with the AOTI runtime. Details:
* Python side introduces a new flag to enable this behavior (needs a better name): `torch._inductor.package.load_package(..., run_single_threaded=False)`
    * This flag is passed down to the C++ side's `AOTIModelPackageLoader`, which passes it to the `CreateAOTIModelRunnerFunc` during `AOTIModelContainerRunner` construction.
* C++ side introduces single-threaded alternatives to model running and model container running:
    * `AOTIModelContainer.run_single_threaded()` / `AOTIModel.run_single_threaded()`. The interfaces match those of `run()`, but the synchronization logic has been removed.
    * Introduces `AOTInductorModelContainerRunSingleThreaded` to AOTI's `interface.h`; this is invoked by the `AOTIModelContainerRunner` utility class when `run_single_threaded=true`.

I've verified on both a small repro and my real-world use case that I can manually CUDAGraph a loaded model that was previously exported + AOTIed.

**Future work:**
* Flip default value to `run_single_threaded=True` as Python-side inference doesn't take advantage of the AOTI runtime thread pool
    * There are some BC concerns here - models need to be re-serialized so the .so contains the new `AOTInductorModelContainerRunSingleThreaded` interface func. We can flip the default value and warn (instead of crashing) if the `AOTInductorModelContainerRunSingleThreaded` symbol does not exist.
* Compose with cudagraph trees as opposed to manual cuda graph wrapping

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148601
Approved by: https://github.com/desertfire
2025-03-08 02:44:14 +00:00
dfb4094b9c Skip buffer in dense update (#148533)
Summary:
as title.

PyTorch Module buffer will not be published in delta publishing.  In Quinn's previous diff, constant type annotations have been introduced.

In addition to skip constant, we also need to skip buffer if it is not found in the user-provided delta weights list

Test Plan: https://docs.google.com/document/d/1wiqUo0PyZ4g6YJIJlL_LE084ZEuE74iu74gZjqGGjWY/edit?tab=t.0#heading=h.dby6cwiw1xrn

Differential Revision: D69553929

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148533
Approved by: https://github.com/22quinn, https://github.com/jingsh
2025-03-07 01:59:58 +00:00
b160dda743 cpp_wrapper: reduce memory usage by removing unneeded temporaries (#147403)
This PR contains a set of interrelated changes, listed below, with the upshot that compiled model memory usage in `cpp_wrapper` mode is now roughly equivalent to the default inductor mode.

Changes:

1. Refactor `reinterpret_view` calls in `cpp_wrapper` to always return a temporary RAII tensor object, rather than saving off a "temporary" tensor handle that persisted through the end of the function. This matches the behavior of the base Python wrapper class, and is responsible for majority of the memory usage reductions.
2. Eliminate nearly all other cases where a "temporary" tensor handle was saved off (with the exception of one or two places where the tensor would immediately be destroyed by going out-of-scope). This necessitated some ugly-looking code to handle `Optional[Tensor]` and `Optional[Sequence[Any]]`, since `Optional` is passed by pointer into the C-shim functions (making passing temporary objects difficult). This code is justified by the fact that it only appears in controlled circumstances that we auto-generate, so there are minimal user-facing footguns.
3. Delete the list containing the input tensors to the `cpp_wrapper` main function after casting them to `AtenTensorHandle` objects, which have an internal reference count keeping them alive.

The [TorchInductor benchmark](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Sat%2C%2015%20Feb%202025%2018%3A38%3A08%20GMT&stopTime=Sat%2C%2022%20Feb%202025%2018%3A38%3A08%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/benjaminglass1/73/head&lCommit=4d5edaf67e80ca9ca36d301af1ded13967a04790&rBranch=main&rCommit=e1bf892d9004a4dba0748d0eda5c3b4eced0ea70) I ran shows the increased memory compression.

Differential Revision: [D70648897](https://our.internmc.facebook.com/intern/diff/D70648897)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147403
Approved by: https://github.com/desertfire
2025-03-06 16:08:16 +00:00
a5c0dab900 [AOTInductor] Guard RAII_cpuMalloc with macro (#147150)
Summary: Silence RAII_cpuMalloc(size_t) defined but not used [-Wunused-function]

Test Plan: Existing tests

Differential Revision: D69623481

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147150
Approved by: https://github.com/henrylhtsang
2025-02-14 23:21:35 +00:00