## Problem
Okay there's limitations with today's `atomically_apply_size_hint` though it works for most observed failures we've seen so far. However, it's easy to come up with an edge case.
Suppose you encounter this setup.
```
a: [s0 + u0]
b: [s1 + u1]
c: [u2 + u3]
d: [u100]
```
Today, we use a few heuristics to specify the LHS and RHS for replacements.
10d2734d9b/torch/_inductor/sizevars.py (L730-L759)
It's possible to end up with these replacement rules. Notice how there's no replacement for `s1 + u1` and `u2 + u3` :( That's because today picking the LHS and RHS matters a lot, and `s1 + u1` & `u2 + u3` happened to end up on the RHS.
```
s0 + u0 => s1 + u1
s0 + u0 => u2 + u3 # overrides previous replacement; each expr only gets one replacement
s0 + u0 => u100 # overrides previous replacement; ditto
```
I believe what we really want is this: everybody gets a replacement! And they all should (eventually) settle at the same canonical expr (i.e. `u100`) when running the replacement several times.
```
s1 + u1 ==> s0 + u0
u2 + u3 ==> s0 + u0
s0 + u0 ==> u100
```
We can just short-cut this by using the canonical expr as the replacement.
```
s1 + u1 ==> u100
u2 + u3 ==> u100
s0 + u0 ==> u100
```
## Implementation
I offer one way to deal with this:
1. assure every expression has one canonical replacement (i.e. `u100`)
2. if two expressions are equal (inferred from `deferred_runtime_asserts`), then they must have the same canonical replacement
We can implement the above with union find.
* Whenever you see `Eq(lhs, rhs)` then do `union(lhs, rhs)`.
* Whenever you want to find the canonical replacement for a given expr then do `find(expr)`.
* When picking the canonical replacement we can use a few heuristics like (1) prefer a fully backed expr, (2) replacing with sub-expressions, and whatever we'd like.
Differential Revision: [D84549260](https://our.internmc.facebook.com/intern/diff/D84549260)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164324
Approved by: https://github.com/laithsakka
### Implementation of #151705
This PR introduces the initial implementation of native `tl.dot` support in Inductor, with the goal of generating Triton matmul kernels directly—without relying on predefined templates.
To avoid complexity and ease the review process, I plan to split this work into two phases as outlined in #151705:
1. **Basic support** (this PR)
2. **Lazy broadcasting** for optimal performance (future PR)
### Summary of This PR
This PR implements the basic functionality. It does **not** include lazy broadcasting, so the generated kernels may involve explicit `tl.reshape` and `tl.trans` operations before calling `tl.dot`, which introduces some overhead.
### Notable Changes
1. Adds a new config flag: `config.triton.enable_native_matmul`
2. Introduces a new `ops.dot` IR node in Inductor and lowers `aten.mm` and `aten.bmm` to it when native matmul is enabled
3. Enforces tililng suitable for matmul when the native matmul flag is enabled
4. Implements code generation for `ops.dot`
5. Adds Triton autotuning heuristics: for now, I’ve copied the configuration from the existing matmul templates. However, this may not be optimal—it currently takes a long time to tune, and I think there must be a better way to tackle this.
@eellison @jansel @PaulZhang12 @shunting314
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157743
Approved by: https://github.com/jansel
As title
In windows, we cannot modify the .dll to append weights at the end, the windows .dll loader will complain it's not a valid .dll file. So we store the weight blob as a separete file.
1. We add the following API which allows passing in a pointer to the weight blob and get the size of the weight blob.
```cpp
AOTI_API AOTIRuntimeError AOTInductorModelContainerGetConstantsBlobSize(
AOTInductorModelContainerHandle container_handle,
uint64_t* ret_size);
// Load weights from a single blob in weight_blob_ptr
AOTI_API AOTIRuntimeError AOTInductorModelUpdateConstantsFromBlob(
AOTInductorModelContainerHandle container_handle,
const uint8_t* weight_blob_ptr);
```
2. We also add a method in ModelContainerRunner to load the weight:
If the runner see that there is a `.blob` file in the package, if will mmap the .blob file and use the content to load the constants.
3. We also add the `USE_MMAP_EXTERNAL` macro. When this macro is defined, the model expects to load the weights from external mmap'd weights.
Test Plan:
```
buck run @mode/dev-nosan caffe2/test/inductor:test_aot_inductor -- -r test_large_mmaped_weights_on_disk
```
Also tested for windows-cross compilation with 6542566585/demo/main_voxtral.cpp
```
Loaded model.dll
audio_encoder loaded
C:\Users\shangdiy\source\repos\torchnative\demo\token_embedding\data\aotinductor\model\model.wrapper.so
Loaded model.dll
token_embedding loaded
C:\Users\shangdiy\source\repos\torchnative\demo\text_decoder\data\aotinductor\model\model.wrapper.so
Loaded model.dll
Loading weights from C:\Users\shangdiy\source\repos\torchnative\demo\text_decoder\data\aotinductor\model\model.wrapper_weights.blob
text_decoder loaded
Load latency (ms):
audio_encoder: 1011.234
archive extraction: 0.000
.so loading: 1011.197
token_embedding: 525.773
archive extraction: 0.000
.so loading: 525.704
text_decoder: 3324.130
archive extraction: 0.000
.so loading: 3323.979
Run latency (ms):
audio_encoder: 285.958
audio_encoder output: dtype=bfloat16, shape=[1, 1125, 3072], numel=3456000
token_embedding: 6.676
token_embedding output: dtype=bfloat16, shape=[1, 1138, 3072], numel=3495936
text_decoder: 576.519
text_decoder output: dtype=bfloat16, shape=[1, 1138, 131072], numel=149159936
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164526
Approved by: https://github.com/desertfire
To run this, you need to install `mingw64-gcc-c++` and download windows cuda library toolkit.
See design doc and demo instructions in https://docs.google.com/document/d/1iDaChqA5nNKkBFTzsdkmoomvQlXHbnlb1Z4yEp7xaJA/edit?tab=t.0
If cross_platform_target is windows, we do the following:
- do not link to `sleef`. This can be improved in the future if we need it. Currently I avoid it because that requires extra setup on the linux side
- Use `mingw64-gcc-c++` to compile
- Use `WINDOWS_CUDA_HOME` instead of `CUDA_HOME` when linking to cuda
```
python test/inductor/test_aot_inductor_windows.py -k so
```
Other changes:
- de-couples compile_standalone config and dynamic link flag
- create a new aot_inductor_mode config module, which is used to control configs in aot_inductor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163188
Approved by: https://github.com/desertfire
Summary:
Original commit changeset: 06888d7ebff0
Original Phabricator Diff: D82932788
Restricted the test to SM90 for scaled_grouped_mm
Test Plan: TBD (will share the linux CI results)
Differential Revision: D83283991
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163905
Approved by: https://github.com/angelayi
## Issue
From an internal use case, we found that if we have an equality rule like:
```
Max(15, u0) == s0 * Max(15, u0)
```
This would lead to wrong substitution rule being generated in the substitution table, the result would be the process got stuck in the substitution loop as if it hangs indefinitely, as it's doing the following substitutions:
```
Max(15, u0)
--> s0 * Max(15, u0)
--> s0 ** 2 * Max(15, u0)
--> s0 ** 3 * Max(15, u0)
--> s0 ** 4 * Max(15, u0)
...
```
The root cause is with SymPy expression comparison: as `Max` is [not inside the op class table](https://github.com/sympy/sympy/blob/1.14/sympy/core/basic.py#L50-L86), it'll take the [UNKNOWN](https://github.com/sympy/sympy/blob/1.14/sympy/core/basic.py#L120) order, and considered bigger than any other types of expressions.
## Fix
1. Added a breaking-out from the substitution while-loop to warn about any exccessive substitutions, what threshold should be used here and how to pass it are open to suggestion, using a hard-coded static value to be simple for now
2. Enhanced the sympy expression comparison logic, so that we first check if one expr "has" the other one or not, to help work around the issue with `Max` here
## Testing
- with the unittiest alone --> unittest stuck
- with the unittest and while-loop breakout, we could see tests finished with warning "**Substitution limit reached**":
```
test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpu::test_unbounded_expr_substitutions_cpu W0923 13:00:37.864000 46140 /data/users/q1l1/pytorch/torch/_export/__init__.py:70] +============================+
W0923 13:00:37.864000 46140 /data/users/q1l1/pytorch/torch/_export/__init__.py:71] | !!! WARNING !!! |
W0923 13:00:37.865000 46140 /data/users/q1l1/pytorch/torch/_export/__init__.py:72] +============================+
W0923 13:00:37.865000 46140 /data/users/q1l1/pytorch/torch/_export/__init__.py:73] torch._export.aot_compile()/torch._export.aot_load() is being deprecated, please switch to directly calling torch._inductor.aoti_compile_and_package(torch.export.export())/torch._inductor.aoti_load_package() instead.
stats [('calls_captured', 5), ('unique_graphs', 1)]
inductor [('extern_calls', 2)]
graph_break []
aten_mm_info [('aten.mm_Max(15, u0)_16_64', 1)]
PASSED [5.6947s]
test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_unbounded_expr_substitutions_cuda W0923 13:00:39.633000 46140 /data/users/q1l1/pytorch/torch/_inductor/sizevars.py:765] [0/0] Substitution limit (30) reached w/ u1**30*Max(15, u0)
W0923 13:00:39.679000 46140 /data/users/q1l1/pytorch/torch/_inductor/sizevars.py:765] [0/0] Substitution limit (30) reached w/ 64*u1**30*Max(15, u0)
stats [('calls_captured', 5), ('unique_graphs', 1)]
inductor [('extern_calls', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('async_compile_cache_miss', 1)]
graph_break []
aten_mm_info [('aten.mm_Max(15, u0)_16_64', 1)]
PASSED [5.6278s]
test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleMps::test_unbounded_expr_substitutions_mps SKIPPED [0.0002s]
============================ 2 passed, 1 skipped, 870 deselected in 19.66s ============================
```
- with the unittest + comparison logic enhanced, we don't see the warning any more:
```
Running 3 items in this shard
test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpu::test_unbounded_expr_substitutions_cpu W0923 13:15:39.560000 290812 /data/users/q1l1/pytorch/torch/_export/__init__.py:70] +============================+
W0923 13:15:39.561000 290812 /data/users/q1l1/pytorch/torch/_export/__init__.py:71] | !!! WARNING !!! |
W0923 13:15:39.561000 290812 /data/users/q1l1/pytorch/torch/_export/__init__.py:72] +============================+
W0923 13:15:39.562000 290812 /data/users/q1l1/pytorch/torch/_export/__init__.py:73] torch._export.aot_compile()/torch._export.aot_load() is being deprecated, please switch to directly calling torch._inductor.aoti_compile_and_package(torch.export.export())/torch._inductor.aoti_load_package() instead.
stats [('calls_captured', 5), ('unique_graphs', 1)]
inductor [('extern_calls', 2)]
graph_break []
aten_mm_info [('aten.mm_Max(15, u0)_16_64', 1)]
PASSED [6.6093s]
test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_unbounded_expr_substitutions_cuda stats [('calls_captured', 5), ('unique_graphs', 1)]
inductor [('extern_calls', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('async_compile_cache_miss', 1)]
graph_break []
aten_mm_info [('aten.mm_Max(15, u0)_16_64', 1)]
PASSED [6.0502s]
test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleMps::test_unbounded_expr_substitutions_mps SKIPPED [0.0002s]
============================ 2 passed, 1 skipped, 870 deselected in 21.99s ============================
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163685
Approved by: https://github.com/jansel
Summary:
otherwise, may hit
```
Exception: Expected all tensors to be on the same device, but got other is on cuda:0, different from other tensors on cpu (when checking argument in method wrapper_CUDA__equal)
```
Test Plan: UTs
Reviewed By: yushangdi
Differential Revision: D82974062
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163529
Approved by: https://github.com/yushangdi, https://github.com/Skylion007
Users can specify the following to get a libtorch_free `.so`.
"aot_inductor.use_libtorch": False,
The following config is only used for torchnative (see https://github.com/meta-pytorch/torchnative/pull/110). It's not intended to be used by executorch. The reason we need it for torchnative is because a lot of the symbol definitions in torchnative repo is only in header files.
"aot_inductor.libtorch_free_header": "/data/users/shangdiy/torchnative/standalone,/data/users/shangdiy/torchnative/" (or their custom headers)
The main motivating use case is for executorch to produce a libtorch free `.so`.
TODO for follow-up PR: this flag should be consolidated with the `compile_standalone` flag.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162655
Approved by: https://github.com/angelayi
Summary:
Fix memory leak in AOTI when calling `aoti_torch_as_strided`
If you have something like `AtenTensorHandle buf_handle`; and you allocated memory to it, you have to make it a `RAIIAtenTensorHandle` to release the ownership. Otherwise you have leaked the memory because even when the program ends, there's still a pointer pointing to the underlying storage of `buf_handle_restrided`, and the storage is never freed.
Test Plan:
```
buck run fbcode//mode/dev-nosan fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_pad_non_zero_memory_leak
```
Also verified by looking at `print(f"Allocated memory: {torch.cuda.memory_allocated() / 1024 ** 2:.2f} MB")`
Differential Revision: D81640339
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162118
Approved by: https://github.com/angelayi
Summary:
When we have a user defined triton kernel, it marks the mutated outputs as `MutationOutput` with a NoneLayout. This MutationOutput may later be used as input to another inductor-generated triton kernel.
When we determine whether to use int32 or int64 for the inductor generated triton kernel, we need to look at the number of elements for all buffers involved. If one of the buffer is a MutationOutput, we should still consider it's number of elements, instead of skipping it.
To get a hint on the MutationOutput size, we look at the buffers corresponding to `mutation_names` in MutationOutput.
Test Plan:
```
buck run mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_autotune_int64_user_defined_triton_kernel
```
Differential Revision: D81530083
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162020
Approved by: https://github.com/davidberard98, https://github.com/eellison
Summary: This is a reland of D80285441, fixed the unit test.
Test Plan:
```
buck2 run mode/opt-amd-gpu -m rocm641 -c fbcode.split-dwarf=true -c fbcode.use_link_groups=true -c fbcode.enable_gpu_sections=true //hpc/new/models/feed/benchmark:feed_lower_benchmark -- --load=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/894698382/0/gpu_lowering/new_input8 --skip-eager --skip-flop-estimation --sync-mode=0 --lower-backend=AOT_INDUCTOR
```
will succeed after this diff.
Rollback Plan:
Differential Revision: D80971224
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161521
Approved by: https://github.com/frank-wei
Summary: AMD specific kwargs need to be removed from the guard, otherwise a keyerror will be raised when executing the kernel.
Test Plan:
```
buck2 run mode/opt-amd-gpu -m rocm641 -c fbcode.split-dwarf=true -c fbcode.use_link_groups=true -c fbcode.enable_gpu_sections=true //hpc/new/models/feed/benchmark:feed_lower_benchmark -- --load=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/894698382/0/gpu_lowering/new_input8 --skip-eager --skip-flop-estimation --sync-mode=0 --lower-backend=AOT_INDUCTOR
```
can succeed after this change.
Rollback Plan:
Differential Revision: D80285441
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160671
Approved by: https://github.com/muchulee8
Fixes https://github.com/pytorch/pytorch/issues/159995
Currently there are two problems with extern kernels in subgraphs:
1. They don't get serialized to the extern kernel json file because we only look at the toplevel graph.
2. Since the scope of each extern_kernel list is within its own subgraph, the indices referencing the operator is messed up because each subgraph will start counting from 0.
So, this PR moves the extern_kernels list to a global view (under virtualized) so that we can count the extern kernels across subgraphs and the toplevel graph.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160004
Approved by: https://github.com/ydwu4
Summary:
Previous our implementation for RecordFunction injects Aten into
codegen, which is breaking the ABI contract for AOTInductor.
C10::IValue is aded to call the full record function. The extension of
more profiling info will come in later PRs.
Test Plan:
Included in commit.
Reviewers:
Subscribers:
Tasks:
Tags:
Differential Revision: [D79622071](https://our.internmc.facebook.com/intern/diff/D79622071)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159842
Approved by: https://github.com/desertfire
Fixes https://github.com/pytorch/pytorch/issues/160535
Index may contain ` torch.utils._sympy.functions.Identity`. When we call `SymPyOps.index_expr`, if the value is a sympy.Expr with Identity, `TypedExpr(value, dtype)` will fail. So when we unwrap arguments, we expand the sympy expression to unwrap Identity.
Test Plan:
buck run @mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r test_sym_expr_indexing
Rollback Plan:
Differential Re vision: D76308640
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155504
Approved by: https://github.com/eellison
Currently, the device-bias linter only targets functions decorated with @requires_gpu. This PR adds support for two new detection scenarios:
1. Detect device-bias code in functions decorated with @requires_triton.
2. Detect device-bias code for entire test suites that are defined as shared across GPUs. For example:
```
if __name__ == "__main__":
if HAS_GPU:
run_tests()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159949
Approved by: https://github.com/EikanWang, https://github.com/jansel
Summary: When compiling for standalone, make embed_kernel_binary and emit_multi_arch_kernel default to True, and add a default name for model_name_for_generated_files to make the generated cpp project easier to understand. Also improved the weights object file naming to be more readable.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158560
Approved by: https://github.com/yushangdi
In some cases we have mps kernels which are reused across higher-order-op subgraphs and the toplevel code. However, currently we initialize the variable for the mps kernel the first time we use it, which runs into an issue if we run into the mps kernel within a subgraph since the kernel will only be initialized within the subgraph scope. For instance:
```
if ...
auto mps_lib_0_func = ...
mps_lib_0_func->run()
// since we already used mps_lib_0 once, we don't re-initialize it
mps_lib_0_func->run() // error, mps_lib_0_func not initialized
```
So the solution we took here is to initialize all the kernels at the beginning:
```
const std::shared_ptr<at::native::mps::MetalKernelFunction> get_mps_lib_0() {
static const auto func = mps_lib_0.getKernelFunction("generated_kernel");
return func;
}
AOTIMetalKernelFunctionHandle get_mps_lib_0_handle() {
static const auto handle = AOTIMetalKernelFunctionHandle(get_mps_lib_0().get());
return handle;
}
...
if ...
get_mps_lib_0()->run()
get_mps_lib_0()->run() // success
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159753
Approved by: https://github.com/malfet
ghstack dependencies: #159456, #159695