Commit Graph

327 Commits

Author SHA1 Message Date
b90aa18569 [aoti] Add initial custom op support (#127034)
Re-land of https://github.com/pytorch/pytorch/pull/125242

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127034
Approved by: https://github.com/malfet
2024-07-24 20:29:55 +00:00
16a2a1aad3 Annotate graph.py (#131400)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131400
Approved by: https://github.com/shunting314
2024-07-23 07:04:12 +00:00
b2ad16f01d avoid OpOverloadPacket.__getattr__ calls in inductor lowering (#131348)
we have seen stacktrace samples showing that a lot of compilation time is spent in exceptions raised in `OpOverloadPacket.__getattr__`. It's not entirely clear why/how this happens, but I spot-checked a few places in `_inductor.graph.py` where we previously may have been calling `hasattr(OpOverloadPacket, ...)`, that can be avoided (hasattr will go through getattr, which, for OpOverloadPacket, will do a lookup in the dispatch table for all overload names of the packet).

Test Plan: CI

Differential Revision: D60048270

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131348
Approved by: https://github.com/davidberard98
2024-07-23 04:30:04 +00:00
b6d477fd56 [BE][Easy][16/19] enforce style for empty lines in import segments in torch/_i*/ (#129768)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129768
Approved by: https://github.com/jansel
2024-07-20 16:20:58 +00:00
27c2a0d63b [inductor] Separate Buffer and Operation into two concepts (#130831)
Resubmit of #128893

Currently a buffer represents both a tensor with physical storage and a
computation that produces the tensor as a result.

This PR attempts to split these into two different concepts in the scheduler.
This should allow us to have multiple outputs from a single operation.

Differential Revision: [D59876059](https://our.internmc.facebook.com/intern/diff/D59876059)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130831
Approved by: https://github.com/lezcano
2024-07-20 02:05:07 +00:00
419b8df0b6 [inductor][easy] add debug logs for inlining constants (#130799)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130799
Approved by: https://github.com/chenyang78
2024-07-17 04:21:08 +00:00
9ae40c6bc0 Fix and improve raise_comms and sink_waits (#129980)
The tests for `raise_comms` and `sink_waits` passes were not enabled in CI. The passes are now broken due to functional collective v2 and possibly other changes.

Correctness issues:
- The original passes did not take mutation into consideration and may yield semantically different scheduling order. This may be due to the recent changes to how mutations are expressed in Inductor IR (e.g., MutationOutput).

Effectiveness issues:
- The original passes only moved the comm/wait nodes themselves. However, comm nodes can come with prologues (e.g., clone for all_reduce_, split-cat for non-zero dim all-gather). Whenever there are any prologues, the comms won't be raised at all.
- The prologues are often horizontally fused with other pointwise nodes. This can severely delay the scheduling of the comm node.

This PR:
- Make the passes handle mutation correctly.
- Instead of moving individual comm/wait nodes, schedule all node using a scored method. This way the comm nodes can be optimally raised even in the presence of prologues.
- The horizontal fusion of prolofues often severely delays the scheduling of the comm node. Horizontally fusing this clone can almost never out-perform scheduling the comm node earlier. Also in most cases, this clone is eliminated via in-place reuse. Therefore, we tell the scheduler to not fuse it.
- Enable the tests in CI.

Co-authored-by: Will Feng <yf225@cornell.edu>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129980
Approved by: https://github.com/yf225
2024-07-12 01:55:47 +00:00
edf273edf4 Revert some PRs (#130303)
Summary:
Revert https://github.com/pytorch/pytorch/pull/129346 thru
https://github.com/pytorch/pytorch/pull/128893

For S430832

Test Plan: Tests

Differential Revision: D59503843

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130303
Approved by: https://github.com/bdhirsh
2024-07-09 14:46:00 +00:00
fb078c20c1 [inductor] Separate Buffer and Operation into two concepts (#128893)
Currently a buffer represents both a tensor with physical storage and a
computation that produces the tensor as a result.

This PR attempts to split these into two different concepts in the scheduler.
This should allow us to have multiple outputs from a single operation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128893
Approved by: https://github.com/lezcano
2024-07-02 23:49:57 +00:00
62e5d045c0 [AOTI] Auto-tune Triton kernels in a seperate block (#129057)
Summary: Currently AOTI does a two-pass compilation for the CUDA backend. In the first pass AOTI generates Python code, runs the generated code once with real example inputs to trigger Triton kernel compilation and tuning, and then AOTI runs the second pass to generate cpp code and compiles that into a shared library.

There are several problems with this approach when we want to enable the cpp wrapper mode for JIT Inductor:
* Compilation time: JIT compilation is more sensitive to compilation time than AOT compilation. The two-pass approach does add extra overhead for compilation.
* Peak memory size: when executing the first-pass generated code with real inputs, some inputs need to be cloned to avoid side effect coming from input mutation. This can raise the high-water mark for memory consumption.
* Missing triton kernel autotuning: Because kernel autotune depends on the kernel being executed in the two-pass approach, some kernels will not be autotuned when a model contains control flow such as torch.if or torch.while.

This PR is the first step towards solving these problems by moving Triton kernel autotuning to the compile time and use random inputs for tuning. The cpp wrapper codegen still has two passes, but in the first pass, Inductor will generate a separate code just for kernel autotuning, with https://gist.github.com/desertfire/606dc772b3e989b5e2edc66d76593070 as an example, and we no longer need to execute the model after the first-pass finishes. After that we rerun a second pass to generate cpp code. This reduces peak memory consumption and enables kernel autotuning when there is control flow. Truly making the codegen into one-pass will come later once this solution is proven stable and generates as performant kernels as before.

Differential Revision: [D58782766](https://our.internmc.facebook.com/intern/diff/D58782766)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129057
Approved by: https://github.com/jansel, https://github.com/eellison
2024-06-21 14:34:13 +00:00
571a0db132 [inductor] Fix logging for run_and_get_cpp_code (#128794)
Summary: Found during testing with remote caching: Use the same output logger object between graph.py and codecache.py since it's patched in `run_and_get_cpp_code`. That allows us to capture any logging produced from the codecache path when using `run_and_get_cpp_code`. I'm also fixing a few tests that were passing mistakenly because logging was missing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128794
Approved by: https://github.com/oulgen, https://github.com/leslie-fang-intel
2024-06-19 21:32:34 +00:00
17b45e905a Fix get output code when caching is enabled (#128445)
Summary: Improve output code retrieval mechanism so that it works in the presence of cache hits.

Test Plan: ci

Differential Revision: D58429602

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128445
Approved by: https://github.com/jansel, https://github.com/eellison, https://github.com/masnesral
2024-06-13 16:00:30 +00:00
c897651392 [inductor] Add BackendFeature gating (#128266)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128266
Approved by: https://github.com/shunting314
2024-06-13 07:31:51 +00:00
2229884102 Introduce int_oo (#127693)
In a previous life, we used sympy.oo to represent the lower/upper bounds of integer ranges. Later, we changed this to be sys.maxsize - 1 for a few reasons: (1) sometimes we do tests on a value being exactly sys.maxsize, and we wanted to avoid a data dependent guard in this case, (2) sympy.oo corresponds to floating point infinity, so you get incorrect types for value ranges with oo, and (3) you can do slightly better reasoning if you assume that input sizes fall within representable 64-bit integer range.

After working in the sys.maxsize regime for a bit, I've concluded that this was actually a bad idea. Specifically, the problem is that you end up with sys.maxsize in your upper bound, and then whenever you do any sort of size-increasing computation like size * 2, you end up with 2 * sys.maxsize, and you end up doing a ton of arbitrary precision int computation that is totally unnecessary. A symbolic bound is better.

But especially after #126905, we can't go back to using sympy.oo, because that advertises that it's not an integer, and now your ValueRanges is typed incorrectly. So what do we do? We define a new numeric constant `int_oo`, which is like `sympy.oo` but it advertises `is_integer`. **test/test_sympy_utils.py** describes some basic properties of the number, and **torch/utils/_sympy/numbers.py** has the actual implementation.

The rest of the changes of the PR are working out the implications of this change. I'll give more commentary as inline comments.

Fixes https://github.com/pytorch/pytorch/issues/127396

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127693
Approved by: https://github.com/lezcano
ghstack dependencies: #126905
2024-06-13 04:08:20 +00:00
5d8c7f39d4 Revert "Introduce int_oo (#127693)"
This reverts commit 9cab5987bdeb66df8efbc581b3469bfe300e168c.

Reverted https://github.com/pytorch/pytorch/pull/127693 on behalf of https://github.com/clee2000 due to sorry executorch CI is a bit weird regarding pins, I'll make a chat with mergen with the choices of what to do and how it'll affect executorch CI, reverting for now to prevent more divergences in the meantime ([comment](https://github.com/pytorch/pytorch/pull/127693#issuecomment-2161775400))
2024-06-11 23:36:08 +00:00
9cab5987bd Introduce int_oo (#127693)
In a previous life, we used sympy.oo to represent the lower/upper bounds of integer ranges. Later, we changed this to be sys.maxsize - 1 for a few reasons: (1) sometimes we do tests on a value being exactly sys.maxsize, and we wanted to avoid a data dependent guard in this case, (2) sympy.oo corresponds to floating point infinity, so you get incorrect types for value ranges with oo, and (3) you can do slightly better reasoning if you assume that input sizes fall within representable 64-bit integer range.

After working in the sys.maxsize regime for a bit, I've concluded that this was actually a bad idea. Specifically, the problem is that you end up with sys.maxsize in your upper bound, and then whenever you do any sort of size-increasing computation like size * 2, you end up with 2 * sys.maxsize, and you end up doing a ton of arbitrary precision int computation that is totally unnecessary. A symbolic bound is better.

But especially after #126905, we can't go back to using sympy.oo, because that advertises that it's not an integer, and now your ValueRanges is typed incorrectly. So what do we do? We define a new numeric constant `int_oo`, which is like `sympy.oo` but it advertises `is_integer`. **test/test_sympy_utils.py** describes some basic properties of the number, and **torch/utils/_sympy/numbers.py** has the actual implementation.

The rest of the changes of the PR are working out the implications of this change. I'll give more commentary as inline comments.

Fixes https://github.com/pytorch/pytorch/issues/127396

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127693
Approved by: https://github.com/lezcano
ghstack dependencies: #126905
2024-06-10 19:09:53 +00:00
3964a3ec73 Complete revamp of float/promotion sympy handling (#126905)
At a high level, the idea behind this PR is:

* Make it clearer what the promotion and int/float rules for various Sympy operations are. Operators that previously were polymorphic over int/float are now split into separate operators for clarity. We never do mixed int/float addition/multiplication etc in sympy, instead, we always promote to the appropriate operator. (However, equality is currently not done correctly.)
* Enforce strict typing on ValueRanges: if you have a ValueRange for a float, the lower and upper MUST be floats, and so forth for integers.

The story begins in **torch/utils/_sympy/functions.py**. Here, I make some changes to how we represent certain operations in sympy expressions:

* FloorDiv now only supports integer inputs; to do float floor division, do a truediv and then a trunc. Additionally, we remove the divide out addition by gcd optimization, because sympy gcd is over fields and is willing to generate rationals (but rationals are bad for ValueRange strict typing).
* ModularIndexing, LShift, RShift now assert they are given integer inputs.
* Mod only supports integer inputs; eventually we will support FloatMod (left for later work, when we build out Sympy support for floating operations). Unfortunately, I couldn't assert integer inputs here, because of a bad interaction with sympy's inequality solver that is used by the offline solver
* TrueDiv is split into FloatTrueDiv and IntTrueDiv. This allows for us to eventually generate accurate code for Python semantics IntTrueDiv, which is written in a special way to preserve precision when the inputs are >= 2**53 beyond what first coercing the integer to floats and then doing true division.
* Trunc is split to TruncToFloat and TruncToInt.
* Round is updated to return a float, not an int, making it consistent with the round op handler in Inductor. To get Python-style conversion to int, we call TruncToInt on the result.
* RoundDecimal updated to consistently only ever return a float
* Add ToFloat for explicit coercion to float (required so we can enforce strict ValueRanges typing)

In **torch/__init__.py**, we modify SymInt and SymFloat to appropriately call into new bindings that route to these refined sympy operations.  Also, we modify `torch.sym_min` and `torch.sym_max` to have promotion semantics (if one argument is a float, the return result is always a float), making them inconsistent with builtins.min/max, but possible to do type analysis without runtime information.

We also need to introduce some new op handlers in **torch/_inductor/ops_handler.py**:

* `to_int` for truncation to int64, directly corresponding to TruncToInt; this can be implemented by trunc and dtype, but with a dedicated handler it is more convenient for roundtripping in Sympy
* `int_truediv` for Python-style integer true division, which has higher precision than casting to floats and then running `truediv`

These changes have consequences. First, we need to make some administrative changes:

* Actually wire up these Sympy functions from SymInt/SymFloat in **torch/fx/experimental/sym_node.py**, including the new promotion rules (promote2)
* Add support for new Sympy functions in **torch/utils/_sympy/interp.py**, **torch/utils/_sympy/reference.py**
  * In particular, in torch.utils._sympy.reference, we have a strong preference to NOT do nontrivial compute, instead, everything in ops handler should map to a singular sympy function
  * TODO: I chose to roundtrip mod back to our Mod function, but I think I'm going to have to deal with the C/Python inconsistency this to fix tests here
* Add printer support for the Sympy functions in **torch/_inductor/codegen/common.py**, **torch/_inductor/codegen/cpp_utils.py**, **torch/_inductor/codegen/triton.py**. `int_truediv` and mixed precision equality is currently not implemented soundly, so we will lose precision in codegen for large values. TODO: The additions here are not exhaustive yet
* Update ValueRanges logic to use new sympy functions in **torch/utils/_sympy/value_ranges.py**. In general, we prefer to use the new Sympy function rather than try to roll things by hand, which is what was done previously for many VR analysis functions.

In **torch/fx/experimental/symbolic_shapes.py** we need to make some symbolic reasoning adjustments:

* Avoid generation of rational subexpressions by removing simplification of `x // y` into `floor(x / y)`. This simplification then triggers an addition simplification rule `(x + y) / c --> x / c + y / c` which is bad because x / c is a rational number now
* `_assert_bound_is_rational` is no more, we no longer generate rational bounds
* Don't intersect non-int value ranges with the `int_range`
* Support more sympy Functions for guard SYMPY_INTERP
* Assert the type of value range is consistent with the variable type

The new asserts uncovered necessary bug fixes:

* **torch/_inductor/codegen/cpp.py**, **torch/_inductor/select_algorithm.py**, **torch/_inductor/sizevars.py** - Ensure Wild/Symbol manually allocated in Inductor is marked `is_integer` so it's accepted to build expressions
* **torch/_inductor/utils.py** - make sure you actually pass in sympy.Expr to these functions
* **torch/_inductor/ir.py** - make_contiguous_strides_for takes int/SymInt, not sympy.Expr!
* **torch/export/dynamic_shapes.py** - don't use infinity to represent int ranges, instead use sys.maxsize - 1

Because of the removal of some symbolic reasoning that produced rationals, some of our symbolic reasoning has gotten worse and we are unable to simplify some guards. Check the TODO at **test/test_proxy_tensor.py**

**Reland notes.** This requires this internal fbcode diff https://www.internalfb.com/phabricator/paste/view/P1403322587 but I cannot prepare the diff codev due to https://fb.workplace.com/groups/osssupport/posts/26343544518600814/

It also requires this Executorch PR https://github.com/pytorch/executorch/pull/3911 but the ET PR can be landed prior to this landing.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126905
Approved by: https://github.com/xadupre, https://github.com/lezcano
2024-06-09 06:20:25 +00:00
ea614fb2b1 Flip default value for mypy disallow_untyped_defs [2/11] (#127839)
See #127836 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127839
Approved by: https://github.com/oulgen
2024-06-08 18:23:08 +00:00
917387f66d [AOTI] fix a constant tensor device move issue (#128265)
Summary: When copying a constant tensor to another device, `.to` returns a fake tensor and causes a problem when a real tensor is expected.

Test Plan: CI

Differential Revision: D58313034

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128265
Approved by: https://github.com/chenyang78
2024-06-08 13:23:49 +00:00
ac51f782fe Revert "Complete revamp of float/promotion sympy handling (#126905)"
This reverts commit 2f7cfecd86009a9d396fdbdcdfb4ba7a005db16b.

Reverted https://github.com/pytorch/pytorch/pull/126905 on behalf of https://github.com/atalman due to Sorry need to revert - failing internally ([comment](https://github.com/pytorch/pytorch/pull/126905#issuecomment-2155118778))
2024-06-07 16:01:46 +00:00
5a3bea1e88 Remove unused arg to GraphLowering (#126821)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126821
Approved by: https://github.com/eellison
ghstack dependencies: #126820
2024-06-06 06:39:59 +00:00
2f7cfecd86 Complete revamp of float/promotion sympy handling (#126905)
At a high level, the idea behind this PR is:

* Make it clearer what the promotion and int/float rules for various Sympy operations are. Operators that previously were polymorphic over int/float are now split into separate operators for clarity. We never do mixed int/float addition/multiplication etc in sympy, instead, we always promote to the appropriate operator. (However, equality is currently not done correctly.)
* Enforce strict typing on ValueRanges: if you have a ValueRange for a float, the lower and upper MUST be floats, and so forth for integers.

The story begins in **torch/utils/_sympy/functions.py**. Here, I make some changes to how we represent certain operations in sympy expressions:

* FloorDiv now only supports integer inputs; to do float floor division, do a truediv and then a trunc. Additionally, we remove the divide out addition by gcd optimization, because sympy gcd is over fields and is willing to generate rationals (but rationals are bad for ValueRange strict typing).
* ModularIndexing, LShift, RShift now assert they are given integer inputs.
* Mod only supports integer inputs; eventually we will support FloatMod (left for later work, when we build out Sympy support for floating operations). Unfortunately, I couldn't assert integer inputs here, because of a bad interaction with sympy's inequality solver that is used by the offline solver
* TrueDiv is split into FloatTrueDiv and IntTrueDiv. This allows for us to eventually generate accurate code for Python semantics IntTrueDiv, which is written in a special way to preserve precision when the inputs are >= 2**53 beyond what first coercing the integer to floats and then doing true division.
* Trunc is split to TruncToFloat and TruncToInt.
* Round is updated to return a float, not an int, making it consistent with the round op handler in Inductor. To get Python-style conversion to int, we call TruncToInt on the result.
* RoundDecimal updated to consistently only ever return a float
* Add ToFloat for explicit coercion to float (required so we can enforce strict ValueRanges typing)

In **torch/__init__.py**, we modify SymInt and SymFloat to appropriately call into new bindings that route to these refined sympy operations.  Also, we modify `torch.sym_min` and `torch.sym_max` to have promotion semantics (if one argument is a float, the return result is always a float), making them inconsistent with builtins.min/max, but possible to do type analysis without runtime information.

We also need to introduce some new op handlers in **torch/_inductor/ops_handler.py**:

* `to_int` for truncation to int64, directly corresponding to TruncToInt; this can be implemented by trunc and dtype, but with a dedicated handler it is more convenient for roundtripping in Sympy
* `int_truediv` for Python-style integer true division, which has higher precision than casting to floats and then running `truediv`

These changes have consequences. First, we need to make some administrative changes:

* Actually wire up these Sympy functions from SymInt/SymFloat in **torch/fx/experimental/sym_node.py**, including the new promotion rules (promote2)
* Add support for new Sympy functions in **torch/utils/_sympy/interp.py**, **torch/utils/_sympy/reference.py**
  * In particular, in torch.utils._sympy.reference, we have a strong preference to NOT do nontrivial compute, instead, everything in ops handler should map to a singular sympy function
  * TODO: I chose to roundtrip mod back to our Mod function, but I think I'm going to have to deal with the C/Python inconsistency this to fix tests here
* Add printer support for the Sympy functions in **torch/_inductor/codegen/common.py**, **torch/_inductor/codegen/cpp_utils.py**, **torch/_inductor/codegen/triton.py**. `int_truediv` and mixed precision equality is currently not implemented soundly, so we will lose precision in codegen for large values. TODO: The additions here are not exhaustive yet
* Update ValueRanges logic to use new sympy functions in **torch/utils/_sympy/value_ranges.py**. In general, we prefer to use the new Sympy function rather than try to roll things by hand, which is what was done previously for many VR analysis functions.

In **torch/fx/experimental/symbolic_shapes.py** we need to make some symbolic reasoning adjustments:

* Avoid generation of rational subexpressions by removing simplification of `x // y` into `floor(x / y)`. This simplification then triggers an addition simplification rule `(x + y) / c --> x / c + y / c` which is bad because x / c is a rational number now
* `_assert_bound_is_rational` is no more, we no longer generate rational bounds
* Don't intersect non-int value ranges with the `int_range`
* Support more sympy Functions for guard SYMPY_INTERP
* Assert the type of value range is consistent with the variable type

The new asserts uncovered necessary bug fixes:

* **torch/_inductor/codegen/cpp.py**, **torch/_inductor/select_algorithm.py**, **torch/_inductor/sizevars.py** - Ensure Wild/Symbol manually allocated in Inductor is marked `is_integer` so it's accepted to build expressions
* **torch/_inductor/utils.py** - make sure you actually pass in sympy.Expr to these functions
* **torch/_inductor/ir.py** - make_contiguous_strides_for takes int/SymInt, not sympy.Expr!
* **torch/export/dynamic_shapes.py** - don't use infinity to represent int ranges, instead use sys.maxsize - 1

Because of the removal of some symbolic reasoning that produced rationals, some of our symbolic reasoning has gotten worse and we are unable to simplify some guards. Check the TODO at **test/test_proxy_tensor.py**

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126905
Approved by: https://github.com/xadupre, https://github.com/lezcano
2024-06-06 02:29:45 +00:00
d5cb5d623a Revert "Complete revamp of float/promotion sympy handling (#126905)"
This reverts commit fb696ef3aa34e20c0fef1c0210a397abd3ea5885.

Reverted https://github.com/pytorch/pytorch/pull/126905 on behalf of https://github.com/ezyang due to internal user reported ceiling equality simplification problem, I have a plan ([comment](https://github.com/pytorch/pytorch/pull/126905#issuecomment-2148805840))
2024-06-05 03:57:58 +00:00
0eb9ec958a Revert "Inductor respects strides for custom ops by default (#126986)" (#127923)
This reverts commit dd64ca2a02434944ecbc8f3e186d44ba81e3cb26.

There's a silent incorrectness bug with needs_fixed_stride_order=True and
mutable custom ops, so it's better to flip the default back to avoid
silent incorrectness.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127923
Approved by: https://github.com/williamwen42
2024-06-04 22:25:45 +00:00
fb696ef3aa Complete revamp of float/promotion sympy handling (#126905)
At a high level, the idea behind this PR is:

* Make it clearer what the promotion and int/float rules for various Sympy operations are. Operators that previously were polymorphic over int/float are now split into separate operators for clarity. We never do mixed int/float addition/multiplication etc in sympy, instead, we always promote to the appropriate operator. (However, equality is currently not done correctly.)
* Enforce strict typing on ValueRanges: if you have a ValueRange for a float, the lower and upper MUST be floats, and so forth for integers.

The story begins in **torch/utils/_sympy/functions.py**. Here, I make some changes to how we represent certain operations in sympy expressions:

* FloorDiv now only supports integer inputs; to do float floor division, do a truediv and then a trunc. Additionally, we remove the divide out addition by gcd optimization, because sympy gcd is over fields and is willing to generate rationals (but rationals are bad for ValueRange strict typing).
* ModularIndexing, LShift, RShift now assert they are given integer inputs.
* Mod only supports integer inputs; eventually we will support FloatMod (left for later work, when we build out Sympy support for floating operations). Unfortunately, I couldn't assert integer inputs here, because of a bad interaction with sympy's inequality solver that is used by the offline solver
* TrueDiv is split into FloatTrueDiv and IntTrueDiv. This allows for us to eventually generate accurate code for Python semantics IntTrueDiv, which is written in a special way to preserve precision when the inputs are >= 2**53 beyond what first coercing the integer to floats and then doing true division.
* Trunc is split to TruncToFloat and TruncToInt.
* Round is updated to return a float, not an int, making it consistent with the round op handler in Inductor. To get Python-style conversion to int, we call TruncToInt on the result.
* RoundDecimal updated to consistently only ever return a float
* Add ToFloat for explicit coercion to float (required so we can enforce strict ValueRanges typing)

In **torch/__init__.py**, we modify SymInt and SymFloat to appropriately call into new bindings that route to these refined sympy operations.  Also, we modify `torch.sym_min` and `torch.sym_max` to have promotion semantics (if one argument is a float, the return result is always a float), making them inconsistent with builtins.min/max, but possible to do type analysis without runtime information.

We also need to introduce some new op handlers in **torch/_inductor/ops_handler.py**:

* `to_int` for truncation to int64, directly corresponding to TruncToInt; this can be implemented by trunc and dtype, but with a dedicated handler it is more convenient for roundtripping in Sympy
* `int_truediv` for Python-style integer true division, which has higher precision than casting to floats and then running `truediv`

These changes have consequences. First, we need to make some administrative changes:

* Actually wire up these Sympy functions from SymInt/SymFloat in **torch/fx/experimental/sym_node.py**, including the new promotion rules (promote2)
* Add support for new Sympy functions in **torch/utils/_sympy/interp.py**, **torch/utils/_sympy/reference.py**
  * In particular, in torch.utils._sympy.reference, we have a strong preference to NOT do nontrivial compute, instead, everything in ops handler should map to a singular sympy function
  * TODO: I chose to roundtrip mod back to our Mod function, but I think I'm going to have to deal with the C/Python inconsistency this to fix tests here
* Add printer support for the Sympy functions in **torch/_inductor/codegen/common.py**, **torch/_inductor/codegen/cpp_utils.py**, **torch/_inductor/codegen/triton.py**. `int_truediv` and mixed precision equality is currently not implemented soundly, so we will lose precision in codegen for large values. TODO: The additions here are not exhaustive yet
* Update ValueRanges logic to use new sympy functions in **torch/utils/_sympy/value_ranges.py**. In general, we prefer to use the new Sympy function rather than try to roll things by hand, which is what was done previously for many VR analysis functions.

In **torch/fx/experimental/symbolic_shapes.py** we need to make some symbolic reasoning adjustments:

* Avoid generation of rational subexpressions by removing simplification of `x // y` into `floor(x / y)`. This simplification then triggers an addition simplification rule `(x + y) / c --> x / c + y / c` which is bad because x / c is a rational number now
* `_assert_bound_is_rational` is no more, we no longer generate rational bounds
* Don't intersect non-int value ranges with the `int_range`
* Support more sympy Functions for guard SYMPY_INTERP
* Assert the type of value range is consistent with the variable type

The new asserts uncovered necessary bug fixes:

* **torch/_inductor/codegen/cpp.py**, **torch/_inductor/select_algorithm.py**, **torch/_inductor/sizevars.py** - Ensure Wild/Symbol manually allocated in Inductor is marked `is_integer` so it's accepted to build expressions
* **torch/_inductor/utils.py** - make sure you actually pass in sympy.Expr to these functions
* **torch/_inductor/ir.py** - make_contiguous_strides_for takes int/SymInt, not sympy.Expr!
* **torch/export/dynamic_shapes.py** - don't use infinity to represent int ranges, instead use sys.maxsize - 1

Because of the removal of some symbolic reasoning that produced rationals, some of our symbolic reasoning has gotten worse and we are unable to simplify some guards. Check the TODO at **test/test_proxy_tensor.py**

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126905
Approved by: https://github.com/xadupre, https://github.com/lezcano
2024-06-04 11:47:32 +00:00
d6963e769c Force Inductor output code to be dumped even if it fails to compile (#127700)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127700
Approved by: https://github.com/oulgen
2024-06-03 14:06:53 +00:00
7e97b33fbb [Dynamo] Log backward graph compilation metrics (#126629)
Fixes #125313

Compilation metric logs for the code example at #125313:
```
%s CompilationMetrics(compile_id='0/0', frame_key='1', co_name='forward', co_filename='/data/users/ybliang/debug/debug2.py', co_firstlineno=10, cache_size=0, accumulated_cache_size=0, guard_count=11, shape_env_guard_count=0, graph_op_count=1, graph_node_count=3, graph_input_count=1, start_time=1716247236.6165977, entire_frame_compile_time_s=7.926939964294434, backend_compile_time_s=7.887059926986694, inductor_compile_time_s=4.108498811721802, code_gen_time_s=3.97833514213562, fail_type=None, fail_reason=None, fail_user_frame_filename=None, fail_user_frame_lineno=None, non_compliant_ops=set(), compliant_custom_ops=set(), restart_reasons={"'skip function graph_break in file /home/ybliang/local/pytorch/torch/_dynamo/decorators.py'"}, dynamo_time_before_restart_s=0.025330543518066406, has_guarded_code=True, is_fwd=True)
%s CompilationMetrics(compile_id='1/0', frame_key='2', co_name='torch_dynamo_resume_in_forward_at_12', co_filename='/data/users/ybliang/debug/debug2.py', co_firstlineno=12, cache_size=0, accumulated_cache_size=0, guard_count=10, shape_env_guard_count=0, graph_op_count=2, graph_node_count=5, graph_input_count=1, start_time=1716247244.544928, entire_frame_compile_time_s=0.10148310661315918, backend_compile_time_s=0.08753013610839844, inductor_compile_time_s=0.03691983222961426, code_gen_time_s=0.022417306900024414, fail_type=None, fail_reason=None, fail_user_frame_filename=None, fail_user_frame_lineno=None, non_compliant_ops=set(), compliant_custom_ops=set(), restart_reasons=set(), dynamo_time_before_restart_s=0.0, has_guarded_code=True, is_fwd=True)
tensor([[-0.1622, -0.0000, -0.0000,  0.5643, -0.0000,  0.0000, -0.5087,  0.0914,
         -0.0000, -0.0421]], grad_fn=<CompiledFunctionBackward>)
%s CompilationMetrics(compile_id='1/0', frame_key=None, co_name=None, co_filename=None, co_firstlineno=None, cache_size=None, accumulated_cache_size=None, guard_count=None, shape_env_guard_count=None, graph_op_count=None, graph_node_count=None, graph_input_count=None, start_time=None, entire_frame_compile_time_s=None, backend_compile_time_s=None, inductor_compile_time_s=0.026738643646240234, code_gen_time_s=0.016446352005004883, fail_type=None, fail_reason=None, fail_user_frame_filename=None, fail_user_frame_lineno=None, non_compliant_ops=None, compliant_custom_ops=None, restart_reasons=None, dynamo_time_before_restart_s=None, has_guarded_code=None, is_fwd=False)
%s CompilationMetrics(compile_id='0/0', frame_key=None, co_name=None, co_filename=None, co_firstlineno=None, cache_size=None, accumulated_cache_size=None, guard_count=None, shape_env_guard_count=None, graph_op_count=None, graph_node_count=None, graph_input_count=None, start_time=None, entire_frame_compile_time_s=None, backend_compile_time_s=None, inductor_compile_time_s=0.14563536643981934, code_gen_time_s=0.08652091026306152, fail_type=None, fail_reason=None, fail_user_frame_filename=None, fail_user_frame_lineno=None, non_compliant_ops=None, compliant_custom_ops=None, restart_reasons=None, dynamo_time_before_restart_s=None, has_guarded_code=None, is_fwd=False)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126629
Approved by: https://github.com/ezyang
2024-06-03 03:55:33 +00:00
25994a7ed1 [AOTI] Fix a bug when mutated buffer meets .to (#127671)
Summary: Before this change, the added unit test will trigger: `AssertionError: Can not find the original value for L__self____tensor_constant0_cuda0`. The reason is GraphLowering.constant_name could rename a constant with a device suffix but AOTI requires that new name being registered properly.

Differential Revision: [D58047165](https://our.internmc.facebook.com/intern/diff/D58047165)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127671
Approved by: https://github.com/ColinPeppler, https://github.com/22quinn
2024-06-01 12:30:56 +00:00
942d9abd66 [AOTI] Update reinplace to cover mutated buffer (#127297)
Summary: Unlike JIT Inductor, AOTI currently unlifts weights and buffers from input args, so the reinplace pass didn't really work for AOTI because it only checks mutation on placeholder, which led to excessive memory copies for kv_cache updates in LLM models. This PR removes those memory copies and roughly offers a 2x speedup. In the future, we will revert the unlift logic in AOTI and make the behvior consitent with JIT Inductor.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127297
Approved by: https://github.com/peterbell10, https://github.com/chenyang78
2024-05-29 13:07:53 +00:00
3cb16ebf08 [BE]: Update ruff to 0.4.5 (#126979)
Update ruff to 0.4.5 and addresses some false negatives that have been found in the newer version.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126979
Approved by: https://github.com/ezyang
2024-05-24 18:38:35 +00:00
dd64ca2a02 Inductor respects strides for custom ops by default (#126986)
Previously, the default was that Inductor did not respect strides for
all (builtin and custom) ops unless the op has a
"needs_fixed_stride_order" tag on it. This PR changes it so that:

- inductor doesn't respect strides for builtin ops. To change the
  behavior, one can add the "needs_fixed_stride_order" tag
- inductor does respect strides for custom ops. To change the behavior,
  one can add the "does_not_need_fixed_stride_order" tag

Test Plan:
- new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126986
Approved by: https://github.com/ezyang, https://github.com/albanD
2024-05-24 11:11:18 +00:00
51e707650f Fix flexattention not realizing inputs before lowering (also refactored runtime estimation) (#126615)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126615
Approved by: https://github.com/yanboliang, https://github.com/drisspg, https://github.com/xmfan
2024-05-22 17:28:46 +00:00
8a4597980c Revert "Fix flexattention not realizing inputs before lowering (also refactored runtime estimation) (#126615)"
This reverts commit 831efeeadf5fa8d9e7f973057e634a57e3bcf04b.

Reverted https://github.com/pytorch/pytorch/pull/126615 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/126615#issuecomment-2124169157))
2024-05-22 08:23:40 +00:00
f6ffe32a9d [AOTInductor] Automatic detection for buffer mutation and binary linking (#126706)
Summary: Instead of a explicit config for users to determine buffer mutation, we automatically detect whether there's buffer mutation in the model and determine which section constants would be placed. If constants are too large and doesn't fit within section, we error out directly.

Test Plan: Existing tests for buffer mutation and large weight linking

Differential Revision: D57579800

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126706
Approved by: https://github.com/desertfire
2024-05-21 21:49:13 +00:00
831efeeadf Fix flexattention not realizing inputs before lowering (also refactored runtime estimation) (#126615)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126615
Approved by: https://github.com/yanboliang, https://github.com/drisspg, https://github.com/xmfan
2024-05-20 23:40:56 +00:00
45f2d09452 [Quant][Inductor] Enable lowering of qlinear-binary(-unary) fusion for X86Inductor (#122593)
**Description**
Lower the qlinear binary post op pattern to Inductor. Use post op sum (in-place) if the extra input has the same dtype as output. Otherwise, it uses binary add.

**Supported linear-binary(-unary) patterns**
```
    linear(X)   extra input
           \   /
            Add
             |
        Optional(relu)
             |
             Y

1. int8-mixed-fp32
+---+---------------+-----------+------------------------------+---------+
| # | Add type      | Quant out | Pattern                      | Post op |
+---+---------------+-----------+------------------------------+---------+
| 1 | In-/out-place | Yes       | linear + fp32 -> (relu) -> q | add     |
+---+---------------+-----------+------------------------------+---------+
| 2 | In-/out-place | No        | linear + fp32 -> (relu)      | sum     |
+---+---------------+-----------+------------------------------+---------+

2. int8-mixed-bf16
+---+----------+---------------+-----------+--------------------------------------------------+---------+
| # | X2 dtype | Add type      | Quant out | Pattern                                          | Post op |
+---+----------+---------------+-----------+--------------------------------------------------+---------+
| 1 | BF16     | In-/out-place | Yes       | linear + bf16 -> (relu) -> to_fp32 -> q          | add     |
+---+----------+---------------+-----------+--------------------------------------------------+---------+
| 2 | BF16     | In-/out-place | No        | linear + bf16 -> (relu)                          | sum     |
+---+----------+---------------+-----------+--------------------------------------------------+---------+
| 3 | FP32     | Out-place     | Yes       | linear + fp32 -> (relu) -> q                     | add     |
|   |          | In-place right|           |                                                  |         |
+---+----------+---------------+-----------+--------------------------------------------------+---------+
| 4 | FP32     | Out-place     | No        | linear + fp32 -> (relu)                          | sum     |
|   |          | In-place right|           |                                                  |         |
+---+----------+---------------+-----------+--------------------------------------------------+---------+
| 5 | FP32     | In-place left | Yes       | linear + fp32 -> to_bf16 -> relu -> to_fp32 -> q | add     |
+---+----------+---------------+-----------+--------------------------------------------------+---------+
| 6 | FP32     | In-place left | No        | linear + fp32 -> to_bf16 -> (relu)               | add     |
+---+----------+---------------+-----------+--------------------------------------------------+---------+
```
Note
(1) The positions of linear and the extra input can be swapped.
(2) we don't insert q-dq before the extra input of linear-add by recipe. But if q-dq is found at the
extra input, we don't match that pattern because we cannot match all these patterns in 3 passes.

**Test plan**
python test/inductor/test_mkldnn_pattern_matcher.py -k test_qlinear_add
python test/inductor/test_cpu_cpp_wrapper.py -k test_qlinear_add

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122593
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/eellison
2024-05-17 07:46:48 +00:00
9e1826deff [torchbind] Add inductor support (#123709)
Example inductor generated python code: [P1245776497](https://www.internalfb.com/phabricator/paste/view/P1245776497)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123709
Approved by: https://github.com/eellison
2024-05-13 18:18:17 +00:00
aca0807101 [AOTI] Use random inputs to autotune the backward pass (#125291)
Summary: This is for JIT Inductor with cpp wrapper, fixing https://github.com/pytorch/pytorch/issues/117367.

In the backward pass, we don't have real inputs to execute the backward pass to autotune kernels. We have 3 options here, 1) use random tensor inputs; 2) store the forward outputs and feed them to backward (non-trivial because of parameter re-ordering); 3) autotune each kernel with random inputs in a subprocess (similar to select_algorithm). This PR uses the easist option 1. Option 3 is where we are going as the next step, which will simplify the cpp wrapper codegen for the CUDA backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125291
Approved by: https://github.com/chenyang78, https://github.com/angelayi
2024-05-10 20:13:34 +00:00
82b7b59d2a [inductor] Check if n is the input tensor of conv_pointwise (#125119)
Fix https://github.com/pytorch/pytorch/issues/124837.
Check whether n is the input tensor of convolution_pointwise or qconv2d_pointwise, if so freeze the layout to channels_last.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125119
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
2024-05-08 13:25:49 +00:00
48b6c8dbc3 [Inductor] log fusion failure due to index mismatch (#124986)
The scheduler searches for fusion opportunities by looking for common memory access. Two memory access are considered common not only when the buffer name match, but it also requires more things
- index formula matches
- var_ranges matches

In this PR, I want to log all the fusion failures due to mismatch index formula or var_ranges. I also want to further categories the failures. Right now I found the following failure categories
- rand_seed: the index for rand seed access is an integer and different access uses different integer offset
- different numel: this happens for cat operation
- broadcast: e.g. kernel A write a buffer which is broadcasted and read by kernel B
- different loop orders: the major category we want inductor to be able to fuse
- different offset: happens when use a concatenated linear layer to project Q/K/V and then split the result. Each split will point to the same buffer with different offset.
- unknown

My hope is to make sure for the models I tested, there is no fusion failure falling in the unknown category so all the failures are well understood and categories. Right now it's true for BertForMaskedLM ( https://gist.github.com/shunting314/6dc2c903629d342fa63ba731a171adc2  ), DistillGPT2 ( https://gist.github.com/shunting314/145176f2e850103c7fad4ad72f0e200e ) and llm.c ( https://gist.github.com/shunting314/cfc64a326312a889ba55f79bd47b2082 )

For BertForMaskedLM, we found 82 instances of fusion failures and majority of them are due to different loop orders! Studying the log a bit more can help us figure out where all these loop order mismatch comes from in real models.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124986
Approved by: https://github.com/eellison, https://github.com/jansel
2024-05-07 02:29:00 +00:00
7ffa5558ee Revert "[FX] Update type hints in torch.fx._compatibility.py (#125469)"
This reverts commit 235b4d6ec22ddac35b2e47b7e871ef10538d4aee.

Reverted https://github.com/pytorch/pytorch/pull/125469 on behalf of https://github.com/izaitsevfb due to breaks pyre in dependent projects (internal: see D56986361) ([comment](https://github.com/pytorch/pytorch/pull/125469#issuecomment-2096665396))
2024-05-06 18:36:43 +00:00
1dd42e42c4 [BE]: Try TCH autofixes on torch/ (#125536)
Tries TCH autofixes and see what breaks

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125536
Approved by: https://github.com/ezyang
2024-05-05 23:13:59 +00:00
235b4d6ec2 [FX] Update type hints in torch.fx._compatibility.py (#125469)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125469
Approved by: https://github.com/Skylion007
ghstack dependencies: #125468
2024-05-05 19:30:22 +00:00
e5e623af4b Codegen runtime asserts in Inductor (#124874)
This completely subsumes https://github.com/pytorch/pytorch/pull/120816

This makes use of the unbacked binding machinery to teach Inductor how to generate deferred runtime asserts directly. There is some back story about why I did it this way, let me explain.

Previously, our strategy for generating runtime asserts was that Dynamo would insert them into the FX graph after finishing tracing, and we would attempt to code generate them based on the FX graph. This is a good strategy for export, where we immediately export the graph. However, this strategy was afflicted by problems in eager, where we reuse the same ShapeEnv as before. In particular, on subsequent graph passes, we would immediately turn all of these assertions into noops, because when we evaluated their expressions, we would see that because we had a deferred runtime assert in the ShapeEnv, we know "oh, of course this expression is True" already. Oops!

So, with this PR, we take the attitude that as long as the ShapeEnv sticks around, the ShapeEnv's list of deferred runtime asserts is the source of truth, and we don't put anything in the graph. So we just need to decide when to actually generate asserts, and the place I picked was Inductor lowering, since we already have an AssertScalar buffer concept, and so I just need to insert them at this point. AssertScalar also uses raw sympy.Expr rather than SymInt/Bool, so it is easier to prevent unrestricted simplification at this point.

There are a few things jumbled together in this PR. I can split them if you want, but some of the changes are before I changed my strategy, but they're useful changes anyway.

**torch/_dynamo/output_graph.py** and **torch/_inductor/lowering.py** - Here, we stop putting deferred runtime asserts in the graph. I also have to make sure we don't DCE unused symbol arguments; we're going to get some goofy graph arguments this way, will be good to restore that optimization eventually. We also just disable codegen for `_assert_scalar`  entirely; we assume that ShapeEnv will be good enough to capture all of these.

**torch/_inductor/codegen/wrapper.py** and **torch/_inductor/ir.py** - Add a way to codegen sizevars without forcing simplification

**torch/_inductor/graph.py** - The main logic. Our strategy is to interpose in the same place we are testing that unbacked SymInts are properly showing up in lowered code. The logic is directly analogous to the logic in the existing insert deferred runtime asserts FX pass, but it's simpler because sympy expressions can be directly stored on inductor IR nodes.

**torch/fx/experimental/symbolic_shapes.py** - For extra safety, we have a way of freezing runtime asserts, so that if you try to add more we error. This prevents us from adding runtime asserts after we've done lowering. There's a funny interaction with backwards which there's a comment for in graph.py

**torch/fx/passes/runtime_assert.py** - This is not really needed in this PR, but I rewrote the runtime assert logic to use unbacked_bindings rather than inferring it by looking for unbacked SymInts. Now, keypaths are translated into FX node acessors. Unfortunately, I couldn't delete the old inference code, because you still need it to find backed SymInts from arguments (as this pass may be used on graphs which don't explicitly bind all their shape variables as argments). There are some new tests exercising this.

TODO: I think we need to generate asserts for replacements too. This is a preexisting problem that the old FX pass had too.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124874
Approved by: https://github.com/jansel
ghstack dependencies: #124864
2024-04-29 10:19:29 +00:00
555f1aeb02 Fix module buffer mutation (#124586)
Fixes #124583

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124586
Approved by: https://github.com/leslie-fang-intel, https://github.com/desertfire
2024-04-29 06:05:12 +00:00
7321005dd8 Add support for capturing tensors with score_mod (#124444)
```
import torch
from torch import nn
import torch.nn.functional as F
import torch._inductor.config as config
# torch.set_default_device('cuda')

import torch
from torch.nn.attention._templated_attention import _templated_attention as templated_attention
from triton.testing import do_bench
from torch.nn.attention import SDPBackend, sdpa_kernel

index = torch.ops.aten
torch.manual_seed(0)

B = 16
H = 16
S = 2048
D = 64

head_scale = torch.randn(H, device='cuda')
def alibi(score, batch, head, token_q, token_kv):
    return score + torch.ops.aten.index(head_scale, [head]) * (token_q - token_kv)
bias = torch.randn(H, S, S, dtype=torch.float16, device='cuda')

query = torch.randn(B, H, S, D, device="cuda", dtype=torch.float16)
key = torch.randn(B, H, S, D, device="cuda", dtype=torch.float16)
value = torch.randn(B, H, S, D, device="cuda", dtype=torch.float16)

compiled = torch.compile(templated_attention)
out = compiled(query, key, value, score_mod=alibi)
out2 = templated_attention(query, key, value,score_mod=alibi)
print((out - out2).abs().mean())
assert (out - out2).abs().mean() < 1e-3
print("Flash (no mask): ", do_bench(lambda: F.scaled_dot_product_attention(query, key, value)))
print("Flash (mask): ", do_bench(lambda: F.scaled_dot_product_attention(query, key, value, attn_mask=bias)))
print("flexattention: ", do_bench(lambda: compiled(query, key, value, score_mod=alibi)))
```
<img width="324" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/18c175d0-2720-4dfd-8747-85b8a8f609f5">

Differential Revision: [D56583900](https://our.internmc.facebook.com/intern/diff/D56583900)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124444
Approved by: https://github.com/jansel, https://github.com/drisspg
2024-04-26 01:02:28 +00:00
4259e5d0e0 [inductor] Specialize on unguarded alignment of example inputs (#123319)
When inductor generates triton code, the triton code can either assume that the inputs given to it are aligned or unaligned. If they are aligned, triton can use more efficient instructions (like vectorized loads or tensor cores). However, if we generate "aligned" code and pass in unaligned inputs, the triton code will error out; to fix this, we clone unaligned inputs that are passed to triton kernels that expect aligned inputs. This can lead to excessive clones if we have inputs that are not expected to be aligned.

In this PR, we use the example input to decide whether the generated triton code should assume alignment or not. If the example input is aligned, then we will generate triton code that assumes alignment; if at runtime we receive an unaligned input, we'll make a clone. Meanwhile, if the example input is not aligned, the generated triton code will not assume inputs are aligned and we won't ever need to clone.

Note that the alignment of the inputs is not guarded on; we found that adding guards on tensor offsets (a) was slow in cases where we do a lot of comparisons on tensor offsets, and (b) led to a lot of recompilations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123319
Approved by: https://github.com/eellison
2024-04-25 22:28:15 +00:00
b4597fffce Try to reuse old symbol name rather than new symbol name when renaming (#124782)
Previously, unbacked SymInts would gradually get larger and larger as we kept rebinding them.  Now, we do the replacement to preserve the old symbol.

Actually doing this is a bit tricky.  Here’s the order things happen when retracing data dependent:

1. Run fake tensor prop: allocate new unbacked SymInt
2. Run proxy tensor mode, calculate bindings and associate them with FX node
3. Run PropagateUnbackedSymInts, rename unbacked bindings to their old ones so they are consistent

So the problem is when we calculate bindings in step (2), we don't know
what the original names are yet, we only find out later at (3).  But by
the time (3) runs, we've already stuffed some new bindings in
meta["unbacked_bindings"] and we don't know how to update them!  To fix
this, I introduce resolve_unbacked_bindings which post facto applies any
of the renamings we discovered in (3).

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124782
Approved by: https://github.com/lezcano
ghstack dependencies: #124310, #124314, #124316, #124394, #124739
2024-04-25 14:02:42 +00:00
0ca1ff3dce Revert "Add support for capturing tensors with score_mod (#124444)"
This reverts commit 7c253a777641791247f7fcc19fe5c60f24be32b9.

Reverted https://github.com/pytorch/pytorch/pull/124444 on behalf of https://github.com/jeanschmidt due to Breaking internal tests, check D56522566 ([comment](https://github.com/pytorch/pytorch/pull/124444#issuecomment-2076908582))
2024-04-25 10:56:38 +00:00
13ab24f192 Reimplement unbacked symbol bindings in Inductor (#124394)
This PR has a lot of "draw the rest of the fucking owl" energy. Here's how to break it down.

1. **torch/_inductor/graph.py** - We start by tightening unbacked symbol invariants. Specifically, as we lower FX nodes, we check whether or not every unbacked_binding recorded on the FX node meta, actually ends up getting bound (according to get_unbacked_symbol_defs) in all the buffers generated by the lowering. Hopefully this invariant is self evident. This leads to a lot of failures.
2. **torch/_inductor/ir.py** - Problem 1: There is softness in how Inductor computes defs of unbacked symbols in IR node. Previously, we tried to infer it by looking at the output sizes/strides/etc and see if new unbacked symbols popped up that we hadn't seen in the inputs. I don't know exactly what was buggy about the old code, but sometimes we would fail to notice an unbacked symbol had been bound, or rebind an unbacked symbol multiple times. Fortunately, thanks to the earlier PRs in our stack, we now have a nice list of unbacked symbol bindings from FX, so we now just store it directly on ExternKernel and use it directly to report defs. This has to be done twice: once for FallbackKernel (e.g., nonzero) and once for DynamicScalar (e.g., item) (see also **torch/_inductor/lowering.py**, **torch/_inductor/codegen/wrapper.py** and  **torch/_inductor/codegen/cpp_wrapper_cpu.py** for the lowering and codegen changes for item)
   * **process_kernel** - Sidequest! It turns out that Inductor lowering can reallocate unbacked symbols. This happens specifically when we repropagate fake tensors through the operator in `process_kernel`. This repropagation process is necessary because Inductor may have changed the strides of input tensors, and it must now recompute the strides so that it can continue to appropriately plan the rest of the lowering process. This is fine: we just make sure we do the rebind unbacked + compute_unbacked_bindings dance we've been doing previously in the PR stack. But instead of putting unbacked_bindings on a new FX node, they go straight into our unbacked_bindings on the Inductor IR node.
    * **codegen_unbacked_symbol_defs** - Sidequest! FallbackKernel lowering is done in two steps. First, you emit the FallbackKernel buffer. Then, you emit MultiOutput buffers which actually give access to the individual outputs of FallbackKernel, which may have been multi-output. There is a design decision here: does the FallbackKernel bind the unbacked symbols, or the MultiOutput buffer? Historically, we put the binding on MultiOutput buffer, because it's more convenient: the FallbackKernel buffer is fake, in fact, it doesn't even get a name in C++ codegen. But it's kind of inconsistent with the keypath model that we've been tracking unbacked bindings with: if you have a multi-output node, you'd expect a keypath like `[0].size()[0]` representing the first output's first dimension size. That suggests that it's the FallbackKernel that should define the things. So that was my first implementation. Unfortunately, the C++ codegen is too cursed and I could not understand how to make it work in that case. So now we just unsoundly assume you cannot have multi-output data dependent output, and do the codegen in MultiOutput. There are some comments explaining exactly what we are improperly assuming.
3. **_rename_unbacked_to** in **torch/fx/experimental/symbolic_shapes.py** - Previously, when we renamed unbacked symbols, we clobbered any facts we previously knew about them. So for example, if we had a replacement `u0 -> s0` but then we renamed u0 to u1, we would now setup the replacement `u0 -> u1`, clobbering the old replacement. This apparently didn't matter in earlier PRs in the stack, but with Inductor now on the ball, there were some tests that indicated this was a problem. The solution is easy: if u0 had a preexisting replacement, reapply it to u1. However...
    * **torch/_functorch/_aot_autograd/collect_metadata_analysis.py** - When we run forward analysis, this triggers fake tensor repropagation and fresh allocations. Previously, we just cleared out the pending symbols when finished the analysis. But with the change above, this would also migrate replacements to the new symbols... which are now dead. So now we explicitly suppress generation of these symbols with `ignore_fresh_unbacked_symbols` so that no rebinding happens at all.
    * **torch/_dynamo/eval_frame.py** - same deal; I just searched for all sites we called clear() on pending
4. The last step is fixing the long tail of extra problems that show up, now that unbacked_bindings are load bearing into Inductor
    * **torch/_dynamo/eval_frame.py** - Some of the exports are making copies of nodes without repropagating fake tensors, so in this case, it is important to also copy the `unbacked_bindings` (apparently this didn't matter before without the Inductor changes)
    * **torch/_export/pass_base.py** - I discover that this is doing fake tensor repropagation via a test suite failure. Do the same playbook as AOTAutograd: PropagateUnbackedSymInts too!  Actually, they also have implemented their own tracer as well, so do the same playbook as proxy_tensor: record unbacked_bindings on the newly traced nodes. UGH code duplication.
    * **torch/_subclasses/fake_tensor.py**, **torch/_subclasses/fake_impls.py** (with call site updates at  **torch/_functorch/_aot_autograd/traced_function_transforms.py** and **torch/fx/passes/fake_tensor_prop.py**) - What's this new epoch thing? I noticed that sometimes I would be retracing, call nonzero() on a fake tensor, and not allocate a new unbacked symbol. This is actually bad, because if I don't get a new unbacked symbol, I don't know there's a binding site, and `unbacked_bindings` is now missing a binding. The reason for this is memoization: if I reuse the exact same fake tensor on my retrace, it will already have an unbacked symint memoized on it and we will short circuit allocation. Well, that's no good. So I associate the memos with a fake tensor epoch, and every time you start a new fake tensor propagation from scratch, you bump the epoch so that I clear all the memos.
    * **torch/_inductor/scheduler.py** - I notice in unit tests that V.current_node is not always set when we call process_kernel. So I save it into the IR node and restore it when we are running `get_estimated_runtime`.
    * **torch/fx/experimental/symbolic_shapes.py** - A few things
      * **rebind_unbacked** (re **_tensor_version**). Ordinarily, when you have an unbacked SymInt, you persistently hvae it all the way to the end of the program. `_tensor_version` violates this: this generates an unbacked SymInt (for reasons I don't quite understand?) and then gets rid of it later. This triggered an assert violation. I think this op is kind of misusing unbacked SymInt, but I didn't know how to refactor it, so it gets a special case.
      * **rebind_unbacked** (re **Simplify SymBool binding**). Ugh, SymBool, what a pain in the butt. I have an assert that you can only rebind unbacked symbol to another unbacked symbol. This assert fails when a boolean is involved, because the result of running keypath on the result is not `u1`, it's `sympy.Piecewise(... sympy.Eq(u1, 1) ...)`. This is actually just `u1`, but Sympy doesn't know it because it doesn't know that `u1` value range is `[0, 1]`. So we manually implement the simplification needed to get the assert to pass.
      * **compute_unbacked_bindings** (re **This is pretty fragile**). There is a really funny disaster involving memoization and Inductor process kernel. Ordinarily when I retrace, if there was a memo hit in the old trace, there will be a memo hit in the new trace. However, Inductor process kernel breaks this, because it recreates fake tensor inputs to the operator call from scratch (since they might have different strides), and obviously these tensor inputs don't have the memo from the old one. I tried a little bit to try to manually transplant the memo to the new fake tensor but it seemed hopeless, so I just let the fresh symbol ride, allocating a new unbacked symbol. However, in one of our tests, we rely on knowing that the first nonzero call is equal to the second (memoized) nonzero call. The equality test looked pretty easy to discharge, so I just went ahead and added a deferred runtime assert to this effect and it worked.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124394
Approved by: https://github.com/jansel
ghstack dependencies: #124310, #124314, #124316
2024-04-25 02:08:59 +00:00