Summary: ONNX team and recent transformer upgrade ran into this error and we also ran into during our export benchmarking. This diff makes it possible to trace through vmap implementation in pre-dispatch IR. Note that we don't support serializing functorch ops in pre-dispatch IR and in the future, we should desugar them to post-grad ops.
The implementation strategy is:
1. We add python wrappers around vmap APIs so that we attach custom torch function handler that is only on during non-strict export. The reason is we don't want to add this to default torch_function handler because it will break BC.
2. Some dynamo changes to make sure it picks up new python wrapper APIs. The reason is when we do strict export, we need to re-materialize these APIs in pre-dispatch IR from torch IR. We can avoid this by special casing in dynamo for export to proxy different API calls but i feel that is too much chaos because you need to be able to proxy 2 different variants of same vmap API.
Test Plan: CI
Differential Revision: D75623875
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154650
Approved by: https://github.com/ezyang, https://github.com/zou3519
This PR fixes a bug where user defined triton kernels hidden behind `triton_op` do not register source code changes. If a user *only* changes a triton kernel source_code, because triton kernels are hidden under the custom op, dynamo hasn't traced into them yet.
This means at AOTAutograd time, we don't know the list of triton kernels that are defined by custom ops. This is an initial fix for the issue by parsing the AST of the custom op looking for triton kernels. This won't catch more degenerate cases if the custom op calls other custom ops/functions that then call triton kernels, and then the toplevel compiled graph doesn't know about it. To handle that, we'd have to trace through the custom op at dynamo time.
This should handle 99% of cases, though. I added an expectedFailure test to show the limitation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160120
Approved by: https://github.com/zou3519
Currently, the device-bias linter only targets functions decorated with @requires_gpu. This PR adds support for two new detection scenarios:
1. Detect device-bias code in functions decorated with @requires_triton.
2. Detect device-bias code for entire test suites that are defined as shared across GPUs. For example:
```
if __name__ == "__main__":
if HAS_GPU:
run_tests()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159949
Approved by: https://github.com/EikanWang, https://github.com/jansel
The goal of this PR is to fix a specific bug when turning precompile on/off between caching runs.
If you try to turn on BundledAOTAutogradCacheEntry today in between local runs, the FXGraphCache may randomly hit *between* the two runs, because FXGraphCache knows nothing about AOTAutogradCache's config. When FXGraphCache hits, it immediately will call make_launchers() immediately on the triton code it launches, which then causes an assertion failure because pickle should not be called after make_launchers.
One way to resolve the bug is just to add whether precompile is enabled to teh FxGraph cache key. But the better fix for this, however, is higher level/philosophical:
When using BundledAOTAutogradCacheEntry, the entire CompiledFxGraph is saved directly to the cache entry, and we expect the two caches to work in sync, i.e. as one cache. So to simplify the programming model, we disable FxGraphCache when BundledAOTAUtogradCache is turned on.
BundledAOTAutogradCacheEntry is only used for precompile use cases now; if we wanted to use BundledAOTAutogradCache for traditional caching use cases, there's a bunch of further work, one of which would be to re-enable FxGraphCache in the event that BundledAOTAutogradCache has to bypass. However, for precompile, this is not a scenario that should happen: we should always expect the entire callable to be saveable, and we should expect to never bypass. So we don't do that change for now.
Added a unit test demonstrating this behavior. Also updated existing unit tests to show that all fx graph cache operations are now 0 (but all tests still pass).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156611
Approved by: https://github.com/zhxchen17
This PR adds the necessary things to register and record backend ids from BundledAOTAutogradCacheEntry.
One TODO to point out; in this diff, if there are multiple backends that would have the same AOTAutogradCache key (traditional cache key, not backend_id), we just end up serializing the same BundledAOTAutogradCache entry multiple times. This is not ideal obviously, so we'll want to deduplicate these and just track the different keys that one BundledAOTAutogradCacheEntry is associated with instead. This shouldn't be super hard to do, though, as we just need to run a deduplication step on call to `serialize()`, I think.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155387
Approved by: https://github.com/oulgen
This PR adds the necessary things to register and record backend ids from BundledAOTAutogradCacheEntry.
One TODO to point out; in this diff, if there are multiple backends that would have the same AOTAutogradCache key (traditional cache key, not backend_id), we just end up serializing the same BundledAOTAutogradCache entry multiple times. This is not ideal obviously, so we'll want to deduplicate these and just track the different keys that one BundledAOTAutogradCacheEntry is associated with instead. This shouldn't be super hard to do, though, as we just need to run a deduplication step on call to `serialize()`, I think.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155387
Approved by: https://github.com/oulgen
Compiled Autograd retraces AOT's bw_module at backward runtime into a larger graph, and today this runs into an issue on warm cache runs because the bw_module is not restored. This PR adds it to the cache, by first stripping it bare from unserializable metadata. I also intentionally differentiate the cached and non-cached versions to avoid accidental attempts of AOT compilation with a restored bw_module (would probably crash).
The bw_module's generated code is then serialized, and at compiled autograd runtime, it is restored via symbolic_trace. This also means that presence of tensor constructors will be lifted as constants. Something we will address separately.
Note that since the cache entry may be used by runs that use compiled autograd and runs that do not, we need to cache both the lowered backward and the bw_module.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151860
Approved by: https://github.com/jamesjwu
ghstack dependencies: #156120
On this line, we see that the bw_compiler that dynamo uses for AotAutograd automatically disables the backward runnable:
05dd638ee9/torch/_dynamo/backends/common.py (L76)
This disables dynamo in the bw_compiler but also disables the runnable the compiler returns.
On a AOTAutogradCache hit, however, we never call the bw_compiler! So we don't disable dynamo properly. This only has an effect on certain cases of cpu tensors' backwards, where the backward is being done in python land, and dynamo unnecessarily tries to trace through the inductor generated code. It also only matters if the backward is being accessed outside of dynamo itself (say, in a graph break in eager mode), since dynamo properly disables the forward function already.
```
I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517] TorchDynamo attempted to trace the following frames: [
I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517] * fn /home/jjwu/test.py:9
I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517] * cast /data/users/jjwu/a/pytorch-env/lib/python3.10/typing.py:1737
I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517] * call /tmp/torchinductor_jjwu/rq/crq327nhoyjzog5n3qlchauucdrunrtutwmmoh7ipoe2ngnson5s.py:35
I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517] * fn /home/jjwu/test.py:9
I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517] * cast /data/users/jjwu/a/pytorch-env/lib/python3.10/typing.py:1737
I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517] * call /tmp/torchinductor_jjwu/rq/crq327nhoyjzog5n3qlchauucdrunrtutwmmoh7ipoe2ngnson5s.py:35
I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517] ]
```
This PR fixes the issue and adds a unit test showing that with or without cache hit, the frames dynamo is tracing is identical.
Fixes https://github.com/pytorch/pytorch/issues/154536
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155251
Approved by: https://github.com/bdhirsh, https://github.com/anijain2305
AOTAutogradCache uses FXGraphCache which uses the tracing context to get the ShapeEnv. Although the TracingContext global_context is cleared by the time we get around to reusing it, we don't actually need it. We just need the ShapeEnv in the TracingContext, which isn't cleared at the end of dynamo and does persist. This PR adds the tracing context manager around the specialized compile to ensure our caching infrastructure can get access to the ShapeEnv. A test was also added to prove correctness.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153526
Approved by: https://github.com/jamesjwu, https://github.com/zou3519
ghstack dependencies: #153433, #153449
https://github.com/pytorch/pytorch/issues/148222
Goal:
At the moment autograd saved tensors hooks are run in eager after compiled forward.
They are executed at the same time for all saved tensors.
Hooks can be used to reduce amout of memory used for saved tensors, doing quantization or offloading to cpu.
This is suboptimal for optimization of peak memory.
Better solution will be to put the hooks in the graph, as close as possible to the last usage of the tensor.
To get user specified autograd saved tensors hooks in the graph.
Logic:
UX:
If user specifies with torch.autograd.graph.saved_tensors_hooks(pack_gm, unpack_gm).
Where pack_gm and unpack_gm are torch.fx.GraphModule.
Then AotAutograd will retrace those graph modules, doing decompositions and functionalization in aot_autograd, inlining the result graphs in forward epilogue and backward prologue.
User may want to use control logic in the hooks, for example applying quantization only for specific dtypes and sizes.
This is also possible, user can put it into torch.fx.wrap function and use symbolic trace to make a GraphModule.
In that case AotAutograd cahing will work only in case when user explicitly set to the torch.fx.wrap call_function node "user_cache_hash" metadata.
If this metadata set - then aot_autograd cache can use saved cache artifact.
If metadata is not set - then cache is bypassed.
Dynamo:
Dynamo traces pack and unpack hooks and installs them as subgraph and explicitly adds to the output_graph. (As those subgraphs are not used and will not be copied in the result by default).
The complexity here is that at this moment we do not have example of inputs for the hooks.
We trace pack_hook with some Tensor from the inputs.
The result subgraphs are added to the hashing of AotAutograd Cache.
In AotAutograd we retrace the graph with the true saved tensors coming from partitioner.
Backwards Compatibility:
As current hooks are executed in eager mode and not all of them will be traceable - we only try to put in the graph hooks, explicitly marked by user with annotation (@_inlineable_saved_tensors_hooks).
For other hooks or if compiled autograd is enabled - keep the same logic.
Recompilations:
Hooks are guarded with lambda guard matching function id to cause recompilation if user reruns compiled function.
Aot_autograd:
After partitioner prepared forward and backward module - we trace prepared at Dynamo graphs for pack and unpack hooks and inline them in epilogue of forward and prologue of backward. Forward outputs and backward inputs are changed, transparently for user.
We do not try to put it close the last usage etc., relying on inductor to do this optimization.
```
INFO: TRACED GRAPH
===== Forward graph pre saved_tensors_hooks inlining 3 =====
/data/users/ivankobzarev/a/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module):
def forward(self, primals_1: "Sym(s0)", primals_2: "Sym(s1)", primals_3: "f32[s0, s1][s1, 1]cuda:0"):
# File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6660 in simple_fn, code: x = x + 1
add: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten.add.Tensor(primals_3, 1); primals_3 = None
# File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6661 in simple_fn, code: x = SAF.apply(x)
view: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten.view.default(add, [primals_1, primals_2])
return (view, add, primals_1, primals_2)
INFO: TRACED GRAPH
===== Backward graph pre saved_tensors_hooks inlining 3 =====
/data/users/ivankobzarev/a/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module):
def forward(self, primals_1: "Sym(s0)", primals_2: "Sym(s1)", primals_3: "f32[s0, s1][s1, 1]cuda:0"):
# File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6660 in simple_fn, code: x = x + 1
add: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten.add.Tensor(primals_3, 1); primals_3 = None
# File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6661 in simple_fn, code: x = SAF.apply(x)
view: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten.view.default(add, [primals_1, primals_2])
return (view, add, primals_1, primals_2)
INFO: TRACED GRAPH
===== saved_tensors_pack_hook add 3 =====
/data/users/ivankobzarev/a/pytorch/torch/fx/_lazy_graph_module.py class pack_float8(torch.nn.Module):
def forward(self, x_1: "f32[s0, s1][s1, 1]cuda:0"):
# No stacktrace found for following nodes
_to_copy: "f8e4m3fn[s0, s1][s1, 1]cuda:0" = torch.ops.aten._to_copy.default(x_1, dtype = torch.float8_e4m3fn); x_1 = None
return (torch.float32, _to_copy)
INFO: TRACED GRAPH
===== saved_tensors_unpack_hook add 3 =====
<eval_with_key>.22 from /data/users/ivankobzarev/a/pytorch/torch/fx/experimental/proxy_tensor.py:1225 in wrapped class pack_float8(torch.nn.Module):
def forward(self, x_1: "f32[s0, s1][s1, 1]cuda:0"):
# No stacktrace found for following nodes
_to_copy: "f8e4m3fn[s0, s1][s1, 1]cuda:0" = torch.ops.aten._to_copy.default(x_1, dtype = torch.float8_e4m3fn); x_1 = None
return (torch.float32, _to_copy)
INFO: TRACED GRAPH
===== Forward graph 3 =====
/data/users/ivankobzarev/a/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module):
def forward(self, primals_1: "Sym(s0)", primals_2: "Sym(s1)", primals_3: "f32[s0, s1][s1, 1]cuda:0"):
# File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6660 in simple_fn, code: x = x + 1
add: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten.add.Tensor(primals_3, 1); primals_3 = None
# No stacktrace found for following nodes
_to_copy: "f8e4m3fn[s0, s1][s1, 1]cuda:0" = torch.ops.aten._to_copy.default(add, dtype = torch.float8_e4m3fn)
# File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6661 in simple_fn, code: x = SAF.apply(x)
view: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten.view.default(add, [primals_1, primals_2]); add = None
return (view, _to_copy, primals_1, primals_2)
INFO: TRACED GRAPH
===== Backward graph 3 =====
<eval_with_key>.21 class GraphModule(torch.nn.Module):
def forward(self, primals_1: "Sym(s0)", primals_2: "Sym(s1)", add_packed_2: "f8e4m3fn[s0, s1][s1, 1]cuda:0", tangents_1: "f32[s0, s1][s1, 1]cuda:0"):
# No stacktrace found for following nodes
_to_copy: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten._to_copy.default(add_packed_2, dtype = torch.float32); add_packed_2 = None
# File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6661 in simple_fn, code: x = SAF.apply(x)
add_7: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten.add.Tensor(tangents_1, _to_copy); tangents_1 = _to_copy = None
return (None, None, add_7)
```
Differential Revision: [D72187044](https://our.internmc.facebook.com/intern/diff/D72187044)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150032
Approved by: https://github.com/bdhirsh
Finally, this PR adds BundledAOTAutogradCacheEntry. A BundledAOTAutogradCacheEntry is an AOTAutogradCacheEntry that saves the entire CompiledFxGraph directly in the entry.
This has some advantages:
- No more dependency on FxGraphCache at all
- Clearing FxGraphCache does not result in AOTAutogradCache miss
- Simpler logic, as BundledAOTAutogradCacheEntry has everything you need to load a full compiled python wrapper from a dynamo output
We plan to use BundledAOTAutogradCacheEntry for precompile. There's also a question of whether we want to use it for regular caching — the main disadvantage of this is having to save the same CompiledFxGraph twice, once in Inductor cache and once for AOTAutogradCache. With MegaCaching, this *could* be a regression in total cache size (as well as a minor cold start regression, as you have to save the same graph twice). I will import this and measure the mega cache space complexity, and if it looks good I'll enable it by default for caching as well.
On warm start, if AOTAutogradCache hits, you won't have to load inductor at all, so warm start overhead should be unaffected.
Differential Revision: [D74593304](https://our.internmc.facebook.com/intern/diff/D74593304)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152840
Approved by: https://github.com/zhxchen17
Compiled Autograd retraces AOT's bw_module at backward runtime into a larger graph, and today this runs into an issue on warm cache runs because the bw_module is not restored. This PR adds it to the cache, by first stripping it bare from unserializable metadata. I also intentionally differentiate the cached and non-cached versions to avoid accidental attempts of AOT compilation with a restored bw_module (would probably crash).
Note that since the cache entry may be used by runs that use compiled autograd and runs that do not, we need to cache both the lowered backward and the bw_module.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151860
Approved by: https://github.com/jamesjwu
ghstack dependencies: #149707
Today, we mark graph outputs as maybe dynamic, this lets a compilation to communicate to future compilations whether certain graph inputs are dynamic. Similarly, we can do this to saved activations, which may be used in future compilations as well. This is especially prevalent in compiled autograd, where tensor activations will always become graph inputs.
Changes to the tests were mainly cosmetic, with the exception of tests that relied on duck shaping. By annotating tensor dims, we prevent them from reusing pre-existing symbols, so this change will make graphs use duck shapes less than before, which affects some of the caching tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149707
Approved by: https://github.com/bdhirsh
This somewhat complicated PR does a few things:
- It separates out a lot of the guard checking logic into its own class, GuardedCache[T]
- It adds a new `check_guard_hit` lambda to FXGraphCache._lookup_graph, which allows callers to define their own guard checking logic
- It then uses these two combined parts to lift guard checking to AOTAutogradCache. This means that AOTAutogradCache stores its own guard expressions and evaluates them.
- FXGraphCache's guard checking logic is completely unchanged, just refactored. As part of the work, I'm able to extend a bit of the logging functionality of AOTAutogradCache into FXGraphCache, so that you can know if FXGraphCache missed due to a guard failure or a full cache miss.
# Why do this?
Lifting guards to AOTAutogradCache has a few benefits:
- First, it fixes a long standing bug in guard checking logic. Backward passes can have different symint inputs than forward passes depending on forward output, if AOTAutograd chooses to store symints for the backward. These symint inputs have the same underlying symbols as the forward, but on AOTAutogradCache hit, we don't have access to the hints backing these exact symints (we only have hints for the symints on the forward function). By lifting guard checking logic to AOTAutogradCache, we no longer need to check the backward guards, as they'll be included in the AOTAutogradCache guard expression. **I've added a unit test that failed before my diff, and now passes, as an example of this**
- Secondly, this is the first step necessary to bundle CompiledFxGraph into AOTAutogradCache. Doing so will simplify our cache logic significantly, and also make precompile logic simpler, as precompiles will only need to store AOTAutogradCacheEntrys, without needing to match them up with inductor FXGraphCache entries.
- Finally, adding guard checking logic to AOTAutogradCache my allow us in the future to handle more complicated cases like a single forward with multiple backwards, as guard checks are now storable on the cache entry itself.
# Guard checking logic of AOTAutogradCache
When AOTAutogradCache evaluates guard expressions, it no longer needs to evaluate the forward/backward guards in the FXGraphCacheEntry (since the AOTAutogradCache guard expressions will encompass them). Because of this, we still need a way for AOTAutogradCache to distinguish between multiple FXGraphCache local entries. To do so, AOTAutogradCache stores the guard string from FXGraphCache, which it uses as a second "cache key". It doesn't need to **evaluate** these guards, it just needs to find the cache entry from FXGraphCache that had the same guards as when it was stored.
After this, I will work on putting the FXGraphCache entries directly into AOTAutogradCache. If I can put CompiledFxGraphs in the cache directly, I no longer need this complicated `check_guard_hit` overriding logic.
## Test Plan
Added a new unit test. There are comprehensive guard checking unit tests in `test_aot_autograd_cache` already, and those pass.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151563
Approved by: https://github.com/oulgen
This PR was inspired by internal models that were cache missing due to PGO. At a high level the problem looks as follows
Run 1, Invocation 1: We do static compile, save some example values in PGO/automatic dynamic
Run 1, Invocation 2: We detect varying inputs, do dynamic compile, get a dynamic graph and save to PGO. Crucially what we save to PGO is actually a superset of what is actually dynamic. If we notice an input was varying, we mark it as dynamic in PGO even if later on that value gets specialized. When a value gets specialized, we actually remove the symbol from the graph. This results in an interesting conundrum where although we are producing the same isomorphic graph, PGO makes the second run cache miss. Let's see how....
Run 2, Invocation 1: We fetch the PGO, over-mark things as dynamic, get a fx graph, look it up in the cache and... whoops! cache miss! This is because of the aforementioned behavior where the PGO profile will cause us to over-allocate symbols. In practice this means we end up saving a graph in cache with symbols x:s1, y:s3 and on second attempt we cache miss with x:s1, y:s6 where symbols s3,s4,s5 were all optimistically marked dynamic by PGO and subsequently specialized.
We solve this problem by hashing the source names. This ensures somewhat stable assignment. To prevent catastrophic symbol collisions, we use linear probing to ensure no collisions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149665
Approved by: https://github.com/Mingming-Ding, https://github.com/laithsakka
This PR was inspired by internal models that were cache missing due to PGO. At a high level the problem looks as follows
Run 1, Invocation 1: We do static compile, save some example values in PGO/automatic dynamic
Run 1, Invocation 2: We detect varying inputs, do dynamic compile, get a dynamic graph and save to PGO. Crucially what we save to PGO is actually a superset of what is actually dynamic. If we notice an input was varying, we mark it as dynamic in PGO even if later on that value gets specialized. When a value gets specialized, we actually remove the symbol from the graph. This results in an interesting conundrum where although we are producing the same isomorphic graph, PGO makes the second run cache miss. Let's see how....
Run 2, Invocation 1: We fetch the PGO, over-mark things as dynamic, get a fx graph, look it up in the cache and... whoops! cache miss! This is because of the aforementioned behavior where the PGO profile will cause us to over-allocate symbols. In practice this means we end up saving a graph in cache with symbols x:s1, y:s3 and on second attempt we cache miss with x:s1, y:s6 where symbols s3,s4,s5 were all optimistically marked dynamic by PGO and subsequently specialized.
We solve this problem by hashing the source names. This ensures somewhat stable assignment. To prevent catastrophic symbol collisions, we use linear probing to ensure no collisions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149665
Approved by: https://github.com/Mingming-Ding, https://github.com/laithsakka
This adds a new env var and flag,
autograd_cache_allow_custom_autograd_functions, (env var: `TORCHINDUCTOR_AUTOGRAD_CACHE_ALLOW_CUSTOM_AUTOGRAD`) which allows custom autograd functions into AOTAutogradCache.
@hirsheybar and I worked together to verify that the higher order op AutogradFunctionApply is pure with respect to the dynamo input being passed in, so this *should* be safe. I'm still putting it behind a flag and turning it on slowly, first on an internal model, though. Once we verify that it is correct on the internal model we can work to enable the flag by default.
Differential Revision: [D71633184](https://our.internmc.facebook.com/intern/diff/D71633184/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149751
Approved by: https://github.com/bdhirsh, https://github.com/zou3519
This PR intends to fix the cache related issues from https://github.com/pytorch/pytorch/issues/147405.
It does *not* handle the dynamo recompile case in process, because it does not introduce any extra guards. For FXGraphCache and AOTAutogradCache, we simply have to have the device context in the cache key.
Note that for any function that accepts tensor inputs, the device context is naturally already included in the cache key by the metadata of example inputs. However, for functions that return constants or have no arguments, the device context still needs to be in the cache key.
A more robust fix for this would be to have inductor generate device guards that are dynamic, instead of specialized. This would also help us share more cache artifacts.
I've added unit tests for FXGraphCache and AOTAutogradCache, both of which would fail without this change.
Differential Revision: [D69875939](https://our.internmc.facebook.com/intern/diff/D69875939)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147464
Approved by: https://github.com/bdhirsh, https://github.com/anijain2305
Fix: #141974
This PR makes `ViewMeta` sequence, present in functional tensors,
serializable with pickle. In order to accomplish that, it makes
`ViewMeta` an abstract class with overridable `forward` and `reverse`
functions. In this context, each operation that once instanciated
`ViewMeta`, should now create a new specialized class that inherits from
`ViewMeta. Therefore, this PR also uses codegen for creating these
specializations.
In summary, these are the changes this PR introduces:
- `ViewMeta` is turned into an abstract class (see
_FunctionalStorageImpl.cpp_). `forward` and `reverse` are pure virtual
functions that need to be implemented. `to_out_index` should be
implemented by operations that might return more than 1 output.
- New `ViewMeta` specializations for `resize_` and `_unsafe_view` are
created (see _FunctionalizeFallbackKernel.h_).
- New templates _ViewMetaClasses.{cpp,h}_ are created. They hold the
declaration and definition of the `ViewMeta` specializations, which
are automatically generated in the ATen codegen (see _gen.py_).
- New `_functionalization` Python sub-module is created (see
_Module.cpp_). It serves as namespace for the `ViewMeta`
specializations and `InverseReturnMode` enum.
- New template _ViewMetaClassesPythonBinding.cpp_ is created. It holds
the automatically generated Python bindings for the `ViewMeta`
specialization, which are generated in the torch codegen (see
_generate_code.py_).
Note that this PR makes use of codegen at 2 different moments:
- ATen codegen (_gen.py_): generates the `ViewMeta` specialized classes.
- Torch codegen (_generate_code.py_): generated the Python bindings for
them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143712
Approved by: https://github.com/bdhirsh
FXGraphCache supports freezing, but AOTAutogradCache does not. This is due to the fact that when freezing is turned on, instead of using the constants from the graph module that was saved on cache miss, we have to take the constants from the AOTAutograd generated graph module. This PR does two things:
- It bypasses AOTAutogradCache when freezing is turned on. We should have always been doing this.
- It refactors the code to be way more clear about the constants we're using and when we're using them.
Basically, there are two possible sets of constants we can grab from the compiled fx graph.
1. If freezing is turned off, we save the constants directly in CompiledFxGraph.
2. If freezing is turned on, we save the *names* of the constants in CompiledFxGraph, and use the runtime GraphModule's actual constant values: we reconstruct them from the saved names + the new graph module from AOTDispatch.
We implement two different classes for doing just this: one that has access to the post aotdispatch gm, which supports freezing, and one that doesn't have it, which does not support freezing. Then we construct the wrappers and unwrap the result as needed.
This makes it clear that the gm passed to AOTAutogradCache is *not* part of post compile, only the cache key generated from it is.
The whole flow is pretty confusing, but hopefully this gives us better types and static information for understanding what the different codepaths are doing.
Will add a specific AOTAutogradCache to confirm we bypass freezing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141897
Approved by: https://github.com/ezyang, https://github.com/masnesral
This turns on AOTAutogradCache for all inductor tests. It clears AOTAutogradCache on each test as well, by virtue of the local cache using the same directory to store cache entries.
I've also tested with INDUCTOR_TEST_DISABLE_FRESH_CACHE=1, running all the tests. AOTAutogradCache successfully caches 99% of these. There are a few tests that use view_replay and therefore save functional tensors, which cause AOTAutogradCache to fail to pickle its result. Will look into next steps there, but for now, it seems okay if the cache just misses on those cases where it can't serialize the result. It would be better to check before pickling, though.
I've made the following small bugfixes to get this working:
- Inductor is sometimes used in a standalone mode without dynamo, which leads to attribute errors in check_can_cache. In general, we should *never* crash in cache checking, only bypass. So I change a try catch to check Exception instead of just a specific exception.
- Add extra structured logging for metadata on cache hits
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140890
Approved by: https://github.com/bdhirsh
Summary: https://github.com/pytorch/pytorch/pull/136505 changed the cache_clear operation to remove loaded modules from disk. That change caused some problems with TORCHINDUCTOR_FORCE_DISABLE_CACHES=1, where there are some code paths (coordinate descent tuning at least), where we call `PyCodeCache.load_by_key_path` and expect that the files are still on disk. (But when caches are disabled, we call cache_clear before every inductor compile). It seems we probably have a shortcoming in the disable-cache logic, but since we also have flakey test failures with the same `'could not get source code'` error, let's restore the previous functionality until I can investigate further.
Since some tests actually _DO_ want to delete on-disk artifacts (e.g., to test remote caching), then I added a `purge` param to optionally delete files
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140216
Approved by: https://github.com/eellison
Summary: The main changes to support freezing are:
1) When pickling constant tensors as part of the cache key calculation: If freezing has not been applied, then keep the existing behavior (pickle the metadata and values). If freezing has been applied, then pickle the values if the constant will be inlined; otherwise, consider only the metadata.
2) If freezing has been applied, modify what we store in the cache: Instead of storing the constant attributes in the cache entry, store the _names_ of the constants, and then grab those constants from the GraphModule when we need attache the attributes to a newly-loaded Python module. Since the cache lookup path loads the Python module, this bullet means we need to thread through a GraphModule argument in several places.
3) Since this feature means that we may need to reload the same Python module path more than once (but attach different constant attributes), I changed PyCodeCache.load_by_key_path to not store an in-memory map of path to module (since there may be more than one). I don't _think_ this will have any affect on performance, however.. It's unclear why we were using an in-memory cache here anyway, since this function should only be called once for each module needed to be loaded.
4) Several tests were removing on-disk PyCodeCache artifacts by iterating over the modules. I made this more straightforward by implementing a cache_clear method that removes the on-disk artifacts. Arguably, this should have been the implementation all along.
Differential Revision: [D63542170](https://our.internmc.facebook.com/intern/diff/D63542170)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136505
Approved by: https://github.com/eellison
This threads through all of the necessary parts into aot autograd from the FXGraphCache changes so that we can run cudagraphs properly on a AOTAutograd cache hit.
Specifics:
- AOTAutograd needs access to the `cudagraphs` boxedbool in order to properly set the backward to not use cudagraphs on a cache hit from the forward.
- We have lots of tests that test this already from the previous PR, so I just added an extra test and made the previous test work with both AOTAutogradCache and FXGraphCache at the same time.
```
TORCH_LOGS=torch._functorch._aot_autograd.autograd_cache,cudagraphs ENABLE_AOT_AUTOGRAD_CACHE=1 TORCHINDUCTOR_FX_GRAPH_CACHE=1 tlp python benchmarks/gpt_fast/benchmark.py --output ~/gpt_fast_benchmark.csv
```
Twice, once on cache miss and once and cache hit.
Here is the perfetto trace for each(FB only link):
**Cache Miss:**
Logs:
```
Loading model Llama-2-7b-chat-hf
Time to load model: 0.66 seconds
I0813 10:53:34.416000 911030 torch/_functorch/_aot_autograd/autograd_cache.py:479] [0/0] AOTAutograd cache miss for key alqchc7zw6ynsxj2bzktcsngu4cajwcb3tmhvwlyqkuinx3zhmey
I0813 10:53:51.395000 911030 torch/_functorch/_aot_autograd/autograd_cache.py:558] [0/0] Writing AOTAutograd cache entry to /tmp/torchinductor_jjwu/aotautograd/alqchc7zw6ynsxj2bzktcsngu4cajwcb3tmhvwlyqkuinx3zhmey/entry
I0813 10:54:17.579000 911030 torch/_functorch/_aot_autograd/autograd_cache.py:479] [1/0] AOTAutograd cache miss for key a3nq2ywjxku342c6ag7rsqkalnxfshlcgve3tb2bigg7a45uz6pt
I0813 10:54:38.636000 911030 torch/_functorch/_aot_autograd/autograd_cache.py:558] [1/0] Writing AOTAutograd cache entry to /tmp/torchinductor_jjwu/aotautograd/a3nq2ywjxku342c6ag7rsqkalnxfshlcgve3tb2bigg7a45uz6pt/entry
I0813 10:54:39.228000 911030 torch/_inductor/cudagraph_trees.py:385] [__cudagraphs] recording cudagraph tree for graph without symints
V0813 10:54:39.939000 911030 torch/_inductor/cudagraph_trees.py:2160] [__cudagraphs] Running warmup of function 0
V0813 10:55:10.615000 911030 torch/_inductor/cudagraph_trees.py:2119] [__cudagraphs] Recording function 0 of graph recording id 0
Compilation time: 101.24 seconds
Average tokens/sec: 147.96 tokens/sec
Average bandwidth achieved: 1955.22 GB/s
Memory used: 14.51 GB
```
Chromium Event(fb only):
https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom%2Fchromium_events.json#!/viewer?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom%2Fchromium_events.json&local_cache_key

**Cache Hit:**
Logs:
```
Loading model Llama-2-7b-chat-hf
Time to load model: 0.67 seconds
I0813 10:55:51.821000 944420 torch/_functorch/_aot_autograd/autograd_cache.py:474] [0/0] AOTAutograd cache hit for key alqchc7zw6ynsxj2bzktcsngu4cajwcb3tmhvwlyqkuinx3zhmey
I0813 10:55:55.465000 944420 torch/_functorch/_aot_autograd/autograd_cache.py:474] [1/0] AOTAutograd cache hit for key a3nq2ywjxku342c6ag7rsqkalnxfshlcgve3tb2bigg7a45uz6pt
I0813 10:55:56.030000 944420 torch/_inductor/cudagraph_trees.py:385] [__cudagraphs] recording cudagraph tree for graph without symints
V0813 10:55:56.192000 944420 torch/_inductor/cudagraph_trees.py:2160] [__cudagraphs] Running warmup of function 0
V0813 10:55:56.426000 944420 torch/_inductor/cudagraph_trees.py:2119] [__cudagraphs] Recording function 0 of graph recording id 0
Compilation time: 9.40 seconds
Average tokens/sec: 147.94 tokens/sec
Average bandwidth achieved: 1954.98 GB/s
Memory used: 14.51 GB
```
Chromium Event(fb only):
https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom2%2Fchromium_events.json#!/viewer?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom2%2Fchromium_events.json&local_cache_key

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132294
Approved by: https://github.com/eellison
This PR refactors process_inputs so that it occurs earlier outside of create_aot_dispatcher_function for the purpose of calculating a cache key with the inputs after they have been processed.
This way, if tensors have symint sizes/strides, we successfully factor that into the cache key instead of specializing on every possible size and stride. Test that utilizes this incoming.
# Guard behavior
Note that it's technically possible for tensors with symint arguments to introduce guards in aot_dispatch, if they trace through decompositions that branch on tensor size/stride. This can result in multiple graph modules with differing guards having the same key in the cache.
FXGraphCache has this same issue, and the remote FXGraphCache intentionally does not handle this: instead it only saves the first result in the cache, and cache misses if guards miss. The local FXGraphCache does handle this by storing multiple files and iterating through them, but we opt not to introduce that complexity just yet for AOTAutogradCache until we deem it necessary (i.e., models appear where saving multiple cache results with different guards but the same cache key becomes important). Instead, AOTAutogradCache will save a single entry per result, overriding it if it cache misses due to guards.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130962
Approved by: https://github.com/bdhirsh
Summary: This makes it so that stress tests on separate processes on the same machine don't clobber the directories of each other. InductorTestCase will automatically make a fresh tmpdir for each unit test.
Test Plan:
```
buck2 test -j 18 'fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_aot_autograd_cache.py::AOTAutogradCacheTests::test_nn_module_with_params_global_constant' --run-disabled --stress-runs 10 --record-results
```
Now passes
Differential Revision: D60604811
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132432
Approved by: https://github.com/masnesral
Summary:
AOTAutogradCache currently only checks the local directory instead of both local and remote when saving/loading from the cache, so if remote cache is turned on, it will cache miss.
Disable remote caching for now on these tests: when I work on remote caching compatibility, I'll re-enable them here.
Test Plan:
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_aot_autograd_cache.py::AOTAutogradCacheTests::test_nn_module_with_params_global_constant' --run-disabled
passes
Differential Revision: D60588615
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132409
Approved by: https://github.com/masnesral