pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
Brian Hirsh	7d710403b0	Reapply "Make functionalization `ViewMeta` serializable with pickle. (#143712 )" (#163769 ) ### Summary: NOTE: This is a re-export of https://github.com/pytorch/pytorch/pull/161994 ; the changes between these two PRs is exclusively to the buck/build files (Summary from #161994 ) Attempted rebase of https://github.com/pytorch/pytorch/pull/143712. This reverts commit 6c713ccb5e0df227dd5b630057cbccd373cbe7d6. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames Lucaskabela imported-using-ghimport Test Plan: Imported from OSS Differential Revision: D81524507 Pulled By: Lucaskabela Pull Request resolved: https://github.com/pytorch/pytorch/pull/163769 Approved by: https://github.com/dolpm Co-authored-by: Brian Hirsh <hirsheybar@fb.com>	2025-09-25 10:27:37 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	dbef606631	Add support for tracing vmap in pre-dispatch export (#154650 ) Summary: ONNX team and recent transformer upgrade ran into this error and we also ran into during our export benchmarking. This diff makes it possible to trace through vmap implementation in pre-dispatch IR. Note that we don't support serializing functorch ops in pre-dispatch IR and in the future, we should desugar them to post-grad ops. The implementation strategy is: 1. We add python wrappers around vmap APIs so that we attach custom torch function handler that is only on during non-strict export. The reason is we don't want to add this to default torch_function handler because it will break BC. 2. Some dynamo changes to make sure it picks up new python wrapper APIs. The reason is when we do strict export, we need to re-materialize these APIs in pre-dispatch IR from torch IR. We can avoid this by special casing in dynamo for export to proxy different API calls but i feel that is too much chaos because you need to be able to proxy 2 different variants of same vmap API. Test Plan: CI Differential Revision: D75623875 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154650 Approved by: https://github.com/ezyang, https://github.com/zou3519	2025-08-20 19:31:07 +00:00
James Wu	9708fcf92d	Account for triton kernel source code hidden in custom ops properly in AOTAutogradCache (#160120 ) This PR fixes a bug where user defined triton kernels hidden behind `triton_op` do not register source code changes. If a user only changes a triton kernel source_code, because triton kernels are hidden under the custom op, dynamo hasn't traced into them yet. This means at AOTAutograd time, we don't know the list of triton kernels that are defined by custom ops. This is an initial fix for the issue by parsing the AST of the custom op looking for triton kernels. This won't catch more degenerate cases if the custom op calls other custom ops/functions that then call triton kernels, and then the toplevel compiled graph doesn't know about it. To handle that, we'd have to trace through the custom op at dynamo time. This should handle 99% of cases, though. I added an expectedFailure test to show the limitation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160120 Approved by: https://github.com/zou3519	2025-08-12 14:11:06 +00:00
ghostspiders	af10f1f86c	Fix requires_cuda to requires_cuda_and_triton (#160222 ) Fixes ##159399 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160222 Approved by: https://github.com/janeyx99	2025-08-10 07:05:52 +00:00
xinan.lin	8047421fbb	[Linter] Expanding the scope of detecting device-bias code. (#159949 ) Currently, the device-bias linter only targets functions decorated with @requires_gpu. This PR adds support for two new detection scenarios: 1. Detect device-bias code in functions decorated with @requires_triton. 2. Detect device-bias code for entire test suites that are defined as shared across GPUs. For example: ``` if __name__ == "__main__": if HAS_GPU: run_tests() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159949 Approved by: https://github.com/EikanWang, https://github.com/jansel	2025-08-09 09:41:16 +00:00
Xuehai Pan	02715d0876	[BE][5/6] fix typos in test/ (test/dynamo/) (#157639 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157639 Approved by: https://github.com/yewentao256, https://github.com/jansel ghstack dependencies: #157638	2025-07-06 06:34:25 +00:00
James Wu	e7a66166ce	[precompile] When using BundledAOTAutogradCache, disable FXGraphCache (#156611 ) The goal of this PR is to fix a specific bug when turning precompile on/off between caching runs. If you try to turn on BundledAOTAutogradCacheEntry today in between local runs, the FXGraphCache may randomly hit between the two runs, because FXGraphCache knows nothing about AOTAutogradCache's config. When FXGraphCache hits, it immediately will call make_launchers() immediately on the triton code it launches, which then causes an assertion failure because pickle should not be called after make_launchers. One way to resolve the bug is just to add whether precompile is enabled to teh FxGraph cache key. But the better fix for this, however, is higher level/philosophical: When using BundledAOTAutogradCacheEntry, the entire CompiledFxGraph is saved directly to the cache entry, and we expect the two caches to work in sync, i.e. as one cache. So to simplify the programming model, we disable FxGraphCache when BundledAOTAUtogradCache is turned on. BundledAOTAutogradCacheEntry is only used for precompile use cases now; if we wanted to use BundledAOTAutogradCache for traditional caching use cases, there's a bunch of further work, one of which would be to re-enable FxGraphCache in the event that BundledAOTAutogradCache has to bypass. However, for precompile, this is not a scenario that should happen: we should always expect the entire callable to be saveable, and we should expect to never bypass. So we don't do that change for now. Added a unit test demonstrating this behavior. Also updated existing unit tests to show that all fx graph cache operations are now 0 (but all tests still pass). Pull Request resolved: https://github.com/pytorch/pytorch/pull/156611 Approved by: https://github.com/zhxchen17	2025-06-25 21:01:42 +00:00
James Wu	10fb98a004	[Precompile] Hook up backend="inductor" (#155387 ) This PR adds the necessary things to register and record backend ids from BundledAOTAutogradCacheEntry. One TODO to point out; in this diff, if there are multiple backends that would have the same AOTAutogradCache key (traditional cache key, not backend_id), we just end up serializing the same BundledAOTAutogradCache entry multiple times. This is not ideal obviously, so we'll want to deduplicate these and just track the different keys that one BundledAOTAutogradCacheEntry is associated with instead. This shouldn't be super hard to do, though, as we just need to run a deduplication step on call to `serialize()`, I think. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155387 Approved by: https://github.com/oulgen	2025-06-22 15:05:08 +00:00
Animesh Jain	5ab257c74c	[invoke_subgraph] Make invoke_subgraph cacheable (#156448 ) Its unclear to me what happens if the subgraph itself is not cacheable. Imo, there is nothing special about invoke_subgraph to prevent any caching. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156448 Approved by: https://github.com/oulgen, https://github.com/zou3519	2025-06-20 21:20:23 +00:00
PyTorch MergeBot	edd45f3a02	Revert "[Precompile] Hook up backend="inductor" (#155387 )" This reverts commit 2c68c3e8d5e9a235f5861be6486de4959f80c840. Reverted https://github.com/pytorch/pytorch/pull/155387 on behalf of https://github.com/atalman due to dynamo/test_precompile_context.py::PrecompileContextTests::test_basic [GH job link](https://github.com/pytorch/pytorch/actions/runs/15772892021/job/44464141039) [HUD commit link](`2c68c3e8d5`) ([comment](https://github.com/pytorch/pytorch/pull/155387#issuecomment-2992044073))	2025-06-20 15:30:04 +00:00
James Wu	2c68c3e8d5	[Precompile] Hook up backend="inductor" (#155387 ) This PR adds the necessary things to register and record backend ids from BundledAOTAutogradCacheEntry. One TODO to point out; in this diff, if there are multiple backends that would have the same AOTAutogradCache key (traditional cache key, not backend_id), we just end up serializing the same BundledAOTAutogradCache entry multiple times. This is not ideal obviously, so we'll want to deduplicate these and just track the different keys that one BundledAOTAutogradCacheEntry is associated with instead. This shouldn't be super hard to do, though, as we just need to run a deduplication step on call to `serialize()`, I think. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155387 Approved by: https://github.com/oulgen	2025-06-20 06:38:29 +00:00
Simon Fan	3bec588bf5	[aot][ca] save bw_module in AOTAutogradCache (#151860 ) Compiled Autograd retraces AOT's bw_module at backward runtime into a larger graph, and today this runs into an issue on warm cache runs because the bw_module is not restored. This PR adds it to the cache, by first stripping it bare from unserializable metadata. I also intentionally differentiate the cached and non-cached versions to avoid accidental attempts of AOT compilation with a restored bw_module (would probably crash). The bw_module's generated code is then serialized, and at compiled autograd runtime, it is restored via symbolic_trace. This also means that presence of tensor constructors will be lifted as constants. Something we will address separately. Note that since the cache entry may be used by runs that use compiled autograd and runs that do not, we need to cache both the lowered backward and the bw_module. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151860 Approved by: https://github.com/jamesjwu ghstack dependencies: #156120	2025-06-19 03:47:41 +00:00
Oguz Ulgen	a2a75be0f8	Rename inductor cache (#156128 ) Requested by Simon on a different PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/156128 Approved by: https://github.com/xmfan	2025-06-17 03:57:18 +00:00
James Wu	be2ad70cfa	Fix dynamo tracing into AOTAutogradCache results in cpu tensors (#155251 ) On this line, we see that the bw_compiler that dynamo uses for AotAutograd automatically disables the backward runnable: `05dd638ee9/torch/_dynamo/backends/common.py (L76)` This disables dynamo in the bw_compiler but also disables the runnable the compiler returns. On a AOTAutogradCache hit, however, we never call the bw_compiler! So we don't disable dynamo properly. This only has an effect on certain cases of cpu tensors' backwards, where the backward is being done in python land, and dynamo unnecessarily tries to trace through the inductor generated code. It also only matters if the backward is being accessed outside of dynamo itself (say, in a graph break in eager mode), since dynamo properly disables the forward function already. ``` I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517] TorchDynamo attempted to trace the following frames: [ I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517] * fn /home/jjwu/test.py:9 I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517] * cast /data/users/jjwu/a/pytorch-env/lib/python3.10/typing.py:1737 I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517] * call /tmp/torchinductor_jjwu/rq/crq327nhoyjzog5n3qlchauucdrunrtutwmmoh7ipoe2ngnson5s.py:35 I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517] * fn /home/jjwu/test.py:9 I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517] * cast /data/users/jjwu/a/pytorch-env/lib/python3.10/typing.py:1737 I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517] * call /tmp/torchinductor_jjwu/rq/crq327nhoyjzog5n3qlchauucdrunrtutwmmoh7ipoe2ngnson5s.py:35 I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517] ] ``` This PR fixes the issue and adds a unit test showing that with or without cache hit, the frames dynamo is tracing is identical. Fixes https://github.com/pytorch/pytorch/issues/154536 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155251 Approved by: https://github.com/bdhirsh, https://github.com/anijain2305	2025-06-09 02:06:16 +00:00
bobrenjc93	e7bf72c908	[multigraph] fix composabilty with aotautograd cache (#153526 ) AOTAutogradCache uses FXGraphCache which uses the tracing context to get the ShapeEnv. Although the TracingContext global_context is cleared by the time we get around to reusing it, we don't actually need it. We just need the ShapeEnv in the TracingContext, which isn't cleared at the end of dynamo and does persist. This PR adds the tracing context manager around the specialized compile to ensure our caching infrastructure can get access to the ShapeEnv. A test was also added to prove correctness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153526 Approved by: https://github.com/jamesjwu, https://github.com/zou3519 ghstack dependencies: #153433, #153449	2025-05-30 16:56:17 +00:00
IvanKobzarev	4439255148	[aotd] Support saved tensors hooks in aot_autograd (#150032 ) https://github.com/pytorch/pytorch/issues/148222 Goal: At the moment autograd saved tensors hooks are run in eager after compiled forward. They are executed at the same time for all saved tensors. Hooks can be used to reduce amout of memory used for saved tensors, doing quantization or offloading to cpu. This is suboptimal for optimization of peak memory. Better solution will be to put the hooks in the graph, as close as possible to the last usage of the tensor. To get user specified autograd saved tensors hooks in the graph. Logic: UX: If user specifies with torch.autograd.graph.saved_tensors_hooks(pack_gm, unpack_gm). Where pack_gm and unpack_gm are torch.fx.GraphModule. Then AotAutograd will retrace those graph modules, doing decompositions and functionalization in aot_autograd, inlining the result graphs in forward epilogue and backward prologue. User may want to use control logic in the hooks, for example applying quantization only for specific dtypes and sizes. This is also possible, user can put it into torch.fx.wrap function and use symbolic trace to make a GraphModule. In that case AotAutograd cahing will work only in case when user explicitly set to the torch.fx.wrap call_function node "user_cache_hash" metadata. If this metadata set - then aot_autograd cache can use saved cache artifact. If metadata is not set - then cache is bypassed. Dynamo: Dynamo traces pack and unpack hooks and installs them as subgraph and explicitly adds to the output_graph. (As those subgraphs are not used and will not be copied in the result by default). The complexity here is that at this moment we do not have example of inputs for the hooks. We trace pack_hook with some Tensor from the inputs. The result subgraphs are added to the hashing of AotAutograd Cache. In AotAutograd we retrace the graph with the true saved tensors coming from partitioner. Backwards Compatibility: As current hooks are executed in eager mode and not all of them will be traceable - we only try to put in the graph hooks, explicitly marked by user with annotation (@_inlineable_saved_tensors_hooks). For other hooks or if compiled autograd is enabled - keep the same logic. Recompilations: Hooks are guarded with lambda guard matching function id to cause recompilation if user reruns compiled function. Aot_autograd: After partitioner prepared forward and backward module - we trace prepared at Dynamo graphs for pack and unpack hooks and inline them in epilogue of forward and prologue of backward. Forward outputs and backward inputs are changed, transparently for user. We do not try to put it close the last usage etc., relying on inductor to do this optimization. ``` INFO: TRACED GRAPH ===== Forward graph pre saved_tensors_hooks inlining 3 ===== /data/users/ivankobzarev/a/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module): def forward(self, primals_1: "Sym(s0)", primals_2: "Sym(s1)", primals_3: "f32[s0, s1][s1, 1]cuda:0"): # File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6660 in simple_fn, code: x = x + 1 add: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten.add.Tensor(primals_3, 1); primals_3 = None # File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6661 in simple_fn, code: x = SAF.apply(x) view: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten.view.default(add, [primals_1, primals_2]) return (view, add, primals_1, primals_2) INFO: TRACED GRAPH ===== Backward graph pre saved_tensors_hooks inlining 3 ===== /data/users/ivankobzarev/a/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module): def forward(self, primals_1: "Sym(s0)", primals_2: "Sym(s1)", primals_3: "f32[s0, s1][s1, 1]cuda:0"): # File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6660 in simple_fn, code: x = x + 1 add: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten.add.Tensor(primals_3, 1); primals_3 = None # File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6661 in simple_fn, code: x = SAF.apply(x) view: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten.view.default(add, [primals_1, primals_2]) return (view, add, primals_1, primals_2) INFO: TRACED GRAPH ===== saved_tensors_pack_hook add 3 ===== /data/users/ivankobzarev/a/pytorch/torch/fx/_lazy_graph_module.py class pack_float8(torch.nn.Module): def forward(self, x_1: "f32[s0, s1][s1, 1]cuda:0"): # No stacktrace found for following nodes _to_copy: "f8e4m3fn[s0, s1][s1, 1]cuda:0" = torch.ops.aten._to_copy.default(x_1, dtype = torch.float8_e4m3fn); x_1 = None return (torch.float32, _to_copy) INFO: TRACED GRAPH ===== saved_tensors_unpack_hook add 3 ===== <eval_with_key>.22 from /data/users/ivankobzarev/a/pytorch/torch/fx/experimental/proxy_tensor.py:1225 in wrapped class pack_float8(torch.nn.Module): def forward(self, x_1: "f32[s0, s1][s1, 1]cuda:0"): # No stacktrace found for following nodes _to_copy: "f8e4m3fn[s0, s1][s1, 1]cuda:0" = torch.ops.aten._to_copy.default(x_1, dtype = torch.float8_e4m3fn); x_1 = None return (torch.float32, _to_copy) INFO: TRACED GRAPH ===== Forward graph 3 ===== /data/users/ivankobzarev/a/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module): def forward(self, primals_1: "Sym(s0)", primals_2: "Sym(s1)", primals_3: "f32[s0, s1][s1, 1]cuda:0"): # File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6660 in simple_fn, code: x = x + 1 add: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten.add.Tensor(primals_3, 1); primals_3 = None # No stacktrace found for following nodes _to_copy: "f8e4m3fn[s0, s1][s1, 1]cuda:0" = torch.ops.aten._to_copy.default(add, dtype = torch.float8_e4m3fn) # File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6661 in simple_fn, code: x = SAF.apply(x) view: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten.view.default(add, [primals_1, primals_2]); add = None return (view, _to_copy, primals_1, primals_2) INFO: TRACED GRAPH ===== Backward graph 3 ===== <eval_with_key>.21 class GraphModule(torch.nn.Module): def forward(self, primals_1: "Sym(s0)", primals_2: "Sym(s1)", add_packed_2: "f8e4m3fn[s0, s1][s1, 1]cuda:0", tangents_1: "f32[s0, s1][s1, 1]cuda:0"): # No stacktrace found for following nodes _to_copy: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten._to_copy.default(add_packed_2, dtype = torch.float32); add_packed_2 = None # File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6661 in simple_fn, code: x = SAF.apply(x) add_7: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten.add.Tensor(tangents_1, _to_copy); tangents_1 = _to_copy = None return (None, None, add_7) ``` Differential Revision: [D72187044](https://our.internmc.facebook.com/intern/diff/D72187044) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150032 Approved by: https://github.com/bdhirsh	2025-05-22 14:09:38 +00:00
James Wu	c31e239910	[precompile] Add BundledAOTAutogradCacheEntry (#152840 ) Finally, this PR adds BundledAOTAutogradCacheEntry. A BundledAOTAutogradCacheEntry is an AOTAutogradCacheEntry that saves the entire CompiledFxGraph directly in the entry. This has some advantages: - No more dependency on FxGraphCache at all - Clearing FxGraphCache does not result in AOTAutogradCache miss - Simpler logic, as BundledAOTAutogradCacheEntry has everything you need to load a full compiled python wrapper from a dynamo output We plan to use BundledAOTAutogradCacheEntry for precompile. There's also a question of whether we want to use it for regular caching — the main disadvantage of this is having to save the same CompiledFxGraph twice, once in Inductor cache and once for AOTAutogradCache. With MegaCaching, this could be a regression in total cache size (as well as a minor cold start regression, as you have to save the same graph twice). I will import this and measure the mega cache space complexity, and if it looks good I'll enable it by default for caching as well. On warm start, if AOTAutogradCache hits, you won't have to load inductor at all, so warm start overhead should be unaffected. Differential Revision: [D74593304](https://our.internmc.facebook.com/intern/diff/D74593304) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152840 Approved by: https://github.com/zhxchen17	2025-05-21 18:08:42 +00:00
James Wu	74d0300804	Change unsafe_marked_cacheable_functions to a dictionary, so that you can specify a static cache key (#152486 ) Fixes https://github.com/pytorch/pytorch/issues/152434 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152486 Approved by: https://github.com/oulgen	2025-05-19 02:16:33 +00:00
PyTorch MergeBot	a28dcdba2c	Revert "[aot][ca] save bw_module in AOTAutogradCache (#151860 )" This reverts commit 613bd462721f3246888030de0a3f6932d52f515a. Reverted https://github.com/pytorch/pytorch/pull/151860 on behalf of https://github.com/huydhn due to Chatting with @xmfan and decide to revert and reland this instead ([comment](https://github.com/pytorch/pytorch/pull/151860#issuecomment-2856709646))	2025-05-07 00:56:54 +00:00
Simon Fan	613bd46272	[aot][ca] save bw_module in AOTAutogradCache (#151860 ) Compiled Autograd retraces AOT's bw_module at backward runtime into a larger graph, and today this runs into an issue on warm cache runs because the bw_module is not restored. This PR adds it to the cache, by first stripping it bare from unserializable metadata. I also intentionally differentiate the cached and non-cached versions to avoid accidental attempts of AOT compilation with a restored bw_module (would probably crash). Note that since the cache entry may be used by runs that use compiled autograd and runs that do not, we need to cache both the lowered backward and the bw_module. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151860 Approved by: https://github.com/jamesjwu ghstack dependencies: #149707	2025-05-01 21:59:43 +00:00
Simon Fan	c461ba6522	[aot] mark dynamic activations as maybe dynamic (#149707 ) Today, we mark graph outputs as maybe dynamic, this lets a compilation to communicate to future compilations whether certain graph inputs are dynamic. Similarly, we can do this to saved activations, which may be used in future compilations as well. This is especially prevalent in compiled autograd, where tensor activations will always become graph inputs. Changes to the tests were mainly cosmetic, with the exception of tests that relied on duck shaping. By annotating tensor dims, we prevent them from reusing pre-existing symbols, so this change will make graphs use duck shapes less than before, which affects some of the caching tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149707 Approved by: https://github.com/bdhirsh	2025-05-01 21:59:36 +00:00
James Wu	a4fdae5c84	Lift guard checking logic to AOTAutogradCache (#151563 ) This somewhat complicated PR does a few things: - It separates out a lot of the guard checking logic into its own class, GuardedCache[T] - It adds a new `check_guard_hit` lambda to FXGraphCache._lookup_graph, which allows callers to define their own guard checking logic - It then uses these two combined parts to lift guard checking to AOTAutogradCache. This means that AOTAutogradCache stores its own guard expressions and evaluates them. - FXGraphCache's guard checking logic is completely unchanged, just refactored. As part of the work, I'm able to extend a bit of the logging functionality of AOTAutogradCache into FXGraphCache, so that you can know if FXGraphCache missed due to a guard failure or a full cache miss. # Why do this? Lifting guards to AOTAutogradCache has a few benefits: - First, it fixes a long standing bug in guard checking logic. Backward passes can have different symint inputs than forward passes depending on forward output, if AOTAutograd chooses to store symints for the backward. These symint inputs have the same underlying symbols as the forward, but on AOTAutogradCache hit, we don't have access to the hints backing these exact symints (we only have hints for the symints on the forward function). By lifting guard checking logic to AOTAutogradCache, we no longer need to check the backward guards, as they'll be included in the AOTAutogradCache guard expression. I've added a unit test that failed before my diff, and now passes, as an example of this - Secondly, this is the first step necessary to bundle CompiledFxGraph into AOTAutogradCache. Doing so will simplify our cache logic significantly, and also make precompile logic simpler, as precompiles will only need to store AOTAutogradCacheEntrys, without needing to match them up with inductor FXGraphCache entries. - Finally, adding guard checking logic to AOTAutogradCache my allow us in the future to handle more complicated cases like a single forward with multiple backwards, as guard checks are now storable on the cache entry itself. # Guard checking logic of AOTAutogradCache When AOTAutogradCache evaluates guard expressions, it no longer needs to evaluate the forward/backward guards in the FXGraphCacheEntry (since the AOTAutogradCache guard expressions will encompass them). Because of this, we still need a way for AOTAutogradCache to distinguish between multiple FXGraphCache local entries. To do so, AOTAutogradCache stores the guard string from FXGraphCache, which it uses as a second "cache key". It doesn't need to evaluate these guards, it just needs to find the cache entry from FXGraphCache that had the same guards as when it was stored. After this, I will work on putting the FXGraphCache entries directly into AOTAutogradCache. If I can put CompiledFxGraphs in the cache directly, I no longer need this complicated `check_guard_hit` overriding logic. ## Test Plan Added a new unit test. There are comprehensive guard checking unit tests in `test_aot_autograd_cache` already, and those pass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151563 Approved by: https://github.com/oulgen	2025-04-22 03:01:08 +00:00
Oguz Ulgen	0f8613bf5c	Introduce unsafe way to mark functions as cacheable (#151603 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151603 Approved by: https://github.com/jamesjwu ghstack dependencies: #151768, #151609	2025-04-21 17:37:38 +00:00
James Wu	f1364431f0	Add debug_lines of FXGraphCacheKey to AOTAutogradCacheEntry (#150594 ) Previously we didn't save debug_lines because it's pretty large, but compared to the size of FXGraphCache entries it's still pretty small. So let's add it to AOTAutogradCache for easier debugability. Differential Revision: [D72361611](https://our.internmc.facebook.com/intern/diff/D72361611/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150594 Approved by: https://github.com/oulgen	2025-04-11 15:24:13 +00:00
bobrenjc93	f649ee73ce	Use source hashing to generate consistent symbolic ids (#149665 ) This PR was inspired by internal models that were cache missing due to PGO. At a high level the problem looks as follows Run 1, Invocation 1: We do static compile, save some example values in PGO/automatic dynamic Run 1, Invocation 2: We detect varying inputs, do dynamic compile, get a dynamic graph and save to PGO. Crucially what we save to PGO is actually a superset of what is actually dynamic. If we notice an input was varying, we mark it as dynamic in PGO even if later on that value gets specialized. When a value gets specialized, we actually remove the symbol from the graph. This results in an interesting conundrum where although we are producing the same isomorphic graph, PGO makes the second run cache miss. Let's see how.... Run 2, Invocation 1: We fetch the PGO, over-mark things as dynamic, get a fx graph, look it up in the cache and... whoops! cache miss! This is because of the aforementioned behavior where the PGO profile will cause us to over-allocate symbols. In practice this means we end up saving a graph in cache with symbols x:s1, y:s3 and on second attempt we cache miss with x:s1, y:s6 where symbols s3,s4,s5 were all optimistically marked dynamic by PGO and subsequently specialized. We solve this problem by hashing the source names. This ensures somewhat stable assignment. To prevent catastrophic symbol collisions, we use linear probing to ensure no collisions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149665 Approved by: https://github.com/Mingming-Ding, https://github.com/laithsakka	2025-03-28 05:36:32 +00:00
PyTorch MergeBot	af7719a2fa	Revert "Use source hashing to generate consistent symbolic ids (#149665 )" This reverts commit 1f92348dc6c60e3020a723b37ecb8226cf2480c0. Reverted https://github.com/pytorch/pytorch/pull/149665 on behalf of https://github.com/malfet due to Broke trunk, see `6eb3c2e282/1` ([comment](https://github.com/pytorch/pytorch/pull/149665#issuecomment-2758578187))	2025-03-27 16:02:27 +00:00
bobrenjc93	1f92348dc6	Use source hashing to generate consistent symbolic ids (#149665 ) This PR was inspired by internal models that were cache missing due to PGO. At a high level the problem looks as follows Run 1, Invocation 1: We do static compile, save some example values in PGO/automatic dynamic Run 1, Invocation 2: We detect varying inputs, do dynamic compile, get a dynamic graph and save to PGO. Crucially what we save to PGO is actually a superset of what is actually dynamic. If we notice an input was varying, we mark it as dynamic in PGO even if later on that value gets specialized. When a value gets specialized, we actually remove the symbol from the graph. This results in an interesting conundrum where although we are producing the same isomorphic graph, PGO makes the second run cache miss. Let's see how.... Run 2, Invocation 1: We fetch the PGO, over-mark things as dynamic, get a fx graph, look it up in the cache and... whoops! cache miss! This is because of the aforementioned behavior where the PGO profile will cause us to over-allocate symbols. In practice this means we end up saving a graph in cache with symbols x:s1, y:s3 and on second attempt we cache miss with x:s1, y:s6 where symbols s3,s4,s5 were all optimistically marked dynamic by PGO and subsequently specialized. We solve this problem by hashing the source names. This ensures somewhat stable assignment. To prevent catastrophic symbol collisions, we use linear probing to ensure no collisions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149665 Approved by: https://github.com/Mingming-Ding, https://github.com/laithsakka	2025-03-27 03:39:27 +00:00
James Wu	49f86a939c	[AOTAutogradCache] Allow Custom Autograd functions behind a flag (#149751 ) This adds a new env var and flag, autograd_cache_allow_custom_autograd_functions, (env var: `TORCHINDUCTOR_AUTOGRAD_CACHE_ALLOW_CUSTOM_AUTOGRAD`) which allows custom autograd functions into AOTAutogradCache. @hirsheybar and I worked together to verify that the higher order op AutogradFunctionApply is pure with respect to the dynamo input being passed in, so this should be safe. I'm still putting it behind a flag and turning it on slowly, first on an internal model, though. Once we verify that it is correct on the internal model we can work to enable the flag by default. Differential Revision: [D71633184](https://our.internmc.facebook.com/intern/diff/D71633184/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149751 Approved by: https://github.com/bdhirsh, https://github.com/zou3519	2025-03-24 21:12:11 +00:00
Simon Fan	666508eb17	[aot cache][ca] remove restriction on caching ca's aot inference graph (#148491 ) but still can't cache CA's aot inference graph yet: the CA functional ops aren't serializable Pull Request resolved: https://github.com/pytorch/pytorch/pull/148491 Approved by: https://github.com/jamesjwu ghstack dependencies: #148381	2025-03-08 06:08:26 +00:00
James Wu	8728d4b815	Clear triton kernels after parent make_launcher (#148604 ) Before, we were clearing the cache only after inductor compile. But inductor may not always compile, i.e. on AOTAutogradCache hit. So instead, we should clear it when the future is consumed. This is a more robust fix for the issue in D69476856 Differential Revision: [D70646281](https://our.internmc.facebook.com/intern/diff/D70646281/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148604 Approved by: https://github.com/masnesral	2025-03-06 03:28:38 +00:00
James Wu	574371d828	Add current cuda device index to FXGraphCache key (#147464 ) This PR intends to fix the cache related issues from https://github.com/pytorch/pytorch/issues/147405. It does not handle the dynamo recompile case in process, because it does not introduce any extra guards. For FXGraphCache and AOTAutogradCache, we simply have to have the device context in the cache key. Note that for any function that accepts tensor inputs, the device context is naturally already included in the cache key by the metadata of example inputs. However, for functions that return constants or have no arguments, the device context still needs to be in the cache key. A more robust fix for this would be to have inductor generate device guards that are dynamic, instead of specialized. This would also help us share more cache artifacts. I've added unit tests for FXGraphCache and AOTAutogradCache, both of which would fail without this change. Differential Revision: [D69875939](https://our.internmc.facebook.com/intern/diff/D69875939) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147464 Approved by: https://github.com/bdhirsh, https://github.com/anijain2305	2025-02-20 12:38:21 +00:00
Aaron Gokaslan	e738f7ba23	[BE]: Enable ruff rule SIM113 (#147290 ) Lint rules that tells the user to avoid keeping track of their own counter and use the builtin enumerate when possible. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147290 Approved by: https://github.com/jansel	2025-02-16 22:41:16 +00:00
PyTorch MergeBot	6c713ccb5e	Revert "Make functionalization `ViewMeta` serializable with pickle. (#143712 )" This reverts commit b8abdaa286fd161af48af57a675827f4f849914d. Reverted https://github.com/pytorch/pytorch/pull/143712 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/143712#issuecomment-2597205261))	2025-01-17 00:52:50 +00:00
Yukio Siraichi	b8abdaa286	Make functionalization `ViewMeta` serializable with pickle. (#143712 ) Fix: #141974 This PR makes `ViewMeta` sequence, present in functional tensors, serializable with pickle. In order to accomplish that, it makes `ViewMeta` an abstract class with overridable `forward` and `reverse` functions. In this context, each operation that once instanciated `ViewMeta`, should now create a new specialized class that inherits from `ViewMeta. Therefore, this PR also uses codegen for creating these specializations. In summary, these are the changes this PR introduces: - `ViewMeta` is turned into an abstract class (see _FunctionalStorageImpl.cpp_). `forward` and `reverse` are pure virtual functions that need to be implemented. `to_out_index` should be implemented by operations that might return more than 1 output. - New `ViewMeta` specializations for `resize_` and `_unsafe_view` are created (see _FunctionalizeFallbackKernel.h_). - New templates _ViewMetaClasses.{cpp,h}_ are created. They hold the declaration and definition of the `ViewMeta` specializations, which are automatically generated in the ATen codegen (see _gen.py_). - New `_functionalization` Python sub-module is created (see _Module.cpp_). It serves as namespace for the `ViewMeta` specializations and `InverseReturnMode` enum. - New template _ViewMetaClassesPythonBinding.cpp_ is created. It holds the automatically generated Python bindings for the `ViewMeta` specialization, which are generated in the torch codegen (see _generate_code.py_). Note that this PR makes use of codegen at 2 different moments: - ATen codegen (_gen.py_): generates the `ViewMeta` specialized classes. - Torch codegen (_generate_code.py_): generated the Python bindings for them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143712 Approved by: https://github.com/bdhirsh	2025-01-16 19:41:41 +00:00
James Wu	6e77d7cac5	Add AOTAutogradCache support for cache hot loading APIs (#144499 ) This diff adds AOTAutogradCache support to the mega cache. Differential Revision: [D67991059](https://our.internmc.facebook.com/intern/diff/D67991059/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D67991059/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/144499 Approved by: https://github.com/oulgen	2025-01-13 07:07:18 +00:00
Tom Ritchford	d25e6e623f	Fix unused Python variables in test/[a-d]* (#134665 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134665 Approved by: https://github.com/albanD	2024-12-13 22:13:12 +00:00
James Wu	60a192036b	Refactor optional graph module into CompiledFxGraphConstants (#141897 ) FXGraphCache supports freezing, but AOTAutogradCache does not. This is due to the fact that when freezing is turned on, instead of using the constants from the graph module that was saved on cache miss, we have to take the constants from the AOTAutograd generated graph module. This PR does two things: - It bypasses AOTAutogradCache when freezing is turned on. We should have always been doing this. - It refactors the code to be way more clear about the constants we're using and when we're using them. Basically, there are two possible sets of constants we can grab from the compiled fx graph. 1. If freezing is turned off, we save the constants directly in CompiledFxGraph. 2. If freezing is turned on, we save the names of the constants in CompiledFxGraph, and use the runtime GraphModule's actual constant values: we reconstruct them from the saved names + the new graph module from AOTDispatch. We implement two different classes for doing just this: one that has access to the post aotdispatch gm, which supports freezing, and one that doesn't have it, which does not support freezing. Then we construct the wrappers and unwrap the result as needed. This makes it clear that the gm passed to AOTAutogradCache is not part of post compile, only the cache key generated from it is. The whole flow is pretty confusing, but hopefully this gives us better types and static information for understanding what the different codepaths are doing. Will add a specific AOTAutogradCache to confirm we bypass freezing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141897 Approved by: https://github.com/ezyang, https://github.com/masnesral	2024-12-05 00:34:14 +00:00
James Wu	326f487809	Bypass AutogradCache when view replay affects the mutation meta (#141978 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141978 Approved by: https://github.com/bdhirsh	2024-12-04 22:13:12 +00:00
James Wu	a7ca6a9113	Enable autograd cache on inductor tests (#140890 ) This turns on AOTAutogradCache for all inductor tests. It clears AOTAutogradCache on each test as well, by virtue of the local cache using the same directory to store cache entries. I've also tested with INDUCTOR_TEST_DISABLE_FRESH_CACHE=1, running all the tests. AOTAutogradCache successfully caches 99% of these. There are a few tests that use view_replay and therefore save functional tensors, which cause AOTAutogradCache to fail to pickle its result. Will look into next steps there, but for now, it seems okay if the cache just misses on those cases where it can't serialize the result. It would be better to check before pickling, though. I've made the following small bugfixes to get this working: - Inductor is sometimes used in a standalone mode without dynamo, which leads to attribute errors in check_can_cache. In general, we should never crash in cache checking, only bypass. So I change a try catch to check Exception instead of just a specific exception. - Add extra structured logging for metadata on cache hits Pull Request resolved: https://github.com/pytorch/pytorch/pull/140890 Approved by: https://github.com/bdhirsh	2024-11-27 20:41:43 +00:00
James Wu	c7d072db99	[AOTAutogradCache] Allowlist various ops found from models to safe list (#140825 ) From running internal models, I found a bunch of AOTAutogradCache ops that seem safe to cache. Would appreciate any suggestions for how to allowlist these in a more general way, but starting with these for now. Differential Revision: [D66010326](https://our.internmc.facebook.com/intern/diff/D66010326/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D66010326/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/140825 Approved by: https://github.com/bdhirsh	2024-11-21 00:04:17 +00:00
Sam Larsen	f6ba95a76f	[inductor] PyCodeCache: only delete on-disk artifacts if purge=True (#140216 ) Summary: https://github.com/pytorch/pytorch/pull/136505 changed the cache_clear operation to remove loaded modules from disk. That change caused some problems with TORCHINDUCTOR_FORCE_DISABLE_CACHES=1, where there are some code paths (coordinate descent tuning at least), where we call `PyCodeCache.load_by_key_path` and expect that the files are still on disk. (But when caches are disabled, we call cache_clear before every inductor compile). It seems we probably have a shortcoming in the disable-cache logic, but since we also have flakey test failures with the same `'could not get source code'` error, let's restore the previous functionality until I can investigate further. Since some tests actually _DO_ want to delete on-disk artifacts (e.g., to test remote caching), then I added a `purge` param to optionally delete files Pull Request resolved: https://github.com/pytorch/pytorch/pull/140216 Approved by: https://github.com/eellison	2024-11-14 19:34:57 +00:00
James Wu	dd6a5de00d	Allow OpOverloadPackets as safe torch functions, sanitize dynamo gm before running aotdispatch with cache (#139785 ) Summary: This diff implements two things to improve cache hit rates after testing AOTAutogradCache with internal cogwheel jobs: - We should allow torch functions that are OpOverloadPackets - When running with cache, there are some fields that dynamo puts into the input graph module to aotdispatch that are not stable between runs. We use a context manager to null these out so that they can't be used to affect the output of AOTAutograd, and then we put the fields back onto the gm before returning from AOTAutogradCache.load(). Test Plan: New unit tests + running nanogpt with AOTAutogradCache. Meta: Run on a long running job Cache miss: {F1953831996} Cache hit: {F1953830872} Servicelabs here: https://www.internalfb.com/servicelab/experiment/4301352991/ Cache hit: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/f660597709-TrainingApplication/attempt_0/version_0/rank_0/index.html Cache miss: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/f660569960-TrainingApplication/attempt_0/version_0/rank_0/index.html We can see that with these changes, autograd cache hits and saves compile time: https://fburl.com/scuba/pt2_compile_events/ycddxstd Differential Revision: D65436373 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139785 Approved by: https://github.com/bdhirsh	2024-11-06 16:34:02 +00:00
Sam Larsen	d8b606ecb5	[fx graph cache] Support freezing with FX graph caching (#136505 ) Summary: The main changes to support freezing are: 1) When pickling constant tensors as part of the cache key calculation: If freezing has not been applied, then keep the existing behavior (pickle the metadata and values). If freezing has been applied, then pickle the values if the constant will be inlined; otherwise, consider only the metadata. 2) If freezing has been applied, modify what we store in the cache: Instead of storing the constant attributes in the cache entry, store the _names_ of the constants, and then grab those constants from the GraphModule when we need attache the attributes to a newly-loaded Python module. Since the cache lookup path loads the Python module, this bullet means we need to thread through a GraphModule argument in several places. 3) Since this feature means that we may need to reload the same Python module path more than once (but attach different constant attributes), I changed PyCodeCache.load_by_key_path to not store an in-memory map of path to module (since there may be more than one). I don't _think_ this will have any affect on performance, however.. It's unclear why we were using an in-memory cache here anyway, since this function should only be called once for each module needed to be loaded. 4) Several tests were removing on-disk PyCodeCache artifacts by iterating over the modules. I made this more straightforward by implementing a cache_clear method that removes the on-disk artifacts. Arguably, this should have been the implementation all along. Differential Revision: [D63542170](https://our.internmc.facebook.com/intern/diff/D63542170) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136505 Approved by: https://github.com/eellison	2024-11-01 18:29:29 +00:00
Guilherme Leobas	8785353f2f	Fix tensor subclass + dynamic shapes in torch.compile + aot autograd (#125941 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125941 Approved by: https://github.com/bdhirsh ghstack dependencies: #133337	2024-10-28 21:58:59 +00:00
Xu Han	97fd087cdb	[inductor] calibration inductor windows uts (6/N) (#134419 ) Disable UTs for Windows: `test/dynamo/test_aot_autograd_cache.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134419 Approved by: https://github.com/jansel	2024-08-25 20:39:34 +00:00
James Wu	0bde3c4f2f	Run cudagraphs on AOTAutograd cache hit (#132294 ) This threads through all of the necessary parts into aot autograd from the FXGraphCache changes so that we can run cudagraphs properly on a AOTAutograd cache hit. Specifics: - AOTAutograd needs access to the `cudagraphs` boxedbool in order to properly set the backward to not use cudagraphs on a cache hit from the forward. - We have lots of tests that test this already from the previous PR, so I just added an extra test and made the previous test work with both AOTAutogradCache and FXGraphCache at the same time. ``` TORCH_LOGS=torch._functorch._aot_autograd.autograd_cache,cudagraphs ENABLE_AOT_AUTOGRAD_CACHE=1 TORCHINDUCTOR_FX_GRAPH_CACHE=1 tlp python benchmarks/gpt_fast/benchmark.py --output ~/gpt_fast_benchmark.csv ``` Twice, once on cache miss and once and cache hit. Here is the perfetto trace for each(FB only link): Cache Miss: Logs: ``` Loading model Llama-2-7b-chat-hf Time to load model: 0.66 seconds I0813 10:53:34.416000 911030 torch/_functorch/_aot_autograd/autograd_cache.py:479] [0/0] AOTAutograd cache miss for key alqchc7zw6ynsxj2bzktcsngu4cajwcb3tmhvwlyqkuinx3zhmey I0813 10:53:51.395000 911030 torch/_functorch/_aot_autograd/autograd_cache.py:558] [0/0] Writing AOTAutograd cache entry to /tmp/torchinductor_jjwu/aotautograd/alqchc7zw6ynsxj2bzktcsngu4cajwcb3tmhvwlyqkuinx3zhmey/entry I0813 10:54:17.579000 911030 torch/_functorch/_aot_autograd/autograd_cache.py:479] [1/0] AOTAutograd cache miss for key a3nq2ywjxku342c6ag7rsqkalnxfshlcgve3tb2bigg7a45uz6pt I0813 10:54:38.636000 911030 torch/_functorch/_aot_autograd/autograd_cache.py:558] [1/0] Writing AOTAutograd cache entry to /tmp/torchinductor_jjwu/aotautograd/a3nq2ywjxku342c6ag7rsqkalnxfshlcgve3tb2bigg7a45uz6pt/entry I0813 10:54:39.228000 911030 torch/_inductor/cudagraph_trees.py:385] [__cudagraphs] recording cudagraph tree for graph without symints V0813 10:54:39.939000 911030 torch/_inductor/cudagraph_trees.py:2160] [__cudagraphs] Running warmup of function 0 V0813 10:55:10.615000 911030 torch/_inductor/cudagraph_trees.py:2119] [__cudagraphs] Recording function 0 of graph recording id 0 Compilation time: 101.24 seconds Average tokens/sec: 147.96 tokens/sec Average bandwidth achieved: 1955.22 GB/s Memory used: 14.51 GB ``` Chromium Event(fb only): https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom%2Fchromium_events.json#!/viewer?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom%2Fchromium_events.json&local_cache_key ![image](https://github.com/user-attachments/assets/47fdd77e-3cc1-437e-8e68-7901646269bb) Cache Hit: Logs: ``` Loading model Llama-2-7b-chat-hf Time to load model: 0.67 seconds I0813 10:55:51.821000 944420 torch/_functorch/_aot_autograd/autograd_cache.py:474] [0/0] AOTAutograd cache hit for key alqchc7zw6ynsxj2bzktcsngu4cajwcb3tmhvwlyqkuinx3zhmey I0813 10:55:55.465000 944420 torch/_functorch/_aot_autograd/autograd_cache.py:474] [1/0] AOTAutograd cache hit for key a3nq2ywjxku342c6ag7rsqkalnxfshlcgve3tb2bigg7a45uz6pt I0813 10:55:56.030000 944420 torch/_inductor/cudagraph_trees.py:385] [__cudagraphs] recording cudagraph tree for graph without symints V0813 10:55:56.192000 944420 torch/_inductor/cudagraph_trees.py:2160] [__cudagraphs] Running warmup of function 0 V0813 10:55:56.426000 944420 torch/_inductor/cudagraph_trees.py:2119] [__cudagraphs] Recording function 0 of graph recording id 0 Compilation time: 9.40 seconds Average tokens/sec: 147.94 tokens/sec Average bandwidth achieved: 1954.98 GB/s Memory used: 14.51 GB ``` Chromium Event(fb only): https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom2%2Fchromium_events.json#!/viewer?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom2%2Fchromium_events.json&local_cache_key ![image](https://github.com/user-attachments/assets/9bdd14ec-d12a-4c89-8705-135c999ac746) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132294 Approved by: https://github.com/eellison	2024-08-17 21:24:54 +00:00
James Wu	cd565bc455	Refactor process_inputs outside of create_aot_dispatcher_function (#130962 ) This PR refactors process_inputs so that it occurs earlier outside of create_aot_dispatcher_function for the purpose of calculating a cache key with the inputs after they have been processed. This way, if tensors have symint sizes/strides, we successfully factor that into the cache key instead of specializing on every possible size and stride. Test that utilizes this incoming. # Guard behavior Note that it's technically possible for tensors with symint arguments to introduce guards in aot_dispatch, if they trace through decompositions that branch on tensor size/stride. This can result in multiple graph modules with differing guards having the same key in the cache. FXGraphCache has this same issue, and the remote FXGraphCache intentionally does not handle this: instead it only saves the first result in the cache, and cache misses if guards miss. The local FXGraphCache does handle this by storing multiple files and iterating through them, but we opt not to introduce that complexity just yet for AOTAutogradCache until we deem it necessary (i.e., models appear where saving multiple cache results with different guards but the same cache key becomes important). Instead, AOTAutogradCache will save a single entry per result, overriding it if it cache misses due to guards. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130962 Approved by: https://github.com/bdhirsh	2024-08-13 14:56:00 +00:00
James Wu	1250171866	Use fresh inductor cache on unit tests (#132432 ) Summary: This makes it so that stress tests on separate processes on the same machine don't clobber the directories of each other. InductorTestCase will automatically make a fresh tmpdir for each unit test. Test Plan: ``` buck2 test -j 18 'fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_aot_autograd_cache.py::AOTAutogradCacheTests::test_nn_module_with_params_global_constant' --run-disabled --stress-runs 10 --record-results ``` Now passes Differential Revision: D60604811 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132432 Approved by: https://github.com/masnesral	2024-08-02 03:02:36 +00:00
James Wu	f467d55329	Disable remote cache on test_aot_autograd_cache (#132409 ) Summary: AOTAutogradCache currently only checks the local directory instead of both local and remote when saving/loading from the cache, so if remote cache is turned on, it will cache miss. Disable remote caching for now on these tests: when I work on remote caching compatibility, I'll re-enable them here. Test Plan: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_aot_autograd_cache.py::AOTAutogradCacheTests::test_nn_module_with_params_global_constant' --run-disabled passes Differential Revision: D60588615 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132409 Approved by: https://github.com/masnesral	2024-08-01 17:26:11 +00:00
Oguz Ulgen	920f0426ae	Add None return type to init -- tests rest (#132376 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132376 Approved by: https://github.com/jamesjwu ghstack dependencies: #132335, #132351, #132352	2024-08-01 15:44:51 +00:00

1 2

62 Commits