166 Commits

Author SHA1 Message Date
24b3ab9255 Revert "Add inductor standalone_compile API (#150670)"
This reverts commit bbc5fe850454df6860814ab77a1f3a4ca3698157.

Reverted https://github.com/pytorch/pytorch/pull/150670 on behalf of https://github.com/albanD due to Broke profiler test ([comment](https://github.com/pytorch/pytorch/pull/150670#issuecomment-2802067144))
2025-04-14 15:22:33 +00:00
bbc5fe8504 Add inductor standalone_compile API (#150670)
This PR adds standalone_compile API that does precompilation via caching to support vLLM use case in the short term while we work on the longer term precompilation solution.

```
standalone_compile(gm, example_inputs, options) -> CompiledArtifact
CompiledArtifact.save(path, format: binary|unpacked = binary)
CompiledArtifact.load(path, format: binary|unpacked = binary)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150670
Approved by: https://github.com/jamesjwu, https://github.com/zou3519
2025-04-14 07:07:10 +00:00
5be5cfe4cb [inductor][autotune cache] add torch_key() to configs hash (#150494)
Summary:
**Context**: https://github.com/pytorch/pytorch/pull/150122 (D71982587 - let's call this "the WS diff") introduces "bc/fc-breaking" cache changes.

In particular, it introduces `num_consumer_groups` and adds it to the cached config. In versions of torch that include the WS diff, `num_consumer_groups` is treated as a class variable on a triton.Config object (i.e. `triton.Config({..kwargs..}, num_consumer_groups=num_consumer_groups, ...`). And in versions of torch that don't include the WS diff, you generally don't expect to see this kwarg.

But if a program is run WS-torch (i.e. torch w/ the WS diff), and then later you run the same program with non-WS-torch, then non-WS-torch is going to find this autotune cache entry, and interpret `num_consumer_groups` as a kwarg, because there's no special handling for for num_consumer_groups in this version of torch. Then the program crashes with a triton failure message.

**The fix**: add the torch version / torch key into the hash, so that any changes to inductor will invalidate the cache (ensuring that other changes to triton_heuristics won't cause these bc/fc issues).

Test Plan: D72285868 (or https://gist.github.com/davidberard98/2ea697eb550c94d0d1948fedb5c5c7d8, but this doesn't repro in OSS because this version of warp specialization is not available in oss triton) can repro the failure, and the failure is fixed after this PR is patched.

Also, added a test in test/inductor/test_codecache.py which verifies that there's no cache hit if the torch_key changes (and verified that without the functional changes in this PR, the test fails).

Differential Revision: D72285303

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150494
Approved by: https://github.com/oulgen
2025-04-03 16:01:57 +00:00
cbc0964636 Store statically launchable CachingAutotuners inside CompiledFXGraph.triton_bundle (#149054)
This PR adds CachingAutotuners that are statically launchable to FXGraphCache's cache entry.

Regular CachingAutotuners, with triton kernels attached to them, are not very good to cache: they are very large, and take huge amounts of space since they track all of the various binary files, along with various metadata. We could probably figure out what information we could delete from the kernel and have it still work, but with StaticCudaLauncher, we no longer have to. Instead, we can cache every compiled triton kernel that is statically launchable.

Because StaticTritonCompileResult is serializable, and designed to have a very small memory footprint, we can save it into FXGraphCache without increasing the cache size significantly. We store it as a part of CompiledFxGraph.triton_bundle.

Then, on load, we repopulate the CachingAutotuner into our CompiledTritonKernel cache.

The upsides of this are many:
- We no longer need to call into a separate process on cache hit
- We can *guarantee* that the triton kernel we got from our cache entry is the one we use to launch again, so no worries about triton's own caching logic
- Once we achieve feature parity and all torch.compiled triton kernels are statically launchable, we can clean up a bunch of TritonBundler code and simplify the cache hit logic.

Fixes #149449

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149054
Approved by: https://github.com/oulgen
2025-03-30 17:51:11 +00:00
7c4e49750e Revert "Store statically launchable CachingAutotuners inside CompiledFXGraph.triton_bundle (#149054)"
This reverts commit c16af5d7984872b6ae81476d6cae64bddb7ce664.

Reverted https://github.com/pytorch/pytorch/pull/149054 on behalf of https://github.com/jamesjwu due to Sorry I forgot to fix one last test ([comment](https://github.com/pytorch/pytorch/pull/149054#issuecomment-2761381443))
2025-03-28 13:35:07 +00:00
c16af5d798 Store statically launchable CachingAutotuners inside CompiledFXGraph.triton_bundle (#149054)
This PR adds CachingAutotuners that are statically launchable to FXGraphCache's cache entry.

Regular CachingAutotuners, with triton kernels attached to them, are not very good to cache: they are very large, and take huge amounts of space since they track all of the various binary files, along with various metadata. We could probably figure out what information we could delete from the kernel and have it still work, but with StaticCudaLauncher, we no longer have to. Instead, we can cache every compiled triton kernel that is statically launchable.

Because StaticTritonCompileResult is serializable, and designed to have a very small memory footprint, we can save it into FXGraphCache without increasing the cache size significantly. We store it as a part of CompiledFxGraph.triton_bundle.

Then, on load, we repopulate the CachingAutotuner into our CompiledTritonKernel cache.

The upsides of this are many:
- We no longer need to call into a separate process on cache hit
- We can *guarantee* that the triton kernel we got from our cache entry is the one we use to launch again, so no worries about triton's own caching logic
- Once we achieve feature parity and all torch.compiled triton kernels are statically launchable, we can clean up a bunch of TritonBundler code and simplify the cache hit logic.

Fixes #149449

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149054
Approved by: https://github.com/oulgen
2025-03-28 13:28:05 +00:00
80aa88f907 Revert "Store statically launchable CachingAutotuners inside CompiledFXGraph.triton_bundle (#149054)"
This reverts commit ac91f8765ba7817a0853f0520e7f9c94768babc2.

Reverted https://github.com/pytorch/pytorch/pull/149054 on behalf of https://github.com/yangw-dev due to This is breaking ROCM tests on trunk. hud.pytorch.org/ ([comment](https://github.com/pytorch/pytorch/pull/149054#issuecomment-2759604301))
2025-03-27 22:15:40 +00:00
ac91f8765b Store statically launchable CachingAutotuners inside CompiledFXGraph.triton_bundle (#149054)
This PR adds CachingAutotuners that are statically launchable to FXGraphCache's cache entry.

Regular CachingAutotuners, with triton kernels attached to them, are not very good to cache: they are very large, and take huge amounts of space since they track all of the various binary files, along with various metadata. We could probably figure out what information we could delete from the kernel and have it still work, but with StaticCudaLauncher, we no longer have to. Instead, we can cache every compiled triton kernel that is statically launchable.

Because StaticTritonCompileResult is serializable, and designed to have a very small memory footprint, we can save it into FXGraphCache without increasing the cache size significantly. We store it as a part of CompiledFxGraph.triton_bundle.

Then, on load, we repopulate the CachingAutotuner into our CompiledTritonKernel cache.

The upsides of this are many:
- We no longer need to call into a separate process on cache hit
- We can *guarantee* that the triton kernel we got from our cache entry is the one we use to launch again, so no worries about triton's own caching logic
- Once we achieve feature parity and all torch.compiled triton kernels are statically launchable, we can clean up a bunch of TritonBundler code and simplify the cache hit logic.

Fixes #149449

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149054
Approved by: https://github.com/oulgen
ghstack dependencies: #149657
2025-03-27 17:14:44 +00:00
bed92a8523 [Window][Inductor UT] Fix for tempfile.NamedTemporaryFile(delete=True) not work on Windows. (#148632)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148632
Approved by: https://github.com/jansel
2025-03-11 05:05:15 +00:00
57addfcd58 Significantly speed up save_cache_artifacts (#148227)
While using save_cache_artifacts on internal workloads, we have noticed that repeatedly calling this function after every batch is incredibly expensive. This PR significantly speeds up this function call by opting out of pickle and redesigning serialization algorithm.

Essentially what we want is to be able to call serialize many times without incurring costs from scratch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148227
Approved by: https://github.com/jamesjwu
ghstack dependencies: #148226
2025-03-03 17:28:41 +00:00
f98cd84b04 cpp_wrapper: use largeTensorTest for test memory checks (#146991)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146991
Approved by: https://github.com/desertfire
2025-02-27 00:30:21 +00:00
574371d828 Add current cuda device index to FXGraphCache key (#147464)
This PR intends to fix the cache related issues from https://github.com/pytorch/pytorch/issues/147405.
It does *not* handle the dynamo recompile case in process, because it does not introduce any extra guards. For FXGraphCache and AOTAutogradCache, we simply have to have the device context in the cache key.

Note that for any function that accepts tensor inputs, the device context is naturally already included in the cache key by the metadata of example inputs. However, for functions that return constants or have no arguments, the device context still needs to be in the cache key.

A more robust fix for this would be to have inductor generate device guards that are dynamic, instead of specialized. This would also help us share more cache artifacts.

I've added unit tests for FXGraphCache and AOTAutogradCache, both of which would fail without this change.

Differential Revision: [D69875939](https://our.internmc.facebook.com/intern/diff/D69875939)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147464
Approved by: https://github.com/bdhirsh, https://github.com/anijain2305
2025-02-20 12:38:21 +00:00
23524699d5 Only call triton in worker process, kick off worker processes earlier, during inductor codegen (#146417)
### Big idea
This PR extends https://github.com/pytorch/pytorch/pull/144288 by combining calling triton in worker processes with the future cache: we kick off triton compilation in the worker processes earlier, during inductor codegen. Basically instead of calling async_compile.triton for the first time only after the entire code has been generated, we start compiling as soon as we know we'll need to compile the kernel. Then, when loading the generated inductor code, we can simply read from our in memory future cache, considerably increasing the parallelism.
### Implementation Overview
In total, the diff does the following:
- Converts TritonFuture to LambdaFuture, only calling triton.compile on worker processes
- Now that triton.compile() isn't called on the main process, we call TritonBundler on all compiled kernels when we get them back from workers
- Extend @eellison's future cache to a class, mostly as a refactor
- Finally, call async_compile.triton ahead of time in Scheduler.codegen if workers are warmed up. This causes the subsequent
async_compile.triton call that occurs after codegen to cache hit on cold start.
In the diffs after this, I will add more to CompiledTritonKernels so that TritonBundler, on a warm start, automatically populates the in memory cache on warm start with the existing triton kernels, avoiding calling triton altogether on warm starts.
Because LambdaFutures are much faster to kick off than TritonFutures, due to not needing to load from TritonCodeCache at all, the time spent kicking off these worker jobs is pretty minimal for inductor codegen.

Differential Revision: [D69123174](https://our.internmc.facebook.com/intern/diff/D69123174/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146417
Approved by: https://github.com/jansel
2025-02-11 03:46:16 +00:00
2811f33d12 Fix code cache + freezing compile-time regression (#145868)
Summary: The current implementation introduces a compile-time regression due to overhead hashing large constants. To support freezing+caching, we consider only the tensor metadata of frozen params, but we neglect to do the same for any constants created as a result of folding frozen params. This PR Explicitly marks the constants created during freezing (and constant folding during freezing) and uses that info in the inductor cache to determine when to hash a tensor value+metadata vs. metadata only.

Test Plan: `python benchmarks/dynamo/torchbench.py --backend inductor --device cuda --only alexnet --bfloat16 --cold-start-latency --print-compilation-time --inference --performance --freezing`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145868
Approved by: https://github.com/eellison
2025-01-31 02:04:15 +00:00
8e258e2ecd Parallelize epilogue/prologue benchmarking (#143408)
When we attempt prologue or epilogue fusion with a TritonTemplate, we benchmark it at compile time in order to determine profitability. This avoids slowdowns/register spilling, and allows us to pick fusion when a base triton template is slower than cublas but faster when considering an epilogue. However, that fused benchmarking does not do the same async compilation as we do for the base TritonTemplate. The Base TritonTemplate is async compiled during lowering, then later waited on and benchmarked.

This PR extends a similar process to benchmarking fused TritonTemplates in the scheduler. We keep a list of pending fusions which have async compilations. And we resolve any pending fusions a node is in prior to attempting to fuse it with any other node.

Initially, I saw some slowdowns with this because we kick off async compilations of identical fusions in parallel. To address this I added source code caching at the `async_compile` level (we also already cache benchmark runs, but that would not happen in parallel).

Compilation speedups:

<img width="717" alt="image" src="https://github.com/user-attachments/assets/8e8f7d6c-7824-4210-83f9-a2a0f6db5ac9" />

This also should let us be a bit more aggressive with either configs, or benchmarking other fusions which are hard to determine profitability of.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143408
Approved by: https://github.com/jansel, https://github.com/shunting314
2025-01-28 18:18:24 +00:00
cd68d54911 Inductor cache: Revamp how we handle frozen params (#143808)
Summary: In https://github.com/pytorch/pytorch/pull/143563 we have a report of a problem with the treatment of frozen params in the inductor cache implementation. There seems to be a path where new constants are added in the `GraphLowering`. On a cache hit when we try to find those constant names in the `torch.fx.GraphModule`, they do not exist. The current approach treats all constants differently if the GM has any frozen params. This PR changes the approach to only treat the _frozen_ params specially, but store all other constants in the cache entry (as we do without freezing):
1) When creating a cache entry, store the names of any frozen params, but the values of any other constants.
2) On a cache hit, restore the values of the frozen params by looking up in the current GM.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143808
Approved by: https://github.com/leslie-fang-intel, https://github.com/eellison
2025-01-24 01:20:07 +00:00
99dbc5b0e2 PEP585 update - test (#145176)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145176
Approved by: https://github.com/bobrenjc93
2025-01-22 04:48:28 +00:00
efa88e04e1 Don't overspecialize float when propagating cache guards to ShapeEnv (#145078)
Fixes https://github.com/pytorch/pytorch/issues/142507

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145078
Approved by: https://github.com/Skylion007
2025-01-21 18:05:43 +00:00
6e77d7cac5 Add AOTAutogradCache support for cache hot loading APIs (#144499)
This diff adds AOTAutogradCache support to the mega cache.

Differential Revision: [D67991059](https://our.internmc.facebook.com/intern/diff/D67991059/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D67991059/)!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144499
Approved by: https://github.com/oulgen
2025-01-13 07:07:18 +00:00
9ee242213b [RFC] Introduce cache hot loading APIs (a.k.a. "Mega-cache") (#143341)
This PR essentially introduces two new APIs
* torch.compiler.save_cache_artifacts
* torch.compiler.load_cache_artifacts

which aim to create a mega cache experience where the user can start collecting cache artifacts, and later call the save API to fetch them. In the next attempt, the user can "hot load" the cache artifacts via the load function.

This bundling approach reduces the need to rely on porting individual files one by one, or relying on many network requests.

Note that these APIs CANNOT log to structured logging as these functions will be called before and after compilation, as opposed to during compilation. Due to this limitation, the API returns a struct that the user can log with.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143341
Approved by: https://github.com/jansel
2025-01-07 23:13:24 +00:00
d8c8ba2440 Fix unused Python variables in test/[e-z]* (#136964)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136964
Approved by: https://github.com/justinchuby, https://github.com/albanD
2024-12-18 23:02:30 +00:00
2531543c5f [user triton cache] Dedup user-defined Triton kernels by config in codecache (#143353)
Previously, the same kernel source with different autotuning configs would generate the same cache key which can lead to wrong cache it and silent incorrectness. Here we add the configs to the cache key in `FxGraphHashDetails`.

Test Plan:

```
python3 test/inductor/test_codecache.py -k test_triton_higher_order_op_different_configs
...
----------------------------------------------------------------------
Ran 2 tests in 3.590s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143353
Approved by: https://github.com/oulgen
2024-12-17 08:41:22 +00:00
0f6bfc58a2 Introduce remote cache key prefix to break cache (#142148)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142148
Approved by: https://github.com/jamesjwu, https://github.com/ezyang
2024-12-10 00:35:50 +00:00
4981bd8355 Make cache keys consistent between OSS and internal (#142147)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142147
Approved by: https://github.com/jamesjwu, https://github.com/masnesral
2024-12-05 22:29:07 +00:00
5bc09ac5e9 Remove option for fork-based compile pool (#142001)
Summary: This has been set to "subproc" for a while internally and externally, so we can remove and simplify some of the code. Note that there's no pressing need here -- just that since we've had internal outage with the legacy "fork" implementation, it doesn't seem helpful to leave it available. But if people aren't in the mood for this sort of cleanup, I won't be offended to abandon it.

Test Plan: CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142001
Approved by: https://github.com/eellison, https://github.com/jansel
2024-12-05 17:02:08 +00:00
78e53a92c3 Remove monkeypatch of has_frozen_params in test/inductor/test_codecache.py (#141898)
Summary: This particular test isn't really needed since the code path is already exercised in `test_freezing`. While I was here, I beefed up testing in that method to consider whether the frozen paramater is inlinable vs. not since the caching behavior is different.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141898
Approved by: https://github.com/ezyang, https://github.com/jansel
2024-12-03 20:38:10 +00:00
af88326250 Ensure that BlockMask length must always exactly match the sequence length in flex_attention (#141625)
Fixes https://github.com/pytorch/pytorch/issues/141435

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141625
Approved by: https://github.com/drisspg
ghstack dependencies: #138788
2024-12-03 04:45:05 +00:00
b97a786125 Inline compile_to_fn at its only call site (#141691)
Stacked on https://github.com/pytorch/pytorch/pull/141689

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141691
Approved by: https://github.com/jansel
ghstack dependencies: #141681, #141683, #141685, #141688, #141689
2024-11-29 01:15:38 +00:00
dbbebee9d7 Code motion CompiledFxGraph to a dedicated file (#141654)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141654
Approved by: https://github.com/aorenste, https://github.com/jansel
ghstack dependencies: #141491, #141492, #141574
2024-11-27 20:42:21 +00:00
a7ca6a9113 Enable autograd cache on inductor tests (#140890)
This turns on AOTAutogradCache for all inductor tests. It clears AOTAutogradCache on each test as well, by virtue of the local cache using the same directory to store cache entries.

I've also tested with INDUCTOR_TEST_DISABLE_FRESH_CACHE=1, running all the tests. AOTAutogradCache successfully caches 99% of these. There are a few tests that use view_replay and therefore save functional tensors, which cause AOTAutogradCache to fail to pickle its result. Will look into next steps there, but for now, it seems okay if the cache just misses on those cases where it can't serialize the result. It would be better to check before pickling, though.

I've made the following small bugfixes to get this working:
- Inductor is sometimes used in a standalone mode without dynamo, which leads to attribute errors in check_can_cache. In general, we should *never* crash in cache checking, only bypass. So I change a try catch to check Exception instead of just a specific exception.
- Add extra structured logging for metadata on cache hits

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140890
Approved by: https://github.com/bdhirsh
2024-11-27 20:41:43 +00:00
3473dfa698 Add triton_op test for user defined triton caching (#141407)
Fix failing internal codecache test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141407
Approved by: https://github.com/aorenste
2024-11-23 07:54:39 +00:00
a8ab6b0938 Fix failing internal codecache test (#141405)
When internal remote cache version was bumped to 11, this test started failing, I guess no one noticed it, and it got disabled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141405
Approved by: https://github.com/aorenste
2024-11-23 02:01:02 +00:00
a173186566 [RFC] Implement caching for user defined triton kernels (#140326)
This PR adds caching for user defined triton kernels by putting the transitive closure of source code in node.meta along with constant arguments.

One HUGE hack we do here is a node looks like
```
triton_kernel_wrapper_functional_proxy = torch.ops.higher_order.triton_kernel_wrapper_functional(kernel_idx = 0, constant_args_idx = 1, grid = [(1, 1, 1)], tma_descriptor_
metadata = {}, kwargs = {'in_ptr0': arg0_1, 'in_ptr1': arg1_1, 'out_ptr': arg0_1}, tensors_to_clone = ['out_ptr']);
```
so we use regex to remove `kernel_idx = 0, constant_args_idx = 1` parts as they are not relevant to cache hash. This is horrible and I'd like to eventually not use pickle as a hashing alternative but this is a longer project.

Differential Revision: [D65895744](https://our.internmc.facebook.com/intern/diff/D65895744)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140326
Approved by: https://github.com/zou3519
2024-11-16 02:37:16 +00:00
f6ba95a76f [inductor] PyCodeCache: only delete on-disk artifacts if purge=True (#140216)
Summary: https://github.com/pytorch/pytorch/pull/136505 changed the cache_clear operation to remove loaded modules from disk. That change caused some problems with TORCHINDUCTOR_FORCE_DISABLE_CACHES=1, where there are some code paths (coordinate descent tuning at least), where we call `PyCodeCache.load_by_key_path` and expect that the files are still on disk. (But when caches are disabled, we call cache_clear before every inductor compile). It seems we probably have a shortcoming in the disable-cache logic, but since we also have flakey test failures with the same `'could not get source code'` error, let's restore the previous functionality until I can investigate further.

Since some tests actually _DO_ want to delete on-disk artifacts (e.g., to test remote caching), then I added a `purge` param to optionally delete files

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140216
Approved by: https://github.com/eellison
2024-11-14 19:34:57 +00:00
320374b011 [Inductor] Refine triton_bundler.py to support correctly on Intel GPU and fix CI failures. (#139705)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139705
Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/guangyey
2024-11-07 07:10:28 +00:00
c0d21b6581 End TritonBundle on non-cache write codepaths (#139698)
Summary:
When we bypass cache write on inductor, we were also forgetting to reset the bundle, this moves resetting the bundle into post_compile step so it gets uniformly reset.

This diff also turns on the cache for internal so that we can do a code rollout.

Test Plan: updated tests

Differential Revision: D65457224

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139698
Approved by: https://github.com/ezyang
2024-11-05 17:00:40 +00:00
585dbfa583 Profile guided optimization for automatic_dynamic (#139001)
Previously: https://github.com/pytorch/pytorch/pull/138052 but the implementation is done from scratch, so I open a new PR.

This implements the ability to save and load profiles of automatic dynamic decisions, so on subsequent runs we can directly make something automatically dynamic. Unlike the previous implementation, this cache is never enabled by default; instead, you have to specify a "job id" that says it's OK to share results. We will be able to automatically populate this id for internal MAST jobs but for generic OSS users you will have to explicitly opt into it.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139001
Approved by: https://github.com/oulgen
2024-11-03 06:29:57 +00:00
d8b606ecb5 [fx graph cache] Support freezing with FX graph caching (#136505)
Summary: The main changes to support freezing are:
1) When pickling constant tensors as part of the cache key calculation: If freezing has not been applied, then keep the existing behavior (pickle the metadata and values). If freezing has been applied, then pickle the values if the constant will be inlined; otherwise, consider only the metadata.
2) If freezing has been applied, modify what we store in the cache: Instead of storing the constant attributes in the cache entry, store the _names_ of the constants, and then grab those constants from the GraphModule when we need attache the attributes to a newly-loaded Python module. Since the cache lookup path loads the Python module, this bullet means we need to thread through a GraphModule argument in several places.
3) Since this feature means that we may need to reload the same Python module path more than once (but attach different constant attributes), I changed PyCodeCache.load_by_key_path to not store an in-memory map of path to module (since there may be more than one). I don't _think_ this will have any affect on performance, however.. It's unclear why we were using an in-memory cache here anyway, since this function should only be called once for each module needed to be loaded.
4) Several tests were removing on-disk PyCodeCache artifacts by iterating over the modules. I made this more straightforward by implementing a cache_clear method that removes the on-disk artifacts. Arguably, this should have been the implementation all along.

Differential Revision: [D63542170](https://our.internmc.facebook.com/intern/diff/D63542170)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136505
Approved by: https://github.com/eellison
2024-11-01 18:29:29 +00:00
e3e3ab805b [fx graph cache] Refactor FxGraphCachePickler (#138682)
Summary: In an upcoming change, we need to modify FxGraphCachePickler to behave differently depending on whether the graph has frozen parameters (whether or not we have frozen parameters). To do that, it will be convenient to change FxGraphCachePickler into a regular object instead of a collection of classmethods.

Test Plan: unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138682
Approved by: https://github.com/eellison
ghstack dependencies: #138681
2024-10-31 03:31:51 +00:00
69ea2e726c Consolidate Triton cache into Inductor cache (#138239)
Summary:
This diff/PR attempts to consolidate Triton caching into the Inductor caching so that there can be just one cache that unifies them both, reducing network requests and increasing success rate.

Implementation details can be found via reading the code or the post: https://fb.workplace.com/groups/1553867532149891/posts/1605037517032892

I did not use the Autotune bundler code at all since I want to simplify that and merge it into this on the next diff/PR.

In terms of instrumentation
1) Dynamo compile: `triton_bundler_time_saved_s` this is sum of all triton.compile calls. We dont have to use the specific number, can use this as a binary value.
2) Events table: I used dynamo_timed to measure how much time we spend on bundler collect and write functions which is all the work we do in this diff
3) TLParse: I emitted number of kernels and triton_bundler_time_saved_s into tlparse as well

Test Plan: Updated unit tests

Adhoc running
```
TORCHINDUCTOR_BUNDLE_TRITON_INTO_FX_GRAPH_CACHE=1 buck2 run @mode/opt //scripts/oulgen:runner
```
gives
https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpmTZt6b/0_0_0/fx_graph_cache_hit_4.json
<img width="771" alt="image" src="https://github.com/user-attachments/assets/478782a2-ee47-40cb-b723-fcac2bf9dd93">

Differential Revision: D64504909

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138239
Approved by: https://github.com/ezyang
2024-10-31 01:37:16 +00:00
ad933578ed [fx graph cache] FxGraphPickler: Remove hack to stabilize device string hashes (#138681)
Summary: With the fast pickling mode, we don't need the custom hack for replacing device strings in tensors. This was previously needed because, e.g., two strings "cuda" will pickle differently if they are the same object vs. not.

Test Plan:
The new test fails with fast mode commented out, but succeeds when enabled:
`python test/inductor/test_codecache.py -k test_stable_strings`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138681
Approved by: https://github.com/oulgen
2024-10-28 15:23:56 +00:00
36b7135c6f Revert "[fx graph cache] FxGraphPickler: Remove hack to stabilize device string hashes (#138681)"
This reverts commit 6cadf616aeb612f3c866b734268919ad1616ffaf.

Reverted https://github.com/pytorch/pytorch/pull/138681 on behalf of https://github.com/jeanschmidt due to Introduced regressions on linux-focal-cuda11.8-py3.10-gcc9 ([comment](https://github.com/pytorch/pytorch/pull/138681#issuecomment-2438945493))
2024-10-25 22:07:30 +00:00
6cadf616ae [fx graph cache] FxGraphPickler: Remove hack to stabilize device string hashes (#138681)
Summary: With the fast pickling mode, we don't need the custom hack for replacing device strings in tensors. This was previously needed because, e.g., two strings "cuda" will pickle differently if they are the same object vs. not.

Test Plan:
The new test fails with fast mode commented out, but succeeds when enabled:
`python test/inductor/test_codecache.py -k test_stable_strings`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138681
Approved by: https://github.com/oulgen
2024-10-25 15:52:58 +00:00
524fe784ec BundledAutotuneCache (take 2) (#137902)
Summary:
Add a cache to combine individual autotune caches into a single cached bundle.  We still rely on the individual autotune caches - on a cache hit we copy the individual results into the local caches so they can retrieved later.

Attempt 2 of #134959 (D60677499).

Various configs:
env: TORCHINDUCTOR_BUNDLED_AUTOTUNE_REMOTE_CACHE
config: bundled_autotune_remote_cache
jk: pytorch/remote_cache:bundled_autotune_remote_cache_version

Test Plan:
unit tests

Manually tested w/ EMU:
```
cd fbcode/accelerators/workloads/models/emu_flash/v1p4
make build_benchmark_model && make save_model_to_path
make test_pt2_latency
```

- on a cold run we got 0 hits and 40 misses. On a warm run it got 40 hits and 0 miss.
- perf seems a little better - for 8 runs:
  - no bundled cache averaged 14m11s
  - bundled cache averaged 14m6s
  - 125ms saved per cache entry seems reasonable

Cache Metrics for an sample run:
no bundled cache:
```
INFO: Cache Metrics:
  FbMemcacheRemoteKernelCache: {hit: 2256, miss: 0, put: 0, exception: 0}
  FbRemoteAutotuneCache: {hit: 0, miss: 0, put: 7, exception: 0}
  FbRemoteFxGraphCache: {hit: 40, miss: 0, put: 0, exception: 0}
  LocalAutotuneCache: {hit: 878, miss: 0, put: 7, exception: 0}
  backend:MemcacheCache: {hit: 2256, miss: 0, put: 7, exception: 0}
  backend:_LocalAutotuneCacheBackend: {hit: 878, miss: 0, put: 7, exception: 0}
  backend:_ManifoldCache: {hit: 40, miss: 0, put: 0, exception: 0}
```
bundled cache:
```
INFO: Cache Metrics:
  FbMemcacheRemoteKernelCache: {hit: 2258, miss: 0, put: 0, exception: 0}
  FbRemoteAutotuneCache: {hit: 0, miss: 0, put: 8, exception: 0}
  FbRemoteBundledAutotuneCache: {hit: 40, miss: 0, put: 0, exception: 0} <<<<<<
  FbRemoteFxGraphCache: {hit: 40, miss: 0, put: 0, exception: 0}
  LocalAutotuneCache: {hit: 878, miss: 0, put: 886, exception: 0}
  backend:MemcacheCache: {hit: 2258, miss: 0, put: 8, exception: 0}
  backend:_LocalAutotuneCacheBackend: {hit: 878, miss: 0, put: 886, exception: 0}
  backend:_ManifoldCache: {hit: 80, miss: 0, put: 0, exception: 0}
```

Differential Revision: D64336043

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137902
Approved by: https://github.com/oulgen
2024-10-15 18:39:47 +00:00
0dbbcfa7ae [Inductor UT] Generalize newly introduced inductor UTs for intel GPU (Part 3) (#136947)
[Inductor UT] Generalize Newly introduced inductor UTs for intel GPU
reuse `test/inductor/test_pattern_matcher.py`
reuse `test/inductor/test_snode_runtime.py`
reuse `test/inductor/test_unbacked_symints.py`
fix `test/inductor/test_triton_kernels.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136947
Approved by: https://github.com/etaf, https://github.com/EikanWang, https://github.com/jansel
2024-10-12 13:21:20 +00:00
1358969fa1 Revert "BundledAutotuneCache (#134959)"
This reverts commit 709021143d9c9aa90df578a2f5abb93a91a4852a.

Reverted https://github.com/pytorch/pytorch/pull/134959 on behalf of https://github.com/albanD due to The newly added test fails on rocm CI ([comment](https://github.com/pytorch/pytorch/pull/134959#issuecomment-2408091754))
2024-10-11 20:43:56 +00:00
709021143d BundledAutotuneCache (#134959)
Add a cache to combine individual autotune caches into a single cached bundle.  We still rely on the individual autotune caches - on a cache hit we copy the individual results into the local caches so they can retrieved later.

Various related configs:
env: TORCHINDUCTOR_BUNDLED_AUTOTUNE_REMOTE_CACHE
config: bundled_autotune_remote_cache
jk: pytorch/remote_cache:bundled_autotune_remote_cache_version

Testing:

Manually tested w/ EMU:
```
cd fbcode/accelerators/workloads/models/emu_flash/v1p4
make build_benchmark_model && make save_model_to_path
make test_pt2_latency
```

 - on a cold run we got 0 hits and 40 misses. On a warm run it got 40 hits and 0 miss.
- perf seems a little better - for 8 runs:
  - no bundled cache averaged 14m11s
  - bundled cache averaged 14m6s
  - 125ms saved per cache entry seems reasonable

Cache Metrics for an sample run:
no bundled cache:
```
INFO: Cache Metrics:
  FbMemcacheRemoteKernelCache: {hit: 2256, miss: 0, put: 0, exception: 0}
  FbRemoteAutotuneCache: {hit: 0, miss: 0, put: 7, exception: 0}
  FbRemoteFxGraphCache: {hit: 40, miss: 0, put: 0, exception: 0}
  LocalAutotuneCache: {hit: 878, miss: 0, put: 7, exception: 0}
  backend:MemcacheCache: {hit: 2256, miss: 0, put: 7, exception: 0}
  backend:_LocalAutotuneCacheBackend: {hit: 878, miss: 0, put: 7, exception: 0}
  backend:_ManifoldCache: {hit: 40, miss: 0, put: 0, exception: 0}
```
bundled cache:
```
INFO: Cache Metrics:
  FbMemcacheRemoteKernelCache: {hit: 2258, miss: 0, put: 0, exception: 0}
  FbRemoteAutotuneCache: {hit: 0, miss: 0, put: 8, exception: 0}
  FbRemoteBundledAutotuneCache: {hit: 40, miss: 0, put: 0, exception: 0}
  FbRemoteFxGraphCache: {hit: 40, miss: 0, put: 0, exception: 0}
  LocalAutotuneCache: {hit: 878, miss: 0, put: 886, exception: 0}
  backend:MemcacheCache: {hit: 2258, miss: 0, put: 8, exception: 0}
  backend:_LocalAutotuneCacheBackend: {hit: 878, miss: 0, put: 886, exception: 0}
  backend:_ManifoldCache: {hit: 80, miss: 0, put: 0, exception: 0}
```

Differential Revision: D60677499

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134959
Approved by: https://github.com/oulgen
2024-10-11 19:12:41 +00:00
319eda9dfd [inductor] Add API to make post_grad_custom passes cache-able (#137298)
Summary: See https://github.com/pytorch/pytorch/issues/130772

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137298
Approved by: https://github.com/oulgen, https://github.com/eellison
2024-10-07 21:11:54 +00:00
4db199f15f Implement Remote AOTAutogradCache (#137278)
Summary: Implement Remote AOTAutogradCache. It uses all the same tech as Remote FXGraphCache, just with its own name.

Test Plan:
Run benchmark:
TORCHINDUCTOR_AUTOGRAD_REMOTE_CACHE=1 TORCHINDUCTOR_FX_GRAPH_REMOTE_CACHE=1 TORCHINDUCTOR_AUTOGRAD_CACHE=0 TORCHINDUCTOR_FX_GRAPH_CACHE=0 TORCH_LOGS=+torch._functorch._aot_autograd.autograd_cache buck run mode/opt benchmarks/dynamo:torchbench -- --training --backend=inductor --only nanogpt --repeat 5 --performance --cold-start-latency

See that it cache hits even with local cache removed.

Results show up in remote cache logs https://fburl.com/scuba/pt2_remote_cache/5893dbaj

New unit tests

Reviewed By: oulgen

Differential Revision: D63323958

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137278
Approved by: https://github.com/oulgen
2024-10-07 15:38:54 +00:00
190e09d8b6 [Inductor UT] Generalize device-bias code introduced from #134874 and (#136596)
[Inductor UT] Generalize device-bias code introduced from #134874 and fix unexpected success test cases.
Fix #136595

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136596
Approved by: https://github.com/EikanWang, https://github.com/jansel

Co-authored-by: Yu, Guangye <guangye.yu@intel.com>
2024-09-26 02:56:59 +00:00