Commit Graph

518 Commits

Author SHA1 Message Date
229daf7470 Inline FxGraphCache.load into its sole call site (#141681)
I need to restructure the body of FxGraphCache.load with the outer if-else in its call site, so inline it goes!

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141681
Approved by: https://github.com/jamesjwu, https://github.com/jansel
2024-11-28 17:43:11 +00:00
dbbebee9d7 Code motion CompiledFxGraph to a dedicated file (#141654)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141654
Approved by: https://github.com/aorenste, https://github.com/jansel
ghstack dependencies: #141491, #141492, #141574
2024-11-27 20:42:21 +00:00
a7ca6a9113 Enable autograd cache on inductor tests (#140890)
This turns on AOTAutogradCache for all inductor tests. It clears AOTAutogradCache on each test as well, by virtue of the local cache using the same directory to store cache entries.

I've also tested with INDUCTOR_TEST_DISABLE_FRESH_CACHE=1, running all the tests. AOTAutogradCache successfully caches 99% of these. There are a few tests that use view_replay and therefore save functional tensors, which cause AOTAutogradCache to fail to pickle its result. Will look into next steps there, but for now, it seems okay if the cache just misses on those cases where it can't serialize the result. It would be better to check before pickling, though.

I've made the following small bugfixes to get this working:
- Inductor is sometimes used in a standalone mode without dynamo, which leads to attribute errors in check_can_cache. In general, we should *never* crash in cache checking, only bypass. So I change a try catch to check Exception instead of just a specific exception.
- Add extra structured logging for metadata on cache hits

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140890
Approved by: https://github.com/bdhirsh
2024-11-27 20:41:43 +00:00
7ea0da2d57 Modest code motion in compile_fx (#141574)
Do code review with whitespace changes off. Check comments for what I changed.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141574
Approved by: https://github.com/bobrenjc93, https://github.com/jansel
ghstack dependencies: #141491, #141492
2024-11-27 13:38:14 +00:00
4e34fbdcbc Add inductor_fx_graph_cache stats to dynamo_utils (#141190)
Summary:
Add the following inductor fx graph cache stats to dynamo compile

- inductor_fx_cache_hit_count
- inductor_fx_cache_miss_count
- inductor_fx_cache_backend_type
- inductor_fx_cache_hit_keys
- inductor_fx_cache_miss_keys
- remote_cache_version

Test Plan: Run local tests and staging logger: P1683061460

Differential Revision: D66232206

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141190
Approved by: https://github.com/masnesral
2024-11-21 20:59:10 +00:00
7392e88219 Instead of using node.meta read from side table directly (#141146)
When a transformation phase copies/modifies a node, it might drop node.meta, same as graph.meta, so they are not a good storage locations. instead directly read from the side table.

Differential Revision: [D66249968](https://our.internmc.facebook.com/intern/diff/D66249968/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141146
Approved by: https://github.com/ezyang
2024-11-21 06:19:12 +00:00
12e95aa4ee [BE]: Apply PERF401 autofixes from ruff (#140980)
* Automatically applies ruff rule 401. Turns loops into equivalent list comprehensions which are faster and do not leak the scope of the loop variables.
* list comprehensions not only often have better typing, but are 50+% faster than for loops on overhead. They also preserve length information etc and are better for the interpreter to optimize.
* Manually went back and made mypy happy after the change.
* Also fixed style lints in files covered by flake8 but not by pyfmt

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140980
Approved by: https://github.com/justinchuby, https://github.com/malfet
2024-11-20 17:52:07 +00:00
ff17d2b83e [easy][logging] Remove dynamo_timed fwd_only param (#140993)
Summary: It's ignored; remove it

Test Plan: CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140993
Approved by: https://github.com/ezyang
2024-11-20 02:31:51 +00:00
4f2543c31d [logs] Add dynamo_timed to get better compilation time breakdown for AOTI (#140198)
Adding some dynamo timed for the purpose of better understanding AOTI compilation time.

Probably would require a few more passes. A lot of time is spent in Scheduler.__init__, and not enough annotations are there.

run_command_and_check takes a lot time as well. But there is probably not much we can do. Maybe we can add a config to tune C++ optimization level?

traces:
<img width="1205" alt="Screenshot 2024-11-08 at 4 41 10 PM" src="https://github.com/user-attachments/assets/61645264-b3af-4d4a-804d-700b0f831c7c">

Differential Revision: D65554141

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140198
Approved by: https://github.com/desertfire
2024-11-19 18:54:17 +00:00
a173186566 [RFC] Implement caching for user defined triton kernels (#140326)
This PR adds caching for user defined triton kernels by putting the transitive closure of source code in node.meta along with constant arguments.

One HUGE hack we do here is a node looks like
```
triton_kernel_wrapper_functional_proxy = torch.ops.higher_order.triton_kernel_wrapper_functional(kernel_idx = 0, constant_args_idx = 1, grid = [(1, 1, 1)], tma_descriptor_
metadata = {}, kwargs = {'in_ptr0': arg0_1, 'in_ptr1': arg1_1, 'out_ptr': arg0_1}, tensors_to_clone = ['out_ptr']);
```
so we use regex to remove `kernel_idx = 0, constant_args_idx = 1` parts as they are not relevant to cache hash. This is horrible and I'd like to eventually not use pickle as a hashing alternative but this is a longer project.

Differential Revision: [D65895744](https://our.internmc.facebook.com/intern/diff/D65895744)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140326
Approved by: https://github.com/zou3519
2024-11-16 02:37:16 +00:00
baf756a785 [reland] [aoti] Selectively package AOTI generated files (#140675)
Summary: Reland  https://github.com/pytorch/pytorch/pull/140022

Test Plan: CI

Differential Revision: D65929964

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140675
Approved by: https://github.com/desertfire
2024-11-15 23:48:34 +00:00
f6ba95a76f [inductor] PyCodeCache: only delete on-disk artifacts if purge=True (#140216)
Summary: https://github.com/pytorch/pytorch/pull/136505 changed the cache_clear operation to remove loaded modules from disk. That change caused some problems with TORCHINDUCTOR_FORCE_DISABLE_CACHES=1, where there are some code paths (coordinate descent tuning at least), where we call `PyCodeCache.load_by_key_path` and expect that the files are still on disk. (But when caches are disabled, we call cache_clear before every inductor compile). It seems we probably have a shortcoming in the disable-cache logic, but since we also have flakey test failures with the same `'could not get source code'` error, let's restore the previous functionality until I can investigate further.

Since some tests actually _DO_ want to delete on-disk artifacts (e.g., to test remote caching), then I added a `purge` param to optionally delete files

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140216
Approved by: https://github.com/eellison
2024-11-14 19:34:57 +00:00
b11ff3cf60 [logging] Overhaul dynamo_timed and CompilationMetrics logging. (#139849)
Here's the overview:

There's a new contextmanager singleton called MetricsContext. Entering the MetricsContext is how we demarcate the boundary on which we'll create a single CompilationMetrics object, and therefore, a single dynamo_compile log entry. While we're inside the MetricsContext, we can update/set many different metrics. Most importantly: `dynamo_timed` can also update the in-progress MetricsContext. In the proposal here, we tell `dynamo_timed` that we want it to do so by providing the name of the MetricsContext field to increment. There can be many `dynamo_timed` calls in different parts of the code updating different fields. Then when the MetricsContext exits, that's when the logging of everything gathered finally happens. One potential footgun is trying to use `dynamo_timed` when we haven't entered the MetricsContext, but we assert on that problem. Another problem is that we re-enter the context recursively, but we watch for that and do the logging only when the outermost exits.

Some specifics:
* Introduce MetricsContext - a context manager that on exit, records the CompilationMetrics (which also logs to dynamo_compile).
* Completely remove the concept of frame_phase_timing. Instead, update the MetricsContext during compilation, either directly or via dynamo_timed.
* Remove some globals we previously used to accumulate counters to later populate a CompilationMetrics. We use CompilationMetrics set/update/increment APIs instead.
* `record_compilation_metrics` is now called on exit from MetricsContext.
* Populate legacy CompilationMetrics fields right before logging, inside `record_compilation_metrics`.
* Remove the one-off `add_remote_cache_time_saved` helper; capture that timing directly into the MetricsContext.

And specifically, several changes to dynamo_timed:
* "Modernize" the parameters and update all callsites accordingly.
* Move the backwards logging of the CompilationMetrics to the backwards compile location.
* Add a parameter for which CompilationMetrics field to update

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139849
Approved by: https://github.com/ezyang
2024-11-14 19:11:20 +00:00
d63eb3c46c Revert "[logging] Overhaul dynamo_timed and CompilationMetrics logging. (#139849)"
This reverts commit cb15c1515778499ae801dcf67d55c8bdab4724ef.

Reverted https://github.com/pytorch/pytorch/pull/139849 on behalf of https://github.com/kit1980 due to Breaking an internal tests + there is a bug according to the author ([comment](https://github.com/pytorch/pytorch/pull/139849#issuecomment-2474459094))
2024-11-13 18:47:51 +00:00
b4cc5d38b4 Revert "[aoti] Remove dir after packaging (#140022)"
This reverts commit ba136a78ba613d3c7f5d2de53b9fff556e04cfba.

Reverted https://github.com/pytorch/pytorch/pull/140022 on behalf of https://github.com/angelayi due to sorry I realized I need to land from internal ([comment](https://github.com/pytorch/pytorch/pull/140022#issuecomment-2473814720))
2024-11-13 14:43:15 +00:00
ba136a78ba [aoti] Remove dir after packaging (#140022)
Update AOTI to return a list of files that it generates when `aot_inductor.package=True`. Then we will only package the files that are in that list.

This should fix the [caching issue](https://fb.workplace.com/groups/1028545332188949/permalink/1081702043539944/) and hopefully https://github.com/pytorch/pytorch/issues/140053.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140022
Approved by: https://github.com/larryliu0820, https://github.com/desertfire, https://github.com/malfet
2024-11-13 12:17:19 +00:00
d48ea29b9a Revert "[aoti] Remove dir after packaging (#140022)"
This reverts commit 8c6abe5a8c42be3909496d2cd3d1f194a8493460.

Reverted https://github.com/pytorch/pytorch/pull/140022 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the lint failure is legit ([comment](https://github.com/pytorch/pytorch/pull/140022#issuecomment-2471847439))
2024-11-12 23:35:27 +00:00
8c6abe5a8c [aoti] Remove dir after packaging (#140022)
Update AOTI to return a list of files that it generates when `aot_inductor.package=True`. Then we will only package the files that are in that list.

This should fix the [caching issue](https://fb.workplace.com/groups/1028545332188949/permalink/1081702043539944/) and hopefully https://github.com/pytorch/pytorch/issues/140053.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140022
Approved by: https://github.com/larryliu0820, https://github.com/desertfire, https://github.com/malfet
2024-11-12 21:36:24 +00:00
cb15c15157 [logging] Overhaul dynamo_timed and CompilationMetrics logging. (#139849)
Here's the overview:

There's a new contextmanager singleton called MetricsContext. Entering the MetricsContext is how we demarcate the boundary on which we'll create a single CompilationMetrics object, and therefore, a single dynamo_compile log entry. While we're inside the MetricsContext, we can update/set many different metrics. Most importantly: `dynamo_timed` can also update the in-progress MetricsContext. In the proposal here, we tell `dynamo_timed` that we want it to do so by providing the name of the MetricsContext field to increment. There can be many `dynamo_timed` calls in different parts of the code updating different fields. Then when the MetricsContext exits, that's when the logging of everything gathered finally happens. One potential footgun is trying to use `dynamo_timed` when we haven't entered the MetricsContext, but we assert on that problem. Another problem is that we re-enter the context recursively, but we watch for that and do the logging only when the outermost exits.

Some specifics:
* Introduce MetricsContext - a context manager that on exit, records the CompilationMetrics (which also logs to dynamo_compile).
* Completely remove the concept of frame_phase_timing. Instead, update the MetricsContext during compilation, either directly or via dynamo_timed.
* Remove some globals we previously used to accumulate counters to later populate a CompilationMetrics. We use CompilationMetrics set/update/increment APIs instead.
* `record_compilation_metrics` is now called on exit from MetricsContext.
* Populate legacy CompilationMetrics fields right before logging, inside `record_compilation_metrics`.
* Remove the one-off `add_remote_cache_time_saved` helper; capture that timing directly into the MetricsContext.

And specifically, several changes to dynamo_timed:
* "Modernize" the parameters and update all callsites accordingly.
* Move the backwards logging of the CompilationMetrics to the backwards compile location.
* Add a parameter for which CompilationMetrics field to update

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139849
Approved by: https://github.com/ezyang
ghstack dependencies: #140094
2024-11-11 14:24:23 +00:00
ca30704f0b [Inductor][ROCm][CK] Add standalone runner (#139441)
Generate standalone executable to debug and profile CK gemm instances

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139441
Approved by: https://github.com/ColinPeppler
2024-11-07 06:21:27 +00:00
1270c78268 Add logging for num_triton_bundles (#139807)
Summary: Adding logs for number of inductor cache triton bundles

Test Plan:
Ran adhoc code and looked at dynamo_compile/sandbox

https://fburl.com/scuba/dynamo_compile/sandbox/nhktfy19

Differential Revision: D65490826

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139807
Approved by: https://github.com/masnesral
2024-11-06 21:11:04 +00:00
d1e2e81ede [AOTI] Fix two test failures from #139471 (#139885)
Summary: https://github.com/pytorch/pytorch/pull/139471 caused two internal test failures due to different compiler path settings.

Differential Revision: D65519537

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139885
Approved by: https://github.com/hl475
2024-11-06 18:41:28 +00:00
c0d21b6581 End TritonBundle on non-cache write codepaths (#139698)
Summary:
When we bypass cache write on inductor, we were also forgetting to reset the bundle, this moves resetting the bundle into post_compile step so it gets uniformly reset.

This diff also turns on the cache for internal so that we can do a code rollout.

Test Plan: updated tests

Differential Revision: D65457224

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139698
Approved by: https://github.com/ezyang
2024-11-05 17:00:40 +00:00
740054ffe6 [AOTI][reland] Switch OSS dashboard to use aoti_compile_and_package (#139597)
Summary: Reland https://github.com/pytorch/pytorch/pull/139154

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139597
Approved by: https://github.com/angelayi
2024-11-04 18:53:17 +00:00
f4ee5a243d Add PT2 Compile Events for triton and kernel compilation + load_by_key_path (#139402)
Adds a few more dynamo_timed() to measure triton compilation and load_by_key_path times.

In the case of async compilation with multiple threads, we'll generate a single `kernel_compile` event that occurs when waiting on all the parallel compiles to finish.

In the case where async parallel compilation is disabled (or, compile threads are warming up), we'll generate a `triton_compile` event for each kernel.

The `triton_compile` events is a bit questionable: do we need a row for each triton compile event? It might eat up on our already low retention, so I might just remove that. Will discuss with @slarsen.

Differential Revision: [D65215707](https://our.internmc.facebook.com/intern/diff/D65215707/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139402
Approved by: https://github.com/oulgen
2024-11-04 06:37:18 +00:00
a46a79fe92 [AOTI] Ignore .o files in package_aoti (#139153)
Summary: There is no point to package .o files since a .so file is included in that package.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139153
Approved by: https://github.com/angelayi
2024-11-02 03:10:05 +00:00
8c17830dea [AOTI] Unify how weights are stored as data section (#139471)
Summary: https://github.com/pytorch/pytorch/pull/118076 introduced a cleaner way to link weights as a data section for macos. Unify the code by adopting that approach for Linux as well.

Test Plan: CI

Differential Revision: D65302273

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139471
Approved by: https://github.com/chenyang78
2024-11-02 00:23:24 +00:00
ee2f8a50d3 Class rename (#139490)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139490
Approved by: https://github.com/exclamaforte, https://github.com/zou3519
ghstack dependencies: #139295
2024-11-02 00:10:17 +00:00
d8b606ecb5 [fx graph cache] Support freezing with FX graph caching (#136505)
Summary: The main changes to support freezing are:
1) When pickling constant tensors as part of the cache key calculation: If freezing has not been applied, then keep the existing behavior (pickle the metadata and values). If freezing has been applied, then pickle the values if the constant will be inlined; otherwise, consider only the metadata.
2) If freezing has been applied, modify what we store in the cache: Instead of storing the constant attributes in the cache entry, store the _names_ of the constants, and then grab those constants from the GraphModule when we need attache the attributes to a newly-loaded Python module. Since the cache lookup path loads the Python module, this bullet means we need to thread through a GraphModule argument in several places.
3) Since this feature means that we may need to reload the same Python module path more than once (but attach different constant attributes), I changed PyCodeCache.load_by_key_path to not store an in-memory map of path to module (since there may be more than one). I don't _think_ this will have any affect on performance, however.. It's unclear why we were using an in-memory cache here anyway, since this function should only be called once for each module needed to be loaded.
4) Several tests were removing on-disk PyCodeCache artifacts by iterating over the modules. I made this more straightforward by implementing a cache_clear method that removes the on-disk artifacts. Arguably, this should have been the implementation all along.

Differential Revision: [D63542170](https://our.internmc.facebook.com/intern/diff/D63542170)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136505
Approved by: https://github.com/eellison
2024-11-01 18:29:29 +00:00
c8a648d4df Add option to dynamo_timed and chromium_event_logger for logging pt2 compile events (#139309)
This diff considerably changes the column format of PT2 Compile Events:

- Now, instead of logging one new column per every piece of metadata, we just log a single column, "metadata". This vastly decreases the number of columns we need to log, which should help with retention.

- Now, we only log to scuba for a set of dynamo_timed() events that we actually care about aggregating. To do so, we add a boolean to dynamo_timed() that decides whether or not to log a pt2_compile_event. We'll always log a chromium_event for every dynamo_timed(), but only log a subset of those to scuba.

Differential Revision: [D65225598](https://our.internmc.facebook.com/intern/diff/D65225598/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139309
Approved by: https://github.com/oulgen
2024-11-01 02:40:25 +00:00
d21a25c6b7 [fx graph cache] Refactor FxGraphCachePickler, step 2 (#138683)
Summary: Move all the custom `_reduce_*` functions inside the FxGraphCachePickler class. This is mostly a cosmetic change since they're conceptually members of FxGraphCachePickler. But also in an upcoming diff, I'll add a member variable to the class to control how we handle constant tensors, so it will be convenient to be able to query that setting via `self`. I made the analogous changes to AOTAutogradCachePickler for consistency.

Test Plan: unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138683
Approved by: https://github.com/eellison
ghstack dependencies: #138681, #138682
2024-10-31 15:12:18 +00:00
3192bdeea4 [AOTI] Use len(serialized_weights) when calculating consts_size (#139054)
Fixes the failure of INT8 DLRM using AOTI.
The previous code calculates `consts_size` directly using `tensor` from `graph.constants`:
```
  consts_size = sum(
      get_nbytes_of_tensor(tensor, all_cuda)
      for (name, tensor) in graph.constants.items()
      if name not in graph.folded_constants
  )
```
Meanwhile, the actual bytes to serialize (`serialized_weights`) is using `graph.get_original_value_of_constant(name)`:
```
  serialized_weights = b"".join(
      _to_bytes(graph.get_original_value_of_constant(name), all_cuda)
      for name in graph.constants.keys()
      if name not in graph.folded_constants
  )
```

`tensor` from `graph.constants` could be different from `graph.get_original_value_of_constant(name)` thus making the `consts_size` inconsistent with the actual byte size of the `serialized_weights`, resulting in runtime error `weights_offset must be aligned to 16K boundary`, similar to what happened in https://github.com/pytorch/pytorch/pull/135205.

This PR direclty gets `consts_size ` using `len(serialized_weights)`, which fixes the inconsistency.

We also added a `reduce_range` argument to the `get_default_x86_inductor_quantization_config` function, which is needed in the unit test to avoid accuracy issue on CI machines (earlier CPUs without VNNI).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139054
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/desertfire
2024-10-31 09:54:16 +00:00
e3e3ab805b [fx graph cache] Refactor FxGraphCachePickler (#138682)
Summary: In an upcoming change, we need to modify FxGraphCachePickler to behave differently depending on whether the graph has frozen parameters (whether or not we have frozen parameters). To do that, it will be convenient to change FxGraphCachePickler into a regular object instead of a collection of classmethods.

Test Plan: unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138682
Approved by: https://github.com/eellison
ghstack dependencies: #138681
2024-10-31 03:31:51 +00:00
69ea2e726c Consolidate Triton cache into Inductor cache (#138239)
Summary:
This diff/PR attempts to consolidate Triton caching into the Inductor caching so that there can be just one cache that unifies them both, reducing network requests and increasing success rate.

Implementation details can be found via reading the code or the post: https://fb.workplace.com/groups/1553867532149891/posts/1605037517032892

I did not use the Autotune bundler code at all since I want to simplify that and merge it into this on the next diff/PR.

In terms of instrumentation
1) Dynamo compile: `triton_bundler_time_saved_s` this is sum of all triton.compile calls. We dont have to use the specific number, can use this as a binary value.
2) Events table: I used dynamo_timed to measure how much time we spend on bundler collect and write functions which is all the work we do in this diff
3) TLParse: I emitted number of kernels and triton_bundler_time_saved_s into tlparse as well

Test Plan: Updated unit tests

Adhoc running
```
TORCHINDUCTOR_BUNDLE_TRITON_INTO_FX_GRAPH_CACHE=1 buck2 run @mode/opt //scripts/oulgen:runner
```
gives
https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpmTZt6b/0_0_0/fx_graph_cache_hit_4.json
<img width="771" alt="image" src="https://github.com/user-attachments/assets/478782a2-ee47-40cb-b723-fcac2bf9dd93">

Differential Revision: D64504909

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138239
Approved by: https://github.com/ezyang
2024-10-31 01:37:16 +00:00
ad933578ed [fx graph cache] FxGraphPickler: Remove hack to stabilize device string hashes (#138681)
Summary: With the fast pickling mode, we don't need the custom hack for replacing device strings in tensors. This was previously needed because, e.g., two strings "cuda" will pickle differently if they are the same object vs. not.

Test Plan:
The new test fails with fast mode commented out, but succeeds when enabled:
`python test/inductor/test_codecache.py -k test_stable_strings`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138681
Approved by: https://github.com/oulgen
2024-10-28 15:23:56 +00:00
42994234a6 std::value/std::type -> std::_v/std::_t (#138746)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138746
Approved by: https://github.com/cyyever, https://github.com/malfet
2024-10-26 20:59:24 +00:00
36b7135c6f Revert "[fx graph cache] FxGraphPickler: Remove hack to stabilize device string hashes (#138681)"
This reverts commit 6cadf616aeb612f3c866b734268919ad1616ffaf.

Reverted https://github.com/pytorch/pytorch/pull/138681 on behalf of https://github.com/jeanschmidt due to Introduced regressions on linux-focal-cuda11.8-py3.10-gcc9 ([comment](https://github.com/pytorch/pytorch/pull/138681#issuecomment-2438945493))
2024-10-25 22:07:30 +00:00
6cadf616ae [fx graph cache] FxGraphPickler: Remove hack to stabilize device string hashes (#138681)
Summary: With the fast pickling mode, we don't need the custom hack for replacing device strings in tensors. This was previously needed because, e.g., two strings "cuda" will pickle differently if they are the same object vs. not.

Test Plan:
The new test fails with fast mode commented out, but succeeds when enabled:
`python test/inductor/test_codecache.py -k test_stable_strings`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138681
Approved by: https://github.com/oulgen
2024-10-25 15:52:58 +00:00
282e6383c1 Add inductor cache metrics (#138603)
Each inductor event should have exactly one hit, miss, bypass etc. Add it to the inductor compile event.

Add triton_compile as a compiler phase with `dynamo_timed`. This way, we get PT2 Compile Event Logs for triton as well.

Here's what triton events look like:  {F1941513932}
And this on a cache hit(since we still redo this work):
 {F1941514350}

Inductor cache info:
 {F1941528530}

Differential Revision: [D64703392](https://our.internmc.facebook.com/intern/diff/D64703392/)

@diff-train-skip-merge

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138603
Approved by: https://github.com/oulgen
2024-10-24 22:09:34 +00:00
0d9fb51028 Fix lru_cache where config is used (#134235)
Ensure that any use of functools.lru_cache does not prevent config from being changed after the function has already run.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134235
Approved by: https://github.com/masnesral
2024-10-24 10:43:34 +00:00
07cc4bd3e2 typing compile_fx.py (#138033)
Type annotations for compile_fx.
- Some of the stuff here is pretty complicated (functions which return functions that take functions) so I bailed on those and used `Any` just to get the rest landed.
- There are also changes to type signatures in other files which I did just to let mypy know more about the types in compile_fx.py.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138033
Approved by: https://github.com/Skylion007
2024-10-21 18:14:59 +00:00
d5035f0aab fix codecache write_atomic path issue on Windows. (#138331)
Fixes #138211

`Path.rename` function has Windows OS specific behavior, that will raise `FileExistsError` when the target file existing.
This behavior is not happened on Linux, so I write a small repoduce code to figure out what happened.

After stepping trace the repo code:
```python
import os
import sys
from pathlib import Path

_IS_WINDOWS = sys.platform == "win32"

def test_case():
    cwd = os.getcwd()
    path1 = os.path.join(cwd, "haha1.txt")
    path2 = Path(os.path.join(cwd, "haha2.txt"))

    try:
        path2.rename(path1)
    except FileExistsError as e_file_exist:
        if _IS_WINDOWS:
            # on Windows file exist is expected: https://docs.python.org/3/library/pathlib.html#pathlib.Path.rename
            shutil.copy2(path2, path1)
            os.remove(path2)
        else:
            raise e_file_exist
    except BaseException as e:
        raise e

    print("run here.")

if __name__ == "__main__":
    test_case()
```
We found the code `path2.rename(path1)` can breakdown into:
1. copy file2's content to file1.
2. delete file2.

So, we can implemented equal code on Windows path:
```python
shutil.copy2(src=tmp_path, dst=path)
os.remove(tmp_path)
```

So, we can get current PR.

TODO: need cherry-pick to release/2.5 branch, CC: @atalman .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138331
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-10-19 01:27:12 +00:00
524fe784ec BundledAutotuneCache (take 2) (#137902)
Summary:
Add a cache to combine individual autotune caches into a single cached bundle.  We still rely on the individual autotune caches - on a cache hit we copy the individual results into the local caches so they can retrieved later.

Attempt 2 of #134959 (D60677499).

Various configs:
env: TORCHINDUCTOR_BUNDLED_AUTOTUNE_REMOTE_CACHE
config: bundled_autotune_remote_cache
jk: pytorch/remote_cache:bundled_autotune_remote_cache_version

Test Plan:
unit tests

Manually tested w/ EMU:
```
cd fbcode/accelerators/workloads/models/emu_flash/v1p4
make build_benchmark_model && make save_model_to_path
make test_pt2_latency
```

- on a cold run we got 0 hits and 40 misses. On a warm run it got 40 hits and 0 miss.
- perf seems a little better - for 8 runs:
  - no bundled cache averaged 14m11s
  - bundled cache averaged 14m6s
  - 125ms saved per cache entry seems reasonable

Cache Metrics for an sample run:
no bundled cache:
```
INFO: Cache Metrics:
  FbMemcacheRemoteKernelCache: {hit: 2256, miss: 0, put: 0, exception: 0}
  FbRemoteAutotuneCache: {hit: 0, miss: 0, put: 7, exception: 0}
  FbRemoteFxGraphCache: {hit: 40, miss: 0, put: 0, exception: 0}
  LocalAutotuneCache: {hit: 878, miss: 0, put: 7, exception: 0}
  backend:MemcacheCache: {hit: 2256, miss: 0, put: 7, exception: 0}
  backend:_LocalAutotuneCacheBackend: {hit: 878, miss: 0, put: 7, exception: 0}
  backend:_ManifoldCache: {hit: 40, miss: 0, put: 0, exception: 0}
```
bundled cache:
```
INFO: Cache Metrics:
  FbMemcacheRemoteKernelCache: {hit: 2258, miss: 0, put: 0, exception: 0}
  FbRemoteAutotuneCache: {hit: 0, miss: 0, put: 8, exception: 0}
  FbRemoteBundledAutotuneCache: {hit: 40, miss: 0, put: 0, exception: 0} <<<<<<
  FbRemoteFxGraphCache: {hit: 40, miss: 0, put: 0, exception: 0}
  LocalAutotuneCache: {hit: 878, miss: 0, put: 886, exception: 0}
  backend:MemcacheCache: {hit: 2258, miss: 0, put: 8, exception: 0}
  backend:_LocalAutotuneCacheBackend: {hit: 878, miss: 0, put: 886, exception: 0}
  backend:_ManifoldCache: {hit: 80, miss: 0, put: 0, exception: 0}
```

Differential Revision: D64336043

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137902
Approved by: https://github.com/oulgen
2024-10-15 18:39:47 +00:00
5b442e8e92 Time torch_key computation in overall Dynamo stats (#137877)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137877
Approved by: https://github.com/oulgen, https://github.com/masnesral
2024-10-15 05:47:19 +00:00
d64492e4cb Increase verbosity of inductor cache hit/miss to INFO level (#137876)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137876
Approved by: https://github.com/Skylion007
2024-10-14 19:59:31 +00:00
1358969fa1 Revert "BundledAutotuneCache (#134959)"
This reverts commit 709021143d9c9aa90df578a2f5abb93a91a4852a.

Reverted https://github.com/pytorch/pytorch/pull/134959 on behalf of https://github.com/albanD due to The newly added test fails on rocm CI ([comment](https://github.com/pytorch/pytorch/pull/134959#issuecomment-2408091754))
2024-10-11 20:43:56 +00:00
709021143d BundledAutotuneCache (#134959)
Add a cache to combine individual autotune caches into a single cached bundle.  We still rely on the individual autotune caches - on a cache hit we copy the individual results into the local caches so they can retrieved later.

Various related configs:
env: TORCHINDUCTOR_BUNDLED_AUTOTUNE_REMOTE_CACHE
config: bundled_autotune_remote_cache
jk: pytorch/remote_cache:bundled_autotune_remote_cache_version

Testing:

Manually tested w/ EMU:
```
cd fbcode/accelerators/workloads/models/emu_flash/v1p4
make build_benchmark_model && make save_model_to_path
make test_pt2_latency
```

 - on a cold run we got 0 hits and 40 misses. On a warm run it got 40 hits and 0 miss.
- perf seems a little better - for 8 runs:
  - no bundled cache averaged 14m11s
  - bundled cache averaged 14m6s
  - 125ms saved per cache entry seems reasonable

Cache Metrics for an sample run:
no bundled cache:
```
INFO: Cache Metrics:
  FbMemcacheRemoteKernelCache: {hit: 2256, miss: 0, put: 0, exception: 0}
  FbRemoteAutotuneCache: {hit: 0, miss: 0, put: 7, exception: 0}
  FbRemoteFxGraphCache: {hit: 40, miss: 0, put: 0, exception: 0}
  LocalAutotuneCache: {hit: 878, miss: 0, put: 7, exception: 0}
  backend:MemcacheCache: {hit: 2256, miss: 0, put: 7, exception: 0}
  backend:_LocalAutotuneCacheBackend: {hit: 878, miss: 0, put: 7, exception: 0}
  backend:_ManifoldCache: {hit: 40, miss: 0, put: 0, exception: 0}
```
bundled cache:
```
INFO: Cache Metrics:
  FbMemcacheRemoteKernelCache: {hit: 2258, miss: 0, put: 0, exception: 0}
  FbRemoteAutotuneCache: {hit: 0, miss: 0, put: 8, exception: 0}
  FbRemoteBundledAutotuneCache: {hit: 40, miss: 0, put: 0, exception: 0}
  FbRemoteFxGraphCache: {hit: 40, miss: 0, put: 0, exception: 0}
  LocalAutotuneCache: {hit: 878, miss: 0, put: 886, exception: 0}
  backend:MemcacheCache: {hit: 2258, miss: 0, put: 8, exception: 0}
  backend:_LocalAutotuneCacheBackend: {hit: 878, miss: 0, put: 886, exception: 0}
  backend:_ManifoldCache: {hit: 80, miss: 0, put: 0, exception: 0}
```

Differential Revision: D60677499

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134959
Approved by: https://github.com/oulgen
2024-10-11 19:12:41 +00:00
47af7cc962 Add compiler bisector (#131936)
This is a utility to aid the torch.compile debugging. You provide a function that returns True on success, False on failure, or do something out of process and run bisect_helper `good | bad`.

The bisector will first go through backends - `eager`, `aot_eager`, `aot_eager_decomp_partition`, `inductor` to find the first failing backend. Then, it will go through subsystems within the backend - currently limited but could be expanded - and try to find the first subsystem for which disabling fixes the problem. Once it has found the failing subsystem, it will find the number of times the subsystem is applied, and then bisect through it.

An example usage of how to hook it up for aot_eager_decomp_partition and decomposition subsystem is :

```
    from torch._inductor.bisect_helper import BisectionManager
    if op in CURRENT_DECOMPOSITION_TABLE:
        if BisectionManager.disable_subsystem("aot_eager_decomp_partition", "decomposition", lambda: repr(op)):
            return NotImplemented
```

Once it has discovered the problematic change, it will print out the associated debug info, and you can set the same limits with `TORCH_BISECT_BACKEND` `TORCH_BISECT_SUBSYSTEM` and `TORCH_BISECT_MAX`.

We could add further options as an automated way of going through a check list for checking divergence - e.g., the mode to emulate amp casts.

Fix for https://github.com/pytorch/pytorch/issues/126546

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131936
Approved by: https://github.com/ezyang
2024-10-09 20:34:11 +00:00
0414aeacd9 AOTInductor: silence linker warnings about executable stacks (#136924)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136924
Approved by: https://github.com/desertfire
2024-10-09 04:02:30 +00:00
319eda9dfd [inductor] Add API to make post_grad_custom passes cache-able (#137298)
Summary: See https://github.com/pytorch/pytorch/issues/130772

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137298
Approved by: https://github.com/oulgen, https://github.com/eellison
2024-10-07 21:11:54 +00:00