Adding source_get_cache also to AOT compile case. Since the guard manager loader code can be shared between AOT and caching, we added a new function load_guard_manager to avoid code duplication between two workflows, for loading guards.
Test Plan: test_guard_serialization.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164773
Approved by: https://github.com/yiming0416, https://github.com/dolpm
# TLDR
This PR removes the regression in torch.topk introduced from torch 2.7.0 and delivers much better performance for large inputs.
The table below reports execution times on H20 for various input sizes with float32 data, extracting the top-100 values. Results indicate that this PR restores and improves performance, especially on large inputs.
| Input Shape | torch2.6.0 (ms) | torch2.8.0 (ms) | 2.8.0+this PR (ms) |
| -------------- | --------------- | --------------- | ------------------ |
| (1, 1B) | 36.6 | 1564.1 | 25.6 |
| (1, 100M) | 3.56 | 17.4 | 2.54 |
| (1, 1000,000) | 0.135 | 0.145 | 0.098 |
| (512, 128000) | 1.33 | 1.33 | 1.32 |
| (8192, 128000) | 19.6 | 19.6 | 19.4 |
# Background
After upgrading PyTorch from 2.6.0 to 2.7.0, we observed a significant GPU performance regression in `torch.topk` on NVIDIA GPUs. For instance, extracting the top-1000 largest values from one billion floats on an NVIDIA H20 increased from **36 ms** to **1.6 s**.
Profiling with Nsight Compute indicates that the slowdown is caused by redundant memory accesses introduced in [PR #145536](https://github.com/pytorch/pytorch/pull/145536).
# Analysis
`torch.topk` relies on **RadixSelect** to find the target values. Each radix pass requires computing a histogram of the input values. For large inputs, histogram computation is split into two stages:
1. **Local histogram**: Each CUDA block processes a subset of the input and writes its local histogram to global memory.
2. **Global reduction**: A single CUDA block reads all local histograms from global memory and reduces them into the final global histogram.
Before [PR #145536](https://github.com/pytorch/pytorch/pull/145536), both stages ran inside a single kernel (`radixFindKthValues`), using a semaphore to ensure that all local histograms were completed before reduction.
In PR #145536, the global histogram computation was merged with subsequent top-k calculations into a single kernel (`computeBlockwiseKthCounts`) to avoid the semaphore. While this simplifies synchronization, it introduces **redundant memory reads**:
- `computeBlockwiseKthCounts` launches `numInputSlices * blocks_per_slice` blocks.
- For each row (slice), `blocks_per_slice` CUDA blocks redundantly reload the same local histograms from global memory.
# This PR
To address this inefficiency, we introduce the following optimizations:
1. **Dedicated kernel**: Refactor global histogram and cumsum computation into a separate GPU kernel, `computeDigitCumSum`.
2. **Loop unrolling**: Apply loop unrolling in `computeDigitCumSum` to speed up local histogram reads.
# Performance
We benchmarked torch.topk on NVIDIA H20 with float32 inputs, extracting the top-100 values across different input sizes. The results in the table below demonstrate that this PR effectively eliminates the performance regression introduced in 2.7.0 and delivers substantial improvements on large inputs.
| Input Shape | torch2.6.0 (ms) | torch2.8.0 (ms) | 2.8.0+this PR (ms) |
| -------------- | --------------- | --------------- | ------------------ |
| (1, 1B) | 36.6 | 1564.1 | 25.6 |
| (1, 100M) | 3.56 | 17.4 | 2.54 |
| (1, 1000,000) | 0.135 | 0.145 | 0.098 |
| (512, 128000) | 1.33 | 1.33 | 1.32 |
| (8192, 128000) | 19.6 | 19.6 | 19.4 |
Besides, I have verified the correctness of this PR with different inputs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164459
Approved by: https://github.com/ngimel, https://github.com/Skylion007
Summary:
the context module provides configurable context selection + isolation key hashing;
context selection is broken into runtime and compile context. runtime context is decided at call time (inductor configs, precision configs, etc.) and compile context is decided at compile time (hardware type, software hashes).
callees will be given access to SelectedRuntimeContext and SelectedCompileContext, which they can use to determine and select what context is necessary with regards to the function which is being cached.
these selected contexts are wrapped in an IsolationSchema, which denotes what context should be taken into consideration when producing an isolation key. The isolation key is essentially a salt of the function signature key, which says that some function signature key result is valid under a given context (isolation schema)
Test Plan:
```
buck test fbcode//mode/opt caffe2/test/inductor:caching
```
Reviewed By: aorenste
D83714689
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164549
Approved by: https://github.com/aorenste
This PR is to temporarily unblock various experiments to re-use dynamo create fake mode. Note that this is still not what we want as the end state. The end state should look sth like:
```
out = fulllgraph_capture(mod, inputs)
fake_mode = out.backend_inputs.fake_mode
gm = out.module()
```
This doesn't work today because export requires we need to wrap the original module to setup a flat module to trace for easier handling of pytree. As a result, we would need to carry export specific flag in fullgraph_capture which seems not ideal.
Regardless, the end state is that we need to give downstream user a graph module and a fake mode in some form, so I think _dynamo_graph_capture_for_export returning a fake mode within graph module itself via gm.meta
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164730
Approved by: https://github.com/avikchaudhuri
Fixes#89034
Updated tensor_to_numpy() function in tensor_numpy.cpp to handle ZeroTensors by throwing an error if force=False and returning an array full of zeros if force=True.
@ngimel, I just saw that you mentioned PyTorch is not too concerned with this issue but I had already worked on it so I figured I would push it anyways and see what you thought. Feel free to close the PR if you think it is not worth merging.
@albanD
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164487
Approved by: https://github.com/izaitsevfb