# TLDR
This PR removes the regression in torch.topk introduced from torch 2.7.0 and delivers much better performance for large inputs.
The table below reports execution times on H20 for various input sizes with float32 data, extracting the top-100 values. Results indicate that this PR restores and improves performance, especially on large inputs.
| Input Shape | torch2.6.0 (ms) | torch2.8.0 (ms) | 2.8.0+this PR (ms) |
| -------------- | --------------- | --------------- | ------------------ |
| (1, 1B) | 36.6 | 1564.1 | 25.6 |
| (1, 100M) | 3.56 | 17.4 | 2.54 |
| (1, 1000,000) | 0.135 | 0.145 | 0.098 |
| (512, 128000) | 1.33 | 1.33 | 1.32 |
| (8192, 128000) | 19.6 | 19.6 | 19.4 |
# Background
After upgrading PyTorch from 2.6.0 to 2.7.0, we observed a significant GPU performance regression in `torch.topk` on NVIDIA GPUs. For instance, extracting the top-1000 largest values from one billion floats on an NVIDIA H20 increased from **36 ms** to **1.6 s**.
Profiling with Nsight Compute indicates that the slowdown is caused by redundant memory accesses introduced in [PR #145536](https://github.com/pytorch/pytorch/pull/145536).
# Analysis
`torch.topk` relies on **RadixSelect** to find the target values. Each radix pass requires computing a histogram of the input values. For large inputs, histogram computation is split into two stages:
1. **Local histogram**: Each CUDA block processes a subset of the input and writes its local histogram to global memory.
2. **Global reduction**: A single CUDA block reads all local histograms from global memory and reduces them into the final global histogram.
Before [PR #145536](https://github.com/pytorch/pytorch/pull/145536), both stages ran inside a single kernel (`radixFindKthValues`), using a semaphore to ensure that all local histograms were completed before reduction.
In PR #145536, the global histogram computation was merged with subsequent top-k calculations into a single kernel (`computeBlockwiseKthCounts`) to avoid the semaphore. While this simplifies synchronization, it introduces **redundant memory reads**:
- `computeBlockwiseKthCounts` launches `numInputSlices * blocks_per_slice` blocks.
- For each row (slice), `blocks_per_slice` CUDA blocks redundantly reload the same local histograms from global memory.
# This PR
To address this inefficiency, we introduce the following optimizations:
1. **Dedicated kernel**: Refactor global histogram and cumsum computation into a separate GPU kernel, `computeDigitCumSum`.
2. **Loop unrolling**: Apply loop unrolling in `computeDigitCumSum` to speed up local histogram reads.
# Performance
We benchmarked torch.topk on NVIDIA H20 with float32 inputs, extracting the top-100 values across different input sizes. The results in the table below demonstrate that this PR effectively eliminates the performance regression introduced in 2.7.0 and delivers substantial improvements on large inputs.
| Input Shape | torch2.6.0 (ms) | torch2.8.0 (ms) | 2.8.0+this PR (ms) |
| -------------- | --------------- | --------------- | ------------------ |
| (1, 1B) | 36.6 | 1564.1 | 25.6 |
| (1, 100M) | 3.56 | 17.4 | 2.54 |
| (1, 1000,000) | 0.135 | 0.145 | 0.098 |
| (512, 128000) | 1.33 | 1.33 | 1.32 |
| (8192, 128000) | 19.6 | 19.6 | 19.4 |
Besides, I have verified the correctness of this PR with different inputs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164459
Approved by: https://github.com/ngimel, https://github.com/Skylion007
Summary:
the context module provides configurable context selection + isolation key hashing;
context selection is broken into runtime and compile context. runtime context is decided at call time (inductor configs, precision configs, etc.) and compile context is decided at compile time (hardware type, software hashes).
callees will be given access to SelectedRuntimeContext and SelectedCompileContext, which they can use to determine and select what context is necessary with regards to the function which is being cached.
these selected contexts are wrapped in an IsolationSchema, which denotes what context should be taken into consideration when producing an isolation key. The isolation key is essentially a salt of the function signature key, which says that some function signature key result is valid under a given context (isolation schema)
Test Plan:
```
buck test fbcode//mode/opt caffe2/test/inductor:caching
```
Reviewed By: aorenste
D83714689
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164549
Approved by: https://github.com/aorenste
This PR is to temporarily unblock various experiments to re-use dynamo create fake mode. Note that this is still not what we want as the end state. The end state should look sth like:
```
out = fulllgraph_capture(mod, inputs)
fake_mode = out.backend_inputs.fake_mode
gm = out.module()
```
This doesn't work today because export requires we need to wrap the original module to setup a flat module to trace for easier handling of pytree. As a result, we would need to carry export specific flag in fullgraph_capture which seems not ideal.
Regardless, the end state is that we need to give downstream user a graph module and a fake mode in some form, so I think _dynamo_graph_capture_for_export returning a fake mode within graph module itself via gm.meta
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164730
Approved by: https://github.com/avikchaudhuri
Fixes#89034
Updated tensor_to_numpy() function in tensor_numpy.cpp to handle ZeroTensors by throwing an error if force=False and returning an array full of zeros if force=True.
@ngimel, I just saw that you mentioned PyTorch is not too concerned with this issue but I had already worked on it so I figured I would push it anyways and see what you thought. Feel free to close the PR if you think it is not worth merging.
@albanD
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164487
Approved by: https://github.com/izaitsevfb
Fixes some tests that seemed to start flaking out as reported in #163202, due to cuBLASLt workspaces becoming persistent following that change.
It's relatively obvious why the workspaces/allocations corresponding to them should be cleaned up for `test_memory_snapshot_script` but less obvious for `test_memory_plots_free_segment_stack`? Why does not cleaning up workspace prevent `empty_cache` from showing up?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163299
Approved by: https://github.com/albanD
Fixes#163330
I tried to reproduce the bug with my 4-GPU setup (the original issue used 8 GPUs). I created several different test scenarios, trying to trigger the bug by:
- creating two different device meshes
- slicing them in various ways
- checking if get_root_mesh() would get confused
but the bug didn't show up! Everything worked correctly in `2.10`. I found that there was a massive refactoring of the `DeviceMesh` code (PR #163213) that landed on October 2nd. That PR completely rewrote how `DeviceMesh` tracks relationships between parent meshes and submeshes using. It seems like this refactoring fixed the bug! But I added a regression test to make sure it doesn't come back. The test (`test_get_root_mesh_multiple_independent_meshes`) does exactly what the bug report described:
- creates two independent meshes
- slices them both
- verifies that each submesh correctly points back to its real parent
- makes sure submeshes from mesh1 don't incorrectly claim mesh2 as their parent
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164731
Approved by: https://github.com/fduwjj