We only want to cache the latest CI docker image for `main` and `release` branches in cases where multiple `docker-builds` workflow runs get triggered in quick succession. This is because the latest run will anyway overwrite the cached images, since we do not maintain a cached image per-SHA, instead it's only one-per-branch (to minimize cache size and docker load times at runner bringup).
Also removing `workflow_dispatch` as a trigger since it won't work (needs artifacts from `docker-builds` run)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/168104
Approved by: https://github.com/jeffdaily
Summary:
`-Wunused-exception-parameter` has identified an unused exception parameter. This diff removes it.
This:
```
try {
...
} catch (exception& e) {
// no use of e
}
```
should instead be written as
```
} catch (exception&) {
```
If the code compiles, this is safe to land.
Test Plan: Sandcastle
Reviewed By: dtolnay
Differential Revision: D87273132
Pull Request resolved: https://github.com/pytorch/pytorch/pull/168056
Approved by: https://github.com/malfet, https://github.com/Skylion007
For GPU: Previously reported that only a single sample could be tested with huber_loss functional. Current snapshot of the code does not appear to suffer from numerical issues as reported before.
For CPU: While testing GPU, it was discovered that with Half appears to be numerically unstable. This commit resolves issue with CPU by upcasting Half to float for the computation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166952
Approved by: https://github.com/benjaminglass1, https://github.com/isuruf
The bucketing dtype fusing was causing nodes which had dependencies to be erased. Transfer those deps over to the new nodes, and also add an assertion that none of our deps are erased to catch this type of error in the future.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167863
Approved by: https://github.com/fmassa
ghstack dependencies: #167852, #167853
The all gather bucketing was part of the way to fusing in dtype casts into the bucket. We do this by allocating the group bucket buffer, then viewing each slice of it as the destination dtype. We then foreach_copy_ into the allocated buffer, with each collective copying in to its destination dtype.
This logic was causing an issue in a later part of the stack, but not fully firing, so might as well fix it.
Note: custom ops dont yet support list[dtype], so i worked around by list[int], but will fix in a follow up.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167853
Approved by: https://github.com/ruisizhang123
ghstack dependencies: #167852
This PR introduces a `Tensor` subclass which represents a complex tensor in terms of two real ones. Ops are decomposed as individual ops on the real and imaginary parts.
It is compatible with `torch.compile`, so long as the real ops used are also compatible. Autograd "works", but is WIP due to different edge-case behaviour.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167621
Approved by: https://github.com/ezyang
Summary: The export_memory_timeline method in torch.profiler is being deprecated in favor of the newer memory snapshot API (torch.cuda.memory._record_memory_history and torch.cuda.memory._export_memory_snapshot). This change adds the deprecated decorator from typing_extensions and updates the docstring to guide users to the recommended alternative. The decorator will emit a FutureWarning at runtime, and the docstring now includes a .. deprecated:: directive for documentation visibility.
Test Plan: Manual verification that the decorator is properly applied and the deprecation message is informative.
Differential Revision: D87272399
Pull Request resolved: https://github.com/pytorch/pytorch/pull/168036
Approved by: https://github.com/valentinandrei
This PR outputs chars to stream without building temporary strings.
They were modified by (on fish)
```
sed -i -e 's/<< "\([^\\\']\)"/<< \'\1\'/g' (grep '<< "."' -r torch c10 aten -l)
```
and revert some invalid changes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167899
Approved by: https://github.com/Skylion007
Splits each torch library registration in the 2.10 folder into its own file -- I had a script that parsed kernel.cpp to do this but I felt like forcing this responsibility on the user might be less error prone
Compiles each file targetting 2.9 and asserts that compilation fails. (There are 2 2.9 kernels we use as negative tests where compilation is expected to succeed)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167962
Approved by: https://github.com/janeyx99
ghstack dependencies: #168025, #167802, #167803, #167804
Fix for this issue on DSV3 autobucketing pass: https://github.com/pytorch/torchtitan/issues/2037; Now users should be able to run DSV3 autobucketing E2E.
It fixed three things:
(1) fix bug in NCCL estimation support for All-to-all.
(2) For dynamic token dispatch/combine in MoE, add fall_back value hint to all-to-all's collective size estimation.
(3) Previously, for schedulable node check, I directly modified `is_wait` in bucketing.py. It might be safer to add these criteria in overlap_scheduling.py as another function `_schedulable_wait_node`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167797
Approved by: https://github.com/eellison
This is a cleaned up version of the POC at https://github.com/pytorch/pytorch/pull/167752/files
This PR adds a inductor option which you can pass into torch.compile that wraps all inductor generated code in a HOP, allowing it to be read by torch dispatches.
This hop is created in output_code.post_compile, so it's cache safe. The configuration to turn it on is part of `inductor_config`, and therefore already part of the cache key. I've added a test that shows this HOP is cache safe.
Because this wrapper occurs at compile time, there should be little to no cpu overhead from creating it, besides that of actually processing the torch_dispatches themselves.
The context here is we want to be able to support compiled regions such as flex attention in eager mode, while working with other torch dispatch tracers like SAC. Will add more tests for SAC/flex attention specific things next.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167844
Approved by: https://github.com/ezyang
This adds a `list` Store API and implements it for all backends.
This is intended to be used for debugging and will allow inspecting all keys in a store locally as well as remotely in the case of TCPStore.
Test plan:
```
pytest test/distributed/test_store.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167883
Approved by: https://github.com/fduwjj
With smaller, aten nodes, we might want to overlap a single collective with multiple nodes. Updates the overlapping, and bucketing code so that a collective can be hidden by multiple nodes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167847
Approved by: https://github.com/fmassa
Following the previous implementation of HOP print, this continues to enable HOP print for dynamo so as to enable eager full graph and aot_eager backend for torch compile. At the end of the the implementation, the HOP print is able to enable stateful print without causing graph break. In the prior built in print, dynamo is able to reduce the graph break but unable to eliminate it. This enable the format-based printing for such purpose in dynamo.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167571
Approved by: https://github.com/angelayi
ghstack dependencies: #167016
Fixes#167091
DTensor convolution operations crashed when bias=None was passed with
torch.compile because the code assumed bias always exists, but the ATen
schema defines it as optional (Tensor?).
This fix:
- Handles None bias_spec in convolution_rules (forward pass)
- Handles None bias_shape_opt in convolution_backward_rules
- Returns None for grad_bias_spec when bias is None
- Extends None output handling to indices 0,1,2 in _sharding_prop.py
Added 3 regression tests covering compile mode, backward pass, and
nn.Conv2d module API with bias=False.
This is related to issue https://github.com/pytorch/pytorch/issues/159959 and this PR https://github.com/pytorch/pytorch/pull/165438 that resolves it, overlapping in the` _sharding_prop.py` change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167258
Approved by: https://github.com/XilunWu
Summary:
noticed this bug with subproc autotuning while working on async autotuning
the created subprocs don't respect changes to cache dirs, specifically the Triton cache dir, which causes subproc autotuning to cache miss on otherwise cached Triton kernels, net effect being that precompile in gemm autotuning path became an expensive no-op
on the torchbench model I tested with, compile time with subproc autotuning went down from ~1k seconds to ~500 seconds, now matching in-process autotuning
Test Plan: CI
Differential Revision: D87170069
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167918
Approved by: https://github.com/aorenste
Fix for https://github.com/pytorch/pytorch/issues/166653.
Two fixes:
- We were inducing a split for broadcasted loads. e.g. (x // 16). While a split of 16 here will make the load coalesced in one of the tile vars, since the load is already in cache it's not worth splitting. And it would make the other tile var load from memory that isnt in cache.
- Add a slight term for uncoalesced memory. This prevents doing tiling for loads which are a small % of the overall kernel.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167771
Approved by: https://github.com/v0i0
This is a follow-up to [#154333](https://github.com/pytorch/pytorch/pull/154333), where I initially introduced a fallback mechanism in deserialize_torch_artifact.
In this revised PR:
Cleaned up commit history for clarity and reproducibility.
Relocated the test into the TestSaveLoad class in test_serialize.py.
There were some issues with last PR so opened this PR
The previous PR had inconsistencies due to local branch issues and was closed in favor of this cleaner submission.
Feedback is very welcome
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158247
Approved by: https://github.com/angelayi
Summary:
add support for symint placeholders
added two test cases with dynamic reshape
- dynamic info coming from tmd on placeholders
- dynamic info coming from placeholders (symints)
Test Plan:
test_reshape_dynamic_ph
test_reshape_dynamic_tmd
Differential Revision: D86984100
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167757
Approved by: https://github.com/blaine-rister
Per title
1) allows `self` argument to have the same precision as output
2) fixes broadcasting of `self` argument - it used to allocate incorrectly sized output and resize it later, causing a warning, in addmm, and error out in baddbmm
3) fixes `out` handling for `out` baddbmm overload, where the implementation used uninitialized memory in `out` instead of copying `self` to out.
4) removes couple unneeded iife patterns
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167931
Approved by: https://github.com/PaulZhang12, https://github.com/drisspg, https://github.com/malfet
Fixes#158429
Updated LogAddExpKernel.cu to allow for complex numbers. Also, updated unittest to run test_logaddexp on CUDA with complex data types and added a unit test in test_linalg.py to compare results between CUDA and cpu.
@drisspg
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163509
Approved by: https://github.com/isuruf
Inductor may treat an outer reduction as inner reduction when the reduction ranges contains a 1. This cause some weird issue that we skip fusing with mix order reduction. While I'm still debugging why that happens, I think we should fix the decision here anyways
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167697
Approved by: https://github.com/jansel, https://github.com/v0i0
The following tests are failing on python 3.14 on linux machine
* TestSetAffinity::test_set_affinity_in_worker_init
* Why? 3.14 makes `forkserver` the default start method for multiprocessing. With it, local functions are not pickle-able and unit test fail.
* TestIndividualWorkerQueue::test_ind_worker_queue
* Why? The test was hitting timeout. This is also related to the start method. I am increasing timeout and reducing batch size iterations to reduce total unit test time.
* Fixes https://github.com/pytorch/pytorch/issues/68643
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167429
Approved by: https://github.com/aelavender, https://github.com/ramanishsingh
getAllOperatorsFor returns a const reference to internal state that is protected by a lock. Presuming that the lock is necessary in the first place (about which I offer no opinion because it's unclear to what extent the GIL should help here), this is a straightforward way to cause callers to create race conditions.
This should fix those race conditions by copying the state instead. I modified calling code to stop binding a const reference to the result for clarity.
Differential Revision: [D87088731](https://our.internmc.facebook.com/intern/diff/D87088731/)
**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D87088731/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167860
Approved by: https://github.com/zou3519
Make the PyObject preservation scheme thread-safe with free threaded (nogil) Python. The general idea is:
* Python Tensor and Storage objects always hold a strong reference to their underlying c10 object
* c10 objects hold a strong reference to their Python objects if there's at least one other reference to the c10 object
This is implemented in `intrusive_ptr`:
* The top most bit (`kHasPyObject`) from the weakref count is now used to indicate if the `intrusive_ptr_target` has an associated PyObject. So `kHasPyObject` is one bit, the weakref count is now 31 bits and the strong refcount remains 32 bits.
* When the reference count increases from one to two and `kHasPyObject` is set, we incref the associated Python object to ensure that it's kept alive.
* When the reference count decreases from two to one (i.e., there are no C++ reference to the `intrusive_ptr_target` other than from the Python object), we decre the associated Python object to break the cycle.
Other benefits:
* We can delete a lot of the copypasta from Python internal `subtype_dealloc`
* This fixes the weakref and GC bugs we had in the previous scheme. Python weakrefs on Tensors and Storages should just work as expected now.
Risks:
* Extra branch for reference count operations on `intrusive_ptr<TensorImpl>`, `intrusive_ptr<StorageImpl>`, and the generic `intrusive_ptr<intrusive_ptr_target>` even when we're not using Python.
* It's a big change
(Second attempt at https://github.com/pytorch/pytorch/pull/166342)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167564
Approved by: https://github.com/albanD, https://github.com/Skylion007
Update the torch-xpu-ops commit to [intel/torch-xpu-ops@1e69f4](1e69f40b3c), includes:
- Add PTL in the default AOT target list for both Win and Lin
- Use PyTorch p2p API in Copy kernel
- Add event cache and event timing to XCCL
- Add Float8_e8m0fnu support for copy
- Add CMAKE_SYCL_COMPILER_LAUNCHER for sccache
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167698
Approved by: https://github.com/EikanWang
The PR #167401 reminded me that the removal of old NVTX interface is long overdue, as the header-only NVTX3 has been around for more than 5 years and is shipped with all CUDA Toolkit versions of 12+. In addition to that, `libnvToolsExt.so` was removed in CUDA Toolkit 13 and onward.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167637
Approved by: https://github.com/eqy
This PR fixes a bug where `torch.clamp` on MPS fails when min/max tensors have more dimensions than the input tensor.
CPU already supports this broadcasting, but MPS raised a RuntimeError.
Example of failing case before the fix:
```python
x = torch.randn(2, 3, device="mps")
min_t = torch.randn(1, 2, 3, device="mps")
max_t = torch.randn(1, 2, 3, device="mps")
torch.clamp(x, min=min_t, max=max_t) # RuntimeError
```
After this fix, MPS matches CPU behavior.
Fixes#160734
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165058
Approved by: https://github.com/malfet
Summary: This diff would be a follow-up diff for D85883723.
Test Plan:
See D86719598. We are now able to publish the model.
Unit test:
```
buck run fbcode//mode/opt -c remoteexecution.local=enabled fbcode//sigmoid/inference/test:test_passes -m ovr_config//triton:experimental -- -r test_triton_hop_cpu
```
Differential Revision: D87091238
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167862
Approved by: https://github.com/XueningXu
# Description
Fixes#114850, we will port test utils and schema check to Intel GPU
We could enable Intel GPU with following methods and try the best to keep the original code styles:
# Changes
1. Get device type with from accelerator and get_devtype helper method
2. Replace the requires cuda statement to device_type.
3. Add HAS_XPU and HAS GPU check to replace some of the HAS_XPU etc.
# Notify
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166684
Approved by: https://github.com/ezyang, https://github.com/guangyey
Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
This PR outputs chars to stream without building temporary strings.
They were modified by (on fish)
```
sed -i -e 's/<< "\([^\\\']\)"/<< \'\1\'/g' (grep '<< "."' -r torch c10 aten -l)
```
and revert some invalid changes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167899
Approved by: https://github.com/Skylion007
This PR moves the implementations of Tensor accessor classes to headeronly with the following modifications:
- Add ArrayRef and IndexBoundsCheck template parameters to refactor out the usages of `IntArrayRef` and `TORCH_CHECK_INDEX` from Tensor accessor implementations.
- Eliminate usage of `c10::irange` as it is not headeronly-compatible.
- Introduce `torch::headeronly::{TensorAccessorBase,TensorAccessor, GenericPackedTensorAccessorBase, GenericPackedTensorAccessor}` that are headeronly-equivalent to `at::{TensorAccessorBase,TensorAccessor, GenericPackedTensorAccessorBase, GenericPackedTensorAccessor}`. Both these sets of template classes use original implementations from `torch::headeronly::detail` that have new template parameters `ArrayRefCls` and `IndexBoundsCheck` to facilitate `at` and `torch::headeronly` implementations of ArrayRef and checking indices.
TODO:
- ~when https://github.com/pytorch/pytorch/pull/164991 lands, eliminate the placeholder class HeaderOnlyArrayRef~ UPDATE: done.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166855
Approved by: https://github.com/janeyx99
Summary:
For sepcific hardware (A100), Autocast will generate a relatively large error on Transformer (torch.nn.TransformerEncoder) when using no_grad decorator on dim=256 (and larger presuably).
H100 seems fine, as does A100 with mig (so less than full SMs).
For now backing out, and revisting next week.
Test Plan:
failed jobs:
https://fburl.com/scuba/remote_execution_action/jzcmujgk
{F1983543613}
Reviewed By: t-ivan-gr
Differential Revision: D87111518
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167884
Approved by: https://github.com/malfet
Summary:
`-Wunused-exception-parameter` has identified an unused exception parameter. This diff removes it.
This:
```
try {
...
} catch (exception& e) {
// no use of e
}
```
should instead be written as
```
} catch (exception&) {
```
If the code compiles, this is safe to land.
Test Plan: Sandcastle
Differential Revision: D85813836
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167604
Approved by: https://github.com/malfet, https://github.com/seemethere
Fixes#167767.
Original issue was that using std::to_string(value) does not work intended here if the value is smaller than 1e-6. The caching keys ended up as `clamp_out_mps_min:0.000000_scalar::f32[1]` instead of `clamp_out_mps_min:0.0000001_scalar::f32[1]`. After the change the values are stored as the hex representation for the floating point number. So for min_value 1e-7 the key will be `impl_min:0x1.ad7f2ap-24_scalar::f32[1]` and for min_value 0.0 `clamp_out_mps_min:0x0p+0_scalar::f32[1]`
Output of the repro code before the change:
```
tensor([0.], device='mps:0')
tensor([0.], device='mps:0')
tensor([0.], device='mps:0')
tensor([0.], device='mps:0')
tensor([0.], device='mps:0')
tensor([1.0000e-07], device='mps:0')
tensor([0.], device='mps:0')
tensor([1.0000e-07], device='mps:0')
```
Output for the repro code after the change:
```
tensor([0.], device='mps:0')
tensor([1.0000e-07], device='mps:0')
tensor([0.], device='mps:0')
tensor([1.0000e-07], device='mps:0')
tensor([0.], device='mps:0')
tensor([1.0000e-07], device='mps:0')
tensor([0.], device='mps:0')
tensor([1.0000e-07], device='mps:0')
```
which matches the expected CPU reference.
Snippet to test with:
```
import torch
device='mps'
dtype=torch.float32
a = torch.zeros(1, device=device, dtype=dtype)
# the following line triggers the incorrect behavior, when commented, the remainder of the script appears to work as expected
a_clamped = a.clamp(min=0.0)
b = torch.zeros(1, device=device)
print(b)
c = b.clamp(min=1e-7)
print(c)
b = torch.zeros(1, device=device)
print(b)
c = b.clamp(min=1e-7, max=None)
print(c)
b = torch.zeros(1, device=device)
print(b)
c = b.clamp(min=1e-7, max=torch.inf)
print(c)
b = torch.zeros(1, device=device)
print(b)
c = b.clamp_min(1e-7)
print(c)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167777
Approved by: https://github.com/malfet
# Summary
This PR optimizes the CUDA kernels for `torch.nn.EmbeddingBag` by reducing GPU register pressure introduced by `CUDA_KERNEL_ASSERT`, which improves kernel occupancy and overall performance. The optimization separates input validation into a dedicated loop before the main processing loop, allowing the compiler to better optimize register allocation. By extensively testing on various GPUs and CUDA versions, `torch.nn.EmbeddingBag` performance improves by 29% to 111% with this PR.
# Performance Results
The following table shows the performance improvements on various input distributions and GPUs. All benchmarks use PyTorch 2.9.0 compiled with CUDA 12.8.
**Input Distribution Types (simulating recommendation system ID patterns):**
- **random id**: Randomly sampled embedding indices from the full vocabulary (uniform distribution)
- **one-hot**: One ID appears with very high frequency across all bags, simulating a popular item in recommendation systems
- **multi-hot**: Multiple IDs appear with high frequency across all bags, simulating multiple popular items in recommendation systems
**Test Configuration:**
- Embedding shape: `(5000000, 128)` (5M vocabulary size, 128-dimensional embeddings)
- Batch size: 2048 bags
- Average bag size: 150 indices per bag
| GPU | Input Distribution | Before (µs) | After (µs) | Speedup |
| ---- | ------------------ | ----------- | ---------- | ------- |
| H100 | random id | 162.4 | 105.9 | 1.53× |
| H100 | one-hot | 120.4 | 88.6 | 1.36× |
| H100 | multi-hot | 113.1 | 87.8 | 1.29× |
| H20 | random id | 278.6 | 132.2 | 2.11× |
| H20 | one-hot | 189.7 | 110.3 | 1.72× |
| H20 | multi-hot | 172.4 | 107.4 | 1.61× |
# Motivation
The original implementation performed bounds checking using `CUDA_KERNEL_ASSERT` inline within the main processing loop, which increased register pressure and limited GPU occupancy. From NSight Compute analysis on H20, using PyTorch 2.9 compiled with CUDA 12.8, removing the `CUDA_KERNEL_ASSERT` from the main loop with this PR increases the overall occupancy from 50% to 75%(registers per thread 52->40).
By separating validation into a dedicated loop, we:
1. **Reduce register pressure in the main loop**: The validation loop uses minimal registers, allowing the compiler to optimize the main processing loop independently with better register allocation.
2. **Maintain correctness**: All input validation is still performed, but in a more register-efficient manner.
# Changes
## Modified Kernels
1. **`EmbeddingBag_updateOutputKernel_max`**: Added separate validation loop before main processing
2. **`EmbeddingBag_updateOutputKernel_sum_mean`**: Added separate validation loop before main processing
## Key Implementation Details
- **Separate validation loop**: Input indices are validated in a dedicated loop that checks all indices before processing begins
- **No early exit**: The validation loop intentionally avoids using `break` for early exit, as benchmarking showed that early exit degrades performance, possibly due to increased branch divergence and reduced instruction-level parallelism
- **Consistent error messages**: Improved error message clarity for invalid input indices
- **Design choice: validation loop vs. separate kernel**: We considered removing `CUDA_KERNEL_ASSERT` entirely and performing bounds checking in a separate GPU kernel, which would achieve even better performance (e.g., on H20 with random id distribution: 132.2 µs → 124.6 µs). However, this approach is harder to maintain as it requires coordinating two separate kernel launches and managing additional kernel launch overhead. Instead, we chose the current approach of using a separate validation loop within the same kernel, which provides a good balance between performance improvement and code maintainability.
## Code Changes
```cpp
// Separate validation loop reduces register pressure in the main loop below.
// No early exit (break) on invalid input as benchmarking shows it degrades performance.
bool has_invalid_index = false;
for (int64_t emb = begin; emb < end; emb++) {
index_t input_idx = input[emb];
has_invalid_index = has_invalid_index || (input_idx < 0 || input_idx >= numRows);
}
CUDA_KERNEL_ASSERT(!has_invalid_index && "Invalid input index in EmbeddingBag: index out of range [0, numRows)");
// Main processing loop (now with reduced register pressure)
for (int64_t emb = begin; emb < end; emb++) {
// ... processing logic ...
}
```
# Testing & Compatibility
## Performance Testing
I conducted extensive performance testing across multiple configurations. All tests show significant performance improvements:
**Tested CUDA Versions:**
- CUDA 12.6, 12.8, 13.0
**Tested GPU Architectures:**
- A100, H20, H100
**Tested Input Configurations:**
- **Embedding shapes**: Various sizes including `[5000000, 128]` and `[128000, 4096]`
- **Embedding dtypes**: `torch.float32`, `torch.float16`
- **Input distributions**: Random indices, one-hot (high-frequency single ID), and multi-hot (high-frequency multiple IDs) patterns, simulating recommendation system workloads
- **Input sizes**: Average bag sizes of 150, 20, and 10 indices per bag
## Correctness Testing
- ✅ Correctness tests pass for various embedding types (bfloat16, float32), shapes, and input distributions
- ✅ Register usage reduction verified with NSight Compute
- ✅ Linter passes
## Compatibility
- ✅ No API/ABI changes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167834
Approved by: https://github.com/ngimel, https://github.com/eqy
Summary:
`torch._scaled_mm_v2` didn't have a valid meta registration, or
`FakeTensor` tests, so anything expecting inductor to work (like
torch.ao tests) would fail horribly.
Test Plan:
```
pytest -sv -k "scaled_mm_v2" test/test_ops.py
```
Reviewers:
Subscribers:
Tasks:
Tags:
Signed-off-by: Simon Layton <simonlayton@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167653
Approved by: https://github.com/drisspg
Summary: This moves torchfrtrace to be under `torch.distributed.flight_recorder` instead of `tools.flight_recorder` as the `tools` package is not included in the torch wheels. This makes it so you can use fr trace analyze without using it from a source checkout
Test Plan:
```
buck run //caffe2/fb/flight_recorder:fr_trace
```
CI
Differential Revision: D87022129
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167782
Approved by: https://github.com/fduwjj
This PR adds 2 new CI jobs to run dynamo core (`test/dynamo/*`) and
`dynamo_wrapped` tests on Python 3.11/3.12.
**Selected Machine**
Tests are executed on `linux.c7i.2xlarge` without GPU. Which means all
cuda tests (if any) are skipped.
**Runtime**
- The core tests takes 30 minutes to run
- The `dynamo_wrapped` test is divided into three shards and each one
takes around 1.5 hours to execute
**Schedule**
Tests are executed every day at 1:29 PDT or in the presence of
`ciflow/dynamo` label
Co-authored-by: Rob Timpe <rtimpe@openteams.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166978
Approved by: https://github.com/atalman, https://github.com/malfet
ghstack dependencies: #167092
Summary: I didn't understand what this macro was doing so I created a bit of a mess, mess be gone!
Test Plan: `buck2 ctargets fbcode//caffe2/... fbsource//xplat/caffe2/...`
Reviewed By: mzlee
Differential Revision: D86460608
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167690
Approved by: https://github.com/seemethere
Summary:
Support semantics when using backed_size_oblivious, similar to https://github.com/pytorch/pytorch/pull/167232
We see errors in a model exported with dynamic shapes, like
```
RuntimeError: non-broadcasting semantics require s67 == 41
While executing %expand : [num_users=1] = call_method[target=expand](args = (%reshape_5, -1, -1, %getitem_9), kwargs = {})
```
Test Plan:
test_dynamic_shapes:
```
test_backed_size_oblivious_expand (test_dynamic_shapes.TestUbackedOps) ... I1112 14:07:54.724596 1386932 Logger.cpp:995] Dropping logs in unit tests.
ok
```
Differential Revision: D86902546
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167689
Approved by: https://github.com/laithsakka
Fixes#161858
### Summary:
Added comprehensive documentation examples for `torch.is_storage()` to help users understand how to check if an object is a PyTorch storage object.
### Impact:
- Enhances API Documentation
- Helps users distinguish between PyTorch storage objects and other types
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161898
Approved by: https://github.com/isuruf, https://github.com/malfet
Fixes#89629
When using torch.ops.aten.convolution_backward, the optional argument bias_sizes was being used in the python function registration without checking whether it was defined.
## For the fix
there are two modes to consider with different results.
First @dynamo.optimize("inductor") is the most demanding.
We cannot be wrong about the size passed into the function. But we should not ignore what the user wants/thinks they are doing. For this case, we want to throw an error when the user is wrong. If the user passes in None, we calculate the expected size directly.
Second @dynamo.optimize("eager") is very lenient.
We really can provide any value we want here. If the user is wrong about bias shape in eager mode, the op will just reshape the bias to the proper size so no error is thrown here.
## For testing
An OpInfo was added for torch.ops.aten.convolution_backward.default.
For the CUDA test_noncontiguous_samples test, a slightly updated error tolerance was necessary for the compounded add multiply (for 2x2 kernel).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165008
Approved by: https://github.com/bdhirsh
Fixes#165427
## Description of Bug 🐛
As reported in #165427, When both the input of `atan2` function is zero the gradient becomes `NaN`. During the forward pass, `atan2` successfully avoids division-by-zero issue, but during backpropagation gradients become `NaN`.
This is because the backward pass calculates `(self * self + other * other).reciprocal()`, which becomes `inf` at `(0, 0)`. The subsequent multiplication by zero `(0 * inf)` results in `NaN`.
## Changes
- Added an `at::where` condition to handle zero denominators in `atan2_backward`.
- If denom is zero return 0 for the reciprocal; otherwise, use the original value.
## Testing
- Added` test_atan2_zero_gradient` in `test/test_autograd.py` to verify `atan2` returns `0.0` gradients for `(0,0)`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166787
Approved by: https://github.com/soulitzer
Summary:
`-Wunused-exception-parameter` has identified an unused exception parameter. This diff removes it.
This:
```
try {
...
} catch (exception& e) {
// no use of e
}
```
should instead be written as
```
} catch (exception&) {
```
If the code compiles, this is safe to land.
Test Plan: Sandcastle
Differential Revision: D85813824
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167612
Approved by: https://github.com/seemethere, https://github.com/malfet
when we know all tensor and intermediate tensors fit in 32 bit but use unbacked DS
we want a way to assume that we can use 32 bit indexing(we will runtime assert on it).
It is not practical to torch check every possible intermediate tensor size ahead of time.
This is needed to enhance vLLM perf with unbacked, since in vLLM all tensors and
intermediates assumed to fit in 32 bits.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167784
Approved by: https://github.com/jansel
Summary:
Prior to this PR we will always build global and torch funciton guards in all cases.
In this PR we did 2 changes to dynamo guards:
1. Created a new guard called "GLOBAL_STATE" which corresponds to the global state guard and can be filtered out using guard_filter_fn
2. Repurpose the existing "TORCH_FUNCTION_STATE" guard for checking torch function mode stack.
Also added a new helper `torch.compiler.skip_all_guards_unsafe` which can be useful for use cases like vllm
Test Plan:
CI
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167674
Approved by: https://github.com/anijain2305
Fix for https://github.com/pytorch/pytorch/issues/166653.
Two fixes:
- We were inducing a split for broadcasted loads. e.g. (x // 16). While a split of 16 here will make the load coalesced in one of the tile vars, since the load is already in cache it's not worth splitting. And it would make the other tile var load from memory that isnt in cache.
- Add a slight term for uncoalesced memory. This prevents doing tiling for loads which are a small % of the overall kernel.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167771
Approved by: https://github.com/v0i0
This PR enables ROCm/HIP support for PyTorch's StaticCudaLauncher, which provides static compilation and launching of Triton kernels. The implementation has been tested on AMD MI300 and MI200 hardware.
**Changes**
**Python (torch/_inductor/runtime/)**
- static_cuda_launcher.py: Added ROCm detection, .hsaco binary support, and ROCm-specific scratch parameter handling
- triton_heuristics.py: Updated device type checks to support both cuda and hip
**C++ (torch/csrc/)**
- Module.cpp: Enabled StaticCudaLauncher for ROCm builds
- inductor/static_cuda_launcher.cpp: Added HIP API equivalents for all CUDA driver calls
- inductor/static_cuda_launcher.h: Updated header guard
**Tests (test/inductor/)**
- test_static_cuda_launcher.py: Removed @skipIfRocm decorators and updated binary file handling
**Enabled Unit Tests**
All tests in test/inductor/test_static_cuda_launcher.py now pass on ROCm:
1. test_basic
2. test_unsigned_integers
3. test_signed_integers
4. test_basic_1arg
5. test_constexpr
6. test_implied_constant
7. test_kernel_no_args
8. test_high_shared_mem
9. test_too_high_shared_mem
10. test_kernel_empty_tensor
11. test_kernel_many_args
12. test_basic_compile
13. test_incompatible_code
14. test_static_launch_user_defined_triton_kernels
15. test_empty_tensor
16. test_any
17. test_disable_static_cuda_launcher
In addition to this, the following tests from test/inductor/test_codecache.py also pass:
1. test_remote_cache_load_function_device_cuda_float32_dynamic_False_bundle_triton_False_use_static_cuda_launcher_False
2. test_remote_cache_load_function_device_cuda_float32_dynamic_False_bundle_triton_True_use_static_cuda_launcher_False
3. test_remote_cache_load_function_device_cuda_float32_dynamic_False_bundle_triton_True_use_static_cuda_launcher_True
4. test_remote_cache_load_function_device_cuda_bfloat16_dynamic_False_bundle_triton_False_use_static_cuda_launcher_False
5. test_remote_cache_load_function_device_cuda_bfloat16_dynamic_False_bundle_triton_True_use_static_cuda_launcher_False
6. test_remote_cache_load_function_device_cuda_bfloat16_dynamic_False_bundle_triton_True_use_static_cuda_launcher_True
The following tests are skipped since triton bundling is necessary for StaticCudaLauncher:
1. test_remote_cache_load_function_device_cuda_float32_dynamic_False_bundle_triton_False_use_static_cuda_launcher_True
2. test_remote_cache_load_function_device_cuda_bfloat16_dynamic_False_bundle_triton_False_use_static_cuda_launcher_True
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166492
Approved by: https://github.com/jeffdaily
Fixes#161871.
Behaviour on arm:
```
PyTorch version: 2.10.0a0+gitdef3b05
Architecture: arm64
Platform: Darwin
Processor: arm
Testing mvlgamma_ with integer tensor on arm64...
Got expected error: mvlgamma: result type Long can't be cast to the desired output type Float
```
and on x86:
```
PyTorch version: 2.10.0a0+git1310d6a
Architecture: x86_64
Platform: Linux
Processor: x86_64
Testing mvlgamma_ with integer tensor on x86_64...
Got expected error: mvlgamma: result type Long can't be cast to the desired output type Float
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164230
Approved by: https://github.com/albanD
Summary:
getCurrentCUDABlasHandle() and getCUDABlasLtWorkspace() use static mutable maps that are not protected from concurrent read-and-write. This leads to crashes.
This diff adds mutexes to synchronize access to the static maps.
Note: this is a re-land of D86316117 / https://github.com/pytorch/pytorch/pull/167248 (see comments for details)
Test Plan:
Use a GPU OD, run multi-threaded tests (cuda_cublas_handle_pool_test) with TSAN:
```
buck test fbcode//mode/dev-tsan fbcode//caffe2:cuda_cublas_handle_pool_test -- --stress-runs 100
```
https://www.internalfb.com/intern/testinfra/testrun/14355223937501118
TSAN output (before synchronization was added): P2026731804
Differential Revision: D86964261
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167722
Approved by: https://github.com/malfet
This PR adds a basic spin configuration to allow for linting. It is designed as a drop-in replacement for the current Makefile based solution, i.e. it sets up and updates lintrunner based on the hashes of certain configuration files.
Lintrunner is called via Uv's `uvx` command, separating its environment from the general development environment in an effort to reduce instances of competing requirements breaking environments.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167226
Approved by: https://github.com/atalman, https://github.com/albanD
Fixes#163374.
Here is the output from reproducible code:
```
W1006 09:09:26.329000 2457 /home/fedora/github/pytorch/torch/distributed/run.py:811]
W1006 09:09:26.329000 2457 /home/fedora/github/pytorch/torch/distributed/run.py:811] *****************************************
W1006 09:09:26.329000 2457 /home/fedora/github/pytorch/torch/distributed/run.py:811] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W1006 09:09:26.329000 2457 /home/fedora/github/pytorch/torch/distributed/run.py:811] *****************************************
aten::clamp_(dt: f32[][R], None, 2)
redistribute_input(0, [P] -> [R])
redistribute_input(t: f32[], [P] -> [R])
_c10d_functional::all_reduce(t: f32[], sum, 0)
_c10d_functional::wait_tensor(t: f32[])
aten::clamp_(t: f32[], None, 2)
aten::view(t: f32[], [])
(Replicate(),)
tensor(2., device='cuda:0')
```
The behavior is now matching what you were expecting in issue #163374:
Expected behavior (from the issue):
1. Placement should change from Partial(sum) to Replicate()
2. Value should be tensor(2.) instead of tensor(144.)
Actual output from this build:
1. (Replicate(),) - placement is correct
2. tensor(2., device='cuda:0') - value is correct
so the inplace operation now properly redistributes the partial DTensor to replicate before performing the clamp snd maintains the correct aliasing semantics. It also produces the expected clamped value.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164729
Approved by: https://github.com/ezyang
To support use case in https://github.com/pytorch/helion/pull/1122, i.e.
```
@helion.kernel
def foo(
x: Tensor,
group_name: str
):
x_remotes = torch.ops.symm_mem.get_remote_tensors(x, group_name)
for t in x_remotes:
...
````
Helion uses fake tensor to trace a program, thus we cannot use the following code in a Helion function:
```
hdl = rendezvous(tensor)
remote_tensors = tuple(
hdl.get_remote_tensor(peer, ...) for peer in range(world_size)
)
```
The reason is that when `tensor` is fake, the returned `hdl` is None, thus any subsequent call on it will fail.
This PR wraps the above functionality as an op:
```
lib.define("get_remote_tensors(Tensor x, str group_name) -> Tensor[]")
```
so that things like `hdl` is not exposed to Helion. The op also provides a `meta` implementation so that Helion can trace it without actually running the rendezvous.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167779
Approved by: https://github.com/yf225
We need to track all symbols, we used to skip
u = item()
and fail with
```
File "/home/lsakka/pytorch10/pytorch/torch/fx/passes/_tensorify_python_scalars.py", line 149, in _sympy_interp
expr_to_sym_proxy[expr]
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
KeyError: u0
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166573
Approved by: https://github.com/bobrenjc93
## Summary
This PR enriches OpenReg device management codes and adds focused documentation.
## Key Changes
- Introduced device management documentation in `device.md`.
- Updated `OpenRegFunctions.h` and `OpenRegFunctions.cpp` to use `DeviceIndex` and added error handling.
- Implemented `check_device_index` function for validating device indices.
- Enhanced Python bindings in `Module.cpp` for device management.
- Added tests for invalid device index handling in `test_device.py`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165897
Approved by: https://github.com/fffrog
Fixes#150477
### Summary:
- Added frame information (function name, file, line number) to all graph break/skip messages
- Standardized message format: "torch.compile will skip tracing the frame <name> (<file> line <N>) and fall back to eager. Reason: <reason>"
### Impacts:
module: dynamo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167067
Approved by: https://github.com/williamwen42
Fixes#167706
- Add `torch.fx.experimental.proxy_tensor.set_original_aten_op()` around flex_atention HOP dispatch so we have `original_aten` populated for flex_attention
- Update the usages of `original_aten` to also expect HOP in addition to OpOverload
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167749
Approved by: https://github.com/drisspg
## MOTIVATION
To generalize Distributed test cases for non-CUDA devices
## CHANGES
- Replaced hard coded device/backends with torch.accelerator.current_accelerator() and dist.get_default_backend_for_device
- Use DistributedTestBase instead of MultiProcessTestCase to use common utilities
- Remove instantiate_device_tests and make use of torch.accelerator.current_accelerator for test/distributed/test_c10d_object_collectives.py
- fix deterministic context issue for non-cuda devices in test/distributed/optim/test_zero_redundancy_optimizer.py
- use torch.accelerator.device_count() for multi-gpu check in torch/testing/_internal/distributed/_tensor/common_dtensor.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165067
Approved by: https://github.com/guangyey, https://github.com/albanD
Currently, conv1d converts the 3D view to 4D before calling onednn::convolution().
However, this function converts the 4D tensor to a channel-last memory format for computation, resulting in incorrect return results (the correct result should be channel-first).
This PR fixes this issue, ensuring that the output return value format is consistent with the expected format.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162944
Approved by: https://github.com/EikanWang
Differential Revision: D86366889
This PR adds the `flush` option to `TailLog`, and it will automatically flush (by setting `buffering=1`) the files opened by that `TailLog` instance.
This is mainly to resolve the race condition between the default flushing of `TailLog` and where we read the duplicated error files in the termination handler.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167169
Approved by: https://github.com/fduwjj
Summary:
The previous implementation incorrectly attempted to read from a `NamedTemporaryFile` file pointer after calling `profiler.export_chrome_trace(fp.name)`. The issue is that `export_chrome_trace()` writes to a file at the path `fp.name`, but doesn't write to the file pointer `fp` itself. This meant when the code tried to read from `fp`, it got empty content.
The fix explicitly closes the temporary file first, then calls `export_chrome_trace(fp.name)` which writes the JSON trace to a file at that path. We then open that file separately for reading and copy its contents to the gzipped output file. This ensures we're reading from the actual file that was written to, not an empty file pointer.
Changes made in both `fbcode/caffe2/torch/profiler/profiler.py` and `xplat/caffe2/torch/profiler/profiler.py`:
- `export_chrome_trace()`: Fixed file reading for gzipped chrome trace exports by opening the written file separately
- `export_memory_timeline()`: Fixed file reading for gzipped memory timeline exports by opening the written file separately
Test Plan:
* run benchmark
```
buck2 run fbcode//mode/opt fbcode//torchrec/distributed/benchmark:benchmark_train_pipeline -- \
--yaml_config=fbcode/torchrec/distributed/benchmark/yaml/sparse_data_dist_base.yml
```
* upload trace
```
DIFF=D86737513 fbcode/torchrec/fb/scripts/trace_to_manifold.sh
```
======== markdown ============
[manifold folder](https://www.internalfb.com/manifold/explorer/torchrec_benchmark_traces/tree/permanent_traces/DIFF/D86737513)
[trace-sparse_data_dist_base-rank0.json.gz](https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/permanent_traces/DIFF/D86737513/trace-sparse_data_dist_base-rank0.json.gz&bucket=torchrec_benchmark_traces)
Differential Revision: D86737513
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167601
Approved by: https://github.com/angelayi
When executing code like torch._check(numel % newsize == 0, ...), we previously allocated a new unbacked symbol due to #113165. However, this allocation is no longer necessary and can cause issues due to inconsistent behavior when tracing torch._check multiple times.
In particular, the allocation can lead to a memo disaster where the previously allocated symbol is returned instead of a new one, causing unexpected behavior.
This PR removes the unnecessary allocation, ensuring consistent behavior and avoiding potential issues. The change is validated by the following code, which now compiles without issues:
```
import torch
def fn(x):
i0 = x.nonzero().size(0)
y = torch.zeros((i0, 192))
return y.view([12, -1, 192])
with torch._dynamo.config.patch({"capture_dynamic_output_shape_ops": True}):
torch.compile(fn, fullgraph=True)(torch.ones((12,)))
```
By removing this unnecessary allocation, we simplify the code and avoid potential issues."
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167123
Approved by: https://github.com/Lucaskabela
Adds `torch.hash_tensor` (#154149) as tensor hashing variant; allows tuple of hashes in log annotations for more info (e.g. `with DebugMode.log_tensor_hashes(hash_fn=["norm", "hash_tensor"]): ...`)
also fixes some corner cases around norm hashing (preserves NaNs/infs, avoids erroring on smaller dtypes)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167486
Approved by: https://github.com/xmfan
TORCH_CHECK_VALUE(scale_a.numel()==mat_a.size(0)&&scale_a.scalar_type()==kFloat,"scale_a must have ",mat_a.size(0)," Float elements, got ",scale_a.numel())
TORCH_CHECK_VALUE(scale_b.numel()==mat_b.size(1)&&scale_b.scalar_type()==kFloat,"scale_b must have ",mat_b.size(1)," Float elements, got ",scale_b.numel())
TORCH_CHECK_VALUE(scale_a.stride(1)==1,"expected scale_a.stride(1) to be 1, but got ",scale_a.stride(1));
// if we have a scale of shape [256, 1] (say), then stride can be [1, 0] - handle this case
TORCH_CHECK_VALUE(
scale_a.stride(1)==1||
scale_a.size(1)==1,
"expected scale_a.stride(1) to be 1, but got ",scale_a.stride(1)
);
TORCH_CHECK_VALUE(scale_b.stride(1)==1,"expected scale_b.stride(1) to be 1, but got ",scale_b.stride(1));
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.