Preferred solution to #166380
Changes:
- Moved summary print to bottom of CMakeLists.txt
- Fix the problem 'add_compile_options' should be called before targets defined, so opted for `append_cxx_flag_if_supported` and `append_c_flag_if_supported` ( new ).
- Added extra verbosity so it can be seen when linker script added.
( unfortunately linker script has to be added per-target rather than globally due to ninja/cmake depdendency tracking ).
Also move summary print to bottom of CMakeLists.txt and improve logging
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166407
Approved by: https://github.com/Aidyn-A, https://github.com/atalman
Summary:
https://github.com/pytorch/pytorch/pull/166609 updates `node.is_impure` to consider a submodule as impure if submodule contains impure node. This in turn changes `graph.eliminate_dead_code()` function behavior, which does not eliminate nodes with side effects, see [pytorch documentation](https://docs.pytorch.org/docs/stable/fx.html#torch.fx.Graph.eliminate_dead_code)
> Remove all dead code from the graph, based on each node’s number of users, and whether the nodes have any side effects.
While this is correct that a submodule containing side-effectful ops is side-effectful and should not be dead code eliminated, some customers rely on the dead code elimination to eliminate submodules that contain impure ops which is the behavior before #166609 fix.
Due to production environment constraints, we have to revert https://github.com/pytorch/pytorch/pull/166609 and move the side-effectful submodule check logic to `const_fold.py`, which will correctly **not** const-fold a submodule that contains impure ops.
NOTE other call sites that use `node.is_impure()` to make decisions are still incorrectly eliminating side-effectful submodules, but we can't safely change that today.
## This pr
- move `_subgraph_has_impure_op` into `fx/experimental/const_fold.py`, check and prevent const-folding an impure submodule
- added a note in `node.is_impure` to highlight the incorrect behavior and context in case people go looking in the future.
Test Plan: run test_fx_const_fold and all tests pass
Differential Revision: D86641994
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167443
Approved by: https://github.com/jfix71
We plan to use `StridedShard` to express `shard_order`. This PR adds the function to support the conversion between `StridedShard` and `shard_order`.
I moved some test related function into torch/testing/_internal/common_utils.py. We may only care about **_dtensor_spec.py** and **test_utils.py** in this PR for the review.
### How to convert shard order to StridedShard:
Considering the example:
- placements = $[x_0, x_1, x_2, x_3, x_4]$, all $x_?$ are shard on the same tensor dim.
Let's see how the shard order will impact the split_factor (sf). We loop from right to left in the placements to construct the split_factor by assuming different shard order. Starting from $x_4$, this should be a normal shard.
Then $x_3$. There are two possibilities, $x_3$'s order can be before $x_4$. If so, $x_3$'s sf=1, because $x_3$ is before $x_4$ in the placements. Else $x_3$'s order is after $x_4$, then the $x_3$'s sf should be the mesh dim size of $x_4$, which is $T(x_4)$:
<img width="820" height="431" alt="image" src="https://github.com/user-attachments/assets/f53b4b24-2523-42cc-ad6f-41f3c280db70" />
We can use this method to decide on the split factor for $x_2$, $x_1$ and so on.
### How to convert StridedShard to shard order:
This follows the same method above. We check all possible paths and use the real split_factor to see which path matchs the split_factor. If no such matches, the StridedShard is unable to be converted to shard order.
---
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166740
Approved by: https://github.com/ezyang
Pass `dim_map` to `_requires_data_exchange` and return False if both spatial and channels dimensions are replicated
Modify `test_conv1d` and `test_conv3d` to check values rather than just shape, and replicate `conv3d` across batch dimension
In general, feels like current Convolution implementation was written to work only if tensor is sharded across last dimention
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167402
Approved by: https://github.com/ezyang
- The logger name in test_fully_shard_logging.py was wrong so the logs didn't happen.
- The `device` variable in test_fully_shard_logging is expected to be a string, so quote it
- `unittest.skipIf` is used so importing `unittest` instead of `unittest.mock` is required
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167312
Approved by: https://github.com/Skylion007, https://github.com/cyyever
## Summary
- Fixes#163548 by replacing the quadratic chunk-overlap scan in `_validate_global_plan` with a sweep-line pass that sorts chunk intervals and keeps an active set via `bisect_right`, giving O(n log n) behavior for metadata validation.
- Add focused tests in `TestValidateGlobalPlan` covering overlapping and non-overlapping shard layouts to lock in the faster path.
## Testing
- python test/distributed/checkpoint/test_planner.py -k ValidateGlobalPlan
## Benchmarks
| chunks | old runtime | new runtime |
|--------|-------------|-------------|
| 1 024 | 0.121 s | 0.0014 s |
| 2 048 | 0.486 s | 0.0027 s |
| 4 096 | 2.474 s | 0.0058 s |
| 8 192 | 8.014 s | 0.0126 s |
| 16 384 | 32.740 s | 0.026 s |
@ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166820
Approved by: https://github.com/LucasLLC, https://github.com/Skylion007
Summary:
It seems
D80948073
has caused some issue on a lowering pkg built on trunk: https://fburl.com/mlhub/o6p60pno
error log: P2001933683
which we were able to lower successfully in older ien pkg: https://fburl.com/mlhub/1ro094zo
D85172267 fixed this issue for the if conditional, but issue still exists for the else conditional. Logic is moved right before if-else to cover both cases
Test Plan:
checkout D85605372
buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.platform=platform010 -c fbcode.nvcc_arch=a100,h100 -c fbcode.split-dwarf=true -c fbcode.dwp=true -c fbcode.enable_distributed_thinlto=true -c fbcode.use_link_groups=true fbcode//inference_enablement/model_processing/infra/components/lowering/re:re_cinder -- -r "$(cat ./fbcode/minimal_viable_ai/umia_v1/ig/ss_omni_exp/re_lower_aoti.json)"
with the diff, no issue was encountered.
Reviewed By: tissue3
Differential Revision: D86474796
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167279
Approved by: https://github.com/pianpwk
This adds zero-bubble / DualPipeV support for (S)AC
Before:
- AC will always retrigger recompute upon every distinct backward.
After:
- Any checkpointed regions encountered by backward under the same instance of this context manager will only trigger recompute at most once, even if there are multiple calls to backward.
- Backward calls under the same instance of this context manager must execute over non-overlapping regions of the backward graph even if retain_graph=True.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166536
Approved by: https://github.com/albanD
**Summary:** While the implementation is correct, these checks are just a subset of the Partial placement checks that are done in https://github.com/pytorch/pytorch/pull/165962. This means for ops aten.linalg_vector_norm.default and aten._foreach_norm.Scalar, we're unnecessarily checking for Partial placements twice.
**Test Cases**
1. pytest test/distributed/tensor/test_math_ops.py -k test_vector_norm_partial
2. pytest test/distributed/tensor/test_math_ops.py -k test_foreach_norm_partial
3. pytest test/distributed/tensor/test_math_ops.py -k test_partial_reduction_ops
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167247
Approved by: https://github.com/XilunWu
# Problem
The FX backend currently allocates tensors on an exact device index, such as `"cuda:0"`. In contrast, the Python backend allocates on a device type, such as `"cuda"`. This avoids edge cases where fake tensor propagation can fail due to mismatched devices.
# Fix
Allocate tensors on `device.type` instead of the device.
# Test plan
Added a CI test passing in sample inputs on an indexed device, and checking that the output device in the generated FX graph is not indexed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167358
Approved by: https://github.com/mlazos, https://github.com/nandesuka, https://github.com/eellison
Not sure why, but running some tests (for example `test_weights_only_safe_globals_build`) with `pytest` in 3.14 makes global name `test_serialization.ClassThatUsesBuildInstruction` instead of expected `__main__.ClassThatUsesBuildInstruction`
Also, change expected exception type from `AttributeError` to `PicklingError`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167333
Approved by: https://github.com/atalman
Some important notes:
a) Just like IValues steal the ownership of ArrayRefs and any std::vectors in order to convert the inner elements into IValues, we do the same thing with StableIValue. This O(N) traverse is ineluctable.
b) As a result, since StableIValues are owning and our contract is that to<T>(StableIValue) transfers ownership, you cannot ever convert from StableIValue to a nonowning HeaderOnlyArrayRef<V>.
We handle memory similar to AtenTensorHandle, but we have a StableListHandle!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165953
Approved by: https://github.com/malfet
ghstack dependencies: #164991, #165152, #165153
Summary:
If quiesce ends up called twice (which is likely not possible with the timer based implementation, but possible with either manual calls, or with the context manager implementation), this assertion fires.
Instead make this assertion tolerant to rentrant calling of quiesce
Test Plan: Added a explicit test which calls quiesce twice.
Differential Revision: D86251534
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167025
Approved by: https://github.com/masnesral
Provides type coverage to torch/_dynamo/variables/dicts.py
Coverage report:
`mypy torch/_dynamo/variables/functions.py --linecount-report /tmp/coverage_log`
Compare before to after - we go from 0 lines and 0 funcs covered to 2698 lines and 166 funcs covered
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167103
Approved by: https://github.com/mlazos, https://github.com/fxdawnn
This fixes 2 issues with the DTensor data-dependent test case:
1) ShapeEnv not found when doing shard prop on data-dependent ops - fix was to detect the outer tracing fake mode. Maybe ShardingPropagator should just own a FakeMode & ShapeEnv for these purposes? The previous behavior was to initialize a new fake mode on every call.
2) Pending unbacked symbols not found. This happens because DTensor dispatch runs fake prop twice, once while figuring out the output sharding: 2bba37309b/torch/distributed/tensor/_sharding_prop.py (L175) and again to actually get the resulting local tensor: 2bba37309b/torch/distributed/tensor/_dispatch.py (L254-L255) With data-dependent ops, both calls will produce an unbacked symbol, but symbols in the first invocation are never surfaced, producing this error, so we ignore pending symbols from this site.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166989
Approved by: https://github.com/ezyang
TORCH_CHECK is non-fatal by design, but TORCH_CHECK_{COND} macros are fatal. this is confusing, and we should limit fatality to the set of debug macros.
Differential Revision: D86168955
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167004
Approved by: https://github.com/malfet
Fixes#163971
Problem:
PyTorch's inductor compiler crashed with IndexError: list index out of range when compiling code that uses 0-dimensional tensors with operations like torch.softmax(scalar_tensor, dim=0).
A 0-dim tensor has shape = torch.Size([]) (empty shape)
```
ndim = 0 (zero dimensions)
len(shape) = 0 (no indices to access)
# Line 972: Pad other_shape to match inp dimensions
other_shape = [1] * (inp_ndim - len(other_shape)) + list(other_shape)
# For scalar tensors:
# inp_ndim = 0 # as input is scalar
# other_shape = []
# Result: [1] * (0 - 0) + [] = [] (still empty!)
dim = match.kwargs["dim"] # dim = 0
if isinstance(dim, int):
dim = (dim,)
# crash is happening here!
return all(statically_known_true(other_shape[d] == 1) for d in dim)
# ^^^^^^^^^^^^^^^^
# Tries other_shape[0] but other_shape = [] (empty!)
# → IndexError: list index out of range
```
The function _other_is_broadcasted_in_dim() is an optimization check for a softmax fusion pattern. It verifies whether it's safe to rewrite:
```
# From
scaled = inp * other
result = scaled - scaled.amax(dim, keepdim=True)
# To this more stable form:
result = (inp - inp.amax(dim, keepdim=True)) * other
```
The optimization is only valid if other is constant across the reduction dimension (i.e., broadcasted to size 1 in that dimension). Otherwise, scaling changes which element is the maximum.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166547
Approved by: https://github.com/jansel
- Since MultiProcContinuous class spawns one process per GPU and runs UT in each of the processes, we need to ensure we are propagating the exit code associated with skip all the way to the main worker thread that spawned all the child processes.
- This commit also updates several UTs that are meant for 4 GPUs but incorrectly calls skip_if_lt_x_gpu with 2 as an input. Examples:
- test_replicate_with_fsdp.py
- test_dtensor_resharding.py
- test_state_dict.py
- test_functional_api.py: Fix typo. multi-accelerator doesn't exit, replaced with multi-gpu
- test_op_strategy.py: world_size was hardcoded
- test_math_ops.py: UT written for 4 GPU, so skipping for anything less
- test_schedule_multiproc.py: All UTs in this suite are required to run on 2+ GPUs, therefore, adding skips if less than 4 GPUs are supplied
Fixes https://github.com/pytorch/pytorch/issues/166875
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167281
Approved by: https://github.com/jeffdaily
Recursive function call creates a reference cycle: closure <- function <- cell inside closure
Capturing self (PyCodegen instance) in same closure prolongs it's life until next gc.collect() which might result in worse resource management
After the introduction of e9209e0 OOM issues has been observed. Looking for reference cycles one has been uncovered that would result in the prolonging lifetime of tensors. As the result of that OOM issues might occur. Such a dependency chain has been uncovered:
<img width="1059" height="540" alt="image" src="https://github.com/user-attachments/assets/359a8534-e7cd-491f-be40-547c2af5cbbc" />
At the end of it a reference cycle can be found that consists of a closure for function collect_temp_source, the function itself, and a cell object inside closure that would point to the function due to the recursive call.
This issue can either be resolved by removing recurrency or removing PyCodegen instance from the closure.
Another precaution that can be made is to explicitly empty f_locals dict. This way we cut the tensor from the chain leading to reference cycle.
Fixes#166721
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166714
Approved by: https://github.com/Lucaskabela, https://github.com/Skylion007, https://github.com/jeromean, https://github.com/williamwen42, https://github.com/mlazos
Fixes T243967987
Move `enrich_profiler_metadata` from `torch._dynamo.config` to `torch.fx.experimental._config`.
We cannot import anything inside recompile(), it made some perf regress internally. We move the config so we can import it at the top of `graph_module.py` without causing any circular import.
We also cannot delete the old config right now because some internal tests rely on copies of the old `graph_module.py` cpp file in unit tests. But I think we should be able to delete the old config soon after this PR lands.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167114
Approved by: https://github.com/angelayi
Summary: Before calling the tritonparse hook with `config_args`, ensure that we set `config_args` within `inductor_meta`. This way, even if it is not set, the hook still gets run and we can at least get the launch arguments.
Test Plan: Tritonparse tests
Differential Revision: D86463732
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167261
Approved by: https://github.com/FindHao
This PR is meant to swap the PR-based ciflow tags from the mi200 nodes (less stable) to the mi300 nodes (more stable). This will ensure that developers see consistent testing on their PRs as well as on main. This PR does all of the following:
- Rename rocm.yml to rocm-mi200.yml : for clarity
- Add ciflow/rocm-mi200 trigger to rocm-mi200.yml : for devs who want to opt-in to single-GPU unit tests on MI200
- Move ciflow/rocm trigger from rocm-mi200.yml to rocm-mi300.yml : so PRs target MI300 runners by default
- Rename inductor-rocm.yml to inductor-rocm-mi200.yml : for clarity
- Remove ciflow/inductor-rocm trigger from inductor-rocm-mi200.yml : prevent MI200 inductor config unit tests being triggered by default
- Add ciflow/inductor-rocm-mi200 trigger to inductor-rocm-mi200.yml : for devs who want to opt-in to inductor config unit tests on MI200
- Move ciflow/periodic trigger from periodic-rocm-mi200.yml to periodic-rocm-mi300.yml : so PRs target MI300 runners by default
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167225
Approved by: https://github.com/jeffdaily, https://github.com/huydhn
Co-authored-by: Jithun Nair <jithun.nair@amd.com>
A few improvements for autotuning
- while testing mix order reduction for internal workloads, Paul found that tuning num-stages could be very helpful for triton kernel. The idea is illustrated on his diff: https://www.internalfb.com/diff/D86341591
- when rnumel is small, larger XBLOCK could be helpful for perf
This PR adds the ability to autotune num-stages and XBLOCK. This brings further 19% speedup for RMSNorm BWD on B200.
Testing result:
eager 11 data points
compiled 11 data points, 17.07x speedup (was 14.39x before the PR. The PR brings further 19% speedup)
quack 11 data points, 12.72x speedup
liger 11 data points, 11.75x speedup
compiled-no-fusion 11 data points, 9.93x speedup
<img width="3564" height="2368" alt="RMSNormBackward_bench" src="https://github.com/user-attachments/assets/3e415242-a988-42bf-8a47-4ed5f11148a3" />
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167161
Approved by: https://github.com/jansel
ghstack dependencies: #166669, #166938
The PR includes a misc list of fixes for the regressions I see from the dashboard:
1. the dashboard may use very small shape for rmsnorm backward. The data set can be fully cached in L2 thus mix order reduction does not show much benefit and may even has worse perf. Disable mix order reduction for small workload
2. disable the autotuning of split size by default to avoid the compilation time hit
3. avoid mix order reduction if there is non-contiguous memory access. Previously the check is only done for shared buffers accessed by both reductions. It turns out to be necessary to expand the check for buffers only accessed by one reduction. Check test test_avoid_non_coalesced_access which is simplified from a TIMM model. Note that larger XBLOCK could fix the perf problem and make mix order reduction still applicable. But I don't think that's high priority. With larger XBLOCK, the kernel would consume much more shared memory/registers. That could also cause perf issue.
Dashboard result [here](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2029%20Oct%202025%2003%3A40%3A22%20GMT&stopTime=Wed%2C%2005%20Nov%202025%2004%3A40%3A22%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=gh/shunting314/257/head&lCommit=b6f4a24ea5f7574d6b1d3b854022aa09d70593db&rBranch=main&rCommit=22a745737a09b0600bb0b85b4c0bbb9fb627f137).
<img width="1484" height="531" alt="Screenshot 2025-11-04 at 10 58 48 PM" src="https://github.com/user-attachments/assets/60cda211-3cc4-4fe1-9eaf-d2fb2c7d15a1" />
- the perf drop for TIMM (default) is not real, it's due to one more model passed the accuracy test
- the perf drop for HF (cudagraphs) is not real. I checked each individual models that showed regressed on the dashboard. And they fall into the following categories
- showed regressed, but absolute execution get reduced. e.g. OPTForCausalLM
- showed regressed, but has slight speedup on h100 dev server: MobileBertForMaskedLM . speedup from 57.847709ms to 56.711640 ms
- showed regressed, but the PR does not change the kernels generated (skip mix order reduction due to small workload or other reasons). e.g. XGLMForCausalLM, AlbertForMaskedLM .
Note that the neutral result on the dashboard is expected due to small workload size. For large workload, we see about 1.5x geomean for rmsnorm/layernorm backward on average and 2.2x for some shapes used by internal model. For 8GPU torchtitan training on llama3, we see 4% TPS (tokens per second) improvement.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166938
Approved by: https://github.com/jansel
ghstack dependencies: #166669
Fixes#165873
# Title
Fix load_state_dict: raise error for 1D (size > 1) -> 0D parameter loads
## Summary
This PR fixes a bug where loading a 1D tensor (size > 1) into a scalar (0D) parameter would silently take the first element instead of raising an error. The fix preserves backward compatibility for 1D tensors of size 1 while catching genuine shape mismatches.
## Motivation
Previously, loading a 1D tensor like torch.randn(32000) into a 0D scalar parameter would silently slice the first element, leading to silent data loss and potential bugs. This change ensures users get a clear error when there's a genuine shape mismatch.
## Behavior change
Before:
1D tensor (any length) -> 0D scalar -> silently coerced using input_param[0]
After:
- 1D tensor (size == 1) -> 0D scalar -> allowed (backward compatibility)
- 1D tensor (size > 1) -> 0D scalar -> raises RuntimeError with size mismatch message
In torch/nn/modules/module.py, _load_from_state_dict, added input_param.shape[0] == 1 check to the backward compatibility condition to only allow single-element 1D tensors.
## Tests
Added test_scalar_param_1d_tensor_raises to verify that loading 1D tensors of size > 1 raises an error, while size 1 loads successfully.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166335
Approved by: https://github.com/mikaylagawarecki
Update the torch-xpu-ops commit to [intel/torch-xpu-ops@9aac5a](9aac5a1ddf), includes:
- Enable FP8 concat/where/flip/index_put/index.Tensor on XPU backend
- Remove BUILD_SPLIT_KERNEL_LIB flag
- Fix the initialization order of ProcessGroupXCCL
- Separates communication initialization logic from getXCCLComm
- Fix segmentation fault in NLLLoss kernel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166945
Approved by: https://github.com/EikanWang
* `c10::fetch_and_cast` and `c10::cast_and_store` produce branchy code since it supports all datatypes
* So, we do special handling for binary elementwise broadcast with mixed dtypes of float/bfloat16/half
* This improves performance
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167233
Approved by: https://github.com/jeffdaily
# Motivation
This PR aims to introduce two variables (`TEST_ACCELERATOR` and `TEST_MULTIACCELERATOR`) to simplify UT generalization. Since out-of-tree backends may be imported later, these variables are defined as lazy values.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167196
Approved by: https://github.com/albanD
Problem: the migration of `AT_DISPATCH_V2` macros to headeronly cannot be a simple copy-paste of macro definitions from one header file to another because the macros `AT_DISPATCH_SWITCH` and `AT_DISPATCH_CASE` may use functions that cannot be migrated to headeronly, e.g. when a selective build feature is enabled, there will be functions that are generated. On the other hand, when not using selective build, the dtype-dispatch macros are perfectly suitable for migrating to headeronly.
In this PR, the migration problem above is tackled by refactoring `AT_DISPATCH` related macros into headeronly macros and non-headeronly macros while preserving the current API and semantics. For instance, consider the current V2 macro definitions:
```c++
#define AT_DISPATCH_V2(TYPE, NAME, BODY, ...) \
AT_DISPATCH_SWITCH(TYPE, NAME, AT_AP_VAR(AT_WRAP(BODY), TYPE, __VA_ARGS__))
#define AT_AP_VAR(N, T, ...) \
AT_EXPAND(AT_CONCAT(AT_AP, AT_NUM_ARGS(__VA_ARGS__))(AT_WRAP(N), __VA_ARGS__))
#define AT_AP1(N, _1) AT_DISPATCH_CASE(_1, N)
...
```
where the headeronly-migration-problematic parts are using AT_DISPATCH_SWITCH and AT_DISPATCH_CASE macros (defined in ATen/Dispatch.h). In this PR, we introduce parametric versions of `AT_DISPATCH_V2` and `AT_AP1` macros that have `_TMPL` suffices, have DISPATCH_SWITCH and DISPATCH_CASE arguments, and are define in `torch/headeronly/core/Dispatch_v2.h`:
```c++
#define THO_DISPATCH_V2_TMPL( \
DISPATCH_SWITCH, DISPATCH_CASE, TYPE, NAME, BODY, ...) \
DISPATCH_SWITCH( \
TYPE, \
NAME, \
THO_AP_VAR_TMPL(DISPATCH_CASE, AT_WRAP(BODY), TYPE, __VA_ARGS__))
#define THO_AP_VAR_TMPL(C, N, T, ...) \
AT_EXPAND( \
AT_CONCAT(THO_AP, AT_NUM_ARGS(__VA_ARGS__))(C, AT_WRAP(N), __VA_ARGS__))
#define THO_AP1(C, N, _1) C(_1, N)
...
```
so that original V2 macro definition, defined in ATen/Dispatch_v2.h, becomes:
```c++
#define AT_DISPATCH_V2(TYPE, NAME, BODY, ...) \
THO_DISPATCH_V2_TMPL( \
AT_DISPATCH_SWITCH, \
AT_DISPATCH_CASE, \
TYPE, \
NAME, \
AT_WRAP(BODY), \
__VA_ARGS__)
```
that has exactly the same API and semantics as the original definition.
Note 1: ~we have changed the definition of `AT_AP1(N, _1) ...` to `AT_AP1(C, N, _1) ...` without renaming `AT_AP1` because `AT_AP1` is a helper macro that is not a part of public API (for instance, nothing in pytorch explicitly uses `AT_AP1`).~ UPDATE: restored the original `AT_AP` macros and introduced new `THO_AP` macros.
Note 2: this PR introduces a new API macro THO_DISPATCH_V2_TMPL that will be available for stable ABI users who can use it by providing custom versions of `AT_DISPATCH_SWITCH` and `AT_DISPATCH_CASE macros, say, with selective build features removed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165856
Approved by: https://github.com/janeyx99
Fixes#166405.
I've fixed this by adding epsilon validation in ```torch.nn.functional.batch_norm``` to reject non-positive values before they cause undefined behavior. Also added a test case ```test_batchnorm_invalid_eps``` to verify the fix works correctly.
While working on this, I noticed that ```layer_norm```, ```group_norm```, and ```instance_norm``` also don't validate epsilon and could have the same issue. Should I add validation for those in this PR as well?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166756
Approved by: https://github.com/mikaylagawarecki
Provides type coverage to torch/_dynamo/variables/dicts.py
Coverage report:
`mypy torch/_dynamo/variables/functions.py --linecount-report /tmp/coverage_log`
Compare before to after - we go from 0 lines and 0 funcs covered to 2698 lines and 166 funcs covered
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167103
Approved by: https://github.com/mlazos, https://github.com/fxdawnn
Summary:
The SIMD path is using SLEEF version of pow which is slightly different from std::pow. The fix is to use the same vectorized code (with partial load and store) for the trailing data as well to ensure consistency between results.
Deploy:
Need to make a hotfix in waas to monitor release signals, since this diff can cause testing failures in veloski and waas release correctness tests.
Test Plan: Sandcastle.
Differential Revision: D86218207
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166986
Approved by: https://github.com/swolchok
Fixes T243967987
Move `enrich_profiler_metadata` from `torch._dynamo.config` to `torch.fx.experimental._config`.
We cannot import anything inside recompile(), it made some perf regress internally. We move the config so we can import it at the top of `graph_module.py` without causing any circular import.
We also cannot delete the old config right now because some internal tests rely on copies of the old `graph_module.py` cpp file in unit tests. But I think we should be able to delete the old config soon after this PR lands.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167114
Approved by: https://github.com/angelayi
This is the initial prototype of distributed autotuning. It's intended to be a basis for iteration rather than the final end product.
Currently when we run a SPMD program we compile the ranks independently. As a result the autotuning is repeated on every rank. So for a 8-GPU program with 8 matmul operators we'll autotune 64 (8*8) times.
Distributed autotuning uses collectives to distribute the autotuning across the ranks so each rank autotunes 1/worldsize the total operators. So in our 8-GPU example we would only perform 8 autotunes total (one on each rank) rather than 64.
There are several advantages:
1. Faster autotuning times - each CPU/GPU does less work total
2. Better determinism - currently it's possible for two ranks to choose different algorithms for the same operator. With distributed autotuning we choose the algorithm once for the entire program.
Results:
In testing using llama3 8B on torchtitan max-autotune time was reduced from 52s -> 26s and exhaustive-autotuning was reduced from 2009s -> 613s.
Usage:
The feature is controlled by the environment variable TORCHINDUCTOR_DISTRIBUTED_AUTOTUNE.
Co-authored-by: @PaulZhang12
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163369
Approved by: https://github.com/PaulZhang12
Handle `torch._check` in `TorchInGraphFunctionVariable.call_function`. Basically, it has two arguments - a predicate (bool) and a message (callable). If predicate is a constant, evaluate `torch._check`. If predicate is true, it just will compile and nothing happens. If predicate is false, `torch._check` will raise an exception.
If predicate is not constant, we manually emit a proxy. I tried to build as_proxy() inside NestedUserFunctionVariable, but failed to, that's why I create it here. I try to extract message. If it's a function, I retrieve it. If not, set it to None. Maybe we could extract it if message is a closure, but not sure how
Fixes#163668
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164676
Approved by: https://github.com/williamwen42, https://github.com/mlazos
Co-authored-by: William Wen <william.wen42@gmail.com>
Summary: The current delta update has a strong assumption that the non-lowered weights share the same tensor dtype from the lowered version. This is not true by design. When dtype mismatches the data loading will load the data into unexpected dtype which introduces undefined behavior. This diff aims to close the gap by always load tensor by its original dtype first then cast to desired dtype.
Test Plan:
No more NaN values!
{P2022339213}
Reviewed By: kqfu
Differential Revision: D86181685
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167039
Approved by: https://github.com/henryoier
This PR enables ROCm/HIP support for PyTorch's StaticCudaLauncher, which provides static compilation and launching of Triton kernels. The implementation has been tested on AMD MI300 and MI200 hardware.
**Changes**
**Python (torch/_inductor/runtime/)**
- static_cuda_launcher.py: Added ROCm detection, .hsaco binary support, and ROCm-specific scratch parameter handling
- triton_heuristics.py: Updated device type checks to support both cuda and hip
**C++ (torch/csrc/)**
- Module.cpp: Enabled StaticCudaLauncher for ROCm builds
- inductor/static_cuda_launcher.cpp: Added HIP API equivalents for all CUDA driver calls
- inductor/static_cuda_launcher.h: Updated header guard
**Tests (test/inductor/)**
- test_static_cuda_launcher.py: Removed @skipIfRocm decorators and updated binary file handling
**Enabled Unit Tests**
All tests in test/inductor/test_static_cuda_launcher.py now pass on ROCm:
1. test_basic
2. test_unsigned_integers
3. test_signed_integers
4. test_basic_1arg
5. test_constexpr
6. test_implied_constant
7. test_kernel_no_args
8. test_high_shared_mem
9. test_too_high_shared_mem
10. test_kernel_empty_tensor
11. test_kernel_many_args
12. test_basic_compile
13. test_incompatible_code
14. test_static_launch_user_defined_triton_kernels
15. test_empty_tensor
16. test_any
17. test_disable_static_cuda_launcher
In addition to this, the following tests from test/inductor/test_codecache.py also pass:
1. test_remote_cache_load_function_device_cuda_float32_dynamic_False_bundle_triton_False_use_static_cuda_launcher_False
2. test_remote_cache_load_function_device_cuda_float32_dynamic_False_bundle_triton_True_use_static_cuda_launcher_False
3. test_remote_cache_load_function_device_cuda_float32_dynamic_False_bundle_triton_True_use_static_cuda_launcher_True
4. test_remote_cache_load_function_device_cuda_bfloat16_dynamic_False_bundle_triton_False_use_static_cuda_launcher_False
5. test_remote_cache_load_function_device_cuda_bfloat16_dynamic_False_bundle_triton_True_use_static_cuda_launcher_False
6. test_remote_cache_load_function_device_cuda_bfloat16_dynamic_False_bundle_triton_True_use_static_cuda_launcher_True
The following tests are skipped since triton bundling is necessary for StaticCudaLauncher:
1. test_remote_cache_load_function_device_cuda_float32_dynamic_False_bundle_triton_False_use_static_cuda_launcher_True
2. test_remote_cache_load_function_device_cuda_bfloat16_dynamic_False_bundle_triton_False_use_static_cuda_launcher_True
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166492
Approved by: https://github.com/jeffdaily
## Summary
This PR adds multi-architecture kernel compilation support for ROCm in PyTorch's AOT Inductor module, enabling a single compiled model to run across multiple AMD GPU architectures (MI200, MI300, MI350, etc.) without recompilation.
## Implementation
- **Multi-arch compilation pipeline**: Compiles LLVM IR to multiple GPU architectures and bundles them using `clang-offload-bundler`
- **Architecture detection**: Automatically detects target architectures from `torch.cuda.get_arch_list()`, with overrides via `PYTORCH_ROCM_ARCH` environment variable
- **ROCm-specific utilities**: New `rocm_multiarch_utils.py` module handles ROCm toolchain integration
- **Test infrastructure**: Adapted AOT Inductor tests to support both CUDA and ROCm compilation paths
## Testing
Successfully tested on:
- MI200
- MI300
**Enabled tests:**
- `test_simple_multi_arch`
- `test_compile_after_package_multi_arch`
- `test_compile_with_exporter`
- `test_compile_with_exporter_weights`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166357
Approved by: https://github.com/jeffdaily
Fixes#166906
```
python test/export/test_export.py -k test_annotate_on_assert
```
The assertions are not marked with annotation because these nodes are created in `apply_runtime_assertion_pass`. Currently the annotation will only be added if the nodes are created during tracing. So we need to manually add the annotation.
Nodes added in `apply_runtime_assertion_pass` will have the same annotation as the input node to the assertion.
Output graph:
Note that `_assert_scalar_default_1` is not annotated becayse it's an assertion on the size of `x` which is not annotated.
```
ExportedProgram:
class GraphModule(torch.nn.Module):
def forward(self, x: "f32[s77]", y: "i64[]"):
# No stacktrace found for following nodes
sym_size_int_1: "Sym(s77)" = torch.ops.aten.sym_size.int(x, 0)
# Annotation: {'moo': 0} File: /data/users/shangdiy/pytorch/test/export/test_export.py:729 in forward, code: x = torch.cat([x, x])
cat: "f32[2*s77]" = torch.ops.aten.cat.default([x, x]); x = None
# Annotation: {'moo': 0} File: /data/users/shangdiy/pytorch/test/export/test_export.py:730 in forward, code: b = y.item()
item: "Sym(u0)" = torch.ops.aten.item.default(y); y = None
ge_1: "Sym(u0 >= 4)" = item >= 4
_assert_scalar_default = torch.ops.aten._assert_scalar.default(ge_1, "Runtime assertion failed for expression u0 >= 4 on node 'ge_1'"); ge_1 = _assert_scalar_default = None
# No stacktrace found for following nodes
mul_1: "Sym(2*s77)" = 2 * sym_size_int_1; sym_size_int_1 = None
le: "Sym(2*s77 <= u0)" = mul_1 <= item; mul_1 = None
_assert_scalar_default_1 = torch.ops.aten._assert_scalar.default(le, "Runtime assertion failed for expression 2*s77 <= u0 on node 'le'"); le = _assert_scalar_default_1 = None
# Annotation: {'moo': 0} File: /data/users/shangdiy/pytorch/test/export/test_export.py:732 in forward, code: return x * b
mul: "f32[2*s77]" = torch.ops.aten.mul.Tensor(cat, item); cat = item = None
return (mul,)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167171
Approved by: https://github.com/angelayi
## Summary
Fix outdated unbiased parameter references in normalization module documentation. Replace deprecated torch.var(input, unbiased=False/True) with modern torch.var(input, correction=0/1) API throughout BatchNorm, InstanceNorm, LayerNorm, and GroupNorm docstrings.
## Changes
- torch/nn/modules/batchnorm.py: Updated 4 instances across BatchNorm1d, BatchNorm2d, BatchNorm3d, and SyncBatchNorm
- torch/nn/modules/instancenorm.py: Updated 3 instances across InstanceNorm1d, InstanceNorm2d, and InstanceNorm3d
- torch/nn/modules/normalization.py: Updated 2 instances in LayerNorm and GroupNorm
## Test plan
Mathematical behavior remains identical: unbiased=False ≡ correction=0 (biased estimator), unbiased=True ≡ correction=1 (unbiased estimator). Documentation now uses consistent modern API terminology with no functional changes to code behavior.
Fixes#166804
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167209
Approved by: https://github.com/albanD
In a trunk failure today, we saw the same test running on both trunk and slow shards. The reason is that this test didn't invoke `super().setUp()`, so all the test features like slow and disabled test didn't apply to them.
I use Claude to find all test classes with a `setUp()` method that didn't called `super().setUp()` and patch all of them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167163
Approved by: https://github.com/malfet
We have some logic figure out "given which inputs have static indices in the pre-subclass-desugaring graph, figure out the static indices in the post-subclass-desugaring graph", and it was busted for training.
Separately, we should probably not have to do this logic at all - as @eellison mentioned, inputs/outputs in the graph are less likely to be tweaked through graph passes, so it would be more convenient and less hassle if we just stashed if a given input was static directly on the Descriptor for it. I did not end up doing that in this PR though.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167127
Approved by: https://github.com/ezyang
[Flight Recorder] Reverted to include stack traces for dump pipe triggered FR dump (#167023)
Summary:
We should also retry if include stacktraces failed. Changed was introduced in https://github.com/pytorch/pytorch/pull/164591
Test Plan: eyes
Reviewed By: fduwjj
Differential Revision: D86248484
This pull request updates the PyTorch documentation build system to support newer versions of Sphinx and its related dependencies, improves coverage checking for undocumented objects, and adds configuration enhancements to the docs build. The most important changes are grouped below.
**Dependency Upgrades and Compatibility:**
* Upgraded `sphinx` to version 7.2.6 and updated related documentation dependencies (`breathe`, `exhale`, `docutils`, `myst-nb`, `sphinx-design`, `myst-parser`, and others) in `.ci/docker/requirements-docs.txt` to ensure compatibility with Python 3.13 and improve documentation generation. [[1]](diffhunk://#diff-b5577a8e38a2e4c5d91865096b259738cc1dbcb97921abb73045dae0255b1479L1-L12) [[2]](diffhunk://#diff-b5577a8e38a2e4c5d91865096b259738cc1dbcb97921abb73045dae0255b1479L39-R45) [[3]](diffhunk://#diff-b5577a8e38a2e4c5d91865096b259738cc1dbcb97921abb73045dae0255b1479L59-R64)
* Replaced the editable install of `pytorch_sphinx_theme2` with a pinned version for stability in documentation builds.
**Documentation Coverage and Build Improvements:**
* Updated the coverage check logic in `.ci/pytorch/python_doc_push_script.sh` to parse the new Sphinx 7.2.6+ coverage report format, extracting the undocumented count from the statistics table for more reliable coverage validation.
**Configuration and Formatting Enhancements:**
* Introduced `autosummary_filename_map` in `docs/source/conf.py` to resolve duplicated autosummary output filenames for functions and classes with the same name, improving documentation clarity.
**Minor Documentation Formatting:**
* Removed an unused `:template:` directive from `docs/source/quantization-support.md` for cleaner autosummary output.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164901
Approved by: https://github.com/albanD
As per title.
It seems safe to be able to generalize to arbitrary contiguous inputs since `at::matmul` is likely to do the flattening to avoid `baddmm`.
Additionally, we guard for bias to be 1D and contiguous which is guaranteed to be fused with no copies.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166071
Approved by: https://github.com/ngimel
I.e. remove distinction between two cases, and always preload full set of libraries
For some reason, when one uses `virtualenv` instead of `venv`,
preloading `cudart` works, but it fails to find cudnn or cublasLT later on
Fix it, by getting read of partial preload logic for one of the cases and always preload full set of libraries
Test plan on stock Ubuntu:
```
pip install virtualenv
virtualenv --symlinks -p python3.11 --prompt virtv venv-virt
source venv-virt/bin/activate
pip install torch
python -c 'import torch'
```
Fixes https://github.com/pytorch/pytorch/issues/165812
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167046
Approved by: https://github.com/atalman
torch.cuda.memory.set_per_process_memory_fraction allows setting
an upper bound on how much device memory is allocated. This PR
exposes this setting to an environment variable.
For example, PYTORCH_CUDA_ALLOC_CONF="per_process_memory_fraction:0.5"
will limit the device memory to half of the available memory.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161035
Approved by: https://github.com/ngimel, https://github.com/eqy
Add Inline Fusion Support for Custom Op Autotuning
--------------------------------------------------
This PR extends PyTorch Inductor's custom op autotuning with inline fusion capabilities, enabling the winning decomposition to be inlined directly into the computation graph for fusion with surrounding operations.
### Usage
```python
def decompose_k_implementation(
a: torch.Tensor, b: torch.Tensor, k_splits: int = 4
) -> torch.Tensor:
"""Matrix multiply with k-way decomposition."""
...
@torch.library.custom_op("my_lib::matmul_relu", mutates_args={})
def custom_matmul_relu_dk(
a: torch.Tensor, b: torch.Tensor, k_splits: int
) -> torch.Tensor:
return torch.relu(decompose_k_implementation(a, b, k_splits))
register_custom_op_autotuning(
custom_op=custom_matmul_relu_dk,
configs=[
CustomOpConfig(k_splits=2),
CustomOpConfig(k_splits=4),
CustomOpConfig(k_splits=8),
CustomOpConfig(k_splits=32),
CustomOpConfig(k_splits=64),
],
name="decompose_k_autotuned",
input_gen_fns={
"a": lambda fake: torch.randn_like(fake, device='cuda'),
"b": lambda fake: torch.randn_like(fake, device='cuda'),
}
)
```
### How It Works
Enable optimizations from Inductor by inlining the best decomposition, allowing fusion with surrounding elementwise operations and other graph-level optimizations. This provide potentially better performance and memory efficiency.
During customop autotuning phase, we still benchmarks all CustomOpConfigs to find the fastest implementation. Then during inline fusion, inductor inline the decompositions into the main graph, converting the winning choice to individual ComputedBuffer IR nodes (fusable). At the end, Inductor automatically fuses inlined operations with surrounding elementwise ops (e.g., bias add, ReLU, scaling). Note that the winning choice must be a SubgraphChoiceCaller (decomposition-based) rather than an ExternKernelChoice for inlining to work. If the ExternKernelChoice is returned, no inline happens.
Performance Results
Benchmarked on matmul+relu workload with decompose-k fusion (H100 GPU, 15 test shapes):
<img width="782" height="377" alt="Screenshot 2025-11-04 at 12 43 11 AM" src="https://github.com/user-attachments/assets/22131d4c-a8ce-4f55-bdcd-ac758ddad8cd" />
Metric | Result
-- | --
Average Speedup vs ATen | 1.28x
Max Speedup vs ATen | 1.41x
<br class="Apple-interchange-newline">
The performance comparison are detailed in the below plots. We spot that on most use cases, the inline fusion gains better performance compared to aten baseline and the current torch.compile.
<img width="4874" height="3545" alt="image" src="https://github.com/user-attachments/assets/190a1233-412f-4f34-84cd-9b7cb582f504" />
**Test**: `test_decompose_k_with_fusion` demonstrates decompose-k with inline fusion enabled.
--------------
### Integration to mm.py decomposeK with a flag enable_inline_subgraph_fusion=True in config (deprecated to avoid breaking async compilation. removed from the PR already)
FP32:
<img width="738" height="357" alt="Screenshot 2025-11-04 at 12 05 08 AM" src="https://github.com/user-attachments/assets/ee421d22-c426-42f2-8dcd-4dcc547d6219" />
FP16:
<img width="769" height="403" alt="Screenshot 2025-11-04 at 12 13 49 AM" src="https://github.com/user-attachments/assets/346d1ffc-15af-40b0-9378-cf9b297711c2" />
The TCF column represents torch compile fusion, which is close to custom_op decomposek. The difference might due to different candidate k values.
#### Usage:
Note: this only happens when we don't benchmark_epilogue_fusion, i.e., not using multi_template_buffer.
```python
# Define the matmul+relu function
def matmul_relu(x, y):
return torch.nn.functional.relu(torch.matmul(x, y))
# Compile with inline subgraph fusion enabled
@torch.compile
def compiled_matmul_relu(x, y):
return matmul_relu(x, y)
# Reset dynamo to ensure clean compilation
torch._dynamo.reset()
with config.patch(
{
"max_autotune": True,
# CRITICAL: These two flags enable inline subgraph fusion
"benchmark_epilogue_fusion": False, # Must be False for inline fusion!
"enable_inline_subgraph_fusion": True, # Enable inline fusion
}
):
# Compile and run
result = compiled_matmul_relu(a, b)
torch.cuda.synchronize()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165952
Approved by: https://github.com/PaulZhang12, https://github.com/eellison
### Summary
Adds a debug-level logging statement to torch.fx.Interpreter.run_node, as proposed in [#117351](https://github.com/pytorch/pytorch/issues/117351), to make FX graph execution traceable when debugging or instrumenting model transformations.
When debug logging is enabled, each executed node emits a single structured log line formatted via `LazyString(lambda: n.format_node())`, deferring string construction unless logging is active.
### Example Output
With `logging.DEBUG` enabled:
```
run_node x = x()
run_node add = _operator.add(x, 1)
run_node clamp = torch.clamp(add, min=0.0, max=5.0)
run_node output = output(clamp)
```
With `logging.DEBUG` disabled no additional output is produced (unchanged default behavior).
### Test Plan
Verified locally with Python 3.11 on macOS using a PyTorch build from source.
- With `logging.DEBUG` enabled: each node emits a debug log via LazyString.
- With `logging.DEBUG` disabled: no additional output.
- Confirmed all `Interpreter` tests pass locally:
`pytest test/test_fx.py -k "Interpreter"`
Updated the example output to reflect the new `_format_fx_node` helper and inclusion of `kwargs`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166622
Approved by: https://github.com/aorenste
Intended to make it easier to reuse this logic for processing operator arguments as IValues in following PR(s).
Testing: python test/test_python_dispatch.py (broke during development, seems to work now)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166368
Approved by: https://github.com/albanD
Summary: Inside group_split api, we share the reference of PG option with parent PG if a PG option is not explicitly specified. This is bad because if we split parent pg multiple times, we will run into errors.
Test Plan: UT + internal test.
Differential Revision: D86225394
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167125
Approved by: https://github.com/Skylion007
Fixes#160513
## The Problem Summary
The issue boiled down to data type promotion logic. The code base has two different functions that deal with dtype promotion logic. If it is purely multi-dimensional tensor operations, the cpp code gets triggered and that follows the numpy dtype promotion logic. That is why in #160513 NDim tensors are fine as NDim dtypes gets precedence. The issue came with python scalars and 0Dim tensors. When it detects "scalars", a python implementation of dtype promotion logic gets triggered (torch/_prims_common/__init__.py:1544). Since this is in python, the implementation can't distinguish what is from a wrapped tensor and a 0Dim tensor and thus will just take the highest dtype which is the python double wrapped number.
## The Fix
The python implementation for dtype promotion had to know where the scalar came from. Once the scalar can be distinguished then the appropriate dtype can be set. The first approach was to try and expose the `is_wrapped_number` method but this came with a big issue. During the `forward_ad` the derivative of those scalars turned out to be `ZeroTensor`s. The `ZeroTensor` internally uses a hack to initialize a meta dtype tensor which skips expensive dispatch operations. But the copy would not grab everything especially the `is_number_wrapped_` property. I thought about modifying the copy but that seemed to go away from the spirit of what the copy was intended for and plus the tests for `is_wrapped_number_` requires `dim > 0` and a scalar `ZeroTensor` is a meta dtype tensor which complicates things.
So I chose the route of creating a new property called `was_wrapped_number` and exposed this property to the python tensor API. I had to modify the autograd code generation to set `was_wrapped_number` in the mul, add, and div operations in `VariableType.cpp`. Once this property was set, the dtype promotion logic could be updated to consider wrapped numbers and 0Dim numbers. Once that hierarchy was taken care of, the buggy behavior was fixed.
I wrote a new ops testing module `TestForwardADWithScalars`. I saw that this bug was unique and required new testing paradigm. This only tests the multiply, add, and divide and I chose this because all operations boil down to these three operations.
[edit]: Just used `efficientzerotensor` meta and converted that to a python number. Since wrapped number is converted back to a python number, dtype promotion is preserved. The constraint to achieve this happened by setting the forward grad zero tensor of a wrapped number with a wrapped number flag since the tangent of the wrapped number should still be a wrapped number. After that this specific zerotensor was then sent through as a meta type in the `BinaryOps.cpp` to get appropriate dtype for resulting arithmetic.
@ezyang @OihanJoyot
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165784
Approved by: https://github.com/ezyang
Fixes#166746 by removing squeezes that caused shape mismatches when calling backwards through `BCELoss(reduction='none')`.
Based on running these tests, it seems MPSGraph can handle inputs without squeezing.
```
python test/test_mps.py TestMPS -k test_bce
python test/test_mps.py TestConsistency -k binary_cross
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166786
Approved by: https://github.com/malfet
Slice knows how to handle unbacked start, we do not need to offset start before calling slice, we can leave it for slice.
The only edge case is when start<0 and start+length ==0 in that case slice and narrow would deviate,
for that case we shall pass dim_size instead of start+length
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166361
Approved by: https://github.com/aorenste
Summary:
- there are various places that access fr's `entries_` field
- if we empty the entries_ on reset, the accesses can result in an error
- so we only perform a soft delete instead of clearing out the entries copletely
- only reset id_ on the reset
- keep track of a reset_epoch which increments everytime reset is called
- dump_entries only returns entries from the latest epoch
- api's that access entries also check if the reset epoch matches
- make the `next_` always track the index in the circular buffer - this change was needed to make the soft delete's implementation easier
---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/166970).
* #166972
* #166971
* __->__ #166970
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166970
Approved by: https://github.com/fduwjj
# Summary
This PR improves `torch.sort` and `torch.unique` performance by **15% to 50%** on NVIDIA GPUs by optimizing CUDA register allocation in radix sort operations.
The key change: specialize `OpaqueType<N>` to use native integer types (uint8_t, uint16_t, uint32_t, uint64_t) for common sizes (1, 2, 4, 8 bytes) instead of `char data[N]`. This enables more efficient register allocation while preserving the template deduplication strategy.
The following table shows the speedup on various input shapes and GPUs. Sorting is performed on the last dimension, and baseline torch version is 2.9.0.
| GPU | input shape | input dtype | **Before** **(ms)** | After (ms) | Speedup |
| ---- | ----------- | ----------- | ------------------- | ---------- | ------- |
| H100 | (16, 1e6) | int32 | 1.61 | 1.37 | 1.18× |
| H100 | (1, 1e8) | int32 | 6.6 | 5.0 | 1.3× |
| H20 | (16, 1e6) | int64 | 3.57 | 3.03 | 1.18× |
| H20 | (1, 1e8) | int64 | 19.3 | 13.0 | 1.48× |
# Analysis
`torch.sort` and `torch.unique` use `radix_sort_pairs`, which internally calls `cub::DeviceRadixSort::SortPairs`. Since values are only copied (never compared), we cast them to `OpaqueType<sizeof(value_t)>` to minimize template instantiations. For example, both `int32` and `float32` values map to the same `OpaqueType<4>.`
## The Problem
The previous `char data[N]` implementation causes inefficient register allocation. Here is one reason I find from SASS code. For 8-byte types:
- `char data[8]:` Compiler may allocate 8 registers (one per byte)
- `uint64_t data`: Compiler allocates 2 registers (standard 64-bit handling)
This happens because the compiler doesn't recognize char[8] as a cohesive 64-bit value, treating each byte independently, which increases register pressure and reduces GPU occupancy.
From Nsight Compute, when using `char data[8]`, the registers per thread is 166, and corresponding theoretical occupancy is 18.75%. When using native `uint64_t`, the registers per thread is 80, and corresponding theoretical occupancy is 37.5%.
## The Solution
Specialize `OpaqueType<N>` for common sizes using native integer types:
```
// Before
template <int N> struct alignas(N) OpaqueType { char data[N]; };
// After
template <int N> struct alignas(N) OpaqueType { char data[N]; }; // fallback
template <> struct alignas(1) OpaqueType<1> { uint8_t data; };
template <> struct alignas(2) OpaqueType<2> { uint16_t data; };
template <> struct alignas(4) OpaqueType<4> { uint32_t data; };
template <> struct alignas(8) OpaqueType<8> { uint64_t data; };
```
This preserves the template deduplication strategy (all 8-byte types still use the same `OpaqueType<8>` instantiation) while enabling better register allocation.
# Testing & Compatibility
## Testing:
✅ Correctness tests pass for various input types (bfloat16, int32, float32, int64), shapes, and dimensions (1, 2, 3)
✅ Register usage reduction verified with NSight Compute
✅ Linter passes
## Compatibility:
✅ No API/ABI changes
✅ Template instantiation count unchanged
# Reference
For detailed analysis, please refere to my previous blog: [Performance Optimization of torch.sort on GPU](https://yywangcs.notion.site/Performance-Optimization-of-torch-sort-on-GPU-192fc9f5d8058018a1bec1efa35da3f9)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167094
Approved by: https://github.com/ngimel, https://github.com/Skylion007
Initial autotuning support for foreach kernels, 4x improvement for some kernels in internal workload. More improvements can surely be made here in the future. Removing num_warps for definition to enable autotune support in generated wrapper code.
Before:
triton_for_fused_18.kd 🔍 | 4.986 ms | 4.986 ms | 2.493 ms | 2 |
triton_for_fused_6.kd 🔍 | 0.098 ms | 0.098 ms | 0.049 ms | 2 |
triton_for_fused_7.kd 🔍 | 0.036 ms | 0.036 ms | 0.018 ms | 2 |
After:
triton_for_fused_18.kd 🔍 | 1.273 ms | 1.273 ms | 0.636 ms | 2 |
triton_for_fused_6.kd 🔍 | 0.044 ms | 0.044 ms | 0.022 ms | 2 |
triton_for_fused_7.kd 🔍 | 0.024 ms | 0.024 ms | 0.012 ms | 2 |
num_warps=8 default due to https://github.com/pytorch/pytorch/blob/main/torch/_inductor/codegen/triton_combo_kernel.py#L374
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162053
Approved by: https://github.com/mlazos, https://github.com/naromero77amd, https://github.com/jeffdaily
Co-authored-by: Nichols A. Romero <nick.romero@amd.com>
Provides type coverage to torch/_dynamo/variables/dicts.py
Coverage report:
`mypy torch/_dynamo/variables/dicts.py --linecount-report /tmp/coverage_log`
Compare before to after - we go from 0 lines and 0 funcs covered to 1547 lines and 89 funcs covered
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167022
Approved by: https://github.com/Skylion007
Example below. You need to trace your function with DTensor inputs in order for the graph proxies to run on DTensor (and not the inner local tensor). You also need to run with `tracing_mode="fake"`, or with your own `FakeTensorMode`, to see the nice DTensor printing. If this doesn't feel very ergonomic then maybe we can find some better UX for printing a graph with DTensor in it:
<img width="1446" height="582" alt="image" src="https://github.com/user-attachments/assets/99ea5ce6-1008-4ba5-b58e-542cd34a340b" />
```
import torch
from torch.testing._internal.distributed.fake_pg import FakeStore
from torch.distributed.tensor import distribute_tensor, Shard, Replicate
from torch.utils._debug_mode import DebugMode
from torch.fx.experimental.proxy_tensor import make_fx
from torch.utils._python_dispatch import TorchDispatchMode
from torch.utils import _pytree as pytree
world_size = 8
device_type = "cpu"
fake_store = FakeStore()
torch.distributed.init_process_group("fake", store=fake_store, rank=0, world_size=world_size)
device_mesh = torch.distributed.init_device_mesh(device_type, (world_size,))
dim = 128
A = torch.randn(8, dim)
B = torch.randn(dim, dim)
dA = distribute_tensor(A, device_mesh, [Shard(0)]).requires_grad_()
dB = distribute_tensor(B, device_mesh, [Replicate()]).requires_grad_()
def f(dA, dB):
dy = dA @ dB
loss = dy.sum()
loss.backward()
return dA.grad, dB.grad
# We actually need the tracing_mode='fake' here, or to trace under a FakeTensorMode.
# make_fx has some logic to ensure we don't accidentally stash real tensors in the graph
# so we won't stash our DTensors properly if they don't hold Fake inner tensors
gm = make_fx(f, tracing_mode='fake')(dA, dB)
# DCE isn't necessary here, there were just a lot of dead detach() nodes that spammed the graph
gm.graph.eliminate_dead_code()
gm.recompile()
gm.print_readable(colored=True)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166750
Approved by: https://github.com/ezyang, https://github.com/wconstab, https://github.com/Skylion007
We are working on a translation from as_strided to view operations, but
only when the as_strided is representable as a plain view. A useful
testing utility in this situation is the ability to enumerate all valid
views on an original tensor. So we have a small test here that shows
it is possible.
To avoid an explosion of states, we don't handle permutes and size=1,
which are degenerate cases (you can always do a single permute and
a series of unsqueezes to get to the final desired state.)
Authored with claude code assistance.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167076
Approved by: https://github.com/albanD
ghstack dependencies: #166868, #166867
Fixes#161366
All the 4 types of dimension matrix are supported.
2d-2d, 2d-3d, 3d-3d, 3d-2d. The corresponding test cases in test_matmul_cuda are working
for both forward and backward pass.
The CK path is enabled for gfx942, gfx950.
ToDo: Need to enable support on gfx90a since the ck kernel used in this commit produces gpu error,
might require a different CK kernel config, based on the profiler result on gfx90a.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166334
Approved by: https://github.com/atalman
When a fn compiled with `torch.compile` calls `.item()` on a float tensor arg (e.g., for thresholds in `torch.clamp`), the generated triton kernel references an unbacked float symbol (e.g., `zuf0`) that was never added to the kernel's parameter list, causing a compilation error.
Fixes: #166888
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166890
Approved by: https://github.com/eellison
Then we can point an ClickHouse ingestor at this s3 path and get them into ClickHouse while the job is running.
use filelock to make sure each json is uploaded once so we don't end up with dups in ClickHouse
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166988
Approved by: https://github.com/izaitsevfb
For [this repro](https://gist.github.com/eellison/75a99616a0fcca0436316bbfd8987fae) enables fusion of `to_blocked` with the prior `to_mx` calculation, so that there is only a single kernel per tensor, resulting in a 10% speedup of the non conversion code (need to update my local devserver to 12.9 to time the matmul as well).
The `to_mx` kernel has a contiguous write:
```Py
op6_op7: FusedSchedulerNode(SchedulerNode,SchedulerNode)
op6_op7.writes = [MemoryDep('buf6', c0, {c0: 2097152}), MemoryDep('buf7', c0, {c0: 67108864})]
op6_op7.unmet_dependencies = []
op6_op7.met_dependencies = [MemoryDep('arg1_1', c0, {c0: 67108864})]
op6_op7.outputs = [
buf6: ComputedBuffer
buf6.layout = FixedLayout('cuda:0', torch.float32, size=[8192, 256], stride=[256, 1])
buf6.users = [
NodeUser(node=SchedulerNode(name='op7'), can_inplace=False, is_weak=False),
NodeUser(node=SchedulerNode(name='op9'), can_inplace=False, is_weak=False),
]
buf7: ComputedBuffer
buf7.layout = FixedLayout('cuda:0', torch.float8_e4m3fn, size=[8192, 256, 32], stride=[8192, 32, 1])
buf7.users = [NodeUser(node=ExternKernelSchedulerNode(name='op10'), can_inplace=False, is_weak=False)]
]
```
While the `to_blocked` has a single discontiguous read and a single contiguous write.
```Py
op9: SchedulerNode(ComputedBuffer)
op9.writes = [MemoryDep('buf9', c0, {c0: 2097152})]
op9.unmet_dependencies = [ MemoryDep('buf6', 32768*((c0//32768)) + 8192*(((ModularIndexing(c0, 1, 16))//4)) + 256*(ModularIndexing(c0, 16, 32)) + 4*(ModularIndexing(c0, 512, 64)) + (ModularIndexing(ModularIndexing(c0, 1, 16), 1, 4)), {c0: 2097152})]
op9.met_dependencies = []
op9.outputs = [
buf9: ComputedBuffer
buf9.layout = FixedLayout('cuda:0', torch.float8_e8m0fnu, size=[2097152], stride=[1])
buf9.users = [NodeUser(node=ExternKernelSchedulerNode(name='op10'), can_inplace=False, is_weak=False)]
]
```
To enable fusion, we invert the read, giving op9 and contiguous read and discontiguous write. More explanation here: https://gist.github.com/eellison/6f9f4a7ec10a860150b15b719f9285a9
[Tlparse with this optimization](https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/eellison/custom/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000).
[Tlparse without this optimization](https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/eellison/custom/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161404
Approved by: https://github.com/shunting314
# Summary
This PR fixes an issue where `torch.fx.Interpreter.boxed_run` would silently ignore extra input arguments instead of validating the argument count.
Previously, `boxed_run` would only consume as many inputs as there were placeholder nodes and then clear the entire `args_list`, hiding potential bugs. This change introduces a strict check to ensure `len(args_list)` matches the number of placeholder nodes, raising a `RuntimeError` on a mismatch.
Fixes#166583.
# Changes
* Validate `len(args_list)` against the number of placeholder nodes at the beginning of `boxed_run`.
* Raise a `RuntimeError` with a clear message ("extra arguments" or "missing arguments") if the counts do not match.
* Move `args_list.clear()` to only execute after successful validation and environment setup. If an error is raised, `args_list` is preserved for debugging.
# Testing
* Added `test_interpreter_boxed_run_argument_validation` to `test/test_fx.py`.
* This test covers three scenarios:
1. Correct number of arguments (succeeds, `args_list` is cleared).
2. Extra arguments (raises `RuntimeError`, `args_list` is preserved).
3. Missing arguments (raises `RuntimeError`, `args_list` is preserved).
# User-facing impact / BC notes
This is a bug fix. Code that was incorrectly passing the wrong number of arguments to `boxed_run` will now fail fast with a `RuntimeError` instead of executing silently with unintended inputs. Correctly written code is unaffected.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166784
Approved by: https://github.com/ezyang, https://github.com/xmfan
This change introduces LocalRunnerMode that allows you to run multiple
SPMD functions concurrently. SMPD functions are executing one at a time,
yielding execution capability while waiting for send or receive operations
to complete. Send and receive peer operations only supported while running
under LocalRunnerMode.
The example test in this change demonstrates how ranks are sending data
to the next peer and receiving data from the previous peer (ring).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166595
Approved by: https://github.com/wconstab, https://github.com/ezyang
## The bug
In some environments, if run:
```py
def inner_func(x):
return x.to(torch.float32, memory_format=torch.channels_last)
x = torch.randn(2, 2, 3, 4, device="cpu", dtype=torch.float64)
torch.vmap(inner_func)(x)
```
we get:
```
E RuntimeError: Batching rule not implemented for aten::to.dtype_layout; the fallback path doesn't work on out= or view ops.
```
Otherwise, it would always fallback and result in an error for ops like `to.dtype` and `to.dtype_layout` even the kernels are registered.
## The cause
The alias key of `FuncTorchBatchedDecomposition` is not properly translated to runtime dispatch keys when updating the dispatch table of `OperatorEntry::dispatchTable_`. [[link](984b096d10/aten/src/ATen/core/dispatch/OperatorEntry.cpp (L500-L501))]
The [`getRuntimeDispatchKeySet`](f3fa560dec/c10/core/DispatchKeySet.cpp (L62)) use if-else to translate all other alias keys but `FuncTorchBatchedDecomposition`.
This would result in not finding the kernel in many cases.
## The fix
This PR adds one more `if` statement to `getRuntimeDispatchKeySet` to map `FuncTorchBatchedDecomposition` to the corresponding runtime dispatch key, `FuncTorchBatched`.
So, that the dispatch table can be properly updated.
This fix allows people to use ops inside vmaps in more environments and across more compilers.
## Why does it work without the PR
As long as the `FuncTorchBatchedDecomposition` [[link](51319ca090/aten/src/ATen/functorch/BatchRulesDecompositions.cpp (L35))]
is registered before the fallback method of `FuncTorchBatched` [[link](d311a3d1dc/aten/src/ATen/functorch/LegacyBatchingRegistrations.cpp (L759))], everything runs fine.
In this case, it relies on the registration of the fallback method to update the dispatch table, which flushes all the kernels in `OperatorEntry::kernels_` into `dispatchTable_`, among which there are kernels registered with `FuncTorchBatchedDecomposition`.
## When does it fail
However, the order of the op registration and the fallback registration is not garanteed at all.
It relies on the C++ static initialization order, which varies from environment to environment.
On our compiler, it the fallback registration goes first and the alias key kernels under `FuncTorchBatchedDecomposition` comes later and not get flushed into the dispatch table by the fallback registration.
Therefore, it cannot find the kernel for it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166032
Approved by: https://github.com/albanD
Provides type coverage to torch/_dynamo/variables/ctx_manager.py
Coverage report:
`mypy torch/_dynamo/variables/ctx_manager.py --linecount-report /tmp/coverage_log`
Compare before to after - we go from 0 lines and 0 funcs covered to 1541 lines and 144 funcs covered
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166878
Approved by: https://github.com/Skylion007
Fixes [#166850](https://github.com/pytorch/pytorch/issues/166850)
- Create a default `.github/copilot-instructions.md` file (used Claude Sonnet 4.5 in Copilot).
- Add `.github/copilot-instructions.md` to the `.gitignore` file.
The prompt used is below, which is preset by Copilot:
```
Analyze this codebase to generate or update `.github/copilot-instructions.md` for guiding AI coding agents.
Focus on discovering the essential knowledge that would help an AI agents be immediately productive in this codebase. Consider aspects like:
- The "big picture" architecture that requires reading multiple files to understand - major components, service boundaries, data flows, and the "why" behind structural decisions
- Critical developer workflows (builds, tests, debugging) especially commands that aren't obvious from file inspection alone
- Project-specific conventions and patterns that differ from common practices
- Integration points, external dependencies, and cross-component communication patterns
Source existing AI conventions from `**/{.github/copilot-instructions.md,AGENT.md,AGENTS.md,CLAUDE.md,.cursorrules,.windsurfrules,.clinerules,.cursor/rules/**,.windsurf/rules/**,.clinerules/**,README.md}` (do one glob search).
Guidelines (read more at https://aka.ms/vscode-instructions-docs):
- If `.github/copilot-instructions.md` exists, merge intelligently - preserve valuable content while updating outdated sections
- Write concise, actionable instructions (~20-50 lines) using markdown structure
- Include specific examples from the codebase when describing patterns
- Avoid generic advice ("write tests", "handle errors") - focus on THIS project's specific approaches
- Document only discoverable patterns, not aspirational practices
- Reference key files/directories that exemplify important patterns
Update `.github/copilot-instructions.md` for the user, then ask for feedback on any unclear or incomplete sections to iterate.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166864
Approved by: https://github.com/malfet
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Summary: In certain scenarios, such as when the first stride is 0, the entire tensor may not be contiguous, but the 2D matrix within each batch can still be contiguous, allowing us to apply max autotune. This diff specifically checks for contiguity within the 2D matrix of each batch, and enables more uses for cpp bmm template.
Differential Revision: D84561331
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165469
Approved by: https://github.com/desertfire
Some users at pytorch conference were asking me about whether it is safe to share a memory pool among cuda graphs that never run concurrently, but may run in arbitrary order, if they don't depend upon each other's output. Even though your capture order doesn't match replay order in this situation, this is safe. However, our documents confusingly said this wasn't allowed. This update is intended to help with that. Since vLLM essentially depends upon this behavior, I call it out specifically.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166975
Approved by: https://github.com/eellison, https://github.com/BoyuanFeng
Slice knows how to handle unbacked start, we do not need to offset start before calling slice, we can leave it for slice.
The only edge case is when start<0 and start+length ==0 in that case slice and narrow would deviate,
for that case we shall pass dim_size instead of start+length
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166361
Approved by: https://github.com/aorenste
Fixes https://github.com/pytorch/pytorch/issues/166900
Implementation notes:
- I tried to disallow guard generation before side effect application in order to futureproof improper guard generation. However, this was not feasible since it is possible to realize lazy VTs while generating side effects (e.g. realizing a constant variable that is used in a deque update).
- `codegen_save_tempvars` now generates `TempLocalSource` for create temporary variables now, so that they won't get confused with `LocalSource` - we should error out when we attempt to create guards for `TempLocalSource`. I considered using `SyntheticLocalSource`, but that has additional `subguards_allowed` behavior that we may not want to have for temp variables.
- We moved the guard installation for constant user-defined pytree objects from `as_python_constant` to `__init__`. Objects created outside the compile-region will be guarded, while objects created inside the compile-region will not be guarded.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166917
Approved by: https://github.com/anijain2305
Fixes#159445
### Summary
- Fixed a stride layout issue in the `torch.linalg.eig` meta kernel that prevented successful compilation with the inductor backend. The meta kernel was producing incorrect row-major strides.
- LAPACK/BLAS libraries (underlying implementation) expect column-major layout
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162484
Approved by: https://github.com/isuruf
Major change is to switch to a timer based implementation. Additionally,
we get rid of the context manager for turning of the compile pool. We
still have the warmup calls.
Note that this only modifies the async_compile methods, the fx pool is
left running.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166581
Approved by: https://github.com/masnesral
ghstack dependencies: #166467
The deprecation warning led to warning spamming in PyTorch APIs, like
torch.compile. This is not how a deprecation warning should go: if we
add a deprecation warning, we'd better update our built-in APIs to
prevent warning spam.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166956
Approved by: https://github.com/albanD
We store a mapping between generated fx graph code and original model code stack trace in `fx.traceback._FX_METADATA_REGISTRY`. And we do a post-processing on the memory snapshot to append the original model stack trace information.
To achieve this, the biggest change we had to do in `aot_eager` mode is to give each generated fx graph a unique stack trace, i.e. it cannot just be `<eval_with_key>`. We set co_filename to **pretend** that the code is from `co_filename` file. Now instead of `<eval_with_key>` in stack trace, we get something like `fx_generated_3a4b5c6d7e8f9a0.py`.
`augment_with_fx_traces` arg is added to `torch.cuda.memory._snapshot` and `_dump_snapshot`. When the arg is set to True, a post-processing will run to populate the original model stack trace to the snapshot frames.
The new behavior of GraphModule can be controlled by `TORCH_ENRICH_RPOFILER_STACK_TRACE` or `_dynamo.config.enrich_profiler_metadata=True`.
Alternative:
Instead of setting co_filename, we can also do it like below:
Note that if we do it this way, we will need to dump the file to make the graph module torch-scriptable. TorchScript requires source access in order to carry out compilation, so we need to make sure original .py files are available.
```
key = filename
globals_copy = globals.copy()
globals_copy["__file__"] = key
globals_copy["__name__"] = key
linecache.lazycache(key, globals_copy)
exec(compile(src, key, "exec"), globals)
````
Other changes:
- Update `MemoryViz.js` to display fx node information and original model code if exist
```
python test/test_fx.py -k test_lineno_map
python test/test_fx.py -k test_custom_traceback_raised
python test/test_public_bindings.py
python test/test_cuda.py -k test_fx_memory
python test/test_fx.py -k test_informative_co_filename
python test/test_fx.py -k test_autowrap_functions
python test/dynamo/test_utils.py -k test_inductor_provenance
```
```python
# Profile with memory snapshot
torch.cuda.memory._record_memory_history()
with torch._dynamo.config.patch("enrich_profiler_stack_trace", True):
compiled = torch.compile(mod, backend="aot_eager", fullgraph=True)
result = compiled(torch.randn(10, 10, device="cuda:0"))
torch.cuda.memory._dump_snapshot("memory_snapshot.pickle", augment_with_fx_traces=True)
torch.cuda.memory._record_memory_history(enabled=None)
```
<img width="913" height="711" alt="Screenshot 2025-10-30 at 10 40 44 AM" src="https://github.com/user-attachments/assets/8d7a1833-f98d-4756-b666-1d63ab57b27b" />
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166676
Approved by: https://github.com/albanD, https://github.com/ezyang
Results from CI:
No failures but generally takes longer, maybe ~20% increase in time?
But the smaller runner is ~25% of the cost of the current runner, so in terms of cost this is a decrease
If the 20% is too much, we can try the 4x larger runners, which are about half the cost of the current runner, so it would probably still result in cost savings with hopefully less impact to time
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164989
Approved by: https://github.com/BoyuanFeng, https://github.com/huydhn
Summary:
cuBlasLt enforces size/stride requirements for 1x128 and 128x128 blockwise scaling
kernels, some of which weren't being handled, causing silent incorrect
answers especially for 128x128 scaling cases.
cuBlasLt enforces ([docs](https://docs.nvidia.com/cuda/cublas/#scaling-factors-layouts)) for deepseek-style
scaling, for `A: MxN`, `B: KxN` you have the following:
```Py
L = K // 128
L4 = round_up(L, 4)
1x128 x 128x128:
* A_scale: [M, K // 128], stride: [1, M]
* B_scale: [L4, N // 128], stride: [1, L4]
128x128 x 1x128:
* A_scale: [L4, M // 128], stride: [1, L4]
* B_scale: [N, K // 128], stride: [1, N]
1x128 x 1x128:
* A_scale: [M, K // 128], stride: [1, M]
* B_scale: [N, K // 128], stride: [1, N]
```
Notable here is the `L4` term, which means that we must round up to the nearest multiple of 4 blocks
in the `K` dimension. This wasn't enforced previously, and caused silent wrong answers
where `(K // 128) % 4 != 0`.
Test Plan:
Reviewers:
Subscribers:
@vkuzo
Tasks:
Tags:
Signed-off-by: Simon Layton <simonlayton@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166752
Approved by: https://github.com/drisspg, https://github.com/vkuzo
Fixes#164684
### Description
Symbolic tracing fails during multiplication between a `SymBool` and a `Tensor`. This scenario is triggered when `.item()` is called on a 0-dim boolean tensor within a `torch.compile` region. In compile mode, this yields a `SymBool`, and the subsequent `SymBool * FakeTensor` operation is unsupported, leading to a `TypeError` or a data-dependent `UserError`.
### Solution
This PR addresses the issue at the type-conversion level, as suggested by reviewers.
The root cause of the TypeError is that torch.sym_float() (which is called by _maybe_convert_to_dtype during type promotion for aten.mul) lacks a conversion path for SymBool and incorrectly falls back to builtins.float(SymBool).
This fix addresses this by implementing the __sym_float__(self) method within the SymBool class (defined in torch/__init__.py).
The torch.sym_float(a) utility function is already designed to check for hasattr(a, "__sym_float__") before falling back to builtins.float(). By adding this method, SymBool instances now correctly advertise their ability to be cast to SymFloat. The new method implementation leverages self.node.sym_float() to correctly convert the symbolic boolean value to its symbolic float representation (0.0 or 1.0), resolving the TypeError at its source.
This approach is more fundamental than modifying a specific operation in builtin.py and ensures SymBool can be correctly promoted to SymFloat in any operation, while still preserving its boolean nature for control flow operations like guard_or_false (which is verified by a new test case).
### Verification
1. **Bug Reproduced**: The initial `UserError: Could not guard on data-dependent expression` was successfully reproduced with the script from the issue. As shown below
<img width="1369" height="945" alt="Screenshot 2025-10-13 at 10 29 05" src="https://github.com/user-attachments/assets/8daa4555-3347-4af5-906a-02150b8df9d1" />
2. **Fix Validated**: After applying the code changes, the same script now runs to completion, printing `✅ eager success` and `✅ compile success`. As shown below
<img width="1228" height="82" alt="Screenshot 2025-10-13 at 10 29 21" src="https://github.com/user-attachments/assets/94c4f143-b898-4dda-9bff-0ad5450a30fa" />
3. Added a new test class DynamoOpPromotionTests to test/dynamo/test_misc.py with three new test cases:
1. test_symbool_tensor_mul_does_not_fail: Verifies that the original bug report code (with .item() + *) no longer raises an error when compiled.
2. test_symbool_guard_or_false: Verifies that this fix does not cause a regression for guard_or_false(SymBool) (the concern raised by reviewers).
3. test_symbool_tensor_mul: Verifies the behavior of Tensor(bool) * Tensor(float) (without .item()) for completeness.
All new tests were added and pass locally.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165264
Approved by: https://github.com/laithsakka, https://github.com/Lucaskabela
# Motivation
This PR introduces support for peer-to-peer (P2P) access between devices, including querying and enabling P2P connections between two devices.
It supports two categories of allocations:
- Regular allocations;
- Expandable segment allocations.
# Additional Context
The follow-up is that we should use this feature to optimize our copy kernel when P2P is supported.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166424
Approved by: https://github.com/gujinghui, https://github.com/albanD
ghstack dependencies: #166299, #166292
# Motivation
This PR intends to add expandable segment feature support on XPU. This will help
- Reduce memory fragmentation;
- Gradually map physical pages into virtual address space as needed.
# Additional Context
The traditional caching allocator frequently allocates and frees device memory blocks. However, over time, with varying tensor size, the device address space becomes fragmented. Even when there's enough total free memory, a lack of contiguous space can cause large allocations to fail.
The **expandable segment** feature addresses this by dynamically extending physical memory within a reserved virtual address range, reducing fragmentation and minimizing reallocation overhead.
The potential drawbacks are
- Virtual memory overhead;
- Potential page mapping overhead;
- Increased complexity.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166292
Approved by: https://github.com/albanD, https://github.com/EikanWang, https://github.com/gujinghui
ghstack dependencies: #166299
Fixes#159139
## The Cause
The bug occurs because the OptimizedModule wrapper in torch._dynamo.eval_frame doesn't call the len method. This causes Python's bool() check to fall back to the default object truthiness (always True) instead of correctly evaluating containers with len() == 0 as False.
## The Fix
A very easy fix . I just added the len method to OptimizedModule in torch._dynamo.eval_frame class to delegate the call to the original module
```python
def __len__(self):
"""
Proxy the len() call to the original module to fix truthiness checks.
"""
return len(self._orig_mod)
```
This successfully fixes the issue . The script now works as expected.
## Reproduction Script
```python
import torch
import torch.nn as nn
# Create an empty nn.ModuleList
original = nn.ModuleList()
# Compile it using torch.compile
compiled = torch.compile(original)
# Compare their boolean evaluations
print(f"bool(original): {bool(original)}")
print(f"bool(compiled): {bool(compiled)}")
# Trigger failure if they differ
assert bool(original) == bool(compiled), "BUG: truthiness behavior mismatch after compilation"
```
## Output
bool(original): False
bool(compiled): False
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159208
Approved by: https://github.com/Lucaskabela
Co-authored-by: pushkar-hue <pushkarsharma.rtm@gmail.com>
Co-authored-by: Lucas Kabela <lucasakabela@gmail.com>
Provides type coverage to torch/_dynamo/variables/builtin.py
### Coverage report:
`mypy torch/_dynamo/variables/builtin.py --linecount-report /tmp/coverage_log`
Compare before to after - we go from 2213 lines and 64 funcs covered to 3212 lines and 85 funcs covered
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166745
Approved by: https://github.com/williamwen42
Fixes#163149
### Summary:
Fixes mypy type checking failures in `test_type_hints` by consolidating typing imports and eliminating duplicate/conflicting import patterns that caused mypy to fail resolving type annotations.
### Impact:
- `test_type_hints` works fine now
- module: tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163150
Approved by: https://github.com/Skylion007
Summary:
Conversion from/to bfloat16 was not getting covered by conversion templates, because these used bfloat16_t as data type instead of the custom c10::BFloat16
Conversion by casting from/to bfloat16_t is broken in clang-[17, 20], fixed in clang-21.
Because Pytorch does not currently have CI running binaries compiled using clang-21, we won't implement this approach for now.
We are currently only adding conversion from bfloat16, as it can be implementing by zero-extending into a 4-byte float.
We've observed the following performance improvements, when compiling with clang-19 and targeting armv9a+sve2:
Before:
bfloat16_t->uint8 ===> 423.583us
bfloat16_t->int8 ===> 424.090us
bfloat16_t->int16 ===> 430.817us
bfloat16_t->int64 ===> 571.547us
bfloat16_t->double ===> 459.089us
After:
bfloat16_t->uint8 ===> 123.783us ----> 342% higher throughput
bfloat16_t->int8 ===> 131.575us -----> 322% higher throughput
bfloat16_t->int16 ===> 136.794us ----> 315% higher throughput
bfloat16_t->int64 ===> 177.699us ----> 322% higher throughput
bfloat16_t->double ===> 165.556us ---> 277% higher throughput
Test Plan:
Correctness:
buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch
Performance:
buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test
Differential Revision: D86119613
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166880
Approved by: https://github.com/mcfi, https://github.com/aditew01
**Summary:** Just like in fully_shard, I added two overload replicate functions. The `@overload` declarations are necessary because the `@contract` decorator uses `ParamSpec` to capture function parameters, which creates a generic `_ContractFn` protocol signature (`*args: _P.args, **kwargs: _P.kwargs`) that Pyrefly cannot properly type-check when calling the function with explicit keyword arguments. In addition, to make the api cleaner I changed device_mesh input argument to mesh to match fully_shard formatting.
**Test Cases**
1. pytest test/distributed/_composable/test_replicate_with_fsdp.py
2. pytest test/distributed/_composable/test_replicate_training.py
3. pytest test/distributed/_composable/test_composability/test_pp_composability.py -k test_replicate_pp
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166459
Approved by: https://github.com/weifengpy
ghstack dependencies: #166433
Summary:
https://github.com/pytorch/pytorch/pull/166609 updated `is_impure` check to now check ops inside a subgraph to decide whether a `call_module` node is pure or not.
This change of behavior affects dead code elimination, commonly run as `gm.graph.eliminate_dead_code()`. Specifically, dead code elimination will not erase a node that has no users if this node has side effect or is impure. With above mentioned pr, dead code elimination no longer eliminates unused subgraphs that contain side-effectful ops.
This affects `const_fold.split_const_subgraph`, what this function does is:
1. split a graph into two submodules, one containing all const ops and one containing non-const ops
2. inline the submodule containing non-const ops back to main graph.
3. run dead code elimination to remove the unused non-const submodule.
With pr #166609 step 3 no longer erases the unused module. As an example, exported graph
```
graph():
%x : [num_users=2] = placeholder[target=x]
%_guards_fn : [num_users=0] = call_module[target=_guards_fn](args = (%x,), kwargs = {})
%empty_permuted : [num_users=1] = call_function[target=torch.ops.aten.empty_permuted.default](args = ([5, 10], [0, 1]), kwargs = {device: cpu, pin_memory: False})
%bernoulli : [num_users=1] = call_function[target=torch.ops.aten.bernoulli.p](args = (%empty_permuted, 0.6), kwargs = {})
%mul : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%x, %bernoulli), kwargs = {})
%div : [num_users=1] = call_function[target=torch.ops.aten.div.Tensor](args = (%mul, 0.6), kwargs = {})
return (div,)
```
After running const_fold, empty_permuted is const-folded, the rest of ops are not, and the main graph looks like
```
graph():
%x : [num_users=3] = placeholder[target=x]
%_fx_const_folded_attrs : [num_users=2] = get_attr[target=_FX_CONST_FOLDED_ATTRS]
%_guards_fn : [num_users=0] = call_module[target=_guards_fn](args = (%x,), kwargs = {})
%bernoulli_p : [num_users=1] = call_function[target=torch.ops.aten.bernoulli.p](args = (%_fx_const_folded_attrs, 0.6), kwargs = {})
%mul_tensor : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%x, %bernoulli_p), kwargs = {})
%div_tensor : [num_users=1] = call_function[target=torch.ops.aten.div.Tensor](args = (%mul_tensor, 0.6), kwargs = {})
%submod_1 : [num_users=0] = call_module[target=submod_1](args = (%x, %_fx_const_folded_attrs), kwargs = {})
return (div_tensor,)
```
`submod_1` is dangling, unused, and just inlined into the graph.
## Fix
This pr updates `const_fold._inline_module` function to explicitly remove the non-const submodule which is unused, after it has inlined the submodule's ops into main graph.
Test Plan:
Added a test in `test_fx_const_fold.py`.
The test would have failed before this PR becuase it yields above example graph leaving an unused `call_module[target=submod_1]` op.
With the PR, the module is erased from main graph correctly.
Differential Revision: D86056354
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166871
Approved by: https://github.com/blaine-rister, https://github.com/mlazos
**Summary:** I have created a new composable replicate api that's integrated into FSDP's codebase with minimal changes. The key changes I made are when we use DDPMeshInfo, we use Replicate placements, prevent initial sharding of parameters, set worldsize to 1 to skip allgathers and reducescatter.
**Test Cases**
1. pytest test/distributed/_composable/test_replicate_training.py
2. pytest test_pp_composability.py
3. pytest test_replicate_with_fsdp.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166433
Approved by: https://github.com/weifengpy
Summary:
DCP checkpoint background process currently determines the port used for pg via get_free_port().
During checkpoint background process initialization, gloo pg init occasionally times out on the first call but succeeds in a subsequent call.
We hypothesized that the timeouts are related to the port being used, and the solution would be to create the pg with PrefixStore and reuse the master port.
This diff adds the option for checkpoint background process to use PrefixStore with MASTER_ADDR + MASTER_PORT.
The default behavior is unchanged. Enabling the new PrefixStore behavior requires setting "DCP_USE_PREFIX_STORE" env var to "1".
context:
https://fb.workplace.com/groups/319878845696681/permalink/1516883985996155/
Differential Revision: D84928180
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166560
Approved by: https://github.com/meetv18
Fixes#166253
## Summary
When `torch.full` is called with a 0-D tensor as `fill_value` inside a `torch.compile`'d function, the value was being incorrectly cached, causing subsequent calls with different values to return the first value.
## Root Cause
The Dynamo handler for `torch.full` was calling `aten._local_scalar_dense` to convert tensor fill_values to Python scalars at compile time, which baked the value into the compiled graph as a constant.
## Solution
Modified the Dynamo handler to decompose `torch.full(size, tensor_fill_value)` into `empty(size).fill_(tensor_fill_value)` when `fill_value` is a `TensorVariable`, keeping the fill value dynamic in the compiled graph.
## Testing
Added test case that verifies torch.full works correctly with dynamic tensor fill_values across multiple calls and dtypes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166554
Approved by: https://github.com/Lucaskabela
Summary:
make_fx() will register tensor constants as new buffers while tracing a shuffle graph for dynamo graph capture. This breaks the invariance that the resulting graph looks identical to the original eager model in terms of state dict.
So we need to de-register the buffers and set them as plain tensor constants.
Test Plan:
pytest test/export/test_experimental.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166777
Approved by: https://github.com/tugsbayasgalan
ghstack dependencies: #166775, #166776
Summary:
dict_keys_getitem can show up in the bytecode but it's using dict.keys() which is not fx tracable.
fx.wrap should make it as a standalone function in the graph to be invoked later with real inputs.
Test Plan:
pytest test/export/test_experimental.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166776
Approved by: https://github.com/jamesjwu
ghstack dependencies: #166775
Summary:
as title, we should return an entire tracing_context object instead of fake_mode only, since tracing context should contain full set of information.
Test Plan:
pytest test/export/test_experimental.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166775
Approved by: https://github.com/tugsbayasgalan
By wrapping the python objects with FakeScriptObject(FakeOpaqueQueue) we restrict users to do anything to this object. torch.compile support can be easily enabled by the rest of [this stack](https://github.com/pytorch/pytorch/pull/163936) and existing support for ScriptObjects.
One thing to note is that by default in functionalization we mark all ops that take in FakeScriptObjects as being effectful. Should this be the case for these custom ops that take in python objs?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165005
Approved by: https://github.com/zou3519
Slice knows how to handle unbacked start, we do not need to offset start before calling slice, we can leave it for slice.
The only edge case is when start<0 and start+length ==0 in that case slice and narrow would deviate,
for that case we shall pass dim_size instead of start+length
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166361
Approved by: https://github.com/aorenste
Summary: NVIDIA uses CUTLASS for row-wise scaling prior to cuBLAS version 12.9. This change enables support for FP16 data type for both bias and output when using CUTLASS.
Test Plan:
pytest -svv test/test_scaled_matmul_cuda.py
Test results on cuda-12.4:
```
test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_scaled_mm_vs_emulated_row_wise_bfloat16_cuda PASSED [0.0022s]
test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_scaled_mm_vs_emulated_row_wise_float16_cuda PASSED [0.0023s]
test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_scaled_mm_vs_emulated_row_wise_float32_cuda SKIPPED [0.0005s]
======================= 51 passed, 516 skipped in 5.26s ========================
```
Test results on cuda-12.9:
```
test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_scaled_mm_vs_emulated_row_wise_bfloat16_cuda PASSED [0.0046s]
test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_scaled_mm_vs_emulated_row_wise_float16_cuda PASSED [0.0040s]
test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_scaled_mm_vs_emulated_row_wise_float32_cuda PASSED [0.0038s]
======================= 70 passed, 482 skipped in 5.88s ========================
```
Reviewed By: pranavsharma, RandySheriff
Differential Revision: D84169910
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166744
Approved by: https://github.com/slayton58
Summary:
Support tile-wise `1x128` scaling in Inductor Triton for FP8 GEMMs, i.e. scaling values along tensors `a` and `b` represent a `1x128` slice of input.
NOTE: Block-wise `128x128` and `1x128` scaling is only supported in CUDA 12.9+; therefore, tile-wise scaling is currently unsupported in `fbcode` (CUDA 12.4). Use OSS PyTorch to run tile-wise scaling (as with deepseek-style scaling).
Test Plan:
Works out-of-the-box with TritonBench:
```
TORCHINDUCTOR_CACHE_DIR=~/personal/cache_dir_inductor CUDA_LAUNCH_BLOCKING=1 TORCH_USE_CUDA_DSA=1 TRITON_PRINT_AUTOTUNING=1 TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+inductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 ENABLE_PERSISTENT_TMA_MATMUL=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 buck2 run mode/{opt,inplace} pytorch/tritonbench:run -- --op fp8_gemm --only torch_fp8_gemm,pt2_fp8_gemm --metrics tflops,accuracy --m 256 --n 768 --k 512 --output="/home/jananisriram/personal/random_bench.csv" --scaling-pair=BlockWise1x128,BlockWise1x128 --atol=1e-2 --rtol=0.5
```
Differential Revision: D84025878
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165132
Approved by: https://github.com/eqy, https://github.com/drisspg, https://github.com/njriasan
**Summary:** Just like in fully_shard, I added two overload replicate functions. The `@overload` declarations are necessary because the `@contract` decorator uses `ParamSpec` to capture function parameters, which creates a generic `_ContractFn` protocol signature (`*args: _P.args, **kwargs: _P.kwargs`) that Pyrefly cannot properly type-check when calling the function with explicit keyword arguments. In addition, to make the api cleaner I changed device_mesh input argument to mesh to match fully_shard formatting.
**Test Cases**
1. pytest test/distributed/_composable/test_replicate_with_fsdp.py
2. pytest test/distributed/_composable/test_replicate_training.py
3. pytest test/distributed/_composable/test_composability/test_pp_composability.py -k test_replicate_pp
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166459
Approved by: https://github.com/weifengpy
ghstack dependencies: #166433
**Summary:** I have created a new composable replicate api that's integrated into FSDP's codebase with minimal changes. The key changes I made are when we use DDPMeshInfo, we use Replicate placements, prevent initial sharding of parameters, set worldsize to 1 to skip allgathers and reducescatter.
**Test Cases**
1. pytest test/distributed/_composable/test_replicate_training.py
2. pytest test_pp_composability.py
3. pytest test_replicate_with_fsdp.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166433
Approved by: https://github.com/weifengpy
This updates the docs-build nightly configuration to match other uses of the _linux-build.yml workflow using `runner_prefix` rather than `runner` directly. The default runner defined in _linux-build.yml is the c7i variant so this also updates the runner appropriately.
Relates to pytorch/test-infra#7175. While moving to c7i costs 5% more, CPU intensive jobs should run roughly 15-20% faster resulting in a cost reduection of 10-15% for those jobs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166727
Approved by: https://github.com/huydhn
### Summary
Adds support for NVIDIA’s cuSolver backend to torch.linalg.eig and torch.linalg.eigvals within the ATen/Linalg framework.
### Motivation
Extending PyTorch’s Linalg backends with NVIDIA’s cuSolver enables faster execution of torch.linalg.eig and torch.linalg.eigvals, complementing existing MAGMA and CPU implementations.
The speedup observed on consumer hardware (RTX4070/Ryzen 5700x) is in the order of **2x**, with preliminary testing on HPC hardware (H100, EPYC 9454) suggesting **up to 10x speedup**.
### Details
- Implements cuSolver support for linalg_eig and linalg_eigvals using the interface described in [NVIDIA cuSolver documentation](https://docs.nvidia.com/cuda/cusolver/index.html#cusolverdnxgeev) as introduced in CUDA 12.8 [CUDA 12.8 release notes](https://docs.nvidia.com/cuda/archive/12.8.0/cuda-toolkit-release-notes/index.html)
- Follows the existing MAGMA backend design, adapting it for cuSolver’s cusolverDnXgeev API.
- Integrates with existing eig/eigvals dispatch mechanism.
- No automatic CPU↔GPU backend switching. (Happy to discuss)
- Verified via existing Linalg test coverage; no new tests introduced in this PR.
- Tested successfully against both test_linalg.py including slow test suites.
- Tested MAGMA fallback successfully using CUDA 12.4. (observed unrelated test failures)
### Impact
- Enables much faster solving of eigenvalue problems
- Maintains numerical consistency and test stability across backends.
- No change to public API or user-facing behavior.
Special thanks to @AlbanD for prior feedback and discussions regarding the PR and @lezcano for feedback on the related testing PR [https://github.com/pytorch/pytorch/pull/166322](https://github.com/pytorch/pytorch/pull/166322).
Happy to discuss backend dispatch strategy, results from performance and stability testing can be seen here [https://dev-discuss.pytorch.org/](https://dev-discuss.pytorch.org/t/cusolver-dnxgeev-faster-cuda-eigenvalue-calculations/3248/7)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166715
Approved by: https://github.com/lezcano, https://github.com/albanD
This PR simplifies `std::copy_n` calls in CopyKernel and IndexKernel. `std::copy_n` is used to create a data pointer array from the input data pointers. However, more careful review reveals that the dest pointers are actually aliases of the original pointers. So we can removes the pointer manipulations.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143544
Approved by: https://github.com/albanD
Summary: Projects that use `-Wswitch-enum` will encounter issues when building and using *PyTorch* (`caffe2`). Address these issues to empower more rigorous upstream compiler warnings/errors.
Test Plan: CI Pass
Differential Revision: D85893917
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166760
Approved by: https://github.com/atalman
This PR adds two checks:
```
readability-static-definition-in-anonymous-namespace
Finds static function and variable definitions
in anonymous namespace.
readability-named-parameter
Find functions with unnamed arguments.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164692
Approved by: https://github.com/Skylion007
`static_input_indices` is used for cudagraphs to determine which input indices are static and will not have changing addresses. Since export never integrated with cudagraphs this information was not necessary. But now we need it!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166761
Approved by: https://github.com/BoyuanFeng
Replace separate min() and max() calls with single aminmax() call in max_unpool_out_mps_template to improve performance by reducing tensor traversals from O(2n) to O(n).
Changes:
- Use indices.aminmax() instead of separate indices.min()/max() calls
- Add required ATen/ops/aminmax.h header for AT_PER_OPERATOR_HEADERS
- Maintain identical bounds checking logic and error handling
This optimization is particularly beneficial for large indices tensors, improving cache locality and reducing computational overhead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165394
Approved by: https://github.com/cyyever, https://github.com/Skylion007
Preivously, we would stash a single stream value we constructed at trace time in a global and return the same value from repeated calls to the graph.
With this PR, we construct the stream value in advance, reference the constructed value in the graph via the lookup table, and if that value is returned as an output, read the value from the lookup table and return it (in bytecode, not as a graph output, since we don't support arbitrary stream outputs).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164819
Approved by: https://github.com/anijain2305
Fixes#163971
Problem:
PyTorch's inductor compiler crashed with IndexError: list index out of range when compiling code that uses 0-dimensional tensors with operations like torch.softmax(scalar_tensor, dim=0).
A 0-dim tensor has shape = torch.Size([]) (empty shape)
```
ndim = 0 (zero dimensions)
len(shape) = 0 (no indices to access)
# Line 972: Pad other_shape to match inp dimensions
other_shape = [1] * (inp_ndim - len(other_shape)) + list(other_shape)
# For scalar tensors:
# inp_ndim = 0 # as input is scalar
# other_shape = []
# Result: [1] * (0 - 0) + [] = [] (still empty!)
dim = match.kwargs["dim"] # dim = 0
if isinstance(dim, int):
dim = (dim,)
# crash is happening here!
return all(statically_known_true(other_shape[d] == 1) for d in dim)
# ^^^^^^^^^^^^^^^^
# Tries other_shape[0] but other_shape = [] (empty!)
# → IndexError: list index out of range
```
The function _other_is_broadcasted_in_dim() is an optimization check for a softmax fusion pattern. It verifies whether it's safe to rewrite:
```
# From
scaled = inp * other
result = scaled - scaled.amax(dim, keepdim=True)
# To this more stable form:
result = (inp - inp.amax(dim, keepdim=True)) * other
```
The optimization is only valid if other is constant across the reduction dimension (i.e., broadcasted to size 1 in that dimension). Otherwise, scaling changes which element is the maximum.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166547
Approved by: https://github.com/jansel, https://github.com/eellison, https://github.com/leslie-fang-intel
It's nice to add a curve with a customized compilation options so that we can compare side-by-side the perf improvement of new features.
E.g. for mix-order-reduction, by running the following command
```
python benchmarks/dynamo/genai_layers/benchmark.py --tolerance=1e-2 --exit-on-accuracy-failure --visualize rmsnorm_backward --custom-compile-name="compiled-no-fusion" --custom-compile-options='{"triton.mix_order_reduction":false}'
```
I get following output:
```
Geomean speedup for benchmark RMSNormBackward
eager 11 data points
compiled 11 data points, 15.82x speedup
quack 11 data points, 15.45x speedup
liger 11 data points, 14.06x speedup
compiled-no-fusion 11 data points, 10.26x speedup
```
The output shows that the feature on average improve perf by `15.82 / 10.26 = 1.54x` for all the shapes tested. (I remove a shape (32768, 32768) whose rnumel is too large and not representative).
The new curve also shows up in the figure:
<img width="3564" height="2368" alt="RMSNormBackward_bench" src="https://github.com/user-attachments/assets/1ffac2bc-e726-4f1e-806d-e9e5de711492" />
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166697
Approved by: https://github.com/BoyuanFeng
ghstack dependencies: #166053, #166382, #166461, #166585, #166675
Adds customizable hooks on `__torch_dispatch__` calls for logging/recording arbitrary values.
Recording hooks store the hook outputs for each call at `debug_mode.operators[*].record`
```python
with DebugMode() as debug_mode, DebugMode.dispatch_hooks(record_hook = some_func):
# some compute
...
```
Logging hooks annotate the string dump:
```python
with DebugMode() as debug_mode, DebugMode.dispatch_hooks(log_hook = some_func):
...
```
Adds default hooks `DebugMode.record_outputs()` and `DebugMode.log_tensor_hashes()`, for checking numerical equivalence. The hashing hook borrows from the Observer. Example dump:
```
aten::sum(dt: f32[8, 32]| S(0))
aten::sum(t: f32[1, 32]) # {'hash': 3.2215590476989746}
_c10d_functional::all_gather_into_tensor(t: f32[1, 32], 8, 0) # {'hash': 204.8783062621951}
_c10d_functional::wait_tensor(t: f32[8, 32]) # {'hash': 204.8783062621951}
aten::mm(t: f32[1, 8], t: f32[8, 32]) # {'hash': 12.014171155635267}
aten::sum(t: f32[1, 32]) # {'hash': 3.2215590476989746}
aten::t(t: f32[1, 8]) # {'hash': 3.7167285680770874}
aten::detach(t: f32[8, 1]) # {'hash': 3.7167285680770874}
...
```
On the FSDP2 / simple FSDP NE in https://github.com/pytorch/pytorch/pull/164939, with hashing, this produces 2 log dumps (FSDP2: P2010198620, simple FSDP: P2010198963). I asked Claude to check the hashes, it wrote an analysis script, and was able to guess RMS norm as the root cause: P2010195076
Another throw-away example for logging per-op memory usage: https://gist.github.com/pianpwk/372082bf29467aa4aa25cb26dee24aea
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166348
Approved by: https://github.com/yushangdi
Keep tracking of the reduction before splitting.
In the mix-order reduction context, if one of the reduction is split, it makes it much harder to fuse with the other reduction. Tracking the metadata of the reduction before splitting to make the fusion possible.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166053
Approved by: https://github.com/jansel
A few things to note:
1. Customers like vllm use a custom backend (e.g. VllmBackend), split the graph, and call standalone_compile for each split. If we let the bisector override the backend, we won't bisect thru the custom backend. `test_configs.bisect_keep_custom_backend_for_inductor` is used to keep the custom backend if we are bisecting for inductor.
2. pre_grad_graph bisecting and lowering bisecting so far does not compose well with each other since an issue may be just captured by the first one we try. `test_configs.bisect_pre_grad_graph` is used to enable the 'pre_grad_graph' bisecting.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166344
Approved by: https://github.com/eellison
Slice knows how to handle unbacked start, we do not need to offset start before calling slice, we can leave it for slice.
The only edge case is when start<0 and start+length ==0 in that case slice and narrow would deviate,
for that case we shall pass dim_size instead of start+length
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166361
Approved by: https://github.com/aorenste
This PR replaces c10::call_once calls with static initialization when possible. C++11 semantics guarantees that static initialization is atomic. Static initialization also has lower cost than using c10::call_once.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166381
Approved by: https://github.com/malfet
The goal of this PR is to provide a standard way to create simple treespec instances and hide the implementation details of the `PyTreeSpec` class.
Changes:
1. Add function `treespec_leaf()` to replace `LeafSpec()`.
2. Add function `treespec_tuple(...)` and `treespec_dict(...)` to create treespec for `tuple` / `dict` which is used for `*args` / `**kwargs`. This avoids direct modification to `treespec` instances that rely on the implementation details of the `PyTreeSpec` class.
3. Change `len(spec.children_specs)` to `spec.num_children`.
4. Change `isinstance(spec, LeafSpec)` to `spec.is_leaf()`.
------
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160843
Approved by: https://github.com/mlazos
Summary:
As per title, a bot created this diff because this test broke due to [a different PR.](https://github.com/pytorch/pytorch/pull/166026)
<Erased bot summary in case anything we don't want to make external.>
Test Plan:
Bot ran the tests and they passed.
<Erased bot test plan in case anything we don't want to make external.>
Differential Revision: D85745809
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166753
Approved by: https://github.com/d4l3k
Graph partition relies on `get_free_symbol_uses()` to collect symbol inputs.
ee7434be82/torch/_inductor/scheduler.py (L4869-L4885)
I empirically observed that `get_free_symbol_uses()` becomes slower for larger graphs. Specifically, I tried to aten fallback for torchtitan which results in 10k+ aten nodes. When processing the 600-th node, it takes seconds to `get_free_symbol_uses()` for 1 node.
Why? Because `get_free_symbol_uses()` may recursively call another `get_free_symbol_uses()`, which could recursively run many times.
ee7434be82/torch/_inductor/ir.py (L4541-L4543)
This PR fixes the issue by caching the results of `get_free_symbol_uses()`. I validated on torchtitan that the issue is fixed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166338
Approved by: https://github.com/eellison
Provides type coverage to ~3000 LOC and 200 methods in `torch/_dynamo/variables/`
This is the first part of the final step to having 100% strict type coverage in dynamo - see previous comments in https://github.com/pytorch/pytorch/pull/166535 (combined into this one PR because ghstack was giving issues...)
### Coverage report:
```
mypy torch_dynamo/variables --linecount-report /tmp/coverage_log
```
Compare before to after - we go from 3826 to 7221 lines covered
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166569
Approved by: https://github.com/williamwen42, https://github.com/Skylion007
- Remove all complex defines logic from the header
- Make GreenContext constructor private, as it should only be created via the static method as singleton
- Delete unused `getContext` and `getGreenContext` methods
- Rename `CUDA_HAS_GREEN_CONTEXT` to `HAS_CUDA_GREEN_CONTEXT()`, which results in compilation error if one accidentally makes a typo
- Suppress `-Wunused-private-field` is GreenContext is not available
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166462
Approved by: https://github.com/ngimel, https://github.com/eqy
This PR refactors the name AOTAutogradCacheEntry into AOTAutogradResult, and BundledAOTAutogradCacheEntry into BundledAOTAutogradResult. It also moves all coresponding files to a new file, `aot_autograd_result`, which is analogous to `output_code.py` from Inductor.
Having all these be called cache entries made sense when all we used them for was caching. But with AOT compile using BundledAOTAutogradCacheEntry, we want a more generalized naming structure.
This is a no-op change, and all existing tests should pass.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166656
Approved by: https://github.com/zhxchen17
ghstack dependencies: #166650
This PR does two things:
- It genericizes `BundledAOTAutogradCacheEntry` to support *any* outputcode, not just CompiledFxGraphs
- It adds a brand new OutputCode for the `aot_eager_regional_inductor` backend, i.e. a graph module that has regional inductor components in it.
This allows BundledAOTAutogradCache to just integrate nicely with inductor out of the box, but more importantly, it allows the result of aot_autograd to be fully serializable when using `aot_eager_regional_inductor`. This will allow us to AOT precompile cases where we have an eager graph that has scooped up inductor bits.
It's a bit unfortunate that the naming makes BundledAOTAutogradCacheEntry sound like its primary use is for caching, but really the more common use is going to be as an AOTAutogradOutput. It may be worth revisiting how to refactor/rename these in a later PR:
- AOTAutogradCacheEntry -> AOTAutogradResult
- BundledAOTAutogradCacheEntry -> BundledAOTAutogradResult
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166650
Approved by: https://github.com/zhxchen17
Partitioner functionality is appealing to use in different scenarios (E.g. Autoparallel)
We have special logic about "partitioner_tag" from meta that is only needed for forward/backward split.
Adding optional argument to avoid it and do only generic split based on inputs/outputs.
Potentially we want to make `_extract_graph_with_inputs_outputs` without underscore :)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166725
Approved by: https://github.com/bdhirsh
Summary:
## Context
when `const_fold.split_const_subgraphs` sees a `call_module` node that is a GraphModule, by the existing implementation it can mark this node as const-foldable when it shouldn't.
For example, a parent graph contains a `call_module` to a subgraph that has no inputs but contain impure ops inside.
```
parent graph():
%sub : [num_users=1] = call_module[target=sub](args = (), kwargs = {})
%getitem : [num_users=1] = call_function[target=operator.getitem](args = (%sub, slice(None, None, None)), kwargs = {})
return (getitem,)
submodule graph():
%randn : [num_users=1] = call_function[target=torch.ops.aten.randn.default](args = ([5, 10],), kwargs = {device: cpu, pin_memory: False})
%add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%randn, 1), kwargs = {})
return (add,)
```
when `submodule` graph is fed to const_fold.split_const_subgraph, it would come out unmodified since randn is impure.
But if the `submodule` is called by a `parent` graph, when `parent` is fed to const_fold.split_const_subgraph, it would come out folded.
```
parent after fold graph():
%_fx_const_folded_attrs : [num_users=1] = get_attr[target=_FX_CONST_FOLDED_ATTRS]
return (_fx_const_folded_attrs,)
```
This is because `node.is_impure()` check inside `const_fold.split_const_subgraph` fail through, leading the call_module node to be marked as pure.
## Fix
We can update `fx.node.Node.is_impure` function to check for ops inside a call_module node with an additional `subgraph_has_impure_ops` check:
- if a call_module node calls a GraphModule,
- check any call_function nodes are impure ops
- recursively check any call_module nodes that call GraphModule
If the call_module subgraph has impure ops, return True to `is_impure`
Test Plan: added tests to test_fx_const_fold.py
Differential Revision: D85798483
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166609
Approved by: https://github.com/blaine-rister
Provides type coverage to ~3000 LOC and 200 methods in `torch/_dynamo/variables/`
This is the first part of the final step to having 100% strict type coverage in dynamo - see previous comments in https://github.com/pytorch/pytorch/pull/166535 (combined into this one PR because ghstack was giving issues...)
### Coverage report:
```
mypy torch_dynamo/variables --linecount-report /tmp/coverage_log
```
Compare before to after - we go from 3826 to 7221 lines covered
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166569
Approved by: https://github.com/williamwen42
Summary:
We were setting the custom inductor choice using `torch._inductor.virtualized.V.set_choices_handler(CustomInductorChoices())`. However, this leads to inconsistent behaviors, even for jobs that are submitted back to back.
In this diff, we pass in the choice handler via an inductor config and overwrite the default behavior when the config is provided. This sovles the inconsistent behavior.
Test Plan: see D85785892 (internal only)
Differential Revision: D85785879
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166607
Approved by: https://github.com/eellison
The goal of this PR is to provide a standard way to create simple treespec instances and hide the implementation details of the `PyTreeSpec` class.
Changes:
1. Add function `treespec_leaf()` to replace `LeafSpec()`.
2. Add function `treespec_tuple(...)` and `treespec_dict(...)` to create treespec for `tuple` / `dict` which is used for `*args` / `**kwargs`. This avoids direct modification to `treespec` instances that rely on the implementation details of the `PyTreeSpec` class.
3. Change `len(spec.children_specs)` to `spec.num_children`.
4. Change `isinstance(spec, LeafSpec)` to `spec.is_leaf()`.
------
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160843
Approved by: https://github.com/mlazos
Summary:
FXConverter configurs _node_metadata_hook passing in `fake_mode` explicitly, which is relevant for cases down the line like `_generate_triton_call` that inserts a `triton_kernel_wrapper_mutation` node.
This `fake_mode` is obtained from `_detect_fake_mode_from_gm`, which can be different from inductor set `V.fake_mode`.
For example, while `V.fake_mode` is not None, `_detect_fake_mode_from_gm` can be **None** for a parent graph containing only a submodule which has no input args and only constants
```
parent graph():
%sub : [num_users=1] = call_module[target=sub](args = (), kwargs = {})
%getitem : [num_users=1] = call_function[target=operator.getitem](args = (%sub, slice(None, None, None)), kwargs = {})
return (getitem,)
submodule graph():
%randn : [num_users=1] = call_function[target=torch.ops.aten.randn.default](args = ([5, 10],), kwargs = {device: cuda, pin_memory: False})
%add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%randn, 1), kwargs = {})
return (add,)
```
Getting this discrepnancy is flawed, it makes `_node_metadata_hook` try running inputs in a different "fake_mode" or no fake_mode when the rest of lowering uses `V.fake_mode`. In some cases where input is placed on custom non-gpu device, it can even complain with "requires device to be started" or tensor device mismatch.
So this diff updates FXConverter.generate to use `V.fake_mode` which is populated by inductor properly.
Test Plan:
added a test `test_const_folded_subgraph` in `test_fxir_backend.py`, this test:
- creates a graph module that calls a subgraph with no inputs and containing only const-foldable ops
- const fold the subgraph
- run FXConverter.generate, expect `fake_mode` used to code-generate is not None
On the prior implementation when `_detect_fake_mode_from_gm` was used, this test would fail as fake_mode would be `None`.
With this change, the test passes, `fake_mode` is properly collected from `V.fake_mode` which is not None.
Differential Revision: D85767475
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166591
Approved by: https://github.com/blaine-rister, https://github.com/mlazos, https://github.com/eellison
This diff tries to fix a limitation in Sigmoid + Triton interaction, where float arguments are not correctly passed. NativeRT passes float arguments as double, while triton kernels were reading as a float, resulting in wrong values.
---
## Limitations in (de)seriazliation
In triton, float arguments to a kernel are encoded as "fp32" ([code](https://github.com/triton-lang/triton-cpu/blob/main-merged/python/triton/runtime/jit.py#L310-L326)):
```
elif isinstance(arg, float):
return ("fp32", None)
```
But it seems like that torch export serde uses double ([code](d2eff5d454/torch/_export/serde/export_schema.thrift (L149))) because Thrift only has the double type:
```
union Argument {
10: bool as_none;
20: TensorArgument as_tensor;
30: list<TensorArgument> as_tensors;
50: i64 as_int;
70: list<i64> as_ints;
80: double as_float; ===> actually double
...
```
`TritonKernel` constructor loads attributes from a node, where `Constant` represents the variant type. And it only has `double` ([code](d2eff5d454/torch/nativert/graph/Graph.h (L86))):
```
using Constant = std::variant<
None,
int64_t,
std::vector<int64_t>,
double, ===> triton float is loaded as double
```
So, NativeRT passes float arguments (originally in Triton) as double to triton kernels. But, all of the triton backends (nvidia, amd and cpu) are reading them as float because the signature still says `fp32`.
D84423898 was the current workaround: wrapping float arguments with tensors.
## The Fix
Fixing the thrift definition isn't viable because Thrift only supports double type. It's also possible to fix on the triton side: it can downcast from double to float. But I needed to fix all backends.
Instead, I think this diff would be the most effective way: when building `TritonKernel`, have downcasted float values, right after loading double arguments.
Test Plan:
```
buck test fbcode//mode/opt-amd-gpu fbcode//caffe2/test:test_export --
```
Differential Revision: D85747160
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166620
Approved by: https://github.com/XueningXu
Provides type coverage to ~3000 LOC and 200 methods in `torch/_dynamo/variables/`
This is the first part of the final step to having 100% strict type coverage in dynamo - see previous comments in https://github.com/pytorch/pytorch/pull/166535 (combined into this one PR because ghstack was giving issues...)
### Coverage report:
```
mypy torch_dynamo/variables --linecount-report /tmp/coverage_log
```
Compare before to after - we go from 3826 to 7221 lines covered
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166569
Approved by: https://github.com/williamwen42
Graph partition relies on `get_free_symbol_uses()` to collect symbol inputs.
ee7434be82/torch/_inductor/scheduler.py (L4869-L4885)
I empirically observed that `get_free_symbol_uses()` becomes slower for larger graphs. Specifically, I tried to aten fallback for torchtitan which results in 10k+ aten nodes. When processing the 600-th node, it takes seconds to `get_free_symbol_uses()` for 1 node.
Why? Because `get_free_symbol_uses()` may recursively call another `get_free_symbol_uses()`, which could recursively run many times.
ee7434be82/torch/_inductor/ir.py (L4541-L4543)
This PR fixes the issue by caching the results of `get_free_symbol_uses()`. I validated on torchtitan that the issue is fixed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166338
Approved by: https://github.com/eellison
This PR introduces CustomOp autotuning. It allows user to provide a CustomOpConfig:
(1) to register (optional) multiple decomposition implementations for custom operations and
(2) to register parameter tuning knobs and values they want to tune for the decompositions
so that inductor automatically select the best-performing variant through Inductor's autotune benchmarking.
Example:
```python
register_custom_op_autotuning(
custom_op=my_attention_op,
configs=[
CustomOpConfig(attention_impl, head_dim=32, method='chunked'),
CustomOpConfig(attention_impl, head_dim=64, method='tiled'),
CustomOpConfig(head_dim=128), # no decompositions
],
input_gen_fns={
"query": lambda fake: torch.randn_like(fake, device='cuda'),
"key": lambda fake: torch.randn_like(fake, device='cuda'),
"value": lambda fake: torch.randn_like(fake, device='cuda'),
}
)
```
**CustomOpConfig**: Each CustomOpConfig defines exactly one autotuning variant with specific parameter values and optional decomposition implementation with PyTorch aten ops. Users can register their own tuning knobs and optional decomposition functions for the same custom operation. The system automatically benchmarks all variants to select the best performing. If no decomposition is provided in the config, the CustomOp's default implementation will be used.
**Custom Input Generation**: Users can provide custom input generators via an optional `input_gen_fns` to control how synthetic inputs are created during benchmarking. This enables more realistic performance testing by generating inputs that match expected data distributions and characteristics for each tensor argument.
**More Examples with autotune logs:**:
1. Allow user to register customOp decompositions with tuning parameters for autotuning. Example usage:
```python
from torch._inductor.kernel.custom_op import CustomOpConfig, register_custom_op_autotuning
def decompose_k_implementation(a: torch.Tensor, b: torch.Tensor, k_splits: int = 4) -> torch.Tensor:
"""Matrix multiply with k-way decomposition."""
# Implementation...with k_splits
@torch.library.custom_op("my_lib::decompose_k", mutates_args=())
def test_decompose_k_op(
a: torch.Tensor, b: torch.Tensor, k_splits: int
) -> torch.Tensor:
return decompose_k_implementation(a, b, k_splits)
# Register autotuning with different k_splits values
register_custom_op_autotuning(
custom_op=test_decompose_k_op,
configs=[
CustomOpConfig(decompose_k_implementation, k_splits=2),
CustomOpConfig(decompose_k_implementation, k_splits=32),
CustomOpConfig(decompose_k_implementation, k_splits=64),
CustomOpConfig(k_splits=128), # can make decomposition optional, then use default impl test_decompose_k_op
CustomOpConfig(k_splits=256)
],
input_gen_fns={
"a": lambda fake: torch.randn_like(fake, device='cuda') * 0.1,
"b": lambda fake: torch.randn_like(fake, device='cuda') * 0.1,
}
)
```
Example result:
```
{"num_choices": 6, "num_triton_choices": 0, "best_kernel": "test_decompose_k_autotuned_fallback_default", "best_time": 0.09980800002813339}
AUTOTUNE test_decompose_k_autotuned(256x65536, 65536x1024)
strides: [65536, 1], [1024, 1]
dtypes: torch.float16, torch.float16
test_decompose_k_autotuned_fallback_default 0.0998 ms 100.0%
test_decompose_k_autotuned_decompose_k_implementation_k_splits_2_0 0.1096 ms 91.0% CustomOp decompose_k_implementation_k_splits_2
test_decompose_k_autotuned_decompose_k_implementation_k_splits_32_1 0.1277 ms 78.2% CustomOp decompose_k_implementation_k_splits_32
test_decompose_k_autotuned_decompose_k_implementation_k_splits_64_2 0.1454 ms 68.6% CustomOp decompose_k_implementation_k_splits_64
test_decompose_k_autotuned_decompose_k_implementation_k_splits_128_3 0.1536 ms 65.0% CustomOp decompose_k_implementation_k_splits_128
test_decompose_k_autotuned_decompose_k_implementation_k_splits_256_4 0.2084 ms 47.9% CustomOp decompose_k_implementation_k_splits_256
```
2. Allow user to tune parameter knob by passing the parameter and values in the CustomOpConfig.
**Example**
```python
def mlp_variants(input_tensor, gate_weight, up_weight, down_weight, method):
"""MLP implementation with different computational approaches."""
if method == 0:
# Standard separate matmuls
# ... implementation
elif method == 1:
# Batched approach with torch.mm
# ... implementation
elif method == 2:
# Fused weights approach
# ... implementation
@torch.library.custom_op("my_lib::mlp_op", mutates_args=())
def mlp_op(
input_tensor: torch.Tensor,
gate_weight: torch.Tensor,
up_weight: torch.Tensor,
down_weight: torch.Tensor,
method: int,
) -> torch.Tensor:
return mlp_variants(
input_tensor, gate_weight, up_weight, down_weight, method=method
)
register_custom_op_autotuning(
custom_op=mlp_op,
configs=[
CustomOpConfig(method=0),
CustomOpConfig(method=1),
CustomOpConfig(method=2),
# method=0 is the default fallback in the original op
],
input_gen_fns={
"input_tensor": lambda fake: torch.randn_like(fake, device='cuda') * 0.1,
"gate_weight": lambda fake: torch.randn_like(fake, device='cuda') * 0.05,
# ... other input generators
}
)
```
Example result:
```
AUTOTUNE test_mlp_autotuned(4x32x512, 512x1024, 512x1024, 1024x256)
test_mlp_autotuned_mlp_variants_method_2 0.0181 ms 100.0% CustomOp mlp_variants_method_2
test_mlp_autotuned_mlp_variants_method_1 0.0185 ms 97.8% CustomOp mlp_variants_method_1
test_mlp_autotuned_mlp_default_fallback_method_0 0.0198 ms 91.4% CustomOp fallback
```
### Test Suite (`test/inductor/test_custom_op_autotune.py`)
* **RMSNorm autotuning**: Tests different RMSNorm implementations with dynamic input shapes
* **MLP autotuning**: Tests different MLP decomposition and tuning "method" parameter
* **DecomposeK**: Tests different k_splits values for matrix multiplication decomposition with k dim split
* **Multi-parameter tuning**: Tests configs with multiple tuning parameters (scale_mode, chunk_size)
### Next Step:
- Enable Max-autotune with user passed in max-autotune config. https://github.com/pytorch/pytorch/pull/165526/files
- Support inline epilogue fusion for selected best customop decomposition with surrounding elementwise ops. https://github.com/pytorch/pytorch/pull/165952/files
- Support customop autotune considering fusion with multiTemplateBuffer. WIP
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164212
Approved by: https://github.com/zou3519
Fixes#166253
## Summary
When `torch.full` is called with a 0-D tensor as `fill_value` inside a `torch.compile`'d function, the value was being incorrectly cached, causing subsequent calls with different values to return the first value.
## Root Cause
The Dynamo handler for `torch.full` was calling `aten._local_scalar_dense` to convert tensor fill_values to Python scalars at compile time, which baked the value into the compiled graph as a constant.
## Solution
Modified the Dynamo handler to decompose `torch.full(size, tensor_fill_value)` into `empty(size).fill_(tensor_fill_value)` when `fill_value` is a `TensorVariable`, keeping the fill value dynamic in the compiled graph.
## Testing
Added test case that verifies torch.full works correctly with dynamic tensor fill_values across multiple calls and dtypes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166554
Approved by: https://github.com/Lucaskabela
- Remove all complex defines logic from the header
- Make GreenContext constructor private, as it should only be created via the static method as singleton
- Delete unused `getContext` and `getGreenContext` methods
- Rename `CUDA_HAS_GREEN_CONTEXT` to `HAS_CUDA_GREEN_CONTEXT()`, which results in compilation error if one accidentally makes a typo
- Suppress `-Wunused-private-field` is GreenContext is not available
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166462
Approved by: https://github.com/ngimel, https://github.com/eqy
DebugMode currently stores dispatch call args & kwargs, which is all intermediate tensors and more. This quickly OOMed on GPU when trying to debug some torchtitan / llama 8b models.
This defaults to storing the stringified version, adding a flag `DebugMode(store_original_args=True)` if users want to store the original args as-is (and for BC).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166347
Approved by: https://github.com/yushangdi
Our CP codebase now contains several files and we are adding more. This
PR refactors the code to consolidate the files into a context_parallel
folder but keep the import so that the existing users of CP won't be
affected.
Unfortunately, we have to split this PR into two PRs as the PyTorch
infra cannot accept a PR with 3000+ LoC change and git cannot recognize
that _context_parallel/_attention.py is moved from _attention.py because
we want to keep BC.
This is the second PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166501
Approved by: https://github.com/Skylion007
ghstack dependencies: #166456
Looking for feedback on this approach.
Received user reports of spurious pyrefly errors for users using hg instead of git. I think this was due to the fact that when using a venv and git, `make setup-env` installs requirements and pulls from a nightly torch wheel, which is needed for pyrefly to type check properly.
Initial documentation for `make setup-env` I found here: https://github.com/pytorch/pytorch/blob/main/CONTRIBUTING.md#developing-pytorch
Testing:
```
hg clone --git ssh://git@github.com/pytorch/pytorch.git
conda create -n pytorch_env python=3.10 # (or manually create venv instead of using script)
cd pytorch
pip install -r requirements.txt
pip install -r requirements-build.txt
lintrunner init
# check how many pyrefly errors - 15,709 errors (11,693 ignored)
lintrunner # confirm error message / warning appears
>>> General linter failure:
Warning (PYREFLY) nightly-wheel-not-run
pytorch-nightly.pth not found. You may need to run make setup-env or make
setup-env-conda to install nightly binaries and type stubs.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166603
Approved by: https://github.com/aorenste
Our CP codebase now contains several files and we are adding more. This PR refactors the code to consolidate the files into a context_parallel folder but keep the import so that the existing users of CP won't be affected.
Unfortunately, we have to split this PR into two PRs as the PyTorch infra cannot accept a PR with 3000+ LoC change and git cannot recognize that _context_parallel/_attention.py is moved from _attention.py because we want to keep BC.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166456
Approved by: https://github.com/Skylion007
Summary:
Using `CUDA_KERNEL_ASSERT_PRINTF` inside kernels allows us to log invalid values to the console (that can be in turn used to surface _hopefully_ more clearer error messages).
This does have an impact in the number of registers needed for the values being logged (I confirmed via diffing PTX that there is no other impact relative to using `__assert_fail`)
To avoid causing perf bottlenecks, this change adds a compile-time switch to enable more verbose errors in some of the common kernels that cause DSAs. There is also a Buck config that can be used to configure this switch more conveniently.
## Alternatives considered
I considered making the behavior of `CUDA_KERNEL_ASSERT_PRINTF` controllable via a compile-time macro instead of writing another wrapper for it but there are kernels where the extra register pressure is not as severe and in those cases, having more useful error messages by default is pretty useful.
Test Plan:
## Simple Python Driver:
```
# scatter_errors.py
import torch
def main() -> None:
a = torch.rand(128, device="cuda:0")
idx = torch.randint(0, 128, (100,), device="cuda:0")
idx[0] = 9999
b = torch.scatter(a, 0, idx, 555.0)
print(b)
```
When running normally via:
```
$ buck2 run @//mode/opt :scatter_errors
```
we see the followng DSA message:
```
fbcode/caffe2/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:410: operator(): block: [0,0,0], thread: [0,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
```
Running via:
```
$ buck2 run @//mode/opt -c fbcode.c10_enable_verbose_assert=1 :scatter_errors
```
however produces:
```
[CUDA_KERNEL_ASSERT] fbcode/caffe2/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:410: operator(): block: [0,0,0], thread: [0,0,0]: Assertion failed: `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"`: Expected 0 <= idx_dim < index_size (128), but got idx_dim = 9999
```
Differential Revision: D85185987
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166171
Approved by: https://github.com/ngimel
Summary:
After a precision study, we concluded it is ok to use ACL's exp function on f32's erf()
We can keep erf inline this way.
Benchmarks show about 91% higher throughput when processing a tensor of 1M elements, compiling with clang-19:
Before:
f32 erf: 2539.179us
After:
f32 erf: 1329.063us
Test Plan:
Correctness:
buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch
Performance:
buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test
Differential Revision: D85730452
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166594
Approved by: https://github.com/mcfi, https://github.com/fadara01
Fixes#165177
When converting guards to sources if we were unable to get the expected symbol from symbol_to_source then try to get it from var_to_sources.
I was unable to make a simpler repro than what was described in the issue (which relies on llama3 - so inappropriate for a unit test).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165723
Approved by: https://github.com/bobrenjc93
Summary: When emitting symbolic variable definition in fx_graph_runnable.py, we need to check if a SymNode is actually an expression, so that we won't generate something like "s27*s53**2 = 36".
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166529
Approved by: https://github.com/mlazos
ghstack dependencies: #166432
In the initial pr for overlapping preserving bucketing, for a graph like:
```
def foo(...):
ag = all_gather(...)
hiding_compute = mm(...)
wait(ag)
```
We would add dependencies from mm -> ag, and wait from wait -> hiding_compute, to prevent bucketing reordering these collectives so that overlap no long occurred. however, there is an additional way for bucketing to prevent overlap.
If we were to reorder another collective so the graph looked like:
```
def foo(...):
ag = all_gather(...)
ar = all_reduce(...)
wait(ar)
hiding_compute = mm(...)
wait(ag)
```
Overlap would not occur, because the wait for the all reduce would also force realization of every collective enqueued on the same stream prior to the all reduce. NCCL uses a single stream per process group.
To model, we set a set a strict ordering of all collective starts, waits, and hiding compute initially when bucketing. Then, when trying to add a collective to a bucket, we will see if we interfere with overlap for all of the following possible bucketings:
[move collective start to bucket start, move bucket start to collective start] x [move collective wait to bucket wait x move bucket wait to collective wait].
For any of these positions, we check if overlap would have been interfered with because of stream queue semantics. Then, if not, we remove the moving start and wait from the constrained ordering of collectives, and see if it's topologically valid to merge the nodes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166324
Approved by: https://github.com/IvanKobzarev
ghstack dependencies: #166309
Preivously, we would stash a single stream value we constructed at trace time in a global and return the same value from repeated calls to the graph.
With this PR, we construct the stream value in advance, reference the constructed value in the graph via the lookup table, and if that value is returned as an output, read the value from the lookup table and return it (in bytecode, not as a graph output, since we don't support arbitrary stream outputs).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164819
Approved by: https://github.com/anijain2305
ghstack dependencies: #164304, #164522
Summary:
was digging through matmul padding for other work, and I noticed that the compute bound checking won't work on MI350 since we haven't supplied the tech specs yet.
I added MI350 specs following the predefined format
Test Plan: CI
Differential Revision: D85804980
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166576
Approved by: https://github.com/leitian
**Summary**
This implements the backward pass for the Varlen API and registers `_varlen_attn()` as a custom op.
**Benchmarking**
To benchmark, we compare runtime and TFLOPs against the current SDPA approach with padding.
Settings:
- 1 H100 machine
- `batch_size=8`, `max_seq_len=2048`, `embed_dim=1024`, `num_heads=16`
- dtype `torch.bfloat16`
- `is_causal=False`
- for variable length, we set sequences to be random multiples of 64 up to `max_seq_len`
- 100 runs
| | Variable Length API | SDPA |
|--------|--------------------|----------|
| Runtime | 0.8189142608642578 ms | 3.263883056640625 ms |
| TFLOPs | 268.652 | 158.731 |
We can see that runtime for Varlen is >3x faster
**Testing**
Run `python test/test_varlen_attention.py` for unit tests where we verify basic functionality and confirm numerical match between varlen gradients vs SDPA.
For custom op testing, `test_custom_op_registration` uses logging mode to verify that `_varlen_attn()` was called and tests with `torch.compile`. `test_custom_op_compliances` uses `torch.library.opcheck()` to verify.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164504
Approved by: https://github.com/drisspg
* We are separating out the rocm jobs of the periodic workflow
* We are introducing a new label `ciflow/periodic-rocm-mi200` to allow us to run distributed tests only on ROCm runners, without triggering many other jobs on the `periodic.yml` workflow (via `ciflow/periodic`)
* This new workflow will also be triggered via the `ciflow/periodic`, thus maintaining the old status quo.
* We are reverting to the `linux.rocm.gpu.4` label since it targets a lot more CI nodes at this point than the K8s/ARC-based `linux.rocm.gpu.mi250.4` label, as that is still having some network/scaling issues.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166544
Approved by: https://github.com/jeffdaily
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Summary:
This ads waitcounter for whether or not the pool is running, as well as if we
are running jobs.
This also ads waitcounters for the first job within a pool. First job and running are working correctly. The job waitcounter seems to either be detecting a leak of a job, or is broken subtly.
Test Plan:
We've tested this internally and see valid ods metrics.
Note that we may be leaking jobs, or the job logic may not be handling an exception correctly.
Differential Revision: D83705931
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164527
Approved by: https://github.com/masnesral
torch.cuda._busy_wait_for_flag() will launch a kernel that spins until a flag is set by a corresponding torch.cuda._clear_flag(). These **must** be run on separate streams or it will deadlock.
When used correctly these kernels will put work on the GPU that is more predictable than torch.cuda._sleep() in cases where the unit test is depending on the GPU being busy.
Fixes#120318.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166218
Approved by: https://github.com/jeffdaily
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Currently most variable trackers implement `iter` via `_call_iter_tuple_list`.
This makes it difficult to customize the behavior of `iter` for different
variable types. Instead, implement `iter` via a polyfill, which will delegate
to the appropriate `__iter__` method.
While this method is more flexible, it increases the overhead of dynamo tracing.
For example, `iter(x)` will generate 9x more instructions than the current
implementation for common iterable types. Microbenchmarking shows a ~6x
slowdown for this operation. I suspect this would be much less for realistic
workloads, but more work would be needed to get specific numbers. If the
performance is a concern we could also consider adding a fast path for types
that are known to correctly implement `__iter__`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162768
Approved by: https://github.com/guilhermeleobas
ghstack dependencies: #166349
Summary:
- the ranks in the default pg config are local ranks
- however fr trace analysis requires them to be global ranks
- so we transform the local ranks to global ranks before the analysis kicks in based on a cli flag
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166149
Approved by: https://github.com/fduwjj
This reverts commit 3e77a2b478f596a8a0aef0af502f6bb1a247aa85.
Otherwise it fails ARM build with older compilers with errors that looks
as follows:
```
vec128_bfloat16_neon.h:666:12: error: operation not permitted on type ‘bfloat16_t’
666 | return (-x) * y - z;
```
For more self-contained example see https://godbolt.org/z/bbY4xWh45
(that compiles the same code using clang-15 and clang-19)
Summary:
- in torchft we have multiple default pg's, 1 for each task group
- for flight recorder to work, each of these need to have a different name, so entries can be matched
- change the `init_process_group` api to optionally take a list of ranks. if provided, we use the hash of the ranks as the name of the pg. for torchft, we'll pass global ranks here so the default pg have a different name on each task group
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166182
Approved by: https://github.com/fduwjj
Summary:
This PR introduces an inductor-level fallback mechanism that gives users control over which operations or subgraphs Inductor should lower and which should fall back to preexisting kernels. This has similar motivation as #164776 in providing flexibility to selectively disable Inductor lowering for specific nodes.
The implementation simply adds a check for the `"should_fallback"` metadata annotation on FX graph nodes. If this is set to `True`, the lowerer falls back before attempting the normal lowering path. Note that since these are user-directed fallbacks dependent upon specific, customized conditions, use `add_to_fallback_set=False` to avoid permanent overwrites of inductor's lowering/fallback rules.
Simple example marking nodes for fallback based on custom predicates:
```
def should_fallback_predicate(node: torch.fx.Node, pred: Callable[torch.fx.Node, bool]):
# Apply predicate and mark for fallback if needed
if self.predicate(node):
node.meta["should_fallback"] = True
```
Test Plan: added a CI test
Differential Revision: D85347587
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166339
Approved by: https://github.com/blaine-rister, https://github.com/eellison
1. Add `extension_build_version` and `is_internal` to `FromImpl`/`ToImpl` (this will be useful for future if we need to break the BC of any type) #163832 has the PoC of how we would actually use this system
2. Add `aoti_torch_library_impl_v2` that takes in an additional `extension_build_version` argument, updates callsite in `torch/csrc/stable/library.h` to always pass `TORCH_ABI_VERSION` for this argument
3. Add `extension_build_version` to `from_ivalue` and `to_ivalue` and update all callsites
4. Add a private `_from` and `_to` that pass `is_internal=True` to `FromImpl`/`ToImpl`, making it easier to reason about what is being called from libtorch-land / extension-land
**Note: This PR does not include a linter that tells the user to update from/to if changing the ABI of a type in headeronly, which I intend to do in https://github.com/pytorch/pytorch/pull/163998**
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164332
Approved by: https://github.com/janeyx99
ghstack dependencies: #164356, #166373, #163683
Move `from_ivalue` and `to_ivalue` and their dependents `StableIValueBoxedKernel`, `aoti_torch_library_impl` `aoti_torch_call_dispatcher` into new (non-aoti shim_common.cpp)
This is in prep for the above PRs where I add v2s (`torch_call_dispatcher` and `torch_library_impl`) that are versioning aware
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166373
Approved by: https://github.com/janeyx99
ghstack dependencies: #164356
As per title.
It seems safe to be able to generalize to arbitrary contiguous inputs since `at::matmul` is likely to do the flattening to avoid `baddmm`.
Additionally, we guard for bias to be 1D and contiguous which is guaranteed to be fused with no copies.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166071
Approved by: https://github.com/ngimel
Summary:
When dealing with a large memory trace, the resulting plot can be challenging to interpret and analyze.
This commit introduces a feature that enables filtering of allocations that have already been freed, providing a more focused view.
The remaining events in the plot often warrant closer examination, as they may be indicative of potential out-of-memory (OOM) issues.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165752
Approved by: https://github.com/zdevito
The goal of this PR is to provide a standard way to create simple treespec instances and hide the implementation details of the `PyTreeSpec` class.
Changes:
1. Add function `treespec_leaf()` to replace `LeafSpec()`.
2. Add function `treespec_tuple(...)` and `treespec_dict(...)` to create treespec for `tuple` / `dict` which is used for `*args` / `**kwargs`. This avoids direct modification to `treespec` instances that rely on the implementation details of the `PyTreeSpec` class.
3. Change `len(spec.children_specs)` to `spec.num_children`.
4. Change `isinstance(spec, LeafSpec)` to `spec.is_leaf()`.
------
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160843
Approved by: https://github.com/mlazos
Fixes#100842
Disable jiterator for complex tan and tanh kernels due to accuracy issues, matching the existing approach used for acos, acosh, asin, and asinh. Reverts to thrust implementation which provides better numerical accuracy.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165250
Approved by: https://github.com/ezyang
Fixes#161366
All the 4 types of dimension matrix are supported.
2d-2d, 2d-3d, 3d-3d, 3d-2d. The corresponding test cases in test_matmul_cuda are working
for both forward and backward pass.
The CK path is enabled for gfx942, gfx950.
ToDo: Need to enable support on gfx90a since the ck kernel used in this commit produces gpu error,
might require a different CK kernel config, based on the profiler result on gfx90a.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166334
Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony
# Context
Previously, we would modify the parent process's NUMA bindings in order to force child process to inherit them.
However, this would not work correctly if `start_method="forkserver"`, because the subprocesses would actually inherit their bindings from the forkserver middleman process. In this case, the inherited affinity would actually be incorrect for all but the first subprocess (because the forkserver process would get created lazily, and hence inherit and then stick with the bindings intended for the first subprocess).
# This PR
* `str` entrypoints: Use `numactl` CLI
* `Callable` entrypoints: Wrap the `Callable` entrypoint and call `os.sched_setaffinity` inside it.
Hopefully this will be the last necessary iteration.
# Test Plan
## Automated
`$ pytest test/test_numa_binding.py`
## Manual
Verified flops/sec and memory locality wins on several different types of jobs
* `Callable` with forkserver
* `str` entrypoint with spawn
* `Callable` entrypoint with spawn
More details in [this doc (Meta-only).](https://docs.google.com/document/d/1vxD-OKYBTT27jbBwtW9iz9g0tNM0u-i0tiTJg_ieQA8/edit?tab=t.scjv58yswi64)
# Later PR
Update all the documentation when we're confident this has stabilized.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166026
Approved by: https://github.com/d4l3k
Co-authored-by: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
# Motivation
In https://github.com/pytorch/pytorch/pull/145591, `std::hardware_destructive_interference_size` was introduced in CUDACachingAllocator. Later, https://github.com/pytorch/pytorch/pull/160067 moved it to `c10/core/alignment.h` for code reuse.
However, on **GCC 13+** using `std::hardware_destructive_interference_size` triggers the following warning:
```bash
warning: use of ‘std::hardware_destructive_interference_size’ [-Winterference-size]
/home/pt-gpu/4T-4652/guangyey/stock-pytorch/aten/src/ATen/core/CachingHostAllocator.h:42:16: note: its value can vary between compiler versions or with different ‘-mtune’ or ‘-mcpu’ flags
/home/pt-gpu/4T-4652/guangyey/stock-pytorch/aten/src/ATen/core/CachingHostAllocator.h:42:16: note: if this use is part of a public ABI, change it to instead use a constant variable you define
/home/pt-gpu/4T-4652/guangyey/stock-pytorch/aten/src/ATen/core/CachingHostAllocator.h:42:16: note: the default value for the current CPU tuning is 64 bytes
/home/pt-gpu/4T-4652/guangyey/stock-pytorch/aten/src/ATen/core/CachingHostAllocator.h:42:16: note: you can stabilize this value with ‘--param hardware_destructive_interference_size=64’, or disable this warning with ‘-Wno-interference-size’
```
# Solution
- Solution 1: Replace `c10::hardware_destructive_interference_size` with a constant 64.
```cpp
constexpr std::size_t hardware_destructive_interference_size = 64;
```
- Solution 2: adding `-Wno-interference-size’ to 8d4e48831e/cmake/public/utils.cmake (L386) to suppress the warning.
# Additional Context
The current implementation uses the second approach. If the reviewers prefer the first approach, I am happy to update it accordingly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166297
Approved by: https://github.com/ezyang
It is recommended to use `python -m pip install --no-build-isolation .` instead of `pip3 install --no-build-isolation .` because most of us use a virtual environment, and the latter probably relies on the system `pip3` rather than the conda or uv. We need to make it consistent with the Python we use, and it is also consistent with how `torch` is installed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166235
Approved by: https://github.com/fffrog, https://github.com/ezyang
when the slice is tensor, we decompose it to .item() call and pass the unbacked symbol to the slice to avoid DDE.
the diff also fix an existing bug in codegen_dynamic_slice_size in the cpp wrapper. a +1 should be -1 making it match
python codegen.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165074
Approved by: https://github.com/Lucaskabela
- Remove all complex defines logic from the header
- Make GreenContext constructor private, as it should only be created via the static method as singleton
- Delete unused `getContext` and `getGreenContext` methods
- Rename `CUDA_HAS_GREEN_CONTEXT` to `HAS_CUDA_GREEN_CONTEXT()`, which results in compilation error if one accidentally makes a typo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166462
Approved by: https://github.com/ngimel, https://github.com/eqy
Summary:
Previously we didn't correctly handle closure tuple when there's content in it. Adding additional code for serializing the tuple and merge it with guard manager local scope.
Test Plan:
pytest test/dynamo/test_aot_compile.py
Reviewers:
Subscribers:
Tasks:
Tags:
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166351
Approved by: https://github.com/Lucaskabela
This pull request introduces a standardized YAML-based configuration system for transformer attention benchmarks, making it easier to run and manage comprehensive performance tests. It adds example configs, and a wrapper script to convert YAML configs into CLI arguments for the benchmark runner.
#### Next Steps:
CI Enablement: This change would further lead to running the attention ops in CI for regression tracking.
#### Developer flow: (Run locally)
`python score_mod.py --config configs/config_test.yaml`
#### Enabling CI run: https://github.com/pytorch/pytorch/pull/165915
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164155
Approved by: https://github.com/jbschlosser
The global pytree registration of `BlockMask` was added in https://github.com/pytorch/pytorch/pull/166045
In general ppl assume `BlockMask` is a leaf, so the global registration could lead to some unexpected failure when calling `tree_map()` on a `BlockMask` since now it will flatten all the way down.
Therefore, we remove the global registration but keep the `_flatten()` and `_unflatten()` classmethod. Users could do a local registration easily when it is needed.
in pytorch
```
python test/distributed/tensor/test_dtensor_export.py -k test_flex_attention_dtensor_export
```
in torchtitan
```
python -m tests.integration_tests.run_tests ./outputs --test_suite features --ngpu 8
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166434
Approved by: https://github.com/wwwjn
Fixes#165810
If we regenerate a node during functionalization, we override the "stack_trace", "custom", and "seq_nr" metadata of the regenerated node with the node meta of the original node.
```
python test/functorch/test_aot_joint_with_descriptors.py -k test_preserve_annotate_replay_view
python test/functorch/test_aotdispatch.py TestAOTAutogradWithDynamo.test_duplicated_arguments_on_tensor_overlap
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166200
Approved by: https://github.com/bdhirsh
**Summary**
This implements the backward pass for the Varlen API and registers `_varlen_attn()` as a custom op.
**Benchmarking**
To benchmark, we compare runtime and TFLOPs against the current SDPA approach with padding.
Settings:
- 1 H100 machine
- `batch_size=8`, `max_seq_len=2048`, `embed_dim=1024`, `num_heads=16`
- dtype `torch.bfloat16`
- `is_causal=False`
- for variable length, we set sequences to be random multiples of 64 up to `max_seq_len`
- 100 runs
| | Variable Length API | SDPA |
|--------|--------------------|----------|
| Runtime | 0.8189142608642578 ms | 3.263883056640625 ms |
| TFLOPs | 268.652 | 158.731 |
We can see that runtime for Varlen is >3x faster
**Testing**
Run `python test/test_varlen_attention.py` for unit tests where we verify basic functionality and confirm numerical match between varlen gradients vs SDPA.
For custom op testing, `test_custom_op_registration` uses logging mode to verify that `_varlen_attn()` was called and tests with `torch.compile`. `test_custom_op_compliances` uses `torch.library.opcheck()` to verify.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164504
Approved by: https://github.com/drisspg
Moving symbolic shape guards to C++ causes compile time issues. This basically boils down to a tradeoff question.
For models that have large amount of dynamic shape guards, this flag will help reduce guard latency. But for most of the models, that have a very few dynamic shape guards, the guard lantecy is anyways small. These models will still see a high compile time hit because of calling gcc during the compile.
So a good default value seems to be False. We can write a doc to give guidance on reducing guard latency.
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166427
Approved by: https://github.com/zou3519
Fixes#165944
**Summary**
Today, if we attempt to run flash attention with batch_size 0, we get error `Runtime Error: batch size must be positive`. This PR fixes this by returning early with empty tensors in the fwd and bwd.
**Test plan**
`python test/test_transformers.py -k test_scaled_dot_product_attention` - added case for batch_size=0
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166318
Approved by: https://github.com/drisspg
Fixes#159139
## The Cause
The bug occurs because the OptimizedModule wrapper in torch._dynamo.eval_frame doesn't call the len method. This causes Python's bool() check to fall back to the default object truthiness (always True) instead of correctly evaluating containers with len() == 0 as False.
## The Fix
A very easy fix . I just added the len method to OptimizedModule in torch._dynamo.eval_frame class to delegate the call to the original module
```python
def __len__(self):
"""
Proxy the len() call to the original module to fix truthiness checks.
"""
return len(self._orig_mod)
```
This successfully fixes the issue . The script now works as expected.
## Reproduction Script
```python
import torch
import torch.nn as nn
# Create an empty nn.ModuleList
original = nn.ModuleList()
# Compile it using torch.compile
compiled = torch.compile(original)
# Compare their boolean evaluations
print(f"bool(original): {bool(original)}")
print(f"bool(compiled): {bool(compiled)}")
# Trigger failure if they differ
assert bool(original) == bool(compiled), "BUG: truthiness behavior mismatch after compilation"
```
## Output
bool(original): False
bool(compiled): False
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159208
Approved by: https://github.com/andrewboldi, https://github.com/Lucaskabela
Co-authored-by: pushkar-hue <pushkarsharma.rtm@gmail.com>
Co-authored-by: Lucas Kabela <lucasakabela@gmail.com>
I accepted a PR that added this code, but re-examining it now, I'm questioning the approach. It seems like we're working around an issue with the inductor generating incorrect sizes. A comment suggests it might be related to unsqueezed u0 values. Removing this code didn't cause any failures, so I'll take it out and address the root issue if it arises.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166278
Approved by: https://github.com/Lucaskabela
Summary: To fix https://github.com/pytorch/pytorch/issues/159400. Currently, test_aoti_abi_check and test_aoti_inference need to be built in two passes, first build pytorch using the regular `pythonsetup.py develop` and then build with `CMAKE_FRESH=1 BUILD_AOT_INDUCTOR_TEST=1 python setup.py devleop`. This is cumbersome. Fix by rewriting CMakeLists.txt for test_aoti_inference to one-pass build which runs AOTI to compile models at the test time. Also update CI test script to get rid of two-pass build. For test_aoti_abi_check, it is not AOTI specific, so we make it not guarded by BUILD_AOT_INDUCTOR_TEST.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164277
Approved by: https://github.com/janeyx99
Pull replace_node function out of Scheduler.finalize_multi_template_buffers(). This is needed by the next PR (#163369). As part of this also pull the _replace_operation_buffer() up to top-level since it needed no self references.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163368
Approved by: https://github.com/PaulZhang12
### Summary
Improves validation of `torch.linalg.eig` results by verifying the eigen decomposition identity **A v − v λ = 0**.
### Motivation
Eigenvectors are not unique, and numerical differences between backends (cuSOLVER, MAGMA, CPU)
can cause false test failures. This PR replaces direct elementwise comparisons with a mathematical
identity check, improving robustness across devices.
### Details
- Introduces `fulfills_eigen_decomposition_identity()` in `test_eig_compare_backends()` to validate the eigen equation.
- Uses CPU matmul for high-precision verification.
- Handles zero-sized matrices explicitly.
- Tolerances derived from numerical comparisons between cuSOLVER and NumPy.
See discussion: [dev-discuss.pytorch.org link](https://dev-discuss.pytorch.org/t/cusolver-dnxgeev-faster-cuda-eigenvalue-calculations/3248/6)
### Impact
- Improves test stability and correctness across eig backends.
- No change to public API.
- All tests pass; lintrunner reports no issues.
- Enables introduction of new eig backends without false test failures.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166322
Approved by: https://github.com/lezcano
This PR refactors the autocast context manager in autocast_mode.py to simplify and centralize the logic for checking supported dtypes for each device. The previous implementation repeated similar checks for multiple device types. Now, a single mapping device_supported_dtypes is used to associate device types with their supported dtypes, and the validation logic is unified.
**The former PR #163446 was merged but reverted due to failed CI test on `openreg` related tests.**
This RR additionally slightly modified some test assertions for passing the CI tests. CI failed due to assertion for the exactly same error message. For example:
```
File "/var/lib/jenkins/workspace/test/cpp_extensions/open_registration_extension/torch_openreg/tests/test_autocast.py", line 9, in test_autocast_with_unsupported_type
with self.assertWarnsRegex(
AssertionError: "In openreg autocast, but the target dtype torch.float32 is not supported." does not match "In openreg autocast, but the target dtype is not supported. Disabling autocast."
```
Sorry for the inconvenience again.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165221
Approved by: https://github.com/albanD
RMS/Layer norm backward would generated 2 kind of reductions:
- the reduction computing dx which reduce across the hidden dimension (in the context of transformer)
- the reduction computing dw/db which reduce across the BxT (batch size , sequence length) dimension.
These 2 set of reductions have common input buffers but inductor can not fuse them because of different loop orders.
There are multiple sources of custom kernels that implement fused version of such kernel (Liger-Kernel, quack, Paul Zhang's internal post). This PR enable Inductor to generate such kernels automatically.
The generated kernel is very similar to 33924d20b6/src/liger_kernel/ops/rms_norm.py (L114) .
To make the implementation simple and performing, we enable such fusion only if the inner reduction (computing dx) is a persistent reduction. This should be true for representative inputs. Persistent reduction is critical for the perf here to make sure a loaded tensor does not need to be reload.
To make sure the inner reduction (computing dx) and outer reductions (computing dw/db) being fusible, the PR does the following:
1. convert the outer reductions to pointwise by replacing 'reduction' & 'store_reduction' node with a new type of node 'parital_accumulate'. The new node will collect the reduction type, buffer name, input of reduction etc, which is essential for proper codegening.
2. do loop reordering (rely on the earlier loop ordering after fusion work) to reorder the loops of the converted pointwise so it can be fused with the inner reduction
3. there can be epilogues that need to be added in the end. E.g. the outer reduction may be followed by a division for mean , or followed by a down cast if dw/db is in low precision (fp16/bf16).
Some early benchmarking on H100 shows about 2X speedup for both RMSNorm and LayerNorm backward for shape (1152 * 500, 384 ) used in some internal model. Note that, I manually disable split reduction in this benchmarking since otherwise the fusion will be skipped right now. The next PR will make the mix-order-reduction compose better with split reduction
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165370
Approved by: https://github.com/jansel
ghstack dependencies: #166204
Summary: In this diff, we add a event pool so that we dont have to create/destroy events all the time, instead re-use the events from the pool.
Test Plan: contbuild
Differential Revision: D84685495
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165647
Approved by: https://github.com/bbus
Fixes#166068
Use the ROCm version string that does not contain a hash. The string is set in LoadHIP.cmake.
Tested on repro provided by reporter.
For a ROCm 7.0 docker container, we get `7.0.0`.
For a ROCm 7.0.2 docker container, we get `7.0.2`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166336
Approved by: https://github.com/jeffdaily
Summary:
Support deepseek-style scaling in Inductor Triton for FP8 GEMMs. DeepSeek-style scaling is a colloquial term for a fine-grained mixed precision framework using FP8 to train [Deepseek-V3](https://arxiv.org/pdf/2412.19437), DeepSeek AI's recent MoE (Mixture of Experts) model. DeepSeek-style scaling effectively extends the dynamic range of FP8 by mitigating dequantization overhead under increased-precision accumulation, which is key to achieving more accurate FP8 GEMM results.
DeepSeek-style scaling on matmul `A @ B` leverages two different types of scaling strategies to preserve a balance between numerical stability and training efficiency:
- Activations (input tensor `A`): tile-wise (1x128 across shape `(M, K)`)
- Weights (input tensor `B`): block-wise (128x128 across shape `(N, K)`)
This diff enables Inductor users to replicate past successes with deepseek-style scaling and achieve higher numerical stability while increasing training efficiency.
NOTE: Block-wise 128x128 scaling is only supported in CUDA 12.9+; therefore, deepseek-style scaling is currently unsupported in `fbcode` (CUDA 12.4). Use OSS PyTorch to run deepseek-style scaling.
NOTE: Accuracy for FP8 is unstable, even with high tolerances, which is why TritonBench benchmarks are unlikely to be accurate against a `torch` implementation.
Test Plan:
In OSS PyTorch, run
```
TORCHINDUCTOR_CACHE_DIR=~/personal/cache_dir_inductor CUDA_LAUNCH_BLOCKING=1 TORCH_USE_CUDA_DSA=1 TRITON_PRINT_AUTOTUNING=1 TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+inductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 ENABLE_PERSISTENT_TMA_MATMUL=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 python run.py --op fp8_gemm --only torch_fp8_gemm,pt2_fp8_gemm --metrics tflops,accuracy --m 4096 --n 768 --k 512 --output="{output_dir}/deepseek_bench.csv" --scaling_deepseek --atol=1e-2 --rtol=0.5 2>&1 | tee ~/personal/deepseek_style/deepseek_bench.log
```
Differential Revision: D83609850
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164404
Approved by: https://github.com/slayton58
Previously, we would completely skip building and calling any resume function if the leaf frame's resume instruction was RETURN_VALUE/RETURN_CONST. Now, we only skip building/calling resume functions for frames that are resuming on RETURN_VALUE.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165808
Approved by: https://github.com/Lucaskabela
ghstack dependencies: #166013, #166015
This `patch.dict(counters, ...` appears to be ancient code that doesn't really seem to be doing anything? It causes issues in nested graph breaks because the patch cleanup clears out the record of the nested graph break. Removing the patch to see if it's even needed in the first place.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166015
Approved by: https://github.com/Lucaskabela
ghstack dependencies: #166013
Summary:
as title
- Some IR nodes are created during `finalize_multi_template_buffers()` in Scheduler. This PR adds provenance (`origin_node` and `origins`) for those nodes.
- Extract `assign_origin_node` function
Test Plan:
```
buck run mode/opt fbcode//caffe2/test/inductor:provenance_tracing -- -r test_deferred_triton_kernels
```
Differential Revision: D83979975
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164746
Approved by: https://github.com/mlazos
Summary:
Add a gflag to allow us skip empty constant named parameter during
dense loading. In [vm_parameters.py](https://fburl.com/code/7xr9ihwy), there is
a constant _empty_tensor parameter used for the model. This constant parameter
is skipped in XL weights during model publish because it is empty. This will
break model inplace update later because it will be reported by the AOTI
container but cannot be found from the model merge weights. This diff will
allow us to solve the problem.
Test Plan: Verified inplace update in job https://www.internalfb.com/vanguard/serving_test_cases/1165842932095688
Reviewed By: muchulee8, joannec3634
Differential Revision: D85082330
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166037
Approved by: https://github.com/muchulee8, https://github.com/jcwchen
Namely, error out rather than crash when out dtype is of an unexpected type
Resize output tensor to the expected size in `_out` operation, to prevent crash when tensor of an unexpected size is passed.
Preserve symbolic shapes whenever possible
Test plan: Run `python test_ops.py -v -k test_out_warning_fft_hfft_mps` for MPS device, without this change it crashes with `Error: Invalid KernelDAG, equalShape for destination failed'`, run `python ../test/test_ops.py -v -k test_dtypes_stft_mps`, without this change it crashes with `A complex mlir::Type does not have a corresponding complex MPSDataType"`, when input dtype is bfloat16
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166272
Approved by: https://github.com/kulinseth
Fixes#163929
Fixes argmin/argmax operations to return correct logical indices instead of physical memory offsets when applied to transposed/permuted tensors. When `argmin()` or `argmax()` is called on a transposed tensor, Inductor was returning physical memory indices instead of logical row-major indices. This caused incorrect results that don't match eager mode behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165983
Approved by: https://github.com/shunting314
\# why
- enable users to control which choices get used on which inputs
- reduce lowering time, and pin kernel selection, by selecting
them for the inputs
\# what
- a new InductorChoices subclass that implements a lookup table
- a README explaining the usage
- corresponding testing
- currently only supports templates that go through
`V.choices.get_template_configs`
\# testing
```
python3 -bb -m pytest test/inductor/test_lookup_table.py -v
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164978
Approved by: https://github.com/PaulZhang12, https://github.com/eellison
description: Add unsigned integer (uint) type support to PyTorch operators by updating AT_DISPATCH macros. Use when adding support for uint16, uint32, uint64 types to operators, kernels, or when user mentions enabling unsigned types, barebones unsigned types, or uint support.
---
# Add Unsigned Integer (uint) Support to Operators
This skill helps add support for unsigned integer types (uint16, uint32, uint64) to PyTorch operators by updating their AT_DISPATCH macros.
## When to use this skill
Use this skill when:
- Adding uint16, uint32, or uint64 support to an operator
- User mentions "unsigned types", "uint support", "barebones unsigned types"
- Enabling support for kUInt16, kUInt32, kUInt64 in kernels
- Working with operator implementations that need expanded type coverage
## Quick reference
**Add unsigned types to existing dispatch:**
```cpp
// Before
AT_DISPATCH_V2(dtype,"op",AT_WRAP([&](){
kernel<scalar_t>();
}),AT_EXPAND(AT_ALL_TYPES));
// After (method 1: add unsigned types explicitly)
description: Convert PyTorch AT_DISPATCH macros to AT_DISPATCH_V2 format in ATen C++ code. Use when porting AT_DISPATCH_ALL_TYPES_AND*, AT_DISPATCH_FLOATING_TYPES*, or other dispatch macros to the new v2 API. For ATen kernel files, CUDA kernels, and native operator implementations.
---
# AT_DISPATCH to AT_DISPATCH_V2 Converter
This skill helps convert PyTorch's legacy AT_DISPATCH macros to the new AT_DISPATCH_V2 format, as defined in `aten/src/ATen/Dispatch_v2.h`.
## When to use this skill
Use this skill when:
- Converting AT_DISPATCH_* macros to AT_DISPATCH_V2
- Porting ATen kernels to use the new dispatch API
- Working with files in `aten/src/ATen/native/` that use dispatch macros
- User mentions "AT_DISPATCH", "dispatch v2", "Dispatch_v2.h", or macro conversion
description: Write docstrings for PyTorch functions and methods following PyTorch conventions. Use when writing or updating docstrings in PyTorch code.
---
# PyTorch Docstring Writing Guide
This skill describes how to write docstrings for functions and methods in the PyTorch project, following the conventions in `torch/_tensor_docs.py` and `torch/nn/functional.py`.
description: Guide users through creating Agent Skills for Claude Code. Use when the user wants to create, write, author, or design a new Skill, or needs help with SKILL.md files, frontmatter, or skill structure.
---
# Skill Writer
This Skill helps you create well-structured Agent Skills for Claude Code that follow best practices and validation requirements.
## When to use this Skill
Use this Skill when:
- Creating a new Agent Skill
- Writing or updating SKILL.md files
- Designing skill structure and frontmatter
- Troubleshooting skill discovery issues
- Converting existing prompts or workflows into Skills
## Instructions
### Step 1: Determine Skill scope
First, understand what the Skill should do:
1.**Ask clarifying questions**:
- What specific capability should this Skill provide?
- When should Claude use this Skill?
- What tools or resources does it need?
- Is this for personal use or team sharing?
2.**Keep it focused**: One Skill = one capability
- Good: "PDF form filling", "Excel data analysis"
- Too broad: "Document processing", "Data tools"
### Step 2: Choose Skill location
Determine where to create the Skill:
**Personal Skills** (`~/.claude/skills/`):
- Individual workflows and preferences
- Experimental Skills
- Personal productivity tools
**Project Skills** (`.claude/skills/`):
- Team workflows and conventions
- Project-specific expertise
- Shared utilities (committed to git)
### Step 3: Create Skill structure
Create the directory and files:
```bash
# Personal
mkdir -p ~/.claude/skills/skill-name
# Project
mkdir -p .claude/skills/skill-name
```
For multi-file Skills:
```
skill-name/
├── SKILL.md (required)
├── reference.md (optional)
├── examples.md (optional)
├── scripts/
│ └── helper.py (optional)
└── templates/
└── template.txt (optional)
```
### Step 4: Write SKILL.md frontmatter
Create YAML frontmatter with required fields:
```yaml
---
name:skill-name
description:Brief description of what this does and when to use it
The description is critical for Claude to discover your Skill.
**Formula**: `[What it does] + [When to use it] + [Key triggers]`
**Examples**:
✅ **Good**:
```yaml
description: Extract text and tables from PDF files, fill forms, merge documents. Use when working with PDF files or when the user mentions PDFs, forms, or document extraction.
```
✅ **Good**:
```yaml
description: Analyze Excel spreadsheets, create pivot tables, and generate charts. Use when working with Excel files, spreadsheets, or analyzing tabular data in .xlsx format.
```
❌ **Too vague**:
```yaml
description: Helps with documents
description: For data analysis
```
**Tips**:
- Include specific file extensions (.pdf, .xlsx, .json)
- Mention common user phrases ("analyze", "extract", "generate")
- List concrete operations (not generic verbs)
- Add context clues ("Use when...", "For...")
### Step 6: Structure the Skill content
Use clear Markdown sections:
```markdown
# Skill Name
Brief overview of what this Skill does.
## Quick start
Provide a simple example to get started immediately.
## Instructions
Step-by-step guidance for Claude:
1. First step with clear action
2. Second step with expected outcome
3. Handle edge cases
## Examples
Show concrete usage examples with code or commands.
## Best practices
- Key conventions to follow
- Common pitfalls to avoid
- When to use vs. not use
## Requirements
List any dependencies or prerequisites:
```bash
pip install package-name
```
## Advanced usage
For complex scenarios, see [reference.md](reference.md).
```
### Step 7: Add supporting files (optional)
Create additional files for progressive disclosure:
**reference.md**: Detailed API docs, advanced options
**examples.md**: Extended examples and use cases
**scripts/**: Helper scripts and utilities
**templates/**: File templates or boilerplate
Reference them from SKILL.md:
```markdown
For advanced usage, see [reference.md](reference.md).
Run the helper script:
\`\`\`bash
python scripts/helper.py input.txt
\`\`\`
```
### Step 8: Validate the Skill
Check these requirements:
✅ **File structure**:
- [ ] SKILL.md exists in correct location
- [ ] Directory name matches frontmatter `name`
✅ **YAML frontmatter**:
- [ ] Opening `---` on line 1
- [ ] Closing `---` before content
- [ ] Valid YAML (no tabs, correct indentation)
- [ ] `name` follows naming rules
- [ ] `description` is specific and < 1024 chars
✅ **Content quality**:
- [ ] Clear instructions for Claude
- [ ] Concrete examples provided
- [ ] Edge cases handled
- [ ] Dependencies listed (if any)
✅ **Testing**:
- [ ] Description matches user questions
- [ ] Skill activates on relevant queries
- [ ] Instructions are clear and actionable
### Step 9: Test the Skill
1. **Restart Claude Code** (if running) to load the Skill
2. **Ask relevant questions** that match the description:
```
Can you help me extract text from this PDF?
```
3. **Verify activation**: Claude should use the Skill automatically
4. **Check behavior**: Confirm Claude follows the instructions correctly
### Step 10: Debug if needed
If Claude doesn't use the Skill:
1. **Make description more specific**:
- Add trigger words
- Include file types
- Mention common user phrases
2. **Check file location**:
```bash
ls ~/.claude/skills/skill-name/SKILL.md
ls .claude/skills/skill-name/SKILL.md
```
3. **Validate YAML**:
```bash
cat SKILL.md | head -n 10
```
4. **Run debug mode**:
```bash
claude --debug
```
## Common patterns
### Read-only Skill
```yaml
---
name: code-reader
description: Read and analyze code without making changes. Use for code review, understanding codebases, or documentation.
allowed-tools: Read, Grep, Glob
---
```
### Script-based Skill
```yaml
---
name:data-processor
description:Process CSV and JSON data files with Python scripts. Use when analyzing data files or transforming datasets.
@ -19,7 +18,7 @@ aspects of contributing to PyTorch.
- [Python Unit Testing](#python-unit-testing)
- [Better local unit tests with `pytest`](#better-local-unit-tests-with-pytest)
- [Local linting](#local-linting)
- [Running `mypy`](#running-mypy)
- [Running `pyrefly`](#running-pyrefly)
- [C++ Unit Testing](#c-unit-testing)
- [Run Specific CI Jobs](#run-specific-ci-jobs)
- [Merging your Change](#merging-your-change)
@ -67,23 +66,6 @@ aspects of contributing to PyTorch.
Follow the instructions for [installing PyTorch from source](https://github.com/pytorch/pytorch#from-source). If you get stuck when developing PyTorch on your machine, check out the [tips and debugging](#tips-and-debugging) section below for common solutions.
### Setup the development environment
First, you need to [fork the PyTorch project on GitHub](https://github.com/pytorch/pytorch/fork) and follow the instructions at [Connecting to GitHub with SSH](https://docs.github.com/en/authentication/connecting-to-github-with-ssh) to setup your SSH authentication credentials.
Then clone the PyTorch project and setup the development environment:
# Or run `make setup-env-cuda` for pre-built CUDA binaries
# Or run `make setup-env-rocm` for pre-built ROCm binaries
source venv/bin/activate # or `. .\venv\Scripts\activate` on Windows
```
### Tips and Debugging
* If you want to have no-op incremental rebuilds (which are fast), see [Make no-op build fast](#make-no-op-build-fast) below.
@ -299,7 +281,7 @@ dependencies as well as the nightly binaries into the repo directory.
**Prerequisites**:
The following packages should be installed with `pip`:
- `expecttest` and `hypothesis` - required to run tests
- `mypy` - recommended for linting
- `pyrefly` - recommended for type checking. [Pyrefly](https://pyrefly.org/)
- `pytest` - recommended to run tests more selectively
Running
```
@ -368,15 +350,32 @@ make lint
Learn more about the linter on the [lintrunner wiki page](https://github.com/pytorch/pytorch/wiki/lintrunner)
#### Running `mypy`
#### Running `pyrefly`
`mypy` is an optional static type checker for Python. We have multiple `mypy`
configs for the PyTorch codebase that are automatically validated against whenever the linter is run.
[Pyrefly](https://pyrefly.org/) is a high-performance static type checker for Python. It provides fast type checking along with IDE features like autocomplete and instant error feedback.
PyTorch uses Pyrefly for type checking across the codebase. The configuration is managed in `pyrefly.toml` at the root of the repository.
**Getting Started with Pyrefly:**
To run type checking on the PyTorch codebase:
```bash
pyrefly check
```
For more detailed error information with summaries:
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.