The deprecation warning led to warning spamming in PyTorch APIs, like
torch.compile. This is not how a deprecation warning should go: if we
add a deprecation warning, we'd better update our built-in APIs to
prevent warning spam.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166956
Approved by: https://github.com/albanD
We store a mapping between generated fx graph code and original model code stack trace in `fx.traceback._FX_METADATA_REGISTRY`. And we do a post-processing on the memory snapshot to append the original model stack trace information.
To achieve this, the biggest change we had to do in `aot_eager` mode is to give each generated fx graph a unique stack trace, i.e. it cannot just be `<eval_with_key>`. We set co_filename to **pretend** that the code is from `co_filename` file. Now instead of `<eval_with_key>` in stack trace, we get something like `fx_generated_3a4b5c6d7e8f9a0.py`.
`augment_with_fx_traces` arg is added to `torch.cuda.memory._snapshot` and `_dump_snapshot`. When the arg is set to True, a post-processing will run to populate the original model stack trace to the snapshot frames.
The new behavior of GraphModule can be controlled by `TORCH_ENRICH_RPOFILER_STACK_TRACE` or `_dynamo.config.enrich_profiler_metadata=True`.
Alternative:
Instead of setting co_filename, we can also do it like below:
Note that if we do it this way, we will need to dump the file to make the graph module torch-scriptable. TorchScript requires source access in order to carry out compilation, so we need to make sure original .py files are available.
```
key = filename
globals_copy = globals.copy()
globals_copy["__file__"] = key
globals_copy["__name__"] = key
linecache.lazycache(key, globals_copy)
exec(compile(src, key, "exec"), globals)
````
Other changes:
- Update `MemoryViz.js` to display fx node information and original model code if exist
```
python test/test_fx.py -k test_lineno_map
python test/test_fx.py -k test_custom_traceback_raised
python test/test_public_bindings.py
python test/test_cuda.py -k test_fx_memory
python test/test_fx.py -k test_informative_co_filename
python test/test_fx.py -k test_autowrap_functions
python test/dynamo/test_utils.py -k test_inductor_provenance
```
```python
# Profile with memory snapshot
torch.cuda.memory._record_memory_history()
with torch._dynamo.config.patch("enrich_profiler_stack_trace", True):
compiled = torch.compile(mod, backend="aot_eager", fullgraph=True)
result = compiled(torch.randn(10, 10, device="cuda:0"))
torch.cuda.memory._dump_snapshot("memory_snapshot.pickle", augment_with_fx_traces=True)
torch.cuda.memory._record_memory_history(enabled=None)
```
<img width="913" height="711" alt="Screenshot 2025-10-30 at 10 40 44 AM" src="https://github.com/user-attachments/assets/8d7a1833-f98d-4756-b666-1d63ab57b27b" />
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166676
Approved by: https://github.com/albanD, https://github.com/ezyang
Results from CI:
No failures but generally takes longer, maybe ~20% increase in time?
But the smaller runner is ~25% of the cost of the current runner, so in terms of cost this is a decrease
If the 20% is too much, we can try the 4x larger runners, which are about half the cost of the current runner, so it would probably still result in cost savings with hopefully less impact to time
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164989
Approved by: https://github.com/BoyuanFeng, https://github.com/huydhn
Summary:
cuBlasLt enforces size/stride requirements for 1x128 and 128x128 blockwise scaling
kernels, some of which weren't being handled, causing silent incorrect
answers especially for 128x128 scaling cases.
cuBlasLt enforces ([docs](https://docs.nvidia.com/cuda/cublas/#scaling-factors-layouts)) for deepseek-style
scaling, for `A: MxN`, `B: KxN` you have the following:
```Py
L = K // 128
L4 = round_up(L, 4)
1x128 x 128x128:
* A_scale: [M, K // 128], stride: [1, M]
* B_scale: [L4, N // 128], stride: [1, L4]
128x128 x 1x128:
* A_scale: [L4, M // 128], stride: [1, L4]
* B_scale: [N, K // 128], stride: [1, N]
1x128 x 1x128:
* A_scale: [M, K // 128], stride: [1, M]
* B_scale: [N, K // 128], stride: [1, N]
```
Notable here is the `L4` term, which means that we must round up to the nearest multiple of 4 blocks
in the `K` dimension. This wasn't enforced previously, and caused silent wrong answers
where `(K // 128) % 4 != 0`.
Test Plan:
Reviewers:
Subscribers:
@vkuzo
Tasks:
Tags:
Signed-off-by: Simon Layton <simonlayton@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166752
Approved by: https://github.com/drisspg, https://github.com/vkuzo
Fixes#164684
### Description
Symbolic tracing fails during multiplication between a `SymBool` and a `Tensor`. This scenario is triggered when `.item()` is called on a 0-dim boolean tensor within a `torch.compile` region. In compile mode, this yields a `SymBool`, and the subsequent `SymBool * FakeTensor` operation is unsupported, leading to a `TypeError` or a data-dependent `UserError`.
### Solution
This PR addresses the issue at the type-conversion level, as suggested by reviewers.
The root cause of the TypeError is that torch.sym_float() (which is called by _maybe_convert_to_dtype during type promotion for aten.mul) lacks a conversion path for SymBool and incorrectly falls back to builtins.float(SymBool).
This fix addresses this by implementing the __sym_float__(self) method within the SymBool class (defined in torch/__init__.py).
The torch.sym_float(a) utility function is already designed to check for hasattr(a, "__sym_float__") before falling back to builtins.float(). By adding this method, SymBool instances now correctly advertise their ability to be cast to SymFloat. The new method implementation leverages self.node.sym_float() to correctly convert the symbolic boolean value to its symbolic float representation (0.0 or 1.0), resolving the TypeError at its source.
This approach is more fundamental than modifying a specific operation in builtin.py and ensures SymBool can be correctly promoted to SymFloat in any operation, while still preserving its boolean nature for control flow operations like guard_or_false (which is verified by a new test case).
### Verification
1. **Bug Reproduced**: The initial `UserError: Could not guard on data-dependent expression` was successfully reproduced with the script from the issue. As shown below
<img width="1369" height="945" alt="Screenshot 2025-10-13 at 10 29 05" src="https://github.com/user-attachments/assets/8daa4555-3347-4af5-906a-02150b8df9d1" />
2. **Fix Validated**: After applying the code changes, the same script now runs to completion, printing `✅ eager success` and `✅ compile success`. As shown below
<img width="1228" height="82" alt="Screenshot 2025-10-13 at 10 29 21" src="https://github.com/user-attachments/assets/94c4f143-b898-4dda-9bff-0ad5450a30fa" />
3. Added a new test class DynamoOpPromotionTests to test/dynamo/test_misc.py with three new test cases:
1. test_symbool_tensor_mul_does_not_fail: Verifies that the original bug report code (with .item() + *) no longer raises an error when compiled.
2. test_symbool_guard_or_false: Verifies that this fix does not cause a regression for guard_or_false(SymBool) (the concern raised by reviewers).
3. test_symbool_tensor_mul: Verifies the behavior of Tensor(bool) * Tensor(float) (without .item()) for completeness.
All new tests were added and pass locally.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165264
Approved by: https://github.com/laithsakka, https://github.com/Lucaskabela
# Motivation
This PR introduces support for peer-to-peer (P2P) access between devices, including querying and enabling P2P connections between two devices.
It supports two categories of allocations:
- Regular allocations;
- Expandable segment allocations.
# Additional Context
The follow-up is that we should use this feature to optimize our copy kernel when P2P is supported.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166424
Approved by: https://github.com/gujinghui, https://github.com/albanD
ghstack dependencies: #166299, #166292
# Motivation
This PR intends to add expandable segment feature support on XPU. This will help
- Reduce memory fragmentation;
- Gradually map physical pages into virtual address space as needed.
# Additional Context
The traditional caching allocator frequently allocates and frees device memory blocks. However, over time, with varying tensor size, the device address space becomes fragmented. Even when there's enough total free memory, a lack of contiguous space can cause large allocations to fail.
The **expandable segment** feature addresses this by dynamically extending physical memory within a reserved virtual address range, reducing fragmentation and minimizing reallocation overhead.
The potential drawbacks are
- Virtual memory overhead;
- Potential page mapping overhead;
- Increased complexity.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166292
Approved by: https://github.com/albanD, https://github.com/EikanWang, https://github.com/gujinghui
ghstack dependencies: #166299
Fixes#159139
## The Cause
The bug occurs because the OptimizedModule wrapper in torch._dynamo.eval_frame doesn't call the len method. This causes Python's bool() check to fall back to the default object truthiness (always True) instead of correctly evaluating containers with len() == 0 as False.
## The Fix
A very easy fix . I just added the len method to OptimizedModule in torch._dynamo.eval_frame class to delegate the call to the original module
```python
def __len__(self):
"""
Proxy the len() call to the original module to fix truthiness checks.
"""
return len(self._orig_mod)
```
This successfully fixes the issue . The script now works as expected.
## Reproduction Script
```python
import torch
import torch.nn as nn
# Create an empty nn.ModuleList
original = nn.ModuleList()
# Compile it using torch.compile
compiled = torch.compile(original)
# Compare their boolean evaluations
print(f"bool(original): {bool(original)}")
print(f"bool(compiled): {bool(compiled)}")
# Trigger failure if they differ
assert bool(original) == bool(compiled), "BUG: truthiness behavior mismatch after compilation"
```
## Output
bool(original): False
bool(compiled): False
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159208
Approved by: https://github.com/Lucaskabela
Co-authored-by: pushkar-hue <pushkarsharma.rtm@gmail.com>
Co-authored-by: Lucas Kabela <lucasakabela@gmail.com>
Provides type coverage to torch/_dynamo/variables/builtin.py
### Coverage report:
`mypy torch/_dynamo/variables/builtin.py --linecount-report /tmp/coverage_log`
Compare before to after - we go from 2213 lines and 64 funcs covered to 3212 lines and 85 funcs covered
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166745
Approved by: https://github.com/williamwen42
Fixes#163149
### Summary:
Fixes mypy type checking failures in `test_type_hints` by consolidating typing imports and eliminating duplicate/conflicting import patterns that caused mypy to fail resolving type annotations.
### Impact:
- `test_type_hints` works fine now
- module: tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163150
Approved by: https://github.com/Skylion007
Summary:
Conversion from/to bfloat16 was not getting covered by conversion templates, because these used bfloat16_t as data type instead of the custom c10::BFloat16
Conversion by casting from/to bfloat16_t is broken in clang-[17, 20], fixed in clang-21.
Because Pytorch does not currently have CI running binaries compiled using clang-21, we won't implement this approach for now.
We are currently only adding conversion from bfloat16, as it can be implementing by zero-extending into a 4-byte float.
We've observed the following performance improvements, when compiling with clang-19 and targeting armv9a+sve2:
Before:
bfloat16_t->uint8 ===> 423.583us
bfloat16_t->int8 ===> 424.090us
bfloat16_t->int16 ===> 430.817us
bfloat16_t->int64 ===> 571.547us
bfloat16_t->double ===> 459.089us
After:
bfloat16_t->uint8 ===> 123.783us ----> 342% higher throughput
bfloat16_t->int8 ===> 131.575us -----> 322% higher throughput
bfloat16_t->int16 ===> 136.794us ----> 315% higher throughput
bfloat16_t->int64 ===> 177.699us ----> 322% higher throughput
bfloat16_t->double ===> 165.556us ---> 277% higher throughput
Test Plan:
Correctness:
buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch
Performance:
buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test
Differential Revision: D86119613
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166880
Approved by: https://github.com/mcfi, https://github.com/aditew01
**Summary:** Just like in fully_shard, I added two overload replicate functions. The `@overload` declarations are necessary because the `@contract` decorator uses `ParamSpec` to capture function parameters, which creates a generic `_ContractFn` protocol signature (`*args: _P.args, **kwargs: _P.kwargs`) that Pyrefly cannot properly type-check when calling the function with explicit keyword arguments. In addition, to make the api cleaner I changed device_mesh input argument to mesh to match fully_shard formatting.
**Test Cases**
1. pytest test/distributed/_composable/test_replicate_with_fsdp.py
2. pytest test/distributed/_composable/test_replicate_training.py
3. pytest test/distributed/_composable/test_composability/test_pp_composability.py -k test_replicate_pp
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166459
Approved by: https://github.com/weifengpy
ghstack dependencies: #166433
Summary:
https://github.com/pytorch/pytorch/pull/166609 updated `is_impure` check to now check ops inside a subgraph to decide whether a `call_module` node is pure or not.
This change of behavior affects dead code elimination, commonly run as `gm.graph.eliminate_dead_code()`. Specifically, dead code elimination will not erase a node that has no users if this node has side effect or is impure. With above mentioned pr, dead code elimination no longer eliminates unused subgraphs that contain side-effectful ops.
This affects `const_fold.split_const_subgraph`, what this function does is:
1. split a graph into two submodules, one containing all const ops and one containing non-const ops
2. inline the submodule containing non-const ops back to main graph.
3. run dead code elimination to remove the unused non-const submodule.
With pr #166609 step 3 no longer erases the unused module. As an example, exported graph
```
graph():
%x : [num_users=2] = placeholder[target=x]
%_guards_fn : [num_users=0] = call_module[target=_guards_fn](args = (%x,), kwargs = {})
%empty_permuted : [num_users=1] = call_function[target=torch.ops.aten.empty_permuted.default](args = ([5, 10], [0, 1]), kwargs = {device: cpu, pin_memory: False})
%bernoulli : [num_users=1] = call_function[target=torch.ops.aten.bernoulli.p](args = (%empty_permuted, 0.6), kwargs = {})
%mul : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%x, %bernoulli), kwargs = {})
%div : [num_users=1] = call_function[target=torch.ops.aten.div.Tensor](args = (%mul, 0.6), kwargs = {})
return (div,)
```
After running const_fold, empty_permuted is const-folded, the rest of ops are not, and the main graph looks like
```
graph():
%x : [num_users=3] = placeholder[target=x]
%_fx_const_folded_attrs : [num_users=2] = get_attr[target=_FX_CONST_FOLDED_ATTRS]
%_guards_fn : [num_users=0] = call_module[target=_guards_fn](args = (%x,), kwargs = {})
%bernoulli_p : [num_users=1] = call_function[target=torch.ops.aten.bernoulli.p](args = (%_fx_const_folded_attrs, 0.6), kwargs = {})
%mul_tensor : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%x, %bernoulli_p), kwargs = {})
%div_tensor : [num_users=1] = call_function[target=torch.ops.aten.div.Tensor](args = (%mul_tensor, 0.6), kwargs = {})
%submod_1 : [num_users=0] = call_module[target=submod_1](args = (%x, %_fx_const_folded_attrs), kwargs = {})
return (div_tensor,)
```
`submod_1` is dangling, unused, and just inlined into the graph.
## Fix
This pr updates `const_fold._inline_module` function to explicitly remove the non-const submodule which is unused, after it has inlined the submodule's ops into main graph.
Test Plan:
Added a test in `test_fx_const_fold.py`.
The test would have failed before this PR becuase it yields above example graph leaving an unused `call_module[target=submod_1]` op.
With the PR, the module is erased from main graph correctly.
Differential Revision: D86056354
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166871
Approved by: https://github.com/blaine-rister, https://github.com/mlazos
**Summary:** I have created a new composable replicate api that's integrated into FSDP's codebase with minimal changes. The key changes I made are when we use DDPMeshInfo, we use Replicate placements, prevent initial sharding of parameters, set worldsize to 1 to skip allgathers and reducescatter.
**Test Cases**
1. pytest test/distributed/_composable/test_replicate_training.py
2. pytest test_pp_composability.py
3. pytest test_replicate_with_fsdp.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166433
Approved by: https://github.com/weifengpy
Summary:
DCP checkpoint background process currently determines the port used for pg via get_free_port().
During checkpoint background process initialization, gloo pg init occasionally times out on the first call but succeeds in a subsequent call.
We hypothesized that the timeouts are related to the port being used, and the solution would be to create the pg with PrefixStore and reuse the master port.
This diff adds the option for checkpoint background process to use PrefixStore with MASTER_ADDR + MASTER_PORT.
The default behavior is unchanged. Enabling the new PrefixStore behavior requires setting "DCP_USE_PREFIX_STORE" env var to "1".
context:
https://fb.workplace.com/groups/319878845696681/permalink/1516883985996155/
Differential Revision: D84928180
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166560
Approved by: https://github.com/meetv18
Fixes#166253
## Summary
When `torch.full` is called with a 0-D tensor as `fill_value` inside a `torch.compile`'d function, the value was being incorrectly cached, causing subsequent calls with different values to return the first value.
## Root Cause
The Dynamo handler for `torch.full` was calling `aten._local_scalar_dense` to convert tensor fill_values to Python scalars at compile time, which baked the value into the compiled graph as a constant.
## Solution
Modified the Dynamo handler to decompose `torch.full(size, tensor_fill_value)` into `empty(size).fill_(tensor_fill_value)` when `fill_value` is a `TensorVariable`, keeping the fill value dynamic in the compiled graph.
## Testing
Added test case that verifies torch.full works correctly with dynamic tensor fill_values across multiple calls and dtypes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166554
Approved by: https://github.com/Lucaskabela
Summary:
make_fx() will register tensor constants as new buffers while tracing a shuffle graph for dynamo graph capture. This breaks the invariance that the resulting graph looks identical to the original eager model in terms of state dict.
So we need to de-register the buffers and set them as plain tensor constants.
Test Plan:
pytest test/export/test_experimental.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166777
Approved by: https://github.com/tugsbayasgalan
ghstack dependencies: #166775, #166776
Summary:
dict_keys_getitem can show up in the bytecode but it's using dict.keys() which is not fx tracable.
fx.wrap should make it as a standalone function in the graph to be invoked later with real inputs.
Test Plan:
pytest test/export/test_experimental.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166776
Approved by: https://github.com/jamesjwu
ghstack dependencies: #166775
Summary:
as title, we should return an entire tracing_context object instead of fake_mode only, since tracing context should contain full set of information.
Test Plan:
pytest test/export/test_experimental.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166775
Approved by: https://github.com/tugsbayasgalan
By wrapping the python objects with FakeScriptObject(FakeOpaqueQueue) we restrict users to do anything to this object. torch.compile support can be easily enabled by the rest of [this stack](https://github.com/pytorch/pytorch/pull/163936) and existing support for ScriptObjects.
One thing to note is that by default in functionalization we mark all ops that take in FakeScriptObjects as being effectful. Should this be the case for these custom ops that take in python objs?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165005
Approved by: https://github.com/zou3519
Slice knows how to handle unbacked start, we do not need to offset start before calling slice, we can leave it for slice.
The only edge case is when start<0 and start+length ==0 in that case slice and narrow would deviate,
for that case we shall pass dim_size instead of start+length
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166361
Approved by: https://github.com/aorenste
Summary: NVIDIA uses CUTLASS for row-wise scaling prior to cuBLAS version 12.9. This change enables support for FP16 data type for both bias and output when using CUTLASS.
Test Plan:
pytest -svv test/test_scaled_matmul_cuda.py
Test results on cuda-12.4:
```
test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_scaled_mm_vs_emulated_row_wise_bfloat16_cuda PASSED [0.0022s]
test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_scaled_mm_vs_emulated_row_wise_float16_cuda PASSED [0.0023s]
test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_scaled_mm_vs_emulated_row_wise_float32_cuda SKIPPED [0.0005s]
======================= 51 passed, 516 skipped in 5.26s ========================
```
Test results on cuda-12.9:
```
test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_scaled_mm_vs_emulated_row_wise_bfloat16_cuda PASSED [0.0046s]
test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_scaled_mm_vs_emulated_row_wise_float16_cuda PASSED [0.0040s]
test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_scaled_mm_vs_emulated_row_wise_float32_cuda PASSED [0.0038s]
======================= 70 passed, 482 skipped in 5.88s ========================
```
Reviewed By: pranavsharma, RandySheriff
Differential Revision: D84169910
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166744
Approved by: https://github.com/slayton58
Summary:
Support tile-wise `1x128` scaling in Inductor Triton for FP8 GEMMs, i.e. scaling values along tensors `a` and `b` represent a `1x128` slice of input.
NOTE: Block-wise `128x128` and `1x128` scaling is only supported in CUDA 12.9+; therefore, tile-wise scaling is currently unsupported in `fbcode` (CUDA 12.4). Use OSS PyTorch to run tile-wise scaling (as with deepseek-style scaling).
Test Plan:
Works out-of-the-box with TritonBench:
```
TORCHINDUCTOR_CACHE_DIR=~/personal/cache_dir_inductor CUDA_LAUNCH_BLOCKING=1 TORCH_USE_CUDA_DSA=1 TRITON_PRINT_AUTOTUNING=1 TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+inductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 ENABLE_PERSISTENT_TMA_MATMUL=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 buck2 run mode/{opt,inplace} pytorch/tritonbench:run -- --op fp8_gemm --only torch_fp8_gemm,pt2_fp8_gemm --metrics tflops,accuracy --m 256 --n 768 --k 512 --output="/home/jananisriram/personal/random_bench.csv" --scaling-pair=BlockWise1x128,BlockWise1x128 --atol=1e-2 --rtol=0.5
```
Differential Revision: D84025878
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165132
Approved by: https://github.com/eqy, https://github.com/drisspg, https://github.com/njriasan
**Summary:** Just like in fully_shard, I added two overload replicate functions. The `@overload` declarations are necessary because the `@contract` decorator uses `ParamSpec` to capture function parameters, which creates a generic `_ContractFn` protocol signature (`*args: _P.args, **kwargs: _P.kwargs`) that Pyrefly cannot properly type-check when calling the function with explicit keyword arguments. In addition, to make the api cleaner I changed device_mesh input argument to mesh to match fully_shard formatting.
**Test Cases**
1. pytest test/distributed/_composable/test_replicate_with_fsdp.py
2. pytest test/distributed/_composable/test_replicate_training.py
3. pytest test/distributed/_composable/test_composability/test_pp_composability.py -k test_replicate_pp
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166459
Approved by: https://github.com/weifengpy
ghstack dependencies: #166433
**Summary:** I have created a new composable replicate api that's integrated into FSDP's codebase with minimal changes. The key changes I made are when we use DDPMeshInfo, we use Replicate placements, prevent initial sharding of parameters, set worldsize to 1 to skip allgathers and reducescatter.
**Test Cases**
1. pytest test/distributed/_composable/test_replicate_training.py
2. pytest test_pp_composability.py
3. pytest test_replicate_with_fsdp.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166433
Approved by: https://github.com/weifengpy
This updates the docs-build nightly configuration to match other uses of the _linux-build.yml workflow using `runner_prefix` rather than `runner` directly. The default runner defined in _linux-build.yml is the c7i variant so this also updates the runner appropriately.
Relates to pytorch/test-infra#7175. While moving to c7i costs 5% more, CPU intensive jobs should run roughly 15-20% faster resulting in a cost reduection of 10-15% for those jobs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166727
Approved by: https://github.com/huydhn
### Summary
Adds support for NVIDIA’s cuSolver backend to torch.linalg.eig and torch.linalg.eigvals within the ATen/Linalg framework.
### Motivation
Extending PyTorch’s Linalg backends with NVIDIA’s cuSolver enables faster execution of torch.linalg.eig and torch.linalg.eigvals, complementing existing MAGMA and CPU implementations.
The speedup observed on consumer hardware (RTX4070/Ryzen 5700x) is in the order of **2x**, with preliminary testing on HPC hardware (H100, EPYC 9454) suggesting **up to 10x speedup**.
### Details
- Implements cuSolver support for linalg_eig and linalg_eigvals using the interface described in [NVIDIA cuSolver documentation](https://docs.nvidia.com/cuda/cusolver/index.html#cusolverdnxgeev) as introduced in CUDA 12.8 [CUDA 12.8 release notes](https://docs.nvidia.com/cuda/archive/12.8.0/cuda-toolkit-release-notes/index.html)
- Follows the existing MAGMA backend design, adapting it for cuSolver’s cusolverDnXgeev API.
- Integrates with existing eig/eigvals dispatch mechanism.
- No automatic CPU↔GPU backend switching. (Happy to discuss)
- Verified via existing Linalg test coverage; no new tests introduced in this PR.
- Tested successfully against both test_linalg.py including slow test suites.
- Tested MAGMA fallback successfully using CUDA 12.4. (observed unrelated test failures)
### Impact
- Enables much faster solving of eigenvalue problems
- Maintains numerical consistency and test stability across backends.
- No change to public API or user-facing behavior.
Special thanks to @AlbanD for prior feedback and discussions regarding the PR and @lezcano for feedback on the related testing PR [https://github.com/pytorch/pytorch/pull/166322](https://github.com/pytorch/pytorch/pull/166322).
Happy to discuss backend dispatch strategy, results from performance and stability testing can be seen here [https://dev-discuss.pytorch.org/](https://dev-discuss.pytorch.org/t/cusolver-dnxgeev-faster-cuda-eigenvalue-calculations/3248/7)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166715
Approved by: https://github.com/lezcano, https://github.com/albanD
This PR simplifies `std::copy_n` calls in CopyKernel and IndexKernel. `std::copy_n` is used to create a data pointer array from the input data pointers. However, more careful review reveals that the dest pointers are actually aliases of the original pointers. So we can removes the pointer manipulations.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143544
Approved by: https://github.com/albanD
Summary: Projects that use `-Wswitch-enum` will encounter issues when building and using *PyTorch* (`caffe2`). Address these issues to empower more rigorous upstream compiler warnings/errors.
Test Plan: CI Pass
Differential Revision: D85893917
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166760
Approved by: https://github.com/atalman
This PR adds two checks:
```
readability-static-definition-in-anonymous-namespace
Finds static function and variable definitions
in anonymous namespace.
readability-named-parameter
Find functions with unnamed arguments.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164692
Approved by: https://github.com/Skylion007
`static_input_indices` is used for cudagraphs to determine which input indices are static and will not have changing addresses. Since export never integrated with cudagraphs this information was not necessary. But now we need it!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166761
Approved by: https://github.com/BoyuanFeng
Replace separate min() and max() calls with single aminmax() call in max_unpool_out_mps_template to improve performance by reducing tensor traversals from O(2n) to O(n).
Changes:
- Use indices.aminmax() instead of separate indices.min()/max() calls
- Add required ATen/ops/aminmax.h header for AT_PER_OPERATOR_HEADERS
- Maintain identical bounds checking logic and error handling
This optimization is particularly beneficial for large indices tensors, improving cache locality and reducing computational overhead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165394
Approved by: https://github.com/cyyever, https://github.com/Skylion007
Preivously, we would stash a single stream value we constructed at trace time in a global and return the same value from repeated calls to the graph.
With this PR, we construct the stream value in advance, reference the constructed value in the graph via the lookup table, and if that value is returned as an output, read the value from the lookup table and return it (in bytecode, not as a graph output, since we don't support arbitrary stream outputs).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164819
Approved by: https://github.com/anijain2305
Fixes#163971
Problem:
PyTorch's inductor compiler crashed with IndexError: list index out of range when compiling code that uses 0-dimensional tensors with operations like torch.softmax(scalar_tensor, dim=0).
A 0-dim tensor has shape = torch.Size([]) (empty shape)
```
ndim = 0 (zero dimensions)
len(shape) = 0 (no indices to access)
# Line 972: Pad other_shape to match inp dimensions
other_shape = [1] * (inp_ndim - len(other_shape)) + list(other_shape)
# For scalar tensors:
# inp_ndim = 0 # as input is scalar
# other_shape = []
# Result: [1] * (0 - 0) + [] = [] (still empty!)
dim = match.kwargs["dim"] # dim = 0
if isinstance(dim, int):
dim = (dim,)
# crash is happening here!
return all(statically_known_true(other_shape[d] == 1) for d in dim)
# ^^^^^^^^^^^^^^^^
# Tries other_shape[0] but other_shape = [] (empty!)
# → IndexError: list index out of range
```
The function _other_is_broadcasted_in_dim() is an optimization check for a softmax fusion pattern. It verifies whether it's safe to rewrite:
```
# From
scaled = inp * other
result = scaled - scaled.amax(dim, keepdim=True)
# To this more stable form:
result = (inp - inp.amax(dim, keepdim=True)) * other
```
The optimization is only valid if other is constant across the reduction dimension (i.e., broadcasted to size 1 in that dimension). Otherwise, scaling changes which element is the maximum.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166547
Approved by: https://github.com/jansel, https://github.com/eellison, https://github.com/leslie-fang-intel
It's nice to add a curve with a customized compilation options so that we can compare side-by-side the perf improvement of new features.
E.g. for mix-order-reduction, by running the following command
```
python benchmarks/dynamo/genai_layers/benchmark.py --tolerance=1e-2 --exit-on-accuracy-failure --visualize rmsnorm_backward --custom-compile-name="compiled-no-fusion" --custom-compile-options='{"triton.mix_order_reduction":false}'
```
I get following output:
```
Geomean speedup for benchmark RMSNormBackward
eager 11 data points
compiled 11 data points, 15.82x speedup
quack 11 data points, 15.45x speedup
liger 11 data points, 14.06x speedup
compiled-no-fusion 11 data points, 10.26x speedup
```
The output shows that the feature on average improve perf by `15.82 / 10.26 = 1.54x` for all the shapes tested. (I remove a shape (32768, 32768) whose rnumel is too large and not representative).
The new curve also shows up in the figure:
<img width="3564" height="2368" alt="RMSNormBackward_bench" src="https://github.com/user-attachments/assets/1ffac2bc-e726-4f1e-806d-e9e5de711492" />
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166697
Approved by: https://github.com/BoyuanFeng
ghstack dependencies: #166053, #166382, #166461, #166585, #166675
Adds customizable hooks on `__torch_dispatch__` calls for logging/recording arbitrary values.
Recording hooks store the hook outputs for each call at `debug_mode.operators[*].record`
```python
with DebugMode() as debug_mode, DebugMode.dispatch_hooks(record_hook = some_func):
# some compute
...
```
Logging hooks annotate the string dump:
```python
with DebugMode() as debug_mode, DebugMode.dispatch_hooks(log_hook = some_func):
...
```
Adds default hooks `DebugMode.record_outputs()` and `DebugMode.log_tensor_hashes()`, for checking numerical equivalence. The hashing hook borrows from the Observer. Example dump:
```
aten::sum(dt: f32[8, 32]| S(0))
aten::sum(t: f32[1, 32]) # {'hash': 3.2215590476989746}
_c10d_functional::all_gather_into_tensor(t: f32[1, 32], 8, 0) # {'hash': 204.8783062621951}
_c10d_functional::wait_tensor(t: f32[8, 32]) # {'hash': 204.8783062621951}
aten::mm(t: f32[1, 8], t: f32[8, 32]) # {'hash': 12.014171155635267}
aten::sum(t: f32[1, 32]) # {'hash': 3.2215590476989746}
aten::t(t: f32[1, 8]) # {'hash': 3.7167285680770874}
aten::detach(t: f32[8, 1]) # {'hash': 3.7167285680770874}
...
```
On the FSDP2 / simple FSDP NE in https://github.com/pytorch/pytorch/pull/164939, with hashing, this produces 2 log dumps (FSDP2: P2010198620, simple FSDP: P2010198963). I asked Claude to check the hashes, it wrote an analysis script, and was able to guess RMS norm as the root cause: P2010195076
Another throw-away example for logging per-op memory usage: https://gist.github.com/pianpwk/372082bf29467aa4aa25cb26dee24aea
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166348
Approved by: https://github.com/yushangdi
Keep tracking of the reduction before splitting.
In the mix-order reduction context, if one of the reduction is split, it makes it much harder to fuse with the other reduction. Tracking the metadata of the reduction before splitting to make the fusion possible.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166053
Approved by: https://github.com/jansel
A few things to note:
1. Customers like vllm use a custom backend (e.g. VllmBackend), split the graph, and call standalone_compile for each split. If we let the bisector override the backend, we won't bisect thru the custom backend. `test_configs.bisect_keep_custom_backend_for_inductor` is used to keep the custom backend if we are bisecting for inductor.
2. pre_grad_graph bisecting and lowering bisecting so far does not compose well with each other since an issue may be just captured by the first one we try. `test_configs.bisect_pre_grad_graph` is used to enable the 'pre_grad_graph' bisecting.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166344
Approved by: https://github.com/eellison
Slice knows how to handle unbacked start, we do not need to offset start before calling slice, we can leave it for slice.
The only edge case is when start<0 and start+length ==0 in that case slice and narrow would deviate,
for that case we shall pass dim_size instead of start+length
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166361
Approved by: https://github.com/aorenste
This PR replaces c10::call_once calls with static initialization when possible. C++11 semantics guarantees that static initialization is atomic. Static initialization also has lower cost than using c10::call_once.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166381
Approved by: https://github.com/malfet
The goal of this PR is to provide a standard way to create simple treespec instances and hide the implementation details of the `PyTreeSpec` class.
Changes:
1. Add function `treespec_leaf()` to replace `LeafSpec()`.
2. Add function `treespec_tuple(...)` and `treespec_dict(...)` to create treespec for `tuple` / `dict` which is used for `*args` / `**kwargs`. This avoids direct modification to `treespec` instances that rely on the implementation details of the `PyTreeSpec` class.
3. Change `len(spec.children_specs)` to `spec.num_children`.
4. Change `isinstance(spec, LeafSpec)` to `spec.is_leaf()`.
------
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160843
Approved by: https://github.com/mlazos
Summary:
As per title, a bot created this diff because this test broke due to [a different PR.](https://github.com/pytorch/pytorch/pull/166026)
<Erased bot summary in case anything we don't want to make external.>
Test Plan:
Bot ran the tests and they passed.
<Erased bot test plan in case anything we don't want to make external.>
Differential Revision: D85745809
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166753
Approved by: https://github.com/d4l3k
Graph partition relies on `get_free_symbol_uses()` to collect symbol inputs.
ee7434be82/torch/_inductor/scheduler.py (L4869-L4885)
I empirically observed that `get_free_symbol_uses()` becomes slower for larger graphs. Specifically, I tried to aten fallback for torchtitan which results in 10k+ aten nodes. When processing the 600-th node, it takes seconds to `get_free_symbol_uses()` for 1 node.
Why? Because `get_free_symbol_uses()` may recursively call another `get_free_symbol_uses()`, which could recursively run many times.
ee7434be82/torch/_inductor/ir.py (L4541-L4543)
This PR fixes the issue by caching the results of `get_free_symbol_uses()`. I validated on torchtitan that the issue is fixed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166338
Approved by: https://github.com/eellison
Provides type coverage to ~3000 LOC and 200 methods in `torch/_dynamo/variables/`
This is the first part of the final step to having 100% strict type coverage in dynamo - see previous comments in https://github.com/pytorch/pytorch/pull/166535 (combined into this one PR because ghstack was giving issues...)
### Coverage report:
```
mypy torch_dynamo/variables --linecount-report /tmp/coverage_log
```
Compare before to after - we go from 3826 to 7221 lines covered
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166569
Approved by: https://github.com/williamwen42, https://github.com/Skylion007
- Remove all complex defines logic from the header
- Make GreenContext constructor private, as it should only be created via the static method as singleton
- Delete unused `getContext` and `getGreenContext` methods
- Rename `CUDA_HAS_GREEN_CONTEXT` to `HAS_CUDA_GREEN_CONTEXT()`, which results in compilation error if one accidentally makes a typo
- Suppress `-Wunused-private-field` is GreenContext is not available
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166462
Approved by: https://github.com/ngimel, https://github.com/eqy
This PR refactors the name AOTAutogradCacheEntry into AOTAutogradResult, and BundledAOTAutogradCacheEntry into BundledAOTAutogradResult. It also moves all coresponding files to a new file, `aot_autograd_result`, which is analogous to `output_code.py` from Inductor.
Having all these be called cache entries made sense when all we used them for was caching. But with AOT compile using BundledAOTAutogradCacheEntry, we want a more generalized naming structure.
This is a no-op change, and all existing tests should pass.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166656
Approved by: https://github.com/zhxchen17
ghstack dependencies: #166650
This PR does two things:
- It genericizes `BundledAOTAutogradCacheEntry` to support *any* outputcode, not just CompiledFxGraphs
- It adds a brand new OutputCode for the `aot_eager_regional_inductor` backend, i.e. a graph module that has regional inductor components in it.
This allows BundledAOTAutogradCache to just integrate nicely with inductor out of the box, but more importantly, it allows the result of aot_autograd to be fully serializable when using `aot_eager_regional_inductor`. This will allow us to AOT precompile cases where we have an eager graph that has scooped up inductor bits.
It's a bit unfortunate that the naming makes BundledAOTAutogradCacheEntry sound like its primary use is for caching, but really the more common use is going to be as an AOTAutogradOutput. It may be worth revisiting how to refactor/rename these in a later PR:
- AOTAutogradCacheEntry -> AOTAutogradResult
- BundledAOTAutogradCacheEntry -> BundledAOTAutogradResult
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166650
Approved by: https://github.com/zhxchen17
Partitioner functionality is appealing to use in different scenarios (E.g. Autoparallel)
We have special logic about "partitioner_tag" from meta that is only needed for forward/backward split.
Adding optional argument to avoid it and do only generic split based on inputs/outputs.
Potentially we want to make `_extract_graph_with_inputs_outputs` without underscore :)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166725
Approved by: https://github.com/bdhirsh
Summary:
## Context
when `const_fold.split_const_subgraphs` sees a `call_module` node that is a GraphModule, by the existing implementation it can mark this node as const-foldable when it shouldn't.
For example, a parent graph contains a `call_module` to a subgraph that has no inputs but contain impure ops inside.
```
parent graph():
%sub : [num_users=1] = call_module[target=sub](args = (), kwargs = {})
%getitem : [num_users=1] = call_function[target=operator.getitem](args = (%sub, slice(None, None, None)), kwargs = {})
return (getitem,)
submodule graph():
%randn : [num_users=1] = call_function[target=torch.ops.aten.randn.default](args = ([5, 10],), kwargs = {device: cpu, pin_memory: False})
%add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%randn, 1), kwargs = {})
return (add,)
```
when `submodule` graph is fed to const_fold.split_const_subgraph, it would come out unmodified since randn is impure.
But if the `submodule` is called by a `parent` graph, when `parent` is fed to const_fold.split_const_subgraph, it would come out folded.
```
parent after fold graph():
%_fx_const_folded_attrs : [num_users=1] = get_attr[target=_FX_CONST_FOLDED_ATTRS]
return (_fx_const_folded_attrs,)
```
This is because `node.is_impure()` check inside `const_fold.split_const_subgraph` fail through, leading the call_module node to be marked as pure.
## Fix
We can update `fx.node.Node.is_impure` function to check for ops inside a call_module node with an additional `subgraph_has_impure_ops` check:
- if a call_module node calls a GraphModule,
- check any call_function nodes are impure ops
- recursively check any call_module nodes that call GraphModule
If the call_module subgraph has impure ops, return True to `is_impure`
Test Plan: added tests to test_fx_const_fold.py
Differential Revision: D85798483
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166609
Approved by: https://github.com/blaine-rister
Provides type coverage to ~3000 LOC and 200 methods in `torch/_dynamo/variables/`
This is the first part of the final step to having 100% strict type coverage in dynamo - see previous comments in https://github.com/pytorch/pytorch/pull/166535 (combined into this one PR because ghstack was giving issues...)
### Coverage report:
```
mypy torch_dynamo/variables --linecount-report /tmp/coverage_log
```
Compare before to after - we go from 3826 to 7221 lines covered
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166569
Approved by: https://github.com/williamwen42
Summary:
We were setting the custom inductor choice using `torch._inductor.virtualized.V.set_choices_handler(CustomInductorChoices())`. However, this leads to inconsistent behaviors, even for jobs that are submitted back to back.
In this diff, we pass in the choice handler via an inductor config and overwrite the default behavior when the config is provided. This sovles the inconsistent behavior.
Test Plan: see D85785892 (internal only)
Differential Revision: D85785879
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166607
Approved by: https://github.com/eellison
The goal of this PR is to provide a standard way to create simple treespec instances and hide the implementation details of the `PyTreeSpec` class.
Changes:
1. Add function `treespec_leaf()` to replace `LeafSpec()`.
2. Add function `treespec_tuple(...)` and `treespec_dict(...)` to create treespec for `tuple` / `dict` which is used for `*args` / `**kwargs`. This avoids direct modification to `treespec` instances that rely on the implementation details of the `PyTreeSpec` class.
3. Change `len(spec.children_specs)` to `spec.num_children`.
4. Change `isinstance(spec, LeafSpec)` to `spec.is_leaf()`.
------
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160843
Approved by: https://github.com/mlazos
Summary:
FXConverter configurs _node_metadata_hook passing in `fake_mode` explicitly, which is relevant for cases down the line like `_generate_triton_call` that inserts a `triton_kernel_wrapper_mutation` node.
This `fake_mode` is obtained from `_detect_fake_mode_from_gm`, which can be different from inductor set `V.fake_mode`.
For example, while `V.fake_mode` is not None, `_detect_fake_mode_from_gm` can be **None** for a parent graph containing only a submodule which has no input args and only constants
```
parent graph():
%sub : [num_users=1] = call_module[target=sub](args = (), kwargs = {})
%getitem : [num_users=1] = call_function[target=operator.getitem](args = (%sub, slice(None, None, None)), kwargs = {})
return (getitem,)
submodule graph():
%randn : [num_users=1] = call_function[target=torch.ops.aten.randn.default](args = ([5, 10],), kwargs = {device: cuda, pin_memory: False})
%add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%randn, 1), kwargs = {})
return (add,)
```
Getting this discrepnancy is flawed, it makes `_node_metadata_hook` try running inputs in a different "fake_mode" or no fake_mode when the rest of lowering uses `V.fake_mode`. In some cases where input is placed on custom non-gpu device, it can even complain with "requires device to be started" or tensor device mismatch.
So this diff updates FXConverter.generate to use `V.fake_mode` which is populated by inductor properly.
Test Plan:
added a test `test_const_folded_subgraph` in `test_fxir_backend.py`, this test:
- creates a graph module that calls a subgraph with no inputs and containing only const-foldable ops
- const fold the subgraph
- run FXConverter.generate, expect `fake_mode` used to code-generate is not None
On the prior implementation when `_detect_fake_mode_from_gm` was used, this test would fail as fake_mode would be `None`.
With this change, the test passes, `fake_mode` is properly collected from `V.fake_mode` which is not None.
Differential Revision: D85767475
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166591
Approved by: https://github.com/blaine-rister, https://github.com/mlazos, https://github.com/eellison
This diff tries to fix a limitation in Sigmoid + Triton interaction, where float arguments are not correctly passed. NativeRT passes float arguments as double, while triton kernels were reading as a float, resulting in wrong values.
---
## Limitations in (de)seriazliation
In triton, float arguments to a kernel are encoded as "fp32" ([code](https://github.com/triton-lang/triton-cpu/blob/main-merged/python/triton/runtime/jit.py#L310-L326)):
```
elif isinstance(arg, float):
return ("fp32", None)
```
But it seems like that torch export serde uses double ([code](d2eff5d454/torch/_export/serde/export_schema.thrift (L149))) because Thrift only has the double type:
```
union Argument {
10: bool as_none;
20: TensorArgument as_tensor;
30: list<TensorArgument> as_tensors;
50: i64 as_int;
70: list<i64> as_ints;
80: double as_float; ===> actually double
...
```
`TritonKernel` constructor loads attributes from a node, where `Constant` represents the variant type. And it only has `double` ([code](d2eff5d454/torch/nativert/graph/Graph.h (L86))):
```
using Constant = std::variant<
None,
int64_t,
std::vector<int64_t>,
double, ===> triton float is loaded as double
```
So, NativeRT passes float arguments (originally in Triton) as double to triton kernels. But, all of the triton backends (nvidia, amd and cpu) are reading them as float because the signature still says `fp32`.
D84423898 was the current workaround: wrapping float arguments with tensors.
## The Fix
Fixing the thrift definition isn't viable because Thrift only supports double type. It's also possible to fix on the triton side: it can downcast from double to float. But I needed to fix all backends.
Instead, I think this diff would be the most effective way: when building `TritonKernel`, have downcasted float values, right after loading double arguments.
Test Plan:
```
buck test fbcode//mode/opt-amd-gpu fbcode//caffe2/test:test_export --
```
Differential Revision: D85747160
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166620
Approved by: https://github.com/XueningXu
Provides type coverage to ~3000 LOC and 200 methods in `torch/_dynamo/variables/`
This is the first part of the final step to having 100% strict type coverage in dynamo - see previous comments in https://github.com/pytorch/pytorch/pull/166535 (combined into this one PR because ghstack was giving issues...)
### Coverage report:
```
mypy torch_dynamo/variables --linecount-report /tmp/coverage_log
```
Compare before to after - we go from 3826 to 7221 lines covered
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166569
Approved by: https://github.com/williamwen42
Graph partition relies on `get_free_symbol_uses()` to collect symbol inputs.
ee7434be82/torch/_inductor/scheduler.py (L4869-L4885)
I empirically observed that `get_free_symbol_uses()` becomes slower for larger graphs. Specifically, I tried to aten fallback for torchtitan which results in 10k+ aten nodes. When processing the 600-th node, it takes seconds to `get_free_symbol_uses()` for 1 node.
Why? Because `get_free_symbol_uses()` may recursively call another `get_free_symbol_uses()`, which could recursively run many times.
ee7434be82/torch/_inductor/ir.py (L4541-L4543)
This PR fixes the issue by caching the results of `get_free_symbol_uses()`. I validated on torchtitan that the issue is fixed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166338
Approved by: https://github.com/eellison
This PR introduces CustomOp autotuning. It allows user to provide a CustomOpConfig:
(1) to register (optional) multiple decomposition implementations for custom operations and
(2) to register parameter tuning knobs and values they want to tune for the decompositions
so that inductor automatically select the best-performing variant through Inductor's autotune benchmarking.
Example:
```python
register_custom_op_autotuning(
custom_op=my_attention_op,
configs=[
CustomOpConfig(attention_impl, head_dim=32, method='chunked'),
CustomOpConfig(attention_impl, head_dim=64, method='tiled'),
CustomOpConfig(head_dim=128), # no decompositions
],
input_gen_fns={
"query": lambda fake: torch.randn_like(fake, device='cuda'),
"key": lambda fake: torch.randn_like(fake, device='cuda'),
"value": lambda fake: torch.randn_like(fake, device='cuda'),
}
)
```
**CustomOpConfig**: Each CustomOpConfig defines exactly one autotuning variant with specific parameter values and optional decomposition implementation with PyTorch aten ops. Users can register their own tuning knobs and optional decomposition functions for the same custom operation. The system automatically benchmarks all variants to select the best performing. If no decomposition is provided in the config, the CustomOp's default implementation will be used.
**Custom Input Generation**: Users can provide custom input generators via an optional `input_gen_fns` to control how synthetic inputs are created during benchmarking. This enables more realistic performance testing by generating inputs that match expected data distributions and characteristics for each tensor argument.
**More Examples with autotune logs:**:
1. Allow user to register customOp decompositions with tuning parameters for autotuning. Example usage:
```python
from torch._inductor.kernel.custom_op import CustomOpConfig, register_custom_op_autotuning
def decompose_k_implementation(a: torch.Tensor, b: torch.Tensor, k_splits: int = 4) -> torch.Tensor:
"""Matrix multiply with k-way decomposition."""
# Implementation...with k_splits
@torch.library.custom_op("my_lib::decompose_k", mutates_args=())
def test_decompose_k_op(
a: torch.Tensor, b: torch.Tensor, k_splits: int
) -> torch.Tensor:
return decompose_k_implementation(a, b, k_splits)
# Register autotuning with different k_splits values
register_custom_op_autotuning(
custom_op=test_decompose_k_op,
configs=[
CustomOpConfig(decompose_k_implementation, k_splits=2),
CustomOpConfig(decompose_k_implementation, k_splits=32),
CustomOpConfig(decompose_k_implementation, k_splits=64),
CustomOpConfig(k_splits=128), # can make decomposition optional, then use default impl test_decompose_k_op
CustomOpConfig(k_splits=256)
],
input_gen_fns={
"a": lambda fake: torch.randn_like(fake, device='cuda') * 0.1,
"b": lambda fake: torch.randn_like(fake, device='cuda') * 0.1,
}
)
```
Example result:
```
{"num_choices": 6, "num_triton_choices": 0, "best_kernel": "test_decompose_k_autotuned_fallback_default", "best_time": 0.09980800002813339}
AUTOTUNE test_decompose_k_autotuned(256x65536, 65536x1024)
strides: [65536, 1], [1024, 1]
dtypes: torch.float16, torch.float16
test_decompose_k_autotuned_fallback_default 0.0998 ms 100.0%
test_decompose_k_autotuned_decompose_k_implementation_k_splits_2_0 0.1096 ms 91.0% CustomOp decompose_k_implementation_k_splits_2
test_decompose_k_autotuned_decompose_k_implementation_k_splits_32_1 0.1277 ms 78.2% CustomOp decompose_k_implementation_k_splits_32
test_decompose_k_autotuned_decompose_k_implementation_k_splits_64_2 0.1454 ms 68.6% CustomOp decompose_k_implementation_k_splits_64
test_decompose_k_autotuned_decompose_k_implementation_k_splits_128_3 0.1536 ms 65.0% CustomOp decompose_k_implementation_k_splits_128
test_decompose_k_autotuned_decompose_k_implementation_k_splits_256_4 0.2084 ms 47.9% CustomOp decompose_k_implementation_k_splits_256
```
2. Allow user to tune parameter knob by passing the parameter and values in the CustomOpConfig.
**Example**
```python
def mlp_variants(input_tensor, gate_weight, up_weight, down_weight, method):
"""MLP implementation with different computational approaches."""
if method == 0:
# Standard separate matmuls
# ... implementation
elif method == 1:
# Batched approach with torch.mm
# ... implementation
elif method == 2:
# Fused weights approach
# ... implementation
@torch.library.custom_op("my_lib::mlp_op", mutates_args=())
def mlp_op(
input_tensor: torch.Tensor,
gate_weight: torch.Tensor,
up_weight: torch.Tensor,
down_weight: torch.Tensor,
method: int,
) -> torch.Tensor:
return mlp_variants(
input_tensor, gate_weight, up_weight, down_weight, method=method
)
register_custom_op_autotuning(
custom_op=mlp_op,
configs=[
CustomOpConfig(method=0),
CustomOpConfig(method=1),
CustomOpConfig(method=2),
# method=0 is the default fallback in the original op
],
input_gen_fns={
"input_tensor": lambda fake: torch.randn_like(fake, device='cuda') * 0.1,
"gate_weight": lambda fake: torch.randn_like(fake, device='cuda') * 0.05,
# ... other input generators
}
)
```
Example result:
```
AUTOTUNE test_mlp_autotuned(4x32x512, 512x1024, 512x1024, 1024x256)
test_mlp_autotuned_mlp_variants_method_2 0.0181 ms 100.0% CustomOp mlp_variants_method_2
test_mlp_autotuned_mlp_variants_method_1 0.0185 ms 97.8% CustomOp mlp_variants_method_1
test_mlp_autotuned_mlp_default_fallback_method_0 0.0198 ms 91.4% CustomOp fallback
```
### Test Suite (`test/inductor/test_custom_op_autotune.py`)
* **RMSNorm autotuning**: Tests different RMSNorm implementations with dynamic input shapes
* **MLP autotuning**: Tests different MLP decomposition and tuning "method" parameter
* **DecomposeK**: Tests different k_splits values for matrix multiplication decomposition with k dim split
* **Multi-parameter tuning**: Tests configs with multiple tuning parameters (scale_mode, chunk_size)
### Next Step:
- Enable Max-autotune with user passed in max-autotune config. https://github.com/pytorch/pytorch/pull/165526/files
- Support inline epilogue fusion for selected best customop decomposition with surrounding elementwise ops. https://github.com/pytorch/pytorch/pull/165952/files
- Support customop autotune considering fusion with multiTemplateBuffer. WIP
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164212
Approved by: https://github.com/zou3519
Fixes#166253
## Summary
When `torch.full` is called with a 0-D tensor as `fill_value` inside a `torch.compile`'d function, the value was being incorrectly cached, causing subsequent calls with different values to return the first value.
## Root Cause
The Dynamo handler for `torch.full` was calling `aten._local_scalar_dense` to convert tensor fill_values to Python scalars at compile time, which baked the value into the compiled graph as a constant.
## Solution
Modified the Dynamo handler to decompose `torch.full(size, tensor_fill_value)` into `empty(size).fill_(tensor_fill_value)` when `fill_value` is a `TensorVariable`, keeping the fill value dynamic in the compiled graph.
## Testing
Added test case that verifies torch.full works correctly with dynamic tensor fill_values across multiple calls and dtypes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166554
Approved by: https://github.com/Lucaskabela
- Remove all complex defines logic from the header
- Make GreenContext constructor private, as it should only be created via the static method as singleton
- Delete unused `getContext` and `getGreenContext` methods
- Rename `CUDA_HAS_GREEN_CONTEXT` to `HAS_CUDA_GREEN_CONTEXT()`, which results in compilation error if one accidentally makes a typo
- Suppress `-Wunused-private-field` is GreenContext is not available
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166462
Approved by: https://github.com/ngimel, https://github.com/eqy
DebugMode currently stores dispatch call args & kwargs, which is all intermediate tensors and more. This quickly OOMed on GPU when trying to debug some torchtitan / llama 8b models.
This defaults to storing the stringified version, adding a flag `DebugMode(store_original_args=True)` if users want to store the original args as-is (and for BC).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166347
Approved by: https://github.com/yushangdi
Our CP codebase now contains several files and we are adding more. This
PR refactors the code to consolidate the files into a context_parallel
folder but keep the import so that the existing users of CP won't be
affected.
Unfortunately, we have to split this PR into two PRs as the PyTorch
infra cannot accept a PR with 3000+ LoC change and git cannot recognize
that _context_parallel/_attention.py is moved from _attention.py because
we want to keep BC.
This is the second PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166501
Approved by: https://github.com/Skylion007
ghstack dependencies: #166456
Looking for feedback on this approach.
Received user reports of spurious pyrefly errors for users using hg instead of git. I think this was due to the fact that when using a venv and git, `make setup-env` installs requirements and pulls from a nightly torch wheel, which is needed for pyrefly to type check properly.
Initial documentation for `make setup-env` I found here: https://github.com/pytorch/pytorch/blob/main/CONTRIBUTING.md#developing-pytorch
Testing:
```
hg clone --git ssh://git@github.com/pytorch/pytorch.git
conda create -n pytorch_env python=3.10 # (or manually create venv instead of using script)
cd pytorch
pip install -r requirements.txt
pip install -r requirements-build.txt
lintrunner init
# check how many pyrefly errors - 15,709 errors (11,693 ignored)
lintrunner # confirm error message / warning appears
>>> General linter failure:
Warning (PYREFLY) nightly-wheel-not-run
pytorch-nightly.pth not found. You may need to run make setup-env or make
setup-env-conda to install nightly binaries and type stubs.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166603
Approved by: https://github.com/aorenste
Our CP codebase now contains several files and we are adding more. This PR refactors the code to consolidate the files into a context_parallel folder but keep the import so that the existing users of CP won't be affected.
Unfortunately, we have to split this PR into two PRs as the PyTorch infra cannot accept a PR with 3000+ LoC change and git cannot recognize that _context_parallel/_attention.py is moved from _attention.py because we want to keep BC.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166456
Approved by: https://github.com/Skylion007
Summary:
Using `CUDA_KERNEL_ASSERT_PRINTF` inside kernels allows us to log invalid values to the console (that can be in turn used to surface _hopefully_ more clearer error messages).
This does have an impact in the number of registers needed for the values being logged (I confirmed via diffing PTX that there is no other impact relative to using `__assert_fail`)
To avoid causing perf bottlenecks, this change adds a compile-time switch to enable more verbose errors in some of the common kernels that cause DSAs. There is also a Buck config that can be used to configure this switch more conveniently.
## Alternatives considered
I considered making the behavior of `CUDA_KERNEL_ASSERT_PRINTF` controllable via a compile-time macro instead of writing another wrapper for it but there are kernels where the extra register pressure is not as severe and in those cases, having more useful error messages by default is pretty useful.
Test Plan:
## Simple Python Driver:
```
# scatter_errors.py
import torch
def main() -> None:
a = torch.rand(128, device="cuda:0")
idx = torch.randint(0, 128, (100,), device="cuda:0")
idx[0] = 9999
b = torch.scatter(a, 0, idx, 555.0)
print(b)
```
When running normally via:
```
$ buck2 run @//mode/opt :scatter_errors
```
we see the followng DSA message:
```
fbcode/caffe2/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:410: operator(): block: [0,0,0], thread: [0,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
```
Running via:
```
$ buck2 run @//mode/opt -c fbcode.c10_enable_verbose_assert=1 :scatter_errors
```
however produces:
```
[CUDA_KERNEL_ASSERT] fbcode/caffe2/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:410: operator(): block: [0,0,0], thread: [0,0,0]: Assertion failed: `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"`: Expected 0 <= idx_dim < index_size (128), but got idx_dim = 9999
```
Differential Revision: D85185987
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166171
Approved by: https://github.com/ngimel
Summary:
After a precision study, we concluded it is ok to use ACL's exp function on f32's erf()
We can keep erf inline this way.
Benchmarks show about 91% higher throughput when processing a tensor of 1M elements, compiling with clang-19:
Before:
f32 erf: 2539.179us
After:
f32 erf: 1329.063us
Test Plan:
Correctness:
buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch
Performance:
buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test
Differential Revision: D85730452
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166594
Approved by: https://github.com/mcfi, https://github.com/fadara01
Fixes#165177
When converting guards to sources if we were unable to get the expected symbol from symbol_to_source then try to get it from var_to_sources.
I was unable to make a simpler repro than what was described in the issue (which relies on llama3 - so inappropriate for a unit test).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165723
Approved by: https://github.com/bobrenjc93
Summary: When emitting symbolic variable definition in fx_graph_runnable.py, we need to check if a SymNode is actually an expression, so that we won't generate something like "s27*s53**2 = 36".
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166529
Approved by: https://github.com/mlazos
ghstack dependencies: #166432
In the initial pr for overlapping preserving bucketing, for a graph like:
```
def foo(...):
ag = all_gather(...)
hiding_compute = mm(...)
wait(ag)
```
We would add dependencies from mm -> ag, and wait from wait -> hiding_compute, to prevent bucketing reordering these collectives so that overlap no long occurred. however, there is an additional way for bucketing to prevent overlap.
If we were to reorder another collective so the graph looked like:
```
def foo(...):
ag = all_gather(...)
ar = all_reduce(...)
wait(ar)
hiding_compute = mm(...)
wait(ag)
```
Overlap would not occur, because the wait for the all reduce would also force realization of every collective enqueued on the same stream prior to the all reduce. NCCL uses a single stream per process group.
To model, we set a set a strict ordering of all collective starts, waits, and hiding compute initially when bucketing. Then, when trying to add a collective to a bucket, we will see if we interfere with overlap for all of the following possible bucketings:
[move collective start to bucket start, move bucket start to collective start] x [move collective wait to bucket wait x move bucket wait to collective wait].
For any of these positions, we check if overlap would have been interfered with because of stream queue semantics. Then, if not, we remove the moving start and wait from the constrained ordering of collectives, and see if it's topologically valid to merge the nodes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166324
Approved by: https://github.com/IvanKobzarev
ghstack dependencies: #166309
Preivously, we would stash a single stream value we constructed at trace time in a global and return the same value from repeated calls to the graph.
With this PR, we construct the stream value in advance, reference the constructed value in the graph via the lookup table, and if that value is returned as an output, read the value from the lookup table and return it (in bytecode, not as a graph output, since we don't support arbitrary stream outputs).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164819
Approved by: https://github.com/anijain2305
ghstack dependencies: #164304, #164522
Summary:
was digging through matmul padding for other work, and I noticed that the compute bound checking won't work on MI350 since we haven't supplied the tech specs yet.
I added MI350 specs following the predefined format
Test Plan: CI
Differential Revision: D85804980
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166576
Approved by: https://github.com/leitian
**Summary**
This implements the backward pass for the Varlen API and registers `_varlen_attn()` as a custom op.
**Benchmarking**
To benchmark, we compare runtime and TFLOPs against the current SDPA approach with padding.
Settings:
- 1 H100 machine
- `batch_size=8`, `max_seq_len=2048`, `embed_dim=1024`, `num_heads=16`
- dtype `torch.bfloat16`
- `is_causal=False`
- for variable length, we set sequences to be random multiples of 64 up to `max_seq_len`
- 100 runs
| | Variable Length API | SDPA |
|--------|--------------------|----------|
| Runtime | 0.8189142608642578 ms | 3.263883056640625 ms |
| TFLOPs | 268.652 | 158.731 |
We can see that runtime for Varlen is >3x faster
**Testing**
Run `python test/test_varlen_attention.py` for unit tests where we verify basic functionality and confirm numerical match between varlen gradients vs SDPA.
For custom op testing, `test_custom_op_registration` uses logging mode to verify that `_varlen_attn()` was called and tests with `torch.compile`. `test_custom_op_compliances` uses `torch.library.opcheck()` to verify.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164504
Approved by: https://github.com/drisspg
* We are separating out the rocm jobs of the periodic workflow
* We are introducing a new label `ciflow/periodic-rocm-mi200` to allow us to run distributed tests only on ROCm runners, without triggering many other jobs on the `periodic.yml` workflow (via `ciflow/periodic`)
* This new workflow will also be triggered via the `ciflow/periodic`, thus maintaining the old status quo.
* We are reverting to the `linux.rocm.gpu.4` label since it targets a lot more CI nodes at this point than the K8s/ARC-based `linux.rocm.gpu.mi250.4` label, as that is still having some network/scaling issues.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166544
Approved by: https://github.com/jeffdaily
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Summary:
This ads waitcounter for whether or not the pool is running, as well as if we
are running jobs.
This also ads waitcounters for the first job within a pool. First job and running are working correctly. The job waitcounter seems to either be detecting a leak of a job, or is broken subtly.
Test Plan:
We've tested this internally and see valid ods metrics.
Note that we may be leaking jobs, or the job logic may not be handling an exception correctly.
Differential Revision: D83705931
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164527
Approved by: https://github.com/masnesral
torch.cuda._busy_wait_for_flag() will launch a kernel that spins until a flag is set by a corresponding torch.cuda._clear_flag(). These **must** be run on separate streams or it will deadlock.
When used correctly these kernels will put work on the GPU that is more predictable than torch.cuda._sleep() in cases where the unit test is depending on the GPU being busy.
Fixes#120318.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166218
Approved by: https://github.com/jeffdaily
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Currently most variable trackers implement `iter` via `_call_iter_tuple_list`.
This makes it difficult to customize the behavior of `iter` for different
variable types. Instead, implement `iter` via a polyfill, which will delegate
to the appropriate `__iter__` method.
While this method is more flexible, it increases the overhead of dynamo tracing.
For example, `iter(x)` will generate 9x more instructions than the current
implementation for common iterable types. Microbenchmarking shows a ~6x
slowdown for this operation. I suspect this would be much less for realistic
workloads, but more work would be needed to get specific numbers. If the
performance is a concern we could also consider adding a fast path for types
that are known to correctly implement `__iter__`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162768
Approved by: https://github.com/guilhermeleobas
ghstack dependencies: #166349
Summary:
- the ranks in the default pg config are local ranks
- however fr trace analysis requires them to be global ranks
- so we transform the local ranks to global ranks before the analysis kicks in based on a cli flag
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166149
Approved by: https://github.com/fduwjj
This reverts commit 3e77a2b478f596a8a0aef0af502f6bb1a247aa85.
Otherwise it fails ARM build with older compilers with errors that looks
as follows:
```
vec128_bfloat16_neon.h:666:12: error: operation not permitted on type ‘bfloat16_t’
666 | return (-x) * y - z;
```
For more self-contained example see https://godbolt.org/z/bbY4xWh45
(that compiles the same code using clang-15 and clang-19)
Summary:
- in torchft we have multiple default pg's, 1 for each task group
- for flight recorder to work, each of these need to have a different name, so entries can be matched
- change the `init_process_group` api to optionally take a list of ranks. if provided, we use the hash of the ranks as the name of the pg. for torchft, we'll pass global ranks here so the default pg have a different name on each task group
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166182
Approved by: https://github.com/fduwjj
Summary:
This PR introduces an inductor-level fallback mechanism that gives users control over which operations or subgraphs Inductor should lower and which should fall back to preexisting kernels. This has similar motivation as #164776 in providing flexibility to selectively disable Inductor lowering for specific nodes.
The implementation simply adds a check for the `"should_fallback"` metadata annotation on FX graph nodes. If this is set to `True`, the lowerer falls back before attempting the normal lowering path. Note that since these are user-directed fallbacks dependent upon specific, customized conditions, use `add_to_fallback_set=False` to avoid permanent overwrites of inductor's lowering/fallback rules.
Simple example marking nodes for fallback based on custom predicates:
```
def should_fallback_predicate(node: torch.fx.Node, pred: Callable[torch.fx.Node, bool]):
# Apply predicate and mark for fallback if needed
if self.predicate(node):
node.meta["should_fallback"] = True
```
Test Plan: added a CI test
Differential Revision: D85347587
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166339
Approved by: https://github.com/blaine-rister, https://github.com/eellison
1. Add `extension_build_version` and `is_internal` to `FromImpl`/`ToImpl` (this will be useful for future if we need to break the BC of any type) #163832 has the PoC of how we would actually use this system
2. Add `aoti_torch_library_impl_v2` that takes in an additional `extension_build_version` argument, updates callsite in `torch/csrc/stable/library.h` to always pass `TORCH_ABI_VERSION` for this argument
3. Add `extension_build_version` to `from_ivalue` and `to_ivalue` and update all callsites
4. Add a private `_from` and `_to` that pass `is_internal=True` to `FromImpl`/`ToImpl`, making it easier to reason about what is being called from libtorch-land / extension-land
**Note: This PR does not include a linter that tells the user to update from/to if changing the ABI of a type in headeronly, which I intend to do in https://github.com/pytorch/pytorch/pull/163998**
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164332
Approved by: https://github.com/janeyx99
ghstack dependencies: #164356, #166373, #163683
Move `from_ivalue` and `to_ivalue` and their dependents `StableIValueBoxedKernel`, `aoti_torch_library_impl` `aoti_torch_call_dispatcher` into new (non-aoti shim_common.cpp)
This is in prep for the above PRs where I add v2s (`torch_call_dispatcher` and `torch_library_impl`) that are versioning aware
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166373
Approved by: https://github.com/janeyx99
ghstack dependencies: #164356
As per title.
It seems safe to be able to generalize to arbitrary contiguous inputs since `at::matmul` is likely to do the flattening to avoid `baddmm`.
Additionally, we guard for bias to be 1D and contiguous which is guaranteed to be fused with no copies.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166071
Approved by: https://github.com/ngimel
Summary:
When dealing with a large memory trace, the resulting plot can be challenging to interpret and analyze.
This commit introduces a feature that enables filtering of allocations that have already been freed, providing a more focused view.
The remaining events in the plot often warrant closer examination, as they may be indicative of potential out-of-memory (OOM) issues.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165752
Approved by: https://github.com/zdevito
The goal of this PR is to provide a standard way to create simple treespec instances and hide the implementation details of the `PyTreeSpec` class.
Changes:
1. Add function `treespec_leaf()` to replace `LeafSpec()`.
2. Add function `treespec_tuple(...)` and `treespec_dict(...)` to create treespec for `tuple` / `dict` which is used for `*args` / `**kwargs`. This avoids direct modification to `treespec` instances that rely on the implementation details of the `PyTreeSpec` class.
3. Change `len(spec.children_specs)` to `spec.num_children`.
4. Change `isinstance(spec, LeafSpec)` to `spec.is_leaf()`.
------
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160843
Approved by: https://github.com/mlazos
Fixes#100842
Disable jiterator for complex tan and tanh kernels due to accuracy issues, matching the existing approach used for acos, acosh, asin, and asinh. Reverts to thrust implementation which provides better numerical accuracy.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165250
Approved by: https://github.com/ezyang
Fixes#161366
All the 4 types of dimension matrix are supported.
2d-2d, 2d-3d, 3d-3d, 3d-2d. The corresponding test cases in test_matmul_cuda are working
for both forward and backward pass.
The CK path is enabled for gfx942, gfx950.
ToDo: Need to enable support on gfx90a since the ck kernel used in this commit produces gpu error,
might require a different CK kernel config, based on the profiler result on gfx90a.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166334
Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony
# Context
Previously, we would modify the parent process's NUMA bindings in order to force child process to inherit them.
However, this would not work correctly if `start_method="forkserver"`, because the subprocesses would actually inherit their bindings from the forkserver middleman process. In this case, the inherited affinity would actually be incorrect for all but the first subprocess (because the forkserver process would get created lazily, and hence inherit and then stick with the bindings intended for the first subprocess).
# This PR
* `str` entrypoints: Use `numactl` CLI
* `Callable` entrypoints: Wrap the `Callable` entrypoint and call `os.sched_setaffinity` inside it.
Hopefully this will be the last necessary iteration.
# Test Plan
## Automated
`$ pytest test/test_numa_binding.py`
## Manual
Verified flops/sec and memory locality wins on several different types of jobs
* `Callable` with forkserver
* `str` entrypoint with spawn
* `Callable` entrypoint with spawn
More details in [this doc (Meta-only).](https://docs.google.com/document/d/1vxD-OKYBTT27jbBwtW9iz9g0tNM0u-i0tiTJg_ieQA8/edit?tab=t.scjv58yswi64)
# Later PR
Update all the documentation when we're confident this has stabilized.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166026
Approved by: https://github.com/d4l3k
Co-authored-by: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
# Motivation
In https://github.com/pytorch/pytorch/pull/145591, `std::hardware_destructive_interference_size` was introduced in CUDACachingAllocator. Later, https://github.com/pytorch/pytorch/pull/160067 moved it to `c10/core/alignment.h` for code reuse.
However, on **GCC 13+** using `std::hardware_destructive_interference_size` triggers the following warning:
```bash
warning: use of ‘std::hardware_destructive_interference_size’ [-Winterference-size]
/home/pt-gpu/4T-4652/guangyey/stock-pytorch/aten/src/ATen/core/CachingHostAllocator.h:42:16: note: its value can vary between compiler versions or with different ‘-mtune’ or ‘-mcpu’ flags
/home/pt-gpu/4T-4652/guangyey/stock-pytorch/aten/src/ATen/core/CachingHostAllocator.h:42:16: note: if this use is part of a public ABI, change it to instead use a constant variable you define
/home/pt-gpu/4T-4652/guangyey/stock-pytorch/aten/src/ATen/core/CachingHostAllocator.h:42:16: note: the default value for the current CPU tuning is 64 bytes
/home/pt-gpu/4T-4652/guangyey/stock-pytorch/aten/src/ATen/core/CachingHostAllocator.h:42:16: note: you can stabilize this value with ‘--param hardware_destructive_interference_size=64’, or disable this warning with ‘-Wno-interference-size’
```
# Solution
- Solution 1: Replace `c10::hardware_destructive_interference_size` with a constant 64.
```cpp
constexpr std::size_t hardware_destructive_interference_size = 64;
```
- Solution 2: adding `-Wno-interference-size’ to 8d4e48831e/cmake/public/utils.cmake (L386) to suppress the warning.
# Additional Context
The current implementation uses the second approach. If the reviewers prefer the first approach, I am happy to update it accordingly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166297
Approved by: https://github.com/ezyang
It is recommended to use `python -m pip install --no-build-isolation .` instead of `pip3 install --no-build-isolation .` because most of us use a virtual environment, and the latter probably relies on the system `pip3` rather than the conda or uv. We need to make it consistent with the Python we use, and it is also consistent with how `torch` is installed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166235
Approved by: https://github.com/fffrog, https://github.com/ezyang
when the slice is tensor, we decompose it to .item() call and pass the unbacked symbol to the slice to avoid DDE.
the diff also fix an existing bug in codegen_dynamic_slice_size in the cpp wrapper. a +1 should be -1 making it match
python codegen.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165074
Approved by: https://github.com/Lucaskabela
- Remove all complex defines logic from the header
- Make GreenContext constructor private, as it should only be created via the static method as singleton
- Delete unused `getContext` and `getGreenContext` methods
- Rename `CUDA_HAS_GREEN_CONTEXT` to `HAS_CUDA_GREEN_CONTEXT()`, which results in compilation error if one accidentally makes a typo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166462
Approved by: https://github.com/ngimel, https://github.com/eqy
Summary:
Previously we didn't correctly handle closure tuple when there's content in it. Adding additional code for serializing the tuple and merge it with guard manager local scope.
Test Plan:
pytest test/dynamo/test_aot_compile.py
Reviewers:
Subscribers:
Tasks:
Tags:
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166351
Approved by: https://github.com/Lucaskabela
This pull request introduces a standardized YAML-based configuration system for transformer attention benchmarks, making it easier to run and manage comprehensive performance tests. It adds example configs, and a wrapper script to convert YAML configs into CLI arguments for the benchmark runner.
#### Next Steps:
CI Enablement: This change would further lead to running the attention ops in CI for regression tracking.
#### Developer flow: (Run locally)
`python score_mod.py --config configs/config_test.yaml`
#### Enabling CI run: https://github.com/pytorch/pytorch/pull/165915
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164155
Approved by: https://github.com/jbschlosser
The global pytree registration of `BlockMask` was added in https://github.com/pytorch/pytorch/pull/166045
In general ppl assume `BlockMask` is a leaf, so the global registration could lead to some unexpected failure when calling `tree_map()` on a `BlockMask` since now it will flatten all the way down.
Therefore, we remove the global registration but keep the `_flatten()` and `_unflatten()` classmethod. Users could do a local registration easily when it is needed.
in pytorch
```
python test/distributed/tensor/test_dtensor_export.py -k test_flex_attention_dtensor_export
```
in torchtitan
```
python -m tests.integration_tests.run_tests ./outputs --test_suite features --ngpu 8
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166434
Approved by: https://github.com/wwwjn
Fixes#165810
If we regenerate a node during functionalization, we override the "stack_trace", "custom", and "seq_nr" metadata of the regenerated node with the node meta of the original node.
```
python test/functorch/test_aot_joint_with_descriptors.py -k test_preserve_annotate_replay_view
python test/functorch/test_aotdispatch.py TestAOTAutogradWithDynamo.test_duplicated_arguments_on_tensor_overlap
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166200
Approved by: https://github.com/bdhirsh
**Summary**
This implements the backward pass for the Varlen API and registers `_varlen_attn()` as a custom op.
**Benchmarking**
To benchmark, we compare runtime and TFLOPs against the current SDPA approach with padding.
Settings:
- 1 H100 machine
- `batch_size=8`, `max_seq_len=2048`, `embed_dim=1024`, `num_heads=16`
- dtype `torch.bfloat16`
- `is_causal=False`
- for variable length, we set sequences to be random multiples of 64 up to `max_seq_len`
- 100 runs
| | Variable Length API | SDPA |
|--------|--------------------|----------|
| Runtime | 0.8189142608642578 ms | 3.263883056640625 ms |
| TFLOPs | 268.652 | 158.731 |
We can see that runtime for Varlen is >3x faster
**Testing**
Run `python test/test_varlen_attention.py` for unit tests where we verify basic functionality and confirm numerical match between varlen gradients vs SDPA.
For custom op testing, `test_custom_op_registration` uses logging mode to verify that `_varlen_attn()` was called and tests with `torch.compile`. `test_custom_op_compliances` uses `torch.library.opcheck()` to verify.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164504
Approved by: https://github.com/drisspg
Moving symbolic shape guards to C++ causes compile time issues. This basically boils down to a tradeoff question.
For models that have large amount of dynamic shape guards, this flag will help reduce guard latency. But for most of the models, that have a very few dynamic shape guards, the guard lantecy is anyways small. These models will still see a high compile time hit because of calling gcc during the compile.
So a good default value seems to be False. We can write a doc to give guidance on reducing guard latency.
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166427
Approved by: https://github.com/zou3519
Fixes#165944
**Summary**
Today, if we attempt to run flash attention with batch_size 0, we get error `Runtime Error: batch size must be positive`. This PR fixes this by returning early with empty tensors in the fwd and bwd.
**Test plan**
`python test/test_transformers.py -k test_scaled_dot_product_attention` - added case for batch_size=0
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166318
Approved by: https://github.com/drisspg
Fixes#159139
## The Cause
The bug occurs because the OptimizedModule wrapper in torch._dynamo.eval_frame doesn't call the len method. This causes Python's bool() check to fall back to the default object truthiness (always True) instead of correctly evaluating containers with len() == 0 as False.
## The Fix
A very easy fix . I just added the len method to OptimizedModule in torch._dynamo.eval_frame class to delegate the call to the original module
```python
def __len__(self):
"""
Proxy the len() call to the original module to fix truthiness checks.
"""
return len(self._orig_mod)
```
This successfully fixes the issue . The script now works as expected.
## Reproduction Script
```python
import torch
import torch.nn as nn
# Create an empty nn.ModuleList
original = nn.ModuleList()
# Compile it using torch.compile
compiled = torch.compile(original)
# Compare their boolean evaluations
print(f"bool(original): {bool(original)}")
print(f"bool(compiled): {bool(compiled)}")
# Trigger failure if they differ
assert bool(original) == bool(compiled), "BUG: truthiness behavior mismatch after compilation"
```
## Output
bool(original): False
bool(compiled): False
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159208
Approved by: https://github.com/andrewboldi, https://github.com/Lucaskabela
Co-authored-by: pushkar-hue <pushkarsharma.rtm@gmail.com>
Co-authored-by: Lucas Kabela <lucasakabela@gmail.com>
I accepted a PR that added this code, but re-examining it now, I'm questioning the approach. It seems like we're working around an issue with the inductor generating incorrect sizes. A comment suggests it might be related to unsqueezed u0 values. Removing this code didn't cause any failures, so I'll take it out and address the root issue if it arises.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166278
Approved by: https://github.com/Lucaskabela
Summary: To fix https://github.com/pytorch/pytorch/issues/159400. Currently, test_aoti_abi_check and test_aoti_inference need to be built in two passes, first build pytorch using the regular `pythonsetup.py develop` and then build with `CMAKE_FRESH=1 BUILD_AOT_INDUCTOR_TEST=1 python setup.py devleop`. This is cumbersome. Fix by rewriting CMakeLists.txt for test_aoti_inference to one-pass build which runs AOTI to compile models at the test time. Also update CI test script to get rid of two-pass build. For test_aoti_abi_check, it is not AOTI specific, so we make it not guarded by BUILD_AOT_INDUCTOR_TEST.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164277
Approved by: https://github.com/janeyx99
Pull replace_node function out of Scheduler.finalize_multi_template_buffers(). This is needed by the next PR (#163369). As part of this also pull the _replace_operation_buffer() up to top-level since it needed no self references.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163368
Approved by: https://github.com/PaulZhang12
### Summary
Improves validation of `torch.linalg.eig` results by verifying the eigen decomposition identity **A v − v λ = 0**.
### Motivation
Eigenvectors are not unique, and numerical differences between backends (cuSOLVER, MAGMA, CPU)
can cause false test failures. This PR replaces direct elementwise comparisons with a mathematical
identity check, improving robustness across devices.
### Details
- Introduces `fulfills_eigen_decomposition_identity()` in `test_eig_compare_backends()` to validate the eigen equation.
- Uses CPU matmul for high-precision verification.
- Handles zero-sized matrices explicitly.
- Tolerances derived from numerical comparisons between cuSOLVER and NumPy.
See discussion: [dev-discuss.pytorch.org link](https://dev-discuss.pytorch.org/t/cusolver-dnxgeev-faster-cuda-eigenvalue-calculations/3248/6)
### Impact
- Improves test stability and correctness across eig backends.
- No change to public API.
- All tests pass; lintrunner reports no issues.
- Enables introduction of new eig backends without false test failures.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166322
Approved by: https://github.com/lezcano
This PR refactors the autocast context manager in autocast_mode.py to simplify and centralize the logic for checking supported dtypes for each device. The previous implementation repeated similar checks for multiple device types. Now, a single mapping device_supported_dtypes is used to associate device types with their supported dtypes, and the validation logic is unified.
**The former PR #163446 was merged but reverted due to failed CI test on `openreg` related tests.**
This RR additionally slightly modified some test assertions for passing the CI tests. CI failed due to assertion for the exactly same error message. For example:
```
File "/var/lib/jenkins/workspace/test/cpp_extensions/open_registration_extension/torch_openreg/tests/test_autocast.py", line 9, in test_autocast_with_unsupported_type
with self.assertWarnsRegex(
AssertionError: "In openreg autocast, but the target dtype torch.float32 is not supported." does not match "In openreg autocast, but the target dtype is not supported. Disabling autocast."
```
Sorry for the inconvenience again.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165221
Approved by: https://github.com/albanD
RMS/Layer norm backward would generated 2 kind of reductions:
- the reduction computing dx which reduce across the hidden dimension (in the context of transformer)
- the reduction computing dw/db which reduce across the BxT (batch size , sequence length) dimension.
These 2 set of reductions have common input buffers but inductor can not fuse them because of different loop orders.
There are multiple sources of custom kernels that implement fused version of such kernel (Liger-Kernel, quack, Paul Zhang's internal post). This PR enable Inductor to generate such kernels automatically.
The generated kernel is very similar to 33924d20b6/src/liger_kernel/ops/rms_norm.py (L114) .
To make the implementation simple and performing, we enable such fusion only if the inner reduction (computing dx) is a persistent reduction. This should be true for representative inputs. Persistent reduction is critical for the perf here to make sure a loaded tensor does not need to be reload.
To make sure the inner reduction (computing dx) and outer reductions (computing dw/db) being fusible, the PR does the following:
1. convert the outer reductions to pointwise by replacing 'reduction' & 'store_reduction' node with a new type of node 'parital_accumulate'. The new node will collect the reduction type, buffer name, input of reduction etc, which is essential for proper codegening.
2. do loop reordering (rely on the earlier loop ordering after fusion work) to reorder the loops of the converted pointwise so it can be fused with the inner reduction
3. there can be epilogues that need to be added in the end. E.g. the outer reduction may be followed by a division for mean , or followed by a down cast if dw/db is in low precision (fp16/bf16).
Some early benchmarking on H100 shows about 2X speedup for both RMSNorm and LayerNorm backward for shape (1152 * 500, 384 ) used in some internal model. Note that, I manually disable split reduction in this benchmarking since otherwise the fusion will be skipped right now. The next PR will make the mix-order-reduction compose better with split reduction
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165370
Approved by: https://github.com/jansel
ghstack dependencies: #166204
Summary: In this diff, we add a event pool so that we dont have to create/destroy events all the time, instead re-use the events from the pool.
Test Plan: contbuild
Differential Revision: D84685495
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165647
Approved by: https://github.com/bbus
Fixes#166068
Use the ROCm version string that does not contain a hash. The string is set in LoadHIP.cmake.
Tested on repro provided by reporter.
For a ROCm 7.0 docker container, we get `7.0.0`.
For a ROCm 7.0.2 docker container, we get `7.0.2`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166336
Approved by: https://github.com/jeffdaily
Summary:
Support deepseek-style scaling in Inductor Triton for FP8 GEMMs. DeepSeek-style scaling is a colloquial term for a fine-grained mixed precision framework using FP8 to train [Deepseek-V3](https://arxiv.org/pdf/2412.19437), DeepSeek AI's recent MoE (Mixture of Experts) model. DeepSeek-style scaling effectively extends the dynamic range of FP8 by mitigating dequantization overhead under increased-precision accumulation, which is key to achieving more accurate FP8 GEMM results.
DeepSeek-style scaling on matmul `A @ B` leverages two different types of scaling strategies to preserve a balance between numerical stability and training efficiency:
- Activations (input tensor `A`): tile-wise (1x128 across shape `(M, K)`)
- Weights (input tensor `B`): block-wise (128x128 across shape `(N, K)`)
This diff enables Inductor users to replicate past successes with deepseek-style scaling and achieve higher numerical stability while increasing training efficiency.
NOTE: Block-wise 128x128 scaling is only supported in CUDA 12.9+; therefore, deepseek-style scaling is currently unsupported in `fbcode` (CUDA 12.4). Use OSS PyTorch to run deepseek-style scaling.
NOTE: Accuracy for FP8 is unstable, even with high tolerances, which is why TritonBench benchmarks are unlikely to be accurate against a `torch` implementation.
Test Plan:
In OSS PyTorch, run
```
TORCHINDUCTOR_CACHE_DIR=~/personal/cache_dir_inductor CUDA_LAUNCH_BLOCKING=1 TORCH_USE_CUDA_DSA=1 TRITON_PRINT_AUTOTUNING=1 TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+inductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 ENABLE_PERSISTENT_TMA_MATMUL=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 python run.py --op fp8_gemm --only torch_fp8_gemm,pt2_fp8_gemm --metrics tflops,accuracy --m 4096 --n 768 --k 512 --output="{output_dir}/deepseek_bench.csv" --scaling_deepseek --atol=1e-2 --rtol=0.5 2>&1 | tee ~/personal/deepseek_style/deepseek_bench.log
```
Differential Revision: D83609850
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164404
Approved by: https://github.com/slayton58
Previously, we would completely skip building and calling any resume function if the leaf frame's resume instruction was RETURN_VALUE/RETURN_CONST. Now, we only skip building/calling resume functions for frames that are resuming on RETURN_VALUE.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165808
Approved by: https://github.com/Lucaskabela
ghstack dependencies: #166013, #166015
This `patch.dict(counters, ...` appears to be ancient code that doesn't really seem to be doing anything? It causes issues in nested graph breaks because the patch cleanup clears out the record of the nested graph break. Removing the patch to see if it's even needed in the first place.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166015
Approved by: https://github.com/Lucaskabela
ghstack dependencies: #166013
Summary:
as title
- Some IR nodes are created during `finalize_multi_template_buffers()` in Scheduler. This PR adds provenance (`origin_node` and `origins`) for those nodes.
- Extract `assign_origin_node` function
Test Plan:
```
buck run mode/opt fbcode//caffe2/test/inductor:provenance_tracing -- -r test_deferred_triton_kernels
```
Differential Revision: D83979975
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164746
Approved by: https://github.com/mlazos
Summary:
Add a gflag to allow us skip empty constant named parameter during
dense loading. In [vm_parameters.py](https://fburl.com/code/7xr9ihwy), there is
a constant _empty_tensor parameter used for the model. This constant parameter
is skipped in XL weights during model publish because it is empty. This will
break model inplace update later because it will be reported by the AOTI
container but cannot be found from the model merge weights. This diff will
allow us to solve the problem.
Test Plan: Verified inplace update in job https://www.internalfb.com/vanguard/serving_test_cases/1165842932095688
Reviewed By: muchulee8, joannec3634
Differential Revision: D85082330
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166037
Approved by: https://github.com/muchulee8, https://github.com/jcwchen
Namely, error out rather than crash when out dtype is of an unexpected type
Resize output tensor to the expected size in `_out` operation, to prevent crash when tensor of an unexpected size is passed.
Preserve symbolic shapes whenever possible
Test plan: Run `python test_ops.py -v -k test_out_warning_fft_hfft_mps` for MPS device, without this change it crashes with `Error: Invalid KernelDAG, equalShape for destination failed'`, run `python ../test/test_ops.py -v -k test_dtypes_stft_mps`, without this change it crashes with `A complex mlir::Type does not have a corresponding complex MPSDataType"`, when input dtype is bfloat16
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166272
Approved by: https://github.com/kulinseth
Fixes#163929
Fixes argmin/argmax operations to return correct logical indices instead of physical memory offsets when applied to transposed/permuted tensors. When `argmin()` or `argmax()` is called on a transposed tensor, Inductor was returning physical memory indices instead of logical row-major indices. This caused incorrect results that don't match eager mode behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165983
Approved by: https://github.com/shunting314
\# why
- enable users to control which choices get used on which inputs
- reduce lowering time, and pin kernel selection, by selecting
them for the inputs
\# what
- a new InductorChoices subclass that implements a lookup table
- a README explaining the usage
- corresponding testing
- currently only supports templates that go through
`V.choices.get_template_configs`
\# testing
```
python3 -bb -m pytest test/inductor/test_lookup_table.py -v
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164978
Approved by: https://github.com/PaulZhang12, https://github.com/eellison
Summary: Fallback kernels are created with flattened constant args and an `unflatten` utility to unflatten them when needed. Apply it in FXConverter to preserve the original structure
Test Plan: added new CI tests
Differential Revision: D85347589
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166144
Approved by: https://github.com/blaine-rister
- TheRock build system for ROCm builds OpenBLAS from source and uses a custom name for the library.
- Following existing conventions in `FindOpenBLAS.cmake` to support finding a custom named version of OpenBLAS.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166333
Approved by: https://github.com/jeffdaily
description: Add unsigned integer (uint) type support to PyTorch operators by updating AT_DISPATCH macros. Use when adding support for uint16, uint32, uint64 types to operators, kernels, or when user mentions enabling unsigned types, barebones unsigned types, or uint support.
---
# Add Unsigned Integer (uint) Support to Operators
This skill helps add support for unsigned integer types (uint16, uint32, uint64) to PyTorch operators by updating their AT_DISPATCH macros.
## When to use this skill
Use this skill when:
- Adding uint16, uint32, or uint64 support to an operator
- User mentions "unsigned types", "uint support", "barebones unsigned types"
- Enabling support for kUInt16, kUInt32, kUInt64 in kernels
- Working with operator implementations that need expanded type coverage
## Quick reference
**Add unsigned types to existing dispatch:**
```cpp
// Before
AT_DISPATCH_V2(dtype,"op",AT_WRAP([&](){
kernel<scalar_t>();
}),AT_EXPAND(AT_ALL_TYPES));
// After (method 1: add unsigned types explicitly)
description: Convert PyTorch AT_DISPATCH macros to AT_DISPATCH_V2 format in ATen C++ code. Use when porting AT_DISPATCH_ALL_TYPES_AND*, AT_DISPATCH_FLOATING_TYPES*, or other dispatch macros to the new v2 API. For ATen kernel files, CUDA kernels, and native operator implementations.
---
# AT_DISPATCH to AT_DISPATCH_V2 Converter
This skill helps convert PyTorch's legacy AT_DISPATCH macros to the new AT_DISPATCH_V2 format, as defined in `aten/src/ATen/Dispatch_v2.h`.
## When to use this skill
Use this skill when:
- Converting AT_DISPATCH_* macros to AT_DISPATCH_V2
- Porting ATen kernels to use the new dispatch API
- Working with files in `aten/src/ATen/native/` that use dispatch macros
- User mentions "AT_DISPATCH", "dispatch v2", "Dispatch_v2.h", or macro conversion
description: Write docstrings for PyTorch functions and methods following PyTorch conventions. Use when writing or updating docstrings in PyTorch code.
---
# PyTorch Docstring Writing Guide
This skill describes how to write docstrings for functions and methods in the PyTorch project, following the conventions in `torch/_tensor_docs.py` and `torch/nn/functional.py`.
description: Guide users through creating Agent Skills for Claude Code. Use when the user wants to create, write, author, or design a new Skill, or needs help with SKILL.md files, frontmatter, or skill structure.
---
# Skill Writer
This Skill helps you create well-structured Agent Skills for Claude Code that follow best practices and validation requirements.
## When to use this Skill
Use this Skill when:
- Creating a new Agent Skill
- Writing or updating SKILL.md files
- Designing skill structure and frontmatter
- Troubleshooting skill discovery issues
- Converting existing prompts or workflows into Skills
## Instructions
### Step 1: Determine Skill scope
First, understand what the Skill should do:
1.**Ask clarifying questions**:
- What specific capability should this Skill provide?
- When should Claude use this Skill?
- What tools or resources does it need?
- Is this for personal use or team sharing?
2.**Keep it focused**: One Skill = one capability
- Good: "PDF form filling", "Excel data analysis"
- Too broad: "Document processing", "Data tools"
### Step 2: Choose Skill location
Determine where to create the Skill:
**Personal Skills** (`~/.claude/skills/`):
- Individual workflows and preferences
- Experimental Skills
- Personal productivity tools
**Project Skills** (`.claude/skills/`):
- Team workflows and conventions
- Project-specific expertise
- Shared utilities (committed to git)
### Step 3: Create Skill structure
Create the directory and files:
```bash
# Personal
mkdir -p ~/.claude/skills/skill-name
# Project
mkdir -p .claude/skills/skill-name
```
For multi-file Skills:
```
skill-name/
├── SKILL.md (required)
├── reference.md (optional)
├── examples.md (optional)
├── scripts/
│ └── helper.py (optional)
└── templates/
└── template.txt (optional)
```
### Step 4: Write SKILL.md frontmatter
Create YAML frontmatter with required fields:
```yaml
---
name:skill-name
description:Brief description of what this does and when to use it
The description is critical for Claude to discover your Skill.
**Formula**: `[What it does] + [When to use it] + [Key triggers]`
**Examples**:
✅ **Good**:
```yaml
description: Extract text and tables from PDF files, fill forms, merge documents. Use when working with PDF files or when the user mentions PDFs, forms, or document extraction.
```
✅ **Good**:
```yaml
description: Analyze Excel spreadsheets, create pivot tables, and generate charts. Use when working with Excel files, spreadsheets, or analyzing tabular data in .xlsx format.
```
❌ **Too vague**:
```yaml
description: Helps with documents
description: For data analysis
```
**Tips**:
- Include specific file extensions (.pdf, .xlsx, .json)
- Mention common user phrases ("analyze", "extract", "generate")
- List concrete operations (not generic verbs)
- Add context clues ("Use when...", "For...")
### Step 6: Structure the Skill content
Use clear Markdown sections:
```markdown
# Skill Name
Brief overview of what this Skill does.
## Quick start
Provide a simple example to get started immediately.
## Instructions
Step-by-step guidance for Claude:
1. First step with clear action
2. Second step with expected outcome
3. Handle edge cases
## Examples
Show concrete usage examples with code or commands.
## Best practices
- Key conventions to follow
- Common pitfalls to avoid
- When to use vs. not use
## Requirements
List any dependencies or prerequisites:
```bash
pip install package-name
```
## Advanced usage
For complex scenarios, see [reference.md](reference.md).
```
### Step 7: Add supporting files (optional)
Create additional files for progressive disclosure:
**reference.md**: Detailed API docs, advanced options
**examples.md**: Extended examples and use cases
**scripts/**: Helper scripts and utilities
**templates/**: File templates or boilerplate
Reference them from SKILL.md:
```markdown
For advanced usage, see [reference.md](reference.md).
Run the helper script:
\`\`\`bash
python scripts/helper.py input.txt
\`\`\`
```
### Step 8: Validate the Skill
Check these requirements:
✅ **File structure**:
- [ ] SKILL.md exists in correct location
- [ ] Directory name matches frontmatter `name`
✅ **YAML frontmatter**:
- [ ] Opening `---` on line 1
- [ ] Closing `---` before content
- [ ] Valid YAML (no tabs, correct indentation)
- [ ] `name` follows naming rules
- [ ] `description` is specific and < 1024 chars
✅ **Content quality**:
- [ ] Clear instructions for Claude
- [ ] Concrete examples provided
- [ ] Edge cases handled
- [ ] Dependencies listed (if any)
✅ **Testing**:
- [ ] Description matches user questions
- [ ] Skill activates on relevant queries
- [ ] Instructions are clear and actionable
### Step 9: Test the Skill
1. **Restart Claude Code** (if running) to load the Skill
2. **Ask relevant questions** that match the description:
```
Can you help me extract text from this PDF?
```
3. **Verify activation**: Claude should use the Skill automatically
4. **Check behavior**: Confirm Claude follows the instructions correctly
### Step 10: Debug if needed
If Claude doesn't use the Skill:
1. **Make description more specific**:
- Add trigger words
- Include file types
- Mention common user phrases
2. **Check file location**:
```bash
ls ~/.claude/skills/skill-name/SKILL.md
ls .claude/skills/skill-name/SKILL.md
```
3. **Validate YAML**:
```bash
cat SKILL.md | head -n 10
```
4. **Run debug mode**:
```bash
claude --debug
```
## Common patterns
### Read-only Skill
```yaml
---
name: code-reader
description: Read and analyze code without making changes. Use for code review, understanding codebases, or documentation.
allowed-tools: Read, Grep, Glob
---
```
### Script-based Skill
```yaml
---
name:data-processor
description:Process CSV and JSON data files with Python scripts. Use when analyzing data files or transforming datasets.
@ -67,23 +66,6 @@ aspects of contributing to PyTorch.
Follow the instructions for [installing PyTorch from source](https://github.com/pytorch/pytorch#from-source). If you get stuck when developing PyTorch on your machine, check out the [tips and debugging](#tips-and-debugging) section below for common solutions.
### Setup the development environment
First, you need to [fork the PyTorch project on GitHub](https://github.com/pytorch/pytorch/fork) and follow the instructions at [Connecting to GitHub with SSH](https://docs.github.com/en/authentication/connecting-to-github-with-ssh) to setup your SSH authentication credentials.
Then clone the PyTorch project and setup the development environment:
Beware that none of the topics under [Using Pytorch Securely](#using-pytorch-securely) are considered vulnerabilities of Pytorch.
Beware that none of the topics under [Using PyTorch Securely](#using-pytorch-securely) are considered vulnerabilities of PyTorch.
However, if you believe you have found a security vulnerability in PyTorch, we encourage you to let us know right away. We will investigate all legitimate reports and do our best to quickly fix the problem.
Please report security issues using https://github.com/pytorch/pytorch/security/advisories/new
All reports submitted thru the security advisories mechanism would **either be made public or dismissed by the team within 90 days of the submission**. If advisory has been closed on the grounds that it is not a security issue, please do not hesitate to create an [new issue](https://github.com/pytorch/pytorch/issues/new?template=bug-report.yml) as it is still likely a valid issue within the framework.
All reports submitted through the security advisories mechanism would **either be made public or dismissed by the team within 90 days of the submission**. If advisory has been closed on the grounds that it is not a security issue, please do not hesitate to create an [new issue](https://github.com/pytorch/pytorch/issues/new?template=bug-report.yml) as it is still likely a valid issue within the framework.
Please refer to the following page for our responsible disclosure policy, reward guidelines, and those things that should not be reported:
https://www.facebook.com/whitehat
## Using Pytorch Securely
**Pytorch models are programs**, so treat its security seriously -- running untrusted models is equivalent to running untrusted code. In general we recommend that model weights and the python code for the model are distributed independently. That said, be careful about where you get the python code from and who wrote it (preferentially check for a provenance or checksums, do not run any pip installed package).
## Using PyTorch Securely
**PyTorch models are programs**, so treat its security seriously -- running untrusted models is equivalent to running untrusted code. In general we recommend that model weights and the python code for the model are distributed independently. That said, be careful about where you get the python code from and who wrote it (preferentially check for a provenance or checksums, do not run any pip installed package).
### Untrusted models
Be careful when running untrusted models. This classification includes models created by unknown developers or utilizing data obtained from unknown sources[^data-poisoning-sources].
**Prefer to execute untrusted models within a secure, isolated environment such as a sandbox** (e.g., containers, virtual machines). This helps protect your system from potentially malicious code. You can find further details and instructions in [this page](https://developers.google.com/code-sandboxing).
**Be mindful of risky model formats**. Give preference to share and load weights with the appropriate format for your use case. [safetensors](https://huggingface.co/docs/safetensors/en/index) gives the most safety but is the most restricted in what it supports. [`torch.load`](https://pytorch.org/docs/stable/generated/torch.load.html#torch.load) has a significantly larger surface of attack but is more flexible in what it can serialize. See the documentation for more details.
**Be mindful of risky model formats**. Give preference to share and load weights with the appropriate format for your use case. [Safetensors](https://huggingface.co/docs/safetensors/en/index) gives the most safety but is the most restricted in what it supports. [`torch.load`](https://pytorch.org/docs/stable/generated/torch.load.html#torch.load) has a significantly larger surface of attack but is more flexible in what it can serialize. See the documentation for more details.
Even for more secure serialization formats, unexpected inputs to the downstream system can cause diverse security threats (e.g. denial of service, out of bound reads/writes) and thus we recommend extensive validation of any untrusted inputs.
@ -43,7 +43,7 @@ Important Note: The trustworthiness of a model is not binary. You must always de
### TorchScript models
TorchScript models should treated the same way as locally executable code from an unknown source. Only run TorchScript models if you trust the provider. Please note, that tools for introspecting TorchScript models (such as `torch.utils.model_dump`) may also execute partial or full code stored in those models, therefore they should be used only if you trust the provider of the binary you are about to load.
TorchScript models should be treated the same way as locally executable code from an unknown source. Only run TorchScript models if you trust the provider. Please note, that tools for introspecting TorchScript models (such as `torch.utils.model_dump`) may also execute partial or full code stored in those models, therefore they should be used only if you trust the provider of the binary you are about to load.
### Untrusted inputs during training and prediction
@ -59,9 +59,9 @@ If applicable, prepare your model against bad inputs and prompt injections. Some
### Data privacy
**Take special security measures if your model if you train models with sensitive data**. Prioritize [sandboxing](https://developers.google.com/code-sandboxing) your models and:
- Do not feed sensitive data to untrusted model (even if runs in a sandboxed environment)
- If you consider publishing a model that was partially trained with sensitive data, be aware that data can potentially be recovered from the trained weights (especially if model overfits).
**Take special security measures if you train your models with sensitive data**. Prioritize [sandboxing](https://developers.google.com/code-sandboxing) your models and:
- Do not feed sensitive data to an untrusted model (even if runs in a sandboxed environment)
- If you consider publishing a model that was partially trained with sensitive data, be aware that data can potentially be recovered from the trained weights (especially if the model overfits).
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.