Summary:
Previously, weight deduplication was done by simply grouping tensors with their untyped storage and saving the first tensor in the group.
A more rigorous approach would be to find a complete tensor that covers the storage and store that tensor. This is particularly important for GPU weights because when saving to raw bytes, we move the weight to CPU first, and if the weight being saved is not a complete one, it will lose the storage information during the copy to CPU.
In this diff, we reuse code in `_package_weights.py` for better weights and constants deduplication in `torch.export.save`.
Test Plan: buck2 run mode/dev-nosan caffe2/test:test_export -- -r test_weight_sharing_gpu
Differential Revision: D83523690
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164196
Approved by: https://github.com/angelayi
Summary: Relax absolute tolerance from 1e-2 to 1e-1 for `test_non_contiguous_input_mm_plus_mm` in `test_max_autotune.py`.
Test Plan: `test_max_autotune.py`
Differential Revision: D83391942
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164022
Approved by: https://github.com/eellison
Summary:
Fix decompose_k test failure (`test_max_autotune_decompose_k `) in `test_max_autotune.py` on B200s by setting `torch._inductor.config` patches for variables `comprehensive_padding` and `shape_padding`. Initial failure was `AssertionError: False is not true : Could not find a split in {3, 9, 2187, 81, 243, 729, 27} in # AOT ID: ['6_forward']`.
Refactor decompose_k test to follow patch semantics when setting all environment variables within a test.
Test Plan:
`test_max_autotune.py`:
```
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:max_autotune -c fbcode.nvcc_arch=b200a -c fbcode.enable_gpu_sections=true -c fbcode.platform010_cuda_version=12.8 -c fbcode.re_gpu_tests=False -- test_max_autotune_decompose_k
```
Differential Revision: D83390563
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164021
Approved by: https://github.com/njriasan, https://github.com/mlazos, https://github.com/eellison
Adds [control_deps](https://en.wikipedia.org/wiki/Control_dependency) higher-order operator to enforce explicit scheduling dependencies in FX graphs. This prevents unwanted operation reordering/fusion by giving nodes additional dependencies, which we also respect in inductor by adding weakdeps on the additional dependencies.
This can be generally useful (such as for ordering collectives) but in this case I am using it so that fusions do not interfere with aten planned comm-compute overlap.
There's definitely some similarity with the `with_effects` hop. Talked with @angelayi - when @zou3519 is back we will figure out how we want to consolidate.
The implementation needs to be a subgraph (as opposed to `with_effects`) because inductor relies on `V.graph.current_node`. Changing the signature of the node with `with_effects` breaks this, and additionally, also breaks striding constraints on the wrapped node - see this [TODO](aed66248a0/torch/fx/experimental/proxy_tensor.py (L1246-L1249)). By maintaining the node with its original calling structure in subgraph this all works.
Example transformation:
Before:
```
%add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%arg0_1, 1), kwargs = {})
%mm : [num_users=1] = call_function[target=torch.ops.aten.mm.default](args = (%arg1_1, %arg1_1), kwargs = {})
%mul : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%add, 2), kwargs = {})
```
After:
```
add: "f32[256, 256]" = torch.ops.aten.add.Tensor(arg0_1, 1)
mm: "f32[256, 256]" = torch.ops.higher_order.control_deps((add,), subgraph_mm, arg1_1, arg1_1)
mul: "f32[256, 256]" = torch.ops.higher_order.control_deps((mm,), subgraph_mul, add)
```
The mm operation now explicitly depends on add completing first, and mul depends on mm, with original operations preserved in subgraphs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164568
Approved by: https://github.com/ezyang, https://github.com/IvanKobzarev
This PR applies clang-tidy readability checks to jit sources and all headers in the code base.
`readability-redundant-inline-specifier` is suppressed because it incurs too many changes. `readability-redundant-inline-specifier` is used to detect redundant inline specifiers on function and variable declarations. There are many in-class method definitions that are marked inline.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164652
Approved by: https://github.com/Skylion007
This PR applies clang-tidy readability checks to jit sources and all headers in the code base.
`readability-redundant-inline-specifier` is suppressed because it incurs too many changes. `readability-redundant-inline-specifier` is used to detect redundant inline specifiers on function and variable declarations. There are many in-class method definitions that are marked inline.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164652
Approved by: https://github.com/Skylion007
```bash
DEBUG /data/vllm-community-homes/vllm-user-6/pytorch/aten/src/ATen/cuda/CUDAGraph.h(59): warning #68-D: integer conversion resulted in a change of sign
DEBUG CaptureId_t capture_id_ = -1;
DEBUG ^
DEBUG
DEBUG Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
DEBUG
DEBUG /data/vllm-community-homes/vllm-user-6/pytorch/aten/src/ATen/cuda/CUDAGraph.h(59): warning #68-D: integer conversion resulted in a change of sign
DEBUG CaptureId_t capture_id_ = -1;
DEBUG ^
DEBUG
DEBUG Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
DEBUG
DEBUG /data/vllm-community-homes/vllm-user-6/pytorch/aten/src/ATen/cuda/CUDAGraph.h(59): warning #68-D: integer conversion resulted in a change of sign
DEBUG CaptureId_t capture_id_ = -1;
DEBUG ^
```
Cuda won't use 0 as a capture id, so it is safe to initialize with 0, which also matches the initialization in `pytorch/aten/src/ATen/native/cudnn/RNN.cpp:2362`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163898
Approved by: https://github.com/houseroad
Fixes#89034
Updated tensor_to_numpy() function in tensor_numpy.cpp to handle ZeroTensors by throwing an error if force=False and returning an array full of zeros if force=True.
@ngimel, I just saw that you mentioned PyTorch is not too concerned with this issue but I had already worked on it so I figured I would push it anyways and see what you thought. Feel free to close the PR if you think it is not worth merging.
@albanD
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164487
Approved by: https://github.com/ngimel, https://github.com/albanD
Fixes#162812
Adds support for either passing a device directly into get_rng_state, or passing in a string or int (which is then wrapped into a device inside, as in torch.cuda.get_rng_state).
I wasn't exactly sure where tests for this should go, please let me know. I used this script for testing:
```python
import torch
# note: when running with CUDA GPU, first three tests will give the same result,
# as will the last two
# test with no device specified
print(torch.get_rng_state())
# test with CPU
cpu_device = torch.device("cpu")
print(torch.get_rng_state(cpu_device))
# test with direct name
print(torch.get_rng_state("cpu"))
# test with CUDA
cuda_device = torch.device("cuda:0")
print(torch.get_rng_state(cuda_device))
# test with integer
print(torch.get_rng_state(0))
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163034
Approved by: https://github.com/ezyang, https://github.com/cyyever
# Propagate custom meta data to backward
Support propagating the user annotation tags to backward graph, by extending the `copy_fwd_metadata_to_bw_nodes` utils (recommended by @xmfan , thanks!).
Example annotation API (added in https://github.com/pytorch/pytorch/pull/163673):
```
class M(torch.nn.Module):
def forward(self, x):
with fx_traceback.annotate({"pp_stage": 0}):
with fx_traceback.annotate({"fdsp_bucket": 0}):
x = x + 1
x = x - 2
with fx_traceback.annotate({"cuda_stream": 2, "fsdp_bucket": 1}):
x = x * 2
x = x / 3
return x
```
Assumptions (some inherited from https://github.com/pytorch/pytorch/pull/126573):
- I am trusting the seq_nr mapping introduced to aot_autograd nodes in https://github.com/pytorch/pytorch/pull/103129
- I am also trusting that the forward is single threaded, since seq_nr is thread local. If this isn't always true, we'll need to also plumb thread_id through the same machinery which is populating seq_nr.
- **(This is changed in this PR!) I assume all backward graph nodes has "is_backward" for 'partitioner_tag', and all other nodes are forward graph nodes**. If we don't run export before `aot_export_join_with_descriptors`, then none of the nodes has "nn_module_stack" in node meta. If we do run export first, then we don't need this change.
- I copy "custom" node meta from forward to backward graph nodes.
Question:
- Is it a good idea to copy all "custom" node meta? Or should we create a dedicated key in custom node meta to be copied? @SherlockNoMad
- Do we expect people to run export before using `aot_export_join_with_descriptors`?
- Can we assume the following for graph produced by `aot_export_join_with_descriptors`? "all backward graph nodes has "is_backward" for 'partitioner_tag', and all other nodes are forward graph nodes". Maybe this is a question for @ezyang
```
python test/functorch/test_aot_joint_with_descriptors.py -k test_preserve_
python test/export/test_export.py -k preserve_anno
python test/distributed/tensor/test_dtensor_export.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164174
Approved by: https://github.com/xmfan, https://github.com/SherlockNoMad
Even though `offset_intragraph_` only tracks RNG consumption within a single graph replay, we have observed that the 32bit storage for these offsets is easy to overshoot, especially for cases with big CUDA graph captures including kernels that are generating a large amount of random numbers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164515
Approved by: https://github.com/eee4017, https://github.com/eqy
In reality we found the current mismatch order does not match the actual error distribution, so we reorder it a bit as following:
1. We do collective type check first
2. Then size check (excluding all2all)
3. dtype check
4. state check
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164606
Approved by: https://github.com/VieEeEw
Add a deterministic mode to skip the on device benchmarking that we know should affect numeric. This include
- pad-mm
- dynamic rblock scaling
- template autotuning
- coordinate descent tuning for reduction
- reduction config autotuning in CachingAutotuner. For reduction both RBLOCK, num_warps should affect numeric. XBLOCK does not. We can still autotune XBLOCK for reductions.
- benchmarking for computation communication reordering pass
The mode definitely has perf hit.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163589
Approved by: https://github.com/v0i0
Summary: Minor refactor where we push some args in the aot joint with descriptors workflow that are not used in export stage to the compile stage where they are actually used.
Test Plan: existing tests should pass
Differential Revision: D83850316
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164584
Approved by: https://github.com/tugsbayasgalan
1\ DTensor abstraction on its own, does not support arbitrary length shards in its distributed tensors representation. It supports a single uneven shard, bit it has to be the last shard in the sharding dimension.
2\ However, DCP supports an API called checkpointable. This API allows you to define your custom shardable tensor structure. I have given a UT example ( look for CheckpointableDistTensor). Therefore, one option is to use CheckpointableDistTensor to save/load uneven shards.
3\ While exploring this path, I also noticed that torch.rec module also encountered a similar problem while working with DTensor. They workaround it by implementing Checkpointable API in DTensor and introducing an auxillary structure called LocalShardsWrapper. This is the second option we can use to unblock data loader resharding work.
In summary;
Use LocalShardWrapper + DTensor as the first option to unblock.
Second preference is to use new implementation of Checkpointable API. ( similar to CheckpointbaleDistTensor I have introduced in this example).
Differential Revision: D80182564
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160533
Approved by: https://github.com/saumishr
This PR is to enable the following tests rocm.
test/test_jit.py::TestBackends::test_save_load
test/test_jit.py::TestBackends::test_execution
test/test_jit.py::TestBackends::test_errors
test/test_jit.py::TestCUDA::test_current_stream
Verified that the tests pass on AMD gfx90a and gfx942 arch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164582
Approved by: https://github.com/jeffdaily
Increase the tolerance for the following UTs as there was a slight mismatch seen on MI200.
- test_data_parallel.py:test_strided_grad_layout
- test_c10d_nccl.py:test_grad_layout_1devicemodule_1replicaperprocess
Skip for MI200:
- test_fully_shard_training.py:test_2d_mlp_with_nd_mesh
- test_2d_composability.py:test_train_parity_2d_mlp
- test_fully_shard_overlap.py:test_fully_shard_training_overlap
Fixes#159489Fixes#159488Fixes#152700Fixes#125555Fixes#134139
Working as is on both MI200 and MI300:
Fixes#125991Fixes#125918
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164390
Approved by: https://github.com/jeffdaily
Summary: Added a set of fixes triggered by fm training job. Overall the theme here is that we should get rid of saved objects as much as possible when they are not used in guard reconstruction. Sometimes for objects that cannot be saved (like local functions) we still try our best to save their closures.
Test Plan:
test_guard_serialization.py
test_lazy_awatiable.py
Differential Revision: D83766926
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164490
Approved by: https://github.com/jamesjwu
Summary:
This diff adds the feature of allocating a large pinned memory segment upfront based on the provided config. This large segment is then used to serve all the small pinned memory requests to avoid expensive device level APIs (slow paths).
Example:
PYTORCH_CUDA_ALLOC_CONF=pinned_reserve_segment_size_mb:2048
This reserves a 2GB pinned memory segment for the process and then all incoming small requests are just served from this segment and no cudaHostAlloc/cudaHostRegister apis are being called.
Differential Revision: D83779074
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164501
Approved by: https://github.com/yangw-dev
So this fixes at least two issues:
1) When we are invoking inductor backend, we apply pre-grad passes which try to find correct fake mode to use. In the nested case, we will run into clash when there is closure variable in the inductor region because non-strict would have fakified this variable before hand and inner torch.compile would have created a new fresh fake mode. This is not a problem in regular torch.compile because inner torch.compile gets ignored. I don't know if we are supposed to inherit fake mode from parent context in this case. But we can avoid this problem if we just default to eager backend which is fine in this case because the point of export is to capture aten operators. Going to inductor would mean we will lose inner torch.compile ops.
2) There is custom torch function modes in export that track number of torch fns executed and inner compile itself doesn't work because of guard failure as this mode state gets changed. I noticed torch.cond fixes this problem by carefully stashing the torch function mode and defer it in the backend. So the correct thing to do here is just re-use torch.cond implementation unconditionally.
So the things i did for fixing above were:
1) Always default to eager backend when compile is invoked inside export. I needed to make how torch.cond sets up the fresh tracing env into an util that can be shared.
2) The previous eager backend for torch.cond was wrong because the context managers didn't actually persist until the backend is invoked.
3) torch.cond used only disable TorchFunctionMetadata tf mode and stash it for later, but in fact, we should do both TorchFunctionMetadata and PreDispatchTorchFunctionMode.
With above fixes, we are able to export flex attention in export.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164171
Approved by: https://github.com/ydwu4
This PR moves the call to copy the generated code from `/tmp/...` so that it is still called if attempting to compile the generated code fails. In both cases now, the generated code will be copied across to `torch_compile_debug/run_.../torchinductor/output_code.py` which makes debugging bad generated code easier.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161615
Approved by: https://github.com/eellison
This pull request adds support for running operator microbenchmarks on ROCm (AMD GPU) environments in the CI workflow. The main changes involve introducing new build and test jobs for ROCm in the `.github/workflows/operator_microbenchmark.yml` file.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164173
Approved by: https://github.com/huydhn
We want to refactor the internal bookkeeping of DeviceMesh so that:
Simply the bookkeeping logics and make it generic enough so that it is easy to support new transformations like flatten noncontiguous dim, reshape and unflatten. (We leveraged the CuTe layout). This new layout also let us handle non-contiguous slicing, flatten, transpose possible.
Concretely, in this PR, we do the following:
1. Use the `_MeshLayout` to handle all index operations rather use a map to record mesh dims.
2. Removed `flatten_name_to_root_dims`, because now we can directly get layout from a flattened device mesh.
3. Replaced `_get_slice_mesh_dims` with `_get_slice_mesh_layout`.
4. Use the newly added function `check_overlap` to check layout overlap.
5. Use a new function `to_remapping_tensor` to use layout ranks as indices when the mesh tensor is not representable as CuTe. The reason is that layout acts as a backend of mesh tensor bookkeeping (indexing indices), it needs to be used as indices for remap back to the mesh tensor for new DeviceMesh generation and backend init. For example, in the case of 2K to 4K, the underlying layout is (2K, 1) but the actual value of the mesh tensor is [2K, 2K+1, ....,]. While flattening, slicing, we need to remap the layout back to the new mesh tensor so it maps the actual device allocation. For example, in the 2K to 4K case, if the shape is (1K, 1K) with dim_names ("dp", "tp"). Then when slicing "tp", the mesh tensor should be (2K, 2K+1, ..., 3K-1) or (3K, 3K+1, ... 4K-1). not the global ranks generated from the layout. (1K, 1).
Verified that loss curve is very close for DeepSeekV3 on torchtitan, note that exact same match is challenging because even if we run the baseline twice, the loss curve does not exactly match.
<img width="1113" height="490" alt="image" src="https://github.com/user-attachments/assets/7877b5a4-337e-4ad8-b878-2378f4f0f38d" />
The PR looks big indeed but we don't change any existing behavior of DeviceMesh, so it is a pure refactor.
With this refactoring we also enabled the slicing and flatten of non-contiguous dims of a device mesh which is hard to implement without cute layout.
This is a continue of https://github.com/pytorch/pytorch/pull/161106 (original one got messed with EasyCLA)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163213
Approved by: https://github.com/lw, https://github.com/fegin
Modified `multimem_one_shot_all_reduce_out` function to accept a `root` argument, making it a `multimem_reduce` op.
The original `multimem_one_shot_all_reduce` op becomes a caller of the `multimem_reduce`, with each rank providing its own rank id as root.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164517
Approved by: https://github.com/ngimel
`grad_dtype` is a new attribute on Tensor to control gradient dtype:
- Access/setting is leaf-only.
- grad_dtype is respected when (1) when assigning to .grad, and (2) in the engine after the previous node produces incoming gradients for AccumulateGrad. (See table below for details)
- Not setting grad_dtype preserves the current behavior. Accessing it returns `t.dtype`
- `grad_dtype` cannot be set when there is already a `.grad` present and the dtypes conflict.
| `grad_dtype` setting | Setting `.grad` manually | Incoming gradient from autograd engine |
|-----------------------|--------------------------|-----------------------------------------|
| **Default (tensor’s dtype)** | `.grad` must match tensor’s dtype | Engine casts incoming grad to tensor’s dtype |
| **Set to specific dtype** | `.grad` must match that dtype | Engine casts incoming grad to the specified dtype |
| **Set to `None`** | `.grad` may be any dtype | Engine does not cast; accepts incoming grad dtype as-is |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162815
Approved by: https://github.com/albanD
Fixes#162129. Added validation in _rank_not_in_group() to check if ```FakeProcessGroup``` is properly initialized before use, raising a clear error message if ```torch.distributed.init_process_group(backend='fake')``` hasn't been called first.
This prevents silent failures and ensures proper dispatch system integration for all distributed operations.
Added test case test_fake_process_group_direct_usage_error() that validates the error is raised for ```all_reduce``` and ```all_to_all_single``` operations.
Please let me know if additional distributed operators should be tested or if any other updates are needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163665
Approved by: https://github.com/ezyang
**Summary**
Raise error when the `local_tensor` argument passed to `DTensor.from_local` is
a DTensor, this prevents users from accidentally calling `from_local` over a DTensor
object.
The error message is organized in this way:
```
the local_tensor argument only accepts torch.Tensor but got <class 'torch.distributed.tensor.DTensor'> value.
```
**Test**
`pytest test/distributed/tensor/test_dtensor.py -k test_from_local`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164496
Approved by: https://github.com/ezyang
it was previously quite misleading since it looks like the inputs to the
dynamo graph are plain tensors when in reality they are tensor subclasses
before
```
class GraphModule(torch.nn.Module):
def forward(self, L_input_batch_inputs_: "i64[2, 512][512, 1]cuda:0", L_self_parameters_weight_: "f32[202048, 256][256, 1]cuda:0"):
```
after
```
class GraphModule(torch.nn.Module):
def forward(self, L_input_batch_inputs_: "DTensor(i64[2, 512][512, 1]cuda:0)", L_self_parameters_weight_: "DTensor(f32[202048, 256][256, 1]cuda:0)"):
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164403
Approved by: https://github.com/ezyang
Shared memory is allocated by creating a file in /dev/shm (by default) that can run out of space. Pytorch reserves the file size by calling ftruncate() that creates a sparse file, so it succeeds even if sufficient disk space is not available.
This could lead to a situation when a shared memory region is successfully created but a subsequent access to a shared memory page results in SIGBUS due to the disk being full.
Using posix_fallocate() instead of ftruncate() eliminates this problem because the former syscall always allocates space and it returns an error if the disk is full.
Related to https://github.com/pytorch/pytorch/issues/5040
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161910
Approved by: https://github.com/mikaylagawarecki
Changelog:
1. When we run into an operation we didn't proxy, we end up emitting fake constants. We error under a config and we disable the config for some internal users. The reason we want to error is this signals a coverage problem we need to address but at the same time, we don't wnat to be disruptive to already working flows.
2. Previous attribute mutation detection logic in non-strict didn't account for nested module structure. This fixes silent incorrectness issue of exporting esm and qwen in non-strict and some torchbench models like levit_128 and demucs.
3. Previous logic didn't work on the cases where we mutate a container attribute as the previous approach used to pytree over old and new attributes resulting in length mismatch. We gracefully handle this now.
Differential Revision: [D83673054](https://our.internmc.facebook.com/intern/diff/D83673054)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164372
Approved by: https://github.com/avikchaudhuri
Add `struct AOTInductorConstantMapEntry` to represent the constant map in AOTI Model. We cannot use `std::unordered_map` for cross-compilation, because it is not ABI stable.
it will be tested when we test `update_user_managed_constant_buffer` for windows cross-compilation
Example usage:
```
// Load constants. Create random constants here.
auto* fc1_w = new slim::SlimTensor(slim::empty({16, 10}, c10::kFloat, c10::Device(c10::kCUDA, 0)));
fc1_w->fill_(1.0);
.....
// Build pairs
std::vector<AOTInductorConstantPair> constants{
{"fc1_weight", fc1_w},
{"fc1_bias", fc1_b},
{"fc2_weight", fc2_w},
{"fc2_bias", fc2_b},
};
// Call runtime (pass raw pointer + size)
update_user_managed_constant_buffer_abi(
container_handle,
constants.data(),
constants.size(),
/*use_inactive=*/false,
/*validate_full_update=*/true);
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163819
Approved by: https://github.com/desertfire
This reverts commit b1033789fea2bc82901eafed498a5252985b80e9.
Reverted https://github.com/pytorch/pytorch/pull/164256 on behalf of https://github.com/yangw-dev due to failed internal test: (pytorch.tritonbench.test.test_gpu.main.TestTritonbenchGpu) Error Details: torch._inductor.exc.InductorError: LoweringException: NoValidChoicesError: No choices to select. Provided reason: All choices failed to compile for backend. please consider adding ATEN into max_autotune_gemm_backends config (defined in torch/_inductor/config.py) to allow at least one choice. ([comment](https://github.com/pytorch/pytorch/pull/164256#issuecomment-3362359624))
Summary:
as title
- Some IR nodes are created during `finalize_multi_template_buffers()` in Scheduler. This PR adds provenance (`origin_node` and `origins`) for those nodes.
- Extract `assign_origin_node` function
Differential Revision: D82871244
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164255
Approved by: https://github.com/mlazos
We should surface the CUDA architecture matrix to make things more transparent. I believe this can later become its own page where we will publish supported matrix for each release.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164471
Approved by: https://github.com/Camyll
We want to refactor the internal bookkeeping of DeviceMesh so that:
Simply the bookkeeping logics and make it generic enough so that it is easy to support new transformations like flatten noncontiguous dim, reshape and unflatten. (We leveraged the CuTe layout). This new layout also let us handle non-contiguous slicing, flatten, transpose possible.
Concretely, in this PR, we do the following:
1. Use the `_MeshLayout` to handle all index operations rather use a map to record mesh dims.
2. Removed `flatten_name_to_root_dims`, because now we can directly get layout from a flattened device mesh.
3. Replaced `_get_slice_mesh_dims` with `_get_slice_mesh_layout`.
4. Use the newly added function `check_overlap` to check layout overlap.
5. Use a new function `to_remapping_tensor` to use layout ranks as indices when the mesh tensor is not representable as CuTe. The reason is that layout acts as a backend of mesh tensor bookkeeping (indexing indices), it needs to be used as indices for remap back to the mesh tensor for new DeviceMesh generation and backend init. For example, in the case of 2K to 4K, the underlying layout is (2K, 1) but the actual value of the mesh tensor is [2K, 2K+1, ....,]. While flattening, slicing, we need to remap the layout back to the new mesh tensor so it maps the actual device allocation. For example, in the 2K to 4K case, if the shape is (1K, 1K) with dim_names ("dp", "tp"). Then when slicing "tp", the mesh tensor should be (2K, 2K+1, ..., 3K-1) or (3K, 3K+1, ... 4K-1). not the global ranks generated from the layout. (1K, 1).
Verified that loss curve is very close for DeepSeekV3 on torchtitan, note that exact same match is challenging because even if we run the baseline twice, the loss curve does not exactly match.
<img width="1113" height="490" alt="image" src="https://github.com/user-attachments/assets/7877b5a4-337e-4ad8-b878-2378f4f0f38d" />
The PR looks big indeed but we don't change any existing behavior of DeviceMesh, so it is a pure refactor.
With this refactoring we also enabled the slicing and flatten of non-contiguous dims of a device mesh which is hard to implement without cute layout.
This is a continue of https://github.com/pytorch/pytorch/pull/161106 (original one got messed with EasyCLA)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163213
Approved by: https://github.com/lw, https://github.com/fegin
# Summary
This happends when flex_attention is not tagged with the ` CheckpointPolicy.MUST_SAVE` policy. This causes the lse to be unrealized. I think in general this probably not the best policy but we shoudn't error
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164421
Approved by: https://github.com/Skylion007
Actually we would like to not graph break even in the case of Dynamo. But there is a weird-unsolved bug with Kineto + Dynamo when there are distributed jobs that lead to NCCL timeouts. This bug is a rare edege case, but we have not been able to root cause it yet.
But for export, we do not anticipate JIT tracing in distributed job training and therefore this PR is safe for export.
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164418
Approved by: https://github.com/StrongerXi, https://github.com/williamwen42
We do some fixes in pinned memory allocation stats collection and better differentiate between active vs allocated bytes.
Reviewed By: bbus, sayitmemory
Differential Revision: D83162346
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164412
Approved by: https://github.com/mradmila
A couple minor things to clean up the structure of `compile_fx` before we hit pre grad passes:
1. After patching config and recursively calling `compile_fx`, we don't need the patches any more. We make the subsequent logic call a `_maybe_wrap_and_compile_fx_main` (both when cpp wrapper exists and doesn't).
2. There's some recursive wrapping that happens on inputs and outputs before hitting pre grad passes, which are now also separated out before calling a `_compile_fx_main`, where actual work finally happens.
These also happen to fix a couple of TODOs in the old code.
Differential Revision: D83500704
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164169
Approved by: https://github.com/zhxchen17
Fixes#156052 and #156444.
This PR setup the privateuseone key in Python to be used as a python backend for pytorch.
Meaning that, after calling `setup_privateuseone_for_python_backend('npy')`, one can use a subclass to with that device to hold arbitrary python data as "device data" and use `torch.library` to register ops that takes that Tensor.
Changes done in this PR:
1. Register an vanilla Device Guard: I extended NoOpDeviceGuard to have allow device index of 0 and to not raise errors when event related functions are accessed. If I don't do those, when calling backward I would get errors. (CPU backend uses NoOpDeviceGuard just fine, although there seems to be special treatment of CPU in the autograd engine.
2. Tensor subclass allows not having `__torch_dispatch__` if the device is not CUDA or CPU. The comment of the check suggests it was to avoid segfault when calling into ops that expects a storage. Here we have a different device so will not call into those ops.
3. python function that invokes the other incantations to setup the privateusekey backend.
This took inspiration of https://github.com/bdhirsh/pytorch_open_registration_example and https://github.com/tinygrad/tinygrad/blob/master/extra/torch_backend/wrapped_tensor.cpp; great thanks to @bdhirsh and @geohot.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157859
Approved by: https://github.com/albanD
**Summary:** In order to test numerics for replicate + pp, stage.py needs to be able to call replicate's backward manually as pipeline parallelism doesn't have this feature.
**Test Case**
1. pytest test/distributed/_composable/test_composability/test_pp_composability.py -k test_replicate_pp
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164031
Approved by: https://github.com/weifengpy, https://github.com/H-Huang
ghstack dependencies: #163897
```Py
@triton.jit
def foo(dest, src):
nvshmem.get_nbi(dest, src, 100, 0)
# Some independent computation which overlaps with the get operation
...
# Wait for completion of the get operation
nvshmem.quiet()
```
Allows us to overlap comm and compute in the same kernel, instead of two kernels + signals.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163540
Approved by: https://github.com/ngimel, https://github.com/fegin
The current behavior is to do "nothing", which means you will corrupt
data. If you're doing something similar to LocalTensor, where you're
overriding the behavior of collectives to do something numerically,
this can be unwelcome behavior. If you can error when this happens
it can help prevent silent numerical incorrectness.
Authored with claude code.
Signed-off-by: Edward Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162841
Approved by: https://github.com/dcci
----
This PR will be part of a series of PR's that aims to remove `.ci/aarch64_linux` folder entirely, such that Aarch64 manylinux build happens as part of `.ci/manywheel/build.sh`, the same as other platforms.
In this PR:
- We prebuild + install Arm Compute Library in the manylinux docker image ( at /acl ), instead of a build time for every pytorch build. Also updated jammy install path to be /acl too.
- We can therefore remove build_ArmComputeLibrary functions from the ci build scripts.
- There is also some refactoring of install_openblas.sh and install_acl.sh to align them together ( similar formatting, similar variable names, same place for version number update )
- We had 2 places to define openblas version, this has been reduced to 1 now ( install_openblas.sh ).
- ACL_VERSION and OPENBLAS_VERSION are now able to be overriden at build.sh level for developers, but there is only 1 version of each hardcoded for ci.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159737
Approved by: https://github.com/seemethere, https://github.com/aditew01
In #163455 , the `reshape` was not a pure view op.
The `permute` before it created an non-contiguous tensor, which would trigger a data copy during the reshape.
This PR improved the implementation by remove the `urtensor` intermediate tensor completely.
By simply expanding the `xtensor` would achieve the `repeat` effect.
Before this PR, there were two data copies (in `urtensor.copy_` and `urtensor.reshape`).
Now, there is only one data copy in the `.copy_()`.
Reshape would not copy data because it is on a contiguous tensor.
One more note is that we do want at one copy because we want to duplicate the elements for the repeats.
User can inplace modify single elements without afffecting others.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163842
Approved by: https://github.com/Skylion007
Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
Both Linux, Windows and MacOS CI workflows should use `.ci/docker/requirements-ci.txt`
TODOS:
- Investigate why `choco install cmake` is needed to successfully detect MKL
- Move `psutil` installation from specific scripts into requirements-ci.txt
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163396
Approved by: https://github.com/Skylion007
Summary: `nn.Parameter()` by default has `requires_grad=True` and would cause issues when there are non-float parameters.
Test Plan: buck2 run mode/dev-nosan caffe2/test:test_export -- -r test_non_float_weight
Differential Revision: D83598796
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164290
Approved by: https://github.com/angelayi
Summary:
New test sizes for `test_scaled_mm_vs_emulated_block_wise` all fail with
```
RuntimeError: Invalid scaling configuration
```
Disable these new tests for now (the remaining test is a parametrized
version of the original test case)
Test Plan:
`pytest test/test_scaled_matmul_cuda.py`
Reviewers:
Subscribers:
Tasks:
Tags:
Signed-off-by: Simon Layton <simonlayton@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164259
Approved by: https://github.com/jananisriram
ghstack dependencies: #164266
To run this, you need to install `mingw64-gcc-c++` and download windows cuda library toolkit.
See design doc and demo instructions in https://docs.google.com/document/d/1iDaChqA5nNKkBFTzsdkmoomvQlXHbnlb1Z4yEp7xaJA/edit?tab=t.0
If cross_platform_target is windows, we do the following:
- do not link to `sleef`. This can be improved in the future if we need it. Currently I avoid it because that requires extra setup on the linux side
- Use `mingw64-gcc-c++` to compile
- Use `WINDOWS_CUDA_HOME` instead of `CUDA_HOME` when linking to cuda
```
python test/inductor/test_aot_inductor_windows.py -k so
```
Other changes:
- de-couples compile_standalone config and dynamic link flag
- create a new aot_inductor_mode config module, which is used to control configs in aot_inductor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163188
Approved by: https://github.com/desertfire
Summary: This diffs add an API to query expandable segment size for each stream so that we can use this info to warmup the segment in advance, so we dont incur any performance penalty during steady state inference for new CUDA memory allocations.
Differential Revision: D76447308
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163771
Approved by: https://github.com/bbus
Previous work #158352 delivered CUDAGraph memory footprint reduction with no replay-time impact, but capture time regressed (up to 20× slower) due to repeated full-graph traversals. See previous benchmark results [here](https://github.com/pytorch/pytorch/pull/158352#issuecomment-3215947565)
This PR removes capture/reply overhead while preserving the memory savings:
1. **Terminals as free markers**
We stop inserting empty nodes and instead record the current stream terminals as free markers. This avoids mutating the user’s graph and keeps semantics unchanged.
2. **Incremental, cached reachability**
We add a **per-graph reuse context** that caches reverse-traversal state:
* `graph_reuse_context[graph].visited[stream]` tracks nodes already seen from that stream’s terminal frontier.
* On each allocation during capture, we resume traversal from the latest terminals and only visit unseen nodes.
* A block is freed when all its recorded markers are in the visited set of its allocation stream—i.e., all markers are proven predecessors of future work.
See [the performance results here](https://docs.google.com/spreadsheets/d/e/2PACX-1vRPvdd9Xa8W87ixbiA0da_qvOhrUAjUpFz0G-_j-MsDnoeRyhEa4_ut_W3rqcg1VVZVFJ-gucwov-3b/pubhtml?gid=1468302443&single=true), we sweep synthetic multi-stream CUDA Graphs built by `capture_benchmark.py` (same as before, we generate random interleaving of alloc/free/join with given probabilities, see [gist here](https://gist.github.com/eee4017/e2092d215b1d4bd46534148939af39e3)), and we compare median capture/replay times and memory. On an NVIDIA H100 PCIe across 24 configs, the optimization preserves reserved memory reduction at ~24–98%, leaves allocated memory unchanged, and brings capture time back to baseline (range 0.96–1.04× vs. baseline) with replay time unchanged (range 0.97–1.11×).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162186
Approved by: https://github.com/eqy, https://github.com/ngimel
**Summary:** In order to ensure that replicate acts as intended (a specialized version of hsdp) we need to make sure that it can pass the same tests that fully_shard can for training. The first test verifies Replicate works with gradient accumulation properly. The second verifies that replicate works correctly with a One-Forward-One-Backward (1F1B) pipeline parallelism schedule
**Test Cases**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_gradient_accumulation
2. pytest test/distributed/_composable/test_replicate_training.py -k test_1f1b_microbatching
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162839
Approved by: https://github.com/mori360
ghstack dependencies: #162830, #162836
Those were very useful in the past, because:
- CI builder jobs did not generates wheels, but rather run `python setup.py develop` and shared docker layers, which is no longer the case, all CI jobs produce wheels
- CD jobs were targeting pre-CXX11 ABI, but this is no longer the case after manylinux2_28 migration
Existing, but acceptable gaps:
- Windows libtorch debug builds sometimes might fail, but IMO it's ok not to be able to produce those for a few days, as number of libtorch users are somewhat small
- All CD jobs are based on AlmaLinux, while CI are based on Ubuntu, but this could be adjusted if needed, besides AlmaLinux-9 and Ubuntu-22.04 are pretty close in terms of glibc and gcc versions
- CD jobs build for all GPU architectures, while CI only for the one being tested, but there are now periodic H100 and B200 jobs, and not a lot of development happens for Voltas or Pascals
Besides there are better tools to alert about the nightly failures
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164260
Approved by: https://github.com/seemethere, https://github.com/atalman
Fixes#147521
This modification allow user to put any size of var in GaussianNLLLoss if the var is broadcastable (to input/target's size)
Therefore, the demo code in #147521 will result in expected behaviour and correct output.
This allow all input size that match:
`input.size = (..., n, ...), var.size = (..., 1, ...)`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147522
Approved by: https://github.com/mikaylagawarecki
Fixes#163702.
This fixes 2 issues:
1. The value may inconsistently be a shape or string. This normalizes to handle both of these.
2. 1D shapes should not transpose data. This fixes the order of operations to prevent this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163966
Approved by: https://github.com/eellison
Summary:
As titled.
sdpa will select backend based on hardware check, and it fails when exporting with cuda under fake mode on a cuda-less machine.
We guard `at::cuda::is_available()` check before `at::cuda::getCurrentDeviceProperties()` and give warnings.
Test Plan: buck2 run mode/dev-nosan caffe2/test:test_export -- -r nn_functional_scaled_dot_product_attention
Differential Revision: D83496154
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164162
Approved by: https://github.com/SherlockNoMad
# Problem
Inductor's FX backend receives sympy expressions for Triton launch grids, and passes these to a tracer to generate equivalent FX IR. However, the tracer does not support all possible sympy expressions. In particular, it can't handle ops like `floor` and `Pow` which would be found in an expression like `floor(x / y)`. Instead, it expects `FloorDiv(x, y)`, which has the advantage that all intermediate values are integers, unlike `x / y`.
Inductor's Python backend uses a trick where `ceil(x / y)` is computed in Python as `-(x // -y)`, which is faster when evaluating Python launch grids at runtime. However, this trick generates more complex sympy expressions, so the FX backend introduced a `"python_slow"` mode using a more familiar form of ceil division. However, this mode is slower to evaluate, which increased production CPU usage. (Internal reviewers see T237853632.)
# Solution
To get the best of both worlds, this PR removes `"python_slow"` mode, and generalizes the `replace_floor_div` function to handle the more complex expressions resulting from the `"python"` grid mode. The new algorithm is conceptually similar to the existing one, except instead of analyzing only the first argument to a `sympy.Mul` op, it checks all factors, so it can handle expressions containing both `Rational` and `Pow` ops, among other cases. It also uses `Mul.make_args` to handle the case when the argument to `floor` is not a `Mul`. Finally, it uses `expr.is_positive` to check the sign of symbolic exponents.
This new algorithm is guaranteed to convert all `floor` ops to an equivalent expression using `FloorDiv`. (To see this, consider that `floor(x) == FloorDiv(x, 1)`.) Note it may not remove all `Pow` ops, with a counterexample being `floor(x / (2 + z ** y))`, but it covers everything we've seen in practice for symbolic launch grids. In particular, it covers the typical case where `Pow` is a factor of the argument to `floor`, and the exponent is `-1`. Is this situation, we move the `Pow` to the denominator of `FloorDiv` and the exponent becomes `1`, eliminating the `Pow` op.
# Test plan
This PR adds an end-to-end test for static padding with dynamic outer dimensions, which creates a difficult sympy expression that the existing algorithm would not be able to handle.
This PR also adds some unit tests for the `replace_floor_div` function. It can be difficult to construct end-to-end tests that expose all the trickiest expressions, as those tests have to pass through a number of other systems handling dynamic shapes. Therefore, it's easier to expose the edge cases with these new unit tests. The tests check that we can replace all `floor` ops in the input expression with `FloorDiv`, then they expand `FloorDiv` back to `floor` and check equality with the original expression.
Note this PR also requires some MTIA changes to pass internal tests. Those will be stacked onto the imported diff.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163828
Approved by: https://github.com/nandesuka, https://github.com/angelayi, https://github.com/jansel
Summary:
Under circumstances it seems reasonable to return a callable directly without guard check when user use aot_compile on a function with single compilation result.
When having multiple entries (aot_compile_module), we should start enabling guard check to differetiate different compiled functions apart.
Test Plan: CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163432
Approved by: https://github.com/dolpm, https://github.com/mlazos
# Summary
- Add a note to each `nn.LPPool*d` docstring explaining how `ceil_mode=True` interacts with right padding.
- Mirror the same clarification in the `torch.nn.functional.lp_pool*` docstrings so the rendered functional docs stay in sync.
# Motivation
The current PyTorch spec for **LPPool** does not fully match runtime behavior, which has led to downstream confusion in other specs (e.g., ONNX) and runtimes (e.g., [onnxruntime issue #25848](https://github.com/microsoft/onnxruntime/issues/25848)). A corresponding clarification was also made in the ONNX spec: [onnx/onnx#5741](https://github.com/onnx/onnx/pull/5741).
PyTorch’s **LPPool** implementation calls into **AvgPool**, which enforces the rule that windows starting entirely in the right padded region are ignored when `ceil_mode=True`. As a result, **LPPool** inherits the same behavior.
This is an edge case where the output size formula shown in the LPPool docs/spec is not sufficient on its own. Without the added caveat, the documentation is technically incorrect. This PR brings the LPPool docs in line with actual behavior.
Note that this is a trivial fix to the spec as all major implementers of the spec adhere to this caveat.
For comparison, both **MaxPool** and **AvgPool** already include this clarification in their spec. Their docstrings explicitly state:
> *When `ceil_mode=True`, sliding windows are allowed to go off-bounds if they start within the left padding or the input. Sliding windows that would start in the right padded region are ignored.*
Adding the same note to LPPool ensures consistency across all pooling operators.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163186
Approved by: https://github.com/mikaylagawarecki
This is a simple refactor that just moves some logic in `_precompile_config` to two new functions for separation of concerns. This will allow subclasses e.g. out of tree to configure options and metadata for triton.compile.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162406
Approved by: https://github.com/exclamaforte
Fixes#156052 and #156444.
This PR setup the privateuseone key in Python to be used as a python backend for pytorch.
Meaning that, after calling `setup_privateuseone_for_python_backend('npy')`, one can use a subclass to with that device to hold arbitrary python data as "device data" and use `torch.library` to register ops that takes that Tensor.
Changes done in this PR:
1. Register an vanilla Device Guard: I extended NoOpDeviceGuard to have allow device index of 0 and to not raise errors when event related functions are accessed. If I don't do those, when calling backward I would get errors. (CPU backend uses NoOpDeviceGuard just fine, although there seems to be special treatment of CPU in the autograd engine.
2. Tensor subclass allows not having `__torch_dispatch__` if the device is not CUDA or CPU. The comment of the check suggests it was to avoid segfault when calling into ops that expects a storage. Here we have a different device so will not call into those ops.
3. python function that invokes the other incantations to setup the privateusekey backend.
This took inspiration of https://github.com/bdhirsh/pytorch_open_registration_example and https://github.com/tinygrad/tinygrad/blob/master/extra/torch_backend/wrapped_tensor.cpp; great thanks to @bdhirsh and @geohot.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157859
Approved by: https://github.com/albanD
I experimented with 3 paths to get joint graph for DTensorized module and input
1. strict_export + aot_export_joint_with_descriptors
2. graph_capture + aot_export_joint_with_descriptors
3. aot_export_joint_with_descriptors alone
Added test to guard them.
1 doesn't work, as bw graph region is missing from the joint graph.
I am leaning towards making 2 the recommended path.
If 2 doesn't work going forward, we can fallback to 3.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163609
Approved by: https://github.com/tugsbayasgalan
Co-authored-by: suo <suo@fb.com>
Summary:
Original commit changeset: 06888d7ebff0
Original Phabricator Diff: D82932788
Restricted the test to SM90 for scaled_grouped_mm
Test Plan: TBD (will share the linux CI results)
Differential Revision: D83283991
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163905
Approved by: https://github.com/angelayi
tl;dr performs bucketing while preserving comm-compute overlap.
In comm-compute overlap we will have a graph with:
```
def foo(...):
ag = all_gather(...)
hiding_compute = mm(...)
wait(ag)
```
There is no explicit dependency between the hiding compute and the collectives, but we want to add implicit dependencies from wait->hiding_compute, and from hiding_compute->all_gather to preserve overlap.
Additionally, while bucketing, we will merge collective starts and collective waits together. In this case, we will want to treat the two nodes as a single subgraph - each node in the merged set will have the union of all deps in the set.
We perform bucketing while augmenting the graph with these relationships. This can be done separably from comm-compute overlap, so long as the hiding compute relationships are passed in.
TODO:
- need to instrument fx graph so inductor respects these relationships.
- the compile time of the bucketing search can be sped up significantly by limiting what portion of the graph we traverse through
- more memory aware handling
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163960
Approved by: https://github.com/ruisizhang123, https://github.com/v0i0, https://github.com/IvanKobzarev
ghstack dependencies: #163215, #163754, #163959
In comm-compute overlap we will have a graph with:
```
def foo(...):
ag = all_gather(...)
hiding_compute = mm(...)
wait(ag)
```
There is no explicit dependency between the hiding compute and the collectives, but we want to add implicit dependencies from wait->hiding_compute, and from hiding_compute->all_gather to preserve overlap.
Additionally, while bucketing, we will merge collective starts and collective waits together. In this case, we will want to treat the two nodes as a single subgraph - each node in the merged set will have the union of all deps in the set.
This pr adds `AugmentedGraphHelper` that adds the apis, and allows querying for dependency with this augmented graph.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163959
Approved by: https://github.com/v0i0, https://github.com/IvanKobzarev
ghstack dependencies: #163215, #163754
This is first part of the stack that does comm/compute reordering, and then uses the exposure analysis to do bucketing.
Subsequent prs will handle:
- use of exposure analysis to do bucketing
- make sure inductor respects comm/compute overlapping done at fx level
- non-profiling mm estimation/rank broadcasting of profile results
Other mis:
- Validate accuracy of nccl estimations ( use ruisi's profiling instead ?)
For a llama 2d parallelism test, on forward, we overlap all but 2 of potentially hidden collectives. For backward, we overlap 217/269 of potentially hidden collectives. If you increase `compute_overlap_multipler` (for fudge factor of inaccurate comms estimation), that goes down to all but 16 of potentially hidden collectives.
fwd example: https://gist.github.com/eellison/76209c49d8829c5f1e323d34a3f040c3
bwd example: https://gist.github.com/eellison/6cfc2285df53a94cfa4012f5fdae5c51
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163215
Approved by: https://github.com/IvanKobzarev
----
- `cmake_dependent_option` condition should be `USE_ROCM OR (USE_CUDA AND NOT MSVC)` (similar to the one for flash attention)
- Default settings should be user overridable, i.e. even if one builds for SM_10, they should be able to pass `USE_FBGEMM_GENAI=0` and skip the build
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164165
Approved by: https://github.com/Skylion007
See also #163972, which was intended to be this PR.
Triton (release/3.5.x) by default ships CUDA12.8 ptxas.
This PR tries to bundle a ptxas version for cuda13, so that it can help https://github.com/pytorch/pytorch/issues/163801 when users run on new devices like THOR and Spark.
Fixes https://github.com/pytorch/pytorch/issues/163801
Test Plan:
Check binary size increase against nightly or v2.9RC
Install the binary from into a working THOR and GB200/GH100 machine (reproduce the original issue first on THOR), then install the binary built from this PR and we expect the issue to be gone without any additional user setting. Testing on GB200 is to ensure no regression.
Reference: https://github.com/pytorch/pytorch/pull/119750 and 5c814e2527
Note: with this PR, the pytorch world's torch.compile is supposed to find ptxas via "torch/_inductor/runtime/compile_tasks.py" and "_set_triton_ptxas_path". Use cases that do not go through "_set_triton_ptxas_path" may not be able to use the cuda13 ptxas binary.
However, as is, the triton world does not know the existence of this new cuda13 ptxas. So IF a users thinks there is already pytorch/bin/ptxas and delete the ptxas from triton, then c6ad34f7eb/python/triton/knobs.py (L216) would still complain ptxas not found (if removed - it won't know this new one available)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163988
Approved by: https://github.com/atalman
Previously we already replaced most use of `python setup.py develop/install`.
This PR also replaces the use of `setup.py bdist_wheel` with the modern `python -m build --wheel` alternative.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156712
Approved by: https://github.com/atalman
ghstack dependencies: #156711
* Changes some internal logic for grouping so hopefully it's slightly less annoying write code for
* Changes the invoking file summary to just use file, which I think is correct most of the time
* Adds some fields to the file summary, like skips, errors, etc so I can reuse it for file report regression things
Output should be the same, maybe with slightly more fields since I got rid of some of the pops
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164016
Approved by: https://github.com/huydhn
Summary:
Previously, many arvr targets transitively depended on c10, not c10_ovrsource,
because they either explicitly depended on c10 (because they didn't know
better) or they depended on legacy Caffe2, which never got the ovrsource
treatment. So we found all these spots (driven by D82283623) and forced them
to query arvr mode to figure out which one they should use. The goal is you
NEVER have both targets in the same build rule at the same time.
This diff could be reverted if D82224960 works out but I haven't gotten it to work yet.
Test Plan: sandcastle
Reviewed By: EscapeZero
Differential Revision: D82390436
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164128
Approved by: https://github.com/albanD, https://github.com/malfet
In comm-compute overlap we will have a graph with:
```
def foo(...):
ag = all_gather(...)
hiding_compute = mm(...)
wait(ag)
```
There is no explicit dependency between the hiding compute and the collectives, but we want to add implicit dependencies from wait->hiding_compute, and from hiding_compute->all_gather to preserve overlap.
Additionally, while bucketing, we will merge collective starts and collective waits together. In this case, we will want to treat the two nodes as a single subgraph - each node in the merged set will have the union of all deps in the set.
This pr adds `AugmentedGraphHelper` that adds the apis, and allows querying for dependency with this augmented graph.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163959
Approved by: https://github.com/v0i0, https://github.com/IvanKobzarev
ghstack dependencies: #163215, #163754
This is first part of the stack that does comm/compute reordering, and then uses the exposure analysis to do bucketing.
Subsequent prs will handle:
- use of exposure analysis to do bucketing
- make sure inductor respects comm/compute overlapping done at fx level
- non-profiling mm estimation/rank broadcasting of profile results
Other mis:
- Validate accuracy of nccl estimations ( use ruisi's profiling instead ?)
For a llama 2d parallelism test, on forward, we overlap all but 2 of potentially hidden collectives. For backward, we overlap 217/269 of potentially hidden collectives. If you increase `compute_overlap_multipler` (for fudge factor of inaccurate comms estimation), that goes down to all but 16 of potentially hidden collectives.
fwd example: https://gist.github.com/eellison/76209c49d8829c5f1e323d34a3f040c3
bwd example: https://gist.github.com/eellison/6cfc2285df53a94cfa4012f5fdae5c51
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163215
Approved by: https://github.com/IvanKobzarev
# Problem
Inductor sometimes generates unbacked symints to handle things like mismatched branches of `torch.cond`. This code is represented by `pytree.KeyPath`, with special codegen logic to convert it to Python and C++. This was not previously supported by the FX backend.
# Feature
This PR adds support for unbacked symbol declarations to the FX backend. The implementation is fairly straightforward.
1. Instead of raw Python/C++, update the wrapper codegen method to emit a new Wrapper IR line called `UnbackedSymbolDefsLine`. This contains all the information needed to generate the Python and C++ code.
2. Move the existing Python/C++ codegen to a private method, which is invoked by `UnbackedSymbolDefsLine.codegen()`.
3. Implement a method to generate FX IR from unbacked symbol definitions. The implementation is based on recursive descent, consuming some keypath entries, emitting an FX IR node, and recursing to the rest of the keypath. It is conceptually identical to the existing algorithm for Python and C++, except it generates FX nodes.
4. The FX backend currently relies on size hints to generate autotuning arguments, and consequently autotuning does not support unbacked SymInts. At some point, we would like to generalize the autotuning logic to support these. But for now, simply emit a warning and skip autotuning when we see them.
5. The new test case exposed some tricky issues reconciling Triton call args with constants stored in `triton_meta`. This PR rewrites the relevant helper function to do this in a more principled way.
# Test plan
This PR imports an existing control flow test to the FX backend's test suite. The test uses unbacked symbol definitions to handle mismatched dynamic shapes coming from `torch.cond` branches.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163729
Approved by: https://github.com/jansel
Before returning a comand buffer, as subsequent calle are very likely to allocate their own encoder, which results in the following runtime error
```
tryCoalescingPreviousComputeCommandEncoderWithConfig:nextEncoderClass:]:1090: failed assertion `A command encoder is already encoding to this command buffer'
```
Added regression test to `test_mps_extension`
Please note, that `torch::mps::get_command_buffer()` should be called with dispatch_queue held, both before and after this change, but many implementations skip that
Fixes https://github.com/pytorch/pytorch/issues/163721
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164093
Approved by: https://github.com/atalman, https://github.com/Skylion007
Summary: Relax stride check for block-wise scaling (1x128, 128x128) when a dimension of the scaling factor is 1. When the scaling tensor has a dimension of size 1, the stride is effectively "meaningless" to PyTorch, i.e. PyTorch decides to replace its stride with a default of `[1, 1]`. However, the old stride check required the stride to match one of the scaling dimensions. Here, we relax the stride check when the effective stride is 1 in order to allow for cases in which `K <= 128` and `N <= 128`.
Test Plan:
```
pytest -s -v test/test_matmul_cuda.py::TestFP8MatmulCUDA::test_scaled_mm_vs_emulated_block_wise_float32_lhs_block_1_rhs_block_128_cuda 2>&1 | tee ~/personal/stride_check.log
```
Differential Revision: D83023706
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163829
Approved by: https://github.com/lw, https://github.com/eqy
Summary: For H100s and below, add `op_name="scaled_mm"` to the template heuristic for `CUDAScaledTMATemplateConfigHeuristic` such that `scaled_mm` persistent + TMA tests do not default to the "mm" heuristics.
Test Plan: `test_max_autotune.py`
Differential Revision: D83390775
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164019
Approved by: https://github.com/njriasan
Fixes#163597
- Updates fast SDPA implementations to take in query tensor stride info similar to key and value instead of assuming stride.
- Updated tests with additional transpose/permutation layouts. New tests catch the regression.
### Benchmarking with script found in [implementation PR](https://github.com/pytorch/pytorch/pull/152781#:~:text=19.8%25%20speed%20improvement-,Script%20to%20get%20perf%3A,-import%20torch%0Aimport)
Times are averaged over 100000 iterations. This change should not have any significant performance difference. Tested on an M3 Pro
### Vector Fast Path (q_len=1, k_len=256)
- Before: 0.160 ms
- After: 0.157 ms
### Vector 2-pass (q_len=1, k_len=4096)
- Before: 0.342 ms
- After: 0.339 ms
### Vector Fast Path (q_len=8, k_len=256)
- Before: 0.228 ms
- After: 0.231 ms
### Vector 2-pass (q_len=8, k_len=4096)
- Before: 0.432 ms
- After: 0.436 ms
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163598
Approved by: https://github.com/malfet
Summary:
ran into this when precompiling baidu/ERNIE-4.5-21B-A3B-PT
codegen after fix:
```py
import triton
import triton.language as tl
from torch._inductor.runtime.triton_heuristics import start_graph, end_graph
from torch._C import _cuda_getCurrentRawStream as get_raw_stream
with torch.cuda._DeviceGuard(0):
stream0 = get_raw_stream(0)
...
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163707
Approved by: https://github.com/jamesjwu
Summary: Partly Importing and adapting https://github.com/pytorch/pytorch/pull/138388, adding SVE128 as ISA.
Intention is to add SVE128 translation layers for Vectorized data types.
Idea is to have 1 PR per file, aside from the current one, plus a last one modifying cmake files to enable the new ISA selectively.
Tested current changes on a nightly run, to verify no regressions occur on systems leveraging SVE256.
No regressions spotted when running test_ops.py, a set of 34k unit tests. A machine leveraging SVE128 was used towards this testing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158932
Approved by: https://github.com/malfet
For https://github.com/pytorch/pytorch/issues/114850, we will port aten unit tests to Intel GPU. This PR will work on some test case of test/test_ops.py. We could enable Intel GPU with following methods and try the best to keep the original code styles:
1. Extended XPUTestBase.get_all_devices to support multiple devices
2. Added skipXPU decorator
3. Extended onlyOn to support device list
4. Enabled 'xpu' for some test pathes
5. Added allow_xpu=True for supported test class.
6. Replaced onlyCUDA with onlyOn(['cuda', 'xpu']) for supported tests
7. Use skipIfXpu and skipXPU to disable unsupported test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159944
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/albanD
Its unclear why we had disable in the first place. With
install_free_tensors, we are tracing into this hook. A better way would
be to place the tracer without any hook. For now, disable the checking
while dynamo is tracing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164084
Approved by: https://github.com/tugsbayasgalan
This PR introduces a new "operator microbenchmark" CI workflow and GitHub Actions for operator microbenchmarks, updating test scripts and job matrices to support new parameters, and broadening the operator benchmark tests to include more data types, larger shapes, and gradient tests. The benchmark configurations now focus more on different cuda hardware and multiple dtypes (bf16, fp16, fp32), for both compile and eager mode.
**Benchmark Configuration and Coverage:**
* Expanded operator benchmark configurations in `addmm_test.py`, `bmm_test.py`, `matmul_test.py`, and `mm_test.py` to benchmark multiple dtypes on CUDA devices, in eager and compile mode, for forward and backward run. The configs with tag "long" for the above mentioned files are being run in CI.
* The CI benchmarking is running on various hardwares: H100, A100.
* The CI job also uploads the microbenchmarking outputs to a [HUD](https://hud.pytorch.org/benchmark/llms?repoName=pytorch%2Fpytorch&benchmarkName=PyTorch+operator+microbenchmark) dashboard.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162530
Approved by: https://github.com/huydhn
Co-authored-by: Huy Do <huydhn@gmail.com>
This adds basic support for subclass inputs in export (specifically for non-strict). I had to make fakify little more complicated which risks further divergence from dynamo fakification. But dynamo one is so complex, so i feel it is better to do this way. Also improved fake mode detection logic to recursively look into subclass inner tensors.
Differential Revision: [D83156489](https://our.internmc.facebook.com/intern/diff/D83156489)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163770
Approved by: https://github.com/avikchaudhuri
Replace the **runtime_error** of the vallina C++ exceptions with **TORCH_CEHCK** in **torch/nativert/***
The vallina C++ exception should not exist in the core part of pytorch for its corss-languanges trait. Comparing with the vallina C++ exceptions, TORCH_CHECK have the richer error context and It has the unified error handling mechanism. This commit replace the runtime_error with TORCH_CHECK of the files in
torch/nativert/* .
Fixes part of #148114
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163308
Approved by: https://github.com/dolpm
Currently, the Cholesky factorization and least squares operation defaults to magma when Pytorch is compiled for ROCm. This shows suboptimal performance.
This change allows PyTorch to rely on hipSolver instead of Magma.
@jeffdaily
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163977
Approved by: https://github.com/Skylion007
Summary:
Running a toy example through `torch.compile(fullgraph=True, backend="inductor")` with default inductor config, I tried to see what passes are run in each of pre-grad, joint-graph, and post-grad phases by printing out the subsystem in `GraphTransformObserver`. However the subsystem showed up as None in a bunch of transforms that were run in each of those phases, so this PR adds some additional annotations.
Note that these annotations are probably not a complete set, since other transforms may run based on changes to the config that are not covered here.
Hopefully this doesn't change behavior. However, I did notice that bisecting relies on disabling various phases, which means that while before some passes would *not* be disabled (because their subsystem was `None`), now they would.
Test Plan: existing tests + manual test described in summary
Differential Revision: D83306676
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163922
Approved by: https://github.com/jansel
A DTensor that contains partial placement shouldn't be checkpointed (DCP.save) -- the result is not correct and DCP doesn't know how to handle it.
There are several APIs that are only used by checkpointing, e.g.,`__create_write_items__`. These APIs should raise an exception if the DTensor, `self`, has Partial placement.
Ideally, we want to add the following test:
```
with self.assertRaisesRegex(
RuntimeError, "Any checkpointing related operations are not supported for"
):
dcp.save({"dtensor": dtensor}, checkpoint_id=tempfile.gettempdir())
```
While we do see the RuntimeError is raised, the error was raised in another thread due to DTensor checkpoint APIs are called by DCP in a separate thread, which assertRaisesRegex cannot capture.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163941
Approved by: https://github.com/tianyu-l
As the title stated.
**Changes:**
- torch.cuda.amp.autocast
- torch.cpu.amp.autocast
- add explicit `__new__` and `__init_subclass__` for those class above for inspect.signature to retrieve correct signature
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163654
Approved by: https://github.com/Skylion007
This adds a PR time benchmark that checks for runtime overhead on a very small graph. This will help track regressions in runtime overhead.
Example Results:
```
runtime_overhead_inductor,instruction_count,222645
runtime_overhead_inductor_inference_mode,instruction_count,234998
runtime_overhead_inductor_requires_grad,instruction_count,293556
runtime_overhead_inductor_requires_grad_backward,instruction_count,78181
runtime_overhead_inductor_dynamic,instruction_count,234870
runtime_overhead_inductor_inference_mode_dynamic,instruction_count,248711
runtime_overhead_inductor_requires_grad_dynamic,instruction_count,309979
runtime_overhead_inductor_requires_grad_backward_dynamic,instruction_count,77599
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163866
Approved by: https://github.com/jansel, https://github.com/mlazos, https://github.com/anijain2305
We were seeing instances of stdlib files in clang-tidy output so this
just essentially removes them from the things that lintrunner will
report up. Longer term fix here would be to just modify the clang-tidy
configuration in order to do the correct thing here but that requires a
bit more investigation as to why this is only happening in CI and is not
reproduceable locally.
Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164008
Approved by: https://github.com/ZainRizvi
While refactoring the bookkeeping for DeviceMesh while leveraging CuTe layout, we found that we need to have two more util functions. One is to check whether one layout has overlap inside it or not. For example, (2,2):(2:1) has no overlap while (2,2):(2:2) has overlap.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163367
Approved by: https://github.com/fegin
ghstack dependencies: #163212, #163288, #163928, #163930
Fixes#160598Fixes#160551Fixes#160507
This PR fixes a bug in the `test_garbage_collect_expandable` unit test where the finally block incorrectly re-reads the current per process memory fraction instead of setting the original value. With out the fix the other tests in the `test/test_cuda.py` test suite were impacted and failed with OOM error on ROCm.
This ensures proper cleanup and isolation of test state, maintaining test correctness and avoiding side effects like the below OOM error that it caused.
For example, `test_autocast_checkpointing` failed with the below error https://github.com/pytorch/pytorch/actions/runs/17982223758/job/51153974194 on ROCm
`torch.OutOfMemoryError: HIP out of memory. Tried to allocate 76.00 MiB. GPU 0 has a total capacity of 255.69 GiB of which 252.97 GiB is free. 1.20 GiB allowed; Of the allocated memory 1.14 GiB is allocated by PyTorch, with 17.00 MiB allocated in private pools (e.g., HIP Graphs), and 18.63 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164000
Approved by: https://github.com/jeffdaily
Fixes#163640
This PR avoids a mask left align check in the case that we're operating under torch.compile / torch.export. Originally, I planned to make a more invasive change to auto-disable the fast path entirely underneath torch.compile / torch.export, but I realized during testing that the fast path wasn't actually causing compile issues outside of the narrow issue identified here.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163773
Approved by: https://github.com/mikaylagawarecki
Summary:
As titled.
Without the diff, we got P1963055009
With the diff passing in the enviroment, we can do correct sym_int deduction:
https://fburl.com/mlhub/p5zy7o28
Test Plan:
```
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:unbacked_symints -- test_sdfpa_unbacked_strides --print-passing-details --env TORCHDYNAMO_EXTENDED_DEBUG_CPP=1 --env TORCHDYNAMO_EXTENDED_DEBUG_GUARD_ADDED="Eq(u0, 0)"
```
Without the fix: P1964887260
With the fix: P1964888579
Differential Revision: D83211018
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163925
Approved by: https://github.com/ColinPeppler
cuda-bindings>=12.0,<13.0 ; platform_machine != "s390x" and platform_system != "Darwin"
#Description: required for testing CUDAGraph::raw_cuda_graph(). See https://nvidia.github.io/cuda-python/cuda-bindings/latest/support.html for how this version was chosen. Note "Any fix in the latest bindings would be backported to the prior major version" means that only the newest version of cuda-bindings will get fixes. Depending on the latest version of 12.x is okay because all 12.y versions will be supported via "CUDA minor version compatibility". Pytorch builds against 13.z versions of cuda toolkit work with 12.x versions of cuda-bindings as well because newer drivers work with old toolkits.
# Copied from https://github.com/pytorch/test-infra/blob/be01a40157c36cd5a48391fdf44a7bc3ebd4c7e3/aws/ami/windows/scripts/Installers/Install-Pip-Dependencies.ps1#L16 with some adjustments
# pytest-rerunfailures==10.3 as 10.2 fails with INTERNALERROR> pluggy._manager.PluginValidationError: unknown hook 'pytest_configure_node'
Following is the release cadence. All future dates below are tentative. For latest updates on the release schedule, please follow [dev discuss](https://dev-discuss.pytorch.org/c/release-announcements/27). Please note: Patch Releases are optional.
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.