* [BE] [CD] Remove pygit2 dep for aarch64_wheel build (#144716)
As it's incompatible with 3.13t and only used to fetch the branch name, which could be done by running
```
git rev-parse --abbrev-ref HEAD
```
Also, remove yet another reference to long gone `master` branch.
Test plan:
Download `manywheel-py3_11-cpu-aarch64.zip` produced by this PR, install it inside docker container and check it's version
```
# pip install torch-2.7.0.dev20250113+cpu-cp311-cp311-manylinux_2_28_aarch64.whl
...
Installing collected packages: mpmath, typing-extensions, sympy, networkx, MarkupSafe, fsspec, filelock, jinja2, torch
Successfully installed MarkupSafe-3.0.2 filelock-3.16.1 fsspec-2024.12.0 jinja2-3.1.5 mpmath-1.3.0 networkx-3.4.2 sympy-1.13.1 torch-2.7.0.dev20250113+cpu typing-extensions-4.12.2
root@434f2540345e:/# python
Python 3.11.9 (main, Aug 1 2024, 23:33:10) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__version__
'2.7.0.dev20250113+cpu'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144716
Approved by: https://github.com/atalman
ghstack dependencies: #144696, #144697
(cherry picked from commit 58302c4eaa6e48fd503f6d4e18e5945954ed02be)
* [CD] Enable python3.13t builds for aarch64 (#144698)
But make sure that right numpy version is picked (2.0.2 does not support 3.13)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144698
Approved by: https://github.com/atalman
ghstack dependencies: #144696, #144697, #144716
(cherry picked from commit 6053242890e91a78eb31f50d2d5cd3c2858feac1)
* Regenerate workflow
[CD] Fix slim-wheel nvjit-link import problem (#141063)
When other toolkit (say CUDA-12.3) is installed and `LD_LIBRARY_PATH` points to there, import torch will fail with
```
ImportError: /usr/local/lib/python3.10/dist-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12: undefined symbol: __nvJitLinkComplete_12_4, version libnvJitLink.so.12
```
It could not be worked around by tweaking rpath, as it also depends on the library load order, which are not guaranteed by any linker. Instead solve this by preloading `nvjitlink` right after global deps are loaded, by running something along the lines of the following
```python
if version.cuda in ["12.4", "12.6"]:
with open("/proc/self/maps") as f:
_maps = f.read()
# libtorch_global_deps.so always depends in cudart, check if its installed via wheel
if "nvidia/cuda_runtime/lib/libcudart.so" in _maps:
# If all abovementioned conditions are met, preload nvjitlink
_preload_cuda_deps("nvjitlink", "libnvJitLink.so.*[0-9]")
```
Fixes https://github.com/pytorch/pytorch/issues/140797
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141063
Approved by: https://github.com/kit1980
Co-authored-by: Sergii Dymchenko <sdym@meta.com>
(cherry picked from commit f2975717f3c268ca4164f92268fc4f4a8f080eb7)
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
[dtensor] improve doc of the DTensor class (#144099)
as titled: explicitly list all public members to make sure the public
API stays consistent, also use groupwise as the member order to make doc
look better
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144099
Approved by: https://github.com/awgu
(cherry picked from commit 48a05ee7735709406b782474e66f0c6231e2ad2e)
Extend bmm tiling to work up to 2^32 elem in any single output dim (#143095)
The previous tiling implementation worked for up to 2^32 total elements per single batch entry. This extends the functionality to support the dimensions encountered in ComfyUI (output shape: 1,72250,72250).
Fixes#141909
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143095
Approved by: https://github.com/kulinseth
(cherry picked from commit afa313e669e53142618c9116c9337c7b7a54a9e9)
Co-authored-by: Joona Havukainen <jhavukainen@apple.com>
[ONNX] Avoid overwriting overlapped decomposed functions (#142831)
Fixes#141770
The decomposed function in `torch.export.default_decompositions().items()` is overwritten by `torch._decomp.decomposition_table`. As from `torch.onnx.export()` perspective, we should rather respect the table of decompositions in `torch.export.default_decompositions().items()` and avoid overwriting it with `torch._decomp.decomposition_table.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142831
Approved by: https://github.com/justinchuby
(cherry picked from commit 0ddb33ba2299542f9558d949aa482a9ca30ceb30)
Co-authored-by: titaiwangms <titaiwang@microsoft.com>
Amazon Linux 2023: Preload cusparseLt.so (#144477)
Fixes https://github.com/pytorch/pytorch/issues/144433
Test with some debug statements added:
```
>>> import torch
trying to load libcublas.so.*[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/cublas/lib/libcublas.so.12']
trying to load libcublas.so.*[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/cublas/lib/libcublas.so.12
trying to load libcudnn.so.*[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/cudnn/lib/libcudnn.so.9']
trying to load libcudnn.so.*[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/cudnn/lib/libcudnn.so.9
trying to load libnvrtc.so.*[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/cuda_nvrtc/lib/libnvrtc.so.12']
trying to load libnvrtc.so.*[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/cuda_nvrtc/lib/libnvrtc.so.12
trying to load libcudart.so.*[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/cuda_runtime/lib/libcudart.so.12']
trying to load libcudart.so.*[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/cuda_runtime/lib/libcudart.so.12
trying to load libcupti.so.*[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/cuda_cupti/lib/libcupti.so.12']
trying to load libcupti.so.*[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/cuda_cupti/lib/libcupti.so.12
trying to load libcufft.so.*[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/cufft/lib/libcufft.so.11']
trying to load libcufft.so.*[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/cufft/lib/libcufft.so.11
trying to load libcurand.so.*[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/curand/lib/libcurand.so.10']
trying to load libcurand.so.*[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/curand/lib/libcurand.so.10
trying to load libnvJitLink.so.*[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/nvjitlink/lib/libnvJitLink.so.12']
trying to load libnvJitLink.so.*[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/nvjitlink/lib/libnvJitLink.so.12
trying to load libcusparse.so.*[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/cusparse/lib/libcusparse.so.12']
trying to load libcusparse.so.*[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/cusparse/lib/libcusparse.so.12
trying to load libcusparseLt.so.*[0-9] from []
trying to load libcusparseLt.so.*[0-9] from /usr/local/lib/python3.9/site-packages/cusparselt/lib/libcusparseLt.so.0
trying to load libcusolver.so.*[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/cusolver/lib/libcusolver.so.11']
trying to load libcusolver.so.*[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/cusolver/lib/libcusolver.so.11
trying to load libnccl.so.*[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/nccl/lib/libnccl.so.2']
trying to load libnccl.so.*[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/nccl/lib/libnccl.so.2
trying to load libnvToolsExt.so.*[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/nvtx/lib/libnvToolsExt.so.1']
trying to load libnvToolsExt.so.*[0-9] from /usr/local/lib/python3.9/site-
packages/nvidia/nvtx/lib/libnvToolsExt.so.1
/usr/local/lib64/python3.9/site-packages/torch/_subclasses/functional_tensor.py:275: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:81.)
cpu = _conversion_method_template(device=torch.device("cpu"))
>>> exit()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144477
Approved by: https://github.com/Skylion007, https://github.com/nWEIdia
(cherry picked from commit 2b241a8206843f43f0568b7b65473ebb593c4740)
Co-authored-by: atalman <atalman@fb.com>
[inductor][cpu] Fix bmm b_index for dynamic expressions in inductor autotuner (#143141)
Fixes#143102
Addresses 2 problems relating to dynamic batch size in BMM autotuner:
1. With dynamic batch size, when the input is a sympy Mult expression, such as `s0*8` which occurs in many dynamo benchmark models. We address this by using `size_hints` to solve for any expressions. This is safe since this section of the code is only called to generate inputs for benchmarking.
2. Some epilogue nodes may use the dynamic batch size as part of the codegen, for example when an input to the epilogue node is transposed and has dynamic batch size in the stride of other dimensions. When these epilogue nodes exist, if the sizevar is not already present in the `kernel.args`, it will create a new sizevar with a name. It is possible that subsequent calls to `def_kernel` could overwrite this variable name, so to avoid this we pass all the sizevars as `extra_sizevars` to the calls to `def_kernel` for the GEMM functions, so no variable renaming happens later in the BMM definition.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143141
Approved by: https://github.com/jansel, https://github.com/leslie-fang-intel, https://github.com/jgong5
(cherry picked from commit 51a37a42e0e50df6b199732f2680afa5ed14c94f)
Co-authored-by: Mitchell, Frost <frost.mitchell@intel.com>
[CD] Aarch64 builds should not override `OVERRIDE_PACKAGE_VERSION` envvar (#144285)
Currently our nightly aarch64 binaries have correct suffixes +cpu or +cu126. But release binaries are missing these suffixes. Hence to correct this, make sure are nightly and release binaries are consistent, I propose this change.
I see that override is already set correctly in release workflow:
https://github.com/pytorch/pytorch/actions/runs/12383179841/job/34565381200
For CPU:
```
OVERRIDE_PACKAGE_VERSION="2.6.0+cpu"
```
For CUDA:
```
OVERRIDE_PACKAGE_VERSION="2.6.0+cu126"
```
The removed code will set : OVERRIDE_PACKAGE_VERSION="2.6.0" for both cuda and cpu builds for release binaries.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144285
Approved by: https://github.com/malfet, https://github.com/tinglvv
(cherry picked from commit 8d35333498e9433a379611746c177285fa51c8c5)
Co-authored-by: atalman <atalman@fb.com>
Fix int8 mm V.ops.mul dispatching (#143127)
This is sort of subtle - because we were doing `V.ops.mul` at binding time, we dont redispatch later when we invoke the epilogue. and then later running into assertion checking in pr above.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143127
Approved by: https://github.com/drisspg
ghstack dependencies: #143048
(cherry picked from commit 7968732f5b84ac6509d800a54bfb23fb791d3b88)
Co-authored-by: eellison <elias.ellison@gmail.com>
Remove assert from partitioner.py (#143376)
Remove erroneous assert assuming a dependent (user) node to be in the partition. This partially reverts #136616 by removing the assert.
Tested locally with a failing ExecuTorch Arm test using
```
$ python -m examples.arm.aot_arm_compiler --model_name mv2 --target ethos-u55-128 --delegate --quantize
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143376
Approved by: https://github.com/tarun292
(cherry picked from commit 6829897682b2b46a592a92d84417cb4124c26e88)
Co-authored-by: Digant Desai <digantdesai@meta.com>
Enable CPP/CUDAExtension with py_limited_api for python agnosticism (#138088)
Getting tested with ao, but now there is a real test i added.
## What does this PR do?
We want to allow custom PyTorch extensions to be able to build one wheel for multiple Python versions, in other words, achieve python agnosticism. It turns out that there is such a way that setuptools/Python provides already! Namely, if the user promises to use only the Python limited API in their extension, they can pass in `py_limited_api` to their Extension class and to the bdist_wheel command (with a min python version) in order to build 1 wheel that will suffice across multiple Python versions.
Sounds lovely! Why don't people do that already with PyTorch? Well 2 things. This workflow is hardly documented (even searching for python agnostic specifically does not reveal many answers) so I'd expect that people simply don't know about it. But even if they did, _PyTorch_ custom Extensions would still not work because we always link torch_python, which does not abide by py_limited_api rules.
So this is where this PR comes in! We respect when the user specifies py_limited_api and skip linking torch_python under that condition, allowing users to enroll in the provided functionality I just described.
## How do I know this PR works?
I manually tested my silly little ultra_norm locally (with `import python_agnostic`) and wrote a test case for the extension showing that
- torch_python doesn't show up in the ldd tree
- no Py- symbols show up
It may be a little confusing that our test case is actually python-free (more clean than python-agnostic) but it is sufficient (and not necessary) towards showing that this change works.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138088
Approved by: https://github.com/ezyang, https://github.com/albanD
(cherry picked from commit be27dbf2b806e5d9c8d63ac6f6f96712299f98c3)
Co-authored-by: Jane Xu <janeyx@meta.com>
As per title.
The following implementation removes the usage of `repeat_interleave, tile` and `full_coo_indices` and replaces them with broadcasting. That way we reduce memory traffic (and are likely to hit cache a lot) and the total number of launched kernels.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142364
Approved by: https://github.com/amjames, https://github.com/cpuhrsch
This adds an option to cause automatic dynamic shapes to trigger
unbacked SymInts rather than backed SymInts. This can potentially
help if you are still seeing recompilations from 0/1 specialization
but it also might just cause your program to fail with
GuardOnDataDependent errors.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141415
Approved by: https://github.com/bobrenjc93
Motivation: Generalize unit tests so that can be executed for cuda and non cuda devices.
Depedency : #133209 Merged now.
There was a #135242 for these changes and closed due to in correct commits. I have incoroprated the changes as suggested in comments.
@kwen2501 @zeshengzong Please review the changes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139184
Approved by: https://github.com/kwen2501
Co-authored-by: Yu, Guangye <guangye.yu@intel.com>
@tugsbayasgalan found a bug for nested subclasses:
E.g. we have
TwoTensor(TwoTensor(t1, t2), t0).
After right_inverse we have:
rebuilt_stack == [(TwoTensor, meta, ["a", "b"]), (TwoTensor, meta, ["a", "b"])]
plain_tensors == [t0, t1, t2]
We will first put plain tensors, and only then the nested TwoTensor.
But when we unflatten:
todo = [t0, t1, t2]
we first create TwoTensor[t1, t2]
put it to todo [t0, TwoTensor[t1, t2]]
And as a result get
TwoTensor(t0, TwoTensor(t1, t2))
which is swapping original a and b :)
So the fix should be different, we need to preserve the order of elements in the stack for plain/subclasses.
I will think about the fix.
Fix:
Keep order of inner_tensor_attr_names according them added to the stack. (first - plain tensor attributes, then subclass attributes)
Test:
```
python test/functorch/test_aotdispatch.py -k test_subclass_parameters
```
Differential Revision: [D67032477](https://our.internmc.facebook.com/intern/diff/D67032477)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142481
Approved by: https://github.com/tugsbayasgalan, https://github.com/bdhirsh
Added a check in aten/src/ATen/native/EmbeddingBag.cpp that checks if per_sample_weights needs a gradient in order to determine if at::_embedding_bag_forward_only or at::_embedding_bag should run.
Also, added two tests in test_embedding.py that check if the command now works.
Fixes#136457
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142338
Approved by: https://github.com/soulitzer
Summary: This diff implements the inferface of "torch.mtia.max_memory_allocated" API. The internal implementation will be addressed in a separate diff.
Test Plan:
Passed a local unit test: `buck run //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api`
```
----------------------------------------------------------------------
Ran 15 tests in 16.862s
OK
I1127 11:31:14.613909 2272144 afg_bindings.cpp:943] afg-aten::mul.out-dtype_Float-uqJKuNc0 executable has been unloaded
I1127 11:31:14.615438 2272144 afg_bindings.cpp:943] afg-add-dtype_Float-fa37JncC executable has been unloaded
```
Reviewed By: ttrung149, nautsimon
Differential Revision: D66553954
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142406
Approved by: https://github.com/nautsimon
This PR implements the deduplication pass (blocked by config currently) for dynamo where identical regions from https://github.com/pytorch/pytorch/pull/141381 are replaced with a common subgraph.
The two phases of deduplication are explained below.
**Subgraph creation**:
Subgraph creation works by taking one representative region from each region group and creating a subgraph from it, which will then be used to replace all regions in the group. This is implemented by first copying all nodes of the region to the new subgraph and then finding all inputs which are not within the region and creating placeholders for them. For the outputs, all regions in a region group need to be scanned to ensure the largest set of outputs is found, and then an output node is created which returns a tuple of all outputs.
**Graph replacement**:
To replace each region with the extracted subgraph, the node index in the region and argument index within the node's flattened args and kwargs are recorded once during subgraph creation. This allows us to determine which (external to the region) nodes and in which order these nodes are passed as inputs. For the outputs, getitem nodes are created for each output, and all nodes in the region with external outputs are replaced by the proper getitem node. Finally, all original nodes are erased (there should be no uses of these left in the graph).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141383
Approved by: https://github.com/zou3519
ghstack dependencies: #141381, #141382
This PR implements graph region tracking for later extraction into common subgraphs. The algorithm is as follows:
`GraphRegionTracker` tracks each node added to the output graph and generates a key based on the source location, instruction pointer, input shapes, and global state at the time the node is inserted into the graph. Nodes with the same key are grouped together in a list of identical nodes.
Once graph capture is complete, these nodes are organized into region groups. A region group looks like this:
[[IdenticalNode1], [IdenticalNode2], [IdenticalNode3]] and each sublist is called a region. For each region group (starting at the topologically latest region group), the inner regions are gradually expanded one node at time from args and kwargs of the node in each region provided that for all regions in the group, the nodes being added are also identical (ie have the same key computed above). The `get_identical_regions` function is the main entry point which will be used by the graph replacement algorithm in #141383
Edge cases to add more testing for in future PRs (in progress):
* ~~multiple nodes on the same line~~ (implemented)
* ~~dynamic shapes checking (need to verify symbolic inputs are the same across subgraphs)~~ (implemented)
* ensure we don't expand regions where it will create a cycle during subgraph replacement
* ensure outputs are always tensors (or tuples of tensors iirc)
* ~~out of order kwargs, unevenly nested kwargs~~ (implemented)
* input aliasing - TBD, we may add support for this in `invoke_subgraph` or reuse the aliasing analysis here to not form regions with these properties
* ~~all global state~~ (implemented)
Other followups:
* consolidate global state checking across all caching infra
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141381
Approved by: https://github.com/zou3519
This allows one to do something like that
```python
import torch
x = torch.ones(10, device="mps")
m = torch.mps._compile_shader("""
kernel void foo(device float* x, uint idx [[thread_position_in_grid]]) {
x[idx] += idx;
}
")
m.foo(x)
```
And in general enables writing custom operators using Metal shaders purely in Python
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141478
Approved by: https://github.com/manuelcandales
Instead of an ancient prebuilt binary
This is a followup from https://github.com/pytorch/pytorch/pull/121323
For some reason, newer `sccache` does not work when `gcc` is invoked with `-E` option, so one have to special-case `-E` case in `/opt/ccache/bin/gcc` wrapper, which had to be special cased to work with `nvcc` by checking whether `-E` is passed not only as first or second, but as 3rd argument as well(to be followed up by a generic https://github.com/pytorch/pytorch/pull/142813 ), i.e. to generate following wrapper:
```shell
#!/bin/sh
if [ "$1" = "-E" ] || [ "$2" = "-E" ] || [ "$3" = "-E" ]; then
exec /usr/bin/gcc "$@"
elif [ $(env -u LD_PRELOAD ps -p $PPID -o comm=) != sccache ]; then
exec sccache /usr/bin/gcc "$@"
else
exec /usr/bin/gcc "$@"
fi
```
Without it `sccache nvcc hello.cu` failed with no-descriptive
```
sccache: error: failed to execute compile
sccache: caused by: Compiler not supported: ""
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140614
Approved by: https://github.com/wdvr
Co-authored-by: Wouter Devriendt <wouterdevriendt@meta.com>
Summary:
Change back log to VLOG(2) in waitForFutureOrTimeout. Instead, print a more user friendly message - if FR completes successfully.
This message is meant for developers only - so don't default to `INFO` in this function.
Also, change one more message from LOG(ERROR) to LOG(INFO).
Tested locally.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142505
Approved by: https://github.com/kwen2501
This allows one to do something like that
```python
import torch
x = torch.ones(10, device="mps")
m = torch.mps._compile_shader("""
kernel void foo(device float* x, uint idx [[thread_position_in_grid]]) {
x[idx] += idx;
}
")
m.foo(x)
```
And in general enables writing custom operators using Metal shaders purely in Python
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141478
Approved by: https://github.com/manuelcandales
Summary:
# Why
- sampling the same config multiple times is wasteful, especially on exhaustive
- for AMD we rewrite the configs to have a specific number of stages, which might lead to some configs appearing multiple times
# What
cast the configs, already defined as a tuple, through a set to remove duplicates
Test Plan:
taken from the `mm_kernel_configs` logic in the same file
```
>>> mm_kernel_configs = [ {"config": (BLOCK_M, BLOCK_N, BLOCK_K, num_stages, num_warps), "cond": True} for BLOCK_M, BLOCK_N, BLOCK_K in itertools.product( [16, 32, 64, 128, 256], repeat=3 ) for num_stages in [1, 2, 3, 4, 5] for num_warps in [2, 4, 8] ]
>>> configs = [c['config'] for c in mm_kernel_configs]
>>> a = tuple((c[0], c[1], c[2], 0, c[4]) for c in configs)
>>> len(set(a))
375
>>> len(a)
1875
>>>
```
Differential Revision: D66893774
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142254
Approved by: https://github.com/henrylhtsang
The idea is the parent hop's fake tensor mode should ignore the newly allocated unbacked symints in subgraph because the bindings of unbacked symbols in the subgraph should already be done when we trace the subgraph. E.g. if there's an operator in subgraph that produces unbacked symints, the track_tensor_tree logic for that operator will take care of it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142031
Approved by: https://github.com/zou3519
ghstack dependencies: #142162
For prologues which only do either loads like gathers or dtype conversions, and no actual arithmetic on lower-precision types, we can codegen them without upcasting to fp32 without changing numerics.
Prologues that actually do arithmetic will need to use invoke quant. But I would like to to support upcasts/gathers out of the box.
We could potentially extend this in the future to avoid upcasting max pooling operations as well, if there were perf benefits to be had (less likely).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142402
Approved by: https://github.com/jansel
ghstack dependencies: #134532, #142350, #142400, #142401
We load inputs to prologue fusion with a mask. That mask must still be zero before we run `tl.dot`. Previously, we would always apply the mask:
```
tmp0 = tl.load(in_ptr1 + (tl.broadcast_to(xindex, xindex.shape)), a_mask, eviction_policy='evict_last')
tmp1 = tmp0.to(tl.float32)
a = tl.where(a_mask, tmp1, 0.0)
```
now we do not need to ->
```
tmp0 = tl.load(in_ptr1 + (tl.broadcast_to(xindex, xindex.shape)), a_mask, eviction_policy='evict_last')
tmp1 = tmp0.to(tl.float32)
a = tmp1
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142401
Approved by: https://github.com/jansel
ghstack dependencies: #134532, #142350, #142400
We introduced a special graph break to avoid max-recursion-depth error
in #100296.
After #111415, the original `test_list_self_reference` no longer
triggers the special graph break because we started modeling root frame
free variables with `LazyVariableTracker`.
After #117426, we no longer build the list items eagerly, and they'll hit
`variable_tracker_cache` when they get lazily constructed later.
As a result, this patch updates the `test_list_self_reference` test and
removes the special graph break.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142438
Approved by: https://github.com/jansel
ghstack dependencies: #142437
Dynamo was generating `GetItemSource(tuple_source, index)` for items of
`NamedTupleVariable`, but that stops working when a user supplied named
tuple has a custom `__getitem__` function with different semantics.
This patch
- fixes the aforementioned issue by using `AttrSource` instead.
- handles named tuple outside `wrap_listlike`, by removing the special
case of named tuple in `BaseListVariable.cls_for_instance`, since the
semantics of named tuple is different enough.
- makes user all constructions of `NamedTupleVariable` has items with
proper sources.
Fixes#142399.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142437
Approved by: https://github.com/jansel
This patch applies a local and practical workaround for custom dict
construction when multiple inheritance is involved.
Handling multiple inheritance in general could be a lot more involved,
so I created #142414 to track that.
Fixes#141118.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142416
Approved by: https://github.com/jansel
Summary: We already have CUDA OVERHEAD events enabled in on-demand so we should also add them to auto-trace
Test Plan: Tested using internal performance suites and found no noticeable performance change
Differential Revision: D66904879
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142271
Approved by: https://github.com/ngimel
**Summary**
This PR adds a new experimental API `set_rotate_method` for Context Parallel. This API allows user to choose the desired communication method (between all-to-all and all-gather) for shards rotation.
**Test**
`pytest test/distributed/_tensor/test_attention.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142093
Approved by: https://github.com/fegin
Summary:
Change log message for future execution back from VLOG(2) to LOG(INFO).
This message is useful for Flight Recorder to verify that flight recorder dumps completed successfully (or not).
Test Plan: Tested manually on a mast job and noted that the INFO message was as expected.
(meta only link: https://fburl.com/mlhub/iui2tpc9)
```
[trainer5]:I1208 10:21:00.772841 7528 ProcessGroupNCCL.cpp:1294] [PG ID 0 PG GUID 0(precheck) Rank 21] future is successfully executed for: Flight recorder dump in heartbeatMonitor
```
Differential Revision: D66996439
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142441
Approved by: https://github.com/fduwjj
Fixes https://github.com/pytorch/pytorch/issues/141305.
```python
class M(torch.nn.Module):
def forward(self, x, y, z):
a = y.shape[0]
b = z.shape[0]
def true_fn(x):
return x + a
def false_fn(x):
return x + b * z
# When exporting with non-strict: a and b are symints,
# so torch.compile need to wrap and trace symint inputs.
return torch.cond(x.shape[0] > 5, true_fn, false_fn, (x,))
```
In non-strict export, when inputs are annotated with dynamic shape, the a, and b in above example are torch.SymInt type. true_fn and false_fn will have closure that're of torch.SymInt types. The error is triggered because we didn't handle SymInt inputs in dynamo and ends up using a UserDefinedObjectVariable for it, which doesn't have a proxy. We added support by following how we handle SymBool input previously.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141524
Approved by: https://github.com/zou3519
ghstack dependencies: #141610, #142185
Summary:
In thrift schema, we represent every None value as "True/False" while we represent None as () in OSS schema. This will cause some inconsistency between the type systems and the simplest thing to do here is changing Tuple[()] to bool in oss schema.
This change should NOT cause version bump, because on deserializer side we never read the value from as_none fields, as it doesn't have real meaning. Therefore this schema change should be considered as safe.
Test Plan: CI
Reviewed By: SherlockNoMad
Differential Revision: D66888892
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142257
Approved by: https://github.com/yiming0416, https://github.com/hl475
This PR extends our ability to fuse pointwise nodes onto triton templates with the ability to fuse pointwise nodes into triton templates - prologue fusion.
Similar to the store_output api:
`{{store_output(("idx_m", "idx_n"), "acc", "mask")}}`
And the modification api:
```
{{ modification(
subgraph_number=0,
output_name="post_mod_scores",
score="qk",
out="qk"
) | indent_except_first(1) }}
```
We have:
```{{load_input("B", "b", ("idx_m", "idx_n"), mask=None if EVEN_K else "b_mask", indent_width=8)}}```
Because we are now loading the input with explicit indices and mask, I needed to rewrite the mm kernel to no longer update the [pointers by BLOCK_K](bb03ef7aca/torch/_inductor/kernel/mm.py (L110-L111)) on every iteration and instead on each iteration compute indices from the the k_idx of each loop. This did not have any perf difference.
There are a couple main use cases for prologue fusion:
- Fusing dequants into a matmul. particularly for more bandwidth bound scenarios.
- Fusing gather into a matmul. This is useful particularly in MOE. See https://github.com/pytorch/pytorch/issues/134535 for more details.
Prologue fusion is generally much less profitable than epilogue fusion, because it must be applied to an element of an input on each loop of the matmul, compared to only once in the epilogue (gather into matmul is a potential exception). Accordingly, we are much less aggressive in attempting to fuse prologue fusion. We only attempt fusion if it does not increase the number of memory bytes read instead the triton template, multipled by a small factor to allow gathers. This restricts reliably unprofitable fusions like fp32->fp16 inside kernel. In future pr we could potentially have api of being more aggressive if we know we are in a bandwidth bound regime. See: https://github.com/pytorch/pytorch/pull/134532/files#diff-d2539c9c8dc6a3d7e457767a880612e96d3c85752a77ead49a9e4e00a3e4c3c7R3060-R3066
Other notes:
By default we will upcast to fp32 inside every kernel. This matches eager numerics. This is fine enough for epilogue because it is only done once (although it is probably unnecessary for say a relu) but tanks perf for prologue. I am currently using the `codegen_upcast_to_fp32` option to avoid it, but that will not work for libdevice calls that require fp32. We will need https://github.com/pytorch/pytorch/pull/136778/ and dtype-aware codegen to upcast fp16 ops into libdevice calls.
With prologue fusion, we now have essentially separate kernels for each input, and for the output. I had to increase the number of fields that are swapped out in `set_subgraph_body` by a large number :/ I also update the fusion logic because the inputs will have a different group than the outputs. Maybe as part of enabling multiple outputs, this could get cleaned up a bit so..
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134532
Approved by: https://github.com/jansel
This PR brings the FlexAttention inference support for the inductor backend in torch.compile (support precisions: bf16 and fp32) on CPUs.
Based on the existing CPP template, this PR extends and implements a FlexAttention CPP template to support broad attention variants, and meanwhile brings optimized performance on CPUs.
With this, users can transparently extend their Flex Attention usages to CPUs with good and common support from torch.compile, both functionality and performance.
For UT tests, in this PR, we include partial critical tests for CPUs as the following (conduct inference tests):
```
pytest test/inductor/test_flex_attention.py
`TestFlexAttention`
#common functions:
run_test
preprocess_paged_attention
run_paged_attention
run_test_with_paged_attention
run_test_with_call
run_dynamic_test
run_automatic_dynamic_test
#test functions:
test_builtin_score_mods
test_builtin_score_mods_automatic_dynamic
test_builtin_score_mods_different_seqlen
test_builtin_score_mods_different_block_size
test_kv_batch_broadcast
test_GQA
test_cpu_error_message_return_lse
test_validate_cpu_dtype_error_message
`TestPagedAttention`
#test function:
test_paged_builtin_score_mods
```
For the rest UTs in `test/inductor/test_flex_attention.py ` and `test/inductor/test_flex_decoding.py`, due to bigger lines of changes (1500+ LOC) that make this PR hard to review, will submit another PR specific for CPU device UTs enabling and refactor.
Besides, more optimizations are also planned in follow up PRs, including:
- Block sparse computation
- Flash decoding tuning
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141453
Approved by: https://github.com/drisspg, https://github.com/leslie-fang-intel
Co-authored-by: Wu, Chunyuan <chunyuan.wu@intel.com>
Summary: Provide a standalone path to compile and run a ExportedProgram in C.
Test Plan:
(1) Generate a compiled model from ExportedProgram
```
python generate_lowered_cpu.py --input-path /tmp/$USER/ep.pt --output-path /tmp/$USER/final.pt
```
(2) Compile a standalone test runner
```
TORCH_ROOT_DIR=/data/users/$USER/pytorch sh standalone_compile.sh standalone_test.cpp standalone_test.out
```
(3) Run test for the compiled model in step (1)
```
LD_LIBRARY_PATH=/data/users/$USER/pytorch/build/lib ./standalone_test.out /tmp/$USER/final.pt
```
Differential Revision: D66872380
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142327
Approved by: https://github.com/hl475
We found that [this commit](6eca0aee76) caused a ~6% performance drop in ViT INT8. This was due to changes to the `numbytes_hint` for `NoneLayout`. In this PR, we reverted the changes in `numbytes_hint` to allow more fusions.
```
class Model(torch.nn.Module):
def __init__(self):
super().__init__()
self.dense = torch.nn.Linear(768, 768)
self.layernorm = torch.nn.LayerNorm(768, eps=1e-12)
def forward(self, context_layer, hidden_states):
attention_output = self.dense(context_layer)
hidden_states = attention_output + hidden_states
layer_output = self.layernorm(hidden_states)
return layer_output
```
The generated code before (left) and after (right) this PR is as follows:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141766
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jansel
Combining several fixes to unflatten for bugs revealed by random graph testing.
The fixes target two categories of bugs:
1. Some bugs show up as exponential blowups for largish system of nn modules. These are fixes by converting lists to sets, using caching, or otherwise rewriting to reuse computation more effiicently.
2. Other bugs were due to missing intermediate modules created when attributes such as submodules and buffers are accessed through longish paths before calling the corresponding intermediate modules, or missing attributes such as buffers and constants in submodules corresponding to multiple calls.
Differential Revision: [D66659795](https://our.internmc.facebook.com/intern/diff/D66659795/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142141
Approved by: https://github.com/ydwu4
**Summary:**
(Since I am trying the other solution for https://github.com/pytorch/pytorch/pull/141082, I moved out the test case fixes from that pr to a separate pr to land first.)
-----
Testing float8 dynamic scaling case with `TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1` didn't make any difference.
The test case for fp8 (https://github.com/pytorch/pytorch/blob/main/test/inductor/test_loop_ordering.py#L425) is also failing, https://www.internalfb.com/intern/test/844425111960859?ref_report_id=0
-------
The main change here is to modify the condition of calling `loop_reordering` from `shared_data_score == 0` to `shared_data_score < config.score_fusion_memory_threshold`.
Before the change:
`shared_data_score > 0 -> won't loop_reorder -> can't fused because of shared_data_score < config.score_fusion_memory_threshold`
After the change:
`shared_data_score > 0 -> loop_reorder (shared_data_score < config.score_fusion_memory_threshold) -> get a larger shared_data_score -> fused`
----
It's the same issue as fixed in https://github.com/pytorch/pytorch/pull/136782. But the condition to call loop_reorder might be changed later, causing the test case to fail again.
**Test Plan:**
```
buck2 test 'fbcode//mode/opt' caffe2/test/inductor:loop_ordering
```
And ran a float8 dynamic scaling training script to verify it e2e
-----
Differential Revision: D66906175
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142273
Approved by: https://github.com/eellison
As `torch._C._scatter` is only defined for CUDA/ROCm (and may be XPU?)
This is a regression introduced by https://github.com/pytorch/pytorch/pull/141098 that went unnoticed due to https://github.com/pytorch/pytorch/issues/142206
Test plan:
```
python test_autograd.py -v -k test_dataparallel_saved_tensors_hooks
```
Before this change it failed with
```
ERROR: test_dataparallel_saved_tensors_hooks (__main__.TestMultithreadAutograd.test_dataparallel_saved_tensors_hooks)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/malfet/git/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 3108, in wrapper
method(*args, **kwargs)
~~~~~~^^^^^^^^^^^^^^^^^
File "/Users/malfet/git/pytorch/pytorch/test/test_autograd.py", line 13074, in test_dataparallel_saved_tensors_hooks
model = torch.nn.DataParallel(Model())
File "/Users/malfet/git/pytorch/pytorch/torch/nn/parallel/data_parallel.py", line 153, in __init__
raise RuntimeError("no available devices were found")
RuntimeError: no available devices were found
```
After this change it passes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142448
Approved by: https://github.com/kit1980
Summary:
currently, CUB_VERSION is 0 for USE_ROCM
CUB_VERSION is used for determine whether to use advanced cub APIs for some implementation.
Test Plan:
`buck2 build --flagfile fbsource//arvr/mode/win/vs2022/cpp20/cuda12_5/dev --flagfile fbsource//arvr/mode/cuda/rtx30 fbsource//arvr/libraries/eye/apollo_visualizer:unit_test_apollo_hu_module_capability`
`buck2 build --flagfile fbcode//mode/amd-gpu fbcode//aiplatform/modelstore/checkpointing/pyper:tensor_save_load_utils`
Differential Revision: D63054638
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140861
Approved by: https://github.com/eqy, https://github.com/zoranzhao, https://github.com/houseroad
### Motivation:
As design illustrated in Intel distributed support RFC https://github.com/pytorch/pytorch/issues/141741, two sections are needed to enable intel distributed backend (`XCCL`) support in PyTorch.
1. Intel GPU distributed Backend integration in PyTorch `torch-xpu-ops`.
2. **Intel distributed Backend register in PyTorch distributed package**. This PR is to contribute section 2 change.
### Example:
Here is a simple example of using spawn to launch XCCL backend and perform allreduce on XPU tensors.
```
import os
import torch
import torch.distributed as dist
import torch.multiprocessing as mp
def setup(rank, world_size):
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '29500'
dist.init_process_group(rank=rank, world_size=world_size)
def cleanup():
dist.destroy_process_group()
def run_allreduce(rank, world_size):
setup(rank, world_size)
device = torch.device('xpu:{}'.format(rank))
x = torch.randn([2, 2], device=device)
dist.all_reduce(x)
cleanup()
if __name__ == '__main__':
world_size = 2
mp.spawn(run_allreduce, args=(world_size,), nprocs=world_size, join=True)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141856
Approved by: https://github.com/kwen2501, https://github.com/gujinghui, https://github.com/albanD
Internal only: the before way meant that `from torch.distributed._composable.fsdp import fully_shard` was importing `fully_shard.py` not the function `fully_shard`. For some reason, the resolution order is different from open source.
To fix this, we match the old import as closely as possible. Namely, we import `fully_shard.py` contents from `.fully_shard`. This should force that import to take precedence.
@diff-train-skip-merge
Differential Revision: [D66990327](https://our.internmc.facebook.com/intern/diff/D66990327)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142419
Approved by: https://github.com/weifengpy
Preparatory refactor for https://github.com/pytorch/pytorch/pull/137243.
# Feature
Follow up to https://github.com/pytorch/pytorch/pull/141751. Since we now represent `numels` as a dict, it's natural to extend this to `size_hints`. The latter are basically just the former rounded up to the nearest power of 2. This simplifies various heuristics such as the coordinate descent tuner. Where we previously needed to determine which index in `size_hints` corresponds to each dimension, now we can just query by prefix. This will be especially important when we enable 2D reductions, as it becomes harder to keep track of these things when we have multiple reduction dimensions. (See the previous PR for some examples.)
# Test plan
The existing CI provides good coverage. This PR modifies a few tests which explicitly constructed size hints.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142249
Approved by: https://github.com/jansel
In nonblocking mode, we always check if the NCCL communicator is ready between issuing commands to it.
Today this is done by the `waitReady()` function.
Unfortunately, the `waitReady()` function is burned with `C10D_NCCL_CHECK_TIMEOUT_SLEEP` which would sleep for an interval between two consecutive checks.
While this is nice when waiting for comm init or finalize, it degrades performance of collective calls (which would almost certainly return success immediately.)
This PR adds a `bool longInterval` argument to `waitReady` and let call site determine whether long wait is likely; if not, `waitReady` would use `sched_yield()` to more eagerly check for readiness.
Thanks @eqy for reporting an issue that small collectives has perf impact in nonblocking mode.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142291
Approved by: https://github.com/eqy, https://github.com/fduwjj
Making CUDA or NCCL calls in object destruction can be dangerous because CUDA context may have exited before the the destructor, in which case, the CUDA calls would see a "CUDA driver shutting down" error.
this PR does take a destroy call away from NCCLComm dtor, and doesn't add a new one. If users are calling destroy_process_group or abort_process_group as recommended, then we are destroying for them, and otherwise we are OK with letting them possibly leak resources (and get a warning).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141511
Approved by: https://github.com/eqy, https://github.com/wconstab
ghstack dependencies: #141510
The usage of `onCompletionHook` is mostly similar to what Flight Recorder does today -- for example, measuring how long a collective takes and put it into a profiler's "database".
Since FR already records and can dump info like this, we are considering deprecating the onCompletionHook support to save a side thread. (Each PG runs 3 side threads today, which is resource consuming and complicates the code)
User can file an issue if additional information needs to be recorded.
They can also file an RFC if Flight Recorder needs to accept plugins that customize the recording.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142390
Approved by: https://github.com/fduwjj, https://github.com/fegin
Fixes https://github.com/pytorch/pytorch/issues/141120
Registering a fallthrough for a backend correctly alters nonFallthroughKeysPerBackend_[backend_idx]. However, the backend_idx calculation does not take into account the local dispatch key set, which is used to temporarily turn on Meta as a backend. This means that makeFallthrough does not behave exactly as if it was a normal function which redispatched rather than a "fake function" implemented with a key mask.
So e.g. impl::computeDispatchKeySet(ks, nonFallthroughKeysPerBackend_[backend_idx]); will exclude keys like Meta which may be in the TLS include set.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141581
Approved by: https://github.com/bdhirsh
Summary:
This diff refactors the code for the "torch.mtia.memory_stats" API to maintain the same file hierarchy as its CUDA counterpart:
- All device memory APIs are now located under ".../mtia/memory.py".
- Device memory APIs can be accessed using either "torch.mtia.XYZ" or "torch.mtia.memory.XYZ".
Test Plan:
Passed a local unit test: `buck run //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api`
```
Ran 14 tests in 16.657s
OK
I1127 11:06:06.505201 2133030 afg_bindings.cpp:943] afg-aten::mul.out-dtype_Float-bBtLGD6Y executable has been unloaded
I1127 11:06:06.506654 2133030 afg_bindings.cpp:943] afg-add-dtype_Float-fa37JncC executable has been unloaded
W1127 11:06:08.731138 2133030 HazptrDomain.h:148] Tagged objects remain. This may indicate a higher-level leak of object(s) that use hazptr_obj_cohort.
```
Differential Revision: D66549179
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141723
Approved by: https://github.com/nautsimon
This PR brings the FlexAttention inference support for the inductor backend in torch.compile (support precisions: bf16 and fp32) on CPUs.
Based on the existing CPP template, this PR extends and implements a FlexAttention CPP template to support broad attention variants, and meanwhile brings optimized performance on CPUs.
With this, users can transparently extend their Flex Attention usages to CPUs with good and common support from torch.compile, both functionality and performance.
For UT tests, in this PR, we include partial critical tests for CPUs as the following (conduct inference tests):
```
pytest test/inductor/test_flex_attention.py
`TestFlexAttention`
#common functions:
run_test
preprocess_paged_attention
run_paged_attention
run_test_with_paged_attention
run_test_with_call
run_dynamic_test
run_automatic_dynamic_test
#test functions:
test_builtin_score_mods
test_builtin_score_mods_automatic_dynamic
test_builtin_score_mods_different_seqlen
test_builtin_score_mods_different_block_size
test_kv_batch_broadcast
test_GQA
test_cpu_error_message_return_lse
test_validate_cpu_dtype_error_message
`TestPagedAttention`
#test function:
test_paged_builtin_score_mods
```
For the rest UTs in `test/inductor/test_flex_attention.py ` and `test/inductor/test_flex_decoding.py`, due to bigger lines of changes (1500+ LOC) that make this PR hard to review, will submit another PR specific for CPU device UTs enabling and refactor.
Besides, more optimizations are also planned in follow up PRs, including:
- Block sparse computation
- Flash decoding tuning
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141453
Approved by: https://github.com/drisspg, https://github.com/leslie-fang-intel
Co-authored-by: Wu, Chunyuan <chunyuan.wu@intel.com>
Added c7564f31f7/wheel/build_wheel.sh to `.ci/wheel/` folder
Commented out call to 39532891a0/run_tests.sh, because since 2018 this script just checked that tests folder is there and exited, as there are no way to run all pytorch tests in single shard, see this logic:
```bash
#!/bin/bash
set -eux -o pipefail
# Essentially runs pytorch/test/run_test.py, but keeps track of which tests to
# skip in a centralized place.
#
# TODO Except for a few tests, this entire file is a giant TODO. Why are these
# tests # failing?
# TODO deal with Windows
# This script expects to be in the pytorch root folder
if [[ ! -d 'test' || ! -f 'test/run_test.py' ]]; then
echo "builder/test.sh expects to be run from the Pytorch root directory " \
"but I'm actually in $(pwd)"
exit 2
fi
# Allow master skip of all tests
if [[ -n "${SKIP_ALL_TESTS:-}" ]]; then
exit 0
fi
```
https://github.com/pytorch/pytorch/pull/123390 is a misread attempt to interpret above-mentioned logic, as run_tests will be skipped if `${SKIP_ALL_TESTS}` is a non-empty string
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142277
Approved by: https://github.com/huydhn, https://github.com/atalman
ghstack dependencies: #142276
Over time, a large number of the existing type ignores have become irrelevant/unused/dead as a result of improvements in annotations and type checking.
Having these `# type: ignore` linger around is not ideal for two reasons:
- They are verbose/ugly syntatically.
- They could hide genuine bugs in the future, if a refactoring would actually introduce a bug but it gets hidden by the ignore.
I'm counting over 1500 unused ignores already. This is a first PR that removes some of them. Note that I haven't touched type ignores that looked "conditional" like the import challenge mentioned in https://github.com/pytorch/pytorch/pull/60006#issuecomment-2480604728. I will address these at a later point, and eventually would enable `warn_unused_ignores = True` in the mypy configuration as discussed in that comment to prevent accumulating more dead ignores going forward.
This PR should have no effect on runtime at all.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142325
Approved by: https://github.com/Skylion007, https://github.com/janeyx99
Summary: Before we would recompile the model unnecessarily even
if the model is already in the desired mode. For training
frameworks that assume `model.train()` is idempotent and calls
this before every single training step, this led to a bunch of
tiny graphs and poor performance. This commit makes these calls
no-ops if we're already in the target train/eval mode.
Test Plan:
python test/test_quantization -k TestQuantizePT2E.test_allow_exported_model_train_eval_idempotent
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142239
Approved by: https://github.com/jerryzh168
CUDA 12.4 is the default now. This frees up some resources. This also fixes newly added Python 3.13 job by #140733. That PR missed adding the new Docker image `pytorch-linux-focal-cuda12.4-cudnn9-py3.13-gcc9-inductor-benchmarks` into docker build workflow.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142177
Approved by: https://github.com/atalman
Summary: With autotune_at_compile_time enabled, AOTI now can perform CUDA codegen in one pass. CUDA kernel related code is generated in a deferred way, after autotuning is done. This one-pass implementation will eliminate any issue caused by disparity between passes in the previous two-pass implementation (which caused multiple bug reports in the past). One-pass implementation also avoids cloning mutated inputs needed in the two-pass implementation, which will reduce GPU memory consumption.
Differential Revision: [D66739414](https://our.internmc.facebook.com/intern/diff/D66739414)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141980
Approved by: https://github.com/chenyang78
**Description**
Fuse and prepack weight for `linear_dynamic_fp16` with post op relu. In Inductor, the pattern we see is
```
fp32 activation
|
(reshape)
|
mm/addmm <- t <- to_fp32 <- tp_fp16 <- weight
|
(reshape) <- relu
```
Or
```
fp32 activation
|
expand
|
bmm <- expand <- t <- to_fp32 <- tp_fp16 <- weight
|
(add) <- relu
```
The second pattern is for x.ndim > 2 and x is not contiguous. The first pattern is for other cases.
Fuse the pattern with weight prepack, and we get
```
fp32 activation
|
onednn.linear_relu_dynamic_fp16 <- onednn.linear_prepack_fp16 <- weight
```
After freezing, the prepack op is gone.
**Test plan**
```
python test/inductor/test_mkldnn_pattern_matcher.py -k test_linear_relu_dynamic_fp16
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141556
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
ghstack dependencies: #141549
# Summary
This PR adds an alternative triton lowering for _scaled_mm. This uses an updated mm template that utilizes persistent scheduling + TMAs on A and B matrices.
Limitations:
* This implementations does not work with Bias values: 0602676c8d/torch/_inductor/kernel/mm_scaled.py (L106) Plan is to remove this work around and enforce that both scaling + bias is properly done as epilogues onto the existing templates
* K dim must be 32 or greater for these to take effect
* Gated by a config flag ( currently defaults to Off, maybe should be on)
## Testing
We dont have any tests exercising this code in CI/CD but I updated the relevant tests in test_fp8 and they are all green:
<img width="1680" alt="Screenshot 2024-12-05 at 7 24 07 PM" src="https://github.com/user-attachments/assets/9c520541-d97a-416f-9af7-e68b366ec90f">
## Follow Ups
* Work to update the base mm triton templates and utilize the same template from mm/addmm/scaled_mm w/ respective epilogues
* Tuning on Persistent kernel configs. I found ones that work for my problem shapes but need to do some more NCU work
### Some profiling code I was using
Code I am using to iterate w/
```Python
import torch
from dataclasses import dataclass
from jsonargparse import CLI
import logging
from pathlib import Path
from transformer_nuggets.utils.benchmark import ProfileConfig, profile_function
from torchao.float8.inference import (
addmm_float8_unwrapped_inference,
preprocess_data,
Float8MMConfig,
)
from transformer_nuggets.fp8.fp8_matmul import (
matmul_persistent,
matmul_tma_persistent,
matmul_device_tma_persistent,
)
from enum import Enum
logging.getLogger("transformer_nuggets").setLevel(logging.INFO)
class FP8Kernel(Enum):
PERSISTENT = "Persistent"
PERSISTENT_TMA = "Persistent-TMA"
DEVICE_TMA = "Device-TMA"
SCALED_MM = "Scaled-MM"
class ScalingStrategy(Enum):
PER_TENSOR = "PerTensor"
PER_ROW = "PerRow"
@dataclass(frozen=True)
class ExperimentConfig:
M: int
K: int
N: int
scaling_strategy: ScalingStrategy
fp8_kernel: FP8Kernel
compile: bool
def get_fp8_matmul(
A: torch.Tensor,
B: torch.Tensor,
scaling_strategy: ScalingStrategy,
fp8_kernel: FP8Kernel,
):
A_fp8 = A.to(torch.float8_e4m3fn)
B_fp8 = B.to(torch.float8_e4m3fn)
A_fp8, B_fp8 = preprocess_data(A_fp8, B_fp8, Float8MMConfig(use_fast_accum=True))
if scaling_strategy == ScalingStrategy.PER_TENSOR:
a_scale = torch.tensor(1, device="cuda", dtype=torch.float32)
b_scale = torch.tensor(1, device="cuda", dtype=torch.float32)
elif scaling_strategy == ScalingStrategy.PER_ROW:
a_scale = torch.ones((A_fp8.size(0), 1), device="cuda", dtype=torch.float32)
b_scale = torch.ones((B_fp8.size(1), 1), device="cuda", dtype=torch.float32).T
else:
raise ValueError(f"Invalid scaling strategy: {scaling_strategy}")
assert fp8_kernel == FP8Kernel.SCALED_MM
return lambda: addmm_float8_unwrapped_inference(
A_fp8, a_scale, B_fp8, b_scale, output_dtype=torch.bfloat16, use_fast_accum=True
)
def run_matmul(config: ExperimentConfig):
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
A = torch.randn(config.M, config.K, device=device, dtype=torch.bfloat16)
B = torch.randn(config.K, config.N, device=device, dtype=torch.bfloat16)
fp8_matmul = get_fp8_matmul(A, B, config.scaling_strategy, config.fp8_kernel)
if config.compile and config.fp8_kernel == FP8Kernel.SCALED_MM:
fp8_matmul = torch.compile(fp8_matmul, mode="max-autotune-no-cudagraphs")
_ = fp8_matmul()
return
def main():
torch.random.manual_seed(123)
# Define your experiment configuration here
config = ExperimentConfig(
M=8192,
K=8192,
N=8192,
scaling_strategy=ScalingStrategy.PER_TENSOR,
fp8_kernel=FP8Kernel.SCALED_MM,
compile=True,
)
run_matmul(config)
if __name__ == "__main__":
CLI(main)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142045
Approved by: https://github.com/eellison
# Feature
Previously, only the codegen for `torch.sqrt` was dtype aware. This PR updates most of the `libdevice`/`tl.math` ops to support dtype-aware codegen as well. This is often necessary to get correct code when `config.triton.codegen_upcast_to_fp32=False`, as most Triton math ops do not support float16/bfloat16.
This PR enables dtype aware codegen via the `maybe_upcast_float32` decorator. This wraps `TritonOverrides` macros to upcast arguments to float32, and downcast the result back to the original dtype. The exception is for ops that return booleans, in which case we set `convert_output=False` and skip the output cast.
# Test Plan
Added CI tests for all the new ops. The list of ops to test is automatically generated based on uses of the `maybe_upcast_float32` decorator, and stored in the new `OpDtypeSupport` class. In each new test, we search the generated code for upcasts/downcasts using a regex.
Also added a unit test for `OpDtypeSupport` which checks that we have correct dtype info for ops that require upcasts.
This PR also moves some existing tests around, to collect all the dtype aware codegen tests in one file.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140864
Approved by: https://github.com/eellison, https://github.com/arui-meta
Co-authored-by: eellison <elias.ellison@gmail.com>
Now syntax errors look like this:
```
torch/_dynamo/variables/base.py:20:
Error (RUFF) E999
SyntaxError: Expected ',', found indent.
See https://beta.ruff.rs/docs/rules/
>>> 19 |class SourceType(Enum]:
20 | """
21 | This Enum divides VariableTracker into 2 cases, depending on the variable
22 | it represents:
[...more errors...]
```
Note that most syntax errors lead to a cascade of other errors, so the exception is generally wrong, but the location and name are good.
Before they looked like this:
```
>>> General linter failure:
Error (RUFF) Linter failed
Linter failed. This a bug, please file an issue against the linter
maintainer.
CONTEXT:
Linter command failed with non-zero exit code.
STDERR:
<MainThread:DEBUG> $ /home/rec/.conda/envs/pytorch-dev-constant/
bin/python3 -m ruff check --exit-zero --quiet --output-format=json
--config=pyproject.toml /home/rec/git-constant/pytorch/torch/_dynamo/
variables/base.py
<MainThread:DEBUG> took 38ms
Traceback (most recent call last):
File "/home/rec/git-constant/pytorch/tools/linter/adapters/
ruff_linter.py", line 465, in <module>
main()
File "/home/rec/git-constant/pytorch/tools/linter/adapters/
ruff_linter.py", line 424, in main
lint_messages = check_files(
File "/home/rec/git-constant/pytorch/tools/linter/adapters/
ruff_linter.py", line 273, in check_files
return [
File "/home/rec/git-constant/pytorch/tools/linter/adapters/
ruff_linter.py", line 288, in <listcomp>
severity=severities.get(vuln["code"],
get_issue_severity(vuln["code"])),
File "/home/rec/git-constant/pytorch/tools/linter/adapters/
ruff_linter.py", line 172, in get_issue_severity
if any(
File "/home/rec/git-constant/pytorch/tools/linter/adapters/
ruff_linter.py", line 173, in <genexpr>
code.startswith(x)
AttributeError: 'NoneType' object has no attribute 'startswith'
STDOUT:
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142312
Approved by: https://github.com/Skylion007
Tag the Wheels with appropriate Manylinx 2.28 tags
Initially used auditwheel but it does much more that just adding tags. It also tries to package multiple libs into wheel, which we don't want at this point. Hence just changed tag and filename. If no librs are repackages by auditwheel all ti does is to tag and rename.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141988
Approved by: https://github.com/kit1980, https://github.com/malfet
This PR brings the FlexAttention inference support for the inductor backend in torch.compile (support precisions: bf16 and fp32) on CPUs.
Based on the existing CPP template, this PR extends and implements a FlexAttention CPP template to support broad attention variants, and meanwhile brings optimized performance on CPUs.
With this, users can transparently extend their Flex Attention usages to CPUs with good and common support from torch.compile, both functionality and performance.
For UT tests, in this PR, we include partial critical tests for CPUs as the following (conduct inference tests):
```
pytest test/inductor/test_flex_attention.py
`TestFlexAttention`
#common functions:
run_test
preprocess_paged_attention
run_paged_attention
run_test_with_paged_attention
run_test_with_call
run_dynamic_test
run_automatic_dynamic_test
#test functions:
test_builtin_score_mods
test_builtin_score_mods_automatic_dynamic
test_builtin_score_mods_different_seqlen
test_builtin_score_mods_different_block_size
test_kv_batch_broadcast
test_GQA
test_cpu_error_message_return_lse
test_validate_cpu_dtype_error_message
`TestPagedAttention`
#test function:
test_paged_builtin_score_mods
```
For the rest UTs in `test/inductor/test_flex_attention.py ` and `test/inductor/test_flex_decoding.py`, due to bigger lines of changes (1500+ LOC) that make this PR hard to review, will submit another PR specific for CPU device UTs enabling and refactor.
Besides, more optimizations are also planned in follow up PRs, including:
- Block sparse computation
- Flash decoding tuning
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141453
Approved by: https://github.com/jgong5, https://github.com/drisspg, https://github.com/leslie-fang-intel
Co-authored-by: Wu, Chunyuan <chunyuan.wu@intel.com>
Fixes#141652
This PR fixes (at least in part) the unit test failure. However, we may also need to do a separate flush of the untuned results-- if this test continues to be flaky, another PR would be needed to flush the untuned results as well.
Tested locally and it seems to be working.
Also fixing code that was accidentally commented out code in the unit test from the prior multi-gpu offline tuning PR https://github.com/pytorch/pytorch/pull/139673
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142269
Approved by: https://github.com/jeffdaily
Currently there are a few type annotations that falsely state that mypy doesn't support recursive types.
Recursive type support is available in mypy for a few years already. It has been officially enabled in [version 0.991](https://mypy-lang.blogspot.com/2022/11/mypy-0990-released.html). Pyright even had support for recursive types earlier (https://github.com/microsoft/pyright/issues/569), so there is probably no reason not to model these types correctly.
This PR models these types properly now. Since this has turned a few implicit `Any` into fully typed variables that are not narrowed cleanly, a small number of type ignores were necessary.
Note that regarding the `Argument` it is desirable to model it in a covariant way (i.e. using `Sequence` and `Mapping`) instead of making it invariant unnecessarily (using `List` and `Dict`). If it were modeled invariant, it would for instance mean that a `List[Node]` would not type check as `Argument`, because invariance would mean that it really has to be a `List[Argument]` (i.e., including all the branches of the union type). Since even the name of the type "argument" strongly suggest that it is semantically used as "argument", having covariance natural anyway.
There are no chances in this PR that affect runtime behavior.
CC @Skylion007
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142300
Approved by: https://github.com/ezyang, https://github.com/Skylion007
Summary:
This is an attempt to improve the flight recorder efficacy.
We have a small subset of jobs that are timing out (i.e. failing to write out FR logs in 1 minute) and some that are throwing a `std::exception - broken promise`.
There are two changes in here.
1. We attempt to write out FR buffer with stack traces. If this fails, we attempt to capture FR buffer again - but this time without stack traces. The assumption here is that FR could be locking up when unwinding stack.
Note, to keep things simple, I'm re-using the same file name for both with/without stack_trace.
2. Add additional catch statements in the Manifold writer. There might be something going on in here - so we'll get a log statement if this is failing.
TODO:
- there's nothing differentiating in the output that says whether stack traces were omitted purposefully or not.
This info might be useful for the analyzer - so I'll add this in a follow on diff.
Test Plan: Unit tests.
Differential Revision: D66843194
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142178
Approved by: https://github.com/kwen2501
Preparatory refactor for https://github.com/pytorch/pytorch/pull/137243.
# Feature
Follow up to https://github.com/pytorch/pytorch/pull/141751. Since we now represent `numels` as a dict, it's natural to extend this to `size_hints`. The latter are basically just the former rounded up to the nearest power of 2. This simplifies various heuristics such as the coordinate descent tuner. Where we previously needed to determine which index in `size_hints` corresponds to each dimension, now we can just query by prefix. This will be especially important when we enable 2D reductions, as it becomes harder to keep track of these things when we have multiple reduction dimensions. (See the previous PR for some examples.)
# Test plan
The existing CI provides good coverage. This PR modifies a few tests which explicitly constructed size hints.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142249
Approved by: https://github.com/jansel
Locally shards a full tensor based on indicated sharding arrangement, and returns a DTensor containing the local shard.
warning: This is a private API purposed to skip the communication otherwise required by `distribute_tensor`. It is only applicable to a case where all ranks have the same `full_tensor`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142288
Approved by: https://github.com/wz337
Summary:
Original Context Doc: https://docs.google.com/document/d/1Gv5ZqN3UY7kTuCUd0JLU3L_onwVIr2vd3AiZyim1Muc/edit?disco=AAABZX_tqdk
### Changes
This diff restructures the Partitioners.py file in the _functorch package.
* Moves the three knapsack problem algorithems (greedy, ilp, dp) into a separate file
Test Plan:
### Unit Testing
```
$ buck test mode/opt //caffe2/test/functorch:test_ac
File changed: fbsource//xplat/caffe2/test/functorch/TARGETS
File changed: fbsource//xplat/caffe2/test/functorch
File changed: fbsource//xplat/caffe2/test
7 additional file change events
Soft Error: source_directory_includes_subpackage: Directory `v2.17.1-1` of package `fbsource//third-party/nccl` may not cover any subpackages, but includes subpackage `v2.17.1-1/src/tests`.
Soft Error: source_directory_includes_subpackage: Directory `v2.18.3-1` of package `fbsource//third-party/nccl` may not cover any subpackages, but includes subpackage `v2.18.3-1/src/tests`.
Soft Error: source_directory_includes_subpackage: Directory `v2.19.3-1` of package `fbsource//third-party/nccl` may not cover any subpackages, but includes subpackage `v2.19.3-1/src/tests`.
Buck UI: https://www.internalfb.com/buck2/a2f91f8a-5326-435e-9075-5af0de930b8b
Test UI: https://www.internalfb.com/intern/testinfra/testrun/7036874660108924
Network: Up: 28MiB Down: 3.5GiB (reSessionID-19af3d71-7528-448c-9126-5615d27b3bd7)
Jobs completed: 423656. Time elapsed: 3:57.2s.
Cache hits: 99%. Commands: 146147 (cached: 145758, remote: 317, local: 72)
Tests finished: Pass 8. Fail 0. Fatal 0. Skip 0. Build failure 0
```
### Tested on Local Training Run
```
CUDA_VISIBLE_DEVICES=5,6 AOT_PARTITIONER_DEBUG=1 PARTITIONER_MEMORY_BUDGET_PARETO=0 buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=local_fb_fm_v4 launcher.num_workers=2 2>&1 | tee log_2024-11-2320:41:50.757256__bento_trigger.txt
```
Output Summary Paste: P1685697066
Differential Revision: D65800097
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141614
Approved by: https://github.com/jansel, https://github.com/Chillee
**Description**
For `linear_dynamic_fp16`, we insert `quantize` and `dequantize` between x/w and linear to have the following pattern:
```
x
|
linear <- to_fp32 <- to_fp16 <- w
```
In Inductor, the pattern we finally see will be
```
fp32 activation
|
(reshape)
|
mm/addmm <- t <- to_fp32 <- tp_fp16 <- weight
|
(reshape)
```
Or
```
fp32 activation
|
expand
|
bmm <- expand <- t <- to_fp32 <- tp_fp16 <- weight
|
(add)
```
The second pattern is for x.ndim > 2 and x is not contiguous. The first pattern is for other cases.
Fuse the pattern with weight prepack, and we get
```
fp32 activation
|
onednn.linear_dynamic_fp16 <- onednn.linear_prepack_fp16 <- weight
```
After freezing, the prepack op is gone.
**Test plan**
```
python test/inductor/test_mkldnn_pattern_matcher.py -k test_linear_dynamic_fp16
```
Differential Revision: [D66802159](https://our.internmc.facebook.com/intern/diff/D66802159)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141549
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
Summary:
With the changes in https://github.com/pytorch/pytorch/pull/140755 and https://github.com/pytorch/pytorch/pull/141997, I added a load_constants function to the packaging API. Currently this doesn't work for cpu.
The workflow is something like:
```
ep = torch.export.export(model, example_inputs)
package = torch._inductor.aoti_compile_and_package(ep, inductor_configs=inductor_configs)
compiled = torch._inductor.aoti_load_package(package)
print(compiled.get_constant_fqns()) # see what are the fqns needed/available
compiled.load_constants(new_state_dict, check_full_update=True) # update the constants in AOTI
```
You can also use the `aot_inductor.package_constants_in_so` config to stop including the constants in the so:
```
package = torch._inductor.aoti_compile_and_package(ep, inductor_configs={`aot_inductor.package_constants_in_so`: False)
compiled = torch._inductor.aoti_load_package(package)
compiled(*inputs) # segfaults because there are no constants --> we should probably have a better error msg
compiled.load_constants(new_state_dict, check_full_update=True)
compiled(*inputs)
```
Test Plan: `buck2 run @//mode/dev-nosan //caffe2/test/inductor:aot_inductor_package -- -r "test_so_without_weight" `
Differential Revision: D66796206
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142246
Approved by: https://github.com/henrylhtsang, https://github.com/desertfire
Summary: Failed to write the auto-tune result to `PYTORCH_TUNABLEOP_FILENAME` with this change (empty file) , reverting to unblock.
Test Plan: CI
Differential Revision: D66870750
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142226
Approved by: https://github.com/leitian
Utils for creating random dags and generating code for nn modules based on such dags.
In particular, this was used to do fuzz testing for unflatten, where the random dags instructed the generation of calls, const accesses, and buffer mutations in a system of nn modules.
Example of generated test:
```python
def test_unflatten_random_dag_const_preserving_3(self):
class N2(torch.nn.Module):
def __init__(self):
super().__init__()
self.const = torch.ones(1)
def forward(self, x):
return x + 1
class N1(torch.nn.Module):
def __init__(self):
super().__init__()
self.const = torch.ones(1)
self.n2 = N2()
def forward(self, x):
x = x + self.n2.const
x = self.n2(x + 1)
return x + 1
class N0(torch.nn.Module):
def __init__(self):
super().__init__()
self.const = torch.ones(1)
self.n1 = N1()
def forward(self, x):
x = x + self.n1.n2.const
x = self.n1(x + 1)
x = self.n1.n2(x + 1)
return x + 1
inp = (torch.ones(1),)
eager = N0()(*inp)
ep = torch.export.export(
N0(),
inp,
strict=False,
preserve_module_call_signature=(
"n1",
"n1.n2",
),
)
epm = ep.module()
ufm = torch.export.unflatten(ep)
assert torch.allclose(epm(*inp), eager)
assert torch.allclose(ufm(*inp), eager)
```
Differential Revision: [D66838348](https://our.internmc.facebook.com/intern/diff/D66838348/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142180
Approved by: https://github.com/angelayi, https://github.com/ydwu4
This implements a new wrapper class AOTDispatchCompiler wrapper, which is just a wrapper around a callable that returns an OutputCode. We can then use it in AOTDispatch to decide whether or not to use the cache: if fw_compiler, bw_compiler and inference_compiler are all AOTDispatchCompilers, then we enable caching.
This type is pretty close to _CompiledFxGraphCallable, except it's not allowed to take any kwargs. Not sure how to consolidate the two ideas together just yet: unfortunately, there's no way to properly annotate the types to make them related. But a lot of the time, the input to this function will be a partially applied _CompiledFxGraphCallable.
This allows the PR above this one to enable AOTAutogradCache everywhere, but not increase instruction count or enable cache on unit tests that use aot_eager or other non inductor compilers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142205
Approved by: https://github.com/oulgen, https://github.com/bdhirsh
Notable new features for SDPA operators on AMD systems from AOTriton 0.8b:
1. Nestedtensor support;
2. MQA/GQA support;
3. Restore Efficient attention support for causal=True and seqlen_q != seqlen_k cases;
+ The kernel should use top-left alignment, bottom right alignment will be added later
4. Move gfx1100 (RX7900/W7800/W7900) out of experimental support status.
However, users are strongly recommended to update to ROCM 6.2.4, notably for
its firmware updates.
Related unit tests are enabled as well.
Notable related changes from AOTriton 0.8b:
1. AOTriton 0.8b moves the GPU kernel out of libaotriton.so to a separate directory `aotriton.images`;
2. LZMA replaces ZSTD as GPU kernel compression algorithm for better compression ratio: aotriton0.8b (.so + aotriton.images take 350MB) compared to aotriton0.7b .so: 800MB
3. The compression cannot be disabled now, and `liblzma` is hard run-time dependency.
+ Should not be a problem, since `lzma` is part of Python Standard Library
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140172
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
Otherwise certain sequences of tests will fail with OOM e.g.,
```
# python test/test_cuda.py -k max_split_expandable -k test_assigning_back_deleter_fns_to_tensor --repeat 100 .. ---------------------------------------------------------------------- Ran 2 tests in 0.311s OK E. ====================================================================== ERROR: test_assigning_back_deleter_fns_to_tensor (__main__.TestBlockStateAbsorption.test_assigning_back_deleter_fns_to_tensor)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/workspace/pytorch/torch/testing/_internal/common_utils.py", line 3058, in wrapper
method(*args, **kwargs)
File "/workspace/pytorch/test/test_cuda.py", line 4320, in test_assigning_back_deleter_fns_to_tensor
graph, outputs = cudagraphify(foo, [inp])
^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/pytorch/test/test_cuda.py", line 4080, in cudagraphify
fn(*inputs)
File "/workspace/pytorch/test/test_cuda.py", line 4316, in foo
int8_cuda(LARGE_BUFFER) + x,
~~~~~~~~~~~~~~~~~~~~~~~~^~~
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 160.00 MiB. GPU 0 has a total capacity of 31.73 GiB of which 31.30 GiB is free. Process 2916661 has 442.00 MiB memory in use. 120.00 MiB allowed; Of the allocated memory 52.00 MiB is allocated by PyTorch, and 6.00 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
To execute this test, run the following from the base repo dir:
python test/test_cuda.py TestBlockStateAbsorption.test_assigning_back_deleter_fns_to_tensor
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
----------------------------------------------------------------------
Ran 2 tests in 0.136s
FAILED (errors=1)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140852
Approved by: https://github.com/Skylion007
Fixes#105203
Facing a similar problem to the linked issue, where variable sized input data can mean that a handful of slow to process samples holds up smaller and faster to process samples from being used. This also leads to lower GPU utilization as well. In certain cases, e.g. evaluation epochs, inference pipelines or other cases where reproducibility isn't important, this can bring significant speed ups.
This PR adds an `allow_out_of_order` bool input to the `DataLoader` class, defaulting to `false`, which when set to `true` will returning data from workers in whatever order they are ready/processed in, rather in the strict index order.
Instead of storing data that was returned out of order, it is passed directly to the main thread and the entry in `_task_info` is deleted. The main changes are they to check that an entry in `_task_info` does exist, and only increasing `self._rcvd_idx` when the lowest index remaining gets returned.
Two tests are added to test this for iterable type datasets and index type datasets.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141833
Approved by: https://github.com/andrewkho
CC @zdevito @janeyx99
This isn't ideal but cuBLASLt workspaces are not currently cached, so this additional untracked allocation will cause `test_cuda_tracker_equivalence` to fail with a large enough workspace size e.g., `CUBLAS_LT_WORKSPACE_SIZE=32768`. One solution is to just use byte-tensors for the workspace instead of going directly to the caching allocator.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139442
Approved by: https://github.com/Aidyn-A, https://github.com/albanD, https://github.com/janeyx99
Fixes#142144
A global x is saved in checkpoint as `GLOBAL x.__module__ x.__name__`. So , after allowlisting a GLOBAL it is expected to match any GLOBAL instruction of the form `GLOBAL x.__module__ x.__name__` but there are edge cases when for the same API from the same module, what `__module__` gives changes between versions which prevents users from allowlisting the global.
In this case, in numpy < 2.1
```
torch.save("bla", np_array)
# checkpoint has GLOBAL "np.core.multiarray" "_reconstruct"
```
In np version 2.1
```
with safe_globals([np.core.multiarray._reconstruct]):
torch.load("bla")
```
np.core.multiarray._reconstruct.__module__ gives "np._core.multiarray" (note the extra _ before core) and see what was done [here](https://github.com/numpy/numpy/blob/main/numpy/core/multiarray.py)
Since the dictionary to access safe globals is keyed on "{foo.__module__}.{foo.__name__}", __module__, __name__ will no longer match that in the checkpoint so "np.core.multiarray._reconstruct" can no longer be properly allowlisted (instead np._core.multiarray._reconstruct is a key in the dict).
We allow `add_safe_globals/safe_globals` to optionally take tuples of (global, str of module.name) to workaround such (odd/edge case) situations.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142153
Approved by: https://github.com/albanD
Summary:
In this diff we implement a way to ensure the internal thrift schema from cfgr (configerator/structs/caffe2/torch/export/schema.thrift) and the schema in OSS (torch/_export/serde/schema.thrift) are in sync, by adding a unittest to reflect on the type names and fields from each schema and compare them field by field.
When we detect new fields/types from torch/_export/serde/schema.thrift, there'll be a test failure on the trunk and the error message hints people to add the missing field/type to the thrift schema from cfgr, so that they are always in sync in practice.
Test Plan: buck test mode/opt caffe2/test:test_export -- -r test_thrift_schema_in_sync
Differential Revision: D66716834
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141989
Approved by: https://github.com/yiming0416
Previously when Dynamo encounters a `functools.wrap(...)` call, it would
check `VariableTracker.can_reconstruct` and graph break if failed.
That has 2 issues:
1. Implementation of `can_reconstruct` is incorrect, since logic of
reconstructability isn't necessarily encapsulated in
`VariableTracker.reconstruct` -- for some VTs like `CellVariable`,
it's also in `SideEffects.codegen_save_tempvars`. This is exposed by
#134731.
2. We don't always need to reconstruct the result of
`functools.wrap(...)`, for those cases we don't want to give up
tracing by an early `con_reconstruct` check. Instead we could just
let it fall through, and graph break in the actual `reconstruct` call
later, if needed.
This patch removes the `can_reconstruct` check altogether. It was
introduced in #114279, but the added tests pass even without the check
now; this might be because of some recent bug fixing on cells and side
effects.
Fixes#134731, #141514.
D66838708
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142000
Approved by: https://github.com/zou3519
Allow mutations mutations for subclasses that are non-contiguous.
Changes:
Removing assert in collect_metadata_analysis
Main requested testcase:
Compilation of NJT.index_put()
Adding test in test_nestedtensor.py, that compiles NJT.index_put()
It is decomposed to NJT split,unbind, which needed additional `torch._check`, `torch._check_is_size` for NJT.unbind() and guard_size_oblivious() usage in _meta_registrations and _inductor/lowering.py.
Special case:
If tangent is mutated outside of the graph, it does not participate in backward graph. Autograd in this case will set this tangent to zeros tensor.
We handle it separately in CompiledFunction.backward: not doing any processing for this tangent and broadcast to number of expected subclass unwrapped arguments.
disabling for dynamo 2 tests:
1/ For nested tensor - symbolic shapes issue on nested_tensor index operation that does splits [0, 0, 0] - there is a failure with "pending unbacked symints". This PR does not add more .tolist()/item() ops than it was before.
2/ As we do not fail with exception in collect_metadata_analysis new paths for dynamo started working and it started failing with smth strange that set_ in storage_offset (because of test for views) handling updates storage "cpu" -> "meta"
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139630
Approved by: https://github.com/bdhirsh
Today, when user does `init_process_group()`, without `backend` or `device_id` specification, we would auto-translate it into `cuda:nccl,cpu:gloo`. The idea was to initialize all **default** backends to cover what the user may do later.
A side effect is increase of initialization time and resources.
This PR changes it to detecting the accelerator type on the machine, and initialize only the backend for that accelerator.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142216
Approved by: https://github.com/wconstab, https://github.com/XilunWu
This basically undoes some workarounds introduced in #119926, the
root causes of which have been fixed by #142078 and other changes in
Dynamo.
Now that Dynamo traces the spec comparison code, the test also needs update:
- removing the `_jvp_treespec_compare` calls in fx graph
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142081
Approved by: https://github.com/zou3519
ghstack dependencies: #142078, #142080
This basically undoes most of the workarounds introduced in #119405, the
root causes of which have been fixed by #142078 and other changes in
Dynamo.
Now that Dynamo traces the spec comparison code, the test also needs update:
1. renaming `o` to `pimals_out`
2. removing the `_vjp_treespec_compare` calls in fx graph
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142080
Approved by: https://github.com/zou3519
ghstack dependencies: #142078
Flatten the inputs to minifier so AOTI Minifier can handle unflattened inputs and kwargs.
- flatten the inputs in minifier
- changed the "load_and_run" part of the minifier verification to run on the flattened inputs.
- refactored code to keep `torch._inductor.__init__.py` clean
- update doc
`python test/inductor/test_minifier.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141156
Approved by: https://github.com/desertfire
# Feature
Previously, only the codegen for `torch.sqrt` was dtype aware. This PR updates most of the `libdevice`/`tl.math` ops to support dtype-aware codegen as well. This is often necessary to get correct code when `config.triton.codegen_upcast_to_fp32=False`, as most Triton math ops do not support float16/bfloat16.
This PR enables dtype aware codegen via the `maybe_upcast_float32` decorator. This wraps `TritonOverrides` macros to upcast arguments to float32, and downcast the result back to the original dtype. The exception is for ops that return booleans, in which case we set `convert_output=False` and skip the output cast.
# Test Plan
Added CI tests for all the new ops. The list of ops to test is automatically generated based on uses of the `maybe_upcast_float32` decorator, and stored in the new `OpDtypeSupport` class. In each new test, we search the generated code for upcasts/downcasts using a regex.
Also added a unit test for `OpDtypeSupport` which checks that we have correct dtype info for ops that require upcasts.
This PR also moves some existing tests around, to collect all the dtype aware codegen tests in one file.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140864
Approved by: https://github.com/eellison, https://github.com/arui-meta
Co-authored-by: eellison <elias.ellison@gmail.com>
Summary:
### Context
* When compiling the object file for a CUTLASS kernel, CUDA RT symbols are left undefined.
* When compiling the final shared object file, we statically link with `libcudart_static.a`.
* One important thing is that ordering matters when specifying the lib search paths (-L).
Test Plan:
```
// before diff
RuntimeError: Failure loading .so: /tmp/tmpqhz_dnza/model.so: undefined symbol: cudaLaunchKernelExC
```
Differential Revision: D66793974
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142094
Approved by: https://github.com/chenyang78, https://github.com/hl475
**Background:** conversion from outer dim -> inner dim makes the (previously valid) assumption that the ragged dim is immediately next to the batch dim. This is no longer the case after #137125.
This PR:
* Updates the outer dim -> inner dim conversion logic to match the actual ragged_idx. Since ragged_idx tells us where the packed ragged / batch dim is, both ragged and batch outer dims should map to this inner dim. The conversion logic must now take in `ragged_idx` to make this possible, so the PR updates all call-sites to pass this.
* Fixes outputs across keepdim settings when reducing over ragged / batch dims.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142173
Approved by: https://github.com/drisspg
Add 3D(pp+tp+fsdp) test `test_3d_with_tp_dp_pp` at test_pp_compodability
Currently provide @parametrize on
"ScheduleClass" for pp in [ScheduleGPipe, Schedule1F1B, ScheduleInterleaved1F1B, ScheduleLoopedBFS, ScheduleInterleavedZeroBubble]
"MixedPrecisionParam" for fsdp in [torch.bfloat16, torch.float32]
Future work:
1. add fp8
2. add cp(context parallelism) to enable 4D test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141398
Approved by: https://github.com/wconstab, https://github.com/kwen2501
Parametrization can not be registered for non-direct child parameters of the module.
We have to iterate through all submodules and register parametrization at every level.
Original testcase did not test the nested modules case - adding submodule to the test.
Testing:
```
python test/functorch/test_aotdispatch.py -k test_subclass_parameters
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142155
Approved by: https://github.com/bdhirsh
Summary:
Static assert that NCCL VERSION is greater than 2.4.
This is in preparation of enabling error checking by default in PyTorch library and removal of some macros.
This is in PR #141914.
The rationale behind this version is:
1. 2.4 released ~2 years ago so it's unlikely that someone is still using the old library.
2. Enabling error checking is benefitial to the community as it helps debug subtle bugs in production environments.
Test Plan: unit tests
Differential Revision: D66737055
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142023
Approved by: https://github.com/kwen2501
Summary: Avoid using expr_printer as an overriden class member for WrapperCodegen. Instead, use pexpr and cexpr explicitly for python and cpp expression print respectively. This is to prepare for one-pass AOTI CUDA codegen, where PythonWrapperCodegen is used to generate the autotune block and CppWrapperCodegen is used to generate the model code.
Differential Revision: [D66459992](https://our.internmc.facebook.com/intern/diff/D66459992)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141388
Approved by: https://github.com/chenyang78
This PR updates the documentation for `torch.mean()` to explicitly mention that computing the mean over an empty tensor returns `nan`. This clarification helps users understand the behavior and handle it appropriately in their code.
Fixes#141057
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142039
Approved by: https://github.com/albanD
Summary: This has been set to "subproc" for a while internally and externally, so we can remove and simplify some of the code. Note that there's no pressing need here -- just that since we've had internal outage with the legacy "fork" implementation, it doesn't seem helpful to leave it available. But if people aren't in the mood for this sort of cleanup, I won't be offended to abandon it.
Test Plan: CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142001
Approved by: https://github.com/eellison, https://github.com/jansel
Magma is built for specific CUDA versions and stored in the ossci-linux bucket. Install it from there rather than the deprecated conda package.
There are two places where magma is installed today:
- `install_conda.sh`: extract the magma package in the same exact location where conda would install it, using a dedicated `install_magma_conda.sh` script. The new script is included in the relevant Dockerfiles where CUDA+magma is needed
- `install_magma.sh`: this script already uses a tarball. Use the new tarball instead of the tarball from the conda package. The format of the new tarball is compatible with the old one, so changes here are minimal:wq
Fixes#140538
Test PR: https://github.com/pytorch/pytorch/pull/141584
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140417
Approved by: https://github.com/atalman
This PR removes the functorch config that set an upper limit on the number of aliased
inputs with dynamic shapes. After moving them to be run at runtime in C++, the compilation
time and runtime (in true alias cases) improved, rendering the error no longer relevant.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141680
Approved by: https://github.com/bdhirsh
ghstack dependencies: #139554, #139555, #140013
This PR moves the logic for computing the overlapping relations between input tensors that
share a storage instance to C++.
In summary, this PR:
- Moves both `tensors_definitely_do_not_overlap` and part of `compute_overlapping_tensors`
to C++
- Introduces a `check_overlapping` function that re-runs `compute_overlapping_tensors`,
checking that the result is consistent with what is expected
- Introduces the `StorageOverlapChecker` class
- Keeps track of overlapping and non-overlapping tensors
- Actually checks the overlapping relation (call `check_overlapping`) when all tensors
are collected
- Introduces the `STORAGE_OVERLAPPING` relational guard
- Has a reference to a `StorageOverlapChecker`
- Stores the to-be-checked tensors in the checker, and triggers its check
- Introduces `install_storage_overlapping_guard` python function
- Creates an instance of `StorageOverlapChecker`
- Creates 2 instances of the `STORAGE_OVERLAPPING` guard (for overlapping and
non-overlapping tensors), referencing the same `StorageOverlapChecker` instance
**Why is `StorageOverlapChecker` needed?**
The way `GuardManager` is implemented, we have no control over the order in which the
check methods are called, i.e. no control over the order the tensors are collected. So, we
can't easily split them in "overlapping" and non-overlapping kinds.
Instead, we create 2 instances of `STORAGE_OVERLAPPING` guard, each of which helps
collecting the tensors for one of the kinds mentioned above. They are then used in a
single `StorageOverlapChecker` instance.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140013
Approved by: https://github.com/bdhirsh
ghstack dependencies: #139554, #139555
Fix: #118214
This PR replaces the guards introduced by running `_tensors_definitely_do_not_overlap` at
compile-time by a single `___check_overlapping` guard. When evaluated, this function calls
the original `_tensors_definitely_do_not_overlap` so as to check whether the current state
of the inputs are consistent, i.e. tensors that should overlap do overlap, and those that
shouldn't don't.
In summary, the changes are:
- Introduce `StorageOverlap` derived class from `GuardEnvExpr`
- Plumb `AOTConfig` to the `compute_overlapping_inputs` function, so as to have access to
AOTAutograd input sources
- Suppress the guards generated by `_tensors_definitely_do_not_overlap` function at runtime
- Issue a `StorageOverlap` AOTAutograd guard, specifying the sources that should and
shouldn't overlap
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139555
Approved by: https://github.com/bdhirsh
ghstack dependencies: #139554
This PR fixes the issue where AOTAutograd would produce a guard that used a symbolic value
that came from one of the input's base.
```python
@torch.compile(backend="aot_eager", dynamic=True)
def f(a, b):
a.add_(1)
b.add_(1)
return a
x = torch.ones(10)
f(x[1:], x[1:])
```
In the example above, AOTAutograd functionalizes the mutation by making use of
`as_strided_scatter` operation, which produces the guard: `s0 >= s1 + 1`, where:
- `s0`: corresponds to `x.size()[0]`
- `s1`: corresponds to `a.size()[0]`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139554
Approved by: https://github.com/bdhirsh
A few changes to MetaTensorDesc and friends:
1. Change view_func from a raw method to an ADT where the common case (FakeTensor._view_func_unsafe) is a simple representation instead.
2. (minor) Remove and fix some `type: ignore`s added by #141839
3. (minor) Fix _UNSERIALIZABLE to be a set instead of a dict which is converted into a set each time it's used.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141926
Approved by: https://github.com/ezyang
Fixes https://github.com/pytorch/pytorch/issues/140229
Fixes https://github.com/pytorch/pytorch/issues/139474
The issue was that:
(1) DDPOptimizer has some logic to partition the dynamo graph into buckets, and run AOTAutograd/inductor on each bucket
(2) doing so requires knowing the **exact** strides of the outputs of each subgraph, so we can have example inputs (with correct strides) to each of the later subgraphs to compile with
(3) there is some existing logic to do this today: we have a `fakify_first_call` flag in AOTAutograd that lets you run it with fake tensor inputs (to handle the calling convention changes that AOTAutograd performs at runtime). During this process, we query inductor for the output strides that it compiled with
(4) these outputs strides are stored in the FX graph cache as raw strings of sympy expressions. We have a function, `evaluate_symexpr`, which given the sympy string, and the ShapeEnv's `var_to_val` mapping, will evaluate the sympy string to generate concrete strides
(5) evaluating this expression will specialize on the exact values of any variables in our shape env, however. In DDPOptimizer, we want to know what inductor's stride outputs are symbolically. This requires converting the (string) sympy expression into actual `SymInts` that we can return.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140751
Approved by: https://github.com/eellison
Summary: Add an option in config to not include weights in .so
Test Plan: `test/inductor:test_aot_inductor -- -r test_so_without_weight_cuda`
Reviewed By: desertfire
Differential Revision: D65968885
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141997
Approved by: https://github.com/desertfire
Summary: Change fx graph module's _replace_hook from a single hook, to a list of hooks. This is to prepare to registering more hooks for inductor provenance tracking, where we might need to register multiple hooks for node replacement.
Test Plan:
```
buck run mode/dev-nosan caffe2/test:fx -- -r test_hooks_for_node_update
buck run mode/dev-nosan caffe2/test:test_export -- -r test_replace_hook
```
Differential Revision: D66726724
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142006
Approved by: https://github.com/zhxchen17
**Overview**
This PR moves `torch/distributed/_composable/fsdp` to `torch/distributed/fsdp/_fully_shard` and makes public APIs available from `torch.distributed.fsdp`, e.g.:
```
from torch.distributed.fsdp import fully_shard
```
This is targeting 2.6 release. I rewrote some of the documentation with (hopefully) improved phrasing.
**Follow-Ups**
- [x] Add some explanation in the docs about FSDP1 vs. FSDP2
- [ ] Move unit tests from `test/distributed/_composable/fsdp` to `test/distributed/fsdp/fully_shard/`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141868
Approved by: https://github.com/kwen2501, https://github.com/wconstab, https://github.com/weifengpy
Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
Fixes https://github.com/pytorch/pytorch/issues/142076. Under compile, functional collectives are supposed to **not** return `AsyncCollectiveTensor`, and instead immediately issue calls to `wait_tensor()` (that we rely on the compiler to reorder as necessary.
This is done with a function `_are_we_tracing()`, that tries to detect if we are running from inside of the compiler. One of the checks it performs is `is_torchdynamo_compiling()` ([here](https://github.com/pytorch/pytorch/blob/main/torch/distributed/_functional_collectives.py#L808C8-L808C34)).
Unfortunately, this will always return False, even if dynamo is indeed tracing. The problem is that this function only returns true if dynamo **intercepts** the bytecode for `is_torchdynamo_compiling()`. However, this function is called during fake-tensor propagation, which is run as part of dynamo, but is not actually intercepted by dynamo itself.
One thing that we know is the case during dynamo tracing, however, is that a `FakeTensorMode` is active. So I tweaked the logic to assume that we are tracing if there is an active fake mode.
This could potentially have consequences for anybody running functional collectives with a fake mode directly, without compile in the loop. Although hopefully it's not too unreasonable to issue wait() calls immediately if you are running with fake tensor (presumably you only care about fake tensor propagation, in which case the wait() calls should technically be a no-op).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142075
Approved by: https://github.com/yifuwang, https://github.com/kwen2501
ghstack dependencies: #141725, #141728
When running fallback operations in `cpp_wrapper` mode, Python errors thrown in the fallback should be propagated up the stack. This PR fixes the current situation, which discards all Python errors thrown in the fallback op in favor of an uninformative `RuntimeError`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141369
Approved by: https://github.com/desertfire
ghstack dependencies: #141368, #141580
FXGraphCache supports freezing, but AOTAutogradCache does not. This is due to the fact that when freezing is turned on, instead of using the constants from the graph module that was saved on cache miss, we have to take the constants from the AOTAutograd generated graph module. This PR does two things:
- It bypasses AOTAutogradCache when freezing is turned on. We should have always been doing this.
- It refactors the code to be way more clear about the constants we're using and when we're using them.
Basically, there are two possible sets of constants we can grab from the compiled fx graph.
1. If freezing is turned off, we save the constants directly in CompiledFxGraph.
2. If freezing is turned on, we save the *names* of the constants in CompiledFxGraph, and use the runtime GraphModule's actual constant values: we reconstruct them from the saved names + the new graph module from AOTDispatch.
We implement two different classes for doing just this: one that has access to the post aotdispatch gm, which supports freezing, and one that doesn't have it, which does not support freezing. Then we construct the wrappers and unwrap the result as needed.
This makes it clear that the gm passed to AOTAutogradCache is *not* part of post compile, only the cache key generated from it is.
The whole flow is pretty confusing, but hopefully this gives us better types and static information for understanding what the different codepaths are doing.
Will add a specific AOTAutogradCache to confirm we bypass freezing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141897
Approved by: https://github.com/ezyang, https://github.com/masnesral
Large n input caused a regression starting in ROCm 6.1. The for loop will run for an excessive number of iterations. The root cause seems to be how static_cast<int64_t> behaves for large float values such as 1e20 that certain unit tests will use. The workaround is to break out of the loop once the returned value reaches nan.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141150
Approved by: https://github.com/eqy, https://github.com/malfet
Related to #107302
To use `numpy>=2`, we need to upgrade `scipy>=1.13.0` from `1.11.0`.
This PR fixes a failed test caused by the `scipy` upgrade.
The `scipy` implementation of `logsumexp` has changed and deviated from the torch implementation.
So, we replace it with a simple custom implementation as the ground truth.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141986
Approved by: https://github.com/rgommers, https://github.com/albanD
Today `destroy_process_group()` is implemented via `ncclCommAbort`.
When user call it in CPU, risk is that a healthy NCCL kernel gets preempted, which causes data corruption.
Instead of aborting kernels, we should flush collectives in `destroy_process_group`, i.e. let them complete normally, before we tear down resources.
This PR implements such "flushing" behavior using `ncclCommFinalize`, then reclaims resources via `ncclCommDestroy`.
Expected behaviors:
For a bad program, a hang is expected at `destroy_process_group()`. If the PG uses non-blocking communicators, such hang is recoverable, because we attaches a timeout to the flush behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141510
Approved by: https://github.com/wconstab
Previously when Dynamo encounters a `functools.wrap(...)` call, it would
check `VariableTracker.can_reconstruct` and graph break if failed.
That has 2 issues:
1. Implementation of `can_reconstruct` is incorrect, since logic of
reconstructability isn't necessarily encapsulated in
`VariableTracker.reconstruct` -- for some VTs like `CellVariable`,
it's also in `SideEffects.codegen_save_tempvars`. This is exposed by
#134731.
2. We don't always need to reconstruct the result of
`functools.wrap(...)`, for those cases we don't want to give up
tracing by an early `con_reconstruct` check. Instead we could just
let it fall through, and graph break in the actual `reconstruct` call
later, if needed.
This patch removes the `can_reconstruct` check altogether. It was
introduced in #114279, but the added tests pass even without the check
now; this might be because of some recent bug fixing on cells and side
effects.
Fixes#134731, #141514.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142000
Approved by: https://github.com/zou3519
Hermite polynomials diverge to NaN at high orders due to numerical overflow. The proposal is to prematurely return NaN of it is known that at this value it will be NaN.
According to my short test
```Python
import torch
device = "cuda"
dtype = torch.float32
x = torch.linspace(-1000, 1000, 100000, device=device, dtype=dtype)
for n in range(1024):
if torch.special.hermite_polynomial_h(x, n).isnan().sum().item() == x.shape[0]:
print(f"hermite_polynomial_h: all outputs are nans! n = {n}")
break
for n in range(1024):
if torch.special.hermite_polynomial_he(x, n).isnan().sum().item() == x.shape[0]:
print(f"hermite_polynomial_he: all outputs are nans! n = {n}")
break
```
The output values become NaNs at these orders:
```
hermite_polynomial_h: all outputs are nans! n = 53, dtype=torch.float32
hermite_polynomial_he: all outputs are nans! n = 61, dtype=torch.float32
hermite_polynomial_h: all outputs are nans! n = 272, dtype=torch.float64
hermite_polynomial_he: all outputs are nans! n = 304, dtype=torch.float64
```
Surely, it makes sense to increase the limit as a safety margin.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141955
Approved by: https://github.com/malfet
This PR refactors all builtin autograd nodes (e.g. MulBackward0) from
having a single MulBackward0::apply into having:
- a "pure function variant" `MulBackward0_apply_functional`
- a stateful variant MulBackward0::apply that ends up calling
`MulBackward0_apply_functional`.
In order to do this we left the stateful pieces in MulBackward0::apply
(like unpacking of saved vars, determining which gradients actually need
computing).
The motivation is that this will be useful for compiled autograd in a
future PR. We might refactor this more later, but I wanted to get
something reviewed, shipped, and tested in-tree because the entire stack
is going to be big and this change by itself might have subtle perf issues.
The new codegen looks like the following:
- https://gist.github.com/zou3519/84721cfbef71bb640ddf1a64ef8583a3
Here's the old codegen for comparison:
- https://gist.github.com/zou3519/73f925fe6aca6dd3ceb0a6e6fcf5f77d
Test Plan:
- existing tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141278
Approved by: https://github.com/soulitzer
- Set the dtype of "acc" appropriately so that epilogue fusion will have args with dtype
- Update dtype propagation to use `type_to_dtype` instead of instantiating tensor
- Throw if we have a string arg where we should have a proper CSEVariable, unless we're doing the Modification Subgraph thing which is nyi. everything else is appropriately typed (cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov @drisspg ).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141991
Approved by: https://github.com/drisspg
ghstack dependencies: #139945, #140057, #141495, #141882
# Summary
Fix tensor argument ordering for autotuning flex attention, change how we enabled scatters codegen for triton. We used to go through the existing store_output triton codegen but now we just short circuit and generate the correct expression earlier on.
This enables us to instead of relying on arg.python_defs to thread arguments through via input_buffers we can instead reuse the exact same mutated buffer infra as we did for multiple outputs before.
Test cases added for both default and max-autotune-no-cudagraphs modes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141531
Approved by: https://github.com/Chillee
(Got reverted due to a silly bug, fixed now.)
This is causing FbFxGraphRemoteCache.init to no longer be idempotent, i.e. only safe to call once per compile. AOTAutogradCache initializes a new remote cache for the forward and the backward.
Technically, we could make AOTAutogradCache smart and globally thread through a single FbFxGraphRemoteCache everywhere. But there's no reason to do so, as this class is just the handle to access the cache. Plus, it's very brittle for FbFxGraphRemoteCache to not be safe to call multiple times
Differential Revision: [D66701970](https://our.internmc.facebook.com/intern/diff/D66701970/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141967
Approved by: https://github.com/laithsakka
Fixes https://github.com/pytorch/pytorch/issues/141710. Previously, if we called flash attention with a `SymFloat` scale that was properly symbolic, we would unsafely cast its raw `SymFloat._data` into a `float`, which is pretty much guaranteed to give `NaN`.
This avoids the NaNs in the linked issue, but I'm not sure if it's worth landing yet because we'll start specializing and recompiling for every distinct `scale` that is passed in (which in the dynamic shapes case, is some function of `query.size(-1)`).
The real fix would be to ensure that the flash attention (and related) ops all accept a symbolic version of the `scale`. I'm not sure if we should use `SymFloat` or `Scalar` though - more discussion in the issue
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141725
Approved by: https://github.com/ezyang
There are four core ATen ops with Composite Implicit Autograd (CIA) dispatch: upsample_bilinear2d.vec, upsample_nearest2d.vec, avg_pool1d, and adaptive_avg_pool1d. Op variant auto-generation is currently skipped for CIA ops. In preparation to disable the decompositions for upsample ops by default in export, we need to generate out variants for these ops.
This change enables autogen for core-tagged CIA ops, which enables generation of upsample_bilinear2d.vec_out and upsample_nearest2d.vec_out.
Test Plan:
Added a new test test_functional_variant_autogen_out_variant_core to cover this case in test_codegen.py.
Confirmed that upsample_bilinear2d.vec_out and upsample_nearest2d.vec_out op overloads are registered (they were previously not available).
Differential Revision: D66590257
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141797
Approved by: https://github.com/larryliu0820
If (ynumel / YBLOCK) > get_max_ygrids(), the z dimension will be used if znumel is None. However, if (ynumel / YBLOCK) % get_max_ygrids() != 0, there will be program launches with inputs that require masking, and so this needs to be considered when determining if the y dimension has a constant mask.
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139751
Approved by: https://github.com/eellison
Co-authored-by: George White <georgew@graphcore.ai>
Add option to split Linear gates for Quantizable LSTM into separate ops (#141366)
Summary:
Reattempt to land D65283170, adding pyre-fixmes / mypy ignores following D52890934
For LSTM, the input and hidden state are projected with Linear layers to construct the 4 gates. This is typically performed together as a single Linear (for each state) with output channel count `4 * hidden_dim` for efficiency.
https://www.internalfb.com/code/fbsource/[ebef7c4238aa55948b2b444044f2c8ed2040de55]/fbcode/caffe2/torch/ao/nn/quantizable/modules/rnn.py?lines=52-58
The output is then ultimately split into 4:
https://www.internalfb.com/code/fbsource/[ebef7c4238aa55948b2b444044f2c8ed2040de55]/fbcode/caffe2/torch/ao/nn/quantizable/modules/rnn.py?lines=83-87
For on-device latency (and possibly memory) considerations, we want to avoid constructing the intermediate `gates` tensor (which can be relatively large), by splitting `igates` and `hgates` first (as 4x `Linear(hidden_dim, hidden_dim)` each), applying add separately, then proceeding as usual.
This functionality can be enabled by specifying `split_gates=True` (default False is original behavior) at any entry point (directly with `torch.ao.nn.quantizable.LSTM` or via `_get_lstm_with_individually_observed_parts`).
Test Plan:
piggy back on existing test to check for correct swap handling, numerics, and jit.script during prepare/convert
```
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/quantization:test_quantization -- --exact 'caffe2/test/quantization:test_quantization - test_custom_module_lstm (caffe2.test.quantization.core.test_quantized_op.TestQuantizedOps)'
```
https://www.internalfb.com/intern/testinfra/testrun/4503599884152725
This test is quite long running now (more than double original).
---
shorter test to confirm original `LSTMCell` passes
```
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test:quantization_fx -- --exact 'caffe2/test:quantization_fx - test_static_lstm_with_custom_fixed_qparams (quantization.fx.test_quantize_fx.TestQuantizeFx)'
```
https://www.internalfb.com/intern/testinfra/testrun/11258999127933996
Reviewed By: Ninja91
Differential Revision: D66380336
- Turn fx_codegen_and_compile() into a class (FxCompile) so we can override the implementation.
- Pull the current body into an implementation (_InProcessFxCompile) which just performs the existing behavior.
- Add an async interface. (See below)
The intended future behavior of Async Compile will be to allow dynamo functions to start compiling in the background (and on a separate machine) while we continue to run eager in the foreground. As such we'll need to put the compilation behind some kind of Future implementation - it makes sense to reuse the existing python futures for that. An async function is just a syntactic way to return an asyncio.Future.
Because asyncio.run() adds confusion to the stack traces when the called function isn't actually being used in an asynchronous way we also provide a synchronous interface which can be directly called.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141505
Approved by: https://github.com/ezyang
ghstack dependencies: #141502
Moved some code from FxGraphCache.lookup_graph() which dealt with serializing and deserializing CompiledFxGraph into CompiledFxGraph itself so it can be reused later by Async Compile.
Async Compile will need to serialize the compiled CompiledFxGraph from one process and deserialize it in another - so it's very similar to the cache.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141502
Approved by: https://github.com/ezyang
Summary:
We were ignoring the with_escaped_quotes param in format_list inline function iin utils.cpp in the case where we had to truncate a list of more than kTruncatelength items.
In that case we would truncate a list into a string but always return it with an escaped quotes wrapping it. this will cause issues if this string is meant to be added to other lists which will also go through formatting. Leading to cases like `"["[a, b, c, ...]"]"`.
now the above will be well formatted as `"[[a, b, c, ...]]"` as the escape quote requests will be honored.
Differential Revision: D66521676
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141626
Approved by: https://github.com/sraikund16
Summary: This particular test isn't really needed since the code path is already exercised in `test_freezing`. While I was here, I beefed up testing in that method to consider whether the frozen paramater is inlinable vs. not since the caching behavior is different.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141898
Approved by: https://github.com/ezyang, https://github.com/jansel
This logs a waitcounter of the name pytorch.dynamo_timed.{key}.
Primarily sending this now to make sure everyone likes the API, then
I'll add tests, and migrate one dynamo_timed to use it. (likely starting
with
https://github.com/pytorch/pytorch/pull/141379).
Testing is a bit harder, since we don't normally have any way to read
_WaitCounter state AFAICT. I want to poke around and see if I can figure
out a way to read the state, otherwise I'll just mock it to at least
make sure it's mostly working.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141402
Approved by: https://github.com/jamesjwu, https://github.com/masnesral
Thanks @drisspg and @albanD for finding the fix
**TEST PLAN**
```
import gc
import torch
import torch.nn as nn
from torch.utils.module_tracker import ModuleTracker
class MyModel(nn.Module):
def forward(self, x):
return x * x
print(f"torch=={torch.__version__}")
m = MyModel()
m.cuda()
m.to(torch.bfloat16)
mt = ModuleTracker()
for i in range(1000):
if i % 100 == 0:
gc.collect()
print("memory_allocated:", torch.cuda.memory_allocated())
x = torch.randn([128, 256], device="cuda", dtype=torch.bfloat16, requires_grad=True)
with mt:
m(x)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141960
Approved by: https://github.com/albanD
FSDP1's `fully_shard` frontend was an exploration at the end of 2022 H2 as part of the `torch/distributed/_composable` APIs to avoid `nn.Module` wrappers. It calls into the same backend code as FSDP1's `FullyShardedDataParallel`.
The API did not gain traction internally, so we instead reused the name `fully_shard` for FSDP2, which similarly is not an `nn.Module` wrapper and follows similar design principles as FSDP1's `fully_shard`.
To the best of our knowledge, we have removed all instances of FSDP1's `fully_shard` internally, and we put the deprecation warning in open source in 2.4 saying it will be removed after 2.5 (which is now):
4959784dac/torch/distributed/_composable/fully_shard.py (L40-L48)
We are skipping the PR sanity check because this PR is only removing code, not adding new code, and should not require this sanity check.
Differential Revision: [D66664988](https://our.internmc.facebook.com/intern/diff/D66664988)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141875
Approved by: https://github.com/weifengpy
We've been using it privately for half a year and everything's been
good. This PR:
1. Makes torch.library.triton_op public
2. Renames capture_triton -> wrap_triton. We got feedback that no one
knew what "capture triton" does.
3. Makes torch.library.wrap_triton public.
triton_op is used to construct a Python custom operator that may call 1+
triton kernels. Each of those triton kernels must be annotated with
wrap_triton.
Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141880
Approved by: https://github.com/albanD
ghstack dependencies: #141894
Fixes#139970, #139812.
Revise mkldnn pattern matcher UTs, to check the relevant specific matched patterns instead of the total matched number.
1) Add the missing specific counters in pattern matchers, e.g. `mkldnn_unary_fusion_matcher_nodes`/`mkldnn_conv_weight_pack_matcher_count`.
2) In UTs, change the general `matcher_count`/`matcher_nodes` checks to the specific ones, e.g. `mkldnn_unary_fusion_matcher_nodes`/`mkldnn_conv_weight_pack_matcher_count`.
3) In UTs, remove the option of `matcher_count`/`matcher_nodes` params in _test_common and make `matcher_check_fn` a necessary param.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141334
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jansel
As title, this also uncovered a few invalid use cases; the cases that
cause error are fixed in separate patches prior to this patch, and the
rest are fixed in this patch.
This patch also moves a few `.source` mutation to variable construction,
to increase the coverage of the validation.
Fixes#133027.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141717
Approved by: https://github.com/jansel
ghstack dependencies: #141713, #141714, #141715, #141902, #141716
A subsequeunt patch attempts to fix a side-effect issue for range
iterators, which in turn exposed an exising issue on guards for range
iterators -- the following test started failing:
```
PYTORCH_TEST_WITH_DYNAMO=1 python test/test_tensor_creation_ops.py TestTensorCreationCPU.test_hstack_column_stack_cpu_int16
```
This patch adds a `RANGE_ITERATOR_MATCH` guard to make sure that we
properly guard on range iterators, and adds a regression test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141902
Approved by: https://github.com/jansel
ghstack dependencies: #141713, #141714, #141715
Dynamo accidentally passed the original `ConstDictVariable.source` to
the result of `dict.copy(...)`, which caused aliasing issue when the
result escapes the graph (e.g., is a return value).
This patch fixes that and adds a regression test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141715
Approved by: https://github.com/jansel
ghstack dependencies: #141713, #141714
Previously we never replayed side effects to `DequeVariable` with a
source; the bug was already in the `test_deque_input` test, but went
unnoticed because we didn't check the deque objects.
This patch adds limited but practical support for this (see comments in
`side_effects.py` for why limited), and updates the deque tests to check
for this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141714
Approved by: https://github.com/jansel
ghstack dependencies: #141713
Currently constants are not broadcasted on vectorised stores in `CppTile2DKernel`. This leads to errors like the following:
```shell
error:: request for member 'store' in 'tmp1', which is of non-class type 'signed char'
61 | tmp1.store(tmp2 + static_cast<int64_t>(8L*x0_inner), static_cast<int64_t>(8));
| ^~~~~
```
This PR adds the required broadcasting.
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140262
Approved by: https://github.com/jgong5
This PR removes copycast of reduced precision types to float before printing, that was added in https://github.com/pytorch/pytorch/pull/14418 to probably unblock printing when many operations, like `is_nan` and `max` were not supported on CPUs
(Reusing old test plan) Before the PR:
```python
In [1]: import torch; a = torch.rand(1, 1700, 34, 50, dtype=torch.float16)
In [2]: %timeit str(a)
621 μs ± 5.06 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
```
after the PR
```python
In [1]: import torch; a = torch.rand(1, 1700, 34, 50, dtype=torch.float16)
In [2]: %timeit str(a)
449 μs ± 2.34 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
```
Also, this allows one printing 15Gb Metal tensors on 32GB Mac machine:
```
% python3 -c "import torch;print(torch.empty(72250,72250, device='mps', dtype=torch.float16))"
tensor([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]], device='mps:0', dtype=torch.float16)
```
Before this change it failed with non-descriptive
```
% python3 -c "import torch;print(torch.empty(72250,72250, device='mps', dtype=torch.float16))"
Traceback (most recent call last):
File "<string>", line 1, in <module>
import torch;print(torch.empty(72250,72250, device='mps', dtype=torch.float16))
~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/malfet/git/pytorch/pytorch/torch/_tensor.py", line 568, in __repr__
return torch._tensor_str._str(self, tensor_contents=tensor_contents)
~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/malfet/git/pytorch/pytorch/torch/_tensor_str.py", line 708, in _str
return _str_intern(self, tensor_contents=tensor_contents)
File "/Users/malfet/git/pytorch/pytorch/torch/_tensor_str.py", line 625, in _str_intern
tensor_str = _tensor_str(self, indent)
File "/Users/malfet/git/pytorch/pytorch/torch/_tensor_str.py", line 339, in _tensor_str
self = self.float()
RuntimeError: Invalid buffer size: 19.45 GB
```
Convert fp8 dtypes to float16, as float range is an overkill
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141927
Approved by: https://github.com/ezyang
Summary:
Early prototype of adding CK backend for aten.bmm. Currently, it is very limited in that:
1. BF16 only
2. A single CK instance
3. NT layout only
4. Alpha=1, Beta=0 only
Reviewed By: xw285cornell, zjing14
Differential Revision: D65954695
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140758
Approved by: https://github.com/bradleyhd
Using EC2 G6 instance, based on NVIDIA L4, added to scale config in https://github.com/pytorch/test-infra/pull/5376
To enable more balanced sharding, had to push 148ae19935
Added `@xfailIfSM89` to the following tests:
- test_fp8_pattern_2
- test_original_aten_preserved_split_addmm
- test_sparse_semi_structured_scaled_mm
- test_sparse_semi_structured_scaled_mm_fp8
- test_sparse_fp8fp8_mm
Increased tolerance to 2e-4 for `RNNTest.BidirectionalMultilayerGRU_CPU_vs_CUDA`
Skipped following inductor tests (that either flaky OOMs or timeouts):
- test_reduction_fn_std_float64
- test_reduction_fn_var_mean_float64
- test_multi_output_unbacked_custom_op
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140305
Approved by: https://github.com/wdvr, https://github.com/ZainRizvi
The input with the same can be represented with different symbols e.g.
```python
def body_fn(a, b):
return b.sin(), a.sin()
```
, where a = torch.randn(3, 4), b= torch.randn(3, 4). There could be 4 symbols allocated for a and b. So instead of checking their shapes and strides' symbol must be the same, we just use guard_equals to enforce the constraint.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141734
Approved by: https://github.com/zou3519, https://github.com/eellison
This PR fixes the shape checks that are done in the associative_scan operation.
Before all shapes of the input leaves were required to be the same. With this PR only the shapes of the output of the combine_fn and the input leaves need to be the same, but not among the input leaves.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141698
Approved by: https://github.com/ydwu4
Fixes https://github.com/pytorch/pytorch/issues/141139
How the 3 versions of the error message now look
### Version 1
Old error message:
```
_pickle.UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options, do those steps only if you trust the source of the checkpoint.
(1) Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
(2) Alternatively, to load with `weights_only=True` please check the recommended steps in the following error message.
WeightsUnpickler error: Unsupported global: GLOBAL __main__._rebuild_class_that_uses_build_instruction was not an allowed global by default. Please use `torch.serialization.add_safe_globals([_rebuild_class_that_uses_build_instruction])` or the `torch.serialization.safe_globals([_rebuild_class_that_uses_build_instruction])` context manager to allowlist this global if you trust this class/function.
Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.
```
New error message:
```
_pickle.UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options, do those steps only if you trust the source of the checkpoint.
(1) In PyTorch 2.6, we changed the default value of the `weights_only` argument in `torch.load` from `False` to `True`. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
(2) Alternatively, to load with `weights_only=True` please check the recommended steps in the following error message.
WeightsUnpickler error: Unsupported global: GLOBAL __main__._rebuild_class_that_uses_build_instruction was not an allowed global by default. Please use `torch.serialization.add_safe_globals([_rebuild_class_that_uses_build_instruction])` or the `torch.serialization.safe_globals([_rebuild_class_that_uses_build_instruction])` context manager to allowlist this global if you trust this class/function.
Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.
````
### Version 2
Old error message:
```
_pickle.UnpicklingError: Weights only load failed. ``torch.nested`` and ``torch._dynamo`` must be imported to load nested jagged tensors (NJTs)
```
New error message:
```
_pickle.UnpicklingError: Weights only load failed. ``torch.nested`` and ``torch._dynamo`` must be imported to load nested jagged tensors (NJTs)
In PyTorch 2.6, we changed the default value of the `weights_only` argument in `torch.load` from `False` to `True`. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
```
### Version 3
Old error message
```
_pickle.UnpicklingError: Weights only load failed. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
Trying to load unsupported GLOBAL posix.execv whose module posix is blocked.
Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.
```
New error message
```
_pickle.UnpicklingError: Weights only load failed. In PyTorch 2.6, we changed the default value of the `weights_only` argument in `torch.load` from `False` to `True`. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
Trying to load unsupported GLOBAL posix.execv whose module posix is blocked.
Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.
````
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141304
Approved by: https://github.com/zou3519
This PR introduces support for finding HIP-SDK Libraries on Windows.
Since reading the code changes using the diff view is a bit cumbersome due to introduced if branch, let me explain what was changed:
- The linux-specific steps to find HIP packages have been dragged into `if(UNIX) block`
- Windows steps follow in the `else()` clause
The separation was needed, because of several factors:
- HIP SDK for Windows typically names its components using `hip` in their names (for exmaple: `hip_version.h` instead of `rocm_version.h`, `HIP_VERSION_DEV_MAJOR` instead of `ROCM_VERSION_DEV_MAJOR`, etc.),
- The libraries included in HIP SDK are only a subset of what is available in Linux ROCm (missing hsa-rt, rccl, roctx)
- MIOpen isn't a part of HIP SDK, but can be built separately and as of now requires additional path to be defined using and env var.
- Windows can only find hip package in version greater than 1.0 and its libraries if the lowercase `find_package(hip ...)` is invoked first. This is because the lowercase `hip` name will cause the mechanism to find hip's packages using [config mode](https://cmake.org/cmake/help/latest/command/find_package.html#search-modes) which is the only one supported on Windows, assuming we also want to [include its libraries](https://rocm.docs.amd.com/en/latest/conceptual/cmake-packages.html#consuming-the-hip-api-in-c-code). The upper-case module-mode-seearched `find_package(HIP)` is used later for inclusion of macros such as `hip_add_library` and related macros.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137279
Approved by: https://github.com/jeffdaily
This is causing FbFxGraphRemoteCache.init to no longer be idempotent, i.e. only safe to call once per compile. AOTAutogradCache initializes a new remote cache for the forward and the backward.
Technically, we could make AOTAutogradCache smart and globally thread through a single FbFxGraphRemoteCache everywhere. But there's no reason to do so, as this class is just the handle to access the cache. Plus, it's very brittle for FbFxGraphRemoteCache to not be safe to call multiple times.
(Same problem, different fix of D66502138)
Differential Revision: [D66508492](https://our.internmc.facebook.com/intern/diff/D66508492/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141707
Approved by: https://github.com/ezyang
Certain `cpp_wrapper`-enabled tests were OOM-ing in the CI pipeline, with error messages suggesting that sufficient memory was accessible. This ultimately resulted from an internal memory limitation that was not queryable in the API. This PR adds querying for that limit.
Additionally, the failing tests had incorrect memory availability checks, and are updated with measured memory requirements.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140620
Approved by: https://github.com/malfet, https://github.com/eqy
ghstack dependencies: #141367
By introducing `DynamicMetalShaderLibrary` and `MetalShaderFunction`
Add unittests that also serves as an example of how API works
Using this primitive, one can compile and dispatch any 1D or 2D shader over MPS tensor using the following pattern
```cpp
auto x = torch::empty({8, 16}, at::device(at::kMPS));
DynamicMetalShaderLibrary lib(R"MTL(
kernel void full(device float* t, constant ulong2& strides, uint2 idx [[thread_position_in_grid]]) {
t[idx.x*strides.x + idx.y*strides.y] = idx.x + 33.0 * idx.y;
}
)MTL");
auto func = lib.getKernelFunction("full");
func->runCommandBlock([&] {
func->startEncoding();
func->setArg(0, x);
func->setArg(1, x.strides());
func->dispatch({8, 16});
});
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141547
Approved by: https://github.com/Skylion007
Fixes https://github.com/pytorch/pytorch/issues/141332
`F.logsigmoid` will return two outputs: `output` and `buffer`.
For `F.logsigmoid` cpu path, it will use buffer to store some intermediate values and use them when computing gradients, so it returns a `buffer` tensor with nonzero size. For cuda and xpu paths, buffer is useless, so the `buffer ` tensor size of xpu `F.logsigmoid` will be zero, just like cuda. The root cause of the issue is that the codes in `decompositions.py` (ref:https://github.com/pytorch/pytorch/blob/main/torch/_decomp/decompositions.py#L2803) only handle the cuda cases, when the a fake tensor with device is xpu run to here, it will use the cpu path and return a `buffer` with nonzero size, which is conflict to the implementation of intel xpu concrete tensor. Therefore this pr add conditions to handle xpu cases. Make sure the two returned buffer sizes match each other.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141333
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/ezyang
This patch
1. removes `AutoDerefLocalSource` in favor of `LocalSource`, thereby
removing its special handling in guards.
2. introduces a `LocalCellSource` for cells from the root frame, with
only `reconstruct` implemented, to programmatically enforce that thse
cells should never be used by other components like guards.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141629
Approved by: https://github.com/jansel
ghstack dependencies: #141628
This gets a bit messy, but this appears to be the best spot to make a
true / false decision.
Note that since we're looking at whether or not it's used, if the pool
doesn't warm up within the time it takes for a compile, we will mark the
feature use as false.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141074
Approved by: https://github.com/masnesral
ghstack dependencies: #141059
Fixes#139970, #139812.
Revise mkldnn pattern matcher UTs, to check the relevant specific matched patterns instead of the total matched number.
1) Add the missing specific counters in pattern matchers, e.g. `mkldnn_unary_fusion_matcher_nodes`/`mkldnn_conv_weight_pack_matcher_count`.
2) In UTs, change the general `matcher_count`/`matcher_nodes` checks to the specific ones, e.g. `mkldnn_unary_fusion_matcher_nodes`/`mkldnn_conv_weight_pack_matcher_count`.
3) In UTs, remove the option of `matcher_count`/`matcher_nodes` params in _test_common and make `matcher_check_fn` a necessary param.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141334
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
**Summary**
When addressing [issue #134998](https://github.com/pytorch/pytorch/issues/134998), we will verify if any node in the current graph shares the same storage as the node we intend to prune. In the implementation, we assumed that when creating the `GraphLowering` in post-grad phase, there would be no `submodules`, and all `get_attr` nodes would correspond to a `torch.Tensor`. However, this assumption proves incorrect when enabling `FlexAttention`. In this scenario, `submodules` are present as `get_attr` node in post-grad phase. For example:
```
V1128 23:23:47.071000 1965794 torch/_inductor/compile_fx.py:875] [0/1] [__post_grad_graphs] class sdpa_score30(torch.nn.Module):
V1128 23:23:47.071000 1965794 torch/_inductor/compile_fx.py:875] [0/1] [__post_grad_graphs] def forward(self, arg0_1: "bf16[][]cpu", arg1_1: "i32[][]cpu", arg2_1: "i32[][]cpu", arg3_1: "i32[][]cpu", arg4_1: "i32[][]cpu"):
V1128 23:23:47.071000 1965794 torch/_inductor/compile_fx.py:875] [0/1] [__post_grad_graphs] return arg0_1
V1128 23:23:45.482000 1965794 torch/_inductor/freezing.py:118] [0/1] sdpa_score30 = self.sdpa_score30
V1128 23:23:45.482000 1965794 torch/_inductor/freezing.py:118] [0/1] sdpa_mask30 = self.sdpa_mask30
V1128 23:23:45.482000 1965794 torch/_inductor/freezing.py:118] [0/1] flex_attention_30 = torch.ops.higher_order.flex_attention(add_276, index_put_60, index_put_61, sdpa_score30, (_frozen_param293, _frozen_param295, _frozen_param296, _frozen_param297, _frozen_param298, _frozen_param299, _frozen_param300, _frozen_param301, 64, 64, sdpa_mask30), 0.08838834764831843, {'SKIP_MASK_SCORE': True, 'PRESCALE_QK': False, 'ROWS_GUARANTEED_SAFE': False, 'BLOCKS_ARE_CONTIGUOUS': False, 'OUTPUT_LOGSUMEXP': False}, (), (_frozen_param294,)); add_276 = sdpa_score30 = sdpa_mask30 = None
V1128 23:23:45.482000 1965794 torch/_inductor/freezing.py:118] [0/1] getitem_60: "bf16[1, 32, 1, 128]" = flex_attention_30[0]; flex_attention_30 = None
```
We added an extra check in the implementation to ensure only comparing the `get_attr` node with `torch.Tensor`. It is difficult to reproduce this issue using pure high-order operators. Adding a unit test after https://github.com/pytorch/pytorch/pull/141453 lands would be more straightforward.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141798
Approved by: https://github.com/jgong5
When the input tensor to Conv3d is in the channels_last_3d memory format the Conv3d op will generate incorrect output (see example image in #141471). This PR checks if the op is 3d, and then attempts to convert the input tensor to contiguous.
Added a regression test that verifies the output by running the same op on the CPU.
I'm unsure if Conv3d supports the channels last memory format after #128393. If it does, we should consider updating the logic to utilize this as it would be more efficient. Perhaps @DenisVieriu97 knows or has more context?
Fixes#141471
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141780
Approved by: https://github.com/malfet
Preparatory refactor for https://github.com/pytorch/pytorch/pull/137243. Previously, we would typically check for reductions by `tree.prefix == "r"`. This PR moves the check into a helper function. This makes it easier to generalize the code to multi-dimensional reductions, which could have multiple prefixes like `("r0_", "r1_")`.
Tested by the existing CI.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141738
Approved by: https://github.com/jansel
Summary: To make sure schema.py and schema.thrift are kept in sync, we use the int keys from thrift and use Python Annotated type to associate fields between thrift and schema.py. Later we will use this association to build a single source of truth between the schemas.
Test Plan: CI
Differential Revision: D66253157
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141611
Approved by: https://github.com/yiming0416
**Summary**
DTensor RNG code raises error if the seed passed in is beyong `torch.int64` range (e.g. `torch.tensor([2**64-1])` raises error). The solution is to specify the `dtype=torch.uint64` in the `torch.tensor()` call.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141532
Approved by: https://github.com/wconstab
ghstack dependencies: #141731, #141220, #141223
**Summary**
This PR proposes 4 changes to DTensor RNG management:
1. DTensor allows users to eagerly initialize the RNG tracker by calling `torch.distributed.tensor._random.manual_seed`.
2. DTensor `manual_seed` no longer checks the integrity of the `seed` argument. Users are responsible for setting the same seed on all ranks within an SPMD group, but if there are multiple separate SPMD groups (e.g. across pipeline stages), users should set a _different_ seed for each SPMD group. For cases like Pipeline Parallel, users can set different initial seed for pipelining stages by calling
```
world_mesh = init_device_mesh(
device_type="cuda",
mesh_shape=(2, 2, 2),
mesh_dim_names=("pp", "dp", "tp"),
)
pp_mesh = world_mesh["pp"]
pp_rank = pp_mesh.get_local_rank()
spmd_mesh = world_mesh["dp", "tp"]._flatten("spmd") # this flattening is only needed if you need to call collective over this mesh
torch.distributed.tensor._random.manual_seed(123+pp_rank, spmd_mesh)
```
In other word, if users want to call `torch.distributed.tensor._random.manual_seed`, they will be responsible for passing in the right value and DTensor won't perform any checks on it. If the current rank is not a part of the mesh, it will use the current device RNG state to initialize.
3. `OffsetBasedRNGTracker` still performs RNG state synchronization by broadcasting the RNG state on rank 0 to `WORLD`. However, calling `torch.distributed.tensor._random.manual_seed` is an exception. In this case, no broadcast will happen.
4. Enforce that the `manual_seed` call only accept "full mesh" i.e. the DTensor RNG state on every rank must be set through the call. This makes sure that no rank has its RNG state left uninitialized and the SPMD ranks have their RNG state synchronous.
**Motivation**
tl;dr
1. Lazily initializing DTensor RNG tracker causes hang in non-SPMD code such as Pipeline Parallel.
2. Users may want to set different seed on ranks in one device mesh.
3. We want to keep the old behavior if users prefer not curating the RNG state and want to have DTensor take care of it.
see detail in https://github.com/pytorch/pytorch/issues/140301
**Test**
`pytest test/distributed/_tensor/test_random_ops.py`
`pytest test/distributed/tensor/parallel/test_tp_random_state.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141223
Approved by: https://github.com/wconstab
ghstack dependencies: #141731, #141220
**Summary**
The ad-hoc DTensor RNG tracker was used to mimic Megatron DDP+TP RNG behavior but it turns out not compatible with PyTorch Distributed FSDP2+TP so we decide to deprecate it and use `OffsetBasedRNGTracker` to replace, which follows the SPMD semantics (replicas get the same random sampling result, shards get different results).
**Motivation**
`TensorParallelRNGTracker` was designed for DDP+TP where the random operators produce the same result along the data parallel mesh dimension and different results along the tensor parallel dimension. However this does not apply to the new FSDP+TP composable combination where the model weights are sharded along data parallel mesh dimension as well. Therefore we decide to remove this outdated RNG tracker type for now. If users have demands for exact match between PyTorch Distributed and Megatron on Random Number generation result, feel free to file an issue.
**Impact**
`TensorParallelRNGTracker` was only used when Tensor Parallel is used (i.e. calling `parallelize_module`).
For non-FSDP users, the "replicas get the same random numbers and shards get different ones" remains unchanged. Unlike `TensorParallelRNGTracker` which sets different seeds (`base_seed + 2718 + TP_rank`) within the TP group, DTensor now sets the same seed (default value is 1234 but users can call `torch.distributed.tensor._random.manual_seed` to modify) on all ranks but choose the right RNG offset based on DTensor placements to enforce the "replicas get the same random numbers and shards get different ones" invariant.
For FSDP2 users, improvement should be observed in a way that DTensor sharded within DP group now gets different random number sampling which `TensorParallelRNGTracker` failed to do, though we're not sure how much this change will improve the eventual training loss convergence.
**Test**
1-d model weight meta init:
`pytest test/distributed/_tensor/test_random_ops.py -s -k test_tp_model_meta_init`
2-d model weight meta init:
`pytest test/distributed/_tensor/test_random_ops.py -s -k test_fsdp_tp_model_meta_init`
TP model weight init test:
`pytest test/distributed/tensor/parallel/test_tp_random_state.py`
FSDP+TP model weight init test:
`pytest test/distributed/_composable/fsdp/test_fully_shard_init.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141220
Approved by: https://github.com/wconstab
ghstack dependencies: #141731
**Summary**
Added tests for model meta init on 1-d mesh (TP) and 2-d mesh (FSDP+TP). This exploits the issue where DTensor RNG failed to initialize weights differently across FSDP ranks.
**Test**
`pytest test/distributed/_tensor/test_random_ops.py -s -k meta_init`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141731
Approved by: https://github.com/wconstab
Annotate linear node for `linear_dynamic_fp16` with `X86InductorQuantizer`
After `convert_pt2e`, the pattern will be
```
x
|
linear <- to_fp32 <- to_fp16 <- w
```
**Test plan**
```
pytest test/quantization/pt2e/test_x86inductor_quantizer.py -k test_linear_dynamic_fp16
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141480
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
# Overview
Add monitor script to collect system-level utilization data during CI tests.
Currently all monitoring scripts are disabled.
# Details
- Add flag to customize the time intervals for logging
- Enable multiple GPU utilization logging
# Next step
enable monitor scritpt in non-perf-test workflows
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141438
Approved by: https://github.com/huydhn
I was constantly annoyed at the fact that we had a separate else branch for when cache was disabled which was distinct from when cache was bypassed. This diff gets rid of the disabled cache branch, so we use the same logic for bypass/disable. I actually think this change probably didn't actually matter much for the POC but I think it's cleaner.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141685
Approved by: https://github.com/aorenste
ghstack dependencies: #141681, #141683
Summary:
Fix logic for inserting broadcast on kernel with load going directly to store. In the case where load is going directly to store, we insert a tl.broadcast on the store, regardless of the block size on the load. In the case where a broadcast is not required, the downstream Triton compiler is expected to remove this no-op broadcast instruction.
Test Plan: Added tests under test_torchinductor_strided_blocks.py:test_expand_broadcast in OSS and internal test cases.
Reviewed By: blaine-rister
Differential Revision: D65518033
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141693
Approved by: https://github.com/blaine-rister
Summary:
This very much the same solution proposed by bobrenjc93 except that it restrict it to expressions and axioms that have FloorDiv, since those are the only ones that could have became CleanDiv. and the only one that can changes as shape env changes.
This also does not break torchrec benchmarks, it might be worth it to know why the generalization of this does break the torchrec benchmarks, but we could just be hitting another bug or NYI situation.
ovearhead?
None on
```
buck2 run fbcode//mode/opt fbcode//torchrec/distributed/tests:pt2_compile_benchmark -- --num-features=1000
```
Differential Revision: D66307433
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141267
Approved by: https://github.com/ezyang
This PR combines Manylinux 2_28 and Manylinux 2014 builds of triton under one workflow. This is required in order to support torch cpu, cuda 118, cuda 12.4 wheels built with Manylinux 2014 and torch cuda 12.6 wheels built with Manylinux 2_28.
Manylinux 2014 wheels:
``pytorch_triton-3.2.0+git35c6c7c6-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl``
Manylinux 2_28 wheels:
``pytorch_triton-3.2.0+git35c6c7c6-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl``
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141704
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/huydhn
- Add in upcast_compute_type on creation of new tensors (loads, constants)
- Fixes index_expr - right now we are sort of inconsistent in dtype and dont always respect the dtype specified. would be nice to fix but not doing in this pr.
- bug fix in view dtype where we were always upcasting back to fp32 when input was in bf16/fp16. we should only be doing that if the output is also in bf16/fp16.
- for masked, avoid calling dtype propagation and just use output dtype.
Turns on the runtime dtype verification for opinfo tests. The separate test file is still useful because we can use it for testing turning off codegen_upcast_to_fp32.
Follow ups:
- We could consider requiring less explicit upcast_compute_types calls and do it automatically. That would potentially make things easier but be less flexible in the future. Maybe I should have done it this pr.
- Be more consistent on our index expr dtype printing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141495
Approved by: https://github.com/blaine-rister, https://github.com/arui-meta, https://github.com/ezyang
ghstack dependencies: #139945, #140057
the `__ANDROID__` macro was used as a proxy to check whether compilation is targeting a 32 or 64 bit system, causing build failure on non-android 32 bit linux targets like arm v7.
This modification adjusts the check to fail if and only if int64_t and long and not the same on 64-bit systems, on systems where `sizeof(void*) == 8`
Like I said in the issue #141043 , I'm not sure whether a different `Scalar` constructor should be defined in the 32 bit case. My code does not break but I'm not sure other people's code won't.
Fixes#141043
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141244
Approved by: https://github.com/malfet
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
# Motivation
Use default context in Windows to keep consistency with Linux. It makes it easy to interact with external libraries like `dlpack`.
# Additional Context
This PR depends on Intel GPU oneAPI 2025.0.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138049
Approved by: https://github.com/gujinghui
So I found this utility by accident, trying to find how many html files we have in the repo so I could convert them to markdown
Turns out we package some html and js files in pytorch to visualize torchscript models. This seems kinda strange, probably shouldn't be in core, I removed the tests I could find. Maybe some internal tests will break but considering torchscript is being superseded might make sense to do this
Last time there was a meaningful update to the test for this file was about 2 years ago by @digantdesai since then it's a bunch of routine upgrades
It seems like this package is unused https://github.com/search?type=code&auto_enroll=true&q=torch.utils.model_dump&p=1 I skimmed through 5 pages of these and the only time this shows up in code search is when someone is either cloning pytorch or checking in their venv into github
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141540
Approved by: https://github.com/malfet
Prior to this patch, we are using `ConstantVariable.create` to create VT
for frozenset objects, and intended yet failed to predicate that on all
itmes being literals (see https://github.com/pytorch/pytorch/pull/140984#discussion_r1847393736).
The code was from https://github.com/pytorch/torchdynamo/commit/7c03434 and
the original goal was to help DBR quantization, but as the new test in
this patch shows, it could lead to silent incorrectness.
Upon a closer look, this exposes some subtleties in how Dynamo handles
`ConstantVariable` and `LOAD_CONST`, so this patch both fixes the
aforementioned issue and documents, enforces, and makes explicit the
invariants around `ConstantVariable` and `LOAD_CONST` -- only immutable
objects are supported.
Specifically, this patch:
1. refine the checks for wrapping a `frozenset` object, document why we
can't just wrap its items directly due to lack of `Sourcec` for set
items, and use a safe workaround (`SourcelessBuilder`) to ensure
soundness while keeping the DBR quantization support.
2. Adds more types to `common_constant_types`, thereby making
`ConstantVariable.is_base_literal` more lenient, and strictly checks
this property in the constructor of `ConstantVariable`.
3. Change relevant uses of `create_instruction("LOAD_CONST", ...)` to
`create_load_const` which checks `is_safe_constant`, and makes
developer overrides explicit by using `create_load_const_unchecked`
when needed.
4. In a few places, use more specific `VariableTracker`, e.g.,
`TypingVariable` rather than `ConstantVariable`, and
`FrozensetVariable` rather than `SetVariable`.
(2) and (3) are mainly to future-proof Dynamo against bugs like (1).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141504
Approved by: https://github.com/jansel
Magma is built for specific CUDA versions and stored in the ossci-linux bucket. Install it from there rather than the deprecated conda package.
There are two places where magma is installed today:
- `install_conda.sh`: extract the magma package in the same exact location where conda would install it, using a dedicated `install_magma_conda.sh` script. The new script is included in the relevant Dockerfiles where CUDA+magma is needed
- `install_magma.sh`: this script already uses a tarball. Use the new tarball instead of the tarball from the conda package. The format of the new tarball is compatible with the old one, so changes here are minimal:wq
Fixes#140538
Test PR: https://github.com/pytorch/pytorch/pull/141584
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140417
Approved by: https://github.com/atalman
This turns on AOTAutogradCache for all inductor tests. It clears AOTAutogradCache on each test as well, by virtue of the local cache using the same directory to store cache entries.
I've also tested with INDUCTOR_TEST_DISABLE_FRESH_CACHE=1, running all the tests. AOTAutogradCache successfully caches 99% of these. There are a few tests that use view_replay and therefore save functional tensors, which cause AOTAutogradCache to fail to pickle its result. Will look into next steps there, but for now, it seems okay if the cache just misses on those cases where it can't serialize the result. It would be better to check before pickling, though.
I've made the following small bugfixes to get this working:
- Inductor is sometimes used in a standalone mode without dynamo, which leads to attribute errors in check_can_cache. In general, we should *never* crash in cache checking, only bypass. So I change a try catch to check Exception instead of just a specific exception.
- Add extra structured logging for metadata on cache hits
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140890
Approved by: https://github.com/bdhirsh
Fixes#137280
When we have multiple indexings for the same array as returned items in pattern replacement, we shouldn't ignore its indexing numbers. otherwise, we may create a wrong pattern_to_node mapping.
A unit test is added in this PR. In this unit test, the function `rms_pattern_static` is replaced with `rms_replacement_static` when called. The function `rms_pattern_static` calls two functionalized custom operators, `torch.ops.vllm.rms_norm.default` and `torch.ops.vllm.static_scaled_int8_quant.default`, and it returns at2[1] and at2[2] as outputs. The function `rms_replacement_static` calls one functionalized custom operator `torch.ops.vllm.fused_rms_norm_quant_static.default`, which returns two corresponding items.
Run `python test/inductor/test_pattern_matcher.py -k test_multioutput_register_replacement` to test. After set `TORCH_COMPILE_DEBUG` to 1, the final part of the `fx_graph_readable.py` is like the following.
```python
# File: /home/yhao/p9/pytorch/test/inductor/test_pattern_matcher.py:1673 in rms_pattern_static, code: at1 = auto_functionalized(
auto_functionalized = torch.ops.higher_order.auto_functionalized(torch.ops.vllm.rms_norm.default, result = permute_1, input = convert_element_type, weight = convert_element_type_1, epsilon = 1e-06); permute_1 = convert_element_type = convert_element_type_1 = None
getitem_1: "bf16[5, 4]" = auto_functionalized[1]; auto_functionalized = None
# File: /home/yhao/p9/pytorch/test/inductor/test_pattern_matcher.py:1680 in rms_pattern_static, code: at2 = auto_functionalized(
auto_functionalized_1 = torch.ops.higher_order.auto_functionalized(torch.ops.vllm.static_scaled_int8_quant.default, result = permute, input = getitem_1, scale = full_default, azp = None); permute = getitem_1 = full_default = None
getitem_3: "i8[5, 4]" = auto_functionalized_1[1]
getitem_4: "f32[1, 1]" = auto_functionalized_1[2]; auto_functionalized_1 = None
return (getitem_3, getitem_4)
```
This happens before pattern matching, so is it expected to call `static_scaled_int8_quant` and `rms_norm` and return auto_functionalized_1 as outputs.
However, for pytorch before this PR, the `fx_graph_transformed.py`, which is after pattern matching, has the following code.
```python
# File: /home/yhao/p9/pytorch/test/inductor/test_pattern_matcher.py:1748 in my_func_static, code: scale = torch.ones((1, 1))
full_default: "f32[1, 1]" = torch.ops.aten.full.default([1, 1], 1, dtype = torch.float32, layout = torch.strided, device = device(type='cpu'), pin_memory = False)
# No stacktrace found for following nodes
as_strided_default: "i8[20]" = torch.ops.aten.as_strided.default(permute, [20], [1], 0)
clone_default: "i8[20]" = torch.ops.aten.clone.default(as_strided_default); as_strided_default = None
as_strided_default_1: "i8[5, 4]" = torch.ops.aten.as_strided.default(clone_default, [5, 4], [4, 1], 0); clone_default = None
as_strided_default_2: "f32[1]" = torch.ops.aten.as_strided.default(full_default, [1], [1], 0)
clone_default_1: "f32[1]" = torch.ops.aten.clone.default(as_strided_default_2); as_strided_default_2 = None
as_strided_default_3: "f32[1, 1]" = torch.ops.aten.as_strided.default(clone_default_1, [1, 1], [1, 1], 0); clone_default_1 = None
static_scaled_int8_quant_default = torch.ops.vllm.static_scaled_int8_quant.default(as_strided_default_1, permute_1, as_strided_default_3); as_strided_default_1 = permute_1 = static_scaled_int8_quant_default = None
fused_rms_norm_quant_static_default = torch.ops.vllm.fused_rms_norm_quant_static.default(permute, convert_element_type, convert_element_type_1, full_default, None, 1e-06); convert_element_type = convert_element_type_1 = full_default = fused_rms_norm_quant_static_default = None
return (permute, as_strided_default_3)
```
Here, it returns `(permute, as_strided_default_3)` while `permute` is written by fused_rms_norm_quant_static and `as_strided_default_3` is written by `static_scaled_int8_quant`. This is wrong because in our expectation, the `static_scaled_int8_quant` should be removed since it is replaced with `fused_rms_norm_quant_static`. It is supposed to return `(permute, full_default)`.
The root cause is the following part. When we [generate patterns](5f4a21dc58/torch/_inductor/pattern_matcher.py (L1580)) with traced fx graph and call the following function, the indexing numbers' type int in traced graph are ignored in `ignore_types`. So, the final arguments of patterns for those two output items are like `(CallFunction(auto_functionalized,XXX)), *)`.
5f4a21dc58/torch/_inductor/pattern_matcher.py (L1839-L1847)
When we do pattern matching after we generated patterns in the following part, the `sorted(itertools.chain.from_iterable(nodes), reverse=True)` is `[getitem_4, getitem_3, getitem_1]`. The getitem_4's iteration is always FailedMatch because we always use the first element to do the pattern match here (it fails on different match functions before and after this PR, but the reason is always the indexing numbers issue)d4cdc09881/torch/_inductor/pattern_matcher.py (L848). However, when we do pattern matching for getitem_3, the child_match returns a match for getitem_3 again which is because the `*` pattern can match anything. Then the getitem_3's pattern matching returns a `[getitem_3, getitem_3]` as outputs which are wrong.
d4cdc09881/torch/_inductor/pattern_matcher.py (L856)d4cdc09881/torch/_inductor/pattern_matcher.py (L1750-L1774)
This PR doesn't ignore `int` type when we generate patterns for getitem functions because integer indexing numbers are important to them. Thus, the indexing information is kept in patterns, ensuring correct matchings. With this PR, the above `child_match` returns a match for getitem_4, and the final getitem_3's pattern matching returns the correct `[getitem_3, getitem_4]`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140193
Approved by: https://github.com/eellison
When multiple nodes have similar sizes and are part of the `banned_nodes` (which is a `set` and not a `list`), there is non-determinism present in the partitioner due to sorting only by node-size.
This PR fixes this by also sorting by node name.
It would be good to add some tests, but I'm not sure about the best way to do it here.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141682
Approved by: https://github.com/Chillee, https://github.com/yf225
Summary: We want to return the mismatch ranks info in the `mismatch_collectives` field. Also update the logging message when no error is found and it's not partial analysis.
Test Plan: CI
Differential Revision: D66522602
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141631
Approved by: https://github.com/c-p-i-o
A couple changes.
- Tries to reuse dtype propagation rules that were already registered in inductor. These were present both with `pointwise_overrides_data` and the `boolean_ops` list. Additionally, the registration of pointwise ops already specified dtype propagation rules. Saves those registrations and reuses them later.
- Factors out `get_promoted_dtype` which uses functools.lru_cache to take in non - CSEVariable args because those will not work with the functools cache.
Tests get added later in the stack when everything is implemented.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139945
Approved by: https://github.com/blaine-rister, https://github.com/arui-meta, https://github.com/ezyang
The test was initially added due to accuracy issues which is sufficiently covered by the `self.common(fn, (x,))` assertion.
Unfortunately, the test fails due to tiling logic on `128-bit` vector size, which is outside the scope of this test and therefore it was overly specific.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141246
Approved by: https://github.com/desertfire
Added documentation note clarifying the rounding behavior of `torch.arange` when using floating-point dtypes, particularly for reduced precision types like `bfloat16`. This helps users understand potential issues like repeated values and provides guidance on using integer dtypes for precise sequences.
## Changes
- Added explanatory note about floating-point rounding behavior and its effects
- Included specific mention of `bfloat16` dtype issues
- Added recommendation to use integer dtypes for precise sequences
Fixes [#137774](https://github.com/pytorch/pytorch/issues/137774)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141655
Approved by: https://github.com/cpuhrsch
Current `fuse` function supports conv/BN fusions only. This commit is to support linear/BN fusion as well. Changes to follow the API guidelines are also applied.
(This will close the PR #141352 which I created for the same topic and got approval but had lint and API guideline problems.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141585
Approved by: https://github.com/ezyang
Summary:
Splitting this PR into two, one for the cuSPARSELt improvements, and one
for the inductor lowering.
This PR adds in the additional cuSPARSELt bindings into pytorch.
* `torch._cslt_sparse_mm_search` will be deprecated in a future PR,
so a warning has been added
* Added a header file for cuSPARSELtOps.cpp
* max_id is now available in `torch.backends.cusparselt` via
`torch.backends.cusparselt.get_max_alg_id()`
* fixed meta registrations for float8
Test Plan:
python test/test_sparse_semi_structured.py
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137427
Approved by: https://github.com/cpuhrsch, https://github.com/eqy
Summary:
Ensures we support dims of size 0 properly in `torch._scaled_mm`. Follows the behavior from `torch.mm`.
For now only enable support for tensorwise, we can tackle rowwise in a future PR.
Test Plan:
```
python test/test_matmul_cuda.py -k test_zero_dim
```
Reviewers:
Subscribers:
Tasks:
Tags:
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140967
Approved by: https://github.com/eqy, https://github.com/drisspg
This is super duper minor, but I believe this corrects a typo in the documentation of `torch.trapezoid`.
The documentation says the input is a 1-dimensional tensor $y_0, \dots, y_n$, but it uses summations going from 1 to n-1. Since it's summing over terms $y_i - y_{i-1}$, stopping at n-1 excludes the last partition $y_n - y_{n-1}$, which doesn't match the implementation...
```python
# (just showing it does include $y_n - y_{n-1}$)
torch.trapezoid([0, 0, 9999]) == 9999 / 2
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141459
Approved by: https://github.com/colesbury
Summary:
float8 dtypes were missing from this map, adding
Test Plan:
CI, and unbreaks debugging in torchao
If there is an existing test I can add this to - lmk
Reviewers:
Subscribers:
Tasks:
Tags:
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141385
Approved by: https://github.com/soulitzer
`SymmetricMemory.has_multicast_support()` throws an exception rather than returning `False` when called with a `DeviceType` that does not support. For example:
```
from torch._C._distributed_c10d import _SymmetricMemory
from torch._C._autograd import DeviceType
try:
supports_multicast = _SymmetricMemory.has_multicast_support(DeviceType.CPU, 0)
except RuntimeError as exc:
assert str(exc) == "SymmetricMemory does not support device type cpu"
```
This is problematic when building PyTorch from source without `CUDASymmetricMemory.cu` since the [`@requires_multicast_support`](https://github.com/pytorch/pytorch/blob/main/torch/testing/_internal/common_distributed.py#L353) test decorator will throw an exception rather than skipping the test (as intended)
This PR makes `_SymmetricMemory.has_multicast_support()` properly return `False` when multicast is not supported on the passed device.
cc) @malfet , @atalman
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141598
Approved by: https://github.com/yifuwang
This PR contains three `unsqueeze()`-related fixes for NJT:
1. Adjusts the output's `_ragged_idx` when `unsqueeze()` inserts a dim before the ragged dim
2. Corrects the unbind reference for `unsqueeze()` after the last input dim. For this case, the dim kwarg canonicalization logic needs to be applied wrt `inp.dim() + 1` to account for `dim=-1` properly
3. Adds ragged dim support to `unsqueeze()`, allowing for e.g. `(B, j1, D) -> (B, 1, j1, D)`. This is okay now after #137125
Note that `unsqueeze()` still doesn't support batch dim operation, and arguably should never support this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141392
Approved by: https://github.com/cpuhrsch
ghstack dependencies: #141500, #140736, #140161
This PR introduces `ExtraOpData`, a structure that contains op metadata regarding whether the op is a view and the dim-related args it accepts. It also populates a huge database for dim-wise / view ops with this info.
Test logic (sample input generation, references) have been updated to utilize this data. It allows for a fairly generic set of sample inputs & a reference for the class of ops that accept a single NJT and operate dim-wise (AKA "unary dimwise ops").
Testing is added over the following ops:
* `chunk()`
* `narrow()`
* `select()`
* `split()`
* `split_with_sizes()`
* `squeeze()`
* `unflatten()`
* `unsqueeze()`
Most of the above do not operate on the ragged / batch dims or on non-contiguous NJTs, so the proper xfails are added as needed.
I also slipped in a couple minor fixes (sorry):
1. The `_wrap_jagged_dim()` helper now avoids assuming the `nt._ragged_idx == 1` and allows for a batch dim to be a valid input, disambiguating the converted inner dim as necessary through an additional `operating_on_batch` return value (i.e. both dim=0 and dim=1 map to dim=0 on the inner values tensor, since that dim represents a packed ragged dim for all batch items)
2. Padded dense -> NJT conversion requires shape gymnastics to operate with the restrictive FBGEMM kernel. The gymnastics were slightly wrong for the transposed NJT case, and this PR fixes that
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140161
Approved by: https://github.com/Skylion007, https://github.com/cpuhrsch
ghstack dependencies: #141500, #140736
Convenience, when we build pytorch docs
1. Docs for build weren't clear that `make html` is the main command intended to be ran
2. Once you run `make html` you need to visualize the work, opening up a simple http server seems like the simplest solution so adding a `make serve command`
Usage
```shell
numpy ❯ make serve PORT=8080 # Add port optionally
Serving HTTP on :: port 8080 (http://[::]:8080/) ...
::1 - - [26/Nov/2024 10:05:41] "GET / HTTP/1.1" 200 -
::1 - - [26/Nov/2024 10:05:41] "GET /_static/copybutton.css HTTP/1.1" 200 -
::1 - - [26/Nov/2024 10:05:41] "GET /_static/katex-math.css HTTP/1.1" 200 -
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141590
Approved by: https://github.com/svekars, https://github.com/malfet
This improves `PyProcessGroup` so you can override rank, world size and group name/desc methods from Python. These will be needed to support resizable process groups in torchft.
This also has some small fixes in test_c10d_pypg.py to use threads instead of processes which speeds up the test execution by ~10x.
Test plan:
```
pytest test/distributed/test_c10d_pypg.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141529
Approved by: https://github.com/fegin
Usage:
```bash
python3 tools/packaging/split_wheel.py bdist_wheel
python3 tools/packaging/split_wheel.py install
python3 tools/packaging/split_wheel.py develop
```
Ideally this should make it easier to do the split build locally while
we're doing development.
Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141359
Approved by: https://github.com/kit1980
**Background:** It's common to use `scalar_tensor()` in the input to `where()` to convert any scalars present to compatible tensors with matching options, *including layout*. This shows up in various places, notably including derivative formulas ([example](78491d6afc/tools/autograd/derivatives.yaml (L432-L434))). It causes problems for NJTs because they have `layout=torch.jagged` and it never makes sense to create a scalar tensor with this layout. Some of the breakage only seems to happen in CI for reasons I don't fully understand (see the revert of #140736 due to softshrink's derivative formula).
**This PR:**
* Allows non-contiguous NJT inputs to `where()` + adds tests for this
* Handles scalar tensor / dense tensor inputs for `condition` / `other` + adds tests for this
* Uses limited `broadcast_tensors()` / `broadcast_to()` support
* Improves `expand()` to work on non-contig NJTs
* Changes `scalar_tensor()` to use `torch.strided` instead of `torch.jagged` in both eager and torch.compile (i.e. meta registration)
* Changes backward formulas for `sinc`, `pow`, `special.i1`, and `special.i1e` to uses `scalar_tensor()` instead of e.g. `zeros({})`
**Alternative approach:** Update all problematic usages of `scalar_tensor()` to avoid ever passing `layout=torch.jagged`. This is an extensive change and includes `torch.where()` logic, a bunch of derivative formulas, and likely other places not yet discovered.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141500
Approved by: https://github.com/malfet, https://github.com/cpuhrsch, https://github.com/soulitzer
Summary: Add the helper function to put a const graph back to the toplevel graph, can be useful when we're taking const graphs from delegates.
Test Plan: CI
Reviewed By: trieuat
Differential Revision: D63031982
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140854
Approved by: https://github.com/SherlockNoMad
Summary: `TestFunc` is annotated as `Callable[[object], object]` which represents a callable that takes a single argument of any type (`object`) and returns a value of any type (`object`). However, in reality, `TestFunc` could be any number of arguments, as a result, the corret typing should be `Callable[[...], object]` instead which represents a callable that takes any number of arguments (including zero) and returns a value of any type (`object`).
Test Plan: Contbuild & OSS CI
Differential Revision: D66463705
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141513
Approved by: https://github.com/wz337, https://github.com/Skylion007
Thanks to https://github.com/pytorch/pytorch/pull/137978 from @Skylion007 which bumps to cuDNN 9.5.1 the broken assumption of dO strides == O strides is fixed
Note that there is still the restriction that the innermost stride of the grad output is 1 (this is almost always guaranteed because this condition is required of the input tensors). The main exception would be in test code that does e.g., `.sum().backward()` which yields grad output tensors with strides `[0, 0, 0, 0]`.
CC @drisspg
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141147
Approved by: https://github.com/drisspg
This PR enhances offline tuning to support multi-GPUs.
High-level description of algorithm:
- Duplicate GEMMs are first eliminated
- GEMMs are distributed to multi-GPUs for tuning
- Results are gathered into a file with `_full` in the filename
Also adding support for GemmAndBias and ScaledGemm
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139673
Approved by: https://github.com/jeffdaily, https://github.com/hongxiayang
We added `CudaEventCache` in https://github.com/pytorch/pytorch/pull/133727 and this is a feature which tries to reuse CudaEvent so that we don't call destroy of CudaEvent which causes hang in the past. We had a bunch of tests and testing on TorchTitan and internal workload already. So far no errors or crash are found at the moment so we decide to roll out to all OSS users. For internal workload, this PR would not affect it because of some internal gating.
Also we observed some multi-device use cases in OSS, so that we want to bring back multi-device support originally proposed in https://github.com/pytorch/pytorch/pull/122732/files.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140975
Approved by: https://github.com/eqy, https://github.com/kwen2501
Summary:
Another try for D66198138. Original diff had some weird issue with type checking. Setting everything to int this time to get around it.
Addresses https://github.com/pytorch/pytorch/issues/91888
We use wait as the amount you wait in between cycles when profiling and skip_first to delay the start of said profiling. However, once skip_first steps are completed, we immediately go to the wait phase. This is not problematic if wait is smaller than skip_first because we can just lower the values of skip_first, but if it is larger then we end up starting the first profile much later than desired. For example imagine a skip first of 1 and a wait of 100 with repeat of 2. We do want to wait 100 steps in between cycle 1 and 2 but we may not want to start warmup of cycle 1 at step 101 (forced because wait occurs directly after first steps skipped). This diff addresses this by adding a flag to skip the first wait.
Adds new flag but sets to false by default so that existing impl is not affected.
Test Plan:
Got following traces with this schedule:
schedule=torch.profiler.schedule(
wait=10, warmup=3, active=1, repeat=1, skip_first=1, skip_first_wait=1
)
Differential Revision: D66465860
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141512
Approved by: https://github.com/aaronenyeshi
### Background:
`set(x,y)` changes the untyped storage of x to be the same as y.
```python
import torch
from torch._subclasses.fake_tensor import FakeTensorMode
x1 = torch.ones(2,3)
y1 = torch.ones(2,3)
z1 = torch.ops.aten.set_.source_Tensor(x1, y1)
fake_tensor_mode = FakeTensorMode()
x2 = fake_tensor_mode.from_tensor(torch.ones(2,3))
y2 = fake_tensor_mode.from_tensor(torch.ones(2,3))
z2 = torch.ops.aten.set_.source_Tensor(x2, y2)
print(f"x1: {x1.untyped_storage()._cdata}, y1: {y1.untyped_storage()._cdata}, z1: {z1.untyped_storage()._cdata}")
print(f"x2: {x2.untyped_storage()._cdata}, y2: {y2.untyped_storage()._cdata}, z2: {z2.untyped_storage()._cdata}")
# x1: 99973024, y1: 99973024, z1: 99973024
# x2: 112107232, y2: 112107232, z2: 112107232
```
### Error before this diff
Consider this example:
```python
import torch
def fn(x):
p = torch.nn.Parameter(x + 123)
return p, p.sin()
opt = torch.compile(fn, fullgraph=True)
x = torch.ones(16, device="cuda", requires_grad=True)
p, r = opt(x)
r.sum().backward()
```
When running with `TORCH_LOGS=aot`, we have `set_` in the graph.
```
def forward(self, primals_1: "f32[16][1]cuda:0", primals_2: "f32[16][1]cuda:0"):
# File: /home/boyuan/playground/inductor/donated_buffer.py:4 in fn, code: p = torch.nn.Parameter(x + 123)
add: "f32[16][1]cuda:0" = torch.ops.aten.add.Tensor(primals_1, 123); primals_1 = None
# File: /home/boyuan/playground/inductor/donated_buffer.py:5 in fn, code: return p, p.sin()
sin: "f32[16][1]cuda:0" = torch.ops.aten.sin.default(add)
# No stacktrace found for following nodes
set_: "f32[16][1]cuda:0" = torch.ops.aten.set_.source_Tensor(primals_2, add); primals_2 = set_ = None
return (sin, add)
```
`set_: "f32[16][1]cuda:0" = torch.ops.aten.set_.source_Tensor(primals_2, add)` should change the storage of `primals_2` to be the same as `add`. However, this is not true before this diff. We found different untyped_storage() for meta['val'] of `set_`, `add`, and `primals_2`.
This also leads to an error with donated buffer (#130580), which checks alias by untyped_storage. Since `add` and `primals_2` have different untyped_storage (which is wrong), add is wrongly marked as donated buffer.
### Root Cause
During tracing, we have args, kwargs, out, and proxy_args, proxy_kwargs, proxy_out.
We use args and kwargs to compute `out = func(*args, **kwargs)` ([Here](https://github.com/pytorch/pytorch/blob/main/torch/fx/experimental/proxy_tensor.py#L912)). Later, we set out to its proxy, essentially calling `proxy_out.node.meta["val"] = out.detach()`.
Due to the detach, the storage change happens on args but not on proxy_args.node.meta["val"] when func is torch.ops.aten.set_. I repro'ed this behavior of detach in eager code.
```python
import torch
x = torch.ones(2,3)
x_detach = x.detach()
y = torch.ones(2,3)
z = torch.ops.aten.set_.source_Tensor(x_detach, y)
print(f"x: {x.untyped_storage()._cdata}, x_detach: {x_detach.untyped_storage()._cdata}, y: {y.untyped_storage()._cdata}, z: {z.untyped_storage()._cdata}")
# x: 97023632, x_detach: 97026480, y: 97026480, z: 97026480
```
To fix the issue, this PR manually resets node.meta["val"] if the storage has changed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141308
Approved by: https://github.com/bdhirsh
# Motivation
This PR add `XPUInductorQuantizer`, which would defined the recipe of int8 quantization at XPU backend.
# Detailed
The `XPUInductorQuantizer` is class derived from `X86InductorQuantizer` as both quantizer would take the advantage of highly optimized operators in oneDNN library(qconv, qlinear, qconv/qlinear fusion).
We share the same recipe as `X86InductorQuantizer`, so we would have same `annotate_xxxx` methods. So, in ideal situation, the `XPUInductorQuantizer` would have no class body as all implementation can inherit from base class.
In this PR, we override the `annotate_xxx` method for operators that has NOT be implemented. All operators XPU backend does not implement would be fallbacked to fp32 implementation as the node in graph is a `dq-op-q` pairs. This would help provide good OOB usability for XPU backend. On the other hand, the implemented operators would uses `annotate_op` implemented in base class and could be lowered successfully.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139578
Approved by: https://github.com/EikanWang, https://github.com/leslie-fang-intel, https://github.com/CuiYifeng, https://github.com/jerryzh168
ghstack dependencies: #133080
# Motivation
This PR enables the XPU quantized convolution. The operators it registers are `onednn::qconv_prepack`, `onednn::qconv1d_pointwise`, `onednn::qconv2d_pointwise`, `onednn::qconv3d_pointwise`. We share same operator schemas as Intel CPU backend as both would call kernels implemented in oneDNN library.
# Details
The implemented operators would be further integrated into pt2e quant flow. In this PR, we validated the kernel functionality via the UT in `test/inductor/test_mkldnn_pattern_matcher.py` where CPU backend defines a series of UT for quantized convolution. Also, we extend the device support for inductor lowering pass and inductor IR defined in `torch/_inductor/fx_passes/quantization.py` and `torch/_inductor/mkldnn_ir.py`. The overall picture would be that, CPU and GPU backend could share the general optimization pass(op fusion) and quantization inductor IR. After lowering, the final kernel would be dispatched to different implementation in oneDNN library.
In this PR, we share the same int8 quantizer in CPU, namely, `X68InductorQuantizer`. In next PR #139578, we will append a `XPUIndcutorQuantizer` which will customized the pt2e behaviors at XPU backend. The capability of `XPUInductorQuantizer` would gradually grow along with the development of quantized operators in XPU.
# Validation
* UT testing
```bash
python test/inductor/test_mkldnn_pattern_matcher.py -v \
-k test_qconv2d_xpu \
-k test_qconv2d_silu_xpu \
-k test_qconv2d_relu6_xpu \
-k test_qconv2d_hardtanh_xpu \
-k test_qconv2d_hardswish_xpu
```
* Runtime exemplification
```bash
#qconv2d
onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_u8::blocked:acdb::f0 wei_s8::blocked:acdb::f0 bia_undef::undef::: dst_f32::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+wei:1:f32 attr-zero-points:src0:0:s32 attr-post-ops:binary_add:f32:2+eltwise_linear:1,alg:convolution_direct,mb1_ic128oc128_ih6oh4kh3sh1dh0ph0_iw6ow4kw3sw1dw0pw0,0.0668945
#qconv2d_silu
onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_u8::blocked:acdb::f0 wei_s8::blocked:acdb::f0 bia_undef::undef::: dst_u8::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+wei:1:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_swish:1+binary_add:f32:2+eltwise_linear:0.0124779:22,alg:convolution_direct,mb1_ic3oc128_ih8oh6kh3sh1dh0ph0_iw8ow6kw3sw1dw0pw0,0.0881348
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133080
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/atalman
- Pass `group_name` to `CUDASymmetricMemory::alloc()` instead of `CUDASymmetricMemory::rendezvous()`. We can only move the argument to rendezvous() once all the underlying operators do the same.
- Added `float` to the allowlist for intra-node all-reduces.
- Added a warning when `IntraNodeComm::rendezvous()` is performed with overlapping devices among participants.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141200
Approved by: https://github.com/weifengpy, https://github.com/kwen2501
The old implementation of `SetVariable.call_method("update", ...)` was
incorrectly becacuse it wouldn't handle iterable inputs. This patches
removes the input type restriction altogether, and implements the method
as a polyfill (like how most of the other set methods are handled).
Fixes#141283.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141286
Approved by: https://github.com/anijain2305
With largish systems of nn modules with buffers, sinking params suffered from some kind of exponential blowup that is easily fixed by using a set instead of a list to keep track of unlifted buffer placeholders.
Test Plan: added random dag test that failed previously
Differential Revision: D66457661
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141494
Approved by: https://github.com/angelayi
Using the same `tools/generate_torch_version.py` script
It's already available on Python level, but not on C++ one
Please note, that updating commit hash will force recompilation of less than 10 files according to
```
% touch caffe2/core/macros.h; ninja -d explain -j1 -v -n torch_python
ninja explain: output caffe2/torch/CMakeFiles/gen_torch_version doesn't exist
ninja explain: caffe2/torch/CMakeFiles/gen_torch_version is dirty
ninja explain: /Users/malfet/git/pytorch/pytorch/torch/version.py is dirty
ninja explain: output third_party/kineto/libkineto/CMakeFiles/libkineto_defs.bzl of phony edge with no inputs doesn't exist
ninja explain: third_party/kineto/libkineto/CMakeFiles/libkineto_defs.bzl is dirty
ninja explain: output caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/Version.cpp.o older than most recent input /Users/malfet/git/pytorch/pytorch/build/caffe2/core/macros.h (1732301546390618881 vs 1732301802196214000)
ninja explain: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/Version.cpp.o is dirty
ninja explain: output caffe2/CMakeFiles/torch_cpu.dir/core/common.cc.o older than most recent input /Users/malfet/git/pytorch/pytorch/build/caffe2/core/macros.h (1732301546233600752 vs 1732301802196214000)
ninja explain: caffe2/CMakeFiles/torch_cpu.dir/core/common.cc.o is dirty
ninja explain: output caffe2/CMakeFiles/torch_cpu.dir/serialize/inline_container.cc.o older than most recent input /Users/malfet/git/pytorch/pytorch/build/caffe2/core/macros.h (1732301546651089243 vs 1732301802196214000)
ninja explain: caffe2/CMakeFiles/torch_cpu.dir/serialize/inline_container.cc.o is dirty
ninja explain: output caffe2/CMakeFiles/torch_cpu.dir/serialize/file_adapter.cc.o older than most recent input /Users/malfet/git/pytorch/pytorch/build/caffe2/core/macros.h (1732301546224176845 vs 1732301802196214000)
ninja explain: caffe2/CMakeFiles/torch_cpu.dir/serialize/file_adapter.cc.o is dirty
ninja explain: output caffe2/CMakeFiles/torch_cpu.dir/utils/threadpool/ThreadPool.cc.o older than most recent input /Users/malfet/git/pytorch/pytorch/build/caffe2/core/macros.h (1732301546464535054 vs 1732301802196214000)
ninja explain: caffe2/CMakeFiles/torch_cpu.dir/utils/threadpool/ThreadPool.cc.o is dirty
ninja explain: output caffe2/CMakeFiles/torch_cpu.dir/__/torch/csrc/jit/runtime/static/impl.cpp.o older than most recent input /Users/malfet/git/pytorch/pytorch/build/caffe2/core/macros.h (1732301550062608920 vs 1732301802196214000)
ninja explain: caffe2/CMakeFiles/torch_cpu.dir/__/torch/csrc/jit/runtime/static/impl.cpp.o is dirty
ninja explain: output caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/mps/MPSFallback.mm.o older than most recent input /Users/malfet/git/pytorch/pytorch/build/caffe2/core/macros.h (1732301547538843492 vs 1732301802196214000)
ninja explain: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/mps/MPSFallback.mm.o is dirty
```
Differential Revision: [D66468257](https://our.internmc.facebook.com/intern/diff/D66468257)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141313
Approved by: https://github.com/ezyang
Summary:
Splitting this PR into two, one for the cuSPARSELt improvements, and one
for the inductor lowering.
This PR adds in the additional cuSPARSELt bindings into pytorch.
* `torch._cslt_sparse_mm_search` will be deprecated in a future PR,
so a warning has been added
* Added a header file for cuSPARSELtOps.cpp
* max_id is now available in `torch.backends.cusparselt` via
`torch.backends.cusparselt.get_max_alg_id()`
* fixed meta registrations for float8
Test Plan:
python test/test_sparse_semi_structured.py
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137427
Approved by: https://github.com/cpuhrsch, https://github.com/eqy
Summary:
Populate nn.module.stack in _fuse_conv_bn_qat for replacement nodes that correspond to a `get_attr` node in the original graph.
In new training ir , `get_attr` nodes don't have `nn_module_stack` in node meta anymore (because the get_attr nodes are de-duplicated, so one get_attr node can potential have users in different module stacks).
We populate it by checking if "conv_input" or "conv_weight" replacement node has nn_module_stack. If not, we copy it from the conv node.
Test Plan:
CI
```
buck run fbcode//caffe2/test:quantization_pt2e -- -r test_preserve_nn_module_stack
```
Differential Revision: D66393517
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141400
Approved by: https://github.com/angelayi
### Motivation
`ncclCommInitRank` needs GPU guard (documented in NCCL).
`ncclCommAbort`, `ncclCommFinalize` and `ncclCommDestroy` may also need GPU guard (undocumented in NCCL); otherwise, extra CUDA context may be created (or worse, hang); both effects have been seen before in our tests.
### Solution
This PR records a device index during `NCCLComm` object creation, so that we can add GPU guard in `NCCLComm`'s method calling which direct to the above NCCL APIs.
### Note
This is not a bug fix. Just a safety improvement.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141270
Approved by: https://github.com/eqy
ghstack dependencies: #141374
# Summary
The follow up PR to: https://github.com/pytorch/pytorch/pull/137526. In this pr, we actually update the lowerings for the flex_attention backwards kernel to generate fused backward gradient calculations for any captured buffers that require grads.
We are doing this using tl.atomic_add to scatter the correct gradients into zeroed out buffer for any captured buffers that required grads. Added many test cases and found. Along the way found some masking bugs.
There are likely some performance cliffs here, specifically with D-types and on different GPUs. Planned to do this in a follow-up and profile the current strategy. We are explicitly choosing reduced memory over increased performance right now.
By using atomics, we do not need to realize a full attention scores matrix. However, this comes with two downsides. One, this is potentially slower in some cases, and two, the gradient calculation for any captured buffers is non-deterministic.
## Worked Example
Lets do the case where you are reading from one bias that doesn't require grad and using this to index into another that does.
ScoreMod:
```Python
bias = torch.randn(
params.seq_length,
device=self.device,
dtype=params.dtype,
requires_grad=True,
)
offset = torch.randint(
0,
params.seq_length,
(params.seq_length,),
device=self.device,
)
def score_mod(score, b, h, q_idx, kv_idx):
return score + bias[offset[q_idx]]
```
I am removing all but the new subgraph injected into the backwards:
``` Python
dsT = pT * (dpT - Di[None, :])
# ~~~~~~~~~~~~~~~~~~~ Apply joint modification ~~~~~~~~~~~~~~~~~~~
grad_scores = (dsT)
# ~~~~~~~~~~~~~~~~~~~ Apply other buffer grad writes ~~~~~~~~~~~~~
idx_b = off_z
idx_h = off_hq
idx_m = m
idx_n = n
scatter_mask = offs_m1[None, :] < Q_LEN and offs_n1[:, None] < KV_LEN
tmp4 = (dsT).to(tl.float32)
tl.atomic_add(out_ptr1 + (tl.broadcast_to(tl.load(in_ptr16 + idx_m), tmp4.shape)), tmp4, scatter_mask, sem='relaxed')
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
```
## Key points
* We always accumulate to float 32 grad buffers regardless of the type in the forward. This is because we normally do all computation intra kernel w/ fp32 accumulation and we want the same behavior for atomic additions
* We are currently restricted to 1 scatter in the kenrel. I have some ideas on fx rewrites that would remove this restrictions but for now have nice error message w/ work around and will leave as a follow up.
* Will do more extensive performance/ memory profiling in a follow up.
### Toy E2E example
I have a toy E2E training example PR in the gym for now: https://github.com/pytorch-labs/attention-gym/pull/84/
I plan to update to a realistic learnable bias before landing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137452
Approved by: https://github.com/Chillee
Summary:
It looks RCCL does have the support for those two error types:: ncclRemoteError and ncclnProgress: https://github.com/ROCm/rccl/blob/develop/src/nccl.h.in#L57. And I do see my job throwing out those errors. But pytorch just said:
```
RuntimeError: Unconvertible NCCL type
```
Even though nccl says:
```
develop/src/init.cc.hip:502 NCCL WARN Attempt to use communicator before the previous operation returned ncclSuccess
```
Therefore just enabling those.
Test Plan: CI
Differential Revision: D66434341
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141461
Approved by: https://github.com/eqy
Fixes#126268
I've basically followed @ezyang suggestion (I think) to use `func.decompose(...)`. Since `__torch_dispatch__` won't be called a second time for the same op, I've added a second `TorchDispatchMode` (`_DecomposedCounterMode`) that simpy dispatches to the parent flop counter. Using `self` as the inner context manager is not possible, since the second call to `__enter__` would re-initialize the counter's tracking state.
Let me know if there's something wrong with this implementation, since I'm quite unsure how the decomposition thing actually works :D
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138508
Approved by: https://github.com/ezyang
Co-authored-by: Edward Z. Yang <ezyang@meta.com>
Summary: Fix outstanding TODOs related to logging of CompilationMetrics by moving the population of common fields to record_compilation_metrics() instead of populating those independently wherever we use a the metrics_context contextmanager:
* Keep track of start and end time in MetricsContext and pass those to record_compilation_metrics() and populate those fields in that function.
* Pass exception info to record_compilation_metrics() and populate those field in that function.
* Add a new contextmanager, chromium_event_timed, to create the start/end "dynamo" event. This is important because I want this contextmanager to complete _after_ building the CompilationMetrics.
* Populate the compile_id field centrally in record_compilation_metrics().
* Populate the structured_logging_overhead centrally in record_compilation_metrics().
* Add the CompilationMetrics to the current chromium event in record_compilation_metrics(), after all common fields have been added. In a future diff, I can also add _all_ compilation metrics to the chromium event.
Test plan: Unit tests. Also see internal testing:
* dynamo_compile: https://fburl.com/scuba/dynamo_compile/sandbox/jrascnf9
* pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/l3jnla06
* tlparse: https://fburl.com/bq5a9nqs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141291
Approved by: https://github.com/jamesjwu
### Summary
1. Improve the heuristic for loop split optimization: The divisor needs to be an integer and cannot be too small (needs to be greater than 8, this threshold has been tuned).
2. Improve the heuristic for disabling vectorization: add quantity_threshold and relax ratio_threshold for the number of non-contiguous load/store/index_expr in the loop body.
This PR will bring performance improvements for two torchbench models(functorch_dp_cifar10, opacus_cifar10) and one timm model(sebotnet33ts_256).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137550
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jansel
This PR contains three `unsqueeze()`-related fixes for NJT:
1. Adjusts the output's `_ragged_idx` when `unsqueeze()` inserts a dim before the ragged dim
2. Corrects the unbind reference for `unsqueeze()` after the last input dim. For this case, the dim kwarg canonicalization logic needs to be applied wrt `inp.dim() + 1` to account for `dim=-1` properly
3. Adds ragged dim support to `unsqueeze()`, allowing for e.g. `(B, j1, D) -> (B, 1, j1, D)`. This is okay now after #137125
Note that `unsqueeze()` still doesn't support batch dim operation, and arguably should never support this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141392
Approved by: https://github.com/cpuhrsch
ghstack dependencies: #140736, #140161
This PR introduces `ExtraOpData`, a structure that contains op metadata regarding whether the op is a view and the dim-related args it accepts. It also populates a huge database for dim-wise / view ops with this info.
Test logic (sample input generation, references) have been updated to utilize this data. It allows for a fairly generic set of sample inputs & a reference for the class of ops that accept a single NJT and operate dim-wise (AKA "unary dimwise ops").
Testing is added over the following ops:
* `chunk()`
* `narrow()`
* `select()`
* `split()`
* `split_with_sizes()`
* `squeeze()`
* `unflatten()`
* `unsqueeze()`
Most of the above do not operate on the ragged / batch dims or on non-contiguous NJTs, so the proper xfails are added as needed.
I also slipped in a couple minor fixes (sorry):
1. The `_wrap_jagged_dim()` helper now avoids assuming the `nt._ragged_idx == 1` and allows for a batch dim to be a valid input, disambiguating the converted inner dim as necessary through an additional `operating_on_batch` return value (i.e. both dim=0 and dim=1 map to dim=0 on the inner values tensor, since that dim represents a packed ragged dim for all batch items)
2. Padded dense -> NJT conversion requires shape gymnastics to operate with the restrictive FBGEMM kernel. The gymnastics were slightly wrong for the transposed NJT case, and this PR fixes that
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140161
Approved by: https://github.com/Skylion007, https://github.com/cpuhrsch
ghstack dependencies: #140736
Newer versions of distutils no longer import `_msvccompiler` upon init(on Windows platform, that was not the case on other platforms even before 74), but it's still accessible if one chooses to import it directly.
Test plan:
```
% python -c 'from setuptools import distutils; print(distutils.__version__, hasattr(distutils, "_msvccompiler")); from distutils import _msvccompiler; import setuptools; print(setuptools.__version__, _msvccompiler.__file__)'
3.10.9 False
65.5.0 /usr/local/fbcode/platform010/Python3.10.framework/Versions/3.10/lib/python3.10/site-packages/setuptools/_distutils/_msvccompiler.py
```
and
```
% python -c 'from setuptools import distutils; print(distutils.__version__, hasattr(distutils, "_msvccompiler")); from distutils import _msvccompiler; import setuptools; print(setuptools.__version__, _msvccompiler.__file__)'
3.13.0 False
75.6.0 /Users/malfet/py312-venv/lib/python3.13/site-packages/setuptools/_distutils/_msvccompiler.py
```
Gave up trying to appease the linker, so rewrote it as following function:
```python
def _get_vc_env(vc_arch: str) -> dict[str, str]:
try:
from setuptools import distutils # type: ignore[import]
return distutils._msvccompiler._get_vc_env(vc_arch) # type: ignore[no-any-return]
except AttributeError:
from setuptools._distutils import _msvccompiler #type: ignore[import]
return _msvccompiler._get_vc_env(vc_arch) # type: ignore[no-any-return]
```
This PR also undoes setuptools version restriction introduced by https://github.com/pytorch/pytorch/pull/136489 as premise for restriction is incorrect
Fixes https://github.com/pytorch/pytorch/issues/141319
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141363
Approved by: https://github.com/huydhn, https://github.com/atalman
Summary:
In SJD, we register the callbacks to get notified of an active compilation. Using this information, we can basically allow for an increase time for the training loop
The callbacks currently do not account for entire time and in several cases, the end callback is not called at all.
This leads to a bunch of APS jobs getting terminated incorrectly: https://fburl.com/scuba/mast_hpc_job_run_status/ondwzt2w
In this diff, we basically install a context manager which will call the start and end callbacks, similar to how we log counters and other information.
Test Plan:
```
buck2 run mode/opt //aps_models/examples/dlrm:dlrm_train_app -- --config-name train_mast_fsdp_torchdynamo launcher.data_project=apf_ai_infra launcher.fbl_entitlement=ai_infra_training_rnd_tc launcher.hardware=TC_ANY_80G
```
Led to https://www.internalfb.com/mlhub/pipelines/runs/mast/aps-atuljangra-ef2285ba9a?job_attempt=0&version=0&env=prodhttps://fburl.com/ai_infra/sv0a213y confirms that callback was correctly called and a lease was properly installed, which takes over the training loop lease.
{F1965137027}
Differential Revision: D66347023
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141323
Approved by: https://github.com/ezyang
Summary:
Addresses https://github.com/pytorch/pytorch/issues/91888
We use wait as the amount you wait in between cycles when profiling and skip_first to delay the start of said profiling. However, once skip_first steps are completed, we immediately go to the wait phase. This is not problematic if wait is smaller than skip_first because we can just lower the values of skip_first, but if it is larger then we end up starting the first profile much later than desired. For example imagine a skip first of 1 and a wait of 100 with repeat of 2. We do want to wait 100 steps in between cycle 1 and 2 but we may not want to start warmup of cycle 1 at step 101 (forced because wait occurs directly after first steps skipped). This diff addresses this by adding a flag to skip the first wait.
Adds new flag but sets to false by default so that existing impl is not affected.
Test Plan:
Got reasonable traces with this schedule:
schedule=torch.profiler.schedule(
wait=10, warmup=3, active=1, repeat=1, skip_first=1, skip_first_wait=1
)
D66198138
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141070
Approved by: https://github.com/aaronenyeshi, https://github.com/briancoutinho
For better tracking, we need to make maybe aliasing/mutating ops with proper tag. We need to special case native_batch_norm because it is not a CIA but has a wrong schema. I guess native_batch_norm will be removed at some point, so until then we just keep it around.
D60347117
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131990
Approved by: https://github.com/bdhirsh
Summary: This was failing with a `/usr/local/fbcode/platform010/lib/python3.10/asyncio/events.py:666: DeprecationWarning` that seems unrelated.
Test Plan:
```
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_misc.py::InlineInbuiltNNModulesMiscTests::test_numpy_readonly_inline_inbuilt_nn_modules' --run-disabled
```
Differential Revision: D66394773
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141399
Approved by: https://github.com/yanboliang
Handling of nested modules in unflatten had several bugs, which were caught by trying to preserve module call signatures for nested modules.
* A module `k` encountered when calling `k.n()` before `k()` used to become an empty nn module. This caused some information to be dropped when `k()` was eventually called. Relatedly, we would also lose call counts for `k.n()` through different paths (say, when `k()` calls `n()`).
* Deleting call-indexed modules and patching up their call sites was broken for nested modules when creating dispatcher modules, because of silliness when handling their fqns.
An interesting aside is that we used random graph generation for testing some of these changes. A future PR will add the infra to create tests using these random graphs.
Differential Revision: D66192799
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141066
Approved by: https://github.com/angelayi
Summary:
Mirroring changes in D64604573, it appears this code in libcaffe2 is
mostly a copy of folly's one. Copy of the original diff summary:
This particular inline assembly use cannot be converted to a constraint template parameter (until llvm 18 / gcc 14), as there is no way (until those versions) to specify that a non-pic relocation is needed when compiling under pic. This inline assembly requires a non-pic relocation because it is being written to the .notes section, which is non .text.
Differential Revision: D66038989
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141191
Approved by: https://github.com/dcci
Summary: This is a possible fix for https://fb.workplace.com/groups/1075192433118967/permalink/794017756161443/
Test Plan:
```
buck2 test 'fbcode//mode/opt' fbcode//ai_infra/distributed_ai/pyper_test_framework/pt2_staging_tests/sw_v2:smallworld_cmf_test -- --exact 'ai_infra/distributed_ai/pyper_test_framework/pt2_staging_tests/sw_v2:smallworld_cmf_test - test_train (ai_infra.distributed_ai.pyper_test_framework.pt2_staging_tests.sw_v2.smallworld_cmf_test.CmfTest)'
```
Differential Revision: D66340927
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141310
Approved by: https://github.com/ezyang
For custom ops that do not have a meta kernel, draft export automatically creates a meta kernel based on the tracing example inputs. To ensure that these assumptions made during tracing is clear to the user, we add assertions into the traced exported program:
An example graph:
```
ExportedProgram:
class GraphModule(torch.nn.Module):
def forward(self, a: "f32[s0, s1]", b: "f32[s2, s3]"):
# File: /data/users/angelayi/pytorch/test/export/test_draft_export.py:172 in forward, code: res1 = torch.ops.mylib.foo4(a, b)
_assert_tensor_metadata = torch.ops.aten._assert_tensor_metadata(a, dtype = torch.float32, device = device(type='cpu')); _assert_tensor_metadata = None
_assert_tensor_metadata_1 = torch.ops.aten._assert_tensor_metadata(b, dtype = torch.float32, device = device(type='cpu')); _assert_tensor_metadata_1 = None
foo4: "f32[u2, u3]" = torch.ops.mylib.foo4.default(a, b); a = b = None
return (foo4,)
```
Differential Revision: [D66321129](https://our.internmc.facebook.com/intern/diff/D66321129)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141072
Approved by: https://github.com/pianpwk
ghstack dependencies: #141071
Summary:
We find another corner case in the merge splits, where the first split node does not have consecutive getitem indices, we need to skip such cases.
{F1964255863}
Test Plan:
# local reproduce
```
buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --flow_id 666002198 2>&1 | tee ~/cmf.txt
```
P1683429791
Differential Revision: D66275387
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141194
Approved by: https://github.com/jackiexu1992
Summary:
Latest attempt after [136802](https://github.com/pytorch/pytorch/pull/136802) and [140084](https://github.com/pytorch/pytorch/pull/140084) got shelved.
This keeps the string format for `expr_str`, but calls `sympy.printing.repr.srepr(s)` instead of `str(s)`, which prints expressions more explicitly, e.g.
```
((2*x)//(3*y + 4)) -> "FloorDiv(Mul(Integer(2), Symbol('x')), Add(Mul(Integer(3), Symbol('y')), Integer(4)))"
```
This is nice because:
- we have better roundtrippability for deserialization, robust to pretty printing changes like [this](6c9bfd52b6/torch/utils/_sympy/functions.py (L208)) that caused the issue in the first place.
- this preserves the BC surface for both 1) sigmoid thrift serialization, by keeping the string format, and 2) deserialization for old IRs, since `sympy.sympify(...)` still handles the old `str(s)` format.
- more memory efficient than storing ASTs; the [AST attempt](https://github.com/pytorch/pytorch/pull/140084) increased artifact size by 20% on some toy programs.
- doesn't even require a schema version bump.
Additionally to push some test cases over the line, this redoes expression processing (handling ranges, symbol caching) by doing bottom-up processing instead of the current hacky-ish workflow.
Test Plan: test_serdes, test_serialize, internal tests broken by AST PR
Differential Revision: D66283208
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141284
Approved by: https://github.com/zhxchen17
After `destroy_process_group`, it may be possible that the CUDA context finishes its job and exits, thus NVML detects 0 processes on the device. This PR relaxes the current check condition (there must be exactly 1 active process on that device) to cover this possibility.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141298
Approved by: https://github.com/eqy
Bump the Triton pin to the release candidate commit for Triton 3.2.
A few changes beyond the pin bump itself are needed:
* Remove the script that adds a git version hash suffix to the Triton wheel, since as of https://github.com/triton-lang/triton/pull/4812 Triton adds that itself
* Add `pybind11` to the Triton build setup, since Triton now depends on it
* Use manylinux-2.28 for the Triton wheel builder, and use clang+lld for building to pick up the right glibc
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139206
Approved by: https://github.com/malfet, https://github.com/atalman
Co-authored-by: Andrey Talman <atalman@fb.com>
Summary: The struct type is named "InputToConsantInputSpec" in thrift which causes some inconsistency between the schema. Changing the type name from 1 to another is okayish because that doesn't change the on wire format.
Test Plan: CI
Differential Revision: D66240951
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141151
Approved by: https://github.com/yiming0416
Bump the Triton pin to the release candidate commit for Triton 3.2.
A few changes beyond the pin bump itself are needed:
* Remove the script that adds a git version hash suffix to the Triton wheel, since as of https://github.com/triton-lang/triton/pull/4812 Triton adds that itself
* Add `pybind11` to the Triton build setup, since Triton now depends on it
* Use manylinux-2.28 for the Triton wheel builder, and use clang+lld for building to pick up the right glibc
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139206
Approved by: https://github.com/malfet, https://github.com/atalman
Co-authored-by: Andrey Talman <atalman@fb.com>
When executing the following code:
```
import pytorch_openreg
import torch
if __name__ == "__main__":
a = torch.tensor(1, device="openreg")
```
Sometimes releases tensor a failed after the process finishes executing `main` function. The trace of releasing `a` is `~Tensor()` -> ... -> `OpenRegMem.cpp` -> `OpenRegHooks.cpp` -> `_aten_impl.py`.
There are two failed scenarios I've found:
1. Segmentation fault: Before executing `~Tensor()`, the process has released global variables in `_aten_impl.py`, which causes the issue.
2. Waiting indefinitely: The main process passes the `free ptr` command to deamon process, however daemon processes have shutdown.
The way to fix this issue is when the process is shutting down, we ignore the del ptr operation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140936
Approved by: https://github.com/ezyang
# Motivation
The distributed APIs rely on backend names for creation of process group.
To abstract out references of these names from PG creation, an API is added to get default distributed backend for device.
The device code would need to register its device and backend via ```torch.distributed.Backend.register_backend``` or update the map ``` torch.distributed.Backend.default_device_backend_map["device"] = "distributed_backend" ``` prior to using the API.
An example of use is added in the test file ( which can be used to check abstracted APIs)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140536
Approved by: https://github.com/kwen2501
Summary:
Addresses https://github.com/pytorch/pytorch/issues/91888
We use wait as the amount you wait in between cycles when profiling and skip_first to delay the start of said profiling. However, once skip_first steps are completed, we immediately go to the wait phase. This is not problematic if wait is smaller than skip_first because we can just lower the values of skip_first, but if it is larger then we end up starting the first profile much later than desired. For example imagine a skip first of 1 and a wait of 100 with repeat of 2. We do want to wait 100 steps in between cycle 1 and 2 but we may not want to start warmup of cycle 1 at step 101 (forced because wait occurs directly after first steps skipped). This diff addresses this by adding a flag to skip the first wait.
Adds new flag but sets to false by default so that existing impl is not affected.
Test Plan:
Got reasonable traces with this schedule:
schedule=torch.profiler.schedule(
wait=10, warmup=3, active=1, repeat=1, skip_first=1, skip_first_wait=1
)
Differential Revision: D66198138
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141070
Approved by: https://github.com/aaronenyeshi, https://github.com/briancoutinho
Summary: Without this change calling `str(MatchState.SOMETHING)` will cause exception.
Test Plan:
Can we add unittest somewhere?
Ensure `str(MatchState.FULLY_MATCHED)` and `str(MatchState.FULLY_MATCHED())` won't raise exception.
Differential Revision: D66321609
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141297
Approved by: https://github.com/fduwjj
#140739 and #140740 made it such that `get_safe_globals` no longer return an empty List by default
This caused some tests that check the content of `get_safe_globals` to fail, in particular when run individually (they didn't fail in test suite as other tests ran before them called `clear_safe_globals`) but will fail when tests are run individually [T208186010](https://www.internalfb.com/intern/tasks/?t=208186010)
test_safe_globals_for_weights_only
test_safe_globals_context_manager_weights_only
This PR fixes that and also makes most tests calling `clear_safe_globals` use the `safe_globals` context manager rather than try: finally
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141300
Approved by: https://github.com/awgu
Summary:
`repro.py` can have nested graph modules, e.g.
```
class Repro(torch.nn.Module):
def __init__(self) -> None:
super().__init__()
self.true_graph_0 = GraphModule()
def forward(self):
true_graph_0 = self.true_graph_0
return (true_graph_0,)
```
So dumping the string doesn’t always work.
So,
1) we use exported program in repro.py instead
2) we still dump the graph module string, but only put it in comments
We also added two flags to `minifier_launcher.py`
- `minifier-export-mode`: whether strict or non-strict export is used in the minifier
- `skip-export-error`: intermediate graphs that cannot be exported will be skipped.
Test Plan:
```
buck2 run fbcode//caffe2/test/inductor:minifier_utils_cpu -- -r string
python test/inductor/test_minifier.py
```
Differential Revision: D66175257
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141159
Approved by: https://github.com/henrylhtsang
As suggested by @leslie-fang-intel in 4c83e4e751 (diff-139642bd981df977f70f4c18c1c34bd1a85c1d6b9ffa06aaa98426ed83942a31R537) - all elements of `B` tiles (not referring to AMX tiles, but the tiles at the granularity of the micro-kernel) have contiguous elements since `B` matrix is pre-packed, so dequantized buffer loading logic can be simplified. While the previous approach kept elements to be loaded into a B AMX tile contiguous, the new approach doesn't entail any performance penalty either because that data is already in L1D, so loading AMX tiles from non-contiguous dequantized B elements doesn't adversely affect performance.
Also rectified the size of the dequantized B buffer.
Fixes#140208.
A subsequent PR will factor out caching of dequantized int8 weights into a separate codegen function
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140258
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel
Summary:
When we have both `set_grad` and `autocast` HOP, name collision might happen when we try to inline a node.
For exmaple, for a GraphModule like this:
```
GraphModule(
(submod_0): GraphModule(
(submod_1): GraphModule()
)
(submod_1): GraphModule()
(submod_2): GraphModule()
)
```
when we inline `submod_0`, we might accidentally overwrite `submod_1`.
In this PR, we fix this by check if the graph module already has an attribute with the same name, if so, we use the next "submod_{i}", until no name collision.
Partially fixes https://github.com/pytorch/pytorch/issues/140589.
Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r test_predispatch_autocast_and_set_grad
```
Differential Revision: D66200994
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141169
Approved by: https://github.com/angelayi
fsspec transactions do not support concurrency and assumes that there is at most 1 running transaction per filesystem. This is *not* true in our usage, where because of multi-threading we usually have multiple concurrent transactions running at once.
Previously, this would just (unsafely) pass but lead to hard-to-debug race conditions (since the commit of one transaction will blow away the state of the other transaction). In fsspec 2024.3.0, trying to commit concurrent transactions will actually crash (see the code at 76ca4a6888/fsspec/transaction.py (L39) -- because each filesystem can have a single transaction, this tear-down logic will error).
Instead, let's manually handle committing / discarding changes to the file. This does this "the old-fashioned way" instead of using `fsspec`'s commit/rollback behavior because the internal PathManagerFileSystem used for `iopath` does not properly support that behavior.
I don't have a minimal test-case, but in Meta this solves a broken test on `fsspec >= 2024.3.0`:
Before: https://www.internalfb.com/intern/testinfra/testrun/7318349626774607
After: https://www.internalfb.com/intern/testinfra/testrun/2251800062722633
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135541
Approved by: https://github.com/Skylion007
Fixes#140986
This includes several improvements on the grammar and wording of nn/module.py, mostly simple one word fixes, but also other slightly more elaborate ones.
It addresses about half of the docs for module.py but I would be glad to cover the rest of it if required to do so.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140987
Approved by: https://github.com/mikaylagawarecki
Summary:
Splitting this PR into two, one for the cuSPARSELt improvements, and one
for the inductor lowering.
This PR adds in the additional cuSPARSELt bindings into pytorch.
* `torch._cslt_sparse_mm_search` will be deprecated in a future PR,
so a warning has been added
* Added a header file for cuSPARSELtOps.cpp
* max_id is now available in `torch.backends.cusparselt` via
`torch.backends.cusparselt.get_max_alg_id()`
* fixed meta registrations for float8
Test Plan:
python test/test_sparse_semi_structured.py
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137427
Approved by: https://github.com/cpuhrsch, https://github.com/eqy
Ever since #135140, this test will fail if run with CPU parameterization (e.g. test_out__refs_logical_or_cpu_float32) and CUDA available - as far as I can tell, the PyTorch CI isn't currently checking for this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137140
Approved by: https://github.com/ezyang
# Summary
We have another IMA for captured buffers when we are the sequences are not divisible.
Running test before this commit:
```Shell
========= Error: process didn't terminate successfully
========= Target application returned an error
========= ERROR SUMMARY: 447 errors
========= ERROR SUMMARY: 347 errors were not printed. Use --print-limit option to adjust the number of printed errors
```
And After
```Shell
❯ CUDA_LAUNCH_BLOCKING=1 PYTORCH_NO_CUDA_MEMORY_CACHING=1 compute-sanitizer --tool memcheck pytest test/inductor/test_flex_attention.py -k "test_non_divisible_with_captured_buffer"
========= COMPUTE-SANITIZER
====================================================== test session starts =======================================================
platform linux -- Python 3.12.7, pytest-7.4.0, pluggy-1.5.0
rootdir: /home/drisspg/meta/pytorch
configfile: pytest.ini
plugins: hypothesis-6.115.5, typeguard-4.3.0
collected 518 items / 517 deselected / 1 selected
Running 1 items in this shard
test/inductor/test_flex_attention.py . [100%]
=============================================== 1 passed, 517 deselected in 13.31s ===============================================
========= ERROR SUMMARY: 0 errors
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141164
Approved by: https://github.com/Chillee
Summary:
Add the following inductor fx graph cache stats to dynamo compile
- inductor_fx_cache_hit_count
- inductor_fx_cache_miss_count
- inductor_fx_cache_backend_type
- inductor_fx_cache_hit_keys
- inductor_fx_cache_miss_keys
- remote_cache_version
Test Plan: Run local tests and staging logger: P1683061460
Differential Revision: D66232206
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141190
Approved by: https://github.com/masnesral
Changes semantic of __repr__ of P2POp: s, d are now group ranks instead
of global ranks. I think this is OK since I also updated the field names
to make this obvious.
Also add mypy annotations
Partially addresses RFC 0042 (pytorch/rfcs#71)
See more details/motivation in #140460
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141054
Approved by: https://github.com/kwen2501
Fix: #139936
This PR modifies the lowering of `split` operation, so that it won't generate guards,
specializing on the sizes parameter. Instead, it specializes on the number of output
tensors being generated (i.e. function of the size of the base tensor, and the sizes
parameter).
As a result, operations such as `chunk` (whose number of output tensors usually is
constant given a static chunk number) won't trigger recompiles when varying the size of
the base tensor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141077
Approved by: https://github.com/ezyang
Currently real tensor tracing raises MetadataMismatchErrors if registered fake kernels don't match the real kernels (e.g. shape, aliasing, dtype, etc.). This adds an option to use fake kernel inference to bypass mismatches - this option defaults to False for real tensor tracing, but is on for draft export.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139766
Approved by: https://github.com/angelayi, https://github.com/zou3519
Adding `destroy_pg_upon_exit` property to allow derived Test classes to control whether auto destroy is desired.
(Otherwise, derived test classes will need to rewrite the `_run()` method, leading to duplicated code of `_run()` and if one needs to add things to `_run` in the future, more code change is needed.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141192
Approved by: https://github.com/wconstab
This PR implements the framework for supporting HOP in the ONNX exporter. Refer to https://github.com/pytorch/pytorch/issues/140995 for the design.
- Implement support for torch.cond
- Refactor `_add_nodes` into `_translate_fx_graph` to handle nested subgraphs. To support building subgraphs as functions using the same logic, new handlers for `placeholder` and `output` nodes are added to register inputs and outputs on the onnx function.
- Fuctions are created under the domain of `pkg.torch.__subgraph__`
- Updated the type promotion pass to run on nested subgraphs.
- Implement torch.cond in `_torchlib/ops/hop.py`. Updated the registry to discover these ops.
- Improve opset_import handling robustness with `add_opset_imports` IR pass. To achieve this, we added opset version to all Nodes. Fixes https://github.com/pytorch/pytorch/issues/139503Fixes#117655Fixes#123972Fixes#93743 Closes https://github.com/pytorch/pytorch/issues/140995
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137428
Approved by: https://github.com/justinchuby
Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
This test is failing locally in 3.12 and 3.13 and is blocking 3.13 CI enablement.
It may have to do with scipy version, see .ci/docker/requirements-ci.txt (3.12+ has scipy 1.12.0/1.14.1, where as < 3.12 requires scipy 1.10.1).
Wanted to xfail these tests, but they somehow pass sometimes on CI.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140731
Approved by: https://github.com/ezyang, https://github.com/malfet
Differential Revision: [D65362160](https://our.internmc.facebook.com/intern/diff/D65362160)
State after this IR:
1. For the tests that require inference IR, they are replaced with ep.run_decomp({}) so export_for_training_run_decomp is sort of redundant but i guess it is still nice that multiple round of retracing still working. In general, we need some auditing to reduce our redundant testing coverages.
2. After this PR landed and not get reverted for a week or so, i will replace the export_for_training calls with export as they are the same thing now.
3. Added more tests to also cover now "deprecated" old IR by patching export to use old export. For reviewers, please look at the internal version.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139511
Approved by: https://github.com/ydwu4, https://github.com/angelayi, https://github.com/avikchaudhuri
Fixes a bunch of benchmarks that failed with cudagraph errors including `tlp python benchmarks/dynamo/timm_models.py --device cuda --inductor --accuracy --amp --training --only resmlp_12_224` when `specialize_float=False`
Also brings down number of overall failures (with keep-going) from 108 => 62. I'd estimate >80% of those 62 are wobbly expect tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140346
Approved by: https://github.com/ezyang
ghstack dependencies: #140983, #141003
Looks like a regression caused by use of strided API, but adding the test revealed (at least in CI), that on Ventura it worked but returned garbage results, so fixed by removing all the logic about channels last (as it's irrelevant for strided API case and placeholder already turns tensor into a correct one)
This also allows one to remove `mem_format_key` and `ns_shape_key` (it was redundant even back then, as `mem_format_key` + `getTensorsStringKey(grad_output_t)` already uniquely identified the operation)
Fixes https://github.com/pytorch/pytorch/issues/140902
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141009
Approved by: https://github.com/manuelcandales
hipGraphExecDestroy doesn't immediately free memory since rocm6.2.
They wait for next sync point in order to free the memory, this is to ensure that all hipGraphLaunch are finished before we release any memory.
We need to ensure all async opreations finish before deleting the object.
capture_dev_ variable is used to save the device number when capture_begin() method is called
But CUDAGraph can be created and destroyed without calling capture_begin() method. `capture_dev_ = UNDEFINED_DEVICE;` allows to detect such a case and skip sync
Tests impacted:
test_cuda.py::TestCuda::test_graph_make_graphed_callables_*
distributed/test_c10d_nccl.py::ProcessGroupNCCLTest::test_allreduce_in_cudagraph
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138722
Approved by: https://github.com/malfet, https://github.com/eqy, https://github.com/jeffdaily
* Automatically applies ruff rule 401. Turns loops into equivalent list comprehensions which are faster and do not leak the scope of the loop variables.
* list comprehensions not only often have better typing, but are 50+% faster than for loops on overhead. They also preserve length information etc and are better for the interpreter to optimize.
* Manually went back and made mypy happy after the change.
* Also fixed style lints in files covered by flake8 but not by pyfmt
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140980
Approved by: https://github.com/justinchuby, https://github.com/malfet
Summary:
**wins**
on torchrec benchmark, for 2K nodes it save 40seconds
with the recent sympy changes (https://www.internalfb.com/diff/D65883538) we save around 13 second ( with the max opt on).
```
buck2 run fbcode//mode/opt fbcode//torchrec/distributed/tests:pt2_compile_benchmark -- --num-features=200
```
This diff optimizes construction expressions of the form
a+b+c... (all unique symbols).
which are very common in torchrec models.
**How**
Expressions of the form a+b+c are not optimized by add, the only needed optimization is sorting them.
If we have a+b+c and we are adding (d) to it, we can do a binary search to know
the position of (d) and avoid optimizing the new expression by passing the new order.
**Extensions**:
1. support constant terms.
2. support 10a+10b+.. (this will give even more wins will extend the support in second PR)
Differential Revision: D66008482
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140822
Approved by: https://github.com/ezyang
Tracking issue: #138399
This PR changes the `pow` C++ implementation, making its C++ meta kernel consistent with
its Python ref implementation. The following example shows the inconsistency between the
two:
```python
def run(device):
S = (5,)
a = torch.rand(S, device=device, dtype=torch.float32)
b = 2
out = torch.empty(S, device=device, dtype=torch.float64)
return torch.pow(a, b, out=out)
>>> run("cpu")
Traceback (most recent call last):
File "test.py", line 34, in run
return torch.pow(a, b, out=out)
RuntimeError: Found dtype Double but expected Float
>>> run("meta")
tensor(..., device='meta', size=(5,), dtype=torch.float64)
```
**~Update:~**
~Note that this happens only for `pow.Tensor_Scalar` overloads. Therefore, this PR needed
further 2 modifications:~
- ~Split the `pow` ref implementation, making `pow.Tensor_Scalar` error on mismatching
output dtypes~
- ~Create a dispatch for `pow` when `_refs.pow()` is called~
**Update:**
Changing the `TensorIteratorConfig` for `pow.Tensor_Scalar` was easier and,
after the discussion below, more correct. The solution was to change the
`TensorIteratorBase::build_output_borrowing_argument_owning_unary_op` function,
setting:
- `cast_common_dtype_to_outputs`; and
- `enforce_safe_casting_to_output`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140287
Approved by: https://github.com/ezyang
Detailed description:
The codes below will raise an error
```Python
import torch
from torch.fx.experimental.proxy_tensor import make_fx
def func(a):
b = a + 1
c = b.view(-1)
c.add_(1)
return b
input = torch.randn(2)
out = make_fx(func)(input)
```
The error info are like below:
```Python
...
File "/root/Git.d/pytorch/pytorch/torch/_dynamo/codegen.py", line 34, in <module>
from .variables.torch_function import TensorWithTFOverrideVariable
File "/root/Git.d/pytorch/pytorch/torch/_dynamo/variables/torch_function.py", line 185, in <module>
populate_builtin_to_tensor_fn_map()
File "/root/Git.d/pytorch/pytorch/torch/_dynamo/variables/torch_function.py", line 146, in populate_builtin_to_tensor_fn_map
inp0 = torch.ones(1)
File "/root/Git.d/pytorch/pytorch/torch/fx/experimental/proxy_tensor.py", line 1240, in __torch_function__
return func(*args, **kwargs)
File "/root/Git.d/pytorch/pytorch/torch/utils/_stats.py", line 21, in wrapper
return fn(*args, **kwargs)
File "/root/Git.d/pytorch/pytorch/torch/fx/experimental/proxy_tensor.py", line 1342, in __torch_dispatch__
return proxy_call(self, func, self.pre_dispatch, args, kwargs)
File "/root/Git.d/pytorch/pytorch/torch/fx/experimental/proxy_tensor.py", line 907, in proxy_call
name=proxy_mode.tracer.graph._target_to_str(func.overloadpacket.__name__),
AttributeError: 'PythonKeyTracer' object has no attribute 'graph'
...
```
Solutions:
Import torch._dynamo before dispatch_trace is called to avoid the context set before dispatch_trace from affecting the torch._dynamo import.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141022
Approved by: https://github.com/ezyang
Fixes `python test/inductor/test_fused_attention.py SDPAPatternRewriterCpuTests.test_pattern_fails_with_unsupported_mask_cpu` when `specialize_float=False`. You might wonder how it's related, it's because there is a "negative" test that expects us not to match. Previously it would fail on isinstance(param, Tensor), but now that we tensorify the float, it did match and caused a failure. This check ensures the mask has the same shape to ensure this negative test case actually fails.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141003
Approved by: https://github.com/ezyang
ghstack dependencies: #140983
Summary: The way we've been de/serializing sympy.Exprs is not roundtrippable in all cases (serialize by calling `str(expr)`, and deserialize by calling `sympy.sympify(expr_str)`). This has led to expressions being mathematically equivalent but structurally different, causing issues in ValueRanges. Example issue: https://github.com/pytorch/pytorch/issues/136797
This starts to deprecate the use of `expr_str` and stores expressions in AST format instead. For BC purposes, `expr_str` deserialization is still supported, but we will always serialize to `expr_ast`. We'll kill this once the serialization upgrader design is finalized and implemented.
Test Plan: test_export
Differential Revision: D65638757
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140084
Approved by: https://github.com/angelayi
Fix https://github.com/pytorch/pytorch/issues/140462 .
Horace found that when we implicitly fallback to eager, some eager kernels may not work correctly if Inductor provide non-contiguous inputs (due to padding etc.). The original issue is found for the backward op of weight_norm. The fix in this PR is a general one: we force inputs to all implicit fallback kernels to be contiguous.
I have to refactor the code a bit to make it work. Previously we apply layout constraint in `GraphLowering.run_node`. We looks for implicit fallback in `call_function`. The problem here is, when we setup the implicit fallback in `call_function` with a layout constraint, we don't have a chance to apply the constraints.. The refactor moves the code that applies layout constraints to `call_function`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140996
Approved by: https://github.com/jansel
When the stub file `nn/parallel/distributed.pyi` was removed (#88701), some types that existed are no longer available. This pull request adds them back.
Just for reference, these types are used in pytorch-lightning's LightningCLI. Command line interfaces are created automatically, and having type hints make them nicer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136835
Approved by: https://github.com/kwen2501
Summary:
Some graphs produced by the minifier graph cutter cannot be used for AOTI/export (illegal graphs), these should be considered as graphs that don't fail in the minifier, so the minifier keeps searching.
One example is the following graph, where `true_graph_0` is an fx.GraphModule. Here, export.export() would give a `UserError` with `ErrorType = UserErrorType.INVALID_OUTPUT`.
```
# graph():
# %true_graph_0 : [num_users=1] = get_attr[target=true_graph_0]
# return (true_graph_0,)
```
This graph could be obtained from the module below:
```python
class M(torch.nn.Module):
def forward(self, x, flag):
flag = flag.item()
def true_fn(x):
return x.clone()
return torch.cond(flag > 0, true_fn, true_fn, [x])
```
So we detect such errors, and exclude them from minifier's search (consider these graphs as didn't fail).
This is ok and won't miss any actual errors, since the AOTI minifier is only designed to catch errors in the AOTI phase anyway, it is not responsible to catching export bugs.
Test Plan:
```
buck2 run fbcode//caffe2/test/inductor:test_minifier_utils -- -r invalid_output
```
Differential Revision: D66143487
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140999
Approved by: https://github.com/henrylhtsang
Summary: We are working on onboarding legokit modules to ModuleStability and this is needed to fix the serialization issue found in P1680200613
Test Plan:
`buck2 test //torchrec/fb/legokit/module_stability_tests/layer_norm_stability_test:layer_norm_stability_test -- --env ADD_NEW_STABILITY_CONFIGS=True`
serialization succeeds when the above command is run on top of this diff.
Differential Revision: D66034492
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141047
Approved by: https://github.com/angelayi
Defining `static char shaderSource[]` in the header will instantiate it as often as it is included.
Solved the problem by renaming `static auto getCPLState(const std::string&)` into `auto getFusedAdamCPLState(const std::string&)` and instantiating it only once resulted in 500K reduction in binary size (and perhaps even more in runtime footprint)
I.e. before
```
% ls -lak lib/libtorch_cpu.dylib
-rwxr-xr-x 1 malfet staff 183357744 Nov 19 17:58 lib/libtorch_cpu.dylib
```
and afer
```
% ls -lak lib/libtorch_cpu.dylib
-rwxr-xr-x 1 malfet staff 183357120 Nov 19 17:57 lib/libtorch_cpu.dylib
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141090
Approved by: https://github.com/Skylion007
ghstack dependencies: #141089
### Background
This PR adds the functionality to xfail / skip on a per-`SampleInput` basis for `OpInfo` tests. See #89354 and #82669 for some requests asking for this type of functionality.
This was originally landed for NJT in #138370 and is generalized and slightly tweaked here.
### Design
#### Principles
* Clean separation among `SampleInput` generation logic, test logic that uses the `SampleInput`s, and xfail / skip logic (which will change as bugs are addressed).
* Flexibility in xfail / skip predicate specification - ideally each bug can be handled by a single skip / xfail, even if it surfaces across a specific class of ops.
* This is important in practice for NJT, where it's common to have a bug that affects all binary ops, for example.
* Opt-in with minimal test logic changes + no substantial impact on other tests.
#### Details
The core new concept is a `SampleRule`, which can be either an `XFailRule` or `SkipRule`.
```python
@dataclass
class SampleRule(ABC):
# function to indicate whether the rule applies to this op; return True if so
# NB: str arg of callable is device_type
op_match_fn: Callable[[str, OpInfo], bool] = None
# function to indicate whether the rule applies to this sample; return True if so
sample_match_fn: Callable[[torch.device, SampleInput], bool] = None
# optional name for identifying the rule
name: str = ""
@dataclass
class XFailRule(SampleRule):
# expected error type
error_type: TypeVar = Exception
# expected error message
error_msg: str = ".*"
@dataclass
class SkipRule(SampleRule):
...
```
* See below for example usage details, but at a high level: each test should have a corresponding list of `sample_skips_and_xfails`.
* The list of `sample_skips_and_xfails` is traversed in order, and the first rule that matches (if any) is applied, so order can matter.
* The PR includes a logging mechanism for matched rules accessible by setting the loglevel to `DEBUG`.
* The split between `op_match_fn` and `sample_match_fn` is made to allow pre-filtering of the list of rules to get only those that apply to the op under test.
* Each `SampleInput` is run within a subtest context so they can be individually skipped / xfailed as needed. This also means that a test will no longer stop after the first erroring `SampleInput`; all samples will be run through test logic.
### Example Usage
Consider the following OpInfo test:
```python
class MyTestCase(TestCase):
@ops(op_db)
def test_foo(self, device, dtype, op):
for sample in op.sample_inputs(device, dtype, requires_grad=False):
# do some SampleInput-based test logic
output = op.op(sample.input, *sample.args, **sample.kwargs)
...
```
This is a common pattern for such tests; simply generate a list of `SampleInputs` and run them through the op. Now say you want to xfail one of these `SampleInput`s for a given op. Today, you have to xfail the entire test or hack around this in the test logic.
This PR lets you do this to get very flexible xfail / skips based on op / sample input properties:
```python
# NB: Define rules for per-SampleInput xfails / skips. These can also be defined in-line in the @ops decorator, but
# it can be more readable to maintain these somewhere else. These are attempted to be matched in order and
# the first one that matches applies, so order can matter.
FOO_SKIPS_AND_XFAILS = [
XFailRule(
error_type=ValueError,
error_mg="2D inputs not supported",
op_match_fn=lambda device, op: (
# NB: logic for which ops this rule applies to goes here
op.full_name == "add"
),
sample_match_fn=lambda device, sample: (
# NB: logic which samples this rule applies to goes here
sample.input.dim() == 2
),
# NB: optional rule identifier can help with debugging matched rules
name="add_with_2D_inputs_not_supported",
),
# NB: This follows a similar structure as XFailRule but without error_type / error_msg. Obviously
# this skips a particular SampleInput instead of xfailing :)
SkipRule(...),
...
]
class MyTestCase(TestCase):
@ops(op_db)
@sample_skips_and_xfails(FOO_SKIPS_AND_XFAILS)
# NB: the @ops decorator automatically filters out any rules that don't apply to this op
def test_foo(self, device, dtype, op):
for sample, subtest_ctx in op.sample_inputs(
# NB: use_subtests=True is required for skips / xfails to work. If skips / xfails are defined and use_subtests != True,
# an informative error will be thrown.
device, dtype, requires_grad=False, use_subtests=True
):
# NB: this subtest context manager runs each sample input as a "subtest" and handles skips / xfails appropriately
with subtest_ctx(self):
# do some SampleInput-based test logic
output = op.op(sample.input, *sample.args, **sample.kwargs)
...
```
More examples can be seen in `test/test_nestedtensor.py`, where this system is used in practice.
I also demonstrate usage of syntactic sugar over this system in `test/functorch/test_vmap.py`. Here, a skip for the `to()` operator is replaced with a granular xfail for `test_vmap_exhaustive()`:
```python
...
# pre-existing xfail
xfail("item"),
# new granular xfail using syntactic sugar over the general system
xfailIf(
"to",
lambda sample: (
sample.kwargs["memory_format"] == torch.channels_last
),
),
...
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140443
Approved by: https://github.com/janeyx99, https://github.com/zou3519
ghstack dependencies: #140160, #138370
Adding some dynamo timed for the purpose of better understanding AOTI compilation time.
Probably would require a few more passes. A lot of time is spent in Scheduler.__init__, and not enough annotations are there.
run_command_and_check takes a lot time as well. But there is probably not much we can do. Maybe we can add a config to tune C++ optimization level?
traces:
<img width="1205" alt="Screenshot 2024-11-08 at 4 41 10 PM" src="https://github.com/user-attachments/assets/61645264-b3af-4d4a-804d-700b0f831c7c">
Differential Revision: D65554141
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140198
Approved by: https://github.com/desertfire
During export, we nub out most CIA ops to return NotImplemented to avoid decomposing them during tracing. To recover the existing shape propagation behavior, we register these CIA decomps directly as FakeTensorMode rules as well. The reason we have to do is because when we return NotImplemented, FakeTensor would fallback to running these CIAs with Meta backend causing device branching CIA ops to fail. (because now the device is Meta. One example is sdpa). If we register a kernel directly to FakeTensorMode, we won't fallback to Meta backend.
Differential Revision: [D65716260](https://our.internmc.facebook.com/intern/diff/D65716260/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140465
Approved by: https://github.com/bdhirsh
Expands the `test_linalg_qr_autograd_errors` unit test to check all cases of differentiablity/non-differentiability as given in the docs https://pytorch.org/docs/stable/generated/torch.linalg.qr.html:
- mode= ‘reduced’ (default): Returns (Q, R) of shapes (*, m, k), (*, k, n) respectively. It is always differentiable.
- mode= ‘complete’: Returns (Q, R) of shapes (*, m, m), (*, m, n) respectively. It is differentiable for m <= n.
- mode= ‘r’: Computes only the reduced R. Returns (Q, R) with Q empty and R of shape (*, k, n). It is never differentiable.
(in particular, the happy paths are added)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135097
Approved by: https://github.com/IvanYashchuk, https://github.com/nikitaved
On ROCm, hipification converts std::min to ::min, but ::min is not returning the right result. This impacts index_add_ operation on a large tensor, we end up picking the large values instead of max supported block size (128). This leads to GPU accessing memory out of bounds.
While we wait for ::min to be fixed, we can use < operator to compare instead of relying on ::min.
Example Code w/ failure:
```
D=6144
hidden_states = torch.zeros([16384, 6144], device="cuda:0", dtype=torch.bfloat16)
index = torch.randint(0, 16384, (1, 32, 16384), device="cuda:0", dtype=torch.int64)
output = torch.empty([1, 32, 16384, 6144], device="cuda:0", dtype=torch.bfloat16)
hidden_states.index_add_(0, index.view(-1), output.view(-1, D))
```
```
Traceback (most recent call last):
RuntimeError: HIP error: invalid configuration argument
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139087
Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony
Summary:
1. Clearly specify error messages that we are refering to a collective_sequence_id and an internal_record id for entry.
The entry id is semi-useless for the end consumer so at least let them know that this is an internal record id.
2. Add some missing fields in types.py.
self.missing_ranks = set()
self.input_numel = tuple()
self.output_numel = tuple()
self.errors = set()
These were showing up as linter errors when I opened the file in vs-code
Test Plan:
```
buck2 run //caffe2/fb/flight_recorder:fr_trace -- -m f665492593-nerf_training-96ab95e0 -w 8 --mast_job_version 0 -a 0
Buck UI: https://www.internalfb.com/buck2/2cac9273-1b7b-47bf-867f-82f9a4c1d581
Network: Up: 0B Down: 0B
Not all ranks joining collective: sequence number: 31117
internal record id: 31116
group info: 0:default_pg
collective: nccl:all_reduce
missing ranks: {3, 4, 5, 6, 7}
input sizes: [[1571911]]
output sizes: [[1571911]]
world size: 8
expected ranks: {0, 1, 2, 3, 4, 5, 6, 7}
collective state: scheduled
collective stack trace:
all_reduce at /packages/fblearner.flow.canary/workflow#link-tree/torch/distributed/distributed_c10d.py:2707
wrapper at /packages/fblearner.flow.canary/workflow#link-tree/torch/distributed/c10d_logger.py:81
sync_buffers at /packages/fblearner.flow.canary/workflow#link-tree/xri_mapsr/neural_fields/models/gaussian_splatting.py:650
decorate_context at /packages/fblearner.flow.canary/workflow#link-tree/torch/utils/_contextlib.py:116
step at /packages/fblearner.flow.canary/workflow#link-tree/xri_mapsr/neural_fields/training/training_manager/splatting.py:356
main at /packages/fblearner.flow.canary/workflow#link-tree/xri_mapsr/neural_fields/nerf_training.py:260
main_impl at /packages/fblearner.flow.canary/workflow#link-tree/rl_aiep/mast/endpoint.py:57
main at /packages/fblearner.flow.canary/workflow#link-tree/rl_aiep/mast/endpoint.py:34
wrapper at /packages/fblearner.flow.canary/workflow#link-tree/torch/distributed/elastic/multiprocessing/errors/__init__.py:355
<module> at /packages/fblearner.flow.canary/workflow#link-tree/rl_aiep/mast/endpoint.py:118
_run_code at /packages/fblearner.flow.canary/workflow#link-tree/runtime/lib/python3.10/runpy.py:86
_run_module_as_main at /packages/fblearner.flow.canary/workflow#link-tree/runtime/lib/python3.10/runpy.py:196
run_as_main at /packages/fblearner.flow.canary/workflow#link-tree/__par__/bootstrap.py:69
run_as_main at /packages/fblearner.flow.canary/workflow#link-tree/__par__/meta_only/bootstrap.py:98
__invoke_main at /packages/fblearner.flow.canary/workflow#link-tree/__run_lpar_main__.py:28
<module> at /packages/fblearner.flow.canary/workflow#link-tree/__run_lpar_main__.py:31
...
Differential Revision: D66018461
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140969
Approved by: https://github.com/Skylion007, https://github.com/fduwjj
test test_save_load_transform in [test_transforms.py](https://github.com/pytorch/pytorch/blob/main/test/distributions/test_transforms.py)
_pytest test_transforms.py -k test_save_load_transform_
error message:
```
.
.
.
File "/workspace/pytorch/test/distributions/test_transforms.py", line 555, in test_save_load_transform
other = torch.load(stream)
^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/serialization.py", line 1444, in load
raise pickle.UnpicklingError(_get_wo_message(str(e))) from None
_pickle.UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options, do those steps only if you trust the source of the checkpoint.
(1) Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
(2) Alternatively, to load with `weights_only=True` please check the recommended steps in the following error message.
WeightsUnpickler error: Unsupported global: GLOBAL torch.distributions.transformed_distribution.TransformedDistribution was not an allowed global by default. Please use `torch.serialization.add_safe_globals([TransformedDistribution])` or the `torch.serialization.safe_globals([TransformedDistribution])` context manager to allowlist this global if you trust this class/function.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140494
Approved by: https://github.com/mikaylagawarecki
Citing @malfet's [comment](https://github.com/pytorch/pytorch/pull/136343#pullrequestreview-2318792396) in https://github.com/pytorch/pytorch/pull/136343
> It would be great, if users do not have to modify their programs for every new backend, but rather use with torch.device('xpu'): and keep rest of the code unchanged.
This PR makes the backend specification ("nccl", "gloo") optional when user provides a `devce_id` to `init_process_group` (the acceptance of `device_id` has been previously supported for the purpose of eager init).
New user experience:
```
device = torch.device(device_type, rank % device_count)
dist.init_process_group(device_id=device)
```
The line of `device = torch.device(...)` is anyway needed because user would use it for tensor creation etc.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140963
Approved by: https://github.com/wconstab
Per discussion with @malfet, only allow weights_only unpickler to load NJT if `torch.nested` and `torch._dynamo` are imported
(this is slightly weird as technically `torch.nested` is actually imported by default and `torch._dynamo.decorators._DimRange` is actually what needs to be imported)
we can't import this from `torch.nested` as this would
- undo dynamo lazy import
- cause circular import
===========================
Redo of https://github.com/pytorch/pytorch/pull/140304 caused issues as `torch.nested._internal.foo` needs to be imported, which causes issues like
```python
torch/_weights_only_unpickler.py", line 339, in load
if full_path in _get_allowed_globals():
torch/_weights_only_unpickler.py", line 188, in _get_allowed_globals
torch.nested._internal.nested_tensor.NestedTensor
AttributeError: module 'torch.nested' has no attribute '_internal'
```
**This likely wasn't caught in our CI because imports are global during unit tests(?), so we use subprocess to properly test this time**
Differential Revision: [D65961691](https://our.internmc.facebook.com/intern/diff/D65961691)
@jbschlosser
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140739
Approved by: https://github.com/malfet
Summary:
Bug in quantizer when Conv + ReLU is fused even when the preceeding conv has more than one user. Conv and ReLU can not be fused in this case because the result of Conv must be used elsewhere.
XNNPACK Delegate naturally handles this by inserting a clamp node for ReLU.
Test Plan: CI
Reviewed By: digantdesai
Differential Revision: D65989599
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140846
Approved by: https://github.com/digantdesai
Doc updates:
* This adds documentation for the object oriented ProcessGroup APIs that are being used in torchft as well as https://github.com/pytorch/rfcs/pull/71 .
* It also does some general cleanups to simplify the distributed.rst by using `:methods`.
* It adds `__init__` definitions for the Stores
* I've reordered things so the collective APIs are before the Store/PG apis
Test plan:
```
lintrunner -a
cd docs && sphinx-autobuild source build/ -j auto -WT --keep-going
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140853
Approved by: https://github.com/kwen2501
# Overview
Currently monitor.py produces error only result, this pr introduct disable-monitor option to all *-test.yml. We also like to explore how the monitor code affect benchmark results.
# next steps
- fix the monitor.py
- enable non-benchmark tests with monitor
- investigate benchmark test behavior with monitor background job
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140857
Approved by: https://github.com/huydhn
Tracking issue: #138399
This PR fixes a number of reference implementations (which are also used as meta
functions), making them more consistent with CPU device. More specifically, it fixes those
operations that use `_make_elementwise_unary_reference` decorator, and don't error on
mismatching out argument dtype while they error when using concrete devices (e.g. CPU).
The fixed operations are:
- `abs`
- `ceil`
- `floor`
- `frac`
- `isneginf`
- `isposinf`
- `sgn`
- `sign`
- `signbit`
- `trunc`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140288
Approved by: https://github.com/ezyang
ghstack dependencies: #140186, #140286
Fixes#139320
### Summary:
#### (1) Add `_rename_dynamic_shapes_with_model_inputs` for dynamic_shapes to play along with input_names
* Use model forward signature to rename dynamic_shapes when dynamic_shapes is not nested and dynamic_shapes is directly using the customized name. This solves the issue that torch.export.export expects dynamic_shapes only uses the model input names.
* If the dynamic_shapes is nested, we do nothing.
#### (2) Add `_from_dynamic_shapes_to_dynamic_axes` for fallback
* We flatten dynamic_shapes with leaf defined _pytree.tree_leaves()
~~* If a dynamic_shapes is not nested, and defined in dict. We can use the key as the input_names, since it should be renamed by `_rename_dynamic_shapes_with_model_inputs` already.~~
* If a dynamic_shapes is provided, input_names is required to assign the names, because dynamic_axes needs it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139532
Approved by: https://github.com/justinchuby
I.e. replace `at::detail::getMPSHooks().isOnMacOSorNewer` with `is_macos_13_or_newer`, which is a direct function call instead of going thru a virtual method call
Hooks are only needed to provide a feature-agnostic inteface to query something even on the platforms that might not have support for the featuee, while functions implemented in `ATen/native/xxx` should be able to call those platform specific methods directly
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140950
Approved by: https://github.com/Skylion007
ghstack dependencies: #140896
Summary: When AOT_PARTITIONER_DEBUG is set to 1 and debug logging is turned on we can now log the full input and output for each knapsack problem.
Differential Revision: D65633086
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140757
Approved by: https://github.com/jansel
Which is a variadic template that automates tedious (and error prone) process of pasing the arguments via series of
```cpp
mtl_setBuffer(encoder, b1, 0);
mtl_setBuffer(encoder, b2, 1);
mtl_setBytes(encoder, param, 2);
```
into a compact
```
mtl_setArgs(encoder, b1, b2, param);
```
Introduce few more specialization of `mps_setArg`, such as:
- Call `setBuffer` for `id<MTLBuffer>`
- Copy double as float (as MPS does not support double precision types)
- Accept `std::optional<at::Tensor>` that will not call setBuffet, if optional is empty
Also, re-metaprogramm `mtl_setBytes` to make it usable with any trivially copiable structs, but keep separate implementation for containers, as uploading `c10:SmallVector`, which is trivially copiable would overwrite next arguments, which luckily resulted in test failures of `test_cross_entropy_label_smoothing_weight_ignore_indices_mps`
Introduce `has_size_type_v` which could be used to diferrentiate between trivially copiable `std::array` and `c10::ArrayRef` vs other trivially copiable structs.
```cpp
template <typename T>
class has_size_type {
template <typename U>
static constexpr std::true_type check(typename U::size_type*);
template <typename>
static constexpr std::false_type check(...);
public:
static constexpr bool value = decltype(check<T>(nullptr))::value;
};
template <typename T>
constexpr bool has_size_type_v = has_size_type<T>::value;
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140896
Approved by: https://github.com/Skylion007
Fixes#140598
Allows ragged structures for query and key+value sequence lengths to differ (i.e. supports cross attention for Flex + NJT).
Technically, this is BC-breaking thanks to arg renaming and positional arg reordering in `create_nested_block_mask()`, but Flex + NJT support isn't in a major release yet so I'm hoping we can just do it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140723
Approved by: https://github.com/drisspg
# Motivation
This pr is an extension of #131758. As described in #131758, these changes are looking to make distributed UTs more accessible to users of all device types.
It is a demonstration of a few changes discussed by @kwen2501 and @jgong5 in the discussion for #131758(https://github.com/pytorch/pytorch/pull/131758#discussion_r1762422784)
This PR contains two types of changes, the first is to the common distributed folder where we have added a new class derived from MultiProcessTestCase which helps abstracts out the process group creation /deletion and other functionality for a given device.
The new generalized content can be added by deriving from this base class.
Also includes other misc changes for gaudi support
The second changed file is test_functional_api. a test file in common distributed. This file is a POC for how we can use this new class to write more device agnostic distributed test cases.
The following changes have been made to test_functional_api.py:
-Functionality has been added to test for non cuda devices using intel HPU as an example
-Multiple set up steps previously required by MultiProcessTestCase have been abstracted out
-Misc adaptations to allow for general call to accelerators while adding test skips instead explicitly skipping for multiple GPUs
-Skipifhpu flags have been added to enable skipping a few Multithreaded test cases which are as yet not supported on HPUs
NOTE: Within test functional api, there are tests which require the use of some multithreading functions which are as yet not supported on HPUs. These have been skipped for hpu using skipHPU decorator.
I will be raising a separate PR to improve usability pf said decorators in a device agnostic setting in the manner suggested by @kwen2501 in a comment on this PR.
This pr is a cleaned up version of a previous PR(#136988) which I closed due to human error. I have addressed some of the comments made by @kwen2501 in this as well
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138216
Approved by: https://github.com/kwen2501, https://github.com/guangyey
Here's are some explanations of this PR.
1. Changes in `aten/src/ATen/core/Tensor.cpp` and `c10/core/DispatchKey.cpp`: Support toString method for `QuantizedPrivateUse1` backend, make pytorch print out correct backend string for it.
2. Add header `DispatchStub.h` in `aten/src/ATen/native/quantized/IndexKernel.h`: If this header is not included, we can't utilize `masked_fill_kernel_quantized_stub` even we include this `IndexKernel.h` header, it would throw an error during compilation.
3. Add multiple `TORCH_API`s in `aten/src/ATen/native/quantized/AffineQuantizer.h`: these functions is useful for other privateuse1 backends supporting quantization functions, if these `TORCH_API` are missed, it would throw an error during runtime (undefined symbol)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139860
Approved by: https://github.com/bdhirsh
Faced with an annoying string of warnings like this when running tests,
<img width="1644" alt="Screenshot 2024-11-15 at 11 23 21 AM" src="https://github.com/user-attachments/assets/91ff4e1d-3c29-4510-9a61-46e7df68a212">
My choices seem to be (1) call destroy_process_group() at the end of
each test fn, (2) do this in some wrapper, (3) do it in the base test
class.
Since tests in MultiProcessTestCase are responsible for calling
init_process_group themselves, they should also be responsible for
calling destroy (or at least method (3) would be asymmetric and may
result in double-destroy).
But it doesn't feel worth it to go add a destroy call manually to each
test, and try/except for a possible second destroy call seems like a
happy middle ground.
Note: tests that want to ensure that destroy runs cleanly can and should
still call destroy _inside_ the test, and this change does not affect
that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140820
Approved by: https://github.com/fegin
Summary:
We observed another corner case where not all split items are used, see the screenshot
{F1960315622}
We thus skip such cases by checking the getitem indices.
Test Plan:
# local reproduce
```
buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --flow_id 663157369 2>&1 | tee ~/cmf.txt
```
P1679677122
# E2E
before fix
f663157369
after fix
Differential Revision: D65990213
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140788
Approved by: https://github.com/jackiexu1992
Also add missing mypy typing and a few asserts to make mypy happy
Partially addresses RFC 0042 (pytorch/rfcs#71)
See more details/motivation in #140460
Note: object collective version canonicalizes to global instead of group
rank, simply becuase this left more of the original code intact and
required less conversions overall.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140827
Approved by: https://github.com/kwen2501
Previously `SymmetricMemory` only had private pybind APIs:
```python
from torch.distributed._symmetric_memory import _SymmetricMemory
t = _SymmetricMemory.empty_strided_p2p(
size=(64,),
stride=(1,),
dtype=torch.float32,
device=device,
)
symm_mem_hdl = _SymmetricMemory.rendezvous(t, group_name=group.group_name)
```
This PR introduces user-facing APIs empty() and rendezvous():
```python
import torch.distributed._symmetric_memory as symm_mem
t = symm_mem.empty(64, device="cuda")
symm_mem_hdl = symm_mem.rendezvous(t, group_name=group.group_name)
```
Notable differences compared to the pybind APIs:
- `empty()` now resembles `torch.empty()`:
- shape can either be an integer sequence or pack
- no need to/can't specify stride anymore
- device can either be `torch.device` or string
- `group_name` needs to be specified at rendezvous time as opposed to allocation time. See https://github.com/pytorch/pytorch/pull/139529 for the rationales. I feel the new semantic is superior, hence enforcing it in the public API.
- Currently, the pybind API still support specifying `group_name` at rendezvous time.
This PR does not change the behavior of the pybind APIs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139677
Approved by: https://github.com/lw
ghstack dependencies: #139529
I.e. fixes
```
1082/1084] Building OBJCXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/mps/operations/UpSample.mm.o
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/UpSample.mm:224:10: warning: non-portable path to file '<ATen/native/mps/UpSample_metallib.h>'; specified path differs in case from file name on disk [-Wnonportable-include-path]
224 | #include <ATen/native/mps/Upsample_metallib.h>
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| <ATen/native/mps/UpSample_metallib.h>
```
as generated header name should have the same capitalization as respective shader file, i.e. `kernels/UpSample.metal`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140891
Approved by: https://github.com/Skylion007
Before this PR, users need to call `empty_strided_p2p()` with a `group_name`:
```python
tensor = _SymmetricMemory.empty_strided_p2p((1024,), (1,), device=device, group_name="0")
symm_mem = _SymmetricMemory.rendezvous(tensor)
```
Users can now omit `group_name` at allocation time and specify it later at rendezvous time:
```python
tensor = _SymmetricMemory.empty_strided_p2p((1024,), (1,), device=device)
symm_mem = _SymmetricMemory.rendezvous(tensor, group_name="0")
```
Rationales for this change:
- This allows the same allocation to establish symmetric memory under different groups
- Specifying `group_name` at rendezvous time instead of allocation time is a more natural UX
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139529
Approved by: https://github.com/lw
Faced with an annoying string of warnings like this when running tests,
<img width="1644" alt="Screenshot 2024-11-15 at 11 23 21 AM" src="https://github.com/user-attachments/assets/91ff4e1d-3c29-4510-9a61-46e7df68a212">
My choices seem to be (1) call destroy_process_group() at the end of
each test fn, (2) do this in some wrapper, (3) do it in the base test
class.
Since tests in MultiProcessTestCase are responsible for calling
init_process_group themselves, they should also be responsible for
calling destroy (or at least method (3) would be asymmetric and may
result in double-destroy).
But it doesn't feel worth it to go add a destroy call manually to each
test, and try/except for a possible second destroy call seems like a
happy middle ground.
Note: tests that want to ensure that destroy runs cleanly can and should
still call destroy _inside_ the test, and this change does not affect
that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140820
Approved by: https://github.com/fegin
ghstack dependencies: #140460, #140815
Avoid copypaste of send/isend and recv/irecv impl.
This does change the warning issued from send to include the identifier
"isend" instead of "send", but I think thats not a big deal.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140815
Approved by: https://github.com/fegin
ghstack dependencies: #140460
This PR adds caching for user defined triton kernels by putting the transitive closure of source code in node.meta along with constant arguments.
One HUGE hack we do here is a node looks like
```
triton_kernel_wrapper_functional_proxy = torch.ops.higher_order.triton_kernel_wrapper_functional(kernel_idx = 0, constant_args_idx = 1, grid = [(1, 1, 1)], tma_descriptor_
metadata = {}, kwargs = {'in_ptr0': arg0_1, 'in_ptr1': arg1_1, 'out_ptr': arg0_1}, tensors_to_clone = ['out_ptr']);
```
so we use regex to remove `kernel_idx = 0, constant_args_idx = 1` parts as they are not relevant to cache hash. This is horrible and I'd like to eventually not use pickle as a hashing alternative but this is a longer project.
Differential Revision: [D65895744](https://our.internmc.facebook.com/intern/diff/D65895744)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140326
Approved by: https://github.com/zou3519
Summary:
Add a wait counter for the dump function.
This is useful to see if we get stuck in the dump function and never return for a particular job.
Test Plan: Tested locally I and see `pytorch.wait_counter.NCCLTraceBuffer__dump.busy_time_us.sum.60` in ODS.
Differential Revision: D65823433
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140823
Approved by: https://github.com/fduwjj
Summary:
Customize splitter behavior to mark `get_attr` nodes as acc supported.
Currently these nodes are excluded by `FxNetAccNodesFinder` which marks all nodes with op not in `CALLABLE_NODE_OPS` ("call_module", "call_function", "call_method") as unsupported.
Before this change, merge-net is split into an almost empty cpu submodule with a single empty output node:
```
INFO:caffe2.torch.fb.model_transform.experimental.prepare_fx_model:###### debug_print nodes for _run_on_cpu_0
INFO:caffe2.torch.fb.model_transform.experimental.prepare_fx_model:Found output node: n.name='output', n.target='output', n.args=((),), n.kwargs={}, n.meta={}
INFO:caffe2.torch.fb.model_transform.experimental.prepare_fx_model:return ()
INFO:caffe2.torch.fb.model_transform.experimental.prepare_fx_model:
_run_on_cpu_0 stats for merge:
[output] output: 1
```
full log: P1678727348 (generated using same command as below)
Test Plan:
Tested by lowering `ig_organic_feed_cn_v2_mtml` using cmd:
```
buck run mode/opt-split-dwarf //tgif/cli:cli -- --model-name=ig_organic_feed_cn_v2_mtml --model-type ig_organic_feed_cn_v2_mtml --world-size=1 --storage-mode 1 --inference-dtype=FP16 --meta-transform=False --use-random-weights=True --accelerator-arch=3 --enable-input-dist=True --embedding-tables-dtype=FP16 --mtia-use-torch-export=True embedding-quantization-pass torchrec-sharding-pass tgif-split-pass gen-app-graph-pass tgif-mtia-lowering-pass dense-quantization-pass save-torch-package-pass generate-model-package-pass pack-weights-and-save-pass 2>&1 | tee /tmp/publish_ig_organic_feed_cn_v2_mtml_mtia_export_20241114_splitter_2.log
```
Output shows only 1 acc submodule is generated for merge:
```
INFO 18:33:15.951 1735650 utils.py:235: [TGIF] num of acc submodules: 1
INFO 18:33:15.952 1735650 utils.py:236: [TGIF] num of cpu submodules: 0
INFO 18:33:16.534 1735650 logging_utils.py:53: [TGIF] _run_on_acc_0 graph module debug info: https://www.internalfb.com/intern/everpaste/?color=0&handle=GK4VKhWsDKF9VdsDAKxhR6KAlhJ0br0LAAAz
INFO 18:33:16.534 1735650 utils.py:257: [TGIF] Start MTIA lowering _run_on_acc_0 in merge, device ordinal: -1
```
full log: P1679596796
Differential Revision: D65983916
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140794
Approved by: https://github.com/ezyang
Summary:
### Motivation
In D65283170, we need subclass of quantizable LSTM to enable split_gates. Also, required for tests.
### What's the change?
As subclass is not part of no_observer() set, an improper observer is added after the quantizable LSTM module. Here, we switch class check change to issubclass check on no_observer set.
Test Plan:
- N6206576
- CI.
Reviewed By: andrewor14
Differential Revision: D65989314
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140818
Approved by: https://github.com/andrewor14
Implementation of the `softmax_backward_data` operator for the CPU backend produces incorrect results when the `output` argument is non-contiguous.
Here is a test case that demonstrates this issue:
```python
torch.manual_seed(0)
op = torch.ops.aten._softmax_backward_data
grad_output = torch.ones(3, 3, 3)
temp = torch.randn(3, 10, 3)
out = temp[:, :3, :]
out = out.contiguous()
print(out.is_contiguous())
grad_input = op(grad_output, out, 1, torch.float32)
print(grad_input)
```
In this test case, the variable `grad_input` yields incorrect results if the line `out = out.contiguous()` is commented out. With this fix, `grad_input` consistently produces the same results whenever `output` is contiguous.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139740
Approved by: https://github.com/zou3519
Now that all cells are modeled as `NewCellVariable` in Dynamo, we no
longer need to put cell variables into this special `closure_cells`,
rather we just merge `closure_cells` with `symbolic_locals`.
This allows us to merge and remove some code paths, notably make
`LOAD_CLOSURE` the same as `LOAD_FAST`, and `LOAD_DEREF` & `STORE_DEREF`
the same for inlining or regular `InstructionTranslator`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140154
Approved by: https://github.com/jansel
ghstack dependencies: #140330, #140152, #140436, #140435, #140153
In addition to `NewCellVariable`, Dynamo has 3 ways of modeling cell objects:
1. For cells captured and created by the root frame, represent them as
their contents in `root_tx.symbolic_locals`, which `LOAD_DEREF` and
`STORE_DEREF` update directly, without going through `SideEffects`.
2. `ClosureVariable`: this is created when cells from (1) are captured
by a newly created function Dynamo is about to inline. It's a handle
with a name that redirects `LOAD_DEREF` and `STORE_DEREF` back (1),
to make `root_tx.symbolic_locals` up-to-date.
3. For cells that are captured by both the root frame and some
pre-existing function Dynamo is about to inline, represent those
cells as contents, and do not allow writes to them.
Note that (2) and (3) are mainly to conform with (1) -- to make sure
Dynamo has a consistent modeling of cells for the same cell objects.
In this patch, we represent all of these cells as `NewCellVariable`. The
main new code paths introduced are:
- using `NewCellVariable` to model cell objects created by the root
frame (the cells are passed in as input to `InstructionTranslator`),
this is what allows us to get rid of all 3 legacy paths above.
- adding a new `AutoDerefLocalSource` to deal with the python-code
level (guards) and bytecode level (codegen) auto-dereferencing
behavior, when accessing pre-existing python cells. This also
involves a tiny update to guard manager generation.
- plumbing some extra info into `LocalSource` and `CellVariable` so that
we can still emit `LOAD_DEREF`, `STORE_DEREF`, `LOAD_CLOSURE` (instead
of `make_cell`, `cell_contents` attribute access, and `LOAD_FAST`),
which is important for readability, performance, and some
assumptions `bytecode_transformation.py` makes.
As a result, this patch removes a lot of the now-dead code paths and
TODOs. Notably, it significantly simplified the `prune_dead_locals`
function, which was duplicating a lot of the logic from
`prune_dead_object_new`; this conveniently closes#137123.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140153
Approved by: https://github.com/jansel
ghstack dependencies: #140330, #140152, #140436, #140435
In `match_nested_cell`, Dynamo tried to identify pre-existing captured
cells by `(cell_name, id(cell_contents))`. This works in most cases, but
as the test added in this patch shows, it's not a complete solution.
This patch
1. changes `match_nested_cell` to `lookup_variable_for_captured_cell`,
and does the lookup based on id of cell objects, not their contents.
This requires plumbing a tuple of captured cell objects from
different CPython versions all the way to
`InstructionTranslator.__init__`, where we store a mapping from the
ids of these cell objects, and use it later in
`UserFunctionVariable.bind_args` to look for these unboxed cells.
2. builds off (1) -- rather than using a `VariableTracker` that
represents the content of the unboxed cells, use `ClosureVariable`,
which enables codegen in case these cells escape as closure of a
`NestedUserFunctionVariable`.
The patch adds a regression test for each of the scenarios above:
1. `test_write_to_cells_with_name_shadowing` where Dynamo mistakenly
thought the program is writing to a cell captured by root frame (which
it doesn't support atm), which resulted in
```
File "/Users/ryanguo99/Documents/work/pytorch/torch/_dynamo/symbolic_convert.py", line 3340, in STORE_DEREF
unimplemented("write to __closure__ while inlining")
File "/Users/ryanguo99/Documents/work/pytorch/torch/_dynamo/exc.py", line 313, in unimplemented
raise Unsupported(msg, case_name=case_name)
torch._dynamo.exc.Unsupported: write to __closure__ while inlining
```
2. `test_existing_func_that_creates_capturing_nested_func` where Dynamo
ended up trying to codegen a `NestedUserFunctionVariable` that
captures a cell which was also captured by the root frame, so it was
unboxed and ends up emitting `LOAD_DEREF` rather than
`LOAD_FAST/LOAD_CLOSURE` during codegen, resulting in
```
File "/Users/ryanguo99/Documents/work/pytorch/torch/_dynamo/variables/functions.py", line 105, in _create_nested_fn
func = FunctionType(code, f_globals, name, defaults, closure)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: arg 5 (closure) expected cell, found int
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140436
Approved by: https://github.com/jansel, https://github.com/williamwen42
ghstack dependencies: #140330, #140152
This patch introduces a `DynamoFrameType` to serve as a layer between
Dynamo and different versions of Python frame object. In
`DynamoFrameType`, we only register attributes Dynamo cares about (e.g.,
`f_code`, `f_locals`, etc.
This will be helpful when it comes to adding new attributes to this
`DynamoFrameType`, or dealing with Python version changes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140330
Approved by: https://github.com/jansel, https://github.com/williamwen42
Context: we are trying to pass an empty tensor through the system now (sometimes;... its an edge case); and it seems to cause all_reduce to seg fault, which is unexpected to me
Deep Shah and Pavan identified the issue, I'm just pushing for a fix :)
Test Plan: idk what i'm doing here, someone help
Reviewed By: shuqiangzhang
Differential Revision: D65956095
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140741
Approved by: https://github.com/shuqiangzhang
Features:
(1) Add support for tree structure.
(2) Add user warning before axes to shapes conversion
(3) Add suggestion of providing `dynamic_shapes` when conversion fails
Notes:
(1) `input_names` is crucial to the conversion, as we don't know the ONNX graph inputs.
(2) min and max are set as default, so LLM has higher chance to fail if users use `dynamic_axes` in terms of the min/max constraints dependency between `attention_mask` and `sequence_length`, etc. (Found in llama-3.2-1B_Instruct)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140488
Approved by: https://github.com/justinchuby
Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
Summary:
If unhealthy, the user should be able to get the type of errors, e.g.,
timeout,nccl error or remote error.
This API is applied to PG level, compared to the work.get_future_result() API which is applied to Work Level.
Error detection at PG level is much more convenient for users to handle the PG failure as a whole, e.g, restarting the PG.
Error handling at the work level is still useful for users to attach work specific context and debug the RC of the specific failing work/collective
Note it is critical for all ranks in the PG to be notified about an error as soon as it occurs, so we introduce an errorType of REMOTE_ERROR, which is 'broadcasted' from a src rank (which detects a local error) to all other ranks in the PG, the broadcast is done through TCPStore currently
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140087
Approved by: https://github.com/kwen2501
Functionally two decorators are very similar, but one should rely on expectedFailure as much as possible to get signal when something is fixed.
- Move `product_version` variable from `test_mps` to common_utils, but call it `MACOS_VERSION`
- Introduce `skipIfMPSOnMacOS13` to decorate the hard crashes that happens only on MacOS13 (which at this point will not get any fixes and will be deprecated soon)
- Add `device_type='mps'` to all `skipIfMPS` per https://github.com/pytorch/pytorch/issues/140560
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139940
Approved by: https://github.com/janeyx99, https://github.com/huydhn
Fixes following warnings:
```
In file included from /Users/malfet/git/pytorch/pytorch/torch/csrc/Generator.cpp:25:
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/mps/MPSGeneratorImpl.h:40:63: warning: extra ';' after member function definition [-Wextra-semi]
40 | void set_engine(at::Philox4_32 engine) { engine_ = engine; };
| ^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/mps/MPSGeneratorImpl.h:41:46: warning: extra ';' after member function definition [-Wextra-semi]
41 | at::Philox4_32 engine() { return engine_; };
| ^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/mps/MPSGeneratorImpl.h:43:62: warning: extra ';' after member function definition [-Wextra-semi]
43 | static DeviceType device_type() { return DeviceType::MPS; };
| ^
3 warnings generated.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140776
Approved by: https://github.com/Skylion007
Currently, we get all partition id by iterating assignment whose size is same as the number of nodes in graph. But we can reach same results by iterating partitions_by_id whose size is much smaller than the nodes number. Assume the number of nodes is N, the number of partitions is P, the time complexity decrease from O(N * N) to O(N * P) after this patch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136598
Approved by: https://github.com/mcr229
Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
When `TORCH_SYMM_MEM_ALLOW_OVERLAPPING_DEVICES` is set, the check for overlapping devices and multicast support will be disabled. This is useful for testing with a single device.
Making this is an env var instead of an API argument since this is likely only useful for testing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140127
Approved by: https://github.com/lw
Summary: The gm_torch_level can be a _LazyGraphModule(GraphModule) instead of a GraphModule. When we call .recompile(), GraphModule populates the self._out_spec, but _LazyGraphModule(GraphModule).recompile() doesn't populate it.
Test Plan: CI
Differential Revision: D65902135
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140608
Approved by: https://github.com/tugsbayasgalan
This reintroduces support for high channel sizes for convs. The guard for macOS versions < 15.1 is still present to prevent reintroducing #129207.
I'm unsure about the specific macOS version support, but I'm assuming this was fixed in 15.1, and I'm relying on signals from ci for verification. I'm expecting the new test will fail for macOS versions < 15.1, and the old test will start failing for > 15.0. I've added xfails for this and extended the version helpers to support 15.1+.
Fixes#140722
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140726
Approved by: https://github.com/malfet
@ezyang noticed this exercises a multithreading bug that is causing tests to become disabled:
```
2024-11-13T21:05:55.8363582Z inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCPU::test_comprehensive_fft_ihfftn_cpu_int32 /opt/conda/envs/py_3.9/lib/python3.9/site-packages/_pytest/threadexception.py:73: PytestUnhandledThreadExceptionWarning: Exception in thread Thread-3
2024-11-13T21:05:55.8364857Z
2024-11-13T21:05:55.8364974Z Traceback (most recent call last):
2024-11-13T21:05:55.8365491Z File "/opt/conda/envs/py_3.9/lib/python3.9/threading.py", line 980, in _bootstrap_inner
2024-11-13T21:05:55.8366003Z self.run()
2024-11-13T21:05:55.8366371Z File "/opt/conda/envs/py_3.9/lib/python3.9/threading.py", line 917, in run
2024-11-13T21:05:55.8366858Z self._target(*self._args, **self._kwargs)
2024-11-13T21:05:55.8367518Z File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/fbscribelogger/__init__.py", line 176, in _run_event_loop
2024-11-13T21:05:55.8368189Z self.loop.run_until_complete(self.task)
2024-11-13T21:05:55.8368774Z File "/opt/conda/envs/py_3.9/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
2024-11-13T21:05:55.8369348Z return future.result()
2024-11-13T21:05:55.8369980Z File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/fbscribelogger/__init__.py", line 214, in _worker
2024-11-13T21:05:55.8370603Z message = await asyncio.wait_for(
2024-11-13T21:05:55.8371090Z File "/opt/conda/envs/py_3.9/lib/python3.9/asyncio/tasks.py", line 442, in wait_for
2024-11-13T21:05:55.8371573Z return await fut
2024-11-13T21:05:55.8372156Z File "/opt/conda/envs/py_3.9/lib/python3.9/asyncio/queues.py", line 166, in get
2024-11-13T21:05:55.8372613Z await getter
2024-11-13T21:05:55.8374010Z RuntimeError: Task <Task pending name='Task-1' coro=<FbScribeLogger._worker() running at /opt/conda/envs/py_3.9/lib/python3.9/site-packages/fbscribelogger/__init__.py:214> cb=[_run_until_complete_cb() at /opt/conda/envs/py_3.9/lib/python3.9/asyncio/base_events.py:184]> got Future <Future pending> attached to a different loop
2024-11-13T21:05:55.8375366Z
2024-11-13T21:05:55.8375603Z warnings.warn(pytest.PytestUnhandledThreadExceptionWarning(msg))
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140717
Approved by: https://github.com/ezyang, https://github.com/zxiiro
Summary: https://github.com/pytorch/pytorch/pull/136505 changed the cache_clear operation to remove loaded modules from disk. That change caused some problems with TORCHINDUCTOR_FORCE_DISABLE_CACHES=1, where there are some code paths (coordinate descent tuning at least), where we call `PyCodeCache.load_by_key_path` and expect that the files are still on disk. (But when caches are disabled, we call cache_clear before every inductor compile). It seems we probably have a shortcoming in the disable-cache logic, but since we also have flakey test failures with the same `'could not get source code'` error, let's restore the previous functionality until I can investigate further.
Since some tests actually _DO_ want to delete on-disk artifacts (e.g., to test remote caching), then I added a `purge` param to optionally delete files
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140216
Approved by: https://github.com/eellison
Here's the overview:
There's a new contextmanager singleton called MetricsContext. Entering the MetricsContext is how we demarcate the boundary on which we'll create a single CompilationMetrics object, and therefore, a single dynamo_compile log entry. While we're inside the MetricsContext, we can update/set many different metrics. Most importantly: `dynamo_timed` can also update the in-progress MetricsContext. In the proposal here, we tell `dynamo_timed` that we want it to do so by providing the name of the MetricsContext field to increment. There can be many `dynamo_timed` calls in different parts of the code updating different fields. Then when the MetricsContext exits, that's when the logging of everything gathered finally happens. One potential footgun is trying to use `dynamo_timed` when we haven't entered the MetricsContext, but we assert on that problem. Another problem is that we re-enter the context recursively, but we watch for that and do the logging only when the outermost exits.
Some specifics:
* Introduce MetricsContext - a context manager that on exit, records the CompilationMetrics (which also logs to dynamo_compile).
* Completely remove the concept of frame_phase_timing. Instead, update the MetricsContext during compilation, either directly or via dynamo_timed.
* Remove some globals we previously used to accumulate counters to later populate a CompilationMetrics. We use CompilationMetrics set/update/increment APIs instead.
* `record_compilation_metrics` is now called on exit from MetricsContext.
* Populate legacy CompilationMetrics fields right before logging, inside `record_compilation_metrics`.
* Remove the one-off `add_remote_cache_time_saved` helper; capture that timing directly into the MetricsContext.
And specifically, several changes to dynamo_timed:
* "Modernize" the parameters and update all callsites accordingly.
* Move the backwards logging of the CompilationMetrics to the backwards compile location.
* Add a parameter for which CompilationMetrics field to update
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139849
Approved by: https://github.com/ezyang
Pylance infers the type of the first argument (`enabled`) to `_record_memory_history` as `str` even though the function accepts `Literal[None, "state", "all"]`.
This raises an issue when passing `None`, even though it is a legitimate argument.
This PR addresses the issue by adding the type annotation in the doc string.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140545
Approved by: https://github.com/Skylion007
Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
Differential Revision: D63206258
This diff introduces a mechanism to generate a json-compatible deserializer in cpp using nlohmann json (already being used by AOTI).
Why we need this? Because there will be a lot of cases where people don't want to use Python to load the graph (e.g. cpp runtime), and instead they can use this header to deserialize the JSON graph.
Every time we call update_schema.py to update the schema, the header will be auto generated and included into the source files.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136398
Approved by: https://github.com/angelayi
On the exeuctor side, when it is found that meta.data_ptr is not in the allocated memory, tensor creation will fail, but there is no need to allocate memory when creating an empty tensor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140496
Approved by: https://github.com/ezyang
**About this PR**
This PR adds the following ops for `linear_dynamic_fp16` in onednn namespace. These ops are intended for PT2E quantization eager mode.
- `onednn::linear_prepack_fp16`: packs fp32 weight to an fp16 MkldnnCPU tensor.
- `onednn::linear_dynamic_fp16`: takes an fp32 CPU tensor and an fp16 MkldnnCPU tensor and compute linear in fp32
- `onednn::linear_relu_dynamic_fp16`: similar as the former and apply relu on output.
**Test plan**
`python test/test_quantization.py -k test_linear_dynamic_fp16_onednn`
**Implementation**
These ops call oneDNN lib under the hood. It's worth noting that oneDNN does not support f32 * f16 -> f32 computation, so we have to convert fp16 weight to fp32 before computation. And weight is still in plain format after packing.
**Correctness and performance**
Correctness is guaranteed by UT.
Performance of the new ops may be better than the FBGEMM implementation when weight shape is small but worse when weight shape is large. It's because weight dtype conversion and computation are not fused.
For example, I ran benchmarks on an Intel(R) Xeon(R) Platinum 8490H machine with different cores and shapes. When using 1 core per instance, the new implementation generally is faster for weight shape < 1024 * 1024. When using more cores, the threshold will increase.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140376
Approved by: https://github.com/jerryzh168, https://github.com/jgong5
Related to #107302
We saw `test_float_to_int_conversion_nonfinite` failed as we upgrade to NumPy 2.
It is caused by the undefined behavior of `numpy` casting `inf`, `-inf` and `nan` from `np.float32` to other dtypes.
The test is using NumPy as reference for the ground truth. (see line 1013-1015)
However, these behaviors are undefined in NumPy.
If you do `np.array([float("inf")]).astype(np.uint8, casting="safe")`, it results in an error `TypeError: Cannot cast array data from dtype('float64') to dtype('uint8') according to the rule 'safe'`.
The undefined behaviors are always subject to change.
This PR address this issue by passing concrete values as the ground truth references.
In the future, even NumPy changes its behavior the test would still remain stable.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138131
Approved by: https://github.com/drisspg
…al studio build tool is only needed for Windows
I created no issue since the suggested change is actually very small. This is my very first PR so partly I am creating it just to dip my toes in the water. In fact I would understand if the change does not get accepted since it's a simple modification to part of the wording in the README. The wording as it currently stands is probably clear enough for most people, but I still missed the fact that visual studio build tool must only be installed for Windows (even though that is stated there), and I thought by adding some parentheses this might become even more clear, specially since elsewhere in the README the formatting makes it more explicit that some steps must only be run for Windows/Linux/MacOS
As I said, it's a trivial change so I'd understand if it's not accepted, and I am looking forward to making more meaningful contributions as time goes on.
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140442
Approved by: https://github.com/soulitzer
Summary:
Removes print statements and implements logging via the logging library.
Hopefully this will allow more control on the level of logging when running models.
Test Plan:
```
AOT_PARTITIONER_DEBUG=1 buck2 run @mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=local_fb_fm_v4 launcher.num_workers=2
```
Resulting output paste: P1674535630
* Full logs paste: P1674535621
```
pastry P1674535621 | grep "functorch/partitioners.py" | pastry
```
Logging results: P1674549514
Differential Revision: D61678215
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139782
Approved by: https://github.com/paryxyt, https://github.com/jansel
This PR adds native implementation of unfold_backward as metal shader, mostly copy-n-paste of algorithms used in CUDA and CPU implementations, i.e. considering `out = in.unfold(dim, size, step)`, then following holds true:
* `out.shape[dim] == (in.shape[dim] - size) / step + 1`
* `out.shape[-1] == size`
* `out.ndim == in.ndim + 1`
`unfold_backward` Metal kernel receives `grad_in` and returns `grad_out` such that:
* `grad_in.shape == out.shape`
* `grad_out.shape == in.shape`
For each index in `grad_out` find the elements contributing to it and sum them up. Such algorithm requires no synchronization between threads.
That is `grad_out[...,out_dim_idx,...]` accumulates all values `grad_in[...,in_dim_idx,...,in_last_idx]`, where `in_dim_idx` is range [`(out_dim_idx - size) / step`, `out_dim_idx / step`] clamped to (0, `in_dim_size`) and `in_last_idx` are equal `out_dim_idx - in_dim_idx * step` . Accumulation step is skipped if `in_last_idx` is outside of [0, size] range.
This operator has been requested 16 times on https://github.com/pytorch/pytorch/issues/77764
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135411
Approved by: https://github.com/manuelcandales
Co-authored-by: Manuel Candales <42380156+manuelcandales@users.noreply.github.com>
Summary: output nodes may be eliminated to the input nodes if only partial output nodes are specified. add option to check results for all output nodes in the partitioned graph
Test Plan: see D65367305
Reviewed By: qcyuan
Differential Revision: D65367305
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139774
Approved by: https://github.com/jfix71
Summary:
It seems like this issues is due to leftover cupti events during warmup staying persistent in the queue during profiling. These events start before our actual time window and therefore have a timestamp lower than our basetime. This makes the delta become negative which results in unsigned overflow. This then creates a large number which later gets sign added which creates the signed overflow.
Solution: If a raw timestamp is less than the base timestamp, just mark the process timestamp as -1 so we can mark these events as "to ignore". In Kineto, add a special case to ignore timestamps that are negative.
Test Plan: Test with ASAN
Differential Revision: D65835650
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140441
Approved by: https://github.com/davidberard98
Fixes https://github.com/pytorch/pytorch/issues/138715
It looks like we were previously ignoring guards on FSDP module parameters. In the issue linked above, this was causing inductor size/stride asserts to fire. The root cause is that for some code like this:
```
m = FSDP(
torch.nn.Sequential(
torch.compile(torch.nn.Linear(1024, 1024)),
torch.compile(torch.nn.Linear(1024, 4096))
)
)
```
We need to generate two different graphs for the two linear layers, and it looks like without a `TENSOR_MATCH` guard on the linear parameters, dynamo would think that it could re-use the same graph across both layers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138819
Approved by: https://github.com/anijain2305
Summary:
LLVM-15 has a warning `-Wunused-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.
This diff either (a) removes an unused variable and, possibly, it's associated code or (b) qualifies the variable with `[[maybe_unused]]`.
#buildsonlynotests - Builds are sufficient
- If you approve of this diff, please use the "Accept & Ship" button :-)
Test Plan: Sandcastle
Reviewed By: meyering
Differential Revision: D65833225
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140569
Approved by: https://github.com/Skylion007
Currently, when ptxas errors occur in one of the autotuning configs, we error out. This doesn't match the newly introduced behavior of the native Triton ([here](915c149978/python/triton/runtime/autotuner.py (L164))). In this PR, we match the Inductor's autotuning behavior to native Triton's by ignoring the ptxas errors and the configs triggering thereof.
This unblocks PT2 compilation of an internal model.
Differential Revision: D65861236
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140495
Approved by: https://github.com/chenyang78
Enable concat linear for CPU mkldnn path.
Previously, we have a concat linear in freezing passes but it not worked on CPU.
This is because `concat_linear` pattern happened after `mkldnn_weight_prepack`. And `concat_linear` only handle `addmm/mm` etc.
```
addmm -> mkldnn linear
addmm -> mkldnn linear -> cannot concat
# only worked when disable mkldnn
addmm ->
addmm -> concat linear
```
Now we changed `mkldnn linear` related pass numbers larger than `concat_linear` pass numbers.
```
addmm -> concat linear -> mkldnn linear
addmm ->
```
So it can work fine with mkldnn linear now.
Also, since concat linear not always have benefits. We add 1 flag `config.cpp.enable_concat_linear` and set default value to False. User can enable this by their need.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139048
Approved by: https://github.com/jgong5, https://github.com/jansel
Summary:
1. We don't want to exit with exceptions when there are so many mismatches. We should just break and return.
2. Polish the message of dtype mismatch. This is because dtype of input/output is actually a list not a string. So we don't want to show a list of ['double'] in the output message.
Test Plan:
Testing on the case when we see too many collective dtype mismatch
{F1958467224}
Differential Revision: D65841830
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140451
Approved by: https://github.com/c-p-i-o
aten._add_relu doesn't have meta function registered, so in dynamic shape case it is throwing an error in dynamo logs:
Error:
`V1107 11:25:32.344000 140481543555072 torch/_dynamo/symbolic_convert.py:534] [0/1] [__graph_breaks] NotImplementedError: aten::_add_relu.Tensor: attempted to run this operator with Meta tensors, but there was no fake impl or Meta kernel registered. You may have run into this message while using an operator with PT2 compilation APIs (torch.compile/torch.export); in order to use this operator with those APIs you'll need to add a fake impl.`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140009
Approved by: https://github.com/ezyang
This PR is a supplement to https://github.com/pytorch/pytorch/pull/133980. The previous PR fulfill the basic functionality of XPU device guard, while we found it fails to address structured operators.
With current PR, the code snippet in RegisterXPU.cpp is as follows, where we can see the device guard is successfully generated.
```c++
struct structured_exp_out_functional final : public at::native::structured_exp_out {
void set_output_strided(
int64_t output_idx, IntArrayRef sizes, IntArrayRef strides,
TensorOptions options, DimnameList names
) override {
auto current_device = guard_.current_device();
if (C10_UNLIKELY(current_device.has_value())) {
TORCH_INTERNAL_ASSERT(*current_device == options.device(),
"structured kernels don't support multi-device outputs");
} else {
guard_.reset_device(options.device());
}
outputs_[output_idx] = create_out(sizes, strides, options);
if (!names.empty()) {
namedinference::propagate_names(outputs_[output_idx], names);
}
// super must happen after, so that downstream can use maybe_get_output
// to retrieve the output
at::native::structured_exp_out::set_output_raw_strided(output_idx, sizes, strides, options, names);
}
void set_output_raw_strided(
int64_t output_idx, IntArrayRef sizes, IntArrayRef strides,
TensorOptions options, DimnameList names
) override {
auto current_device = guard_.current_device();
if (C10_UNLIKELY(current_device.has_value())) {
TORCH_INTERNAL_ASSERT(*current_device == options.device(),
"structured kernels don't support multi-device outputs");
} else {
guard_.reset_device(options.device());
}
outputs_[output_idx] = create_out(sizes, strides, options);
if (!names.empty()) {
namedinference::propagate_names(outputs_[output_idx], names);
}
// super must happen after, so that downstream can use maybe_get_output
// to retrieve the output
at::native::structured_exp_out::set_output_raw_strided(output_idx, sizes, strides, options, names);
}
const Tensor& maybe_get_output(int64_t output_idx) override {
return outputs_[output_idx];
}
std::array<Tensor, 1> outputs_;
c10::OptionalDeviceGuard guard_;
};
```
However, without current change, the generated code is
```c++
struct structured_exp_out_functional final : public at::native::structured_exp_out {
void set_output_strided(
int64_t output_idx, IntArrayRef sizes, IntArrayRef strides,
TensorOptions options, DimnameList names
) override {
outputs_[output_idx] = create_out(sizes, strides, options);
if (!names.empty()) {
namedinference::propagate_names(outputs_[output_idx], names);
}
// super must happen after, so that downstream can use maybe_get_output
// to retrieve the output
at::native::structured_exp_out::set_output_raw_strided(output_idx, sizes, strides, options, names);
}
void set_output_raw_strided(
int64_t output_idx, IntArrayRef sizes, IntArrayRef strides,
TensorOptions options, DimnameList names
) override {
outputs_[output_idx] = create_out(sizes, strides, options);
if (!names.empty()) {
namedinference::propagate_names(outputs_[output_idx], names);
}
// super must happen after, so that downstream can use maybe_get_output
// to retrieve the output
at::native::structured_exp_out::set_output_raw_strided(output_idx, sizes, strides, options, names);
}
const Tensor& maybe_get_output(int64_t output_idx) override {
return outputs_[output_idx];
}
std::array<Tensor, 1> outputs_;
};
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138802
Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/ezyang
There has been a series of attempts to provide support for resizing in
torch operators like `torch.sigmoid(x, out=y)`, i.e., `y` would have a
different shape before and after this expression. Prior to this patch,
we have some checks to graph break if the shape changed.
This patch extends
1. extends the existing check and graph break for any shape change, not
just for `TensorVariable` with source field.
2. removes an old code path which was introduced to address the shape
change, but became obselete in that regard because we added extra
checks to graph break upon shape change. Moreover, this old code path
is unsound, it tries to replace references to the old
`TensorVariable` the new one returned by `wrap_fx_proxy`, but it only
does the replacement in `symbolic_locals`, which breaks when cells
are involved. In general the old `TensorVariable` could be _anywhere_,
think the `replace_all` we had for immutable VTs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140202
Approved by: https://github.com/jansel
ghstack dependencies: #140035, #140036, #140149, #140150, #140151, #140201
This patch fixes 2 things which are exposed if we have `NewCellVariable`
rather than `ClosureVariable` to model python cells:
1. `codegen_save_tempvars` must run first, to establish `source` for
objects, otherwise they can't reconstruct.
2. `prune_dead_object_new` must account for `OutputGraph.backward_state`
as well, since it also contains variables that must live.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140201
Approved by: https://github.com/jansel
ghstack dependencies: #140035, #140036, #140149, #140150, #140151
The `cell_or_freevar` was added in #106403 to help us ensure
Dynamo-export only allows graph input that depends on the frame input
(rather than a captured cell, for instance).
However, when taken literally, the `cell_or_freevar` condition is
actually not accurate, because for frame inputs that are also cells
(i.e., captured by some inner function), we actually set the
`cell_or_freevar` flag to false. This makes sense, because otherwise the
existing implementation would prevent Dynamo-export to add any of these
inputs to the graph.
To help with reasoning, this patch refines the `cell_or_freevar` flag to
what we really want to check -- `is_input`, and updates the relevant use
sites.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140151
Approved by: https://github.com/jansel
ghstack dependencies: #140035, #140036, #140149, #140150
In `UserFunctionVariable.bind_args`, there's a rare case when the
underlying function satisfies all conditions below
1. The function captures a pre-existing cell
2. The cell isn't captured by root frame
3. `UserFunctionVariable.source` is `None`
In such cases, Dynamo would model the cell as its content (just like
what we do for cells in the root frame). However, this could break in
two cases:
- We could have multiple instances of `UserFunctionVariable`, where some
have source and others don't. This means sometimes we'll model the
cell as a `NewCellVariable`, and sometimes as its content. This
causes issues because writes to the `NewCellVariable` would be
buffered in `SideEffects` and never get picked up by the other
modeling.
- Only when `UserFunctionVariable` has a source, do we check whether we
already had a `NewCellVariable` for the captured cell. This again causes
Dynamo to potentially have multiple representations for the same cell
object, resulting in a similar "buffered writes not reflected" issue
as above.
This patch fixes the above 2 issues by
1. modeling captured cells of sourceless `UserFunctionVariable` as
immutable `NewCellVariable`, and adds a few lines in `SideEffects` to
account for its immutability.
2. always checking whether we already had a `NewCellVariable` for the
captured cell, before constructing a new one.
Tests are added for each aforementioned case.
I also left a TODO to investigate why exactly we would lose source
information for `UserFunctionVariable`. Some cases are easily fixable,
but others not so much.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140150
Approved by: https://github.com/jansel
ghstack dependencies: #140035, #140036, #140149
We added an unboxing optimization to avoid writes to cells that existed
before Dynamo tracing (such writes interfere with HOPs). However, the
avoided write shouldn't be there in the first place, since we were
basically creating an empty `NewCellVariable`, and then write the
pre-existing content into the variable.
This patch
1. adds logic to bypass the initial write for pre-existing cells
without undermining correctness.
2. removes the unboxing optimization and the restart code path.
Fixes#137456, #138491; also see those issues for more historical
context.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140149
Approved by: https://github.com/ezyang, https://github.com/jansel
ghstack dependencies: #140035, #140036
The `export_freevars` method was introduced very early on, for
propagating writes to unboxed cells from child to parent frame, see
https://github.com/pytorch/torchdynamo/commit/d0c10341.
However, it's no longer needed after we started to modify root tracer's
`symbolic_locals` directly for the unboxed cells, see
https://github.com/pytorch/torchdynamo/commit/663e4d92.
As a result, we no longer need `export_freevars`. In fact, it can cause
a very subtle bug when name collision happens across the parent and
child frames during inlining, because the parent frame isn't necessarily
the frame that defined the cell captured by child frame.
In summary, this patch removes the `export_freevars` bits, and adds a
regression test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140036
Approved by: https://github.com/williamwen42, https://github.com/jansel
ghstack dependencies: #140035
This patch establishes the invariant that `ClosureVariable` and
`NewCellVariable` are always in `closure_cells`, never in
`symbolic_locals`, and therefore removes some duplicated code paths.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140035
Approved by: https://github.com/jansel
Summary:
`with_comms()` is mostly used as a decorator with an optional input argument `eager_init`. The problem of a decorator with input argument is that it has to be used with invocation always, i.e., you have to use as `with_comms()` rather than `with_comms` which majority of the existing usages.
This diff tries to provide a solution such that we could use `with_comms`, `with_comms()`, `with_comms(eager_init=False)`, and `with_comms(eager_init=True)`.
Test Plan: Contbuild & OSS CI
Differential Revision: D65385700
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139637
Approved by: https://github.com/wz337
Previously the split decomp would return the input when there were no splits. this errors in torch.compile (or FakeTensorMode) with :
> RuntimeError: View operation returned a tensor that is the same as the input base tensor. This is no longer allowed; you must explicitly create a new tensor (e.g., using .detach()). As a user, you could have made a mistake implementing __torch_dispatch__ or a Python operator decomposition or meta registration; if that's not the case, please report a bug to PyTorch or the backend you are using.
Fix for https://github.com/pytorch/pytorch/issues/133394
Differential Revision: [D65635070](https://our.internmc.facebook.com/intern/diff/D65635070)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140065
Approved by: https://github.com/bdhirsh
Summary:
The recent tries on bandwidth profiler is not as expected. I have observed a few issues and tried to fix them in this diff:
1. The return of the DebugAutotuner class
2. Profiling results shows really large overhead.
DebugAutotuner.run() returns the benchmark time around 45ms while CachingAutotuner.run() returns the benchmark time around 0.45ms.
The `_find_names` and `re.match` takes 45ms: P1669186358
After we commenting out the above _find_names and re.match, the benchmark time become consistent with non-profiling mode: P1669185589
3. introduce a variable `bandwidth_info` to control the path in DebugAutotuner.run(). During benchmarking of configuration selection, we should turn off the `bandwidth_info`
After applying this diff, the profiling issues mentioned above are fixed: P1669273172
Test Plan:
```
TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_PROFILE=1 TORCHINDUCTOR_PROFILE_OUTPUT=~/tmp/profile.txt TORCH_LOGS='+inductor,+schedule,output_code' TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 CUDA_VISIBLE_DEVICES=5 buck run mode/{opt,inplace} scripts/wwei6/triton_examples:test_mat 2>&1 | tee profiling-5.log
```
If we want to disable the Aten backend, just add TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS="TRITON"
Differential Revision: D64883079
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139607
Approved by: https://github.com/chenyang78
Summary: Add a log warning users about how disabling only CUDA events can cause incorrect correlation IDs
Test Plan: Log was printed in the correct scenario
Differential Revision: D65762576
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140285
Approved by: https://github.com/sanrise
Remove most references to rockset:
* replace comments and docs with a generic "backend database"
* Delete `upload_to_rockset`, so we no longer need to install the package.
* Do not upload perf stats to rockset as well (we should be completely on DynamoDB now right @huydhn?)
According to VSCode, it went from 41 -> 7 instances of "rockset" in the repo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139922
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
Before this change, if one builds PyTorch without XPU build process will
be perpetually regenerating code because of the reference to non-existing
file, that will make autograd codegened files always out of date, see part of the `ninja -d explain torch_cpu` output:
```
ninja explain: output ../torch/csrc/inductor/aoti_torch/generated/c_shim_xpu.cpp doesn't exist
ninja explain: output third_party/kineto/libkineto/CMakeFiles/libkineto_defs.bzl of phony edge with no inputs doesn't exist
ninja explain: third_party/kineto/libkineto/CMakeFiles/libkineto_defs.bzl is dirty
ninja explain: /Users/malfet/git/pytorch/pytorch/torch/csrc/autograd/generated/Functions.cpp is dirty
```
This is a regression introduced by https://github.com/pytorch/pytorch/pull/139025.
After this change, incremental rebuilds with no changes cause no build actions:
```
% ninja -j1 -v -d explain -n torch_cpu
ninja explain: output third_party/kineto/libkineto/CMakeFiles/libkineto_defs.bzl of phony edge with no inputs doesn't exist
ninja explain: third_party/kineto/libkineto/CMakeFiles/libkineto_defs.bzl is dirty
ninja: no work to do.
```
Test plan: Wait for at least on XPU build to finish...
Fixes https://github.com/pytorch/pytorch/issues/140432
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140438
Approved by: https://github.com/kit1980, https://github.com/huydhn
Previously we assumed that the number of tensor elements multiplied by the type size is not greater than the allocated memory size. However in some scenarios such as `tensor.expand`, the stride can be zero, which makes the assumption not true.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140255
Approved by: https://github.com/ezyang
This PR replaces the parameter names specified in the `triangular_solve_meta`
function (specifically in its `@out_wrapper(...)` decorator) by those written in the
_native_functions.yaml_ file.
This name mismatch caused the operation to fail when using the meta device (see error
below):
```python
Traceback (most recent call last):
File "examples/test.py", line 23, in <module>
torch.triangular_solve(b.to("meta"), A.to("meta"), out=meta_out)
File "torch/_decomp/__init__.py", line 100, in _fn
return f(*args, **kwargs, out=None if is_none else out_kwargs)
File "torch/_prims_common/wrappers.py", line 289, in _fn
result = fn(*args, **kwargs)
TypeError: triangular_solve_meta() got an unexpected keyword argument 'X'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140186
Approved by: https://github.com/ezyang
Differential Revision: [D65307961](https://our.internmc.facebook.com/intern/diff/D65307961/)
This PR introduces the concept of a "dispatcher" module `n` that carries multiple interpreter modules `n`, `n@1`, `n@2`, etc., each corresponding to a particular call of `n` and thus might carry a different specialized graph. We only do this when we're preserving module call signatures for `n`. The carried modules have the same number and order of calls to `n` appearing in the original module / exported program. In the unflattened module, all those calls go to the "dispatcher" module which internally tracks how many calls have been made so far and invokes the corresponding interpreter module. We reset this tracking after a successful or unsuccessful run of the unflattened module.
Overall this makes swapping easier when module call signatures are preserved.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139439
Approved by: https://github.com/tugsbayasgalan
ghstack dependencies: #139438
Differential Revision: [D65308061](https://our.internmc.facebook.com/intern/diff/D65308061/)
When a shared submodule is called multiple times with different aliases, e.g., `self.a` and `self.b` are both `C()` under the hood and we have calls to both `self.a(...)` and `self.b(...)`, we wrap `C()` to emit as many export tracepoints as there are aliases. This caused us to compute module call signatures that conflated information: we'd add inputs and outputs of one call to inputs and outputs of a different call. Overall preserving module call signatures in the presence of shared submodules was borked because of this bug.
The fix is to pay attention to the nn module stack, which accurately tracks individual calls, thus allowing us to ignore some export tracepoints that get the module correct but not the alias through which the call was made.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139438
Approved by: https://github.com/zhxchen17
This was introduced in https://github.com/pytorch/torchdynamo/commit/d0c10341
as limited support for pre-existing cells, since we know `__class__` wouldn't be modified
in most cases. It's no longer needed now that we have much more support for these cells.
Example:
```python
class Foo():
def __init__(self):
super().__init__()
print(Foo.__init__.__code__.co_freevars) # ('__class__',)
print(Foo.__init__.__closure__) # (<cell at 0x1011fb310: type object at 0x10fe185b0>,)
```
This patch also exposed and fixes a bug in
`NNModuleVariable.var_getattr`, where Dynamo wasn't propagating source
correctly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140034
Approved by: https://github.com/williamwen42, https://github.com/anijain2305, https://github.com/jansel
Summary:
UBSan hits undefined behavior in this file. This fixes it by marking these pointers as unaligned.
```
caffe2/aten/src/ATen/native/quantized/cpu/qnnpack/__ukernels_sse2__/buck-private-headers/q8gemm/4x4c2-sse2.c:325:5: runtime error: store to misaligned address 0x62900313891f for type 'uint32_t' (aka 'unsigned int'), which requires 4 byte alignment
0x62900313891f: note: pointer points here
be be be be be be be be be be be be be be be be be be be be be be be be be be be be be be be be
^
UndefinedBehaviorSanitizer: undefined-behavior buck-caffe2/aten/src/ATen/native/quantized/cpu/qnnpack/__ukernels_sse2__/buck-private-headers/q8gemm/4x4c2-sse2.c:325:5 in
```
The fix is to mark these variables as unaligned following D42179009's example
q8gemm.cc + internal integration test
Differential Revision: D65637959
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140188
Approved by: https://github.com/digantdesai
Leverage existing FindGloo CMake module to locate system's library and headers. Add system's gloo headers to include path rather than the gloo from third party when USE_SYSTEM_GLOO is specified.
Fixes#140274
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140275
Approved by: https://github.com/malfet
It was raised that the backwards layer norm on AMD was slightly off the accuracy of the equivalent NVIDIA implementation.
On AMD we call into a helper kernel `cuLoadWriteStridedInputs` which processes strided input and accumulates the partial gradients into shared memory.
In this kernel (https://github.com/pytorch/pytorch/pull/87635) we truncated `mean` and `rstd` from T_ACC type to T which causes numerical issues in the warp buffers created in this kernel. This PR will use the correct accumulator type for mean and rstd.
Note: Only AMD call into this call stack for backwards layer norm, so this was not an issue for NV.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140259
Approved by: https://github.com/jianyuh
This PR updates OpInfo-based tests for NJTs:
* Adds extensive coverage across non-contiguous NJTs (both non-contiguous transposed and non-contiguous with holes)
* The `_sample_njts()` helper that `sample_input_func`s utilize now produces non-contig NJTs as well
* Utilizes a `SampleInput`-based xfail system for granular classification of bugs. For example, it's possible to indicate that a class of ops is expected to fail only on non-contig with holes NJT inputs.
* I decided on adding `SampleInput`s and utilizing this system over using test parametrization for two reasons:
* Test perf - adding `SampleInput`s is faster than generating entire new tests
* Avoiding the possibility of `sample_input_func`s not respecting the non-contig test parameter - this would result in silently incorrect passing of these tests. Keeping the responsibility for `SampleInput` generation firmly within each `OpInfo`'s `sample_input_func` means weirdness like this isn't possible
* Improves `SampleInput` naming for a bunch of `sample_input_func`s. This makes it easier to xfail them as needed. For example, binary / unary / other ops now use the new `_describe_njt()` helper to get a string repr that uniquely defines the type of NJT being passed to the op
* Adds appropriate `XFailRule`s to get tests passing for forward / backward / forward compile / backward compile. In general, each xfail corresponds to some bug that needs to be fixed
```python
# Represents a rule indicating how to xfail a particular test. It allows granularity
# at the device, dtype, op, and individual sample levels. This flexibility allows entire
# bugs to be represented by a single rule, even if this corresponds with multiple conceptual
# test cases across multiple ops.
@dataclass
class XFailRule:
# expected error type
error_type: TypeVar = Exception
# expected error message
error_msg: str = ".*"
# function to indicate whether the rule applies; return True if so
match_fn: Callable[[torch.device, torch.dtype, OpInfo, SampleInput], bool] = None
# optional name for identifying the rule
name: str = ""
def match(self, device, dtype, op, sample) -> bool:
return self.match_fn(device, dtype, op, sample)
```
Example:
```python
# Bug when broadcasting a binary op with non-contiguous with holes NJT + dense
# tensor with 1 in ragged dim.
XFailRule(
error_type=RuntimeError,
error_msg="cannot call binary pointwise function .* with inputs of shapes",
match_fn=lambda device, dtype, op, sample: (
isinstance(op, BinaryUfuncInfo)
and "noncontig_holes" in sample.name
and "broadcasting 1 over ragged" in sample.name
),
name="binary_noncontig_holes_broadcasting_1_over_ragged",
),
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138370
Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer
ghstack dependencies: #140160
This PR updates the binding for `stream_write_value32` to be consistent with `memset32` which IMO makes more sense for this type of utilities:
- Changed the API to take a uint32 tensor as argument, instead of a device pointer
- Changed the Python binding to be a static method of `_SymmetricMemory`, instead of a object method
- Use the dispatcher for device dispatching, as opposed to `SymmetricMemory` backends
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139934
Approved by: https://github.com/weifengpy
ghstack dependencies: #139227
Buck1 is no longer supported in favor of buck2. This CI tests the old buck1 flow, however it is difficult to maintain especially since buck1 doesn't support aarch64 mac.
I am suggesting that this CI be deprecated until a decision on buck2 is made, and buck2 support is added. As of now, there seems to be no push towards adding buck2 support.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140067
Approved by: https://github.com/huydhn
It was wrong to add it to MPSDevice in the first place, as in the end it's just a regular shader, like all others.
I.e. this PR:
- Moves contents of `at::mps::indexing_metal_shaders` into `kernels/Indexing.metal`
- Deletes `MPSDevice::getMetalIndexingLibrary()` and `MPSDevice::metalIndexingPSO` methods
- Moves `at::native::mps::generateKernelDataOffsets` implementation from `OperationUtils.mm` to `Indexing.mm`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140271
Approved by: https://github.com/Skylion007
Publish current state of s390x builder image to allow reproducing worker setup.
Also, if this image gets published to docker repository later, it'd be possible to download published image instead of building it into worker image in https://github.com/pytorch/pytorch/blob/main/.github/scripts/s390x-ci/self-hosted-builder/actions-runner.Dockerfile#L66, which should allow improving restart time at the cost of additional runtime overhead.
Compared to first attempt to merge:
- default docker repository settings are added to all runners. Changes are mirrored in this PR.
- job is moved into separate workflow file.
- it's no longer attempted to update limits on s390x. Limits should be properly set up there on the host. And it's not possible to update them from worker since it runs in container. Also, worker container currently doesn't have sudo installed or configured or any systemd running.
- github token is now passed once via named pipe instead of environment variable. This should increase security of tokens.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132983
Approved by: https://github.com/huydhn, https://github.com/malfet
This PR adds "PrimHOPBase", which is intended to be a base class that
one can extend to create new HOPs that match some criteria:
- they take one subgraph as input, and their semantics are running the
subgraph on some operands
- the HOP stays alive until Inductor
The motivation is that we are seeing a lot more HOPs (invoke_subgraph,
invoke_quant) that have this property and there can be a lot of shared
code between them.
Future:
- Migrate invoke_subgraph to use this
- There are some TODOs in the code
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139898
Approved by: https://github.com/anijain2305, https://github.com/ydwu4
In old triton versions, you take the hash of the triton kernel and use it in the filepath for the cached kernel. In Triton 3.2 (after https://github.com/triton-lang/triton/pull/4553), the filepath will use the base-64-encoded representation of the hash in the path.
This PR checks whether the `_base64` function exists in triton, and if so, uses the base-64-encoded represenatation in the path.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140190
Approved by: https://github.com/ezyang
Here's the overview:
There's a new contextmanager singleton called MetricsContext. Entering the MetricsContext is how we demarcate the boundary on which we'll create a single CompilationMetrics object, and therefore, a single dynamo_compile log entry. While we're inside the MetricsContext, we can update/set many different metrics. Most importantly: `dynamo_timed` can also update the in-progress MetricsContext. In the proposal here, we tell `dynamo_timed` that we want it to do so by providing the name of the MetricsContext field to increment. There can be many `dynamo_timed` calls in different parts of the code updating different fields. Then when the MetricsContext exits, that's when the logging of everything gathered finally happens. One potential footgun is trying to use `dynamo_timed` when we haven't entered the MetricsContext, but we assert on that problem. Another problem is that we re-enter the context recursively, but we watch for that and do the logging only when the outermost exits.
Some specifics:
* Introduce MetricsContext - a context manager that on exit, records the CompilationMetrics (which also logs to dynamo_compile).
* Completely remove the concept of frame_phase_timing. Instead, update the MetricsContext during compilation, either directly or via dynamo_timed.
* Remove some globals we previously used to accumulate counters to later populate a CompilationMetrics. We use CompilationMetrics set/update/increment APIs instead.
* `record_compilation_metrics` is now called on exit from MetricsContext.
* Populate legacy CompilationMetrics fields right before logging, inside `record_compilation_metrics`.
* Remove the one-off `add_remote_cache_time_saved` helper; capture that timing directly into the MetricsContext.
And specifically, several changes to dynamo_timed:
* "Modernize" the parameters and update all callsites accordingly.
* Move the backwards logging of the CompilationMetrics to the backwards compile location.
* Add a parameter for which CompilationMetrics field to update
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139849
Approved by: https://github.com/ezyang
ghstack dependencies: #140094
Summary:
I was looking into why the non-standard bool value will fail for msort - it makes sense for argsort and sort to fail, because we're randomly generating uint8 so the order will be different (and thus the indices will be different). But msort should work.
After some digging, it's interesting that even though scalar_t is bool, when the actual value is a uint8_t, the comparison will treat them as signed. I tried lhs=255 and rhs=0: lhs < rhs is equivalent to -1 < 0 which is true (but it's supposed to be False)
Therefore we add an explicit type cast.
Test Plan: Remove the test skip
Differential Revision: D65472170
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139870
Approved by: https://github.com/Skylion007, https://github.com/davidberard98
This PR updates the binding for `stream_write_value32` to be consistent with `memset32` which IMO makes more sense for this type of utilities:
- Changed the API to take a uint32 tensor as argument, instead of a device pointer
- Changed the Python binding to be a static method of `_SymmetricMemory`, instead of a object method
- Use the dispatcher for device dispatching, as opposed to `SymmetricMemory` backends
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139934
Approved by: https://github.com/weifengpy
ghstack dependencies: #139227
This changes the conda-builder workflow to almalinux-builder and switches Docker file to almalinux.
Please note: Published conda-builder images will still be available, hence workflows that use these images will still work.
We will be switching workflows that use conda-builder images to almalinux-builder
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140157
Approved by: https://github.com/malfet
# Motivation
This PR aims to maintain backward compatibility when building PyTorch XPU with the old and new compilers.
# Additional Context
The details are described here. The new compiler (2025.0.0) has some breaking changes compared with the old compiler(2024.1), for examples:
1. On Windows, sycl library is named `sycl7.lib` in the old compiler but is named `sycl.lib` in the new compiler.
2. On Linux, in order to support ABI=0, we have to link `libsycl-preview.so` in the old compiler but we could link `libsycl.so` in the new compiler to have the same ABI compatibility.
3. We added a macro `SYCL_COMPILER_VERSION` to support our new code has good backward compatibility with the old compiler. Now the new feature(Event elapsed_time, memory summary, and device architecture property) introduced by the new compiler will be controlled within the macro `SYCL_COMPILER_VERSION`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139258
Approved by: https://github.com/EikanWang, https://github.com/atalman, https://github.com/gujinghui
[AOTI] Introduce an extensibility mechanism for the c shim codegen to make it easy to produce c shims for out-of-tree OP kernels as well. Add c shim for XPU.
### Motivation
Since the current c shim codegen will only produce C wrappers for Op's registered in `aten/src/ATen/native/native_functions.yaml`, for the same backend, when a portion of out-of-tree OP's are not registered in that file, but are registered externally. For example, `third_party/torch-xpu-ops/yaml/native_functions.yaml` , in this case, the existing codegen can't fulfill the need to do extensions for the c shims from the out-of-tree OPs for the in-tree that has already been produced.
### Design
To extend the c shim with more OP for a backend from out-of-tree.
The PR provided a bool option `--aoti-extend` to indicate the codegen is to extend c shim from out-of-tree.
The generated c shim is stored in the `extend` subdirectory , for example:
```
torch/include/torch/csrc/inductor/aoti_torch/generated/c_shim_xpu.h
torch/include/torch/csrc/inductor/aoti_torch/generated/c_shim_xpu.cpp
torch/include/torch/csrc/inductor/aoti_torch/generated/extend/c_shim_xpu.h
torch/include/torch/csrc/inductor/aoti_torch/generated/extend/c_shim_xpu.cpp
```
example usage:
`python -m torchgen.gen --source-path third_party/torch-xpu-ops/yaml/ --xpu --aoti-extend --update-aoti-c-shim `
`--xpu`: generate c shim for XPU
`--aoti-extend `: this is an out-of-tree OPs(defined in `third_party/torch-xpu-ops/yaml/native_functions.yaml`) extend for in-tree ops(defined in `aten/src/ATen/native/native_functions.yaml`)
`--update-aoti-c-shim`: always generate c_shim_xpu.h for the extend c_shim.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136742
Approved by: https://github.com/EikanWang, https://github.com/desertfire
ghstack dependencies: #139025
[Intel GPU] Support RegisterXPU.cpp codegen and compile for the in-tree XPU structured GEMM ops.
Motivation: There are two parts of aten ops for XPU, one is in-tree ops like GEMM related OPs and the other is out-off-tree ops in torch-xpu-ops. For the in-tree part,since Pytorch uses native_functions.yaml registration and is equipped with convenient codegen capabilities, we want to take advantage of these benefits as well.
At the same time, since AOT Inductor also uses native_functions.yaml to generate c shim wrappers, we also need to enable this mechanism for XPU.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139025
Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/desertfire
Summary: Add time log to cudagraph, including `create deferred_cudagraphify wrapper`, `warmup`, `record`, and `checkpoint`.
Test Plan:
1. buck2 run fbcode//mode/opt //pytorch/benchmark:run -- resnet50 -d cuda -t train --inductor --pt2-triton-cudagraph
2. Found the result in [scuba table](https://fburl.com/scuba/pt2_compile_events/0oik8nu9).
{F1954034920}
Differential Revision: D65505659
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139818
Approved by: https://github.com/eellison
Summary:
AMD lowering duration is 1.55x longer than H100. Profiling shows hipification related functions took 22% of overall lowering time.
This diff cuts that time by safely memoize the trie to regex logic. The trick is to incrementally build a state of the trie during the trie construction. The state is the hash of all the words added to the trie.
Differential Revision: D65659445
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140156
Approved by: https://github.com/ColinPeppler
Co-authored-by: Kefei Lu <kefeilu@meta.com>
Fixes#126268
I've basically followed @ezyang suggestion (I think) to use `func.decompose(...)`. Since `__torch_dispatch__` won't be called a second time for the same op, I've added a second `TorchDispatchMode` (`_DecomposedCounterMode`) that simpy dispatches to the parent flop counter. Using `self` as the inner context manager is not possible, since the second call to `__enter__` would re-initialize the counter's tracking state.
Let me know if there's something wrong with this implementation, since I'm quite unsure how the decomposition thing actually works :D
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138508
Approved by: https://github.com/ezyang
This fix was a bit more involved:
1) It fixes a item_memo loss place.
2) It updates a test to be eager instead of aot_eager since it reveals a very obscure bug related to replacements that's not worth solving since in practice inductor will regenerate the runtime asserts anyways
3) It updates tensorify to specialize more places now that the aforementioned bug is fixed.
Fixes `PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=6 python test/inductor/test_torchinductor_opinfo.py TestInductorOpInfoCPU.test_comprehensive_linalg_norm_cpu_float16` when `specialize_float=False`
while ensuring `python test/dynamo/test_dynamic_shapes.py DynamicShapesMiscTests.test_runtime_assert_replacement_dynamic_shapes` doesn't regress
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139587
Approved by: https://github.com/ezyang
ghstack dependencies: #139569, #139457, #139568, #139572, #139846, #139454, #139896, #139935
Currently, we get all partition id by iterating assignment whose size is same as the number of nodes in graph. But we can reach same results by iterating partitions_by_id whose size is much smaller than the nodes number. Assume the number of nodes is N, the number of partitions is P, the time complexity decrease from O(N * N) to O(N * P) after this patch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136598
Approved by: https://github.com/ezyang
Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
This PR contains several fixes related to non-contiguous NJTs:
1. Propagates `lengths` through op calls appropriately (see desc of #138098)
* SDPA now calls `nested_view_from_values_offsets_lengths()` instead of `nested_view_from_values_offsets()`
2. Allows non-contig NJTs in unsqueeze / transpose / select
3. Expands padded dense -> NJT conversion to support non-contig NJTs
4. (unrelated sorry) Updates `split` / `split_with_sizes` to allow for optional `dim`, matching the ATen signature
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140160
Approved by: https://github.com/cpuhrsch
Fixes#136559
As we upgrade to NumPy 2, torch falsely filtered out `numpy.random` as unsupported in dynamo tracking.
This PR changes the filtering rules to include them while keeping behavior with numpy 1 unchanged.
Before this PR, the following tests failed:
```
PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/dynamo/test_functions.py -k FunctionTests.test_numpy_random
PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/dynamo/test_unspec.py -k UnspecTests.test_to_tensor
PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/test_fake_tensor.py -k FakeTensorTest.test_export_numpy
PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/test_fake_tensor.py -k PropagateRealTensorsFakeTensorTest.test_export_numpy_propagate_real_tensors
```
With this PR, the supported/unsupported ops in NumPy 1 are not changed.
For NumPy 2, only the `numpy.random` ops that are already supported with NumPy 1 are added to the supported list.
I used the following scripts to check the differences before and after the change for both NumPy 1 & 2.
The output is empty for NumPy 1 since there is no change.
The output is a list of `numpy.random` that considered supported for NumPy 2.
```py
from torch._dynamo import trace_rules
import numpy as np
def new_numpy_function_ids():
unsupported_funcs = {"seed", "ranf", "get_bit_generator", "RandomState", "set_bit_generator", "sample"}
def is_supported(k, v, mod):
if not callable(v):
return False
if not getattr(v, "__module__", None):
return True
if v.__module__ == mod.__name__:
return True
if v.__module__ == "numpy.random.mtrand" and mod.__name__== "numpy.random" and k not in unsupported_funcs:
return True
return False
rv = {}
for mod in trace_rules.NP_SUPPORTED_MODULES:
for k, v in mod.__dict__.items():
if is_supported(k, v, mod):
rv[id(v)] = f"{mod.__name__}.{k}"
return rv
def old_numpy_function_ids():
rv = {}
for mod in trace_rules.NP_SUPPORTED_MODULES:
rv.update(
{
id(v): f"{mod.__name__}.{k}"
for k, v in mod.__dict__.items()
if callable(v)
and (getattr(v, "__module__", None) or mod.__name__) == mod.__name__
}
)
return rv
rv1 = set(old_numpy_function_ids().values())
rv2 = set(new_numpy_function_ids().values())
for v in (rv1 - rv2):
print(v)
print("****")
for v in (rv2 - rv1):
print(v)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138686
Approved by: https://github.com/williamwen42
This PR introduces the following:
### torch.ops.symm_mem._async_input_mm
`_async_input_mm(Tensor a, Tensor b, Tensor a_chunk_signals, int a_chunk_pivot) -> Tensor`
An mm impl that supports consuming asynchronous input. It guarantees the following rasterization order, and that the corresponding signal arrives before an input chunk is consumed.
```
num_chunks = a_chunks_signals.numel()
for chunk_idx in range(a_chunk_pivot, num_chunks + a_chunk_pivot):
chunk_idx = chunk_idx % num_chunks
wait_signal(a_chunk_signals, chunk_idx)
# Compute output tiles that consumes the input chunk
```
### PersistentAsyncInputScheduler
This is a forked version of PersistentScheduler that supports consuming asynchronous input. This tile scheduler introduces the following arguments:
- `tiles_per_chunk_m` – Specifies the size of an M chunk. Chunks are the granularity at which the asynchronous input becomes ready. It must be an interger multiple of the size of an M tile.
- `chunk_signals` – `chunk_signals[i] == 1` indicates that chunk i is ready. Before returning a work tile, get_current_work() waits for the signal to ensure that the corresponding chunk is ready.
- `tile_idx_pivot_m` – After applying swizzling, apply `pivot(m) => (m + tile_idx_pivot_m) % tiles_m` to `m`. In a distributed setting, this allows different ranks to process different m indices at the same time, thus avoiding communication hotspots.
Note that this scheduler currently only supports the `KernelTmaWarpSpecializedCooperative` kernel schedule. This is enforced via the template argument `KernelSchedule`.
Usage:
```
using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
Shape<int, int, int, int>,
CollectiveMainloop,
CollectiveEpilogue,
cutlass::gemm::PersistentAsyncInputScheduler<KernelSchedule>>;
```
### _fused_all_gather_matmul_native
An ag-mm impl that combines `torch.ops.symm_mem._async_input_mm` and progress-aware all-gather. This is not yet enabled via the async-tp passes. We will use it as a backend to optimize the current decomposition-based async-tp impl.
## Benchmarks
### 4096x3584x8192
- cublas + nccl: 539us
- decomp-based async-tp w/o cuda graph: 694us
- decomp-based async-tp w/ cuda graph: 478us
- new cutlass kernel: 408us
<img width="478" alt="image" src="https://github.com/user-attachments/assets/39f316ab-36c5-4b41-af77-07854a385dfc">
### 2048x3584x8192
- cublas + nccl: 301us
- decomp-based async-tp w/o cuda graph: 687us
- decomp-based async-tp w/ cuda graph: 356us
- new cutlass kernel: 276us
<img width="441" alt="image" src="https://github.com/user-attachments/assets/9e23ce21-863b-43dd-a562-fb05d3a5a144">
## Next Steps
- Add tuning logic
- Use `_fused_all_gather_matmul_native` as a backend for the decomp-based async-tp impl
Differential temp Revision: [D65623152](https://our.internmc.facebook.com/intern/diff/D65623152)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139227
Approved by: https://github.com/weifengpy, https://github.com/Chillee
As manylinuxaarch64-builder already comes pre-built with all versions of python runtime
Refactor logic for setting path to DESIRED_PYTHON from `manywheel/build_common` into `set_desired_python.sh` and call it from aarch64_ci_setup.sh
In followup PRs move scons and ninja installation into base docker image
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140093
Approved by: https://github.com/atalman
When we have hardware support, we can use it. When we don't have hardware support, we can still do better than vec_base.h. I'm not sure to what extent we're set up to properly test both `defined(__ARM_FEATURE_BF16)` and `!defined(__ARM_FEATURE_BF16)` builds, feedback especially welcome there.
Testing: vec_test_all_types should cover correctness. For perf, seems clear that using vectorized intrinsics should be better than vec_base?
Differential Revision: [D64997747](https://our.internmc.facebook.com/intern/diff/D64997747/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139090
Approved by: https://github.com/jgong5, https://github.com/malfet
ghstack dependencies: #139084
Summary: Tighten the AOTIModelContainerRunner::run interface to take a const vector of at::Tensor, which 1) makes it clear that the runner will not modify the input tensor vector; 2) runner will be able to take a temp vector of tensors as the input.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139955
Approved by: https://github.com/chenyang78
Previously the split decomp would return the input when there were no splits. this errors in torch.compile (or FakeTensorMode) with :
> RuntimeError: View operation returned a tensor that is the same as the input base tensor. This is no longer allowed; you must explicitly create a new tensor (e.g., using .detach()). As a user, you could have made a mistake implementing __torch_dispatch__ or a Python operator decomposition or meta registration; if that's not the case, please report a bug to PyTorch or the backend you are using.
Fix for https://github.com/pytorch/pytorch/issues/133394
Differential Revision: [D65635070](https://our.internmc.facebook.com/intern/diff/D65635070)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140065
Approved by: https://github.com/bdhirsh
This PR adds support for the `restore_value` argument of the
`@triton.autotune` for the user-defined Triton kernels in PT2.
The `kernel.restore_idx` are extracted in the
`ir.UserDefinedTritonKernel` and the corresponding arg names are
placed into the `triton_meta["restore_value"]`. From there, those
are added to the existing `mutated_arg_names` in the caching autotuner
infra which already exists and leads to the listed argss being cloned.
This achieves the equivalent effect to the native `restore_value`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139851
Approved by: https://github.com/oulgen
This is a first step towards removing builds dependency to conda.
Currently we build magma as a conda package in a pytorch conda channel, implemented in a1b372dbda/magma.
This commit adapts the logic from pytorch/builder as follows:
- use pytorch/manylinux-cuda<cuda-version> as base image
- apply patches and invoke the build.sh script directly (not anymore through conda build)
- stores license and build files along with the built artifact, in an info subfolder
- create a tarball file which resembles that created by conda, without any conda-specific metadata
A new matrix workflow is added, which runs the build for each supported cuda version, and uploads the binaries to pyorch s3 bucket.
For the upload, define an upload.sh script, which will be used by the magma windows job as well, to upload to `s3://ossci-*` buckets.
The build runs on PR and push, upload runs in DRY_RUN mode in case of PR.
Fixes#139397
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139888
Approved by: https://github.com/atalman, https://github.com/malfet, https://github.com/seemethere
1. My company is using privateuseone to connect new hardware device and requires the use of `batch_isend_irecv` function. However, `batch_isend_irecv` is currently only open to CUDA, so I add `supports_coalescing` property in `c10d::Backend` to determine whether backend supports coalescing.
2. If `pg._has_hooks` return True, We don't need to determine if the current device is CUDA. So privateuseone can also support `pg._wait_for_pending_works`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135338
Approved by: https://github.com/kwen2501
Here are the cases that Inductor does autotuning at compile time:
1. pad mm: benchmark to decide if we should pad or not
2. template autotuning: benchmark triton/cutlass templates and ATen kernel for matmul/conv and pick the fastest one.
The PR annotate these cases with `dynamo_timed`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139431
Approved by: https://github.com/ezyang
**About the PR**
In the implementation of SmoothQuant in Torchao, quantized linear is computed by `_int_mm(a, b)` + `mul(b_scale)` + `mul(a_scale)` (+ optional `add` for bias) with `reshape` and `convert_dtype` in between.
This PR adds a pass to fuse the corresponding patterns:
- (no bias) `reshape -> _int_mm -> convert_element_type -> (expand -> mul) -> mul -> reshape`
- (with bias) `pattern_no_bias -> add -> reshape -> reshape`
The patterns are replaced by `onednn.qlinear_pointwise` and `onednn.qlinear_prepack`, the latter of which is evaluated and frozen during the freezing process of Inductor. The final graph contains `onednn.qlinear_pointwise` only with packed weight constants.
Note that `onednn.qlinear_pointwise` does not support per-channel quantization of activation, which is a limitation of oneDNN library, so in that case we set activation scale to 1 and bias to none and apply scales and add bias after `onednn.qlinear_pointwise`.
**Validation results**
Accuracy/perplexity is not changed with or without this fusion pass.
Latency is improved by >10% with the fusion pass.
Test method:
- Model: EleutherAI/gpt-j-6b
- Hardware: Intel(R) Xeon(R) Platinum 8490H, running on 1 socket, 60 cores
- Using Intel OMP and Tcmalloc
- Running [the example script of SmoothQuant in Torchao](https://github.com/pytorch/ao/blob/main/torchao/prototype/smoothquant/example.py) with `TORCHINDUCTOR_FREEZING=1 numactl -N1 python example.py -m EleutherAI/gpt-j-6b --device=cpu --quant-mode=dynamic --compile`
**Test plan**
```
python test/inductor/test_mkldnn_pattern_matcher.py -k test_smooth_quant_with_int_mm
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139595
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jerryzh168
Test uses `torch.load()` for DTensor state_dict:
```
python3 test/distributed/fsdp/test_fsdp_dtensor_state_dict.py -k TestFSDPWithDeviceMeshAndDTensor
```
In this PR, we add `DTensor` related class to allowed safe globals so we can still `torch.load()` a `DTensor` with `weights_only=True`. We also need this for backward compatibility, since `DTensor` can be `torch.load()` before `weights_only` defaults to True. Without the change, `torch.load()` a `DTensor` would run into the following error:
```
_pickle.UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options, do those steps only if you trust the source of the checkpoint.
(1) Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
(2) Alternatively, to load with `weights_only=True` please check the recommended steps in the following error message.
WeightsUnpickler error: Unsupported global: GLOBAL torch.distributed.tensor.DTensor was not an allowed global by default. Please use `torch.serialization.add_safe_globals([DTensor])` or the `torch.serialization.safe_globals([DTensor])` context manager to allowlist this global if you trust this class/function.
```
The unit test failure is not being captured by CI when `weights_only` being rolled out for `torch.load()` by default. This is due to another issue that the test communication wrapper `with_comms` let unit tests silently pass without capturing failure due to a recent change (https://github.com/pytorch/pytorch/pull/138108). This wrapper issue is going to be fixed
by a separate PR https://github.com/pytorch/pytorch/pull/139637.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139949
Approved by: https://github.com/mikaylagawarecki
At line 205, I believe the code `x = self.activations[act](x)` should be indented so that it is in the body of the for loop. Otherwise, applying the four linear modules has the same effect as applying a single linear module, in the sense that it is still just a linear map so there is no point in having four of them. In other words, each layer of this network should have a nonlinearity.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139667
Approved by: https://github.com/malfet
Reverts PR https://github.com/pytorch/pytorch/pull/137523
Reasons for the reversion:
1. NCCL profiler plugin is meant to be opened by NCCL. And the profiler's implementation is meant to be provided by a profiler. There is no evidence that `torch.distributed` is at a better position to be either an opener or a provider. (The PR to be reverted made `torch.distributed` an opener).
2. The main purpose of the reverted PR is to dlopen a dump function, with the help of an environment variable `NCCL_PROFILER_PLUGIN_FUN` that provides the symbol name, as in code below:
c19c384690/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (L415-L427)
After some investigation, NCCL does not support env var `NCCL_PROFILER_PLUGIN_FUN`. And NCCL's profiler contract `nccl_profiler.h` does not have a function called "ncclProfilerPluginDump" defined. So this looks like a private add-on.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139847
Approved by: https://github.com/c-p-i-o
Summary: I'm refactoring dynamo_timed and updating the params. It will be much easier to do this refactor entirely in OSS. So this diff essentially provides a couple aliases in the OSS area that I can update without affecting the internal usage.
Test Plan: Ran locally and made sure I still got samples: https://fburl.com/scuba/dynamo_compile/sandbox/qub89lwj
Reviewed By: oulgen
Differential Revision: D65580302
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140016
Approved by: https://github.com/oulgen
My sales pitch: I need to ssh into the runner from time to time on my PR to debug issues, but it's well-known that LF runners don't support SSH login anymore. So, the propose fix here is to introduce a new label called ~no-runner-determinator~ `no-runner-experiments` that can be attached to the PR. Whenever `.github/scripts/runner_determinator.py` runs on a PR and sees this label, it will not apply any logic and just straight up use an empty prefix.
### Testing
With the label:
```
python3 runner_determinator.py \
--github-token "MY_TOKEN" \
--github-issue "5132" \
--github-branch "install-torchao-torchtune-et" \
--github-actor "huydhn" \
--github-issue-owner "huydhn" \
--github-ref-type "branch" \
--github-repo "pytorch/pytorch" \
--eligible-experiments "" \
--pr-number "139947"
INFO : Opt-out runner determinator because #139947 has no-runner-determinator label
WARNING : No env var found for GITHUB_OUTPUT, you must be running this code locally. Falling back to the deprecated print method.
::set-output name=label-type::
```
Without the label:
```
python3 runner_determinator.py \
--github-token "MY_TOKEN" \
--github-issue "5132" \
--github-branch "install-torchao-torchtune-et" \
--github-actor "huydhn" \
--github-issue-owner "huydhn" \
--github-ref-type "branch" \
--github-repo "pytorch/pytorch" \
--eligible-experiments "" \
--pr-number "139947"
INFO : Based on rollout percentage of 95%, enabling experiment lf.
INFO : Skipping experiment 'awsa100', as it is not a default experiment
WARNING : No env var found for GITHUB_OUTPUT, you must be running this code locally. Falling back to the deprecated print method.
::set-output name=label-type::lf.
```
Running in trunk commit without a PR number will use the regular logic:
```
python3 runner_determinator.py \
--github-token "MY_TOKEN" \
--github-issue "5132" \
--github-branch "install-torchao-torchtune-et" \
--github-actor "huydhn" \
--github-issue-owner "huydhn" \
--github-ref-type "branch" \
--github-repo "pytorch/pytorch" \
--eligible-experiments "" \
--pr-number ""
INFO : Based on rollout percentage of 95%, enabling experiment lf.
INFO : Skipping experiment 'awsa100', as it is not a default experiment
WARNING : No env var found for GITHUB_OUTPUT, you must be running this code locally. Falling back to the deprecated print method.
::set-output name=label-type::lf.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140054
Approved by: https://github.com/malfet, https://github.com/ZainRizvi
Summary:
In most cases, we don't need to turn on AttrProxy tracing for two reasons:
1. It's only needed when you have one submodule owning multiple FQNs.
2. AND it will cause model using module identity to be traced incorrectly (because we substitute module objects at tracing time).
Overall after offline discussion with some export folk, we think it's better to turn off AttrProxy if we can make sure every submodule has unique FQN, which tends to be the common case.
Test Plan: buck test mode/opt caffe2/test:test_export -- -r module_dict_key
Differential Revision: D65555919
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139918
Approved by: https://github.com/tugsbayasgalan
I recently added a new pattern here https://github.com/pytorch/pytorch/pull/139136 to remove pointless view/permute pairs. At that PR, I've already updated the matched pattern/node count in `test_linear_binary` to account for the new pattern. But it looks like with cpp wrapper, one more pattern will be matched.
```
7 patterns without cpp-wrapper:
========== pattern matched <code object pointless_view at 0x7f6d25c67aa0, file "/home/shunting/ws/pytorch/torch/_inductor/fx_passes/joint_graph.py", l
ine 568> =======
========== pattern matched <code object pointless_view_pair at 0x7f6d25c67b50, file "/home/shunting/ws/pytorch/torch/_inductor/fx_passes/joint_graph.p
y", line 581> =======
========== pattern matched <code object pointless_view at 0x7f6d25c67aa0, file "/home/shunting/ws/pytorch/torch/_inductor/fx_passes/joint_graph.py", l
ine 568> =======
========== pattern matched <code object pointless_view at 0x7f6d25c67aa0, file "/home/shunting/ws/pytorch/torch/_inductor/fx_passes/joint_graph.py", l
ine 568> =======
========== pattern matched <code object linear at 0x7f6d176e5dc0, file "/home/shunting/ws/pytorch/torch/_inductor/fx_passes/mkldnn_fusion.py", line 11
21> =======
========== pattern matched <code object reshape_linear_reshape_pattern at 0x7f6d176e5210, file "/home/shunting/ws/pytorch/torch/_inductor/fx_passes/mk
ldnn_fusion.py", line 732> =======
========== pattern matched <code object fn at 0x7f6d176d3ec0, file "/home/shunting/ws/pytorch/torch/_inductor/fx_passes/mkldnn_fusion.py", line 476> =
======
8 patterns with cpp wrapper:
========== pattern matched <code object pointless_view at 0x7f8e78bf07c0, file "/home/shunting/ws/pytorch/torch/_inductor/fx_passes/joint_graph.py", l
ine 568> =======
========== pattern matched <code object pointless_view_pair at 0x7f8e78bf0870, file "/home/shunting/ws/pytorch/torch/_inductor/fx_passes/joint_graph.p
y", line 581> =======
========== pattern matched <code object pointless_view at 0x7f8e78bf07c0, file "/home/shunting/ws/pytorch/torch/_inductor/fx_passes/joint_graph.py", l
ine 568> =======
========== pattern matched <code object pointless_view at 0x7f8e78bf07c0, file "/home/shunting/ws/pytorch/torch/_inductor/fx_passes/joint_graph.py", l
ine 568> =======
========== pattern matched <code object pointless_view at 0x7f8e78bf07c0, file "/home/shunting/ws/pytorch/torch/_inductor/fx_passes/joint_graph.py", l
ine 568> =======
========== pattern matched <code object linear at 0x7f8e59c04190, file "/home/shunting/ws/pytorch/torch/_inductor/fx_passes/mkldnn_fusion.py", line 11
21> =======
========== pattern matched <code object reshape_linear_reshape_pattern at 0x7f8e59dfb520, file "/home/shunting/ws/pytorch/torch/_inductor/fx_passes/mk
ldnn_fusion.py", line 732> =======
========== pattern matched <code object fn at 0x7f8e59dfa290, file "/home/shunting/ws/pytorch/torch/_inductor/fx_passes/mkldnn_fusion.py", line 476> =
======
```
I fixed this test by +1 to the expected number if cpp wrapper is enabled. But I think fundamentally can we not assert for the total number of patterns matched in the test? I think that makes the test very fragile. People adding new patterns may keep breaking these 'un-related' tests. One possible way to improve is, we have a counter for each specific pattern, in the tests, instead of check the total number of patterns matched, just check the match count for the ***RELEVANT*** patterns. That should reduce false-positive for broken tests. cc possible test creator @jgong5
Fixes https://github.com/pytorch/pytorch/issues/139812 (we need to have this to run this disabled test on your PR)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139942
Approved by: https://github.com/huydhn, https://github.com/jgong5
Fixes#139755#139621
The new stream pipeliner on AMD triton backend enables num_stages to function equivalent to NV backend. This upgrade in triton 3.2 will cause OOM issues in flex attention due to num_stages=3 setting, we have tuned this to num_stages=1 which is the best setting for flash attention kernels and avoids the shmem issues.
We will follow up this PR with some config tuning on AMD backend.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139883
Approved by: https://github.com/bertmaher
Fixes `PYTORCH_TEST_WITH_DYNAMO=1 python test/test_torch.py TestTorchDeviceTypeCPU.test_gradient_type_promotion_cpu` when `specialize_float=False`
Reviewers might wonder why we need to have this whitelist. Can't we rely on python_arg_parser.h to do the specialization generically? Alas this path doesn't actually FFI to C++ so we do need to do the specialization in pythonland.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139935
Approved by: https://github.com/ezyang
ghstack dependencies: #139569, #139457, #139568, #139572, #139846, #139454, #139896
Instead of moving these queries to ClickHouse, we're just going to remove it since it's not really used. We do want something for test aggregates, but we can make a new script instead
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139915
Approved by: https://github.com/huydhn
Set PYTORCH_MIOPEN_SUGGEST_NHWC environment variable to force output layout to channels-last.
This way, the channels-last CK instances will be added to benchmark choices in max autotune
# Testing
```
pytest test/inductor/test_ck_backend.py -k conv2d
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138643
Approved by: https://github.com/chenyang78
Shell script still referencing builder checkout rather than PyTorch, which results in
```
python /builder/aarch64_linux/aarch64_wheel_ci_build.py --enable-mkldnn
python: can't open file '/builder/aarch64_linux/aarch64_wheel_ci_build.py': [Errno 2] No such file or directory
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140020
Approved by: https://github.com/atalman
The test was failing when I ran the whole test suite. I'm guessing that the exact indices would previously depend on the order that tests would run; by resetting the kernel_side_table we should hopefully get results that are reproducible independent of the test execution order.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139907
Approved by: https://github.com/oulgen, https://github.com/aakhundov
- Refactored traceback code into `work.printTraceback()`. cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @shuqiangzhang
- Refactored desync debug code into `class DesyncDebugger`.
- Moved occurrences of `futureWorkResult_->markCompleted` into `checkAndSetException` and `checkTimeout`, respectively. cc @shuqiangzhang
- Modularized dump signal broadcast code into `ProcessGroupNCCL::broadcastDumpSignal`. cc @fduwjj @c-p-i-o
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139834
Approved by: https://github.com/shuqiangzhang
Based on discussion here: https://github.com/pytorch/pytorch/pull/138731
Introducing ability for subclass implement type convertion to expected_type.
```
def __coerce_same_metadata_as_tangent__(
self, expected_metadata: Any, expected_type: Optional[Type] = None
):
```
Here if `expected_type=None` means `SubclassClass` is expected.
E.g. for `DTensor` we may find tangent `AsyncCollectiveTensor` where we expected `Tensor` - in this case
`expected_type=Tensor` will be called during runtime
Adding implementation to AsyncCollectiveTensor, that just triggers `wait()`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139095
Approved by: https://github.com/bdhirsh
@frost-intel discovered that some Inductor auto-tuning UTs for CPU are currently broken on machines supporting AMX ISA. That's because in #136688, I had reverted a change in the AMX GEMM micro-kernel that was introduced in #131887, but it looks like some other implementations introduced after the aforementioned change rely upon it, so it should not have been reverted.
Added a fix.
Ideally, a CI machine that supports AMX should cover these UTs (test/inductor/test_cpu_select_algorithm.py). We do have at least one CI machines that support AMX.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139906
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
Support stream. When the driver communicates with the executor, it will send the stream id corresponding to the execution command; when the executor receives the command with the stream id, it will ignore the stream id because cpu backend doesn't support asynchronous execution.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136991
Approved by: https://github.com/ezyang
There are 4 parts (they are hard to further break into smaller ones cause they're highly coupled) in this PR:
1. **Whenever we call create_graph_input, we try to bind the symbols in the graph input.**
We've enforced the invariant that all create_graph_inputs calls must provide an example value, we could intercept at the create_graph_input calls (This PR only handles free symbols in tensors).
2. **We cache the bound_symbols** to avoid lift the same symbol repeated.
3. For lifted symbols, we re-used **lifted_freevars** i.e. the mapping between symbol proxy in parent graph to the lifted phs in current subgraph, which we handle lifted tensors. In this way, all hops that supports lifted tensors should be able to handle lifted_symints automatically (at least in dynamo part).
4. For **unbacked symbols** created during tracing, we need to also bound these symbols to its proxy. This is to support the tests cases where we want to lift unbacked symbols as input. We need the proxy of the unbacked symbol in parent graph in order to properly create the args to the hop.
5. We change all the tests after free symbols are lifted in subgraphs. And also supports the lifted symbols in existing higher order ops.
**The interaction of nested tracers:**
The previous design for lifting tensor closures is that: suppose we're in nested tracers, whenever we see a new proxy that's not created by create tracer, we recursively look for the proxy in parent tracer until we find the tracer that creates this proxy (either a placeholder or some intermediate results). More detail is in Note [Nested SubgraphTracer and free_variable handling].
Given the above design, the plan for lifting the free symbols is: whenever we lift a free tensor to be the inputs of current subgraph, we'll look at the symbols in it and bind the symbols at the same time.
For example, suppose we have the following function:
```python
def f(x: [s1, s2]):
def true_f():
def true_f_inner():
return x.sin()
```
what will happen in time order:
1. we create a subtracer 1 and start to speculate the outer cond's true_f
2. we create a another subtracer 2 and start to speculate the inner cond's true_f_inner.
3. dynamo realize the tensor input x by calling wrap_tensor in top-level to create graph input x (tracer 0), we bind the symbol s1, s2 after ph for x is created. So the graph now looks like:
```python
def gm(s1, s2, x):
```
4. when seeing TensorVariable.call_method of x, tracer2 wants to create a call_function(sin, proxy_of_x), but it finds that proxy_of_x is not created by current tracer. So it recursively look up its parent tracer1 and find parent tracer1 also doesn't track this proxy_of_x then it finds the root tracer0, who is the creator of it and tracks it as a ph. Then tracer 1 create_graph_input to lift the closure to its input ph1 and add (proxy_of_x: ph1) k-v in **lifted_freevars** of tracer 1.
Now the graph looks like:
```python
def gm(s1, s2, x):
def true_gm(x):
```
5. Since there are free symbols inside this new tensor input, tracer 1 also binds the symbols (maybe_bind_symbol), which calls create_graph_input for s1 and s2. Now the graph looks like
```python
def gm(s1, s2, x):
def true_gm(s1, s2, x):
```
6. then it goes back to tracer 2, and call create_graph_input for x and get ph2, tracer 2's **lifted_freevars** records (ph1, ph2). and tracer 2 also binds the symbols in this new tensor input. Now the graph looks like:
```python
def gm(s1, s2, x):
def true_gm(s1, s2, x):
def true_gm_inner(s1, s2, x):
```
7. Finally the sin call_function node is created by tracer 2.
**This PR also handles the following cases:**
- What if we lift two tensors share the same symbol? e.g. x1 [s1, s2], x2 [s2, s3]? Each subtracer maintains bound_symbols as a cache that maps a symbol.expr to its proxy in current tracer. So when we see x1, we'll track s1 and s2 as inputs and bound s1 to ph1, s2 to ph2. So when we try to bind symbols of x2, s2 will already be tracked so no graph input is created.
- what if a subgraph close over a symint? e.g.
```python
def f(x):
def true_f():
c = x.size(0)
def true_fn_inner():
return c
```
When we speculate true_fn_inner, we find proxy_of_c is not tracked by tracer 2, so it recursively looks up its parent. At this point, x and its symbols have been lifted as input of true_f (as a result of lifting x during tracing true_f in tracer 1. Specifically the graph looks like:
```python
def gm(s1, s2, x):
def true_gm(s1, s2, x):
def true_gm_inner():
```
So tracer 2 is able to find that s1 have been tracked as ph in tracer 1 so it returns back to gm and call create_graph_input on s1. The graph now looks like:
```python
def gm(s1, s2, x):
def true_gm(s1, s2, x):
def true_gm_inner(s1):
return s1
```
- What if subgraph close over an unbacked symint? e.g.
```python
def f(x):
def true_f():
c = x.item()
def true_f_inner():
return c
```
When x.item() is called, proxy_of_c and its symnode variable is created for tracer 1, and we also call track_unbacked_symbols to record this relationship. So when tracer 2 finds proxy_of_c is not created by current tracer, it recursivelly looks up its parent tracer and finds that that expression u0 has been tracked as a result of track_unbacked_symbol in tracer 1. So it will stop the recursion and create_graph_input u0 in tracer 2. Graph looks like:
```python
def f(x):
def true_f(s1, s2, x):
c = x.item()
def true_gm_inner(u0):
return u0
cond(pred, true_gm_inner, false_gm_inner, (c,))
```
- what if subgraph close over a tensor with unbacked symint shape?
```python
def f(x):
def true_f():
c = x.item()
r = torch.randn((c,))
def true_f_inner():
return r + 1
```
This is the same as the case of closing over tensors with backed shapes. where we first lift r, then bind u0 in it, which recursively bind_symint of u0 in its parent and found u0 is tracked in parent tracer as a result of .item() call.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138363
Approved by: https://github.com/zou3519
This PR introduces the following:
### torch.ops.symm_mem._async_input_mm
`_async_input_mm(Tensor a, Tensor b, Tensor a_chunk_signals, int a_chunk_pivot) -> Tensor`
An mm impl that supports consuming asynchronous input. It guarantees the following rasterization order, and that the corresponding signal arrives before an input chunk is consumed.
```
num_chunks = a_chunks_signals.numel()
for chunk_idx in range(a_chunk_pivot, num_chunks + a_chunk_pivot):
chunk_idx = chunk_idx % num_chunks
wait_signal(a_chunk_signals, chunk_idx)
# Compute output tiles that consumes the input chunk
```
### PersistentAsyncInputScheduler
This is a forked version of PersistentScheduler that supports consuming asynchronous input. This tile scheduler introduces the following arguments:
- `tiles_per_chunk_m` – Specifies the size of an M chunk. Chunks are the granularity at which the asynchronous input becomes ready. It must be an interger multiple of the size of an M tile.
- `chunk_signals` – `chunk_signals[i] == 1` indicates that chunk i is ready. Before returning a work tile, get_current_work() waits for the signal to ensure that the corresponding chunk is ready.
- `tile_idx_pivot_m` – After applying swizzling, apply `pivot(m) => (m + tile_idx_pivot_m) % tiles_m` to `m`. In a distributed setting, this allows different ranks to process different m indices at the same time, thus avoiding communication hotspots.
Note that this scheduler currently only supports the `KernelTmaWarpSpecializedCooperative` kernel schedule. This is enforced via the template argument `KernelSchedule`.
Usage:
```
using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
Shape<int, int, int, int>,
CollectiveMainloop,
CollectiveEpilogue,
cutlass::gemm::PersistentAsyncInputScheduler<KernelSchedule>>;
```
### _fused_all_gather_matmul_native
An ag-mm impl that combines `torch.ops.symm_mem._async_input_mm` and progress-aware all-gather. This is not yet enabled via the async-tp passes. We will use it as a backend to optimize the current decomposition-based async-tp impl.
## Benchmarks
### 4096x3584x8192
- cublas + nccl: 539us
- decomp-based async-tp w/o cuda graph: 694us
- decomp-based async-tp w/ cuda graph: 478us
- new cutlass kernel: 408us
<img width="478" alt="image" src="https://github.com/user-attachments/assets/39f316ab-36c5-4b41-af77-07854a385dfc">
### 2048x3584x8192
- cublas + nccl: 301us
- decomp-based async-tp w/o cuda graph: 687us
- decomp-based async-tp w/ cuda graph: 356us
- new cutlass kernel: 276us
<img width="441" alt="image" src="https://github.com/user-attachments/assets/9e23ce21-863b-43dd-a562-fb05d3a5a144">
## Next Steps
- Add tuning logic
- Use `_fused_all_gather_matmul_native` as a backend for the decomp-based async-tp impl
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139227
Approved by: https://github.com/weifengpy, https://github.com/Chillee
This patch
1. Adds documentation to `PyCodegen.__call__`, `PyCodegen.tempvars` and
the `allow_cache` flag.
2. Merges a few existing code paths in `PyCodegen.__call__`.
3. removes the `elif var in cg.tempvars` code path in
`codegen_save_tempvars`, because it's no longer needed after #113725,
as we have up-to-date `VariableTracker.source` now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139670
Approved by: https://github.com/jansel
ghstack dependencies: #139538
This effectively undoes #115095, which is not longer be needed after #113725.
Why did we need #115095? I went back in history and found that [this line](https://github.com/pytorch/pytorch/pull/113725/files#diff-0bb1756725c4426408938314b0c9d3988ae5bf49994892d7038ad7746e209e9fR86)
actually fixed what #115095 fixed. Specifically, without the
`allow_cache` check for the "dup_top" optimization, we could incorrectly
codegen based on source, despite `codegen_update_mutated` requested to
codegen from value, for updates to pre-existing lists, etc. Since #113725 added
the `allow_cache` check, we no longer need the `mutable_side_effects_from_source`
code path from #115095.
However, #115442 introduced a `value_from_source` flag which didn't
account for the `mutable_side_effects_from_source` branch. So this patch
adds an extra check to keep existing behavior for export, and leaves a
TODO for investigating what exactly export wants from codegen, when it
comes to side effects and sources.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139538
Approved by: https://github.com/jansel
Summary:
Move flight recorder logger class out from utils.py into its own file.
This makes the program more modular.
This is mostly a refactoring/non-functional change.
Test Plan:
Build fr_trace locally and ran it.
```
buck build //caffe2/fb/flight_recorder:fr_trace
Buck UI: https://www.internalfb.com/buck2/875ca6a3-e86e-4263-95a0-579502494c5c
Network: Up: 0B Down: 0B
Jobs completed: 6818. Time elapsed: 0.2s.
BUILD SUCCEEDED
```
Ran it as follows:
```
cd buck-out/v2/gen/fbcode/caffe2/fb/flight_recorder
./fr_trace.par -p trace_ /tmp
Not all ranks joining collective 3 at entry 2
group info: 0:default_pg
collective: nccl:all_reduce
missing ranks: {1}
input sizes: [[4, 5]]
output sizes: [[4, 5]]
expected ranks: 2
collective state: scheduled
collective stack trace:
<module> at /home/cpio/test/c.py:66
```
Differential Revision: D65503768
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139806
Approved by: https://github.com/fduwjj
Fixes#138550.
### Description
In the fusion of two nodes, one node with less variables (`node_to_recomp`) would make its variable ranges aligned with the other node (`ref_node`). In detail, `node_to_recomp` would change its variable ranges to the original ranges of `ref_node`. However, if both of the nodes have changed its ranges, i.e., the simplified variable ranges are different from its original ones, the issue comes up.
### Solution
For the case where the `ref_node` also changes its variable ranges, we recompute the size and body for it, to ensure the nodes are simplified to the same size.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138568
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
Fixes#111824
Currently it is the case that if the user specifies their group normalization to be of NHWC format, pytorch will default to NCHW tensors and convert. This conversion is not immediately obvious to the user unless they check the format themselves which is not intuitive. This PR adds suppor for NHWC for cuda by adding necessary kernels.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126635
Approved by: https://github.com/eqy, https://github.com/mikaylagawarecki
Summary:
This diff reverts D65290089
This change is introducing more logging than I realized and could present problems for tlparsen
Test Plan: NA
Reviewed By: jamesjwu
Differential Revision: D65541060
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139893
Approved by: https://github.com/jamesjwu
Follow up to some issues @malfet's recent PR pointed out about missing ops #139763. Tried to mirror it to other important nearby ops. Seems like we could automate / autogen this more for generic pointwise ops like this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139890
Approved by: https://github.com/malfet
This allows Configs to handle setting their defaults (or overriding
themselves) via environment variables.
The environment variables are resolved at install time (which is usually
import time). This is done 1) to avoid any race conditions between
threads etc..., but 2) to help encourage people to just go modify the
configs directly, vs overriding environment variables to change
pytorch behaviour.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138956
Approved by: https://github.com/ezyang
ghstack dependencies: #138766
Fixes#137512
Relaxes the restriction that the ragged dim is immediately next to the batch dim e.g. `(B, *, D_0, ..., D_N)`. This allows for constructing NJTs of shape e.g. `(B, D, j0)` directly. It's possible before this PR to get an NJT of e.g. shape `(B, D, j0)` by constructing an NJT of shape `(B, j0, D)` and transposing it. This PR allows a user to go straight there without the transpose. The standard `torch.nested.nested_tensor(list)` constructor has been updated to support this.
At the very least, this is useful for testing on transposed NJTs. I'm willing to make this functionality private if needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137125
Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer
Add a new documentation to show one memory usage benefit brought by TorchDynamo-based ONNX exporter.
Also add a unit test to make sure TorchDynamo-based ONNX exporter works well under FakeTensorMode.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139388
Approved by: https://github.com/xadupre
Per our discussion in https://fburl.com/gdoc/voce5o06, we will run slow jobs more frequently on all trunk commits. Note that slowgradcheck jobs are moved to periodic as they are not about running slow tests.
There are currently 3 GPU + 2 ROCm + some CPU `linux.4xlarge` runners running slow jobs. So, I don't expect to see a big increase in CI cost after this.
Also, these slow jobs will only run in trunk commits, not in PRs, so their duration won't affect PR TTS.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139842
Approved by: https://github.com/clee2000
Summary: When we use aoti_compile_and_package to package the AOTI compiled artifacts, cubin files will be included, and at the deploy time, we should setup the cubin file directory to the right path that contains unziped cubin files.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139848
Approved by: https://github.com/aakhundov
As MacOS-15 or newer supports those out of the box. This significantly reduces memory requirements and improves performance for some stable diffision networks.
Test plan: Run
```python
from diffusers import StableDiffusionXLPipeline, AutoencoderKL, EulerAncestralDiscreteScheduler
import torch
import time
vae = AutoencoderKL.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0",
subfolder='vae',
torch_dtype=torch.bfloat16,
force_upcast=False).to('mps')
pipe = StableDiffusionXLPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", vae=vae,
torch_dtype=torch.bfloat16, variant="fp16").to('mps')
pipe.scheduler = EulerAncestralDiscreteScheduler.from_config(pipe.scheduler.config)
start_time = time.time()
start_mps_mem = torch.mps.driver_allocated_memory()
image = pipe(prompt="Spherical cow in vacuum",
num_inference_steps=10,
guidance_scale=8,
generator=torch.Generator("mps").manual_seed(42),
).images[0]
end_mps_mem = torch.mps.driver_allocated_memory()
run_time = time.time() - start_time
print(f"run time in {run_time:.2f} sec, end_mps_mem {end_mps_mem/1024.0**2:.2f} Mb mem increase {(end_mps_mem-start_time)/1024.0**2:.2f} Mb")
image.save(f'bfloat16.png')
```
Before the change total memory use were 16Gb and needed 65 sec to complete, after it drops down to 14Gb and takes 50 sec to finish on M2Pro, though generated image remains the same:

Fixes https://github.com/pytorch/pytorch/issues/139389
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139791
Approved by: https://github.com/drisspg, https://github.com/Skylion007
ghstack dependencies: #139788, #139784, #139763
fsspec transactions do not support concurrency and assumes that there is at most 1 running transaction per filesystem. This is *not* true in our usage, where because of multi-threading we usually have multiple concurrent transactions running at once.
Previously, this would just (unsafely) pass but lead to hard-to-debug race conditions (since the commit of one transaction will blow away the state of the other transaction). In fsspec 2024.3.0, trying to commit concurrent transactions will actually crash (see the code at 76ca4a6888/fsspec/transaction.py (L39) -- because each filesystem can have a single transaction, this tear-down logic will error).
Instead, let's manually handle committing / discarding changes to the file.
I don't have a minimal test-case, but in Meta this solves a broken test on `fsspec >= 2024.3.0`:
Before: https://www.internalfb.com/intern/testinfra/testrun/7318349626774607
After: https://www.internalfb.com/intern/testinfra/testrun/2251800062722633
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135541
Approved by: https://github.com/Skylion007
**About the PR**
In the implementation of SmoothQuant in Torchao, quantized linear is computed by `_int_mm(a, b)` + `mul(b_scale)` + `mul(a_scale)` (+ optional `add` for bias) with `reshape` and `convert_dtype` in between.
This PR adds a pass to fuse the corresponding patterns:
- (no bias) `reshape -> _int_mm -> convert_element_type -> (expand -> mul) -> mul -> reshape`
- (with bias) `pattern_no_bias -> add -> reshape -> reshape`
The patterns are replaced by `onednn.qlinear_pointwise` and `onednn.qlinear_prepack`, the latter of which is evaluated and frozen during the freezing process of Inductor. The final graph contains `onednn.qlinear_pointwise` only with packed weight constants.
Note that `onednn.qlinear_pointwise` does not support per-channel quantization of activation, which is a limitation of oneDNN library, so in that case we set activation scale to 1 and bias to none and apply scales and add bias after `onednn.qlinear_pointwise`.
**Validation results**
Accuracy/perplexity is not changed with or without this fusion pass.
Latency is improved by >10% with the fusion pass.
Test method:
- Model: EleutherAI/gpt-j-6b
- Hardware: Intel(R) Xeon(R) Platinum 8490H, running on 1 socket, 60 cores
- Using Intel OMP and Tcmalloc
- Running [the example script of SmoothQuant in Torchao](https://github.com/pytorch/ao/blob/main/torchao/prototype/smoothquant/example.py) with `TORCHINDUCTOR_FREEZING=1 numactl -N1 python example.py -m EleutherAI/gpt-j-6b --device=cpu --quant-mode=dynamic --compile`
**Test plan**
```
python test/inductor/test_mkldnn_pattern_matcher.py -k test_smooth_quant_with_int_mm
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139595
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jerryzh168
The only difference between `convert_boolean_attn_mask_cudnn` and `convert_boolean_attn_mask` is the value we initialize boolean tensor to
Reduce duplication by introducing `convert_boolean_attn_mask_` that takes `neg_inf` value and make abovementioned implementations are trivial oneline call
Also, as suggested by @Skylion007, replace `at::where(foo->logical_not, -inf, 0)` with `at::where(*foo, 0, -inf)`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139784
Approved by: https://github.com/Skylion007, https://github.com/drisspg
ghstack dependencies: #139788
Summary:
save around 8% on the torchrec model.
In most case the new implications are not optimizaiton anyway in some case though they are,
but optimizing them is useless.
ex:
```
generating implications for Eq(Mod(s0, 3), 0)
adding Eq(Mod(s0, 3), 0)
adding Eq(0, Mod(s0, 3))
adding Ne(Mod(s0, 3), 0)
adding Ne(0, Mod(s0, 3))
adding Mod(s0, 3) <= 0
adding 0 < Mod(s0, 3)
adding True
adding False
```
VS
```
generating implications for Eq(Mod(s0, 3), 0)
adding Eq(Mod(s0, 3), 0)
adding Eq(0, Mod(s0, 3))
adding Ne(Mod(s0, 3), 0)
adding Ne(0, Mod(s0, 3))
adding Mod(s0, 3) <= 0
adding 0 < Mod(s0, 3)
adding 0 <= Mod(s0, 3)
adding Mod(s0, 3) < 0
```
the main difference is that 0 <= Mod(s0, 3) can be simplified to True and Mod(s0, 3) < 0 to False but with this change
this wont happen. but True:True and False: False are useless anyway lol. so its ok i think
```
buck2 run fbcode//mode/opt fbcode//torchrec/distributed/tests:pt2_compile_benchmark -- --num-features=1000
```
<img width="1082" alt="Screenshot 2024-11-04 at 9 25 51 PM" src="https://github.com/user-attachments/assets/a26e291b-9280-4b55-9275-f3201a36ac51">
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139738
Approved by: https://github.com/ezyang
ghstack dependencies: #139703
Assuming the forward pass user code looks like:
```
for _ in range(2):
x = layer(x)
```
and we have `fully_shard(layer)`, then:
- the forward pass will be like: "unshard layer -> call layer 1st time -> reshard layer -> unshard layer -> call layer 2nd time-> reshard layer" (currently same for both eager and compile)
- the backward pass will be like: "unshard layer -> call layer 1st time -> reshard layer -> unshard layer -> call layer 2nd time-> reshard layer" in eager, but currently it's "unshard layer -> call layer 1st time -> call layer 2nd time -> reshard layer" in compile
The behavior in the backward pass is different between eager and compile, which is not ideal.
I am currently trying to look for a way to fix this non-ideal behavior of compile - tried a few things:
1. Tracing the RegisterPostBackwardFunction custom autograd function - this stills seems to be a no-go, due to HOP not supporting side-effects.
2. Instead of custom autograd function, do a "multi-grad hook" to wait for all gradients to be ready before triggering post_backward. However, this approach seems to have bad interaction with register_hook of pre_backward, in the sense that it's unclear which of them will be triggered first in practice.
3. Force execute any pending post_backward before unshard in pre_backward hook, and rely on compiler to move the reshard to the right place to optimize peak memory. -> This PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139671
Approved by: https://github.com/awgu
- Remove "mypy: allow-untyped-defs" and mark functions individually with "no-untyped-def"
- Mark some trivial functions with the proper return types (`None` and `torch.dtype`)
- Fixed a type bug in the signature of supported_dtype_of_cpp_wrapper()
- `ruff check torch/_inductor/ir.py --select ANN --fix --unsafe-fixes` and then fixed up things that looked incorrectly applied.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139238
Approved by: https://github.com/Skylion007, https://github.com/ezyang
**Summary**
In the case of LLaMA2, for a linear operation with an activation size of `(4, 1, 4096)` and a stride of `(4096, 128, 1)` which has been decomposed into `matmul`. And the decomposition of `matmul` results in `bmm` due to a strict continuity check. We can align the continuity check with ATen by skip dim of size 1 to enable decomposition into `mm` instead.
**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_mkldnn_pattern_matcher.py -k test_linear_input_non_contiguous_3D_wo_bias
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139172
Approved by: https://github.com/jgong5, https://github.com/ezyang
This is a bug on the main exposed by https://github.com/pytorch/pytorch/issues/139476
We have dict tag optimization where if the dict tag does not change, we
skip guards on all the items of the dict that are "immutable". We
considered tensors as immutable in such scenarios. This is critical for
guard eval performance, because generally users dont change their
parameters.
If I try to remove this optimization, we see slowdowns, e.g, 3.03x to
2.95x on conv_mixer TIMM benchamrk.
So, I am adding a flag which keeps the current state but allows the
users to remove this optimization. Not ideal, but given how serious guard eval perf has to be,
we are in the gray are of unsoundness vs performance tradeoff.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139560
Approved by: https://github.com/jansel
This PR adds some instructions for how to add a TARGETS file to run the
fx_graph_runnable script. I'm planning to add some followups that will
add additional imports for custom ops and use autodeps to get the
dependencies, but I figure this PR is an easy first step.
Test Plan:
- pytest test/dynamo/test_structured_trace.py
- Does anyone have suggestions for how to test this?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139481
Approved by: https://github.com/eellison
This patch adds 2 simple methods `VariableTracker.is_mutable()` and
`VariableTracker.is_immutable()`, which helps clarify intention. For
instance, rather than writing
```python
if var.mutation_type:
...
```
After this patch one can write
```python
if var.is_mutable():
...
```
This patch also simplifies `mutation_type` propagation in some
`ListVariable` methods.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139341
Approved by: https://github.com/mlazos, https://github.com/anijain2305
ghstack dependencies: #139339, #139340
This patch addresses the renaming part of #133027, specifically, it
renames the following and adds documentation for relevant classes.
1. `VariableTracker.mutable_local` to `mutation_type`
2. `MatableLocal `to `ValueMutationNew`
3. `MutableSideEffects `to `ValueMutationExisting`
4. `MutableLocalSource` to `SourceType`
5. `MutableLocalSource.Local` to `New`
Note that (2), (3) and (5) are mainly to bring consistency between them
and `AttributeMutationNew`, `AttributeMutationExisting`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139339
Approved by: https://github.com/jansel, https://github.com/mlazos, https://github.com/anijain2305
### Motivation
Today, watchdog only reports that it found a collective timeout:
```
[rank1]:[E1104 14:02:18.767594328 ProcessGroupNCCL.cpp:688] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=200, NumelOut=200, Timeout(ms)=5000) ran for 5096 milliseconds before timing out.
```
While this is nice, it is hard to associate the error with user's program or library stack.
### This PR
This PR gives watchdog the ability to report the call-time stack of the collective, so that it would be easier to track the error back to the program's behavior.
The call-time stack was recorded by Flight Recorder with minimal overhead (for details, please read this [doc](https://dev-discuss.pytorch.org/t/fast-combined-c-python-torchscript-inductor-tracebacks/1158) written by @zdevito ). In `ProcessGroupNCCL`, we are only tracking / reporting the python part so that it fits most PyTorch users.
### Demo
[stack_demo.py](https://gist.github.com/kwen2501/6758e18d305d67fc6f3f926217825c09).
```
TORCH_NCCL_TRACE_BUFFER_SIZE=100 torchrun --nproc-per-node 2 stack_demo.py
```
`TORCH_NCCL_TRACE_BUFFER_SIZE` is for turning on the Flight Recorder.
Output:
```
[rank0]:[E1104 14:19:27.591610653 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation:
#0 all_reduce from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:2696
#1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83
#2 bar from /data/users/kw2501/sync_async/repro.py:15
#3 foo from /data/users/kw2501/sync_async/repro.py:24
#4 main from /data/users/kw2501/sync_async/repro.py:34
#5 <module> from /data/users/kw2501/sync_async/repro.py:40
[rank1]:[E1104 14:19:27.771430164 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation:
#0 all_gather_into_tensor from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:3630
#1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83
#2 baz from /data/users/kw2501/sync_async/repro.py:20
#3 foo from /data/users/kw2501/sync_async/repro.py:26
#4 main from /data/users/kw2501/sync_async/repro.py:34
#5 <module> from /data/users/kw2501/sync_async/repro.py:40
```
From the log above, we can tell that `bar()` and `baz()` are the places where the two ranks divert.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139659
Approved by: https://github.com/wconstab, https://github.com/fduwjj
## This Stack
This stack does the following things to support `xformers`-style, comm-aware Triton kernels:
- Exposes `signal_pad`s as tensors in Python
- Adds a binding for `cuMemsetAsync`
These in combination aims to provide users with more flexibility to express custom signaling/synchronization patterns.
## This PR
Make `cuMemset32Async` available via `_SymmetricMemory.memset32`. We chose `cuMemset32Async` over `cudaMemsetAsync` because it allows for `uint32_t`-wise memset. This provides users with better flexibility.
To enable this, we also added the following cuda driver APIs in `c10::cuda::DriverAPI`:
- `cuDevicePrimaryCtxRetain` - for obtaining the primary context of a device in the form of `CUcontext`.
- `cuCtxGetCurrent`/`cuCtxSetCurrent` - for setting and restoring the context for cuda driver APIs such as `cuMemset32Async`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138755
Approved by: https://github.com/weifengpy, https://github.com/eqy, https://github.com/lw
This PR enables donated buffer in OSS and handles two edge cases:
1. While donated buffer relies on storage to check alias, sparse tensor subclasses does not provide access to storage. So we skip sparse tensor subclasses for donated buffer.
2. Handles missing "val" from n.meta. This is observed from `inductor/test_fused_attention.py::SDPAPatternRewriterCpuTests::test_sdpa_rewriter_11_cpu`,
`functorch/test_aotdispatch.py::TestAOTAutograd::test_input_mutation_simple_with_none_and_nontensor`, and
`inductor/test_compiled_autograd.py::TestCompiledAutograd::test_trace_run_with_rng_state`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139669
Approved by: https://github.com/bdhirsh
We don't need to do a loop over all the args, kwargs in the
AdInplaceOrView key; we just need to bump the version on the args,
kwargs that are mutable.
On the benchmark mentioned in
https://github.com/pytorch/pytorch/issues/139494
this made the time go from
```
mutate2 = 61.72943878173828
no_mutate2 = 36.89440155029297
mutate = 236.3092498779297
no_mutate = 59.31964874267578
```
to
```
mutate2 = 47.976478576660156
no_mutate2 = 38.37468719482422
mutate = 71.21315002441406
no_mutate = 59.7432975769043
```
Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139513
Approved by: https://github.com/bdhirsh
ghstack dependencies: #139509
Summary:
Dedup the data-dependent errors based on the stacktrace it points to. Right now we just display every propagate-real-tensor log that shows up, but we actually can dedup them if they are due to the same piece of code (ex. there could multiple calls to a piece of code that does some data dependent computation).
This occurred when trying out draft export on the PT2I model zoo. For a specific model, previously we would get ~3k data dependent errors, but after deduping based on the stacktrace we now only get 4 errors.
Test Plan: CI
Differential Revision: D65374254
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139540
Approved by: https://github.com/pianpwk, https://github.com/zou3519
Summary:
When we bypass cache write on inductor, we were also forgetting to reset the bundle, this moves resetting the bundle into post_compile step so it gets uniformly reset.
This diff also turns on the cache for internal so that we can do a code rollout.
Test Plan: updated tests
Differential Revision: D65457224
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139698
Approved by: https://github.com/ezyang
Summary: During dynamic rendezvous, we shouldn't use the address from the store but just use `self._this_node.addr` directly because sometimes, the store host is not the host of rank0. Passing wrong host will cause timeout error. This is a follow up fix to S463164, for internal tests, we disable the TCPStore sharing for now.
Test Plan: CI.
Differential Revision: D65453312
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139702
Approved by: https://github.com/XilunWu
Disable tree vectorize in vec_convert.h for gcc10 and aarch64+sve which causes compiler error to occur.
```
/tmp/tmpuqk7lj9j/zx/czx2eyturb6j6m727xhvknkjbdu3y5nqqk66wgxcjkwnxuzvpm5r.cpp:3:18: internal compiler error: in vect_get_vector_types_for_stmt, at tree-vect-stmts.c:12252
3 | extern "C" void kernel(const float* in_ptr0,
```
Fixes#137775
I've not linked a gcc bug report yet as they require a minimal reproducer to be made.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137795
Approved by: https://github.com/malfet
Summary:
Currently, we incorrectly log process_group for DCP based events.
We rely on [c10d_logger.py](https://fburl.com/v4mdme9z) to fill in information about process_group (e.g. backend, nccl_version if available).
In [checkpoint/logger.py](https://fburl.com/yho9nqbu) we pass the `msg_dict` to c10d_logger which never contains the `process_group` param, so [c10d_logger](https://fburl.com/zlw2ukxp) logs information about the default process_group which is always `NCCL`.
Test Plan:
Before:
Always defaults to NCCL even though GLOO is passed by caller.
{F1950847585}
After:
GLOO backend shows up.
{F1950848375}
Differential Revision: D65255871
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139428
Approved by: https://github.com/teja-rao, https://github.com/mhorowitz
As an outcome of https://fburl.com/gdoc/voce5o06, I want to assign owner(s) to any periodic or slows job that are still needed but couldn't run more frequently (too $$$, capacity constraint, don't fail that often). They include:
* multigpu
* debug build
* ROCm (distributed, slow)
@malfet @soulitzer I put down your names as the owners of debug build and slowgradcheck respectively. Please let me know if you are ok with that, or if you have a better option in mind.
Any jobs there without an owner are owned by us (PT Dev Infra)
### Testing
The owners are show up in the job name https://hud.pytorch.org/pr/139519
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139519
Approved by: https://github.com/malfet
Summary:
I think we can inplace a buffer if all of the users of said buffer are "inconsequential", defined as having been removed, being completed, or being part of the ancestors set. In particular, this allows LayerNorm to inplace its input buffer.
Implements:
https://github.com/pytorch/pytorch/issues/132826
Test Plan:
New unit test of matmul followed by LayerNorm, make sure there's an inplaced buffer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138383
Approved by: https://github.com/eellison
Summary:
I realized I wanted to check "are my cache entries/IO unreasonably large"
and there's no easy way to do it. This lets me do it.
Test Plan: servicelab
Differential Revision: D65390363
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139627
Approved by: https://github.com/c00w
Summary:
While testing exportability for PT2 Inference models, we found various cases of invalid op inputs during tracing, for example errors like: `a and b must have same reduction dim`, `expected scalar type Long but found Int`, etc. Looking more closely, these happened to due the same few meta kernels & eager kernels producing mismatched outputs upstream (e.g. different output tensor dtype, int output).
Adding checks to catch mismatched outputs in real tensor prop upstream, so errors are raised at the mismatched op, instead of the downstream ops taking them as inputs. Relies a lot on utils from [CrossRefFakeMode](929797dedb/torch/_subclasses/fake_utils.py (L78))
Follow ups: could add more checks, and maybe have a flag to only enable these for cases like draft mode, so perf doesn't suffer?
Test Plan: test_export, test_fake_tensor
Differential Revision: D64210055
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137747
Approved by: https://github.com/zou3519
# Why?
I want the following code to work.
minimal repro:
```
class M(torch.nn.Module):
def forward(self, dilate_flag):
return dilate_flag.item()
input1 = (torch.tensor([1], dtype=torch.bool, device="cuda"),)
model = M().cuda()
ep = torch.export.export(model, input1, strict=True)
path = torch._inductor.aot_compile(ep.module(), input1)
aot_model = torch._export.aot_load(path, device="cuda")
actual_output = aot_model(*input1)
```
error: AssertionError: Encountered an unsupported object of type <class 'torch.SymBool'> while writing the metadata for exported program
second error will be handled by https://github.com/pytorch/pytorch/pull/138760
# Motivation
I could technically bypass it with a torch.int tensor. However, it doesn't work with torch.cond. I want the following to work. It would also require https://github.com/pytorch/pytorch/pull/138760 for aot compile to work.
```
class M(torch.nn.Module):
def __init__(self) -> None:
super().__init__()
self.dilate_flag = 0
def forward(self, dilate_flag):
self.dilate_flag = dilate_flag.item()
def true_fn(dilate_flag):
return dilate_flag.clone()
def false_fn(dilate_flag):
return dilate_flag.clone()
torch.cond(
self.dilate_flag,
true_fn,
false_fn,
(dilate_flag,),
)
return self.dilate_flag
input1 = (torch.tensor([1], dtype=torch.bool, device="cuda"),)
input2 = (torch.tensor([0], dtype=torch.bool, device="cuda"),)
inputs = (input1, input2)
model = M().cuda()
for input in inputs:
expected_output = model(*input)
ep = torch.export.export(model, input, strict=False)
path = torch._inductor.aot_compile(ep.module(), input)
aot_model = torch._export.aot_load(path, device="cuda")
actual_output = aot_model(*input)
assert (
expected_output == actual_output
), f"henry they are not equal {expected_output} != {actual_output}"
```
Differential Revision: D64867504
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138765
Approved by: https://github.com/ydwu4
This refactoring is for getting a deterministic ordering of binding tensors and sizes of tensors. When seeing a free tensor x with shape (s0,) in subgraph, the ordering of lifting changes from
```
lift_x_in_child, lift_s0_in_child, lift_s0_in_parent, lift_x_in_parent
```
to
```
lift_x_in_parent, lift_s0_in_parent, lift_x_in_child, lift_s0_in_child
```
This produces a determinstic ordering of handling the symints in lifted tensors.
This is also the current contract of dynamo top-level graph: we lift free_symbols in sizes after tensor x and insert the free symbols before the tensor x's proxy.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138559
Approved by: https://github.com/zou3519
ghstack dependencies: #138345, #138428, #138558, #138737
Code refactoring only. We move the wrap_to_fake_tensor_logic out of wrap_fx_proxy for placeholders to provide the invariant that **all graph inputs must set their example values when creating the inputs**. This invariant helps us to identify all the free symbols in the graph in top-level and sub-graphs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138428
Approved by: https://github.com/ezyang, https://github.com/zou3519
ghstack dependencies: #138345
Enabling Manywheel builds here: https://github.com/pytorch/pytorch/pull/138732
During the build I observe the failure with cuda jobs:
```
-- Compiler does not support SVE extension. Will not build perfkernels.
-- Found CUDA: /usr/local/cuda (found version "11.8")
-- The CUDA compiler identification is unknown
CMake Error at cmake/public/cuda.cmake:47 (enable_language):
No CMAKE_CUDA_COMPILER could be found.
Tell CMake where to find the compiler by setting either the environment
variable "CUDACXX" or the CMake cache entry CMAKE_CUDA_COMPILER to the full
path to the compiler, or to the compiler name if it is in the PATH.
Call Stack (most recent call first):
cmake/Dependencies.cmake:44 (include)
CMakeLists.txt:851 (include)
```
While correct sequence suppose to be:
```
-- Found CUDA: /usr/local/cuda (found version "11.8")
-- The CUDA compiler identification is NVIDIA 11.8.89
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Found CUDAToolkit: /usr/local/cuda/include (found version "11.8.89")
```
Issue found to be missing PATH setting in 2_28 Docker file. This section exist in CentOS Docker file here:
https://github.com/pytorch/pytorch/blob/main/.ci/docker/manywheel/Dockerfile#L174-L175
(Please Note these Docker images are not used yet. The https://github.com/pytorch/pytorch/pull/138732 should enable using these images)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139631
Approved by: https://github.com/malfet, https://github.com/huydhn
Partially fixing https://github.com/pytorch/pytorch/issues/138685
Add a (relatively safe?) heuristics to skip fusion if we can potentially increasing peak memory.
The doc string mainly explains what this PR is doing:
```
The implementation is more like a heuristic since we don't really know if we are at peak
or not when trying to fuse these two ndoes. The order of nodes may change later which makes the
peak memory estimation hard.
Here is how we decide the LOWER BOUND of extra memory allocation if we fuse these 2 nodes:
1. find all buffers read by each node with a single user. These buffers are supposed to
be reused if we don't fuses these 2 nodes
2. find the intersection of these buffers for the two node and sum the total buffer size.
If we don't fuse these two nodes, we can at lease avoid this much memory allocation.
Note that the extra memory allocation is not necessarily causing peak memory increase.
This is just a heuristic.
We return true only if the saving for fusion can not trade off the extra memory allocation.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138756
Approved by: https://github.com/jansel
ghstack dependencies: #139136
Since any stage can run a mixture of full backwards and split backwards,
it is important to count the sum of (full_backwards + backward_weight)
when comparing to num microbatches to determine last backward.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139415
Approved by: https://github.com/H-Huang
This addresses
https://github.com/pytorch/pytorch/pull/137677/files#r1799836499, which
had to set `allow_cache=False` for codegen on `DataPtrVariable.base`,
which is a `TensorVariable`, otherwise we observe failure of
`test_no_grad_copy` when testing with Dynamo.
I've seen `test_no_grad_copy` failing a few times, and every single time
it's related to cyclic reference, my best guess is the cyclic reference
holds some tensor object longer in memory than necessary, preventing the
optimization introduced in #11165.
This patch makes `OutputGraph.cleanup()` more aggressive by clearing out
all fields that might reference a `VariableTracker`. As a result, we can
remove the aforementioned `allow_cache=False`, which helps generate
better code (e.g., in the case of `test_no_grad_copy`, it skipped generating
a redundant graph whose only op is returning the input tensor; instead we just
generate a single `LOAD_FAST`).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139487
Approved by: https://github.com/jansel, https://github.com/aakhundov
These are not artificial patterns I come up. They shows up in linear+CrossEntropyLoss graph.
Consider this snippet:
```
class LinearAndCEL(nn.Module):
def __init__(self):
super().__init__()
self.linear = nn.Linear(C, V)
self.ce = nn.CrossEntropyLoss()
def forward(self, x, y):
return self.ce(self.linear(x).view(B * T, V), y.view(-1))
```
`x` passed to `forward` is a 3D tensor of shape [B, T, C].
The `self.linear` will view x as [BxT, C] shape tensor first, do the matmul and produce a [BxT, V] tensor, and then view this output back to a 3D tensor with shape [B, T, V]. User code is gonna add another view op to convert the tensor shape to [B x T, V]. This generates a pair of redundant views . A pair of redundant permute happens in the backward part when we compute gradients.
The view ops makes it hard to chunk linear+CEL. When the view op breaks up the dimension being chunked, what should the chunker do (even if we merge those dimension again later)? Removing these pointless view pairs makes the chunker simpler. And I think it's in general nice to do.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139136
Approved by: https://github.com/Chillee, https://github.com/jansel
This is tested in PR stacked above in
```python
python test/distributed/fsdp/test_fsdp_state_dict.py TestFSDPStateDict.test_torch_save_load
```
We cannot depend on whether `hasattr(..., __slots__)` to know whether a BUILD instruction has slotstate. For example, if a class subclasses ABC `hasattr(__slots__)` will be `True` but there might be no slots (and hence `state` will not be a tuple). So revert #138936 to following the pickle library's code
```python
>>> from abc import ABC
>>> hasattr(ABC, "__slots__")
True
```
So
```python
import torch
from abc import ABC
from dataclasses import dataclass
class Foo(ABC):
pass
class FooWrapper(Foo):
def __init__(self, x, y):
self.x = x
self.y = y
f = FooWrapper(1, 2)
torch.save(f, "temp.pt")
with torch.serialization.safe_globals([FooWrapper]):
torch.load("temp.pt")
```
Would fail on the previous code with
```
File "/data/users/mg1998/pytorch/torch/serialization.py", line 1934, in _load
result = unpickler.load()
File "/data/users/mg1998/pytorch/torch/_weights_only_unpickler.py", line 366, in load
for k, v in slotstate.items():
```
As there is actually no slotstate
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139541
Approved by: https://github.com/malfet
ghstack dependencies: #138936, #139221, #139433
In this diff, i make test_torchbind.py tests to handle training IR. Today in the training IR, we don't see the effect token and HOP because this happens at the FunctionalTensorMode. Maybe in the future, we should move this logic up to the training IR so that writing passes etc on training Ir is safer. But for the migration purposes, i think it is ok for now. I also fixed two bugs:
1. ep.module() doesn't register all aliased constants in the module.
2. When we retrace, we need to fakify the original Torchbind object.
3. We don't run any DCE on training IR so we need to add some more torch ops to verifier.
Differential Revision: [D64853530](https://our.internmc.facebook.com/intern/diff/D64853530)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138658
Approved by: https://github.com/ydwu4, https://github.com/zhxchen17
Adds a few more dynamo_timed() to measure triton compilation and load_by_key_path times.
In the case of async compilation with multiple threads, we'll generate a single `kernel_compile` event that occurs when waiting on all the parallel compiles to finish.
In the case where async parallel compilation is disabled (or, compile threads are warming up), we'll generate a `triton_compile` event for each kernel.
The `triton_compile` events is a bit questionable: do we need a row for each triton compile event? It might eat up on our already low retention, so I might just remove that. Will discuss with @slarsen.
Differential Revision: [D65215707](https://our.internmc.facebook.com/intern/diff/D65215707/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139402
Approved by: https://github.com/oulgen
This is a bug on the main exposed by https://github.com/pytorch/pytorch/issues/139476
We have dict tag optimization where if the dict tag does not change, we
skip guards on all the items of the dict that are "immutable". We
considered tensors as immutable in such scenarios. This is critical for
guard eval performance, because generally users dont change their
parameters.
If I try to remove this optimization, we see slowdowns, e.g, 3.03x to
2.95x on conv_mixer TIMM benchamrk.
So, I am adding a flag which keeps the current state but allows the
users to remove this optimization. Not ideal, but given how serious guard eval perf has to be,
we are in the gray are of unsoundness vs performance tradeoff.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139560
Approved by: https://github.com/jansel
## This Stack
This stack does the following things to support `xformers`-style, comm-aware Triton kernels:
- Exposes `signal_pad`s as tensors in Python
- Adds a binding for `cuMemsetAsync`
These in combination aims to provide users with more flexibility to express custom signaling/synchronization patterns.
## This PR
Make `cuMemset32Async` available via `_SymmetricMemory.memset32`. We chose `cuMemset32Async` over `cudaMemsetAsync` because it allows for `uint32_t`-wise memset. This provides users with better flexibility.
To enable this, we also added the following cuda driver APIs in `c10::cuda::DriverAPI`:
- `cuDevicePrimaryCtxRetain` - for obtaining the primary context of a device in the form of `CUcontext`.
- `cuCtxGetCurrent`/`cuCtxSetCurrent` - for setting and restoring the context for cuda driver APIs such as `cuMemset32Async`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138755
Approved by: https://github.com/weifengpy, https://github.com/eqy, https://github.com/lw
Previously: https://github.com/pytorch/pytorch/pull/138052 but the implementation is done from scratch, so I open a new PR.
This implements the ability to save and load profiles of automatic dynamic decisions, so on subsequent runs we can directly make something automatically dynamic. Unlike the previous implementation, this cache is never enabled by default; instead, you have to specify a "job id" that says it's OK to share results. We will be able to automatically populate this id for internal MAST jobs but for generic OSS users you will have to explicitly opt into it.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139001
Approved by: https://github.com/oulgen
This is the next step in support dynamic float arguments in PT2: docs.google.com/document/d/1HswUSp9H6mg8Vg27mhRk8YzC9q_uf63b6wz-gwx65BQ/edit?pli=1#heading=h.xvyiqp8tuje6. To make this more incremental and tractable, we've decided to opt the export path our of this first phase of the rollout.
Fixes python test/export/test_export.py TestExport.test_export_input_mutation_dynamic_shape when specialize_float=False
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139486
Approved by: https://github.com/ezyang
ghstack dependencies: #139451, #139482, #139484
In https://github.com/pytorch/pytorch/pull/134685, I transformed the following code:
```CPP
if (CUDAAllocatorConfig::release_lock_on_cudamalloc()) {
// At scope exit, acquire the lock again. This provides safety against
// any potential exceptions in the cudaMallocMaybeCapturing function.
auto sg = c10::make_scope_exit([&]() { lock.lock(); });
lock.unlock();
p.err = cudaMallocMaybeCapturing(&ptr, size);
} else {
p.err = cudaMallocMaybeCapturing(&ptr, size);
}
if (CUDAAllocatorConfig::release_lock_on_cudamalloc()) {
TORCH_CHECK(
lock.owns_lock(), "Failed to acquire lock after cudaMalloc");
}
```
into:
```CPP
if (CUDAAllocatorConfig::release_lock_on_cudamalloc()) {
// At scope exit, acquire the lock again. This provides safety against
// any potential exceptions in the cudaMallocMaybeCapturing function.
auto sg = c10::make_scope_exit([&]() { lock.lock(); });
lock.unlock();
}
auto active_pool = MemPoolContext::getActiveMemPool();
if (active_pool && active_pool->allocator() &&
p.pool->owner_PrivatePool) {
ptr = active_pool->allocator()->raw_alloc(size);
p.err = ptr ? cudaSuccess : cudaErrorMemoryAllocation;
} else {
p.err = cudaMallocMaybeCapturing(&ptr, size);
}
if (CUDAAllocatorConfig::release_lock_on_cudamalloc()) {
TORCH_CHECK(
lock.owns_lock(), "Failed to acquire lock after cudaMalloc");
}
```
This is wrong because, I didn't realize what `c10::make_scope_exit([&]() { lock.lock(); });` does. And so my changes doesn't let `release_lock_on_cudamalloc` unlock..execute alloc..lock, and instead it just unlock..locks. This PR rectifies that change, and in addition adds an ASSERT ensuring the active pool and p.pool are the same (mirroring the behavior from released_cached_blocks).
Thanks @zvon82 for reporting this!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139430
Approved by: https://github.com/ezyang
This PR ensures that the `nanmean()` function raises a `RuntimeError` when using `int64` or `bool` dtypes, even for empty tensors. Previously, non-empty tensors correctly raised errors for unsupported dtypes, while empty tensors did not. This change brings consistent error handling for both cases.
addressing the need raised in an issue by @hyperkai (Issue [#131043](https://github.com/pytorch/pytorch/issues/131043)).
### Changes
- Added checks in `nanmean_out()` to raise errors for `int64` and `bool` dtypes regardless of tensor size.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138745
Approved by: https://github.com/ezyang
Fixes issue with timm models where
example_value = 0.09999
proxy.node.target = <built-in function sub>
would fall through to
```
unimplemented(
"torch.* op returned non-Tensor "
+ f"{typestr(example_value)} {proxy.node.op} {proxy.node.target}",
case_name="unsupported_operator",
)
```
and graph break
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139482
Approved by: https://github.com/ezyang
ghstack dependencies: #139451
Previously: https://github.com/pytorch/pytorch/pull/138052 but the implementation is done from scratch, so I open a new PR.
This implements the ability to save and load profiles of automatic dynamic decisions, so on subsequent runs we can directly make something automatically dynamic. Unlike the previous implementation, this cache is never enabled by default; instead, you have to specify a "job id" that says it's OK to share results. We will be able to automatically populate this id for internal MAST jobs but for generic OSS users you will have to explicitly opt into it.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Differential Revision: [D65065497](https://our.internmc.facebook.com/intern/diff/D65065497)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139001
Approved by: https://github.com/oulgen
Summary:
Restarting (aborting and re-initialize a PG) is a basic need if we want
to achieve in-process restart of PGs without tearing down the whole
process.
Add this tests to verify that this is supported by current NCCL.
Note that this restart test passes steadily only for blocking mode for now.
In nonblockin mode. There is problem in either nccl init or abort that
needs further investigation
Test Plan:
new UT
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139496
Approved by: https://github.com/c-p-i-o, https://github.com/kwen2501
PyTorch MPS backend for the most part relies on MPSGraph to provide specific operations, but recently more and more often one had to implement custom kernel here that were simply embedded in the operator codebase and were compiled directly using [`- id<MTLLibrary>newLibraryWithSource:options:error:`](https://developer.apple.com/documentation/metal/mtldevice/1433431-newlibrarywithsource) (first metal kernel to MPS backend was added in https://github.com/pytorch/pytorch/pull/82307 )
Later on, as number of operator grew, those were refactored into `MetalShaderLibrary` convenience class (see https://github.com/pytorch/pytorch/pull/125550 )
But as number of kernels keeps growing, it's time to make a next step and properly compile them into `.metalib`
This PR does exactly that by:
- Moving shader sources into separate .metal files
- Adds check on whether full Xcode installed or just DeveloperTools
- If full Xcode is installed, compiles and links shaders into .metallib for Metal-3.0(Available on MacOS 13) and Metal-3.1 standard (available on MacOS 14, can use bfloat) and bundles both using `-sectcreate` linker option and `getsectiondata` API call. `metallib_dummy.cpp` file is used to properly express dependencies between metallib build and torch_cpu link stages. Logic for generating metallibraries is loosely based on https://github.com/ml-explore/mlx/blob/main/mlx/backend/metal/kernels/CMakeLists.txt.
- If only DeveloperTools CLI is installed, automatically wraps .metal into `_metallib.h` that contains shader source wrapped in `MetalShaderLibrary`
Bulk of changes introduced in this PR are just moving code around. I.e. for every file that contains non-templated shader definition in `aten/src/ATen/native/mps/operators` folder, corresponding `.metal` file is created in `aten/src/ATen/native/mps/kernels` folder and embedded shader definition is replaced with the following
```cpp
#ifndef PYTORCH_JIT_COMPILE_SHADERS
static auto& lib = MetalShaderLibrary::getBundledLibrary();
#else
#include <ATen/native/mps/OpName_metallib.h>
#endif
```
Some historical stats:
| PyTorch Version | Number of shaders in MPS | Ops added |
| ------------- | ------------- | ---- |
| 1.12 | 0 | |
| 1.13 | 2 | bitwise_ops and index.out |
| 2.0 | 4 | cross repeat and view) |
| 2.1 | 9 | unary_ops, histogram, renorm, binary_ops |
| 2.2 | 11 | gamma and bucketization |
| 2.3 | 12 | naive_matmul (to workaround crash) |
| 2.4 | 13 | quantized_mm |
| 2.5 | 14 | fused_adam |
Pros:
- Better code structure/readability
- Eventually allows one to use shared headers (and implement something like `TensorIterator`)
- Faster runtime (as compilation is done ahead of time) and perhaps better optimized compiled kernels
Cons:
- Build process is a bit more complicated that it used to be
- Need to maintain two codepath (as our CI builders only has DeveloperTools installed)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138636
Approved by: https://github.com/manuelcandales
Previously: https://github.com/pytorch/pytorch/pull/138052 but the implementation is done from scratch, so I open a new PR.
This implements the ability to save and load profiles of automatic dynamic decisions, so on subsequent runs we can directly make something automatically dynamic. Unlike the previous implementation, this cache is never enabled by default; instead, you have to specify a "job id" that says it's OK to share results. We will be able to automatically populate this id for internal MAST jobs but for generic OSS users you will have to explicitly opt into it.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Differential Revision: [D65065497](https://our.internmc.facebook.com/intern/diff/D65065497)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139001
Approved by: https://github.com/oulgen
The ONNX custom ops registration API.
## Design
1. Create a "custom_translation_table: dict[Callable, Sequence[Callable] | Callable" parameter for specifying extra functions
2. Use a callable as the key to support all possible call_function targets in the fx graph
3. Allow a callable or a Sequence of callables as values.
- When there is a single callable, it is the translation function for the op
- When there is a Sequence of callable, the exporter's dispatcher will dispatch to these callables in order based on input dtypes.
- The translation functions can be a plain python function that calls onnxscript ops (traced), or an onnxscript function.
- Complex input support: We create special type annotations for annotating real representations of complex inputs, which are needed to handle complex computation in the ONNX graph, as we don't have any ops in ONNX that handle complex inputs. The dispatcher will have knowledge of these newly created type annotations and dispatch correctly. The complex functions will be in the same overload pool as the real functions.
```py
torch.onnx.export(dynamo=True,
custom_translation_table = {
torch.ops.aten.add: [overload1, overload2],
torch.sym_not: sym_not_onnx,
})
```
Support for functions that handles complex inputs will be in separate PRs.
fixes https://github.com/pytorch/pytorch/issues/138391
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135403
Approved by: https://github.com/titaiwangms
Summary:
Triton has added some integer overflow detection when kernels are compiled with
`debug=True`, and this test results in integer overflow (2.0 is 0x40000000,
times 2 is 0x80000000 which overflows a signed int32).
Assertion `int32 overflow detected for operation mul` failed
Fixes#139479
Test Plan:
```
python inductor/test_torchinductor.py -k test_float32_to_int32_cuda
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139489
Approved by: https://github.com/eellison, https://github.com/jansel, https://github.com/chenyang78
## This Stack
This stack does the following things to support `xformers`-style, comm-aware Triton kernels:
- Exposes `signal_pad`s as tensors in Python
- Adds a binding for `cuMemsetAsync`
These in combination aims to provide users with more flexibility to express custom signaling/synchronization patterns.
## This PR
```python
# Obtain the signal pad of the specified peer rank as a tensor.
# If both shape and dtype are unspecified, the returned tensor will be a
# 1d uint32 tensor, which is most natural for signaling purposes.
symm_mem.get_signal_pad(peer_rank)
# If only shape is specified, it is equivalent to:
# symm_mem.get_signal_pad(peer_rank)[:shape.numel()].view(shape)
symm_mem.get_signal_pad(peer_rank, shape)
# If only dtype is specified, it is equivalent to:
# symm_mem.get_signal_pad(peer_rank).view(dtype)
symm_mem.get_signal_pad(peer_rank, dtype=dtype)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138754
Approved by: https://github.com/weifengpy, https://github.com/lw
Fixes#136559
As we upgrade to NumPy 2, torch falsely filtered out `numpy.random` as unsupported in dynamo tracking.
This PR changes the filtering rules to include them while keeping behavior with numpy 1 unchanged.
Before this PR, the following tests failed:
```
PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/dynamo/test_functions.py -k FunctionTests.test_numpy_random
PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/dynamo/test_unspec.py -k UnspecTests.test_to_tensor
PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/test_fake_tensor.py -k FakeTensorTest.test_export_numpy
PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/test_fake_tensor.py -k PropagateRealTensorsFakeTensorTest.test_export_numpy_propagate_real_tensors
```
With this PR, the supported/unsupported ops in NumPy 1 are not changed.
For NumPy 2, only the `numpy.random` ops that are already supported with NumPy 1 are added to the supported list.
I used the following scripts to check the differences before and after the change for both NumPy 1 & 2.
The output is empty for NumPy 1 since there is no change.
The output is a list of `numpy.random` that considered supported for NumPy 2.
```py
from torch._dynamo import trace_rules
import numpy as np
def new_numpy_function_ids():
unsupported_funcs = {"seed", "ranf", "get_bit_generator", "RandomState", "set_bit_generator", "sample"}
def is_supported(k, v, mod):
if not callable(v):
return False
if not getattr(v, "__module__", None):
return True
if v.__module__ == mod.__name__:
return True
if v.__module__ == "numpy.random.mtrand" and mod.__name__== "numpy.random" and k not in unsupported_funcs:
return True
return False
rv = {}
for mod in trace_rules.NP_SUPPORTED_MODULES:
for k, v in mod.__dict__.items():
if is_supported(k, v, mod):
rv[id(v)] = f"{mod.__name__}.{k}"
return rv
def old_numpy_function_ids():
rv = {}
for mod in trace_rules.NP_SUPPORTED_MODULES:
rv.update(
{
id(v): f"{mod.__name__}.{k}"
for k, v in mod.__dict__.items()
if callable(v)
and (getattr(v, "__module__", None) or mod.__name__) == mod.__name__
}
)
return rv
rv1 = set(old_numpy_function_ids().values())
rv2 = set(new_numpy_function_ids().values())
for v in (rv1 - rv2):
print(v)
print("****")
for v in (rv2 - rv1):
print(v)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138686
Approved by: https://github.com/lezcano, https://github.com/williamwen42
This is to match the default layout constraint for custom operators. By
default, Inductor should match the stride order of inputs to a triton
kernel.
IF THIS IS BREAKING YOU, PLEASE REACH OUT, especially if it's been
more than two weeks since this landed. You can flip the config locally
as a workaround.
Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137064
Approved by: https://github.com/albanD, https://github.com/eellison
This teaches install_config_module (and the underlying code) to
understands Config objects. Additionally we've added a JK option to this
which resolves the JK.
This config gets stored within the _ConfigEntry class and is evaluated
when __getattr__ is called. If justknobs is set, it'll call
justknobs_check to see the result.
Due to preceeding work, basically everything works correctly here and we
had to update a couple of tests, and modify the getattr behaviour.
Note that we are updating the justknob_check function to support a
default option, to make default work.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138766
Approved by: https://github.com/ezyang
Summary: The main changes to support freezing are:
1) When pickling constant tensors as part of the cache key calculation: If freezing has not been applied, then keep the existing behavior (pickle the metadata and values). If freezing has been applied, then pickle the values if the constant will be inlined; otherwise, consider only the metadata.
2) If freezing has been applied, modify what we store in the cache: Instead of storing the constant attributes in the cache entry, store the _names_ of the constants, and then grab those constants from the GraphModule when we need attache the attributes to a newly-loaded Python module. Since the cache lookup path loads the Python module, this bullet means we need to thread through a GraphModule argument in several places.
3) Since this feature means that we may need to reload the same Python module path more than once (but attach different constant attributes), I changed PyCodeCache.load_by_key_path to not store an in-memory map of path to module (since there may be more than one). I don't _think_ this will have any affect on performance, however.. It's unclear why we were using an in-memory cache here anyway, since this function should only be called once for each module needed to be loaded.
4) Several tests were removing on-disk PyCodeCache artifacts by iterating over the modules. I made this more straightforward by implementing a cache_clear method that removes the on-disk artifacts. Arguably, this should have been the implementation all along.
Differential Revision: [D63542170](https://our.internmc.facebook.com/intern/diff/D63542170)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136505
Approved by: https://github.com/eellison
# Summary
The AMX ISA based GEMM micro-kernel template for int8 weight-only quantization (BF16 activation, int8 weights) should cache dequantized weights (int8 -> int32 -> fp32 -> bf16) so that they would not have to be dequantized again in subsequent calls to the _inner-kernel_ that uses the same weights.
This change leverages the fact that even for BF16 x BF16 GEMM template, cache-blocking ensures that `Nr * Kc` weight elements are cached in L1D cache (more info [here](https://static.sched.com/hosted_files/pytorch2024/59/TorchInductor%20CPU%20Backend%20Advancements%20-%20New%20Features%20and%20Performance%20Improvements_20240915.pdf)). Here, `Nr` is the register blocking size for `N` dimension (at the granularity of the GEMM micro-kernel, it's currently also the cache blocking size for `N` dimension, although that may change in the future), and `Kc` is the cache blocking size for `K` dimension.
The figure below is from the document linked above -
<img width="476" alt="image" src="https://github.com/user-attachments/assets/e23e5476-d910-46d1-a9b3-cbf77de76d94">
## Performance data
Collected on 48 physical cores of one socket of Intel Xeon Platinum 8468H (Xeon SP 4th gen). Intel OpenMP & tcmalloc were preloaded.
|M | N | K | Latency with ATen _weight_int8pack_mm | Latency with codegened templated GEMM (current main branch) | Latency with codegened templated GEMM (this PR) |
|-----|-----|-----|------|----------|----|
|4096|4096|4096| 45.844 ms | 9.322 ms| 5.2181 ms |
|4096|11008|4096| 127.618 ms |24.6258 ms | 13.6046 ms|
|4096|4096|11008| 121.953 ms | 25.4692 ms | 10.2669 ms |
|4096|32000|4096| 478.450 ms| 75.3942 ms | 48.21 ms |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136688
Approved by: https://github.com/jgong5
Summary:
The change comes from D65214804 (https://github.com/pytorch/pytorch/pull/139239)
`buck2 test @//fbobjc/mode/buck2/ios-tests fbsource//xplat/caffe2/c10:c10_testApple` doesn't like having 2 `testConversionToString` in the same suite `StringViewTest`, so just need to use a different name there.
Test Plan: `buck2 test @//fbobjc/mode/buck2/ios-tests fbsource//xplat/caffe2/c10:c10_testApple` passes
Differential Revision: D65314266
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139448
Approved by: https://github.com/cyyever, https://github.com/malfet
This PR fixes a compilation time regression manifested in timm_models/hrnet_w18 caused by https://github.com/pytorch/pytorch/pull/136732.
The regression is reproducible locally. The compilation time is a bit noisy, but it's still possible to tell the difference.
```
Before the offending PR
compilation_latency mean=176.022 seconds
compilation_latency mean=176.564 seconds
On the offending PR
compilation_latency mean=180.096 seconds
compilation_latency mean=179.101 seconds
On the fix
compilation_latency mean=173.153 seconds
compilation_latency mean=174.182 seconds
```
(I think the fix being faster than the baseline is due to noise)
The cause of the regression is an inefficiency in `is_user_visible_output()`. Specifically, it used `output_node.args[0].index(node)` to obtain the output idx for each node (and we called this for each node twice). The offending PR had the assumption that `len(output_node.args[0])` is rather small. However, it has been proven false by the benchmark (it was 1900+ for timm_models/hrnet_w18).
The fix is to precompute `user_visible_output_strides` once by iterating only over the nodes in `output_node.args[0]`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139420
Approved by: https://github.com/ezyang
Fix https://github.com/pytorch/pytorch/issues/128063 .
Now for this snippet
```
def f(x):
y = torch.sum(torch.sum(x, dim=-1))
z = x / 10.0
z_t = z.t().contiguous().t()
return y, z, z_t
```
Inductor could generate a single kernel for the first reduction and the two ponitwise kernels (if loop-ordering after fusion is enabled). And the generated kernel read `x` only ONCE. (with no proper handling, the two pointwise's may each access x once even if they are fused).
The PR needs fix 2 subtile bugs regarding LOAF .
1. when we reorder loops for a FusedSchedulerNode, we check if each sub-node's sizes matches. But some node has sizes in `list` type (if its loop is not reordered) while others have its sizes in `tuple` type (if its loop is reordered). I could change the upstream code to uniformly use either `list` or `tuple`. But without strong enforcement, future code could break this. So I just convert sizes to uniform type before comparison.
2. We have a cache for tiling decisions of a BaseSchedulerNode. If we reorder loops for the node, we should invalidate the cache. Otherwise, a stale tiling decision can result in (very) bad kernel.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139376
Approved by: https://github.com/jansel, https://github.com/eellison
Currently, we get all partition id by iterating assignment whose size is same as the number of nodes in graph. But we can reach same results by iterating partitions_by_id whose size is much smaller than the nodes number. Assume the number of nodes is N, the number of partitions is P, the time complexity decrease from O(N * N) to O(N * P) after this patch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136598
Approved by: https://github.com/tarun292
Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
Also, update tests to use I (BACKWARD_INPUT) vs B (FULL_BACKWARD)
consistently.
Previously, schedules would issue a 'B' operation and leave it ambiguous
whether that operation should be BACKWARD_INPUT or FULL_BACKWARD,
depending on a separate flag (use_full_backward) passed to the schedule
class, which would determine which behavior was taken at runtime.
Now, use_full_backward is removed and the schedule class is required to
produce unambiguous IR. The logic for 'use_full_backward' is removed
from the runtime.
_validate_pipeline_order is replaced with _simulate_comms_compute. Both
offer similar functionality, to validate the corrrectness of a schedule
IR. 'validate' operates on compute-only IR, while simulate operates on
compute + comm IR. To convert from using validate to simulate, you have
to first insert comm actions via '_add_send_recv'.
'simulate' was inefficiently written before this PR and needed to be
optimized to run quickly for extra large schedules with >32 ranks and
microbatches per rank used in some unit tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138886
Approved by: https://github.com/H-Huang
As discussed w/ @ezyang offline, one way to de-risk the `specialize_float=False` rollout is to specialize all backed symfloats that we fail to tensorify away. This diff does a few things:
1) It fixes a bug where item_memo gets dropped (due to incorrect epoch invalidation)
2) It updates the tensorify pass to do the backup specialization
This pass was originally part of the [PR](https://github.com/pytorch/pytorch/pull/137782) that flips `specialize_float=False` but we learned that the blast radius is simply too large. We've pivoted to a more milestone driven approach where we learn from the failures of the aforementioned PR and cherry pick fixes into main first. After this current PR lands our strategy is as follows:
1) Integrate turning off specialize float only in the automatic dynamic pass.
2) Put up a canary diff that only turns off specialize float in `backend=eager` mode to sniff out symfloat related bugs in dynamo due to code paths we previously never exercised.
3) Put up a canary diff that only turns off specialize float in `backend=aot_eager` mode to sniff out symfloat related bugs in aotautograd due to code paths we previously never exercised.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138868
Approved by: https://github.com/ezyang
This diff considerably changes the column format of PT2 Compile Events:
- Now, instead of logging one new column per every piece of metadata, we just log a single column, "metadata". This vastly decreases the number of columns we need to log, which should help with retention.
- Now, we only log to scuba for a set of dynamo_timed() events that we actually care about aggregating. To do so, we add a boolean to dynamo_timed() that decides whether or not to log a pt2_compile_event. We'll always log a chromium_event for every dynamo_timed(), but only log a subset of those to scuba.
Differential Revision: [D65225598](https://our.internmc.facebook.com/intern/diff/D65225598/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139309
Approved by: https://github.com/oulgen
Summary:
I think we can inplace a buffer if all of the users of said buffer are "inconsequential", defined as having been removed, being completed, or being part of the ancestors set. In particular, this allows LayerNorm to inplace its input buffer.
Implements:
https://github.com/pytorch/pytorch/issues/132826
Test Plan:
New unit test of matmul followed by LayerNorm, make sure there's an inplaced buffer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138383
Approved by: https://github.com/eellison
Reason:
Currently we have multiple traversals for tangents in runtime:
- To check that types and structure are identical to what we guessed during tracing time
- Coerce metadata
- Coerce memory_format
- Unwrap_tensor_subclass
All of them are traversing tangents via __tensor_flatten__ calls the tree of Subclasses.
Change:
To do everything in one traversal at runtime (including flattening)
Implementation details:
Add memory_format information inside SubclassCreationMeta, for PlainTensors keep not only (int) of unwrapped_index, but memory_format too.
Preparing memory_format is optional (controlled by with_memory_format=True).
2. Removing unused subclass_utils.create_metadata_for_subclass which does not have any usages inside torch and would require update of the logic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139068
Approved by: https://github.com/bdhirsh
Summary:
Move check_no_missing_dump_files to after the "just print" location.
This allows us to print dump_files when there are actual missing files.
Test Plan:
```
torchfrtrace -j ~/pyper-training-online-924394600 --selected-ranks 1 2
Inferred common prefix nccl_trace_rank_
loaded 95 files in 0.040270328521728516s
built groups, memberships
Rank 1 Rank 2
------------------------------------------------------------------ ------------------------------------------------------------------
broadcast(input_sizes=[[2]], state=completed) broadcast(input_sizes=[[2]], state=completed)
```
Without this change, the command was erroring out.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139417
Approved by: https://github.com/Skylion007, https://github.com/fduwjj
Used in both simulator and add_send_recv pass, the ready_to_schedule
logic works by looking at all the previously scheduled ops on a rank to
see if any of them 'unblocks' the current op to be scheduled. For example,
to schedule a FORWARD op, a previous RECV_F op is needed, unless this is
stage 0 or there is a previous stage on the same rank that ran FORWARD
already.
The old implementation iteratively compared the candidate op to the
previous ops. The new implementation uses set lookups to reduce
complexity. It also maintains the set of previous ops as ops are
scheduled rather than constructing a set on demand.
I did not save benchmark results, but this results in a 10-100x speedup
which is most noticeable for unit tests with artificially huge schedule
IR, the largest of which took longer than 20m before (I never let it
finish) but now takes less than 14s. Most schedules take less than
10ms.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138924
Approved by: https://github.com/H-Huang
ghstack dependencies: #138928, #131762
### Separate dI / dW:
PipelineScheduleRuntime now supports execution of merged FULL_BACKWARD
or separate dI / dW operations.
Separating the B and W may add execution overhead or may be suboptimal
in cases where BW are 'fused', but it is worthwhile when separating B, W
lets the schedule be more efficient by filling in bubbles. In some
cases, the schedule will still issue B followed by W at certain points,
so in these cases just merge them back into BW ops and execute them as
full backwards rather than executing a B followed by a W.
### V-schedules:
V-schedules have a special case where the last rank has 2 adjacent
stages.
E.g. if rank3 had stage 3 and stage 4, then we should implement direct
transfer of stage3 outputs to stage4 inputs without a
send/recv.
In the schedling logic, we also must allow scheduling the
stage 4 forward after running stage 3 forward, without expecting a stage
4 RECV_F
In the runtime, we pass activations between adjacent stages without
using SEND/RECV ops since the stages are on the same rank/process. We
add new APIs to PipelineStage abstraction for passing the activations
both during forward and backward. Currently the implementation directly
modifies the 'recv buffers' the stage is managing, so the
forward/backwrad execution logic does not need to know the difference.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131762
Approved by: https://github.com/H-Huang
ghstack dependencies: #138928
Summary:
had a land racing with another diff D65166035 to fix the schema.
According to export team's discussion, we are upgrading min_val and max_val to optional fields which shouldn't break BC and allows the schema to express infinity.
Test Plan: buck2 test 'fbcode//mode/opt' fbcode//apf/rec/ir/tests:ir_export_deserialize_test
Differential Revision: D65273170
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139394
Approved by: https://github.com/yiming0416
#### Summary
This pull request introduces new weighted loss functions to the PyTorch library: `weighted_huber_loss`, `wmse_loss`, and `wmae_loss`. These functions allow for precise control over the influence of each sample during training, important for imbalanced data or when certain samples are more significant than others.
#### Changes
- **`weighted_huber_loss`**: Huber loss modified to incorporate weights, providing a balance between L1 and L2 loss based on the `delta` parameter.
- **`wmse_loss`** (Weighted Mean Squared Error): Applies weights to the standard MSE loss, useful for emphasizing certain samples in regression tasks.
- **`wmae_loss`** (Weighted Mean Absolute Error): Adjusts MAE loss calculation by including weights, ideal for datasets with outliers.
#### Code Details
- **Input Validation**: Ensures `input`, `target`, and `weights` tensors match in size to prevent broadcasting errors.
- **Reduction Options**: Supports `none`, `mean`, and `sum` reductions to suit various computational needs.
- **Backward Compatibility**: Maintains support for deprecated arguments `size_average` and `reduce`, while encouraging use of the `reduction` argument.
#### Usage Example
```python
import torch
input = torch.tensor([0.5, 2.5, 2.0], dtype=torch.float32)
target = torch.tensor([0.0, 2.0, 1.5], dtype=torch.float32)
weights = torch.tensor([1.0, 0.5, 1.5], dtype=torch.float32)
loss = weighted_huber_loss(input, target, weights, delta=1.0)
print(loss)
```
---
Feedback on these implementations is welcome; please let me know if further modifications are required.
Resolves#132465
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132049
Approved by: https://github.com/mikaylagawarecki
Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com>
So for functional autograd + CA, most nodes are inlined in aot autograd. But user-defined callables aren't safe to make_fx unless dynamo traces through them. The AOT backward must be inlined by dynamo time. We plan to directly insert calls to the backward in the graph:
- call prologue
- call bwd graph
- call epilogue
Restructuring our AOT bwd implementation will make this implementation easier.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139331
Approved by: https://github.com/zou3519
I'm sick of reductions not working properly - spotty dim coverage, missing backwards, etc. This PR fixes quite a bit.
It applies to the following ops:
* `sum` / `mean` / `prod`
* `all` / `any`
* `amin` / `amax`
* `min` / `max`
* `argmin` / `argmax`
The general reduction logic has been factored out into a helper `_apply_reduction(func, func_name, identity_element, *args, **kwargs)`. The idea is that by providing a valid identity element, we can utilize conversions to padded dense when needed for reducing over the ragged dim.
Extensive test coverage includes:
* reductions across ragged dim
* reductions across non-batch, non-ragged dims
* reductions across both batch and ragged dims
* multiple dim reductions (for ops that support this)
* full reduction -> scalar
Bonus: the PR includes backwards fixes for `sum` and `mean`, which have never worked.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139317
Approved by: https://github.com/cpuhrsch
Summary:
This PR adds Auto-Trace implementation for Trace ID. By default, the python side will generate a uuid in the same format as the one set in the backend by kineto. Upon running an auto-trace, the python generated trace id will overwrite the one set in kineto using the Config variable. Since we don't expect users to generate on-demand traces after an auto-trace we can simply keep overwriting the backend trace id whenever autotrace is ran. If we one day want to eventually do something like this, we simply have to add a call in kineto on the backend to generate a new ID upon start of profiling.
We also implement a custom callback in the frontend such that users can generate their own trace ids if they wish to. This works similarly as the default, only difference being that they have to manually set this callback after a profiler is generated. We use a specific call to set this rather then putting it in the frontend initializer in case users want to change the trace_id for different repeats.
Test Plan: Tested both default and custom callbacks using the verbose prints added. Trace ids on the frontend and the prints on the backend for the manifold upload matched.
Differential Revision: D65178308
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139310
Approved by: https://github.com/shengfukevin
According to the documentation, decay is a number in [0,1] range,[ i.e.](https://pytorch.org/docs/stable/optim.html)
```
Decay is a parameter between 0 and 1 that controls how fast the averaged parameters are decayed. If not provided to get_ema_multi_avg_fn, the default is 0.999.
```
An inspection of `swa_utils.py` indicates there are no checks for invalid values of `decay`. Adding asserts as suggested in this PR ensures valid compute range (one way to enforce correct behavior, there are perhaps more suitable ones). Papers `torch` cites for reference idea/implementation also consider exclusively this range (e.g., https://arxiv.org/pdf/2310.04415).
Fixes https://github.com/pytorch/pytorch/issues/133772
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133773
Approved by: https://github.com/janeyx99
The special case was added during experimentation with batched send/recv
ops. The ops needed to be jointly scheduled or the simulator would
think that each op was unschedulable since each contained a recv that
depended on the other's send. The workaround I added was to let the
scheduler 'peek' one op ahead for unblocking, which let batched ops be
scheduled but also changed the behavior or non-batched ops. It let RECV
ops be simulated one step earlier than the unblocking SEND ops, which
shortened the simulated duration of schedules.
Removing this workaround simplifies the simulator but more importantly
lends to optimizing the runtime of the simulator by making it much
easier to avoid copying or extending lists of previous ops on each
iteration. It also restores the output of the simulator for non-batched
ops to a more natural output where RECV must happen at the same time or
later than matching SEND, rather than possibly a step earlier.
For example, for this test:
`python test/distributed/pipelining/test_schedule.py -k test_send_recv_test_info0`
Before:
```
Step 0: 0F0 1RECV_F0
Step 1: 0SEND_F0
Step 2: 0F1 1RECV_F1
Step 3: 0SEND_F1 1F0
Step 4: 0RECV_B0 1B0
Step 5: 0B0 1SEND_B0
Step 6: 1F1
Step 7: 0RECV_B1 1B1
Step 8: 0B1 1SEND_B1
```
After:
```
Rank 0 Rank 1
Step 00: 0F0
Step 01: 0SEND_F0 1RECV_F0
Step 02: 0F1
Step 03: 0SEND_F1 1RECV_F1
Step 04: 1F0
Step 05: 1B0
Step 06: 0RECV_B0 1SEND_B0
Step 07: 0B0 1F1
Step 08: 1B1
Step 09: 0RECV_B1 1SEND_B1
Step 10: 0B1
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138928
Approved by: https://github.com/H-Huang
These are not artificial patterns I come up. They shows up in linear+CrossEntropyLoss graph.
Consider this snippet:
```
class LinearAndCEL(nn.Module):
def __init__(self):
super().__init__()
self.linear = nn.Linear(C, V)
self.ce = nn.CrossEntropyLoss()
def forward(self, x, y):
return self.ce(self.linear(x).view(B * T, V), y.view(-1))
```
`x` passed to `forward` is a 3D tensor of shape [B, T, C].
The `self.linear` will view x as [BxT, C] shape tensor first, do the matmul and produce a [BxT, V] tensor, and then view this output back to a 3D tensor with shape [B, T, V]. User code is gonna add another view op to convert the tensor shape to [B x T, V]. This generates a pair of redundant views . A pair of redundant permute happens in the backward part when we compute gradients.
The view ops makes it hard to chunk linear+CEL. When the view op breaks up the dimension being chunked, what should the chunker do (even if we merge those dimension again later)? Removing these pointless view pairs makes the chunker simpler. And I think it's in general nice to do.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139136
Approved by: https://github.com/Chillee, https://github.com/jansel
Summary: Move all the custom `_reduce_*` functions inside the FxGraphCachePickler class. This is mostly a cosmetic change since they're conceptually members of FxGraphCachePickler. But also in an upcoming diff, I'll add a member variable to the class to control how we handle constant tensors, so it will be convenient to be able to query that setting via `self`. I made the analogous changes to AOTAutogradCachePickler for consistency.
Test Plan: unit tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138683
Approved by: https://github.com/eellison
ghstack dependencies: #138681, #138682
Fixes#131020
As discussed in the issue thread, we can use ` KINETO_DAEMON_INIT_DELAY_S` to delay the initialization of `kineto` in case `kineto` is initialized before `libtorch_cuda.so`.
It's not clear to set a proper value of environmental variable `KINETO_DAEMON_INIT_DELAY_S`, here's a trick to make the initialization of `kineto` after the initialization of module `torch`. I'm not sure whether this is an acceptable trick, please take a look at this pr, thanks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131448
Approved by: https://github.com/sraikund16, https://github.com/briancoutinho
Previously, we tried to sort SymInt strides to determine the stride
order. This PR makes the sorting more unbacked symint aware: given a Tensor
with sizes (u0, u1, u2), it has strides (u1 * u2, u1, 1), which is
sortable under the guard_size_oblivious assumptions.
Test Plan:
- test case
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137063
Approved by: https://github.com/eellison
Fixes the failure of INT8 DLRM using AOTI.
The previous code calculates `consts_size` directly using `tensor` from `graph.constants`:
```
consts_size = sum(
get_nbytes_of_tensor(tensor, all_cuda)
for (name, tensor) in graph.constants.items()
if name not in graph.folded_constants
)
```
Meanwhile, the actual bytes to serialize (`serialized_weights`) is using `graph.get_original_value_of_constant(name)`:
```
serialized_weights = b"".join(
_to_bytes(graph.get_original_value_of_constant(name), all_cuda)
for name in graph.constants.keys()
if name not in graph.folded_constants
)
```
`tensor` from `graph.constants` could be different from `graph.get_original_value_of_constant(name)` thus making the `consts_size` inconsistent with the actual byte size of the `serialized_weights`, resulting in runtime error `weights_offset must be aligned to 16K boundary`, similar to what happened in https://github.com/pytorch/pytorch/pull/135205.
This PR direclty gets `consts_size ` using `len(serialized_weights)`, which fixes the inconsistency.
We also added a `reduce_range` argument to the `get_default_x86_inductor_quantization_config` function, which is needed in the unit test to avoid accuracy issue on CI machines (earlier CPUs without VNNI).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139054
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/desertfire
Summary:
This diff reverts D65167805
broke the release pipeline
Test Plan: NA
Differential Revision: D65245198
@diff-train-skip-merge (to silent facebook-github-bot until I have a stamp to land this)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139371
Approved by: https://github.com/malfet
Fixes#131040
## Description
Add docs for `torch.inf` and `torch.nan`,
## Checklist
- [x] The issue that is being fixed is referred in the description (see above "Fixes #ISSUE_NUMBER")
- [x] Only one issue is addressed in this pull request
- [x] Labels from the issue that this PR is fixing are added to this pull request
- [x] No unnecessary issues are included into this pull request.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138430
Approved by: https://github.com/ezyang
# Motivation
This PR intends to update torch-xpu-ops commit pin. It mainly includes the following two highlighted changes:
1. split the DLL library into 4 smaller libraries to avoid the 2G limitation on Windows;
2. some new operators added, for example, `cdist`, `pdist`, `maxunpool2d`, `maxunpood3d`, `upsample_trilinear3d, `Bessel operators`, etc...
# Additional Context
We have to supply XPU device check logic in `cdist` and `pdist` ops.
This PR depends on https://github.com/pytorch/pytorch/pull/139050 to fix Windows build issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139041
Approved by: https://github.com/EikanWang, https://github.com/ezyang
Summary: In an upcoming change, we need to modify FxGraphCachePickler to behave differently depending on whether the graph has frozen parameters (whether or not we have frozen parameters). To do that, it will be convenient to change FxGraphCachePickler into a regular object instead of a collection of classmethods.
Test Plan: unit tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138682
Approved by: https://github.com/eellison
ghstack dependencies: #138681
This PR adds C shim for `QConvPointWisePT2E` and `QConvPointWiseBinaryPT2E` similar to https://github.com/pytorch/pytorch/pull/138439. Besides that, we aligned the implementation of `qconv_pointwise` with `qlinear_pointwise` in the following aspects:
1. The parameter order of `qconv_pointwise` and `qlinear_pointwise` are quite different, we aligned the schema of `qconv_pointwise` to have similar parameter order as `qlinear_pointwise` to make it more consistent.
2. We always converted `x_scale` and `x_zero_point` to Tensors, just like in the lowering of `qlinear_pointwise`. This avoids the need to create two separate C APIs (one for `double x_scale` and `int64_t x_zero_point`, and another for `Tensor` versions). Instead, we only need one API for `Tensor`-based `x_scale` and `x_zero_point`. If we later add dynamic quantization for qconv (which will use `Tensor` for `x_scale` and `x_zero_point`), we can reuse the code from this PR and don't need to change the C shim layer API.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138540
Approved by: https://github.com/jgong5, https://github.com/desertfire
ghstack dependencies: #138691, #138806
Summary:
This diff/PR attempts to consolidate Triton caching into the Inductor caching so that there can be just one cache that unifies them both, reducing network requests and increasing success rate.
Implementation details can be found via reading the code or the post: https://fb.workplace.com/groups/1553867532149891/posts/1605037517032892
I did not use the Autotune bundler code at all since I want to simplify that and merge it into this on the next diff/PR.
In terms of instrumentation
1) Dynamo compile: `triton_bundler_time_saved_s` this is sum of all triton.compile calls. We dont have to use the specific number, can use this as a binary value.
2) Events table: I used dynamo_timed to measure how much time we spend on bundler collect and write functions which is all the work we do in this diff
3) TLParse: I emitted number of kernels and triton_bundler_time_saved_s into tlparse as well
Test Plan: Updated unit tests
Adhoc running
```
TORCHINDUCTOR_BUNDLE_TRITON_INTO_FX_GRAPH_CACHE=1 buck2 run @mode/opt //scripts/oulgen:runner
```
gives
https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpmTZt6b/0_0_0/fx_graph_cache_hit_4.json
<img width="771" alt="image" src="https://github.com/user-attachments/assets/478782a2-ee47-40cb-b723-fcac2bf9dd93">
Differential Revision: D64504909
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138239
Approved by: https://github.com/ezyang
# Motivation
The code changes in `GpuStreamManager` class intend to help manage `dnnl::stream` efficiently.
# Addtional Context
Use the following code to simply benchmark.
```python
import torch
import time
device = torch.device("xpu")
M, N, K = 64, 64, 64 # You can change these dimensions as needed
torch.manual_seed(0)
A = torch.randn(M, K, device=device)
B = torch.randn(K, N, device=device)
# Warm-up
for _ in range(10):
torch.matmul(A, B)
s1 = torch.xpu.Stream()
s2 = torch.xpu.Stream()
# Measure the time for the GEMM operation
start_time = time.time()
with torch.xpu.stream(s1):
for _ in range(50000):
C = torch.matmul(A, B)
with torch.xpu.stream(s2):
for _ in range(50000):
D = torch.matmul(A, B)
torch.xpu.synchronize()
end_time = time.time()
# Calculate elapsed time
elapsed_time = end_time - start_time
# Print the results
print(f"Time taken for GEMM operation: {elapsed_time:.6f} seconds")
```
Compared with the old implementation elapses 2.077069s, the new implementation consumes 2.023017s, which means ~2% performance improvement.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139176
Approved by: https://github.com/gujinghui, https://github.com/jgong5
Fixes https://github.com/pytorch/pytorch/issues/138920. See comments there for details.
I still need to try to get a smaller repro to write an actual test. But suppressing the guards, I now no longer see the specilization in the CA graph in the linked example:
```
aot1_view_3: ... = torch.ops.aten.view.default(aot1_tangents_1, [aot1_sym_size_int, 48, 1])
aot1_view_4: ... = torch.ops.aten.view.default(aot1_view_3, [aot1_sym_size_int, 48])
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138968
Approved by: https://github.com/yf225, https://github.com/xmfan
As discussed w/ @ezyang offline, one way to de-risk the `specialize_float=False` rollout is to specialize all backed symfloats that we fail to tensorify away. This diff does a few things:
1) It fixes a bug where item_memo gets dropped (due to incorrect epoch invalidation)
2) It updates the tensorify pass to do the backup specialization
This pass was originally part of the [PR](https://github.com/pytorch/pytorch/pull/137782) that flips `specialize_float=False` but we learned that the blast radius is simply too large. We've pivoted to a more milestone driven approach where we learn from the failures of the aforementioned PR and cherry pick fixes into main first. After this current PR lands our strategy is as follows:
1) Integrate turning off specialize float only in the automatic dynamic pass.
2) Put up a canary diff that only turns off specialize float in `backend=eager` mode to sniff out symfloat related bugs in dynamo due to code paths we previously never exercised.
3) Put up a canary diff that only turns off specialize float in `backend=aot_eager` mode to sniff out symfloat related bugs in aotautograd due to code paths we previously never exercised.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138868
Approved by: https://github.com/ezyang
Schedule simulator is useful for detecting hangs in schedules and
validating that they won't hang. It also inserts bubbles (None actions)
at any timestep where a rank can not enqueue its next action due to
unmet dependencies, which can serve as a rough metric for schedule
efficiency. The output can be visualized. The simulator expects a full
comm + compute schedule as input.
Chrometrace dump is a basic visualization utility. It currently just
renders one 'process' per rank, and lets users visualize the schedule in
a UI instead of as text.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138134
Approved by: https://github.com/H-Huang
Reference: https://github.com/pytorch/pytorch/issues/138399
This PR introduces an `OpInfo` test that checks whether running each `out=` operation
using meta inputs is consistent with using concrete (e.g. CPU) inputs. More specifically,
it tests the case where the output tensors are not of the expected data type. According to
the `out=` specification, some operations should error.
I have added XFAIL to the set of operations that are currently failing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138515
Approved by: https://github.com/ezyang
Summary:
I think we can inplace a buffer if all of the users of said buffer are "inconsequential", defined as having been removed, being completed, or being part of the ancestors set. In particular, this allows LayerNorm to inplace its input buffer.
Implements:
https://github.com/pytorch/pytorch/issues/132826
Test Plan:
New unit test of matmul followed by LayerNorm, make sure there's an inplaced buffer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138383
Approved by: https://github.com/eellison
This PR changes real_tensor_prop to also infer fake kernels when the
operator doesn't have it.
We infer the fake output to be of the same properties as the real
output, with unbacked symints in the sizes and some stride order.
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139213
Approved by: https://github.com/pianpwk
ghstack dependencies: #139212
Summary: According to export team's discussion, we are upgrading min_val and max_val to optional fields which shouldn't break BC and allows the schema to express infinity.
Test Plan: buck test mode/opt caffe2/test:test_export -- -r test_serialize_infinite_sym_int
Differential Revision: D65167805
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139223
Approved by: https://github.com/yiming0416
When we see a custom op:
- check that its mutation annotations are correct
- check that its aliasing constraints matches our constraints for custom
ops.
Otherwise, there may be undefined behavior.
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139212
Approved by: https://github.com/angelayi
```
NOTE [lowering-time collective optimization]
In collective communication libraries such as NCCL, every rank maintains
communication buffers that are remotely accessible by some peers. Depending
on the underlying transport, remote accessibility may be established via
mechanisms such as ib_reg_mr, CUDA P2P, or CUDA multicast. Typically, these
buffers are private to the communication library by default, and
communication ops copy user data in and out of these buffers.
To prevent these copies, an optimization commonly known as "user buffer
registration" can be employed. This allows direct establishment of remote
accessibility on user buffers, eliminating the need for copying. However,
this optimization introduces stringent usage requirements, which are
typically hard to satisfy without being intrusive to the user code:
- Establishing remote accessibility is expensive and often done ahead of
time. In such implementations, all ranks must agree on the set of allocations
used for every collective op. Failing to meet this requirement can
lead to runtime errors or even silent correctness issues.
- Even if the collective communication library supports gracefully falling
back to "unregistered" implementations, the fallback mechanism would nullify
the optimization.
- Some communication mechanisms impose stricter requirements than others. For
example, CUDA's multicast + multi-mem instructions require all ranks to agree
not only on the allocations used for every collective but also on the offsets
within these allocations.
To support all different mechanisms with optimal results, we aim to satisfy
the strictest requirement for this family of optimizations - we ensures that
every collective op invocation is guaranteed to operate on the same
allocation, at the same offset, in every iteration.
For eligible collective ops, we identify communication buffers at lowering
time and optionally choose to lower the op to a different kernel
(ommunication libraries like NCCL handle both registered and non-registered
buffers transparently within the same op, though some may require different
ops for different cases). Later, the codegen will perform "persistent
allocation" to satisfy the aforementioned constraints, and optionally,
perform buffer planning to optimize overall memory usage.
```
### Changes
- Created `comm_lowering.py` for the lowerings of `_c10d_functional` ops. This is to prevent cluttering `lowering.py` as we add more lowering-time collective optimizations. This PR moved the lowerings for `all_reduce` and `all_reduce_` to the file.
- Added `comm_buffer_type: Dict[str, str]` to `GraphLowering` to track whether a buffer is a comm buffer and the type of the comm buffer.
- Added codegen allocation support for comm buffers of type "symm_mem".
- Added support for auto-lowering `_c10d_functional.all_reduce_` to `symm_mem.one_shot_all_reduce`.
- Added an Inductor config for collective optimizations in general (`config._collective`).
### Limitation
Currently, each persistently allocated comm buffer is dedicated to a single callsite. This is not viable in terms of memory usage. However, this is a neccesary intermediate state before we tackle memory planning for comm buffers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138029
Approved by: https://github.com/Chillee
ghstack dependencies: #138028
`dcp.load()` is documented as "operating in place", updating the state of existing state_dict elements instead of replacing them wherever possible. However, it appears that in the case of a stateful element, the code both updates its state in-place, then replaces it with a copy of itself in the state_dict. This looks like a simple oversight, so here's a PR that should fix it!
[From the docs:](https://pytorch.org/docs/stable/distributed.checkpoint.html)
> DCP is different than torch.save and torch.load in a few significant ways: *...*
> - It operates in place, meaning that the model should allocate its data first and DCP uses that storage instead.
This manifested as a strange bug in TorchTitan, causing a model loaded from a checkpoint to be saved incorrectly, resulting in a twice-resumed model being subtly broken.
Let me know if this makes sense, and if there's anything else I should add!
Thanks for all the work on PyTorch!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138575
Approved by: https://github.com/kwen2501, https://github.com/fegin
Fixes#138742. In the issue, the matrix multiplication with DTensor failed when the size of one of mesh dimension is 1 when the mesh is > 1D. We are missing tests for covering this corner case where mesh_shape is (n, 1) or (1, n). The DTensor mm op is correct when the 1D mesh is of shape (self.world_size, ) or 2D mesh with none of the mesh_dimension has a size of 1.
In this PR, we fixed the corner case by updating `gen_einsum_strategies` in `_einsum_strategy.py`. Specifically, we cannot skip generating `mesh_dim_strategies` when `mesh_dim <= 1`, as this is not valid for nD mesh with one of the mesh dimension sizes being 1.
Without the fix, the OpStrategy generated for 2D mesh with mesh_shape of (1,n) or (n,1) is wrong, as the OpStrategy generated is 1D.
```
all_mesh_dim_strategies=[[[Replicate(), Replicate(), Replicate()], [Partial(sum), Shard(dim=1), Shard(dim=0)], [Shard(dim=0), Shard(dim=0), Replicate()], [Shard(dim=1), Replicate(), Shard(dim=1)]]]
OpStrategy(all_strategies)::: [(R, R) -> R, (S(1), S(0)) -> P, (S(0), R) -> S(0), (R, S(1)) -> S(1)] @ mesh: (4, 1)[(R, R) -> R, (S(1), S(0)) -> P, (S(0), R) -> S(0), (R, S(1)) -> S(1)] @ mesh: (4, 1)
```
After the fix, we can see the OpStrategy generated is correct with 2D strategy.
```
all_mesh_dim_strategies=[[[Replicate(), Replicate(), Replicate()], [Partial(sum), Shard(dim=1), Shard(dim=0)], [Shard(dim=0), Shard(dim=0), Replicate()], [Shard(dim=1), Replicate(), Shard(dim=1)]]][[[Replicate(), Replicate(), Replicate()], [Partial(sum), Shard(dim=1), Shard(dim=0)], [Shard(dim=0), Shard(dim=0), Replicate()], [Shard(dim=1), Replicate(), Shard(dim=1)]]]
OpStrategy(all_strategies) = [(RR, RR) -> RR, (RS(1), RS(0)) -> RP, (RS(0), RR) -> RS(0), (RR, RS(1)) -> RS(1), (S(1)R, S(0)R) -> PR, (S(1)S(1), S(0)S(0)) -> PP, (S(1)S(0), S(0)R) -> PS(0), (S(1)R, S(0)S(1)) -> PS(1), (S(0)R, RR) -> S(0)R, (S(0)S(1), RS(0)) -> S(0)P, (S(0)S(0), RR) -> S(0)S(0), (S(0)R, RS(1)) -> S(0)S(1), (RR, S(1)R) -> S(1)R, (RS(1), S(1)S(0)) -> S(1)P, (RS(0), S(1)R) -> S(1)S(0), (RR, S(1)S(1)) -> S(1)S(1)] @ mesh: (4, 1)
```
*******
As a follow up, we should add more test coverage for DTensor op with 2D mesh and 2D mesh with one of the size of mesh dimension being 1.
*******
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139134
Approved by: https://github.com/fegin
**Summary**
Remove the redundant method of X86 Inductor Quantizer as `get_supported_quantization_configs`, `get_supported_operator_for_quantization_config` and `get_supported_operators`. They are not the must have to implement a customized Quantizer and not mentioned in existing document for how to use X86 Inductor Quantizer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139161
Approved by: https://github.com/jgong5
`mask` is already defined as `uint16x8_t` no need to reinterpret it
bd369bb182/aten/src/ATen/cpu/vec/vec128/vec128_half_neon.h (L220)
Fixes
```
var/lib/jenkins/workspace/aten/src/ATen/cpu/vec/vec128/vec128_half_neon.h: In static member function 'static at::vec::DEFAULT::Vectorized<c10::Half> at::vec::DEFAULT::Vectorized<c10::Half>::set(const at::vec::DEFAULT::Vectorized<c10::Half>&, const at::vec::DEFAULT::Vectorized<c10::Half>&, int64_t)':
/var/lib/jenkins/workspace/aten/src/ATen/cpu/vec/vec128/vec128_half_neon.h:227:39: error: cannot convert 'uint16x8_t' to 'float16x8_t'
227 | vreinterpretq_u16_f16(mask),
| ^~~~
| |
| uint16x8_t
In file included from /var/lib/jenkins/workspace/aten/src/ATen/cpu/vec/intrinsics.h:23,
from /var/lib/jenkins/workspace/aten/src/ATen/cpu/vec/vec128/vec128.h:4,
from /var/lib/jenkins/workspace/aten/src/ATen/cpu/vec/vec.h:6,
from /var/lib/jenkins/workspace/aten/src/ATen/test/vec_test_all_types.h:2,
from /var/lib/jenkins/workspace/aten/src/ATen/test/vec_test_all_types.cpp:1:
/usr/lib/gcc/aarch64-linux-gnu/11/include/arm_neon.h:5841:36: note: initializing argument 1 of 'uint16x8_t vreinterpretq_u16_f16(float16x8_t)'
5841 | vreinterpretq_u16_f16 (float16x8_t __a)
| ~~~~~~~~~~~~^~~
```
introduced by https://github.com/pytorch/pytorch/pull/137911
Also, guard any use of NEON intrinsics in `ReducedPrecisionFloatGemvFastPathKernel.cpp` with `!defined(CPU_CAPABILITY_SVE)` otherwise compilation fails with
```
/var/lib/jenkins/workspace/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp: In function 'float at::native::SVE256::reduce(at::vec::SVE256::VectorizedN<c10::Half, 16>&)':
/var/lib/jenkins/workspace/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp:77:24: error: cannot convert 'at::vec::SVE256::Vectorized<float>' to 'float32x4_t'
77 | return vaddvq_f32(t0 + t1);
| ~~~^~~~
| |
| at::vec::SVE256::Vectorized<float>
In file included from /var/lib/jenkins/workspace/c10/util/Half.h:51,
from /var/lib/jenkins/workspace/c10/util/Float8_e5m2.h:17,
from /var/lib/jenkins/workspace/c10/core/ScalarType.h:8,
from /var/lib/jenkins/workspace/c10/core/TensorImpl.h:11,
from /var/lib/jenkins/workspace/c10/core/GeneratorImpl.h:8,
from /var/lib/jenkins/workspace/aten/src/ATen/core/Generator.h:18,
from /var/lib/jenkins/workspace/aten/src/ATen/CPUGeneratorImpl.h:3,
from /var/lib/jenkins/workspace/aten/src/ATen/Context.h:4,
from /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp:2,
from /var/lib/jenkins/workspace/build/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp.SVE256.cpp:1:
/usr/lib/gcc/aarch64-linux-gnu/11/include/arm_neon.h:10423:25: note: initializing argument 1 of 'float32_t vaddvq_f32(float32x4_t)'
10423 | vaddvq_f32 (float32x4_t __a)
| ~~~~~~~~~~~~^~~
In file included from /var/lib/jenkins/workspace/build/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp.SVE256.cpp:1:
/var/lib/jenkins/workspace/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp: In function 'float at::native::SVE256::reduce(at::vec::SVE256::Vectorized<float>)':
/var/lib/jenkins/workspace/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp:119:21: error: cannot convert 'at::vec::SVE256::Vectorized<float>' to 'float32x4_t'
119 | return vaddvq_f32(x);
| ^
| |
| at::vec::SVE256::Vectorized<float>
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139235
Approved by: https://github.com/huydhn
Summary:
Added where logs are being added to constrain violations in draft export.
Example output:
```
1. Constraint violation error.
The specified input dynamic_shapes spec was found to be incorrect during tracing.
Specifically, this guard was added: Eq(s0, 3), where {'s0': "L['args'][0][0].size()[0]"}.
This occured at the following stacktrace:
File /data/users/angelayi/fbsource/buck-out/v2/gen/fbcode/1beb9df83fd74b9a/scripts/angelayi/draft_export/__test_draft_export__/test_draft_export#link-tree/torch/nn/modules/module.py, lineno 1736, in _wrapped_call_impl
File /data/users/angelayi/fbsource/buck-out/v2/gen/fbcode/1beb9df83fd74b9a/scripts/angelayi/draft_export/__test_draft_export__/test_draft_export#link-tree/torch/nn/modules/module.py, lineno 1747, in _call_impl
File /data/users/angelayi/fbsource/buck-out/v2/gen/fbcode/1beb9df83fd74b9a/scripts/angelayi/draft_export/__test_draft_export__/test_draft_export#link-tree/scripts/angelayi/draft_export/test_draft_export.py, lineno 138, in forward.
Because of this, we have modified the dynamic shapes structure to be the following:
```
dynamic_shapes = {'a': {0: 3}}
```
```
The result of this diff is also that `dynamic` logs are permanently turned on during draft export. Otherwise we cannot capture the `[guard added]` logs from symbolic_shapes.py.
Test Plan: `buck2 run @//mode/dev-nosan scripts/angelayi/draft_export:test_draft_export -- -r "test_shape_failure" `
Differential Revision: D64862374
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138748
Approved by: https://github.com/ezyang
Canonically, the empty_cache API releases all cached blocks of the CUDACachingAllocator. There is no API that can release only the cached blocks of a given pool.
In this PR, we extend the functionality of empty_cache API such that it only releases the cached blocks of an active pool. When empty_cache API is called under a MemPoolContext, we only release the cached blocks that correspond to the pool id of the active pool.
Part of https://github.com/pytorch/pytorch/issues/124807.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133602
Approved by: https://github.com/ezyang
Summary:
Which are backed with an older version of `typing_extensoins` but this runtime could not care less about type-checking.
So pretend that is has `TypeIs` by replacing it with `TypeGuard`
Fixes test failures introduced by https://github.com/pytorch/pytorch/pull/133814 / D65030974
Test Plan: `buck2 test 'fbcode//mode/opt' fbcode//multipy/runtime:test_deploy -- --exact 'multipy/runtime:test_deploy - TorchpyTest.TestNumpy'`
Differential Revision: D65145409
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139195
Approved by: https://github.com/Skylion007
Fixes#138761
Add test file for _building.py to verify and guarantee the correct behavior on OpRecorder. Noted that the tests does not validate the model itself, but the expected behavior of the evaluator adding extra ops during input preprocessing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139107
Approved by: https://github.com/justinchuby
This modifies the config system, to use a single mapping of config ->
ConfigEntry and to store the default and user values within them.
We could have used multiple dicts (i.e. user_override and default), but
as we add more fields (justknobs in this PR, perhaps testing and env
variables later), it quickly becomes painful.
There are a couple design decisions we could change.
1) All configs we save store the resolved value - not the default and
user override seperately
2) All configs we load, apply the resolved value as a user override.
This means that certain complexities of default behvaiour and deletion
(as well as JK), will change if you save + load a config.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138758
Approved by: https://github.com/ezyang
**Background:** The `@parametrize` decorator enjoys widespread usage as a convenient tool for ensuring extensive test coverage. One particular feature that makes this easy is the ability to stack such decorators, testing over the cross-product of inputs. Example:
```python
class MyTestClass(TestCase):
@parametrize("x", range(3))
@parametrize("y", [False, True])
def test_foo(self, x, y):
# Invoked with:
# x=0, y=False
# x=1, y=False
# x=2, y=False
# x=0, y=True
# x=1, y=True
# x=2, y=True
...
```
Note that the `@ops` and `@modules` decorators employ the same underlying machinery for parametrizing over `OpInfo` / `ModuleInfo` entries. These decorators also parametrize over op-specific `device` / `dtype` info *according to what is supported for each op*.
```python
class MyTestClass(TestCase):
@ops(op_db)
def test_foo(self, op, device, dtype):
# Invoked each OpInfo in the db along with each device / dtype that corresponds
# with this op according to the OpInfo entry.
...
```
Note that this in contrast to the naive cross product between ops and devices / dtypes, which would generate too many tests. Certain use cases benefit from a similar type of flexible parametrization that is more intelligent than simple cross-product composition. It is expensive to generate / run too many tests, even if the unneeded ones are skipped appropriately.
This PR attempts to generalize such flexible parametrization and satisfy these use cases through the introduction of a `@reparametrize` decorator, which operates on an existing parametrizer and allows for customized on-the-fly parametrization through the use of an `adapter_fn`. Examples:
```python
# adapter_fn that adds a new arg
def include_is_even_arg(test_name, param_kwargs):
x = param_kwargs["x"]
is_even = x % 2 == 0
new_param_kwargs = dict(param_kwargs)
new_param_kwargs["is_even"] = is_even
is_even_suffix = "_even" if is_even else "_odd"
new_test_name = f"{test_name}{is_even_suffix}"
yield (new_test_name, new_param_kwargs)
# adapter_fn that excludes certain values
def exclude_odds(test_name, param_kwargs):
x = param_kwargs["x"]
is_even = x % 2 == 0
yield None if not is_even else (test_name, param_kwargs)
class MyTestClass(TestCase):
@reparametrize(parametrize("x", range(5)), include_is_even_arg)
def test_foo(self, x, is_even):
# Invoked with both the x value and the new is_even arg
...
@reparametrize(parametrize("x", range(5)), exclude_odds)
def test_bar(self, x):
# Only invoked with even x values
...
```
For a more real-world use case, imagine you want to write a set of OpInfo tests that parametrize over additional op-specific things beyond `device` / `dtype` (in NJT's case, this includes contiguity type, whether to operate over the batch / ragged / other dims, etc.). The `@reparametrize` decorator allows you to customize the `@ops` parametrization to add in these additional args as they make sense on a per-op basis.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138369
Approved by: https://github.com/janeyx99
Here's a markdown summary for the PR:
# Add workspace buffer support for Triton templates
## Summary
Adds support for templates to allocate and use temporary workspace buffers
## Key Changes
- Add `WorkspaceArg` support in Triton template system
- Automatic workspace allocation/deallocation around kernel execution
- Zero-initialization support for workspace buffers
- Seamless integration with existing tensor management
## Example Usage
```python
def generate(self, ...):
workspace_arg = WorkspaceArg(
count=1024*1024, # 1MB workspace
zero_fill=True # Zero-initialized
)
return TritonTemplateCaller(..., workspace_arg=workspace_arg)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138050
Approved by: https://github.com/Chillee, https://github.com/eellison
**Summary**
This PR fixes a calculation miss in DeviceMesh's create_sub_mesh().
**Error Description**
When users call `device_mesh["dim0", "dim1", "dim2", "dim3"]`, it creates a slice of mesh or we call it "submesh". Users can also slice a submesh from a flattened mesh. For example:
```
flattened_mesh = device_mesh["dim0", "dim1", "dim2"]._flatten("dim0-2")
alias_flattened_mesh = device_mesh["dim0-2"] # this mesh slice leads to error in current impl
```
It triggers the error in the size calculation `reduce(lambda, mesh_dim)` happening in `create_sub_mesh`:
```
IndexError: Dimension out of range (expected to be in range of [-4, 3], but got 4)
```
**Fix**
The usage of lambda is wrong, for `lambda x, y`, the x is the accumulated value while `y` is the iterator value.
**Test**
`pytest test/distributed/test_device_mesh.py -s -k test_flatten_mesh_4d`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138945
Approved by: https://github.com/wz337
Fixes#113564
When I used PyTorch's profiler to analyze the performance of vLLM, I encountered the following error. This error is similar to #113564. After analysis and troubleshooting, I changed the temporary file from text mode to binary mode, and it no longer reported an error and ran normally.
```bash
ERROR 10-28 10:25:50 engine.py:160] File "/usr/local/lib/python3.12/dist-packages/torch/profiler/profiler.py", line 722, in stop
ERROR 10-28 10:25:50 engine.py:160] self._transit_action(self.current_action, None)
ERROR 10-28 10:25:50 engine.py:160] File "/usr/local/lib/python3.12/dist-packages/torch/profiler/profiler.py", line 751, in _transit_action
ERROR 10-28 10:25:50 engine.py:160] action()
ERROR 10-28 10:25:50 engine.py:160] File "/usr/local/lib/python3.12/dist-packages/torch/profiler/profiler.py", line 745, in _trace_ready
ERROR 10-28 10:25:50 engine.py:160] self.on_trace_ready(self)
ERROR 10-28 10:25:50 engine.py:160] File "/usr/local/lib/python3.12/dist-packages/torch/profiler/profiler.py", line 444, in handler_fn
ERROR 10-28 10:25:50 engine.py:160] prof.export_chrome_trace(os.path.join(dir_name, file_name))
ERROR 10-28 10:25:50 engine.py:160] File "/usr/local/lib/python3.12/dist-packages/torch/profiler/profiler.py", line 220, in export_chrome_trace
ERROR 10-28 10:25:50 engine.py:160] fout.writelines(fin)
ERROR 10-28 10:25:50 engine.py:160] File "<frozen codecs>", line 322, in decode
ERROR 10-28 10:25:50 engine.py:160] UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8e in position 5896: invalid start byte
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139062
Approved by: https://github.com/ezyang
Current ddp hooks quantization code use .cuda() API to move tensors and parameter on backend devices. This limits only cuda backend to work with ddp quantization hooks.
Change is to make code backend agnostic and move tensors/parameters based on **tensor.device.**
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138816
Approved by: https://github.com/kwen2501
Fixes the error of running WOQ-INT8 LLaMA:
```
E In file included from /home/user/inductor/pytorch/torch/include/torch/csrc/inductor/aoti_runtime/arrayref_tensor.h:3,
E from /tmp/torchinductor_user/sw/csw5gfmlzp5iooqvfwl2gwn574frwdpmtrx2y6nu2m6x76d3xcux.cpp:4:
E /tmp/torchinductor_user/sw/csw5gfmlzp5iooqvfwl2gwn574frwdpmtrx2y6nu2m6x76d3xcux.cpp: In function ‘void inductor_entry_impl(AtenTensorOpaque**, AtenTensorOpaque**)’:
E /tmp/torchinductor_user/sw/csw5gfmlzp5iooqvfwl2gwn574frwdpmtrx2y6nu2m6x76d3xcux.cpp:117:33: error: ‘aoti_torch_cpu__weight_int8pack_mm’ was not declared in this scope
E 117 | AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_cpu__weight_int8pack_mm(convert_arrayref_tensor_to_tensor(arg8_1), _frozen_param0, _frozen_param1, &buf0_handle));
E | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138691
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/desertfire
**MOTIVATION**
We recently integrated support for Intel Gaudi devices (identified as 'hpu') into the common_device_type framework via the pull request at https://github.com/pytorch/pytorch/pull/126970. This integration allows tests to be automatically instantiated for Gaudi devices upon loading the relevant library. Building on this development, the current pull request extends the utility of these hooks by adapting selected CUDA tests to operate on Gaudi devices. Additionally, we have confirmed that these modifications do not interfere with the existing tests on CUDA devices.
**CHANGES**
- Add support for HPU devices within the test_move_exported_model_bn using TEST_HPU flag
- Use instantiate_device_type_tests with targeted attributes to generate device-specific test instances.
- Apply skipIfHPU decorator to bypass tests that are not yet compatible with HPU devices.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137863
Approved by: https://github.com/jerryzh168
Summary:
`torch.fx.Interpreter.run()` only takes args as input. Currently we pass kwargs as well which causes errors during retracing.
Flatten the kwargs and concat them with args will solve the issue.
Several previously failing tests under `_retraceability_non_strict` now passes.
Test Plan:
```
buck2 test @//mode/dev-nosan //caffe2/test:test_export -- -r _retraceability_non_strict
```
Differential Revision: D64980053
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138927
Approved by: https://github.com/angelayi
As older versions are affected by CVE-2024-6345
Also, update `typing_extensions` to 4.11 to support `TypeIs`, otherwise some of the workflows report following error (but succeed somehow), see [this](https://github.com/pytorch/pytorch/actions/runs/11566785190/job/32196549021):
```
2024-10-29T03:55:01.3601410Z + /Users/ec2-user/runner/_work/_temp/miniconda/bin/conda run -p /Users/ec2-user/runner/_work/_temp/conda_environment_11566785190 --no-capture-output python3 -c 'import torch'
2024-10-29T03:55:01.3602260Z ~/runner/_work/_temp ~/runner/_work/pytorch/pytorch
2024-10-29T03:55:01.8043630Z Traceback (most recent call last):
2024-10-29T03:55:01.8044540Z File "<string>", line 1, in <module>
2024-10-29T03:55:01.8045670Z File "/Users/ec2-user/runner/_work/_temp/conda_environment_11566785190/lib/python3.9/site-packages/torch/__init__.py", line 37, in <module>
2024-10-29T03:55:01.8046690Z from typing_extensions import ParamSpec as _ParamSpec, TypeIs as _TypeIs
2024-10-29T03:55:01.8048010Z ImportError: cannot import name 'TypeIs' from 'typing_extensions' (/Users/ec2-user/runner/_work/_temp/conda_environment_11566785190/lib/python3.9/site-packages/typing_extensions.py)
```
Also delete macOS-X86 as we no longer build those
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139144
Approved by: https://github.com/Skylion007, https://github.com/kit1980, https://github.com/huydhn
This PR aims to support the following use case:
```python
def all_reduce_eager(x):
y = x * x
req = dist.all_reduce(y, op=dist.ReduceOp.SUM, async_op=True)
assert isinstance(req, torch.distributed.Work)
return y
@torch.compile(fullgraph=True)
def all_reduce_wait_compiled(y):
torch.ops.c10d_functional.wait_tensor(y)
return y * y
x = torch.ones(1280, 1280, device="cuda") + self.rank
with allow_inflight_collective_as_graph_input_ctx():
y = all_reduce_eager(x)
z = all_reduce_wait_compiled(y)
```
where the collective is issued in eager (with `async_op=True`) but waited in compiled region.
This is important for internal use cases such as TorchRec, where we issue collectives in eager for SparseArch all_to_all but want to wait for them in compiled region at beginning of OverArch, so that the all_to_all can be overlapped with the DenseArch compute that runs in parallel.
----
**Update**: Did two items to prevent regression to existing use cases:
1. Added memory-stressed test case to test_c10d_nccl.py `test_unwaited` to cover existing user's "not calling work.wait() for non-functional collective" use case
2. Gated all new `register_work()` / `unregister_work()` calls with `c10d::allow_inflight_collective_as_graph_input()` check, which is a new context manager that requires explicit user enablement (i.e. not on by default, so should not affect existing users).
The risk of this new version of PR causing regression should be very low.
------
Test commands:
- `pytest -rA test/distributed/test_inductor_collectives.py::TestCollectivesMultiProc::test_eager_async_allreduce_inductor_wait`
- `pytest -rA test/test_fx.py::TestDCE::test_keep_collectives`
- `pytest -rA test/test_fx.py::TestDCE::test_keep_collectives_no_overload`
- `pytest -rA test/distributed/test_c10d_functional_native.py::TestWithNCCL::test_wait_tensor`
- `pytest -rA test/distributed/test_c10d_functional_native.py::TestWithNCCL::test_unwaited`
- `pytest -rA test/distributed/test_c10d_nccl.py::CommTest::test_wait_tensor`
- `pytest -rA test/distributed/test_c10d_nccl.py::CommTest::test_unwaited`
- `pytest -rA test/distributed/_tensor/test_tensor_ops.py::DistTensorOpsTest::test_equal`
- `pytest -rA test/distributed/_tensor/test_random_ops.py::DistTensorRandomOpTest::test_manual_seed`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_baseline_aot_eager_multiprocess`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_aot_eager`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_setattr`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_inline`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_no_inline`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_asymmetric_compilation`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_scalar`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_speculation_divergence`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_tensor`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_dim_mismatch`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_graph_break_empty_graph_still_collective`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_missing_source`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_scalar_missing_source`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_type_mismatch`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_activation_checkpointing`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_baseline_aot_eager_multiprocess`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_activation_checkpointing`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_aot_eager`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_inductor`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_setattr`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_inline`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_no_inline`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_aot_eager`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_aot_eager_static_graph`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_inductor`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_inductor_static_graph`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_fsdp_activation_checkpointing`
- `pytest -rA test/distributed/_tensor/test_experimental_ops.py::DistOtherOpsTest::test_bernoulli`
- `pytest -rA test/distributed/_tensor/test_dtensor_compile.py::TestDTensorCompileE2E::test_tp_compile_fullgraph_is_seq_parallel_True`
- `pytest -rA test/distributed/test_inductor_collectives.py::TestCollectivesMultiProc::test_allreduce_inductor_cudagraph_trees`
- `python benchmarks/dynamo/torchbench.py --ci --accuracy --timing --explain --inductor --device cuda --inference --bfloat16 --total-partitions 2 --partition-id 1 --output inference_torchbench.csv --only moco`
------
Differential Revision: [D65023311](https://our.internmc.facebook.com/intern/diff/D65023311)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137763
Approved by: https://github.com/yifuwang
Canonically, the snapshot API returns the entire memory state of the CUDACachingAllocator (using `get_all_blocks`). There is no API that can only return the memory state of a given pool.
In this PR, we extend the functionality of snapshot API such that it can only return the memory addresses of an active pool. When snapshot API is called under a MemPoolContext, we only return the blocks that correspond to the pool id of the active pool.
Part of https://github.com/pytorch/pytorch/issues/124807.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133601
Approved by: https://github.com/ezyang
Previously the decomposition would upcasts inputs to fp32. This led to a slowdown compared to eager which would run in fp16. We also tried keeping the bmm in fp16, and the upcasting for the epilogue but that led to worse numerics because the bmm in eager would do the epilogue all in fp32 without a downcast in the bmm accumulator.
Fix for https://github.com/pytorch/pytorch/issues/137897
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137904
Approved by: https://github.com/ngimel
We should only pass the `device_id` when the backend is `nccl`. Otherwise, we would run into the following error:
```
RuntimeError: No backend for the parent process group or its backend does not support splitting
```
This also fixes test failure is not asserted when using `with_comms()` or `with_comms(eager_init=False)`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139097
Approved by: https://github.com/XilunWu
This adds host-side Triton TMA support to AOTInductor. Notes:
- Two helper functions, `init1DTMADescriptor` and `init2DTMADescriptor` are added to the C++ wrapper codegen on GPU, conditioned on the model having user-defined Triton kernels with host-side TMA (CUDA-specific).
- C++ wrapper codegen on GPU emits TMA descriptor initialization via the aforementioned helper functions.
- Special handling added for the TMA descriptors (in the Python wrapper codegen) during the compile-time autotuning, as the underlying tensor can't be passed directly to the user-defined Triton kernel. TMA descriptors are generated in-between the source tensor's buffer and the kernel call, like in the full Python wrapper codegen.
- This PR concludes the host-side Triton TMA support in PT2.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138878
Approved by: https://github.com/desertfire, https://github.com/chenyang78
ghstack dependencies: #138759, #138877
In this PR, I make test_export to be compatible with training IR. The idea is that when we flip the IR to non-functional training IR, all these tests should be green. The changes involve reading through the test case, and add necessary decomposition etc to make sure the tests pass. For example, if the tests expect to see mutated buffers returned, we need to get them via running run_decomp.
Differential Revision: [D64732360](https://our.internmc.facebook.com/intern/diff/D64732360)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138517
Approved by: https://github.com/avikchaudhuri
This PR enables you to inspect PyObjects in C using `INSPECT(...)` without requiring https://docs.python.org/3/howto/gdb_helpers.html. `torch._dynamo.eval_frame.raise_sigtrap` can also be used to set gdb breakpoints while running Python code, e.g.
```python
x = x + 1
torch._dynamo.eval_frame.raise_sigtrap();
# can breakpoint on ceval.c:CALL to breakpoint the `sin` call in C.
x = torch.sin(x)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138030
Approved by: https://github.com/jansel
### Summary
The fake impl for `nonzero` sets the symint's upper range to `sys.maxsize - 1` if there are any SymInts in the original input tensor shape. This PR constrains the range more intelligently by using the upper ranges of each SymInt in the input tensor shape.
See https://github.com/pytorch/pytorch/pull/134899 as a merged solution for a similar problem for a different op.
### Test plan
Added unit test to verify upper bound reduction calculation (`python test/export/test_export.py TestExport.test_nonzero_dynamic`)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137663
Approved by: https://github.com/ezyang
This PR adds FlexAttention + NJT support. In particular:
* To handle raggedness, treats the packed sequence dim of input NJTs as a giant "stacked sequence". To ensure user `score_mod` / `mask_mod` functions can still be written in the original NJT sequence space, this PR handles conversions for indices within the giant "stacked sequence" -> sequence relative indices automatically.
* Provides `py_impls` for `NestedTensor` to the HOPs for flex attention forward / backward that simply wrap / unwrap NJTs appropriately
* Adds barebones `new_empty()` support to NJT since FlexAttention utilizes this repeatedly; right now, only `new_empty()` with a shape of `()` is supported
* Tests that FlexAttention with a causal mask matches causal SDPA
* Adds a new public API for FlexAttention usage:
* `create_nested_block_mask(mask_mod, B, H, njt, BLOCK_SIZE, _compile)` - NJT analogue for `create_block_mask()` that utilizes the `njt`'s ragged structure to create an appropriately-sized block mask (e.g. `(1, 1, total_seqlen, total_seqlen)`). This function handles the index conversion from "stacked sequence" space -> relative sequence space.
* Minor note: as this is a public API, this function is purposefully named with "nested" instead of "njt" to keep the latter as an informal, mostly internal-only term.
Example usage:
```python
def causal_mask(b, h, q_idx, kv_idx):
return q_idx >= kv_idx
query = ... # NJT of shape (B, H, S*, D)
key = ... # NJT of shape (B, H, S*, D)
value = ... # NJT of shape (B, H, S*, D)
# create_nested_block_mask() automatically converts indices from "stacked sequence" space -> relative sequence space
block_mask = create_nested_block_mask(causal_mask, 1, 1, query) # block mask conceptual shape is (B, H, sum(S*), sum(S*))
output = flex_attention(query, key, value, block_mask=block_mask)
def causal_score_mod(score, b, h, q_idx, kv_idx):
return torch.where(q_idx >= kv_idx, score, float("-inf"))
# flex_attention() automatically converts indices from "stacked sequence" space -> relative sequence space for NJT inputs
output2 = flex_attention(query, key, value, score_mod=causal_score_mod)
```
TODO:
* ~~Determine the right level of abstraction for public API helpers + move them alongside other helpers~~ Verify this with others though
* ~~Some cleanup~~
* ~~`njt_score_mod_adapter`~~
* ~~Q: should `create_njt_block_mask()` call `njt_mask_mod_adapter()` so we don't need two calls?~~
* Can we avoid materializing the `sum(s)` length `seq_idx` used for conversion between stacked sequence -> sequence relative indices?
* Not for now, although future work may deepen the integration between Flex + NJT (possibly requiring custom templates). We should try to cache this though.
* ~~Demonstrate non-causal mask~~
* Support non-contiguous NJTs with holes (**booted to future PR**)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136792
Approved by: https://github.com/drisspg
ghstack dependencies: #138841
Some tests in `test/dynamo` are marked as "expected failure when testing
with `PYTORCH_TEST_WITH_DYNAMO=1`, i.e., we added files of those test
names in the `dynamo_expected_failures` folder.
However, a lot of those dynamo tests seem to be passing with
`PYTORCH_TEST_WITH_DYNAMO=1`, so this patch removes them from
`dynamo_expected_failures`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138935
Approved by: https://github.com/anijain2305
Summary:
Unflatten was broken for HOPs for a couple of reasons:
(1) we didn't expect `get_attr` nodes in the exported program, but they can occur to hold graph arguments to HOPs; such attributes must be moved from the exported program to the corresponding unflattened submodule containing the HOP call.
(2) we don't record metadata for graph arguments on serialization (there's nothing to hold it in our schema), and accordingly the `get_attr` nodes we create on deserialization don't have `nn_module_stack` metadata, which obviously wrecks unflatten.
Test Plan: added a couple of tests
Differential Revision: D65013647
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138978
Approved by: https://github.com/zhxchen17
dot reference implementation should be consistent with the cpu / cuda implementations since it may be used for meta dispatch
i.e.
```python
import torch
x = torch.tensor([1,2,3], dtype=torch.float32)
y = torch.tensor([4,5,6], dtype=torch.float16)
x.dot(y)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: dot : expected both vectors to have same dtype, but found Float and Half
```
However the below does not raise an exception
```python
x.to("meta").dot(y.to("meta"))
```
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138596
Approved by: https://github.com/bdhirsh
Looking at the function record_shapeenv_event its hard to tell that it does not always run
but we do disable it by setting top level is_recording to True self.should_record_events is false
this makes it more explicit to avoid confusion and overloading is_recording.
alternativley we can rename is_recording to do_no_record.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138965
Approved by: https://github.com/ezyang
ghstack dependencies: #138804
MInor, adds a linter that ensures that all jobs run on pull_request, schedule, push etc have a `if: github.repository_owner == 'pytorch'` or are dependent on a job that has that check
There is also a setting in Github repos that can disable all workflows for that repo
A lot of these are unnecessary because many jobs use reusable workflows that have that check. However, this is a one time change so I'm not that bothered
Unfortunately I can't put this at the workflow level, which would make this better
Lots of weird string parsing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138849
Approved by: https://github.com/malfet
Companion logger diff: https://www.internalfb.com/diff/D65012523
* Using float seconds for timestamps is bad because our internal system defaults to float32 precision and you don't even get second precision for timestamps in float32
* We decide to use microseconds instead of milliseconds because millisecond granularity you can end up with the same timestamp if compilation is happening very quickly; much better to force non-overlapping spans
* Because there are so many new fields and I don't feel like reimplementing each on BwdCompilationMetrics, BwdCompilationMetrics is no more, it's just that everything in CompilationMetrics is now optional.
* The actual frame compile times collection is not modified (still float) to reduce blast radius, so I just convert to microseconds before making the record. At float64 precision (Python's default), you get about microsecond precision on timestamps so shouldn't be a data problem (https://www.leebutterman.com/2021/02/01/store-your-unix-epoch-times-as-float64.html)
* I rename some entries for clarity. In particular, whenever a timing contains all of the its lower phases (e.g., how Inductor also contains Triton compilation) we put "cumulative" in its name. If something doesn't happen at compile time but is delayed until we have actual real inputs, we put "runtime" in its name.
Test plan:
```
buck2 run @mode/opt @mode/inplace //scripts/oulgen:runner
```
And then inspect https://fburl.com/scuba/dynamo_compile/sandbox/mslu7f5w and verify the us columns are populated and meaningful.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138975
Approved by: https://github.com/masnesral
Reference: https://github.com/pytorch/pytorch/issues/138399
This PR introduces an `OpInfo` test that checks whether running each `out=` operation
using meta inputs is consistent with using concrete (e.g. CPU) inputs. More specifically,
it tests the case where the output tensors are not of the expected data type. According to
the `out=` specification, some operations should error.
I have added XFAIL to the set of operations that are currently failing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138515
Approved by: https://github.com/ezyang
Summary: With the fast pickling mode, we don't need the custom hack for replacing device strings in tensors. This was previously needed because, e.g., two strings "cuda" will pickle differently if they are the same object vs. not.
Test Plan:
The new test fails with fast mode commented out, but succeeds when enabled:
`python test/inductor/test_codecache.py -k test_stable_strings`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138681
Approved by: https://github.com/oulgen
I've found that when using `torch.utils.cpp_extension.load` on my Windows system, decoding errors occur when my .cpp/.cu files contain certain non-English characters.
`test.py`:
```py
from torch.utils.cpp_extension import load
my_lib = load(name='my_cuda_kernel', sources=['my_cuda_kernel.cu'], extra_cuda_cflags=['-O2', '-std=c++17'])
# ......
```
`my_cuda_kernel.cu`:
```cpp
#include <torch/types.h>
#include <torch/extension.h>
// 向量化 <------ some chinese characters
// ......
```
Errors will be reported as:
```
Traceback (most recent call last):
File "E:\test\test.py", line 8, in <module>
my_lib = load(
^^^^^
File "C:\Users\XXX\AppData\Roaming\Python\Python311\site-packages\torch\utils\cpp_extension.py", line 1314, in load
return _jit_compile(
^^^^^^^^^^^^^
File "C:\Users\XXX\AppData\Roaming\Python\Python311\site-packages\torch\utils\cpp_extension.py", line 1680, in _jit_compile
version = JIT_EXTENSION_VERSIONER.bump_version_if_changed(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\XXX\AppData\Roaming\Python\Python311\site-packages\torch\utils\_cpp_extension_versioner.py", line 46, in bump_version_if_changed
hash_value = hash_source_files(hash_value, source_files)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\XXX\AppData\Roaming\Python\Python311\site-packages\torch\utils\_cpp_extension_versioner.py", line 17, in hash_source_files
hash_value = update_hash(hash_value, file.read())
^^^^^^^^^^^
UnicodeDecodeError: 'gbk' codec can't decode byte 0x96 in position 141: illegal multibyte sequence
```
The issue lies in the fact that the `open()` function in Python is platform-dependent, which can cause decoding errors when a file contains characters that are not supported by the default encoding. Pytorch uses file contents to generate hash string:
60c1433041/torch/utils/_cpp_extension_versioner.py (L16-L17)
In my windows the default encoding is `gbk` but all of my cpp files are in `utf-8`.
There is a simple solution to this problem I think: just change the file reading mode to binary mode, which can avoid issues related to file encoding. It works perfectly on my computer.
```diff
- with open(filename) as file:
+ with open(filename, 'rb') as file:
hash_value = update_hash(hash_value, file.read())
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138565
Approved by: https://github.com/malfet, https://github.com/janeyx99
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
The current pytorch.wait_counter.codegen_and_compile scopes over
cache hit/miss, so it doesn't accurately say if you're actually
spending time doing Inductor compile or not. This counter /only/
is triggered when we're actually about to spend time in Inductor.
It covers Inductor lowering, codegen as well as Triton compilation.
It does NOT cover Triton compilation that occurs when you cache hit.
Some more bikeshedding may be needed.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138010
Approved by: https://github.com/markkm
During the work to dedup graphs for hierarchical compilation I tried to tame the `wrap_fx_proxy_cls` mess by separating the wrapping into three distinct scenarios (vs a jumble of conditionals). These are:
1) wrapping a preexisting tensor (`_wrap_fx_preexisting_tensor`
2) wrapping and tracing a new op into the graph (`_wrap_fx_proxy`)
3) handling a value that is some other proxyable data structure
See `wrap_fx_proxy_cls` for the conditional tree handling these three cases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138933
Approved by: https://github.com/williamwen42
Summary:
Current implementation for lifted graph takes a dict of [constant name: constant value]. And the constant value is used to run_node and excute the constant graph to get the folded values and then create new getattr nodes for folded values.
We don't have constant values for lifted graph during model compilation on MTIA. I think it is more general to allow the constant folding pass to just take the constant names only to produce the constant graph and represent the folded nodes as placeholders to make it consistent with lifted graph. Additionally, this mimic the real situation on Sigmoid, where Sigmoid executes the constant graph, get the folded values and set the folded values to the main graph. This diff is to update the pass to work with a list of constant names.
Test Plan:
```
buck run mode/opt caffe2/test:test_export -- -r split_const_gm
```
Differential Revision: D62144791
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135060
Approved by: https://github.com/SherlockNoMad
Co-authored-by: Tuan Trieu <tuant@meta.com>
Summary:
* Fixed real tensor tracing w/ torchbind objs by passing the cloned tensor obj. For now I just catch the exception and have an error message if the `_clone` fails, but up for discussion on what to do here
* Separate question, should we require people to set up FakeScriptObjects and stuff for draft mode?
* Prevent side effects from happening when we do the first pass of custom ops profiling by cloning/copying everything. Not sure if deepcopying the model will succeed in all cases... But also I guess this path can be removed once custom ops profiling turns into one pass.
Test Plan: `buck2 run @//mode/dev-nosan //scripts/angelayi/draft_export:test_draft_export`
Reviewed By: ydwu4
Differential Revision: D64124825
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138797
Approved by: https://github.com/ydwu4
Title + we avoid calling defer_assert when we statically know the guard results.
timing for pnasnet5large
```
TIMING: code_gen:21.79672 inductor_compile:39.57726 backend_compile:65.30649 entire_frame_compile:95.22052 total_wall_time:95.22052
```
matches with out the diff
```
TIMING: code_gen:21.89314 inductor_compile:39.72298 backend_compile:65.38539 entire_frame_compile:95.0854 total_wall_time:95.0854
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138804
Approved by: https://github.com/ezyang
Skipped `test_exponential` and `test_multinomial` because simply printing the result of an operator does not constitute a test. The testing framework does not attempt to interpret the output.
Modify `test_print_non_contiguous` to get tensors string representation, which is an equivalent operation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139009
Approved by: https://github.com/Skylion007
### Why use non-blocking mode in eager init?
For overlapping comm init and model init, etc.

### Why can we set non-blocking as default?
If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`).
### Why not make non-blocking default for lazy mode as well?
PR https://github.com/pytorch/pytorch/pull/137544 tried it.
Two reasons why that's not preferred today:
1. It is hard -- too big a blast.
2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138527
Approved by: https://github.com/wconstab
ghstack dependencies: #138860
Summary: Move arrayref specific header codegen logic to cpp_wrapper_cpu_array_ref.py, and consolidate some header files codegen logic
Test Plan: CI
Differential Revision: D64899248
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138882
Approved by: https://github.com/hl475
# Motivation
According to [[RFC]A device-agnostic Python runtime API design for stream-based accelerators](https://github.com/pytorch/pytorch/issues/128403), this PR intends to introduce a device-agnostic runtime API design.
I personally prefer the **Simple Version** APIs that no longer accept the device type as an input argument. It means we will leverage `getAccelerator` to fetch the current accelerator. And it is flexible to expand these APIs to handle multiple types of accelerator scenarios. The design does **NOT** break the previous design philosophies.
I also believe that namespace torch.accelerator is better. It lets users know that the APIs they are calling are running on an accelerator rather than CPU. This is important. Meanwhile, we can follow a simple API design principle:
1. Device-agnostic APIs should be placed under the torch.accelerator namespace and not accept a device_type optional parameter.
2. Device-specific APIs should be placed under device-specific submodules.
3. APIS required by both CPU and accelerators should be placed under the torch namespace and accept a device_type optional parameter.
Also, I list the pros and cons of **Simple Version** here:
Pros:
- `torch.accelerator.foo` will have the same input argument as `torch.xxx.foo`, bringing a better user experience;
- more concise, facilitate the developer to write a device-agnostic code.
Cons:
- no obvious drawbacks.
# Additional Context
I list the new APIs here:
```python
torch.accelerator.is_available() -> bool:
torch.accelerator.current_accelerator() -> torch.device:
torch.accelerator.device_count() -> int:
torch.accelerator.current_device_idx() -> int:
torch.accelerator.set_device_idx(device: Union[torch.device, str, int, None]) -> None:
torch.accelerator.current_stream(device: Union[torch.device, str, int, None]) -> torch.Stream:
torch.accelerator.set_stream(stream: torch.Stream) -> None:
torch.accelerator.synchronize(device: Union[torch.device, str, int, None]) -> None:
```
According to the discussion with Alban, we decide to change the API name `set_device` to `set_device_idx` and `current_device` to `current_device_idx` for more explicit. And will submit other PR to support device and stream context manager.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132204
Approved by: https://github.com/EikanWang, https://github.com/abhilash1910, https://github.com/gujinghui, https://github.com/albanD
Try to fix https://github.com/pytorch/pytorch/issues/138772 .
aten._scaled_dot_product_efficient_attention_backward requires the out and gradient_out to have stride order (3, 1, 2, 0). When Inductor layout optimization is enabled, Inductor may change tensor strides if they are not user visible. For efficient_attention_backward, Inductor tries to follow eager strides. But the eager strides Inductor gets for backward graph may be the one after optimization. There are a few possible fixes:
1. change the kernel to allow stride order other than (3, 1, 2, 0). This is probably hard
2. backout https://github.com/pytorch/pytorch/pull/112045/files and don't do layout optimization if the model contains efficient_attention.
3. Force (3, 1, 2, 0) strides order for the relevant tensors
4. Pass original eager layouts to Inductor for the backward graph. Let Inductor follow those layouts for tensors with extra layout requirement.
The PR implements option 3. Option 4 looks more general to me, I think we can do this in long term.
I tried to add a test but failed to repro: https://gist.github.com/shunting314/fe37a246aad269de9ea00199446688f6
Here is the original command to repro the issue:
```
TORCHINDUCTOR_LAYOUT_OPTIMIZATION=1 PYTORCH_NO_CUDA_MEMORY_CACHING=1 CUDA_LAUNCH_BLOCKING=1 time python benchmark.py --model maxvit_nano_rw_256 --precision bfloat16 --torchcompile --bench train --no-retry -b 64
```
benchmark.py is https://github.com/huggingface/pytorch-image-models/blob/main/benchmark.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138879
Approved by: https://github.com/drisspg, https://github.com/eellison
This is semantics changing as if you are dealing with multiple code objects which have exactly the same filename/firstlineno/name, but are distinct objects, and need non-aliasing automatic dynamic state. Otherwise, this should be equivalent (modulo lifetime). I want to do this because when I do PGO I can't index on code object identity, need a stable identifier.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138740
Approved by: https://github.com/bobrenjc93
ghstack dependencies: #138693, #138717
While working on automatic dynamic PGO (https://github.com/pytorch/pytorch/pull/138052) one abstract property I was looking for out of profile information is that it formed a semilattice: I could join together two profiles and get a merged profile that is consistent with the profiles that I saw in both cases. While working on this data structure that supported joins, I realized that the base automatic dynamic algorithm could be implemented in this way, therefore this refactor.
The basic recipe is that we now support a join operation on FrameStateSizeEntry. Intuitively, if you join two sizes that are equal, you get back that size (join(2, 2) == 2), but if you join two different sizes you get a special singleton auto_dynamic indicating that the size of the tensor is dynamic (join(2, 3) == auto_dynamic). So now, the automatic dynamic algorithm is: (1) compute the FrameStateSizeEntry that corresponds to the concrete values we've seen, and (2) join it into the ambient FrameStateSizeEntry. As a bonus, compiler collectives can buy into the same abstraction (we're simply distributing FrameStateSizeEntry from each node to every other node). For convenience, I also added the necessary `auto_unset` extra state which is the identity element (which makes our semilattice bounded from both top and bottom). Here, join(2, auto_unset) == 2.
While doing this, there was a complication: the infer stride algorithm wasn't technically a semilattice. Here, I did what I suggested in the original code review https://github.com/pytorch/pytorch/pull/130232 which is stop using a heuristic, and instead replicate the stride inference algorithm in automatic dynamic. This means that when I join strides together, I don't join their concrete values, instead, if a stride can be inferred as the contiguous stride for a particular inner dimension, then you represent it as InferStride(dim). There's an example in code which I recommend looking at.
Some other extra things that are happening in this PR:
* I tried to deduplicate the size/stride automatic dynamic logic as much as possible. So hopefully less code to review here.
* I had to reimplement all the logging. For the most part I tried to track the logging as closely to the original as possible, but I think we could be emitting less Chrome events here
* The `marked_dynamic` handling is still preserved as is, but I kind of don't like it and we should figure out how to put it somewhere else
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138717
Approved by: https://github.com/bobrenjc93
ghstack dependencies: #138693
Summary:
Disable comprehensive_padding when use_runtime_constant_folding=True.
We need to disable the comprehensive padding because it modifies the stride thus the stride information between the constant graph and main graph will differ.
Test Plan:
```
buck2 run mode/opt -c fbcode.platform010_cuda_version=12 -c fbcode.nvcc_arch=a100 caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --model-path=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/643940255/17/gpu_lowering/input.predictor.disagg.gpu.merge --lower-backend="AOT_INDUCTOR_EP" --aot-inductor-config="{'max_autotune': True, 'aot_inductor.use_runtime_constant_folding': True}"
```
Reviewed By: 22quinn, henryoier
Differential Revision: D64927546
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138872
Approved by: https://github.com/chenyang78
## Context
Previously, the stride preservation of user-visible nodes worked as follows:
- After joint-graph tracing, we recorded the **names** of user-visible nodes and passed them to GraphLowering.
- In GraphLowering, we determined whether we needed to preserve the striding for a certain node by checking if the node's name was in `user_visible_outputs`.
- We obtained the original strides by checking `node.meta["val"].stride()`.
However, there's a problem with this approach: the nodes in output_node.args[0] and their strides could change between the completion of joint-graph tracing and the consumption of `user_visible_outputs` (e.g., during post-grad passes), making it unreliable.
## This PR
- After joint graph tracing:
- Record the original strides for all nodes in `output_nodes.args[0]` as `output_node.meta["original_output_strides"]` (recording for all nodes in case we need the info for other purposes such as debugging).
- Record the indices of user-visible outputs as `output_node.meta["user_visible_output_idxs"]`.
- Remove the original plumbing of `user_visible_outputs`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136732
Approved by: https://github.com/Chillee
- Instead of generating shader from templated code on host, just define two specializations of one kernel template
- Get rid of unused `threads_per_threadgroup` argument
- Replace `if (typeid(scalar_t) == typeid(int32_t))` with `if constexpr (std::is_same_v<scalar_t, int32_t>)` in the host code
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138962
Approved by: https://github.com/janeyx99
1) always generate expected_results.csv up to accuracy of first three digits
ex: 112313212312 --> 1120000000 .. etc
2) regenerate all record in expected_results.csv and not just failed ones , why? because if we change something
by 1.3% and noise 1.5% we want to reflect that.
3) add "please update all results that changed significantly, and not only the failed ones"
```
(myenv) [lsakka@devgpu005.nha1 ~/pytorch/benchmarks/dynamo/pr_time_benchmarks (check_result_ehancements)]$ python check_results.py test_check_result/expected_test.csv te
st_check_result/result_test.csv out
WIN: benchmark ('a', 'instruction count') failed, actual result 9011111111 is -18.16% lower than expected 11011111111 ±1.00% please update the expected results.
please update all results that changed significantly, and not only the failed ones
REGRESSION: benchmark ('b', 'memory') failed, actual result 20011111111 is 99.89% higher than expected 10011111111 ±+10.00% if this is an expected regression, please update the expected results.
please update all results that changed significantly, and not only the failed ones
REGRESSION: benchmark ('c', 'something') failed, actual result 107111111111 is 969.92% higher than expected 10011111111 ±+10.00% if this is an expected regression, please update the expected results.
please update all results that changed significantly, and not only the failed ones
MISSING REGRESSION TEST: benchmark ('d', 'missing-test') does not have a regression test enabled for it.
new expected results file content if needed:
a,instruction count,9011000000,0.01
b,memory,20010000000,0.1
c,something,107100000000,0.1
There was some failures you can use the new reference expected result stored at path:out and printed above
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137925
Approved by: https://github.com/aorenste
Title + we avoid calling defer_assert when we statically know the guard results.
timing for pnasnet5large
```
TIMING: code_gen:21.79672 inductor_compile:39.57726 backend_compile:65.30649 entire_frame_compile:95.22052 total_wall_time:95.22052
```
matches with out the diff
```
TIMING: code_gen:21.89314 inductor_compile:39.72298 backend_compile:65.38539 entire_frame_compile:95.0854 total_wall_time:95.0854
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138804
Approved by: https://github.com/ezyang
Before this PR, calling `is_nested` in-graph would result in graph code like the following:
```python
class GraphModule(torch.nn.Module):
def forward(self, L_nt_: "f64[3, s1, 5]", s1: "Sym(s1)"):
l_nt_ = L_nt_
# Note this useless line!
getattr_1 = l_nt_.is_nested; getattr_1 = None
add: "f64[3, s1, 5]" = l_nt_ + 2; l_nt_ = None
return (add,)
```
This PR follows what is done for `is_sparse` / `is_quantized`: store it onto `TensorVariable` and have `getattr` calls to `is_nested` return the stored value as a constant. This removes the useless line above from the graph. Note that guarding is handled through tensor type check guards, so no need to guard on `is_nested` status.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138841
Approved by: https://github.com/soulitzer
update_hint_regression has been behaving, so I am setting 2% noise threshold for it. 1.5% for sum_floordiv_regression.
I have one concern, with the way we do the regression detection. small or changes <threshold level will accumulate and eventually trigger failure. to avoid those would have to keep any eye on the dashboard and potentially refresh the expected result file regularly even when there is no faluires. .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137548
Approved by: https://github.com/aorenste
### Why use non-blocking mode in eager init?
For overlapping comm init and model init, etc.

### Why can we set non-blocking as default?
If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`).
### Why not make non-blocking default for lazy mode as well?
PR https://github.com/pytorch/pytorch/pull/137544 tried it.
Two reasons why that's not preferred today:
1. It is hard -- too big a blast.
2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138527
Approved by: https://github.com/wconstab
ghstack dependencies: #138860
Canonically, the snapshot API returns the entire memory state of the CUDACachingAllocator (using `get_all_blocks`). There is no API that can only return the memory state of a given pool.
In this PR, we extend the functionality of snapshot API such that it can only return the memory addresses of an active pool. When snapshot API is called under a MemPoolContext, we only return the blocks that correspond to the pool id of the active pool.
Part of https://github.com/pytorch/pytorch/issues/124807.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133601
Approved by: https://github.com/ezyang
This patch
- removes the `is_lazy_module` check from `is_dynamic_nn_module`, and
adds a regression test.
- removes a series of dynamo expected failures on lazy modules. The few
ones I checked all were failing due to speculation log divergence,
similar to #138489.
Note that #100047 introduced the conditional removed in this patch, and
it was trying to fix#100001. But I've confirmed locally that #100001 no
longer repros after this patch.
Fixes#138489. See more context in the issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138639
Approved by: https://github.com/jansel
Fixes#120614
Takes over #120615
In summary, this PR:
- Adds a `__dlpack__` attribute check in the tensor creation path (i.e. [`internal_new_from_data` @ tensor_new.cpp](cdfe1bffd1/torch/csrc/utils/tensor_new.cpp (L266)))
- Creates the tensor by using the DLPack machinery, instead of an element-by-element copy
- No changes since #120615
- Adds a test, making sure the DLPack machinery is used
- Wraps a tensor in a fresh `TensorDLPackWrapper` class that implements only the DLPack methods
- Creates a new tensor from an instance of `TensorDLPackWrapper`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138697
Approved by: https://github.com/ezyang
Co-authored-by: Wenzel Jakob <wenzel.jakob@epfl.ch>
In non-blocking mode, it seems a single `ncclRecv` or `ncclSend` call can "early return" `ncclSuccess` before the kernel is fully enqueued. This causes the event record below missing the P2P the kernel, leading to data corruption.
Side note: per NCCL, it is legal to call `ncclSend` or `ncclRecv` only if there is only one P2P op. This is true whether we are in blocking or non-blocking mode.
In this fix, we use ncclGroup semantics to ensure that the kernel is enqueued for single-P2P ops. The ncclGroup call itself should introduce minimal overhead.
Added a test `test_non_blocking_p2p`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138860
Approved by: https://github.com/shuqiangzhang
Set PYTORCH_MIOPEN_SUGGEST_NHWC environment variable to force output layout to channels-last.
This way, the channels-last CK instances will be added to benchmark choices in max autotune
# Testing
```
pytest test/inductor/test_ck_backend.py -k conv2d
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138643
Approved by: https://github.com/chenyang78
Summary: Capture the timing for the remote fx graph cache get and put operations and add them to the logger logging.
Test Plan:
1) Landed D64483593 and waited for logger actualization.
2) Ran test script on devserver: `buck2 run mode/opt scripts/slarsen/torch_compile_model:run`
3) Queried dynamo_compile/sandbox:
```
(pytorch-3.10_4) devvm2296:~/local/pytorch-3.10_4 $ scuba -e="select time,co_filename,remote_fx_graph_cache_get_time_s,remote_fx_graph_cache_put_time_s from \`dynamo_compile/sandbox\` where remote_fx_graph_cache_put_time_s is not null"
+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------+----------------------------------+
| time | co_filename | remote_fx_graph_cache_get_time_s | remote_fx_graph_cache_put_time_s |
+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------+----------------------------------+
| 1729136266 | null | 0.05652284622192383 | 0.9691152572631836 |
| 1729136263 | /data/users/slarsen/fbsource/buck-out/v2/gen/fbcode/289bb46b326874c6/scripts/slarsen/torch_compile_model/__run__/run-inplace#link-tree/scripts/slarsen/torch_compile_model/run.py | 0.8298435211181641 | 0.18642282485961914 |
+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------+----------------------------------+
```
Reviewed By: oulgen
Differential Revision: D64484025
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138164
Approved by: https://github.com/jamesjwu, https://github.com/ezyang
Summary: The current clamp normalization does not include torch.clamp where its min and max are not normalized to kwargs, thus the batch fusion of clamp can hit min and max are both empty problem.
Test Plan:
```
buck2 run mode/opt servicelab/ai_ml/auto_tune:local_model_pt2 -- --flow_id 654509735 --test_mode split
```
GPU type: NVIDIA PG509-210
=============Print full analysis for offsite_cvr_oba_optout_dedicated_model================
| Metric | Value |
|:-------------------|:-----------------|
| GPU type | A100 |
| Batch size | 10 |
| Latency | 227.13 ms |
| Model size | 2322763344 bytes |
| Flops/example | 1136.52 G |
| TFLOPS | 50.04 |
| MFU | 16.04% |
| Activation/example | 2722.49 MB |
I1023 112249.043 local_model_with_pt2.py:25] benchmark results [('batch_size', 10), ('latency_ms', 22712), ('model_size_bytes', 2322763344), ('flops_per_example', 113652), ('tflops_g', 5003), ('mfu', 1603), ('activation_per_example_mb', 272249)
Differential Revision: D64848369
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138723
Approved by: https://github.com/jackiexu1992
Somehow make setup-env as recomended in CONTRIBUTING.MD is not installing all dependencies require to run tests
This makes it slightly clearer when running tests.
Specific repro on my side was
```
git checkout e7679663070e3149ae7cd6e28d376d86852ce9e4
make setup-env
conda activate pytorch-deps
python test/test_utils_internal.py
```
which is what my reading of the instructions implies should be correct.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137797
Approved by: https://github.com/albanD
I was discussing with James March how the current fx_codegen_and_compile
counter doesn't actually capture all compile time. This one is more
accurate and corresponds closely to the existing events in dynamo_compile
table.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138072
Approved by: https://github.com/markkm
Summary:
- This diff introduces `dtype` attribute to `TritonCSEVariable` and a dtype propagation helper function to infer dtype from input to output for each op.
- There will be a follow-up diff that uses this `dtype` information in `TritonCSEVariable` to perform dtype-aware codegen.
Test Plan: CI
Differential Revision: D61815079
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136778
Approved by: https://github.com/eellison, https://github.com/blaine-rister
Summary: while reading through inductor template code I found a few places where else statements were driving me crazy. Fixing them as I read
Test Plan: CI
Differential Revision: D64882385
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138789
Approved by: https://github.com/aakhundov
Summary: there is an issue with functionalization V2 in export. This is a quick fix that plumbs `is_export` through to `run_functionalized_fw_and_collect_metadata`.
Test Plan: CI
Differential Revision: D64915263
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138836
Approved by: https://github.com/tugsbayasgalan
I was debugging an internal ne divergence for a while that ended up being because of a bad meta. I added an explicit a config option and an explicit backend `aot_eager_decomp_partition_crossref` to enable the FakeCrossRefMode when running the graph. I added an explicit backend bc I suspect it will be useful for internal models but I'm also happy to leave as config option.
It will only test ops that have meta to avoid memory overhead of hitting fallback path and running in eager.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138651
Approved by: https://github.com/zou3519, https://github.com/bdhirsh
Summary: With the fast pickling mode, we don't need the custom hack for replacing device strings in tensors. This was previously needed because, e.g., two strings "cuda" will pickle differently if they are the same object vs. not.
Test Plan:
The new test fails with fast mode commented out, but succeeds when enabled:
`python test/inductor/test_codecache.py -k test_stable_strings`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138681
Approved by: https://github.com/oulgen
This change fixes the RUNPATH of installed c++ tests so that the linker can find the shared libraries they depend on.
For example, currently:
```bash
venv/lib/python3.10/site-packages/torch $ ./bin/test_lazy
./bin/test_lazy: error while loading shared libraries: libtorch.so: cannot open shared object file: No such file or directory
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136627
Approved by: https://github.com/malfet
Fixes#138509
tl.constexpr globals would be codegen-ed as `constexpr()` instead of `tl.constexpr()` if they were un-annotated. This fixes the issue (and adds a test). The correct handling was already added but the corrected string was not being used in the un-annotated branch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138757
Approved by: https://github.com/oulgen
Summary: During compilation, a profiler context gets ignored so we should temporarily turn off tests that are failing due to dynamo. Once profiler integration with dynamo is introduced we can reintroduce these tests
Test Plan: Make sure CI is passing again
Differential Revision: D64867447
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138762
Approved by: https://github.com/davidberard98
As called out in https://github.com/pytorch/pytorch/pull/137999, preserving signatures of multiple calls when buffer mutations are present was NYI. The main problem was that intermediate values of buffers were not tracked, so couldn't be propagated statefully between multiple calls (i.e., they would need to be explicitly passed around, defeating the unlifting needed for preserving signatures).
This PR fixes this situation, by introducing module attributes that carry the necessary intermediate values of buffer mutations. In general, a buffer mutation can have several intermediate values it depends on recursively, even other buffers. So rather than tying an intermediate value with a particular buffer, we tie it with the submodules that create and read it. We install an attribute on all modules that create or read a particular intermediate value, sharing the same initial storage (i.e., initialized with the same empty tensor). For the module that creates this intermediate value, we copy the value into the corresponding attribute; and for the modules that read it, we read the corresponding attribute instead.
Another complication that needed to be addressed was that a `run_decompositions` following an `export_for_training` was not preserving module call graphs, which is needed for unflattening and, in particular, used when remapping inputs. Fortunately some existing metadata already tracks provenance of nodes, which we could use to update a module call graph after functionalization / decomposition.
Differential Revision: D64806175
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138669
Approved by: https://github.com/tugsbayasgalan
Summary:
We always need to give the heartbeat monitor thread time to write out flight recorder dumps. Otherwise, the watchdog thread kills the heartbeat monitor thread too fast before it has time to write out the Flight Recorder logs.
This change:
1. Removes the "sleep after exception" JK. We don't need to sleep for 8 minutes.
2. Use a promise between watchdog thread and heartbeat monitor thread to delay, at most, one minute to give Flight Recorder time to write out it's log on timeout.
Test Plan:
Tested on my local job and flight recorder successfully executed for the job.
https://fburl.com/mlhub/38fj5yne
The watchdog thread gives heartbeat thread time to write out the logs.
In the logs we see:
```
[trainer4]:I1023 17:39:29.755507 12592 ProcessGroupNCCL.cpp:1950] [PG ID 0 PG GUID 0(precheck) Rank 12] slept for 1647ms giving time for flight recorder dumps to finish.
```
Differential Revision: D64857928
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138828
Approved by: https://github.com/d4l3k, https://github.com/fduwjj
We found that `export() -> _inductor.aot_compile()` lowering, 3 different ShapeEnvs get created, leading to errors when one ShapeEnv processes expressions created by another ShapeEnv. This plumbs the 2 places where ShapeEnv creation happens, detecting the original ShapeEnv from the GraphModule example values, so the original ShapeEnv is just reused.
Differential Revision: D64613290
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138362
Approved by: https://github.com/angelayi
Each inductor event should have exactly one hit, miss, bypass etc. Add it to the inductor compile event.
Add triton_compile as a compiler phase with `dynamo_timed`. This way, we get PT2 Compile Event Logs for triton as well.
Here's what triton events look like: {F1941513932}
And this on a cache hit(since we still redo this work):
{F1941514350}
Inductor cache info:
{F1941528530}
Differential Revision: [D64703392](https://our.internmc.facebook.com/intern/diff/D64703392/)
@diff-train-skip-merge
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138603
Approved by: https://github.com/oulgen
Summary: We expand the tests to cover serdes_non_strict. Currently failing tests are skipped.
Test Plan:
```
buck2 test @//mode/dev-nosan //caffe2/test:test_export -- -r _serdes_non_strict
```
Differential Revision: D64709285
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138662
Approved by: https://github.com/avikchaudhuri
As discussed with @ezyang, this set of diffs are extracting fixes to problems discovered to flipping `specialize_float=False` in https://github.com/pytorch/pytorch/pull/137782. Since these codepaths are exercised in existing tests, I'm going to bias towards shipping speed and put these up with the primary test plan as the global CI. These code paths are all tested via existing tests when `specialize_float=False` and it feels a bit wonky to add more gated tests that only test behavior when this flag is True, especially since these code paths are already covered. That being said, I'm happy to add individual tests if reviewers insist or have a different POV.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138598
Approved by: https://github.com/ezyang
ghstack dependencies: #138595
When fp8 dtype is involved, Inductor may set min_elem_per_thread to be a positive value. This will force increasing XBLOCK even for a small xnumel (e.g. 1). Inductor will report an error later when sanity check the triton config.
The simple fix here is to just not let XBLOCK to be larger than xnumel.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138730
Approved by: https://github.com/Chillee
ghstack dependencies: #136782
unwrap_tensor_subclasses -> get_plain_tensors
Is used at runtime. For small models this overhead is feasible in comparison with small compiled kernel.
1/ Removing asserts from runtime path
2/ Removing list creation with using optional output list to append argument
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138498
Approved by: https://github.com/bdhirsh
Summary:
Currently, whether split should be used depends on the size of subgroup.
It's possible that default PG is not eagerly initialized yet, but split is still
called.
This PR fixes this issue by removing split's dependency on subgroup size
Test Plan:
Modified UT
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138781
Approved by: https://github.com/kwen2501
# Motivation
Fix https://github.com/pytorch/pytorch/issues/138577.
# Solution
1. All UTs in `test/inductor/test_compiled_optimizers.py` are fixed by https://github.com/pytorch/pytorch/pull/134170
2. UT in `test/inductor/test_pattern_matcher.py` is introduced by https://github.com/pytorch/pytorch/pull/138089, we will skip this UT due to the unsupported feature `max_autotune_gemm_backends:Triton`.
3. We have a new impl related to `histc`, so we remove the expected failure from `test/inductor/test_torchinductor_opinfo.py`
4. We support `avg_pool3d` for `fp16` data type, so we remove the expected failure from `test/inductor/test_torchinductor_opinfo.py`
5. CUDA-bias code is introduced by https://github.com/pytorch/pytorch/issues/138472, we just generalize it to `GPU_TYPE`.
# Additional Context
> Why update torch-xpu-ops commit pin here?
We have to update commit pin to avoid the build failure raised by the code change [C10_UNUSED](https://github.com/pytorch/pytorch/pull/138364).
> What does the feature of torch-xpu-ops update?
1. Add some foreach ops, like `unary ops` and `foreach_clamp_max` etc;
2. Add some maxpool ops forward and backward, like `averge_pool3d` and `max_pool3d`
3. Add some other ops, like `log_normal_`, `index_copy`, and `mode` etc;
4. fix build failure related to `C10_UNUSED`;
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138548
Approved by: https://github.com/malfet, https://github.com/EikanWang
We did a lot of optimization for PyTorch Windows, and we got good progress of it. But still some models have performance gap between PyTorch Windows and PyTorch Linux. Ref: https://pytorch.org/blog/performance-boost-windows/#conclusion
From the blog conclusion, we found the `ResNet50` is typical case of it.
Let's focus on the `ResNet50`, and collect the profiling log:
```cmd
(nightly) D:\xu_git\dnnl_cb>python test_script_resnet50.py
--------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls
--------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
model_inference 3.91% 682.427ms 100.00% 17.448s 17.448s 1
aten::conv2d 0.18% 30.906ms 64.79% 11.305s 2.133ms 5300
aten::convolution 0.45% 78.031ms 64.62% 11.275s 2.127ms 5300
aten::_convolution 0.30% 51.670ms 64.17% 11.196s 2.113ms 5300
aten::mkldnn_convolution 63.58% 11.093s 63.87% 11.145s 2.103ms 5300
aten::batch_norm 0.13% 23.536ms 20.10% 3.506s 661.580us 5300
aten::_batch_norm_impl_index 0.28% 49.486ms 19.96% 3.483s 657.139us 5300
aten::native_batch_norm 19.26% 3.360s 19.64% 3.427s 646.615us 5300
aten::max_pool2d 0.01% 1.038ms 5.84% 1.018s 10.181ms 100
aten::max_pool2d_with_indices 5.83% 1.017s 5.83% 1.017s 10.171ms 100
aten::add_ 3.38% 588.907ms 3.38% 588.907ms 85.349us 6900
aten::relu_ 0.35% 60.358ms 1.67% 292.155ms 59.624us 4900
aten::clamp_min_ 1.33% 231.797ms 1.33% 231.797ms 47.306us 4900
aten::empty 0.46% 80.195ms 0.46% 80.195ms 1.513us 53000
aten::linear 0.01% 927.300us 0.23% 39.353ms 393.532us 100
aten::addmm 0.20% 35.379ms 0.21% 37.016ms 370.155us 100
aten::empty_like 0.12% 20.455ms 0.17% 29.976ms 5.656us 5300
aten::as_strided_ 0.11% 18.830ms 0.11% 18.830ms 3.553us 5300
aten::adaptive_avg_pool2d 0.00% 419.900us 0.08% 14.265ms 142.647us 100
aten::mean 0.01% 1.737ms 0.08% 13.845ms 138.448us 100
aten::sum 0.05% 8.113ms 0.05% 8.648ms 86.479us 100
aten::resize_ 0.03% 5.182ms 0.03% 5.182ms 0.978us 5300
aten::div_ 0.01% 1.445ms 0.02% 3.460ms 34.600us 100
aten::to 0.00% 337.000us 0.01% 2.015ms 20.154us 100
aten::_to_copy 0.01% 977.500us 0.01% 1.678ms 16.784us 100
aten::copy_ 0.01% 1.474ms 0.01% 1.474ms 7.371us 200
aten::t 0.00% 775.900us 0.01% 1.410ms 14.104us 100
aten::flatten 0.00% 420.900us 0.01% 1.311ms 13.106us 100
aten::view 0.01% 889.700us 0.01% 889.700us 8.897us 100
aten::transpose 0.00% 410.700us 0.00% 634.500us 6.345us 100
aten::expand 0.00% 496.800us 0.00% 566.800us 5.668us 100
aten::fill_ 0.00% 534.800us 0.00% 534.800us 5.348us 100
aten::as_strided 0.00% 293.800us 0.00% 293.800us 1.469us 200
aten::empty_strided 0.00% 241.700us 0.00% 241.700us 2.417us 100
aten::resolve_conj 0.00% 54.800us 0.00% 54.800us 0.274us 200
--------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 17.448s
Execution time: 20.02380895614624
```
We found the major kernel consume CPU resource is `aten::mkldnn_convolution`. It was dispatched to `MKLDNN`.
Acturally, we had optimized memory allocation via integrated mimalloc to pytorch C10 module. It helps PyTorch Windows boost a lot, but it does not cover `MKL` and `MKLDNN`'s intermediary temporary memory.
We still have potential to improve PyTorch Windows performance via optimize `MKL` and `MKLDNN`'s intermediary temporary memory.
So, I discussed with Intel MKL team, and get a method to register high performance memory allocation API to MKL, and it would help MKL to boost memory performance. Please check the online document: https://www.intel.com/content/www/us/en/docs/onemkl/developer-guide-windows/2023-0/redefining-memory-functions.html
This PR is optimize MKL memory alloction performance on Windows, via register mi_malloc to MKL. PR Changes:
1. Add cmake option: `USE_MIMALLOC_ON_MKL`, It is sub-option of `USE_MIMALLOC`.
2. Wrap and export mi_malloc APIs in C10, when `USE_MIMALLOC_ON_MKL` is `ON`.
3. Add MklAllocationHelp.cpp to register allocation APIs to MKL, when `USE_MIMALLOC_ON_MKL` is `ON`.
For `oneDNN`, it is still tracking in this proposal: https://github.com/oneapi-src/oneDNN/issues/1898
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138419
Approved by: https://github.com/jgong5, https://github.com/ezyang
This patch improves the performance of individual reductions on MI300X. These improvements are measured on individual sum reduction operations of varying sizes. The patch impacts the following tensor types:
- 1D tensors
- 2D tensors when reducing along dimension 0
- 2D tensors when reducing along dimension 1
Runtime reduction between 0 and 75% depending on tensor shape.
The patch uses the maximum number of threads per CU and the number of CUs itself to control the number of threadblocks in various situations (i.e. for various reduction types and tensor dimensions).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137737
Approved by: https://github.com/eqy, https://github.com/jeffdaily, https://github.com/pruthvistony, https://github.com/xw285cornell
This split in JKs was never actually used (We just set both JKs to the same values except when we accidentally didn't due to being humans who make mistakes). This simplifies the overall JK structure and eventually, will let us delete the duplicate JK
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137501
Approved by: https://github.com/oulgen
Summary:
Currently, if watchdog + healthcheck are enabled via knobs but watchdog is disabled via SJD config, we observe a stuck when the watchdog loop attempts to open the watchdog file path. This is because the FileTimerClient that is usually set in TorchElasticWatchdog will not be set since disabling watchdog via SJD config bypasses the TorchElasticWatchdog initialization
The workaround is to update the healthcheck time when calling `get_last_progress_time`
Test Plan:
Logs show that the progress time value is being changed despite client not being set
Behavior when watchdog is enabled with SJD config is left unchanged
Differential Revision: D64733766
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138615
Approved by: https://github.com/gag1jain
Summary: Currently, calling `torch._logging.set_logs()` resets the log directory leading to multiple tlparse outputs. This prevents the dir from resetting after the first call.
Reviewed By: ezyang
Differential Revision: D64118047
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137793
Approved by: https://github.com/ezyang
Adds the job `get-test-label-type` in `.github/workflows/inductor-perf-compare.yml` checking for the experiment `awsa100`.
It is then used by the job `linux-focal-cuda12_1-py3_10-gcc9-inductor-build` to define the prefix for the runners that will run the benchmark.
Those runners temporarily accept the labels `awsa100.linux.gcp.a100` and `linux.aws.a100`. This is used so we can migrate via experimentation from `linux.gcp.a100`. After successfully experiment with those instances we will remove those labels and update the workflows to use `linux.aws.a100` and decomisson the gcp fleet.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138204
Approved by: https://github.com/ZainRizvi, https://github.com/huydhn
### Why use non-blocking mode in eager init?
For overlapping comm init and model init, etc.

### Why can we set non-blocking as default?
If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`).
### Why not make non-blocking default for lazy mode as well?
PR https://github.com/pytorch/pytorch/pull/137544 tried it.
Two reasons why that's not preferred today:
1. It is hard -- too big a blast.
2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138527
Approved by: https://github.com/wconstab
ghstack dependencies: #137855, #138488, #138374, #138384
Summary:
This DIFF is to fix dead lock issue in execution issue. ExecutionTraceObserver get a lock in recordOperatorStart and onFunctionExit. However, inside these two functions, the input/ouput values are evaluated, which will triger python GIL in some use cases. In this case, the lock order is ET locker -> GIL.
One of the ads application get GIL first, then call all-gather to collect some metrics from all ranks. When ET is on, all-gather is captured by ET observer. In this case, the lock order is: GIL -> ET locker
That is the reason why dead lock happens. To fix it, I changed the ET locker scope, so the input/output evaluation is no longer inside the scope of the ET locker.
Test Plan: buck2 test mode/opt caffe2/test:test_profiler_cuda
Differential Revision: D63556608
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136892
Approved by: https://github.com/aaronenyeshi
As discussed with @ezyang, this set of diffs are extracting fixes to problems discovered to flipping `specialize_float=False` in https://github.com/pytorch/pytorch/pull/137782. Since these codepaths are exercised in existing tests, I'm going to bias towards shipping speed and put these up with the primary test plan as the global CI. These code paths are all tested via existing tests when `specialize_float=False` and it feels a bit wonky to add more gated tests that only test behavior when this flag is True, especially since these code paths are already covered. That being said, I'm happy to add individual tests if reviewers insist or have a different POV.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138599
Approved by: https://github.com/ezyang
As discussed with @ezyang, this set of diffs are extracting fixes to problems discovered to flipping `specialize_float=False` in https://github.com/pytorch/pytorch/pull/137782. Since these codepaths are exercised in existing tests, I'm going to bias towards shipping speed and put these up with the primary test plan as the global CI. These code paths are all tested via existing tests when `specialize_float=False` and it feels a bit wonky to add more gated tests that only test behavior when this flag is True, especially since these code paths are already covered. That being said, I'm happy to add individual tests if reviewers insist or have a different POV.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138595
Approved by: https://github.com/ezyang
Previous to this PR, when a model fails to be exported, it falls back to try with the legacy torchscript exporter. However, we didn't stop when it's exported with torchscript exporter, an optimization is applied to the graph.
It's ideal that the optimization can also boost the performance of the model exported with the legacy torchscript exporter, but currently, for benchmarking purpose and what fallback guarantee to the users, we should keep it simple and only return the graph.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138265
Approved by: https://github.com/xadupre, https://github.com/justinchuby
Previously we only wait for comm to become ready after its initialization.
That's not enough. There are other NCCL APIs that can cause the comm to be InProgress, e.g. P2P calls, commSplit, commFinalize, etc.
Therefore, we just ensure comm is ready every "next time" we need to access ncclComm.
The place to add such gate keeper is `getNcclComm`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138384
Approved by: https://github.com/shuqiangzhang, https://github.com/fduwjj
ghstack dependencies: #137855, #138488, #138374
Summary:
This PR adds a lowering for `torch._cslt_sparse_mm` to find the optimal
alg_id and cache it when running with `torch.compile`
Seeing speedups on both bfloat16 and float8 dtypes:
<img width="641" alt="Screenshot 2024-10-17 at 2 10 38 PM" src="https://github.com/user-attachments/assets/b928cd11-32a3-43e5-b209-8e4028896f0b">
<img width="1274" alt="Screenshot 2024-10-17 at 1 39 03 PM" src="https://github.com/user-attachments/assets/d9edd684-a8ec-46fd-b3da-2e76dbcb7bb6">
* `torch._cslt_sparse_mm_search` has been modified to return optimal
split-k parameters as well as max alg_id.
* max_id is now available in `torch.backends.cusparselt` via
`torch.backends.cusparselt.get_max_alg_id()`
* fixed meta registrations for float8
Test Plan:
python test/test_sparse_semi_structured.py
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137427
Approved by: https://github.com/cpuhrsch
This diff does a few things:
## Add metadata to events in progress
Adds the ability to add extra metadata to Chromium Events via `add_event_data`.
Metadata can only be added to chromium events that have started, but not ended (so, in progress events)
- When you add the data, the metadata is appended to the metadata when you call log_event_end().
- The metadata appears in chromium events in tlparse. It also gets logged to scuba.
## New `dynamo` chromium event
We add a new `dynamo` chromium event to the top of the stack, where we collect various metadata found in dynamo_compile. So the new order of events goes:
```
__start__
-> dynamo (dynamo compile metrics)
-> entire_frame_compile (compile.inner)
-> backend_compile (i.e. aotdispatch)
-> create_aot_dispatch_function
-> inductor_compile
-> ...
```
BackwardCompilationMetrics doesn't have any dynamo specific information (as it's mostly inductor timings). So we don't include that here.
*FAQ: Why can't we use `entire_frame_compile` as the event?*
This is mostly due to backward compatibility with `dynamo_compile`. `dynamo_compile` collects CompilationMetrics outside of `compile.compile_inner`, and uses `dynamo_timed` to grab timings from phases of the compiler, including `entire_frame_compile`. So we don't have a CompilationMetric object until after an `entire_frame_compile` event ends! Separately, `dynamo` as a name for all of dynamo compile is more descriptive than `entire_frame_compile`, imo.
## Log metadata as separate columns
(Meta only): Separately, this also changes the `metadata` column in PT2 Compile Events. Instead of logging a single metadata column in JSON, it separates the JSON into separate columns. This is much better for data analysis. Now that this table is more mature, I think logging keys to separate columns is a better system.Differential Revision: [D64696287](https://our.internmc.facebook.com/intern/diff/D64696287/)
**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D64696287/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138477
Approved by: https://github.com/aorenste
This PR fixes an issue with `torch._dynamo.assume_constant_result` causing global values to be overwritten.
Currently `torch._dynamo.assume_constant_result` saves the constant result into a global variable derived from the name of the function. This causes that function to be overwritten in the global scope. This PR checks that the name is unique in the global scope as well, avoiding the issue of overriding the function.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132431
Approved by: https://github.com/jansel
Summary: We expand the tests to cover retraceability_non_strict. Currently failing tests are skipped.
Test Plan:
```
buck2 test @//mode/dev-nosan //caffe2/test:test_export -- -r _retraceability
```
Differential Revision: D64611532
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138380
Approved by: https://github.com/angelayi
Before this PR, NJT would dispatch e.g. `NJT * nested_int` to `mul.Tensor`, wrongly interpreting the SymInt as a tensor and outputting garbage. This PR verifies that there are no nested ints in the list of args before dispatching for pointwise ops.
I originally tried checking that `the number of passed tensor args == the number of func schema tensor args`, but this wrongly disallows `nt * 2`, which (non-intuitively to me at least at first) dispatches via the `mul.Tensor` overload.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138602
Approved by: https://github.com/soulitzer
Was a bit confusing to read when working on #138354
"computer-assisted proof"
```
import random
def argsort(seq):
# preserve original order for equal strides
getter = seq.__getitem__
a_r = range(len(seq))
return list(reversed(sorted(a_r, key=getter, reverse=True))) # noqa: C413
def stride_order2fill_order(order):
"""
Convert stride order to fill order
For channel last format,
stride order = [3, 0, 2, 1] and fill order = [1, 3, 2, 0]
"""
lookup = {pos: idx for idx, pos in enumerate(order)}
fill_order = [lookup[i] for i in range(len(order))]
return fill_order
def get_stride_order(seq):
"""
Convert strides to stride order
"""
sorted_idx: List[int] = argsort(seq)
out = [0 for _ in range(len(seq))]
a = sorted_idx.copy()
for i, elem in enumerate(sorted_idx):
out[elem] = i
fillorder = stride_order2fill_order(out)
assert fillorder == sorted_idx
return out
for _ in range(1000):
a = [0, 1, 2, 3]
random.shuffle(a)
get_stride_order(a)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138376
Approved by: https://github.com/drisspg
On ROCm, using a non-vectorized index_put kernel provides ~2x perf improvement over the hipified CUDA kernel. None of the existing unit tests were exercising the large index case so a new unit test was added.
It was also noted that the scale value in the original kernel was hard-coded to 1.0 which would be a no-op, so it was removed from the simplified rocm kernel.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138259
Approved by: https://github.com/xw285cornell, https://github.com/leitian, https://github.com/eqy
Summary: The problem happened after splitting CppWrapperCpu and CppWrapperCpuArrayRef, because CppWrapperCpuArrayRef.generate_index_put_fallback missed a statement.
Running test_aot_inductor.py as a whole didn't reveal the problem, but running test_index_put_with_none_index_cpu_with_stack_allocation individually did. Digging deeper, the root cause is init_backend_registration has incorrectly cached CPU CppWrapperCodegen class, which means CppWrapperCpuArrayRef was never picked when running test_aot_inductor.py as a whole. To fix the problem, all the ArrayRef tests are split into a separate file. Also a code checking is added to regex match AOTInductorModelRunMinimalArrayrefInterface so this kind of false passing signal won't be unnoticed.
Differential Revision: [D64734106](https://our.internmc.facebook.com/intern/diff/D64734106)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138541
Approved by: https://github.com/frank-wei
This PR combines a number of cleanups in one PR. If any of the specific cleanups don't seem to make sense, let me know and I can remove them.
Cleanups
- This PR adds a set of test suites for the config module code, which handles basically all the APIs and ways it is used. Please let me know if you see anything critical that is not tested that I missed. This test suite is primarily used as the regression test suite for later changes in this diff. Note that there is some dynamo specific testing of the config module, but it isn't as verbose.
- I removed all internal usage of shallow_copy_dict. Those usages could all use the deep copy, and did not depend on the reference behavior of certain config values that shallow_copy_dict allows.
- I removed shallow copy semantics for configuration with a deprecation warning. I think this requires a release note, so hopefully I did that correctly. Let me know if we want to continue to expose shallow copy value semantics, but I just can't find a case where I expect anyone would want it. It also complicated later internal changes to the API (i.e. breaking apart various layers of the config changes).
- I fixed what I believe is a bug in how hashes are calculated on configs. In particular, if you got the hash, then made a config change, and then got the hash again, it would not update the hash. @oulgen, please let me know if I'm misunderstanding this behavior and it is desired.
- I switched our multiple implementations of iterating through the dictionary to a single one. This is primarily to make later changes easier, but it also makes it clear how inconsistent our various config ignoring options are. Let me know if people would be interested in me unifying the various options for ignoring config values.
- I updated the test patcher (not the performance critical one, just the normal one), to use __setattr__ and __getattr__ to remove direct API access to the underlying config fetcher.
For release notes, Not sure exactly how to communicate this, but something like
"ConfigModule.to_dict, and ConfigModule.shallow_copy_dict no longer retain their shallow copy semantics, which allowed reference values objects to be modified. If you wish to modify the config object, call load_config explicitly".
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138377
Approved by: https://github.com/ezyang, https://github.com/jansel, https://github.com/jovianjaison
Previously we'd been raising UserErrors when `Dim()` and DimHints (`Dim.AUTO/Dim.DYNAMIC`) were both specified in `dynamic_shapes`, this PR stops that, and uses `Dim()` objects to guide DimHints.
The key to this was making the `EqualityConstraint` class happy when it checks that inferred equivalence relations were specified in the original `dynamic_shapes` spec, and this introduces a `RelaxedConstraint` object to mark the hinted dimensions, so equality checks between `RelaxedConstraints` and other constraints are treated as valid.
Current behavior is that:
```
class Foo(torch.nn.Module):
def forward(self, x, y):
return x - y
inputs = (torch.randn(4, 4), torch.randn(4, 4))
shapes = {
"x": (Dim.AUTO, Dim("d1", min=3)),
"y": (Dim("d0", max=8), Dim.DYNAMIC),
}
ep = export(Foo(), inputs, dynamic_shapes=shapes)
```
The dimensions marked `AUTO` and `DYNAMIC` will have max & min ranges of 8 & 3 respectively. Note that inferred equality between `Dim()` objects & `Dim.STATIC` will still raise errors - `Dim()` suggests not specializing to a constant.
Differential Revision: D64636101
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138490
Approved by: https://github.com/avikchaudhuri
Summary:
This implementation does not utilize the benefit that after allgather we can directly perform the SDPA without doing the ring-based SDPA, but we can overlap the communication with the first sharded kv computation. This implementation shows some performance benefit and memory saving compared to the original alltoall implementation in certain cases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132820
Approved by: https://github.com/XilunWu
This PR refactors some ref-counting functionality out of `beginAllocateToPool` and `releasePool`. The ref-counting logic is then used in construction and destruction of `torch.cuda.MemPool`.
The `use_count` variable in the CUDACachingAllocator is essentially a refcount of how many context managers are using the pool. Since we are now lifting up the MemPool abstraction to the user, the MemPool object itself now needs to hold a an extra reference as well.
Part of https://github.com/pytorch/pytorch/issues/124807.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133600
Approved by: https://github.com/eqy, https://github.com/ezyang
Use `split_group()` to create sub_groups for nccl backend if the default pg is eagerly initialized. Otherwise, it will still go through the normal lazy init process and call `new_group()` instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138129
Approved by: https://github.com/kwen2501
Currently when tuples values are encountered in dynamo, they are encoded using `repr(arg)`. This causes an issue if one of the values inside of the tuple will not be properly encoded. In this case, if an enum is contained inside of a tuple, it will cause invalid python code to be generated
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133123
Approved by: https://github.com/jansel
Summary: same as title. Plan is to pass a callable to the partitioner to perform custom autoAC via an ILP. This is the same as a previous diff D63714905 which was landed and then subsequently reverted by PyTorch Release Engineering because of a failing unit test (f7b8d36c28). We think the unit test is buggy, and we also fix the same.
Test Plan: tbd
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137785
Approved by: https://github.com/basilwong
Co-authored-by: Huy Do <huydhn@gmail.com>
This has the benefit that
1) It's much easier to aggregate test failure repros into say a CSV or shell script from scuba
2) We can do analysis (eg. set different two sets of tests across two PRs)
3) We can get results faster at the test-level granularity instead of job-level granularity we see in the HUD/GH.
I tested this by introducing a breaking change, adding ci-scribe label and then verifying that the failed tests were logged to scuba: https://fburl.com/scuba/torch_open_source_signpost/w6qt7qr9
I then reverted the breaking change and published this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138394
Approved by: https://github.com/ezyang
Forward fix for build issue introduced by #137855:
```
In file included from fbcode/caffe2/torch/csrc/distributed/c10d/NCCLUtils.cpp:2:
fbcode/caffe2/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp:508:21: error: use of undeclared identifier 'NCCL_SPLIT_NOCOLOR'
508 | int split_color{NCCL_SPLIT_NOCOLOR - 1};
| ^
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138488
Approved by: https://github.com/fduwjj
ghstack dependencies: #137855
Does what it says on the tin. I believe the right behavior here is to ensure that `record_stream()` is called on all tensor components of the NJT to ensure they all live until stream computation is complete.
This is an ask from torchrec as the op is used there.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137099
Approved by: https://github.com/ngimel
Looking in the code I see
```
// NB: __cplusplus doesn't work for MSVC, so for now MSVC always uses
// the "__declspec(deprecated)" implementation and not the C++14
// "[[deprecated]]" attribute. We tried enabling "[[deprecated]]" for C++14 on
// MSVC, but ran into issues with some older MSVC versions.
```
But looking at the [MSVC C++ support table](https://learn.microsoft.com/en-us/cpp/overview/visual-cpp-language-conformance?view=msvc-170) I see that the `[[deprecated]]` attribute is supported as of MSVC 2015 and that the vast majority of C++17 features became supported in MSVC 2015 _or later_.
Since PyTorch is C++17 now, I infer that PyTorch must not support versions of MSVC earlier than MSVC 2015, so the versions of MSVC supported by PyTorch must support `[[deprecated]]`.
Therefore, since we are finished deprecating old MSVCs we can deprecate `C10_DEPRECATED`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138406
Approved by: https://github.com/cyyever, https://github.com/malfet
Since cuda 12.4 binaries are default binaries on pypi now. The pytorch_extra_install_requirements need to use 12.4.
This would need to be cherry-picked to release 2.5 branch to avoid injecting these versions into metadata during pypi promotion.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138458
Approved by: https://github.com/malfet
## The problem
In a typical debugger, `repr()` is used to display variables and not `str()`.
Several classes in Dynamo have a `__str__()` method that returns useful information and a `__repr__()` that does not. Having to call `str(x)` or `[str(i) for i in x]` in the debugger all the time is a chore.
`str()` should be ["informal, nicely printable"](https://docs.python.org/3/library/stdtypes.html#str) and `repr()` should ["attempt to return a string that would yield an object with the same value when passed to eval()](https://docs.python.org/3/library/functions.html#repr)".
## The solution
In the Python object model, if there is no `__str__` method, `__repr__` is used instead (but not the other way around).
So renaming `__str__` to `__repr__` in a few cases where no `__repr__` method exists now should not change observable behavior, and should make debugging easier.
The specific classes changed were all in `torch._dynamo.variables`:
* `builtin.BuiltinVariable`
* `constant.ConstantVariable`
* `constant.EnumVariable`
* `functions.UserMethodVariable`
* `lazy.LazyVariableTracker`
* `lazy.LazySymNodeFormatString`
* `misc.GetAttrVariable`
* `misc.NullVariable`
* `user_defined.UserDefinedObjectVariable`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136316
Approved by: https://github.com/XuehaiPan, https://github.com/jansel
Type annotations for compile_fx.
- Some of the stuff here is pretty complicated (functions which return functions that take functions) so I bailed on those and used `Any` just to get the rest landed.
- There are also changes to type signatures in other files which I did just to let mypy know more about the types in compile_fx.py.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138033
Approved by: https://github.com/Skylion007
This PR contains multiple fixes for issue https://github.com/pytorch/pytorch/issues/135279:
## First part:
Moves the GPU guard (`cudaSetDevice`) before the `currentStreamCaptureStatusMayInitCtx` call.
As its name suggests, it May Init Ctx.
## Second part:
Even with the above fix, additional contexts are still observed during Work object destruction, e.g.
```
work = dist.all_reduce(tensor, async_op=True)
time.sleep(5) <-- no additional context yet
del work <-- additional context shows up
```
### Debug process
Chasing it down to destruction of a `Future` object -- a member variable of `Work`.
Then further down to the following member of `Future`:
```
std::vector<c10::Event> events_;
```
When the `events_` are destroyed, we hit the road down to:
1f3a793790/c10/cuda/impl/CUDAGuardImpl.h (L106-L121)
When there is no "preset" CUDA context (**which is the case for python garbage collector**), line 112: `c10::cuda::GetDevice(&orig_device)` will set `orig_device` to 0. Then, at line 120, `c10::cuda::SetDevice(orig_device)` will "officially" set the context to device 0 --
**that's where rank 1, 2, ... can create extra context on device 0!**
### Solution
This PR adds an explicit destructor to `Future`. In this destructor, destroy each event with a device guard.
## Test
Added test_extra_cuda_context, implemented via
- `pynvml` (if available), or
- memory consumption check.
`python test/distributed/test_c10d_nccl.py -k test_extra_cuda_context`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135273
Approved by: https://github.com/fduwjj, https://github.com/wconstab, https://github.com/eqy
ghstack dependencies: #137161
Co-authored-by: Will Feng <yf225@cornell.edu>
Instead of calling `safe_expand` right after symbolic expression construction, we invoke it in `ShapeEnv.simplify`. This enables more simplification with product form, e.g.,
```
(a + b)^2 / (a + b) --> (a + b)
```
which won't happen if we expand eagerly during product construction:
```
(a^2 + 2ab + b^2) / (a + b) --> no change
```
Fixes#136044.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138235
Approved by: https://github.com/ezyang
replace_by_example is used to implement some pattern-matching passes in inductor. Previously, replace_by_example would generate nodes with very little metadata. In particular, `meta["original_aten"]` would be lost; that meant that when generating triton kernel names, you could get empty names like `triton_tem_fused_0` if the input nodes to the fused kernel were the result of a pattern-matching pass that used replace_by_example.
This also adds metadata for to register_replacement patterns, including pad_mm.
This fixes the issue by copying metadata from the original node to the replacement nodes. If there are multiple original nodes we skip the metadata transfer; so if you have a `add(z, mm(x, y))`, then the metadata won't be transferred right now.
Differential Revision: [D64480755](https://our.internmc.facebook.com/intern/diff/D64480755)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138089
Approved by: https://github.com/aakhundov
Summary: same as title. Plan is to pass a callable to the partitioner to perform custom autoAC via an ILP. This is the same as a previous diff D63714905 which was landed and then subsequently reverted by PyTorch Release Engineering because of a failing unit test (f7b8d36c28). We think the unit test is buggy, and we also fix the same.
Test Plan: tbd
Differential Revision: D64246495
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137785
Approved by: https://github.com/basilwong
Summary: The problem happened after splitting CppWrapperCpu and CppWrapperCpuArrayRef, because CppWrapperCpuArrayRef.generate_index_put_fallback missed a statement. Running test_aot_inductor.py as a whole didn't reveal the problem, but running test_index_put_with_none_index_cpu_with_stack_allocation individually did. Digging deeper, the root cause is init_backend_registration has incorrectly cached CPU CppWrapperCodegen class, which means CppWrapperCpuArrayRef was never picked when running test_aot_inductor.py as a whole.
Differential Revision: [D64598714](https://our.internmc.facebook.com/intern/diff/D64598714)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138303
Approved by: https://github.com/hl475
As `backend`, `pg_options`, and `group_desc` are the same for each mesh dimension, we don't need to get or create these args for `new_group` multiple times. This PR moves it from the inner loop of the subgroup creation (each subgroup ranks of each mesh dimension) to the outer loop (each mesh_dimension).
For example, given we have a 2 * 4 DeviceMesh, we are re-creating the variables `backend`, `pg_options`, and `group_desc` 2*4 = 8 times. After the change, we only create these variables once per mesh dimension, which is 2 times.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138117
Approved by: https://github.com/kwen2501
replace_by_example is used to implement some pattern-matching passes in inductor. Previously, replace_by_example would generate nodes with very little metadata. In particular, `meta["original_aten"]` would be lost; that meant that when generating triton kernel names, you could get empty names like `triton_tem_fused_0` if the input nodes to the fused kernel were the result of a pattern-matching pass that used replace_by_example.
This also adds metadata for to register_replacement patterns, including pad_mm.
This fixes the issue by copying metadata from the original node to the replacement nodes. If there are multiple original nodes we skip the metadata transfer; so if you have a `add(z, mm(x, y))`, then the metadata won't be transferred right now.
Differential Revision: [D64480755](https://our.internmc.facebook.com/intern/diff/D64480755)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138089
Approved by: https://github.com/aakhundov
Fixes: https://github.com/pytorch/pytorch/issues/138069
I tested this by running `python test/inductor/test_torchinductor_dynamic_shapes.py DynamicShapesCpuTests.test_builtins_round_float_ndigits_pos_dynamic_shapes_cpu` before and after the change and verifying no more log spew.
I'm uncertain on if it makes sense to add a test for this PR. Question for reviewers: is there a standard paradigm for testing these log spew based fixed? Happy to add a test if someone can point me towards the right direction.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138356
Approved by: https://github.com/ezyang
This fixes internal crash due to the invalid bufer size computation if sliced API is used
Not sure what was the purpose of
```c++
IntArrayRef baseShape;
if (src.is_view()) {
baseShape = src._base().sizes();
} else {
baseShape = getIMPSAllocator()->getBufferShape(src.storage().data());
}
int flattenedShaped = 1;
for (const auto i : c10::irange(baseShape.size())) {
flattenedShaped *= baseShape[i];
}
```
As flattenShaped could be much easier computed as `[srcBuf
lengh]/src.element_size()`, and even if `srcBuf` is padded it's a safe thing to do.
When someone allocated buffer to hold say uint8 and that view-casted it
to float16, attempt to compute `baseShape` returned sizes of original
tensor in its data type, rather than size in new dtypes
Fixes https://github.com/pytorch/pytorch/issues/137800
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138314
Approved by: https://github.com/albanD, https://github.com/DenisVieriu97
## Summary
We are currently [updating](https://github.com/conda-forge/pytorch-cpu-feedstock/pull/277) the [`conda-forge::pytorch`](https://anaconda.org/conda-forge/pytorch) package to version 2.5.0. This update includes a new dependency, the third_party/NVTX submodule. However, like other package management frameworks (e.g., apt), conda-forge prefers using system-installed packages instead of vendor-provided third-party packages.
This pull request aims to add an option, `USE_SYSTEM_NVTX`, to select whether to use the vendored nvtx or the system-installed one, with the default being the vendored one (which is the current behavior).
## Test Plan
The `USE_SYSTEM_NVTX` option is tested by building the `conda-forge::pytorch` package with the change applied as a [patch](cd1d2464dd/recipe/patches/0005-Use-system-nvtx3.patch).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138287
Approved by: https://github.com/albanD
Fixes https://github.com/pytorch/pytorch/issues/137856.
### Issue 1
Today under `ProcessGroupNCCL::Options`, color is declared as:
```
int64_t split_color{0};
```
When passing this variable to `ncclCommSplit` which accepts `int`, the value may overflow and become negative, as in #137856. But NCCL API only accepts non-negative colors (or `NCCL_SPLIT_NOCOLOR`).
But that's not all.
### Issue 2
`split_color` is pybind'ed to python frontend. If we just change from `int64_t` to `int` in C++, pybind will complain:
```
[rank0]: TypeError: (): incompatible function arguments. The following argument types are supported:
[rank0]: 1. (self: torch._C._distributed_c10d.ProcessGroupNCCL.Options, arg0: int) -> None
```
This is because python `int` represents a wider range than C++ `int`. So we cannot pass hash values -- which are potentially big ints -- from python to C++. The PR modulo the hash value with `c_int`'s max value.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137855
Approved by: https://github.com/wconstab
The rationale behind this PR is to:
1. Move the dump of c++ traces after FR dump because the FR dump is timed meaning that it will not block forever, while the dumping of c++ traces is likely to be blocking. so that we swap the order. Ideally we also want to make cpp stacktrace dump to be a future wait, if we want to go down this path, we can also make it happen in an another PR.
2. Add log Prefix to the logs which have not been added.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138368
Approved by: https://github.com/c-p-i-o
Summary:
This change unblocks the CFR AOTI lowering runtime error.
TL;DR:
In this model, one triton kernel expects a scalar input dtype as i64, but getting an i32. The reason is "auto" can infer a smaller data type if the variable it passed in e.g. is i32. thus cause CUDA IMA.
Original problematic kernel: `triton_poi_fused_add_ge_logical_and_logical_or_lt_46_grid_100`.
This diff manually cast it to i64 for all symbolic arguments in compile time for i64 triton kernel inputs, instead of use `auto var_x = {arg}` in cpp wrapper code.
Test Plan:
Verified in FLB locally:
```
PYTORCH_NO_CUDA_MEMORY_CACHING=1 AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=3 TORCH_LOGS="output_code" TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCH_SHOW_CPP_STACKTRACES=1 CUDA_LAUNCH_BLOCKING=1 ~/fbsource/buck-out/v2/gen/fbcode/98e643f8bb44fe9d/hpc/new/models/feed/benchmark/__feed_lower_benchmark__/feed_lower_benchmark.par --skip-eager --skip-flop-estimation --lower-backend="AOT_INDUCTOR" --sync-mode=0 --precision bf16 --output-precision bf16 --lower-presets="ifr_cint;disable_new_lowering_weights;disable_dper_passes:passes=fuse_parallel_linear_no_weight_change" --remove-unexpected-type-cast=False --load="manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/924293663/0/gpu_lowering/input.merge"```
Differential Revision: D64490039
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138106
Approved by: https://github.com/ColinPeppler
Fixes#138211
`Path.rename` function has Windows OS specific behavior, that will raise `FileExistsError` when the target file existing.
This behavior is not happened on Linux, so I write a small repoduce code to figure out what happened.
After stepping trace the repo code:
```python
import os
import sys
from pathlib import Path
_IS_WINDOWS = sys.platform == "win32"
def test_case():
cwd = os.getcwd()
path1 = os.path.join(cwd, "haha1.txt")
path2 = Path(os.path.join(cwd, "haha2.txt"))
try:
path2.rename(path1)
except FileExistsError as e_file_exist:
if _IS_WINDOWS:
# on Windows file exist is expected: https://docs.python.org/3/library/pathlib.html#pathlib.Path.rename
shutil.copy2(path2, path1)
os.remove(path2)
else:
raise e_file_exist
except BaseException as e:
raise e
print("run here.")
if __name__ == "__main__":
test_case()
```
We found the code `path2.rename(path1)` can breakdown into:
1. copy file2's content to file1.
2. delete file2.
So, we can implemented equal code on Windows path:
```python
shutil.copy2(src=tmp_path, dst=path)
os.remove(tmp_path)
```
So, we can get current PR.
TODO: need cherry-pick to release/2.5 branch, CC: @atalman .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138331
Approved by: https://github.com/malfet
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Enable -Werror on s390x
Example of original issue on s390x:
https://github.com/pytorch/pytorch/actions/runs/11014606340/job/30585632704
Most of warnings are not specific to s390x, but specific to gcc-13 or gcc-14. To test it on s390x an image with gcc-13 is needed. For s390x it's tested for new regressions on every merge due to trunk workflow.
`-Wdangling-reference` produces either obviously false warnings or suspicious warnings, which on closer inspection look plausibly safe.
`-Wredundant-move` with new gcc complains about `std::move(...)` disabling copy elision. But removing `std::move(...)` makes used clang versions complain about copying objects when they could be moved. For now also disable it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136527
Approved by: https://github.com/malfet
Add an optional `eager_init` flag to `with_comms`.
When `eager_init` is True and backend is `nccl`, we pass the `device_id` to `init_process_group()` for eager initialization.
Otherwise, `device_id` is still `None` and this goes through the normal lazy call.
Default for `eager_init` is False.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138108
Approved by: https://github.com/kwen2501
Summary: I have to keep bypassing issues because of these clang rules. Let's start with all of the bugs instead of the variable name ones because that will introduce a lot of lines of code and can make things hard to read
Test Plan: Format tests pass.
Differential Revision: D64411171
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138296
Approved by: https://github.com/aaronenyeshi, https://github.com/Skylion007
This diff is the starting steps of https://docs.google.com/document/u/2/d/1kAEBt4AyW7HTAhXHbjoz8FBFHNyyEA2Qo2mPn7v3WUQ/edit?usp=drive_web&ouid=113555078003219714709
It implements the following changes:
- Only log spans to scuba, so no start events are ever logged
- Log events as the full event name, without "START" or "END"
- Only log to scuba major phases from chromium events. These are:
- entire_frame_compile (dynamo)
- backend_compile (aotdispatch)
- inductor_compile (inductor)
- codegen (inductor codegen)
Tlparse chromium events stay basically the same. But I implemented a few changes to clean that up as well:
- When there's a phase name available, log the phase name instead of the function name as the event name. This simplifies the trace to not have two identical rows. The fn_name is avaliable as metadata on the chromium event, if interested
- Log new events for pre and post grad passes. These do *not* log to scuba.
By making the phases much simpler in Scuba, with only categories for major phases of PT2 Compilation, we pave the way to add **much** more metadata and information to each individual event type. Diffs for that will come later.
**IMPLEMENTATION NOTES:**
- The logic for `log_chromium_event_internal` (which is the function that logs to Scuba) lives in chromium_events for now, but in the future as we add more metadata, it may belong independently in dynamo_timed or even outside of dynamo_timed. I haven't explored in detail what the refactor will look like. Once we start logging metadata for dynamo, aotdispatch, inductor, I suspect we will call log_pt2_compile_event directly, instead of making chromium event logger handle the pt2_compile_event logic. But that refactor is left for another PR on top of this one.
- There's an interesting space after pre grad passes within AOT autograd logic, that's between create_aot_dispatcher_function and pre grad passes. I'm not sure what we're spending time doing in that time, but I'll find out with a profile later.
Differential Revision: [D64479033](https://our.internmc.facebook.com/intern/diff/D64479033/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138093
Approved by: https://github.com/ezyang
This patch addresses parts of the side-effect refactor proposed in #133027;
specifically, it does 3 things:
1. Change `SideEffects.store_attr_mutations` and `PyCodegen.tempvars`
to index on `VariableTracker` rather than `MutableLocalBase`.
2. Remove the `source` field from `MutableSideEffects` and
`AttributeMutation`, and use `VariableTracker.source` instead.
3. Plumb a `overridden_sources: Dict[Source, Source]` from
`handle_aliases_for_stolen_lists` to `PyCodegen` so that we don't
update `VariableTracker.source` in place, while still preserving what
`handle_aliases_for_stolen_lists` needed (i.e., modifying codegen for
certain `VariableTracker`).
(1) and (2) are merged in 1 patch because of some dependency between
a. `OutputGraph.handle_aliases_for_stolen_lists` which iterates over
`sideSideEffects.store_attr_mutations.keys()`, and potentially update
its source field to be completely different.
b. `SideEffects.codegen_update_mutated`, which happens after the above
and uses `cg(var.mutable_local.source)`.
where if we apply (1) only, (b) breaks, and if we apply (2) only, (a)
breaks.
(3) is needed for correctness, see comments in the PR for details.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137905
Approved by: https://github.com/jansel, https://github.com/anijain2305, https://github.com/mlazos
Removing just the LF canary scale config for now to test the changes in https://github.com/pytorch/test-infra/pull/5767
Those changes have been deployed to prod and appear to be working, but this will be the final proof that it is in fact reading the test-config version of scale-config and not the pytorch/pytorch copy.
Note: This will break the Scale config validation workflow on test-infra, but it's worth it since this test will be very short lived and that workflow only runs when someone modifies scale config
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138361
Approved by: https://github.com/wdvr
Add an additional check that scalars wrapped to 0-D tensors by dynamo are actually 0-D. This fixes a bug where a 1-D tensor was mistakenly converted to a scalar value rather than passed as a pointer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137303
Approved by: https://github.com/eellison
ghstack dependencies: #135701
Summary: From experiment, it seems like aten.constant_pad_nd has better QPS compared to torch.cat. The qps gain for ig ctr is ~10%, and ~5% for oc.
Test Plan:
```
buck2 run mode/opt -c fbcode.nvcc_arch=a100 //caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --model-path=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/585279927/480/gpu_lowering/input.predictor.disagg.gpu.merge --lower-backend=AOT_INDUCTOR
```
```
buck2 run mode/opt //caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --model-path=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/588102397/1500/gpu_lowering/input.predictor.disagg.gpu.merge --lower-backend=AOT_INDUCTOR
```
Differential Revision: D64271583
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137820
Approved by: https://github.com/eellison
In this PR, we implement lazy dictionary for export decomp behaviour for following reasons:
1. Custom op loading can happen after import time, as a result, the decomp table might not be able to pick up the decomp. Therefore we try to delay materialization as late as possible.
I intentionally seperated out the core_aten_decomp to not have any custom CIA ops in this PR to mitigate the risk of getting reverted but in the future, core_aten_decomp under torch/_decomp will exist as an alias to official export table (torch.export.default_decompositions)
Differential Revision: [D64140807](https://our.internmc.facebook.com/intern/diff/D64140807)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137650
Approved by: https://github.com/justinchuby, https://github.com/bdhirsh
To skip over the command that do not have output file specified
Recently I've noticed that `generate_torch_version.py` started to run on every rebuild, and this results in a failed plan for deb info rebuilds
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138290
Approved by: https://github.com/Skylion007
## `VariableTracker::build()` hides the Builders
### The problem
In the current code, creating a `VariableTracker` involves choosing one of two `Builder` classes and either calling a method, or calling a constructor that creates an object that you immediately call, [like this](083c9149b7/torch/_dynamo/variables/functions.py (L761-L768)).
Variations on this code are repeated in many places.
More, the `Builder` classes have a lot of dependencies, so they have to be loaded late in the whole import process to avoid circular imports, so they end up being repeatedly imported at local scope.
### The solution
In this commit, the import from `builder` and the logic of choosing and calling the Builder class are hidden in a single static factory method, `VariableTracker.build()`, easier to reason about and to import.
This commit net lowers the total lines of code by over 150 lines by removing repetitive logic and unnecessary local imports.
**CHANGES:** Originally the name of the static method was `VariableTracker.create()` but a static method on a derived class, `LazyVariableTracker.create()` now exists with a different signature that's irreconcilable, so the new static method was renamed to `VariableTracker.build()`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135714
Approved by: https://github.com/jansel
Previously we would error when trying to preserve the call signature for a module when it was called multiple times. This PR can now do this without erroring. The fix is to propagate call indices in a few more places.
Note that while this works in the presence of params, buffers, and tensor constants, preserving call signatures for multiple calls to a module when buffers are mutated is not supported yet. This is future work. The main problem is that we do not have enough metadata to `copy_` mutated buffers at the end of each call to a module, so the next call can read those buffers at the beginning. Making this work will likely need some explicit tracking of intermediate values of mutated buffers when collecting metadata during functionalization in export.
Note also that we stop short of creating a single graph out of multiple graphs: that is still future work. So the unflattened module will still have different targets `n`, `n@1`, `n@2`, etc. for each call when we ask the module call signature of `n` to be preserved. However it is way easier to swap all of these targets with a replacement that behaves similar to the original, because all of these calls will respect the original module call signature. (In particular, any constant inputs will be carried by the calls.)
Differential Revision: D64406945
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137999
Approved by: https://github.com/tugsbayasgalan
This adds Dynamo tracing support for the host-side Triton TMA API (see `create_2d_tma_descriptor` calls on the host in the [Triton tutorial](https://triton-lang.org/main/getting-started/tutorials/09-persistent-matmul.html#sphx-glr-getting-started-tutorials-09-persistent-matmul-py)). A few notes:
- Here we assume the availability of the host-side TMA API added to upstream Triton in https://github.com/triton-lang/triton/pull/4498. As of time of writing, this is not a part of the PT2 OSS Triton pin (although back-ported internally). OSS Triton pin update should be done in December 2024.
- Due to Dynamo support implemented in the previous PR, the `tma_descriptor_metadata` dict is delivered to the `triton_kerenl_wrap_` lowering and passed to the `ir.UserDefinedTritonKernel` as additional argument.
- Looking into the `tma_descriptor_metadata`, `ir.UserDefinedTritonKernel` substitutes the corresponding `TensorBox` arguments of the kernel (swapped upstream in Dynamo) by the new `ir.TMADescriptor` nodes implementing TMA descriptors in Inductor IR.
- `ir.TMADescriptor.__init__` provides the wiring between the upstream underlying `ir.TensorBox` and the downstream `ir.UserDefinedTritonKernel` kernel. In particular, we use `ir.NonOwnedLayout` wrapping `ir.ReinterpretView` to avoid the upstream tensor's buffer being deleted prematurely (before the TMA descriptor is used in the Triton kernel).
- Via `ir.TMADescriptor.codegen`, the Triton's `create_{1d,2d}_tma_descriptor` function call is codegened in the wrapper (in the host code).
- New `TMADescriptorArg` dataclass is added to handle the Triton kernel metadata pertinent to host-side TMA.
- AOT Inductor support will be implemented in a follow-up PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137950
Approved by: https://github.com/eellison
ghstack dependencies: #137677
Fix TODO in code
```python
# TODO: create an internal helper function and extract the duplicate code in FP16_compress and BF16_compress.
```
1. Extract common logic in `fp16_compress_hook` and `bf16_compress_hook` to `_compress_hook` method
2. Let `fp16_compress_hook` and `bf16_compress_hook` invoke `_compress_hook` with difference `dtype`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138182
Approved by: https://github.com/awgu
By guarding the calls to `-[MTLCompileOptions setFastMathEnabled]` with `C10_DIAGNOSTIC_PUSH` and `POP`
and using `-[MTLCompileOptions setMathMode:]` and `-[MTLCompileOptions setMathFloatingPointFunctions:]` on MacOS15
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138238
Approved by: https://github.com/atalman
Summary:
Our watchdog does not differentiate timeout from NCCL errors clearly in terms of both log and code paths.
It's important for c10d to differentiate different reasons of watchdog
failures. E.g, timeout vs nccl errors, and possibly let users to handle the
errors differently depends on the type of errors
Test Plan:
UT
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138240
Approved by: https://github.com/Skylion007
Summary: logs if an operator is run with the TorchScript runtime, using a thread_local variable set in `InterpreterState.run()`
Test Plan: buck2 run mode/dev-nosan caffe2/torch/fb/observers:scuba_observer_runner
Reviewed By: zou3519
Differential Revision: D64200781
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137986
Approved by: https://github.com/angelayi
Dynamo stance is recently added in https://github.com/pytorch/pytorch/pull/137504. When Dynamo stance is "force_eager" (user explicitly wants to fall back to eager), we would like Compiled Autograd to fall back to eager as well. This will allow the Traceable FSDP2 use case to work since "eager forward + compiled autograd backward" is not supported for Traceable FSDP2.
In general, if user wants to do "eager forward + compiled autograd backward", they should explicitly run the forward in eager instead of applying compile and then set stance to "force_eager".
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138113
Approved by: https://github.com/xmfan
This fixes 4 main issues:
- The way the cuda sanitizer handle it's state is weird. In particular, because the lifetime of the Mode is linked to the submodule, then this might outlive the python runtime and other modules loaded. On my current version, this even outlives the "sys" module. Given that I'm not sure the impact of changing this lifetime handling, I'm making the exit handler a no-op when python is already dying and thus no point cleaning up.
- Adds a "disable" method to be able to test after the mode is enabled.
- Fix `Tensor.as_sublass()` to properly disable modes when creating the new Tensor object just like we already do in `make_subclass` and `make_wrapper_subclass`. The change here is just to apply the exact same treatment to it.
- ~Fix `Tensor.as_subclass()` not to propagate autograd as there is no valid backward associated here.~ We have test that check that this behavior happen so I guess this is not an obvious bugfix and expected behavior. Reverted that change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138218
Approved by: https://github.com/ngimel
Previously the decomposition would upcasts inputs to fp32. This led to a slowdown compared to eager which would run in fp16. We also tried keeping the bmm in fp16, and the upcasting for the epilogue but that led to worse numerics because the bmm in eager would do the epilogue all in fp32 without a downcast in the bmm accumulator.
Fix for https://github.com/pytorch/pytorch/issues/137897
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137904
Approved by: https://github.com/ngimel
This is only a minor patch that I hope will change how I talk to contributors when lint fails, so that I can tell them to read the logs about lintrunner. There have been too many times when I have had to click the "approve all workflows" just for lint to fail again cuz the developer is manually applying every fix and using CI to test. I understand there are times when lintrunner doesn't work, but I'd like most contributors to at least give it a swirl once to start.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138232
Approved by: https://github.com/kit1980, https://github.com/Skylion007
Compiled Autograd uses this AOT inference path, but it shows up as "aot_forward_graph" in tlparse output, which causes it to not be easily differentiable from normal "aot_forward_graph"s that are also in the tlparse output. This PR renames it to "aot_inference_graph" which makes it easier to tell which tlparse graph block is from Compiled Autograd.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137803
Approved by: https://github.com/Microve, https://github.com/bdhirsh, https://github.com/ezyang
There's an annoying pattern emerging for pulling out the NJT min / max seqlen ints if they exist without computing / caching if they don't. This PR introduces private convenience functions to simplify handling this and avoiding redundant checks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138130
Approved by: https://github.com/soulitzer
Summary:
Support autocast re-tracing by giving it the same treatment as set_grad.
In re-tracing, when dynamo encounters an autocast HOP, we want it to trace through `with torch.autocast()` again, and replace the HOP with the traced subgraph.
Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r test_export_with_autocast
```
Differential Revision: D63856081
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138082
Approved by: https://github.com/ydwu4
Summary:
1) Add sleef back to enable SIMD on AMD
2) adding kpack to triton compute_meta for AMD triton, since there will be user-defined triton kernels using this for k-dim packing
Test Plan:
```
HIP_VISIBLE_DEVICES=0 TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 TORCH_LOGS="output_code,graph_code" buck run mode/{opt,amd-gpu} -c fbcode.triton_backend=amd -c fbcode.enable_gpu_sections=true //hpc/new/models/feed/benchmark:feed_lower_benchmark -- --skip-flop-estimation --skip-trt --skip-ait --enable-aot-inductor --sync-mode=0 --gpu-trace --sample-input-tile-factor=1 --load="manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/925729118/0/gpu_lowering/input.merge" --lowering-input-str='{"serialized_inference_model_input_path":"ads_storage_fblearner/tree/user/facebook/fblearner/predictor/925729118/0/gpu_lowering/input.merge","serialized_inference_model_output_path":"ads_storage_fblearner/tree/user/facebook/fblearner/predictor/925729118/0/gpu_lowering/mi300_output.merge","submodule_names_to_lower":["merge"],"inductor_lowering_context":{"aot_inductor_lowering_settings":{"use_scripting":true,"preset_lowerer":"ifu_cint;disable_new_lowering_weights;disable_dper_passes:passes=fuse_parallel_linear_no_weight_change","precision":3,"output_precision":3, "remove_unexpected_type_cast":false, "sample_input_tile_factor":32}},"model_entity_id":925729118,"model_snapshot_id":0,"add_sample_inputs":false,"hardware_type":0,"platform_arch":1,"dense_in_place_format":2}' --precision=bf16 2>&1 | tee local_benchmark_log.txt
```
Differential Revision: D64262924
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137939
Approved by: https://github.com/frank-wei
Summary:
Currently, https://fburl.com/code/uka25j1i checks whether the guarded object supports weakref by looking at its `__class__`
```
if hasattr(guarded_object.__class__, "__weakref__") and not isinstance(
guarded_object, enum.Enum
):
obj_ref = weakref.ref(guarded_object)
```
However, we have reason to modify this slightly because we use classes that "pretend" to be some other classes (e.g. nn.Parameter). Example https://fburl.com/code/8bcktgoh :
```
class QuantizedWeights:
# TODO: Ugly trick so torch allows us to replace parameters
# with our custom weights. Do this properly.
property
def __class__(self) -> Type[nn.parameter.Parameter]:
return nn.Parameter
property
def grad_fn(self) -> None:
return None
```
For example, Fp8RowwiseWeights which inherit from the base class above and also from namedtuple, actually does not have `__weakref__` attribute, but its "class" will say it does.
I think the easiest change is to use instance-level checking rather than class-level
```
if hasattr(guarded_object, "__weakref__") ...
```
But I'm wondering if this will harm any of the existing behaviors.
I'd appreciate reviews from the experts
(I just added all recommended reviewers since I'm not sure who is the best person to consult...)
Test Plan: CI?
Reviewed By: YJYJLee
Differential Revision: D64140537
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137838
Approved by: https://github.com/williamwen42, https://github.com/jansel
`test_replicate_with_compiler.py` and `test_fully_shard_compile.py` requires bf16, so needs to be run within test_inductor_distributed job (which uses A10G (SM80) and has bf16 support).
This allows us to migrate distributed jobs to T4 machines in https://github.com/pytorch/pytorch/pull/137161, as the compiled distributed jobs are the only blocking ones now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138178
Approved by: https://github.com/xmfan
The suggestions unusably clog up early draft PRs that are not necessarily lint clean yet. Making matters worse, even if I fix them I have to manually click through hundreds of comments to "Resolve" them even though I've fixed it. Disabling it on ghstack helps, but I occasionally do standard PRs via fbcode export mechanism. Opt me out.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138054
Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/PaliC
Summary: Float8 is becoming and increasingly popular datatype now that it is well supported on GPUs. This diff enables FP8 to work with `torch.cat`. This is pretty straight forward since memory operations dont vary based on the input dtype, but can be quite helpful for FP8 based models.
Test Plan:
```
buck2 run mode/opt -c fbcode.enable_gpu_sections=true -c fbcode.platform=platform010 -c fbcode.nvcc_arch=h100a -c fbcode.platform010_cuda_version=12 //caffe2/test:tensor_creation -- -r test_cat_all_dtypes_and_devices
```
Differential Revision: D64443965
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138046
Approved by: https://github.com/eqy, https://github.com/qchip, https://github.com/jianyuh
Current setting command of the `CMAKE_PREFIX_PATH` environment variable will overwrite values if it had already been set with some values. Changing it to `:` appends the conda env search path to its values to avoid library not found issues.
`export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}:${CMAKE_PREFIX_PATH}`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134934
Approved by: https://github.com/malfet, https://github.com/EikanWang
Dynamo stance is recently added in https://github.com/pytorch/pytorch/pull/137504. When Dynamo stance is "force_eager" (user explicitly wants to fall back to eager), we would like Compiled Autograd to fall back to eager as well. This will allow the Traceable FSDP2 use case to work since "eager forward + compiled autograd backward" is not supported for Traceable FSDP2.
In general, if user wants to do "eager forward + compiled autograd backward", they should explicitly run the forward in eager instead of applying compile and then set stance to "force_eager".
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138113
Approved by: https://github.com/xmfan
ghstack dependencies: #138105
Our matmul support is abysmal - let's at least get this working and do it performantly later.
Bonus: implements `bmm` as well.
jagged <-> padded dense conversions are utilized when possible, and an unbind-based fallback otherwise (the former works with torch.compile and the latter doesn't). Some testing is missing because we don't have factory function support yet :(
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138121
Approved by: https://github.com/cpuhrsch
Summary:
Tensor constants can show up through wrapped methods, so that they may not always be found in constant attributes. They need to be fakified and their meta vals need to be found to create graph signatures nevertheless. Otherwise non-strict barfs.
Longer term maybe we should pull this fakification up in non-strict.
Test Plan: added test
Differential Revision: D64480272
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138091
Approved by: https://github.com/tugsbayasgalan
Fixes https://github.com/pytorch/pytorch/issues/136640
Today, inductor has some logic to figure out when it needs to do broadcasting during lowering, which just checks if any of the input shapes have sizes equal to 1.
In particular: we should already have this information by the time we get to inductor, because our FakeTensor compute will have branched/guarded on whether any ops performed broadcasting, appropriately.
In particular, if we have a tensor with a size value of `(64//((2048//(s3*((s2//s3)))))))`, and it happens to be equal to one (and it is used in an op that requires this dim to be broadcasted), FakeTensorProp will have generated a guard:
```
Eq((64//((2048//(s3*((s2//s3))))))), 1)
```
I chose the simplest possible way to beef up inductor's checks to know when a given size is equal to 1: loop over the existing shape env guards, and if our current size is a sympy expression on the LHS of one of our `Eq(LHS, 1)` guards, then return True.
I'm hoping for feedback on whether or not this approach is reasonable. One better option I could imagine is that our symbolic reasoning should have automatically simplified the size of our tensor down to a constant as part of evaluating that guard. I was originally going to try to do this directly in the shape env, but I ran into a few issues:
(1) I wanted to call some version of `set_replacement(expr, 1)`. But `set_replacement()` only accepts plain symbols on the LHS, not expressions
(2) in theory I could get this to work if I could rework the above expression to move everything that is not a free variable to the RHS, e.g. `Eq(s2, 32)`. It looks like our existing `try_solve()` logic is... [not quite able](https://github.com/pytorch/pytorch/blob/main/torch/utils/_sympy/solve.py#L27) to do this generally though.
Checking the guards feels pretty simple-and-easy. Are we worried that it is too slow to iterate over all the guards? I could also cache the lookup so we only need to iterate over guards that are of the form `Eq(LHS, 1)`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136670
Approved by: https://github.com/ezyang
This pr introduces two changes:
1. Before this pr, the subgraphs output is ([], []), in this pr, we change it to a flattened list for easier codegen and consistency with other control flow operators.
2. Before the PR, the combine_fn of scan takes a sliced input but keep the sliced dimension. For exmaple, suppose xs = torch.randn(3, 4, 5) and we scan over dim 0, the combine_fn looks like:
```
# x.shape = (1, 4, 5) instead of (4, 5)
def combine_fn(carry, x):
...
```
In this PR, we fixed this and also simplify some of the slicing logic.
3. this diff also make sure we always stack ys on fist dimension.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135601
Approved by: https://github.com/zou3519
ghstack dependencies: #135600
Summary:
Record the world size in log and scuba table.
This helps us quickly figure out if there are missing flight recorder files form ranks.
Test Plan: Ran locally and noted that size was logged to scuba
Differential Revision: D64442949
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138044
Approved by: https://github.com/Skylion007
Summary:
## Why
random.fork_rng doesn't support meta device:
```
[rank0]: File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/aps_models/ads/tools/memory_estimator/estimation_dense.py", line 655, in estimate_dense_memory_size
[rank0]: losses.sum().backward()
[rank0]: File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/_tensor.py", line 604, in backward
[rank0]: return handle_torch_function(
[rank0]: File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/overrides.py", line 1718, in handle_torch_function
[rank0]: result = mode.__torch_function__(public_api, types, args, kwargs)
[rank0]: File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/utils/_device.py", line 106, in __torch_function__
[rank0]: return func(*args, **kwargs)
[rank0]: File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/_tensor.py", line 613, in backward
[rank0]: torch.autograd.backward(
[rank0]: File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/autograd/__init__.py", line 347, in backward
[rank0]: _engine_run_backward(
[rank0]: File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/autograd/graph.py", line 825, in _engine_run_backward
[rank0]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank0]: File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/utils/checkpoint.py", line 1125, in unpack_hook
[rank0]: frame.recompute_fn(*args)
[rank0]: File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/utils/checkpoint.py", line 1507, in recompute_fn
[rank0]: with torch.random.fork_rng(
[rank0]: File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/runtime/lib/python3.10/contextlib.py", line 135, in __enter__
[rank0]: return next(self.gen)
[rank0]: File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/random.py", line 153, in fork_rng
[rank0]: raise RuntimeError(
[rank0]: RuntimeError: torch has no module of `meta`, you should register a module by `torch._register_device_module`.
```
This blocks us from running backward() on model with checkpoint enabled in meta mode.
## What
This diff handles the case of meta device in random.fork_rng.
Test Plan: Tested with toy model which has checkpoint on its module: P1641201046
Differential Revision: D64161410
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137715
Approved by: https://github.com/kit1980
Summary:
add testing for autocast and set_grad nodes for export_for_training. In export_for_training, we do not wrap the autocast and set_grad node in to HOP, but we should still have the set_grad_enabled/autocast nodes.
add support for autocast in non-strict export. Previously, `_enter_autocast` and `_exit_autocast` nodes don't show up in the export graph when we use `strict=False`.
- In autocast's enter and exit function, we dispatch to `PreDispatchTorchFunctionMode.__torch_function__`.
if we have PreDispatchTorchFunctionMode in our function_mode_stack, the call stack looks like below. This is mostly the same call stack as strict mode, except strict mode enters [here](https://www.internalfb.com/code/fbsource/[0d4f1135cacdb26c6e01d5dce1ce52a15d61ee48]/xplat/caffe2/torch/_dynamo/variables/ctx_manager.py?lines=806).
```
- torch.amp.autocast.__enter__()'s torch.overrides.handle_torch_function
- torch.fx.experimental.proxy_tensor.TorchFunctionMetadataMode.__torch_function__
- torch.amp._enter_autocast()'s torch.overrides.handle_torch_function
- PreDispatchTorchFunctionMode.__torch_function__
```
- in `PreDispatchTorchFunctionMode.__torch_function__`, we create the autocast nodes.
- to match the strict mode behavior, we let the input node to the `_exist_autocast` node be the corresponding `_enter_autocast` node. This requires us to maintain a stack in `PreDispatchTorchFunctionMode`.
Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r test_export_with_autocast
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r test_export_with_set_grad
```
Differential Revision: D64016023
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137495
Approved by: https://github.com/bdhirsh
We used a LIFO stack to store the CudaEvent in the cache. ,Somehow we like FIFO deque better so aside from improving the readability of the code, we use a deque instead. As @wconstab pointed out, both methods are equally correct because the moment we put the event into stack/deque, the event is already ready for reuse, this change mostly is a preference change not trying to fix anything.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138048
Approved by: https://github.com/kwen2501
ghstack dependencies: #138040
**MOTIVATION**
We recently integrated support for Intel Gaudi devices (identified as 'hpu') into the common_device_type framework via the pull request at https://github.com/pytorch/pytorch/pull/126970. This integration allows tests to be automatically instantiated for Gaudi devices upon loading the relevant library. Building on this development, the current pull request extends the utility of these hooks by adapting selected CUDA tests to operate on Gaudi devices. Additionally, we have confirmed that these modifications do not interfere with the existing tests on CUDA devices.
**CHANGES**
- Add support for HPU devices within the payload function.
- Use instantiate_device_type_tests with targeted attributes to generate device-specific test instances.
- Expand the supported_activities() function to include checks for torch.profiler.ProfilerActivity.HPU.
- Apply skipIfHPU decorator to bypass tests that are not yet compatible with HPU devices.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133975
Approved by: https://github.com/briancoutinho, https://github.com/aaronenyeshi
Summary:
Blocking wait mode is not widely used, probably useful in debugging.
in blockingWait mode, we don't need to enable the watchdog thread to
check the timeout or nccl error because the main thread would throw an
exception if error happens and it is obvious to user which work fails
and its user's responsibility to handle the exception.
Test Plan:
CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138001
Approved by: https://github.com/fduwjj, https://github.com/c-p-i-o
ghstack dependencies: #137799
Here is why we see using `CUDAEventCache` cause crash and data corruption.
1. The deleter is doing its job and append the job the stack.
2. In create, instead of getting a reference, we are getting a copy of eventsArray_[i] (which is a std::vector). This is bad because we didn't really remove the element from the stack. While we thought we already pop up the last one from the stack, but it turns out the last one is still in the stack; we end up reusing the same event again and again. What's worse, since we keep adding new events to the stack, this will eventually explode the stack and a crash happens.
Fix is easy, just get a reference. Local torchtitan run see a non-Nan loss.
Also we want to use a deque instead of a stack, and refactor the code a bit to make it more readable. (in a separate PR)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138040
Approved by: https://github.com/kwen2501, https://github.com/shuqiangzhang
NOTE: this PR removes `ScheduleFlexibleInterleaved1F1B`, let me know if theres any concerns.
`ScheduleFlexibleInterleaved1F1B` is a superset of `Interleaved1F1B` and uses most of the same implementation, but relaxes the condition that `n_microbatches % pp_size == 0`. This is refactors the implementation into `Interleaved1F1B` and then removes it since it is confusing to have both schedules with similar names. This also refactors the zero bubble logic to belong in the `ZeroBubble` schedule class.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137783
Approved by: https://github.com/wconstab
This adds Dynamo tracing support for the host-side Triton TMA API (see `create_2d_tma_descriptor` calls on the host in the [Triton tutorial](https://triton-lang.org/main/getting-started/tutorials/09-persistent-matmul.html#sphx-glr-getting-started-tutorials-09-persistent-matmul-py)). A few notes:
- Here we assume the availability of the host-side TMA API added to upstream Triton in https://github.com/triton-lang/triton/pull/4498. As of time of writing, this is not a part of the PT2 OSS Triton pin (although back-ported internally). OSS Triton pin update should be done in December 2024.
- To capture the chain of calls `t.data_ptr() --> create_{1d,2d}_tma_descriptor(ptr, ...) --> kernel[grid](tma_desc, ...)`, we add three new variable trackers: `DataPtrVariable`, `CreateTMADescriptorVariable` (for the function), `TMADescriptorVariable` (for TMA descriptor object). This is to maintain the path back from the Triton kernel to the Tensor from which the TMA descriptor has been created.
- The newly introduced variables have `reconstruct` methods used in case of graph breaks.
- The `tma_descriptor_metadata` extracted from the captured `create_{1d,2d}_tma_descriptor` calls is propagated through the HOPs in Dynamo and AOTAutograd to be used by the downstream compiler (e.g., Inductor). See the unit tests for how the captured HOP arguments look like.
- In the Dynamo-captured fx graph, we replace the TMA descriptor arguments of the Triton kernel by the underlying Tensors, to be able to track the input/output relationships in terms of Tensors.
- In the Triton kernel mutation analysis pass (in AOTAutograd), we use the `tt.experimental_descriptor_store` TTIR op to detect mutations of the underlying tensors via TMA descriptors. So that downstream AOTAutograd can perform functionalizations as required.
- JIT Inductor and AOT Inductor support will be implemented in follow-up PRs.
Differential Revision: [D64404928](https://our.internmc.facebook.com/intern/diff/D64404928)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137677
Approved by: https://github.com/zou3519
Summary:
This PR is trying to let users to know what exact collective call from the python thread is failing, and
customize their own error handling function, instead of watchdog thread crashing everything.
This is potentially very useful in fault tolerant training, in which we can have in-process restart.
E.g., when an nccl error is detected, users can potentially abort comms, re-init comms and go back to the previous check pointed step and try again, instead of crashing the whole job.
This is to allow users to check the status of each collective call,
using the ivalue::future libs in PT core. This also allows users to
attach its customized failure handling functions by:
work.get_future_result().then(erro_handling_func)
Note that the above call is also non-blocking for CPU thread
Test Plan:
Added a new test: test_get_future_result to verify the workResutl is
correctly propagated to the users
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137799
Approved by: https://github.com/fduwjj, https://github.com/wconstab
```
Parallelization strategy: after each rank copies its shard into its local
p2p buffer, every rank issues independent p2p copy -> shard_consumer
sequences to two streams. In addition to computation/communication
overlapping, the strategy allows for computation/computation overlapping,
greatly reducing quantization inefficiency.
Notation:
- "mv" for the copy to local buffer
- "cp" for p2p copies
- "b" for barriers
Constraints:
- The GPU scheduler may or may not overlap "mv" with the first shard_consumer.
- "cp" from different streams cannot overlap.
Ideal scenario 0 - "mv" overlaps with the first shard_consumer:
stream 0: [ shard_consumer ][ cp ][ shard_consumer ]
stream 1: [ mv ][b][ cp ][ shard_consumer ]
Ideal scenario 1 - "mv" is scheduled before the first shard_consumer:
stream 0: [ shard_consumer ][ cp ][ shard_consumer ]
stream 1: [ mv ][b][ cp ][ shard_consumer ]
Suboptimal scenario 0 - "mv" is scheduled after the first shard_consumer:
stream 0: [ shard_consumer ] [ cp ][ shard_consumer ]
stream 1: [ mv ][b][ cp ][ shard_consumer ]
Suboptimal scenario 0 - "b" is scheduled after the first shard_consumer:
stream 0: [ shard_consumer ] [ cp ][ shard_consumer ]
stream 1: [ mv ] [b][ cp ][ shard_consumer ]
We haven't yet figured out a way to ensure "mv" and "b" are either
overlapped with or scheduled before the first shard_consumer. Thus, to
prevent suboptimal scenarios, we are giving up the chance to overlap "mv"
and "b" with the first shard_consumer for now.
```
This PR improves the scheduling for mm kernels with high SM utilization. The GPU scheduler tends to not overlap local DtoD copies with such kernels, which leads to suboptimal scheduling. The following is an example of pipelining PyTorch's cutlass-based, row-wise scaling fp8 kernel:
Before this PR:
<img width="298" alt="image" src="https://github.com/user-attachments/assets/81e0a7f4-18ee-47c6-b258-04fdaca7a6a2">
With this PR:
<img width="253" alt="image" src="https://github.com/user-attachments/assets/982de5a8-da1e-4a8f-b67e-c9c869b0a77f">
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137850
Approved by: https://github.com/weifengpy
ghstack dependencies: #137643, #137738, #137805, #137836
```
Parallelization strategy: every rank issues independent compute
-> barrier -> p2p copy sequences on two streams. In addition to
computation/communication overlapping, the strategy allows for
computation/computation overlapping, greatly reducing
quantization inefficiency.
Ideally, stream activities would look like this ("b" for
barriers, "cp" for p2p copies):
[rank 0]
stream 0: [ chunk_producer ][b][ cp ][ chunk_producer ][b][ cp ]
stream 1: [ chunk_producer ][b][ cp ][ chunk_producer ][b][ cp ]
[rank 1]
stream 0: [ chunk_producer ][b][ cp ][ chunk_producer ][b][ cp ]
stream 1: [ chunk_producer ][b][ cp ][ chunk_producer ][b][ cp ]
Note that the barriers synchronize streams with the same ID
across ranks. They don't synchronize streams on the same rank.
Since the work on both streams is independent, there's no
guarantee that the chunk_producer from stream 0 or stream 1 will
be scheduled first. If there is a scheduling mismatch across
ranks, the barrier forces all ranks to wait for the slowest.
When scheduling mismatches occur among ranks, the stream
activities might look like this (note that p2p copies from
different streams cannot overlap with each other):
[rank 0]
stream 0: [ chunk_producer ][b ][ cp ][ chunk_producer ][b ][ cp ]
stream 1: [ chunk_producer ][b] [ cp ][ chunk_producer ][b] [ cp ]
[rank 1]
stream 0: [ chunk_producer ][b] [ cp ][ chunk_producer ][b] [ cp ]
stream 1: [ chunk_producer ][b ][ cp ][ chunk_producer ][b ][ cp ]
To prevent this, we need to ensure that the chunk_producer on
stream 1 gets scheduled first on every rank. Without access to
the underlying kernels, CUDA offers no API to control the
scheduling order of two independent, overlapping kernels. Our
solution is to issue a small sleep kernel in stream 0. The sleep
duration is insignificant, but having an extra task in stream 0
will almost guarantee that the chunk_producer on stream 1 gets
scheduled first. Once the first chunk_producer is scheduled in
the correct order, there's very little room for the scheduling
order of subsequent kernels to be inconsistent across ranks.
```
Currently, we perform stream synchronization to ensure scheduling order. The stream synchronization has no bearing on correctness, but prevents inconsistent scheduling orders across ranks.
Without the stream synchronization, ranks may have inconsistent scheduling order, and the barriers cause all ranks to wait for the slowest rank:
<img width="379" alt="image" src="https://github.com/user-attachments/assets/ffb97e76-7e19-4449-b121-83c32ec3e91d">
With stream synchronization, the inconsistent scheduling order issue is addressed, but we lose compute/compute overlapping (this is the state before this PR):
<img width="378" alt="image" src="https://github.com/user-attachments/assets/4cb76246-625f-4fc1-b49a-823ae46d3f23">
With this PR, we get both consistent scheduling order across ranks and compute/compute overlap:
<img width="327" alt="image" src="https://github.com/user-attachments/assets/51ab1bdc-4f60-46e0-b53c-6d208e2d4888">
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137836
Approved by: https://github.com/weifengpy
ghstack dependencies: #137643, #137738, #137805
This PR add support for `A_scale` to be row-wise scale. The op can automatically detect whether the row-wise scale is sharded or replicated. When the row-wise scale is sharded, the op would all-gather the scale in a pipelined fashion.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137805
Approved by: https://github.com/weifengpy
ghstack dependencies: #137643, #137738
This will retry connection timeout failures up to the timeout duration. Under heavy load the server may not be able to immediately accept the connection. In such a case we do want to retry the connection rather than fall back to ipv4 for the remaining of the connection timeout.
The connection timeout here is not the same as the c10d timeout which appears to be higher. We could adjust the linux timeout directly but using the c10d retry loop keeps things more consistent and gives us things like exponential backoff, logs, etc.
Example failure:
```
socket.cpp:752] [c10d] The client socket has failed to connect to [...]:29400 (errno: 110 - Connection timed out).
socket.cpp:752] [c10d] The IPv4 network addresses of (..., 29400) cannot be retrieved (gai error: -2 - Name or service not known).
... repeats ipv4 connection failure
```
From Linux man page: https://man7.org/linux/man-pages/man2/connect.2.html
```
ETIMEDOUT
Timeout while attempting connection. The server may be
too busy to accept new connections. Note that for IP
sockets the timeout may be very long when syncookies are
enabled on the server.
```
Test plan:
CI for backwards compatibility
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138003
Approved by: https://github.com/c-p-i-o, https://github.com/fduwjj, https://github.com/rsdcastro
### Fix 1: Throw async error during init wait
Previously we just busy wait for `ncclSuccess`, if the nonblocking init encountered error, we never report that. Added detection of async error via `ncclGetAsyncError`.
### Fix 2: Add wait after comm split
```
// After calling ncclCommSplit in non-blocking mode, we should wait for the
// source communicator to be out of ncclInProgress state.
// Reason 1:
// it's unsafe to call new operations on the parent comm while it's in
// ncclInProgress state.
// Reason 2:
// as of NCCL 2.23, the ptr value of child comm will not be filled until the
// state of parent comm is ncclSuccess. This may change in the future. See:
// https://github.com/NVIDIA/nccl/issues/1472
```
This wait does not mean the child comm is ready for use, neither does it block till that point.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137741
Approved by: https://github.com/shuqiangzhang
The Triton `AttrsDescriptor` object was refactored in https://github.com/triton-lang/triton/pull/4734. These changes add support for the new `AttrsDescriptor` while maintaining backwards compatibility with the existing version. The main changes are different names for the initialized of the descriptor parameters, and a creation via a static method instead of the class constructor.
Depends on #137458 which removes some unused logic around the old descriptor. Those changes make this PR cleaner, but if for some reason that old logic is still used I can make adjustments.
Use of the new `AttrsDescriptor` depends on https://github.com/triton-lang/triton/pull/4888
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137757
Approved by: https://github.com/jansel
Related to #107302
The following tests failed in test_binary_ufuncs.py when testing with NumPy 2.
```
FAILED [0.0050s] test/test_binary_ufuncs.py::TestBinaryUfuncsCPU::test_scalar_support__refs_sub_cpu_complex64 - AssertionError
FAILED [0.0043s] test/test_binary_ufuncs.py::TestBinaryUfuncsCPU::test_scalar_support__refs_sub_cpu_float32 - AssertionError
FAILED [0.0048s] test/test_binary_ufuncs.py::TestBinaryUfuncsCPU::test_scalar_support_sub_cpu_complex64 - AssertionError
FAILED [0.0043s] test/test_binary_ufuncs.py::TestBinaryUfuncsCPU::test_scalar_support_sub_cpu_float32 - AssertionError
FAILED [0.0028s] test/test_binary_ufuncs.py::TestBinaryUfuncsCPU::test_shift_limits_cpu_uint8 - OverflowError: Python integer -100 out of bounds for uint8
```
This PR fixes them.
More details:
* `test_shift_limits` failed because `np.left_shift()` and `np.right_shift()` no longer support negative shift values in NumPy 2.
* `test_scalar_support` failed because NumPy 2 changed its dtype promo rules. We special-cased the incompatible cases by changing the expected dtypes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137937
Approved by: https://github.com/albanD
Related to #107302
`TestExport.test_exported_objects` in `test/torch_np/test_basic.py` is failing with NumPy 2.
The test is checking if all methods under `torch._numpy` exist in `numpy`.
However, some of them are removed in NumPy 2.
This PR fixes the issue by not checking the removed methods when running with NumPy 2.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137814
Approved by: https://github.com/albanD
Intel GPU aten library(libtorch_xpu) utilizes `torchgen` to generate structure kernels. Currently, the generated structure kernels are decorated by `TORCH_API` to control the visibility, while `TORCH_API` is controlled by the `CAFFE2_BUILD_MAIN_LIB` macro. However, we cannot enable `CAFFE2_BUILD_MAIN_LIB` for the Intel GPU ATen library naively. Because the macro not only serves for the `TORCH_API` semantic. It means that the semantic of `TORCH_API` is symbol `hidden`.
https://github.com/pytorch/pytorch/blob/main/c10/macros/Export.h#L95-L99
Therefore, we need to use ` TORCH_XPU_API` to decorate the produced structure kernels.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137794
Approved by: https://github.com/atalman
ghstack dependencies: #137873
MOTIVATION
We recently verified some quantization tests on devices other than cpu (eg. CUDA and Intel Gaudi devices identified as 'hpu'). We noticed a device mismatch error as eps is a tensor created on cpu but other tensors (min_val_neg, max_val_pos, scale, zero_point) are moved to the targeted _device_.
CHANGES
Move eps to _device_ of other tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135204
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
Previously the decomposition would upcasts inputs to fp32. This led to a slowdown compared to eager which would run in fp16. We also tried keeping the bmm in fp16, and the upcasting for the epilogue but that led to worse numerics because the bmm in eager would do the epilogue all in fp32 without a downcast in the bmm accumulator.
Fix for https://github.com/pytorch/pytorch/issues/137897
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137904
Approved by: https://github.com/ngimel
Fixes an issue where if the context arg is not provided, Dynamo would throw an arg mismatch error.
The skips are there because Dynamo would previously fall back to eager on those tests due to the arg mismatch error.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137809
Approved by: https://github.com/drisspg
Summary: Tests Fix Clear On Fork by forking a process after a profile has already been done. Afterwards we check that all the PID/TID are as expected.
Test Plan: Ran buck2 test 'fbcode//mode/dev' fbcode//caffe2/test:profiler -- --exact 'caffe2/test:profiler - test_forked_process (profiler.test_profiler.TestProfiler)'
Differential Revision: D63992036
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137511
Approved by: https://github.com/sanrise, https://github.com/aaronenyeshi
The `AttrsDescriptor` class has been present in Triton for almost a year now (introduced [here](72c9833927)), so we should be able to rely on it existing. I am in the process of supporting the new `AttrsDescriptor` class and @jansel suggested I split changes to the existing class out separately to make sure nothing breaks removing the legacy attribute descriptor attributes.
Initially I attempted to remove the branching around detecting whether `AttrsDescriptor` exists but that breaks because PyTorch must build without Triton. So, I went back and updated for the naming introduced in the commit linked above, and also removed two unused attributes `divisible_by_8` and `ids_to_fold` which were removed in Feb 2024 (https://github.com/triton-lang/triton/pull/3122 and https://github.com/triton-lang/triton/pull/3080 respectively).
With these changes only the internal workings of the `AttrsDescriptor` class will differ between supported Triton versions, but the data stored will remain consistent.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137458
Approved by: https://github.com/jansel
This reverts commit e688b78791d01bd91614a61e57726c32beb46ee4.
We're reverting this because:
1) The original PR (#134872) fixed a bug but caused another one. The
assessment is that the bug it caused is worse than the bug it fixed.
2) it was reverted on the release 2.5 branch, so we want to prevent
divergence
3) The original author is out-of-office for a while so we don't want the
divergence to wait until they're back
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137891
Approved by: https://github.com/Skylion007
For `autograd.Function`, the engine will try to allocate correctly-shaped zeros for `None` grads (i.e. in the case where the output isn't used downstream). It determines the shape of these zeros from the `VariableInfo` entry, which is derived from the forward output shape. For the NJT forward output case, the size info stored will contain a nested int, and calling `zeros()` with this size throws:
```
RuntimeError: .../build/aten/src/ATen/RegisterCPU.cpp:5260: SymIntArrayRef expected to contain only concrete integers
```
This PR fixes this by storing the full tensor in the `VariableInfo` for the nested case and calling `zeros_like()` to allocate correctly-shaped zeros. This is pretty inefficient; ideally we would want to save just the NJT shape and be able to construct zeros from it, but this requires factory function support for nested ints (WIP). So this is a short-term fix until we have that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136875
Approved by: https://github.com/soulitzer, https://github.com/huydhn
A proposal addressing Issue #1489: **Optimizer should track parameter names and not id.**
(also mentioned in here: [[RFC] Introducing FQNs/clarity eyeglasses to optim state_dict](https://dev-discuss.pytorch.org/t/rfc-introducing-fqns-clarity-to-optim-state-dict/1552)
## Summary
This PR introduces a backward-compatible enhancement where optimizers track parameter names instead of just their id.
Optimizers can be initialized with `named_parameters()` as:
```python
optimizer = optim.SGD(model.named_parameters(), lr=0.01, momentum=0.9)
```
This allows for greater clarity and ease when handling optimizers, as the parameters' names are preserved within the optimizer’s `state_dict` as:
```
state_dict =
{
'state': {
0: {'momentum_buffer': tensor(...), ...},
1: {'momentum_buffer': tensor(...), ...},
},
'param_groups': [
{
'lr': 0.01,
'weight_decay': 0,
...
'params': [0,1]
'param_names' ['layer.weight', 'layer.bias'] (optional)
}
]
}
```
Loading `state_dict` is not changed (backward-compatible) and the `param_names` key will be ignored.
## Key Features
#### Named Parameters in Optimizer Initialization:
Optimizers can accept the output of `model.named_parameters()` during initialization, allowing them to store parameter names directly.
#### Parameter Names in `state_dict`:
The parameter names are saved as a list in the optimizer’s `state_dict` with key `param_names`, alongside the `params` indices, ensuring seamless tracking of both names and parameters.
## Backward Compatibility
#### No Breaking Changes:
This change is fully backward-compatible. The added `param_names` key in the optimizer's `state_dict` is ignored when loading a state to the optimizer.
#### Customization with Hooks:
For more control, the loaded state_dict can be modified using a custom `register_load_state_dict_pre_hook`, providing flexibility for different design needs.
## Documentation Updates
Please refer to the documentation changes for more details on how this feature is implemented and how it can be used effectively.
## Solution Example:
A suggested solution to the problem mentioned in #1489, for the same parameters but in a different order.
The following `register_load_state_dict_pre_hook` should be added to the optimizer before loading to enable loading the state dict :
```python
def adapt_state_dict_ids(optimizer, state_dict):
# assuming a single param group.
current_state_group = optimizer.state_dict()['param_groups'][0]
loaded_state_group = state_dict['param_groups'][0]
# same number of params, same names, only different ordering
current_state_name_to_id_mapping = {} # mapping -- param_name: id
for i, name in enumerate(current_state_group['param_names']):
current_state_name_to_id_mapping[name] = current_state_group['params'][i]
# changing the ids of the loaded state dict to match the order of the given state dict.
for i, name in enumerate(current_state_group['param_names']):
loaded_state_group['params'][i] = current_state_name_to_id_mapping[name]
return state_dict
```
In this code, the loaded `state_dict` ids are adapted to match the order of the current optimizer `state_dict`.
Both the previous and the current optimizers are required to be initiated with `named_parameters()` to have the 'param_names' key in the dict.
### Note
This is my first contribution to PyTorch, and I wish to receive feedback or suggestions for improvement.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134107
Approved by: https://github.com/janeyx99
Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
The test is failing (flakily?) on periodic Windows CUDA jobs with the following error:
```
__________ TestLinalgCUDA.test_matmul_offline_tunableop_cuda_float16 __________
Traceback (most recent call last):
File "C:\actions-runner\_work\pytorch\pytorch\test\test_linalg.py", line 4618, in test_matmul_offline_tunableop
os.remove(filename)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'tunableop_untuned0.csv'
```
For example, https://github.com/pytorch/pytorch/actions/runs/11292745299/job/31410578167#step:15:15097
The test tried to catch and ignore this, but this is Windows. So, the fix is to:
1. Ignore if these files couldn't be removed
2. Write them to a temp directory instead, otherwise, [assert_git_not_dirty](https://github.com/pytorch/pytorch/blob/main/.ci/pytorch/test.sh#L286) won't be happy
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137835
Approved by: https://github.com/atalman
This PR adds a Selective Activation Checkpointing (SAC) Estimator, built on top of the `Runtime Estimator`, for estimating memory and recomputation time trade-offs.
It provides a `TorchDispatchMode` based context manager that estimates the memory and runtime trade-offs of functions or `torch.nn.Modules` for SAC, using the `Runtime Estimator` #134243 under the hood to support two estimation modes: 'operator-level-benchmark' and 'operator-level-cost-model' (roofline model). The SAC Estimator provides detailed statistics and metadata information for operators of each module, including greedy order for selecting operators to be recomputed/checkpointed and per-module trade-off graphs. This estimator is designed to be used under FakeTensorMode and currently supports estimation of compute time and memory usage."
It's inspired from: [XFormers SAC](https://github.com/facebookresearch/xformers/blob/main/xformers/checkpoint.py) by @fmassa
End-to-end example:
```
import torch
from torch._subclasses.fake_tensor import FakeTensorMode
from torch.distributed._tools.sac_estimator import SACEstimator
from torch.testing._internal.distributed._tensor.common_dtensor import (
ModelArgs,
Transformer,
)
if __name__ == "__main__":
dev = torch.cuda.current_device()
vocab_size = 8192
bsz, seq_len = 8, 1024
model_args = ModelArgs(
n_layers=4,
n_heads=12,
vocab_size=vocab_size,
max_seq_len=seq_len,
dim=768,
dropout_p=0.1,
)
with FakeTensorMode():
with torch.device(dev):
model = Transformer(model_args)
inp = torch.randint(
0, model_args.vocab_size, (bsz, model_args.max_seq_len), device=dev
)
sace = SACEstimator()
with sace(estimate_mode_type='operator-level-cost-model'):
loss = model(inp).sum()
loss.backward()
sace.pwlf_sac_tradeoff_curve(n_segments=2, save_tradeoff_graphs=True)
sace.display_modulewise_sac_stats(depth=4, print_tabular=True)
```
Example AC Stats for one of the transformer layers:

Example AC Trade-off for one of the transformer layers:

Example AC Trade-Off graph one of the transformer layers:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135208
Approved by: https://github.com/weifengpy
With LTO(Link Time Optimization) enabled in CFLAGS, some compiler will optimize and strip the unwind_c function, which is caused by compiler that couldn’t resolve reference correctly, thus breaking the build with undefined reference in unwind_entry. Add an attribute to avoid this bad situation.
Fixes#121282
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137862
Approved by: https://github.com/Skylion007
Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
Builds upon #76951.
Benchmarking code is the same as in #76950.
AMD Ryzen Threadripper PRO 3995WX:
```
batch_size drop_last origin new speedup
------------ ----------- -------- ------ ---------
4 True 0.94 0.5706 64.74%
4 False 0.9745 0.9468 2.93%
8 True 0.7423 0.3715 99.82%
8 False 0.7974 0.5666 40.73%
64 True 0.5394 0.2085 158.76%
64 False 0.6083 0.2697 125.51%
640 True 0.5448 0.1985 174.41%
640 False 0.7085 0.2308 206.91%
6400 True 0.5554 0.2028 173.88%
6400 False 0.7711 0.2109 265.60%
64000 True 0.556 0.2091 165.82%
64000 False 0.7803 0.2078 275.58%
```
When `drop_last == True`, it uses `zip` to speed things up.
When `drop_last == False`, it uses `itertools` to speed things up.
`itertools` was the fastest way I could find that deals with the last batch if it is smaller than `batch_size`. I have a pure python method too, but it is slower when `batch_size` is 4 or 8, so I have committed the `itertools` version for now.
Happy to chat further about this change :-) I understand you may not want to introduce the `itertools` package into [sampler.py](https://github.com/pytorch/pytorch/blob/main/torch/utils/data/sampler.py).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137423
Approved by: https://github.com/Skylion007
PyStructSequence is the C API equivalent for `collections.namedtuple` in Python. But they have different constructors:
```python
tuple = NamedTupleType(*args)
tuple = NamedTupleType._make(args)
tuple = StructSequenceType(args)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137776
Approved by: https://github.com/jansel
We use nn_module_stack in unflatten to recognize when module calls begin and end. However the current format is not sufficient to detect module call boundaries when we have successive calls to the same module, because the successive instructions (end of one call, begin of next call) have the same nn_module_stack. This causes us to effectively "unroll" successive calls to a single call. This can cause problems when preserving module call signatures because the outputs of the successive calls might be concatenated in the single call.
Previously we introduced the concept of a "call index" to generate multiple graphs when unflattening, one per call. This PR pushes this concept into nn_module_stack itself. In particular, the keys of nn_module_stack now go from `key` to `key@call_index`. (In a previous attempt, https://github.com/pytorch/pytorch/pull/137457, instead values in nn_module_stack go from (fqn, type) to (fqn, type, call_index), which is BC-breaking.)
Note that we still do not have the ability to preserve module call signatures for multiple calls to the same module. But now instead of randomly crashing we give a proper error. OTOH when not preserving module call signatures we simply generate multiple calls, each with its own graph, possibly deduplicated, matching what we would do for non-successive calls.
Test Plan: Like D64014936
Differential Revision: D64136277
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137646
Approved by: https://github.com/angelayi
This PR adds new meta functions for `lerp`, `addcmul`, and `addcdiv` (including their
respective inplace versions).
These functions only had refs implementations, which was being the root cause of a
significant overhead ([issue][1]) when running `AdamW` optimizer step on PyTorch/XLA
backend. Running the meta functions resulted in the following improvements:
- `lerp` calls: 1,550ms to 140ms (10x)
- `addcdiv` calls: 640ms to 350ms (1.8x)
- `addcmul` calls: 620ms to 300ms (2.05x)
[1]: https://github.com/pytorch/xla/issues/7923
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136909
Approved by: https://github.com/jansel
Summary:
# Latest Update
This diff is no longer needed because we did need the check to exist, to make meta behave the same as other devices, see D54526190.
---------------------------------
# Background
T176105639
| case | embedding bag weight | per_sample_weight | fbgemm lookup | forward in meta |
| A | fp32 | fp32 | good | good |
| B | fp16 | fp32 | good| failed [check](https://fburl.com/code/k3n3h031) that forces weight dtype == per_sample_weights dtype |
| C | fp16 | fp16 | P1046999270, RuntimeError: "expected scalar type Float but found Half from fbgemm call" | good |
| D | fp32 | fp16 | N/A | N/A |
Currently we are in case A. Users need to add `use_fp32_embedding` in training to force embedding bag dtype to be fp32. However, users actually hope for case B to use fp16 as the embedding bag weight. When deleting `use_fp32_embedding`, they would fail the [check](https://fburl.com/code/k3n3h031) that forces `weight dtype == per_sample_weights dtype ` in meta_registration.
The check is actually not necessary. Is it because the backend fbgemm does support case B. Additionally, later on in the `meta_embedding_bag`, `weight` and `per_sample_weights` don't need to be in the same dtype (https://fburl.com/code/q0tho05h, weight is src, per_sample_weights is scale) for `is_fast_path_index_select`.
# This diff
Therefore, this diff remove the unnecessary [check](https://fburl.com/code/k3n3h031) to support case B in meta forward. With such, users are able to use fp16 to be the emb bag dtype without the need to force per_sample_weights the same dtype in meta forward (see Test Plan).
# Reference diffs to resolve this issue
Diff 1: D52591217
This passes embedding bag dtype to feature_processor to make per_sample_weights same dtype as emb bag weight. However, `is_meta` also needs to be passed because of case C. fbgemm still does not support per_sample_weights = fp16 (see the above table). Therefore users are forced to only make per_sample_weights fp16 when it is on meta. The solution requires too many hacks.
Diff 2: D53232739
Basically doing the same thing in diff 1 D52591217, except that the hack is added in TorchRec library. This adds an if in EBC and PEA for: when emb bag weight is fp16, it forces per_sample_weight fp16 too. However, it would then result in fbgemm issue too and has broken a bunch of prod models.
Test Plan:
# APS
The following command will run icvr_launcher which triggers ads_launcher and run forward in meta device:
```
buck2 run mode/opt -c python.package_style=inplace //aps_models/ads/icvr:icvr_launcher_publish -- mode=mast_ig_fm_when_combo0_uhm_publish launcher.fbl_entitlement=ads_global_tc_ads_score launcher.data_project=oncall_ads_model_platform launcher.tags=[ads_ranking_taxonomy_exlarge_fm_prod] stages.train=false
```
Result:
{F1461463993}
Reviewed By: ezyang
Differential Revision: D54175438
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136774
Approved by: https://github.com/ezyang
This test is currently failing in trunk when memory leak check is enabled, for example https://github.com/pytorch/pytorch/actions/runs/11296206361/job/31422348823#step:22:1970. When testing locally, calling `backward` on a masked tensor always causes memory leak until I clean up the data and the mask manually. This is probably related to this warning from masked tensor `UserWarning: It is not recommended to create a MaskedTensor with a tensor that requires_grad. To avoid this, you can use data.clone().detach()`, but I don't know much about the internal details here to go further. So, let's just fix the test first/
### Testing
```
PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 python test/test_maskedtensor.py TestBasicsCUDA.test_stack_cuda
```
passes and doesn't warn about memory leak anymore.
The test itself came from https://github.com/pytorch/pytorch/pull/125262#issuecomment-2344068012
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137815
Approved by: https://github.com/kit1980
Related to #107302
The breakages are caused by backward incompatibility between NumPy 1 and NumPy 2.
This PR fixes all the corresponding test failures in `test_torch.py`.
1. The dtype of the return value `np.percentile` when passed a `torch.float32` tensor.
NumPy 1: Return value of `np.float64`.
NumPy 2: Return value of `np.float32`.
Solution: Enforce it with `.astype(np.float64)`.
2. The type of `np.gradient()` when returning multiple arrays.
NumPy1: A list of arrays.
NumPy2: A tuple of arrays.
Solution: Cast the tuple to a list.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137740
Approved by: https://github.com/ezyang
Updates all references to runner determinator workflow (`_runner-determinator.yml`) from current cloned version to main version.
This enables the team to push updates to this workflow, like fixing bugs or pushing improvements, and have it immediately be reflected on all open PRs. So avoiding potentially breaking situations, empowering moving fast and fast and simple recover in case of bugs.
From:
```
jobs:
get-label-type:
uses: ./.github/workflows/_runner-determinator.yml
```
To:
```
jobs:
get-label-type:
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137791
Approved by: https://github.com/malfet, https://github.com/huydhn, https://github.com/zxiiro
Performs shape inference at runtime using user-provided real tensors.
- avoids the need for users to precompute shapes which is difficult and error prone
- lets us remove args from the PipelineStage ctor (in a later PR)
- deprecates existing inference helper in PipelineStage constructor for several reasons: its problematic to have to reason about the stage submod being on the right device for shape inference
The current state as of this PR:
- Users should not pass any input or output shapes into PipelineStage ctor, and shape inference will run automatically
- To override shape inference, they can continue to pass input/output args as previously
Currently, does not add a barrier after shape-inference, which essentially pipelines shape inference with the subsequent schedule action for that stage. If this complicates debugging, we could add in a barrier (it comes at a cost, but only during the first step).
Testing:
- Removed input args from all PP test cases, thus exposing them all to shape-inference.
- Verified visually (nvidia-smi) that torchtitan PP 3D test runs shape inference fine without creating extra cuda contexts.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136912
Approved by: https://github.com/kwen2501, https://github.com/H-Huang
The original PR #122396 used the CPU device since at that point in time
there was no actual Triton CPU backend. After #133408, this is no longer
the case, so we now have multiple backends getting registered for the
CPU. The test still works in OSS but fails internally due to different
test runners initializing the backends in a different order.
This PR doesn't actually end up fixing the test internally because
cpp_extension -- needed to implement the privateuseone device -- isn't
supported there, so we simply skip it instead. However, it still makes the
OSS test independent of initialization order, which is good.
Differential Revision: [D63838169](https://our.internmc.facebook.com/intern/diff/D63838169/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137611
Approved by: https://github.com/henrylhtsang
Previously, instances of `SchedulerNode` and `FusedSchedulerNode` would explicitly check whether the compilation target is Triton when codegen'ing debug strings. Generating debug triton code is instead implemented as a callback set on scheduler nodes by `TritonScheduling`. This makes the codegen more device-agnostic and allows schedulers to customise the codegen output as opposed to it being closely coupled to the debug string codegen
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135015
Approved by: https://github.com/jansel
Follow up to https://github.com/pytorch/pytorch/pull/131936. In the original bisector you'd have to test inline if we were disabling a component - `if BisectionManager.disable_subsystem("inductor", "post_grad_passes", debug_info)`. This adds a convenient way of testing config changes for root causing issue. I've added `emulate_precision_casts` and aot_eager_decomp_partition cse as initial ones.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137346
Approved by: https://github.com/zou3519
This function computes a topological sort using a non-recursive implementation of DFS. Upon first reading, I thought it was using Kahn’s algorithm because it uses a variable called `queue`, but upon closer reading, I noticed this variable is actually used as a stack.
This pull request improves readability by renaming the stack and changing it from `std::vector` to `std::stack`.
Note: this also changes the backing store from an `std::vector` to an `std::deque`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130526
Approved by: https://github.com/alanwaketan, https://github.com/malfet
adds a `default` tag to experiment configurations, allowing to remove some experiments by default on the random draw:
```
experiments:
lf:
rollout_perc: 25
otherExp:
rollout_perc: 25
default: false
---
```
and includes the configuration to filter what experiments are of interest for a particular workflow (comma separated):
```
get-test-label-type:
name: get-test-label-type
uses: ./.github/workflows/_runner-determinator.yml
with:
...
check_experiments: "awsa100"
```
The end goal, is to enable us to run multiple experiments, that are independent from one another. For example, while we still runs the LF infra experiment, we want to migrate other runners leveraging the current solution. A immediate UC is for the A100 instances, where we want to migrate to AWS.
Those new instances will during the migration period be labeled both `awsa100.linux.gcp.a100` and `linux.aws.a100`. Once the experiment ends, we will remove the first confusing one.
```
jobs:
get-build-label-type:
name: get-build-label-type
uses: ./.github/workflows/_runner-determinator.yml
with:
...
get-test-label-type:
name: get-test-label-type
uses: ./.github/workflows/_runner-determinator.yml
with:
...
check_experiments: "awsa100"
linux-focal-cuda12_1-py3_10-gcc9-inductor-build:
name: cuda12.1-py3.10-gcc9-sm80
uses: ./.github/workflows/_linux-build.yml
needs:
- get-build-label-type
- get-test-label-type
with:
runner_prefix: "${{ needs.get-build-label-type.outputs.label-type }}"
...
test-matrix: |
{ include: [
{ config: "inductor_huggingface_perf_compare", shard: 1, num_shards: 1, runner: "${{ needs.get-test-label-type.outputs.label-type }}linux.gcp.a100" },
...
]}
...
```
```
experiments:
lf:
rollout_perc: 50
awsa100:
rollout_perc: 50
default: false
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137614
Approved by: https://github.com/malfet
Earlier the subgraphs were getting inlined into the output code. This PR lifts the subgraphs into a function, and then we just call the function in the output code.
This is the output code for test `test_cond_reintepret_view_inputs_outputs`
Before this PR - https://www.internalfb.com/intern/paste/P1632948905/
With this PR - https://www.internalfb.com/intern/paste/P1632946348/
A relevant snippet from the above paste is
~~~
def false_graph_0(args):
false_graph_0_arg0_1, false_graph_0_arg1_1, s0 = args
args.clear()
s0 = s0
with torch.cuda._DeviceGuard(0):
torch.cuda.set_device(0)
false_graph_0_buf0 = empty_strided_cuda(((-1) + s0, 20), (20, 1), torch.float32)
false_graph_0_buf1 = empty_strided_cuda(((-1) + s0, 20), (20, 1), torch.float32)
# Unsorted Source Nodes: [cond, z1, z2], Original ATen: [aten.sub, aten.add]
triton_poi_fused_add_sub_1_xnumel = (-20) + (20*s0)
stream0 = get_raw_stream(0)
triton_poi_fused_add_sub_1.run(false_graph_0_arg0_1, false_graph_0_arg1_1, false_graph_0_buf0, false_graph_0_buf1, triton_poi_fused_add_sub_1_xnumel, grid=grid(triton_poi_fused_add_sub_1_xnumel), stream=stream0)
del false_graph_0_arg0_1
del false_graph_0_arg1_1
return (reinterpret_tensor(false_graph_0_buf0, ((-3) + s0, 20), (20, 1), 40), reinterpret_tensor(false_graph_0_buf1, ((-1) + s0, 16), (20, 1), 4), )
async_compile.wait(globals())
del async_compile
def call(args):
arg0_1, arg1_1, arg2_1, arg3_1 = args
args.clear()
s0 = arg0_1
assert_size_stride(arg1_1, (s0, 20), (20, 1))
assert_size_stride(arg2_1, (s0, 20), (20, 1))
assert_size_stride(arg3_1, (), ())
with torch.cuda._DeviceGuard(0):
torch.cuda.set_device(0)
buf0 = [None] * 2
buf0 = [None] * 2
if arg3_1.item():
# subgraph: true_graph_0
true_graph_0_arg0_1 = reinterpret_tensor(arg1_1, ((-1) + s0, 20), (20, 1), 0)
true_graph_0_arg1_1 = reinterpret_tensor(arg2_1, ((-1) + s0, 20), (20, 1), 0)
(true_graph_0_buf0, true_graph_0_buf1) = true_graph_0([true_graph_0_arg0_1, true_graph_0_arg1_1, s0])
buf0[0] = true_graph_0_buf0
buf0[1] = true_graph_0_buf1
else:
# subgraph: false_graph_0
false_graph_0_arg0_1 = reinterpret_tensor(arg1_1, ((-1) + s0, 20), (20, 1), 0)
false_graph_0_arg1_1 = reinterpret_tensor(arg2_1, ((-1) + s0, 20), (20, 1), 0)
(false_graph_0_buf0, false_graph_0_buf1) = false_graph_0([false_graph_0_arg0_1, false_graph_0_arg1_1, s0])
buf0[0] = false_graph_0_buf0
buf0[1] = false_graph_0_buf1
del arg1_1
del arg2_1
del arg3_1
buf1 = buf0[0]
buf2 = buf0[1]
del buf0
return (buf1, buf2, )
~~~
The key change is to recursively call `codegen` for the subgraph and rely on `SubgraphPythonWrapper` to generate just the subgraph `fn`. The resulting subgraph_code is then inserted into the parent wrapper.
Note that this PR only works for python wrapper. For cpp wrapper, we need a lot of refactor to ensure that we don't duplicate the global variables in the outpute_code. So, for now, I fallback to the old way of inlining for cpp wrapper. I am hoping someone with more familiarity with cpp wrapper can support subgraph lifting (cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov).
This work will unblock hierarchical compilation (or cold start compile time work).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137200
Approved by: https://github.com/desertfire, https://github.com/eellison
This PR reduces the overhead on the CPU side by eliminating the use of c10::str in creating signatures. Instead we use fmt library. TunableOp overhead on the CPU are reduced by around ~40%. The improvement is most noticeable on small GEMMs. This PR does not contain any bug fixes or new features.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135371
Approved by: https://github.com/jeffdaily
# Summary
I started to explore the performance of _scaled_mm against a triton-based persistent TMA kernel for RowWise scaling.
There are more details here: https://github.com/drisspg/transformer_nuggets/pull/36
It clearly showed that where was some room for improvement on larger problem sizes compared to triton's performance. Note that the triton kernel only has a 128x128x128 Tile shape, where scaled_mm has a 64, 128, 128 tile shape which we use for smaller problem sizes which may explain some of the perf delta for at smaller shapes.
This led to seeing if we can improve our triton codegen lowering for _scaled_mm (I think we should still do this: https://github.com/pytorch/pytorch/pull/137517).
In the meantime @Chillee suggested I make sure swizziling is set for the large matmul shapes
This PR makes sure that we increase the max_swizzle_size for the large matmuls.
## Performance
Note* Red means triton based tma beats _scaled_mm blue means _scaled_mm is faster
On Nighlty W/ Triton at (2ef33c6c4c3)

You can see that as M,K,N increase there is a clear win for the Triton Persistent TMA.
After this PR:

For example w/ this change(power limited gpu)
M=16384 K=16384 N=16384
TFlops Before :`985.49`
TFlops After: `1304.69`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137681
Approved by: https://github.com/eqy
Fixes#136722Fixes#136718
By default, it goes to onednn. So this PR adds a check to ensure stride > 0. Now program will quit with an error message if stride is 0.
FBGEMM and QNNPACK can create modules with stride=0 without error but program crashes when calling forward.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136739
Approved by: https://github.com/jgong5
Thanks @eqy for reminding me of this RFC: https://github.com/pytorch/pytorch/issues/119797
This PR is meant to:
- provide a way to abort multiple PGs without deadlocking each other.
- provide a possibility to manually handle comm errors or timeouts (and potentially recovery of such).
One can find an example from: https://github.com/NVIDIA/nccl/issues/1013
## How is it different from `destroy_process_group`?
`destroy_process_group` is meant for normal exit, while `_abort_process_group` is meant for bailout upon hangs or failures. Similar to `ncclCommDestroy` vs `ncclCommAbort`.
## What's new in `_abort_process_group`?
It added support for "group abort" semantic. The "group abort" semantic is capable of aborting multiple NCCL comms concurrently, avoiding deadlock in otherwise serialized `ncclCommAbort` executions. Details are in the [RFC](https://github.com/pytorch/pytorch/issues/119797) targeting [the hang issue in multi-comm case](https://github.com/NVIDIA/nccl/issues/1013). `Group abort` semantic is added in NCCL 2.22.
## What's next?
Ideally, the watchdog's behavior should support "group abort" too. But this is hard to implement today due to a lack of "global view" by each PG's individual watchdog. A big semi-big refactor may be needed to "uplift" the watchdogs to a global level or consolidate them into one (i.e. one dog watching multiple PGs).
In any case, it may not be a bad idea to experiment the "group abort" feature with a manual API first and then extend to the automatic mode (watchdog).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132291
Approved by: https://github.com/eqy
Enable FSDP to deal with channels_last memory formatted tensors. Preserving channels_last memory format makes FSDP compatible with the best kernels CUDNN offers.
Summary of changes:
1) Store strides information along with shapes
2) Replace calls to flatten() with as_strided(size=(param.numel(),), stride=(1,)) for flattening
3) Replace calls to view() with as_strided with the stored sizes and strides for unflattening
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137382
Approved by: https://github.com/awgu
Resolve#137540
Summary:
We might get different state_dict and named_parameters result when the module has registered custom state_dict_hooks.
For exported_program's state_dict, we want the state_dict to reflect the actual module hierarchy at runtime, and it might be different from the model's state_dict() output if the model has state_dict hooks.
To do weight swapping, one needs to either re-export or turn-off the hooks when saving model's state_dict().
Previously, ExportedProgram uses nn.Module's state_dict() method to populate its own state_dict, but it doesn't work for some models (e.g. llama3_3_vision) because ExportedProgram's state_dict and an nn.Module's state_dict have some subtle differences semantically.
nn.Module's state_dict is about how the state should be serialized, and it reflects the structure of the original user model code. In contrast, export specializes on a “run” of a model, and its state_dict needs to reflect the runtime module hierarchy.
One example where these two are different is TorchTune's Llama3_2_vision text decoder. Here, a FusionLayer is added as a local optimization and it is not part of the "static model definition". In runtime, we have mod.layers[3].layer.sa_norm.scale.
But in nn.Module's state_dict, the authors of the model added a state_dict hook to remove the "layer" in mod.state_dict() to reflect the static model definition, so we have mod.state_dict()["layers.3.sa_norm.scale"].
In this Diff, we change ExportedProgram to populate its state_dict using named_parameters() and named_buffers() instead. So in ExportedProgram's state_dict, we have "layers.3.layer.sa_norm.scale", which reflects the runtime module hierarchy.
Now one problem this presents is weight swapping. Since ExportedProgram's state and the model's state is not the same anymore, weight swapping procedure also needs to change slightly.
In internal Ads and RecSys models deployment, weight swapping is where they have one model that is currently being being deployed and serving traffic, and they want to swap out the weights with newly trained model weights without having to redo the whole exporting/lowering process and create a new artifact. So they would move the deployed model’s pointer to the state dict over to the new state dict. Because of this, it’s previously a requirement that the FQNs are matching between the exported and the eager model’s state dict.
The new ExportedProgram's state dict still supports weight swapping, but the state_dict to be swapped needs to be obtained from torch.export.exported_program instead of model.state_dict() if the model has state_dict hooks.
The new requirement is that the FQNs are matching between the exported’s state dict and the state_dict obtained from `_disabled_load_state_dict_hooks(M)` context manager. One benefit of having this new API is that we are now in full control within export of gathering and updating the model state.
If a model doesn't have any state_dict hooks, one can still use model.state_dict() for weight swapping, so it's BC.
Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r test_export_for_training_with_state_dict_hooks
```
Differential Revision: D64080561
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137609
Approved by: https://github.com/angelayi, https://github.com/pianpwk
Move optimization from the export call to the `optimize()` method in ONNXProgram.
Users can call `optimize()` before calling `save()` to save the model. Right now if users set `optimize=True` in `torch.onnx.export` it will have the same effect as calling `optimize()`, but in the future we can evolve the method to be more flexible (e.g. target aware etc.)
Example
```python
onnx_program = torch.onnx.export(..., dynamo=True)
onnx_program.optimize()
onnx_program.save("model.onnx")
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137667
Approved by: https://github.com/titaiwangms
ghstack dependencies: #137666
Related to #107302
The following test fails with NumPy 2.
```
_________ TestNumPyInteropCPU.test_numpy_array_interface_cpu __________
Traceback (most recent call last):
File "/usr/local/google/home/haifengj/git/pytorch_np2/test/test_numpy_interop.py", line 415, in test_numpy_array_interface
wrapped_x = np.array([1, -2, 3, -4], dtype=dtype)
OverflowError: Python integer -2 out of bounds for uint8
To execute this test, run the following from the base repo dir:
python test/test_numpy_interop.py TestNumPyInteropCPU.test_numpy_array_interface_cpu
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
```
According to the official warning from NumPy 1, the assigning negative value to a `uint8` is deprecated.
The recommended way is to `np.array([1, -2, 3, -4]).astype(np.uint8)`
See the following for details.
```
>>> np.array([1, -2, 3, -4], dtype=np.uint8)
<stdin>:1: DeprecationWarning: NumPy will stop allowing conversion of out-of-bound Python integers to integer arrays. The conversion of -2 to uint8 will fail in the future.
For the old behavior, usually:
np.array(value).astype(dtype)
will give the desired result (the cast overflows).
<stdin>:1: DeprecationWarning: NumPy will stop allowing conversion of out-of-bound Python integers to integer arrays. The conversion of -4 to uint8 will fail in the future.
For the old behavior, usually:
np.array(value).astype(dtype)
will give the desired result (the cast overflows).
array([ 1, 254, 3, 252], dtype=uint8)
```
This PR fixes the test failure.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137532
Approved by: https://github.com/soulitzer
This PR contains multiple fixes for issue https://github.com/pytorch/pytorch/issues/135279:
## First part:
Moves the GPU guard (`cudaSetDevice`) before the `currentStreamCaptureStatusMayInitCtx` call.
As its name suggests, it May Init Ctx.
## Second part:
Even with the above fix, additional contexts are still observed during Work object destruction, e.g.
```
work = dist.all_reduce(tensor, async_op=True)
time.sleep(5) <-- no additional context yet
del work <-- additional context shows up
```
### Debug process
Chasing it down to destruction of a `Future` object -- a member variable of `Work`.
Then further down to the following member of `Future`:
```
std::vector<c10::Event> events_;
```
When the `events_` are destroyed, we hit the road down to:
1f3a793790/c10/cuda/impl/CUDAGuardImpl.h (L106-L121)
When there is no "preset" CUDA context (**which is the case for python garbage collector**), line 112: `c10::cuda::GetDevice(&orig_device)` will set `orig_device` to 0. Then, at line 120, `c10::cuda::SetDevice(orig_device)` will "officially" set the context to device 0 --
**that's where rank 1, 2, ... can create extra context on device 0!**
### Solution
This PR adds an explicit destructor to `Future`. In this destructor, destroy each event with a device guard.
## Test
Added test_extra_cuda_context, implemented via
- `pynvml` (if available), or
- memory consumption check.
`python test/distributed/test_c10d_nccl.py -k test_extra_cuda_context`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135273
Approved by: https://github.com/fduwjj, https://github.com/wconstab, https://github.com/eqy
ghstack dependencies: #137161
### Context
Fixes CUDA IMA in autotune_at_compile_time, where we would generate an example tensor with an incorrect stride.
In the case below, the stride should be (u0 * 128, 128, 1). However, we apply the fallback on the entire expr (i.e. u0 * 128).
```
# buf817 = tensor(size=(s0, u0, 128), stride=(u0 * 128, 128, 1))
buf812 = generate_example_value(
(64, 8192, 128), (8192, 128, 1), "cuda:0", torch.bfloat16, 0
)
```
The fix is to apply the fallback on each symbol.
### Test
```
PYTORCH_NO_CUDA_MEMORY_CACHING=1 compute-sanitizer python test_aot_inductor.py -k test_stride_with_unbacked_expr_abi_compatible_cuda
========= Invalid __global__ write of size 2 bytes
```
Differential Revision: [D64074561](https://our.internmc.facebook.com/intern/diff/D64074561)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137537
Approved by: https://github.com/jingsh
Currently, there are compilation warnings as below, which are resolved after the fix
```
/tmp/torchinductor_root/c7t6qm4gf35cxkk5jywa5booovl5n6ivzwdbbs5og7rdemqtgrzh/caoefkofe5jrkuaoch4lfpjwtodlcy4savxgzsxqldkcdof7ifh7.cpp: In function ‘ihipModuleSymbol_t* loadKernel(std::string, const string&, uint32_t, const std::optional<std::__cxx11::basic_string<char> >&)’:
/tmp/torchinductor_root/c7t6qm4gf35cxkk5jywa5booovl5n6ivzwdbbs5og7rdemqtgrzh/caoefkofe5jrkuaoch4lfpjwtodlcy4savxgzsxqldkcdof7ifh7.cpp:482:25: warning: ignoring returned value of type ‘hipError_t’, declared with attribute nodiscard [-Wunused-result]
482 | hipDrvGetErrorString(code, &msg); \
| ~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~
/tmp/torchinductor_root/c7t6qm4gf35cxkk5jywa5booovl5n6ivzwdbbs5og7rdemqtgrzh/caoefkofe5jrkuaoch4lfpjwtodlcy4savxgzsxqldkcdof7ifh7.cpp:519:5: note: in expansion of macro ‘CUDA_DRIVER_CHECK’
519 | CUDA_DRIVER_CHECK(hipModuleLoad(&mod, filePath.c_str()));
| ^~~~~~~~~~~~~~~~~
In file included from /opt/rocm/include/hip/hip_runtime.h:70,
from /pytorch/torch/include/torch/csrc/inductor/aoti_runtime/device_utils.h:14,
from /pytorch/torch/include/torch/csrc/inductor/aoti_runtime/model.h:17,
from /pytorch/torch/include/torch/csrc/inductor/aoti_runtime/model_container.h:13,
from /tmp/torchinductor_root/c7t6qm4gf35cxkk5jywa5booovl5n6ivzwdbbs5og7rdemqtgrzh/caoefkofe5jrkuaoch4lfpjwtodlcy4savxgzsxqldkcdof7ifh7.cpp:4:
/opt/rocm/include/hip/hip_runtime_api.h:2369:12: note: in call to ‘hipError_t hipDrvGetErrorString(hipError_t, const char**)’, declared here
2369 | hipError_t hipDrvGetErrorString(hipError_t hipError, const char** errorString);
| ^~~~~~~~~~~~~~~~~~~~
In file included from /opt/rocm/include/hip/hip_runtime.h:70,
from /pytorch/torch/include/torch/csrc/inductor/aoti_runtime/device_utils.h:14,
from /pytorch/torch/include/torch/csrc/inductor/aoti_runtime/model.h:17,
from /pytorch/torch/include/torch/csrc/inductor/aoti_runtime/model_container.h:13,
from /tmp/torchinductor_root/c7t6qm4gf35cxkk5jywa5booovl5n6ivzwdbbs5og7rdemqtgrzh/caoefkofe5jrkuaoch4lfpjwtodlcy4savxgzsxqldkcdof7ifh7.cpp:4:
/opt/rocm/include/hip/hip_runtime_api.h:399:3: note: ‘hipError_t’ declared here
399 | } hipError_t;
| ^~~~~~~~~~
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137626
Approved by: https://github.com/ColinPeppler, https://github.com/chenyang78
**Motivation**
Enable SVE vectorization with `torch.compile`
Extends PR: #119571
* This PR enables vectorization for codegen part using SVE-256 (vec length)
* The changes can be extended to other SVE vec lengths
I've done some comparisons against existing NEON implementation with SVE vectorization enabled route for `torch.compile`
Test results are for 8 cores on ARM Neoverse_V1
<img width="359" alt="Screenshot 2024-08-28 at 16 02 07" src="https://github.com/user-attachments/assets/6961fbea-8285-4ca3-b92e-934a2db50ee2">
It's worth mentioning, for standalone `SiLU op` there's a `~1.8x` speedup with `torch.compile`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134672
Approved by: https://github.com/jgong5, https://github.com/malfet
The change in pytorch/pytorch#136785 enabled these jobs to run on LF runners however we saw a sudden large spike in cost once that happened last week that would have caused us to over use our available AWS credits. This change hardlocks the tests for these jobs to Meta runners. We need this at least until we can figure out how to handle the additional spend caused by these jobs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137616
Approved by: https://github.com/Skylion007, https://github.com/seemethere
Summary: We hipify NCCLUtils.h from nccl.h to rccl/rccl.h. This follows the format of the rocm rpm suite (the header is in include/rccl/rccl.h), however the source code is just src/rccl.h. Using the rccl/rccl.h will make us find the rpm's header but not the src code's header.
Test Plan:
buck run mode/opt-amd-gpu -c hpc_comms.use_rccl=develop -c fbcode.split-dwarf=True --config rccl.build_rdma_core=true --config rccl.adhoc_brcm=true //aps_models/ads/icvr:icvr_launcher -- mode=local_ctr_cvr_cmf_rep_1000x_v1_no_atom data_loader.dataset.table_ds=[2024-09-04] data_loader.dataset.batch_size=512 max_ind_range=10
w/o this diff, it'll show 2.18 nccl version
Differential Revision: D62371434
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135472
Approved by: https://github.com/jeffdaily, https://github.com/cenzhaometa
Summary: Fixes a couple of problems: constants didn't have metadata before creating graph signatures, and graph signatures weren't updated when lifting constants.
Test Plan: fixed test
Differential Revision: D64081786
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137547
Approved by: https://github.com/tugsbayasgalan
**Summary**
Previously, we assumed the packed weight for (`MKL/MKLDNN`) linear operations was at `new_input_nodes[1]`. However, this is not the case for `MKL linear`, where `new_input_nodes[1]` contains the original weight instead of the packed weight. To generalize the code, in this PR, we identify nodes that are present in `input_nodes` but not in `new_input_nodes`—indicating they are no longer used by the GEMM template and can be considered candidates for deletion.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135101
Approved by: https://github.com/jgong5, https://github.com/jansel
It seems that there's a bug in `TensorMaker` - it would treat `storage_offset` as bytes when calculating the storage size, but as numel when setting the tensor `storage_offset`. This seems to be causing tensors returned by get_buffer() with non-0 offset to report wrong storage size.
Will look into the `TensorMaker` issue further. But for `get_buffer()`, it seems more natural to just incorporate the offset into the data pointer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137569
Approved by: https://github.com/weifengpy
ghstack dependencies: #137567
during auto_functionalize_v2 if we encounter a view such that size() stride() and storage_offset() matches the base
we create a view that is regenerated by calling aten.alias instead of as_strided for better performance.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137149
Approved by: https://github.com/zou3519
Remove all the keyword static for constants of vec registers in exp_u20 implementation. With the bf16 input shape of BertLarge, the SDPA kernel improves from 5.1ms to 4.7ms on SPR 56 threads.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137571
Approved by: https://github.com/jgong5
Fixes#115725. Note that the github issue title is misleading. Read the comments to understand what the problem is really about.
The PR improves the documentation and CMake's behavior for ROCM builds.
- Documentation: There were two environment variables for ROCm builds that are now documented. `ROCM_PATH` and `PYTORCH_ROCM_ARCH`.
- CMake: Improved diagnostic messaging and error handling with respect to `ROCM_PATH`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137308
Approved by: https://github.com/pruthvistony, https://github.com/jithunnair-amd, https://github.com/jeffdaily
Scripts for building aarch64 PyTorch PIP Wheels. These scripts build the following wheels:
* torch
* torchvision
* torchaudio
* torchtext
* torchdata
## Aarch64_ci_build.sh
This script is design to support CD operations within PyPi manylinux aarch64 container, and be executed in the container. It prepares the container and then executes __aarch64_wheel_ci_build.py__ to build the wheels. The script "assumes" the PyTorch repo is located at: ```/pytorch``` and will put the wheels into ```/artifacts```.
This app allows a person to build using AWS EC3 resources and requires AWS-CLI and Boto3 with AWS credentials to support building EC2 instances for the wheel builds. Can be used in a codebuild CD or from a local system.
If you do not have upload permissions, please ping @seemethere or @soumith to gain access
## New versions
New CUDA versions can be added by creating a new make target with the next desired version. For CUDA version NN.n, the target should be named `magma-cudaNNn`.
Make sure to edit the appropriate environment variables (e.g., DESIRED_CUDA, CUDA_ARCH_LIST) in the `Makefile` accordingly. Remember also to check `build_magma.sh` to ensure the logic for copying over the files remains correct.
New patches can be added by editing `Makefile` and`build_magma.sh` the same way `getrf_nbparam.patch` is implemented.
# File "/pytorch/test/common.py", line 399, in assertEqual
# super(TestCase, self).assertEqual(x, y, message)
# AssertionError: None != 1
tests_to_skip+=('TestDistBackend and test_broadcast')
# Memory leak very similar to all the conda ones below, but appears on manywheel
# 3.6m_cu80
# AssertionError: 1605632 not less than or equal to 1e-05 : __main__.TestEndToEndHybridFrontendModels.test_vae_cuda leaked 1605632 bytes CUDA memory on device 0
tests_to_skip+=('TestEndToEndHybridFrontendModels and test_vae_cuda')
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.