Summary:
att
Test Plan:
doc page generated from CI
Reviewers:
Subscribers:
Tasks:
Tags:
ghstack-source-id: 14150e4e3a5d65f1a9eb5a4201fcd4bac09920f1
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157766
This PR adds a new config option, `caching_precompile`, and a `DynamoCache`, which loads and saves Dynamo Cache entries automatically. It also hooks up DynamoCache to PrecompileContext, so that we can save multiple cache entries.
When this configuration is turned on, we:
- Automatically create and initialize a CompilePackage on every torch.compile
- Automatically use BundledAutogradcache
- Automatically save the CompilePackage entry to DynamoCache after every compile
You can also use PrecompileContext.serialize() to manually serialize a full object.
I've added unit tests to exhibit this behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155913
Approved by: https://github.com/zhxchen17
This PR uses `find_library` to replace `find_path`.
It also searches for NVSHMEM host lib and device lib separately.
Tested against system install location: /usr/local/lib and /usr/local/include.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157695
Approved by: https://github.com/Skylion007
ghstack dependencies: #157513
The structure is
```
torch/
__init__.py
version.py
```
When we import torch, only `torch/__init__.py` is executed by default.
The submodules like `version.py` are not automatically imported or attached to the torch module.
So without anything in `__init__.py`, `torch.version` may not be found. So in this PR, we make the import explicit.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157584
Approved by: https://github.com/ezyang
Summary: We want to add versioning to DCP to the metadata so that whenever planner logic changes, we can use the version on save to determine how to load the data
Test Plan:
added a test
Rollback Plan:
Differential Revision: D76135887
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155343
Approved by: https://github.com/teja-rao
Note on backward precision over fp16:
A float16 number has 10 bits of mantissa, 5 bits of exponent, and 1 bit for the sign. If the sign bit is positive, then with a mantissa $m$ and exponent $e$ represented in base 10, the number that the float16 format represents is $(1 + m / 1024) \exp2(e)$. ([source](https://en.wikipedia.org/wiki/Half-precision_floating-point_format))
Consider adding two numbers $a$ and $b$ which have arbitrary mantissas, and say their exponents are $e_a = 1$ (so $2 \le a \lt 4$) and $e_b=-3$ (so $0.175 \le b \lt 0.25$). Assume that the result has the same exponent as $a$. Since the exponents differ by 4, we'll effectively need to truncate the 4 rightmost bits of $b$'s mantissa, which would introduce a maximum error on the order of $(2^4 / 1024) \exp2(-3) \approx 0.002$.
The error is nearly the same if $e_b = -2$ (so $0.25 \le b \lt 0.5$), where the 3 rightmost bits are truncated, giving a maximum error on the order of $(2^3 / 1024) \exp2(-2) \approx 0.002$. Same for $e_b=-1$.
So if we're adding up nine different numbers that all have exponents -3, -2, or -1, and they sum to a number with exponent 1, then we would expect a maximum error of several times greater than 0.002. In my comments above, summing those particular nine numbers in different ways gave results that ranged between 3.1816 and 3.1758, a difference of $0.0058 \approx 2.9 * 0.002$.
That's within the acceptable bounds, and we can safely just increase the error tolerance used in test_output_grad_match for the case of max_pool3d_backward with float16.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157498
Approved by: https://github.com/malfet
This PR fixes a minor typo in a comment in `torch/_dynamo/variables/torch.py`, changing 'paramter' to the correct spelling 'parameter'.
These small but meaningful changes help improve code readability and maintain the overall quality of the codebase.
Thanks for your time and review!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157651
Approved by: https://github.com/Skylion007
***SUMMARY***
The main function in this tests overrides that of the Gtest framework which contains it's `RUN_ALL_TESTS()` function. The main function in this test is called conditionally when conditions apply, in this case, when the C10_MOBILE directive is provided. This is wrong as we always want to call the `RUN_ALL_TEST()` function.
In this PR, we only make the test suite available for cases that apply, i.e if the C10_MOBILE directive exist which represents the caching allocator and is only exposed on mobile
***TEST PLAN***
This tests should run in modes where it applies which should be covered in the CI run.
Below shows a sample run in the dev-nosan mode which do not have the cache allocator
BEFORE
```
buck test fbcode//caffe2:cpu_caching_allocator_test
Discovered 0. Pass 0. Fail 0. Fatal 0. Skip 0. Timeout 0
⚠ Listing failed: caffe2:cpu_caching_allocator_test
Listing tests failed with error:
Failed to read from /data/users/ysuleiman/fbsource/buck-out/v2/test/buck-out/v2/test_discovery/fbcode/6dcc55a61c1b90b3/default/tpx_execution_dir/gtest_output_file.json. Listing process stdout: , stderr:
```
AFTER
```
buck test '@fbcode//mode/dev-nosan' fbcode//caffe2:cpu_caching_allocator_test
Analyzing targets. Remaining 0/46242 1871690 actions, 2251668 artifacts declared
Executing actions. Remaining 0/257870 83:28:24.4s exec time total
Command: test. Finished 10 remote, 112314 cache (99% hit) 83:22:43.5s exec time cached (99%)
Time elapsed: 2:57.7s
Tests finished: Pass 0. Fail 0. Fatal 0. Skip 0. Build failure 0
NO TESTS RAN
```
Rollback Plan:
steps:
- manual.note:
content: Revert this diff
Reviewed By: patskovn
Differential Revision: D77229077
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156816
Approved by: https://github.com/kimishpatel
This PR adds a new config option, `caching_precompile`, and a `DynamoCache`, which loads and saves Dynamo Cache entries automatically. It also hooks up DynamoCache to PrecompileContext, so that we can save multiple cache entries.
When this configuration is turned on, we:
- Automatically create and initialize a CompilePackage on every torch.compile
- Automatically use BundledAutogradcache
- Automatically save the CompilePackage entry to DynamoCache after every compile
You can also use PrecompileContext.serialize() to manually serialize a full object.
I've added unit tests to exhibit this behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155913
Approved by: https://github.com/zhxchen17
When `torch.backends.mkldnn.matmul.fp32_precision=='bf16'`, we also enabled mkldnn linear in inductor path and allow to run with bf16 computation data type.
Testplan:
```
python test/inductor/test_mkldnn_pattern_matcher.py -k test_linear_unary
python test/inductor/test_mkldnn_pattern_matcher.py -k test_linear_fp32
python test/inductor/test_mkldnn_pattern_matcher.py -k test_multi_linear_share_same_input
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127294
Approved by: https://github.com/jgong5, https://github.com/jansel
Co-authored-by: Jiang, Yanbing <yanbing.jiang@intel.com>
Fixes#151223
Because FSDP stores original parameters as views into a flattened tensor, changing the flattened parameter’s tensor directly can desynchronize the views. With the NO_SHARD strategy this caused a shape mismatch error when writing back modified parameters.
Ensured writeback handles NO_SHARD correctly by flattening tensors before copying. The logic now flattens the source parameter or gradient when the strategy is unsharded to maintain the expected 1‑D shape for writeback operations
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154369
Approved by: https://github.com/weifengpy
When CC and CXX compiler is set to clang, and clang was compiled with libc++, compilation of torchvision fails with:
```
File "/usr/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 585, in build_extensions
compiler_name, compiler_version = self._check_abi()
^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 1034, in _check_abi
_, version = get_compiler_abi_compatibility_and_version(compiler)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 449, in get_compiler_abi_compatibility_and_version
if tuple(map(int, version)) >= minimum_required_version:
^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: invalid literal for int() with base 10: '7+libcxx'
```
Compiler identification is a valid semantic version:
```
$ clang -dumpfullversion -dumpversion
20.1.7+libcxx
```
After adjusting parser of version, clang is able to compile extensions successfully.
Fixes#157665
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157666
Approved by: https://github.com/msaroufim
This PR addresses a minor typo in the file `test/quantization/fx/test_model_report_fx.py`:
- Corrected the word "paramter" to "parameter" for better readability and accuracy.
While it's a small change, correcting such typographical errors contributes to maintaining the overall quality and professionalism of the codebase.
Thank you for your time and consideration in reviewing this PR. I'm happy to make any further adjustments if needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157646
Approved by: https://github.com/yewentao256, https://github.com/ezyang
This pull request fixes a minor typo in the doc comments of `test/nn/test_parametrization.py`.
- Replaced `'Intializing'` with `'Initializing'` in two docstring comments to improve clarity and maintain consistency across the codebase.
This is a non-functional change and does not impact behavior or test outcomes.
Thank you for maintaining such a high-quality codebase. Please let me know if any adjustments are needed. I'd be happy to help!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157362
Approved by: https://github.com/ezyang
They might have been slow on CUDA-11.3, but this version of CUDA is long gone. More fundamental underlying issue were linear complexity of the recursive polynomial definitions for higher order polynomials, for example see this loop from implementation of Chebyshev polynomial of the first kind
7081b8233a/aten/src/ATen/native/Math.h (L2969-L2973)
which were tested by `test_compare_cpu` using following values (as sample index 16)
7081b8233a/torch/testing/_internal/opinfo/core.py (L2079)
Luckily chebyshev polynomials for absolute values higher than 1 pretty quickly reach infinity, see below
```
python3 -c "import torch;print(torch.special.chebyshev_polynomial_v(torch.nextafter(torch.tensor(1.0), torch.tensor(2.0)), torch.tensor(1e6)))"
tensor(nan)
```
Which is not the case for Laguerre polynomials, but it's probably fine to just limit it to 1e7
Before
```
$ PYTORCH_TEST_WITH_SLOW=1 python test_ops.py -k chebyshev_polynomial_
ssssssss..ssssss..ssssss..ssssssssssssssssssssss..ssssss/home/ubuntu/py3.10-nightly/lib/python3.10/site-packages/torch/backends/cuda/__init__.py:131: UserWarning: This API is going to be deprecated, please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:78.)
return torch._C._get_cublas_allow_tf32()
....ssssssssssss..ssssss..ssssss............ssssssssssssssssssssssssssssssssssss..ssssssssssssss..ssssss..ssssssssssssssssssssssssssssss..ssssss....ssssssssssss..ssssss..ssssss............ssssssssssssssssssssssssssssssssssss..ssssss..ssssssssssssss..ssssss..ssssss..ssssssssssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssssssssssss
----------------------------------------------------------------------
Ran 432 tests in 8.575s
OK (skipped=344)
```
After
```
$ PYTORCH_TEST_WITH_SLOW=1 python test_ops.py -k chebyshev_polynomial_
ssssssss........................ssssssssssssssss......../home/ubuntu/pytorch/torch/backends/cuda/__init__.py:131: UserWarning: This API is going to be deprecated, please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /home/ubuntu/pytorch/aten/src/ATen/Context.cpp:78.)
return torch._C._get_cublas_allow_tf32()
........................................................................................xxxxxxxx................ssssssssssssssssssssssss........................................................................................................ssssssss........................ssssssss........................................................................................ssssssss
----------------------------------------------------------------------
Ran 432 tests in 45.580s
OK (skipped=72, expected failures=8)
```
Fixes https://github.com/pytorch/pytorch/issues/79528
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157464
Approved by: https://github.com/Skylion007, https://github.com/dcci
ghstack dependencies: #157488
Fixes #ISSUE_NUMBER
This PR fixes a small punctuation issue in the PyTorch README.
Specifically:
Added a missing full stop at the end of the sentence:
"Note: You could refer to the cuDNN Support Matrix for cuDNN versions with the various supported CUDA, CUDA driver and NVIDIA hardware."
Added comma for clarity between "CUDA driver" and "NVIDIA hardware".
These edits improve the readability and grammatical correctness of the documentation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157623
Approved by: https://github.com/Skylion007
This PR addresses a typo in the file `test/mobile/model_test/gen_test_model.py`.
### Changes:
- Corrected "occurances" to the correct spelling "occurrences"
- Renamed associated variables to reflect this change for consistency and clarity
This is a non-functional, cleanup-only PR to improve code readability.
Thanks to the PyTorch team for maintaining such a high-quality codebase
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157629
Approved by: https://github.com/Skylion007
This PR addresses a minor typo in the documentation file aten/src/ATen/cuda/tunable/README.md, where paramters has been corrected to parameters for improved clarity and consistency.
Context
Accurate and clear documentation is crucial for helping developers and contributors understand PyTorch internals. This small fix contributes to the overall quality and readability of the project.
Thank you to the PyTorch team and maintainers for your continued efforts in building such an incredible framework. I'm happy to contribute in any way I can — even if just with a small doc improvement like this one.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157575
Approved by: https://github.com/eqy
Change `build-system.build-backend`: `setuptools.build_meta:__legacy__` -> `setuptools.build_meta`. Also, move static package info from `setup.py` to `pyproject.toml`.
Now the repo can be installed from source via `pip` command instead of `python setup.py develop`:
```bash
python -m pip install --verbose --editable .
python -m pip install --verbose --no-build-isolation --editable .
```
In addition, the SDist is also buildable:
```bash
python -m build --sdist
python -m install dist/torch-*.tar.gz # build from source using SDist
```
Note that we should build the SDist with a fresh git clone if we will upload the output to PyPI. Because all files under `third_party` will be included in the SDist. The SDist file will be huge if the git submodules are initialized.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155998
Approved by: https://github.com/ezyang, https://github.com/cyyever, https://github.com/atalman
ghstack dependencies: #157557
This PR corrects a small spelling error in `test/jit/test_alias_analysis.py`.
- "initalized" → "initialized"
This is a minor comment correction and does not affect functionality or logic.
Thank you for maintaining this amazing codebase.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157628
Approved by: https://github.com/Skylion007
Fixes#154978
## Test Result
```python
>>> import torch
>>> import numpy as np
>>> import torch.nn as nn
>>> import torch.distributions.normal as norm
>>> device = torch.device(('cuda' if torch.cuda.is_available() else 'cpu'))
>>> print('Using {}'.format(device))
Using cuda
>>> m = nn.Sequential(nn.Linear(1, 128).cuda(), nn.Tanh(), nn.Linear(128, 128).cuda(), nn.Tanh(), nn.Linear(128, 128).cuda(), nn.Tanh())
>>> m.to(device, dtype=None, non_blocking=False)
Sequential(
(0): Linear(in_features=1, out_features=128, bias=True)
(1): Tanh()
(2): Linear(in_features=128, out_features=128, bias=True)
(3): Tanh()
(4): Linear(in_features=128, out_features=128, bias=True)
(5): Tanh()
)
>>> opt = torch.optim.Adam(m.parameters(), lr=0.001)
>>> print('Number of trainable parameters: ', sum((p.numel() for p in m.parameters() if p.requires_grad)))
Number of trainable parameters: 33280
>>> input_tensor = torch.tensor(77.0, device=device)
>>> target = torch.tensor(66.0)
>>> loss_function = nn.MSELoss()
>>> print('Loss Function: ', loss_function)
Loss Function: MSELoss()
>>> loss = loss_function(input_tensor, target)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/zong/code/pytorch/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/zong/code/pytorch/torch/nn/modules/module.py", line 1778, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/zong/code/pytorch/torch/nn/modules/loss.py", line 610, in forward
return F.mse_loss(input, target, reduction=self.reduction)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/zong/code/pytorch/torch/nn/functional.py", line 3903, in mse_loss
return torch._C._nn.mse_loss(
^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155089
Approved by: https://github.com/cyyever, https://github.com/albanD
Thanks @huydhn for spotting two name mismatches in the CI configs.
We were matching against "test_h100_symm_mem" instead of "h100-symm-mem".
Also, replaced `TORCH_SYMMMEM` env setting with programmatic method:
`symm_mem.set_backend(...)`
Further, skips a hanged test in `test_nvshmem_trion.py`. (#TODO @codingwithsurya )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157597
Approved by: https://github.com/fduwjj, https://github.com/huydhn
Summary: We currently have foreach kernel implementations for MTIA, and for when we don't we internally decompose the ops. Anyone using this list for compatibility checks should be sending through the foreach kernels.
Reviewed By: egienvalue, scottxu0730
Differential Revision: D77751248
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157583
Approved by: https://github.com/egienvalue
Summary: Add a pass use_triton_fp8_swish_replace_normal_swish to replace _triton_swish_rms_norm with its counterpart that supports fp8 triton_swish_rms_norm, and turn on fp8 during inference.
Test Plan:
```
buck2 run mode/opt mode/inplace -c fbcode.platform010_cuda_version=12.4 -c fbcode.nvcc_arch=h100 caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --lower-backend=AOT_INDUCTOR --model-snapshot-id=899072727_0 --node-replacement-dict="{}" --gpu-trace --add-passes=use_triton_fp8_swish_replace_normal_swish
```
The perf improvement on the 100x model with this pass is roughly ~7%, details are recorded [here](https://docs.google.com/document/d/1eIV_OTQyQcf_DlEDxwycTwhyGxT5OJkLzs8cPL6EMYc/edit?tab=t.0)
Rollback Plan:
Reviewed By: frank-wei
Differential Revision: D76531303
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157574
Approved by: https://github.com/frank-wei
Summary:
This is an improvement over `_broadcast_rank0_decision` where we uses the rank0's decision to broadcast to every rank. The issue of `_broadcast_rank0_decision` is that we observed large variance on the peak memory usage. One cause is that different ranks receive different dynamic shaped tensors and the hints of those tensors are different in different ranks. If we only rely on rank0's decision and it's unlucky to get unrepresentative hints, then the decision it makes may not be suitable for other ranks.
Here, we introduce `sync_cross_rank_decision` which comes up with the decision after comparing all ranks' local decision, it will:
1. all gather decisions from all ranks;
2. test each decision on the current rank and get its estimated memory usage;
3. all reduce estimated memory usage with ReduceOp.MAX, so that we know the maximum memory usage of each decision on all ranks;
4. pick the decision which gives us minimum maximum memory memory usage;
A graph to show more details
https://internalfb.com/excalidraw/EX484509
After applying sync_cross_rank_decision, we observed that the variance are much smaller
Rollback Plan:
Differential Revision: D76714005
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156287
Approved by: https://github.com/fmassa, https://github.com/bdhirsh
Summary:
- symbolic shapes statically_known_true usage is wrong, this API is meant to be used for SymNodes. what is needed is V.graph.sizevars.statically_known_true. or V.graph.sizevars.statically_known_Equals or ideally V.graph.sizevars.statically_known_multiple_of.
- The construction using == 0 is not symbolic, this used to always return false for symbolic inputs.
Differential Revision: D77619293
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157400
Approved by: https://github.com/ColinPeppler
Porting passes to bucket all_gathers
The main logic of the pass is done via
1. Searching for all all_gathers from the buckets
Copying tests from @wconstab PR to test compatibility with reordering.
Test checks only compatibility, as because of (3) the joint all_gather will be scheduled already as early as possible and no space for reordering.
Pass changes:
Using mutation ops to match performance of fsdp, in future the perfect scenario will be to have only functional graph, that inductor does all memory optimizations on its own without mutable ops.
Inductor changes:
Adding foreach_copy_ lowering
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157396
Approved by: https://github.com/wconstab
This PR corrects minor typos in developer-facing comments:
- Replaces 'recieve' with 'receive' in:
- `FunctionalTensorWrapper.cpp`
- `make_boxed_from_unboxed_functor.h`
These changes improve code readability and maintain comment correctness.
Thank you for reviewing!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157544
Approved by: https://github.com/soulitzer
I am trying to see if I can easily add the linearity support for aten.mul to allow Partial placement to propagate through. But it turns out that I have to completely rework the current linearity propagation.
In short, before this PR, linearity mainly support aten.add and some trival ops. It is done by allowing input Partial to propagate, and in the meanwhile, redistribute Replicate inputs to Partial to preserve the single device semantic, i.e suppose we want to execute `aten.add(lhs, rhs)` on 2 ranks:
* `lhs` is partial, value on rank 0: `r0`, lhs value on rank 1: `r1`
* `rhs` is replicate, value: `a`
Then in order to preserve single device semantic (which should produce the value of `a + r0 + r1`), we do `rhs/world_size` first, then add `rhs` to `lhs`. This means every operand would first need be partial, then we can add them together.
But this become non-true for multiplicative operations, like `aten.mul`, for `aten.mul`, assuming the same `aten.mul(lhs, rhs)` and value, we don't need to divide lhs by world_size to preserve single device semantic, b.c. `a* (r0+r1) = a* r0 + a* r1`
So to accomodate the difference of add/mul, in this PR I:
* change linearity to be a int to support different linearity types, add linearity and multiplicative are separate
* add checks to ensure only a subset of partial types can support linearity (namely partial-sum/avg)
* handle the linearity type plumbing through the pointwise ops.
* add `mul.Tensor/Scalar` to be the multiplicative linearity
* added the tests to show that the partial placements can be propagated with `aten.mul`
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157340
Approved by: https://github.com/zpcore
This PR fixes a minor typo in `test/jit/test_modules.py`:
- Before: `intialized`
- After: `initialized`
There are no functional code changes — this is a comment-only fix to improve clarity and consistency.
Thank you to the PyTorch team for maintaining this outstanding project.
Please let me know if anything else is needed.
With appreciation,
Abhishek Nandy
[@abhitorch81](https://github.com/abhitorch81)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157226
Approved by: https://github.com/Skylion007
This is a minor typo fix in `test/test_transformers.py`:
- Renamed `intial_query_grad` to `initial_query_grad` for improved clarity and correctness in test variable naming.
There are **no functional or logic changes** — this PR is aimed purely at improving readability and maintaining code quality.
Thanks to the PyTorch team for their work and review time
Please feel free to suggest if this needs any adjustment.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157306
Approved by: https://github.com/Skylion007
Summary:
This is a follow up after the PR to add comm override support: https://github.com/pytorch/pytorch/pull/155189
The previous PR loosely checks the allocation mixin classes, which isn't really safe as the actual hook may still override the behavior.
This may lead to unnecessary confusion for no good use case. So for now we just make the 2 sets of APIs largely incompatible:
1. setting custom comms after `set_allocate_memory_from_process_group_for_comm()` is ok.
2. setting `set_allocate_memory_from_process_group_for_comm()` after custom comms is ko.
Basically `set_allocate_memory_from_process_group_for_comm` is like a drop in hammer while the `set_custom_all_gather/reduce_scatter()` are like finer-grained scalpels that require more code crafted.
We can revisit this if there's use case in between but for now they can be largely viewed independent from each other (even tho we do share some of the underlying pieces for now, that could be subject to change and should not be exposed to end users).
Test Plan: added UT
Differential Revision: D77681620
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157487
Approved by: https://github.com/weifengpy
Covers the case where the output of one collective feeds the input of another collective.
e.g. TP + FSDP - all_gather(tp+dp sharded param on TP dim) -> allgather dp_sharded buffer on DP dim
Fixes a bug where the reordering pass specifically exempted wait nodes from dependencies.
Note: this exemption was incorrect, so it should be removed. But it was also put there for a reason, to help move collectives past wait nodes that are not related to that collective. After this fix, reordering performance may be worse and we need to find a smarter way to decide if a particular wait node is a blocker for a given collective.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157489
Approved by: https://github.com/IvanKobzarev
ghstack dependencies: #156879
The reason for inner/outer method is to keep the outer method conforming
to the typedef for a comms graph pass which returns one obj, while
allowing unit tests to call the inner method that returns more metadata
useful for testing the pass. The logs should be in the inner part, so
they are functional also during unit testing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156879
Approved by: https://github.com/IvanKobzarev
Change `build-system.build-backend`: `setuptools.build_meta:__legacy__` -> `setuptools.build_meta`. Also, move static package info from `setup.py` to `pyproject.toml`.
Now the repo can be installed from source via `pip` command instead of `python setup.py develop`:
```bash
python -m pip install --verbose --editable .
python -m pip install --verbose --no-build-isolation --editable .
```
In addition, the SDist is also buildable:
```bash
python -m build --sdist
python -m install dist/torch-*.tar.gz # build from source using SDist
```
Note that we should build the SDist with a fresh git clone if we will upload the output to PyPI. Because all files under `third_party` will be included in the SDist. The SDist file will be huge if the git submodules are initialized.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155998
Approved by: https://github.com/ezyang, https://github.com/cyyever, https://github.com/atalman
They were slow on CUDA-11.3, which has long been gone, let's see if they work now
Before
```
$ python test_ops.py -k chebyshev_polynomial_
ssssssss..ssssss..ssssss..ssssssssssssssssssssss..ssssss/home/ubuntu/py3.10-nightly/lib/python3.10/site-packages/torch/backends/cuda/__init__.py:131: UserWarning: This API is going to be deprecated, please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:78.)
return torch._C._get_cublas_allow_tf32()
....ssssssssssss..ssssss..ssssss............ssssssssssssssssssssssssssssssssssss..ssssssssssssss..ssssss..ssssssssssssssssssssssssssssss..ssssss....ssssssssssss..ssssss..ssssss............ssssssssssssssssssssssssssssssssssss..ssssss..ssssssssssssss..ssssss..ssssss..ssssssssssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssssssssssss
----------------------------------------------------------------------
Ran 432 tests in 8.575s
OK (skipped=344)
```
After
```
$ python test_ops.py -k chebyshev_polynomial_
ssssssss........................ssssssssssssssss......../home/ubuntu/py3.10-nightly/lib/python3.10/site-packages/torch/backends/cuda/__init__.py:131: UserWarning: This API is going to be deprecated, please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:78.)
return torch._C._get_cublas_allow_tf32()
........................................................................................ssssssss................ssssssssssssssssssssssss........................................................................................................ssssssss........................ssssssss........................................................................................ssssssss
----------------------------------------------------------------------
Ran 432 tests in 42.379s
OK (skipped=80)
```
Fixes https://github.com/pytorch/pytorch/issues/79528
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157464
Approved by: https://github.com/Skylion007
Summary:
When we compute contiguity for a tensor with dynamic shapes we first:
1) Try to compute it without guarding.
2) If all shapes hinted, compute it with potentially adding guards.
3) if any input is not hinted, compute it symbolically.
sym_is_contiguous return a SymBool that is then either evaluated or guard_or_false can be called
on it to avoid data dependent errors.
ex:
bool is_contiguous = input.sym_is_contiguous().guard_or_false(__FILE__, __LINE__);
is_contiguous_or_false is a helper function that does that.
In this PR I only handle default contiguity, will follow up with changes for other formats like channel_last .
We use this patter in this PR for several locations to avoid DDEs.
Test Plan:
contbuild & OSS CI,
Rollback Plan:
Reviewed By: malfet
Differential Revision: D77639021
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157472
Approved by: https://github.com/aorenste
Add complex number unwrapping in functional collectives used by DTensor.
Complex tensors are not directly supported by underlying comm kernels
(e.g. nccl) but complex tensors can be viewed as real tensors of a
higher rank (added size-2 tensor dim represents real vs im component).
Collective output is then viewed as complex to restore the
original/expected shape and dtype.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157329
Approved by: https://github.com/XilunWu
Change the branch/tag deletion script that runs once per day to delete more tags
Previous: only delete ciflow tags that didn't correspond to an open PR
New: delete ciflow tags attached to commits that are > 7 days old. Also delete `trunk/<sha>` (I think they are for autorevert) tags that are attached to commits that are > 7 days old
It's hard to figure out when the actual tag was pushed or created, so instead it looks at the commit date, which might lead to unexpected behavior if the tag was pushed much later than the commit (ex triggering periodic later to bisect). I think it's ok though since you don't really need the tag after the workflow runs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157468
Approved by: https://github.com/izaitsevfb
Serialize BUILTIN_MATCH since they are all stored in __builtin__ dict.
Also fixed an issue that the wrong global scope is passed to CheckFunctionManager while loading guards. Previously we can always reuse the compile-time global scope for evaluating guards because the compile-time and runtime global scope are always the same.
For precompile, we need to serialize the compile-time global scope for loading only. We need to point the CheckFunctionManager to the new global scope after loading is finished for evaluating guards.
Differential Revision: [D77159313](https://our.internmc.facebook.com/intern/diff/D77159313/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157016
Approved by: https://github.com/jansel, https://github.com/jamesjwu
Summary: The global gemm cache has not been maintained in ~1 year, and the only entry point (`search_autotune_cache`) was recently deprecated. Meaning, this is now dead code that we can remove.
Test Plan:
CI
Rollback Plan:
Differential Revision: D77520979
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157327
Approved by: https://github.com/jansel
Otherwise it turns test into a trivial one(that always succeeds), as following example demonstrates
```python
import torch
from torch.testing._internal.common_utils import serialTest, run_tests, TestCase
class MegaTest(TestCase):
@serialTest
def test_foo(self):
if hasattr(self.test_foo, "pytestmark"):
print("foo has attr and it is", self.test_foo.pytestmark)
print("foo")
@serialTest()
def test_bar(self):
if hasattr(self.test_bar, "pytestmark"):
print("bar has attr and it is", self.test_bar.pytestmark)
print("bar")
if __name__ == "__main__":
run_tests()
```
That will print
```
test_bar (__main__.MegaTest.test_bar) ... bar has attr and it is [Mark(name='serial', args=(), kwargs={})]
bar
ok
test_foo (__main__.MegaTest.test_foo) ... ok
----------------------------------------------------------------------
Ran 2 tests in 0.013s
```
Added assert that arg is boolean in the decorator to prevent such silent skips in the future
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157388
Approved by: https://github.com/clee2000
Description:
This PR fixes a small documentation typo in torch/_C/_distributed_c10d.pyi, correcting:
Intializes → Initializes
This helps improve clarity in internal docstrings for maintainers and contributors.
Let me know if further changes are needed. Thanks for your time and the amazing work on PyTorch!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157455
Approved by: https://github.com/Skylion007, https://github.com/malfet
Fixes#155006
Inductor sometimes codegens triton kernel definitions into a triple-quoted text block. If the text block itself contains triple-quotes, this breaks. Notably, this can happen for user-defined triton kernels, where the user may have added a docstring in their triton kernel.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157322
Approved by: https://github.com/zou3519, https://github.com/drisspg
`_torchdynamo_orig_callable` was being used in two distinct places:
- to get the original user function from nested eval_frame.py decorators
- to get the original backend from nested convert_frame.py callbacks
We rename ~the first usage to `_torchdynamo_orig_fn`~ and the second to `_torchdynamo_orig_backend` in order to distinguish these cases.
UPDATE: seems like both internal and OSS users depend on `_torchdynamo_orig_callable`, but it only seems in the first context. We should thus keep the original name for the first case then.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156901
Approved by: https://github.com/StrongerXi, https://github.com/jansel
Summary:
This change introduces 2 comm override APIs: `set_custom_all_gather` and `set_custom_reduce_scatter` to allow for custom behavior respectively.
This allow users to control how the comm buffers are allocated and the exact comm implementation for flexibility.
For details, see docstring in `Comm` in `_fsdp_api.py`
Related PR:
https://github.com/pytorch/pytorch/pull/150564
Test Plan: CI
Differential Revision: D75714362
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155189
Approved by: https://github.com/weifengpy
Summary: For priors like layer norm, the order of the weight quantization kernel might be different and therefore have a different suffix, so we use regular expression instead.
Test Plan:
Trying this on model id 737772166 with
```
buck2 run mode/opt mode/inplace -c fbcode.platform010_cuda_version=12 -c fbcode.nvcc_arch=h100 caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --lower-backend=AOT_INDUCTOR --model-snapshot-id=737772166_0 --trace-aot-inductor-module=True --disable-acc-tracer=False --batch-size=1024 --node_replacement_dict "{'(autotune)':{'(1000+,1000+)':'fp8_float_model_dynamic_quantization_rowwise'}"
```
will allow more linears to be correctly replaced with fp8.
An example of the gpu trace can be found in https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/hpc/new/models/feed/benchmark/libkineto_activities_773108_f58b57e208c04787acd3bcb01a3e8771.json.gz&bucket=gpu_traces.
Rollback Plan:
Differential Revision: D76092551
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155722
Approved by: https://github.com/Skylion007
Fixes#155006
Inductor sometimes codegens triton kernel definitions into a triple-quoted text block. If the text block itself contains triple-quotes, this breaks. Notably, this can happen for user-defined triton kernels, where the user may have added a docstring in their triton kernel.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157322
Approved by: https://github.com/zou3519, https://github.com/drisspg
When we compute contiguity for a tensor with dynamic shapes we first:
1) Try to compute it without guarding.
2) If all shapes hinted, compute it with potentially adding guards.
3) if any input is not hinted, compute it symbolically.
sym_is_contiguous return a SymBool that is then either evaluated or guard_or_false can be called
on it to avoid data dependent errors.
ex:
bool is_contiguous = input.sym_is_contiguous().guard_or_false(__FILE__, __LINE__);
is_contiguous_or_false is a helper function that does that.
In this PR I only handle default contiguity, will follow up with changes for other formats like channel_last .
We use this patter in this PR for several locations to avoid DDEs.
Differential Revision: [D77183032](https://our.internmc.facebook.com/intern/diff/D77183032)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155590
Approved by: https://github.com/ezyang
Everything should go thru a generalized kernels, and Metal kernels should work with the same sizes and strides as CPU or CUDA backends to avoid problems with `torch.compile` that relies on the meta kernels to tell what its ouput going to look like.
To avoid returning tensors with different layout depending on whether upper parameter is true or false, templatize `factorDiagonalBlock`, `applyTRSM` and `applySYRK` to take upper/lower (actually row-wise vs column-wise) as template argument and call appropriate templates from host
TODOs:
- Rename upper parameter to something more sensible and add comments
- Use simd_groupsize instead of hardcoded 32 everywhere
Fixes https://github.com/pytorch/pytorch/issues/156658
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157014
Approved by: https://github.com/Skylion007, https://github.com/dcci
ghstack dependencies: #157179
Currently, every time we construct a GLOBAL_STATE guard, we always create a fresh guard based on the current global state. For precompile, we want to create a GLOBAL_STATE guard always based on some external sources, e.g. serialized global states. This can also be applied with the normal case where we just pass in the global state guard from Python.
Differential Revision: [D77400988](https://our.internmc.facebook.com/intern/diff/D77400988/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157285
Approved by: https://github.com/jansel
adding more information to the error message for debugging.
example error message:
```
Detected recompile when torch.compile stance is 'fail_on_recompile'. filename: 'caffe2/test/dynamo/test_misc.py', function name: 'fn', line number: 0
Failed on the following precompiled guards:
TREE_GUARD_MANAGER:
+- RootGuardManager
| +- LAMBDA_GUARD: isinstance(L['x'], bool)
GuardDebugInfo(
result=0,
verbose_code_parts=["isinstance(L['x'], bool)"],
num_guards_executed=1)
```
Differential Revision: [D76987126](https://our.internmc.facebook.com/intern/diff/D76987126/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156433
Approved by: https://github.com/jamesjwu
Fixes#149280. Follow up to #147966, but now available for ROCm.
Since hipblaslt does not support HIPBLASLT_MATMUL_DESC_CU_COUNT_TARGET, we instead create a hipStream that has a CU mask applied. We pass this masked stream to hipblaslt instead of pytorch's current stream. We ensure stream ordering between streams using hipEvents and stream synchronization.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149466
Approved by: https://github.com/malfet, https://github.com/atalman
Changes needed for ROCm7.0:
* `warpSize` is _not_ a compile-time constant on device-side compilation for ROCm anymore
* `warpSize` is _not_ defined on host-side compilation, hence `at::cuda::warp_size()` must be used to query warpsize at runtime
* Redefining `C10_WARP_SIZE` to be a compile-time constant, with a reasonable value for device-side compilation, but an unreasonable value of 1 for host-side compilation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156979
Approved by: https://github.com/jeffdaily
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
The biggest bottleneck that we found with two-shot allreduce was that the compiler was serializing all the load operations for some reason. To avoid these load delays, we've added de-serialization of loads. Along with this improvement, we also found that on AMD GPUs a different block and thread size gives a nice performance boost. Here are the bandwidth numbers I am getting with this PR:

The rows that are green are the tensor sizes that we are interested in because two-shot is only used for bigger sizes (one-shot is used for smaller sizes). As we can see, our baseline numbers wrt to fbgemm numbers were consistently underperforming. However, with this deserialize change, most of the tensor sizes have a performance boost (positive %) for the green tensors. There's one tensor with negative performance, but that's within error margin.
co-authored by: @amd-hhashemi
https://github.com/pytorch/FBGEMM/issues/4072
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156746
Approved by: https://github.com/jeffdaily
Co-authored-by: Hashem Hashemi <hashem.hashemi@amd.com>
Generate source tarball with PEP 517 conform build tools instead of the custom routine in place right now.
Closes#150461.
The current procedure for generating the source tarball consists in creation of a source tree by manual copying and pruning of source files.
This PR replaces that with a call to the standard [build tool](https://build.pypa.io/en/stable/), which works with the build backend to produce an sdist. For that to work correctly, the build backend also needs to be configured. In the case of Pytorch, the backend currently is (the legacy version of) the setuptools backend, the source dist part of which is mostly configured via the `MANIFEST.in` file.
The resulting source distribution can be used to install directly from source with `pip install ./torch-{version}.tar.gz` or to build wheels directly from source with `pip wheel ./torch-{version}.tar.gz`; both should be considered experimental for now.
## Issues
### sdist name
According to PEP 517, the name of the source distribution file must coincide with the project name, or [more precisely](https://peps.python.org/pep-0517/#source-distributions), the source distribution of a project that generates `{NAME}-{...}.whl` wheels are required to be named `{NAME}-{...}.tar.gz`. Currently, the source tarball is called `pytorch-{...}.tar.gz`, but the generated wheels and python package are called `torch-{...}`.
### Symbolic Links
The source tree at the moment contains a small number of symbolic links. This [has been seen as problematic](https://github.com/pypa/pip/issues/5919) largely because of lack of support on Windows, but also because of [a problem in setuptools](https://github.com/pypa/setuptools/issues/4937). Particularly unfortunate is a circular symlink in the third party `ittapi` module, which can not be resolved by replacing it with a copy.
PEP 721 (now integrated in the [Source Distribution Format Specification](https://packaging.python.org/en/latest/specifications/source-distribution-format/#source-distribution-archive-features)) allows for symbolic links, but only if they don't point outside the destination directory and if they don't contain `../` in their target.
The list of symbolic links currently is as follows:
<details>
|source|target|problem|solution|
|-|-|-|-|
| `.dockerignore` | `.gitignore` | ✅ ok (individual file) ||
| `docs/requirements.txt` | `../.ci/docker/requirements-docs.txt` |❗`..` in target|swap source and target[^1]|
| `functorch/docs/source/notebooks` | `../../notebooks/` |❗`..` in target|swap source and target[^1]|
| `.github/ci_commit_pins/triton.txt` | `../../.ci/docker/ci_commit_pins/triton.txt` | ✅ ok (omitted from sdist)||
| `third_party/flatbuffers/docs/source/CONTRIBUTING.md` | `../../CONTRIBUTING.md` |❗`..` in target|omit from sdist[^2]|
| `third_party/flatbuffers/java/src/test/java/DictionaryLookup` | `../../../../tests/DictionaryLookup` |❗`..` in target|omit from sdist[^3]|
| `third_party/flatbuffers/java/src/test/java/MyGame` | `../../../../tests/MyGame` |❗`..` in target|omit from sdist[^3]|
| `third_party/flatbuffers/java/src/test/java/NamespaceA` | `../../../../tests/namespace_test/NamespaceA` |❗`..` in target|omit from sdist[^3]|
| `third_party/flatbuffers/java/src/test/java/NamespaceC` | `../../../../tests/namespace_test/NamespaceC` |❗`..` in target|omit from sdist[^3]|
| `third_party/flatbuffers/java/src/test/java/optional_scalars` | `../../../../tests/optional_scalars` |❗`..` in target|omit from sdist[^3]|
| `third_party/flatbuffers/java/src/test/java/union_vector` | `../../../../tests/union_vector` |❗`..` in target|omit from sdist[^3]|
| `third_party/flatbuffers/kotlin/benchmark/src/jvmMain/java` | `../../../../java/src/main/java` |❗`..` in target|omit from sdist[^3]|
| `third_party/ittapi/rust/ittapi-sys/c-library` | `../../` |❗`..` in target|omit from sdist[^4]|
| `third_party/ittapi/rust/ittapi-sys/LICENSES` | `../../LICENSES` |❗`..` in target|omit from sdist[^4]|
| `third_party/opentelemetry-cpp/buildscripts/pre-merge-commit` | `./pre-commit` |✅ ok (individual file)||
| `third_party/opentelemetry-cpp/third_party/prometheus-cpp/cmake/project-import-cmake/sample_client.cc` | `../../push/tests/integration/sample_client.cc` |❗`..` in target|omit from sdist[^5]|
| `third_party/opentelemetry-cpp/third_party/prometheus-cpp/cmake/project-import-cmake/sample_server.cc` | `../../pull/tests/integration/sample_server.cc` |❗`..` in target|omit from sdist[^5]|
| `third_party/opentelemetry-cpp/third_party/prometheus-cpp/cmake/project-import-pkgconfig/sample_client.cc` | `../../push/tests/integration/sample_client.cc` |❗`..` in target|omit from sdist[^5]|
| `third_party/opentelemetry-cpp/third_party/prometheus-cpp/cmake/project-import-pkgconfig/sample_server.cc` | `../../pull/tests/integration/sample_server.cc` |❗`..` in target|omit from sdist[^5]|
| `third_party/XNNPACK/tools/xngen` | `xngen.py` | ✅ ok (individual file)||
</details>
The introduction of symbolic links inside the `.ci/docker` folder creates a new problem, however, because Docker's `COPY` command does not allow symlinks in this way. We work around that by using `tar ch` to dereference the symlinks before handing them over to `docker build`.
[^1]: These resources can be naturally considered to be part of the docs, so moving the actual files into the place of the current symlinks and replacing them with (unproblematic) symlinks can be said to improve semantics as well.
[^2]: The flatbuffers docs already actually use the original file, not the symlink and in the most recent releases, starting from flatbuffers-25.1.21 the symlink is replaced by the actual file thanks to a documentation overhaul.
[^3]: These resources are flatbuffers tests for java and kotlin and can be omitted from our sdist.
[^4]: We don't need to ship the rust bindings for ittapi.
[^5]: These are demonstration examples for how to link to prometheus-cpp using cmake and can be omitted.
### Nccl
Nccl used to be included as a submodule. However, with #146073 (first released in v2.7.0-rc1), the submodule was removed and replaced with a build time checkout procedure in `tools/build_pytorch_libs.py`, which checks out the required version of nccl from the upstream repository based on a commit pin recorded in `.ci/docker/ci_commit_pins/nccl-cu{11,12}.txt`.
This means that a crucial third party dependency is missing from the source distribution and as the `.ci` folder is omitted from the source distribution, it is not possible to use the build time download.
However, it *is* possible to use a system provided Nccl using the `USE_SYSTEM_NCCL` environment variable, which now also is the default for the official Pytorch wheels.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152098
Approved by: https://github.com/atalman
This PR improves the parallelize_module API to support more corner cases:
1. if the plan entry specified as "", it should apply the style to the current module
2. if the plan entry does not have a corresponding submodule to apply, raise a warning and ignore this plan entry
As working on this PR, I also found that the while-loop inside is actually not necessary and could produce some nasty on the fly modifying while iterating behavior.. So I removed the while loop
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157182
Approved by: https://github.com/tianyu-l
Pytorch build is failing on power system from this commit ec24f8f58a74502c5a2488f5d9e85a817616dda0
***Build Failure Logs***
**Error related to mkldnn**
```
pytorch/aten/src/ATen/native/Blas.cpp:302:26: error: ‘cpuinfo_has_x86_amx_int8’ was not declared in this scope
302 | if ((!mixed_dtype && cpuinfo_has_x86_amx_int8()) ||
| ^~~~~~~~~~~~~~~~~~~~~~~~
pytorch/aten/src/ATen/native/Blas.cpp:303:25: error: ‘cpuinfo_has_x86_amx_fp16’ was not declared in this scope
303 | (mixed_dtype && cpuinfo_has_x86_amx_fp16())) {
| ^~~~~~~~~~~~~~~~~~~~~~~~
```
**Error related to vec256 complex float redefinition**
```
aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:19:7: error: specialization of ‘at::vec::DEFAULT::Vectorized<c10::complex<float> >’ after instantiation
19 | class Vectorized<ComplexFlt> {
| ^~~~~~~~~~~~~~~~~~~~~~
aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:19:7: error: redefinition of ‘class at::vec::DEFAULT::Vectorized<c10::complex<float> >’
aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:633:18: error: ‘const class at::vec::DEFAULT::Vectorized<c10::complex<float> >’ has no member named ‘abs_2_’
633 | auto abs_a = a.abs_2_();
| ^~~~~~
aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:634:18: error: ‘const class at::vec::DEFAULT::Vectorized<c10::complex<float> >’ has no member named ‘abs_2_’
634 | auto abs_b = b.abs_2_();
| ^~~~~~
/aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:666:17: error: ‘const class at::vec::DEFAULT::Vectorized<c10::complex<float> >’ has no member named ‘vec0’
666 | vec_add(a.vec0(), b.vec0()), vec_add(a.vec1(), b.vec1())};
aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:673:17: error: ‘const class at::vec::DEFAULT::Vectorized<c10::complex<float> >’ has no member named ‘vec0’
673 | vec_sub(a.vec0(), b.vec0()), vec_sub(a.vec1(), b.vec1())};
| ^~~~
aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:680:27: error: ‘const class at::vec::DEFAULT::Vectorized<c10::complex<float> >’ has no member named ‘vec0’
680 | vec_and(a.vec0(), b.vec0()), vec_and(a.vec1(), b.vec1())};
```
***With this changes build logs***
```
Building wheel torch-2.8.0a0+gita3098a7
-- Building version 2.8.0a0+gita3098a7
-- Checkout nccl release tag: v2.26.5-1
cmake -GNinja -DBLAS=OpenBLAS -DBUILD_PYTHON=True -DBUILD_TEST=True -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/home/avanish/OfficeWork2025/JuneWork/pytorch_5Jun/pack/torch_night_5Jun/pytorch/torch -DCMAKE_PREFIX_PATH=/home/avanish/OfficeWork2025/JuneWork/pyenv/pytorch_5Jun/lib/python3.12/site-packages -DPython_EXECUTABLE=/home/avanish/OfficeWork2025/JuneWork/pyenv/pytorch_5Jun/bin/python -DTORCH_BUILD_VERSION=2.8.0a0+gita3098a7 -DUSE_MKLDNN=ON -DUSE_MKLDNN_CBLAS=ON -DUSE_NUMPY=True -DUSE_OPENMP=ON /home/avanish/OfficeWork2025/JuneWork/pytorch_5Jun/pack/torch_night_5Jun/pytorch
cmake --build . --target install --config Release
running build_ext
-- Building with NumPy bindings
-- Not using cuDNN
-- Not using CUDA
-- Not using XPU
-- Using MKLDNN
-- Not using Compute Library for the Arm architecture with MKLDNN
-- Using CBLAS in MKLDNN
-- Not using NCCL
-- Building with distributed package:
-- USE_TENSORPIPE=True
-- USE_GLOO=True
-- USE_MPI=False
-- Building Executorch
-- Not using ITT
Copying functorch._C from functorch/functorch.so to /home/avanish/OfficeWork2025/JuneWork/pytorch_5Jun/pack/torch_night_5Jun/pytorch/build/lib.linux-ppc64le-cpython-312/functorch/_C.cpython-312-powerpc64le-linux-gnu.so
copying functorch/functorch.so -> /home/avanish/OfficeWork2025/JuneWork/pytorch_5Jun/pack/torch_night_5Jun/pytorch/build/lib.linux-ppc64le-cpython-312/functorch/_C.cpython-312-powerpc64le-linux-gnu.so
building 'torch._C' extension
creating build/temp.linux-ppc64le-cpython-312/torch/csrc
```
This patch will fix the pytorch build issue on power, and i am able to build successfully.
Hi @malfet @albanD
Please review this PR for pytorch build issue that we are observing on power.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155255
Approved by: https://github.com/albanD, https://github.com/malfet
This PR relaxes the device_mesh argument constraint in the local_map API. The current restriction is too strict, i.e. all the input arguments must have the same device mesh if they are DTensors. But many times user might want to pass in DTensors to this function that lives on different device mesh, i.e. weight and activation could live in different device mesh.
When using the local_map, we are extracting the local tensors from DTensors, and as long as the placements user specified match with the actual DTensor placements, user knows clearly that the inputs are intended to live in different mesh. So this PR removes the same mesh check and update doc to clearly document the behavior.
The `device_mesh` argument now serves for a main purpose, allow user to specify the device_mesh for the output DTensor reconstruction
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157049
Approved by: https://github.com/Chillee, https://github.com/zpcore
With Triton main things were failing with:
```py
File "/home/jansel/pytorch/torch/_inductor/codecache.py", line 205, in get_system
from triton.compiler.compiler import triton_key
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
ImportError: cannot import name 'triton_key' from 'triton.compiler.compiler' (/home/jansel/pytorch/triton/compiler/compiler.py)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157242
Approved by: https://github.com/aorenste
This PR makes the necessary changes in order to upgrade PyTorch DLPack
support to version 1.0. In summary, we add support for the following:
- Support both `DLManagedTensor` and `DLManagedTensorVersioned` when
producing and consuming DLPack capsules
- New parameter for `__dlpack__` method: `max_version`
- Version checks:
- Fallback to old implementation if no `max_version` or if version
lower than 1.0
- Check that the to-be-consumed capsule is of version up to 1.X
In order to accommodate these new specifications, this PR adds the
following main changes:
- `torch._C._to_dlpack_versioned` Python API (Module.cpp): new Python
API for creating a versioned DLPack capsule (called by `__dlpack__`
method)
- `DLPackTraits<T>` class (DLConvertor.h): select the correct
traits (e.g. capsule name, conversion functions) depending on which
DLPack tensor class is being used
- `toDLPackImpl<T>` function (DLConvertor.cpp): populates the
common fields of both classes
- `fromDLPackImpl<T>` function (DLConvertor.cpp): constructs a tensor
from a DLPAck capsule
- `fillVersion<T>` function (DLConvertor.cpp): populates the version
field for `DLManagedTensorVersioned` (no-op for `DLManagedTensor`)
- `tensor_fromDLPackImpl<T>` function (tensor_new.cpp): outer function
for constructing a tensor out of a DLPack capsule that also marks the
capsule as used
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145000
Approved by: https://github.com/albanD
Summary: Since we increment the counter after performing the callback, it leads to the assertion error when callback raises an error and increment never happens. Let's increment first to avoid it.
Test Plan:
tba
Rollback Plan:
Differential Revision: D77475650
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157185
Approved by: https://github.com/xmfan
Summary:
D76832520 switched Executorch to use the caffe c10 headers. This copy contains a shadow, which is treated as an error for certain embedded compile flows.
Simple rename to avoid.
Test Plan:
CI
Rollback Plan:
Differential Revision: D77446104
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157107
Approved by: https://github.com/Skylion007
Summary: pretty simple. if planner exists, which implies that planning is enabled, create a manager for each frame. the associated serial executor will use the withMemoryPlannner fn to ensure the deallocation is done after execution completes.
Test Plan: CI
Differential Revision: D73635809
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157053
Approved by: https://github.com/henryoier, https://github.com/georgiaphillips
Summary: Fixes a gap in the Triton update where the traverse would break because `get_tma_stores` didn't handle both TMA APIs.
Test Plan:
`buck test -m ovr_config//triton:beta 'fbcode//mode/dev-nosan' fbcode//ads_mkl/ops/tests:gdpa_dcpp_test -- --exact 'ads_mkl/ops/tests:gdpa_dcpp_test - test_gdpa_dcpp (ads_mkl.ops.tests.gdpa_dcpp_test.GdpaDCPPTest)'`
Rollback Plan:
Differential Revision: D77501582
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157212
Approved by: https://github.com/davidberard98
Fixes https://github.com/pytorch/pytorch/issues/155046
This change allows Cholesky inversion to use rocSOLVER. This is now also the default on ROCm for Cholesky inversion which aligns with the behavior on NVIDIA (which defaults to cuSOLVER for this linear algebra operation). This fix also gets around a memory access fault encountered in MAGMA for large matrices.
MAGMA can still be forced on ROCm by doing:
```
torch.backends.cuda.preferred_linalg_library(backend='magma')
```
Ran all Cholesky UT on ROCm and there were no regressions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157154
Approved by: https://github.com/jeffdaily
Creating contiguous strides creates an expression max(1, x). Often we know that x >= 1, in
which case we should simplify max(1, x) to x.
This appeared in two situations:
1) An internal user complained about statically_known_true(x == max(1, x)) failing (internal link: https://fb.workplace.com/groups/1028545332188949/permalink/1232958568414290).
This https://github.com/pytorch/pytorch/pull/155938 won't be needed with this.
3) Not simplifying the above could result in wrong ConstraintViolationErrors.
Because we assume non-trival single arg guards shall evaporate see the logic in the function
issue_guard in symbolic_shapes.py
with this change we longer throw ConstraintViolationErrors with the program bellow
this is blocking landing this [PR](https://github.com/pytorch/pytorch/pull/155590) from landing
internally. Due to internal export tests throwing ConstraintViolationErrors.
like
```
Constraints violated (width)!
- Not all values of width = L['x'].size()[3] in the specified range 224 <= width <= 455 satisfy the generated guard max(1, 1 + (((-1) + L['x'].size()[3]) // 2)) == (1 + (((-1) + L['x'].size()[3]) // 2)).
````
```
x = torch.rand(10)
torch._dynamo.mark_dynamic(x, 0, max=20, min=5)
@torch.compile(fullgraph=True, dynamic=True)
def func(x):
if max(1, (-1 + x.size()[0]//2)) == (-1+x.size()[0]//2):
return x*400
else:
return (x*10)*100
func(x)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157189
Approved by: https://github.com/pianpwk
Similar to cublas/hipblas, LT now allocates one workspace per handle+stream combo.
- fixes hipblaslt issue where memory use increased during graph capture
- preserves CUDA env var TORCH_CUBLASLT_UNIFIED_WORKSPACE
- moves LT workspace and size from CUDABlas.cpp into CublasHandlePool.cpp, new APIs
- size_t getCUDABlasLtWorkspaceSize()
- void* getCUDABlasLtWorkspace()
Fixes https://github.com/ROCm/pytorch/issues/2286.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156495
Approved by: https://github.com/eqy
Really, pytorch shoudn't be messing with basic _global_ cmake configuration like this, but without a careful analysis what all depends on this behaviour, I'm not confident to propose a change.
But at least notifying the user that something wonky is going on seems like a good idea.
@drisspg
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156123
Approved by: https://github.com/drisspg, https://github.com/msaroufim
Co-authored-by: Mark Saroufim <marksaroufim@meta.com>
`index_put` with a boolean mask (`target[mask] = src`) causes a `cudaStreamSynchronize`. When both `mask` and `target` tensors are on GPU this is expected.
However, the sync can be prevented if the `mask` is a CPU tensor.
Internally a new index tensor is created with `mask.nonzero()` so we can use a non-blocking copy to transfer it to the GPU since it cannot be accidentally mutated by the user between its creation and the device copy. @ngimel Let me know if I'm missing something.
I think this is useful since users can't prevent a sync simply by making sure all tensors are on the same device as with other ops. Instead one would need to do something like this which is much less readable
```python
indices = mask.nonzero().squeeze(1).to("cuda", non_blocking=True)
target[indices] = src
```
Fixes#12461
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156384
Approved by: https://github.com/ngimel
We notice model code contains indexing syntax like [nanogpt model code](f144fe9095/torchbenchmark/models/nanogpt/model.py (L240)), which causes training fail in the backward pass when using DTensor.
In the code, `x = x[:, [-1], :]` calls the index op and in the backward pass, it will trigger `aten.index_put.default` with the second argument to be of type `torch::List<std::optional<Tensor>>`, e.g., `[None, tensor([-1], device='cuda:0')]`. We are unable to unwarp the op info into Dtensor based on the current logic [here](2625c70aec/torch/distributed/tensor/_dispatch.py (L339-L358)). We need to set runtime_schema_info for the op and enable needs_pytree to support the conversion of tensor list arg.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156240
Approved by: https://github.com/wanchaol
Summary:
With the way these were written, any string literals that were being passed in, like `__func__`, were only ever passed down as a `const char*`, so this switches it over to take a `std::string_view` at the deepest part.
This also has the side effect of allowing `std::string_view` to be passed to the `RECORD_FUNCTION` macros as well.
Test Plan:
contbuilds
Rollback Plan:
Differential Revision: D74681042
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153567
Approved by: https://github.com/Skylion007, https://github.com/swolchok
Instead of skipping the whole test as the CUPTI team figures out what is wrong, let's temporarily skip the profiler check portion. It is high pri to add it back to ensure foreach ops are actually performant.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156871
Approved by: https://github.com/albanD
ghstack dependencies: #156876
Fixed error message:
On main:
```
KeyError: ("Invalid mesh_dim_names ('dp_shard', 'dp_shard') specified. ", 'Found mesh dim indices to slice: [(1,), (1,)]. ', 'Mesh dim indices should be in ascending order.')
```
On PR:
```
KeyError: Invalid mesh_dim_names ('dp_shard', 'dp_shard') specified. Found mesh dim indices to slice: [(1,), (1,)]. Mesh dim indices should be in ascending order.'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157096
Approved by: https://github.com/Skylion007
This PR rewrite how load balancing and sharding works in the current
context parallel implementation.
Why the changes? We should NOT expose another layer of "sharding"
concept as it would confuse the user about its difference with DTensor
sharding. The current CP perform sharding weirdly simply because it
mixed the concept of load balancing and sharding.
I think load balancing and sharding need to be decoupled to separate
layers:
* The load balancing layer is responsible to reorder the input sequence
so that the attention computation are evenly balanced across rows/ranks.
* Sharding is a separate layer after it, it simply take the input reordered by
the load balancer and shard it exactly as how DTensor shard tensor sequentially
In this PR:
* I removed the "Sharder" and "LoadBalancer" mixed usage, and
simply generate a roundrobin indices when the mask is a casual mask
* use `distribute_tensor` to perform the sharding. We still keep the local
shard instead of the DTensor objects to allow maximum compatibility with
arbitrary model architecture given DTensor op coverage is not high
enough.
One alternative design is to still keep the LoadBalancer and add the indices
generation and restore to be the protocol of the LoadBalancer. I thought through
it and think we might want to directly expose the load_balancing indices as
an argument instead of a dedicated class interface, so I removed it here. More
discussion on this is welcomed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155442
Approved by: https://github.com/XilunWu
ghstack dependencies: #155441
as titled, I'm working on a series of changes to make ring attention
impl and DTensor works better together, this PR specifically refactor the
current implemtnation to:
* remove dead/unused code
* restructure the functions to make them stay organized
* refactor to remove/make error message better
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155441
Approved by: https://github.com/fegin
Differential Revision: D77249427
Due to memoization and graph order update, it can happen that a backed symbol is passed into compute_unbacked_bindings and lead to failure. An example as follow:
- There are 2 boolean indexing operators (e.g. op1 and op2) with the same mask.
- A unbacked symint is generated from op1, and then op2 reuses the unbacked symint due to a nonzero_memo in nonzero's fake implementation and no rebinding is needed for op2.
- Since op1 generated the unbacked symint, its meta has "unbacked_bindings" field filled and op2's meta doesn't have it.
- Output from op1 and op2 are later concated with others with backed symint, so that the unbacked symint can be replaced by a backed symint.
- In Inductor, during fake tensor prop, there is no memoi because new fake tensor is always generated (for the same node). op1 generates an unbacked symint and the unbacked can be rebound successfully to the backed symint. Since there is no memoi, op2 also generates a new unbacked symint, but no rebinding can happen because op2's meta doesn't have "unbacked_bindings". And "compute_unbacked_bindings/_rename_unbacked_to" fails to assert op2's old symbol to be unbacked.
From discussion with [@ezyang](https://www.internalfb.com/intern/profile/?id=503862770), there is no easy way to fix this issue.
- We can try to enable memoization for fake tensor prop in Inductor, however, we need to ensure that op1 is visited before op2 during Inductor fake tensor prop for this to work (op2's meta doesn't have "unbacked_bindings" so no rebinding can happen and we need to do rebinding from op1. But there are passes such as reorder_for_locality that can change the graph order so this doesn't work.
- A simple hack is to just replace the unbacked symbol in op2 by the backed symbol.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156911
Approved by: https://github.com/ezyang
Summary:
Add MTIA_INSIGHT to kMtiaTypes in kineto_shim.cpp
For insight, user can use MTIA_INSIGHT_VERBOSE_TRACES=0 to disable the profiler. So, we can enable it by default
Test Plan:
{F1979756361}
When the environment var isn't set, it uses 0.
Rollback Plan:
Differential Revision: D77315882
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156853
Approved by: https://github.com/sraikund16
This PR adds a new config `backward_pass_autocast`, to set the backward autocast
behavior. It does not change the existing behavior.
The reason why we need this is that torch.compile acquires a forward and
backward graph at the time of the forward pass. This means that
implemented naively, if there are any context managers active outside
the call to torch.compile, the backward graph will also get the
behaviors from those context managers. This PR gives users a way to
tweak the autocast behavior of the backward pass.
Please see torch._functorch.config for the options to the
`backward_pass_autocast` config.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156356
Approved by: https://github.com/bdhirsh
ghstack dependencies: #155354
The dtype documentation has not been updated in awhile, let's do a revamp.
1. combine the duplicated docs for dtypes from `tensors.rst` and `tensor_attributes.rst` to live in `tensor_attributes.rst`, and link to that page from `tensors.rst`
2. split the dtype table into floating point and integer dtypes
3. add the definition of shell dtype
4. add the float8 and MX dtypes as shell dtypes to the dtype table
5. remove legacy quantized dtypes from the table
6. add the definition of various dtype suffixes ("fn", etc)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156087
Approved by: https://github.com/albanD
Fixes#136662
There are two problems:
1) canonicalize_view_scatter_ops adds some new nodes into the graph.
These new nodes cause the alias info on the graph to be wrong. To fix
this, we try to run FakeTensorUpdater on the graph again.
2) FakeTensorUpdater's alias information is wrong. It tries to skip
nodes that it thinks have "equivalent" FakeTensor metadata.
It should not be allowed to do this if any users of the node can
alias the node. The example
is if we have `x = foo(...); y = x.view(...)`. If the user replaces
`foo` with a new `bar` node and sets bar.meta["val"] correctly, then
FakeTensorUpdater still needs to update y's meta["val"] to be a view
of the new bar node.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152011
Approved by: https://github.com/yf225
When we compute contiguity for a tensor with dynamic shapes we first:
1) Try to compute it without guarding.
2) If all shapes hinted, compute it with potentially adding guards.
3) if any input is not hinted, compute it symbolically.
sym_is_contiguous return a SymBool that is then either evaluated or guard_or_false can be called
on it to avoid data dependent errors.
ex:
bool is_contiguous = input.sym_is_contiguous().guard_or_false(__FILE__, __LINE__);
is_contiguous_or_false is a helper function that does that.
In this PR I only handle default contiguity, will follow up with changes for other formats like channel_last .
We use this patter in this PR for several locations to avoid DDEs.
Differential Revision: [D77183032](https://our.internmc.facebook.com/intern/diff/D77183032)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155590
Approved by: https://github.com/ezyang
## Summary
Moved the Triton-specific NVSHMEM tests in `test_nvshmem.py` into a dedicated `test_nvshmem_triton.py` file. Also put the shared Triton JIT kernels at the top-level of new file for reusability.
## Testing
```bash
TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py
TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem_triton.py
```
All 16 original tests pass with no functionality changes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156685
Approved by: https://github.com/mandroid6, https://github.com/kwen2501
ghstack dependencies: #156684
Tests: `python test/inductor/test_aot_inductor.py -vvv -k device_tma`
Device-side TMA in Triton allows the kernel author to construct the TMA descriptor on the device (which composes with things like autotuning much better). However, it also requires a scratch space to be provided into which the TMA descriptor will be constructed. In the new TMA API (tl.make_tensor_descriptor), this is implemented using a "global scratch space" - a tensor which is allocated beforehand and then passed in as an argument for the kernel.
To support this in AOTI, this PR:
* records the global scratch space needed (triton_heuristics.py), so that it can be used during AOTI codegen
* allocates global scratch, if needed (cuda/device_op_overrides.py)
* plumbs `device_idx_` into the triton caller function, so that global scratch can be allocated on the right device)
* updates tests to verify this works for dynamically shaped inputs
This PR should support both inductor-generated device-side TMA (e.g. persistent TMA mm) and user-defined triton kernels that contain device-side TMA (which is the test I ran to verify this works)
Note: this overrides any user-provided allocator function (typically with eager triton code, the user must provide their own custom allocator function that is used to allocate scratch space).
For Meta reviewers, here is a tlparse from running `python test/inductor/test_aot_inductor.py -vvv -k test_triton_kernel_on_device_tma_dynamic_True_tma_version_new_cuda` https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpFg13g1/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000
Differential Revision: [D77352139](https://our.internmc.facebook.com/intern/diff/D77352139)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155896
Approved by: https://github.com/desertfire
**Summary**
Fix the performance regression of `functorch_maml_omniglot` in TorchBench. The issue reported in [#151523](https://github.com/pytorch/pytorch/issues/151523) occurs only when a parallel reduction is performed under the vectorized loop and a scalar kernel is used for the tail loop. Previously, we addressed this regression in [#151887](https://github.com/pytorch/pytorch/pull/151887) by disabling all cases where a parallel reduction occurs under the vectorized loop. However, for `functorch_maml_omniglot`, we found that a masked vector kernel is used in the tail loop instead of the scalar kernel in the job of `inductor_torchbench_cpu_smoketest_perf`. In this PR, we refine the fix by excluding the cases where a masked vector kernel is used in the tail loop, rather than disabling all such scenarios.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156526
Approved by: https://github.com/CaoE
Fixes#150951
Summary:
For complex.pow(2) on GPU:
Uses complex * complex directly.
Produces results consistent with CPU implementation.
Eliminates spurious imaginary components for real inputs.
🧪 Tests
Added unit tests to verify correctness of the new kernel path.
Verified numerical consistency with CPU results.
This change is backward-compatible and only affects the specific case of pow(2) on complex tensors on GPU.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152373
Approved by: https://github.com/ezyang
Calling `at::native::nansum_out` causes the fake kernel to dispatch to a
`make_reduction` call and then segfaults later due to the
`mutable_data_ptr` call in `TensorIteratorBase::build`. It also causes
fake tensor propagation issue in Dynamo. The added tests demonstrate the
aforementioned 2 issues.
This patch fixes it by dispatching to `at::nansum_out` instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156642
Approved by: https://github.com/zou3519
Summary:
cuBLAS used to have strict alignment requirements for TF32 usage, even if TF32 was enabled by users; this caused a numeric SEV in the past, when Triton would use TF32 even if cuBLAS could not due to failing the alignment checks
we believe that cuBLAS no longer has alignment requirements for TF32 usage, based on some testing in D77265581; we'd like to deprecate `force_same_precision` since it no longer functions as expected
changing the default to False in fbcode, guarded by a jk so that we can quickly revert to the original behavior if needed
Test Plan:
CI
Rollback Plan:
Differential Revision: D77265930
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156789
Approved by: https://github.com/jhadidjojo, https://github.com/masnesral
`_torchdynamo_orig_callable` was being used in two distinct places:
- to get the original user function from nested eval_frame.py decorators
- to get the original backend from nested convert_frame.py callbacks
We rename the first usage to `_torchdynamo_orig_fn` and the second to `_torchdynamo_orig_backend` in order to distinguish these cases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156901
Approved by: https://github.com/StrongerXi, https://github.com/jansel
ghstack dependencies: #156527
This should prevent bad resume function prologues from slipping by. In particular, graph breaks in resume function prologues will now hard error.
Implementation details:
- The resume function prologue is surrounded by `LOAD_CONST arg, STORE_FAST __is_tracing_resume_prologue` instructions. The first sequence has `arg=True` and the second sequence has `arg=False`.
- InstructionTranslator will know when it is tracing a resume function prologue when it detects `STORE_FAST __is_tracing_resume_prologue`. The top of stack will be True to mark the start of the prologue, False to mark the end.
- When `convert_frame.py` detects that an error occurred while the InstructionTranslator was tracing a resume function prologue, we will wrap the exception and hard error
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154564
Approved by: https://github.com/jansel
ghstack dependencies: #154283, #154289, #154782, #156762, #155166
See added test for the case that this PR handles. In particular, the semantics for nested torch.compile with toggled fullgraph settings was strange before - `@torch.compile(fullgraph=True)` overrides the existing fullgraph setting, while `@torch.compile(fullgraph=False)` does not.
Note that this change will add an extra frame to any inlined torch.compile'd function (which I don't expect to happen frequently).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155166
Approved by: https://github.com/jansel
ghstack dependencies: #154283, #154289, #154782, #156762
- Make the fullgraph argument of set_fullgraph a positional argument
- Fix behavior on nested calls by updating `tracer.error_on_graph_break` in more places. In particular, a tracer's error_on_graph_break is set to the inlined tracer's error_on_graph_break upon the latter's exit. We also track error_on_graph_break in the speculation log now, since if we encounter a nested graph break, we will restart analysis and we need to somehow remember the error_on_graph_break setting after attempting to run the nested function (but we don't actually trace into it in the restart analysis).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154782
Approved by: https://github.com/jansel
ghstack dependencies: #154283, #154289
Implements https://github.com/pytorch/pytorch/issues/144908.
Implementation notes:
- `set_fullgraph` is implemented using `patch_config`, which changes config correctly during runtime and tracing.
- Moved setting `config.error_on_graph_break` from convert_frame.py to eval_frame.py. This is because this should only be done at the top-level decorated function. If we kept this in convert_frame.py, we would be changing `config.error_on_graph_break` on every top-level frame, which causes confusing behavior (see added test for example).
- InstructionTranslator reads from `config.error_on_graph_break` every `step()`. This is to determine the value of `config.error_on_graph_break` at the time of the graph break, because tracer cleanup will restore the value of `config.error_on_graph_break` .
- `convert_frame.py` determines whether we should abort tracing (fullgraph=True) or continue (fullgraph=False) by reading the value of the tracer's `error_on_graph_break`. If there is no tracer (failed to initialize), then default to reading `config.error_on_graph_break`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154289
Approved by: https://github.com/jansel, https://github.com/zou3519
ghstack dependencies: #154283
`torch.compile` now always goes through `torch._dynamo._optimize`. fullgraph is now implemented in `torch.compile` by looking at `config.error_on_graph_break`. Export still goes through `torch._dynamo._optimize_assert`, which uses `tx.one_graph` instead of `config.error_on_graph_break`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154283
Approved by: https://github.com/jansel, https://github.com/anijain2305
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):
Today the only way to choose allocation backend is via env `TORCH_SYMMMEM=...`.
This is a bit hard to set in CI on test file basis. (The env has to be set before program is loaded).
This PR added a programmatic way -- a `set_backend` API.
Implementation:
Since this API is slightly more dynamic than static registration, at static time each backend registers its availability rather than filling itself as **the** allocator directly. Later when `set_backend` is called, the allocator would actually fill in the device-to-allocation `map_`.
Though added, `set_backend` is **not** a necessary API for user to call -- one backend is still registered as the default at static time.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156661
Approved by: https://github.com/ngimel, https://github.com/fduwjj
Summary:
Encountered 'register_foward_pre_hook not supported on ScriptModule' error when trying to publish CFR MTML with placing remote_ro module in remote. Issue may come from the fact that the local net from torchArrow is already scriptModule before gen_app_graph pass.
{F1979770267}
Test Plan:
hg checkout 1ff14dfaade4ac1f3cbbf38fbd72f7fdd5cdcd16
bash hstu_blocker.sh
Rollback Plan:
Reviewed By: RenfeiChen-FB
Differential Revision: D77341370
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156904
Approved by: https://github.com/jingsh
## Summary
This PR removes unnecessary `dist.barrier` calls up in our Triton NVSHMEM test suite and adds signal_op support, which is a lightweight device-side signaling mechanism. Added test for this in our `wait_until` kernel and corresponding `core.extern` wrapper.
**Why did we drop the `dist.barrier()` calls?**
We dropped the host‐side dist.barrier() in all Triton NVSHMEM tests (except the raw put/get cases) because every other test already uses NVSHMEM collectives or device‐side sync primitives (fence/quiet/signal/wait), making the extra barrier redundant. This keeps synchronization entirely on the GPU and leverages NVSHMEM’s native ordering guarantees for clearer, more efficient tests.
**`test_triton_wait_until` update**
- **Rank 1**: after `put_kernel` writes the data, launches `signal_op_kernel` to atomically set Rank 0's flag via `nvshmemx_signal_op`
- **Rank 0**: drops its old `dist.barrier()` and simply calls `wait_until_kernel` to spin-wait on the device flag, then asserts data correctness
- Changes made per [this comment](https://github.com/pytorch/pytorch/pull/156472#discussion_r2159734046)
## Testing
```bash
TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156684
Approved by: https://github.com/kwen2501, https://github.com/mandroid6
unbind will always specialize on dim, because it determine the number of output tensors.
guard_size_oblivious is not useful there and more confusing probably for code readers
added a comment and a test that verifies the specialization.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148815
Approved by: https://github.com/pianpwk
Summary:
Sample error message:
```
RuntimeError: Failed to find a generated cpp file or so file for model 'forward' in the zip archive.
Available models in the archive:
model
To load a specific model, please provide its name using the `model_name` parameter when calling AOTIModelPackageLoader() or torch._inductor.package.load_package.
The following files were loaded from the archive:
c7l7jkswdq7ud6gpvpmunx76hi3c357l7epyc7oofeemzeoy7euo.wrapper/data/aotinductor/model/cqdxv6zki2oiiytjeqrg774uxlxgqdemhdxn5dycn4nnc3rmcd7w.cubin
c7l7jkswdq7ud6gpvpmunx76hi3c357l7epyc7oofeemzeoy7euo.wrapper/data/aotinductor/model/c7l7jkswdq7ud6gpvpmunx76hi3c357l7epyc7oofeemzeoy7euo.wrapper.cpp
c7l7jkswdq7ud6gpvpmunx76hi3c357l7epyc7oofeemzeoy7euo.wrapper/data/aotinductor/model/ctmp7adn3spwyscdotllyj4yx3vrqcnxk3thkpgdcax7zvqmyyp3.kernel.cpp
c7l7jkswdq7ud6gpvpmunx76hi3c357l7epyc7oofeemzeoy7euo.wrapper/data/aotinductor/model/c7l7jkswdq7ud6gpvpmunx76hi3c357l7epyc7oofeemzeoy7euo.wrapper_metadata.json
c7l7jkswdq7ud6gpvpmunx76hi3c357l7epyc7oofeemzeoy7euo.wrapper/data/aotinductor/model/ctmp7adn3spwyscdotllyj4yx3vrqcnxk3thkpgdcax7zvqmyyp3.kernel_metadata.json
c7l7jkswdq7ud6gpvpmunx76hi3c357l7epyc7oofeemzeoy7euo.wrapper/data/aotinductor/model/c7l7jkswdq7ud6gpvpmunx76hi3c357l7epyc7oofeemzeoy7euo.wrapper.so
c7l7jkswdq7ud6gpvpmunx76hi3c357l7epyc7oofeemzeoy7euo.wrapper/archive_format
c7l7jkswdq7ud6gpvpmunx76hi3c357l7epyc7oofeemzeoy7euo.wrapper/archive_version
c7l7jkswdq7ud6gpvpmunx76hi3c357l7epyc7oofeemzeoy7euo.wrapper/.data/version
c7l7jkswdq7ud6gpvpmunx76hi3c357l7epyc7oofeemzeoy7euo.wrapper/byteorder
c7l7jkswdq7ud6gpvpmunx76hi3c357l7epyc7oofeemzeoy7euo.wrapper/.data/serialization_id
```
Test Plan:
```
buck2 run @//mode/dev-nosan //caffe2/test/inductor:aot_inductor_package -- -r "test_loading_wrong_model"
```
Rollback Plan:
Differential Revision: D77320485
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156863
Approved by: https://github.com/tugsbayasgalan
Original issue: https://github.com/pytorch/pytorch/issues/154820
The issue happens when there is a mutation for the same input in forward AND in backward.
AOTD emited copy_ after joint_function tracing. This made this fx-node to correspond to the side effects of both mutations (in forward and in backward).
After that partitioner can put it either in forward or in backward.
The fix:
1/ Introduce joint_function.handle that allows to set "post_forward" callback, to be able to check inputs state after forward
We do not want to apply the mutation after joint, if we already applied it in forward. For that we need "mutation_counter" and memorize the version of mutation that we applied for forward mutation.
2/ Exposing mutation_counter to python
We want to keep invariant that copy_ exist only in the end of joint graph.
3/ We memorize mutation_counter and state of the inputs after forward, using the handle post_forward.
Emit post_forward mutations after joint graph fully traced.
add for post_forward mutations "must_be_in_forward" tag (similar to existing "must_be_in_backward") to keep them in forward.
4/ Ban recompute of the source of mutation. Recompute can apply the same op (e.g. add) in forward and backward.
For this set MUST_SAVE for the source of mutation in forward.
proxy_tensor changes:
By default proxy tensor updates tensor_tracker. In this case applied mutations will be chained.
But we want that this copy_ will be independent and applied just to primals.
For this introducing a contextmanager to be able to disable update of tensor_tracker for adding forward mutations.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155354
Approved by: https://github.com/bdhirsh
Fixes#136662
There are two problems:
1) canonicalize_view_scatter_ops adds some new nodes into the graph.
These new nodes cause the alias info on the graph to be wrong. To fix
this, we try to run FakeTensorUpdater on the graph again.
2) FakeTensorUpdater's alias information is wrong. It tries to skip
nodes that it thinks have "equivalent" FakeTensor metadata.
It should not be allowed to do this if any users of the node can
alias the node. The example
is if we have `x = foo(...); y = x.view(...)`. If the user replaces
`foo` with a new `bar` node and sets bar.meta["val"] correctly, then
FakeTensorUpdater still needs to update y's meta["val"] to be a view
of the new bar node.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152011
Approved by: https://github.com/yf225
Based on the [conversation](https://github.com/pytorch/pytorch/issues/121791), we plan to drop the "highest, high, medium" to represent fp32 internal computation data types . Instead, we will directly use the algorithm to represent it.
### Design Choice: Directly use algorithms name like "TF32", "BF16".
#### Pros
- The names are more informative. 'tf32' is more informative than a simple "high".
- Easier to extend new algorithm like `tf32x3`
#### Cons
- "HIGHEST, HIGH, MEDIUM" indicated the relative precision between different algorithms. However, we can have more documents to discuss them.
### We provide a layered structure for backends/operators.
('f32' is short for 'fp32_precision')

### We provide 3 fp32 compute precision can be set:
- **"ieee"**: Not allowed to use any other internal computation data types .
- **"tf32"**: Allowed to use tf32 as internal computation data types.
- **"bf16"**: Allowed to use bf16 as internal computation data types.
- **"none"**: Precision's are not set. Can be override by its father node.
### Overriding Precision Settings
Child node can be override by its father node if it is set to default.
For current default settings:
```
backend = generic, op = all, precision setting = none
backend = cuda, op = all, precision setting = none
backend = cuda, op = conv, precision setting = tf32
backend = cuda, op = rnn, precision setting = tf32
backend = cuda, op = matmul, precision setting = none
backend = matmul, op = all, precision setting = none
backend = matmul, op = conv, precision setting = none
backend = matmul, op = rnn, precision setting = none
backend = matmul, op = matmul, precision setting = none
```
- If the user set `torch.backends.mkldnn.fp32_precision="bf16"`, his child nodes `torch.backends.mkldnn.matmul.fp32_precision` / `torch.backends.mkldnn.conv.fp32_precision` / `torch.backends.mkldnn.rnn.fp32_precision` will also be override to "bf16".
- If the user set `torch.backends.fp32_precision="bf16"`, `torch.backends.mkldnn.fp32_precision` and his child nodes will also we override to "bf16".
### Backward Compatible
Since new API allow user to have more fine-grained control. There will be some conflict. For example, previous `torch.backends.cudnn.allow_tf32` are not enough to represent the status for `torch.backends.cudnn.rnn.fp32_precision="ieee"` and `torch.backends.cudnn.conv.fp32_precision="tf32"`. Therefore, our goal for backward compatible is
- If the user only uses previous APIs, it will work as previous expectations.
- If the user use **new** API to change the status to an **un-representable** status for old API, and try to access the status by **old** API. We will raise Runtime Error and point the document for user.
### Test Plan
```
python test/test_cuda.py -k test_fp32_precision_with_tf32
python test/test_cuda.py -k test_fp32_precision_with_float32_matmul_precision
python test/test_cuda.py -k test_invalid_status_for_legacy_api
python test/test_mkldnn.py -k test_mlkdnn_get_set
python test/test_mkldnn.py -k test_generic_precision
python test/test_mkldnn.py -k test_invalid
python test/test_mkldnn.py -k test_default_use_parent
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125888
Approved by: https://github.com/jgong5, https://github.com/albanD
Co-authored-by: Jiang, Yanbing <yanbing.jiang@intel.com>
We discovered that when importing latest 12.9 arm nightly wheel, it is missing the NCCL lib. With the use of USE_SYSTEM_NCCL=1, we need to copy the libnccl.so lib into our big wheel environment, so that it can be dynamically linked at runtime.
https://github.com/pytorch/pytorch/pull/152835 enabled USE_SYSTEM_NCCL=1, which would use the system NCCL by default, and it would no longer use the one built from libtorch_cuda.so. With this PR, we add back the libnccl.so to be used at runtime. In this way, we also provide the flexibility to use different versions of NCCL from what came with the original pytorch build.
related - https://github.com/pytorch/pytorch/issues/144768
```
Python 3.12.3 (main, Jun 18 2025, 17:59:45) [GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.12/dist-packages/torch/__init__.py", line 417, in <module>
from torch._C import * # noqa: F403
^^^^^^^^^^^^^^^^^^^^^^
ImportError: libnccl.so.2: cannot open shared object file: No such file or directory
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156888
Approved by: https://github.com/atalman
Fixes https://github.com/pytorch/pytorch/issues/156815
As far as testing goes
* I tried to use cuobjdump but that was kinda goofy bccd9393a5 the problem was that the name of the cubin will have a single gencode always
* Another idea was to read stderr and check that the right amount of gencodes is there 0beadc01b3 this helped a lot to convince me locally that this test works, the test passed on my dev gpu but was failing in CI and I suspect it's because of a bad interaction with subprocesses
* Last approach was to have a simpler unit test to check which flags get added by default, this is not as comprehensive as the previous ideas but it works and is fast so will opt for this since I'm convinced testing is working per my own experiments and customers
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156850
Approved by: https://github.com/malfet
This adds the `dist_info` command to the list of non-building commands of `setup.py`, which avoids the current situation where simple metadata generation with any packaging tool already triggers a build.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156709
Approved by: https://github.com/Skylion007
----
- serialization
- dlpack
**Next Steps**:
- The rest of `test/test_cpp_extensions_open_device_registration.py` is about the fallback mechanism. In order to keep it consistent with other accelerator usage (C++ registration), the implementation of OpenReg needs to be refactored:
* Simulate multiple device memory in a single process (a brief RFC will be submitted this week)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156589
Approved by: https://github.com/albanD
ghstack dependencies: #156588
Changes WITH_PUSH and the environment check to be ok with giving credentials to push to docker io if its on the main branch, a tag starting with v, or the release branch
Credentials for pushing to docker io are in the environment, so without the environment, you can't push to docker io. You also don't do the push unless WITH_PUSH is true
binary builds on release branch were failing because they pull from docker io, but the docker build wasn't pushing to docker io because it was either on the release branch (didn't have credentials https://github.com/pytorch/pytorch/actions/runs/15888166271/job/44813180986) or it was on the tag (doesn't have WITH_PUSH)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156910
Approved by: https://github.com/atalman
Fixes#154328
**Summary**
Fail reason:
The input value is infinity in float and it has undefined behavior to convert it to int64_t. On X86, it will be converted to the min value of int64_t, which is not expected.
Fix:
Clamping `(input * inv_scale + zero_point)` to `[quant_min, quant_max]` before converting it to int64_t.
**Test plan**
```
pytest test/quantization/core/test_workflow_ops.py -k test_fake_quantize_per_tensor_affine_inf
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155109
Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168
Summary:
- Consolidate the stack trace recording code in TracerBase and PythonKeyTracer
- Change `make_fx`'s arg name to be consistent with TracerBase member name `record_stack_traces`
We move the stack trace logic from `create_proxy` to `create_node` so all inherited classes of TracerBase and re-use the same stack trace logic.
Test Plan:
```
buck run caffe2/test:test_export -- -r test_stack_trace
```
Rollback Plan:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156257
Approved by: https://github.com/angelayi, https://github.com/zou3519
Summary: Undo highlevel BUCKification in favor of something more organized by moving it to the dir itself
Test Plan:
CI
Rollback Plan:
Reviewed By: swolchok
Differential Revision: D76920013
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156503
Approved by: https://github.com/swolchok
The goal of this PR is to fix a specific bug when turning precompile on/off between caching runs.
If you try to turn on BundledAOTAutogradCacheEntry today in between local runs, the FXGraphCache may randomly hit *between* the two runs, because FXGraphCache knows nothing about AOTAutogradCache's config. When FXGraphCache hits, it immediately will call make_launchers() immediately on the triton code it launches, which then causes an assertion failure because pickle should not be called after make_launchers.
One way to resolve the bug is just to add whether precompile is enabled to teh FxGraph cache key. But the better fix for this, however, is higher level/philosophical:
When using BundledAOTAutogradCacheEntry, the entire CompiledFxGraph is saved directly to the cache entry, and we expect the two caches to work in sync, i.e. as one cache. So to simplify the programming model, we disable FxGraphCache when BundledAOTAUtogradCache is turned on.
BundledAOTAutogradCacheEntry is only used for precompile use cases now; if we wanted to use BundledAOTAutogradCache for traditional caching use cases, there's a bunch of further work, one of which would be to re-enable FxGraphCache in the event that BundledAOTAutogradCache has to bypass. However, for precompile, this is not a scenario that should happen: we should always expect the entire callable to be saveable, and we should expect to never bypass. So we don't do that change for now.
Added a unit test demonstrating this behavior. Also updated existing unit tests to show that all fx graph cache operations are now 0 (but all tests still pass).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156611
Approved by: https://github.com/zhxchen17
New set of batchnorm tests to verify NCHW 2D/3D BatchNorm
This test also allows to add and configure different BatchNorm tests (dtypes, NCHW/NHWC, Mixed) in the future
based on:
- Train [test_batchnorm_cudnn_nhwc](1051b93192/test/test_nn.py (L4985))
- Inference [test_batchnorm_nhwc_cuda](1051b93192/test/test_nn.py (L5130))
```
test_batchnorm_3D_inference_NCHW_vs_cpu_float32 (__main__.TestNN.test_batchnorm_3D_inference_NCHW_vs_cpu_float32) ... ok (0.113s)
test_batchnorm_3D_inference_NCHW_vs_cpu_mixed_bfloat16 (__main__.TestNN.test_batchnorm_3D_inference_NCHW_vs_cpu_mixed_bfloat16) ... ok (0.057s)
test_batchnorm_3D_inference_NCHW_vs_cpu_mixed_float16 (__main__.TestNN.test_batchnorm_3D_inference_NCHW_vs_cpu_mixed_float16) ... ok (0.063s)
test_batchnorm_3D_inference_NCHW_vs_native_float32 (__main__.TestNN.test_batchnorm_3D_inference_NCHW_vs_native_float32) ... ok (0.059s)
test_batchnorm_3D_inference_NCHW_vs_native_mixed_bfloat16 (__main__.TestNN.test_batchnorm_3D_inference_NCHW_vs_native_mixed_bfloat16) ... ok (0.006s)
test_batchnorm_3D_inference_NCHW_vs_native_mixed_float16 (__main__.TestNN.test_batchnorm_3D_inference_NCHW_vs_native_mixed_float16) ... ok (0.006s)
test_batchnorm_3D_train_NCHW_vs_cpu_float32 (__main__.TestNN.test_batchnorm_3D_train_NCHW_vs_cpu_float32) ... ok (0.007s)
test_batchnorm_3D_train_NCHW_vs_cpu_mixed_bfloat16 (__main__.TestNN.test_batchnorm_3D_train_NCHW_vs_cpu_mixed_bfloat16) ... ok (0.005s)
test_batchnorm_3D_train_NCHW_vs_cpu_mixed_float16 (__main__.TestNN.test_batchnorm_3D_train_NCHW_vs_cpu_mixed_float16) ... ok (0.005s)
test_batchnorm_3D_train_NCHW_vs_native_float32 (__main__.TestNN.test_batchnorm_3D_train_NCHW_vs_native_float32) ... ok (0.003s)
test_batchnorm_3D_train_NCHW_vs_native_mixed_bfloat16 (__main__.TestNN.test_batchnorm_3D_train_NCHW_vs_native_mixed_bfloat16) ... skip: bfloat16 NCHW train failed due to native tolerance issue (0.001s)
test_batchnorm_3D_train_NCHW_vs_native_mixed_float16 (__main__.TestNN.test_batchnorm_3D_train_NCHW_vs_native_mixed_float16) ... skip: 3D float16 NCHW train failed on ROCm<7.0 (0.001s)
test_batchnorm_2D_inference_NCHW_vs_cpu_float32 (__main__.TestNN.test_batchnorm_2D_inference_NCHW_vs_cpu_float32) ... ok (0.016s)
test_batchnorm_2D_inference_NCHW_vs_cpu_mixed_bfloat16 (__main__.TestNN.test_batchnorm_2D_inference_NCHW_vs_cpu_mixed_bfloat16) ... ok (0.003s)
test_batchnorm_2D_inference_NCHW_vs_cpu_mixed_float16 (__main__.TestNN.test_batchnorm_2D_inference_NCHW_vs_cpu_mixed_float16) ... ok (0.003s)
test_batchnorm_2D_inference_NCHW_vs_native_float32 (__main__.TestNN.test_batchnorm_2D_inference_NCHW_vs_native_float32) ... ok (0.054s)
test_batchnorm_2D_inference_NCHW_vs_native_mixed_bfloat16 (__main__.TestNN.test_batchnorm_2D_inference_NCHW_vs_native_mixed_bfloat16) ... ok (0.002s)
test_batchnorm_2D_inference_NCHW_vs_native_mixed_float16 (__main__.TestNN.test_batchnorm_2D_inference_NCHW_vs_native_mixed_float16) ... ok (0.001s)
test_batchnorm_2D_train_NCHW_vs_cpu_float32 (__main__.TestNN.test_batchnorm_2D_train_NCHW_vs_cpu_float32) ... ok (0.007s)
test_batchnorm_2D_train_NCHW_vs_cpu_mixed_bfloat16 (__main__.TestNN.test_batchnorm_2D_train_NCHW_vs_cpu_mixed_bfloat16) ... ok (0.004s)
test_batchnorm_2D_train_NCHW_vs_cpu_mixed_float16 (__main__.TestNN.test_batchnorm_2D_train_NCHW_vs_cpu_mixed_float16) ... ok (0.004s)
test_batchnorm_2D_train_NCHW_vs_native_float32 (__main__.TestNN.test_batchnorm_2D_train_NCHW_vs_native_float32) ... ok (0.003s)
test_batchnorm_2D_train_NCHW_vs_native_mixed_bfloat16 (__main__.TestNN.test_batchnorm_2D_train_NCHW_vs_native_mixed_bfloat16) ... skip: bfloat16 NCHW train failed due to native tolerance issue (0.001s)
test_batchnorm_2D_train_NCHW_vs_native_mixed_float16 (__main__.TestNN.test_batchnorm_2D_train_NCHW_vs_native_mixed_float16) ... ok (0.002s)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156498
Approved by: https://github.com/jeffdaily
Summary:
ncclUniqueID is only relevant when a comm is created using ncclCommCreate or ncclCommCreateConfig. If a comm is created with ncclCommSplit, this field is unset, causing its usage to create unexpected behavior.
This patch creates a unique hash key for each comm, irrespective of how the comm is created.
Test Plan:
CI
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156790
Approved by: https://github.com/fduwjj, https://github.com/kwen2501
This PR adds the `@deprecate` decorator for internal functions which we are prepping for deprecation. Add it on top of an internal function to emit a deprecation warning + allow bc with the non internal version of the function.
Tested with `python test/test_utils.py TestDeprecate.test_deprecated `
Furthermore, testing with a modified version of the tes in the pr gives something like this which is what we want
```
/home/sahanp/repos/pytorch/test/test_utils.py:1239: UserWarning: deprecated_api is DEPRECATED, please consider using an alternative API(s).
deprecated_api(1, 2)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155127
Approved by: https://github.com/albanD
Co-authored-by: albanD <desmaison.alban@gmail.com>
**Summary**
Enable fp8 qlinear on CPU. It's part of the plan to enable fp8 static quantization on CPU. This PR only adds FP8 support of the existing int8 qlinear op. It does not add a new op nor does it affect frontend or quantization flow. The schema of the qlinear op is not changed either.
So, the FP8 qlinear shares the same op as INT8 qlinear and the difference is that src/wei dtype is fp8 instead of int8. The output dtype can be fp8/float32/bfloat16. The implementation uses the oneDNN library.
The differences of qlinear from `_scaled_mm` are that
- Qlinear supports post op fusion while `_scaled_mm` does not
- Weights are prepacked for qlinear
**Test plan**
```
pytest test/quantization/core/test_quantized_op.py -k "qlinear and fp8"
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155678
Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168
Fixes#154328
**Summary**
Fail reason:
The input value is infinity in float and it has undefined behavior to convert it to int64_t. On X86, it will be converted to the min value of int64_t, which is not expected.
Fix:
Clamping `(input * inv_scale + zero_point)` to `[quant_min, quant_max]` before converting it to int64_t.
**Test plan**
```
pytest test/quantization/core/test_workflow_ops.py -k test_fake_quantize_per_tensor_affine_inf
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155109
Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168
Summary: Wrapper tensor for DTensor is losing shape in offload_tensor. This PR fixes this bug.
Test Plan:
updated the test. Test fails with old code and passes with the fix.
Rollback Plan:
Differential Revision: D77269733
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156774
Approved by: https://github.com/mikaylagawarecki
Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.10b:
* Official support of gfx950/gfx1201
* Experimental support of gfx1101/gfx1151/gfx1150/gfx1200
* Reduce libaotriton.so binary size by over 80%.
+ Without this optimization the binary size of `libaotriton.so` could be
over 100MiB due to 2x more supported architectures compared with 0.9b.
Now it is only about 11MiB.
* Support sliding window attention (SWA) in
`_flash_attention_forward/backward`. Should fix#154582
See https://github.com/ROCm/aotriton/releases/tag/0.10b for full details,
including Known Problems.
Notable changes to SDPA backend:
* `std::optional<int64_t>` `window_size_left/right` are directly passed to
ROCM's SDPA backend, because the default value `-1` is meaningful to
AOTriton's backend and bottom-right aligned causal mask is implemented with
negative `window_size_left/right`
* Some code clean up around `USE_CK_FLASH_ATTENTION`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156499
Approved by: https://github.com/jeffdaily, https://github.com/jithunnair-amd
When the async compile subprocesses crash in C++ they tend to just silently die instead of leaving any kind of trace. This installs a crash handler so that if they SEGV, ILL, or ABRT they'll attempt to output a backtrace instead.
While in there I also cleaned up the CLANGTIDY warnings coming from Module.cpp.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155068
Approved by: https://github.com/masnesral
# Motivation
Update the doc, to make `torch.device`'s constructor officially support the following methods:
- A device string, which is a string representation of the device type and optionally the device ordinal.
- A device type and a device ordinal.
- A device ordinal, which is treated as the current accelerator type.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156686
Approved by: https://github.com/albanD
Implementations:
1. Move collective ops to c10d namespace, so that we can call them externally.
2. Add AOTI shims for collective ops.
Testing
1. Add c10d functional UT for cpu.
2. Include the above one in cpp wrapper UT.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154492
Approved by: https://github.com/desertfire
Summary:
Adds an experimental implementation for a rank local checkpointer with save and load with partial load, blind load and in-place load.
This uses an new API and simpler format.
Plan to add async checkpointing, IO layer, pluggable storage backend, layout customization, Resharding, deduplication etc are not implemented.
Test Plan: unit tests
Reviewed By: saumishr
Differential Revision: D75426560
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156142
Approved by: https://github.com/saumishr
The changes are documentation changes to the function lobpcg. There are three changes to the doc.
1. Match doc arg description to be in the same order as the parameters to the function.
2. Update documentation for arg `n` to indicate that when arg `x` is specified value of `n` is ignored if set.
3. Add warning that `m` must be bigger than 3 x the number of requested eigenpairs.
Fixes#152107
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156139
Approved by: https://github.com/soulitzer
Differential Revision: D76904773
In the current scheduler logic, if a template buffer is only a Triton template, which can result from only 1 Triton choice in the autotuning, the fusion won't be benchmarked.
This can lead to an edge case in which a Triton GEMM template from the autotune lookup table can have a problematic fusion, leading to shared memory requirements above the hardware limit. `(256, 128, 64, 4, 8, 8)` is such a config, where we have seen fusion with a `.to(torch.float32)` can lead to this issue, `out of resource: shared memory, Required: 264224, Hardware limit: 232448`. We benchmark the fusion for this case to ensure it's safe.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156500
Approved by: https://github.com/jansel
Fixes#156261
Thanks to @ngimel's fast eyes
For testing, I had experimented with a broader test case change but found that creating a tensor of 2**31+1 size was too expensive to do more than just a few times. Note that while the test case does not run in CI, I did run it locally to ensure it passes with new changes and fails without.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156719
Approved by: https://github.com/albanD
Summary:
The functions guard_lt, guard_equals, and guard_leq work similarly to torch.check and expect_true, but they operate on SymPy expressions. Notably, guard_equals applies local replacements before comparison, which might be better extracted into a separate function.
This pull request standardizes naming conventions to match symbolic_shapes.py. Specifically,
- it introduces size_vars.expect_true and size_vars.check.
- guard_lt becomes check_lt
- guard_leq becomes check_leq
- guard_equals becomes check_equals
I am also seeing a couple of wrong usages !! that i will fix in the next PR
Test Plan:
OSS and cont
Rollback Plan:
Differential Revision: D77054177
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156518
Approved by: https://github.com/bobrenjc93
This PR runs an automatic check as part of dynamo_wrapped to make sure that all unimplemented_v2() callsites are mapped to the JSON file. It also fixes the issue of the CI not able to expand the hints, which was the root cause of the previous workflow failure. If not, the dev gets a message giving them instructions on how to update the JSON file. I also updated a dynamic gb_type to static and updated its test_error_message to include the GBID link for the graph break (before the link would not be produced).
Testing:
I ran the file with the argument to ensure all cases were covered, and also tested the test in CI.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156237
Approved by: https://github.com/williamwen42
Summary: we need a mechanism that provided the functionschemas for each kernel will be able to trace aliasing behaviour s.t., we have correct value lifetimes when we plan.
Test Plan: ci + unit tests
Reviewed By: SherlockNoMad
Differential Revision: D73635213
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156676
Approved by: https://github.com/zhxchen17
----
* test_baseexception.py
* test_exceptions.py
* test_exception_variations.py
* test_raise.py
* test_sys.py
Minor changes were made to each test to run them inside Dynamo
One can reproduce the changes by downloading the tests from CPython and applying the diff:
```bash
for f in "test_raise" "test_sys" "test_exceptions" "test_baseexception" "test_exception_variations"; do
wget -O "test/dynamo/cpython/3_13/${f}.py" "https://raw.githubusercontent.com/python/cpython/refs/heads/3.13/Lib/test/${f}.py"
git apply "test/dynamo/cpython/3_13/${f}.diff"
done
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150789
Approved by: https://github.com/zou3519
Summary:
# Why
- keep code cleaner
- modular way to hook up preprocessing steps
- expand testability of flows that change which choices are provided e.g. to test performance models and lookup tables by running torch.compile
# What
- similar to feedback_saver_fns, now there are preprocessing_fns
- the existing regex logic is exported into those as a proof of concept
Test Plan:
```
buck2 run mode/opt scripts/coconutruben/torchmm:experiment 2>&1 | tee /tmp/epx038
```
This does not exercise the logic, it just shows that it's safe right now
Rollback Plan:
Differential Revision: D76946993
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156464
Approved by: https://github.com/masnesral
Usually `world_size=torch.cuda.device_count()` for FSDPTest-based tests
But distributed test class `TestFullyShardAllGatherExtensionsMultiProcess` [forces to use `world_size=2`](0a6e1d6b9b/test/distributed/_composable/fsdp/test_fully_shard_extensions.py (L170)) even for 1 GPU.
Then NCCL fails with errors:
```
HIP_VISIBLE_DEVICES=0 python distributed/_composable/fsdp/test_fully_shard_extensions.py -v -k test_all_gather_extensions_train_parity
...
ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
Duplicate GPU detected : rank 1 and rank 0 both on CUDA device c000
Duplicate GPU detected : rank 0 and rank 1 both on CUDA device c000
```
The test method [has `@skip_if_lt_x_gpu(2)` decorator](0a6e1d6b9b/test/distributed/_composable/fsdp/test_fully_shard_extensions.py (L209)), but test fails during test class initialization before decorator activation
This PR will skip FSDPtest-based tests if `world_size > torch.cuda.device_count()`
```
HIP_VISIBLE_DEVICES=0 python distributed/_composable/fsdp/test_fully_shard_extensions.py -v -k test_all_gather_extensions_train_parity
...
dist init r=0, world=2
dist init r=1, world=2
SKIPPED [15.5507s] (Need at least 2 CUDA devices)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155836
Approved by: https://github.com/jeffdaily
Should help with https://github.com/pytorch/pytorch/issues/156425
The one I saw today was because the job was waiting for an environment deployment approval for mergebot environment, which I think comes from something like a temporary github outage or a dropped webhook since it should have permissions as it was on the main branch, and other runs are fine
The run is https://github.com/pytorch/pytorch/actions/runs/15820977440 but you can't see anything about waiting for deployment anymore
My solution is to change the concurrency group so that it will cancel in progress jobs if there is one. My hope is that if one gets stuck, the next one will cancel and re do the environment check. I don't know how to replicate this because apparently you're just supposed to fail if you don't match the protection rules https://github.com/pytorch/pytorch/actions/runs/15830920815
The job runs every 30 minutes so there might be an issue if this job needs to run for >30 minutes to find a green sha, but takes <5 minutes to run usually so I think its ok
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156619
Approved by: https://github.com/atalman
Tests:
* test_generator.py
* test_generator_stop.py
* test_contextlib.py
Minor changes were made to each test to run them inside Dynamo. We
intentionally didn't copy the binary files stored in
`python/Lib/test/archivetestdata` for security reasons. There's a single
test that requires a binary file and it is skipped because of that.
The tests were downloaded from CPython 3.13 and the diff was generated
using `git diff` to apply the changes:
```bash
for f in "test_contextlib" "test_generators" "test_generator_stop"; do
wget -O "test/dynamo/cpython/3_13/${f}.py" "https://raw.githubusercontent.com/python/cpython/refs/heads/3.13/Lib/test/${f}.py"
git apply "test/dynamo/cpython/3_13/${f}.diff"
done
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150796
Approved by: https://github.com/williamwen42
2025-06-24 13:15:04 +00:00
1460 changed files with 53750 additions and 14434 deletions
#Description: required for testing CUDAGraph::raw_cuda_graph(). See https://nvidia.github.io/cuda-python/cuda-bindings/latest/support.html for how this version was chosen. Note "Any fix in the latest bindings would be backported to the prior major version" means that only the newest version of cuda-bindings will get fixes. Depending on the latest version of 12.x is okay because all 12.y versions will be supported via "CUDA minor version compatibility". Pytorch builds against 13.z versions of cuda toolkit work with 12.x versions of cuda-bindings as well because newer drivers work with old toolkits.
TORCH_CUDA_ARCH_LIST="7.5;8.0;8.6;9.0;10.0;12.0+PTX"#removing sm_50-sm_70 as these architectures are deprecated in CUDA 12.8/9 and will be removed in future releases
EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")
#removing sm_50-sm_70 as these architectures are deprecated in CUDA 12.8/9 and will be removed in future releases
@ -200,7 +200,7 @@ If you want to compile with CUDA support, [select a supported version of CUDA fr
- [NVIDIA cuDNN](https://developer.nvidia.com/cudnn) v8.5 or above
- [Compiler](https://gist.github.com/ax3l/9489132) compatible with CUDA
Note: You could refer to the [cuDNN Support Matrix](https://docs.nvidia.com/deeplearning/cudnn/backend/latest/reference/support-matrix.html) for cuDNN versions with the various supported CUDA, CUDA driver and NVIDIA hardware
Note: You could refer to the [cuDNN Support Matrix](https://docs.nvidia.com/deeplearning/cudnn/backend/latest/reference/support-matrix.html) for cuDNN versions with the various supported CUDA, CUDA driver, and NVIDIA hardware.
If you want to disable CUDA support, export the environment variable `USE_CUDA=0`.
Other potentially useful environment variables may be found in `setup.py`. If
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.