The deprecation warning led to warning spamming in PyTorch APIs, like
torch.compile. This is not how a deprecation warning should go: if we
add a deprecation warning, we'd better update our built-in APIs to
prevent warning spam.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166956
Approved by: https://github.com/albanD
(cherry picked from commit 527b1109a8a8d8ae9e1c76c057468aacb302ed84)
Co-authored-by: Richard Zou <zou3519@gmail.com>
[Minor][Inductor] move some combo kernel log from warning to debug (#166993)
Combo kernel warns for long reduction and large pointwise. This becomes too spammy for users such as vLLM.
This PR moves these logs from warn to debug. I validated the spammy log is removed on llama-3.1-8B.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166993
Approved by: https://github.com/zou3519, https://github.com/eellison
(cherry picked from commit e020fb3431371ea335a0d5db5094810c9f1e104d)
Co-authored-by: Boyuan Feng <boyuan@meta.com>
[Graph Partition] fix partition x memory plan issue (#165514)
For `test_graph_partition_with_memory_plan_reuse`, before this PR, when using graph partition, it would error ([P1992728479](https://www.internalfb.com/phabricator/paste/view/P1992728479)):
```
def partition_0(args):
...
del buf0
return (buf3, buf4, buf5, buf2, primals_4, )
...
File "/tmp/torchinductor_boyuan/ww/cwwc7ukfqscg2vy6ankby2fizdb377tvgyx3fwdgddrxe3g47jg6.py", line 132, in partition_0
return (buf3, buf4, buf5, buf2, primals_4, )
^^^^
NameError: name 'buf2' is not defined. Did you mean: 'buf0'?
```
When not using graph partition, it would work and give the following code ([P1992997521](https://www.internalfb.com/phabricator/paste/view/P1992997521)):
```
def call(self, args):
...
buf2 = buf0; del buf0 # reuse
...
```
Note that the issue is buf0 is not reused for buf2 when using graph partition.
Why? Because the codegen runs `run_wrapper_ir_passes` and `memory_plan_reuse`, which pops tailing `MemoryPlanningLine` unless it is in graph output by checking `V.graph.get_output_names()`. However, for graph partition, we should check the output of the current partition instead of the graph before partition.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165514
Approved by: https://github.com/ProExpertProg, https://github.com/eellison
(cherry picked from commit f071f17911ac7ace9b170e5289e44d50ae460c43)
Co-authored-by: Boyuan Feng <boyuan@meta.com>
[Graph Partition] fix graph partition input signature for fallback kernels (#165815)
Scheduler relies on node.last_usage to free buffers. `last_usage` may contain a buffer that is allocated in previous graph partition AND not directly accessed in the current graph partition.
## Example
```python
def f(x):
y = x + 1
z = torch.ops.aten.view.dtype(y, torch.float8_e4m3fn)
z_cpu = z.cpu()
u_cuda = z_cpu.cuda()
return u_cuda
```
In the generated code, we have
```
def partition_0(args):
...
# Topologically Sorted Source Nodes: [y, z], Original ATen: [aten.add, aten.view]
buf1 = torch.ops.aten.view.dtype(buf0, torch.float8_e4m3fn) # < ------ buf1 is a view of buf0
buf2 = buf1 # <------- buf2 is buf1
assert_size_stride(buf2, (8, ), (1, ), 'torch.ops.aten.view.dtype')
assert_alignment(buf2, 16, 'torch.ops.aten.view.dtype')
return (buf2, )
def call(self, args):
...
(buf2,) = self.partitions[0](partition0_args)
...
buf3.copy_(buf2, False)
del buf0
del buf1
del buf2 # <---- `del buf2` leads to `del buf0`. BUT `buf0` is not returned from partition_0.
...
```
Note: view is treated as a fallback kernel due to its special dtype.
de09bab4b6/torch/_inductor/lowering.py (L841-L843)
## Fix
This PR fixes the issue by also returning these buffers to be freed later.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165815
Approved by: https://github.com/eellison
(cherry picked from commit 1891239a1d876b15f6f97be7ada9aac1f84dab38)
Co-authored-by: Boyuan Feng <boyuan@meta.com>
[dynamo] Revert C++-fying of symbolic shape guards (#166427)
Moving symbolic shape guards to C++ causes compile time issues. This basically boils down to a tradeoff question.
For models that have large amount of dynamic shape guards, this flag will help reduce guard latency. But for most of the models, that have a very few dynamic shape guards, the guard lantecy is anyways small. These models will still see a high compile time hit because of calling gcc during the compile.
So a good default value seems to be False. We can write a doc to give guidance on reducing guard latency.
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166427
Approved by: https://github.com/zou3519
(cherry picked from commit 84a2715d341f068a26a281a252c3907bbe827d9b)
Co-authored-by: Animesh Jain <anijain@umich.edu>
CUDA 13.0 builds fix on Amazon Linux 2023 (#164870)
During 2.9 rc testing I am seeing an issue on Amazon Linux 2023 with CUDA 13.0 builds
This is related to:
https://github.com/pytorch/pytorch/issues/152756
Workflow: https://github.com/pytorch/test-infra/actions/runs/18324074610/job/52184079262
Error:
```
WARNING: There was an error checking the latest version of pip.
+ python3.11 .ci/pytorch/smoke_test/smoke_test.py --package torchonly
Traceback (most recent call last):
File "/usr/local/lib64/python3.11/site-packages/torch/__init__.py", line 333, in _load_global_deps
ctypes.CDLL(global_deps_lib_path, mode=ctypes.RTLD_GLOBAL)
File "/usr/lib64/python3.11/ctypes/__init__.py", line 376, in __init__
self._handle = _dlopen(self._name, mode)
^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: libcudart.so.13: cannot open shared object file: No such file or directory
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/pytorch/pytorch/.ci/pytorch/smoke_test/smoke_test.py", line 12, in <module>
import torch
File "/usr/local/lib64/python3.11/site-packages/torch/__init__.py", line 425, in <module>
_load_global_deps()
File "/usr/local/lib64/python3.11/site-packages/torch/__init__.py", line 383, in _load_global_deps
_preload_cuda_deps(lib_folder, lib_name)
File "/usr/local/lib64/python3.11/site-packages/torch/__init__.py", line 317, in _preload_cuda_deps
raise ValueError(f"{lib_name} not found in the system path {sys.path}")
Traceback (most recent call last):
ValueError: libnvToolsExt.so.*[0-9] not found in the system path ['/pytorch/pytorch/.ci/pytorch/smoke_test', '/usr/lib64/python311.zip', '/usr/lib64/python3.11', '/usr/lib64/python3.11/lib-dynload', '/usr/local/lib64/python3.11/site-packages', '/usr/local/lib/python3.11/site-packages', '/usr/lib64/python3.11/site-packages', '/usr/lib/python3.11/site-packages']
File "/home/ec2-user/actions-runner/_work/test-infra/test-infra/test-infra/.github/scripts/run_with_env_secrets.py", line 102, in <module>
main()
File "/home/ec2-user/actions-runner/_work/test-infra/test-infra/test-infra/.github/scripts/run_with_env_secrets.py", line 98, in main
run_cmd_or_die(f"docker exec -t {container_name} /exec")
File "/home/ec2-user/actions-runner/_work/test-infra/test-infra/test-infra/.github/scripts/run_with_env_secrets.py", line 39, in run_cmd_or_die
raise RuntimeError(f"Command {cmd} failed with exit code {exit_code}")
RuntimeError: Command docker exec -t 7d9c5bd403cac9a9ee824d63a1d6f6057ecce89a7daa94a81617dbf8eff0ff2e /exec failed with exit code 1
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164870
Approved by: https://github.com/Camyll
(cherry picked from commit 483f4e0db91166128ad8922d86dc7222338d4ecc)
Co-authored-by: atalman <atalman@fb.com>
Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com>
fix cpp extension distributed warning spew (#162764)
With the new change we only log the warning if we're running non distributed code or if we're in rank 0. Unit testing that certain messages get printed on certain ranks only feels kinda jank so test plan is below instead
Test plan
```python
# torchrun --nproc_per_node=2 demo_fix.py
import os
import logging
logging.getLogger('torch.utils.cpp_extension').setLevel(logging.DEBUG)
import torch
if 'RANK' in os.environ:
torch.distributed.init_process_group('nccl')
from torch.utils.cpp_extension import _get_cuda_arch_flags
_get_cuda_arch_flags()
print(f"Rank {os.environ.get('RANK', '0')} done")
```
Logs showing how how `TORCH_CUDA_ARCH_LIST`only shows up once if we explicitly set the the logging level to `logging.DEBUG`. It also improves the debug message to explain what the actual behavior will be
```
(source) [marksaroufim@devgpu005]~% torchrun --nproc_per_node=2 demo_fix.py
W0911 18:30:16.594000 1315439 /home/marksaroufim/pytorch/torch/distributed/run.py:814]
W0911 18:30:16.594000 1315439 /home/marksaroufim/pytorch/torch/distributed/run.py:814] *****************************************
W0911 18:30:16.594000 1315439 /home/marksaroufim/pytorch/torch/distributed/run.py:814] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0911 18:30:16.594000 1315439 /home/marksaroufim/pytorch/torch/distributed/run.py:814] *****************************************
[rank0]:V0911 18:30:18.921000 1316753 pytorch/torch/utils/cpp_extension.py:2444] TORCH_CUDA_ARCH_LIST is not set, using TORCH_CUDA_ARCH_LIST='10.0+PTX' for visible GPU architectures. Set os.environ['TORCH_CUDA_ARCH_LIST'] to override.
Rank 0 done
Rank 1 done
```
But if we just use the default and comment out `logging.getLogger('torch.utils.cpp_extension').setLevel(logging.DEBUG)`
Then we get
```
(source) [marksaroufim@devgpu005]~% torchrun --nproc_per_node=2 demo_fix.py
W0911 18:14:33.926000 690759 /home/marksaroufim/pytorch/torch/distributed/run.py:814]
W0911 18:14:33.926000 690759 /home/marksaroufim/pytorch/torch/distributed/run.py:814] *****************************************
W0911 18:14:33.926000 690759 /home/marksaroufim/pytorch/torch/distributed/run.py:814] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0911 18:14:33.926000 690759 /home/marksaroufim/pytorch/torch/distributed/run.py:814] *****************************************
Rank 0 done
Rank 1 done
(source) [marksaroufim@devgpu005]~%
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162764
Approved by: https://github.com/ezyang, https://github.com/zou3519
(cherry picked from commit f7e83219619a05934a344ca699c33ee69d5a3642)
Co-authored-by: Mark Saroufim <marksaroufim@meta.com>
Reapply "Make functionalization `ViewMeta` serializable with pickle. (#143712)" (#163769)
NOTE: This is a re-export of https://github.com/pytorch/pytorch/pull/161994 ; the changes between these two PRs is exclusively to the buck/build files
(Summary from #161994 )
Attempted rebase of https://github.com/pytorch/pytorch/pull/143712.
This reverts commit 6c713ccb5e0df227dd5b630057cbccd373cbe7d6.
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames Lucaskabela
imported-using-ghimport
Test Plan: Imported from OSS
Differential Revision: D81524507
Pulled By: Lucaskabela
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163769
Approved by: https://github.com/dolpm
(cherry picked from commit 7d710403b003e44bf31d367673a05468e49df75d)
Co-authored-by: Brian Hirsh <hirsheybar@fb.com>
[Flex] Fix silent correctness w/ backpropping grads (#163677)
Fixes #https://github.com/pytorch/pytorch/issues/162228
# Summary
Majority of our tests are only compiling flex-attention in isolation. This means that for fake tensor propagation the input primals and all captured buffers dont do any intermediate computation below autograd. As a result result the by happen chance match the `require_grad`ness of the eager implementation and this check will pass. However if score_mod is a the result of some other intermediate fake tensor prop then it is not guaranteed to have accurate req_gradness, which was happening here.
TLDR is that this was a boot and suspenders that was actually harmful and we should just let the joint graph handle creating the correct joint graph
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163677
Approved by: https://github.com/ydwu4
(cherry picked from commit e2ce79e4cce5327b71fcf366fad1133030563285)
Co-authored-by: drisspg <drisspguessous@gmail.com>
[a2av] Separate in/out splits into two tensors (#163837)
Old signature:
`all_to_all_vdev(Tensor input, Tensor(a!) out, Tensor(a!) in_out_splits, str group_name)`
New signature:
`all_to_all_vdev(Tensor input, Tensor(a!) out, Tensor in_splits, Tensor(a!) out_splits_offsets, str group_name)`
i.e. split `in_out_splits` into IN tensor and OUT tensor so that we can define the TORCH_LIBRARY signature better.
Also to be in line with the 2D version.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163837
Approved by: https://github.com/fduwjj
ghstack dependencies: #163886
(cherry picked from commit bbf8aa43efe755b9c310347b3780962fca85bf9c)
Co-authored-by: Ke Wen <kw2501@meta.com>
[Flex attention] Fix flex attention head broadcast (#163426)
Fixes part of #163314
In particular bug: **Bug 1: H=None Broadcasting Produces Incorrect Results**
This fixes a shape bug when slicing BlockMask on the Q-tile axis with an int (**mask[:, :, i]**). That form of indexing collapses the Q dimension, so kv_num_blocks/kv_indices lose their expected [B, H, Q_tiles, …] shape. Due to them losing shape, even though the mask_mod remains "interpretable", the kernel’s stride math then reads wrong offsets. Due to this we get silent numerical mismatches compared to regular SDPA, especially when single position decoding/H broadcasting.
The B=None, H=None works case is accidental: with singleton batch/head the kernel maps to index 0 via `sparse_idx_z = off_zq % 1` and `sparse_idx_hq = off_hq % 1` and with a single Q tile `q_start // SPARSE_Q_MULTIPLE = 0`. The missing Q-tiles stride is multiplied by 0, so the bad offset from the collapsed Q axis doesn’t move the pointer and it happens to read the first tile correctly. Once H > 1 or there are multiple Q tiles, those terms become nonzero and the kernel indexes with wrong strides which causes silent error
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163426
Approved by: https://github.com/drisspg
(cherry picked from commit 1a42656d6c43a9bb7eb90c511884ce451d29422f)
Co-authored-by: Isalia20 <irakli.salia854@gmail.com>
Update Microsoft C++ Redistributable to the latest version (#161430)
Update Microsoft C++ Redistributable link to the latest version as one of the libraries used by AMD currently has a dependency on that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161430
Approved by: https://github.com/malfet
(cherry picked from commit 1330c638bef7fac64a42935b5a46ee32637ddd4d)
Co-authored-by: Saman Khatir <saman.khatir@amd.com>
[MPSHooks] Release pending command encoder (#164093)
Before returning a comand buffer, as subsequent calle are very likely to allocate their own encoder, which results in the following runtime error
```
tryCoalescingPreviousComputeCommandEncoderWithConfig:nextEncoderClass:]:1090: failed assertion `A command encoder is already encoding to this command buffer'
```
Added regression test to `test_mps_extension`
Please note, that `torch::mps::get_command_buffer()` should be called with dispatch_queue held, both before and after this change, but many implementations skip that
Fixes https://github.com/pytorch/pytorch/issues/163721
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164093
Approved by: https://github.com/atalman, https://github.com/Skylion007
(cherry picked from commit 8f32adc90a7fee83583c9ba89dbdfabb317e0452)
Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com>
[SDPA] [MPS] Fixes regression in 2.8.0 for scaled_dot_product_attention using mps (#163598)
Fixes#163597
- Updates fast SDPA implementations to take in query tensor stride info similar to key and value instead of assuming stride.
- Updated tests with additional transpose/permutation layouts. New tests catch the regression.
### Benchmarking with script found in [implementation PR](https://github.com/pytorch/pytorch/pull/152781#:~:text=19.8%25%20speed%20improvement-,Script%20to%20get%20perf%3A,-import%20torch%0Aimport)
Times are averaged over 100000 iterations. This change should not have any significant performance difference. Tested on an M3 Pro
### Vector Fast Path (q_len=1, k_len=256)
- Before: 0.160 ms
- After: 0.157 ms
### Vector 2-pass (q_len=1, k_len=4096)
- Before: 0.342 ms
- After: 0.339 ms
### Vector Fast Path (q_len=8, k_len=256)
- Before: 0.228 ms
- After: 0.231 ms
### Vector 2-pass (q_len=8, k_len=4096)
- Before: 0.432 ms
- After: 0.436 ms
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163598
Approved by: https://github.com/malfet
(cherry picked from commit 1c12d7416bc4f1cf0bc8a229e64169fc361b688e)
Co-authored-by: Vismai Khanderao <59114226+Vismai-Khanderao@users.noreply.github.com>
[AARCH64][CD][CUDA13][Triton][PTXAS] Turn on BUILD_BUNDLE_PTXAS=1 (#163988)
See also #163972, which was intended to be this PR.
Triton (release/3.5.x) by default ships CUDA12.8 ptxas.
This PR tries to bundle a ptxas version for cuda13, so that it can help https://github.com/pytorch/pytorch/issues/163801 when users run on new devices like THOR and Spark.
Fixes https://github.com/pytorch/pytorch/issues/163801
Test Plan:
Check binary size increase against nightly or v2.9RC
Install the binary from into a working THOR and GB200/GH100 machine (reproduce the original issue first on THOR), then install the binary built from this PR and we expect the issue to be gone without any additional user setting. Testing on GB200 is to ensure no regression.
Reference: https://github.com/pytorch/pytorch/pull/119750 and 5c814e2527
Note: with this PR, the pytorch world's torch.compile is supposed to find ptxas via "torch/_inductor/runtime/compile_tasks.py" and "_set_triton_ptxas_path". Use cases that do not go through "_set_triton_ptxas_path" may not be able to use the cuda13 ptxas binary.
However, as is, the triton world does not know the existence of this new cuda13 ptxas. So IF a users thinks there is already pytorch/bin/ptxas and delete the ptxas from triton, then c6ad34f7eb/python/triton/knobs.py (L216) would still complain ptxas not found (if removed - it won't know this new one available)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163988
Approved by: https://github.com/atalman
(cherry picked from commit 3b4ad4a17d69e2db495ecaf3bae8916282a4eb0d)
Co-authored-by: Wei Wang <weiwan@nvidia.com>
Add operator benchmarking run to CI nightly (#162530)
This PR introduces a new "operator microbenchmark" CI workflow and GitHub Actions for operator microbenchmarks, updating test scripts and job matrices to support new parameters, and broadening the operator benchmark tests to include more data types, larger shapes, and gradient tests. The benchmark configurations now focus more on different cuda hardware and multiple dtypes (bf16, fp16, fp32), for both compile and eager mode.
**Benchmark Configuration and Coverage:**
* Expanded operator benchmark configurations in `addmm_test.py`, `bmm_test.py`, `matmul_test.py`, and `mm_test.py` to benchmark multiple dtypes on CUDA devices, in eager and compile mode, for forward and backward run. The configs with tag "long" for the above mentioned files are being run in CI.
* The CI benchmarking is running on various hardwares: H100, A100.
* The CI job also uploads the microbenchmarking outputs to a [HUD](https://hud.pytorch.org/benchmark/llms?repoName=pytorch%2Fpytorch&benchmarkName=PyTorch+operator+microbenchmark) dashboard.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162530
Approved by: https://github.com/huydhn
(cherry picked from commit 54b38f3b46c33a1cc4e8f7894619358afcbd7c89)
Co-authored-by: jainapurva <apurvajain.kota@gmail.com>
Co-authored-by: Huy Do <huydhn@gmail.com>
Update the operator benchmarking, to benchmark using torch.compile (#161394)
This pull request enhances the PyTorch operator benchmarking suite by introducing support for benchmarking with `torch.compile` mode, in addition to existing Eager and JIT. It also adds peak memory measurement (fwd/bwd pass); improves the output format in JSON to be used by dashboard for reporting; and introduce some more CLI options. The new CLI flags introduced are:
- Added `--use-compile` CLI argument and corresponding logic to run benchmarks using `torch.compile`, including mutual exclusivity with `--use-jit`
- Added `--benchmark-name` argument for customizing the benchmark name in output
- Updated default value for `--output-json-for-dashboard` to `benchmark-results.json` for more predictable output file name
Sample command to run a single operator:
`python -m pt.mm_test --use-compile`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161394
Approved by: https://github.com/jbschlosser
(cherry picked from commit af60398c3a057506363e028bf328843a755b4f24)
Co-authored-by: jainapurva <apurvajain.kota@gmail.com>
[SymmMem] Barrier on team instead of world (#163298)
As titled. Avoiding a potential hang when running dispatch and combine in subgroups.
The rest is just re-arrange of the tests to create a sub-group test class. (no substantial change)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163298
Approved by: https://github.com/fegin
(cherry picked from commit f8fb437197033c33ecc435cd5e1e6a5b2bc5bf69)
Co-authored-by: Ke Wen <kw2501@meta.com>
[SymmMem] Fix memory allocation hold-up (#162680)
Problem:
Without MemPool it looks like nvshmem backend never deallocates memory.
Cause:
Handles in `symm_mems_` (a map) keeps reference to memory allocations.
Solution:
- Remove reference to allocation from handles -- the reference is never used anyway.
- Use `unique_ptr` instead of `shared_ptr` to wrap allocation to ensure single ownership.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162680
Approved by: https://github.com/ezyang
ghstack dependencies: #163298
(cherry picked from commit 7130b174e07dbc1a708934b18dede3d88e8f779f)
Co-authored-by: Ke Wen <kw2501@meta.com>
[Inductor][Intel GPU] Save `threads_per_warp` from tirton compiled kernel for launching kernel correctly in cpp wrapper. (#163315)
On the Inductor XPU backend, `threads_per_warp` is not always 32. For Intel GEMM Triton kernels, it can be 16. This information must be preserved for XPU so that the Cpp wrapper can launch the kernel with the correct configuration.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163315
Approved by: https://github.com/EikanWang, https://github.com/desertfire
(cherry picked from commit 9f8a311af09586ac4026d6a56fc7c4ac7acc62ed)
Co-authored-by: xinan.lin <xinan.lin@intel.com>
[CI] Fix test_triton_wait_until hang (#163886)
I don't know why `nvshmem_barrier_all_kernel` leads the test to hang. Will investigate.
But since it is an unnecessary call here, I am removing it to unblock other PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163886
Approved by: https://github.com/fegin
(cherry picked from commit 96275dbf88372bb32a123c4ea918498128fbecb9)
Co-authored-by: Ke Wen <kw2501@meta.com>
[CD] Add statically linked windows libraries to exclude list (#163768)
Fixes: https://github.com/pytorch/pytorch/issues/159514
Seeing following in the Wheel build logs:
```
Linking CXX static library lib\kineto.lib
Linking CXX static library lib\dnnl.lib
....
```
These files are around 800MB uncompressed and 109MB compressed, hence provide ~50% size reduction for Windows CPU builds.
Test Plan: Build Pytorch Windows binary. Build vision, audio and torchcodec with this binary. Smoke test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163768
Approved by: https://github.com/albanD, https://github.com/malfet
(cherry picked from commit 98c4e35f14601909c113b4fd2857b6f0fb525316)
Co-authored-by: atalman <atalman@fb.com>
[CI] Move Windows build/tests to Python-3.10 (#162862)
What supposed to be a very simple change end up being quite involved, as current Windows CI framework is quite inflexible, i.e. it takes a lots of argument, but later on ignores them, namely:
- `PYTHON_VERSION` used to be a no-op that is simply ignored by the scripts
- With this change, `setup-win` action will create an environment called `py_tmp` with specific python version + intel-openmp (that is hard runtime requirement, but for some reason not packaged into the wheel nor marked as such)
- Copied test type dependencies from be01a40157/aws/ami/windows/scripts/Installers/Install-Pip-Dependencies.ps1 (L16) into `win-test.sh`, but made some adjustments to be compatible with 3.10 runtime (scipy version update) and just make rerun-tests compatible with the rest of the deps
I think in the long run, one needs to update 4432e2cacd/aws/ami/windows/scripts/Installers/Install-Miniconda3.ps1 that currently pins Miniconda python to 3.9, but also figure out how CI can still create a new environment without having to download all the dependencies all the time
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162862
Approved by: https://github.com/wdvr, https://github.com/huydhn
ghstack dependencies: #163339, #163341
Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com>
[BE] Introduce `CONDA_ROOT_DIR` (#163341)
Which equal to `%CONDA_PARENT_DIR%/Miniconda3`, and replace this pattern with `%CONDA_ROOT_DIR%` throughout the codebase
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163341
Approved by: https://github.com/clee2000
ghstack dependencies: #163339
(cherry picked from commit a273475b01e912f402378a522bb9c4ed37e8413a)
Co-authored-by: Nikita Shulga <nshulga@meta.com>
[export] Remove .contiguous() when saving weights to raw bytes (#163587)
Summary: `.contiguous()` will discard the original storage size of the tensor, and could lead to issues during loading.
Test Plan:
buck2 run mode/dev-nosan caffe2/test:test_export -- -r test_1D_tensor_slicing
buck2 run mode/dev-nosan caffe2/test:test_export -- -r test_2D_tensor_slicing
Differential Revision: D83016250
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163587
Approved by: https://github.com/angelayi
(cherry picked from commit 720a7b2887ca4efc8d63b32373182bc97918c76e)
Co-authored-by: Yiming Zhou <yimingzhou@meta.com>
Update pytorch_sphinx_theme2 to latest hash (#163269)
The updated theme:
- Fixes articleBody in the json+ld that caused previous Google Search issues
- Other minor fixes
- 404.html fixes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163269
Approved by: https://github.com/albanD
(cherry picked from commit 68e75be86ab618bb6b1dc32b603a780ff6046262)
Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
[SymmMem] Fix put_signal + wait_until hang (#163194)
The test used a wrong ptr to refer to remote address:
```
dst_ptr = out_hdl.buffer_ptrs[peer]
src_ptr = inp_hdl.buffer_ptrs[rank]
sig_ptr = out_hdl.signal_pad_ptrs[peer]
```
All three indices should be `rank` instead of `peer` because NVSHMEM APIs accept local address as input and perform translation internally. Without correct signal address, the peer would be waiting, thus hang.
Also adjusted the signature of `nvshmem.putmem_signal_block` to accept tensor instead of pointer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163194
Approved by: https://github.com/ngimel
ghstack dependencies: #163025, #163152
(cherry picked from commit 80f8be9840c20c3efe1274266b52ab098f4d1030)
Co-authored-by: Ke Wen <kw2501@meta.com>
[Graph Partition] improve custom op output alias (#163227)
For a custom op with multiple outputs, we will see the following generated code:
```
buf1 = op1(arg0)
buf3 = buf0[0]
buf4 = buf0[1]
del buf1 # <--- if buf1 is not accessed in the future
```
If `buf1` is not accessed in the future, it's good to deallocate early. So we don't delay `del` until both buf3 and buf4 are not used anymore. Note that buf3 and buf4 hold reference to the data such that `del buf1` does not prevent their usage.
However, when there are mutating args, we don't see `del buf1` immediately.
```python
@torch.library.custom_op(
"mylib::op1",
mutates_args=["x"],
schema="(Tensor(a!)? x) -> (Tensor, Tensor)",
device_types="cuda",
)
def op1(x) -> tuple[torch.Tensor, torch.Tensor]:
x = x + 1
return (x + 1, x + 2)
```
<img width="661" height="821" alt="image" src="https://github.com/user-attachments/assets/3d1d1f5a-9749-4652-bb02-da593c78702d" />
Why? Because `buf3` is a MultiOutput with `buf1` as input and believes `buf1` (an output of FallbackKernel op1) has inputs that alias output.
72fedf0575/torch/_inductor/ir.py (L7976-L7982)
According to `[NOTE: FallbackKernel supported operators]`, as a mutating op that are auto-functionalizable, buf1's output should NOT alias any of the inputs. This PR improves get_inputs_that_alias_output of Fallback Kernel.
Use case: [moe custom op in vllm](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/fused_moe/layer.py#L2057-L2064)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163227
Approved by: https://github.com/zou3519
(cherry picked from commit 4967ad8baa724b8b1acc123698bb1265723feb87)
Co-authored-by: Boyuan Feng <boyuan@meta.com>
Add decomp rule to assert_tensor_metadata for BatchedTensors (#163008)
Whenever there is device move, export introduces assert_tensor_metadata aten operator to make sure to guard for device specialization. This aten op didn't work with Vmap because we didn't register explicit decomp rule saying we just skip BatchedTensor and call it on underlying tensor
Differential Revision: [D82483979](https://our.internmc.facebook.com/intern/diff/D82483979)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163008
Approved by: https://github.com/huydhn
(cherry picked from commit e28983be76aa4651e3cb69dc3a4234d75038d938)
Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com>
[SymmMem] Fix NVSHMEM plugin + Triton 3.5 (#163152)
1. The dispatch signatures defined in `core.extern_elementwise` call must match the C signature of the NVSHMEM functions, in particular the dtypes. Otherwise, there would be weird errors, such as IMA or hang. When matched, most of time the NVSHMEM device function will be inlined into the generated PTX. When not matched, it is represented as a function call in the PTX (not sure if it is the function call that goes wrong).
2. When calling the `core.extern` wrappers from the `triton.jit` kernels, the input must be cast to match the signatures defined in 1, e.g. via `nbytes.to(tl.int64)`. Otherwise, Triton will report a key error when searching for such kernel.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163152
Approved by: https://github.com/ngimel
ghstack dependencies: #163025
(cherry picked from commit 57a54a04b6eb78e0aa7d13b48e25fb8c0c49fd60)
Co-authored-by: Ke Wen <kw2501@meta.com>
Support vmap + custom autograd function/improve DTensor constructor inefficiency (#162240)
This makes gemma3 exportable on transformers=4.55.4
In HF, there is a torch funciton mode called TransformGetItemToIndex which internally calls custom autograd function. When this custom autograd function is called under vmap, It triggers CustomFunctionHigherOrderOP which error-ed because there was no pre-dispatch proxy mode implementation.
Since there are number of requests lately to add various operators in pre-dispatch IR, I introduce a decorator in export that works similar to `allow_in_graph`. Basically:
1) We intercept custom_autograd_function.apply at pre-dispatch mode when this decorator is applied
2) We apply `flat_apply` HOP to hide the pytree spec for this autograd function. Note that this adds restriction that this custom autograd function needs to take in fx-able types.
3) subclass constructor decorator is implemented similarly, so we just refactor it to use similar implementation as this new decorator. eventually we should delete the subclass constructor decorator.
4) Move some code in subclass constructor decorator to exit early in non-export environment which should shave off some inefficiency (around 1% according to @swolchok 's benchmark)
Fixes: https://github.com/pytorch/pytorch/issues/161563#issuecomment-3246309758
Differential Revision: [D82141316](https://our.internmc.facebook.com/intern/diff/D82141316)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162240
Approved by: https://github.com/ydwu4
(cherry picked from commit 463fbc8ca0537e5635236190d2ca38ce6fcef831)
Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com>
[CD] Aarch64 Fix packaging ``libarm_compute.so`` and other libraries to the aarch64 CUDA wheels (#162566)
Fixes aarch64 linux packaging, following error:
https://github.com/pytorch/vision/actions/runs/17612462583/job/50037380487#step:15:62
```
Traceback (most recent call last):
File "/__w/vision/vision/pytorch/vision/setup.py", line 13, in <module>
import torch
File "/__w/_temp/conda_environment_17612462583/lib/python3.11/site-packages/torch/__init__.py", line 415, in <module>
from torch._C import * # noqa: F403
^^^^^^^^^^^^^^^^^^^^^^
ImportError: libarm_compute.so: cannot open shared object file: No such file or directory
```
Due to missing dependencies.
Current Error:
File torch-2.10.0.dev20250910+cu130-cp310-cp310-linux_aarch64.whl is extracted
File is repackaged as torch-2.10.0.dev20250910+cu130-cp310-cp310-manylinux_2_28_aarch64.whl
File torch-2.10.0.dev20250910+cu130-cp310-cp310-linux_aarch64.whl renamed as torch-2.10.0.dev20250910+cu130-cp310-cp310-manylinux_2_28_aarch64.whl
Hence the repackaging does not take any effect.
This PR does following
File torch-2.10.0.dev20250910+cu130-cp310-cp310-linux_aarch64.whl is extracted
File torch-2.10.0.dev20250910+cu130-cp310-cp310-linux_aarch64.whl deleted
File is repackaged as torch-2.10.0.dev20250910+cu130-cp310-cp310-manylinux_2_28_aarch64.whl
Looks like after migrating from zipping the wheel to wheel pack renaming the wheel is no longer necessary. Hence removing renaming and deleting old file.
```
2025-09-10T10:10:05.9652454Z Using nvidia libs from pypi - skipping CUDA library bundling
2025-09-10T10:10:05.9656595Z Copying to /pytorch/dist/tmp/torch/lib/libgomp.so.1
2025-09-10T10:10:05.9873843Z Copying to /pytorch/dist/tmp/torch/lib/libgfortran.so.5
2025-09-10T10:10:06.0410041Z Copying to /pytorch/dist/tmp/torch/lib/libarm_compute.so
2025-09-10T10:10:06.2869242Z Copying to /pytorch/dist/tmp/torch/lib/libarm_compute_graph.so
2025-09-10T10:10:06.4385740Z Copying to /pytorch/dist/tmp/torch/lib/libnvpl_lapack_lp64_gomp.so.0
2025-09-10T10:10:06.5461372Z Copying to /pytorch/dist/tmp/torch/lib/libnvpl_blas_lp64_gomp.so.0
2025-09-10T10:10:06.5728970Z Copying to /pytorch/dist/tmp/torch/lib/libnvpl_lapack_core.so.0
2025-09-10T10:10:06.6231872Z Copying to /pytorch/dist/tmp/torch/lib/libnvpl_blas_core.so.0
2025-09-10T10:10:14.1503110Z Updated tag from Tag: cp310-cp310-linux_aarch64
2025-09-10T10:10:14.1503482Z to Tag: cp310-cp310-manylinux_2_28_aarch64
2025-09-10T10:10:14.1503682Z
2025-09-10T10:10:41.6498892Z Repacking wheel as /pytorch/dist/torch-2.10.0.dev20250910+cu130-cp310-cp310-manylinux_2_28_aarch64.whl...OK
2025-09-10T10:10:41.9394460Z Renaming torch-2.10.0.dev20250910+cu130-cp310-cp310-linux_aarch64.whl wheel to torch-2.10.0.dev20250910+cu130-cp310-cp310-manylinux_2_28_aarch64.whl
```
Test Plan, Executed on local file:
```
inflating: ubuntu/dist/tmp/torch-2.9.0.dev20250909+cu130.dist-info/WHEEL
inflating: ubuntu/dist/tmp/torch-2.9.0.dev20250909+cu130.dist-info/entry_points.txt
inflating: ubuntu/dist/tmp/torch-2.9.0.dev20250909+cu130.dist-info/top_level.txt
inflating: ubuntu/dist/tmp/torch-2.9.0.dev20250909+cu130.dist-info/RECORD
Bundling CUDA libraries with wheel
Updated tag from Tag: cp310-cp310-manylinux_2_28_aarch64
to Tag: cp310-cp310-manylinux_2_28_aarch64
Repacking wheel as ubuntu/dist/torch-2.9.0.dev20250909+cu130-cp310-cp310-manylinux_2_28_aarch64.whl...OK
Copying torch-2.9.0.dev20250909+cu130-cp310-cp310-manylinux_2_28_aarch64.whl to artifacts
Build Complete. Created torch-2.9.0.dev20250909+cu130-cp310-cp310-manylinux_2_28_aarch64.whl..
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162566
Approved by: https://github.com/jeanschmidt, https://github.com/NicolasHug
(cherry picked from commit 3d32bb114bf0d5bd0193dc40f20253635dddf080)
Co-authored-by: atalman <atalman@fb.com>
#Description: required for testing CUDAGraph::raw_cuda_graph(). See https://nvidia.github.io/cuda-python/cuda-bindings/latest/support.html for how this version was chosen. Note "Any fix in the latest bindings would be backported to the prior major version" means that only the newest version of cuda-bindings will get fixes. Depending on the latest version of 12.x is okay because all 12.y versions will be supported via "CUDA minor version compatibility". Pytorch builds against 13.z versions of cuda toolkit work with 12.x versions of cuda-bindings as well because newer drivers work with old toolkits.
# Copied from https://github.com/pytorch/test-infra/blob/be01a40157c36cd5a48391fdf44a7bc3ebd4c7e3/aws/ami/windows/scripts/Installers/Install-Pip-Dependencies.ps1#L16 with some adjustments
# pytest-rerunfailures==10.3 as 10.2 fails with INTERNALERROR> pluggy._manager.PluginValidationError: unknown hook 'pytest_configure_node'
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.