Currently, every time we construct a GLOBAL_STATE guard, we always create a fresh guard based on the current global state. For precompile, we want to create a GLOBAL_STATE guard always based on some external sources, e.g. serialized global states. This can also be applied with the normal case where we just pass in the global state guard from Python.
Differential Revision: [D77400988](https://our.internmc.facebook.com/intern/diff/D77400988/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157285
Approved by: https://github.com/jansel
adding more information to the error message for debugging.
example error message:
```
Detected recompile when torch.compile stance is 'fail_on_recompile'. filename: 'caffe2/test/dynamo/test_misc.py', function name: 'fn', line number: 0
Failed on the following precompiled guards:
TREE_GUARD_MANAGER:
+- RootGuardManager
| +- LAMBDA_GUARD: isinstance(L['x'], bool)
GuardDebugInfo(
result=0,
verbose_code_parts=["isinstance(L['x'], bool)"],
num_guards_executed=1)
```
Differential Revision: [D76987126](https://our.internmc.facebook.com/intern/diff/D76987126/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156433
Approved by: https://github.com/jamesjwu
Fixes#149280. Follow up to #147966, but now available for ROCm.
Since hipblaslt does not support HIPBLASLT_MATMUL_DESC_CU_COUNT_TARGET, we instead create a hipStream that has a CU mask applied. We pass this masked stream to hipblaslt instead of pytorch's current stream. We ensure stream ordering between streams using hipEvents and stream synchronization.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149466
Approved by: https://github.com/malfet, https://github.com/atalman
Changes needed for ROCm7.0:
* `warpSize` is _not_ a compile-time constant on device-side compilation for ROCm anymore
* `warpSize` is _not_ defined on host-side compilation, hence `at::cuda::warp_size()` must be used to query warpsize at runtime
* Redefining `C10_WARP_SIZE` to be a compile-time constant, with a reasonable value for device-side compilation, but an unreasonable value of 1 for host-side compilation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156979
Approved by: https://github.com/jeffdaily
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
The biggest bottleneck that we found with two-shot allreduce was that the compiler was serializing all the load operations for some reason. To avoid these load delays, we've added de-serialization of loads. Along with this improvement, we also found that on AMD GPUs a different block and thread size gives a nice performance boost. Here are the bandwidth numbers I am getting with this PR:

The rows that are green are the tensor sizes that we are interested in because two-shot is only used for bigger sizes (one-shot is used for smaller sizes). As we can see, our baseline numbers wrt to fbgemm numbers were consistently underperforming. However, with this deserialize change, most of the tensor sizes have a performance boost (positive %) for the green tensors. There's one tensor with negative performance, but that's within error margin.
co-authored by: @amd-hhashemi
https://github.com/pytorch/FBGEMM/issues/4072
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156746
Approved by: https://github.com/jeffdaily
Co-authored-by: Hashem Hashemi <hashem.hashemi@amd.com>
Generate source tarball with PEP 517 conform build tools instead of the custom routine in place right now.
Closes#150461.
The current procedure for generating the source tarball consists in creation of a source tree by manual copying and pruning of source files.
This PR replaces that with a call to the standard [build tool](https://build.pypa.io/en/stable/), which works with the build backend to produce an sdist. For that to work correctly, the build backend also needs to be configured. In the case of Pytorch, the backend currently is (the legacy version of) the setuptools backend, the source dist part of which is mostly configured via the `MANIFEST.in` file.
The resulting source distribution can be used to install directly from source with `pip install ./torch-{version}.tar.gz` or to build wheels directly from source with `pip wheel ./torch-{version}.tar.gz`; both should be considered experimental for now.
## Issues
### sdist name
According to PEP 517, the name of the source distribution file must coincide with the project name, or [more precisely](https://peps.python.org/pep-0517/#source-distributions), the source distribution of a project that generates `{NAME}-{...}.whl` wheels are required to be named `{NAME}-{...}.tar.gz`. Currently, the source tarball is called `pytorch-{...}.tar.gz`, but the generated wheels and python package are called `torch-{...}`.
### Symbolic Links
The source tree at the moment contains a small number of symbolic links. This [has been seen as problematic](https://github.com/pypa/pip/issues/5919) largely because of lack of support on Windows, but also because of [a problem in setuptools](https://github.com/pypa/setuptools/issues/4937). Particularly unfortunate is a circular symlink in the third party `ittapi` module, which can not be resolved by replacing it with a copy.
PEP 721 (now integrated in the [Source Distribution Format Specification](https://packaging.python.org/en/latest/specifications/source-distribution-format/#source-distribution-archive-features)) allows for symbolic links, but only if they don't point outside the destination directory and if they don't contain `../` in their target.
The list of symbolic links currently is as follows:
<details>
|source|target|problem|solution|
|-|-|-|-|
| `.dockerignore` | `.gitignore` | ✅ ok (individual file) ||
| `docs/requirements.txt` | `../.ci/docker/requirements-docs.txt` |❗`..` in target|swap source and target[^1]|
| `functorch/docs/source/notebooks` | `../../notebooks/` |❗`..` in target|swap source and target[^1]|
| `.github/ci_commit_pins/triton.txt` | `../../.ci/docker/ci_commit_pins/triton.txt` | ✅ ok (omitted from sdist)||
| `third_party/flatbuffers/docs/source/CONTRIBUTING.md` | `../../CONTRIBUTING.md` |❗`..` in target|omit from sdist[^2]|
| `third_party/flatbuffers/java/src/test/java/DictionaryLookup` | `../../../../tests/DictionaryLookup` |❗`..` in target|omit from sdist[^3]|
| `third_party/flatbuffers/java/src/test/java/MyGame` | `../../../../tests/MyGame` |❗`..` in target|omit from sdist[^3]|
| `third_party/flatbuffers/java/src/test/java/NamespaceA` | `../../../../tests/namespace_test/NamespaceA` |❗`..` in target|omit from sdist[^3]|
| `third_party/flatbuffers/java/src/test/java/NamespaceC` | `../../../../tests/namespace_test/NamespaceC` |❗`..` in target|omit from sdist[^3]|
| `third_party/flatbuffers/java/src/test/java/optional_scalars` | `../../../../tests/optional_scalars` |❗`..` in target|omit from sdist[^3]|
| `third_party/flatbuffers/java/src/test/java/union_vector` | `../../../../tests/union_vector` |❗`..` in target|omit from sdist[^3]|
| `third_party/flatbuffers/kotlin/benchmark/src/jvmMain/java` | `../../../../java/src/main/java` |❗`..` in target|omit from sdist[^3]|
| `third_party/ittapi/rust/ittapi-sys/c-library` | `../../` |❗`..` in target|omit from sdist[^4]|
| `third_party/ittapi/rust/ittapi-sys/LICENSES` | `../../LICENSES` |❗`..` in target|omit from sdist[^4]|
| `third_party/opentelemetry-cpp/buildscripts/pre-merge-commit` | `./pre-commit` |✅ ok (individual file)||
| `third_party/opentelemetry-cpp/third_party/prometheus-cpp/cmake/project-import-cmake/sample_client.cc` | `../../push/tests/integration/sample_client.cc` |❗`..` in target|omit from sdist[^5]|
| `third_party/opentelemetry-cpp/third_party/prometheus-cpp/cmake/project-import-cmake/sample_server.cc` | `../../pull/tests/integration/sample_server.cc` |❗`..` in target|omit from sdist[^5]|
| `third_party/opentelemetry-cpp/third_party/prometheus-cpp/cmake/project-import-pkgconfig/sample_client.cc` | `../../push/tests/integration/sample_client.cc` |❗`..` in target|omit from sdist[^5]|
| `third_party/opentelemetry-cpp/third_party/prometheus-cpp/cmake/project-import-pkgconfig/sample_server.cc` | `../../pull/tests/integration/sample_server.cc` |❗`..` in target|omit from sdist[^5]|
| `third_party/XNNPACK/tools/xngen` | `xngen.py` | ✅ ok (individual file)||
</details>
The introduction of symbolic links inside the `.ci/docker` folder creates a new problem, however, because Docker's `COPY` command does not allow symlinks in this way. We work around that by using `tar ch` to dereference the symlinks before handing them over to `docker build`.
[^1]: These resources can be naturally considered to be part of the docs, so moving the actual files into the place of the current symlinks and replacing them with (unproblematic) symlinks can be said to improve semantics as well.
[^2]: The flatbuffers docs already actually use the original file, not the symlink and in the most recent releases, starting from flatbuffers-25.1.21 the symlink is replaced by the actual file thanks to a documentation overhaul.
[^3]: These resources are flatbuffers tests for java and kotlin and can be omitted from our sdist.
[^4]: We don't need to ship the rust bindings for ittapi.
[^5]: These are demonstration examples for how to link to prometheus-cpp using cmake and can be omitted.
### Nccl
Nccl used to be included as a submodule. However, with #146073 (first released in v2.7.0-rc1), the submodule was removed and replaced with a build time checkout procedure in `tools/build_pytorch_libs.py`, which checks out the required version of nccl from the upstream repository based on a commit pin recorded in `.ci/docker/ci_commit_pins/nccl-cu{11,12}.txt`.
This means that a crucial third party dependency is missing from the source distribution and as the `.ci` folder is omitted from the source distribution, it is not possible to use the build time download.
However, it *is* possible to use a system provided Nccl using the `USE_SYSTEM_NCCL` environment variable, which now also is the default for the official Pytorch wheels.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152098
Approved by: https://github.com/atalman
This PR improves the parallelize_module API to support more corner cases:
1. if the plan entry specified as "", it should apply the style to the current module
2. if the plan entry does not have a corresponding submodule to apply, raise a warning and ignore this plan entry
As working on this PR, I also found that the while-loop inside is actually not necessary and could produce some nasty on the fly modifying while iterating behavior.. So I removed the while loop
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157182
Approved by: https://github.com/tianyu-l
Pytorch build is failing on power system from this commit ec24f8f58a74502c5a2488f5d9e85a817616dda0
***Build Failure Logs***
**Error related to mkldnn**
```
pytorch/aten/src/ATen/native/Blas.cpp:302:26: error: ‘cpuinfo_has_x86_amx_int8’ was not declared in this scope
302 | if ((!mixed_dtype && cpuinfo_has_x86_amx_int8()) ||
| ^~~~~~~~~~~~~~~~~~~~~~~~
pytorch/aten/src/ATen/native/Blas.cpp:303:25: error: ‘cpuinfo_has_x86_amx_fp16’ was not declared in this scope
303 | (mixed_dtype && cpuinfo_has_x86_amx_fp16())) {
| ^~~~~~~~~~~~~~~~~~~~~~~~
```
**Error related to vec256 complex float redefinition**
```
aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:19:7: error: specialization of ‘at::vec::DEFAULT::Vectorized<c10::complex<float> >’ after instantiation
19 | class Vectorized<ComplexFlt> {
| ^~~~~~~~~~~~~~~~~~~~~~
aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:19:7: error: redefinition of ‘class at::vec::DEFAULT::Vectorized<c10::complex<float> >’
aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:633:18: error: ‘const class at::vec::DEFAULT::Vectorized<c10::complex<float> >’ has no member named ‘abs_2_’
633 | auto abs_a = a.abs_2_();
| ^~~~~~
aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:634:18: error: ‘const class at::vec::DEFAULT::Vectorized<c10::complex<float> >’ has no member named ‘abs_2_’
634 | auto abs_b = b.abs_2_();
| ^~~~~~
/aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:666:17: error: ‘const class at::vec::DEFAULT::Vectorized<c10::complex<float> >’ has no member named ‘vec0’
666 | vec_add(a.vec0(), b.vec0()), vec_add(a.vec1(), b.vec1())};
aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:673:17: error: ‘const class at::vec::DEFAULT::Vectorized<c10::complex<float> >’ has no member named ‘vec0’
673 | vec_sub(a.vec0(), b.vec0()), vec_sub(a.vec1(), b.vec1())};
| ^~~~
aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:680:27: error: ‘const class at::vec::DEFAULT::Vectorized<c10::complex<float> >’ has no member named ‘vec0’
680 | vec_and(a.vec0(), b.vec0()), vec_and(a.vec1(), b.vec1())};
```
***With this changes build logs***
```
Building wheel torch-2.8.0a0+gita3098a7
-- Building version 2.8.0a0+gita3098a7
-- Checkout nccl release tag: v2.26.5-1
cmake -GNinja -DBLAS=OpenBLAS -DBUILD_PYTHON=True -DBUILD_TEST=True -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/home/avanish/OfficeWork2025/JuneWork/pytorch_5Jun/pack/torch_night_5Jun/pytorch/torch -DCMAKE_PREFIX_PATH=/home/avanish/OfficeWork2025/JuneWork/pyenv/pytorch_5Jun/lib/python3.12/site-packages -DPython_EXECUTABLE=/home/avanish/OfficeWork2025/JuneWork/pyenv/pytorch_5Jun/bin/python -DTORCH_BUILD_VERSION=2.8.0a0+gita3098a7 -DUSE_MKLDNN=ON -DUSE_MKLDNN_CBLAS=ON -DUSE_NUMPY=True -DUSE_OPENMP=ON /home/avanish/OfficeWork2025/JuneWork/pytorch_5Jun/pack/torch_night_5Jun/pytorch
cmake --build . --target install --config Release
running build_ext
-- Building with NumPy bindings
-- Not using cuDNN
-- Not using CUDA
-- Not using XPU
-- Using MKLDNN
-- Not using Compute Library for the Arm architecture with MKLDNN
-- Using CBLAS in MKLDNN
-- Not using NCCL
-- Building with distributed package:
-- USE_TENSORPIPE=True
-- USE_GLOO=True
-- USE_MPI=False
-- Building Executorch
-- Not using ITT
Copying functorch._C from functorch/functorch.so to /home/avanish/OfficeWork2025/JuneWork/pytorch_5Jun/pack/torch_night_5Jun/pytorch/build/lib.linux-ppc64le-cpython-312/functorch/_C.cpython-312-powerpc64le-linux-gnu.so
copying functorch/functorch.so -> /home/avanish/OfficeWork2025/JuneWork/pytorch_5Jun/pack/torch_night_5Jun/pytorch/build/lib.linux-ppc64le-cpython-312/functorch/_C.cpython-312-powerpc64le-linux-gnu.so
building 'torch._C' extension
creating build/temp.linux-ppc64le-cpython-312/torch/csrc
```
This patch will fix the pytorch build issue on power, and i am able to build successfully.
Hi @malfet @albanD
Please review this PR for pytorch build issue that we are observing on power.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155255
Approved by: https://github.com/albanD, https://github.com/malfet
This PR relaxes the device_mesh argument constraint in the local_map API. The current restriction is too strict, i.e. all the input arguments must have the same device mesh if they are DTensors. But many times user might want to pass in DTensors to this function that lives on different device mesh, i.e. weight and activation could live in different device mesh.
When using the local_map, we are extracting the local tensors from DTensors, and as long as the placements user specified match with the actual DTensor placements, user knows clearly that the inputs are intended to live in different mesh. So this PR removes the same mesh check and update doc to clearly document the behavior.
The `device_mesh` argument now serves for a main purpose, allow user to specify the device_mesh for the output DTensor reconstruction
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157049
Approved by: https://github.com/Chillee, https://github.com/zpcore
With Triton main things were failing with:
```py
File "/home/jansel/pytorch/torch/_inductor/codecache.py", line 205, in get_system
from triton.compiler.compiler import triton_key
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
ImportError: cannot import name 'triton_key' from 'triton.compiler.compiler' (/home/jansel/pytorch/triton/compiler/compiler.py)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157242
Approved by: https://github.com/aorenste
This PR makes the necessary changes in order to upgrade PyTorch DLPack
support to version 1.0. In summary, we add support for the following:
- Support both `DLManagedTensor` and `DLManagedTensorVersioned` when
producing and consuming DLPack capsules
- New parameter for `__dlpack__` method: `max_version`
- Version checks:
- Fallback to old implementation if no `max_version` or if version
lower than 1.0
- Check that the to-be-consumed capsule is of version up to 1.X
In order to accommodate these new specifications, this PR adds the
following main changes:
- `torch._C._to_dlpack_versioned` Python API (Module.cpp): new Python
API for creating a versioned DLPack capsule (called by `__dlpack__`
method)
- `DLPackTraits<T>` class (DLConvertor.h): select the correct
traits (e.g. capsule name, conversion functions) depending on which
DLPack tensor class is being used
- `toDLPackImpl<T>` function (DLConvertor.cpp): populates the
common fields of both classes
- `fromDLPackImpl<T>` function (DLConvertor.cpp): constructs a tensor
from a DLPAck capsule
- `fillVersion<T>` function (DLConvertor.cpp): populates the version
field for `DLManagedTensorVersioned` (no-op for `DLManagedTensor`)
- `tensor_fromDLPackImpl<T>` function (tensor_new.cpp): outer function
for constructing a tensor out of a DLPack capsule that also marks the
capsule as used
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145000
Approved by: https://github.com/albanD