The structure is
```
torch/
__init__.py
version.py
```
When we import torch, only `torch/__init__.py` is executed by default.
The submodules like `version.py` are not automatically imported or attached to the torch module.
So without anything in `__init__.py`, `torch.version` may not be found. So in this PR, we make the import explicit.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157584
Approved by: https://github.com/ezyang
#153622 introduced a hook for getting the relevant code objects after frame tracing. The idea is to have vLLM use this instead of monkey-patching `inline_call_()` to determine the source code files to hash. Unfortunately, the hook runs too late; the vLLM backend needs access to the set of source code filenames while it's running.
This PR replaces the newly-added hook with a utility function that a backend can call to get this information. I've made the change in vLLM and can verify that this allows the information to be queried at the right time.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155249
Approved by: https://github.com/zou3519
AMD is beginning to roll out ROCm distribution via Python wheels. This patch adds the `__init__.py` hook that is necessary to bootstrap ROCm correctly on Linux and Windows when built from these wheels.
See draft, developer documentation describing the mechanism here: https://github.com/ROCm/TheRock/blob/main/docs/packaging/python_packaging.md
This operates to similar effect as how Torch can depend on CUDA wheels, with some differences:
* ROCm libraries and checks are delegated to helpers in the `rocm_sdk` module, which knows how to find and configure access to the installed libraries. This limits the amount of plumbing and path machinations that must match up between the framework and ROCm.
* When building torch against ROCm, no ROCm system install is needed: instead the proper SDK development wheel is installed and the `CMAKE_PREFIX_PATH` is obtained via `rocm-sdk path --cmake`.
* It is expected that whoever produces such a build will also place a generated `_rocm_init.py` in the `torch` module with initialization logic to preload libraries, check versions, verify GPU compatibility, etc.
* See [build_prod_wheels.py](https://github.com/ROCm/TheRock/blob/main/external-builds/pytorch/build_prod_wheels.py) for an example build script that is being used to generate nightlies in this configuration.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155285
Approved by: https://github.com/jeffdaily
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
This PR:
* Expands `Hooks` with a new, optional `frame_traced_fn` field. It should be a callable receiving the list of traced code objects
* Maintains a list of `traced_code` objects in the `TracingContext` of an `OutputGraph`
* Whenever an `inline_call()` is encountered, the corresponding code object is added to this set
* `OutputGraph`'s associated `f_code` is added to the list just before the hook is called
I believe use of this hook should enable the source code hashing that vLLM does in a better way than monkey-patching `inline_call()`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153622
Approved by: https://github.com/jansel
### Changes
- Detect NVSHMEM install location via `sysconfig.get_path("purelib")`, which typically resolves to `<conda_env>/lib/python/site-packages`, and NVSHMEM include and lib live under `nvidia/nvshmem`
- Added link dir via `target_link_directories`
- Removed direct dependency on mlx5
- Added preload rule (following other other NVIDIA libs)
### Plan of Record
1. End user experience: link against NVSHMEM dynamically (NVSHMEM lib size is 100M, similar to NCCL, thus we'd like users to `pip install nvshmem` than torch carrying the bits)
2. Developer experience: at compile time, prefers wheel dependency than using Git submodule
General rule: submodule for small lib that torch can statically link with
If user pip install a lib, our CI build process should do the same, rather than building from Git submodule (just for its header, for example)
3. Keep `USE_NVSHMEM` to gate non-Linux platforms, like Windows, Mac
4. At configuration time, we should be able to detect whether nvshmem is available, if not, we don't build `NVSHMEMSymmetricMemory` at all.
For now, we have symbol dependency on two particular libs from NVSHMEM:
- libnvshmem_host.so: contains host side APIs;
- libnvshmem_device.a: contains device-side global variables AND device function impls.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153010
Approved by: https://github.com/ngimel, https://github.com/fduwjj, https://github.com/Skylion007
Retry of https://github.com/pytorch/pytorch/pull/150957, which was reverted due to internal meta failures
Credit to @mgmtea who wrote the initial version of this PR: https://github.com/pytorch/pytorch/pull/146604
Context: CUPTI is the NVIDIA library that Kineto uses for collecting GPU-side info during profiling. The intended usage is to register a callback while you want profiling to occur, and then unregister the callback when you want profiling to stop. But a bug would cause crashes if CUPTI callbacks were de-registered when used with cudagraphs. The workaround was to disable "CUPTI_LAZY_REINIT" and "CUPTI_TEARDOWN" in Kineto - which prevents crashes, but can result in slower execution after profiling has occurred and completed.
This bug is believed to be fixed in CUDA >= 12.6, so this PR qualifies that DISABLE_CUPTI_LAZY_REINIT=1 and CUPTI_TEARDOWN=0 should only be applied if CUDA >= 12.6. Additionally, `profiler_allow_cudagraph_cupti_lazy_reinit_cuda12()` is added as an escape hatch so that we can add a killswitch in case we see more crashes related to this.
Differential Revision: [D72842114](https://our.internmc.facebook.com/intern/diff/D72842114/)
**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D72842114/)!
Differential Revision: [D72842114](https://our.internmc.facebook.com/intern/diff/D72842114)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151124
Approved by: https://github.com/sraikund16
Credit to @mgmtea who wrote the initial version of this PR: https://github.com/pytorch/pytorch/pull/146604
Context: CUPTI is the NVIDIA library that Kineto uses for collecting GPU-side info during profiling. The intended usage is to register a callback while you want profiling to occur, and then unregister the callback when you want profiling to stop. But a bug would cause crashes if CUPTI callbacks were de-registered when used with cudagraphs. The workaround was to disable "CUPTI_LAZY_REINIT" and "CUPTI_TEARDOWN" in Kineto - which prevents crashes, but can result in slower execution after profiling has occurred and completed.
This bug is believed to be fixed in CUDA >= 12.6, so this PR qualifies that DISABLE_CUPTI_LAZY_REINIT=1 and CUPTI_TEARDOWN=0 should only be applied if CUDA >= 12.6. Additionally, `profiler_allow_cudagraph_cupti_lazy_reinit_cuda12()` is added as an escape hatch so that we can add a killswitch in case we see more crashes related to this.
Differential Revision: [D72745929](https://our.internmc.facebook.com/intern/diff/D72745929)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150957
Approved by: https://github.com/aaronenyeshi, https://github.com/Skylion007
This PR remove the usage of guard_size_oblivious in vector_norm by inlining it in the runtime check,
this prevent any data dependent error from ever appearing here at the locations where guard_size_oblivious
used to exist. Before this PR it used to break potentially. This is NOT BC breaking or changing of semantics from eager.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148809
Approved by: https://github.com/bobrenjc93
Follow up after https://github.com/pytorch/pytorch/pull/148137
Make sure we don't try to load cufile on CUDA 11.8
Test:
```
>>> import torch
/usr/local/lib64/python3.9/site-packages/torch/_subclasses/functional_tensor.py:276: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:81.)
cpu = _conversion_method_template(device=torch.device("cpu"))
>>> torch.__version__
'2.7.0.dev20250227+cu118'
>>>
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148184
Approved by: https://github.com/mikaylagawarecki
Fixes: https://github.com/pytorch/pytorch/issues/148120
Test with almalinux/9-base:latest :
```
>>> import torch
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib64/python3.9/site-packages/torch/__init__.py", line 401, in <module>
from torch._C import * # noqa: F403
ImportError: libcufile.so.0: cannot open shared object file: No such file or directory
>>> exit()
[root@18b37257e416 /]# vi /usr/local/lib64/python3.9/site-packages/torch/__init__.py
[root@18b37257e416 /]# python3
Python 3.9.19 (main, Sep 11 2024, 00:00:00)
[GCC 11.5.0 20240719 (Red Hat 11.5.0-2)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
/usr/local/lib64/python3.9/site-packages/torch/_subclasses/functional_tensor.py:276: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:81.)
cpu = _conversion_method_template(device=torch.device("cpu"))
>>> torch.__version__
'2.7.0.dev20250227+cu126'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148137
Approved by: https://github.com/malfet
Similar fix as: https://github.com/pytorch/pytorch/pull/144816
Fixes: https://github.com/pytorch/pytorch/issues/145580
Found during testing of https://github.com/pytorch/pytorch/issues/138340
Please note both nvrtc and nvjitlink exist for cuda 11.8, 12.4 and 12.6 hence we can safely remove if statement. Preloading can apply to all supporting cuda versions.
CUDA 11.8 path:
```
(.venv) root@b4ffe5c8ac8c:/pytorch/.ci/pytorch/smoke_test# ls /.venv/lib/python3.12/site-packages/torch/lib/../../nvidia/cuda_nvrtc/lib
__init__.py __pycache__ libnvrtc-builtins.so.11.8 libnvrtc-builtins.so.12.4 libnvrtc.so.11.2 libnvrtc.so.12
(.venv) root@b4ffe5c8ac8c:/pytorch/.ci/pytorch/smoke_test# ls /.venv/lib/python3.12/site-packages/torch/lib/../../nvidia/nvjitlink/lib
__init__.py __pycache__ libnvJitLink.so.12
```
Test with rc 2.6 and CUDA 11.8:
```
python cudnn_test.py
2.6.0+cu118
---------------------------------------------SDPA-Flash---------------------------------------------
ALL GOOD
---------------------------------------------SDPA-CuDNN---------------------------------------------
ALL GOOD
```
Thank you @nWEIdia for discovering this issue
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145582
Approved by: https://github.com/nWEIdia, https://github.com/eqy, https://github.com/kit1980, https://github.com/malfet
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
When other toolkit (say CUDA-12.3) is installed and `LD_LIBRARY_PATH` points to there, import torch will fail with
```
ImportError: /usr/local/lib/python3.10/dist-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12: undefined symbol: __nvJitLinkComplete_12_4, version libnvJitLink.so.12
```
It could not be worked around by tweaking rpath, as it also depends on the library load order, which are not guaranteed by any linker. Instead solve this by preloading `nvjitlink` right after global deps are loaded, by running something along the lines of the following
```python
if version.cuda in ["12.4", "12.6"]:
with open("/proc/self/maps") as f:
_maps = f.read()
# libtorch_global_deps.so always depends in cudart, check if its installed via wheel
if "nvidia/cuda_runtime/lib/libcudart.so" in _maps:
# If all abovementioned conditions are met, preload nvjitlink
_preload_cuda_deps("nvjitlink", "libnvJitLink.so.*[0-9]")
```
Fixes https://github.com/pytorch/pytorch/issues/140797
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141063
Approved by: https://github.com/kit1980
Co-authored-by: Sergii Dymchenko <sdym@meta.com>
Fixes https://github.com/pytorch/pytorch/issues/144433
Test with some debug statements added:
```
>>> import torch
trying to load libcublas.so.*[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/cublas/lib/libcublas.so.12']
trying to load libcublas.so.*[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/cublas/lib/libcublas.so.12
trying to load libcudnn.so.*[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/cudnn/lib/libcudnn.so.9']
trying to load libcudnn.so.*[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/cudnn/lib/libcudnn.so.9
trying to load libnvrtc.so.*[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/cuda_nvrtc/lib/libnvrtc.so.12']
trying to load libnvrtc.so.*[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/cuda_nvrtc/lib/libnvrtc.so.12
trying to load libcudart.so.*[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/cuda_runtime/lib/libcudart.so.12']
trying to load libcudart.so.*[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/cuda_runtime/lib/libcudart.so.12
trying to load libcupti.so.*[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/cuda_cupti/lib/libcupti.so.12']
trying to load libcupti.so.*[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/cuda_cupti/lib/libcupti.so.12
trying to load libcufft.so.*[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/cufft/lib/libcufft.so.11']
trying to load libcufft.so.*[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/cufft/lib/libcufft.so.11
trying to load libcurand.so.*[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/curand/lib/libcurand.so.10']
trying to load libcurand.so.*[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/curand/lib/libcurand.so.10
trying to load libnvJitLink.so.*[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/nvjitlink/lib/libnvJitLink.so.12']
trying to load libnvJitLink.so.*[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/nvjitlink/lib/libnvJitLink.so.12
trying to load libcusparse.so.*[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/cusparse/lib/libcusparse.so.12']
trying to load libcusparse.so.*[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/cusparse/lib/libcusparse.so.12
trying to load libcusparseLt.so.*[0-9] from []
trying to load libcusparseLt.so.*[0-9] from /usr/local/lib/python3.9/site-packages/cusparselt/lib/libcusparseLt.so.0
trying to load libcusolver.so.*[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/cusolver/lib/libcusolver.so.11']
trying to load libcusolver.so.*[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/cusolver/lib/libcusolver.so.11
trying to load libnccl.so.*[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/nccl/lib/libnccl.so.2']
trying to load libnccl.so.*[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/nccl/lib/libnccl.so.2
trying to load libnvToolsExt.so.*[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/nvtx/lib/libnvToolsExt.so.1']
trying to load libnvToolsExt.so.*[0-9] from /usr/local/lib/python3.9/site-
packages/nvidia/nvtx/lib/libnvToolsExt.so.1
/usr/local/lib64/python3.9/site-packages/torch/_subclasses/functional_tensor.py:275: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:81.)
cpu = _conversion_method_template(device=torch.device("cpu"))
>>> exit()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144477
Approved by: https://github.com/Skylion007, https://github.com/nWEIdia