pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-21 13:44:15 +08:00

Author	SHA1	Message	Date
joshuamarkovic	559e8d1c20	[doc]: Small typos (#162982 ) Small typo fixes Pull Request resolved: https://github.com/pytorch/pytorch/pull/162982 Approved by: https://github.com/ezyang, https://github.com/zou3519	2025-09-16 17:42:19 +00:00
Aaryaman Vasishta	52af91e4c1	[ROCm/Windows] Support load_inline on windows (#162577 ) Supports `torch.utils.cpp_extension.load_inline` on Windows with ROCm. Tested on Windows with gfx1201. Note that it currently only works when CC and CXX are set to `clang-cl`. This is also needed when building extensions via. `setuptools` due to linker errors when using `cl` directly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162577 Approved by: https://github.com/ezyang	2025-09-12 08:10:07 +00:00
Shangdi Yu	636a511084	[aoti] add config for libtorch free so (#162655 ) Users can specify the following to get a libtorch_free `.so`. "aot_inductor.use_libtorch": False, The following config is only used for torchnative (see https://github.com/meta-pytorch/torchnative/pull/110). It's not intended to be used by executorch. The reason we need it for torchnative is because a lot of the symbol definitions in torchnative repo is only in header files. "aot_inductor.libtorch_free_header": "/data/users/shangdiy/torchnative/standalone,/data/users/shangdiy/torchnative/" (or their custom headers) The main motivating use case is for executorch to produce a libtorch free `.so`. TODO for follow-up PR: this flag should be consolidated with the `compile_standalone` flag. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162655 Approved by: https://github.com/angelayi	2025-09-12 07:31:04 +00:00
Mark Saroufim	f7e8321961	fix cpp extension distributed warning spew (#162764 ) With the new change we only log the warning if we're running non distributed code or if we're in rank 0. Unit testing that certain messages get printed on certain ranks only feels kinda jank so test plan is below instead Test plan ```python # torchrun --nproc_per_node=2 demo_fix.py import os import logging logging.getLogger('torch.utils.cpp_extension').setLevel(logging.DEBUG) import torch if 'RANK' in os.environ: torch.distributed.init_process_group('nccl') from torch.utils.cpp_extension import _get_cuda_arch_flags _get_cuda_arch_flags() print(f"Rank {os.environ.get('RANK', '0')} done") ``` Logs showing how how `TORCH_CUDA_ARCH_LIST`only shows up once if we explicitly set the the logging level to `logging.DEBUG`. It also improves the debug message to explain what the actual behavior will be ``` (source) [marksaroufim@devgpu005]~% torchrun --nproc_per_node=2 demo_fix.py W0911 18:30:16.594000 1315439 /home/marksaroufim/pytorch/torch/distributed/run.py:814] W0911 18:30:16.594000 1315439 /home/marksaroufim/pytorch/torch/distributed/run.py:814] *************************************** W0911 18:30:16.594000 1315439 /home/marksaroufim/pytorch/torch/distributed/run.py:814] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0911 18:30:16.594000 1315439 /home/marksaroufim/pytorch/torch/distributed/run.py:814] ************************************* [rank0]:V0911 18:30:18.921000 1316753 pytorch/torch/utils/cpp_extension.py:2444] TORCH_CUDA_ARCH_LIST is not set, using TORCH_CUDA_ARCH_LIST='10.0+PTX' for visible GPU architectures. Set os.environ['TORCH_CUDA_ARCH_LIST'] to override. Rank 0 done Rank 1 done ``` But if we just use the default and comment out `logging.getLogger('torch.utils.cpp_extension').setLevel(logging.DEBUG)` Then we get ``` (source) [marksaroufim@devgpu005]~% torchrun --nproc_per_node=2 demo_fix.py W0911 18:14:33.926000 690759 /home/marksaroufim/pytorch/torch/distributed/run.py:814] W0911 18:14:33.926000 690759 /home/marksaroufim/pytorch/torch/distributed/run.py:814] ************************************* W0911 18:14:33.926000 690759 /home/marksaroufim/pytorch/torch/distributed/run.py:814] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0911 18:14:33.926000 690759 /home/marksaroufim/pytorch/torch/distributed/run.py:814] *************************************** Rank 0 done Rank 1 done (source) [marksaroufim@devgpu005]~% ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162764 Approved by: https://github.com/ezyang, https://github.com/zou3519	2025-09-12 06:12:46 +00:00
Guilherme Leobas	789d494212	Defer loading hipify until it is needed (#160824 ) Saves a few milliseconds when running a test case: Before: ``` $ PYTORCH_TEST_WITH_DYNAMO=1 python test/dynamo/cpython/3_13/test_float.py GeneralFloatCases.test_float_pow frames [('total', 1), ('ok', 1)] inline_call [] . ---------------------------------------------------------------------- Ran 1 test in 1.497s ``` After: ``` $ PYTORCH_TEST_WITH_DYNAMO=1 python test/dynamo/cpython/3_13/test_float.py GeneralFloatCases.test_float_pow frames [('total', 1), ('ok', 1)] inline_call [] . ---------------------------------------------------------------------- Ran 1 test in 0.909s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160824 Approved by: https://github.com/zou3519	2025-09-02 15:27:37 +00:00
PyTorch MergeBot	13b65196db	Revert "Defer loading hipify until it is needed (#160824 )" This reverts commit 403a3a393cda7e60f503f3b04b8805a845dcf45d. Reverted https://github.com/pytorch/pytorch/pull/160824 on behalf of https://github.com/atalman due to Broke slow tests test_utils.py::TestHipifyTrie::test_special_char_export_trie_to_regex [GH job link](https://github.com/pytorch/pytorch/actions/runs/17387051351/job/49355619371) [HUD commit link](`403a3a393c`) ([comment](https://github.com/pytorch/pytorch/pull/160824#issuecomment-3243281628))	2025-09-01 21:34:13 +00:00
Guilherme Leobas	403a3a393c	Defer loading hipify until it is needed (#160824 ) Saves a few milliseconds when running a test case: Before: ``` $ PYTORCH_TEST_WITH_DYNAMO=1 python test/dynamo/cpython/3_13/test_float.py GeneralFloatCases.test_float_pow frames [('total', 1), ('ok', 1)] inline_call [] . ---------------------------------------------------------------------- Ran 1 test in 1.497s ``` After: ``` $ PYTORCH_TEST_WITH_DYNAMO=1 python test/dynamo/cpython/3_13/test_float.py GeneralFloatCases.test_float_pow frames [('total', 1), ('ok', 1)] inline_call [] . ---------------------------------------------------------------------- Ran 1 test in 0.909s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160824 Approved by: https://github.com/zou3519	2025-09-01 20:57:41 +00:00
Benjamin Glass	cbc53b7696	Update pybind11 submodule to 3.0.1 (#160754 ) Upgrade to PyBind11 v3. This allows us to strip out our own (possibly broken?) handling of the C++ ABI when building extensions, in favor of the more-complete PyBind11 internal handling. Fixes a few test failures due to https://github.com/pybind/pybind11/issues/5774, which effectively makes the `__qualname__` attribute of functions platform-dependent. Test plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/160754 Approved by: https://github.com/Skylion007	2025-08-27 21:15:01 +00:00
PyTorch MergeBot	1b34e04485	Revert "Update pybind11 submodule to 3.0.1 (#160754 )" This reverts commit 660b0b8128181d11165176ea3f979fa899f24db1. Reverted https://github.com/pytorch/pytorch/pull/160754 on behalf of https://github.com/atalman due to please see https://github.com/pytorch/pytorch/pull/160754#issuecomment-3226051449 ([comment](https://github.com/pytorch/pytorch/pull/160754#issuecomment-3226078102))	2025-08-26 23:35:22 +00:00
Benjamin Glass	660b0b8128	Update pybind11 submodule to 3.0.1 (#160754 ) Upgrade to PyBind11 v3. This allows us to strip out our own (possibly broken?) handling of the C++ ABI when building extensions, in favor of the more-complete PyBind11 internal handling. Fixes a few test failures due to https://github.com/pybind/pybind11/issues/5774, which effectively makes the `__qualname__` attribute of functions platform-dependent. Test plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/160754 Approved by: https://github.com/Skylion007	2025-08-26 01:21:18 +00:00
Dmitry Rogozhkin	eb5549a431	xpu: fix cpp_extension compatibility with oneapi dpc++ 2025.2 compiler (#161012 ) Intel oneapi DPC++ compiler has changed (fixed) parsing of `-fsycl-host-compiler-options` option in the respect of treating arguments with escaped quotes. This commit adds an if branches depending on compiler versions. Fixes: https://github.com/intel/torch-xpu-ops/issues/1938 CC: @chuanqi129 @EikanWang @guangyey Pull Request resolved: https://github.com/pytorch/pytorch/pull/161012 Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-08-25 09:29:53 +00:00
Scott Todd	b7ca502f29	[ROCm][Windows] Add hipcc compatibility flags to cpp_extension.py. (#159790 ) This is a similar change to https://github.com/pytorch/pytorch/pull/153986, this time adding flags to the hipcc command under `cpp_extension.py`. The `-Wno-ignored-attributes` flag in particular avoids about 200MB of warning spam when building torchvision, like these: ``` In file included from D:\b\vision_main\torchvision\csrc\ops\hip\deform_conv2d_kernel.hip:72: In file included from D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\torch\include\ATen/ATen.h:13: In file included from D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\torch\include\ATen/Functions.h:386: In file included from D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\torch\include\ATen/ops/_sparse_softmax.h:21: D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\torch\include\ATen/ops/_sparse_softmax_ops.h:18:8: warning: __declspec attribute 'dllimport' is not supported [-Wignored-attributes] 18 \| struct TORCH_API _sparse_softmax_int { \| ^~~~~~~~~ D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\torch\include\torch/headeronly/macros/Export.h💯19: note: expanded from macro 'TORCH_API' 100 \| #define TORCH_API C10_IMPORT \| ^~~~~~~~~~ D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\torch\include\torch/headeronly/macros/Export.h:53:31: note: expanded from macro 'C10_IMPORT' 53 \| #define C10_IMPORT __declspec(dllimport) \| ^~~~~~~~~ ``` The `-fms-extensions` flag just seems beneficial to include: https://clang.llvm.org/docs/MSVCCompatibility.html. See also this downstream issue where these changes were tested: https://github.com/ROCm/TheRock/issues/910. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159790 Approved by: https://github.com/jeffdaily	2025-08-16 02:20:49 +00:00
Johnny	9c5601ecc3	[NVIDIA] Refactor Family Blackwell Support codegen (#156176 ) With the legacy driver (nvgpu) used for CUDA 12.9, Thor was operating with SM 10.1. This changes to SM 11.0 when the newer driver model (OpenRM), which is intended for CUDA 13.0, is introduced. Thor 10.1 --> 11.0 Spark 12.1 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156176 Approved by: https://github.com/ezyang	2025-08-15 02:51:26 +00:00
Hao Zhang(张浩)	ab8874bd26	Suppress warning when using native arch for jit loading cuda extensions. (#156923 ) Previeusly, if users want to let pytorch determine the cuda arch when jit loading cuda extensions, they should left environment variable `TORCH_CUDA_ARCH_LIST` empty, but which will raise an warning. This commit add an option to set `TORCH_CUDA_ARCH_LIST=native`, to tell pytorch users want to use native cuda arch intentionally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156923 Approved by: https://github.com/ezyang	2025-07-09 02:51:20 +00:00
Sv. Lockal	7cda4017dd	Fix torch.utils.cpp_extension parser for clang version 20.1.7+libcxx (#157666 ) When CC and CXX compiler is set to clang, and clang was compiled with libc++, compilation of torchvision fails with: ``` File "/usr/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 585, in build_extensions compiler_name, compiler_version = self._check_abi() ^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 1034, in _check_abi _, version = get_compiler_abi_compatibility_and_version(compiler) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 449, in get_compiler_abi_compatibility_and_version if tuple(map(int, version)) >= minimum_required_version: ^^^^^^^^^^^^^^^^^^^^^^^^ ValueError: invalid literal for int() with base 10: '7+libcxx' ``` Compiler identification is a valid semantic version: ``` $ clang -dumpfullversion -dumpversion 20.1.7+libcxx ``` After adjusting parser of version, clang is able to compile extensions successfully. Fixes #157665 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157666 Approved by: https://github.com/msaroufim	2025-07-06 01:35:00 +00:00
Xuehai Pan	d40aaa42ee	[BE][16/16] fix typos in torch/ (torch/utils/) (#156606 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156606 Approved by: https://github.com/albanD ghstack dependencies: #156318, #156320, #156602, #156604	2025-07-02 22:55:29 +00:00
Mark Saroufim	18b01afa9e	load inline user overridable gencode (#156850 ) Fixes https://github.com/pytorch/pytorch/issues/156815 As far as testing goes * I tried to use cuobjdump but that was kinda goofy `bccd9393a5` the problem was that the name of the cubin will have a single gencode always * Another idea was to read stderr and check that the right amount of gencodes is there `0beadc01b3` this helped a lot to convince me locally that this test works, the test passed on my dev gpu but was failing in CI and I suspect it's because of a bad interaction with subprocesses * Last approach was to have a simpler unit test to check which flags get added by default, this is not as comprehensive as the previous ideas but it works and is fast so will opt for this since I'm convinced testing is working per my own experiments and customers Pull Request resolved: https://github.com/pytorch/pytorch/pull/156850 Approved by: https://github.com/malfet	2025-06-26 10:15:08 +00:00
PyTorch MergeBot	b1d62febd0	Revert "Use official CUDAToolkit module in CMake (#154595 )" This reverts commit 08dae945ae380d80efbaf140a95abfc5d96e5100. Reverted https://github.com/pytorch/pytorch/pull/154595 on behalf of https://github.com/malfet due to It breaks on some local setup with no clear diagnostic, but looks like it fails to find cuFile ([comment](https://github.com/pytorch/pytorch/pull/154595#issuecomment-2997959344))	2025-06-23 21:15:31 +00:00
cyy	08dae945ae	Use official CUDAToolkit module in CMake (#154595 ) Use CUDA language in CMake and remove forked FindCUDAToolkit.cmake. Some CUDA targets are also renamed with `torch::` prefix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154595 Approved by: https://github.com/albanD	2025-06-22 05:44:29 +00:00
Xuehai Pan	63360e64da	[BE][Easy] do not install yanked `types-pkg-resources` in lint environment (#156462 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156462 Approved by: https://github.com/ezyang	2025-06-20 16:00:43 +00:00
Dmitry Rogozhkin	443b5b43c3	xpu: fix AOT compilation in sycl cpp extension (#156364 ) Commit fixes AOT compilation in sycl cpp extension which got accidentally dropped on aca2c99a652 (fallback to JIT compilation had happened). Commit also fixes override logic for default sycl targets allowing flexibility to specify targets externally. Further, commit extends test coverage to cover such a case and fixes issue in the test where consequent tests executed same (first) compiled extension due to name conflicts. Fixes: #156249 Fixes: aca2c99a652 ("xpu: get xpu arch flags at runtime in cpp_extensions (#152192)") CC: @pengxin99, @guangyey Pull Request resolved: https://github.com/pytorch/pytorch/pull/156364 Approved by: https://github.com/ezyang	2025-06-19 20:11:38 +00:00
Jithun Nair	e4adf5df39	[ROCm] cpp_extension allow user to override default flags (#152432 ) We need -fgpu-rdc for projects such as DeepEP + rocSHMEM. The default of -no-gpu-rdc doesn't work for such cases. As per https://github.com/pytorch/pytorch/pull/152432#issuecomment-2840899088: "rocshmem shares the same global variable in different files, as deepEP uses CUDAExtention to build the project `65e2a700f0/setup.py (L51)` and depends on rocshmem, this -fgpu-rdc is needed. The current logic in Pytorch prevents users from overriding this flag." Pull Request resolved: https://github.com/pytorch/pytorch/pull/152432 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-05-15 21:06:18 +00:00
Aaron Gokaslan	fb85ebd710	[BE]: Use undocumented temp shim to restore setuptools compat (#153052 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153052 Approved by: https://github.com/albanD	2025-05-12 18:33:41 +00:00
Dmitry Rogozhkin	aca2c99a65	xpu: get xpu arch flags at runtime in cpp_extensions (#152192 ) This commit moves query for xpu arch flags to runtime when building SYCL extensions which allows to adjust `TORCH_XPU_ARCH_LIST` at python script level. That's handy for example in ci test which gives a try few variants of the list. CC: @malfet, @jingxu10, @EikanWang, @guangyey Pull Request resolved: https://github.com/pytorch/pytorch/pull/152192 Approved by: https://github.com/guangyey, https://github.com/gujinghui, https://github.com/albanD	2025-05-09 05:43:50 +00:00
Jithun Nair	88b56774bd	At least one of ROCM_HOME or CUDA_HOME must be None (#152236 ) Copied description by @hj-wei from https://github.com/ROCm/pytorch/pull/1809 > Hi all, I manually generating nvcc to bypass NVIDIA component checks(Megatron-LM), see `2da43ef4c1/megatron/legacy/fused_kernels/__init__.py (L57)` > but it can lead to incorrect CUDA_HOME configurations. This can cause initialization anomalies in downstream libraries like DeepSpeed Pull Request resolved: https://github.com/pytorch/pytorch/pull/152236 Approved by: https://github.com/jeffdaily	2025-05-08 22:20:25 +00:00
Mark Saroufim	7a3cae4b20	Configurable logging for cpp_extensions.py (#152260 ) Today `cpp_extensions` makes heavy use of printing to stderr, this makes our life harder in KernelBot where we typically rely on stderr to only surface real errors but instead today cpp_extensions leverages stderr for updates that could be qualified as INFO, WARNING, ERROR Now instead we'll recommend users of our cpp extension system to do something like ```python import logging cpp_ext_logger = logging.getLogger("torch.utils.cpp_extension") cpp_ext_logger.setLevel(logging.WARNING) ``` While this dramatically reduces log spew, it can be viewed as a BC breaking change if people were relying on certain strings being present in stdout or stderr Considering different teams might want to silence errors differently, this PR proposes replacing all `print()` statements with `logging` statements with the same heuristics that the python logging module recommends 1. DEBUG: For things like detailed compilation steps or reading filepaths - by default gets logged on stdout 2. INFO: Build progress - by default gets logged on stdout 3. WARNING: Surfacing issues that might cause bad performance or slow compilation times - by default gets logged on stdout 4. ERROR: Problems that prevent proper functioning - by default gets logged on stdout Note that warnings.warn is a different library and is not hooked up to the python logging module by default So the goal of this PR is to make it possible for teams to set the logging that is most appropriate to them. One annoying thing is logger throws ruff errors if you try to use it in conjunction with f strings or .format so have to use old school %s An unrelated improvement I'd be happy to push to a seperate PR is adding support for "native" in `TORCH_CUDA_ARCH_LIST` which would just pick the ARCH for the current device An example of what's in stderr today ``` Using /root/.cache/torch_extensions/py311_cu124 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /root/.cache/torch_extensions/py311_cu124/grayscale/build.ninja... /usr/local/lib/python3.11/site-packages/torch/utils/cpp_extension.py:2059: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST']. warnings.warn( Building extension module grayscale... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) Loading extension module grayscale... /usr/local/lib/python3.11/site-packages/torch/_dynamo/variables/functions.py:679: UserWarning: Graph break due to unsupported builtin grayscale.PyCapsule.grayscale. This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind). If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround. If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use torch.compiler.allow_in_graph. torch._dynamo.utils.warn_once(msg) ``` Whereas after this PR users can do `python benchmark_load_inline.py > >(tee stdout.txt) 2> >(tee stderr.txt >&2)` ```python import os import sys from pathlib import Path import shutil import tempfile import torch from torch.utils.cpp_extension import load_inline import logging cpp_ext_logger = logging.getLogger("torch.utils.cpp_extension") cpp_ext_logger.setLevel(logging.WARNING) os.environ["TORCH_CUDA_ARCH_LIST"] = "native" cpp_code = """ torch::Tensor to_gray(torch::Tensor input); """ cuda_kernel_code = """ torch::Tensor to_gray(torch::Tensor input) { auto output = torch::epty({input.size(0), input.size(1)}, input.options()); return output ; } """ # Avoid caching results with tempfile.TemporaryDirectory() as build_dir: cuda_module = load_inline( name="to_gray_cuda", cpp_sources=cpp_code, cuda_sources=cuda_kernel_code, functions=["to_gray"], with_cuda=True, verbose=True, extra_cflags=["-std=c++17"], # "-ftime-report", "-H"], extra_cuda_cflags=["-arch=sm_89"], build_directory=build_dir, ) ``` ## New logs ### On failure Which gives a much more reasonable stdout ``` [1/3] /usr/local/cuda-12.8/bin/nvcc --generate-dependencies-with-compile --dependency-output cuda.cuda.o.d -DTORCH_EXTENSION_NAME=to_gray_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -isystem /home/marksaroufim/pytorch/torch/include -isystem /home/marksaroufim/pytorch/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -isystem /usr/local/cuda/targets/x86_64-linux/include -isystem /home/marksaroufim/.conda/envs/nv/include/python3.10 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -arch=sm_89 -std=c++17 -c /tmp/tmpbg_xzv0r/cuda.cu -o cuda.cuda.o FAILED: cuda.cuda.o /usr/local/cuda-12.8/bin/nvcc --generate-dependencies-with-compile --dependency-output cuda.cuda.o.d -DTORCH_EXTENSION_NAME=to_gray_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -isystem /home/marksaroufim/pytorch/torch/include -isystem /home/marksaroufim/pytorch/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -isystem /usr/local/cuda/targets/x86_64-linux/include -isystem /home/marksaroufim/.conda/envs/nv/include/python3.10 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -arch=sm_89 -std=c++17 -c /tmp/tmpbg_xzv0r/cuda.cu -o cuda.cuda.o /tmp/tmpbg_xzv0r/cuda.cu(6): error: namespace "torch" has no member "epty" auto output = torch::epty({input.size(0), input.size(1)}, input.options()); ^ 1 error detected in the compilation of "/tmp/tmpbg_xzv0r/cuda.cu". [2/3] c++ -MMD -MF main.o.d -DTORCH_EXTENSION_NAME=to_gray_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -isystem /home/marksaroufim/pytorch/torch/include -isystem /home/marksaroufim/pytorch/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -isystem /usr/local/cuda/targets/x86_64-linux/include -isystem /home/marksaroufim/.conda/envs/nv/include/python3.10 -fPIC -std=c++17 -std=c++17 -c /tmp/tmpbg_xzv0r/main.cpp -o main.o ninja: build stopped: subcommand failed. ``` And stderr ``` Traceback (most recent call last): File "/home/marksaroufim/pytorch/torch/utils/cpp_extension.py", line 2874, in _run_ninja_build subprocess.run( File "/home/marksaroufim/.conda/envs/nv/lib/python3.10/subprocess.py", line 526, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1. The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/home/marksaroufim/load_inline_slow/benchmark_load_inline.py", line 30, in <module> cuda_module = load_inline( File "/home/marksaroufim/pytorch/torch/utils/cpp_extension.py", line 2261, in load_inline return _jit_compile( File "/home/marksaroufim/pytorch/torch/utils/cpp_extension.py", line 2367, in _jit_compile _write_ninja_file_and_build_library( File "/home/marksaroufim/pytorch/torch/utils/cpp_extension.py", line 2528, in _write_ninja_file_and_build_library _run_ninja_build( File "/home/marksaroufim/pytorch/torch/utils/cpp_extension.py", line 2892, in _run_ninja_build raise RuntimeError(message) from e RuntimeError: Error building extension 'to_gray_cuda' ``` ### On success stdout ``` [1/3] c++ -MMD -MF main.o.d -DTORCH_EXTENSION_NAME=to_gray_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -isystem /home/marksaroufim/pytorch/torch/include -isystem /home/marksaroufim/pytorch/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -isystem /usr/local/cuda/targets/x86_64-linux/include -isystem /home/marksaroufim/.conda/envs/nv/include/python3.10 -fPIC -std=c++17 -std=c++17 -c /tmp/tmpxv_ovlrf/main.cpp -o main.o [2/3] /usr/local/cuda-12.8/bin/nvcc --generate-dependencies-with-compile --dependency-output cuda.cuda.o.d -DTORCH_EXTENSION_NAME=to_gray_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -isystem /home/marksaroufim/pytorch/torch/include -isystem /home/marksaroufim/pytorch/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -isystem /usr/local/cuda/targets/x86_64-linux/include -isystem /home/marksaroufim/.conda/envs/nv/include/python3.10 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -arch=sm_89 -std=c++17 -c /tmp/tmpxv_ovlrf/cuda.cu -o cuda.cuda.o [3/3] c++ main.o cuda.cuda.o -shared -L/home/marksaroufim/pytorch/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda-12.8/lib64 -lcudart -o to_gray_cuda.so ``` And an empty stderr as expected Pull Request resolved: https://github.com/pytorch/pytorch/pull/152260 Approved by: https://github.com/albanD	2025-04-30 18:30:28 +00:00
Aidyn-A	36acaaae3f	[CUDA] Add new architectures (#152414 ) CUDA 12.9 will introduce a couple of new architectures `sm_103` and `sm_121`. We do not need to build for them, because they are going to be compatible with`sm_100` and `sm_120` respectively (similar to `sm_86` and `sm_89`), but PyTorch must be "aware" of them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152414 Approved by: https://github.com/Skylion007, https://github.com/eqy, https://github.com/malfet	2025-04-30 09:55:27 +00:00
Jing Xu	2089b22c76	[xpu] set aot device flags in cpp_extension (#149459 ) If PyTorch is compiled with only AOT text strings starting with "dg2", the `_get_sycl_arch_list()` function will pass an empty string to `-device` argument of `ocloc` and then cause a compilation crash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149459 Approved by: https://github.com/guangyey, https://github.com/dvrogozh, https://github.com/malfet Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com> Co-authored-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>	2025-04-24 22:55:52 +00:00
Prachi Gupta	b8f4dc5a9f	[ROCm] opportunistic fastatomics for ReduceAdd operations for MI300 GPUs (#146264 ) In this approach, we are catching any lane within a wave that is doing fastatomics to the same destination address and computing the sum on the CU. This is leading to 3x improvement in scatter_add performance and 2x improvement in index_select. scatter_add performance on MI300x: dtype\|Baseline (before optimizations)\|opportunistic fastatomics -------\|----------------------------------\|---------------------------------- f32\|1.389425039\|0.430447996 fp16\|2.195472956\|0.779729486 bf16\|2.194051027\|0.784599513 Using the following reproducer ``` import torch import triton def main(): dtype = torch.float32 dim = 1305301 a = torch.rand(100, device="cuda", dtype=dtype) index = torch.randint(0, 100, (dim,), device="cuda") src = torch.rand(dim, device="cuda", dtype=dtype) print("=" * 20) print( triton.testing.do_bench( lambda: a.scatter_add(0, index, src), return_mode="median", ) ) print("=" * 20) if __name__ == "__main__": main() ``` co-authored by: @amd-hhashemi Pull Request resolved: https://github.com/pytorch/pytorch/pull/146264 Approved by: https://github.com/jeffdaily, https://github.com/mxz297 Co-authored-by: Hashem Hashemi <hashem.hashemi@amd.com>	2025-04-22 21:55:40 +00:00
albanD	2a909cab16	Update ninja missing error message (#147698 ) In cpp_extensions Pull Request resolved: https://github.com/pytorch/pytorch/pull/147698 Approved by: https://github.com/Skylion007	2025-04-11 21:56:53 +00:00
Akash Verma	e9e5682a4a	[ROCm] Build Pytorch extensions with amdclang++ (#150451 ) Here are the following modifications made to cpp_extension.py- 1) Changed compiler flag to use --version. 2) Added a feature to convert alpha-numeric string to numeric string for the version string returned by compiler. This was the source of error as the parser was failing on parsing alpha-numeric version string. Build with following pytorch extensions- Apex, TorchVision, TorchAudio & DeepSpeed. Unit tested with following pytorch extensions- Apex, TorchVision. (cherry picked from commit c873aeac35851a7d5000eb7f24561d3f56c2ffbd) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/150451 Approved by: https://github.com/jeffdaily	2025-04-07 23:31:29 +00:00
Saagar Jha	c067127d47	Ensure cuda_dlink_post_cflags are quoted as well (#150151 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150151 Approved by: https://github.com/janeyx99	2025-04-03 06:50:22 +00:00
tvukovic-amd	db32093192	[ROCm][Windows] Fix torchvision build with ROCm 6.4 on windows (#150180 ) Since with HIP SDK 6.4 hipcc files and calls and restructured, the case for calling hipcc.exe is added in case of building torchvision with HIP SDK 6.4 on Windows Pull Request resolved: https://github.com/pytorch/pytorch/pull/150180 Approved by: https://github.com/malfet, https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-04-02 00:35:47 +00:00
Taras	d670df356c	Improve error handling when checking CUDA version in case nvcc is not found (#148671 ) Fixes: - https://github.com/pytorch/pytorch/issues/101138 Description The PR enhances error handling in `_check_cuda_version` by verifying the existence of the `nvcc` executable before invoking `subprocess.check_output`. If `nvcc` is missing, a `FileNotFoundError` is raised with a clear message, guiding users to check their CUDA installation and path configuration. Testing Manually tested with and without `nvcc` present in the expected path. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148671 Approved by: https://github.com/malfet	2025-03-27 23:04:59 +00:00
Nikita Shulga	5a7588f183	[Build] Remove pre-CXX11 ABI logic from build script (#149888 ) Only keep one in check_binary_symbols to make sure there are no pre-CXX11 ABI symbols in the library Pull Request resolved: https://github.com/pytorch/pytorch/pull/149888 Approved by: https://github.com/atalman, https://github.com/seemethere ghstack dependencies: #149887	2025-03-25 03:17:16 +00:00
Mark Saroufim	539db4af4b	load_inline no_implicit_headers mode (#149480 ) In the kernelBot leaderboard we support people competing with custom cuda extensions via `load_inline()`, however even on toy kernels this can result in cold starts of up to 90s - this feature is primarily responsible for us having to double our timeout values I performed an investigation here https://github.com/msaroufim/load_inline_slow and the primary cause was that torch/extension.h and torch/types.h add in about 5,000 header files https://github.com/msaroufim/load_inline_slow/blob/main/header-analysis So we introduce a mode `no_implicit_headers` which forces users to be explicit about exactly what they want to add. There's a proper test meant to be used in a CLI and a pytest test that's not terribly helpful Then there's still an open question around what's the most minimal example implementation we can provide. For the baseline kernel we're showing here, it takes about 1 min to compile 1. There's using TensorBase.h (finicky to get right but can get compilation times down to 7s) 2. Just using Tensor.h (down to 15s) 3. Using Shim.h (did not try yet since the syntax is verbose relative to cuda) This is my take so far https://gist.github.com/msaroufim/079a8d08ffebd0f91a1c2247eb0ce9e0 for a minimal implementation at 15s but @malfet has a simpler one at only 5s There's more things I'd like to try moving forward like nvrtc and fancier compilation flags. Typical advice around using precompiled headers does not apply to us because we are mostly interested in cold starts where we tear down the machine after running a kernel Also in a future PR I'd like to fix issue I've noticed with load_inline 1. It needs a force recompilation mode, I was using this quite a bit myself 2. The cache does not take into account changes in environment so the best way to force a recompilation is to change some string in the file 3. Instead of relying on pybind, can we use TORCH_LIBRARY instead Pull Request resolved: https://github.com/pytorch/pytorch/pull/149480 Approved by: https://github.com/malfet	2025-03-22 19:21:29 +00:00
tvukovic-amd	268de64005	[ROCm][Windows] Enable torchvision build with ROCm on Windows (#147382 ) - Updated HIP flags for Windows (removed non Windows flags on Windows case, added runtime library) - Set hipcc call for Windows case - Removed CUDA flags (not used in ROCm) on Windows - Updated Windows compiler (added case when using ROCm on Windows) - Fixed path issue in hipify_python Pull Request resolved: https://github.com/pytorch/pytorch/pull/147382 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-03-18 23:37:05 +00:00
Nichols A. Romero	c0566e0dbf	[ROCm] Fixes and improvements to CUDA->HIP flag conversion for CPP extensions (#149245 ) Fixes https://github.com/ROCm/hip/issues/3764. Fixes and improvements to CUDA->HIP flag conversion for CPP extensions - Log flag conversion for debugging purposes. - Fix cases where it should not touch the -I flags or cases where CUDA appears more than once by replacing only the first instance. - Fix case where nvcc key may not exist - Fix case where hipify should ignore flag values and only touch the flag itself Pull Request resolved: https://github.com/pytorch/pytorch/pull/149245 Approved by: https://github.com/jeffdaily Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai>	2025-03-18 18:01:07 +00:00
Dmitry Rogozhkin	c179971bfc	xpu: update filter out of dg2 AOT target (#148677 ) torch-xpu-ops has updated list of AOT targets to use and used `dg2` instead of `dg2-g10`. This requires an update in cpp_extension.py which currently filters out `dg2-` prefixed AOT targets. CC: @gujinghui @EikanWang @fengyuan14 @guangyey @jgong5 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148677 Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/albanD	2025-03-14 02:24:06 +00:00
Dmitry Rogozhkin	70410f93f2	doc/xpu: align description of SyclExtension with CPP/CUDA (#147988 ) This commit just aligns description of `py_limited_api` feature in SyclExtension with CPP/CUDA. We've missed this change on doing SyclExtension due to parallel work on the changes. For CPP/CUDA change was done in 515e55e6927ad5f57ec222d7779712630341acf3. CC: @gujinghui @EikanWang @fengyuan14 @guangyey @jgong5 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147988 Approved by: https://github.com/janeyx99, https://github.com/guangyey	2025-03-04 04:17:36 +00:00
Dmitry Rogozhkin	baccadb2f1	xpu: torch.xpu.get_arch_list() to return [] if xpu not compiled (#147431 ) Initially discussed here: https://github.com/pytorch/pytorch/pull/132945#discussion_r1957366131 Previously `torch.xpu.get_arch_list()` got relaxed to work even if XPU device is not available. However, we overlooked the case when pytorch is not compiled with XPU support. In such a case function throws an exception. This commit adjusts this behavior and makes function return `[]` even if pytorch is not compiled with XPU support. CC: @EikanWang @fengyuan14 @guangyey @malfet @albanD Pull Request resolved: https://github.com/pytorch/pytorch/pull/147431 Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/albanD	2025-02-24 01:35:54 +00:00
Dmitry Rogozhkin	d27ecf85db	xpu: support sycl with torch.utils.cpp_extension APIs (#132945 ) This patch adds support for sycl kernels build via `torch.utils.cpp_extension.load`, `torch.utils.cpp_extension.load_inline` and (new) `class SyclExtension` APIs. Files having `.sycl` extension are considered to have sycl kernels and are compiled with `icpx` (dpc++ sycl compiler from Intel). Files with other extensions, `.cpp`, `.cu`, are handled as before. API supports building sycl along with other file types into single extension. Note that `.sycl` file extension is a PyTorch convention for files containing sycl code which I propose to adopt. We did follow up with compiler team to introduce such file extension in the compiler, but they are opposed to this. At the same time discussion around sycl file extension and adding sycl language support into such tools as cmake is ongoing. Eventually cmake also considers to introduce some file extension convention for sycl. I hope we can further influence cmake and compiler communities to broader adopt `.sycl` file extension. By default SYCL kernels are compiled for all Intel GPU devices for which pytorch native aten SYCL kernels are compiled. At the moment `pvc,xe-lpg`. This behavior can be overridden by setting `TORCH_XPU_ARCH_LIST` environment variables to the comma separated list of desired devices to compile for. Fixes: #132944 CC: @gujinghui @EikanWang @fengyuan14 @guangyey @jgong5 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132945 Approved by: https://github.com/albanD, https://github.com/guangyey, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-02-16 16:50:59 +00:00
PyTorch MergeBot	dd5d0ea6bb	Revert "xpu: support sycl with torch.utils.cpp_extension APIs (#132945 )" This reverts commit 607379960bc5093a1fe51ff72c3e0fd39ac126ab. Reverted https://github.com/pytorch/pytorch/pull/132945 on behalf of https://github.com/malfet due to It just broke all the tests, see `b16ae97ad0/1` ([comment](https://github.com/pytorch/pytorch/pull/132945#issuecomment-2661498747))	2025-02-16 16:03:42 +00:00
Dmitry Rogozhkin	607379960b	xpu: support sycl with torch.utils.cpp_extension APIs (#132945 ) This patch adds support for sycl kernels build via `torch.utils.cpp_extension.load`, `torch.utils.cpp_extension.load_inline` and (new) `class SyclExtension` APIs. Files having `.sycl` extension are considered to have sycl kernels and are compiled with `icpx` (dpc++ sycl compiler from Intel). Files with other extensions, `.cpp`, `.cu`, are handled as before. API supports building sycl along with other file types into single extension. Note that `.sycl` file extension is a PyTorch convention for files containing sycl code which I propose to adopt. We did follow up with compiler team to introduce such file extension in the compiler, but they are opposed to this. At the same time discussion around sycl file extension and adding sycl language support into such tools as cmake is ongoing. Eventually cmake also considers to introduce some file extension convention for sycl. I hope we can further influence cmake and compiler communities to broader adopt `.sycl` file extension. By default SYCL kernels are compiled for all Intel GPU devices for which pytorch native aten SYCL kernels are compiled. At the moment `pvc,xe-lpg`. This behavior can be overridden by setting `TORCH_XPU_ARCH_LIST` environment variables to the comma separated list of desired devices to compile for. Fixes: #132944 CC: @gujinghui @EikanWang @fengyuan14 @guangyey @jgong5 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132945 Approved by: https://github.com/albanD, https://github.com/guangyey	2025-02-16 10:16:09 +00:00
Jane Xu	515e55e692	Set -DPy_LIMITED_API flag for py_limited_api=True extensions (#145764 ) This could be BC breaking, because there was a period of time when we use py_limited_api=True but don't enforce the flag, and now that we will start enforcing the flag, people's custom extensions may fail to build. This is strictly still better behavior, as it is sketchy to claim CPython agnosticism without the flag, but calling this out as potential people yelling at us. Ways to mitigate this risk + reasons this may not be too big a deal: - People haven't known about py_limited_api for extensions much due to lack of docs from python so usage is low right now - My current tutorial is in store to make new users of py_limited_api pass this flag, so it'd be a noop for them. Test plan: * Locally i'm confident as I tried rebuilding ao with this change and it reliably failed (cuz importing torch/extension.h is a nono) * Unit test wise, the normal python_agnostic one I added should work Pull Request resolved: https://github.com/pytorch/pytorch/pull/145764 Approved by: https://github.com/ezyang, https://github.com/zou3519, https://github.com/albanD	2025-01-28 20:11:05 +00:00
H. Vetinari	e6c1e6e20e	simplify torch.utils.cpp_extension.include_paths; use it in cpp_builder (#145480 ) While working on conda-forge integration, I needed to look at the way the include paths are calculated, and noticed an avoidable duplication between `torch/utils/cpp_extension.py` and `torch/_inductor/cpp_builder.py`. The latter already imports the former anyway, so simply reuse the same function. Furthermore, remove long-obsolete include-paths. AFAICT, the `/TH` headers have not existed since pytorch 1.11. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145480 Approved by: https://github.com/ezyang	2025-01-27 07:19:42 +00:00
Johnny	732c4998f3	[NVIDIA] Full Family Blackwell Support codegen (#145436 ) More references: https://github.com/NVIDIA/nccl Pull Request resolved: https://github.com/pytorch/pytorch/pull/145436 Approved by: https://github.com/ezyang, https://github.com/drisspg	2025-01-24 04:36:00 +00:00
Irem Yuksel	66bf7da446	Enable sleef for Win Arm64 (#144876 ) Sleef module was disabled for Windows Arm64 on `b021486405` This PR enables it again since the issue is no longer valid. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144876 Approved by: https://github.com/albanD, https://github.com/malfet Co-authored-by: Ozan Aydin <148207261+ozanMSFT@users.noreply.github.com>	2025-01-23 19:22:58 +00:00
Johnny	a57133e3c7	[NVIDIA] Jetson Thor Blackwell Support codegen (#145395 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145395 Approved by: https://github.com/eqy, https://github.com/malfet	2025-01-22 20:13:19 +00:00
johnnynunez	35f5668f7e	[NVIDIA] RTX50 Blackwell Support codegen (#145270 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145270 Approved by: https://github.com/ezyang	2025-01-21 21:10:05 +00:00

1 2 3 4 5 ...

358 Commits