Fixes https://github.com/pytorch/pytorch/issues/156815
As far as testing goes
* I tried to use cuobjdump but that was kinda goofy bccd9393a5 the problem was that the name of the cubin will have a single gencode always
* Another idea was to read stderr and check that the right amount of gencodes is there 0beadc01b3 this helped a lot to convince me locally that this test works, the test passed on my dev gpu but was failing in CI and I suspect it's because of a bad interaction with subprocesses
* Last approach was to have a simpler unit test to check which flags get added by default, this is not as comprehensive as the previous ideas but it works and is fast so will opt for this since I'm convinced testing is working per my own experiments and customers
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156850
Approved by: https://github.com/malfet
Commit fixes AOT compilation in sycl cpp extension which got accidentally dropped on aca2c99a652 (fallback to JIT compilation had happened). Commit also fixes override logic for default sycl targets allowing flexibility to specify targets externally. Further, commit extends test coverage to cover such a case and fixes issue in the test where consequent tests executed same (first) compiled extension due to name conflicts.
Fixes: #156249
Fixes: aca2c99a652 ("xpu: get xpu arch flags at runtime in cpp_extensions (#152192)")
CC: @pengxin99, @guangyey
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156364
Approved by: https://github.com/ezyang
This patch adds support for sycl kernels build via `torch.utils.cpp_extension.load`, `torch.utils.cpp_extension.load_inline` and (new) `class SyclExtension` APIs. Files having `.sycl` extension are considered to have sycl kernels and are compiled with `icpx` (dpc++ sycl compiler from Intel). Files with other extensions, `.cpp`, `.cu`, are handled as before. API supports building sycl along with other file types into single extension.
Note that `.sycl` file extension is a PyTorch convention for files containing sycl code which I propose to adopt. We did follow up with compiler team to introduce such file extension in the compiler, but they are opposed to this. At the same time discussion around sycl file extension and adding sycl language support into such tools as cmake is ongoing. Eventually cmake also considers to introduce some file extension convention for sycl. I hope we can further influence cmake and compiler communities to broader adopt `.sycl` file extension.
By default SYCL kernels are compiled for all Intel GPU devices for which pytorch native aten SYCL kernels are compiled. At the moment `pvc,xe-lpg`. This behavior can be overridden by setting `TORCH_XPU_ARCH_LIST` environment variables to the comma separated list of desired devices to compile for.
Fixes: #132944
CC: @gujinghui @EikanWang @fengyuan14 @guangyey @jgong5
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132945
Approved by: https://github.com/albanD, https://github.com/guangyey, https://github.com/malfet
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
This patch adds support for sycl kernels build via `torch.utils.cpp_extension.load`, `torch.utils.cpp_extension.load_inline` and (new) `class SyclExtension` APIs. Files having `.sycl` extension are considered to have sycl kernels and are compiled with `icpx` (dpc++ sycl compiler from Intel). Files with other extensions, `.cpp`, `.cu`, are handled as before. API supports building sycl along with other file types into single extension.
Note that `.sycl` file extension is a PyTorch convention for files containing sycl code which I propose to adopt. We did follow up with compiler team to introduce such file extension in the compiler, but they are opposed to this. At the same time discussion around sycl file extension and adding sycl language support into such tools as cmake is ongoing. Eventually cmake also considers to introduce some file extension convention for sycl. I hope we can further influence cmake and compiler communities to broader adopt `.sycl` file extension.
By default SYCL kernels are compiled for all Intel GPU devices for which pytorch native aten SYCL kernels are compiled. At the moment `pvc,xe-lpg`. This behavior can be overridden by setting `TORCH_XPU_ARCH_LIST` environment variables to the comma separated list of desired devices to compile for.
Fixes: #132944
CC: @gujinghui @EikanWang @fengyuan14 @guangyey @jgong5
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132945
Approved by: https://github.com/albanD, https://github.com/guangyey
I've found that when using `torch.utils.cpp_extension.load` on my Windows system, decoding errors occur when my .cpp/.cu files contain certain non-English characters.
`test.py`:
```py
from torch.utils.cpp_extension import load
my_lib = load(name='my_cuda_kernel', sources=['my_cuda_kernel.cu'], extra_cuda_cflags=['-O2', '-std=c++17'])
# ......
```
`my_cuda_kernel.cu`:
```cpp
#include <torch/types.h>
#include <torch/extension.h>
// 向量化 <------ some chinese characters
// ......
```
Errors will be reported as:
```
Traceback (most recent call last):
File "E:\test\test.py", line 8, in <module>
my_lib = load(
^^^^^
File "C:\Users\XXX\AppData\Roaming\Python\Python311\site-packages\torch\utils\cpp_extension.py", line 1314, in load
return _jit_compile(
^^^^^^^^^^^^^
File "C:\Users\XXX\AppData\Roaming\Python\Python311\site-packages\torch\utils\cpp_extension.py", line 1680, in _jit_compile
version = JIT_EXTENSION_VERSIONER.bump_version_if_changed(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\XXX\AppData\Roaming\Python\Python311\site-packages\torch\utils\_cpp_extension_versioner.py", line 46, in bump_version_if_changed
hash_value = hash_source_files(hash_value, source_files)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\XXX\AppData\Roaming\Python\Python311\site-packages\torch\utils\_cpp_extension_versioner.py", line 17, in hash_source_files
hash_value = update_hash(hash_value, file.read())
^^^^^^^^^^^
UnicodeDecodeError: 'gbk' codec can't decode byte 0x96 in position 141: illegal multibyte sequence
```
The issue lies in the fact that the `open()` function in Python is platform-dependent, which can cause decoding errors when a file contains characters that are not supported by the default encoding. Pytorch uses file contents to generate hash string:
60c1433041/torch/utils/_cpp_extension_versioner.py (L16-L17)
In my windows the default encoding is `gbk` but all of my cpp files are in `utf-8`.
There is a simple solution to this problem I think: just change the file reading mode to binary mode, which can avoid issues related to file encoding. It works perfectly on my computer.
```diff
- with open(filename) as file:
+ with open(filename, 'rb') as file:
hash_value = update_hash(hash_value, file.read())
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138565
Approved by: https://github.com/malfet, https://github.com/janeyx99
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
The test may fail because it either uses target flags newer than the GPU resulting in failures loading the compiled binary or targetting a GPU for which CUDA has no support yet/anymore
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122400
Approved by: https://github.com/ezyang
The test uses the CUDA compute capabilities of the current device to
compile an extension. If nvcc is older than the device, it will fail
with a message like "Unsupported gpu architecture 'compute_80'"
resulting in a `RuntimeError: Error building extension 'cudaext_archflags'`
ultimately failing the test.
This checks for this case and allows execution to continue
Fixes https://github.com/pytorch/pytorch/issues/51950
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122402
Approved by: https://github.com/ezyang
This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings.
I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519
Approved by: https://github.com/ezyang
This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings.
I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519
Approved by: https://github.com/ezyang
Fixes https://github.com/pytorch/pytorch/issues/61655
The test is flaky and fails whenever `test_jit_cuda_archflags` is run. The latter `test_jit_cuda_archflags` was slow test in the old Windows runner. It's currently running again on trunk due to the problem with populating slow-test JSON file ~Interestingly, its performance is getting better in the new Windows G5 runner and it becomes a borderline slow test, where it run sometimes~. Whenever it runs, the next test `test_jit_cuda_extension` will fail.
* Build and load different CUDA arch modules from `test_jit_cuda_archflags` in separate processes to avoid importing them into the current one. The test only checks the build artifacts. Importing them cause `test_jit_cuda_extension` to fail as describe in https://github.com/pytorch/pytorch/issues/61655
* Clean up the temp build dir on Windows. Windows CUDA runner is non-ephemeral, so it's better to clean thing up properly to avoid any funny business the next time the runner is used
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93332
Approved by: https://github.com/davidberard98
Attempts to fix#92656
BC-breaking! This changes the default of zero_grad in optim and in nn to default set grads to None instead of zero tensors. We are changing the default because there are proven perf wins and existing code has typically not regressed due to this change. (will probably have to flesh out this note more).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92731
Approved by: https://github.com/ngimel
Fixes: https://github.com/pytorch/pytorch/issues/88010
This PR does a couple things to stop slow gradcheck from timing out:
- Splits out test_ops_fwd_gradients from test_ops_gradients, and factors out TestFwdGradients and TestBwdGradients which both inherit from TestGradients, now situated in common_utils (maybe there is a better place?)
- Skips CompositeCompliance (and several other test files) for slow gradcheck CI since they do not use gradcheck
- because test times for test_ops_fwd_gradients and test_ops_gradients are either unknown or wrong, we hardcode them for now to prevent them from being put together. We can undo the hack after we see actual test times are updated. ("def calculate_shards" randomly divides tests with unknown test times in a round-robin fashion.)
- Updates references to test_ops_gradients and TestGradients
- Test files that are skipped for slow gradcheck CI are now centrally located in in run_tests.py, this reduces how fine-grained we can be with the skips, so for some skips (one so far) we still use the old skipping mechanism, e.g. for test_mps
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88216
Approved by: https://github.com/albanD
This functionality does not seem to be used
and there are some requests to update dependency.
Add `third_party` to torch_cpu include directories if compiling with
Caffe2 support, as `caffe2/quantization/server/conv_dnnlowp_op.cc` depends on `third_party/fbgemm/src/RefImplementations.h`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75394
Approved by: https://github.com/janeyx99, https://github.com/seemethere