As in title
idk how the install_cmake script is used because I see it being called with 3.18 but when I look at the build jobs some say 3.18 and others 3.31
Just make everything install cmake via the requirements-ci.txt. I don't know if the comment at 5d36485b4a/.ci/docker/common/install_conda.sh (L78) still holds, but pretty much every build has CONDA_CMAKE set to true, so I'm just defaulting to installing through pip
Also defaulting to 4.0.0 everywhere except the executorch docker build because executorch reinstalls 3.31.something
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152537
Approved by: https://github.com/cyyever, https://github.com/atalman, https://github.com/malfet
As in title
idk how the install_cmake script is used because I see it being called with 3.18 but when I look at the build jobs some say 3.18 and others 3.31
Just make everything install cmake via the requirements-ci.txt. I don't know if the comment at 5d36485b4a/.ci/docker/common/install_conda.sh (L78) still holds, but pretty much every build has CONDA_CMAKE set to true, so I'm just defaulting to installing through pip
Also defaulting to 4.0.0 everywhere except the executorch docker build because executorch reinstalls 3.31.something
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152537
Approved by: https://github.com/cyyever, https://github.com/atalman, https://github.com/malfet
We do not need `install_aotriton.sh` and `aotriton_version.txt` any more since `aotriton.cmake` now installs the best binary release package as the default option when building pytorch.
This should resolve the issue of needing a pre-installed aotriton package when building PyTorch for ROCm from source, which is not feasible if building PyTorch *outside* a CI docker image. With this change, a user can have a pre-installed AOTriton in their environment, if desired, and have the build pick it up by specifying the `AOTRITON_INSTALLED_PREFIX` env var, or have the build automatically detect and install the compatible version. As a third option, the user can also force AOTriton to build from source instead, using the `AOTRITON_INSTALL_FROM_SOURCE` env var.
Also, with the changes in this PR, the cmake build process handles the tasks of copying aotriton .so and images directory from `torch/lib` to the installation path.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137443
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
Co-authored-by: Jithun Nair <jithun.nair@amd.com>
This patch implements `with sdpa_kernel(SDPBackend.EFFICIENT_ATTENTION):` by reusing AOTriton's accelerated SDPA implementation
Known limitations:
- Only supports MI200/MI300X GPUs
- Does not support varlen
- Does not support `CausalVariant`
- Optional arguments `causal_diagonal` and `seqlen_k` in `_efficient_attention_forward/backward` must be null
- Does not work well with inductor's SDPA rewriter. The rewriter has been updated to only use math and flash attention on ROCM.
This PR also uses a different approach of installing AOTriton binary instead of building it from source in the base docker image. More details on motivation: https://github.com/pytorch/pytorch/pull/124885#issuecomment-2153229129
`PYTORCH_TEST_WITH_ROCM=1 PYTORCH_TESTING_DEVICE_ONLY_FOR="cuda" python test/test_transformers.py` yields "55028 passed, 20784 skipped" results with this change. [Previous result](https://hud.pytorch.org/pr/127528) of `test_transformers.py` was 0 error, 0 failure, 55229 skipped out of 75517 tests in total (the XML report does not contain total number of passed tests).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124885
Approved by: https://github.com/malfet
Mitigates #126111
AOTrtion, as a Math library, takes long time to build. However, this library itself is not moving as fast as PyTorch itself and it is not cost-efficient to build it for every CI check.
This PR moves the build of AOTriton from PyTorch to its base docker image, avoids duplicated and long build time.
Pre-this-PR:
* PyTorch base docker build job duration: 1.1-1.3h
* PyTorch build job duration: 1.4-1.5hr (includes AOTriton build time of 1hr6min on a linux.2xlarge node)
Post-this-PR:
* PyTorch base docker build job duration: 1.3h (includes AOTriton build time of 20min on a linux.12xlarge node)
* PyTorch build job duration: <20 min
Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127012
Approved by: https://github.com/jithunnair-amd, https://github.com/pruthvistony, https://github.com/huydhn
Follow the example I did for ONNX in https://github.com/pytorch/pytorch/pull/96793, this caches the pretrained `mobilenet_v2 model` and `mobilenet_v3_large` used by CI jobs. I think there might be an issue either with AWS or with the domain download.pytorch.org as the connection to the latter has been failing a lots in the past few days.
Related flaky jobs:
* https://github.com/pytorch/pytorch/actions/runs/4835873487/jobs/8618836446
* https://github.com/pytorch/pytorch/actions/runs/4835783539/jobs/8618404639
* https://github.com/pytorch/pytorch/actions/runs/4835783539/jobs/8618404639
```
Downloading: "https://download.pytorch.org/models/mobilenet_v2-b0353104.pth" to /var/lib/jenkins/.cache/torch/hub/checkpoints/mobilenet_v2-b0353104.pth
Traceback (most recent call last):
File "/opt/conda/envs/py_3.8/lib/python3.8/urllib/request.py", line 1354, in do_open
h.request(req.get_method(), req.selector, req.data, headers,
File "/opt/conda/envs/py_3.8/lib/python3.8/http/client.py", line 1256, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/opt/conda/envs/py_3.8/lib/python3.8/http/client.py", line 1302, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/opt/conda/envs/py_3.8/lib/python3.8/http/client.py", line 1251, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/opt/conda/envs/py_3.8/lib/python3.8/http/client.py", line 1011, in _send_output
self.send(msg)
File "/opt/conda/envs/py_3.8/lib/python3.8/http/client.py", line 951, in send
self.connect()
File "/opt/conda/envs/py_3.8/lib/python3.8/http/client.py", line 1418, in connect
super().connect()
File "/opt/conda/envs/py_3.8/lib/python3.8/http/client.py", line 922, in connect
self.sock = self._create_connection(
File "/opt/conda/envs/py_3.8/lib/python3.8/socket.py", line 808, in create_connection
raise err
File "/opt/conda/envs/py_3.8/lib/python3.8/socket.py", line 796, in create_connection
sock.connect(sa)
OSError: [Errno 99] Cannot assign requested address
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100302
Approved by: https://github.com/ZainRizvi