Commit Graph

26 Commits

Author SHA1 Message Date
7d39e73c57 Fix more URLs (#153277)
Or ignore them.
Found by running the lint_urls.sh script locally with https://github.com/pytorch/pytorch/pull/153246

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153277
Approved by: https://github.com/malfet
2025-05-14 16:23:50 +00:00
4b8b7c7fb9 [CI] Use cmake from pip instead of conda in CI docker images (#152537)
As in title

idk how the install_cmake script is used because I see it being called with 3.18 but when I look at the build jobs some say 3.18 and others 3.31

Just make everything install cmake via the requirements-ci.txt.  I don't know if the comment at 5d36485b4a/.ci/docker/common/install_conda.sh (L78) still holds, but pretty much every build has CONDA_CMAKE set to true, so I'm just defaulting to installing through pip

Also defaulting to 4.0.0 everywhere except the executorch docker build because executorch reinstalls 3.31.something
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152537
Approved by: https://github.com/cyyever, https://github.com/atalman, https://github.com/malfet
2025-05-08 18:58:10 +00:00
a7ea115494 Revert "[CI] Use cmake from pip instead of conda in CI docker images (#152537)"
This reverts commit 941062894a1accfd472d0acd2716493e1f173bd7.

Reverted https://github.com/pytorch/pytorch/pull/152537 on behalf of https://github.com/malfet due to Sorry to revert this PR, but it broke doc builds, see 4976b1a3a8/1 ([comment](https://github.com/pytorch/pytorch/pull/152537#issuecomment-2863337268))
2025-05-08 14:53:34 +00:00
941062894a [CI] Use cmake from pip instead of conda in CI docker images (#152537)
As in title

idk how the install_cmake script is used because I see it being called with 3.18 but when I look at the build jobs some say 3.18 and others 3.31

Just make everything install cmake via the requirements-ci.txt.  I don't know if the comment at 5d36485b4a/.ci/docker/common/install_conda.sh (L78) still holds, but pretty much every build has CONDA_CMAKE set to true, so I'm just defaulting to installing through pip

Also defaulting to 4.0.0 everywhere except the executorch docker build because executorch reinstalls 3.31.something
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152537
Approved by: https://github.com/cyyever, https://github.com/atalman, https://github.com/malfet
2025-05-08 10:10:27 +00:00
6d28d61323 [CI] Remove protobuf from docker image (#151933)
Pretty sure the source should be the one in third-party

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151933
Approved by: https://github.com/huydhn
2025-04-23 10:29:09 +00:00
2bd5bfa3ce [ROCm] use magma-rocm tarball for CI/CD (#149986)
Follow-up to #149902.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149986
Approved by: https://github.com/malfet
2025-03-28 19:28:50 +00:00
c41196a4d0 [EZ][Docker] Remove install_db.sh (#149360)
Which is a vestige of caffe2 days and was no-op since https://github.com/pytorch/pytorch/pull/125092

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149360
Approved by: https://github.com/atalman, https://github.com/cyyever, https://github.com/seemethere, https://github.com/Skylion007
2025-03-18 16:07:47 +00:00
bc576355a2 Let aotriton.cmake detect the best binary package to use, and deprecate aotriton_version.txt (#137443)
We do not need `install_aotriton.sh` and `aotriton_version.txt` any more since `aotriton.cmake` now installs the best binary release package as the default option when building pytorch.

This should resolve the issue of needing a pre-installed aotriton package when building PyTorch for ROCm from source, which is not feasible if building PyTorch *outside* a CI docker image. With this change, a user can have a pre-installed AOTriton in their environment, if desired, and have the build pick it up by specifying the `AOTRITON_INSTALLED_PREFIX` env var, or have the build automatically detect and install the compatible version. As a third option, the user can also force AOTriton to build from source instead, using the `AOTRITON_INSTALL_FROM_SOURCE` env var.

Also, with the changes in this PR, the cmake build process handles the tasks of copying aotriton .so and images directory from `torch/lib` to the installation path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137443
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily

Co-authored-by: Jithun Nair <jithun.nair@amd.com>
2025-01-09 00:00:02 +00:00
034717a029 [ROCm] remove triton-rocm commit pin and merge pins with triton.txt (#133438)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133438
Approved by: https://github.com/jithunnair-amd, https://github.com/malfet

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
2024-09-05 20:36:45 +00:00
a1ba8e61d1 Revert "[ROCm] remove triton-rocm commit pin and merge pins with triton.txt (#133438)"
This reverts commit 5e8bf29148a590318f678620f84be8f4d5ffff5c.

Reverted https://github.com/pytorch/pytorch/pull/133438 on behalf of https://github.com/ZainRizvi due to This still breaks linux binary builds. Added the appropriate labels to ensure tests can pass. See [GH job link](https://github.com/pytorch/pytorch/actions/runs/10626427003/job/29460479554) [HUD commit link](5e8bf29148) ([comment](https://github.com/pytorch/pytorch/pull/133438#issuecomment-2322246198))
2024-08-30 20:00:41 +00:00
5e8bf29148 [ROCm] remove triton-rocm commit pin and merge pins with triton.txt (#133438)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133438
Approved by: https://github.com/jithunnair-amd, https://github.com/malfet

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
2024-08-30 03:38:35 +00:00
4648848696 Revert "[ROCm] remove triton-rocm commit pin and merge pins with triton.txt (#133438)"
This reverts commit f71c3d265ab52589f983dd252d61461db4e7dbbd.

Reverted https://github.com/pytorch/pytorch/pull/133438 on behalf of https://github.com/jeanschmidt due to seems to have introduced breakages in linux binary builds ([comment](https://github.com/pytorch/pytorch/pull/133438#issuecomment-2308787310))
2024-08-25 11:20:30 +00:00
f71c3d265a [ROCm] remove triton-rocm commit pin and merge pins with triton.txt (#133438)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133438
Approved by: https://github.com/jithunnair-amd, https://github.com/malfet
2024-08-24 18:26:49 +00:00
d34075e0bd Add Efficient Attention support on ROCM (#124885)
This patch implements `with sdpa_kernel(SDPBackend.EFFICIENT_ATTENTION):` by reusing AOTriton's accelerated SDPA implementation

Known limitations:
- Only supports MI200/MI300X GPUs
- Does not support varlen
- Does not support `CausalVariant`
- Optional arguments `causal_diagonal` and `seqlen_k` in `_efficient_attention_forward/backward` must be null
- Does not work well with inductor's SDPA rewriter. The rewriter has been updated to only use math and flash attention on ROCM.

This PR also uses a different approach of installing AOTriton binary instead of building it from source in the base docker image. More details on motivation: https://github.com/pytorch/pytorch/pull/124885#issuecomment-2153229129

`PYTORCH_TEST_WITH_ROCM=1 PYTORCH_TESTING_DEVICE_ONLY_FOR="cuda" python test/test_transformers.py` yields "55028 passed, 20784 skipped" results with this change.  [Previous result](https://hud.pytorch.org/pr/127528) of `test_transformers.py` was 0 error, 0 failure, 55229 skipped out of 75517 tests in total (the XML report does not contain total number of passed tests).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124885
Approved by: https://github.com/malfet
2024-06-08 22:41:05 +00:00
ef9451ac8d Move the build of AOTriton to base ROCM docker image. (#127012)
Mitigates #126111

AOTrtion, as a Math library, takes long time to build. However, this library itself is not moving as fast as PyTorch itself and it is not cost-efficient to build it for every CI check.

This PR moves the build of AOTriton from PyTorch to its base docker image, avoids duplicated and long build time.

Pre-this-PR:
* PyTorch base docker build job duration: 1.1-1.3h
* PyTorch build job duration: 1.4-1.5hr (includes AOTriton build time of 1hr6min on a linux.2xlarge node)

Post-this-PR:
* PyTorch base docker build job duration: 1.3h (includes AOTriton build time of 20min on a linux.12xlarge node)
* PyTorch build job duration: <20 min

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127012
Approved by: https://github.com/jithunnair-amd, https://github.com/pruthvistony, https://github.com/huydhn
2024-06-03 20:35:22 +00:00
d30cdc4321 [ROCm] amdsmi library integration (#119182)
Adds monitoring support for ROCm using amdsmi in place of pynvml.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119182
Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/xw285cornell
2024-05-21 01:59:26 +00:00
cyy
3f11958d39 Remove FFMPEG from CI scripts (#125546)
Because FFMPEG was solely used by Caffe2.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125546
Approved by: https://github.com/r-barnes, https://github.com/kit1980, https://github.com/albanD, https://github.com/malfet, https://github.com/seemethere
2024-05-11 16:46:13 +00:00
4dad988822 Revert "Remove vision packages from CI scripts (#125546)"
This reverts commit f42ea14c3f795082138421fcef90d24f64c6fd35.

Reverted https://github.com/pytorch/pytorch/pull/125546 on behalf of https://github.com/huydhn due to I think we are using vision in inductor tests with their various models there ([comment](https://github.com/pytorch/pytorch/pull/125546#issuecomment-2105174723))
2024-05-10 19:43:23 +00:00
cyy
f42ea14c3f Remove vision packages from CI scripts (#125546)
Because they were solely used by Caffe2.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125546
Approved by: https://github.com/r-barnes, https://github.com/kit1980, https://github.com/albanD
2024-05-10 17:53:48 +00:00
0d4fdb0bb7 Revert "[ROCm] amdsmi library integration (#119182)"
This reverts commit 85447c41e32b1e43a025ea19ac812a0c7f88ff57.

Reverted https://github.com/pytorch/pytorch/pull/119182 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the ROCm failed test is legit 85447c41e3 ([comment](https://github.com/pytorch/pytorch/pull/119182#issuecomment-2103433197))
2024-05-09 21:18:21 +00:00
85447c41e3 [ROCm] amdsmi library integration (#119182)
Adds monitoring support for ROCm using amdsmi in place of pynvml.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119182
Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/xw285cornell
2024-05-09 18:21:38 +00:00
07123bc198 [ROCm] Build Triton in Centos for ROCm (#112050)
Triton build for centos-based ROCm Dockerfile was missing. This brings centos Dockerfile up-to-date with ubuntu Dockerfile. No CI job covers this change; this change is independently verified by ROCm QA team.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112050
Approved by: https://github.com/jataylo, https://github.com/malfet
2023-11-05 20:43:56 +00:00
85bd6bc010 Cache pretrained mobilenet_v2 and mobilenet_v3_large models in Docker (#100302)
Follow the example I did for ONNX in https://github.com/pytorch/pytorch/pull/96793, this caches the pretrained `mobilenet_v2 model` and `mobilenet_v3_large` used by CI jobs.  I think there might be an issue either with AWS or with the domain download.pytorch.org as the connection to the latter has been failing a lots in the past few days.

Related flaky jobs:
* https://github.com/pytorch/pytorch/actions/runs/4835873487/jobs/8618836446
* https://github.com/pytorch/pytorch/actions/runs/4835783539/jobs/8618404639
* https://github.com/pytorch/pytorch/actions/runs/4835783539/jobs/8618404639

```
Downloading: "https://download.pytorch.org/models/mobilenet_v2-b0353104.pth" to /var/lib/jenkins/.cache/torch/hub/checkpoints/mobilenet_v2-b0353104.pth
Traceback (most recent call last):
  File "/opt/conda/envs/py_3.8/lib/python3.8/urllib/request.py", line 1354, in do_open
    h.request(req.get_method(), req.selector, req.data, headers,
  File "/opt/conda/envs/py_3.8/lib/python3.8/http/client.py", line 1256, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/opt/conda/envs/py_3.8/lib/python3.8/http/client.py", line 1302, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/opt/conda/envs/py_3.8/lib/python3.8/http/client.py", line 1251, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/opt/conda/envs/py_3.8/lib/python3.8/http/client.py", line 1011, in _send_output
    self.send(msg)
  File "/opt/conda/envs/py_3.8/lib/python3.8/http/client.py", line 951, in send
    self.connect()
  File "/opt/conda/envs/py_3.8/lib/python3.8/http/client.py", line 1418, in connect
    super().connect()
  File "/opt/conda/envs/py_3.8/lib/python3.8/http/client.py", line 922, in connect
    self.sock = self._create_connection(
  File "/opt/conda/envs/py_3.8/lib/python3.8/socket.py", line 808, in create_connection
    raise err
  File "/opt/conda/envs/py_3.8/lib/python3.8/socket.py", line 796, in create_connection
    sock.connect(sa)
OSError: [Errno 99] Cannot assign requested address
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100302
Approved by: https://github.com/ZainRizvi
2023-05-01 19:31:37 +00:00
371f587c92 Dockerize lint jobs (#94255)
This is to minimize network flakiness when running lint jobs.  I create a new Docker image for linter and install all linter dependencies there.  After that, all linter jobs are converted to use Nova generic Linux job https://github.com/pytorch/test-infra/blob/main/.github/workflows/linux_job.yml with the new image.

For the future task: I encounter this issue with the current mypy version we are using and Python 3.11 https://github.com/python/mypy/issues/13627.  Fixing this requires upgrading mypy to a newer version, but that can be done separately (require formatting/fixing `*.py` files with the newer mypy version)

`collect_env` linter job is currently not included here as it needs older Python versions (3.5).  It could also be converted to use the same mechanism (with another Docker image, probably).  This one rarely fails though.

### Testing

BEFORE
https://github.com/pytorch/pytorch/actions/runs/4130366955 took a total of ~14m

AFTER
https://github.com/pytorch/pytorch/actions/runs/4130712385 also takes a total of ~14m
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94255
Approved by: https://github.com/ZainRizvi
2023-02-11 21:56:19 +00:00
dddc0b41db [ROCm] centos update endpoint repo and fix sudo (#92034)
* Update ROCm centos Dockerfile
* Update install_user.sh for centos sudo issue

Fixes ROCm centos Dockerfile due to https://packages.endpoint.com/rhel/7/os/x86_64/endpoint-repo-1.9-1.x86_64.rpm file is not accessible.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92034
Approved by: https://github.com/malfet
2023-02-09 21:30:58 +00:00
6c4dc98b9d [CI][BE] Move docker forlder to .ci (#93104)
Follow up after https://github.com/pytorch/pytorch/pull/92569

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93104
Approved by: https://github.com/huydhn, https://github.com/seemethere, https://github.com/ZainRizvi
2023-02-03 12:25:33 +00:00