Compare commits

...

111 Commits

Author SHA1 Message Date
9e4229de28 [inductor] getting AOT inductor to treat None args correctly
linter

remove import

ghstack-source-id: 72ceaf4a8e8c5bb2c465cf293c1e436876186645
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138910

lint

address feedback

lint

nit
2024-10-28 13:33:27 -07:00
e6180cd8ed Bugfix for passing None args to user defined Triton kernel (#138472)
add test

fewer failing tests

more tests passing

tests passing

lint

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138472
Approved by: https://github.com/aakhundov
2024-10-28 10:37:25 -07:00
d2e81d9c6f [CI/CD] Disable split build (#138752)
See https://github.com/pytorch/pytorch/issues/138750

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138752
Approved by: https://github.com/kit1980, https://github.com/huydhn
2024-10-28 10:37:25 -07:00
7e274747a9 [EZ] Fix typo in test_mps.py (#138738)
s/emedding_weight/embedding_weight/

Stolen from 074766d9b4

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138738
Approved by: https://github.com/atalman
2024-10-28 10:37:25 -07:00
52ca2d7075 Fix test on windows (#138641)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138641
Approved by: https://github.com/huydhn
2024-10-28 10:37:25 -07:00
f9c9b2f290 [hierarchical-compilation][inductor] Support invoke_subgraph HOP (#138031)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138031
Approved by: https://github.com/eellison
ghstack dependencies: #137538, #138036, #137965
2024-10-28 10:37:25 -07:00
9b8015ccf8 Add dump_launch_params config in triton/inductor (#137143)
Summary: Moves the checking of TORCHINDUCTOR_DUMP_LAUNCH_PARAMS into the config module to pull it out of the critical path.

Test Plan: Existing unit tests cover this env variable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137143
Approved by: https://github.com/eellison
2024-10-28 10:37:25 -07:00
beaa796c5d Refactor: Move _nested_int_aware_sort top level (#138693)
I need to use it from some other places later in the PR stack

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138693
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2024-10-28 10:37:25 -07:00
29191cc334 [export] fix test_unbacked_bindings_for_divisible_u_symint (#138607)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138607
Approved by: https://github.com/angelayi
2024-10-28 10:37:25 -07:00
4166d499a3 Clean up a c10::optional and fix documentation (#138700)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138700
Approved by: https://github.com/Skylion007
2024-10-28 10:37:25 -07:00
03b371eaeb Fix unbind_copy and add its decomposition (#134319)
* Fixes https://github.com/pytorch/pytorch/issues/130829

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134319
Approved by: https://github.com/amjames, https://github.com/eellison
2024-10-28 10:37:25 -07:00
b390a00516 Do not run CI on forks (#138714)
Add `if: github.repository_owner == 'pytorch'` for some jobs that were missing it

Fixes #138564
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138714
Approved by: https://github.com/huydhn, https://github.com/kit1980
2024-10-28 10:37:25 -07:00
1d8ced9d1d Introduce torch.sym_add, variadic add (#138660)
Tested internally here: https://www.internalfb.com/diff/D64057744
This is a reland after previous internal failures.
main change is
```
 if min is None and max is None:
        torch._check_is_size(size)
        return
```

Partially addresses https://github.com/pytorch/pytorch/issues/128150

When you have big sums of values, we end up computing long chains of
binary addition in our FX graph representation.  Not only is this ugly,
it also is quadratic, as the sympy.Add constructor is O(N) in number
of arguments.  Instead, ensure that we maintain the summation as a
single FX node so we can do the entire addition all in one go.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138660
Approved by: https://github.com/ezyang, https://github.com/bobrenjc93
2024-10-28 10:37:25 -07:00
8a6174dc2b Generate slice.Tensor view operations instead of as_strided when split is used in the original program. (#137225)
test_recompile assert that the changes do not add more recompilation by comparing with eager backend.
The reason of this is because slice can be lowered in more efficient way.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137225
Approved by: https://github.com/zou3519
2024-10-28 10:37:25 -07:00
65cf6d012f Add decomposition for permute_copy (#130944)
* Extracted from #129476

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130944
Approved by: https://github.com/amjames, https://github.com/eellison
2024-10-28 10:37:25 -07:00
3c15220ddb [SJD] [RFC] force setting last progress time (#138615)
Summary:
Currently, if watchdog + healthcheck are enabled via knobs but watchdog is disabled via SJD config, we observe a stuck when the watchdog loop attempts to open the watchdog file path. This is because the FileTimerClient that is usually set in TorchElasticWatchdog will not be set since disabling watchdog via SJD config bypasses the TorchElasticWatchdog initialization

The workaround is to update the healthcheck time when calling `get_last_progress_time`

Test Plan:

Logs show that the progress time value is being changed despite client not being set

Behavior when watchdog is enabled with SJD config is left unchanged

Differential Revision: D64733766

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138615
Approved by: https://github.com/gag1jain
2024-10-28 10:37:25 -07:00
583a007aa5 Revert "[PGNCCL] Use non-blocking mode by default in eager init (#138527)"
This reverts commit 8fbf866904661b16cba4c799af81121557ba9da8.

Reverted https://github.com/pytorch/pytorch/pull/138527 on behalf of https://github.com/jeanschmidt due to Seems to have introduce regressions on main, pull / linux-focal-cuda11.8-py3.10-gcc9 / test (distributed, 2, 3, linux.g4dn.12xlarge.nvidia.gpu) checking if revert will do ([comment](https://github.com/pytorch/pytorch/pull/138527#issuecomment-2432479338))
2024-10-28 10:37:25 -07:00
25eeca7eb7 Make trace log dir persist through multiple set_logs() calls (#137793)
Summary: Currently, calling `torch._logging.set_logs()` resets the log directory leading to multiple tlparse outputs. This prevents the dir from resetting after the first call.

Reviewed By: ezyang

Differential Revision: D64118047

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137793
Approved by: https://github.com/ezyang
2024-10-28 10:37:25 -07:00
d68453d068 [Inductor] New Triton Attrs Descriptor Fixups (#138390)
Fixes additional areas where we need to use the new Triton AttrsDescriptor if it is available.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138390
Approved by: https://github.com/jansel, https://github.com/huydhn
2024-10-28 10:37:25 -07:00
8424f6173c [CI] Introduces experiment awsa100 to inductor-perf-compare.yml workflow using _runner-determinator.yml (#138204)
Adds the job `get-test-label-type` in `.github/workflows/inductor-perf-compare.yml` checking for the experiment `awsa100`.

It is then used by the job `linux-focal-cuda12_1-py3_10-gcc9-inductor-build` to define the prefix for the runners that will run the benchmark.

Those runners temporarily accept the labels `awsa100.linux.gcp.a100` and `linux.aws.a100`. This is used so we can migrate via experimentation from `linux.gcp.a100`. After successfully experiment with those instances we will remove those labels and update the workflows to use `linux.aws.a100` and decomisson the gcp fleet.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138204
Approved by: https://github.com/ZainRizvi, https://github.com/huydhn
2024-10-28 10:37:25 -07:00
82dcfaa887 Eliminate c10 string_utils (#138499)
Test Plan: Sandcastle

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138499
Approved by: https://github.com/swolchok
2024-10-28 10:37:24 -07:00
c2aecbe5e4 [Quant][Inductor] expand quantization conv-binary(-unary) pattern fusion inside inductor (#138051)
### Summary
Expand quantization conv-binary(-unary) pattern fusion inside inductor to support the following two patterns:
Pattern 1:
```
    Conv(X)   extra input
           \   /
            Add
             |
        Optional(relu)
             |
             Y
```
Pattern 2:
```
    extra input   Conv(X)
           \   /
            Add
             |
        Optional(relu)
             |
             Y
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138051
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel, https://github.com/jgong5
2024-10-28 10:37:24 -07:00
d506df5d5e [CD] fix xpu support packages version (#138189)
Works for https://github.com/pytorch/pytorch/issues/114850
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138189
Approved by: https://github.com/EikanWang, https://github.com/malfet, https://github.com/atalman
2024-10-28 10:37:24 -07:00
63fa95cf2f [PGNCCL] Use non-blocking mode by default in eager init (#138527)
### Why use non-blocking mode in eager init?
For overlapping comm init and model init, etc.
![image](https://github.com/user-attachments/assets/9b0bf7a9-be26-4d16-827b-dbe861f083cd)

### Why can we set non-blocking as default?
If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`).

### Why not make non-blocking default for lazy mode as well?
PR https://github.com/pytorch/pytorch/pull/137544 tried it.
Two reasons why that's not preferred today:
1. It is hard -- too big a blast.
2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138527
Approved by: https://github.com/wconstab
ghstack dependencies: #137855, #138488, #138374, #138384
2024-10-28 10:37:24 -07:00
c8d7ea2038 Fixed dead lock in execution trace (#136892)
Summary:
This DIFF is to fix dead lock issue in execution issue. ExecutionTraceObserver get a lock in recordOperatorStart and onFunctionExit. However, inside these two functions, the input/ouput values are evaluated, which will triger python GIL in some use cases. In this case, the lock order is ET locker -> GIL.

One of  the ads application get GIL first, then call all-gather to collect some metrics from all ranks. When ET is on, all-gather is captured by ET observer. In this case, the lock order is: GIL -> ET locker

That is the reason why dead lock happens. To fix it, I changed the ET locker scope, so the input/output evaluation is no longer inside the scope of the ET locker.

Test Plan: buck2 test mode/opt caffe2/test:test_profiler_cuda

Differential Revision: D63556608

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136892
Approved by: https://github.com/aaronenyeshi
2024-10-28 10:37:24 -07:00
453041e58f [ONNX] Fix sequence handling in graph building (#138656)
Previous to this PR, op.Concat is called without required attributes: axis, and val and arg seems wrongly coded.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138656
Approved by: https://github.com/justinchuby
2024-10-28 10:37:24 -07:00
8ba468e0d6 add CUDA 12.6 to conda docker image (#138417)
Adds cuda 12.6 to common installation script.
Adds cuda 12.6 to conda docker image build matrix.

fixes #138440

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138417
Approved by: https://github.com/cyyever, https://github.com/atalman
2024-10-28 10:37:24 -07:00
2174aab940 Add support for SymFloats in split_module fx pass (#138599)
As discussed with @ezyang, this set of diffs are extracting fixes to problems discovered to flipping `specialize_float=False` in https://github.com/pytorch/pytorch/pull/137782. Since these codepaths are exercised in existing tests, I'm going to bias towards shipping speed and put these up with the primary test plan as the global CI. These code paths are all tested via existing tests when `specialize_float=False` and it feels a bit wonky to add more gated tests that only test behavior when this flag is True, especially since these code paths are already covered. That being said, I'm happy to add individual tests if reviewers insist or have a different POV.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138599
Approved by: https://github.com/ezyang
2024-10-28 10:37:24 -07:00
94fce5beda Support conditionals on sym node variables in the __bool__ and __len__ case (#138595)
As discussed with @ezyang, this set of diffs are extracting fixes to problems discovered to flipping `specialize_float=False` in https://github.com/pytorch/pytorch/pull/137782. Since these codepaths are exercised in existing tests, I'm going to bias towards shipping speed and put these up with the primary test plan as the global CI. These code paths are all tested via existing tests when `specialize_float=False` and it feels a bit wonky to add more gated tests that only test behavior when this flag is True, especially since these code paths are already covered. That being said, I'm happy to add individual tests if reviewers insist or have a different POV.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138595
Approved by: https://github.com/ezyang
2024-10-28 10:37:24 -07:00
1a1ec9f15f [ONNX] Avoid optimize onnx_dynamo-fallback (#138265)
Previous to this PR, when a model fails to be exported, it falls back to try with the legacy torchscript exporter. However, we didn't stop when it's exported with torchscript exporter, an optimization is applied to the graph.

It's ideal that the optimization can also boost the performance of the model exported with the legacy torchscript exporter, but currently, for benchmarking purpose and what fallback guarantee to the users, we should keep it simple and only return the graph.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138265
Approved by: https://github.com/xadupre, https://github.com/justinchuby
2024-10-28 10:37:24 -07:00
e60d44efd6 [PyTorch] Classify Unsupported mutated Dynamic Shapes as User Error (#137054)
Summary: We don't need an assert on for unsupported dyn shape inputs, removing the assert and raising a user exception instead.

Differential Revision: D63661569

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137054
Approved by: https://github.com/bdhirsh
2024-10-28 10:37:24 -07:00
cyy
1797f1c9dc Update ruff to 0.7.0 (#138597)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138597
Approved by: https://github.com/ezyang
2024-10-28 10:37:24 -07:00
ae3bb0645d [easy] Log subproc pool creation (#138642)
Summary: Request from internal to log subproc pool creation

Test Plan:
```
$ TORCH_LOGS=+torch._inductor.async_compile python ~/add.py
I1022 14:12:41.915000 444394 torch/_inductor/async_compile.py:165] Creating subprocess pool with 32 workers
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138642
Approved by: https://github.com/eellison
2024-10-28 10:37:24 -07:00
cyy
3efe2f4e05 [1/N] Don't skip ASAN on some tests (#138571)
Clang15's ASAN is new enough so that it's possible to re-evaluate the disabled ASAN  tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138571
Approved by: https://github.com/ezyang
2024-10-28 10:37:24 -07:00
202e398db3 [tests] fix broken tests caused by AotEagerAndRecordGraphs typo (#138492)
Summary:
Name change happened in https://github.com/pytorch/pytorch/pull/138231

AttributeError: module 'torch._dynamo.testing' has no attribute 'AOTEagerAndRecordGraphs'. Did you mean: 'AotEagerAndRecordGraphs'?

Test Plan: ci

Differential Revision: D64704686

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138492
Approved by: https://github.com/aakhundov
2024-10-28 10:37:24 -07:00
0893b5dbd4 Update torchbench.txt (#138569)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138569
Approved by: https://github.com/huydhn, https://github.com/malfet
2024-10-28 10:37:24 -07:00
011e9bd00d [PGNCCL] Ensure comm is ready before all accesses (#138384)
Previously we only wait for comm to become ready after its initialization.
That's not enough. There are other NCCL APIs that can cause the comm to be InProgress, e.g. P2P calls, commSplit, commFinalize, etc.
Therefore, we just ensure comm is ready every "next time" we need to access ncclComm.
The place to add such gate keeper is `getNcclComm`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138384
Approved by: https://github.com/shuqiangzhang, https://github.com/fduwjj
ghstack dependencies: #137855, #138488, #138374
2024-10-28 10:37:24 -07:00
8bcd9e543c Fix .to(cpu) for Storage (#138011)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138011
Approved by: https://github.com/albanD
2024-10-28 10:37:24 -07:00
108a8311d5 [AOTI][refactor] Move use_minimal_arrayref_interface logic (#138250)
Summary: Move use_minimal_arrayref_interface specific logic from CppWrapperCpu to CppWrapperCpuArrayRef. This is a copy-on-write style refactor, to simply the default AOTI generated code.

Differential Revision: [D64598715](https://our.internmc.facebook.com/intern/diff/D64598715)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138250
Approved by: https://github.com/chenyang78
ghstack dependencies: #138544, #138379
2024-10-28 10:37:24 -07:00
7b6905618e [AOTI] Fix check_model_with_multiple_inputs in test_aot_inductor (#138379)
Summary: Add missing use_minimal_arrayref_interface setting to check_model_with_multiple_inputs.

Differential Revision: [D64635211](https://our.internmc.facebook.com/intern/diff/D64635211)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138379
Approved by: https://github.com/hl475
ghstack dependencies: #138544
2024-10-28 10:37:24 -07:00
b25e2b459d Remove some pre-cpp17 stuff (#138410)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138410
Approved by: https://github.com/Skylion007
2024-10-28 10:37:24 -07:00
ab3bf915d2 Wrap autograd and autocast ops in training IR (#138516)
Differential Revision: [D64732361](https://our.internmc.facebook.com/intern/diff/D64732361)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138516
Approved by: https://github.com/yushangdi
ghstack dependencies: #138261
2024-10-28 10:37:24 -07:00
7f11f58108 Revert "[Inductor] New Triton Attrs Descriptor Fixups (#138390)"
This reverts commit 215999452eb5517213b3a31f72eb9a7e843d12a0.

Reverted https://github.com/pytorch/pytorch/pull/138390 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it still has another lint error ([comment](https://github.com/pytorch/pytorch/pull/138390#issuecomment-2430566004))
2024-10-28 10:37:24 -07:00
efd6c418cc Move test_serialize to training IR (#138261)
Differential Revision: [D64572253](https://our.internmc.facebook.com/intern/diff/D64572253)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138261
Approved by: https://github.com/yushangdi
2024-10-28 10:37:24 -07:00
63ae5e9554 Remove parallel_and and parallel_or (#138135)
Not used, suggested by @ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138135
Approved by: https://github.com/ezyang
2024-10-28 10:37:24 -07:00
cyy
775752512f [1/N] Enable cppcoreguidelines-special-member-functions (#137405)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137405
Approved by: https://github.com/ezyang
2024-10-28 10:37:24 -07:00
eeb3f6bcc8 [EZ][DTensor] Update DTensor readme to use the new import path (#138625)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138625
Approved by: https://github.com/XilunWu
2024-10-28 10:37:24 -07:00
3a52108033 [dynamo] reset compiler stance after test (#138277)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138277
Approved by: https://github.com/anijain2305, https://github.com/jansel
2024-10-28 10:37:24 -07:00
d37fb92fb5 [executorch hash update] update the pinned executorch hash (#135287)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135287
Approved by: https://github.com/pytorchbot, https://github.com/huydhn

Co-authored-by: Huy Do <huydhn@gmail.com>
2024-10-28 10:37:24 -07:00
c133661f5d Disabling amp context when invoking compiler (#138624)
Fix for https://github.com/pytorch/pytorch/issues/133974

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138624
Approved by: https://github.com/bdhirsh, https://github.com/drisspg
2024-10-28 10:37:24 -07:00
ec16fd258d [Inductor] New Triton Attrs Descriptor Fixups (#138390)
Fixes additional areas where we need to use the new Triton AttrsDescriptor if it is available.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138390
Approved by: https://github.com/jansel
2024-10-28 10:37:24 -07:00
c4b4c1793b Revert "Make Context to be Device-agnostic Step by Step (2/N) (#136526)"
This reverts commit 8aacbee8e0d6c03096f2ce94b70e2a8fab17ee81.

Reverted https://github.com/pytorch/pytorch/pull/136526 on behalf of https://github.com/wdvr due to this one has failing internal tests, not related to a landrace with #138398 - reverting this one ([comment](https://github.com/pytorch/pytorch/pull/136526#issuecomment-2430460176))
2024-10-28 10:37:24 -07:00
2a80480777 [sparse] add search for optimal alg_id to torch.compile (#137427)
Summary:

This PR adds a lowering for `torch._cslt_sparse_mm` to find the optimal
alg_id and cache it when running with `torch.compile`

Seeing speedups on both bfloat16 and float8 dtypes:
<img width="641" alt="Screenshot 2024-10-17 at 2 10 38 PM" src="https://github.com/user-attachments/assets/b928cd11-32a3-43e5-b209-8e4028896f0b">
<img width="1274" alt="Screenshot 2024-10-17 at 1 39 03 PM" src="https://github.com/user-attachments/assets/d9edd684-a8ec-46fd-b3da-2e76dbcb7bb6">

* `torch._cslt_sparse_mm_search` has been modified to return optimal
  split-k parameters as well as max alg_id.

* max_id is now available in `torch.backends.cusparselt` via
  `torch.backends.cusparselt.get_max_alg_id()`

* fixed meta registrations for float8

Test Plan:

python test/test_sparse_semi_structured.py

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137427
Approved by: https://github.com/cpuhrsch
2024-10-28 10:37:24 -07:00
c81e2466c3 [EZ] Use at::detail nested namespace in Dispatch.h (#138633)
Instead of `namespace at { namespace detail {`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138633
Approved by: https://github.com/Skylion007
2024-10-28 10:37:24 -07:00
be281e74a9 [AOTI][refactor] Clean up test_aot_inductor skip list (#138544)
Summary: Remove skips for already fixed tests. Change remaining skip to xfail so that the failure list can be more proactively maintained.

Differential Revision: [D64761257](https://our.internmc.facebook.com/intern/diff/D64761257)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138544
Approved by: https://github.com/chenyang78, https://github.com/hl475
2024-10-28 10:37:24 -07:00
17f8cec511 Add support for adding extra metadata to chromium events, log to separate columns (#138477)
This diff does a few things:

## Add metadata to events in progress
Adds the ability to add extra metadata to Chromium Events via `add_event_data`.
Metadata can only be added to chromium events that have started, but not ended (so, in progress events)
- When you add the data, the metadata is appended to the metadata when you call log_event_end().
- The metadata appears in chromium events in tlparse. It also gets logged to scuba.

## New `dynamo` chromium event
We add a new `dynamo` chromium event to the top of the stack, where we collect various metadata found in dynamo_compile. So the new order of events goes:

```
__start__
-> dynamo (dynamo compile metrics)
-> entire_frame_compile (compile.inner)
-> backend_compile (i.e. aotdispatch)
-> create_aot_dispatch_function
-> inductor_compile
-> ...
```

BackwardCompilationMetrics doesn't have any dynamo specific information (as it's mostly inductor timings). So we don't include that here.

*FAQ: Why can't we use `entire_frame_compile` as the event?*
This is mostly due to backward compatibility with `dynamo_compile`. `dynamo_compile` collects CompilationMetrics outside of `compile.compile_inner`, and uses `dynamo_timed` to grab timings from phases of the compiler, including `entire_frame_compile`. So we don't have a CompilationMetric object until after an `entire_frame_compile` event ends! Separately, `dynamo` as a name for all of dynamo compile is more descriptive than `entire_frame_compile`, imo.

## Log metadata as separate columns
(Meta only): Separately, this also changes the `metadata` column in PT2 Compile Events. Instead of logging a single metadata column in JSON, it separates the JSON into separate columns. This is much better for data analysis. Now that this table is more mature, I think logging keys to separate columns is a better system.Differential Revision: [D64696287](https://our.internmc.facebook.com/intern/diff/D64696287/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D64696287/)!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138477
Approved by: https://github.com/aorenste
2024-10-28 10:37:24 -07:00
a3e229bec1 Fixes issue with torch._dynamo.assume_constant_result with global functions (#132431)
This PR fixes an issue with `torch._dynamo.assume_constant_result` causing global values to be overwritten.
Currently `torch._dynamo.assume_constant_result` saves the constant result into a global variable derived from the name of the function.  This causes that function to be overwritten in the global scope.  This PR checks that the name is unique in the global scope as well, avoiding the issue of overriding the function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132431
Approved by: https://github.com/jansel
2024-10-28 10:37:24 -07:00
cfcd399c2e [export] Add retraceability_non_strict to tests (#138380)
Summary: We expand the tests to cover retraceability_non_strict. Currently failing tests are skipped.

Test Plan:
```
buck2 test @//mode/dev-nosan //caffe2/test:test_export -- -r _retraceability
```

Differential Revision: D64611532

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138380
Approved by: https://github.com/angelayi
2024-10-28 10:37:24 -07:00
def2f4ee78 Update copyrights to 2024 (#138638)
Spiritual successor of https://github.com/pytorch/pytorch/pull/119413 + CPP docs copyright update as well
Fixes https://github.com/pytorch/pytorch/issues/138630

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138638
Approved by: https://github.com/atalman
2024-10-28 10:37:24 -07:00
51fbd5873e Bump webrick from 1.7.0 to 1.8.2 in /ios/TestApp (#136593)
Bumps [webrick](https://github.com/ruby/webrick) from 1.7.0 to 1.8.2.
- [Release notes](https://github.com/ruby/webrick/releases)
- [Commits](https://github.com/ruby/webrick/compare/v1.7.0...v1.8.2)

---
updated-dependencies:
- dependency-name: webrick
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-28 10:37:24 -07:00
52efd5d4c0 Improve input validation for NJT pointwise ops (#138602)
Before this PR, NJT would dispatch e.g. `NJT * nested_int` to `mul.Tensor`, wrongly interpreting the SymInt as a tensor and outputting garbage. This PR verifies that there are no nested ints in the list of args before dispatching for pointwise ops.

I originally tried checking that `the number of passed tensor args == the number of func schema tensor args`, but this wrongly disallows `nt * 2`, which (non-intuitively to me at least at first) dispatches via the `mul.Tensor` overload.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138602
Approved by: https://github.com/soulitzer
2024-10-28 10:37:24 -07:00
cyy
e7a6590abb [6/N] Fix extra warnings brought by clang-tidy-17 (#138572)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138572
Approved by: https://github.com/Skylion007
2024-10-28 10:37:24 -07:00
9988121c55 [ONNX] Add complex constant support (#138279)
Transform complex python constant to float representation as well, like what we have with tensors.

PS: I find it's not reasonable to add "complex->float" in IR side, so I put it here.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138279
Approved by: https://github.com/justinchuby

Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
2024-10-28 10:37:24 -07:00
53c9de34f5 Remove unused enforce_cond_guards_match Dynamo feature flag. (#138589)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138589
Approved by: https://github.com/clee2000
2024-10-28 10:37:24 -07:00
f78e347c40 Aarch64 binary builds - fix passing env_file to Docker (#138588)
Aarch64 builds skipped the logic of sourcing binary env file. And as a result PYTORCH_EXTRA_INSTALL_REQUIREMENTS passed to Aarch64 builds have not included triton dependency constraint. This PR makes sure Aarch64 builds follow same path as our regular manywheel builds.

To work around this issue we had to inject triton in aarrch64 builds for release 2.5, which is not ideal: https://github.com/pytorch/builder/pull/2011
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138588
Approved by: https://github.com/jeanschmidt, https://github.com/malfet
2024-10-28 10:37:23 -07:00
eqy
316c9d4185 [Flex Attention] Don't compute fill order to compute stride order just to get fill order back (#138376)
Was a bit confusing to read when working on #138354

"computer-assisted proof"
```
import random

def argsort(seq):
    # preserve original order for equal strides
    getter = seq.__getitem__
    a_r = range(len(seq))
    return list(reversed(sorted(a_r, key=getter, reverse=True)))  # noqa: C413

def stride_order2fill_order(order):
    """
    Convert stride order to fill order
    For channel last format,

    stride order = [3, 0, 2, 1] and fill order = [1, 3, 2, 0]
    """
    lookup = {pos: idx for idx, pos in enumerate(order)}
    fill_order = [lookup[i] for i in range(len(order))]
    return fill_order

def get_stride_order(seq):
    """
    Convert strides to stride order
    """
    sorted_idx: List[int] = argsort(seq)
    out = [0 for _ in range(len(seq))]
    a = sorted_idx.copy()
    for i, elem in enumerate(sorted_idx):
        out[elem] = i
    fillorder = stride_order2fill_order(out)
    assert fillorder == sorted_idx
    return out

for _ in range(1000):
    a = [0, 1, 2, 3]
    random.shuffle(a)
    get_stride_order(a)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138376
Approved by: https://github.com/drisspg
2024-10-28 10:37:23 -07:00
8cfbfceded [Inductor][ROCm][CK] add CK grouped conv2d fwd kernels to ROCm codegen (#137947)
Plug into lowering and end to end test in a later PR

Instance parsing companion PR https://github.com/ROCm/composable_kernel/pull/1585

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137947
Approved by: https://github.com/ColinPeppler, https://github.com/chenyang78
2024-10-28 10:37:23 -07:00
c8008102fd [EZ] [BE] Remove (now) unused scale config (#138511)
Final step of moving scale config files to test-infra repo.  Details in https://github.com/pytorch/test-infra/pull/5767

Scale configs are now read from test-infra.  This PR is just cleaning up stale files
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138511
Approved by: https://github.com/clee2000
2024-10-28 10:37:23 -07:00
07525763d0 Fix for MSVC problem on Windows Arm64 (#136765)
This PR proposes a workaround for an internal issue introduced in MSVC 14.37 for Windows Arm64 target. It is still an ongoing problem.
The fix will be released with the future versions of Visual Studio 2022 but until then the changes to cpu/vec/vec_base.h should be sufficient.
We also opened a new ticket on Visual Studio Developer Community, it can be found here: https://developercommunity.visualstudio.com/t/MSVC-loop-unrolling-problem-194033813-/10720692

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136765
Approved by: https://github.com/malfet

Co-authored-by: Stefan-Alin Pahontu <56953855+alinpahontu2912@users.noreply.github.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com>
2024-10-28 10:37:23 -07:00
7f69d96c71 Revert "Remove C10_DEPRECATED (#138406)"
This reverts commit 70ec86d7542d461ff6f01ba1a1c9a4f38637af8e.

Reverted https://github.com/pytorch/pytorch/pull/138406 on behalf of https://github.com/wdvr due to failing internal tests - see D64714374 ([comment](https://github.com/pytorch/pytorch/pull/138406#issuecomment-2429912896))
2024-10-28 10:37:23 -07:00
01ff36e5e2 Upload artifacts during test run (#125799)
Zip and upload artifacts while run_test is running
Upgrade boto3 because I get errors about not having `botocore.vendored.six.move` if I don't
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125799
Approved by: https://github.com/huydhn
2024-10-28 10:37:23 -07:00
b4f3a4c5c5 [hierarchical-compilation][invoke_subgraph] Use tracing context to cache artifacts of dispatch keys (#137965)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137965
Approved by: https://github.com/zou3519
ghstack dependencies: #137538, #138036
2024-10-28 10:37:23 -07:00
fd779b287c [hierarchical-compilation][invoke_subgraph] Graph break on input mutation or aliasing (#138036)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138036
Approved by: https://github.com/zou3519
ghstack dependencies: #137538
2024-10-28 10:37:23 -07:00
85814f9047 [hierarchical-compilation][hop] Introduce invoke_subgraph (#137538)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137538
Approved by: https://github.com/zou3519
2024-10-28 10:37:23 -07:00
170c622400 [ROCm] index_put performance improvement (#138259)
On ROCm, using a non-vectorized index_put kernel provides ~2x perf improvement over the hipified CUDA kernel.  None of the existing unit tests were exercising the large index case so a new unit test was added.

It was also noted that the scale value in the original kernel was hard-coded to 1.0 which would be a no-op, so it was removed from the simplified rocm kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138259
Approved by: https://github.com/xw285cornell, https://github.com/leitian, https://github.com/eqy
2024-10-28 10:37:23 -07:00
9e901b34da [AOTI][reland] Fix test_index_put_with_none_index_cpu_with_stack_allocation (#138541)
Summary: The problem happened after splitting CppWrapperCpu and CppWrapperCpuArrayRef, because CppWrapperCpuArrayRef.generate_index_put_fallback missed a statement.

Running test_aot_inductor.py as a whole didn't reveal the problem, but running test_index_put_with_none_index_cpu_with_stack_allocation individually did. Digging deeper, the root cause is init_backend_registration has incorrectly cached CPU CppWrapperCodegen class, which means CppWrapperCpuArrayRef was never picked when running test_aot_inductor.py as a whole. To fix the problem, all the ArrayRef tests are split into a separate file. Also a code checking is added to regex match AOTInductorModelRunMinimalArrayrefInterface so this kind of false passing signal won't be unnoticed.

Differential Revision: [D64734106](https://our.internmc.facebook.com/intern/diff/D64734106)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138541
Approved by: https://github.com/frank-wei
2024-10-28 10:37:23 -07:00
442a63a9ba config: simplify most of the config handling and fix some bugs (#138377)
This PR combines a number of cleanups in one PR. If any of the specific cleanups don't seem to make sense, let me know and I can remove them.

Cleanups

- This PR adds a set of test suites for the config module code, which handles basically all the APIs and ways it is used. Please let me know if you see anything critical that is not tested that I missed. This test suite is primarily used as the regression test suite for later changes in this diff. Note that there is some dynamo specific testing of the config module, but it isn't as verbose.
- I removed all internal usage of shallow_copy_dict. Those usages could all use the deep copy, and did not depend on the reference behavior of certain config values that shallow_copy_dict allows.
- I removed shallow copy semantics for configuration with a deprecation warning. I think this requires a release note, so hopefully I did that correctly. Let me know if we want to continue to expose shallow copy value semantics, but I just can't find a case where I expect anyone would want it. It also complicated later internal changes to the API (i.e. breaking apart various layers of the config changes).
- I fixed what I believe is a bug in how hashes are calculated on configs. In particular, if you got the hash, then made a config change, and then got the hash again, it would not update the hash. @oulgen, please let me know if I'm misunderstanding this behavior and it is desired.
- I switched our multiple implementations of iterating through the dictionary to a single one. This is primarily to make later changes easier, but it also makes it clear how inconsistent our various config ignoring options are. Let me know if people would be interested in me unifying the various options for ignoring config values.
- I updated the test patcher (not the performance critical one, just the normal one), to use __setattr__ and __getattr__ to remove direct API access to the underlying config fetcher.

For release notes, Not sure exactly how to communicate this, but something like
"ConfigModule.to_dict, and ConfigModule.shallow_copy_dict no longer retain their shallow copy semantics, which allowed reference values objects to be modified. If you wish to modify the config object, call load_config explicitly".

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138377
Approved by: https://github.com/ezyang, https://github.com/jansel, https://github.com/jovianjaison
2024-10-28 10:37:23 -07:00
297258a623 Add type stub for SymInt.rsub (#138543)
Fixes https://github.com/pytorch/pytorch/issues/138478

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138543
Approved by: https://github.com/malfet
2024-10-28 10:37:23 -07:00
21ce69be05 Add out_dtype kw argument to optimize_bsr_dense_addmm (#136626)
As in the title.

Addresses the task in https://github.com/pytorch/ao/pull/821#issuecomment-2373290266

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136626
Approved by: https://github.com/amjames, https://github.com/cpuhrsch
2024-10-28 10:37:23 -07:00
53f25d34fd [compiled autograd] tls access helpers (#138061)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138061
Approved by: https://github.com/yf225
ghstack dependencies: #137953, #137821
2024-10-28 10:37:23 -07:00
e969f3f2d7 [compiled autograd] Compiled autograd configs in TLS (#137821)
Multithreaded doesn't work yet, this adds python side TLS only for the python side state

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137821
Approved by: https://github.com/jansel, https://github.com/yf225
ghstack dependencies: #137953
2024-10-28 10:37:23 -07:00
c9ff5375ff [compiled autograd] directly use python Logger class in cpp (#137953)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137953
Approved by: https://github.com/jansel, https://github.com/yf225
2024-10-28 10:37:23 -07:00
0a40ddd336 [aoti] Cond symint input support (#138373)
If the input is a symint, we don't want to add the aoti_torch_assign_tensors_out

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138373
Approved by: https://github.com/larryliu0820, https://github.com/desertfire
2024-10-28 10:37:23 -07:00
1880dcfa78 make DimHints compatible with Dims (#138490)
Previously we'd been raising UserErrors when `Dim()` and DimHints (`Dim.AUTO/Dim.DYNAMIC`) were both specified in `dynamic_shapes`, this PR stops that, and uses `Dim()` objects to guide DimHints.

The key to this was making the `EqualityConstraint` class happy when it checks that inferred equivalence relations were specified in the original `dynamic_shapes` spec, and this introduces a `RelaxedConstraint` object to mark the hinted dimensions, so equality checks between `RelaxedConstraints` and other constraints are treated as valid.

Current behavior is that:
```
class Foo(torch.nn.Module):
    def forward(self, x, y):
        return x - y

inputs = (torch.randn(4, 4), torch.randn(4, 4))
shapes = {
    "x": (Dim.AUTO, Dim("d1", min=3)),
    "y": (Dim("d0", max=8), Dim.DYNAMIC),
}
ep = export(Foo(), inputs, dynamic_shapes=shapes)
```

The dimensions marked `AUTO` and `DYNAMIC` will have max & min ranges of 8 & 3 respectively. Note that inferred equality between `Dim()` objects & `Dim.STATIC` will still raise errors - `Dim()` suggests not specializing to a constant.

Differential Revision: D64636101

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138490
Approved by: https://github.com/avikchaudhuri
2024-10-28 10:37:23 -07:00
01e43190a3 [SDPA-CUDNN] Make CuDNN Attention Opt in (#138522)
# Summary
Currently we have a `cudnn_order` that says on H100 w/ new enough CuDNN backend (we ship a 9.1 version in OSS) try to run CuDNN attention first. We have already encountered a few bugs with the release of 2.5:

1. https://github.com/pytorch/pytorch/issues/138529
2. https://github.com/huggingface/diffusers/issues/9704
3. https://github.com/pytorch/pytorch/pull/138354

In light of the above we are going to make the CuDNN backend Opt-in by default.

This can be done easily with the context manager for choosing backends I.e.:
``` Python
from torch.nn.attention import sdpa_kernel, SDPBackend

with sdpa_kernel(SDPBackend.CUDNN_ATTENTION):
    out = F.scaled_dot_product_attention(q, k, v)

```

This PR puts the CuDNN backend as the lowest precedence in the backend list, meaning that the Math backend will always be chosen unless disabled (which is done via the context manager).

Cc @atalman

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138522
Approved by: https://github.com/ngimel, https://github.com/eqy, https://github.com/malfet
2024-10-28 10:37:23 -07:00
debc2170c0 Mark torch.get_device as overridable at the python level (#132706)
Summary:
- add a value to `get_testing_overrides` function for `torch.get_device()`
- remove `torch.get_device()` from the `get_ignored_functions` list

Test Plan:
Existing override testing infra, which should pick up the updates to these two variables.

Closes the loop on:
https://github.com/pytorch/pytorch/pull/132567

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132706
Approved by: https://github.com/ezyang
2024-10-28 10:37:23 -07:00
dcd776c0ad bug in unbacked_bindings for a*u0 (#138136)
Summary: we were storing a*u0 instead of u0 in unbacked_bindings / unbacked_var_to_val

Test Plan: -

Differential Revision: D64508936

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138136
Approved by: https://github.com/ezyang
2024-10-28 10:37:23 -07:00
91ac179599 [pt2] Log is_forward field to dynamo_compile scuba table (#138505)
Differential Revision: [D64711721](https://our.internmc.facebook.com/intern/diff/D64711721)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138505
Approved by: https://github.com/oulgen
2024-10-28 10:37:23 -07:00
356bd932d8 [CP] Implement AllGather based context parallelism (#132820)
Summary:

This implementation does not utilize the benefit that after allgather we can directly perform the SDPA without doing the ring-based SDPA, but we can overlap the communication with the first sharded kv computation. This implementation shows some performance benefit and memory saving compared to the original alltoall implementation in certain cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132820
Approved by: https://github.com/XilunWu
2024-10-28 10:37:23 -07:00
8a04f3e0ff [PGNCCL] Add default value for nccl_nonblocking_timeout (#138374)
- Added default value for `nccl_nonblocking_timeout` (30 mins, previous: -1).
- Reuse C10D_CHECK_TIMEOUT in other CHECK macros

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138374
Approved by: https://github.com/eqy
ghstack dependencies: #137855, #138488
2024-10-28 10:37:23 -07:00
ba828d0fc6 Properly uses ref-counting for torch.cuda.use_mem_pool (#133600)
This PR refactors some ref-counting functionality out of `beginAllocateToPool` and `releasePool`. The ref-counting logic is then used in construction and destruction of `torch.cuda.MemPool`.

The `use_count` variable in the CUDACachingAllocator is essentially a refcount of how many context managers are using the pool. Since we are now lifting up the MemPool abstraction to the user, the MemPool object itself now needs to hold a an extra reference as well.

Part of https://github.com/pytorch/pytorch/issues/124807.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133600
Approved by: https://github.com/eqy, https://github.com/ezyang
2024-10-28 10:37:23 -07:00
fedbbd83dd [easy] in ROCmTemplate set kwargs when creating Buffer (#138521)
Summary: https://github.com/pytorch/pytorch/pull/137768 makes Inductor IR kw only

Test Plan: CI

Differential Revision: D64723804

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138521
Approved by: https://github.com/tenpercent, https://github.com/chenyang78
2024-10-28 10:37:23 -07:00
cyy
828536de0f Use Wmissing-prototypes on torch_cuda (#136080)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136080
Approved by: https://github.com/ezyang
2024-10-28 10:37:23 -07:00
4155456b85 Fix training IR bug by changing passes order (#138292)
Inserting runtime_assertions cause gm to have different names but the graph signature was populated earlier. To avoid this kind of errors in the future, I refactored these steps into a helper function.

Differential Revision: [D64576251](https://our.internmc.facebook.com/intern/diff/D64576251)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138292
Approved by: https://github.com/avikchaudhuri
ghstack dependencies: #138266
2024-10-28 10:37:23 -07:00
8f3f9044a0 Don't try to load cufile (#138501)
Trying to loading it caused a big issue with 2.5.0 release - https://github.com/pytorch/pytorch/issues/138324

cufile is not actually used currently by default, see #133489

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138501
Approved by: https://github.com/atalman, https://github.com/mikaylagawarecki, https://github.com/malfet
2024-10-28 10:37:23 -07:00
56e3472fea Training IR should preserve custom metadata (#138266)
Differential Revision: [D64576252](https://our.internmc.facebook.com/intern/diff/D64576252)

@diff-train-skip-merge
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138266
Approved by: https://github.com/yushangdi
2024-10-28 10:37:23 -07:00
73121c8521 [inductor] add a threshold for membw saving during fusion (#136782)
Fix https://github.com/pytorch/pytorch/issues/133242 . In that issue, inductor fuses 2 nodes because they access the same scalar tensor. This saving is very small (4 bytes), and if we ignore that, by default, we can not fuse. But if loop ordering after fusion get kicked in, we can reorder loops and fuse those 2 nodes. We get 33% memory bandwidth savings .

I think adding a threshold for membw saving in general is not bad.

I'll run a perf test. ( https://github.com/pytorch/pytorch/actions/runs/11375421752 )

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136782
Approved by: https://github.com/jansel
2024-10-28 10:37:23 -07:00
03da94b981 Revert "[AOTI] Fix test_index_put_with_none_index_cpu_with_stack_allocation (#138303)"
This reverts commit 1417b2cd0562e0e4d4349024ef7c731b99214890.

Reverted https://github.com/pytorch/pytorch/pull/138303 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/138303#issuecomment-2427991065))
2024-10-28 10:37:23 -07:00
c7e063c42c [DeviceMesh] Use split_group to create sub_groups for nccl backend if the default pg is eagerly initialized (#138129)
Use `split_group()` to create sub_groups for nccl backend if the default pg is eagerly initialized. Otherwise, it will still go through the normal lazy init process and call `new_group()` instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138129
Approved by: https://github.com/kwen2501
2024-10-28 10:37:23 -07:00
39169e2e44 Fixes issue with enums in a tuple for dynamo (#133123)
Currently when tuples values are encountered in dynamo, they are encoded using `repr(arg)`.  This causes an issue if one of the values inside of the tuple will not be properly encoded.  In this case, if an enum is contained inside of a tuple, it will cause invalid python code to be generated

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133123
Approved by: https://github.com/jansel
2024-10-28 10:37:23 -07:00
d8e5d7ddea Add environment variable to force no weights_only load (#138225)
In preparation for `weights_only` flip, if users don't have access to the `torch.load` call

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138225
Approved by: https://github.com/albanD
2024-10-28 10:37:23 -07:00
5a04072af5 [Traceable FSDP2][CI] Skip more tests on rocm (#138497)
Some of the test checks doesn't work well with rocm.

Fixes https://github.com/pytorch/pytorch/issues/138409.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138497
Approved by: https://github.com/fduwjj
2024-10-28 10:37:23 -07:00
a3d27ce9f2 [inductor][subgraph] Add size asserts (#138424)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138424
Approved by: https://github.com/eellison
ghstack dependencies: #137555
2024-10-28 10:37:23 -07:00
715092be96 [AC] Backward Pass Aware AC - adding hooks to partitioner to pass callable (#137785)
Summary: same as title. Plan is to pass a callable to the partitioner to perform custom autoAC via an ILP. This is the same as a previous diff D63714905 which was landed and then subsequently reverted by PyTorch Release Engineering because of a failing unit test (f7b8d36c28). We think the unit test is buggy, and we also fix the same.

Test Plan: tbd

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137785
Approved by: https://github.com/basilwong

Co-authored-by: Huy Do <huydhn@gmail.com>
2024-10-28 10:37:23 -07:00
99d88c7c46 Log all failing test repros to scuba (#138394)
This has the benefit that

1) It's much easier to aggregate test failure repros into say a CSV or shell script from scuba
2) We can do analysis (eg. set different two sets of tests across two PRs)
3) We can get results faster at the test-level granularity instead of job-level granularity we see in the HUD/GH.

I tested this by introducing a breaking change, adding ci-scribe label and then verifying that the failed tests were logged to scuba: https://fburl.com/scuba/torch_open_source_signpost/w6qt7qr9

I then reverted the breaking change and published this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138394
Approved by: https://github.com/ezyang
2024-10-28 10:37:23 -07:00
55c1b7229a More appropriate socket errors and debug messages (#130347)
Fixes #128998

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130347
Approved by: https://github.com/fduwjj
2024-10-28 10:37:22 -07:00
2772c6d4aa [Forward Fix][PGNCCL] Add define guard for NCCL_SPLIT_NOCOLOR (#138488)
Forward fix for build issue introduced by #137855:
```
In file included from fbcode/caffe2/torch/csrc/distributed/c10d/NCCLUtils.cpp:2:
fbcode/caffe2/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp:508:21: error: use of undeclared identifier 'NCCL_SPLIT_NOCOLOR'
  508 |     int split_color{NCCL_SPLIT_NOCOLOR - 1};
      |                     ^
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138488
Approved by: https://github.com/fduwjj
ghstack dependencies: #137855
2024-10-28 10:37:22 -07:00
8cd8e3833c Support record_stream() for NJT (#137099)
Does what it says on the tin. I believe the right behavior here is to ensure that `record_stream()` is called on all tensor components of the NJT to ensure they all live until stream computation is complete.

This is an ask from torchrec as the op is used there.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137099
Approved by: https://github.com/ngimel
2024-10-28 10:37:22 -07:00
e96e531ac3 Remove C10_DEPRECATED (#138406)
Looking in the code I see
```
// NB: __cplusplus doesn't work for MSVC, so for now MSVC always uses
// the "__declspec(deprecated)" implementation and not the C++14
// "[[deprecated]]" attribute. We tried enabling "[[deprecated]]" for C++14 on
// MSVC, but ran into issues with some older MSVC versions.
```
But looking at the [MSVC C++ support table](https://learn.microsoft.com/en-us/cpp/overview/visual-cpp-language-conformance?view=msvc-170) I see that the `[[deprecated]]` attribute is supported as of MSVC 2015 and that the vast majority of C++17 features became supported in MSVC 2015 _or later_.

Since PyTorch is C++17 now, I infer that PyTorch must not support versions of MSVC earlier than MSVC 2015, so the versions of MSVC supported by PyTorch must support `[[deprecated]]`.

Therefore, since we are finished deprecating old MSVCs we can deprecate `C10_DEPRECATED`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138406
Approved by: https://github.com/cyyever, https://github.com/malfet
2024-10-28 10:37:22 -07:00
40e2169c40 [user triton] typing triton_kernel_wrap.py (#138230)
Remove `# mypy: allow-untyped-defs` from triton_kernel_wrap.py, and fixed all the mypy errors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138230
Approved by: https://github.com/oulgen, https://github.com/Skylion007
2024-10-28 10:37:22 -07:00
12f2d34f12 Use cuda 12.4 pytorch_extra_install_requirements as default (#138458)
Since cuda 12.4 binaries are default binaries on pypi now. The pytorch_extra_install_requirements need to use 12.4.
This would need to be cherry-picked to release 2.5 branch to avoid injecting these versions into metadata during pypi promotion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138458
Approved by: https://github.com/malfet
2024-10-28 10:37:22 -07:00
278 changed files with 7321 additions and 4539 deletions

View File

@ -1 +1 @@
cd1c833b079adb324871dcbbe75b43d42ffc0ade
ca4783992ed7602a39528ba304d61f00396b2a5a

View File

@ -137,6 +137,39 @@ function install_124 {
ldconfig
}
function install_126 {
echo "Installing CUDA 12.6.2 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.6.2"
rm -rf /usr/local/cuda-12.6 /usr/local/cuda
# install CUDA 12.6.2 in the same container
wget -q https://developer.download.nvidia.com/compute/cuda/12.6.2/local_installers/cuda_12.6.2_560.35.03_linux.run
chmod +x cuda_12.6.2_560.35.03_linux.run
./cuda_12.6.2_560.35.03_linux.run --toolkit --silent
rm -f cuda_12.6.2_560.35.03_linux.run
rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.6 /usr/local/cuda
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
mkdir tmp_cudnn && cd tmp_cudnn
wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz -O cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz
tar xf cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz
cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/include/* /usr/local/cuda/include/
cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf tmp_cudnn
# NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses
# Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build
git clone -b $NCCL_VERSION --depth 1 https://github.com/NVIDIA/nccl.git
cd nccl && make -j src.build
cp -a build/include/* /usr/local/cuda/include/
cp -a build/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf nccl
install_cusparselt_062
ldconfig
}
function prune_118 {
echo "Pruning CUDA 11.8 and cuDNN"
#####################################################################################
@ -227,12 +260,46 @@ function prune_124 {
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a
#####################################################################################
# CUDA 12.1 prune visual tools
# CUDA 12.4 prune visual tools
#####################################################################################
export CUDA_BASE="/usr/local/cuda-12.4/"
rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2024.1.0 $CUDA_BASE/nsight-systems-2023.4.4/
}
function prune_126 {
echo "Pruning CUDA 12.6"
#####################################################################################
# CUDA 12.6 prune static libs
#####################################################################################
export NVPRUNE="/usr/local/cuda-12.6/bin/nvprune"
export CUDA_LIB_DIR="/usr/local/cuda-12.6/lib64"
export GENCODE="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
export GENCODE_CUDNN="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
if [[ -n "$OVERRIDE_GENCODE" ]]; then
export GENCODE=$OVERRIDE_GENCODE
fi
if [[ -n "$OVERRIDE_GENCODE_CUDNN" ]]; then
export GENCODE_CUDNN=$OVERRIDE_GENCODE_CUDNN
fi
# all CUDA libs except CuDNN and CuBLAS
ls $CUDA_LIB_DIR/ | grep "\.a" | grep -v "culibos" | grep -v "cudart" | grep -v "cudnn" | grep -v "cublas" | grep -v "metis" \
| xargs -I {} bash -c \
"echo {} && $NVPRUNE $GENCODE $CUDA_LIB_DIR/{} -o $CUDA_LIB_DIR/{}"
# prune CuDNN and CuBLAS
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublas_static.a -o $CUDA_LIB_DIR/libcublas_static.a
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a
#####################################################################################
# CUDA 12.6 prune visual tools
#####################################################################################
export CUDA_BASE="/usr/local/cuda-12.6/"
rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2024.3.2 $CUDA_BASE/nsight-systems-2024.5.1/
}
# idiomatic parameter and option handling in sh
while test $# -gt 0
do
@ -243,6 +310,8 @@ do
;;
12.4) install_124; prune_124
;;
12.6) install_126; prune_126
;;
*) echo "bad argument $1"; exit 1
;;
esac

View File

@ -41,13 +41,16 @@ function install_ubuntu() {
libegl-mesa0 libegl1-mesa libegl1-mesa-dev libgbm1 libgl1-mesa-dev libgl1-mesa-dri \
libglapi-mesa libgles2-mesa-dev libglx-mesa0 libigdgmm12 libxatracker2 mesa-va-drivers \
mesa-vdpau-drivers mesa-vulkan-drivers va-driver-all vainfo hwinfo clinfo
if [[ "${XPU_DRIVER_TYPE,,}" == "rolling" ]]; then
apt-get install -y intel-ocloc
fi
# Development Packages
apt-get install -y libigc-dev intel-igc-cm libigdfcl-dev libigfxcmrt-dev level-zero-dev
# Install Intel Support Packages
if [ -n "$XPU_VERSION" ]; then
apt-get install -y intel-for-pytorch-gpu-dev-${XPU_VERSION} intel-pti-dev
apt-get install -y intel-for-pytorch-gpu-dev-${XPU_VERSION} intel-pti-dev-0.9
else
apt-get install -y intel-for-pytorch-gpu-dev intel-pti-dev
apt-get install -y intel-for-pytorch-gpu-dev-0.5 intel-pti-dev-0.9
fi
# Cleanup
@ -97,7 +100,7 @@ EOF
intel-igc-opencl-devel level-zero-devel intel-gsc-devel libmetee-devel \
level-zero-devel
# Install Intel Support Packages
yum install -y intel-for-pytorch-gpu-dev intel-pti-dev
yum install -y intel-for-pytorch-gpu-dev-0.5 intel-pti-dev-0.9
# Cleanup
dnf clean all
@ -131,7 +134,7 @@ function install_sles() {
zypper install -y libigdfcl-devel intel-igc-cm libigfxcmrt-devel level-zero-devel
# Install Intel Support Packages
zypper install -y intel-for-pytorch-gpu-dev intel-pti-dev
zypper install -y intel-for-pytorch-gpu-dev-0.5 intel-pti-dev-0.9
}

View File

@ -70,6 +70,10 @@ FROM cuda as cuda12.4
RUN bash ./install_cuda.sh 12.4
ENV DESIRED_CUDA=12.4
FROM cuda as cuda12.6
RUN bash ./install_cuda.sh 12.6
ENV DESIRED_CUDA=12.6
# Install MNIST test data
FROM base as mnist
ADD ./common/install_mnist.sh install_mnist.sh
@ -79,6 +83,7 @@ FROM base as all_cuda
COPY --from=cuda11.8 /usr/local/cuda-11.8 /usr/local/cuda-11.8
COPY --from=cuda12.1 /usr/local/cuda-12.1 /usr/local/cuda-12.1
COPY --from=cuda12.4 /usr/local/cuda-12.4 /usr/local/cuda-12.4
COPY --from=cuda12.6 /usr/local/cuda-12.6 /usr/local/cuda-12.6
# Final step
FROM ${BASE_TARGET} as final

View File

@ -1,10 +1,12 @@
# cf. https://github.com/pypa/manylinux/issues/53
import sys
from urllib.request import urlopen
GOOD_SSL = "https://google.com"
BAD_SSL = "https://self-signed.badssl.com"
import sys
print("Testing SSL certificate checking for Python:", sys.version)
@ -12,14 +14,8 @@ if sys.version_info[:2] < (2, 7) or sys.version_info[:2] < (3, 4):
print("This version never checks SSL certs; skipping tests")
sys.exit(0)
if sys.version_info[0] >= 3:
from urllib.request import urlopen
EXC = OSError
else:
from urllib import urlopen
EXC = IOError
EXC = OSError
print(f"Connecting to {GOOD_SSL} should work")
urlopen(GOOD_SSL)

View File

@ -5,7 +5,7 @@
#Pinned versions: 1.6
#test that import:
boto3==1.19.12
boto3==1.35.42
#Description: AWS SDK for python
#Pinned versions: 1.19.12, 1.16.34
#test that import:

View File

@ -284,7 +284,7 @@ test_python_shard() {
# modify LD_LIBRARY_PATH to ensure it has the conda env.
# This set of tests has been shown to be buggy without it for the split-build
time python test/run_test.py --exclude-jit-executor --exclude-distributed-tests $INCLUDE_CLAUSE --shard "$1" "$NUM_TEST_SHARDS" --verbose $PYTHON_TEST_EXTRA_OPTION
time python test/run_test.py --exclude-jit-executor --exclude-distributed-tests $INCLUDE_CLAUSE --shard "$1" "$NUM_TEST_SHARDS" --verbose $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running
assert_git_not_dirty
}
@ -310,7 +310,8 @@ test_dynamo_shard() {
--exclude-distributed-tests \
--exclude-torch-export-tests \
--shard "$1" "$NUM_TEST_SHARDS" \
--verbose
--verbose \
--upload-artifacts-while-running
assert_git_not_dirty
}
@ -1354,7 +1355,7 @@ test_executorch() {
echo "Run ExecuTorch regression tests for some models"
# TODO(huydhn): Add more coverage here using ExecuTorch's gather models script
# shellcheck disable=SC1091
source .ci/scripts/test.sh mv3 cmake xnnpack-quantization-delegation ''
source .ci/scripts/test_model.sh mv3 cmake xnnpack-quantization-delegation ''
popd

View File

@ -114,6 +114,12 @@ if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_B
fi
fi
USE_GLOO_WITH_OPENSSL="ON"
if [[ "$GPU_ARCH_TYPE" =~ .*aarch64.* ]]; then
USE_GLOO_WITH_OPENSSL="OFF"
USE_GOLD_LINKER="OFF"
fi
cat >"$envfile" <<EOL
# =================== The following code will be executed inside Docker container ===================
export TZ=UTC
@ -153,7 +159,7 @@ export DOCKER_IMAGE="$DOCKER_IMAGE"
export USE_GOLD_LINKER="${USE_GOLD_LINKER}"
export USE_GLOO_WITH_OPENSSL="ON"
export USE_GLOO_WITH_OPENSSL="${USE_GLOO_WITH_OPENSSL}"
# =================== The above code will be executed inside Docker container ===================
EOL

View File

@ -26,7 +26,7 @@ runs:
retry_wait_seconds: 30
command: |
set -eu
python3 -m pip install boto3==1.19.12
python3 -m pip install boto3==1.35.42
- name: Download the cache
shell: bash

View File

@ -33,7 +33,7 @@ runs:
retry_wait_seconds: 30
command: |
set -eu
python3 -m pip install boto3==1.19.12
python3 -m pip install boto3==1.35.42
- name: Upload the cache
shell: bash

View File

@ -1 +1 @@
23512dbebd44a11eb84afbf53c3c071dd105297e
e522b45cd4535b9dfe067aa68d7315755df38f48

View File

@ -1,251 +0,0 @@
# This file is generated by .github/scripts/validate_scale_config.py in test-infra
# It defines runner types that will be provisioned by by LF Self-hosted runners
# scale-config.yml:
# Powers what instance types are available for GHA auto-scaled
# runners. Runners listed here will be available as self hosted
# runners, configuration is directly pulled from the main branch.
#
#
# NOTES:
# - Linux runners are by default non-ephemeral to reduce the amount of CreateInstaces calls
# to avoid RequestLimitExceeded issues
# - When updating this file, run the following command to validate the YAML and to generate
# corresponding versions of scale-config for the pytorch/pytorch repo and merge the
# pytorch/pytorch changes before merging these changes.
# `python .github/scripts/validate_scale_config.py --test-infra-repo-root [path_to_test-infra_root] --pytorch-repo-root [path_to_pytorch_root]``
#
# TODO: Add some documentation on how the auto-scaling works
#
# NOTE: Default values,
#
# runner_types:
# runner_label:
# instance_type: m4.large
# os: linux
# max_available: 20
# disk_size: 50
# is_ephemeral: true
runner_types:
lf.linux.12xlarge:
disk_size: 200
instance_type: c5.12xlarge
is_ephemeral: false
max_available: 2000
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.10xlarge.avx2:
disk_size: 200
instance_type: m4.10xlarge
is_ephemeral: false
max_available: 450
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.24xl.spr-metal:
disk_size: 200
instance_type: c7i.metal-24xl
is_ephemeral: false
max_available: 150
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.16xlarge.spr:
disk_size: 200
instance_type: c7i.16xlarge
is_ephemeral: false
max_available: 150
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.9xlarge.ephemeral:
disk_size: 200
instance_type: c5.9xlarge
is_ephemeral: true
max_available: 50
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
variants:
am2:
ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs
lf.linux.12xlarge.ephemeral:
disk_size: 200
instance_type: c5.12xlarge
is_ephemeral: true
max_available: 300
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.16xlarge.nvidia.gpu:
disk_size: 150
instance_type: g3.16xlarge
is_ephemeral: false
max_available: 150
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.24xlarge:
disk_size: 150
instance_type: c5.24xlarge
is_ephemeral: false
max_available: 500
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.24xlarge.ephemeral:
disk_size: 150
instance_type: c5.24xlarge
is_ephemeral: true
max_available: 200
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.2xlarge:
disk_size: 150
instance_type: c5.2xlarge
is_ephemeral: false
max_available: 3120
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.4xlarge:
disk_size: 150
instance_type: c5.4xlarge
is_ephemeral: false
max_available: 1000
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.4xlarge.nvidia.gpu:
disk_size: 150
instance_type: g3.4xlarge
is_ephemeral: false
max_available: 1000
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.8xlarge.nvidia.gpu:
disk_size: 150
instance_type: g3.8xlarge
is_ephemeral: false
max_available: 400
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.g4dn.12xlarge.nvidia.gpu:
disk_size: 150
instance_type: g4dn.12xlarge
is_ephemeral: false
max_available: 250
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.g4dn.metal.nvidia.gpu:
disk_size: 150
instance_type: g4dn.metal
is_ephemeral: false
max_available: 300
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.g5.48xlarge.nvidia.gpu:
disk_size: 150
instance_type: g5.48xlarge
is_ephemeral: false
max_available: 200
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.g5.12xlarge.nvidia.gpu:
disk_size: 150
instance_type: g5.12xlarge
is_ephemeral: false
max_available: 150
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.g5.4xlarge.nvidia.gpu:
disk_size: 150
instance_type: g5.4xlarge
is_ephemeral: false
max_available: 2400
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.g6.4xlarge.experimental.nvidia.gpu:
disk_size: 150
instance_type: g6.4xlarge
is_ephemeral: false
max_available: 50
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.large:
max_available: 1200
disk_size: 15
instance_type: c5.large
is_ephemeral: false
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.arm64.2xlarge:
disk_size: 256
instance_type: t4g.2xlarge
is_ephemeral: false
max_available: 200
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-arm64
lf.linux.arm64.m7g.4xlarge:
disk_size: 256
instance_type: m7g.4xlarge
is_ephemeral: false
max_available: 200
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-arm64
lf.linux.arm64.2xlarge.ephemeral:
disk_size: 256
instance_type: t4g.2xlarge
is_ephemeral: true
max_available: 200
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-arm64
lf.linux.arm64.m7g.4xlarge.ephemeral:
disk_size: 256
instance_type: m7g.4xlarge
is_ephemeral: true
max_available: 200
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-arm64
lf.linux.arm64.m7g.metal:
disk_size: 256
instance_type: m7g.metal
is_ephemeral: false
max_available: 100
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-arm64
lf.windows.g4dn.xlarge:
disk_size: 256
instance_type: g4dn.xlarge
is_ephemeral: true
max_available: 100
os: windows
lf.windows.g4dn.xlarge.nonephemeral:
disk_size: 256
instance_type: g4dn.xlarge
is_ephemeral: false
max_available: 100
os: windows
lf.windows.4xlarge:
disk_size: 256
instance_type: c5d.4xlarge
is_ephemeral: true
max_available: 420
os: windows
lf.windows.4xlarge.nonephemeral:
disk_size: 256
instance_type: c5d.4xlarge
is_ephemeral: false
max_available: 420
os: windows
lf.windows.8xlarge.nvidia.gpu:
disk_size: 256
instance_type: p3.2xlarge
is_ephemeral: true
max_available: 300
os: windows
lf.windows.8xlarge.nvidia.gpu.nonephemeral:
disk_size: 256
instance_type: p3.2xlarge
is_ephemeral: false
max_available: 300
os: windows
lf.windows.g5.4xlarge.nvidia.gpu:
disk_size: 256
instance_type: g5.4xlarge
is_ephemeral: false
max_available: 250
os: windows

View File

@ -4,7 +4,7 @@
# docs/cpp/requirements.txt
# functorch/docs/requirements.txt
# .ci/docker/requirements-ci.txt
boto3==1.19.12
boto3==1.35.42
jinja2==3.1.4
lintrunner==0.10.7
ninja==1.10.0.post1

View File

@ -1,4 +1,4 @@
boto3==1.19.12
boto3==1.35.42
hypothesis==6.56.4
expecttest==0.2.1
fbscribelogger==0.1.6

View File

@ -459,7 +459,7 @@ def generate_wheels_matrix(
".", "_"
),
"pytorch_extra_install_requirements": (
PYTORCH_EXTRA_INSTALL_REQUIREMENTS["12.1"]
PYTORCH_EXTRA_INSTALL_REQUIREMENTS["12.4"]
if os != "linux" and gpu_arch_type != "xpu"
else ""
),

View File

@ -114,20 +114,21 @@ LINUX_BINARY_BUILD_WORFKLOWS = [
isolated_workflow=True,
),
),
BinaryBuildWorkflow(
os=OperatingSystem.LINUX,
package_type="manywheel",
build_configs=generate_binary_build_matrix.generate_wheels_matrix(
OperatingSystem.LINUX,
use_split_build=True,
arches=["11.8", "12.1", "12.4", "cpu"],
),
ciflow_config=CIFlowConfig(
labels={LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_WHEEL},
isolated_workflow=True,
),
use_split_build=True,
),
# See https://github.com/pytorch/pytorch/issues/138750
# BinaryBuildWorkflow(
# os=OperatingSystem.LINUX,
# package_type="manywheel",
# build_configs=generate_binary_build_matrix.generate_wheels_matrix(
# OperatingSystem.LINUX,
# use_split_build=True,
# arches=["11.8", "12.1", "12.4", "cpu"],
# ),
# ciflow_config=CIFlowConfig(
# labels={LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_WHEEL},
# isolated_workflow=True,
# ),
# use_split_build=True,
# ),
BinaryBuildWorkflow(
os=OperatingSystem.LINUX,
package_type="conda",
@ -180,21 +181,22 @@ LINUX_BINARY_SMOKE_WORKFLOWS = [
),
branches="main",
),
BinaryBuildWorkflow(
os=OperatingSystem.LINUX,
package_type="manywheel",
build_configs=generate_binary_build_matrix.generate_wheels_matrix(
OperatingSystem.LINUX,
arches=["11.8", "12.1", "12.4"],
python_versions=["3.9"],
use_split_build=True,
),
ciflow_config=CIFlowConfig(
labels={LABEL_CIFLOW_PERIODIC},
),
branches="main",
use_split_build=True,
),
# See https://github.com/pytorch/pytorch/issues/138750
# BinaryBuildWorkflow(
# os=OperatingSystem.LINUX,
# package_type="manywheel",
# build_configs=generate_binary_build_matrix.generate_wheels_matrix(
# OperatingSystem.LINUX,
# arches=["11.8", "12.1", "12.4"],
# python_versions=["3.9"],
# use_split_build=True,
# ),
# ciflow_config=CIFlowConfig(
# labels={LABEL_CIFLOW_PERIODIC},
# ),
# branches="main",
# use_split_build=True,
# ),
BinaryBuildWorkflow(
os=OperatingSystem.LINUX,
package_type="libtorch",

View File

@ -271,7 +271,7 @@ jobs:
)
docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
if [[ ${BUILD_ENVIRONMENT} == *"aarch64"* ]]; then
docker exec -t "${container_name}" bash -c "bash /builder/aarch64_linux/aarch64_ci_build.sh"
docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash /builder/aarch64_linux/aarch64_ci_build.sh"
elif [[ ${{ inputs.PACKAGE_TYPE }} == "manywheel" || ${{ inputs.PACKAGE_TYPE }} == "libtorch" ]]; then
docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash /pytorch/.ci/${{ inputs.PACKAGE_TYPE }}/build.sh"
else

View File

@ -230,7 +230,7 @@ jobs:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
IS_A100_RUNNER: ${{ contains(matrix.runner, 'a100') && '1' || '0' }}
ARTIFACTS_FILE_SUFFIX: ${{ github.job }}-${{ matrix.config }}-${{ matrix.shard }}-${{ matrix.num_shards }}-${{ matrix.runner }}_${{ steps.get-job-id.outputs.job-id }}
run: |
set -x
@ -289,6 +289,7 @@ jobs:
-e SCRIBE_GRAPHQL_ACCESS_TOKEN \
-e DASHBOARD_TAG \
-e IS_A100_RUNNER \
-e ARTIFACTS_FILE_SUFFIX \
--env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
--security-opt seccomp=unconfined \
--cap-add=SYS_PTRACE \

View File

@ -40,6 +40,8 @@ on:
jobs:
runner-determinator:
# Don't run on forked repos
if: github.repository_owner == 'pytorch'
runs-on: ubuntu-latest
outputs:
label-type: ${{ steps.set-condition.outputs.label-type }}

View File

@ -35,7 +35,7 @@ jobs:
runs-on: linux.9xlarge.ephemeral
strategy:
matrix:
cuda_version: ["11.8", "12.1", "12.4", "cpu"]
cuda_version: ["11.8", "12.1", "12.4", "12.6", "cpu"]
env:
CUDA_VERSION: ${{ matrix.cuda_version }}
steps:

View File

@ -65,7 +65,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_9-cpu-aarch64
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cpu-aarch64-test: # Testing
@ -185,7 +185,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_10-cpu-aarch64
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_10-cpu-aarch64-test: # Testing
@ -305,7 +305,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_11-cpu-aarch64
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_11-cpu-aarch64-test: # Testing
@ -425,7 +425,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_12-cpu-aarch64
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_12-cpu-aarch64-test: # Testing

View File

@ -1,182 +0,0 @@
# @generated DO NOT EDIT MANUALLY
# Template is at: .github/templates/linux_binary_build_workflow.yml.j2
# Generation script: .github/scripts/generate_ci_workflows.py
name: linux-binary-manywheel-split
on:
push:
branches:
- main
tags:
- 'ciflow/periodic/*'
workflow_dispatch:
env:
# Needed for conda builds
ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
ANACONDA_USER: pytorch
AWS_DEFAULT_REGION: us-east-1
BINARY_ENV_FILE: /tmp/env
BUILD_ENVIRONMENT: linux-binary-manywheel-split
BUILDER_ROOT: /builder
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
PR_NUMBER: ${{ github.event.pull_request.number }}
PYTORCH_FINAL_PACKAGE_DIR: /artifacts
PYTORCH_ROOT: /pytorch
SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
SKIP_ALL_TESTS: 0
concurrency:
group: linux-binary-manywheel-split-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true
jobs:
get-label-type:
name: get-label-type
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
manywheel-py3_9-cuda11_8-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu118
GPU_ARCH_VERSION: 11.8
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.8-main
use_split_build: True
DESIRED_PYTHON: "3.9"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_9-cuda11_8
build_environment: linux-binary-manywheel-split
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu11==11.8.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu11==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu11==11.11.3.6; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu11==10.9.0.58; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu11==10.3.0.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu11==11.4.1.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu11==11.7.5.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu11==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu11==11.8.86; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cuda11_8-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- manywheel-py3_9-cuda11_8-build
- get-label-type
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu118
GPU_ARCH_VERSION: 11.8
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.8-main
use_split_build: True
DESIRED_PYTHON: "3.9"
build_name: manywheel-py3_9-cuda11_8
build_environment: linux-binary-manywheel-split
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cuda12_1-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu121
GPU_ARCH_VERSION: 12.1
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.1-main
use_split_build: True
DESIRED_PYTHON: "3.9"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_9-cuda12_1
build_environment: linux-binary-manywheel-split
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cuda12_1-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- manywheel-py3_9-cuda12_1-build
- get-label-type
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu121
GPU_ARCH_VERSION: 12.1
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.1-main
use_split_build: True
DESIRED_PYTHON: "3.9"
build_name: manywheel-py3_9-cuda12_1
build_environment: linux-binary-manywheel-split
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cuda12_4-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.4-main
use_split_build: True
DESIRED_PYTHON: "3.9"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_9-cuda12_4
build_environment: linux-binary-manywheel-split
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cuda12_4-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- manywheel-py3_9-cuda12_4-build
- get-label-type
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.4-main
use_split_build: True
DESIRED_PYTHON: "3.9"
build_name: manywheel-py3_9-cuda12_4
build_environment: linux-binary-manywheel-split
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}

File diff suppressed because it is too large Load Diff

View File

@ -64,7 +64,7 @@ jobs:
ALPINE_IMAGE: "docker.io/s390x/alpine"
build_name: manywheel-py3_9-cpu-s390x
build_environment: linux-s390x-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cpu-s390x-test: # Testing
@ -133,7 +133,7 @@ jobs:
ALPINE_IMAGE: "docker.io/s390x/alpine"
build_name: manywheel-py3_10-cpu-s390x
build_environment: linux-s390x-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_10-cpu-s390x-test: # Testing
@ -202,7 +202,7 @@ jobs:
ALPINE_IMAGE: "docker.io/s390x/alpine"
build_name: manywheel-py3_11-cpu-s390x
build_environment: linux-s390x-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_11-cpu-s390x-test: # Testing
@ -271,7 +271,7 @@ jobs:
ALPINE_IMAGE: "docker.io/s390x/alpine"
build_name: manywheel-py3_12-cpu-s390x
build_environment: linux-s390x-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_12-cpu-s390x-test: # Testing
@ -340,7 +340,7 @@ jobs:
ALPINE_IMAGE: "docker.io/s390x/alpine"
build_name: manywheel-py3_13-cpu-s390x
build_environment: linux-s390x-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13-cpu-s390x-test: # Testing

View File

@ -46,7 +46,7 @@ jobs:
GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.9"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
# NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the
@ -162,7 +162,7 @@ jobs:
GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.10"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
# NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the
@ -278,7 +278,7 @@ jobs:
GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.11"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
# NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the
@ -394,7 +394,7 @@ jobs:
GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.12"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
# NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the
@ -510,7 +510,7 @@ jobs:
GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.13"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
# NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the

View File

@ -55,7 +55,7 @@ jobs:
GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.9"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -322,7 +322,7 @@ jobs:
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.9"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -591,7 +591,7 @@ jobs:
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.9"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -860,7 +860,7 @@ jobs:
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.9"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -1393,7 +1393,7 @@ jobs:
GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.10"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -1660,7 +1660,7 @@ jobs:
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.10"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -1929,7 +1929,7 @@ jobs:
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.10"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -2198,7 +2198,7 @@ jobs:
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.10"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -2731,7 +2731,7 @@ jobs:
GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.11"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -2998,7 +2998,7 @@ jobs:
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.11"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -3267,7 +3267,7 @@ jobs:
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.11"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -3536,7 +3536,7 @@ jobs:
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.11"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -4069,7 +4069,7 @@ jobs:
GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.12"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -4336,7 +4336,7 @@ jobs:
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.12"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -4605,7 +4605,7 @@ jobs:
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.12"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -4874,7 +4874,7 @@ jobs:
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.12"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash

View File

@ -13,8 +13,8 @@ concurrency:
permissions: read-all
jobs:
get-label-type:
name: get-label-type
get-default-label-prefix:
name: get-default-label-prefix
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
with:
triggering_actor: ${{ github.triggering_actor }}
@ -22,21 +22,33 @@ jobs:
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
get-test-label-type:
name: get-test-label-type
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
check_experiments: "awsa100"
linux-focal-cuda12_1-py3_10-gcc9-inductor-build:
name: cuda12.1-py3.10-gcc9-sm80
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
needs:
- get-default-label-prefix
- get-test-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runner_prefix: "${{ needs.get-default-label-prefix.outputs.label-type }}"
build-environment: linux-focal-cuda12.1-py3.10-gcc9-sm80
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9-inductor-benchmarks
cuda-arch-list: '8.0'
test-matrix: |
{ include: [
{ config: "inductor_huggingface_perf_compare", shard: 1, num_shards: 1, runner: "linux.gcp.a100" },
{ config: "inductor_timm_perf_compare", shard: 1, num_shards: 2, runner: "linux.gcp.a100" },
{ config: "inductor_timm_perf_compare", shard: 2, num_shards: 2, runner: "linux.gcp.a100" },
{ config: "inductor_torchbench_perf_compare", shard: 1, num_shards: 1, runner: "linux.gcp.a100" },
{ config: "inductor_huggingface_perf_compare", shard: 1, num_shards: 1, runner: "${{ needs.get-test-label-type.outputs.label-type }}linux.gcp.a100" },
{ config: "inductor_timm_perf_compare", shard: 1, num_shards: 2, runner: "${{ needs.get-test-label-type.outputs.label-type }}linux.gcp.a100" },
{ config: "inductor_timm_perf_compare", shard: 2, num_shards: 2, runner: "${{ needs.get-test-label-type.outputs.label-type }}linux.gcp.a100" },
{ config: "inductor_torchbench_perf_compare", shard: 1, num_shards: 1, runner: "${{ needs.get-test-label-type.outputs.label-type }}linux.gcp.a100" },
]}
secrets:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}

View File

@ -8,9 +8,10 @@ permissions:
contents: read
jobs:
get-label-type:
name: get-label-type
# Don't run on forked repos
if: github.repository_owner == 'pytorch'
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
with:
triggering_actor: ${{ github.triggering_actor }}
@ -19,6 +20,8 @@ jobs:
curr_ref_type: ${{ github.ref_type }}
llm-retrieval:
# Don't run on forked repos
if: github.repository_owner == 'pytorch'
runs-on: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge"
continue-on-error: true
needs: get-label-type

View File

@ -32,7 +32,7 @@ jobs:
cache: pip
- run: |
pip3 install requests==2.32.2 rockset==1.0.3 boto3==1.19.12
pip3 install requests==2.32.2 rockset==1.0.3 boto3==1.35.42
- name: Upload external contribution stats
uses: nick-fields/retry@v3.0.0

View File

@ -333,6 +333,7 @@ jobs:
name: linux-focal-cuda12.1-py3.10-gcc9-experimental-split-build
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
if: false # See https://github.com/pytorch/pytorch/issues/138750
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
use_split_build: true
@ -363,6 +364,7 @@ jobs:
name: linux-focal-cuda11.8-py3.9-gcc9-experimental-split-build
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
if: false # See https://github.com/pytorch/pytorch/issues/138750
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
use_split_build: true
@ -390,6 +392,7 @@ jobs:
name: linux-focal-cuda11.8-py3.10-gcc9-experimental-split-build
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
if: false # See https://github.com/pytorch/pytorch/issues/138750
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
use_split_build: true

View File

@ -579,6 +579,7 @@ jobs:
secrets: inherit
linux-focal-py3_12-clang10-experimental-split-build:
if: false # See https://github.com/pytorch/pytorch/issues/138750
name: linux-focal-py3.12-clang10-experimental-split-build
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type

View File

@ -7,6 +7,8 @@ jobs:
get-label-type:
name: get-label-type
# Don't run on forked repos
if: github.repository_owner == 'pytorch'
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
with:
triggering_actor: ${{ github.triggering_actor }}
@ -70,7 +72,7 @@ jobs:
PR_NUMBER: ${{ github.event.pull_request.number }}
run: |
unzip -o .additional_ci_files/llm_results/mappings.zip -d .additional_ci_files/llm_results || true
python3 -m pip install boto3==1.19.12
python3 -m pip install boto3==1.35.42
python3 tools/testing/do_target_determination_for_s3.py
- name: Upload TD results to s3

View File

@ -256,6 +256,7 @@ jobs:
tests-to-include: "test_nn test_torch test_cuda test_ops test_unary_ufuncs test_binary_ufuncs test_autograd inductor/test_torchinductor distributed/test_c10d_common distributed/test_c10d_nccl"
linux-focal-cuda12_4-py3_10-gcc9-experimental-split-build:
if: false # See https://github.com/pytorch/pytorch/issues/138750
name: linux-focal-cuda12.4-py3.10-gcc9-experimental-split-build
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type

View File

@ -29,5 +29,5 @@ jobs:
aws-region: us-east-1
- name: Update PyTorch labels list in S3
run: |
python3 -m pip install boto3==1.19.12
python3 -m pip install boto3==1.35.42
.github/scripts/export_pytorch_labels.py pytorch pytorch

View File

@ -53,7 +53,7 @@ jobs:
cache: pip
- run: |
pip3 install requests==2.32.2 rockset==1.0.3 boto3==1.19.12
pip3 install requests==2.32.2 rockset==1.0.3 boto3==1.35.42
- name: Upload test artifacts
id: upload-s3

View File

@ -49,7 +49,7 @@ jobs:
cache: pip
- run: |
pip3 install requests==2.32.2 rockset==1.0.3 boto3==1.19.12
pip3 install requests==2.32.2 rockset==1.0.3 boto3==1.35.42
- name: Upload torch dynamo performance stats to S3
id: upload-s3

View File

@ -28,7 +28,7 @@ jobs:
cache: pip
- run: |
pip3 install requests==2.32.2 rockset==1.0.3 boto3==1.19.12
pip3 install requests==2.32.2 rockset==1.0.3 boto3==1.35.42
- name: Upload test stats
env:

View File

@ -1402,7 +1402,7 @@ init_command = [
'black==23.12.1',
'usort==1.0.8.post1',
'isort==5.13.2',
'ruff==0.6.3', # sync with RUFF
'ruff==0.7.0', # sync with RUFF
]
is_formatter = true
@ -1487,7 +1487,7 @@ init_command = [
'python3',
'tools/linter/adapters/pip_init.py',
'--dry-run={{DRYRUN}}',
'ruff==0.6.3', # sync with PYFMT
'ruff==0.7.0', # sync with PYFMT
]
is_formatter = true

View File

@ -878,7 +878,7 @@ Process 87741 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
frame #0: 0x00000001024e2628 libtorch_python.dylib`at::indexing::impl::applySelect(self=0x00000001004ee8a8, dim=0, index=(data_ = 3), real_dim=0, (null)=0x000000016fdfe535, self_sizes= Has Value=true ) at TensorIndexing.h:239:7
236 const at::Device& /*self_device*/,
237 const c10::optional<SymIntArrayRef>& self_sizes) {
237 const std::optional<SymIntArrayRef>& self_sizes) {
238 // See NOTE [nested tensor size for indexing]
-> 239 if (self_sizes.has_value()) {
240 auto maybe_index = index.maybe_as_int();
@ -1081,10 +1081,6 @@ Here are a few well known pitfalls and workarounds:
catch all of these problems: stay vigilant to the possibility that
your crash is due to a real memory problem.
* (NVCC) `c10::optional` does not work when used from device code. Don't use
it from kernels. Upstream issue: https://github.com/akrzemi1/Optional/issues/58
and our local issue #10329.
* `constexpr` generally works less well on MSVC.
* The idiom `static_assert(f() == f())` to test if `f` is constexpr

View File

@ -43,9 +43,19 @@ class TORCH_API Context {
if (device_type == at::kCPU) {
return at::detail::getDefaultCPUGenerator();
} else if (device_type == at::kCUDA) {
return at::detail::getCUDAHooks().getDefaultCUDAGenerator(device.index());
} else if (device_type == at::kMPS) {
return at::detail::getMPSHooks().getDefaultMPSGenerator();
} else if (device_type == at::kXPU) {
return at::detail::getXPUHooks().getDefaultXPUGenerator(device.index());
} else if (device_type == at::kIPU) {
return at::detail::getIPUHooks().getDefaultIPUGenerator(device.index());
} else if (device_type == at::kPrivateUse1) {
return at::detail::getPrivateUse1Hooks().getDefaultGenerator(
device.index());
} else {
return getAcceleratorHooksInterface(device_type)
.getDefaultGenerator(device.index());
AT_ERROR(c10::DeviceTypeName(device_type), " device type not enabled.");
}
}

View File

@ -38,11 +38,9 @@ inline constexpr bool should_include_kernel_dtype(
* binary.
*/
#if defined ENABLE_RECORD_KERNEL_FUNCTION_DTYPE
namespace at {
namespace detail {
namespace at::detail {
TORCH_API void record_kernel_function_dtype(std::string name);
}
} // namespace at
} // namespace at::detail
#define RECORD_KERNEL_FUNCTION_DTYPE(NAME, enum_type) \
at::detail::record_kernel_function_dtype( \

View File

@ -112,6 +112,10 @@ class TORCH_API RefcountedMapAllocator : private RefcountedMapAllocatorArgCheck,
size_t size);
static RefcountedMapAllocator* fromDataPtr(const at::DataPtr&);
RefcountedMapAllocator(const RefcountedMapAllocator&) = delete;
RefcountedMapAllocator(RefcountedMapAllocator&&) = delete;
RefcountedMapAllocator& operator=(const RefcountedMapAllocator&) = delete;
RefcountedMapAllocator& operator=(RefcountedMapAllocator&&) = delete;
static at::DataPtr makeDataPtr(
const char* filename,
int flags,

View File

@ -61,7 +61,7 @@ MemOverlapStatus get_overlap_status(const TensorImpl* a, const TensorImpl* b) {
// same pointer across multiple storages there are many
// similar situations (e.g., storage().data() == storage().data()+1)
// which we will miss.
auto a_storage = a->unsafe_storage();
const auto& a_storage = a->unsafe_storage();
if (a_storage && a_storage.is_alias_of(b->unsafe_storage())) {
const auto a_begin = static_cast<const char*>(a->data());
const auto a_end = a_begin + a->numel() * a->itemsize();

View File

@ -8,6 +8,17 @@
namespace c10 {
namespace detail {
template <typename Base, typename Child, typename... Args>
std::enable_if_t<
!std::is_array_v<Base> && !std::is_array_v<Child> &&
std::is_base_of_v<Base, Child>,
std::unique_ptr<Base>>
make_unique_base(Args&&... args) {
return std::unique_ptr<Base>(new Child(std::forward<Args>(args)...));
}
}
inline KernelFunction::KernelFunction()
: boxed_kernel_func_()
, unboxed_kernel_func_(nullptr)
@ -183,7 +194,7 @@ inline KernelFunction KernelFunction::makeFromUnboxedFunction(FuncPtr func_ptr)
#if !defined(C10_MOBILE)
(void)func_ptr; // Suppress unused variable warning
return makeFromUnboxedFunctor<AllowLegacyTypes, typename impl::WrapFunctionIntoFunctor<FuncPtr>::type>(
guts::make_unique_base<OperatorKernel, typename impl::WrapFunctionIntoFunctor<FuncPtr>::type>()
detail::make_unique_base<OperatorKernel, typename impl::WrapFunctionIntoFunctor<FuncPtr>::type>()
);
#else
// On mobile, we rather want to optimize for binary size than for performance,
@ -200,7 +211,7 @@ inline KernelFunction KernelFunction::makeFromUnboxedRuntimeFunction(FuncType* f
TORCH_INTERNAL_ASSERT(func != nullptr, "Kernel function cannot be nullptr");
return makeFromUnboxedFunctor<AllowLegacyTypes, impl::WrapFunctionIntoRuntimeFunctor<std::decay_t<FuncType>>>(
guts::make_unique_base<OperatorKernel, impl::WrapFunctionIntoRuntimeFunctor<std::decay_t<FuncType>>>(func)
detail::make_unique_base<OperatorKernel, impl::WrapFunctionIntoRuntimeFunctor<std::decay_t<FuncType>>>(func)
);
}
@ -210,7 +221,7 @@ inline std::enable_if_t<guts::is_stateless_lambda<std::decay_t<Lambda>>::value,
#if !defined(C10_MOBILE)
return makeFromUnboxedFunctor<AllowLegacyTypes, impl::WrapFunctionIntoRuntimeFunctor<std::decay_t<Lambda>>>(
guts::make_unique_base<OperatorKernel, impl::WrapFunctionIntoRuntimeFunctor<std::decay_t<Lambda>>>(std::forward<Lambda>(lambda))
detail::make_unique_base<OperatorKernel, impl::WrapFunctionIntoRuntimeFunctor<std::decay_t<Lambda>>>(std::forward<Lambda>(lambda))
);
#else
// On mobile, we rather want to optimize for binary size than for performance,
@ -226,7 +237,7 @@ inline std::enable_if_t<!guts::is_stateless_lambda<std::decay_t<Lambda>>::value,
static_assert(guts::is_functor<std::decay_t<Lambda>>::value, "Tried to call KernelFunction::makeFromUnboxedLambda with a non-lambda type.");
return makeFromUnboxedFunctor<AllowLegacyTypes, impl::WrapFunctionIntoRuntimeFunctor<std::decay_t<Lambda>>>(
guts::make_unique_base<OperatorKernel, impl::WrapFunctionIntoRuntimeFunctor<std::decay_t<Lambda>>>(std::forward<Lambda>(lambda))
detail::make_unique_base<OperatorKernel, impl::WrapFunctionIntoRuntimeFunctor<std::decay_t<Lambda>>>(std::forward<Lambda>(lambda))
);
}

View File

@ -3,6 +3,7 @@
#include <condition_variable>
#include <memory>
#include <optional>
#include <tuple>
#include <type_traits>
#include <utility>

View File

@ -295,6 +295,19 @@ public:
}
return false;
}
// TODO: Remove this once the issue with MSVC is fixed
// See https://developercommunity.visualstudio.com/t/MSVC-loop-unrolling-problem-194033813-/10720692
#if defined(_WIN32) && defined(__aarch64__)
Vectorized<T> map(T (*const f)(T)) const {
Vectorized<T> ret;
for (int64_t i = 0; i < size(); i++) {
ret[i] = f(values[i]);
if (++i < size())
ret[i] = f(values[i]);
}
return ret;
}
#else
Vectorized<T> map(T (*const f)(T)) const {
Vectorized<T> ret;
for (int64_t i = 0; i != size(); i++) {
@ -302,6 +315,7 @@ public:
}
return ret;
}
#endif
Vectorized<T> map(T (*const f)(const T &)) const {
Vectorized<T> ret;
for (int64_t i = 0; i != size(); i++) {

View File

@ -34,7 +34,7 @@ public:
private:
cublasHandle_t handle;
cublasPointerMode_t previous_mode;
cublasPointerMode_t previous_mode{};
};
/* LEVEL 3 BLAS FUNCTIONS */

View File

@ -31,7 +31,7 @@ static std::vector<Generator> default_gens_cuda;
* Warning: this function must only be called once!
*/
static void initCUDAGenVector() {
num_gpus = c10::cuda::device_count();
num_gpus = static_cast<int32_t>(c10::cuda::device_count());
cuda_gens_init_flag.resize(num_gpus);
default_gens_cuda.resize(num_gpus);
}

View File

@ -5,7 +5,6 @@
#include <ATen/core/TensorBase.h>
#include <ATen/cuda/PhiloxCudaState.h>
#include <atomic>
#include <limits>
#include <memory>
#include <unordered_set>
namespace at {
@ -168,7 +167,7 @@ struct TORCH_CUDA_CPP_API CUDAGeneratorImpl : public c10::GeneratorImpl {
CUDAGeneratorImpl* clone_impl() const override;
c10::intrusive_ptr<CUDAGeneratorState> state_;
std::atomic_flag no_reset_rnn_state_;
std::atomic_flag no_reset_rnn_state_{};
};
namespace cuda::detail {

View File

@ -7,9 +7,7 @@
#include <chrono>
#include <cstddef>
#include <cstdint>
#include <thread>
#include <vector>
namespace at::cuda {
@ -19,8 +17,7 @@ constexpr int kSynchronizeBusyWaitMillis = 10;
MempoolId_t graph_pool_handle() {
// Sets just the second value, to distinguish it from MempoolId_ts created from
// cudaStreamGetCaptureInfo id_s in capture_begin.
auto new_pool = c10::cuda::MemPool();
return new_pool.id();
return c10::cuda::MemPool::graph_pool_handle();
}
/**
@ -115,8 +112,7 @@ void CUDAGraph::capture_begin(MempoolId_t pool/*=0*/, cudaStreamCaptureMode capt
} else {
// User did not ask us to share a mempool. Create graph pool handle using is_user_created=false.
// Sets just the first value, to distinguish it from MempoolId_ts created by graph_pool_handle().
auto mempool = c10::cuda::MemPool({}, false);
mempool_id_ = mempool.id();
mempool_id_ = c10::cuda::MemPool::graph_pool_handle(false);
TORCH_INTERNAL_ASSERT(mempool_id_.first > 0);
}
@ -124,7 +120,7 @@ void CUDAGraph::capture_begin(MempoolId_t pool/*=0*/, cudaStreamCaptureMode capt
// autograd thread's free() call triggering an invalid cudaEventRecord in the caching allocator
// due to the capture status being updated _after_ a capture had already started.
c10::cuda::CUDACachingAllocator::beginAllocateToPool(capture_dev_, mempool_id_, [this](cudaStream_t stream) {
cudaStreamCaptureStatus status;
cudaStreamCaptureStatus status{};
CaptureId_t stream_capture_id = 0;
AT_CUDA_CHECK(cudaStreamGetCaptureInfo(stream, &status, &stream_capture_id));
return status == cudaStreamCaptureStatus::cudaStreamCaptureStatusActive && stream_capture_id == capture_id_;
@ -144,7 +140,7 @@ void CUDAGraph::capture_begin(MempoolId_t pool/*=0*/, cudaStreamCaptureMode capt
// https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html#group__CUDART__STREAM_1g9d0535d93a214cbf126835257b16ba85
AT_CUDA_CHECK(cudaStreamBeginCapture(capture_stream_, capture_mode));
cudaStreamCaptureStatus status;
cudaStreamCaptureStatus status{};
AT_CUDA_CHECK(cudaStreamGetCaptureInfo(stream, &status, &capture_id_));
TORCH_INTERNAL_ASSERT(status == cudaStreamCaptureStatus::cudaStreamCaptureStatusActive);

View File

@ -39,7 +39,6 @@
#include <sstream>
#include <cstddef>
#include <functional>
#include <memory>
namespace c10::cuda::_internal {
@ -61,7 +60,7 @@ namespace {
bool _hasPrimaryContext(DeviceIndex device_index) {
TORCH_CHECK(device_index >= 0 && device_index < at::cuda::device_count(),
"hasPrimaryContext expects a valid device index, but got device_index=", device_index);
unsigned int ctx_flags;
unsigned int ctx_flags = 0;
// In standalone tests of cuDevicePrimaryCtxGetState, I've seen the "active" argument end up with weird
// (garbage-looking nonzero) values when the context is not active, unless I initialize it to zero.
int ctx_is_active = 0;
@ -103,7 +102,7 @@ void CUDAHooks::init() const {
#endif
}
const Generator& CUDAHooks::getDefaultGenerator(DeviceIndex device_index) const {
const Generator& CUDAHooks::getDefaultCUDAGenerator(DeviceIndex device_index) const {
return at::cuda::detail::getDefaultCUDAGenerator(device_index);
}
@ -124,7 +123,7 @@ bool CUDAHooks::isPinnedPtr(const void* data) const {
if (primary_ctx_device_index.has_value()) {
device_guard.reset_device(at::Device(at::DeviceType::CUDA, *primary_ctx_device_index));
}
cudaPointerAttributes attr;
cudaPointerAttributes attr{};
// We do not believe that CUDA needs mutable access to the data
// here.
cudaError_t err = cudaPointerGetAttributes(&attr, data);
@ -325,10 +324,10 @@ bool CUDAHooks::hasCUDART() const {
std::string CUDAHooks::showConfig() const {
std::ostringstream oss;
int runtimeVersion;
int runtimeVersion = 0;
cudaRuntimeGetVersion(&runtimeVersion);
auto printCudaStyleVersion = [&](int v) {
auto printCudaStyleVersion = [&](size_t v) {
#ifdef USE_ROCM
// HIP_VERSION value format was changed after ROCm v4.2 to include the patch number
if(v < 500) {
@ -369,7 +368,7 @@ std::string CUDAHooks::showConfig() const {
#if AT_CUDNN_ENABLED()
auto printCudnnStyleVersion = [&](int v) {
auto printCudnnStyleVersion = [&](size_t v) {
oss << (v / 1000) << "." << (v / 100 % 10);
if (v % 100 != 0) {
oss << "." << (v % 100);

View File

@ -3,7 +3,6 @@
#include <ATen/detail/CUDAHooksInterface.h>
#include <ATen/Generator.h>
#include <optional>
// TODO: No need to have this whole header, we can just put it all in
// the cpp file
@ -22,8 +21,7 @@ struct CUDAHooks : public at::CUDAHooksInterface {
void init() const override;
Device getDeviceFromPtr(void* data) const override;
bool isPinnedPtr(const void* data) const override;
const Generator& getDefaultGenerator(
DeviceIndex device_index = -1) const override;
const Generator& getDefaultCUDAGenerator(DeviceIndex device_index = -1) const override;
bool hasCUDA() const override;
bool hasMAGMA() const override;
bool hasCuDNN() const override;

View File

@ -37,7 +37,7 @@ within the next one.
bool maybeOverlappingIndices(const TensorBase& t) {
/* Extract size/stride arrays; only consider size >1 dims. */
std::vector<SizeAndStride> info(t.dim());
int dims = t.dim();
auto dims = t.dim();
int nonSize1Dims = 0;
for (int i = 0; i < dims; ++i) {
int64_t size = t.size(i);

View File

@ -8,7 +8,6 @@
#include <iostream>
#include <utility>
#include <chrono>
namespace at {
namespace native {

View File

@ -18,7 +18,7 @@ namespace at::cuda::tunable {
class StreamTimer : public ITimer {
public:
StreamTimer();
virtual ~StreamTimer() override;
~StreamTimer() override;
void Start() override;

View File

@ -19,16 +19,10 @@
#include <cxxabi.h>
#endif
#include <chrono>
#include <fstream>
#include <functional>
#include <limits>
#include <memory>
#include <mutex>
#include <sstream>
#include <string>
#include <thread>
#include <type_traits>
#include <unordered_map>
#include <unordered_set>
#include <utility>
@ -83,7 +77,7 @@ ResultEntry TuningResultsManager::Lookup(const std::string& op_signature, const
return it->second;
}
inline void TuningResultsManager::AddImpl(const std::string& op_signature,
void TuningResultsManager::AddImpl(const std::string& op_signature,
const std::string& params_signature,
ResultEntry best,
KernelMap& kernel_map) {
@ -98,7 +92,7 @@ inline void TuningResultsManager::AddImpl(const std::string& op_signature,
}
TUNABLE_LOG2(op_signature, "(", params_signature, ") -> ", best);
kernel_map.emplace(params_signature, best);
kernel_map.emplace(params_signature, std::move(best));
}
void TuningResultsManager::Add(const std::string& op_signature, const std::string& params_signature, ResultEntry best) {
@ -109,7 +103,7 @@ void TuningResultsManager::Add(const std::string& op_signature, const std::strin
it = results_.insert({op_signature, {}}).first;
}
AddImpl(op_signature, params_signature, best, it->second);
AddImpl(op_signature, params_signature, std::move(best), it->second);
}
void TuningResultsManager::RecordUntuned( std::ofstream& untuned_file, const std::string& op_signature, const std::string& params_signature) {
@ -155,7 +149,7 @@ void TuningResultsManager::Delete(const std::string& op_signature, const std::st
it->second.erase(it2);
}
inline void TuningResultsManager::DisjointMergeImpl(
void TuningResultsManager::DisjointMergeImpl(
const std::string& op_signature,
const KernelMap& kernel_map,
/*out*/ std::unordered_map<std::string, KernelMap>& results) {
@ -205,7 +199,7 @@ size_t TuningResultsManager::GetSize() {
TuningResultsValidator::TuningResultsValidator() {
RegisterValidator(
"PT_VERSION",
[this]() { return GetPyTorchVersion(); },
[]() { return GetPyTorchVersion(); },
[this](auto&& k) { return ValidatePyTorchVersion(std::forward<decltype(k)>(k)); });
#ifdef USE_ROCM
// rocm
@ -368,7 +362,7 @@ void TuningResultsValidator::RegisterValidator(const std::string& key, const Get
}
}
std::string TuningResultsValidator::GetPyTorchVersion() const {
std::string TuningResultsValidator::GetPyTorchVersion() {
return TORCH_VERSION;
}
@ -487,7 +481,7 @@ std::ofstream& TuningContext::GetUntunedFile(){
std::string filename = (env == nullptr) ? "tunableop_untuned.csv" : env;
std::string device = c10::str(int(c10::cuda::current_device()));
std::size_t found = filename.rfind(".");
std::size_t found = filename.rfind('.');
if (found != std::string::npos) {
filename.insert(found, device);
} else {

View File

@ -10,6 +10,7 @@
#pragma once
#include <c10/util/CallOnce.h>
#include <c10/util/StringUtil.h>
#include <fstream>
#include <functional>
@ -17,11 +18,9 @@
#include <memory>
#include <mutex>
#include <string>
#include <type_traits>
#include <unordered_map>
#include <unordered_set>
#include <utility>
#include <vector>
namespace at::cuda::tunable {
@ -34,11 +33,11 @@ struct MaybeDelete {
using OstreamPtr = std::unique_ptr<std::ostream, MaybeDelete>;
static OstreamPtr get_stream(std::string filename) {
if (filename.compare("out") == 0) {
inline OstreamPtr get_stream(const std::string& filename) {
if (filename == "out") {
return OstreamPtr { &std::cout, MaybeDelete {false} };
}
else if (filename.compare("err") == 0) {
else if (filename == "err") {
return OstreamPtr { &std::cerr, MaybeDelete {false} };
}
else {
@ -72,7 +71,7 @@ enum TORCH_CUDA_CPP_API TuningStatus {
// Mapping from params signature to kernel id
class TORCH_CUDA_CPP_API ResultEntry {
public:
explicit ResultEntry(const std::string& key, double time) : key_(key), time_(time) {}
explicit ResultEntry(std::string key, double time) : key_(std::move(key)), time_(time) {}
bool operator==(const ResultEntry& other) { return key_ == other.key_; }
bool operator!=(const ResultEntry& other) { return key_ != other.key_; }
operator std::string () { return key_; }
@ -108,7 +107,7 @@ class TORCH_CUDA_CPP_API TuningResultsManager {
ResultEntry Lookup(const std::string& op_signature, const std::string& params_signature);
inline void AddImpl(const std::string& op_signature,
void AddImpl(const std::string& op_signature,
const std::string& params_signature,
ResultEntry best,
KernelMap& kernel_map);
@ -119,7 +118,7 @@ class TORCH_CUDA_CPP_API TuningResultsManager {
void Delete(const std::string& op_signature, const std::string& params_signature);
inline void DisjointMergeImpl(
void DisjointMergeImpl(
const std::string& op_signature,
const KernelMap& kernel_map,
/*out*/ ResultsMap& results);
@ -154,7 +153,7 @@ class TORCH_CUDA_CPP_API TuningResultsValidator {
void RegisterValidator(const std::string& key, const GetFunc& gf, const ValidateFunc& vf);
protected:
std::string GetPyTorchVersion() const;
static std::string GetPyTorchVersion() ;
TuningStatus ValidatePyTorchVersion(const std::string& value) const;
public:

View File

@ -18,7 +18,6 @@
#endif
#include <string>
#include <type_traits>
#include <unordered_map>
#include <vector>
@ -146,7 +145,7 @@ class TunableOp {
bool use_buffer_rotation = (rotating_size > 0);
size_t param_size = params->GetSize(use_buffer_rotation);
size_t param_count = (rotating_size / param_size) + 1;
constexpr size_t MB = 1024*1024;
constexpr size_t MB = 1024ull*1024;
if (use_buffer_rotation) {
TUNABLE_LOG2("Rotating buffer ", rotating_size/MB, " MiB. ",
"Needed Size: ", param_size/MB, " MiB. ",
@ -266,6 +265,7 @@ class TunableOp {
std::string CreateSignature() {
#ifndef _WIN32
const auto* name = typeid(*this).name();
// NOLINTNEXTLINE(*array*)
char buf[256];
size_t buf_len = 256;
abi::__cxa_demangle(name, buf, &buf_len, nullptr);

View File

@ -1,13 +1,9 @@
#pragma once
#include <ATen/core/Generator.h>
#include <c10/core/Allocator.h>
#include <c10/core/Device.h>
#include <c10/core/Stream.h>
#include <c10/core/Allocator.h>
C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED("-Wunused-parameter")
namespace at {
// AcceleratorHooksInterface is a shared interface provided by all
@ -62,18 +58,7 @@ struct TORCH_API AcceleratorHooksInterface {
virtual Device getDeviceFromPtr(void* data) const {
TORCH_CHECK(false, "Backend doesn't support getDeviceFromPtr()");
}
virtual const Generator& getDefaultGenerator(
C10_UNUSED DeviceIndex device_index = -1) const {
TORCH_CHECK(false, "Backend doesn`t support getDefaultGenerator()");
}
virtual Generator getNewGenerator(
C10_UNUSED DeviceIndex device_index = -1) const {
TORCH_CHECK(false, "Backend doesn`t support getNewGenerator()");
}
};
} // namespace at
C10_DIAGNOSTIC_POP()

View File

@ -6,13 +6,16 @@
#include <ATen/detail/AcceleratorHooksInterface.h>
// NB: Class must live in `at` due to limitations of Registry.h.
// Forward-declares at::Generator and at::cuda::NVRTC
namespace at {
// Forward-declares at::cuda::NVRTC
struct Generator;
namespace cuda {
struct NVRTC;
} // namespace cuda
} // namespace at
// NB: Class must live in `at` due to limitations of Registry.h.
namespace at {
#ifdef _MSC_VER
constexpr const char* CUDA_HELP =
@ -66,8 +69,8 @@ struct TORCH_API CUDAHooksInterface : AcceleratorHooksInterface {
TORCH_CHECK(false, "Cannot initialize CUDA without ATen_cuda library. ", CUDA_HELP);
}
const Generator& getDefaultGenerator(
[[maybe_unused]] DeviceIndex device_index = -1) const override {
virtual const Generator& getDefaultCUDAGenerator(
[[maybe_unused]] DeviceIndex device_index = -1) const {
TORCH_CHECK(
false,
"Cannot get default CUDA generator without ATen_cuda library. ",

View File

@ -1,13 +1,19 @@
#pragma once
#include <c10/core/Allocator.h>
#include <c10/core/GeneratorImpl.h>
#include <c10/util/Exception.h>
#include <c10/util/Registry.h>
#include <ATen/detail/AcceleratorHooksInterface.h>
#include <memory>
namespace at {
class Context;
}
// NB: Class must live in `at` due to limitations of Registry.h.
namespace at {
@ -24,9 +30,8 @@ struct TORCH_API HIPHooksInterface : AcceleratorHooksInterface {
TORCH_CHECK(false, "Cannot initialize HIP without ATen_hip library.");
}
const Generator& getDefaultGenerator(
C10_UNUSED DeviceIndex device_index = -1) const override {
TORCH_CHECK(false, "Cannot initialize HIP without ATen_hip library.");
virtual std::unique_ptr<c10::GeneratorImpl> initHIPGenerator(Context*) const {
AT_ERROR("Cannot initialize HIP generator without ATen_hip library.");
}
virtual bool hasHIP() const {
@ -45,6 +50,10 @@ struct TORCH_API HIPHooksInterface : AcceleratorHooksInterface {
TORCH_CHECK(false, "Pinned memory requires HIP.");
}
virtual void registerHIPTypes(Context*) const {
AT_ERROR("Cannot registerHIPTypes() without ATen_hip library.");
}
virtual int getNumGPUs() const {
return 0;
}

View File

@ -1,5 +1,6 @@
#pragma once
#include <ATen/core/Generator.h>
#include <ATen/detail/AcceleratorHooksInterface.h>
#include <c10/core/Allocator.h>
@ -8,7 +9,7 @@
namespace at {
struct TORCH_API IPUHooksInterface : AcceleratorHooksInterface {
struct TORCH_API IPUHooksInterface: AcceleratorHooksInterface {
~IPUHooksInterface() override = default;
void init() const override {
@ -20,14 +21,16 @@ struct TORCH_API IPUHooksInterface : AcceleratorHooksInterface {
return false;
}
const Generator& getDefaultGenerator(
C10_UNUSED DeviceIndex device_index = -1) const override {
TORCH_CHECK(false, "Cannot initialize IPU without ATen_ipu library.");
virtual const Generator& getDefaultIPUGenerator(
DeviceIndex device_index [[maybe_unused]] = -1) const {
AT_ERROR(
"Cannot get the default IPU generator: the IPU backend is not "
"available.");
}
Generator getNewGenerator(
DeviceIndex device_index [[maybe_unused]] = -1) const override {
TORCH_CHECK(false, "Cannot initialize IPU without ATen_ipu library.");
virtual Generator newIPUGenerator(DeviceIndex device_index [[maybe_unused]] = -1) const {
AT_ERROR(
"Cannot create a new IPU generator: the IPU backend is not available.");
}
};

View File

@ -2,9 +2,9 @@
#pragma once
#include <ATen/detail/AcceleratorHooksInterface.h>
#include <c10/core/Allocator.h>
#include <ATen/core/Generator.h>
#include <ATen/detail/AcceleratorHooksInterface.h>
#include <c10/util/Exception.h>
#include <c10/util/Registry.h>
@ -31,8 +31,7 @@ struct TORCH_API MPSHooksInterface : AcceleratorHooksInterface {
virtual bool isOnMacOSorNewer(unsigned major = 13, unsigned minor = 0) const {
FAIL_MPSHOOKS_FUNC(__func__);
}
const Generator& getDefaultGenerator(
C10_UNUSED DeviceIndex device_index = -1) const override {
virtual const Generator& getDefaultMPSGenerator() const {
FAIL_MPSHOOKS_FUNC(__func__);
}
virtual Allocator* getMPSDeviceAllocator() const {

View File

@ -1,20 +1,18 @@
#pragma once
#include <ATen/core/Generator.h>
#include <ATen/detail/AcceleratorHooksInterface.h>
#include <c10/core/Allocator.h>
#include <c10/core/Device.h>
#include <c10/core/Storage.h>
#include <c10/util/Exception.h>
C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED("-Wunused-parameter")
namespace at {
struct TORCH_API PrivateUse1HooksInterface : AcceleratorHooksInterface {
~PrivateUse1HooksInterface() override = default;
const at::Generator& getDefaultGenerator(
c10::DeviceIndex device_index) const override {
virtual const at::Generator& getDefaultGenerator(
c10::DeviceIndex device_index) const {
TORCH_CHECK_NOT_IMPLEMENTED(
false,
"You should register `PrivateUse1HooksInterface` for PrivateUse1 before call `getDefaultGenerator`.");
@ -26,17 +24,17 @@ struct TORCH_API PrivateUse1HooksInterface : AcceleratorHooksInterface {
"You should register `PrivateUse1HooksInterface` for PrivateUse1 before call `getDeviceFromPtr`.");
}
bool isPinnedPtr(const void* data) const override {
virtual bool isPinnedPtr(const void* data) const override {
return false;
}
Allocator* getPinnedMemoryAllocator() const override {
virtual Allocator* getPinnedMemoryAllocator() const override {
TORCH_CHECK(
false,
"You should register `PrivateUse1HooksInterface` for PrivateUse1 before call `getPinnedMemoryAllocator`.");
}
bool hasPrimaryContext(DeviceIndex device_index) const override {
virtual bool hasPrimaryContext(DeviceIndex device_index) const override {
TORCH_CHECK_NOT_IMPLEMENTED(
false,
"You should register `PrivateUse1HooksInterface` for PrivateUse1 before call `hasPrimaryContext`.");

View File

@ -4,6 +4,7 @@
#include <c10/util/Exception.h>
#include <c10/util/Registry.h>
#include <ATen/core/Generator.h>
#include <ATen/detail/AcceleratorHooksInterface.h>
C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED("-Wunused-parameter")
@ -31,15 +32,15 @@ struct TORCH_API XPUHooksInterface : AcceleratorHooksInterface{
TORCH_CHECK(false, "Cannot get XPU global device index without ATen_xpu library.");
}
const Generator& getDefaultGenerator(
[[maybe_unused]] DeviceIndex device_index = -1) const override {
TORCH_CHECK(
false, "Cannot get default XPU generator without ATen_xpu library.");
virtual Generator getXPUGenerator(
[[maybe_unused]] DeviceIndex device_index = -1) const {
TORCH_CHECK(false, "Cannot get XPU generator without ATen_xpu library.");
}
Generator getNewGenerator(
[[maybe_unused]] DeviceIndex device_index = -1) const override {
TORCH_CHECK(false, "Cannot get XPU generator without ATen_xpu library.");
virtual const Generator& getDefaultXPUGenerator(
[[maybe_unused]] DeviceIndex device_index = -1) const {
TORCH_CHECK(
false, "Cannot get default XPU generator without ATen_xpu library.");
}
virtual DeviceIndex getNumGPUs() const {

View File

@ -19,8 +19,7 @@ struct MPSHooks : public at::MPSHooksInterface {
bool isOnMacOSorNewer(unsigned major, unsigned minor) const override;
// MPSGeneratorImpl interface
const Generator& getDefaultGenerator(
DeviceIndex device_index = -1) const override;
const Generator& getDefaultMPSGenerator() const override;
// MPSStream interface
void deviceSynchronize() const override;

View File

@ -59,7 +59,7 @@ Allocator* MPSHooks::getMPSDeviceAllocator() const {
return at::mps::GetMPSAllocator();
}
const Generator& MPSHooks::getDefaultGenerator([[maybe_unused]] DeviceIndex device_index) const {
const Generator& MPSHooks::getDefaultMPSGenerator() const {
return at::mps::detail::getDefaultMPSGenerator();
}

View File

@ -26,6 +26,7 @@
#include <ATen/native/cpu/SerialStackImpl.h>
#include <ATen/native/cpu/StackKernel.h>
#include <ATen/quantized/QTensorImpl.h>
#include <c10/core/GradMode.h>
#include <c10/util/Exception.h>
#include <optional>
#include <c10/util/SmallVector.h>
@ -4071,29 +4072,41 @@ void split_copy_Tensor_out(const at::Tensor & self, int64_t split_size, int64_t
}
}
void split_with_sizes_copy_out(const at::Tensor & self, at::IntArrayRef split_sizes, int64_t dim, at::TensorList out) {
auto tmp = self.split_with_sizes(split_sizes, dim);
namespace {
TORCH_CHECK(out.size() == tmp.size(), "split_with_sizes_copy_out() expected an out= argument of size ", tmp.size(), ", got size ", out.size());
void copy_tensor_array_to_out(const char* name, const std::vector<Tensor>& array, at::TensorList out) {
TORCH_CHECK(out.size() == array.size(), name, " expected an out= argument of size ", array.size(), ", got size ", out.size());
for (const auto i : c10::irange(out.size())) {
if (resize_output_check(out[i], tmp[i].sizes())) {
out[i].resize_(tmp[i].sizes());
if (resize_output_check(out[i], array[i].sizes())) {
out[i].resize_(array[i].sizes());
}
TORCH_CHECK(out[i].dtype() == tmp[i].dtype(),
"Expected out tensor to have dtype ", tmp[i].dtype(), ", but got ", out[i].dtype(), " instead");
TORCH_CHECK(out[i].device() == tmp[i].device(),
"Expected out tensor to have device ", tmp[i].device(), ", but got ", out[i].device(), " instead");
out[i].copy_(tmp[i]);
TORCH_CHECK(out[i].dtype() == array[i].dtype(),
"Expected out tensor to have dtype ", array[i].dtype(), ", but got ", out[i].dtype(), " instead");
TORCH_CHECK(out[i].device() == array[i].device(),
"Expected out tensor to have device ", array[i].device(), ", but got ", out[i].device(), " instead");
out[i].copy_(array[i]);
}
}
void unbind_copy_int_out(const at::Tensor & self, int64_t dim, at::TensorList out) {
auto tmp = self.unbind(dim);
}
TORCH_CHECK(out.size() == tmp.size(), "unbind_copy_int_out() expected an out= argument of size ", tmp.size(), ", got size ", out.size());
for (const auto i : c10::irange(out.size())) {
out[i].copy_(tmp[i]);
void split_with_sizes_copy_out(const at::Tensor & self, at::IntArrayRef split_sizes, int64_t dim, at::TensorList out) {
auto tmp = self.split_with_sizes(split_sizes, dim);
copy_tensor_array_to_out("split_with_sizes_copy_out()", tmp, out);
}
void unbind_copy_int_out(const at::Tensor & self, int64_t dim, at::TensorList out) {
if (at::GradMode::is_enabled()) {
for (const auto i : c10::irange(out.size())) {
TORCH_CHECK(!out[i].requires_grad(),
"unbind_copy(): functions with out=... arguments don't support automatic differentiation, "
"but one of the arguments requires grad."
);
}
}
auto tmp = self.unbind(dim);
copy_tensor_array_to_out("unbind_copy_int_out()", tmp, out);
}
int64_t sparse_dim_default(const Tensor& self) {

View File

@ -36,6 +36,7 @@
#include <ATen/native/TensorIteratorDynamicCasting.h>
#include <ATen/cpu/vec/vec.h>
#include <tuple>
#include <utility>
namespace at::native { inline namespace CPU_CAPABILITY {

View File

@ -6,10 +6,9 @@
#include <c10/core/Scalar.h>
#include <c10/util/irange.h>
#include <sstream>
#include <type_traits>
namespace at { namespace native { inline namespace CPU_CAPABILITY {
namespace at::native { inline namespace CPU_CAPABILITY {
using namespace vec;
@ -308,4 +307,4 @@ void binary_kernel_reduce_lastdim(TensorIteratorBase& iter, reduce_func_t reduce
sub_iter.for_each(loop, grain_size);
}
}}} // namespace at::native::<anonymous>
}} // namespace at::native::<anonymous>

View File

@ -124,6 +124,55 @@ __global__ void indexing_backward_kernel(
}
}
#ifdef USE_ROCM
template <typename scalar_t, bool accumulate>
__global__ void indexing_backward_kernel_rocm(
const int64_t* sorted_indices, const int64_t* indices, const scalar_t* grad_output, scalar_t* grad_weight,
int64_t numel, int64_t stride, int64_t stride_before, int64_t outer_dim) {
// This implementation is adopted from indexing_backward_kernel above.
using opmath_t = at::opmath_type<scalar_t>;
for (int64_t z = blockIdx.z; z < outer_dim; z += gridDim.z){
int64_t idx = blockIdx.x * blockDim.y + threadIdx.y;
if (idx < numel && (idx == 0 || sorted_indices[idx] != sorted_indices[idx - 1])){
do {
// if not accumulate, we only keep the last duplicate index so skip those before it
if constexpr (!accumulate) {
if ((idx < numel - 1) && sorted_indices[idx] == sorted_indices[idx + 1]) {
idx++;
continue;
}
}
const int64_t weight_row = ((int64_t) sorted_indices[idx]) * stride + z * stride_before;
const int64_t grad_row = ((int64_t) indices[idx]) * stride + z * numel * stride;
opmath_t gradient;
opmath_t weight;
int64_t feature_dim = threadIdx.x + blockIdx.y * blockDim.x;
while (feature_dim < stride) {
gradient = static_cast<opmath_t>(grad_output[grad_row + feature_dim]);
if constexpr (accumulate) {
weight = static_cast<opmath_t>(grad_weight[weight_row + feature_dim]);
}
if constexpr (accumulate) {
weight += gradient;
} else {
weight = gradient;
}
grad_weight[weight_row + feature_dim] = static_cast<scalar_t>(weight);
feature_dim += gridDim.y * blockDim.x;
}
idx++;
} while (idx < numel && sorted_indices[idx] == sorted_indices[idx - 1]);
}
}
}
#endif
template <typename scalar_t>
__global__ void indexing_backward_kernel_stride_1(
const int64_t* sorted_indices, const int64_t* indices, const scalar_t* grad_output, scalar_t* grad_weight,
@ -491,7 +540,11 @@ void index_put_with_sort_kernel(Tensor & self, const c10::List<std::optional<Ten
linearIndex.numel()*sliceSize*nElemBefore == expandedValue.numel(),
"number of flattened indices did not match number of elements in the value tensor: ",
linearIndex.numel()*sliceSize*nElemBefore, " vs ", expandedValue.numel());
#ifdef USE_ROCM
const int UNROLL = 1;
#else
const int UNROLL = 4;
#endif
const int indices_per_block = 4;
const int warp_size = at::cuda::warp_size();
dim3 grid(ceil_div(num_indices, (int64_t) indices_per_block),
@ -549,6 +602,54 @@ void index_put_with_sort_kernel(Tensor & self, const c10::List<std::optional<Ten
kHalf,
kBool,
kBFloat16);
#ifdef USE_ROCM
} else if (UNROLL == 1) {
if (accumulate) {
AT_DISPATCH_V2(
expandedValue.scalar_type(),
"indexing_backward",
AT_WRAP([&] {
indexing_backward_kernel_rocm<scalar_t, true><<<grid, block, 0, stream>>>(
sorted_indices.const_data_ptr<int64_t>(),
orig_indices.const_data_ptr<int64_t>(),
expandedValue.const_data_ptr<scalar_t>(),
src_.mutable_data_ptr<scalar_t>(),
num_indices,
sliceSize,
strideBefore,
nElemBefore);
C10_CUDA_KERNEL_LAUNCH_CHECK();
}),
AT_EXPAND(AT_ALL_TYPES_AND_COMPLEX),
AT_EXPAND(AT_FLOAT8_TYPES),
kComplexHalf,
kHalf,
kBool,
kBFloat16);
} else {
AT_DISPATCH_V2(
expandedValue.scalar_type(),
"indexing_backward",
AT_WRAP([&] {
indexing_backward_kernel_rocm<scalar_t, false><<<grid, block, 0, stream>>>(
sorted_indices.const_data_ptr<int64_t>(),
orig_indices.const_data_ptr<int64_t>(),
expandedValue.const_data_ptr<scalar_t>(),
src_.mutable_data_ptr<scalar_t>(),
num_indices,
sliceSize,
strideBefore,
nElemBefore);
C10_CUDA_KERNEL_LAUNCH_CHECK();
}),
AT_EXPAND(AT_ALL_TYPES_AND_COMPLEX),
AT_EXPAND(AT_FLOAT8_TYPES),
kComplexHalf,
kHalf,
kBool,
kBFloat16);
}
#endif
} else {
AT_DISPATCH_V2(
expandedValue.scalar_type(),
@ -572,8 +673,8 @@ void index_put_with_sort_kernel(Tensor & self, const c10::List<std::optional<Ten
kHalf,
kBool,
kBFloat16);
}
}
}
if (permuted) {
self.copy_(src_.permute(inversePerm));

View File

@ -11,6 +11,7 @@
#include <ATen/native/cuda/MemoryAccess.cuh>
#include <tuple>
namespace at::native {

View File

@ -2363,7 +2363,8 @@ DropoutState& get_dropout_state(
std::unique_lock<std::mutex> lock{state_cache_mut};
auto& state = dropout_state_cache.at(device);
if (train && dropout_p > 0) {
const auto& gen = at::detail::getCUDAHooks().getDefaultGenerator(device);
const auto& gen =
at::detail::getCUDAHooks().getDefaultCUDAGenerator(device);
auto gen_impl = gen.get<at::CUDAGeneratorImpl>();
bool reset_rnn_state = gen_impl->reset_rnn_state();
if (!state.buffer.defined() || reset_rnn_state) {

View File

@ -3357,7 +3357,7 @@
dispatch:
CUDA: _cslt_compress
- func: _cslt_sparse_mm(Tensor compressed_A, Tensor dense_B, Tensor? bias=None, Tensor? alpha=None, ScalarType? out_dtype=None, bool transpose_result=False, int alg_id=0) -> Tensor
- func: _cslt_sparse_mm(Tensor compressed_A, Tensor dense_B, Tensor? bias=None, Tensor? alpha=None, ScalarType? out_dtype=None, bool transpose_result=False, int alg_id=0, int split_k=1, bool split_k_one_kernel=True) -> Tensor
dispatch:
CUDA: _cslt_sparse_mm

View File

@ -22,6 +22,7 @@
#include <ATen/ops/tensor.h>
#endif
#include <tuple>
#include <utility>
#include <vector>

View File

@ -1,20 +1,7 @@
#include <ATen/cuda/CUDAContext.h>
#include <ATen/cuda/CUDADataType.h>
#include <ATen/cuda/CUDASparse.h>
#include <ATen/cuda/CUDAConfig.h>
#include <ATen/core/Tensor.h>
#include <ATen/Dispatch.h>
#include <ATen/Functions.h>
#include <c10/core/ScalarType.h>
#include <c10/cuda/CUDACachingAllocator.h>
#include <c10/util/Half.h>
#include <cusparse.h>
#include <cstdint>
#include <ATen/native/sparse/cuda/cuSPARSELtOps.h>
#if AT_CUSPARSELT_ENABLED()
#include <cusparseLt.h>
namespace at::native {
// Ideally we would use the same DeviceThreadHandlePool mechanism as used in aten/src/ATen/cuda/CuSparseHandlePool.cpp
@ -56,6 +43,7 @@ at::Tensor _cslt_compress(const Tensor& sparse_input)
#if defined(CUSPARSELT_VERSION) && CUSPARSELT_VERSION >= 602
case at::ScalarType::Float8_e4m3fn:
type = CUDA_R_8F_E4M3;
compression_factor = 10;
break;
#endif
default:
@ -103,7 +91,7 @@ at::Tensor _cslt_compress(const Tensor& sparse_input)
return compressed_tensor;
}
std::tuple<int64_t, at::Tensor> _cslt_sparse_mm_impl(
std::tuple<at::Tensor, int64_t, int64_t, bool, int64_t> _cslt_sparse_mm_impl(
const Tensor& compressed_A,
const Tensor& dense_B,
const std::optional<Tensor>& bias_opt,
@ -111,6 +99,8 @@ std::tuple<int64_t, at::Tensor> _cslt_sparse_mm_impl(
const std::optional<c10::ScalarType> out_dtype_opt,
bool transpose_result,
int alg_id,
int split_k,
bool split_k_one_kernel,
bool search_alg_id
)
{
@ -169,6 +159,7 @@ std::tuple<int64_t, at::Tensor> _cslt_sparse_mm_impl(
output_type = CUDA_R_8F_E4M3;
C_type = CUDA_R_16F;
compute_type = CUSPARSE_COMPUTE_32F;
compression_factor = 10;
break;
#endif
// cuSPARSELt <= v0.5.2 uses CUSPARSE_COMPUTE_TF32, CUSPARSE_COMPUTE_16F
@ -335,10 +326,21 @@ std::tuple<int64_t, at::Tensor> _cslt_sparse_mm_impl(
TORCH_CUDASPARSE_CHECK(cusparseLtMatmulAlgSelectionInit(
&handle, &alg_sel, &matmul, CUSPARSELT_MATMUL_ALG_DEFAULT));
// set alg_id
// set matmul search params
TORCH_CUDASPARSE_CHECK(cusparseLtMatmulAlgSetAttribute(
&handle, &alg_sel, CUSPARSELT_MATMUL_ALG_CONFIG_ID, &alg_id, sizeof(alg_id)));
cusparseLtSplitKMode_t splitKMode;
int max_alg_id;
if (split_k != 1) {
TORCH_CUDASPARSE_CHECK(cusparseLtMatmulAlgSetAttribute(
&handle, &alg_sel, CUSPARSELT_MATMUL_SPLIT_K, &split_k, sizeof(split_k)));
splitKMode = split_k_one_kernel ? CUSPARSELT_SPLIT_K_MODE_ONE_KERNEL : CUSPARSELT_SPLIT_K_MODE_TWO_KERNELS;
TORCH_CUDASPARSE_CHECK(cusparseLtMatmulAlgSetAttribute(
&handle, &alg_sel, CUSPARSELT_MATMUL_SPLIT_K_MODE, &splitKMode, sizeof(splitKMode)));
}
// set tensor_alpha_mode and alpha pointer for matmul
const auto alpha_tensor = alpha_opt.has_value() ? *alpha_opt: Tensor{};
auto alpha_ptr = &alpha;
@ -381,9 +383,23 @@ std::tuple<int64_t, at::Tensor> _cslt_sparse_mm_impl(
&stream,
1));
// get alg_id used
// get matmul params used
TORCH_CUDASPARSE_CHECK(cusparseLtMatmulAlgGetAttribute(
&handle, &alg_sel, CUSPARSELT_MATMUL_ALG_CONFIG_ID, &alg_id, sizeof(alg_id)));
TORCH_CUDASPARSE_CHECK( cusparseLtMatmulAlgGetAttribute(&handle, &alg_sel,
CUSPARSELT_MATMUL_SPLIT_K,
&split_k, sizeof(split_k)));
TORCH_CUDASPARSE_CHECK( cusparseLtMatmulAlgGetAttribute(&handle, &alg_sel,
CUSPARSELT_MATMUL_SPLIT_K_MODE,
&splitKMode, sizeof(splitKMode)));
TORCH_CUDASPARSE_CHECK( cusparseLtMatmulAlgGetAttribute(&handle, &alg_sel,
CUSPARSELT_MATMUL_ALG_CONFIG_MAX_ID,
&max_alg_id, sizeof(max_alg_id)));
}
else {
// do normal matmul
@ -411,7 +427,7 @@ std::tuple<int64_t, at::Tensor> _cslt_sparse_mm_impl(
// destroy plan
TORCH_CUDASPARSE_CHECK(cusparseLtMatmulPlanDestroy(&plan));
return {alg_id, res};
return {res, alg_id, split_k, splitKMode == CUSPARSELT_SPLIT_K_MODE_ONE_KERNEL, max_alg_id};
}
at::Tensor _cslt_sparse_mm(
@ -421,7 +437,9 @@ at::Tensor _cslt_sparse_mm(
const std::optional<Tensor>& alpha_opt,
const std::optional<c10::ScalarType> out_dtype_opt,
bool transpose_result,
int64_t alg_id
int64_t alg_id,
int64_t split_k,
bool split_k_one_kernel
)
{
auto result = _cslt_sparse_mm_impl(
@ -432,8 +450,10 @@ at::Tensor _cslt_sparse_mm(
out_dtype_opt,
transpose_result,
(int) alg_id,
(int) split_k,
split_k_one_kernel,
false);
return std::get<1>(result);
return std::get<0>(result);
}
int64_t _cslt_sparse_mm_search(
@ -445,7 +465,10 @@ int64_t _cslt_sparse_mm_search(
bool transpose_result
)
{
TORCH_WARN_ONCE("torch._cslt_sparse_mm_search is deprecated and will be removed in a future PyTorch release. Please use torch._C._cusparselt.mm_search instead.");
int alg_id_int = 0;
int split_k = 1;
bool split_k_one_kernel= true;
auto result = _cslt_sparse_mm_impl(
compressed_A,
dense_B,
@ -454,11 +477,12 @@ int64_t _cslt_sparse_mm_search(
out_dtype_opt,
transpose_result,
alg_id_int,
split_k,
split_k_one_kernel,
true);
return (int64_t) std::get<0>(result);
return (int64_t) std::get<1>(result);
}
} // namespace at::native
#else // No cuSPARSELt support, throw error if these functions are called.
@ -476,7 +500,9 @@ at::Tensor _cslt_sparse_mm(
const std::optional<Tensor>& alpha_opt,
const std::optional<c10::ScalarType> out_dtype,
bool transpose_result,
int64_t alg_id)
int64_t alg_id,
int64_t split_k,
bool split_k_one_kernel)
{
TORCH_CHECK(false, "cuSPARSELt not supported on your machine.");
}

View File

@ -0,0 +1,58 @@
#pragma once
#include <ATen/cuda/CUDAContext.h>
#include <ATen/cuda/CUDADataType.h>
#include <ATen/cuda/CUDASparse.h>
#include <ATen/cuda/CUDAConfig.h>
#include <ATen/core/Tensor.h>
#include <ATen/Dispatch.h>
#include <ATen/Functions.h>
#include <c10/core/ScalarType.h>
#include <c10/cuda/CUDACachingAllocator.h>
#include <c10/util/Half.h>
#include <cusparse.h>
#include <cstdint>
#if AT_CUSPARSELT_ENABLED()
#include <cusparseLt.h>
#endif
namespace at::native {
at::Tensor _cslt_compress(const Tensor& sparse_input);
TORCH_CUDA_CPP_API std::tuple<at::Tensor, int64_t, int64_t, bool, int64_t> _cslt_sparse_mm_impl(
const Tensor& compressed_A,
const Tensor& dense_B,
const std::optional<Tensor>& bias_opt,
const std::optional<Tensor>& alpha_opt,
const std::optional<c10::ScalarType> out_dtype_opt,
bool transpose_result,
int alg_id,
int split_k,
bool split_k_one_kernel,
bool search_alg_id
);
at::Tensor _cslt_sparse_mm(
const Tensor& compressed_A,
const Tensor& dense_B,
const std::optional<Tensor>& bias_opt,
const std::optional<Tensor>& alpha_opt,
const std::optional<c10::ScalarType> out_dtype_opt,
bool transpose_result,
int64_t alg_id,
int64_t split_k,
bool split_k_one_kernel
);
int64_t _cslt_sparse_mm_search(
const Tensor& compressed_A,
const Tensor& dense_B,
const std::optional<Tensor>& bias_opt,
const std::optional<Tensor>& alpha_opt,
const std::optional<c10::ScalarType> out_dtype_opt,
bool transpose_result
);
} // namespace at::native

View File

@ -68,16 +68,11 @@ bool check_prefer_cudnn_attention() {
std::array<SDPBackend, num_backends> priority_order(sdp_params const& params) {
constexpr std::array<SDPBackend, num_backends> default_order{
SDPBackend::flash_attention,
SDPBackend::cudnn_attention,
SDPBackend::efficient_attention,
SDPBackend::math};
constexpr std::array<SDPBackend, num_backends> cudnn_order{
SDPBackend::math,
SDPBackend::cudnn_attention,
SDPBackend::flash_attention,
SDPBackend::efficient_attention,
SDPBackend::math};
static const bool prefer_cudnn = check_prefer_cudnn_attention();
return prefer_cudnn ? cudnn_order : default_order;
};
return default_order;
}
bool use_tensor_cores(sdp_params const& params, cudaDeviceProp* dprops, bool is_half) {

View File

@ -34,12 +34,13 @@ int32_t XPUHooks::getGlobalIdxFromDevice(const at::Device& device) const {
#endif
}
const Generator& XPUHooks::getDefaultGenerator(DeviceIndex device_index) const {
return at::xpu::detail::getDefaultXPUGenerator(device_index);
Generator XPUHooks::getXPUGenerator(DeviceIndex device_index) const {
return make_generator<at::XPUGeneratorImpl>(device_index);
}
Generator XPUHooks::getNewGenerator(DeviceIndex device_index) const {
return make_generator<at::XPUGeneratorImpl>(device_index);
const Generator& XPUHooks::getDefaultXPUGenerator(
DeviceIndex device_index) const {
return at::xpu::detail::getDefaultXPUGenerator(device_index);
}
Device XPUHooks::getDeviceFromPtr(void* data) const {

View File

@ -11,9 +11,9 @@ struct XPUHooks : public at::XPUHooksInterface {
bool hasXPU() const override;
std::string showConfig() const override;
int32_t getGlobalIdxFromDevice(const at::Device& device) const override;
const Generator& getDefaultGenerator(
Generator getXPUGenerator(DeviceIndex device_index = -1) const override;
const Generator& getDefaultXPUGenerator(
DeviceIndex device_index = -1) const override;
Generator getNewGenerator(DeviceIndex device_index = -1) const override;
Device getDeviceFromPtr(void* data) const override;
c10::DeviceIndex getNumGPUs() const override;
DeviceIndex current_device() const override;

View File

@ -0,0 +1,44 @@
import sys
from benchmark_base import BenchmarkBase
import torch
class Benchmark(BenchmarkBase):
N = 200
def name(self):
return "symint_sum"
def description(self):
return "see https://docs.google.com/document/d/11xJXl1etSmefUxPiVyk885e0Dl-4o7QwxYcPiMIo2iY/edit"
def _prepare_once(self):
torch._dynamo.config.capture_scalar_outputs = True
torch.manual_seed(0)
self.splits = torch.randint(10, (self.N,))
def _prepare(self):
torch._dynamo.reset()
def _work(self):
@torch.compile(fullgraph=True)
def f(a):
xs = a.tolist()
y = sum(xs)
return torch.tensor(y)
f(self.splits)
def main():
result_path = sys.argv[1]
Benchmark().enable_compile_time_instruction_count().collect_all().append_results(
result_path
)
if __name__ == "__main__":
main()

View File

@ -1,253 +0,0 @@
import argparse
import random
import pandas as pd
from tqdm import tqdm
import torch
import torch.utils.benchmark as benchmark
from torch import nn
from torch.sparse import SparseSemiStructuredTensor, to_sparse_semi_structured
torch.set_printoptions(
precision=2,
threshold=None,
edgeitems=16,
linewidth=480,
profile=None,
sci_mode=False,
)
# helper model definition for pruner
class Model(nn.Module):
def __init__(self, m, k, dtype=None):
super().__init__()
# transposed so reversed
self.linear = nn.Linear(k, m)
def forward(self, x):
return self.linear(x)
def rand_sparse_semi_structured_mask(
r, c, dtype=torch.float16, device="cuda", choice=None
):
"""
This function returns a 1:2 sparse matrix of size (r, c).
Note that this means this matrix will also be 2:4 and 4:8 sparse as well.
"""
choices = [[0, 1], [1, 0]]
mask_entries = [choice or random.choice(choices) for i in range(r * c // 2)]
return (
torch.tensor(mask_entries, dtype=dtype, device=device)
.reshape(r, c)
.contiguous()
)
def test_linear(m, k, n, dtype, contiguous, backend):
SparseSemiStructuredTensor._FORCE_CUTLASS = backend == "cutlass"
mask = rand_sparse_semi_structured_mask(m, k, dtype=dtype)
sparse_weight = torch.rand(m, k).to(dtype).cuda() * mask
input_tensor = torch.zeros(n, k).to(dtype).cuda()
model = Model(m, k).to(dtype).cuda().eval()
dense_measurement = benchmark.Timer(
stmt="model(input_tensor)",
globals=locals(),
).blocked_autorange()
dense_output = model(input_tensor)
print(dense_output.shape)
# sparsify weights
model.linear.weight = nn.Parameter(
to_sparse_semi_structured(
sparse_weight,
)
)
sparse_output = model(input_tensor)
print(sparse_output.shape)
sparse_measurement = benchmark.Timer(
stmt="model(input_tensor)",
globals=locals(),
).blocked_autorange()
correct = torch.allclose(dense_output, sparse_output, rtol=1e-3, atol=1e-3)
return {
"test_function": "linear",
"m": m,
"k": k,
"n": n,
"dtype": str(dtype),
"backend": backend,
"sparse_latency (ms)": sparse_measurement.median * 1000,
"dense_latency (ms)": dense_measurement.median * 1000,
"speedup (d/s)": dense_measurement.median / sparse_measurement.median,
"correct": correct,
"contiguous": sparse_output.is_contiguous(),
}
def test_tensor(m, k, n, dtype, contiguous, backend):
A = rand_sparse_semi_structured_mask(m, k, dtype=dtype)
B = torch.zeros(k, n).to(dtype).cuda()
bias = torch.rand(n).to(dtype).cuda()
sA = to_sparse_semi_structured(A)
# torch.mm calculation
if dtype is not torch.int8:
dense_output = torch.mm(A, B)
dense_measurement = benchmark.Timer(
stmt="torch.mm(A, B)",
globals=locals(),
).blocked_autorange()
else:
print("int8 baseline not supported")
dense_output = torch.mm(sA, B)
dense_measurement = benchmark.Timer(
stmt="torch.mm(sA, B)",
globals=locals(),
).blocked_autorange()
sparse_output = torch.mm(sA, B)
sparse_measurement = benchmark.Timer(
stmt="torch.mm(sA, B)",
globals=locals(),
).blocked_autorange()
correct = torch.allclose(dense_output, sparse_output, rtol=1e-3, atol=1e-3)
return {
"test_function": "tensor",
"m": m,
"k": k,
"n": n,
"dtype": str(dtype),
"backend": backend,
"sparse_latency (ms)": sparse_measurement.median * 1000,
"dense_latency (ms)": dense_measurement.median * 1000,
"speedup (d/s)": dense_measurement.median / sparse_measurement.median,
"correct": correct,
"contiguous": sparse_output.is_contiguous(),
}
if __name__ == "__main__":
dtype_lookup = {
"int8": torch.int8,
"fp16": torch.float16,
"bf16": torch.bfloat16,
"fp32": torch.float32,
}
parser = argparse.ArgumentParser(description="Semi-Structured Sparsity Benchmarks")
parser.add_argument(
"--mode",
type=str,
choices=[
"nvidia-bert",
"nvidia-fixed-k",
"nvidia-fixed-mn",
],
)
parser.add_argument(
"--dtype",
type=str,
choices=dtype_lookup.keys(),
default="fp16",
)
parser.add_argument(
"--backend", type=str, choices=["cutlass", "cusparselt"], default="cusparselt"
)
parser.add_argument("-contiguous", action="store_true")
parser.add_argument("-e2e", action="store_true")
parser.add_argument("-save", action="store_true")
args = parser.parse_args()
if args.e2e:
eval_fn = test_linear
else:
eval_fn = test_tensor
print(f"Started benchmark: {args.mode} | dtype: {args.dtype}")
dtype = dtype_lookup[args.dtype]
if args.mode == "nvidia-bert":
bert_shapes = [
(3072, 1024, 16384),
(4096, 1024, 16384),
(1024, 1024, 16384),
(1024, 4096, 16384),
]
results = (
eval_fn(m, k, n, dtype, args.contiguous, args.backend)
for (m, k, n) in tqdm(bert_shapes)
)
elif args.mode == "nvidia-fixed-k":
mn_vals = [
3072,
4096,
5120,
6144,
7168,
8192,
9216,
10240,
11264,
12288,
13312,
14336,
15360,
16384,
17408,
18432,
19456,
20480,
]
results = (
eval_fn(mn, 10240, mn, dtype, args.contiguous, args.backend)
for mn in tqdm(mn_vals)
)
elif args.mode == "nvidia-fixed-mn":
k_vals = [
2560,
3840,
5120,
6400,
7680,
8960,
10240,
11520,
12800,
14080,
15360,
16640,
17920,
19200,
20480,
]
results = (
eval_fn(10240, k, 10240, dtype, args.contiguous, args.backend)
for k in tqdm(k_vals)
)
df = pd.DataFrame.from_records(results)
if args.save:
save_file = f"{args.mode}_{args.dtype}_{args.backend}.csv"
df.to_csv(save_file)
print(f"Finished benchmark: {args.mode} saved results to {save_file}")
print(df)

View File

@ -87,8 +87,6 @@ void reportOutOfMemoryToProfiler(
}
}
MemoryReportingInfoBase::MemoryReportingInfoBase() = default;
void MemoryReportingInfoBase::reportOutOfMemory(
int64_t /*alloc_size*/,
size_t /*total_allocated*/,

View File

@ -157,6 +157,7 @@ inline bool operator!=(std::nullptr_t, const DataPtr& dp) noexcept {
// possible, or the raw interface will incorrectly reported as unsupported,
// when it is actually possible.
// NOLINTNEXTLINE(cppcoreguidelines-special-member-functions)
struct C10_API Allocator {
virtual ~Allocator() = default;
@ -223,10 +224,24 @@ struct C10_API Allocator {
// allocation InefficientStdFunctionContext, on top of the dynamic
// allocation which is implied by std::function itself.
struct C10_API InefficientStdFunctionContext {
void* ptr_;
void* ptr_{nullptr};
std::function<void(void*)> deleter_;
InefficientStdFunctionContext(void* ptr, std::function<void(void*)> deleter)
: ptr_(ptr), deleter_(std::move(deleter)) {}
InefficientStdFunctionContext(const InefficientStdFunctionContext&) = delete;
InefficientStdFunctionContext(InefficientStdFunctionContext&& rhs) noexcept
: ptr_(std::exchange(rhs.ptr_, nullptr)),
deleter_(std::move(rhs.deleter_)) {}
InefficientStdFunctionContext& operator=(
const InefficientStdFunctionContext&) = delete;
// NOLINTNEXTLINE(performance-noexcept-move-constructor)
InefficientStdFunctionContext& operator=(
InefficientStdFunctionContext&& rhs) {
this->~InefficientStdFunctionContext();
ptr_ = std::exchange(rhs.ptr_, nullptr);
deleter_ = std::move(rhs.deleter_);
return *this;
}
~InefficientStdFunctionContext() {
if (deleter_) {
deleter_(ptr_);
@ -270,9 +285,6 @@ struct AllocatorRegisterer {
// An interface for reporting thread local memory usage
// per device
struct C10_API MemoryReportingInfoBase : public c10::DebugInfoBase {
MemoryReportingInfoBase();
~MemoryReportingInfoBase() override = default;
/**
* alloc_size corresponds to the size of the ptr.
*
@ -312,6 +324,7 @@ C10_API void reportOutOfMemoryToProfiler(
Device device);
// used to hold traceback information in allocators
// NOLINTNEXTLINE(cppcoreguidelines-special-member-functions)
struct GatheredContext {
virtual ~GatheredContext() = default;
};

View File

@ -75,9 +75,6 @@ ProfiledCPUMemoryReporter& profiledCPUMemoryReporter() {
template <uint32_t PreGuardBytes, uint32_t PostGuardBytes>
class DefaultMobileCPUAllocator final : public at::Allocator {
public:
DefaultMobileCPUAllocator() = default;
~DefaultMobileCPUAllocator() override = default;
static void deleter(void* const pointer) {
if (C10_UNLIKELY(!pointer)) {
return;

View File

@ -34,6 +34,8 @@ class DeviceGuard {
const impl::DeviceGuardImplInterface* impl)
: guard_(device, impl) {}
~DeviceGuard() = default;
/// Copy is disallowed
DeviceGuard(const DeviceGuard&) = delete;
DeviceGuard& operator=(const DeviceGuard&) = delete;
@ -143,6 +145,7 @@ class OptionalDeviceGuard {
const impl::DeviceGuardImplInterface* impl)
: guard_(device, impl) {}
~OptionalDeviceGuard() = default;
/// Copy is disallowed
OptionalDeviceGuard(const OptionalDeviceGuard&) = delete;
OptionalDeviceGuard& operator=(const OptionalDeviceGuard&) = delete;

View File

@ -61,6 +61,7 @@ struct C10_API GeneratorImpl : public c10::intrusive_ptr_target {
GeneratorImpl(const GeneratorImpl& other) = delete;
GeneratorImpl(GeneratorImpl&& other) = delete;
GeneratorImpl& operator=(const GeneratorImpl& other) = delete;
GeneratorImpl& operator=(GeneratorImpl&& other) = delete;
~GeneratorImpl() override = default;
c10::intrusive_ptr<GeneratorImpl> clone() const;

View File

@ -16,6 +16,10 @@ struct C10_API AutoGradMode {
AutoGradMode(bool enabled) : prev_mode(GradMode::is_enabled()) {
GradMode::set_enabled(enabled);
}
AutoGradMode(const AutoGradMode&) = delete;
AutoGradMode(AutoGradMode&&) = delete;
AutoGradMode& operator=(const AutoGradMode&) = delete;
AutoGradMode& operator=(AutoGradMode&&) = delete;
~AutoGradMode() {
GradMode::set_enabled(prev_mode);
}
@ -35,6 +39,10 @@ struct C10_API AutoFwGradMode {
: prev_mode(AutogradState::get_tls_state().get_fw_grad_mode()) {
AutogradState::get_tls_state().set_fw_grad_mode(enabled);
}
AutoFwGradMode(const AutoFwGradMode&) = delete;
AutoFwGradMode(AutoFwGradMode&&) = delete;
AutoFwGradMode& operator=(const AutoFwGradMode&) = delete;
AutoFwGradMode& operator=(AutoFwGradMode&&) = delete;
~AutoFwGradMode() {
AutogradState::get_tls_state().set_fw_grad_mode(prev_mode);
}

View File

@ -73,6 +73,11 @@ struct C10_API InferenceMode {
c10::impl::_force_tls_local_dispatch_key_set(cur_keyset);
}
InferenceMode(const InferenceMode&) = delete;
InferenceMode(InferenceMode&&) = delete;
InferenceMode& operator=(const InferenceMode&) = delete;
InferenceMode& operator=(InferenceMode&&) = delete;
~InferenceMode() {
AutogradState::set_tls_state(prev_mode);
c10::impl::_force_tls_local_dispatch_key_set(prev_keyset);

View File

@ -81,9 +81,11 @@ template <typename T>
struct SafePyObjectT : private SafePyObject {
SafePyObjectT(PyObject* data, c10::impl::PyInterpreter* pyinterpreter)
: SafePyObject(data, pyinterpreter) {}
~SafePyObjectT() = default;
SafePyObjectT(SafePyObjectT&& other) noexcept : SafePyObject(other) {}
SafePyObjectT(SafePyObjectT const&) = delete;
SafePyObjectT& operator=(SafePyObjectT const&) = delete;
SafePyObjectT& operator=(SafePyObjectT&&) = delete;
using SafePyObject::ptr;
using SafePyObject::pyinterpreter;

View File

@ -23,7 +23,7 @@ C10_API void warnDeprecatedDataPtr();
// Currently used only for storing a custom error message
// used when throwing an exception when data_ptr is accessed.
struct C10_API StorageExtraMeta {
c10::optional<std::string> custom_data_ptr_error_msg_ = c10::nullopt;
std::optional<std::string> custom_data_ptr_error_msg_ = c10::nullopt;
StorageExtraMeta() = default;
StorageExtraMeta(const StorageExtraMeta& other) {
if (other.custom_data_ptr_error_msg_) {
@ -283,7 +283,7 @@ struct C10_API StorageImpl : public c10::intrusive_ptr_target {
[[noreturn]] void throw_data_ptr_access_error() const;
void release_data_and_set_meta_custom_data_ptr_error_msg_(
c10::optional<std::string> s) {
std::optional<std::string> s) {
throw_on_immutable_data_ptr_ = true;
get_extra_meta().custom_data_ptr_error_msg_ = std::move(s);
refresh_has_data_ptr_check();

View File

@ -27,6 +27,7 @@ namespace c10 {
struct StreamGuard {
/// No default constructor, see Note [Omitted default constructor from RAII]
explicit StreamGuard() = delete;
~StreamGuard() = default;
/// Set the current device to the device associated with the passed stream,
/// and set the current stream on that device to the passed stream.
@ -111,6 +112,7 @@ struct OptionalStreamGuard {
// See Note [Move assignment for RAII guards is tricky]
OptionalStreamGuard& operator=(OptionalStreamGuard&& other) = delete;
~OptionalStreamGuard() = default;
/// Resets the currently set stream to the original stream and
/// the currently set device to the original device. Then,
@ -162,6 +164,7 @@ struct MultiStreamGuard {
// See Note [Move assignment for RAII guards is tricky]
MultiStreamGuard& operator=(MultiStreamGuard&& other) = delete;
~MultiStreamGuard() = default;
private:
c10::impl::InlineMultiStreamGuard<impl::VirtualGuardImpl> guard_;

View File

@ -22,7 +22,9 @@ class C10_API SymbolicShapeMeta {
bool strides_valid_ = true; // e.g. for sparse where there are no strides
SymbolicShapeMeta() = default;
~SymbolicShapeMeta() = default;
SymbolicShapeMeta(const SymbolicShapeMeta& other);
SymbolicShapeMeta(SymbolicShapeMeta&& other) = delete;
SymbolicShapeMeta& operator=(const SymbolicShapeMeta& other) = delete;
SymbolicShapeMeta& operator=(SymbolicShapeMeta&& other) = delete;

View File

@ -133,6 +133,7 @@ struct C10_API PlacementDeleteContext {
DataPtr data_ptr_;
PlacementDtor placement_dtor_;
size_t size_;
PlacementDeleteContext(
DataPtr&& data_ptr,
PlacementDtor placement_dtor,
@ -140,6 +141,11 @@ struct C10_API PlacementDeleteContext {
: data_ptr_(std::move(data_ptr)),
placement_dtor_(placement_dtor),
size_(size) {}
PlacementDeleteContext(PlacementDeleteContext&&) noexcept = delete;
PlacementDeleteContext(const PlacementDeleteContext&) = delete;
PlacementDeleteContext& operator=(const PlacementDeleteContext&) = delete;
PlacementDeleteContext& operator=(PlacementDeleteContext&&) = delete;
static DataPtr makeDataPtr(
DataPtr&& data_ptr,
PlacementDtor placement_dtor,
@ -237,6 +243,7 @@ struct C10_API ExtraMeta {
std::optional<std::string> custom_storage_error_msg_ = std::nullopt;
ExtraMeta() = default;
~ExtraMeta() = default;
ExtraMeta(const ExtraMeta& other) {
if (other.symbolic_shape_meta_) {
symbolic_shape_meta_ =

View File

@ -62,7 +62,7 @@ class InlineDeviceGuard {
// DeviceGuard which reads the current device and promises to
// restore to that device on exit. However, most cases where you
// would have written this, you probably meant to actually just
// use OptionalDeviceGuard (since you don't actually need the
// use DeviceGuard (since you don't actually need the
// restore to happen if you don't ever actually set the device).
// We remove the constructor here to encourage you to think about
// what you actually want to happen.
@ -221,6 +221,7 @@ class InlineOptionalDeviceGuard {
explicit InlineOptionalDeviceGuard()
: guard_() // See Note [Explicit initialization of optional fields]
{}
~InlineOptionalDeviceGuard() = default;
/// Set the current device to the passed Device, if it is not nullopt.
explicit InlineOptionalDeviceGuard(std::optional<Device> device_opt)
@ -286,6 +287,7 @@ class InlineOptionalDeviceGuard {
// It's in principle possible to raise an error when this occurs
// by doing some extra thread-local bookkeeping. But why bother?
// Just don't provide the constructor.
InlineOptionalDeviceGuard(const InlineOptionalDeviceGuard<T>& other) = delete;
InlineOptionalDeviceGuard(InlineOptionalDeviceGuard<T>&& other) = delete;
// Note [Move assignment for RAII guards is tricky]
@ -335,6 +337,8 @@ class InlineOptionalDeviceGuard {
//
// We could solve this with an extra thread-local variable. But no one is
// actually using move-assignment. So just get rid of it.
InlineOptionalDeviceGuard& operator=(const InlineOptionalDeviceGuard& other) =
delete;
InlineOptionalDeviceGuard& operator=(InlineOptionalDeviceGuard&& other) =
delete;

View File

@ -135,6 +135,7 @@ class InlineOptionalStreamGuard {
explicit InlineOptionalStreamGuard()
: guard_() // See Note [Explicit initialization of optional fields]
{}
~InlineOptionalStreamGuard() = default;
/// Set the current device to the device associated with the passed stream,
/// and set the current stream on that device to the passed stream,
@ -151,6 +152,9 @@ class InlineOptionalStreamGuard {
explicit InlineOptionalStreamGuard(Args&&... args)
: guard_(std::in_place, std::forward<Args>(args)...) {}
InlineOptionalStreamGuard(const InlineOptionalStreamGuard<T>& other) = delete;
InlineOptionalStreamGuard& operator=(const InlineOptionalStreamGuard& other) =
delete;
// See Note [Move construction for RAII guards is tricky]
InlineOptionalStreamGuard(InlineOptionalStreamGuard<T>&& other) = delete;

View File

@ -132,6 +132,11 @@ struct C10_API ForceDispatchKeyGuard {
updated_set.excluded_ = exclude;
c10::impl::_force_tls_local_dispatch_key_set(updated_set);
}
ForceDispatchKeyGuard(ForceDispatchKeyGuard&&) noexcept = delete;
ForceDispatchKeyGuard(const ForceDispatchKeyGuard&) = delete;
ForceDispatchKeyGuard& operator=(const ForceDispatchKeyGuard&) = delete;
ForceDispatchKeyGuard& operator=(ForceDispatchKeyGuard&&) = delete;
~ForceDispatchKeyGuard() {
c10::impl::_force_tls_local_dispatch_key_set(saved_keyset_);
}

View File

@ -15,6 +15,7 @@ struct C10_API DisablePythonDispatcher {
DisablePythonDispatcher() : old_(PythonDispatcherTLS::get_state()) {
PythonDispatcherTLS::set_state({});
}
~DisablePythonDispatcher() {
PythonDispatcherTLS::set_state(old_);
}

View File

@ -2016,6 +2016,13 @@ class DeviceCachingAllocator {
}
}
void ensureExistsAndIncrefPool(MempoolId_t mempool_id) {
// Create a PrivatePool object if it does not exist yet
// and increment its use_count
std::lock_guard<std::recursive_mutex> lock(mutex);
ensure_exists_and_incref_pool(mempool_id);
}
// See Note [Interaction with CUDA graph capture]
// Called by CUDAGraph::capture_begin
@ -2023,18 +2030,7 @@ class DeviceCachingAllocator {
MempoolId_t mempool_id,
std::function<bool(cudaStream_t)> filter) {
std::lock_guard<std::recursive_mutex> lock(mutex);
auto it = graph_pools.find(mempool_id);
if (it == graph_pools.end()) {
// mempool_id does not reference an existing pool. Make a new pool for
// this capture.
graph_pools.emplace(mempool_id, std::make_unique<PrivatePool>());
} else {
// mempool_id references an existing pool, which the current capture will
// share. Check this pool is live (at least one other capture already
// references it).
TORCH_INTERNAL_ASSERT(it->second->use_count > 0);
it->second->use_count++;
}
ensure_exists_and_incref_pool(mempool_id);
for (auto it2 = captures_underway.begin(); it2 != captures_underway.end();
++it2) {
TORCH_CHECK(
@ -2058,7 +2054,7 @@ class DeviceCachingAllocator {
false, "endAllocatePool: not currently recording to mempool_id");
}
// Called by CUDAGraph::reset
// Called by CUDAGraph::reset and MemPool::~MemPool()
void releasePool(MempoolId_t mempool_id) {
std::lock_guard<std::recursive_mutex> lock(mutex);
// The instantiated cudaGraphExec_t has been destroyed. We can't blindly
@ -2070,20 +2066,24 @@ class DeviceCachingAllocator {
// mempool. When the count reaches 0, we tell free_cached_blocks it may now
// cudaFree blocks from this graph's pool when it discovers they're unused
// (unsplit).
auto it = graph_pools.find(mempool_id);
TORCH_INTERNAL_ASSERT(it != graph_pools.end());
auto uc = --(it->second->use_count);
auto pp = get_private_pool(mempool_id);
auto uc = --(pp->use_count);
TORCH_INTERNAL_ASSERT(uc >= 0);
if (uc == 0) {
// Allows free_cached_blocks to begin cudaFreeing this pool's memory,
// and makes sure this pool wasn't somehow made freeable already.
// NOLINTNEXTLINE(clang-analyzer-deadcode.DeadStores)
bool inserted =
graph_pools_freeable.insert({mempool_id, it->second.get()}).second;
bool inserted = graph_pools_freeable.insert({mempool_id, pp}).second;
TORCH_INTERNAL_ASSERT(inserted);
}
}
int getPoolUseCount(MempoolId_t mempool_id) {
std::lock_guard<std::recursive_mutex> lock(mutex);
auto pp = get_private_pool(mempool_id);
return pp->use_count;
}
void addPeerAccess(c10::DeviceIndex dev_to_access) {
std::lock_guard<std::recursive_mutex> lock(mutex);
if (std::find(
@ -2152,6 +2152,30 @@ class DeviceCachingAllocator {
return blocks;
}
void ensure_exists_and_incref_pool(MempoolId_t mempool_id) {
auto it = graph_pools.find(mempool_id);
if (it == graph_pools.end()) {
// mempool_id does not reference an existing pool.
// Make a new pool for CUDAGraph capture or torch.cuda.use_mem_pool
// usage. use_count is initially 1, which means the pool is
// being used since somebody called ensureExistsAndIncrefPool.
graph_pools.emplace(mempool_id, std::make_unique<PrivatePool>());
} else {
// mempool_id references an existing pool, which the current CUDAGraph
// capture or torch.cuda.use_mem_pool will
// share. Check this pool is live (at least one other capture already
// references it). Increment it to establish the usage.
TORCH_INTERNAL_ASSERT(it->second->use_count > 0);
it->second->use_count++;
}
}
PrivatePool* get_private_pool(MempoolId_t mempool_id) {
auto it = graph_pools.find(mempool_id);
TORCH_INTERNAL_ASSERT(it != graph_pools.end());
return it->second.get();
}
// returns the smallest possible address in any segment
// where there is enough free address space to fit size
// may be composed of free and unmapped segments
@ -3536,6 +3560,14 @@ class NativeCachingAllocator : public CUDAAllocator {
assertValidDevice(device);
device_allocator[device]->resetPeakStats();
}
void ensureExistsAndIncrefPool(
c10::DeviceIndex device,
MempoolId_t mempool_id) override {
assertValidDevice(device);
device_allocator[device]->ensureExistsAndIncrefPool(std::move(mempool_id));
}
// CUDAGraph interactions
void beginAllocateToPool(
c10::DeviceIndex device,
@ -3557,6 +3589,12 @@ class NativeCachingAllocator : public CUDAAllocator {
device_allocator[device]->releasePool(std::move(mempool_id));
}
int getPoolUseCount(c10::DeviceIndex device, MempoolId_t mempool_id)
override {
assertValidDevice(device);
return device_allocator[device]->getPoolUseCount(std::move(mempool_id));
}
void* raw_alloc(size_t nbytes) override {
if (nbytes == 0) {
return nullptr;
@ -3844,6 +3882,13 @@ MemPool::MemPool(
} else {
id_ = {uuid_++, 0};
}
device_ = c10::cuda::current_device();
CUDACachingAllocator::ensureExistsAndIncrefPool(device_, id_);
}
MemPool::~MemPool() {
TORCH_INTERNAL_ASSERT(use_count() == 1);
CUDACachingAllocator::releasePool(device_, id_);
}
MempoolId_t MemPool::id() {
@ -3854,6 +3899,17 @@ CUDACachingAllocator::CUDAAllocator* MemPool::allocator() {
return allocator_;
}
int MemPool::use_count() {
return CUDACachingAllocator::getPoolUseCount(device_, id_);
}
MempoolId_t MemPool::graph_pool_handle(bool is_user_created) {
if (is_user_created) {
return {0, uid_++};
}
return {uuid_++, 0};
}
// Note that active_mempool_ is a global variable here
// and not inside MemPoolContext class, because in windows we
// can't use __declspec(dllexport) and __declspec(thread)

View File

@ -224,6 +224,22 @@ class CUDAAllocator : public Allocator {
c10::DeviceIndex device,
MempoolId_t mempool_id) = 0;
virtual void releasePool(c10::DeviceIndex device, MempoolId_t mempool_id) = 0;
virtual int getPoolUseCount(c10::DeviceIndex device, MempoolId_t mempool_id) {
TORCH_CHECK(
false,
name(),
" does not yet support getPoolUseCount. "
"If you need it, please file an issue describing your use case.");
}
virtual void ensureExistsAndIncrefPool(
c10::DeviceIndex device,
MempoolId_t mempool_id) {
TORCH_CHECK(
false,
name(),
" does not yet support ensureExistsAndIncrefPool. "
"If you need it, please file an issue describing your use case.");
}
// returns true if the allocated blocks are equal to expected live allocations
virtual bool checkPoolLiveAllocations(
c10::DeviceIndex device,
@ -427,6 +443,16 @@ inline void attachAllocatorTraceTracker(AllocatorTraceTracker tracker) {
inline void releasePool(c10::DeviceIndex device, MempoolId_t mempool_id) {
return get()->releasePool(device, mempool_id);
}
inline void ensureExistsAndIncrefPool(
c10::DeviceIndex device,
MempoolId_t mempool_id) {
get()->ensureExistsAndIncrefPool(device, mempool_id);
}
inline int getPoolUseCount(c10::DeviceIndex device, MempoolId_t mempool_id) {
return get()->getPoolUseCount(device, mempool_id);
}
// Not part of CUDA_ALLOCATOR_BACKEND_INTERFACE
inline std::shared_ptr<void> getIpcDevPtr(std::string handle) {
return get()->getIpcDevPtr(std::move(handle));
@ -472,9 +498,12 @@ struct C10_CUDA_API MemPool {
MemPool(
CUDACachingAllocator::CUDAAllocator* allocator = nullptr,
bool is_user_created = true);
~MemPool();
MempoolId_t id();
CUDACachingAllocator::CUDAAllocator* allocator();
int use_count();
static MempoolId_t graph_pool_handle(bool is_user_created = true);
private:
static std::atomic<CaptureId_t> uid_;
@ -482,6 +511,7 @@ struct C10_CUDA_API MemPool {
CUDACachingAllocator::CUDAAllocator* allocator_;
bool is_user_created_;
MempoolId_t id_;
c10::DeviceIndex device_;
};
// MemPoolContext holds the currently active pool and stashes the previous

Some files were not shown because too many files have changed in this diff Show More