Compare commits

..

324 Commits

Author SHA1 Message Date
73c49ee963 Speed up fx graph iteration by implementing it in C++
ghstack-source-id: af7493f6f73baf00e30a6d5790a601729bd9c900
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128288
2024-06-08 17:12:47 -07:00
0e6c204642 [pipelining] Friendly error message when not traceable (#128276)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128276
Approved by: https://github.com/H-Huang
2024-06-08 06:36:11 +00:00
44371bd432 Revert "[dynamo][nn-modules] Trace through nn.Module dunder methods for UnspecializedNNModule (#126578)"
This reverts commit 7ede78f9f5d7e6c993faa1a70a5f0b0eaec5640d.

Reverted https://github.com/pytorch/pytorch/pull/126578 on behalf of https://github.com/anijain2305 due to pippy tests fail ([comment](https://github.com/pytorch/pytorch/pull/126578#issuecomment-2155836555))
2024-06-08 06:35:34 +00:00
6e13c7e874 Revert "[dynamo] Support if cond on UnspecializedNNModuleVariable and add inline tests (#128158)"
This reverts commit 747fc35ff54154ddec2a5ab5661f57c28d65c591.

Reverted https://github.com/pytorch/pytorch/pull/128158 on behalf of https://github.com/anijain2305 due to pippy tests fail ([comment](https://github.com/pytorch/pytorch/pull/128158#issuecomment-2155835787))
2024-06-08 06:32:28 +00:00
94165dba7b Revert "[dynamo] Inline the getattr of fx graph and proxy graph (#128172)"
This reverts commit 662a78f957fb89e53ebeba7deb880561e10ecaf6.

Reverted https://github.com/pytorch/pytorch/pull/128172 on behalf of https://github.com/anijain2305 due to pippy tests fail ([comment](https://github.com/pytorch/pytorch/pull/128172#issuecomment-2155835201))
2024-06-08 06:29:36 +00:00
8a0bc8c9ee [fsdp2] simplify fsdp_param logic with DTensorSpec (#128242)
as titled, we can use a single DTensorSpec to save the SPMD sharding
spec, plus the global shape/stride to simplify the FSDPParam logic

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128242
Approved by: https://github.com/awgu
2024-06-08 05:56:41 +00:00
cbb7e3053f View specialization (#127641)
This PR adds specialization shortcuts for converting n-d to 1-d and 1-d to 2-d views.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127641
Approved by: https://github.com/ezyang
2024-06-08 05:52:52 +00:00
310f80995b Added memory budget to partitioner (#126320)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126320
Approved by: https://github.com/shunting314
2024-06-08 05:52:40 +00:00
ffc202a1b9 Added remove_noop_ops to joint_graph_passes (#124451)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124451
Approved by: https://github.com/ezyang, https://github.com/fmassa
2024-06-08 05:48:11 +00:00
c446851829 [fsdp2] update foreach_reduce accumulate_grad (#128117)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128117
Approved by: https://github.com/awgu
2024-06-08 05:13:57 +00:00
613c7d270d [pipelining] Format doc (#128279)
- Should use two dots around `var`
- Wrap lines
- Add section cross ref
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128279
Approved by: https://github.com/H-Huang
ghstack dependencies: #128273, #128278
2024-06-08 04:59:04 +00:00
2e42671619 [pipelining] Rename to stage.py and schedules.py (#128278)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128278
Approved by: https://github.com/H-Huang
ghstack dependencies: #128273
2024-06-08 04:42:35 +00:00
0e3fe694d1 [pipelining] Restore a stage constructor for tracer path (#128273)
In case user modified stage module out of place, such as
mod = DDP(mod)
mod = torch.compile(mod)

They need a stage builder else than `pipe.build_stage()`.

This PR provides an API to do so:
```
def build_stage(
  stage_module,
  stage_index,
  pipe.info(),
  ...
)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128273
Approved by: https://github.com/wconstab
2024-06-08 04:42:35 +00:00
8a45cf4c64 [AOTI] align data_size of the constants (#127610)
https://github.com/pytorch/pytorch/pull/124272 set the alignment to the `consts_o` but if there're `data_size` of tensor in the `consts_o` non divisible by the alignment, the following tensors are not aligned anymore, resulting in poor performance on CPU.
We align the `data_size` as well in this PR and pad the serialized bytes. Since `size` of the tensor instead of the `data_size` is used when creating tensor from the serialized bytes ([link](f4d7cdc5e6/torch/csrc/inductor/aoti_runtime/model.h (L236-L259))), there won't be correctness issue. `data_size` is only used to record the [bytes_read](f4d7cdc5e6/torch/csrc/inductor/aoti_runtime/model.h (L217)).

This PR will improve the performance on CPU for 4 models in HF, 7 models in TIMM and 1 model in Torchbench.

For the unit test, I add a bias value the original `data_size` of which is not divisible by the alignment to test the correctness:
```
constants_info_[0].dtype = static_cast<int32_t>(at::kFloat);
constants_info_[0].data_size = 64; # was 40 before this PR
constants_info_[0].shape = {10};

constants_info_[1].dtype = static_cast<int32_t>(at::kFloat);
......
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127610
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-06-08 04:31:00 +00:00
1d84c7e100 [DeviceMesh] Update get_group and add get_all_groups (#128097)
Fixes #121984

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128097
Approved by: https://github.com/wconstab, https://github.com/wanchaol
2024-06-08 04:28:56 +00:00
6e5c2a1a3b [inductor] Add missing files to torch_key (#128230)
Previosly all subdirs (like torch.inductor.codegen) were not hashed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128230
Approved by: https://github.com/oulgen
2024-06-08 03:26:48 +00:00
6220602943 [torchbind] support query schema of methods (#128267)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128267
Approved by: https://github.com/angelayi
2024-06-08 03:20:44 +00:00
0ef5229569 Revert "Change lerp decomp to use aten.as_strided_copy instead of prims.copy_strided (#128030)"
This reverts commit fdf1666b20f63e4acf01798f009e478d997a7f7f.

Reverted https://github.com/pytorch/pytorch/pull/128030 on behalf of https://github.com/nWEIdia due to breaking cuda12.1 test_cuda, see HUD https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=inductor ([comment](https://github.com/pytorch/pytorch/pull/128030#issuecomment-2155764546))
2024-06-08 02:34:06 +00:00
f9508b4c1f [pipelining] Update Pipelining Docs (#128236)
----

- Bring PipelineStage/Schedule more front-and-center
- provide details on how to manually construct PipelineStage
- move tracer example and manual example below so the high-level flow
  (e2e) is closer to the top
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128236
Approved by: https://github.com/H-Huang
ghstack dependencies: #128201, #128228
2024-06-08 02:03:46 +00:00
fe74bbd6f0 init sigmoid comments (#127983)
Fixes #127913

### Description
Add docstring to `torch/onnx/symbolic_opset9.py`:`sigmoid` function

### Checklist

- [x] The issue that is being fixed is referred in the description
- [x] Only one issue is addressed in this pull request
- [x] Labels from the issue that this PR is fixing are added to this pull request
- [x] No unnecessary issues are included into this pull request

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127983
Approved by: https://github.com/xadupre
2024-06-08 01:48:00 +00:00
921aa194c7 [pipelining] Move modify_graph_op_device to _IR.py (#128241)
This part is more IR related.
Thus moving from `PipelineStage` constructor to `pipe.build_stage(..., device, ...)`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128241
Approved by: https://github.com/wconstab
ghstack dependencies: #128240
2024-06-08 01:35:07 +00:00
ad96f991a5 [pipelining] Add pipe.build_stage() (#128240)
Given `PipelineStage` name to manual side.
Thus adding a method under `Pipe` to create PipelineStage.
Moved `PipeInfo` to utils.py to avoid circular dependency between `_IR` and `PipelineStage`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128240
Approved by: https://github.com/wconstab, https://github.com/H-Huang
2024-06-08 01:26:02 +00:00
5ef081031e [MPS] Include MPSGraphVenturaOps.h for complex types on macOS 12 (#127859)
Fixes this on macOS 12:

```
/Users/qqaatw/Forks/pytorch/aten/src/ATen/native/mps/operations/FastFourierTransform.mm:108:60: error: use of undeclared identifier 'MPSDataTypeComplexFloat16'; did you mean 'MPSDataTypeFloat16'?
            (inputTensor.dataType == MPSDataTypeFloat16) ? MPSDataTypeComplexFloat16 : MPSDataTypeComplexFloat32;
                                                           ^~~~~~~~~~~~~~~~~~~~~~~~~
                                                           MPSDataTypeFloat16
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127859
Approved by: https://github.com/kulinseth
2024-06-08 00:54:30 +00:00
647815049e Inductor: Allow small sizes of m for mixed mm autotuning (#127663)
For mixed mm with small sizes of m, such as in the example provided in #127056, being able to set BLOCK_M to 16 leads to better performance. This PR introduces kernel configs that are specific to mixed mm by extending the mm configs with two configs that work well for the example provided in #127056.
I am excluding configs with (BLOCK_M=16, BLOCK_K=16, BLOCK_N=64) because triton crashes when this config is used.

For the example in #127056:
- Without my changes, skip_triton is evaluated to true which disables autotuning. On my machine I achieve 146GB/s.
- If autotuning is enabled, but BLOCK_M>=32, I achieve 614 GB/s.
- With the changes in this PR (i.e. autotuning enabled and BLOCK_M=16), I achieve 772 GB/s.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127663
Approved by: https://github.com/Chillee
2024-06-08 00:46:16 +00:00
cyy
ef2b5ed500 [4/N] Remove unused functions (#128193)
Follows #128179

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128193
Approved by: https://github.com/ezyang
2024-06-08 00:09:26 +00:00
39dd4740e6 [inductor][dynamo-inline-nn-modules] Fix test with inlining flag (#128200)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128200
Approved by: https://github.com/Skylion007
ghstack dependencies: #128001, #126578, #128158, #128172
2024-06-07 23:51:58 +00:00
bef586111a [pipelining] pipelining.rst updates (#128228)
fix some nits and add `PipelineStage` (manual)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128228
Approved by: https://github.com/wconstab
ghstack dependencies: #128201
2024-06-07 23:29:54 +00:00
09cccbc1c7 [RFC] add per-collective timeout value in flight recorder (#128190)
Summary:
Add timeout value field on every collected record.

Test Plan:
Unit tests

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128190
Approved by: https://github.com/wconstab
2024-06-07 23:29:35 +00:00
11f2d8e823 Move inductor cuda 124 jobs to a separate workflow that is not triggered by ciflow/inductor (#128250)
https://github.com/pytorch/pytorch/pull/127825

The majority of the g5 runner usage comes from inductor (its something like 2x everything else)
in the past week, inductor ran 1300 ish times on PRs and 300 times on main.  Inductor-periodic ran 50 times on main, so the previous move from inductor -> inductor-periodic only results in 250 fewer runs.

I was under the impression that cu124 is experimental currently and eventually we'll need to switch to it, so this will stay until we switch or inductor uses much fewer runners

Are we expected to be able to handle two versions of cuda in CI?  Because currently we cannot, at least not comfortably

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128250
Approved by: https://github.com/huydhn
2024-06-07 23:01:52 +00:00
5b3624117a update test_issue175 to handle inline_inbuilt_nn_modules (#128026)
with inlining the output graph have more function calls reflecting those on the test that count number of function calls.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128026
Approved by: https://github.com/anijain2305
ghstack dependencies: #127553
2024-06-07 22:07:16 +00:00
ba81c3c290 [inductor] add cpp builder code. (take 2) (#125849)
Fully manual rebase the code of PR: https://github.com/pytorch/pytorch/pull/124045
The old PR seems crashed due to too many commits, and too many times rebase. Please reference: https://github.com/pytorch/pytorch/pull/124045#issuecomment-2103744588

-------
It is the first step of RFC https://github.com/pytorch/pytorch/issues/124245.
Changes:
1. Add cpp builder code, the new cpp_builder support Windows OS.
2. Add CPU ISA checker which is cross OS and exported from backend cpuinfo.
3. Switch compiler ISA checker to new cpp builder.
4. CppCodeCache use the new ISA checker.
5. Add temprary `test_new_cpp_build_logical` UT to help on transfer to new code.
<img width="1853" alt="Image" src="https://github.com/pytorch/pytorch/assets/8433590/ce6519ab-ba92-4204-b1d6-7d15d2ba2cbe">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125849
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-06-07 20:49:58 +00:00
3a620a0f65 bug fix of dynamo_timed in cprofile (#128203)
Fixes #ISSUE_NUMBER

fb-only: "Entire Frame" was missing before this change.

Before: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/f565966006-TrainingApplication/20240527/rank_0/5_0_1/compilation_metrics_23.html
After: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/f569854578-TrainingApplication/20240606/rank_0/0_0_0/compilation_metrics_16.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128203
Approved by: https://github.com/Chillee
2024-06-07 20:47:27 +00:00
8892ddaacc [TD] Test removal on sm86 (#127131)
Yolo

I'm excited to break CI :')
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127131
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
2024-06-07 20:19:18 +00:00
fdf1666b20 Change lerp decomp to use aten.as_strided_copy instead of prims.copy_strided (#128030)
aten.lerp decomposition causes prims::copy_strided to appear in the graph, which is not core aten.

Internal ref: https://fb.workplace.com/groups/pytorch.edge.users/permalink/1525644288305859/
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128030
Approved by: https://github.com/Skylion007, https://github.com/zou3519
2024-06-07 20:12:52 +00:00
e647ea55a3 [pipelining] redirect README to document (#128205)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128205
Approved by: https://github.com/wconstab, https://github.com/H-Huang
2024-06-07 19:34:52 +00:00
dcb63fcedb [pipelining] Remove num_microbatches from stage (#128201)
This is similar to https://github.com/pytorch/pytorch/pull/127979, but instead of removing `num_microbatches` from schedule, we remove it from `PipelineStage`. This also means that during `PipelineSchedule` init we need to setup the buffers for the stage(s).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128201
Approved by: https://github.com/kwen2501
2024-06-07 18:56:44 +00:00
cafbcb6376 [BE]: Update ruff to 0.4.8 (#128214)
Updates ruff to 0.4.8. Some minor fixes, but noticably is 10% faster on microbenchmark and should further reduce local and CI runtime of the linter. Also includes a few bugfixes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128214
Approved by: https://github.com/ezyang
2024-06-07 18:41:35 +00:00
8ca4cefc7d [C10D] Ensure gil is not released when calling toPyBytes (#128212)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128212
Approved by: https://github.com/Skylion007, https://github.com/XilunWu
2024-06-07 18:24:10 +00:00
0a6df4fca6 delete inductor config.trace.compile_profile (#127143)
Fixes #ISSUE_NUMBER

https://fb.workplace.com/groups/257735836456307/posts/687858786777341/?comment_id=687861123443774&reply_comment_id=687865486776671

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127143
Approved by: https://github.com/Chillee
2024-06-07 18:05:50 +00:00
82d7a36a27 Added torchao nightly workflow (#128152)
Summary:
Add torchao benchmark workflow, upload the artifacts to GHA.

X-link: https://github.com/pytorch/benchmark/pull/2273

Test Plan:
```
python run_benchmark.py torchao --ci
```

Differential Revision: D58140479

Pulled By: xuzhao9

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128152
Approved by: https://github.com/jerryzh168
2024-06-07 17:52:15 +00:00
0c7f4353e5 [inductor] simplify indexing (#127661)
This is a short term fix for: https://github.com/pytorch/pytorch/issues/124002

We found the cause of bad perf for the int8_unpack kernel is due to sub-optimal indexing. In this PR we introduce 2 indexing optimizations:
1. expand FloorDiv to the entire expression when feasible. E.g. `x1 * 1024 + x2 // 2`  will be transformed to `(x1 * 2048 + x2) // 2`. The motivation is that we have more chance to simplify loops for `x1 * 2048 + x2`.
2. merge ModularIndexing pairs: `ModularIndexing(ModularIndex(x, 1, a), 1, b)`, can be simplified to `ModularIndexing(x, 1, b)` if a is a multiple of b.

With both indexing optimizations, we improve int8_unpack perf by 1.54x (183us -> 119us).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127661
Approved by: https://github.com/jansel
2024-06-07 17:51:30 +00:00
662a78f957 [dynamo] Inline the getattr of fx graph and proxy graph (#128172)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128172
Approved by: https://github.com/yanboliang
ghstack dependencies: #128001, #126578, #128158
2024-06-07 17:14:58 +00:00
19b31d899a Fix 'get_real_value' on placeholder nodes (#127698)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127698
Approved by: https://github.com/jansel
ghstack dependencies: #127695, #127696
2024-06-07 17:13:43 +00:00
b741819b05 Fix 'get_attr' call in dynamo 'run_node' (#127696)
Fixes #124858

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127696
Approved by: https://github.com/jansel
ghstack dependencies: #127695
2024-06-07 17:13:43 +00:00
3aa623d407 Fix assume_constant_result for UnspecializedNNModuleVariable methods (#127695)
Fixes #127509

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127695
Approved by: https://github.com/jansel
2024-06-07 17:13:43 +00:00
754e6d4ad0 Make jobs with LF runners still pass lint (#128175)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128175
Approved by: https://github.com/huydhn
2024-06-07 17:13:04 +00:00
85758fa5ae [c10d][TCPStore] make TCPStore server use libuv by default (#127957)
**Summary**
This PR switches the default TCPStore server backend to a new implementation that utilizes [`libuv`](https://github.com/libuv/libuv) for significantly lower initialization time and better scalability:
<img width="714" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/18503011-da5d-4104-8ba9-abc456438b02">

We hope this improvement would benefit users from a much shorter startup time in large-scale jobs. Eventually, we hope to fully replace the old TCPStore backend implementation with the libuv one.

**What it changes**
This PR changes the underlying TCPStore server backend to `libuv` if users don't explicitly specify to use the old TCPStore server. This change is not supposed to cause any user notice except significant faster TCPStore startup for large-scale jobs.

One thing to note is, we do not support the initialization approach where user passes in a socket for libuv backend. We plan to support it as a next step but we choose to disable it before fully testing. If you are initializing TCPStore in this approach, you can see the next section to remain using the old TCPStore server.

**Fallback/Remain using the old TCPStore server**
For users who want to stay with the old TCPStore backend, there're 3 ways:

1. If user is directly instantiating TCPStore object, user can pass in argument `use_libuv=False` to use the old TCPStore server backend e.g. `store = torch.distributed.TCPStore(..., use_libuv=False)`.
2. Or, specify the TCPStore backend option in `init_method` when calling default ProcessGroup init, e.g. `torch.distributed.init_process_group(..., init_method="{YOUR_RENDEZVOUS_METHOD}://{YOUR_HOSTNAME}:{YOUR_PORT}?use_libuv=0")`
3. Or, user can set environment variable `USE_LIBUV` to `"0"` when launching.

These 3 approach are in order of precedence. That being said, if user specifies `use_libuv=0` in `init_method` and also sets environment var `USE_LIBUV="1"`, the former will take effect and the TCPStore backend instantiated will be the old one instead of the one using libuv.

**Operating Systems Compatibility**
From the CI signals, we believe the new implementation has the same behavior as the old TCPStore server on all supported platforms. If you notice any behavior discrepancy, please file an issue with `oncall: distributed` label.

**Test Plan**
`pytest test/distributed/test_store.py`
<img width="2548" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/dc0aebeb-6d5a-4daa-b98c-e56bd39aa588">
note: `TestMultiThreadedWait::test_wait` is a broken test that has been there for some time.

`test/distributed/elastic/utils/distributed_test.py`
<img width="2558" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/a6a3266d-b798-41c4-94d2-152056a034f6">

**TODO**
1. Update the doc at

- https://pytorch.org/docs/stable/distributed.html#distributed-key-value-store
- https://pytorch.org/docs/stable/distributed.html#tcp-initialization

2. Make torch elastic rendezvous to use libuv TCPStore as well. See `torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py` cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k @kurman
3. Test if libuv backend is okay with initialization with socket. Change `LibUvTCPStoreTest::test_take_over_listen_socket`.

**Test Plan**
`pytest test/distributed/test_store.py`
<img width="2548" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/dc0aebeb-6d5a-4daa-b98c-e56bd39aa588">
note: `TestMultiThreadedWait::test_wait` is a broken test that has been there for some time.

`test/distributed/elastic/utils/distributed_test.py`
<img width="2558" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/a6a3266d-b798-41c4-94d2-152056a034f6">

Differential Revision: [D58259591](https://our.internmc.facebook.com/intern/diff/D58259591)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127957
Approved by: https://github.com/kurman
ghstack dependencies: #127956
2024-06-07 16:53:01 +00:00
6c824cd9fb [BE][c10d] fix use of TORCH_ERROR in TCPStore libuv backend (#127956)
**Summary**
The use of TORCH_ERROR in TCPStore libuv backend code needs update.

Differential Revision: [D58259589](https://our.internmc.facebook.com/intern/diff/D58259589)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127956
Approved by: https://github.com/shuqiangzhang, https://github.com/cyyever
2024-06-07 16:53:01 +00:00
b9b89ed638 [pipelining] fix LoopedBFS (#127796)
# Issues

Currently two issues need to be fixed with LoopedBFS:
1. The wrap around send operation to the looped around stage blocks will cause a hang. For some reason this doesn't surface on single node, but on multihost this surfaces in a hang.
<img width="1311" alt="image" src="https://github.com/pytorch/pytorch/assets/14858254/210d9d18-455f-4f65-8a11-7ce2c1ec73fd">
2. When microbatches are popped off in `backward_one_chunk` will automatically use the `bwd_chunk_id` starting from 0. This works for interleaved 1f1b and 1f1b, but for loopedBFS we want to pop from starting at `num_microbatches - 1`. Same needs to be fixed for gpipe?

# Changes
- Update LoopedBFS implementation to share `_step_microbatches` with `Interleaved1F1B`
- Also share the tests between the two schedules for varying num_microbatches, local_stages, and world_sizes
- Update `backward_one_chunk` to optionally take a `bwd_chunk_id` argument.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127796
Approved by: https://github.com/wconstab
2024-06-07 16:46:38 +00:00
d9696ea624 [AOTInductor] [Tooling] Update NaN and INF Checker for AOTInductor (#127574)
Summary:
1. Integrate NaN and INF checker with existing config, controllable by env var.
2. Move inject point of NaN & INF checker earlier, this could prevent buffer freeing before check.
3. Inject debugging code in Kernel level, which prevents us trying to read buffers that are fused inplace and into a single kernel.

Test Plan:
Debugging utility.
Test and check by existing tests with env var:
```
TORCHINDUCTOR_NAN_ASSERTS=1 TORCHINDUCTOR_MAX_AUTOTUNE=0 python test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCuda.test_seq_non_abi_compatible_cuda
```

Reviewed By: ColinPeppler

Differential Revision: D57989176

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127574
Approved by: https://github.com/chenyang78, https://github.com/desertfire
2024-06-07 16:46:26 +00:00
fc6e3ff96d [ROCm] Update triton pin to fix libtanh issue (#125396)
There were some internal build issues related to tanh when we moved to upstream triton in ROCm. These issues were fixed by the following triton commit: https://github.com/triton-lang/triton/pull/3810 . This PR moves the triton pin to incorporate that change. Added some skips for unit tests that regressed due to the triton commit bump in this PR.

Needs https://github.com/pytorch/pytorch/pull/127968 since this PR introduces a triton dependency on llnl-hatchet, which doesn't have py3.12 wheels available currently.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125396
Approved by: https://github.com/pruthvistony, https://github.com/malfet
2024-06-07 16:23:04 +00:00
128952625b Revert "Added memory budget to partitioner (#126320)"
This reverts commit 2184cdd29128a924583e4702489177f83fb8270a.

Reverted https://github.com/pytorch/pytorch/pull/126320 on behalf of https://github.com/ZainRizvi due to The new test_ac.py fails on ROCm machines ([comment](https://github.com/pytorch/pytorch/pull/126320#issuecomment-2155141886))
2024-06-07 16:15:03 +00:00
cyy
c219fa5eb9 [3/N] Remove unused functions (#128179)
Following https://github.com/pytorch/pytorch/pull/128005, this PR continues to remove unused functions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128179
Approved by: https://github.com/ezyang
2024-06-07 16:13:16 +00:00
8d16a73f0f Manipulate triton_hash_with_backend so that it doesn't contain any keywords (#128159)
Summary: See https://github.com/pytorch/pytorch/issues/127637 where "def" appears in the backend_hash and causes a problem.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128159
Approved by: https://github.com/jansel
2024-06-07 16:10:44 +00:00
852b7b4c99 [inductor] Enable subprocess-based parallel compile as the default (#126817)
Differential Revision: [D58239826](https://our.internmc.facebook.com/intern/diff/D58239826)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126817
Approved by: https://github.com/eellison
ghstack dependencies: #128037, #128086
2024-06-07 16:10:11 +00:00
ac51f782fe Revert "Complete revamp of float/promotion sympy handling (#126905)"
This reverts commit 2f7cfecd86009a9d396fdbdcdfb4ba7a005db16b.

Reverted https://github.com/pytorch/pytorch/pull/126905 on behalf of https://github.com/atalman due to Sorry need to revert - failing internally ([comment](https://github.com/pytorch/pytorch/pull/126905#issuecomment-2155118778))
2024-06-07 16:01:46 +00:00
23c156cd2d Revert "[inductor] simplify indexing (#127661)"
This reverts commit 901226ae837bd4629b34735c84a3481c4988bb5b.

Reverted https://github.com/pytorch/pytorch/pull/127661 on behalf of https://github.com/atalman due to Sorry reverting because in conflict with https://github.com/pytorch/pytorch/pull/126905 which needs to be reverted, will be relanding it ([comment](https://github.com/pytorch/pytorch/pull/127661#issuecomment-2155115388))
2024-06-07 15:58:36 +00:00
cyy
a1b664adeb Add default values to PyTorchMemEffAttention::AttentionKernel::Params members (#112215)
Default values were added to Params in order to eliminate CUDA warnings like
```
and the implicitly-defined constructor does not initialize ‘PyTorchMemEffAttention::AttentionKernel<float, cutlass::arch::Sm80, true, 64, 64, 64, true, true>::accum_t PyTorchMemEffAttention::AttentionKernel<float, cutlass::arch::Sm80, true, 64, 64, 64, true, true>::Params::scale’
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112215
Approved by: https://github.com/eqy, https://github.com/ezyang
2024-06-07 15:54:07 +00:00
3090667cf9 [pipelining] pipeline() taking microbatch as example input (#128163)
Changed the API of `pipeline()` to take microbatch instead of full batch as example args.

Main purpose is to:
- make this API more atomic;
- decouple tracing frontend from runtime info like `num_chunks`.

Side effects:
- Creates opportunity for varying `num_chunks` of schedules with the same `pipe` object.
- User has to create example microbatch input.
- Chunk spec stuff are now all moved to runtime side.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128163
Approved by: https://github.com/H-Huang
2024-06-07 15:51:53 +00:00
224b4339e5 Revert "Make ValueRange repr less chatty by default (#128043)"
This reverts commit f0dd11df5534ae074ad2d090e6700576a22719d6.

Reverted https://github.com/pytorch/pytorch/pull/128043 on behalf of https://github.com/atalman due to Sorry reverting because in conflict with [#126905](https://github.com/pytorch/pytorch/pull/126905) which needs to be reverted ([comment](https://github.com/pytorch/pytorch/pull/128043#issuecomment-2155091732))
2024-06-07 15:43:39 +00:00
6e75024ff0 Run TestAOTAutograd with dynamo (#128047)
My goal is to run these tests with the autograd cache on, but first I want them running with dynamo. These tests already caught an interesting issue so I thought it would be helpful to just have them.

Next up I'll have a second subclass of these tests, run them twice, and expect a cache hit the second time from autograd.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128047
Approved by: https://github.com/ezyang
2024-06-07 15:42:28 +00:00
771be55bb0 Documenting torch.onnx.operator.shape_as_tensor (#128051)
Fixes #127890

This PR adds docstring to the `torch.onnx.operator.shape_as_tensor` function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128051
Approved by: https://github.com/xadupre
2024-06-07 15:20:18 +00:00
3f9798a4fd add docstring to masked_fill, expand, select, unsqueeze, cat fns (#128055)
Fixes #127891
Fixes #127893
Fixes #127894
Fixes #127907
Fixes #127910

## Description
Add docstring to `masked_fill`, `expand`, `select`, `unsqueeze`, and `cat` functions in torch.onnx.symbolic_opset9.py

remaining pydocstyle errors: 257

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128055
Approved by: https://github.com/xadupre
2024-06-07 15:17:22 +00:00
543a870943 [pipelining] Rename ManualPipelineStage -> PipelineStage (#128157)
Renaming ManualPipelineStage to remove the "Manual" part. I needed to replace the existing `PipelineStage` which takes in the `pipe` argument, so I have renamed that to `TracerPipelineStage`. @kwen2501 will remove this entirely in favor of adding a util to `Pipe` to just create the stage directly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128157
Approved by: https://github.com/wconstab
2024-06-07 09:24:16 +00:00
5f81265572 [Traceable FSDP2] Return early from _register_post_backward_hook when compile (#127864)
Dynamo doesn't support `RegisterPostBackwardFunction` very well yet. This PR skips it and rely on `root_post_backward_callback` under compile. We will improve `RegisterPostBackwardFunction` support in Q3.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127864
Approved by: https://github.com/awgu
2024-06-07 09:19:07 +00:00
7efaeb1494 [AOTI] docs: add suggestion to turn on freezing on CPU (#128010)
With https://github.com/pytorch/pytorch/pull/124350 landed, it is now suggested in AOTI to turn on freezing on CPU to get better performance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128010
Approved by: https://github.com/desertfire
2024-06-07 08:57:02 +00:00
0c16800b4a [pipelining] include lifted constants in input_to_state (#128173)
Previous PR only looked at state dict to determine inputs to state, missing out on lifted tensors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128173
Approved by: https://github.com/kwen2501
2024-06-07 08:40:54 +00:00
01601ebd41 Retire torch.distributed.pipeline (#127354)
Actually retiring module after deprecation warning for a while.
The new supported module is: torch.distributed.pipelining.
Please migrate.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127354
Approved by: https://github.com/wconstab
2024-06-07 08:11:58 +00:00
70724bdbfe Bugfix for nondeterminstic torch_key (#128111)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128111
Approved by: https://github.com/oulgen
2024-06-07 07:17:39 +00:00
00c6ca4459 [compiled autograd][cudagraphs] Inputs runtime wrapper to move cpu scalars to cuda (#125382)
Most commonly CPU scalars used for philox random seed. Right now, any cpu input will skip cudagraphing the entire graph. We need both the traced graph and the runtime inputs to be cudaified.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125382
Approved by: https://github.com/jansel
2024-06-07 07:12:46 +00:00
190f06d468 [pipelining] Lower _configure_data_parallel_mode to stage (#127946)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127946
Approved by: https://github.com/wconstab
ghstack dependencies: #127935
2024-06-07 07:06:23 +00:00
a448b3ae95 [Traceable FSDP2] Check hasattr('fsdp_pre_all_gather') only when not compile (#127855)
Dynamo doesn't support `hasattr(inner_tensor, "fsdp_post_all_gather")` yet. We will work on this support in Q3.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127855
Approved by: https://github.com/awgu
2024-06-07 06:36:40 +00:00
2ff312359c skip hf_T5_generate in dynamic shape test (#121129)
As reported in https://github.com/pytorch/pytorch/issues/119434, `hf_T5_generate` failed with dynamic shape testing, we propose to skip the dynamic batch size testing of this model in this PR.

* Error msg is
```
  File "/home/jiayisun/pytorch/torch/_dynamo/guards.py", line 705, in SHAPE_ENV
    guards = output_graph.shape_env.produce_guards(
  File "/home/jiayisun/pytorch/torch/fx/experimental/symbolic_shapes.py", line 3253, in produce_guards
    raise ConstraintViolationError(
torch.fx.experimental.symbolic_shapes.ConstraintViolationError: Constraints violated (L['inputs_tensor'].size()[0])! For more information, run with TORCH_LOGS="+dynamic".
  - Not all values of RelaxedUnspecConstraint(L['inputs_tensor'].size()[0]) are valid because L['inputs_tensor'].size()[0] was inferred to be a constant (4).
```

* Root Cause is
This error happens while creating guard for this [model script line](https://github.com/huggingface/transformers/blob/main/src/transformers/models/t5/modeling_t5.py#L561): `scores += position_bias_masked`
I run it with TORCH_LOGS="+dynamic" and got the key line : `I0305 00:21:00.849974 140376923287424 torch/fx/experimental/symbolic_shapes.py:3963] [6/0_1] eval Eq(s0, 4) [guard added] at miniconda3/envs/pt2/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py:561 in forward (_refs/__init__.py:403 in _broadcast_shapes)`
The reason for this error is that the batch dimension of `inputs_tensor` in the dynamic batch size test is marked as dynamic shape `s0`, so the batch dimension of `scores` generated by a series of operations with `inputs_tensor` is also `s0`. However, because the function of creating `attention_mask` is not in Dynamo but in python. The batch dimension of `attention_mask` is the real shape `4`, and the batch dimension of `position_bias_masked` generated by a series of operations with `attention_mask` is also the real shape `4`, not the dynamic shape `s0`. The current line of `scores += position_bias_masked` requires creating a guard and check whether the batch dimension of `scores` is always equal to the batch dimension of `position_bias_masked`, Eq(s0, 4), the error happens.
So the root cause of this error is that the function of creating `attention_mask` not in Dynamo but in python. The reason why the function of `attention_mask` not in Dynamo is that Dynamo has a graph break on this function (happened in the [model script line](https://github.com/huggingface/transformers/blob/main/src/transformers/generation/utils.py#L476): `is_pad_token_in_inputs = (pad_token_id is not None) and (pad_token_id in inputs)`) due to the following error:
`torch._dynamo.exc.Unsupported: Tensor.item`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121129
Approved by: https://github.com/leslie-fang-intel, https://github.com/ezyang
2024-06-07 06:28:29 +00:00
d943357a21 [XPU] Add xpu support of make triton (#126513)
This PR is to add XPU support for `make triton`.

If a user wishes to use Triton with XPU support, the user needs to install the  [intel-xpu-backend-for-triton](https://github.com/intel/intel-xpu-backend-for-triton).

This PR allows the user to easily install Triton for xpu backend support:

```
# clone the pytorch repo
export USE_XPU=1
make triton
```
The XPU version of triton will always be built from the source. It will cat the commit id from `.ci/docker/ci_commit_pins/triton-xpu.txt`, for example, `b8c64f64c18d8cac598b3adb355c21e7439c21de`.

So the final call would be like:

```
pip install --force-reinstall "git+https://github.com/intel/intel-xpu-backend-for-triton@b8c64f64c18d8cac598b3adb355c21e7439c21de#subdirectory=python"
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126513
Approved by: https://github.com/EikanWang, https://github.com/atalman
2024-06-07 06:25:47 +00:00
68cc63ae27 introduce skipIfNNModuleInlined and skip test_cpu_cuda_module_after_dynamo (#128023)
see the issue https://github.com/pytorch/pytorch/issues/127636 to for details about the issue, TLDR is that
when inlining is enabled, we create a fake tensor while tracing in dynamo and try to perform  aten.add.Tensor between
two tensor of different types, with out inlining we do not hit that operation during tracing.
```
Failed running call_function <built-in function add>(*(FakeTensor(..., size=(20, 20), grad_fn=<AddBackward0>), FakeTensor(..., device='cuda:0', size=(20, 20))), **{}):
Unhandled FakeTensor Device Propagation for aten.add.Tensor, found two different devices cpu, cuda:0
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128023
Approved by: https://github.com/anijain2305
ghstack dependencies: #127487, #127553
2024-06-07 06:00:33 +00:00
7e48d6a497 reset dynamo in test_do_not_skip_side_effects unit test loop to avoid dynamo cache limit hit (#127487)
fix https://github.com/pytorch/pytorch/issues/127483

When nn module inlining is enabled, all recompilations are considered for the same frame hence we hit the cache limit for
test_do_not_skip_side_effects, but without inlining things are different , each time we hit a new Object Model we do not consider that a re-compilation, as explained in https://github.com/pytorch/pytorch/issues/127483

For that test we do not really care about cache size hence i reset dynamo in the main loop.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127487
Approved by: https://github.com/anijain2305
2024-06-07 06:00:33 +00:00
dc8e3c2e90 [inductor] subproc parallel compile: initialize future before sending work to the pool (#128086)
Summary: I got reports of intermittent failures in CI and the logs show errors like this:
```
CRITICAL:concurrent.futures:Future 139789013754560 in unexpected state: FINISHED
```
I can't repro locally, but seems clear that we should initialize the future _before_ sending work to the subprocess pool since it could finish before we call set_running_or_notify_cancel()

Differential Revision: [D58239829](https://our.internmc.facebook.com/intern/diff/D58239829)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128086
Approved by: https://github.com/jansel
ghstack dependencies: #128037
2024-06-07 04:17:35 +00:00
6a2bf48cfa [inductor] subproc parallel-compile: start thread last in init (#128037)
Summary: Observed on an internal workload: the helper thread started and attempted to access member variables before they were initialized.

Differential Revision: [D58239827](https://our.internmc.facebook.com/intern/diff/D58239827)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128037
Approved by: https://github.com/Skylion007, https://github.com/eellison
2024-06-07 04:17:35 +00:00
e8e0bdf541 [inductor] parallel-compile: call triton_key() before forking (#127639)
Summary:
A user reported severe slowdown on a workload when using parallel compile. The issue is that in some environments, the process affinity changes after forking such that all forked subprocesses use a single logical processor. Described here: https://github.com/pytorch/pytorch/issues/99625. That requires a separate fix, but during debuging we noticed that we can at least optimize the expensive call to triton_key() before forking.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127639
Approved by: https://github.com/eellison, https://github.com/anijain2305
2024-06-07 04:12:57 +00:00
96806b1777 [pipelining][doc] Add frontend description and change tracer example (#128070)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128070
Approved by: https://github.com/wconstab, https://github.com/H-Huang
2024-06-07 04:09:36 +00:00
3df53c2a8f [dtensor] directly return local_tensor under no_grad (#128145)
as titled, skip the autograd function and directly return the
local_tensor if it's under no_grad context, this would avoid creating
views

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128145
Approved by: https://github.com/awgu
ghstack dependencies: #128112
2024-06-07 04:01:47 +00:00
747fc35ff5 [dynamo] Support if cond on UnspecializedNNModuleVariable and add inline tests (#128158)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128158
Approved by: https://github.com/jansel
ghstack dependencies: #128001, #126578
2024-06-07 03:50:33 +00:00
5e5bbdb35e [DDP] Bucket handling: make first bucket size equal to bucket_cap_mb if it was set (#121640)
The fist DDP bucket is always being created of the size of `dist._DEFAULT_FIRST_BUCKET_BYTES` (1 MiB) by default regardless of `bucket_cap_mb`. The proposal is to set `bucket_cap_mb` as the one main bucket size if it was supplied by the user.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121640
Approved by: https://github.com/wanchaol
2024-06-07 03:33:33 +00:00
4d0ece8196 [pipelining] Consolidate chunk counting between stage and schedule (#127935)
We used to have two backward chunk id counting systems, one at schedule level, the other at stage level.
(Which makes safety dependent on the two advancing hand-in-hand.)

This PR consolidates the counting system to the schedule side only, which would pass `mb_index` to the following stage calls:
`forward_one_chunk`
`backward_one_chunk`
`get_bwd_send_ops`
...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127935
Approved by: https://github.com/H-Huang
2024-06-07 03:33:18 +00:00
476bfe6cce fix torch.compile with triton kernels under inference_mode (#124489)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124489
Approved by: https://github.com/albanD
2024-06-07 03:29:37 +00:00
50155e825b [export] provide refine function for automatically accepting dynamic shapes suggested fixes (#127436)
Summary:
Part of the work helping export's automatic dynamic shapes / dynamic shapes refining based on suggested fixes.

Introduces a util function refine_dynamic_shapes_from_suggested_fixes() that takes the error message from a ConstraintViolationError message containing suggested dynamic shapes fixes, along with the original dynamic shapes spec, and returns the new spec. Written so that the suggested fixes from export can be directly parsed and used.

Example usage for the automatic dynamic shapes workflow:
```
# export, fail, parse & refine suggested fixes, re-export
try:
    export(model, inps, dynamic_shapes=dynamic_shapes)
except torch._dynamo.exc.UserError as exc:
    new_shapes = refine_dynamic_shapes_from_suggested_fixes(exc.msg, dynamic_shapes)
    export(model, inps, dynamic_shapes=new_shapes)
```

For examples of behavior, see the added test and docstring. Will take suggestions for renaming the function to something else 😅

Test Plan: test_export tests

Differential Revision: D57409142

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127436
Approved by: https://github.com/avikchaudhuri
2024-06-07 03:29:06 +00:00
65aa16f968 Revert "Default XLA to use swap_tensors path in nn.Module._apply (#126814)" (#128170)
https://github.com/pytorch/pytorch/issues/128165 :(

This reverts commit a7b1dd82ff3063894fc665ab0c424815231c10e6.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128170
Approved by: https://github.com/drisspg, https://github.com/albanD
2024-06-07 01:44:14 +00:00
f99409903c Documenting torch.distributions.utils.clamp_probs (#128136)
Fixes https://github.com/pytorch/pytorch/issues/127889

This PR adds docstring to the `torch.distributions.utils.clamp_probs` function.

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128136
Approved by: https://github.com/janeyx99, https://github.com/svekars, https://github.com/malfet
2024-06-07 00:49:41 +00:00
740cd0559f Filter non input symexprs from codecache guards (#128052)
Summary: Dynamo lifts all symexprs that appear in the inputs to top level which means that we do not need to look at guards that contain symexprs that do not appear in the inputs. Prune them.

Test Plan: added two new tests

Differential Revision: D58200476

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128052
Approved by: https://github.com/ezyang, https://github.com/masnesral
2024-06-07 00:48:49 +00:00
117ab34891 Documenting the torch.utils.collect_env.get_pretty_env_info function (#128123)
Fixes #127888

This PR adds docstring to the `torch.utils.collect_env.get_pretty_env_info` function

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128123
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-06-07 00:43:18 +00:00
901226ae83 [inductor] simplify indexing (#127661)
This is a short term fix for: https://github.com/pytorch/pytorch/issues/124002

We found the cause of bad perf for the int8_unpack kernel is due to sub-optimal indexing. In this PR we introduce 2 indexing optimizations:
1. expand FloorDiv to the entire expression when feasible. E.g. `x1 * 1024 + x2 // 2`  will be transformed to `(x1 * 2048 + x2) // 2`. The motivation is that we have more chance to simplify loops for `x1 * 2048 + x2`.
2. merge ModularIndexing pairs: `ModularIndexing(ModularIndex(x, 1, a), 1, b)`, can be simplified to `ModularIndexing(x, 1, b)` if a is a multiple of b.

With both indexing optimizations, we improve int8_unpack perf by 1.54x (183us -> 119us).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127661
Approved by: https://github.com/jansel
2024-06-06 23:57:45 +00:00
7ede78f9f5 [dynamo][nn-modules] Trace through nn.Module dunder methods for UnspecializedNNModule (#126578)
Tracing through `__init__`  is important because it initializes (calls STORE_ATTR) on members. By doing that, we kick in the mutation tracking for these objects. So, things like mutating `_modules` etc is tracked automatically.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126578
Approved by: https://github.com/jansel
ghstack dependencies: #128001
2024-06-06 23:05:49 +00:00
e5b3387166 [dynamo] Bugfix for nn parameter construction (#128001)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128001
Approved by: https://github.com/jansel
2024-06-06 23:05:49 +00:00
6dfdce92ba Fixed typos in the complex numbers portion of the autograd docs (#127948)
This PR fixes several typos in the complex numbers section of the docs for autograd. Only documentation was altered.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127948
Approved by: https://github.com/soulitzer
2024-06-06 22:47:04 +00:00
56a3d276fe Handle custom op during TorchScript to ExportedProgram conversion (#127580)
#### Description
Handle custom ops during TorchScript to ExportedProgram covnersion
```python
torch.library.define(
    "mylib::foo",
    "(Tensor x) -> Tensor",
    lib=lib,
)

# PyTorch custorm op implementation
@torch.library.impl(
    "mylib::foo",
    "CompositeExplicitAutograd",
    lib=lib,
)
def foo_impl(x):
    return x + x

# Meta function of the custom op.
@torch.library.impl_abstract(
    "mylib::foo",
    lib=lib,
)
def foo_meta(x):
    return x + x

class M(torch.nn.Module):
    def forward(self, x):
        return torch.ops.mylib.foo(x)
```

#### Test Plan
* Add a test case where custom op is called and converted. `pytest test/export/test_converter.py -s -k test_ts2ep_converter_custom_op`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127580
Approved by: https://github.com/angelayi
2024-06-06 22:06:51 +00:00
80fa2778ed Update types for verbose in lr_scheduler (#127943)
I'm currently locked into jsonargparse version 4.19.0, and it complains when used in combination with LightningCLI (v2.0.8). This is because it cares about the types declared in google style docstrings. This causes a problem when it tries to parse how it should cast arguments to construct an instance of an LRScheduler class because the docstrings declare the "verbose" parameter as a bool, but the defaults recently changed to a string "deprecated". This means the type should really be `bool | str`.

This PR adds a `| str` to the docstring type in each learning rate scheduler class. This will prevent jsonargparse from complaining.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127943
Approved by: https://github.com/janeyx99
2024-06-06 21:59:22 +00:00
0a761f0627 [RFC] Provide optional switches to _dump_nccl_trace (#127651)
Summary:
Data from PyTorch distributed is mostly useful during initial stages of model development.
Provide options to reduce data sent/dumped.
`_dump_nccl_trace` takes 3 optional switches. Default as before returns everything
- `includeCollectives`: option to also include collectives: Default is True.
- `includeStacktraces`: option to include stack traces in collectives. Default is True.
- `onlyActive`: option to only send active collective work - i.e. not completed. Default is
    False (i.e. send everything)

Test Plan:
Unit tests

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127651
Approved by: https://github.com/wconstab
2024-06-06 21:59:09 +00:00
54fe2d0e89 [cuDNN][quantization] skip qlinear test in cuDNN v9.1.0 (#128166)
#120006 only very recently unskipped this test 3 days ago so we don't consider it a blocker for cuDNNv9 for now

CC @atalman

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128166
Approved by: https://github.com/atalman, https://github.com/nWEIdia
2024-06-06 21:43:29 +00:00
04272a0e12 Add docstring for the torch.ao.quantization.utils.get_combined_dict function (#128127)
Fixes: #127906

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128127
Approved by: https://github.com/jerryzh168
2024-06-06 21:22:09 +00:00
baaa914bf7 [small] test clean up (#128079)
remove unnecessary line: https://github.com/pytorch/pytorch/issues/123733
add main so test can be run `python ...`: https://github.com/pytorch/pytorch/issues/124906

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128079
Approved by: https://github.com/awgu
2024-06-06 21:21:40 +00:00
9554300436 [inductor][codegen] Codegen constexpr globals and constexpr annotated globals correctly. (#126195)
[Triton #3762](https://github.com/triton-lang/triton/pull/3762)
disallows access to globals which are not `tl.constexpr`

Triton has always treated captured globals this way, but they now
require it be explicit in user code.

Updated codegen to make sure these variables are defined before writing
the kernel source when compiling a user defined triton kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126195
Approved by: https://github.com/alexbaden, https://github.com/bertmaher
2024-06-06 20:50:11 +00:00
2184cdd291 Added memory budget to partitioner (#126320)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126320
Approved by: https://github.com/shunting314
2024-06-06 20:32:29 +00:00
7e059b3c95 Add a call to validate docker images after build step is complete (#127768)
Adds validation to docker images. As discussed here: https://github.com/pytorch/pytorch/issues/125879
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127768
Approved by: https://github.com/huydhn, https://github.com/Skylion007
2024-06-06 20:25:39 +00:00
e8670f6aea [Dynamo][TVM] Support macOS and Linux/aarch64 platforms (#128124)
Fixes #128122
With this fix, I've confirmed that the repro works on the platforms below.
- macOS 14.5 (arm64)
- Ubuntu 20.04.6 LTS (GNU/Linux 5.10.120-tegra aarch64)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128124
Approved by: https://github.com/malfet
2024-06-06 19:47:11 +00:00
de4f8b9946 [BE]: Update cudnn to 9.1.0.70 (#123475)
cuDNN has managed to upload cu11 and cu12 wheels for ~~9.0.0.312~~ 9.1.0.70, so trying this out...

CC @Skylion007 @malfet

Co-authored-by: Wei Wang <weiwan@nvidia.com>
Co-authored-by: atalman <atalman@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123475
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/nWEIdia, https://github.com/atalman
2024-06-06 18:45:22 +00:00
fba21edf5b [CI] Ensure inductor/test_cpu_cpp_wrapper is actually run in inductor_cpp_wrapper_abi_compatible (#126717)
`inductor/test_cpu_cpp_wrapper` is not actually being run in `inductor_cpp_wrapper_abi_compatible` test config

The cpu device type gets removed in d28868c7e8/torch/testing/_internal/common_device_type.py (L733)

so d28868c7e8/test/inductor/test_cpu_cpp_wrapper.py (L396) returns false.

Feel free to make a PR with a different way to do this (a better RUN_CPU check?)

Add a skip for a failing test.  I am not equipped to fix it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126717
Approved by: https://github.com/ZainRizvi
2024-06-06 18:23:52 +00:00
936225d7b2 [mergebot] Fix pending unstable jobs being viewed as failed (#128080)
https://github.com/pytorch/pytorch/pull/128038#issuecomment-2150802030

In the above, pending unstable jobs get put into the ok_failed_checks list, and because there are a lot of unstable jobs, it exceeds the threshold and merge fails.

I don't think unstable jobs should be considered in the ok failed checks threshold, only flaky and broken trunk jobs should be considered there.

Change looks big, but main thing is that unstable jobs don't get included in the check for how many flaky failures there are.  The other changes are mostly renames so things are clearer
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128080
Approved by: https://github.com/huydhn
2024-06-06 18:22:20 +00:00
32fb68960e [FSDP2] Added experimental warning to unshard API (#128138)
There is still ongoing discussion on how this API should work.

Current approach:
- The pre-all-gather ops run in the default stream and the all-gather is called from the default stream with `async_op=True`.
- Pros:
    - The all-gather input and output tensors are allocated in the default stream, so there is no increased memory fragmentation across stream pools.
    - There is no need for additional CUDA synchronization. The API is self-contained.
- Cons:
    - The pre-all-gather ops (e.g. cast from fp32 -> bf16 and all-gather copy-in device copies) cannot overlap with other default stream compute. The biggest concern here is for CPU offloading, the H2D copies cannot overlap.

Alternative approach:
- Follow the default implicit prefetching approach, where the pre-all-gather ops and all-gather run in separate streams.
- Pros:
    - The pre-all-gather ops can overlap with default stream compute.
- Cons:
    - We require an API that should be called after the last optimizer step (namely, last op that modified sharded parameters) and before the first `unshard` call that has the all-gather streams wait for the default stream. The API is no longer self-contained and now has a complementary API.
    - The all-gather input and output tensors are allocated in separate streams (not the default stream), so there can be increased memory fragmentation across pools.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128138
Approved by: https://github.com/wanchaol
ghstack dependencies: #128100
2024-06-06 18:18:42 +00:00
78a6b0c479 update test_reformer_train test to handle nn module inlining (#127467)
number of call nodes increase due to inlining
before inlining:
```
 class GraphModule(torch.nn.Module):
        def forward(self, function_ctx, cat: "f32[1, s0, 512]"):
            # No stacktrace found for following nodes
            _set_grad_enabled = torch._C._set_grad_enabled(False)

            # File: /data/users/lsakka/pytorch/pytorch/test/dynamo/test_repros.py:283 in backward, code: grad_attn_output, grad_hidden_states = torch.chunk(
            chunk = torch.chunk(cat, 2, dim = -1);  cat = None
            getitem: "f32[1, s0, 256]" = chunk[0]
            getitem_1: "f32[1, s0, 256]" = chunk[1];  chunk = None

            # No stacktrace found for following nodes
            _set_grad_enabled_1 = torch._C._set_grad_enabled(True)
            return (getitem_1, None)
```

after inlining:
```
class GraphModule(torch.nn.Module):
    def forward(self, s0: "Sym(s0)", L_hidden_states_: "f32[1, s0, 256]", L_self_layers_0_weight: "f32[256, 256]", L_self_layers_0_bias: "f32[256]", L_self_layer_norm_weight: "f32[512]", L_self_layer_norm_bias: "f32[512]", L_self_layer_norm_normalized_shape_0_: "Sym(512)"):
        l_hidden_states_ = L_hidden_states_
        l_self_layers_0_weight = L_self_layers_0_weight
        l_self_layers_0_bias = L_self_layers_0_bias
        l_self_layer_norm_weight = L_self_layer_norm_weight
        l_self_layer_norm_bias = L_self_layer_norm_bias
        l_self_layer_norm_normalized_shape_0_ = L_self_layer_norm_normalized_shape_0_

        # File: /data/users/lsakka/pytorch/pytorch/test/dynamo/test_repros.py:332 in forward, code: hidden_states = torch.cat([hidden_states, hidden_states], dim=-1)
        hidden_states: "f32[1, s0, 512]" = torch.cat([l_hidden_states_, l_hidden_states_], dim = -1);  l_hidden_states_ = None

        # File: /data/users/lsakka/pytorch/pytorch/test/dynamo/test_repros.py:333 in forward, code: hidden_states = _ReversibleFunction.apply(
        function_ctx = torch.autograd.function.FunctionCtx()

        # File: /data/users/lsakka/pytorch/pytorch/test/dynamo/test_repros.py:258 in forward, code: hidden_states, attn_output = torch.chunk(hidden_states, 2, dim=-1)
        chunk = torch.chunk(hidden_states, 2, dim = -1);  hidden_states = None
        hidden_states_1: "f32[1, s0, 256]" = chunk[0]
        attn_output: "f32[1, s0, 256]" = chunk[1];  chunk = None

        # File: /data/users/lsakka/pytorch/pytorch/torch/nn/modules/linear.py:116 in forward, code: return F.linear(input, self.weight, self.bias)
        attn_output_1: "f32[1, s0, 256]" = torch._C._nn.linear(attn_output, l_self_layers_0_weight, l_self_layers_0_bias);  attn_output = l_self_layers_0_weight = l_self_layers_0_bias = None

        # File: /data/users/lsakka/pytorch/pytorch/test/dynamo/test_repros.py:272 in forward, code: ctx.save_for_backward(attn_output.detach(), hidden_states.detach())
        detach: "f32[1, s0, 256]" = attn_output_1.detach()
        detach_1: "f32[1, s0, 256]" = hidden_states_1.detach()

        # File: /data/users/lsakka/pytorch/pytorch/test/dynamo/test_repros.py:279 in forward, code: return torch.cat([attn_output, hidden_states], dim=-1)
        hidden_states_2: "f32[1, s0, 512]" = torch.cat([attn_output_1, hidden_states_1], dim = -1);  attn_output_1 = hidden_states_1 = None

        # File: /data/users/lsakka/pytorch/pytorch/torch/nn/modules/normalization.py:201 in forward, code: return F.layer_norm(
        hidden_states_3: "f32[1, s0, 512]" = torch.nn.functional.layer_norm(hidden_states_2, (l_self_layer_norm_normalized_shape_0_,), l_self_layer_norm_weight, l_self_layer_norm_bias, 1e-12);  hidden_states_2 = l_self_layer_norm_normalized_shape_0_ = l_self_layer_norm_weight = l_self_layer_norm_bias = None

        # File: /data/users/lsakka/pytorch/pytorch/test/dynamo/test_repros.py:352 in forward, code: hidden_states = torch.nn.functional.dropout(
        hidden_states_4: "f32[1, s0, 512]" = torch.nn.functional.dropout(hidden_states_3, p = 0.5, training = True);  hidden_states_3 = None
        return (hidden_states_4,)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127467
Approved by: https://github.com/anijain2305
ghstack dependencies: #126444, #127146, #127424, #127440
2024-06-06 17:56:36 +00:00
304956e1fb Switch to torch.float16 on XPU AMP mode (#127741)
# Motivation
Previously, the default dtype for AMP on XPU was aligned with the CPU. To align with other GPUs, we intend to change the default dtype for AMP to `torch.float16`. This change aims to save users the effort of converting models from `torch.float16` to `torch.bfloat16`, or vice versa when they want to run the model on different types of GPUs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127741
Approved by: https://github.com/EikanWang, https://github.com/albanD
2024-06-06 17:40:13 +00:00
1d0c1087dd Allow overriding per-dim group options via _MeshEnv.set_dim_group_options (#126599)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126599
Approved by: https://github.com/wanchaol
ghstack dependencies: #126598
2024-06-06 17:18:12 +00:00
e9c5144cbc Fix bug in update_process_group DDP API (#128092)
Fix bug in `_update_process_group` DDP API where we didn't correctly reset `local_used_map_` and a few other variables. This resulted in errors like `Encountered gradient which is undefined, but still allreduced by...`

Added a unit test as well that reproduced the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128092
Approved by: https://github.com/awgu, https://github.com/fegin
2024-06-06 17:10:42 +00:00
2ffdf556ea Add back API that some people rely on in torch.cuda.amp.grad_scaler namespace (#128056)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128056
Approved by: https://github.com/kit1980, https://github.com/eqy
2024-06-06 17:02:32 +00:00
2d47385f0f [BE]: Enable ruff TCH rules and autofixes for better imports (#127688)
Automated fixes to put imports that are only used in type hints into TYPE_CHECKING imports. This also enables the RUFF TCH rules which will automatically apply autofixes to move imports in and out of TYPE_CHECKING blocks as needed in the future, this will make the initial PyTorch import faster and will reduce cyclic dependencies.

Co-authored-by: Xuehai Pan <XuehaiPan@pku.edu.cn>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127688
Approved by: https://github.com/XuehaiPan, https://github.com/ezyang, https://github.com/malfet
2024-06-06 16:55:58 +00:00
4f87f47ea1 [dtensor] reuse DTensorSpec as much as possible (#128112)
as titled, given that our DTensorSpec is immutable, we can always reuse
the spec if the input/output have the same tensor metadata. this helps two fold:
1. We don't need to re-calculate the hash everytime we produce a
   DTensorSpec, reduce runtime operator overhead
2. reduce the DTensor construction overhead.

Some local benchmark on a 800 parameter clip_grad_norm shows that for
foreach_norm the CPU overhead reduces from 11ms -> 7.8ms (around 30% improvement)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128112
Approved by: https://github.com/awgu
2024-06-06 16:55:50 +00:00
f0dd11df55 Make ValueRange repr less chatty by default (#128043)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128043
Approved by: https://github.com/lezcano
2024-06-06 16:42:48 +00:00
eqy
0de6d2427f Bump tolerances for inductor/test_efficient_conv_bn_eval.py::EfficientConvBNEvalCudaTests::test_basic_cuda attempt 2 (#128048)
CC @nWEIdia @huydhn @Skylion007

Same thing but also bump backward tolerances...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128048
Approved by: https://github.com/Skylion007
2024-06-06 16:17:43 +00:00
a5b86a1ec0 Revert "FP8 rowwise scaling (#125204)"
This reverts commit 5dc912822913b3d90f4938891c7eca722a057cf1.

Reverted https://github.com/pytorch/pytorch/pull/125204 on behalf of https://github.com/atalman due to Sorry need to revert this failing, on internal CI. I suggest to reimport this and try to land internally resolving all issues ([comment](https://github.com/pytorch/pytorch/pull/125204#issuecomment-2152905513))
2024-06-06 16:12:34 +00:00
a5ba9b2858 Fix for addcdiv contiguous problem (#124442)
Fixes issue number #118115
Co-authored-by: Siddharth Kotapati <skotapati@apple.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124442
Approved by: https://github.com/kulinseth
2024-06-06 16:09:18 +00:00
c58d3af3b4 Revert "Add OpInfo entry for alias_copy (#127232)"
This reverts commit 457df212e1c6e1aa4f1eb2ad6ee292052d7c07e1.

Reverted https://github.com/pytorch/pytorch/pull/127232 on behalf of https://github.com/clee2000 due to broke [onnx](https://github.com/pytorch/pytorch/actions/runs/9397057801/job/25880181144) and [mps](https://github.com/pytorch/pytorch/actions/runs/9397057805/job/25879818705) tests, [hud link](457df212e1) , base is 15 days old, the onnx test xfailed on the pr but the xfail was removed so if you rebase itll surface, mps build failed so no mps tests were run on the pr ([comment](https://github.com/pytorch/pytorch/pull/127232#issuecomment-2152848758))
2024-06-06 15:44:47 +00:00
9d849d4312 Disable py3.12 nightly wheel builds for ROCm (#127968)
Triton commit bump PR https://github.com/pytorch/pytorch/pull/125396 reverted due to missing llnl-hatchet dependency for triton. Workaround is to disable py3.12 binary build jobs for ROCm on PyTorch CI until llnl-hatchet publishes py3.12 wheels on [PyPI](https://pypi.org/project/llnl-hatchet/#files)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127968
Approved by: https://github.com/atalman, https://github.com/pruthvistony
2024-06-06 15:17:35 +00:00
48a54146e7 Revert "[dynamo] Support ndarray.dtype attribute access (#124490)"
This reverts commit 4adee71155bec4e419bac32be2cbc1763bc6c98f.

Reverted https://github.com/pytorch/pytorch/pull/124490 on behalf of https://github.com/atalman due to Breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/124490#issuecomment-2152664749))
2024-06-06 14:21:29 +00:00
f08fd8e9e3 Remove redundant device guard in Resize.h (#126498)
In https://github.com/pytorch/pytorch/pull/113386 a device guard was [inserted](https://github.com/pytorch/pytorch/pull/113386/files#diff-2691af3a999b3a8f4a0f635aabcd8edf0ffeda501edfa9366648e8a89de12a90R30).

The new inserted device guarded has a clear and more confined guarded scope.
And it's hard to tell the exact purpose and scope of the  [old device guard](78ffe49a3f/aten/src/ATen/native/cuda/Resize.h (L41)).

Removing the guard has negligible positive performance impact and make the code more understandable.

Thanks

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126498
Approved by: https://github.com/eqy, https://github.com/lezcano
2024-06-06 13:01:42 +00:00
c97e3ebb96 Fix wrongly exposed variables in torch/__init__.py (#127795)
<img width="609" alt="image" src="https://github.com/pytorch/pytorch/assets/16078332/964c6707-1856-4c2c-8cd8-ce1d96d38d36">

This PR removes temporary variables in `torch/__init__.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127795
Approved by: https://github.com/albanD
2024-06-06 08:31:41 +00:00
457df212e1 Add OpInfo entry for alias_copy (#127232)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127232
Approved by: https://github.com/lezcano
2024-06-06 07:46:26 +00:00
f5328542b5 Allow multiple cudagraph recordings per compiled graph (#126822)
### Introduction/Problem

Today when dynamo traces a builtin nn module (nn.Linear for example) it will specially handle parameters of that module by storing them as constant attributes of the graph. This requires that dynamo guard on the ID of the NNModule because if the instance of the module changes, we need to retrace and recollect the new parameters as attributes of the graph. This creates a 1:1 compiled graph to cudagraph relationship.

With hierarchical compilation, dynamo will treat builtin nn modules like any other code. This reduces complexity and critically, if there are multiple identical layers in a model, we only need to compile one of those layers once, and reuse the same compiled artifact for each layer. This introduces a problem for the current approach to parameter handling. Since the parameters could now possibly change across calls to the compiled artifact, these need to be inputs to the graph instead of attributes. This introduces a problem for cudagraphs - previously cudagraphs was guaranteed that the parameters of builtin NN Modules would be constant across calls, but now since the compiled artifact needs to be agnostic to the actual instance of the NN module being used these parameter memory locations may vary. Previously cudagraphs simply copies varying inputs to cudagraph owned memory, but since the parameters are quite large, this is catastrophic for performance.

### Solution
To avoid this performance cliff, this PR allows cudagraphs to re-record a new cudagraph if only parameters change. Metadata about which arguments are parameters are propagated from AOT Autograd to compile_fx, and these indices are passed to cudagraphs. If these memory locations change, a new graph is recorded vs previously where this would be an error (because this previously should not happen). This enables a 1:many compiled graph to cudagraph relationship. Across similar modules we will re-record cudagraphs and dispatch the correct graph if parameter pointers match when the cudagraph is executed.

### Next steps (if needed)
It is theoretically possible that a user passes Parameters that change frequently as inputs to model code - if this is a common issue this design allows for dynamo to pass metadata indicating which parameters were created in a builtin NN Module context to only permit those parameters to have the multi-cudagraph behavior, but this PR does not implement this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126822
Approved by: https://github.com/eellison
ghstack dependencies: #126820, #126821
2024-06-06 06:39:59 +00:00
5a3bea1e88 Remove unused arg to GraphLowering (#126821)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126821
Approved by: https://github.com/eellison
ghstack dependencies: #126820
2024-06-06 06:39:59 +00:00
70ba6f0ab6 Collect static parameter metadata in aot (#126820)
Collect the indices of the static parameters to pass down to cudagraphs in order to re-record if necessary.
This location was chosen in order to allow us to restrict this (if needed) in the future by setting metadata in dynamo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126820
Approved by: https://github.com/bdhirsh
2024-06-06 06:39:50 +00:00
c8ff1cd387 [FSDP2] Changed test_register_forward_method to use multiprocess test (#128100)
The test seems to be flaky due to multi-threaded process group. This PR converts the test to use normal multi-process `ProcessGroupNCCL` to fix the flakiness.

This PR closes https://github.com/pytorch/pytorch/issues/126851.

Interestingly, the original MTPG version passes for me on devgpu. Either way, the new version also passes on devgpu, so we can see in CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128100
Approved by: https://github.com/weifengpy
2024-06-06 06:34:02 +00:00
638f543ac2 Enable single nadam test (#128087)
https://github.com/pytorch/pytorch/issues/117150 has been fixed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128087
Approved by: https://github.com/xmfan
2024-06-06 06:25:00 +00:00
cd42b95047 Handle aten::__contains__ during TorchScript to ExportedProgram conversion (#127544)
#### Description
Add support for converting `prim::__contains__` from TorchScript IR to ExportedProgram, e.g.,
```python
class MIn(torch.nn.Module):
    def forward(self, x: torch.Tensor):
        return x.dtype in [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
```
#### Test Plan
* Add test cases to cover both contains IR resulted from primitive types or Tensor. `pytest test/export/test_converter.py -s -k test_ts2ep_converter_contains`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127544
Approved by: https://github.com/angelayi
2024-06-06 05:00:13 +00:00
cyy
68eb771265 [2/N] Remove unused test functions (#128005)
Following #127881, this PR continues to remove unused test functions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128005
Approved by: https://github.com/ezyang
2024-06-06 03:41:32 +00:00
2f7cfecd86 Complete revamp of float/promotion sympy handling (#126905)
At a high level, the idea behind this PR is:

* Make it clearer what the promotion and int/float rules for various Sympy operations are. Operators that previously were polymorphic over int/float are now split into separate operators for clarity. We never do mixed int/float addition/multiplication etc in sympy, instead, we always promote to the appropriate operator. (However, equality is currently not done correctly.)
* Enforce strict typing on ValueRanges: if you have a ValueRange for a float, the lower and upper MUST be floats, and so forth for integers.

The story begins in **torch/utils/_sympy/functions.py**. Here, I make some changes to how we represent certain operations in sympy expressions:

* FloorDiv now only supports integer inputs; to do float floor division, do a truediv and then a trunc. Additionally, we remove the divide out addition by gcd optimization, because sympy gcd is over fields and is willing to generate rationals (but rationals are bad for ValueRange strict typing).
* ModularIndexing, LShift, RShift now assert they are given integer inputs.
* Mod only supports integer inputs; eventually we will support FloatMod (left for later work, when we build out Sympy support for floating operations). Unfortunately, I couldn't assert integer inputs here, because of a bad interaction with sympy's inequality solver that is used by the offline solver
* TrueDiv is split into FloatTrueDiv and IntTrueDiv. This allows for us to eventually generate accurate code for Python semantics IntTrueDiv, which is written in a special way to preserve precision when the inputs are >= 2**53 beyond what first coercing the integer to floats and then doing true division.
* Trunc is split to TruncToFloat and TruncToInt.
* Round is updated to return a float, not an int, making it consistent with the round op handler in Inductor. To get Python-style conversion to int, we call TruncToInt on the result.
* RoundDecimal updated to consistently only ever return a float
* Add ToFloat for explicit coercion to float (required so we can enforce strict ValueRanges typing)

In **torch/__init__.py**, we modify SymInt and SymFloat to appropriately call into new bindings that route to these refined sympy operations.  Also, we modify `torch.sym_min` and `torch.sym_max` to have promotion semantics (if one argument is a float, the return result is always a float), making them inconsistent with builtins.min/max, but possible to do type analysis without runtime information.

We also need to introduce some new op handlers in **torch/_inductor/ops_handler.py**:

* `to_int` for truncation to int64, directly corresponding to TruncToInt; this can be implemented by trunc and dtype, but with a dedicated handler it is more convenient for roundtripping in Sympy
* `int_truediv` for Python-style integer true division, which has higher precision than casting to floats and then running `truediv`

These changes have consequences. First, we need to make some administrative changes:

* Actually wire up these Sympy functions from SymInt/SymFloat in **torch/fx/experimental/sym_node.py**, including the new promotion rules (promote2)
* Add support for new Sympy functions in **torch/utils/_sympy/interp.py**, **torch/utils/_sympy/reference.py**
  * In particular, in torch.utils._sympy.reference, we have a strong preference to NOT do nontrivial compute, instead, everything in ops handler should map to a singular sympy function
  * TODO: I chose to roundtrip mod back to our Mod function, but I think I'm going to have to deal with the C/Python inconsistency this to fix tests here
* Add printer support for the Sympy functions in **torch/_inductor/codegen/common.py**, **torch/_inductor/codegen/cpp_utils.py**, **torch/_inductor/codegen/triton.py**. `int_truediv` and mixed precision equality is currently not implemented soundly, so we will lose precision in codegen for large values. TODO: The additions here are not exhaustive yet
* Update ValueRanges logic to use new sympy functions in **torch/utils/_sympy/value_ranges.py**. In general, we prefer to use the new Sympy function rather than try to roll things by hand, which is what was done previously for many VR analysis functions.

In **torch/fx/experimental/symbolic_shapes.py** we need to make some symbolic reasoning adjustments:

* Avoid generation of rational subexpressions by removing simplification of `x // y` into `floor(x / y)`. This simplification then triggers an addition simplification rule `(x + y) / c --> x / c + y / c` which is bad because x / c is a rational number now
* `_assert_bound_is_rational` is no more, we no longer generate rational bounds
* Don't intersect non-int value ranges with the `int_range`
* Support more sympy Functions for guard SYMPY_INTERP
* Assert the type of value range is consistent with the variable type

The new asserts uncovered necessary bug fixes:

* **torch/_inductor/codegen/cpp.py**, **torch/_inductor/select_algorithm.py**, **torch/_inductor/sizevars.py** - Ensure Wild/Symbol manually allocated in Inductor is marked `is_integer` so it's accepted to build expressions
* **torch/_inductor/utils.py** - make sure you actually pass in sympy.Expr to these functions
* **torch/_inductor/ir.py** - make_contiguous_strides_for takes int/SymInt, not sympy.Expr!
* **torch/export/dynamic_shapes.py** - don't use infinity to represent int ranges, instead use sys.maxsize - 1

Because of the removal of some symbolic reasoning that produced rationals, some of our symbolic reasoning has gotten worse and we are unable to simplify some guards. Check the TODO at **test/test_proxy_tensor.py**

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126905
Approved by: https://github.com/xadupre, https://github.com/lezcano
2024-06-06 02:29:45 +00:00
c1a43a69e4 [NestedTensor] Add error checks for unbind operator coverage when ragged_idx != 1 (#128058)
Summary:
Add the following error checks for the `unbind` operator on `NestedTensor`s when `ragged_idx != 1`:

- The current implementation allows the creation of `NestedTensor` instances from the class definition with an `offsets` tensor that applies to a dimension other than the jagged dimension. This diff ensures that `unbind` fails when the `offsets` exceed the length of the jagged dimension.

Test Plan:
Added the following unit tests:

`test_unbind_with_lengths_ragged_idx_equals_2_bad_dim_cpu` verifies that `unbind` fails when there is a mismatch between the offsets and the jagged dimension, for `NestedTensor`s with `lengths`.
```
test_unbind_with_lengths_ragged_idx_equals_2_bad_dim_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok
```

Reviewed By: davidberard98

Differential Revision: D57989082

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128058
Approved by: https://github.com/davidberard98
2024-06-06 01:56:12 +00:00
9795c4224b Revert "[DDP] Bucket handling: make first bucket size equal to bucket_cap_mb if it was set (#121640)"
This reverts commit e98662bed99df57b7d79f9fc1cbe670afc303235.

Reverted https://github.com/pytorch/pytorch/pull/121640 on behalf of https://github.com/clee2000 due to Sorry but it looks like you're failing  `distributed/_composable/test_replicate_with_compiler.py::ReplicateTest::test_bucketing_coalesced_op `. THe build failed so the tests didn't run, consider rebasing, there have been a couple of PRs lately related to cudnn so you probably are either based on a bad or too old of a commit e98662bed9 https://github.com/pytorch/pytorch/actions/runs/9392731942/job/25868060913 ([comment](https://github.com/pytorch/pytorch/pull/121640#issuecomment-2151258585))
2024-06-06 01:50:18 +00:00
sdp
b4a0161449 Build SYCL kernels for ATen XPU ops on Native Windows (take 2) (#127390)
Original PR https://github.com/pytorch/pytorch/pull/126725 is closed due to bad rebase.

-------
As proposed in https://github.com/pytorch/pytorch/issues/126719, we are enabling PyTorch XPU on Native Windows on Intel GPU.

This PR  enables XPU build on Windows as the first step of #126719:

- Enable `USE_XPU` build on Windows using MSVC as host compiler. The use of MSVC as host compiler seamlessly aligns with the existing PyTorch build on Windows.
- Build oneDNN GPU library on Windows.

Co-authored-by: Yu, Guangye <guangye.yu@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127390
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/ezyang
2024-06-06 01:41:06 +00:00
6adcf21b2b Documenting the torch.cuda.nccl.version function (#128022)
Fixes #127892

This PR adds docstring to the torch.cuda.nccl.version function

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128022
Approved by: https://github.com/malfet
2024-06-06 01:13:07 +00:00
bf2c05352e Make length == stop size oblivious too (#128050)
This doesn't do anything right now (need some other PRs to activate)
but since it edits a header file it would be better to land this
earlier.

Context: https://github.com/pytorch/pytorch/pull/127693

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128050
Approved by: https://github.com/Skylion007, https://github.com/lezcano
2024-06-06 01:09:37 +00:00
80d34217c6 Typo fixes: et al. (#127811)
"et al." is short for _et alia_ and should be abbreviated with a period on the second word. Noticed this typo when reading through the SGD docs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127811
Approved by: https://github.com/janeyx99
2024-06-06 01:03:25 +00:00
d3ad84c38f Use pexpr, not texpr in Triton launch codegen (#128038)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128038
Approved by: https://github.com/Skylion007
2024-06-06 00:45:59 +00:00
8bcebc8dae Add runtime dependency on setuptools for cpp_extensions (#127921)
As per title since this was removed from the builtin python binary in 3.12 and we use it `torch.utils.cpp_extension.*`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127921
Approved by: https://github.com/Skylion007
2024-06-05 23:59:38 +00:00
cyy
2fd75667b4 [Caffe2]Remove Caffe2 scripts and benchmarks (#126747)
Due to removal of Caffe2.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126747
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-06-05 23:46:31 +00:00
e98662bed9 [DDP] Bucket handling: make first bucket size equal to bucket_cap_mb if it was set (#121640)
The fist DDP bucket is always being created of the size of `dist._DEFAULT_FIRST_BUCKET_BYTES` (1 MiB) by default regardless of `bucket_cap_mb`. The proposal is to set `bucket_cap_mb` as the one main bucket size if it was supplied by the user.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121640
Approved by: https://github.com/wanchaol
2024-06-05 23:44:54 +00:00
ffaea656b5 WorkerServer: add support for binding to TCP (#127986)
This adds support for the WorkerServer binding to TCP as well as the existing unix socket support.

```py
server = _WorkerServer("", 1234)
```

Test plan:

Added unit test

```
python test/distributed/elastic/test_control_plane.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127986
Approved by: https://github.com/c-p-i-o
2024-06-05 22:56:32 +00:00
a7c596870d [BE][Eazy] remove torch.torch.xxx usages (#127800)
NB: `torch` is exposed in `torch/__init__.py`. So there can be `torch.torch.torch.xxx`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127800
Approved by: https://github.com/peterbell10, https://github.com/kit1980, https://github.com/malfet
2024-06-05 21:53:49 +00:00
4123323eff [ONNX] Single function for torch.onnx.export and torch.onnx.dynamo_export (#127974)
Add `dynamo: bool = True` as a switch in `torch.onnx.export` to provide users an option to try `torch.onnx.dynamo_export`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127974
Approved by: https://github.com/justinchuby
2024-06-05 21:27:46 +00:00
01694eaa56 Move cuda 12.4 jobs to periodic for both pull and inductor (#127825)
Moves 12.4 sm86/a10g jobs in pull to trunk
Moves 12.4 cuda non sm86 jobs to periodic
Moves 12.4 jobs in inductor to inductor-periodic, except inductor_timm which seems to give important signal

There has been a lot of queueing for cuda runners due to the addition of jobs for cuda 12.4, so move those jobs to other workflows that are run less often
Co-authored-by: Andrey Talman <atalman@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127825
Approved by: https://github.com/ZainRizvi, https://github.com/nWEIdia, https://github.com/atalman, https://github.com/malfet
2024-06-05 21:01:36 +00:00
8184cd85fc [fake tensor] Set _is_param for base fake tensors for views (#127823)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127823
Approved by: https://github.com/eellison, https://github.com/ezyang
ghstack dependencies: #127972
2024-06-05 20:26:52 +00:00
626dc934d1 [dynamo][pippy] Hotfix for nn_module_stack for pippy usecase (#127972)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127972
Approved by: https://github.com/ydwu4
2024-06-05 20:14:50 +00:00
72e863df27 Update _learnable_fake_quantize.py (#127993)
Remove sentence "For literature references, please see the class _LearnableFakeQuantizePerTensorOp." and add "s" to "support"

(Possibly) Fixes #99107 (But not sure, sorry)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127993
Approved by: https://github.com/jerryzh168
2024-06-05 20:02:33 +00:00
6e545392cd Move nongpu workflows from trunk to periodic (#128049)
We don't need to run them on every PR. These are used to test for graceful degradation of GPU.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128049
Approved by: https://github.com/clee2000
2024-06-05 18:31:26 +00:00
6412c6060c [reland] Refresh OpOverloadPacket if a new OpOverload gets added (#128000)
If a user accesses an OpOverloadPacket, then creates a new OpOverload,
then uses the OpOverloadPacket, the new OpOverload never gets hit. This
is because OpOverloadPacket caches OpOverloads when it is constructed.

This PR fixes the problem by "refreshing" the OpOverloadPacket if a new
OpOverload gets constructed and the OpOverloadPacket exists.

Test Plan:
- new tests

This is the third land attempt. The first one was reverted for breaking
internal tests, the second was reverted for being erroneously suspected
of causing a perf regression.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128000
Approved by: https://github.com/albanD
2024-06-05 17:57:09 +00:00
bb68b54be0 [BE][ptd_fb_test][1/N] Enable testslide (#127512)
This change allows to enable Testslide, which gives us more readable output, import time, etc. The PR is previously stamped https://github.com/pytorch/pytorch/pull/126460 but the old PR has some ghexport issue.

Differential Revision: [D57919583](https://our.internmc.facebook.com/intern/diff/D57919583/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127512
Approved by: https://github.com/wz337, https://github.com/Skylion007
2024-06-05 17:45:15 +00:00
3acbfd602e Document torch.utils.collect_env.get_env_info function (#128021)
Fixes #127911

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128021
Approved by: https://github.com/malfet
2024-06-05 17:44:47 +00:00
6454e95824 [FSDP2] enable CI for torch.compile(root Transformer) (#127832)
This CI showcases FSDP2 works with `torch.compile` root model, since FSDP1 can do the same

compiling root Transformer without AC: `pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_multi_group`

compiling root Transformer with AC: `pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_with_activation_checkpointing`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127832
Approved by: https://github.com/awgu
2024-06-05 17:29:46 +00:00
4adee71155 [dynamo] Support ndarray.dtype attribute access (#124490)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124490
Approved by: https://github.com/lezcano
ghstack dependencies: #125717
2024-06-05 17:20:01 +00:00
a9cc147fa1 [DSD][FSDP1] Deprecate FSDP.state_dict_type and redirect users to DSD (#127794)
Summary:
As title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127794
Approved by: https://github.com/awgu
ghstack dependencies: #127793
2024-06-05 16:55:05 +00:00
9acc19f8da [inductor] Take absolute value of strides when picking loop order (#127425)
Fixes #126860

The stride hint is found by comparing the value of the indexing expression
evaluated at `idx` set to all zeros and at `idx[dim] = 1`. This causes a problem
for padded inputs where 0 and 1 are still in the padded region.

In particular, for reflection padding this causes the stride to be negative.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127425
Approved by: https://github.com/lezcano
2024-06-05 16:48:22 +00:00
22964d1007 [DSD] Deprecate submodules feature for DSD (#127793)
Summary:
Getting a partial of the state_dict and set the state_dict with the type of Dict[nn.Module, Dict[str, Any]] is too complicated and can confuse users. The features can be achieved by simple pre-processing and post-processing by users. So this PR adds the deprecation warning to the feature.

The previous PR, https://github.com/pytorch/pytorch/pull/127070, assumes
no one is using the feature and remove it without the grace period. This
seems to be too aggresive and causes some concerns. This PR adds the
deprecation warning and tests.

We will remove the support in 2.5.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127793
Approved by: https://github.com/LucasLLC
2024-06-05 16:31:29 +00:00
5dc9128229 FP8 rowwise scaling (#125204)
# Summary
This pull request introduces an fp8 row-scaling kernel as an optional implementation for `scaled_mm`. The kernel selection is based on the scaling tensors of the inputs. For inputs `x` and `y` of shape `[M, K]` and `[K, N]` respectively, the following conditions must be met:
- `x`'s scale should be a 1-dimensional tensor of length `M`.
- `y`'s scale should be a 1-dimensional tensor of length `N`.

It's important to note that this kernel is not called "rowwise, columnwise" scaling because, although the scales for `y` are semantically along its columns, this implementation only supports the TN format. This means the scaling is along the faster-moving dimension, or the "row".

The following two PRs were required to enable local builds:
- [PR #126185](https://github.com/pytorch/pytorch/pull/126185)
- [PR #125523](https://github.com/pytorch/pytorch/pull/125523)

### Todo
We still do not build our Python wheels with this architecture.

@ptrblck @malfet, should we replace `sm_90` with `sm_90a`?

The NVRTC TMA shadowing feels wrong, but I a not sure the right way to spoof the symbol for this compilation unit:
https://github.com/pytorch/pytorch/pull/125204/files#r1586986954

#### ifdef

I tried to use : `#if !defined(USE_ROCM) && defined(CUDA_VERSION) && CUDA_VERSION >= 12000 && \
    defined(__CUDA_ARCH__) && __CUDA_ARCH__ > 900` to gate the building of the kernel. I was having a hell of a time with this.. so I am not really sure the right way to do this

Kernel Credit:
@jwfromm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125204
Approved by: https://github.com/lw, https://github.com/malfet
2024-06-05 15:46:40 +00:00
4f9fcd7156 Handle unpacking during TorchScript to ExportedProgram conversion (#127419)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127419
Approved by: https://github.com/angelayi
2024-06-05 15:27:13 +00:00
cyy
9f2c4b9342 Replace with standard type traits in torch/csrc (#127852)
In preparation to clean up more type traits.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127852
Approved by: https://github.com/ezyang
2024-06-05 15:22:48 +00:00
cyy
3d617333e7 Simplify CMake code (#127683)
Due to the recent adoption of find(python), it is possible to further simplify some CMake code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127683
Approved by: https://github.com/ezyang
2024-06-05 15:17:31 +00:00
cyy
df75a9dc80 Remove Caffe2/onnx (#127991)
Remove Caffe2/onnx since it is not used. Other tiny fixes are also applied.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127991
Approved by: https://github.com/ezyang
2024-06-05 15:10:12 +00:00
d48c25c7d1 [BE] Fix missing-prototypes errors in Metal backend (#127994)
By declaring a bunch of functions static.
Removed `USE_PYTORCH_METAL` from list of flags that suppress `-Werror=missing-prototypes`. This  will prevent regressions like the ones reported in https://github.com/pytorch/pytorch/issues/127942 to sneak past CI, that builds PyTorch with Metal support.
Use nested namespaces
Remove spurious semicolon after TORCH_LIBRARY declaration.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127994
Approved by: https://github.com/Skylion007, https://github.com/ZainRizvi
2024-06-05 14:58:19 +00:00
8992141dba Restore MPS testing on MacOS 13 and m2 metal (#127853)
The runners are ready now https://github.com/organizations/pytorch/settings/actions/runners?qr=label%3Amacos-m1-13, we want to keep some MacOS 13 runner for mps coverage until MacOS 15 is out.

This also fixes the `macos-m2-14` mistake from https://github.com/pytorch/pytorch/pull/127582.

The current `macos-m2-14` runner is on 14.2 while our `macos-m1-14` has 14.4.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127853
Approved by: https://github.com/malfet
2024-06-05 14:44:00 +00:00
879d01afcb [dynamo][numpy] Add unsigned integer dtypes (#125717)
We should support these to whatever extent we can. They corresponding
`torch.uint<w>` types are defined, so I don't see an issue with
generating the various casting rules and allowing them to trace.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125717
Approved by: https://github.com/lezcano
2024-06-05 14:33:47 +00:00
4ce5322a1f Enable UFMT on test_shape_ops.py test_show_pickle.py test_sort_and_select.py (#127165)
Fixes some files in #123062

Run lintrunner on files:
test_shape_ops.py
test_show_pickle.py
test_sort_and_select.py

```bash
$ lintrunner --take UFMT --all-files
ok No lint issues.
Successfully applied all patches.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127165
Approved by: https://github.com/ezyang
2024-06-05 14:31:26 +00:00
faabda4fc9 [Inductor] Skip model_fail_to_load and eager_fail_to_run models in inductor benchmarks test (#127210)
Aligned with test-infra repo, we skipped `model_fail_to_load` and `eager_fail_to_run` models
Refer code logic:
d3b79778f8/torchci/rockset/inductor/__sql/compilers_benchmark_performance.sql (L57-L58)
```SQL
  WHERE
    filename LIKE '%_accuracy'
    AND filename LIKE CONCAT(
      '%_', : dtypes, '_', : mode, '_', : device,
      '_%'
    )
    AND _event_time >= PARSE_DATETIME_ISO8601(:startTime)
    AND _event_time < PARSE_DATETIME_ISO8601(:stopTime)
    AND (workflow_id = :workflowId OR :workflowId = 0)
    AND accuracy != 'model_fail_to_load'
    AND accuracy != 'eager_fail_to_run'
),
```

Comp Item | Compiler | suite | Before | After fix
-- | -- | -- | -- | --
Pass Rate | Inductor | torchbench | 96%, 80/83 | 100%, 80/80

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127210
Approved by: https://github.com/jansel
2024-06-05 14:23:09 +00:00
c3949b20a1 Opt model save and load (#126374)
## save&load support for OptimizedModule

[Issue Description](https://github.com/pytorch/pytorch/pull/101651)

English is not my native language; please excuse typing errors.

This pr is based on commit b9588101c4d3411b107fdc860acfa8a72c642f91\
I'll do something with the merge conflicts later

### test result for test/dynamo

Conclusion:\
It performs the same as before as far as I can see.

ENV(CPU only):\
platform linux -- Python 3.10.14, pytest-7.3.2, pluggy-1.5.0\
configfile: pytest.ini\
plugins: anyio-3.7.1, cpp-2.3.0, flakefinder-1.1.0, xdist-3.3.1, xdoctest-1.1.0, metadata-3.1.1, html-4.1.1, hypothesis-5.35.1, rerunfailures-14.0

#### before this pr:

[before](https://github.com/pytorch/pytorch/files/15329370/before.md)

#### after this pr:

[after](https://github.com/pytorch/pytorch/files/15329376/after.md)

### some changes

1. add test_save_and_load to test/dynamo/test_modules.py with & without "backend='inductor'"
2. add \_\_reduce\_\_ function to OptimizedModule and derived classes of _TorchDynamoContext for pickling & unpickling
3. change the wrappers into wrapper classes ( including convert_frame_assert, convert_frame, catch_errors_wrapper in torch/_dynamo/convert_frame.py & wrap_backend_debug in torch/_dynamo/repro/after_dynamo.py )
4. change self.output.compiler_fn into innermost_fn(self.output.compiler_fn) in torch/_dynamo/symbolic_convert.py to get the origin compiler_fn and to avoid the "compiler_fn is not eager" condition

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126374
Approved by: https://github.com/msaroufim, https://github.com/jansel
2024-06-05 13:01:16 +00:00
9a8ab778d3 Revert "[BE]: Update cudnn to 9.1.0.70 (#123475)"
This reverts commit c490046693e77e254664e19d940e9b05a1da18ef.

Reverted https://github.com/pytorch/pytorch/pull/123475 on behalf of https://github.com/huydhn due to CUDA trunk jobs are pretty red after this change, and the forward fix https://github.com/pytorch/pytorch/pull/127984 does not look working ([comment](https://github.com/pytorch/pytorch/pull/123475#issuecomment-2149258430))
2024-06-05 08:59:53 +00:00
bb2de3b101 Fixed broken link and removed unfinished sentence from issue #126367 (#127938)
Fixes #126367.

## Description

Fixed a broken link in the pytorch/docs/source/torch.compiler_faq.rst doc and deleted a few words that were extra according to the issue tagged above.

## Checklist
- [X] The issue that is being fixed is referred in the description
- [X] Only one issue is addressed in this pull request
- [X] Labels from the issue that this PR is fixing are added to this pull request
- [X] No unnecesary issues are included into this pull request

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127938
Approved by: https://github.com/msaroufim
2024-06-05 07:37:32 +00:00
4a384d813b [SDPA/memeff] Backport changes from xFormers to PT (#127090)
Backporting a few fixes from xFormers:
* Bug fixes for local attention (which is not exposed in PT at the moment)
* Massively reduced memory usage on the BW pass (see also https://github.com/facebookresearch/xformers/pull/1028)

Essentially this will also make xFormers build process much easier, as we will be able to use mem-eff from PyTorch (if the user has a recent enough version) rather than building it at xFormers install time
The goal is to have the source of truth for these files in PT moving forward, and remove them from xFormers eventually once our users have a recent-enough version of PT.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127090
Approved by: https://github.com/drisspg
2024-06-05 07:33:27 +00:00
cyy
b054470db2 Remove unused functions (#127881)
Some unused functions detected by g++ warnings can be removed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127881
Approved by: https://github.com/zou3519
2024-06-05 05:21:24 +00:00
30788739f4 [c10d] add a simple test to demonstrate the user usage of collectives (#127665)
Summary:
Just play around the UT and think it would be good to give an simple
example of user function which can be used for different subclasses of
_ControlCollectives, and test the user function can be executed
correctly

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127665
Approved by: https://github.com/d4l3k
2024-06-05 04:32:11 +00:00
e505132797 [export] track TORCH_DYNAMO_DO_NOT_EMIT_RUNTIME_ASSERTS for export runtime asserts (#127554)
Track TORCH_DYNAMO_DO_NOT_EMIT_RUNTIME_ASSERTS=1 in export so it doesn't omit runtime asserts.

Differential Revision: D57978699

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127554
Approved by: https://github.com/tugsbayasgalan
2024-06-05 04:16:54 +00:00
d5cb5d623a Revert "Complete revamp of float/promotion sympy handling (#126905)"
This reverts commit fb696ef3aa34e20c0fef1c0210a397abd3ea5885.

Reverted https://github.com/pytorch/pytorch/pull/126905 on behalf of https://github.com/ezyang due to internal user reported ceiling equality simplification problem, I have a plan ([comment](https://github.com/pytorch/pytorch/pull/126905#issuecomment-2148805840))
2024-06-05 03:57:58 +00:00
55a4ef80c4 [pipelining] test pipeline_order in schedule (#127559)
Add a unittest to test validate the pipeline order for different `num_stages`, `num_microbatches`, `num_world_size` combinations. This doesn't actually run the schedule but just validates the ordering of microbatches processed is valid, therefore doesn't require GPUs / multiple processes.

Will add more combinations and negative tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127559
Approved by: https://github.com/wconstab
ghstack dependencies: #127084, #127332
2024-06-05 03:51:27 +00:00
71e684bfae [BE][Mac] Add missing prototypes (#127988)
Really confused how CI did not catch this one, but this triggers missing prototype erros if compiled from scratch on MacOS Sonoma using clang-15

Fixes https://github.com/pytorch/pytorch/issues/127942

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127988
Approved by: https://github.com/Skylion007, https://github.com/huydhn
2024-06-05 02:16:50 +00:00
cyy
ce4436944c Fix IOS builds (#127985)
IOS builds fail these days, fix it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127985
Approved by: https://github.com/ezyang
2024-06-05 02:14:43 +00:00
a135776307 Remove tensor subclass detection logic from weights_only unpickler (#127808)
Remove logic to auto-detect and allow subclasses that did not override certain methods from the weights_only unpickler from https://github.com/pytorch/pytorch/pull/124331 for 2.4 release

Subclasses should be loadable using `torch.serialization.add_safe_globals`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127808
Approved by: https://github.com/malfet
2024-06-05 02:14:30 +00:00
8e496046e5 Update torch-xpu-ops pin (ATen XPU implementation) (#127879)
Support AMP GradScaler.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127879
Approved by: https://github.com/EikanWang
2024-06-05 02:13:46 +00:00
6c07e2c930 fix redundant tensor (#127850)
As title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127850
Approved by: https://github.com/mikaylagawarecki
2024-06-05 02:03:02 +00:00
8830b81208 [c10d] Add commCreateFromRanks to c10d (#127421) (#127982)
This is a duplicate of: https://github.com/pytorch/pytorch/pull/127421 which we can't merge. its landed internally already

Summary:

`ncclCommCreateFromRanks` - described in this [document](https://docs.google.com/document/d/1QIRkAO4SAQ6eFBpxE51JmRKRAH2bwAHn8OIj69XuFqQ/edit#heading=h.5g71oqe3soez), replaces `ncclCommSplit` in NCCLX versions 2.21.5+.  The difference is that `ncclCommCreateFromRanks` is given a list of active ranks and is collective only over those ranks as opposed to `ncclCommSplit` for which you give it a color for every rank including NO_COLOR for inactive ranks and the collective is over the entire world.

This diff connects `ncclCommCreateFromRanks` to `c10d`

`ncclCommSplit` will still be available at the NCCL API but, in this diff, is not used starting at version 2.21.5

Split the python test and implementation of `split()` for internal FB and external OSS builds.

The diff defines `"USE_C10D_NCCL_FBCODE"` as a compiler option. When defined, we use the version of split in the newly created `NCCLUtils.cpp` in the `fb` directory.  The `fb` directory is not *shipit*-ed to *github*.

The same API is used for `split()` in both the `ncclx` and `nccl` versions adding `ranks` to the API.  This argument is not used in the `nccl` version nor in the 2.18 `ncclx` version where `ncclCommSplit()` is used instead of `ncclCommCreateFromRanks()` in `ncclx`

This diff was squashed with D57343946 - see D57343946 for additional review comments.

Test Plan:
for 2.18.3-1 and 2.21.5-1 versions:
```
buck2 run fbcode//mode/opt -c param.use_nccl=True -c fbcode.nvcc_arch=a100 -c hpc_comms.use_ncclx="$VERSION" -c fbcode.enable_gpu_sections=true  fbcode//caffe2/test/distributed/fb:test_comm_split_subgroup_x
```

```
BUILD SUCCEEDED
...
ok

----------------------------------------------------------------------
Ran 1 test in 10.210s

OK
~/scripts
```

OSS build:
`[cmodlin@devgpu003.vll5 ~/fbsource/third-party/ncclx/v2.21.5-1 (e56338cfa)]$ ./maint/oss_build.sh`

OSS build output:
```
...
ncclCommHash 197dce9b413e2775
nccl commDesc example_pg
Dump from comm 0x4708aa0 rings: [[0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0]]
Dump from comm 0x4708aa0 commDesc: example_pg
Dump from comm 0x4708aa0 nRanks: 1
Dump from comm 0x4708aa0 nNodes: 1
Dump from comm 0x4708aa0 node: 0
Dump from comm 0x4708aa0 localRanks: 1
Dump from comm 0x4708aa0 localRank: 0
Dump from comm 0x4708aa0 rank: 0
Dump from comm 0x4708aa0 commHash: "197dce9b413e2775"

2024-05-24T09:02:54.385543 devgpu003:3040664:3040744 [0][AsyncJob]ctran/backends/ib/CtranIb.cc:143 NCCL WARN CTRAN-IB : No active device found.

2024-05-24T09:02:54.385607 devgpu003:3040664:3040744 [0][AsyncJob]ctran/mapper/CtranMapper.cc:187 NCCL WARN CTRAN: IB backend not enabled
Created NCCL_SPLIT_TYPE_NODE type splitComm 0x11c76d0, rank 0
~/fbsource/third-party/ncclx/v2.21.5-1
```

Reviewed By: wconstab, wesbland

Differential Revision: D56907877

Fixes #ISSUE_NUMBER

Co-authored-by: Cory Modlin <cmodlin@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127982
Approved by: https://github.com/izaitsevfb
2024-06-05 00:19:52 +00:00
7fdfb88f03 [pipelining] rewrite interleaved 1f1b (#127332)
## Context

Interleaved 1F1B has multiple points in the schedule where communication is both criss-crossed across ranks leading to hangs due to 1. looped nature of schedules, 2. batched nature of forward + backward in 1f1b phase.

<img width="1370" alt="image" src="https://github.com/pytorch/pytorch/assets/14858254/a07c2b1d-8a99-420b-9ba3-32a0115d228b">

In the current implementation, it is difficult to fix these hangs since it requires `dist.recv` from a prior point in time, but each rank operates on its own step schedule and does not have knowledge of other ranks operations to perform the `recv` prior to their own `send`.

## New implementation

The new implementation is split into 2 parts:

1. Creating the pipeline order.

Each rank will create the timestep normalized ordering of all schedule actions across all ranks. This is created once during the initialization of the schedule class. The timestep between each rank is normalized as each rank can only have 1 computation action (forward or backward) during that timestep.

<img width="1065" alt="image" src="https://github.com/pytorch/pytorch/assets/14858254/196f2347-7ff4-49cf-903b-d8db97d1156f">

3. Executing the pipeline order.

Once the pipeline order is determined, execution is simple because as each rank will perform its send to its peer (based on whether they did forward and backward). Now that each rank has a global understanding of the schedule, they can check their previous and next neighbor ranks to see if they need to recv any activations/gradients from them. Therefore, during execution, each rank is aligned and executing the same time step.

## Benefits

- Implementation is faster since 1f1b computation can now be split up in two time steps, 1 for forward and 1 for backward.
- Debugging is easier since we can now determine which timestep each rank is hung on
- Testing is easier since we can just validate the pipeline order, without running the schedule. This allows us to test on large amount of ranks without actually needing the GPUs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127332
Approved by: https://github.com/wconstab
ghstack dependencies: #127084
2024-06-04 23:46:05 +00:00
1f67cfd437 [inductor] raise tolerance for cspdarknet (#127949)
cspdarknet previously is flaky but after https://github.com/pytorch/pytorch/pull/127367 it fails quite stably. It's probably due to small numerical change from the mentioned PR. That PR will let inductor generated different code due to different loop orders.

Raise tolerance to pass CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127949
Approved by: https://github.com/atalman, https://github.com/nWEIdia, https://github.com/eqy
2024-06-04 23:28:20 +00:00
907cb28f67 Revert "Inductor: Allow small sizes of m for mixed mm autotuning (#127663)"
This reverts commit d8d0bf264a736c7fb3cd17799a1c1aba4addf8d9.

Reverted https://github.com/pytorch/pytorch/pull/127663 on behalf of https://github.com/soulitzer due to breaks torch ao CI, see: https://github.com/pytorch/pytorch/issues/127924 ([comment](https://github.com/pytorch/pytorch/pull/127663#issuecomment-2148554128))
2024-06-04 23:06:43 +00:00
f4b05ce683 Add registry for TorchScript to ExportedProgram conversion (#127464)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127464
Approved by: https://github.com/ydwu4, https://github.com/angelayi
2024-06-04 22:53:00 +00:00
0eb9ec958a Revert "Inductor respects strides for custom ops by default (#126986)" (#127923)
This reverts commit dd64ca2a02434944ecbc8f3e186d44ba81e3cb26.

There's a silent incorrectness bug with needs_fixed_stride_order=True and
mutable custom ops, so it's better to flip the default back to avoid
silent incorrectness.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127923
Approved by: https://github.com/williamwen42
2024-06-04 22:25:45 +00:00
20f966a8e0 Ignore undocumented PipelineSchedule.step (#127955)
Ignore undocumented PipelineSchedule.step to fix doc build:

https://github.com/pytorch/pytorch/actions/runs/9372492435/job/25805861083?pr=127938#step:11:1284

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127955
Approved by: https://github.com/kit1980
2024-06-04 22:11:09 +00:00
a7b1dd82ff Default XLA to use swap_tensors path in nn.Module._apply (#126814)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126814
Approved by: https://github.com/JackCaoG, https://github.com/albanD
ghstack dependencies: #127313
2024-06-04 21:40:49 +00:00
1b704a160f Add linker script optimization flag to CMAKE rule for CUDA ARM wheel (#127514)
Original PR - https://github.com/pytorch/pytorch/pull/127220

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127514
Approved by: https://github.com/Aidyn-A, https://github.com/atalman
2024-06-04 20:51:44 +00:00
6dc0a291b9 Revert "[dynamo] Bugfix for nn parameter construction (#127806)"
This reverts commit f27c4dd862bf79f37019ef277957cd577d57b66f.

Reverted https://github.com/pytorch/pytorch/pull/127806 on behalf of https://github.com/PaliC due to causing nn tests to fail ([comment](https://github.com/pytorch/pytorch/pull/127806#issuecomment-2148393903))
2024-06-04 20:51:41 +00:00
597922ba21 Reapply "distributed debug handlers (#126601)" (#127805)
This reverts commit 7646825c3eb687030c4f873b01312be0eed80174.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127805
Approved by: https://github.com/PaliC
2024-06-04 19:44:30 +00:00
e76b28c765 [dtensor][debug] added c10d alltoall_ and alltoall_base_ to CommDebugMode (#127360)
**Summary**
Added c10d alltoall_ and alltoall_base tracing to CommDebugMode and edited test case in test_comm_mode to include added features.

**Test Plan**
pytest test/distributed/_tensor/debug/test_comm_mode.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127360
Approved by: https://github.com/wz337, https://github.com/XilunWu, https://github.com/yifuwang
ghstack dependencies: #127358
2024-06-04 18:29:48 +00:00
01e6d1cae4 [dtensor][debug] added c10d reduce_scatter_ and reduce_scatter_tensor_coalesced tracing_ to CommDebugMode (#127358)
**Summary**
Added c10d reduce_scatter_ and reduce_scatter_tensor_coalesced tracing to CommDebugMode and edited test case in test_comm_mode to include added features.

**Test Plan**
pytest test/distributed/_tensor/debug/test_comm_mode.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127358
Approved by: https://github.com/wz337, https://github.com/XilunWu, https://github.com/yifuwang
2024-06-04 18:29:48 +00:00
9a25ff77af Revert "[inductor] Enable subprocess-based parallel compile as the default (#126817)"
This reverts commit cf77e7dd9770caf65e898ac2ee82045aa0408e30.

Reverted https://github.com/pytorch/pytorch/pull/126817 on behalf of https://github.com/huydhn due to There are lots of flaky inductor failure showing up in trunk after this commit cf77e7dd97, so I am trying to revert this to see if this helps ([comment](https://github.com/pytorch/pytorch/pull/126817#issuecomment-2148143502))
2024-06-04 18:26:12 +00:00
f27c4dd862 [dynamo] Bugfix for nn parameter construction (#127806)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127806
Approved by: https://github.com/jansel
ghstack dependencies: #127785, #127802
2024-06-04 18:25:46 +00:00
569c5e72e7 [dynamo] Unspec nn module when global backward hooks are present (#127802)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127802
Approved by: https://github.com/jansel
ghstack dependencies: #127785
2024-06-04 18:25:46 +00:00
c7e936a56a [dynamo] Tensorvariable - track grad with _grad field (#127785)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127785
Approved by: https://github.com/jansel
2024-06-04 18:25:46 +00:00
3bcc3cddb5 Using scalarType instead string in function _group_tensors_by_device_and_dtype. (#127869)
Now torch.dtype can pass through pybind11, so modify function _group_tensors_by_device_and_dtype to using scalar type. And without convert torch.dtype and string in python and c++ side.
@ezyang @bdhirsh
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127869
Approved by: https://github.com/ezyang
2024-06-04 18:19:33 +00:00
0ff60236ab Revert "Retire torch.distributed.pipeline (#127354)"
This reverts commit b9c058c203ee38032594f898f27cd8404f113a63.

Reverted https://github.com/pytorch/pytorch/pull/127354 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the doc build failure looks legit b9c058c203 ([comment](https://github.com/pytorch/pytorch/pull/127354#issuecomment-2148133982))
2024-06-04 18:19:31 +00:00
627d2cd87d [CI] disable td for xpu ci test by default (#127611)
Due to the xpu ci test has been enabled td by default, a lot of test cases (75%) have been skipped in CI tests. It caused some ci failures escaped from the ci tests, for example issue #127539. This PR depends on PR #127595 landed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127611
Approved by: https://github.com/etaf, https://github.com/atalman
2024-06-04 17:15:10 +00:00
36e9b71613 Enable UFMT on test/test_jit_fuser_te.py (#127759)
Part of #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127759
Approved by: https://github.com/ezyang
2024-06-04 16:56:03 +00:00
ff32f6c93b Use freshly traced jit-traced module to be used in export analysis (#127577)
Summary: When we export already traced module, it seems to be modifying some global state causing the traced modules to fail to run. For now, we are only logging for test cases, so it is probs ok to trace fresh copy to be used in export for now.

Test Plan: CI

Differential Revision: D57983518

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127577
Approved by: https://github.com/pianpwk
2024-06-04 16:54:23 +00:00
c490046693 [BE]: Update cudnn to 9.1.0.70 (#123475)
cuDNN has managed to upload cu11 and cu12 wheels for ~~9.0.0.312~~ 9.1.0.70, so trying this out...

CC @Skylion007 @malfet

Co-authored-by: Wei Wang <weiwan@nvidia.com>
Co-authored-by: atalman <atalman@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123475
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/nWEIdia
2024-06-04 16:33:06 +00:00
97ea2b5d83 documentation for pattern_matcher.py (#127459)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127459
Approved by: https://github.com/oulgen
ghstack dependencies: #127457, #127458
2024-06-04 15:24:47 +00:00
7a60a75256 Add typing annotations to pattern_matcher.py (#127458)
Turn on `mypy: disallow-untyped-defs` in pattern_matcher.py and fix the fallout.

There are still a bunch of `type: ignore` annotations which should eventually be ironed out.

In the processs found a bug: #127457

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127458
Approved by: https://github.com/Skylion007
ghstack dependencies: #127457
2024-06-04 15:24:47 +00:00
9adfa143d7 fix post_grad pattern (#127457)
The lowering pattern built by cuda_and_enabled_mixed_mm_and_not_int8() was using ListOf() incorrectly - ListOf() is meant to represent a single repeating pattern - but cuda_and_enabled_mixed_mm_and_not_int8() was passing two patterns - I think based on the comment it's trying to build a sequence which would be represented by an actual list, not ListOf().

The behavior of the existing pattern would be to pass the second pattern as the `partial` parameter of `ListOf` which is meant to be a boolean - so it's almost certainly not what was intended.

I tried changing it to be what I thought was the intended behavior but then the resnet152 test failed accuracy - so I'm just preserving the existing behavior with the correct parameter types.

Found when adding annotations to pattern_matcher.py (#127458)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127457
Approved by: https://github.com/oulgen
2024-06-04 15:24:41 +00:00
cyy
f8c6d43524 Concat namespaces and other fixes in torch/csrc/utils (#127833)
It contains formatting and other minor fixes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127833
Approved by: https://github.com/ezyang
2024-06-04 15:12:45 +00:00
91461601b6 [TORCH_FA2_flash_api] Update total_q to the reshaped query 0th dimension (#127524)
There is a difference (&bug) between the TORCH_FA2_flash_api:**mha_varlen_fwd** and FA2_flash_api:**mha_varlen_fwd** at the query transposition (GQA) step.

```
at::Tensor temp_q = q;
if (seqlenq_ngroups_swapped) {
        temp_q = q.reshape( ...
 ...
}
const int total_q = q.sizes()[0];
CHECK_SHAPE(temp_q, total_q, num_heads, head_size_og);
```

When doing query transposition we need to update total_q to the reshaped query 0th dimension, i.e:
```
const int total_q = temp_q.sizes()[0];
 ```

In the original FA2_flash_api:**mha_varlen_fwd** they dont introduce a new variable temp_q but overwrite the q value directly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127524
Approved by: https://github.com/drisspg
2024-06-04 14:44:45 +00:00
c209fbdc53 [inductor] Fix missing unbacked def for unbacked in input expr (#127770)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127770
Approved by: https://github.com/ezyang
2024-06-04 14:43:01 +00:00
cyy
059cae6176 [Caffe2] Remove Caffe2 proto and other files (#127655)
Remove Caffe2 proto files altogether.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127655
Approved by: https://github.com/ezyang
2024-06-04 14:22:21 +00:00
4c074a9b8b Revert "[torchbind] always fakify script object by default in non-strict export (#127116)"
This reverts commit c27882ffa8c1c7e4cf8ebc6c2f879e5b6c8814ad.

Reverted https://github.com/pytorch/pytorch/pull/127116 on behalf of https://github.com/atalman due to Failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/127116#issuecomment-2147459339))
2024-06-04 12:53:19 +00:00
fb696ef3aa Complete revamp of float/promotion sympy handling (#126905)
At a high level, the idea behind this PR is:

* Make it clearer what the promotion and int/float rules for various Sympy operations are. Operators that previously were polymorphic over int/float are now split into separate operators for clarity. We never do mixed int/float addition/multiplication etc in sympy, instead, we always promote to the appropriate operator. (However, equality is currently not done correctly.)
* Enforce strict typing on ValueRanges: if you have a ValueRange for a float, the lower and upper MUST be floats, and so forth for integers.

The story begins in **torch/utils/_sympy/functions.py**. Here, I make some changes to how we represent certain operations in sympy expressions:

* FloorDiv now only supports integer inputs; to do float floor division, do a truediv and then a trunc. Additionally, we remove the divide out addition by gcd optimization, because sympy gcd is over fields and is willing to generate rationals (but rationals are bad for ValueRange strict typing).
* ModularIndexing, LShift, RShift now assert they are given integer inputs.
* Mod only supports integer inputs; eventually we will support FloatMod (left for later work, when we build out Sympy support for floating operations). Unfortunately, I couldn't assert integer inputs here, because of a bad interaction with sympy's inequality solver that is used by the offline solver
* TrueDiv is split into FloatTrueDiv and IntTrueDiv. This allows for us to eventually generate accurate code for Python semantics IntTrueDiv, which is written in a special way to preserve precision when the inputs are >= 2**53 beyond what first coercing the integer to floats and then doing true division.
* Trunc is split to TruncToFloat and TruncToInt.
* Round is updated to return a float, not an int, making it consistent with the round op handler in Inductor. To get Python-style conversion to int, we call TruncToInt on the result.
* RoundDecimal updated to consistently only ever return a float
* Add ToFloat for explicit coercion to float (required so we can enforce strict ValueRanges typing)

In **torch/__init__.py**, we modify SymInt and SymFloat to appropriately call into new bindings that route to these refined sympy operations.  Also, we modify `torch.sym_min` and `torch.sym_max` to have promotion semantics (if one argument is a float, the return result is always a float), making them inconsistent with builtins.min/max, but possible to do type analysis without runtime information.

We also need to introduce some new op handlers in **torch/_inductor/ops_handler.py**:

* `to_int` for truncation to int64, directly corresponding to TruncToInt; this can be implemented by trunc and dtype, but with a dedicated handler it is more convenient for roundtripping in Sympy
* `int_truediv` for Python-style integer true division, which has higher precision than casting to floats and then running `truediv`

These changes have consequences. First, we need to make some administrative changes:

* Actually wire up these Sympy functions from SymInt/SymFloat in **torch/fx/experimental/sym_node.py**, including the new promotion rules (promote2)
* Add support for new Sympy functions in **torch/utils/_sympy/interp.py**, **torch/utils/_sympy/reference.py**
  * In particular, in torch.utils._sympy.reference, we have a strong preference to NOT do nontrivial compute, instead, everything in ops handler should map to a singular sympy function
  * TODO: I chose to roundtrip mod back to our Mod function, but I think I'm going to have to deal with the C/Python inconsistency this to fix tests here
* Add printer support for the Sympy functions in **torch/_inductor/codegen/common.py**, **torch/_inductor/codegen/cpp_utils.py**, **torch/_inductor/codegen/triton.py**. `int_truediv` and mixed precision equality is currently not implemented soundly, so we will lose precision in codegen for large values. TODO: The additions here are not exhaustive yet
* Update ValueRanges logic to use new sympy functions in **torch/utils/_sympy/value_ranges.py**. In general, we prefer to use the new Sympy function rather than try to roll things by hand, which is what was done previously for many VR analysis functions.

In **torch/fx/experimental/symbolic_shapes.py** we need to make some symbolic reasoning adjustments:

* Avoid generation of rational subexpressions by removing simplification of `x // y` into `floor(x / y)`. This simplification then triggers an addition simplification rule `(x + y) / c --> x / c + y / c` which is bad because x / c is a rational number now
* `_assert_bound_is_rational` is no more, we no longer generate rational bounds
* Don't intersect non-int value ranges with the `int_range`
* Support more sympy Functions for guard SYMPY_INTERP
* Assert the type of value range is consistent with the variable type

The new asserts uncovered necessary bug fixes:

* **torch/_inductor/codegen/cpp.py**, **torch/_inductor/select_algorithm.py**, **torch/_inductor/sizevars.py** - Ensure Wild/Symbol manually allocated in Inductor is marked `is_integer` so it's accepted to build expressions
* **torch/_inductor/utils.py** - make sure you actually pass in sympy.Expr to these functions
* **torch/_inductor/ir.py** - make_contiguous_strides_for takes int/SymInt, not sympy.Expr!
* **torch/export/dynamic_shapes.py** - don't use infinity to represent int ranges, instead use sys.maxsize - 1

Because of the removal of some symbolic reasoning that produced rationals, some of our symbolic reasoning has gotten worse and we are unable to simplify some guards. Check the TODO at **test/test_proxy_tensor.py**

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126905
Approved by: https://github.com/xadupre, https://github.com/lezcano
2024-06-04 11:47:32 +00:00
db515b6ac7 [ROCm] Fix error in torch.cuda initialisation if amdsmi is not available (#127528)
Reported in https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/15874

When nvml_count is set via 9f73c65b8f/torch/cuda/__init__.py (L834)

If amdsmi is not available this will throw an error
```
File "python3.10/site-packages/torch/cuda/__init__.py", line 634, in _raw_device_count_amdsmi
    except amdsmi.AmdSmiException as e:
NameError: name 'amdsmi' is not defined
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127528
Approved by: https://github.com/jeffdaily, https://github.com/eqy, https://github.com/pruthvistony, https://github.com/atalman
2024-06-04 11:16:02 +00:00
49048e7f26 [FSDP2] Fixed variable shadowing of module (#127776)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127776
Approved by: https://github.com/wanchaol
ghstack dependencies: #127771
2024-06-04 10:27:34 +00:00
f325b39303 Introduce Inductor passes to micro-pipeline all-gather-matmul and matmul-reduce-scatter in certain cases (#126598)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126598
Approved by: https://github.com/wanchaol
2024-06-04 09:06:56 +00:00
cf77e7dd97 [inductor] Enable subprocess-based parallel compile as the default (#126817)
Differential Revision: [D58056502](https://our.internmc.facebook.com/intern/diff/D58056502)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126817
Approved by: https://github.com/eellison
2024-06-04 07:48:32 +00:00
b9c058c203 Retire torch.distributed.pipeline (#127354)
Actually retiring module after deprecation warning for a while.
The new supported module is: torch.distributed.pipelining.
Please migrate.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127354
Approved by: https://github.com/wconstab
2024-06-04 07:03:26 +00:00
6abca6a564 [export][unflatten] More strictly respect scope when removing inputs (#127607)
Code snippet from TorchTitan (LLaMa):
```
for layer in self.layers.values():
    h = layer(h, self.freqs_cis)
```
`self.freqs_cis` is a buffer of root module (`self`).
It is also an explicit arg in the call signature of original `layer` modules.
If not respecting scope -- `freqs_cis`'s scope only corresponds to root -- `_sink_param` can remove `freqs_cis` from `layer`'s call signature, resulting in runtime error.

There are two fixes in this PR:
1. We filter out the `inputs_to_state` corresponding to the current scope, using existing code that does prefix matching.
2. We delay the removal of param inputs from `call_module` nodes' `args`, till `_sink_param` call on that submodule returns. The return now returns information on which input is actually removed by the submodule, thus more accurate than just doing:
```
    for node in call_module_nodes:
        node.args = tuple(filter(lambda n: n.name not in inputs_to_state, node.args))
```

Before the PR:
![Screenshot 2024-05-31 at 1 40 24 AM](https://github.com/pytorch/pytorch/assets/6676466/a2e06b18-44d5-40ca-b242-0edab45075b7)

After the PR:
![Screenshot 2024-05-31 at 1 43 41 AM](https://github.com/pytorch/pytorch/assets/6676466/b72afb94-cdfa-420d-b88b-29a92bf2a0c0)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127607
Approved by: https://github.com/pianpwk
2024-06-04 06:43:54 +00:00
e216df48c8 [Dynamo][TVM] Fix ignored trials argument for MetaSchedule (#127747)
Fixes #127746

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127747
Approved by: https://github.com/jansel
2024-06-04 06:13:02 +00:00
2122c9e2a9 [BE] Enabled lintrunner on torch/distributed/utils.py (#127771)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127771
Approved by: https://github.com/wanchaol, https://github.com/Skylion007
2024-06-04 06:10:33 +00:00
ef77f2ca4a [pipelining] Simple 1F1B schedule (#127673)
![Screenshot 2024-05-31 at 9 13 18 PM](https://github.com/pytorch/pytorch/assets/6676466/ecf3ca24-33a6-4188-9f7c-df6e96311caa)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127673
Approved by: https://github.com/wconstab
2024-06-04 06:09:51 +00:00
f4b77ce8e2 Masked scale meta function registration #119984 (#127389)
Fixes #119984

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127389
Approved by: https://github.com/cpuhrsch
2024-06-04 06:09:17 +00:00
cyy
e7cb43a2d2 Check unused variables in tests (#127498)
Enables unused variable checks in CMake.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127498
Approved by: https://github.com/ezyang
2024-06-04 05:35:25 +00:00
2ad0e4197d [ts-migration] support aten::__is__, aten::__isnot__, aten::__not__, profiler::_record_function_enter_new, profiler::_record_function_exit (#127656)
Support more ops in ts converter and add unit tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127656
Approved by: https://github.com/SherlockNoMad
2024-06-04 04:51:29 +00:00
8d153e0bab [Inductor] Add FlexAttention backward kernel dynamic shape tests (#127728)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127728
Approved by: https://github.com/Chillee
2024-06-04 04:32:03 +00:00
e793ae220f [Inductor][Flex-attention] Support different sequence lengths for Query and Key/Value (#127678)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127678
Approved by: https://github.com/Chillee
2024-06-04 04:27:24 +00:00
dae757c971 Specify supported OS matrix (#127816)
Windows-10 or newer
manylinux-2014
MacOS-11 or newer (but only on Apple Silicon)

Fixes https://github.com/pytorch/pytorch/issues/126679

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127816
Approved by: https://github.com/kit1980, https://github.com/huydhn
2024-06-04 04:25:41 +00:00
22368eac10 [FSDP2] Fix submesh slicing to enable 3D parallelism (#127585)
Ensures the submesh used to create sharded parameters are created on a
submesh that excludes the Pipeline Parallelism dimension.

Also cleans up the logic for storing placements to no longer consider the outer / global dims.  Since we store an 'spmd' submesh, we can avoid this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127585
Approved by: https://github.com/wanchaol
2024-06-04 04:24:09 +00:00
69f5b66132 [Inductor] FlexAttention backward kernel optimization (#127208)
BWD Speedups (before this PR):
```
| Type    |   Speedup | shape             | score_mod     | dtype          |
|---------|-----------|-------------------|---------------|----------------|
| Average |     0.211 |                   |               |                |
| Max     |     0.364 | (16, 16, 512, 64) | relative_bias | torch.bfloat16 |
| Min     |     0.044 | (2, 16, 4096, 64) | causal_mask   | torch.bfloat16 |
```
BWD Speedups (after this PR, though not optimizing block size yet):
```
| Type    |   Speedup | shape              | score_mod     | dtype          |
|---------|-----------|--------------------|---------------|----------------|
| Average |     0.484 |                    |               |                |
| Max     |     0.626 | (2, 16, 512, 256)  | head_bias     | torch.bfloat16 |
| Min     |     0.355 | (8, 16, 4096, 128) | relative_bias | torch.bfloat16 |
```

There are a few things need to do as follow-ups:
* Optimized default block size on A100/H100.
* Support different seqlen for Q and K/V.
* Support dynamic shapes for backward.
* Enhance unit tests to check there is no ```nan``` value in any grad. I think we should make some changes to ```test_padded_dense_causal``` because it has invalid inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127208
Approved by: https://github.com/Chillee
2024-06-04 04:22:41 +00:00
2498ef7490 Fix scheduler typehints (#127769)
Fixes scheduler typehints

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127769
Approved by: https://github.com/jansel
2024-06-04 04:19:06 +00:00
6580a18f86 [c10d][BE] fix test_init_pg_and_rpc_with_same_socket (#127654)
**Summary**
fix `test_init_pg_and_rpc_with_same_socket` in `test/distributed/test_store.py` which missed a call to destroy the created ProcessGroup before exiting test function. It lead to "init PG twice" error in the test.

**Test Plan**
`pytest test/distributed/test_store.py -s -k test_init_pg_and_rpc_with_same_socket`
`ciflow/periodic` since this test is included in `.ci/pytorch/multigpu-test.sh`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127654
Approved by: https://github.com/Skylion007, https://github.com/malfet
2024-06-04 04:00:28 +00:00
7e906ec9e5 [PT2][Optimus] Improve group batch fusion with same parent/users fusion enablement (#127648)
Summary:
Currently, we fuse the ops in random place, we here enable the same parent/users fuse to enable follow up potential split cat elimination.

Context

https://docs.google.com/document/d/1MSZY23wKD2keW2Z-DfAI1DscDERHKjOJAnuB5bxa06I/edit

Test Plan:
# local reproduce

```
buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "pm_cmf" --flow_id 559694026
```
P1386889671

Differential Revision: D58037636

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127648
Approved by: https://github.com/jackiexu1992
2024-06-04 03:41:44 +00:00
c32fe6b279 [FSDP] keep paras in torch.distributed.checkpoint.state_dict.set_optimizer_state_dict (#127644)
This addresses Fixes https://github.com/pytorch/pytorch/issues/126948
The previous code under `_load_optim_state_dict `function with condition of `info.broadcast_from_rank0`, `optim_state_dict` holds the parameters based on `optim`.
Changes here aim to synchronize the differential parameters.
Unit tests are conducted under `test_state_dict.py` in `test_optim_state_dict_para_matching`,

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127644
Approved by: https://github.com/fegin
2024-06-04 03:32:22 +00:00
4d0386ce1c [torch/jit-runtime] Add explicit include of <chrono> to torch/jit/run… (#127779)
Added an explicit include to `<chrono>` in `jit/runtime/logging.h` since `std::chrono::time_point<std::chrono::high_resolution_clock>` is directly referenced in the header.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127779
Approved by: https://github.com/albanD
2024-06-04 02:12:17 +00:00
ddef7c350f Add comments about runner labels (#127827)
To distinguish between org-wide and repo-specific runners as well as highlight where they are hosted (by DevInfra, LF or various partners

Delete unused `bm-runner`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127827
Approved by: https://github.com/huydhn
2024-06-04 02:06:43 +00:00
1208347d09 [inductor][ez] fix loop ordering test (#127807)
I didn't realize that the main block is not being run when inductor tests are being run in FBCode via remote GPUs. This is a quick fix. I've tested it in both OSS and FBCode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127807
Approved by: https://github.com/eellison, https://github.com/jansel
2024-06-04 01:14:34 +00:00
41033a4274 PyPI: fix link to images to be rendered (#127798)
It addresses the long pending issues on PyPI. The [package description](https://pypi.org/project/torch/2.3.0/) is the repo's Readme, but compared to GitHub rendering, PyPI accepts only raw images linked via MarkDown images.
![image](https://github.com/pytorch/pytorch/assets/6035284/1d8e51d5-c8c1-4f92-b323-f7684879adb4)
 This minor link edit makes the image become raw images and so correctly rendered via PyPI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127798
Approved by: https://github.com/albanD
2024-06-04 00:59:58 +00:00
cyy
05fa05cbae [2/N] Change static functions in headers to inline (#127764)
Follows #127727

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127764
Approved by: https://github.com/Skylion007
2024-06-04 00:49:04 +00:00
dbf39a6e63 [inductor] fix linear_add_bias path (#127597)
Previous the `linear_add_bias` path do not work.
This PR is to fix it and add more ut with it.

**TestPlan**
```
python test/inductor/test_mkldnn_pattern_matcher.py -k test_linear_add_bias
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127597
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-06-04 00:39:01 +00:00
b42cfcabc4 Lift jagged -> padded dense forward / backward kernels from fbgemm_gpu (#125946)
PyTorch can't depend on `fbgemm_gpu` as a dependency because `fbgemm_gpu` already has a dependency on PyTorch. So this PR copy / pastes kernels from `fbgemm_gpu`:
* `dense_to_jagged_forward()` as CUDA registration for new ATen op `_padded_dense_to_jagged_forward()`
* `jagged_to_padded_dense_forward()` as CUDA registration for new ATen op `_jagged_to_padded_dense_forward()`

CPU impls for these new ATen ops will be added in a follow-up PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125946
Approved by: https://github.com/davidberard98
2024-06-03 23:41:54 +00:00
eqy
ac568fc007 [CUDNN] Remove defunct cuDNN V8 API build flag (#120006)
The flag basically does nothing following #95722

Let's see if the quantization tests break

CC @malfet @atalmanagement

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120006
Approved by: https://github.com/malfet
2024-06-03 22:42:05 +00:00
0e7bd7fedd [ROCm] TunableOp improvements (#124362)
- use less memory; smaller default hipblaslt workspace size
- options to avoid cache effects
  - icache flush option
  - rotating buffers during tuning
- python APIs
- unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124362
Approved by: https://github.com/xw285cornell
2024-06-03 22:30:11 +00:00
0f1f0d3015 Onboard ARM bfloat16 to gemv fast path (#127484)
Summary: Used bfloat16 dot support from #127477 to write a bfloat16 transposed fast path and integrated it.

Test Plan: Ran https://github.com/malfet/llm_experiments/blob/main/benchmarks/benchmark_torch_mm.py before and after on my Apple M1 Pro.
Before:
```
mv_nt    torch.float32    6.77 usec
mv_nt    torch.float16    8.24 usec
mv_nt   torch.bfloat16  184.74 usec
mv_ta    torch.float32    5.71 usec
mv_ta    torch.float16   27.95 usec
mv_ta   torch.bfloat16   98.06 usec
notrans  torch.float32    5.55 usec
notrans  torch.float16   25.11 usec
notrans torch.bfloat16   63.55 usec
trans_a  torch.float32    5.62 usec
trans_a  torch.float16   74.48 usec
trans_a torch.bfloat16  313.19 usec
trans_b  torch.float32    5.68 usec
trans_b  torch.float16    8.18 usec
trans_b torch.bfloat16   14.96 usec
```

After:
```
mv_nt    torch.float32    5.40 usec
mv_nt    torch.float16    8.25 usec
mv_nt   torch.bfloat16   12.81 usec
mv_ta    torch.float32    5.69 usec
mv_ta    torch.float16   27.94 usec
mv_ta   torch.bfloat16   98.18 usec
notrans  torch.float32    5.60 usec
notrans  torch.float16   25.17 usec
notrans torch.bfloat16   63.22 usec
trans_a  torch.float32    5.61 usec
trans_a  torch.float16   69.32 usec
trans_a torch.bfloat16  316.62 usec
trans_b  torch.float32    5.60 usec
trans_b  torch.float16    8.09 usec
trans_b torch.bfloat16   14.61 usec
```

Note large improvement in mv_nt torch.bfloat16 case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127484
Approved by: https://github.com/malfet
ghstack dependencies: #127477, #127478
2024-06-03 22:14:16 +00:00
f6ca822366 Patch ARM Half use_gemv_fast_path gate to avoid kernel duplication (#127478)
Summary: The existing code didn't gate the fast path, so the fast path had to duplicate the stock kernel. Now we gate it and delete the duplicate kernel.

Test Plan: Existing tests. Flipped the TORCH_INTERNAL_ASSERT_DEBUG_ONLY to non-debug and forced to fail (locally) to make sure we had test coverage.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127478
Approved by: https://github.com/malfet
ghstack dependencies: #127477
2024-06-03 22:14:16 +00:00
6faa3d5f18 Onboard ARM bfloat16 to gemm-by-dot-product-for-gemm_transa_ infrastructure (#127477)
Summary: This gets us a baseline level of reasonable performance for
bfloat16 matrix-vector and matrix-matrix multiplication on my Apple
M1. I've intentionally left using intrinsics for future work.

Test Plan: Used
https://github.com/malfet/llm_experiments/blob/main/benchmarks/benchmark_torch_mm.py
(modified to run larger sizes) to benchmark a range of LLM-interesting
matrix-vector and matrix-matrix sizes on my Apple M1 Pro. bfloat16 performance is
improved across the board (except possibly for very small cases) and
now exceeds float32 performance (as it should) for the matrix-vector
cases.

Before:
```
Matrix-vector:
m=8, n=128, k=1
====================
trans_b  torch.float32    0.75 usec
trans_b  torch.float16    0.71 usec
trans_b torch.bfloat16    0.81 usec
m=128, n=8, k=1
====================
trans_b  torch.float32    0.75 usec
trans_b  torch.float16    0.93 usec
trans_b torch.bfloat16    0.98 usec
m=4096, n=4096, k=1
====================
trans_b  torch.float32 2194.31 usec
trans_b  torch.float16  661.27 usec
trans_b torch.bfloat16 3758.42 usec
m=11008, n=4096, k=1
====================
trans_b  torch.float32 5792.04 usec
trans_b  torch.float16 1789.98 usec
trans_b torch.bfloat16 10120.67 usec
m=4096, n=11008, k=1
====================
trans_b  torch.float32 6101.22 usec
trans_b  torch.float16 1927.34 usec
trans_b torch.bfloat16 10469.47 usec
m=32000, n=4096, k=1
====================
trans_b  torch.float32 18353.20 usec
trans_b  torch.float16 5161.06 usec
trans_b torch.bfloat16 29601.69 usec

Matrix-matrix (prompt len 4:
m=8, n=128, k=4
====================
trans_b  torch.float32    2.14 usec
trans_b  torch.float16    0.85 usec
trans_b torch.bfloat16    1.19 usec
m=128, n=8, k=4
====================
trans_b  torch.float32    1.47 usec
trans_b  torch.float16    1.85 usec
trans_b torch.bfloat16    1.75 usec
m=4096, n=4096, k=4
====================
trans_b  torch.float32 4416.40 usec
trans_b  torch.float16 2688.36 usec
trans_b torch.bfloat16 14987.33 usec
m=11008, n=4096, k=4
====================
trans_b  torch.float32 6140.24 usec
trans_b  torch.float16 7467.26 usec
trans_b torch.bfloat16 40295.52 usec
m=4096, n=11008, k=4
====================
trans_b  torch.float32 6143.10 usec
trans_b  torch.float16 7298.04 usec
trans_b torch.bfloat16 41393.43 usec
m=32000, n=4096, k=4
====================
trans_b  torch.float32 17650.72 usec
trans_b  torch.float16 21346.63 usec
trans_b torch.bfloat16 116849.98 usec

Matrix-matrix (prompt len 8:
m=8, n=128, k=8
====================
trans_b  torch.float32    1.05 usec
trans_b  torch.float16    1.03 usec
trans_b torch.bfloat16    1.69 usec
m=128, n=8, k=8
====================
trans_b  torch.float32    2.05 usec
trans_b  torch.float16    3.08 usec
trans_b torch.bfloat16    2.95 usec
m=4096, n=4096, k=8
====================
trans_b  torch.float32 2323.99 usec
trans_b  torch.float16 5265.45 usec
trans_b torch.bfloat16 29942.40 usec
m=11008, n=4096, k=8
====================
trans_b  torch.float32 6202.01 usec
trans_b  torch.float16 14677.90 usec
trans_b torch.bfloat16 80625.18 usec
m=4096, n=11008, k=8
====================
trans_b  torch.float32 6112.05 usec
trans_b  torch.float16 14340.52 usec
trans_b torch.bfloat16 82799.99 usec
m=32000, n=4096, k=8
====================
trans_b  torch.float32 17650.65 usec
trans_b  torch.float16 42551.43 usec
trans_b torch.bfloat16 236081.08 usec

Matrix-matrix (prompt len 16:
m=8, n=128, k=16
====================
trans_b  torch.float32    1.26 usec
trans_b  torch.float16    1.34 usec
trans_b torch.bfloat16    2.69 usec
m=128, n=8, k=16
====================
trans_b  torch.float32    1.60 usec
trans_b  torch.float16    5.81 usec
trans_b torch.bfloat16    5.34 usec
m=4096, n=4096, k=16
====================
trans_b  torch.float32 2328.05 usec
trans_b  torch.float16 10526.58 usec
trans_b torch.bfloat16 60028.28 usec
m=11008, n=4096, k=16
====================
trans_b  torch.float32 6243.35 usec
trans_b  torch.float16 28505.08 usec
trans_b torch.bfloat16 163670.15 usec
m=4096, n=11008, k=16
====================
trans_b  torch.float32 5870.11 usec
trans_b  torch.float16 28597.89 usec
trans_b torch.bfloat16 165404.88 usec
m=32000, n=4096, k=16
====================
trans_b  torch.float32 17746.27 usec
trans_b  torch.float16 83393.87 usec
trans_b torch.bfloat16 472313.13 usec

Matrix-matrix (prompt len 32:
m=8, n=128, k=32
====================
trans_b  torch.float32    1.35 usec
trans_b  torch.float16    2.01 usec
trans_b torch.bfloat16    4.68 usec
m=128, n=8, k=32
====================
trans_b  torch.float32    1.19 usec
trans_b  torch.float16   10.98 usec
trans_b torch.bfloat16   10.13 usec
m=4096, n=4096, k=32
====================
trans_b  torch.float32 2525.29 usec
trans_b  torch.float16 23106.71 usec
trans_b torch.bfloat16 122987.04 usec
m=11008, n=4096, k=32
====================
trans_b  torch.float32 6131.34 usec
trans_b  torch.float16 57537.41 usec
trans_b torch.bfloat16 327825.00 usec
m=4096, n=11008, k=32
====================
trans_b  torch.float32 6395.01 usec
trans_b  torch.float16 57456.33 usec
trans_b torch.bfloat16 331325.58 usec
m=32000, n=4096, k=32
====================
trans_b  torch.float32 19078.68 usec
trans_b  torch.float16 167735.08 usec
trans_b torch.bfloat16 975736.88 usec

Matrix-matrix (prompt len 128:
m=8, n=128, k=128
====================
trans_b  torch.float32    2.40 usec
trans_b  torch.float16    6.07 usec
trans_b torch.bfloat16   16.83 usec
m=128, n=8, k=128
====================
trans_b  torch.float32    1.78 usec
trans_b  torch.float16   40.35 usec
trans_b torch.bfloat16   37.21 usec
m=4096, n=4096, k=128
====================
trans_b  torch.float32 4827.60 usec
trans_b  torch.float16 84341.24 usec
trans_b torch.bfloat16 478917.75 usec
m=11008, n=4096, k=128
====================
trans_b  torch.float32 11879.96 usec
trans_b  torch.float16 226484.33 usec
trans_b torch.bfloat16 1289465.50 usec
m=4096, n=11008, k=128
====================
trans_b  torch.float32 10707.75 usec
trans_b  torch.float16 229200.58 usec
trans_b torch.bfloat16 1327416.67 usec
m=32000, n=4096, k=128
====================
trans_b  torch.float32 33306.32 usec
trans_b  torch.float16 662898.21 usec
trans_b torch.bfloat16 3815866.63 usec
```

After:
```
Matrix-vector:
m=8, n=128, k=1
====================
trans_b  torch.float32    0.77 usec
trans_b  torch.float16    0.72 usec
trans_b torch.bfloat16    0.77 usec
m=128, n=8, k=1
====================
trans_b  torch.float32    0.73 usec
trans_b  torch.float16    0.93 usec
trans_b torch.bfloat16    1.56 usec
m=4096, n=4096, k=1
====================
trans_b  torch.float32 2195.22 usec
trans_b  torch.float16  675.40 usec
trans_b torch.bfloat16 1038.29 usec
m=11008, n=4096, k=1
====================
trans_b  torch.float32 5980.27 usec
trans_b  torch.float16 1806.08 usec
trans_b torch.bfloat16 2756.46 usec
m=4096, n=11008, k=1
====================
trans_b  torch.float32 6339.95 usec
trans_b  torch.float16 1844.71 usec
trans_b torch.bfloat16 2726.52 usec
m=32000, n=4096, k=1
====================
trans_b  torch.float32 18137.17 usec
trans_b  torch.float16 6020.75 usec
trans_b torch.bfloat16 8612.89 usec

Matrix-matrix (prompt len 4:
m=8, n=128, k=4
====================
trans_b  torch.float32    2.24 usec
trans_b  torch.float16    0.91 usec
trans_b torch.bfloat16    1.07 usec
m=128, n=8, k=4
====================
trans_b  torch.float32    1.58 usec
trans_b  torch.float16    1.96 usec
trans_b torch.bfloat16    2.11 usec
m=4096, n=4096, k=4
====================
trans_b  torch.float32 4583.43 usec
trans_b  torch.float16 3014.04 usec
trans_b torch.bfloat16 4434.04 usec
m=11008, n=4096, k=4
====================
trans_b  torch.float32 6245.55 usec
trans_b  torch.float16 7513.82 usec
trans_b torch.bfloat16 11207.80 usec
m=4096, n=11008, k=4
====================
trans_b  torch.float32 6096.22 usec
trans_b  torch.float16 7688.82 usec
trans_b torch.bfloat16 11143.72 usec
m=32000, n=4096, k=4
====================
trans_b  torch.float32 17982.88 usec
trans_b  torch.float16 22001.28 usec
trans_b torch.bfloat16 32470.62 usec

Matrix-matrix (prompt len 8:
m=8, n=128, k=8
====================
trans_b  torch.float32    1.05 usec
trans_b  torch.float16    1.02 usec
trans_b torch.bfloat16    1.44 usec
m=128, n=8, k=8
====================
trans_b  torch.float32    2.07 usec
trans_b  torch.float16    3.10 usec
trans_b torch.bfloat16    3.38 usec
m=4096, n=4096, k=8
====================
trans_b  torch.float32 2245.43 usec
trans_b  torch.float16 5597.87 usec
trans_b torch.bfloat16 8775.08 usec
m=11008, n=4096, k=8
====================
trans_b  torch.float32 6227.68 usec
trans_b  torch.float16 15102.41 usec
trans_b torch.bfloat16 22457.37 usec
m=4096, n=11008, k=8
====================
trans_b  torch.float32 6082.16 usec
trans_b  torch.float16 15131.57 usec
trans_b torch.bfloat16 21860.15 usec
m=32000, n=4096, k=8
====================
trans_b  torch.float32 19659.00 usec
trans_b  torch.float16 45075.64 usec
trans_b torch.bfloat16 67746.75 usec

Matrix-matrix (prompt len 16:
m=8, n=128, k=16
====================
trans_b  torch.float32    1.31 usec
trans_b  torch.float16    1.41 usec
trans_b torch.bfloat16    2.04 usec
m=128, n=8, k=16
====================
trans_b  torch.float32    1.66 usec
trans_b  torch.float16    5.76 usec
trans_b torch.bfloat16    6.37 usec
m=4096, n=4096, k=16
====================
trans_b  torch.float32 2271.34 usec
trans_b  torch.float16 11198.46 usec
trans_b torch.bfloat16 16893.54 usec
m=11008, n=4096, k=16
====================
trans_b  torch.float32 6266.85 usec
trans_b  torch.float16 29342.49 usec
trans_b torch.bfloat16 45159.22 usec
m=4096, n=11008, k=16
====================
trans_b  torch.float32 5999.16 usec
trans_b  torch.float16 29157.43 usec
trans_b torch.bfloat16 43295.81 usec
m=32000, n=4096, k=16
====================
trans_b  torch.float32 18028.83 usec
trans_b  torch.float16 89626.88 usec
trans_b torch.bfloat16 128164.62 usec

Matrix-matrix (prompt len 32:
m=8, n=128, k=32
====================
trans_b  torch.float32    1.38 usec
trans_b  torch.float16    2.03 usec
trans_b torch.bfloat16    3.29 usec
m=128, n=8, k=32
====================
trans_b  torch.float32    1.24 usec
trans_b  torch.float16   10.58 usec
trans_b torch.bfloat16   11.97 usec
m=4096, n=4096, k=32
====================
trans_b  torch.float32 2591.56 usec
trans_b  torch.float16 21683.62 usec
trans_b torch.bfloat16 32657.68 usec
m=11008, n=4096, k=32
====================
trans_b  torch.float32 6468.43 usec
trans_b  torch.float16 57811.33 usec
trans_b torch.bfloat16 89263.21 usec
m=4096, n=11008, k=32
====================
trans_b  torch.float32 6034.74 usec
trans_b  torch.float16 59372.56 usec
trans_b torch.bfloat16 88107.85 usec
m=32000, n=4096, k=32
====================
trans_b  torch.float32 18609.27 usec
trans_b  torch.float16 167298.00 usec
trans_b torch.bfloat16 255116.37 usec

Matrix-matrix (prompt len 128:
m=8, n=128, k=128
====================
trans_b  torch.float32    2.44 usec
trans_b  torch.float16    6.11 usec
trans_b torch.bfloat16   10.92 usec
m=128, n=8, k=128
====================
trans_b  torch.float32    1.80 usec
trans_b  torch.float16   40.26 usec
trans_b torch.bfloat16   44.82 usec
m=4096, n=4096, k=128
====================
trans_b  torch.float32 4773.29 usec
trans_b  torch.float16 84458.54 usec
trans_b torch.bfloat16 131248.58 usec
m=11008, n=4096, k=128
====================
trans_b  torch.float32 12249.16 usec
trans_b  torch.float16 234411.87 usec
trans_b torch.bfloat16 351970.71 usec
m=4096, n=11008, k=128
====================
trans_b  torch.float32 11439.24 usec
trans_b  torch.float16 233347.04 usec
trans_b torch.bfloat16 354475.96 usec
m=32000, n=4096, k=128
====================
trans_b  torch.float32 33803.03 usec
trans_b  torch.float16 688157.54 usec
trans_b torch.bfloat16 1048221.42 usec
```

Also ran the stock configuration; it was unchanged, indicating that we need to integrate this path with torch.mv separately, which will come in a follow-up PR.l

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127477
Approved by: https://github.com/malfet
2024-06-03 22:14:10 +00:00
01fc22056a [BE] enable UFMT for torch/masked/ (#127715)
Part of #123062

- #123062
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127715
Approved by: https://github.com/cpuhrsch
2024-06-03 22:01:49 +00:00
406532f864 [AMD] Fix power_draw api (#127729)
Summary: average_socket_power only gives me NA. So we need to change it to current_socket_power

Test Plan: Before `torch.cuda.power_draw` gives me NA, after it gives me the right power reading (e.g.441)

Differential Revision: D58047484

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127729
Approved by: https://github.com/nmacchioni, https://github.com/eqy
2024-06-03 21:46:50 +00:00
c27882ffa8 [torchbind] always fakify script object by default in non-strict export (#127116)
This diff can be risky for internal tests: any torchbind class that hasn't registered a fake class will fail and we should fix them. We've gained some confidence that this can work e2e by implementing FakeTensorQueue for TBE models in sigmoid with [D54210823](https://www.internalfb.com/diff/D54210823).

Differential Revision: [D57991002](https://our.internmc.facebook.com/intern/diff/D57991002)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127116
Approved by: https://github.com/zou3519
ghstack dependencies: #127113, #127114
2024-06-03 21:38:57 +00:00
3efac92888 [torchbind] support torch.compile with aot_eager backend (#127114)
Differential Revision: [D57991001](https://our.internmc.facebook.com/intern/diff/D57991001)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127114
Approved by: https://github.com/zou3519
ghstack dependencies: #127113
2024-06-03 21:38:57 +00:00
c6dc624690 [torchbind] remove test cases that don't fakify script objects (#127113)
As titled.

Differential Revision: [D57991003](https://our.internmc.facebook.com/intern/diff/D57991003)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127113
Approved by: https://github.com/zou3519
2024-06-03 21:38:50 +00:00
6d4ec9b2ec [RFC] Introduce Checkpointable for DCP (#127540) (#127628)
Summary:
# Introduce Checkpointable interface for DCP to support arbitrary tensor subclasses for checkpointing

**Authors:**
* zainhuda

## **Summary**
This diff adds a CheckpointableTensor interface to allow for future compatibility for any tensor subclass with DCP in a clean and maintainable way.

## **Motivation**
For TorchRec sharding migration from ShardedTensor to DTensor, we create a tensor subclass that is stored by DTensor to support TorchRec's sharding schemes (ex, empty shards, multiple shards on a rank).

## **Proposed Implementation**
View the CheckpointableTensor interface implementation, in which, we introduce the minimal set of methods needed to be compatible with DCP. These methods are expected to implemented by any tensor subclasses and as such are then checkpointable by DCP.

## **Drawbacks**
No drawbacks, it extends functionality in a clean and maintainable way.

## **Alternatives**
Alternative design was creating paths for checking for certain attributes in tensor subclasses which can get messy and hard to maintain/understand why it was there in the first place.

Test Plan:
Sandcastle

cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k LucasLLC

Differential Revision: D57970603

Pulled By: iamzainhuda

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127628
Approved by: https://github.com/wz337, https://github.com/XilunWu, https://github.com/fegin
2024-06-03 21:21:55 +00:00
a4064da8ca Always simplify sympy expressions before printing. (#127543)
This is important because if a replacement has happened during inductor lowering, we may have stale symbols in sympy expressions that we need to replace away.  Do this at the very end.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127543
Approved by: https://github.com/lezcano
2024-06-03 20:36:14 +00:00
ef9451ac8d Move the build of AOTriton to base ROCM docker image. (#127012)
Mitigates #126111

AOTrtion, as a Math library, takes long time to build. However, this library itself is not moving as fast as PyTorch itself and it is not cost-efficient to build it for every CI check.

This PR moves the build of AOTriton from PyTorch to its base docker image, avoids duplicated and long build time.

Pre-this-PR:
* PyTorch base docker build job duration: 1.1-1.3h
* PyTorch build job duration: 1.4-1.5hr (includes AOTriton build time of 1hr6min on a linux.2xlarge node)

Post-this-PR:
* PyTorch base docker build job duration: 1.3h (includes AOTriton build time of 20min on a linux.12xlarge node)
* PyTorch build job duration: <20 min

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127012
Approved by: https://github.com/jithunnair-amd, https://github.com/pruthvistony, https://github.com/huydhn
2024-06-03 20:35:22 +00:00
941316f821 [pipelining] Stress test schedules with multi iters (#127475)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127475
Approved by: https://github.com/wconstab
2024-06-03 20:24:07 +00:00
db9d457a3f Use sleef on macOS Apple silicon by default (#126509)
Use sleef ~~for aarch64~~ on macOS Apple silicon by default.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126509
Approved by: https://github.com/digantdesai, https://github.com/malfet
2024-06-03 19:33:06 +00:00
2fc907971a Revert "[Inductor] FlexAttention backward kernel optimization (#127208)"
This reverts commit f7171313abf14d9501a330457140b2f8a01c9985.

Reverted https://github.com/pytorch/pytorch/pull/127208 on behalf of https://github.com/yanboliang due to test_flex_attention is failing internally ([comment](https://github.com/pytorch/pytorch/pull/127208#issuecomment-2145830810))
2024-06-03 18:13:27 +00:00
3f45fa63f2 Revert "[Inductor] Add FlexAttention backward kernel dynamic shape tests (#127728)"
This reverts commit 10e3406ea5d115a54a7d753d33110762eb6c07ff.

Reverted https://github.com/pytorch/pytorch/pull/127728 on behalf of https://github.com/yanboliang due to Ineternal breakage of https://github.com/pytorch/pytorch/pull/127208 hence reverting ([comment](https://github.com/pytorch/pytorch/pull/127728#issuecomment-2145822667))
2024-06-03 18:10:46 +00:00
c35b65715c Revert "[Inductor][Flex-attention] Support different sequence lengths for Query and Key/Value (#127678)"
This reverts commit e2e3ca94ccce1c0abbfd75ac0368793e1756c268.

Reverted https://github.com/pytorch/pytorch/pull/127678 on behalf of https://github.com/atalman due to Ineternal breakage of https://github.com/pytorch/pytorch/pull/127208 hence reverting ([comment](https://github.com/pytorch/pytorch/pull/127678#issuecomment-2145821489))
2024-06-03 18:07:57 +00:00
3437177e2b Quick Fix on #126854, deepcopy lr and other possible base_parameters (#127190)
* Apply `deepcopy` to every base parameters (`initial_lr`, `max_lr`) when instantiating `LRScheduler`.

Fixes #126854

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127190
Approved by: https://github.com/janeyx99
2024-06-03 18:06:31 +00:00
d8d0bf264a Inductor: Allow small sizes of m for mixed mm autotuning (#127663)
For mixed mm with small sizes of m, such as in the example provided in #127056, being able to set BLOCK_M to 16 leads to better performance. This PR introduces kernel configs that are specific to mixed mm by extending the mm configs with two configs that work well for the example provided in #127056.
I am excluding configs with (BLOCK_M=16, BLOCK_K=16, BLOCK_N=64) because triton crashes when this config is used.

For the example in #127056:
- Without my changes, skip_triton is evaluated to true which disables autotuning. On my machine I achieve 146GB/s.
- If autotuning is enabled, but BLOCK_M>=32, I achieve 614 GB/s.
- With the changes in this PR (i.e. autotuning enabled and BLOCK_M=16), I achieve 772 GB/s.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127663
Approved by: https://github.com/Chillee
2024-06-03 17:53:48 +00:00
7c3740d388 [NestedTensor] Extend coverage for unbind when ragged_idx != 1 (#127493)
Summary:
Extend coverage for the `NestedTensor` `unbind` operator to cases in which `ragged_idx != 1`.

Currently, the `unbind` operator in the `NestedTensor` class splits a tensor along the 0-th dimension, where the `ragged_idx` property, which controls the jagged dimension upon which `unbind` splits, is 1. This diff extends support for `ragged_idx != 1` in `NestedTensor`s, allowing `unbind` to split a tensor along a jagged dimension greater than 0 for `NestedTensor`s with and without the `lengths` property.

Test Plan:
Added the following unit tests:

`test_unbind_ragged_idx_equals_2_cpu`, `test_unbind_ragged_idx_equals_3_cpu`, and `test_unbind_ragged_idx_equals_last_dim_cpu` verify that `unbind` works for all jagged dimensions greater than 1, for `NestedTensor`s without `lengths`.
```
test_unbind_ragged_idx_equals_2_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok
test_unbind_ragged_idx_equals_3_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok
test_unbind_ragged_idx_equals_last_dim_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok
```

`test_unbind_with_lengths_cpu` and `test_unbind_with_lengths_ragged_idx_equals_1_cpu` verify that `unbind` works when the jagged dimension is 1, for `NestedTensor`s with `lengths`.
```
test_unbind_with_lengths_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok
test_unbind_with_lengths_ragged_idx_equals_1_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok
```

`test_unbind_with_lengths_ragged_idx_equals_2_cpu` and `test_unbind_with_lengths_ragged_idx_equals_3_cpu` verify that `unbind` works when the jagged dimension is greater than 1, for `NestedTensor`s with `lengths`.
```
test_unbind_with_lengths_ragged_idx_equals_2_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok
test_unbind_with_lengths_ragged_idx_equals_3_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok
```

`test_unbind_with_lengths_ragged_idx_equals_0_cpu` verifies that `unbind` fails when the jagged dimension is 0 (the batch dimension), for `NestedTensor`s with `lengths`.
```
test_unbind_with_lengths_ragged_idx_equals_0_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok
```

`test_unbind_with_lengths_ragged_idx_equals_2_bad_dim_cpu` verifies that `unbind` fails when there is a mismatch between the offsets and the jagged dimension, for `NestedTensor`s with `lengths`.
```
test_unbind_with_lengths_ragged_idx_equals_2_bad_dim_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok
```

`test_unbind_with_wrong_lengths_cpu` verifies that `unbind` fails when the lengths exceed the limitations set by offsets, for `NestedTensor`s with `lengths`.

```
test_unbind_with_wrong_lengths_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok
```

Differential Revision: D57942686

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127493
Approved by: https://github.com/davidberard98
2024-06-03 17:46:12 +00:00
4d32de14b6 [export] Handle serializing duplicate getitem nodes (#127633)
We ran into a graph that looks something like the following, where we have 2 getitem calls to the same index (%getitem, %getitem_2 both query topk[0]):
```
graph():
    %x : [num_users=1] = placeholder[target=x]
    %topk : [num_users=3] = call_function[target=torch.ops.aten.topk.default](args = (%x, 2), kwargs = {})
    %getitem : [num_users=1] = call_function[target=operator.getitem](args = (%topk, 0), kwargs = {})
    %getitem_1 : [num_users=1] = call_function[target=operator.getitem](args = (%topk, 1), kwargs = {})
    %getitem_2 : [num_users=1] = call_function[target=operator.getitem](args = (%topk, 0), kwargs = {})
    %mul_tensor : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%getitem, %getitem_2), kwargs = {})
    %mul : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%mul_tensor, 2), kwargs = {})
    return (mul, getitem_1)
```

The duplicate getitem call gets created during a pass.. so there are a couple of solutions:

1. Change serializer to support the case of duplicate getitem calls
2. Change the pass so that it doesn’t produce duplicate getitem calls
3. Add a pass which dedups the getitem calls

As a framework, we should do 1 and 3 (through a CSE pass).

This PR implements solution 1. However, the serializer currently does some special handling for getitem nodes -- instead of directly serializing the getitem nodes, we serialize the output of the node that outputting a list of tensors (the %topk node in this example) into a list nodes for each output ([%getitem, %getitem_1]). This fails when we have duplicate getitem nodes to the same index (%getitem_2), since we do not record that duplicate getitem node anywhere. So, the solution this PR takes is that the serializer will deduplicate the getitem nodes (%getitem_2 will be replaced with %getitem). This would result in a sematically correct graph, but not necessarily node-to-node identical as the original fx graph.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127633
Approved by: https://github.com/ydwu4
2024-06-03 17:25:51 +00:00
12c4a2c297 [BE]: Apply PLR1736 fixes (unnecessary index lookup) (#127716)
Applies the PLR1736 preview rule with some more autofixes to cut down on unnecessary accesses. Added a noqa since that test actually testing the dunder method.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127716
Approved by: https://github.com/ezyang
2024-06-03 17:22:13 +00:00
21144ce570 [dtensor] implement scatter op with simple replication (#126713)
as titled, implement torch.scatter op with simple replications strategy,
need to follow up and see if we could actually support any sharding
pattern

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126713
Approved by: https://github.com/tianyu-l
ghstack dependencies: #126712
2024-06-03 16:16:28 +00:00
ded580a594 [dtensor] standardize multi mesh-dim strategy with utils (#126712)
This PR standardize the multi mesh-dim strategy generation by unifying a
util to expand from a single mesh dim strategy to multi mesh dim
strategy, to allow strategy generation simpler

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126712
Approved by: https://github.com/tianyu-l
2024-06-03 16:16:28 +00:00
d1fad416a8 Revert "Add aten._unsafe_masked_index (#116491)"
This reverts commit f03f8bc901a6c9038308a6353e8d280f4b5628f5.

Reverted https://github.com/pytorch/pytorch/pull/116491 on behalf of https://github.com/PaliC due to breaking onnx tests ([comment](https://github.com/pytorch/pytorch/pull/116491#issuecomment-2145557724))
2024-06-03 15:51:50 +00:00
53f001c599 Revert "correct BLAS input (#126200)" (#127762)
This reverts commit ea13e9a097aaa875a2b404822579b7f8b62ea291.

Looks like this could have caused: https://github.com/pytorch/pytorch/actions/runs/9346105069/job/25722431775#step:17:984

Aarch64 tests failures:
```
+ echo 'Checking that MKLDNN is available on aarch64'
Checking that MKLDNN is available on aarch64
+ pushd /tmp
/tmp /
+ python -c 'import torch; exit(0 if torch.backends.mkldnn.is_available() else 1)'
Error: Process completed with exit code 1.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127762
Approved by: https://github.com/PaliC, https://github.com/malfet
2024-06-03 15:49:48 +00:00
8677508167 [c10d] guard gpu context during abort (#127363)
This is a mitigation for an internal out of MEM issues on GPU0 that happend during comms abort, this PR was tested internally to have fixed the out of MEM issue.

Note This is supposed to be mitigation only, as the ideal fix should be within NCCL comm libs, which should just set the right CUDA context before any CUDA call and restore it to its exact previous state

ncclCommDestroy/ncclCommAbort -> commReclaim -> commDestroySync (https://fburl.com/code/pori1tka)

In commDestroySync, it thinks that "current device context" is not same as comm's device context. It tries to:
1) save the current context
2) sets the comm's device context
3) cleans up things
4) Restores "previously stored context" by another cudaSetDevice.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127363
Approved by: https://github.com/wconstab
2024-06-03 15:41:11 +00:00
430cdfc0ac [ATen][Native] fixes sparse SPMV on aarch64 (#127642)
Fixes #127491
In #127491 result was allocated as `result = at::empty(...)`, which does not guarantee `result` being filled by zeros, therefore `torch.mv` was producing non-finite values. This happened mainly because the corner case (`beta = 0`) of `addmv` was not taken care of, as it should be just like in any other `addmv`/`addmm`:
923edef31c/aten/src/ATen/native/mkl/SparseBlasImpl.cpp (L307-L311)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127642
Approved by: https://github.com/malfet
2024-06-03 15:38:27 +00:00
badf898df2 Remove unstable ARC jobs (#127563)
Disable these jobs since we're no longer trying to enable ARC
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127563
Approved by: https://github.com/huydhn
2024-06-03 15:30:06 +00:00
63d7ffe121 Retry of D58015187 Move AsyncCompile to a different file (#127691)
Summary:
This is a retry of https://github.com/pytorch/pytorch/pull/127545/files
and
D58015187, fixing the internal test that also imported codecache

Test Plan: Same tests as CI in github, plus sandcastle for internal unit tests should pass now

Differential Revision: D58054611

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127691
Approved by: https://github.com/oulgen
2024-06-03 15:29:41 +00:00
3f8b8f08c8 [Split Build] Make libtorch_global_deps accessible from libtorch wheel (#127570)
Title
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127570
Approved by: https://github.com/atalman, https://github.com/malfet
2024-06-03 15:14:29 +00:00
d05cddfe23 Revert "FP8 rowwise scaling (#125204)"
This reverts commit 923edef31c7f3e98a14625724f2019b1422dcb26.

Reverted https://github.com/pytorch/pytorch/pull/125204 on behalf of https://github.com/atalman due to Broke nightlies and internal tests ([comment](https://github.com/pytorch/pytorch/pull/125204#issuecomment-2145422196))
2024-06-03 15:00:21 +00:00
f03f8bc901 Add aten._unsafe_masked_index (#116491)
To generate masked indexing operations that would generate
masked loads in triton code

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116491
Approved by: https://github.com/lezcano, https://github.com/peterbell10
2024-06-03 14:44:03 +00:00
d6963e769c Force Inductor output code to be dumped even if it fails to compile (#127700)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127700
Approved by: https://github.com/oulgen
2024-06-03 14:06:53 +00:00
f343f98710 [jit] Validate mobile module fields parsed by flatbuffer loader (#127437)
Fixing error in `torch.jit.load` Python API function that cause crash in C-backend of PyTorch.
The mobile module is succesfully parsed from flatbuffer format, but its fields are used without any validation.

Fixes #127434

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127437
Approved by: https://github.com/davidberard98
2024-06-03 08:48:12 +00:00
e017b56c0c [dtensor] local_map UX change: keep func signature and be compatible with Tensor input (#126924)
**Summary**
This PR has 2 parts of change in `local_map`:

1. regulates the way user can access `DeviceMesh` inside the `func` argument of `local_map`. This means `local_map` will strictly follow the `func` signature without implicitly passing any argument to `func`. If user wants to use `DeviceMesh` inside `func`, this mesh must be explicitly passed to `func` as an argument by user. For example,

```
def user_function(device_mesh, /, *args, **kwargs):
    USER CODE HERE

local_func = local_map(func=user_function, ...)
dtensor_out = local_func(device_mesh, dtensor_input, ...)
```

Before this PR, user code was like:
```
def user_function(device_mesh, /, *args, **kwargs):
    USER CODE HERE

local_func = local_map(func=user_function, ...)
dtensor_out = local_func(dtensor_input, ...)  # local_map passes mesh implicitly for user
```

2. `local_map` now supports mix use of `torch.Tensor` and `DTensor` in argument:

- Pure torch.Tensor case: no `DTensor` argument is passed in, all tensor arguments are `torch.Tensor`. Bypass the `in_placements` check and unwrapping steps. The output will not be wrapped into `DTensor` but directly returned.
- Pure DTensor case: no `torch.Tensor` argument is passed in, all tensor arguments are `DTensor`. This follows the default rule: `in_placements` check, unwrapping arguments, pass into `func`, wrapping the `torch.Tensor` output into `DTensor` if the `out_placements` is not `None`.
- Mix of the above two: some arguments are `torch.Tensor` while some are `DTensor`. Only perform `in_placements` check and unwrapping on `DTensor` arguments. For output processing, it's the same as Pure DTensor case.

**Test**
`pytest test/distributed/_tensor/experimental/test_local_map.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126924
Approved by: https://github.com/wanchaol
2024-06-03 08:41:59 +00:00
2d1ad0c31a [CI] Add freezing for cpu inductor accuracy test in inductor CI (#124715)
This PR is to enable '--freezing' when running dynamo accuracy check in CI.
Backgroud:
ISSUES[#124286](https://github.com/pytorch/pytorch/issues/124286) is not captured by CI since freezing is not enabled for cpu-inductor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124715
Approved by: https://github.com/chuanqi129, https://github.com/jgong5, https://github.com/atalman, https://github.com/desertfire
2024-06-03 07:37:30 +00:00
10e3406ea5 [Inductor] Add FlexAttention backward kernel dynamic shape tests (#127728)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127728
Approved by: https://github.com/Chillee
2024-06-03 07:15:46 +00:00
6d21685b45 [DSD] Fixes various bugs for broadcast_from_rank0 (#127635)
Fixes https://github.com/pytorch/pytorch/issues/126285

Summary:
1. Fixes https://github.com/pytorch/pytorch/issues/126285
2. Broadcasting one tensor per time to avoid OOM.
3. Add some docstring

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127635
Approved by: https://github.com/weifengpy
2024-06-03 06:35:21 +00:00
48846cd164 Update torch-xpu-ops pin (ATen XPU implementation) (#127730)
Regular bi-weekly pin update.
1. Porting operator relative PyTorch unit tests. The existing operators in torch-xpu-ops are covered by, 1) Operator specific test, like test_binary_ufuncs.py. 2) Operator common test, like test_ops.py.
2. Bugfixing under the latest PyTorch unit test scope, https://github.com/intel/torch-xpu-ops/tree/release/2.4/test/xpu.

Totally 297 ATen operators are implemented in torch-xpu-ops. https://github.com/intel/torch-xpu-ops/blob/release/2.4/yaml/xpu_functions.yaml

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127730
Approved by: https://github.com/EikanWang
2024-06-03 05:55:00 +00:00
e2e3ca94cc [Inductor][Flex-attention] Support different sequence lengths for Query and Key/Value (#127678)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127678
Approved by: https://github.com/Chillee
2024-06-03 04:35:50 +00:00
cyy
288df042c5 [1/N] Change static functions in headers to inline (#127727)
So that it may fix some tricky linking issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127727
Approved by: https://github.com/ezyang
2024-06-03 04:34:36 +00:00
cyy
1b182ea0d2 Remove c10::guts::{conjunction,disjunction} (#127726)
They are not used in Pytorch OSS.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127726
Approved by: https://github.com/ezyang
2024-06-03 04:06:21 +00:00
3399ad8d9d [Inductor][CPP] Add UT for bitwise right shift (#127731)
**Summary**
Per the discussion in https://github.com/pytorch/pytorch/issues/127310, `bitwise_right_shift` failed in Torch 2.1 but pass with latest PyTorch, Add the UT in this PR to ensure the correctness.

**TestPlan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_bitwise_right_shift
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127731
Approved by: https://github.com/Skylion007
2024-06-03 04:05:41 +00:00
7e97b33fbb [Dynamo] Log backward graph compilation metrics (#126629)
Fixes #125313

Compilation metric logs for the code example at #125313:
```
%s CompilationMetrics(compile_id='0/0', frame_key='1', co_name='forward', co_filename='/data/users/ybliang/debug/debug2.py', co_firstlineno=10, cache_size=0, accumulated_cache_size=0, guard_count=11, shape_env_guard_count=0, graph_op_count=1, graph_node_count=3, graph_input_count=1, start_time=1716247236.6165977, entire_frame_compile_time_s=7.926939964294434, backend_compile_time_s=7.887059926986694, inductor_compile_time_s=4.108498811721802, code_gen_time_s=3.97833514213562, fail_type=None, fail_reason=None, fail_user_frame_filename=None, fail_user_frame_lineno=None, non_compliant_ops=set(), compliant_custom_ops=set(), restart_reasons={"'skip function graph_break in file /home/ybliang/local/pytorch/torch/_dynamo/decorators.py'"}, dynamo_time_before_restart_s=0.025330543518066406, has_guarded_code=True, is_fwd=True)
%s CompilationMetrics(compile_id='1/0', frame_key='2', co_name='torch_dynamo_resume_in_forward_at_12', co_filename='/data/users/ybliang/debug/debug2.py', co_firstlineno=12, cache_size=0, accumulated_cache_size=0, guard_count=10, shape_env_guard_count=0, graph_op_count=2, graph_node_count=5, graph_input_count=1, start_time=1716247244.544928, entire_frame_compile_time_s=0.10148310661315918, backend_compile_time_s=0.08753013610839844, inductor_compile_time_s=0.03691983222961426, code_gen_time_s=0.022417306900024414, fail_type=None, fail_reason=None, fail_user_frame_filename=None, fail_user_frame_lineno=None, non_compliant_ops=set(), compliant_custom_ops=set(), restart_reasons=set(), dynamo_time_before_restart_s=0.0, has_guarded_code=True, is_fwd=True)
tensor([[-0.1622, -0.0000, -0.0000,  0.5643, -0.0000,  0.0000, -0.5087,  0.0914,
         -0.0000, -0.0421]], grad_fn=<CompiledFunctionBackward>)
%s CompilationMetrics(compile_id='1/0', frame_key=None, co_name=None, co_filename=None, co_firstlineno=None, cache_size=None, accumulated_cache_size=None, guard_count=None, shape_env_guard_count=None, graph_op_count=None, graph_node_count=None, graph_input_count=None, start_time=None, entire_frame_compile_time_s=None, backend_compile_time_s=None, inductor_compile_time_s=0.026738643646240234, code_gen_time_s=0.016446352005004883, fail_type=None, fail_reason=None, fail_user_frame_filename=None, fail_user_frame_lineno=None, non_compliant_ops=None, compliant_custom_ops=None, restart_reasons=None, dynamo_time_before_restart_s=None, has_guarded_code=None, is_fwd=False)
%s CompilationMetrics(compile_id='0/0', frame_key=None, co_name=None, co_filename=None, co_firstlineno=None, cache_size=None, accumulated_cache_size=None, guard_count=None, shape_env_guard_count=None, graph_op_count=None, graph_node_count=None, graph_input_count=None, start_time=None, entire_frame_compile_time_s=None, backend_compile_time_s=None, inductor_compile_time_s=0.14563536643981934, code_gen_time_s=0.08652091026306152, fail_type=None, fail_reason=None, fail_user_frame_filename=None, fail_user_frame_lineno=None, non_compliant_ops=None, compliant_custom_ops=None, restart_reasons=None, dynamo_time_before_restart_s=None, has_guarded_code=None, is_fwd=False)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126629
Approved by: https://github.com/ezyang
2024-06-03 03:55:33 +00:00
84776d7597 Revert "[BE]: Update mypy to 1.10.0 (#127717)"
This reverts commit 30213ab0a7b27277e76ea9dd707ce629a63d91ee.

Reverted https://github.com/pytorch/pytorch/pull/127717 on behalf of https://github.com/huydhn due to I am not sure why but the failures look legit and they are showing up in trunk 30213ab0a7 ([comment](https://github.com/pytorch/pytorch/pull/127717#issuecomment-2144183347))
2024-06-03 02:52:47 +00:00
e57f51b80f Update _dedup_save_plans.py (#126569)
To resolve https://github.com/pytorch/pytorch/issues/125740, save each tensor on the lowest rank.

Fixes #125740

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126569
Approved by: https://github.com/LucasLLC
2024-06-03 01:55:03 +00:00
fec8ef8c17 [Aten][BlasKernel] Add function prototype to fix compiler error (#127719)
Adds a prototype for function `fp16_dot_with_fp32_arith()` in `aten/src/ATen/native/BlasKernel.cpp`.

Without this patch the build fails on Apple silicon/MacOs (CPU) with the error `no previous prototype for function 'fp16_dot_with_fp32_arith' [-Werror,-Wmissing-prototypes]`.

The function cannot be marked `static` because its use is not limited to this file.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127719
Approved by: https://github.com/Skylion007
2024-06-02 23:41:43 +00:00
8b08b0f340 [BE] enable ruff rule Q from flake8-quotes (#127713)
Enable [ruff rule `Q`](https://docs.astral.sh/ruff/rules/#flake8-quotes-q) from flake8-quotes. Fixes:

- [avoidable-escaped-quote (Q003)](https://docs.astral.sh/ruff/rules/avoidable-escaped-quote/#avoidable-escaped-quote-q003)
- [unnecessary-escaped-quote (Q004)](https://docs.astral.sh/ruff/rules/unnecessary-escaped-quote/#unnecessary-escaped-quote-q004)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127713
Approved by: https://github.com/ezyang
2024-06-02 23:25:26 +00:00
139b9c6529 Avoid reference cycle in inner closure (#127711)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127711
Approved by: https://github.com/Skylion007, https://github.com/izaitsevfb
2024-06-02 21:28:46 +00:00
30213ab0a7 [BE]: Update mypy to 1.10.0 (#127717)
Updates mypy to the latest and greatest.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127717
Approved by: https://github.com/ezyang
2024-06-02 21:07:23 +00:00
fb53cd6497 [aten_cuda/flash_attn] Add typename to template argument Kernel_trait… (#127634)
Adds the `typename` keyword to the template argument `Kernel_traits::TiledMma` and `Kernel_traits::TiledMmaSdP` (which are dependent type names) when calling the template function `pytorch_flash::convert_layout_acc_Aregs`.

Without `typename` flash_attention kernels do not compile with Clang under C++20 since Clang compiles the entire .cu file in a single pass as opposed to NVCC which split compiles the host and device code. Adding `typename` seems to be OK under NVCC based on CI cuda builds succeeding.

Below is the excerpt of the compilation error:

```
third_party/py/torch/aten/src/ATen/native/transformers/cuda/flash_attn/static_switch.h:46:24: note: expanded from macro 'ALIBI_SWITCH'
   46 |   #define ALIBI_SWITCH BOOL_SWITCH
      |                        ^
third_party/py/torch/aten/src/ATen/native/transformers/cuda/flash_attn/flash_bwd_launch_template.h:132:5: note: in instantiation of function template specialization 'pytorch_flash::run_flash_bwd_seqk_parallel<pytorch_flash::Flash_bwd_ke
rnel_traits<160, 64, 64, 8, 4, 4, 4, false, true>, true>' requested here
  132 |     run_flash_bwd_seqk_parallel<Kernel_traits, Is_dropout>(params, stream);
      |     ^
third_party/py/torch/aten/src/ATen/native/transformers/cuda/flash_attn/flash_bwd_launch_template.h:280:13: note: in instantiation of function template specialization 'pytorch_flash::run_flash_bwd<pytorch_flash::Flash_bwd_kernel_traits<1
60, 64, 64, 8, 4, 4, 4, false, true>, true>' requested here
  280 |             run_flash_bwd<Flash_bwd_kernel_traits<Headdim, 64, 64, 8, 4, 4, 4, false, true, T>, Is_dropout>(params, stream);
      |             ^
third_party/py/torch/aten/src/ATen/native/transformers/cuda/flash_attn/static_switch.h:36:26: note: expanded from macro 'DROPOUT_SWITCH'
   36 |   #define DROPOUT_SWITCH BOOL_SWITCH
      |                          ^
third_party/py/torch/aten/src/ATen/native/transformers/cuda/flash_attn/kernels/flash_bwd_hdim160_fp16_sm80.cu:12:5: note: in instantiation of function template specialization 'pytorch_flash::run_mha_bwd_hdim160<cutlass::half_t>' request
ed here
   12 |     run_mha_bwd_hdim160<cutlass::half_t>(params, stream);
      |     ^
In file included from third_party/py/torch/aten/src/ATen/native/transformers/cuda/flash_attn/kernels/flash_bwd_hdim160_fp16_sm80.cu:7:
In file included from third_party/py/torch/aten/src/ATen/native/transformers/cuda/flash_attn/flash_bwd_launch_template.h:12:
third_party/py/torch/aten/src/ATen/native/transformers/cuda/flash_attn/flash_bwd_kernel.h:543:86: error: missing 'typename' prior to dependent type name 'Flash_bwd_kernel_traits<160, 64, 64, 8, 4, 4, 4, false, true>::TiledMmaSdP'
  543 |         Tensor tPrP = make_tensor(rP.data(), pytorch_flash::convert_layout_acc_Aregs<Kernel_traits::TiledMmaSdP>(rP.layout()));
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127634
Approved by: https://github.com/Skylion007
2024-06-02 16:25:02 +00:00
08653fe355 Beef up the allow_in_graph docs (#127117)
We make the following changes:
- most of the time when someone uses allow_in_graph, they actually
  wanted to make a custom op. We add a link to the custom ops landing
  page and explain the differences between allow_in_graph and custom
  ops.
- we warn people against using allow_in_graph footguns and document
  them.

Test Plan:
- tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127117
Approved by: https://github.com/jansel, https://github.com/albanD
2024-06-02 15:00:46 +00:00
e24a87ed8d [BE][Ez]: Apply PYI059 - Generic always come last (#127685)
Generic baseclass should always be last or unexpected issues can occur, especially in non-stub files (such as with MRO). Applies autofixes from the preview PYI059 rule to fix the issues in the codebase.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127685
Approved by: https://github.com/ezyang
2024-06-02 13:38:58 +00:00
c2547dfcc3 [BE][Ez]: Enable ruff PYI019 (#127684)
Tells pytorch to use typing_extensions.Self when it's able to.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127684
Approved by: https://github.com/ezyang
2024-06-02 13:38:33 +00:00
67ef2683d9 [BE] wrap deprecated function/class with typing_extensions.deprecated (#127689)
Use `typing_extensions.deprecated` for deprecation annotation if possible. Otherwise, add `category=FutureWarning` to `warnings.warn("message")` if the category is missing.

Note that only warnings that their messages contain `[Dd]eprecat(ed|ion)` are updated in this PR.

Resolves #126888

- #126888

This PR is split from PR #126898.

- #126898

------

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127689
Approved by: https://github.com/Skylion007
2024-06-02 12:30:43 +00:00
c1dd3a615f Implement Graph Transform Observer (#127427)
Summary: Implement Graph Transform Observer

Differential Revision: D57887518

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127427
Approved by: https://github.com/angelayi
2024-06-02 06:49:47 +00:00
cyy
4e7f497bb3 [Submodule] Remove ios-cmake (#127694)
It has not been updated for a long time and CI iOS builds don't rely on it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127694
Approved by: https://github.com/ezyang
2024-06-02 04:40:21 +00:00
2129903aa3 Properly detect nested torch function args (#127496)
Dynamo was not detecting nested torch function classes in containers. This was due to pytree compatibility for variable trackers being removed.
Fixes https://github.com/pytorch/pytorch/issues/127174

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127496
Approved by: https://github.com/anijain2305
2024-06-02 03:43:22 +00:00
16578e8584 [symbolic shapes] if symbol not in var_ranges default to unknown range (#127681)
Purpose of this PR is to get around this error: https://github.com/pytorch/pytorch/issues/127677

Differential Revision: D58048558

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127681
Approved by: https://github.com/lezcano
2024-06-02 02:28:40 +00:00
4fd777ed59 [ONNX] Add quantized layer norm op to opset 17 (#127640)
Fixes #126160
Continue #126555

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127640
Approved by: https://github.com/justinchuby
2024-06-02 02:10:02 +00:00
c19ad112f6 [Inductor UT][Intel GPU] Skip test case which doesn't currently work on the XPU stack but newly re-enabled by community. (#127629)
The Inductor UT test/inductor/test_triton_heuristics.py:test_artificial_zgrid that previously skipped was recently enbaled by the PR https://github.com/pytorch/pytorch/pull/127448. However, the test doesn't currently work on the XPU stack, it will huang on GPU, so this PR skip the test for Intel GPU instead of expected failure.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127629
Approved by: https://github.com/EikanWang, https://github.com/peterbell10
2024-06-02 01:00:33 +00:00
2cef2fc2b4 [ts migration] support aten::dim, aten::len, aten::__getitem__ (#127593)
- Add support for aten::dim, aten::len, aten::__getitem__ for torchscript to export converter.
- Add unit tests
Co-authored-by: cyy <cyyever@outlook.com>
Co-authored-by: Menglu Yu <mengluy@meta.com>
Co-authored-by: Animesh Jain <anijain@umich.edu>
Co-authored-by: Simon Fan <xmfan@meta.com>
Co-authored-by: Zain Rizvi <ZainR@meta.com>
Co-authored-by: Tugsbayasgalan (Tugsuu) Manlaibaatar <tmanlaibaatar@meta.com>
Co-authored-by: titaiwangms <titaiwang@microsoft.com>
Co-authored-by: Yueming Hao <yhao@meta.com>
Co-authored-by: IvanKobzarev <ivan.kobzarev@gmail.com>
Co-authored-by: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Co-authored-by: Edward Z. Yang <ezyang@meta.com>
Co-authored-by: Bin Bao <binbao@meta.com>
Co-authored-by: Feny Patel <fenypatel@meta.com>
Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Co-authored-by: xinan.lin <xinan.lin@intel.com>
Co-authored-by: Zain Huda <zainhuda@meta.com>
Co-authored-by: Chien-Chin Huang <chienchin@fb.com>
Co-authored-by: Wei Wang <weiwan@nvidia.com>
Co-authored-by: Jason Ansel <jansel@meta.com>
Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
Co-authored-by: Iris Z <31293777+wz337@users.noreply.github.com>
Co-authored-by: Wang, Eikan <eikan.wang@intel.com>
Co-authored-by: angelayi <yiangela7@gmail.com>
Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
Co-authored-by: Yanbo Liang <ybliang8@gmail.com>
Co-authored-by: Catherine Lee <csl@fb.com>
Co-authored-by: Kwanghoon An <kwanghoon@meta.com>
Co-authored-by: Brian Hirsh <hirsheybar@fb.com>
Co-authored-by: Robert Mast <rmast@live.nl>
Co-authored-by: drisspg <drisspguessous@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127593
Approved by: https://github.com/SherlockNoMad, https://github.com/malfet
2024-06-02 00:36:33 +00:00
0d9e527c4d Remove tensor storage_offset/storage_bytes from the cache key (#127319)
Summary: We observed differences in these fields and inductor does not specialize on them so it is safe to remove them from the key.

Test Plan: CI

Reviewed By: masnesral

Differential Revision: D57871276

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127319
Approved by: https://github.com/masnesral
2024-06-02 00:28:43 +00:00
eqy
2e779166eb [Functorch][cuDNN] Bump tolerances for test_vmapjvpvjp (#127355)
cuDNN can select a winograd kernel for this case which slightly affects tolerances...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127355
Approved by: https://github.com/zou3519, https://github.com/Skylion007
2024-06-01 21:22:55 +00:00
6e2e09f6cc [inductor] fix redis-related env vars in remote_cache.py (#127583)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127583
Approved by: https://github.com/oulgen
2024-06-01 19:55:25 +00:00
b505e86475 [Inductor][CI][CUDA 12.4] Update dynamic_inductor_timm_training.csv - change gluon_inception_v3 from fail_accuracy to pass (#127672)
From the HUD, most of the time the "X" is due to "improved_accuracy" for gluon_inception_v3.

![image](https://github.com/pytorch/pytorch/assets/143543872/d4f70377-2756-4921-872d-587426f00302)

https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=inductor_timm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127672
Approved by: https://github.com/eqy, https://github.com/Skylion007
2024-06-01 19:12:43 +00:00
17dea09b15 Revert "Default XLA to use swap_tensors path in nn.Module._apply (#126814)"
This reverts commit bfdec93395f675a0e5a59e95aef9104ac8f5081a.

Reverted https://github.com/pytorch/pytorch/pull/126814 on behalf of https://github.com/izaitsevfb due to suspicious build instructions count regression, see [D58015016](https://www.internalfb.com/diff/D58015016) ([comment](https://github.com/pytorch/pytorch/pull/126814#issuecomment-2143545818))
2024-06-01 18:46:16 +00:00
82cd7a7dab Revert "Default meta device to use swap_tensors in nn.Module._apply (.to_empty and .to('meta')) (#126819)"
This reverts commit fa426b096b3635daab6ce26b44d50f3baab5a4e5.

Reverted https://github.com/pytorch/pytorch/pull/126819 on behalf of https://github.com/izaitsevfb due to suspicious build instructions count regression, see [D58015016](https://www.internalfb.com/diff/D58015016) ([comment](https://github.com/pytorch/pytorch/pull/126814#issuecomment-2143545818))
2024-06-01 18:46:16 +00:00
42312a52b3 [DSD] Adds type_check param to copy state dict utils (#127417)
[DSD] Adds type_check param to copy state dict utils.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127417
Approved by: https://github.com/fegin
2024-06-01 17:50:52 +00:00
edffb28d39 [BE][Ez]: Enable B019 - flags memory leaks through LRU cache on method (#127686)
Flags potential mem leaks through LRUCache and will hopefully make future contributors rethink this pattern which can cause memleaks. noqas the violations we currently have (should be fixed later)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127686
Approved by: https://github.com/c-p-i-o
2024-06-01 17:19:24 +00:00
22f392ba40 Revert "[easy?] Move AsyncCompile to a different file (#127235)"
This reverts commit f58fc16e8f059232f452a333f32e14ff681e12af.

Reverted https://github.com/pytorch/pytorch/pull/127235 on behalf of https://github.com/izaitsevfb due to breaking internal tests, see [D58015187](https://www.internalfb.com/diff/D58015187) ([comment](https://github.com/pytorch/pytorch/pull/127235#issuecomment-2143518610))
2024-06-01 17:16:16 +00:00
d49dc8f4b8 Revert "Add noqa to prevent lint warnings (#127545)"
This reverts commit f9937afd4f87fbb4844642ae2f587b13b5caa08c.

Reverted https://github.com/pytorch/pytorch/pull/127545 on behalf of https://github.com/izaitsevfb due to reverting to unblock the revert of #127545 ([comment](https://github.com/pytorch/pytorch/pull/127545#issuecomment-2143517711))
2024-06-01 17:12:46 +00:00
114c752b14 Revert "Improve MAGMA conditional macro in BatchLinearAlgebra.cpp (#127495)"
This reverts commit ee08cf57924a4230edad3101666890d8fe050c75.

Reverted https://github.com/pytorch/pytorch/pull/127495 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/127495#issuecomment-2143508218))
2024-06-01 16:39:06 +00:00
efcea2d2fd [dynamo] Support __getitem__ on NNModuleVariable __dict__ (#126956)
Moves further along (but still fails) for the testcase in https://github.com/pytorch/pytorch/pull/126875

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126956
Approved by: https://github.com/jansel
ghstack dependencies: #126923
2024-06-01 15:22:45 +00:00
4129c3e596 Let us find out why we wrote foreach meta regs (#127623)
Turns out it was for no reason!...well, after realizing that these ops are all CompositeExplicit, their meta impls come for free.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127623
Approved by: https://github.com/mikaylagawarecki
ghstack dependencies: #127412
2024-06-01 13:58:18 +00:00
ac60bdaf01 Allow slow foreach to run for any backend, not just CPU (#127412)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127412
Approved by: https://github.com/albanD
2024-06-01 13:58:18 +00:00
4aa7a1efcf [dynamo] Initial exception handling support (#126923)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126923
Approved by: https://github.com/williamwen42, https://github.com/jansel
2024-06-01 13:00:32 +00:00
25994a7ed1 [AOTI] Fix a bug when mutated buffer meets .to (#127671)
Summary: Before this change, the added unit test will trigger: `AssertionError: Can not find the original value for L__self____tensor_constant0_cuda0`. The reason is GraphLowering.constant_name could rename a constant with a device suffix but AOTI requires that new name being registered properly.

Differential Revision: [D58047165](https://our.internmc.facebook.com/intern/diff/D58047165)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127671
Approved by: https://github.com/ColinPeppler, https://github.com/22quinn
2024-06-01 12:30:56 +00:00
c3be459f26 [inductor] fix mkldnn linear binary fusion check ut (#127296)
In this PR:

(1)Fix the unary fusion for bf16 conv/linear.
    Previously we registered same fusion pattern for `bf16. fp16`. And we do not check the dtype while matching the pattern. This results the `fp16` case matched the `bf16` pattern but in later replacement, we found that we have a float16 here which is not expected, so we do not fuse them.  We fix it by checking dtypes to avoid `fp16` case matched `bf16` pattern.

```
  def _is_valid_computation_unary_fusion(computation_op, lowp_dtype=None):
      def fn(match):
          matched = _is_single_computation_op(computation_op, **lowp_dtype**)(match) # previously we do not check lowp_dtype here

```

It is not exposed before because we only check the match count, and the match count is anyway correct because we matched the pattern. To address this, we add check on number of `generated_kernel`. If it is not fused, there will be an additional kernel to compute the post op.

(2)Previous the ut
```
python test/inductor/test_mkldnn_pattern_matcher.py -k test_linear_binary
```
dose not check the fusion status, fix it in this PR.

(3)Extend `test_conv_binary` to test with lp.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127296
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jansel
2024-06-01 11:10:29 +00:00
820 changed files with 21128 additions and 22895 deletions

View File

@ -91,9 +91,9 @@ _UCC_COMMIT=20eae37090a4ce1b32bcce6144ccad0b49943e0b
# configuration, so we hardcode everything here rather than do it
# from scratch
case "$image" in
pytorch-linux-focal-cuda12.4-cudnn8-py3-gcc9)
pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9)
CUDA_VERSION=12.4.0
CUDNN_VERSION=8
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
PROTOBUF=yes
@ -105,9 +105,9 @@ case "$image" in
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9)
pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9)
CUDA_VERSION=12.1.1
CUDNN_VERSION=8
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
PROTOBUF=yes
@ -119,9 +119,9 @@ case "$image" in
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-focal-cuda12.4-cudnn8-py3-gcc9-inductor-benchmarks)
pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9-inductor-benchmarks)
CUDA_VERSION=12.4.0
CUDNN_VERSION=8
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
PROTOBUF=yes
@ -134,9 +134,9 @@ case "$image" in
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9-inductor-benchmarks)
pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9-inductor-benchmarks)
CUDA_VERSION=12.1.1
CUDNN_VERSION=8
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
PROTOBUF=yes
@ -149,9 +149,9 @@ case "$image" in
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-cuda12.1-cudnn8-py3.12-gcc9-inductor-benchmarks)
pytorch-linux-focal-cuda12.1-cudnn9-py3.12-gcc9-inductor-benchmarks)
CUDA_VERSION=12.1.1
CUDNN_VERSION=8
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.12
GCC_VERSION=9
PROTOBUF=yes
@ -164,9 +164,9 @@ case "$image" in
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-cuda12.4-cudnn8-py3.12-gcc9-inductor-benchmarks)
pytorch-linux-focal-cuda12.4-cudnn9-py3.12-gcc9-inductor-benchmarks)
CUDA_VERSION=12.4.0
CUDNN_VERSION=8
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.12
GCC_VERSION=9
PROTOBUF=yes
@ -179,9 +179,9 @@ case "$image" in
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-cuda11.8-cudnn8-py3-gcc9)
pytorch-linux-focal-cuda11.8-cudnn9-py3-gcc9)
CUDA_VERSION=11.8.0
CUDNN_VERSION=8
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
PROTOBUF=yes
@ -193,9 +193,9 @@ case "$image" in
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-focal-cuda12.4-cudnn8-py3-gcc9)
pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9)
CUDA_VERSION=12.4.0
CUDNN_VERSION=8
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
PROTOBUF=yes
@ -207,9 +207,9 @@ case "$image" in
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9)
pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9)
CUDA_VERSION=12.1.1
CUDNN_VERSION=8
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
PROTOBUF=yes
@ -221,9 +221,9 @@ case "$image" in
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-focal-cuda12.4-cudnn8-py3-gcc9)
pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9)
CUDA_VERSION=12.4.0
CUDNN_VERSION=8
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
PROTOBUF=yes
@ -330,10 +330,10 @@ case "$image" in
DOCS=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-jammy-cuda11.8-cudnn8-py3.8-clang12)
pytorch-linux-jammy-cuda11.8-cudnn9-py3.8-clang12)
ANACONDA_PYTHON_VERSION=3.8
CUDA_VERSION=11.8
CUDNN_VERSION=8
CUDNN_VERSION=9
CLANG_VERSION=12
PROTOBUF=yes
DB=yes
@ -380,7 +380,7 @@ case "$image" in
ANACONDA_PYTHON_VERSION=3.9
CONDA_CMAKE=yes
;;
pytorch-linux-jammy-cuda11.8-cudnn8-py3.9-linter)
pytorch-linux-jammy-cuda11.8-cudnn9-py3.9-linter)
ANACONDA_PYTHON_VERSION=3.9
CUDA_VERSION=11.8
CONDA_CMAKE=yes
@ -447,7 +447,7 @@ tmp_tag=$(basename "$(mktemp -u)" | tr '[:upper:]' '[:lower:]')
#when using cudnn version 8 install it separately from cuda
if [[ "$image" == *cuda* && ${OS} == "ubuntu" ]]; then
IMAGE_NAME="nvidia/cuda:${CUDA_VERSION}-cudnn${CUDNN_VERSION}-devel-ubuntu${UBUNTU_VERSION}"
if [[ ${CUDNN_VERSION} == 8 ]]; then
if [[ ${CUDNN_VERSION} == 9 ]]; then
IMAGE_NAME="nvidia/cuda:${CUDA_VERSION}-devel-ubuntu${UBUNTU_VERSION}"
fi
fi
@ -499,7 +499,7 @@ docker build \
"$@" \
.
# NVIDIA dockers for RC releases use tag names like `11.0-cudnn8-devel-ubuntu18.04-rc`,
# NVIDIA dockers for RC releases use tag names like `11.0-cudnn9-devel-ubuntu18.04-rc`,
# for this case we will set UBUNTU_VERSION to `18.04-rc` so that the Dockerfile could
# find the correct image. As a result, here we have to replace the
# "$UBUNTU_VERSION" == "18.04-rc"

View File

@ -118,6 +118,13 @@ COPY ./common/install_cache.sh install_cache.sh
ENV PATH /opt/cache/bin:$PATH
RUN bash ./install_cache.sh && rm install_cache.sh
# Install AOTriton
COPY ci_commit_pins/aotriton.txt aotriton.txt
COPY ./common/common_utils.sh common_utils.sh
COPY ./common/install_aotriton.sh install_aotriton.sh
RUN bash ./install_aotriton.sh /opt/rocm/aotriton && rm -rf install_aotriton.sh aotriton aotriton.txt common_utils.sh
ENV AOTRITON_INSTALLED_PREFIX /opt/rocm/aotriton
# Include BUILD_ENVIRONMENT environment variable in image
ARG BUILD_ENVIRONMENT
ENV BUILD_ENVIRONMENT ${BUILD_ENVIRONMENT}

View File

@ -0,0 +1 @@
24a3fe9cb57e5cda3c923df29743f9767194cc27

View File

@ -1 +1 @@
bbe6246e37d8aa791c67daaf9d9d61b26c9ccfdc
01cbe5045a6898c9a925f01435c8277b2fe6afcc

View File

@ -1 +1 @@
aee0630e1208fdd28411e8ec2448981eb37bc83a
45fff310c891f5a92d55445adf8cc9d29df5841e

View File

@ -0,0 +1,24 @@
#!/bin/bash
set -ex
source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"
AOTRITON_DIR="aotriton"
AOTRITON_PINNED_NAME="aotriton" # No .txt extension
AOTRITON_PINNED_COMMIT=$(get_pinned_commit ${AOTRITON_PINNED_NAME})
AOTRITON_INSTALL_PREFIX="$1"
git clone https://github.com/ROCm/aotriton.git "${AOTRITON_DIR}"
cd "${AOTRITON_DIR}"
git checkout "${AOTRITON_PINNED_COMMIT}"
git submodule sync --recursive
git submodule update --init --recursive --force --depth 1
mkdir build
cd build
cmake .. -G Ninja -DCMAKE_INSTALL_PREFIX=./install_dir -DCMAKE_BUILD_TYPE=Release -DAOTRITON_COMPRESS_KERNEL=OFF -DAOTRITON_NO_PYTHON=ON -DAOTRITON_NO_SHARED=ON
ninja install
mkdir -p "${AOTRITON_INSTALL_PREFIX}"
cp -r install_dir/* "${AOTRITON_INSTALL_PREFIX}"
find /tmp/ -mindepth 1 -delete
rm -rf ~/.triton

View File

@ -3,7 +3,7 @@
set -ex
install_ubuntu() {
# NVIDIA dockers for RC releases use tag names like `11.0-cudnn8-devel-ubuntu18.04-rc`,
# NVIDIA dockers for RC releases use tag names like `11.0-cudnn9-devel-ubuntu18.04-rc`,
# for this case we will set UBUNTU_VERSION to `18.04-rc` so that the Dockerfile could
# find the correct image. As a result, here we have to check for
# "$UBUNTU_VERSION" == "18.04"*

View File

@ -1,23 +1,18 @@
#!/bin/bash
if [[ ${CUDNN_VERSION} == 8 ]]; then
if [[ -n "${CUDNN_VERSION}" ]]; then
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
mkdir tmp_cudnn
pushd tmp_cudnn
if [[ ${CUDA_VERSION:0:4} == "12.4" ]]; then
CUDNN_NAME="cudnn-linux-x86_64-8.9.7.29_cuda12-archive"
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/${CUDNN_NAME}.tar.xz
elif [[ ${CUDA_VERSION:0:4} == "12.1" ]]; then
CUDNN_NAME="cudnn-linux-x86_64-8.9.2.26_cuda12-archive"
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/${CUDNN_NAME}.tar.xz
elif [[ ${CUDA_VERSION:0:4} == "11.8" ]]; then
CUDNN_NAME="cudnn-linux-x86_64-8.7.0.84_cuda11-archive"
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/redist/cudnn/v8.7.0/local_installers/11.8/${CUDNN_NAME}.tar.xz
if [[ ${CUDA_VERSION:0:2} == "12" ]]; then
CUDNN_NAME="cudnn-linux-x86_64-9.1.0.70_cuda12-archive"
elif [[ ${CUDA_VERSION:0:2} == "11" ]]; then
CUDNN_NAME="cudnn-linux-x86_64-9.1.0.70_cuda11-archive"
else
print "Unsupported CUDA version ${CUDA_VERSION}"
exit 1
fi
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/${CUDNN_NAME}.tar.xz
tar xf ${CUDNN_NAME}.tar.xz
cp -a ${CUDNN_NAME}/include/* /usr/local/cuda/include/
cp -a ${CUDNN_NAME}/lib/* /usr/local/cuda/lib64/

View File

@ -19,7 +19,7 @@ elif [ -n "${XPU_VERSION}" ]; then
TRITON_REPO="https://github.com/intel/intel-xpu-backend-for-triton"
TRITON_TEXT_FILE="triton-xpu"
else
TRITON_REPO="https://github.com/embg/triton"
TRITON_REPO="https://github.com/openai/triton"
TRITON_TEXT_FILE="triton"
fi

View File

@ -139,7 +139,7 @@ COPY --from=pytorch/llvm:9.0.1 /opt/llvm /opt/llvm
ARG CUDNN_VERSION
ARG CUDA_VERSION
COPY ./common/install_cudnn.sh install_cudnn.sh
RUN if [ "${CUDNN_VERSION}" -eq 8 ]; then bash install_cudnn.sh; fi
RUN if [ -n "${CUDNN_VERSION}" ]; then bash install_cudnn.sh; fi
RUN rm install_cudnn.sh
# Install CUSPARSELT

View File

@ -110,6 +110,13 @@ COPY ./common/install_cache.sh install_cache.sh
ENV PATH /opt/cache/bin:$PATH
RUN bash ./install_cache.sh && rm install_cache.sh
# Install AOTriton
COPY ci_commit_pins/aotriton.txt aotriton.txt
COPY ./common/common_utils.sh common_utils.sh
COPY ./common/install_aotriton.sh install_aotriton.sh
RUN bash ./install_aotriton.sh /opt/rocm/aotriton && rm -rf install_aotriton.sh aotriton aotriton.txt common_utils.sh
ENV AOTRITON_INSTALLED_PREFIX /opt/rocm/aotriton
# Include BUILD_ENVIRONMENT environment variable in image
ARG BUILD_ENVIRONMENT
ENV BUILD_ENVIRONMENT ${BUILD_ENVIRONMENT}

View File

@ -368,7 +368,7 @@ test_inductor_cpp_wrapper_abi_compatible() {
echo "Testing Inductor cpp wrapper mode with TORCHINDUCTOR_ABI_COMPATIBLE=1"
# cpu stack allocation causes segfault and needs more investigation
python test/run_test.py --include inductor/test_cpu_cpp_wrapper
PYTORCH_TESTING_DEVICE_ONLY_FOR="" python test/run_test.py --include inductor/test_cpu_cpp_wrapper
python test/run_test.py --include inductor/test_cuda_cpp_wrapper
TORCHINDUCTOR_CPP_WRAPPER=1 python benchmarks/dynamo/timm_models.py --device cuda --accuracy --amp \
@ -565,7 +565,11 @@ test_dynamo_benchmark() {
test_single_dynamo_benchmark "dashboard" "$suite" "$shard_id" "$@"
else
if [[ "${TEST_CONFIG}" == *cpu_inductor* ]]; then
test_single_dynamo_benchmark "inference" "$suite" "$shard_id" --inference --float32 "$@"
if [[ "${TEST_CONFIG}" == *freezing* ]]; then
test_single_dynamo_benchmark "inference" "$suite" "$shard_id" --inference --float32 --freezing "$@"
else
test_single_dynamo_benchmark "inference" "$suite" "$shard_id" --inference --float32 "$@"
fi
elif [[ "${TEST_CONFIG}" == *aot_inductor* ]]; then
test_single_dynamo_benchmark "inference" "$suite" "$shard_id" --inference --bfloat16 "$@"
else

View File

@ -1,9 +1,12 @@
self-hosted-runner:
labels:
# GitHub hosted x86 Linux runners
- linux.20_04.4x
- linux.20_04.16x
- linux.large
# Repo-specific LF hosted ARC runners
- linux.large.arc
# Organization-wide AWS Linux Runners
- linux.large
- linux.2xlarge
- linux.4xlarge
- linux.12xlarge
@ -13,16 +16,34 @@ self-hosted-runner:
- linux.8xlarge.nvidia.gpu
- linux.16xlarge.nvidia.gpu
- linux.g5.4xlarge.nvidia.gpu
# Organization-wide AWS Linux Runners on Linux Foundation account
- lf.linux.large
- lf.linux.2xlarge
- lf.linux.4xlarge
- lf.linux.12xlarge
- lf.linux.24xlarge
- lf.linux.arm64.2xlarge
- lf.linux.4xlarge.nvidia.gpu
- lf.linux.8xlarge.nvidia.gpu
- lf.linux.16xlarge.nvidia.gpu
- lf.linux.g5.4xlarge.nvidia.gpu
# Repo-specific IBM hosted S390x runner
- linux.s390x
# Organization wide AWS Windows runners
- windows.4xlarge.nonephemeral
- windows.8xlarge.nvidia.gpu
- windows.8xlarge.nvidia.gpu.nonephemeral
- windows.g5.4xlarge.nvidia.gpu
- bm-runner
# Organization-wide AMD hosted MI300 runners
- linux.rocm.gpu
# Repo-specific Apple hosted runners
- macos-m1-ultra
- macos-m2-14
# Org wise AWS `mac2.metal` runners (2020 Mac mini hardware powered by Apple silicon M1 processors)
- macos-m1-stable
- macos-m1-13
- macos-m1-14
# GitHub-hosted MacOS runners
- macos-latest-xlarge
- macos-13-xlarge
- macos-14-xlarge

View File

@ -8,6 +8,7 @@ ciflow_push_tags:
- ciflow/inductor
- ciflow/inductor-perf-compare
- ciflow/inductor-micro-benchmark
- ciflow/inductor-cu124
- ciflow/linux-aarch64
- ciflow/mps
- ciflow/nightly

View File

@ -5,4 +5,4 @@ ninja=1.10.2
numpy=1.23.3
pyyaml=6.0
setuptools=68.2.2
typing-extensions=4.3.0
typing-extensions=4.9.0

View File

@ -4,4 +4,4 @@ ninja=1.10.2
numpy=1.23.3
pyyaml=6.0
setuptools=68.2.2
typing-extensions=4.3.0
typing-extensions=4.9.0

View File

@ -2,7 +2,7 @@ numpy=1.22.3
pyyaml=6.0
setuptools=61.2.0
cmake=3.22.*
typing-extensions=4.3.0
typing-extensions=4.9.0
dataclasses=0.8
pip=22.2.2
pillow=10.0.1

View File

@ -4,7 +4,7 @@ numpy=1.21.2
pyyaml=5.3
setuptools=46.0.0
cmake=3.22.*
typing-extensions=4.3.0
typing-extensions=4.9.0
dataclasses=0.8
pip=22.2.2
pillow=10.0.1

View File

@ -19,7 +19,7 @@ CUDA_ARCHES = ["11.8", "12.1", "12.4"]
CUDA_ARCHES_FULL_VERSION = {"11.8": "11.8.0", "12.1": "12.1.1", "12.4": "12.4.0"}
CUDA_ARCHES_CUDNN_VERSION = {"11.8": "8", "12.1": "8", "12.4": "8"}
CUDA_ARCHES_CUDNN_VERSION = {"11.8": "9", "12.1": "9", "12.4": "9"}
ROCM_ARCHES = ["6.0", "6.1"]
@ -42,7 +42,7 @@ PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {
"nvidia-cuda-nvrtc-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | " # noqa: B950
"nvidia-cuda-runtime-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cuda-cupti-cu11==11.8.87; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cudnn-cu11==8.7.0.84; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cudnn-cu11==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cublas-cu11==11.11.3.6; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cufft-cu11==10.9.0.58; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-curand-cu11==10.3.0.86; platform_system == 'Linux' and platform_machine == 'x86_64' | "
@ -55,7 +55,7 @@ PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {
"nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | " # noqa: B950
"nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | "
@ -68,7 +68,7 @@ PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {
"nvidia-cuda-nvrtc-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cuda-runtime-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cuda-cupti-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cudnn-cu12==8.9.7.29; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cublas-cu12==12.4.2.65; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cufft-cu12==11.2.0.44; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-curand-cu12==10.3.5.119; platform_system == 'Linux' and platform_machine == 'x86_64' | "
@ -347,6 +347,10 @@ def generate_wheels_matrix(
for python_version in python_versions:
for arch_version in arches:
gpu_arch_type = arch_type(arch_version)
# Disable py3.12 builds for ROCm because of triton dependency
# on llnl-hatchet, which doesn't have py3.12 wheels available
if gpu_arch_type == "rocm" and python_version == "3.12":
continue
gpu_arch_version = (
""
if arch_version == "cpu"

View File

@ -773,13 +773,13 @@ class TestBypassFailures(TestCase):
# than the one on the base commit. This should still count as broken trunk
"pr_num": 104214,
"related_failure_count": 0,
"unrelated_failure_count": 1,
"flaky_or_broken_trunk": 1,
},
{
# This PR had one broken trunk failure and it used ghstack
"pr_num": 105145,
"related_failure_count": 0,
"unrelated_failure_count": 1,
"flaky_or_broken_trunk": 1,
},
{
# The failure on the merge base was retried successfully and
@ -788,20 +788,20 @@ class TestBypassFailures(TestCase):
# be used to detect broken trunk
"pr_num": 107160,
"related_failure_count": 0,
"unrelated_failure_count": 4,
"flaky_or_broken_trunk": 1,
},
{
# This PR used Dr.CI broken trunk classification
"pr_num": 111253,
"related_failure_count": 1,
"unrelated_failure_count": 2,
"flaky_or_broken_trunk": 1,
},
]
for case in test_cases:
pr_num = case["pr_num"]
related_failure_count = case["related_failure_count"]
unrelated_failure_count = case["unrelated_failure_count"]
flaky_or_broken_trunk = case["flaky_or_broken_trunk"]
pr = GitHubPR("pytorch", "pytorch", pr_num)
checks = pr.get_checkrun_conclusions()
@ -823,7 +823,7 @@ class TestBypassFailures(TestCase):
)
self.assertTrue(len(pending) == 0)
self.assertTrue(
len(failed) == unrelated_failure_count + related_failure_count
len(failed) == flaky_or_broken_trunk + related_failure_count
)
def test_ignore_current(self, *args: Any) -> None:

View File

@ -2027,10 +2027,8 @@ def categorize_checks(
pending_checks: List[Tuple[str, Optional[str], Optional[int]]] = []
failed_checks: List[Tuple[str, Optional[str], Optional[int]]] = []
# ok_failed_checks is used with ok_failed_checks_threshold while ignorable_failed_checks
# is used to keep track of all ignorable failures when saving the merge record on Rockset
ok_failed_checks: List[Tuple[str, Optional[str], Optional[int]]] = []
ignorable_failed_checks: Dict[str, List[Any]] = defaultdict(list)
# failed_checks_categorization is used to keep track of all ignorable failures when saving the merge record on Rockset
failed_checks_categorization: Dict[str, List[Any]] = defaultdict(list)
# If required_checks is not set or empty, consider all names are relevant
relevant_checknames = [
@ -2058,36 +2056,38 @@ def categorize_checks(
continue
elif not is_passing_status(check_runs[checkname].status):
target = (
ignorable_failed_checks[classification]
failed_checks_categorization[classification]
if classification
in ("IGNORE_CURRENT_CHECK", "BROKEN_TRUNK", "FLAKY", "UNSTABLE")
else failed_checks
)
target.append((checkname, url, job_id))
if classification in ("BROKEN_TRUNK", "FLAKY", "UNSTABLE"):
ok_failed_checks.append((checkname, url, job_id))
flaky_or_broken_trunk = (
failed_checks_categorization["BROKEN_TRUNK"]
+ failed_checks_categorization["FLAKY"]
)
if ok_failed_checks:
if flaky_or_broken_trunk:
warn(
f"The following {len(ok_failed_checks)} checks failed but were likely due flakiness or broken trunk: "
+ ", ".join([x[0] for x in ok_failed_checks])
f"The following {len(flaky_or_broken_trunk)} checks failed but were likely due flakiness or broken trunk: "
+ ", ".join([x[0] for x in flaky_or_broken_trunk])
+ (
f" but this is greater than the threshold of {ok_failed_checks_threshold} so merge will fail"
if ok_failed_checks_threshold is not None
and len(ok_failed_checks) > ok_failed_checks_threshold
and len(flaky_or_broken_trunk) > ok_failed_checks_threshold
else ""
)
)
if (
ok_failed_checks_threshold is not None
and len(ok_failed_checks) > ok_failed_checks_threshold
and len(flaky_or_broken_trunk) > ok_failed_checks_threshold
):
failed_checks = failed_checks + ok_failed_checks
failed_checks = failed_checks + flaky_or_broken_trunk
# The list of ignorable_failed_checks is returned so that it can be saved into the Rockset merge record
return (pending_checks, failed_checks, ignorable_failed_checks)
# The list of failed_checks_categorization is returned so that it can be saved into the Rockset merge record
return (pending_checks, failed_checks, failed_checks_categorization)
def merge(

View File

@ -38,19 +38,19 @@ jobs:
matrix:
runner: [linux.12xlarge]
docker-image-name: [
pytorch-linux-focal-cuda12.4-cudnn8-py3-gcc9,
pytorch-linux-focal-cuda12.4-cudnn8-py3-gcc9-inductor-benchmarks,
pytorch-linux-focal-cuda12.4-cudnn8-py3.12-gcc9-inductor-benchmarks,
pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9,
pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9-inductor-benchmarks,
pytorch-linux-focal-cuda12.1-cudnn8-py3.12-gcc9-inductor-benchmarks,
pytorch-linux-focal-cuda11.8-cudnn8-py3-gcc9,
pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9,
pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9-inductor-benchmarks,
pytorch-linux-focal-cuda12.4-cudnn9-py3.12-gcc9-inductor-benchmarks,
pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9,
pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9-inductor-benchmarks,
pytorch-linux-focal-cuda12.1-cudnn9-py3.12-gcc9-inductor-benchmarks,
pytorch-linux-focal-cuda11.8-cudnn9-py3-gcc9,
pytorch-linux-focal-py3.8-clang10,
pytorch-linux-focal-py3.11-clang10,
pytorch-linux-focal-py3.12-clang10,
pytorch-linux-focal-rocm-n-1-py3,
pytorch-linux-focal-rocm-n-py3,
pytorch-linux-jammy-cuda11.8-cudnn8-py3.8-clang12,
pytorch-linux-jammy-cuda11.8-cudnn9-py3.8-clang12,
pytorch-linux-focal-py3-clang9-android-ndk-r21e,
pytorch-linux-jammy-py3.8-gcc11,
pytorch-linux-jammy-py3.8-gcc11-inductor-benchmarks,
@ -58,7 +58,7 @@ jobs:
pytorch-linux-jammy-py3-clang15-asan,
pytorch-linux-focal-py3-clang10-onnx,
pytorch-linux-focal-linter,
pytorch-linux-jammy-cuda11.8-cudnn8-py3.9-linter,
pytorch-linux-jammy-cuda11.8-cudnn9-py3.9-linter,
pytorch-linux-jammy-py3-clang12-executorch
]
include:

View File

@ -149,3 +149,10 @@ jobs:
- name: Teardown Linux
uses: pytorch/test-infra/.github/actions/teardown-linux@main
if: always()
validate:
needs: build
uses: pytorch/builder/.github/workflows/validate-docker-images.yml@main
with:
channel: nightly
ref: main

View File

@ -54,7 +54,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_8-cpu-aarch64
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_8-cpu-aarch64-test: # Testing
@ -162,7 +162,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_9-cpu-aarch64
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cpu-aarch64-test: # Testing
@ -270,7 +270,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_10-cpu-aarch64
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_10-cpu-aarch64-test: # Testing
@ -378,7 +378,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_11-cpu-aarch64
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_11-cpu-aarch64-test: # Testing
@ -486,7 +486,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_12-cpu-aarch64
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_12-cpu-aarch64-test: # Testing

View File

@ -48,7 +48,7 @@ jobs:
DESIRED_PYTHON: "3.8"
build_name: manywheel-py3_8-cuda11_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu11==11.8.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu11==8.7.0.84; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu11==11.11.3.6; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu11==10.9.0.58; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu11==10.3.0.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu11==11.4.1.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu11==11.7.5.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu11==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu11==11.8.86; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu11==11.8.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu11==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu11==11.11.3.6; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu11==10.9.0.58; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu11==10.3.0.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu11==11.4.1.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu11==11.7.5.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu11==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu11==11.8.86; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_8-cuda11_8-test: # Testing
@ -88,7 +88,7 @@ jobs:
DESIRED_PYTHON: "3.8"
build_name: manywheel-py3_8-cuda12_1
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_8-cuda12_1-test: # Testing
@ -128,7 +128,7 @@ jobs:
DESIRED_PYTHON: "3.8"
build_name: manywheel-py3_8-cuda12_4
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.7.29; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.2.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.0.44; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.119; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.0.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.0.142; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.2.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.0.44; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.119; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.0.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.0.142; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_8-cuda12_4-test: # Testing

View File

@ -174,7 +174,7 @@ jobs:
DESIRED_PYTHON: "3.8"
build_name: manywheel-py3_8-cuda11_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu11==11.8.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu11==8.7.0.84; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu11==11.11.3.6; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu11==10.9.0.58; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu11==10.3.0.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu11==11.4.1.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu11==11.7.5.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu11==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu11==11.8.86; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu11==11.8.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu11==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu11==11.11.3.6; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu11==10.9.0.58; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu11==10.3.0.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu11==11.4.1.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu11==11.7.5.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu11==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu11==11.8.86; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_8-cuda11_8-test: # Testing
@ -237,7 +237,7 @@ jobs:
DESIRED_PYTHON: "3.8"
build_name: manywheel-py3_8-cuda12_1
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_8-cuda12_1-test: # Testing
@ -300,7 +300,7 @@ jobs:
DESIRED_PYTHON: "3.8"
build_name: manywheel-py3_8-cuda12_4
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.7.29; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.2.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.0.44; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.119; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.0.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.0.142; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.2.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.0.44; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.119; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.0.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.0.142; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_8-cuda12_4-test: # Testing
@ -690,7 +690,7 @@ jobs:
DESIRED_PYTHON: "3.9"
build_name: manywheel-py3_9-cuda11_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu11==11.8.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu11==8.7.0.84; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu11==11.11.3.6; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu11==10.9.0.58; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu11==10.3.0.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu11==11.4.1.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu11==11.7.5.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu11==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu11==11.8.86; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu11==11.8.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu11==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu11==11.11.3.6; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu11==10.9.0.58; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu11==10.3.0.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu11==11.4.1.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu11==11.7.5.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu11==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu11==11.8.86; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cuda11_8-test: # Testing
@ -753,7 +753,7 @@ jobs:
DESIRED_PYTHON: "3.9"
build_name: manywheel-py3_9-cuda12_1
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cuda12_1-test: # Testing
@ -816,7 +816,7 @@ jobs:
DESIRED_PYTHON: "3.9"
build_name: manywheel-py3_9-cuda12_4
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.7.29; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.2.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.0.44; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.119; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.0.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.0.142; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.2.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.0.44; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.119; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.0.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.0.142; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cuda12_4-test: # Testing
@ -1206,7 +1206,7 @@ jobs:
DESIRED_PYTHON: "3.10"
build_name: manywheel-py3_10-cuda11_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu11==11.8.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu11==8.7.0.84; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu11==11.11.3.6; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu11==10.9.0.58; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu11==10.3.0.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu11==11.4.1.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu11==11.7.5.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu11==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu11==11.8.86; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu11==11.8.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu11==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu11==11.11.3.6; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu11==10.9.0.58; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu11==10.3.0.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu11==11.4.1.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu11==11.7.5.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu11==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu11==11.8.86; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_10-cuda11_8-test: # Testing
@ -1269,7 +1269,7 @@ jobs:
DESIRED_PYTHON: "3.10"
build_name: manywheel-py3_10-cuda12_1
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_10-cuda12_1-test: # Testing
@ -1332,7 +1332,7 @@ jobs:
DESIRED_PYTHON: "3.10"
build_name: manywheel-py3_10-cuda12_4
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.7.29; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.2.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.0.44; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.119; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.0.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.0.142; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.2.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.0.44; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.119; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.0.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.0.142; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_10-cuda12_4-test: # Testing
@ -1722,7 +1722,7 @@ jobs:
DESIRED_PYTHON: "3.11"
build_name: manywheel-py3_11-cuda11_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu11==11.8.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu11==8.7.0.84; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu11==11.11.3.6; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu11==10.9.0.58; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu11==10.3.0.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu11==11.4.1.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu11==11.7.5.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu11==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu11==11.8.86; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu11==11.8.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu11==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu11==11.11.3.6; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu11==10.9.0.58; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu11==10.3.0.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu11==11.4.1.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu11==11.7.5.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu11==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu11==11.8.86; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_11-cuda11_8-test: # Testing
@ -1785,7 +1785,7 @@ jobs:
DESIRED_PYTHON: "3.11"
build_name: manywheel-py3_11-cuda12_1
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_11-cuda12_1-test: # Testing
@ -1848,7 +1848,7 @@ jobs:
DESIRED_PYTHON: "3.11"
build_name: manywheel-py3_11-cuda12_4
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.7.29; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.2.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.0.44; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.119; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.0.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.0.142; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.2.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.0.44; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.119; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.0.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.0.142; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_11-cuda12_4-test: # Testing
@ -2238,7 +2238,7 @@ jobs:
DESIRED_PYTHON: "3.12"
build_name: manywheel-py3_12-cuda11_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu11==11.8.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu11==8.7.0.84; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu11==11.11.3.6; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu11==10.9.0.58; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu11==10.3.0.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu11==11.4.1.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu11==11.7.5.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu11==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu11==11.8.86; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu11==11.8.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu11==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu11==11.11.3.6; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu11==10.9.0.58; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu11==10.3.0.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu11==11.4.1.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu11==11.7.5.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu11==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu11==11.8.86; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_12-cuda11_8-test: # Testing
@ -2301,7 +2301,7 @@ jobs:
DESIRED_PYTHON: "3.12"
build_name: manywheel-py3_12-cuda12_1
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_12-cuda12_1-test: # Testing
@ -2364,7 +2364,7 @@ jobs:
DESIRED_PYTHON: "3.12"
build_name: manywheel-py3_12-cuda12_4
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.7.29; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.2.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.0.44; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.119; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.0.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.0.142; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.2.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.0.44; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.119; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.0.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.0.142; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_12-cuda12_4-test: # Testing
@ -2410,209 +2410,3 @@ jobs:
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_12-rocm6_0-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm6.0
GPU_ARCH_VERSION: 6.0
GPU_ARCH_TYPE: rocm
DOCKER_IMAGE: pytorch/manylinux-builder:rocm6.0-main
DESIRED_PYTHON: "3.12"
build_name: manywheel-py3_12-rocm6_0
build_environment: linux-binary-manywheel
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_12-rocm6_0-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: manywheel-py3_12-rocm6_0-build
runs-on: linux.rocm.gpu
timeout-minutes: 240
env:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm6.0
GPU_ARCH_VERSION: 6.0
GPU_ARCH_TYPE: rocm
SKIP_ALL_TESTS: 1
DOCKER_IMAGE: pytorch/manylinux-builder:rocm6.0-main
DESIRED_PYTHON: "3.12"
steps:
- name: Setup ROCm
uses: ./.github/actions/setup-rocm
- uses: actions/download-artifact@v3
name: Download Build Artifacts
with:
name: manywheel-py3_12-rocm6_0
path: "${{ runner.temp }}/artifacts/"
- name: Checkout PyTorch
uses: malfet/checkout@silent-checkout
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
quiet-checkout: true
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: Checkout pytorch/builder
uses: malfet/checkout@silent-checkout
with:
ref: main
submodules: recursive
repository: pytorch/builder
path: builder
quiet-checkout: true
- name: Clean pytorch/builder checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: builder
- name: ROCm set GPU_FLAG
run: |
echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"
- name: Pull Docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
with:
docker-image: pytorch/manylinux-builder:rocm6.0-main
- name: Test Pytorch binary
uses: ./pytorch/.github/actions/test-pytorch-binary
- name: Teardown ROCm
uses: ./.github/actions/teardown-rocm
manywheel-py3_12-rocm6_0-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_12-rocm6_0-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm6.0
GPU_ARCH_VERSION: 6.0
GPU_ARCH_TYPE: rocm
DOCKER_IMAGE: pytorch/manylinux-builder:rocm6.0-main
DESIRED_PYTHON: "3.12"
build_name: manywheel-py3_12-rocm6_0
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_12-rocm6_1-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm6.1
GPU_ARCH_VERSION: 6.1
GPU_ARCH_TYPE: rocm
DOCKER_IMAGE: pytorch/manylinux-builder:rocm6.1-main
DESIRED_PYTHON: "3.12"
build_name: manywheel-py3_12-rocm6_1
build_environment: linux-binary-manywheel
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_12-rocm6_1-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: manywheel-py3_12-rocm6_1-build
runs-on: linux.rocm.gpu
timeout-minutes: 240
env:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm6.1
GPU_ARCH_VERSION: 6.1
GPU_ARCH_TYPE: rocm
SKIP_ALL_TESTS: 1
DOCKER_IMAGE: pytorch/manylinux-builder:rocm6.1-main
DESIRED_PYTHON: "3.12"
steps:
- name: Setup ROCm
uses: ./.github/actions/setup-rocm
- uses: actions/download-artifact@v3
name: Download Build Artifacts
with:
name: manywheel-py3_12-rocm6_1
path: "${{ runner.temp }}/artifacts/"
- name: Checkout PyTorch
uses: malfet/checkout@silent-checkout
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
quiet-checkout: true
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: Checkout pytorch/builder
uses: malfet/checkout@silent-checkout
with:
ref: main
submodules: recursive
repository: pytorch/builder
path: builder
quiet-checkout: true
- name: Clean pytorch/builder checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: builder
- name: ROCm set GPU_FLAG
run: |
echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"
- name: Pull Docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
with:
docker-image: pytorch/manylinux-builder:rocm6.1-main
- name: Test Pytorch binary
uses: ./pytorch/.github/actions/test-pytorch-binary
- name: Teardown ROCm
uses: ./.github/actions/teardown-rocm
manywheel-py3_12-rocm6_1-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_12-rocm6_1-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm6.1
GPU_ARCH_VERSION: 6.1
GPU_ARCH_TYPE: rocm
DOCKER_IMAGE: pytorch/manylinux-builder:rocm6.1-main
DESIRED_PYTHON: "3.12"
build_name: manywheel-py3_12-rocm6_1
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml

View File

@ -54,7 +54,7 @@ jobs:
ALPINE_IMAGE: "docker.io/s390x/alpine"
build_name: manywheel-py3_8-cpu-s390x
build_environment: linux-s390x-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_8-cpu-s390x-test: # Testing
@ -117,7 +117,7 @@ jobs:
ALPINE_IMAGE: "docker.io/s390x/alpine"
build_name: manywheel-py3_9-cpu-s390x
build_environment: linux-s390x-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cpu-s390x-test: # Testing
@ -180,7 +180,7 @@ jobs:
ALPINE_IMAGE: "docker.io/s390x/alpine"
build_name: manywheel-py3_10-cpu-s390x
build_environment: linux-s390x-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_10-cpu-s390x-test: # Testing
@ -243,7 +243,7 @@ jobs:
ALPINE_IMAGE: "docker.io/s390x/alpine"
build_name: manywheel-py3_11-cpu-s390x
build_environment: linux-s390x-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_11-cpu-s390x-test: # Testing
@ -306,7 +306,7 @@ jobs:
ALPINE_IMAGE: "docker.io/s390x/alpine"
build_name: manywheel-py3_12-cpu-s390x
build_environment: linux-s390x-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_12-cpu-s390x-test: # Testing

View File

@ -46,7 +46,7 @@ jobs:
GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.8"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
# For sccache access (only on non-forked PRs)
AWS_ACCESS_KEY_ID: ${{ secrets.MACOS_SCCACHE_S3_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.MACOS_SCCACHE_S3_SECRET_ACCESS_KEY }}
@ -165,7 +165,7 @@ jobs:
GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.9"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
# For sccache access (only on non-forked PRs)
AWS_ACCESS_KEY_ID: ${{ secrets.MACOS_SCCACHE_S3_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.MACOS_SCCACHE_S3_SECRET_ACCESS_KEY }}
@ -284,7 +284,7 @@ jobs:
GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.10"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
# For sccache access (only on non-forked PRs)
AWS_ACCESS_KEY_ID: ${{ secrets.MACOS_SCCACHE_S3_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.MACOS_SCCACHE_S3_SECRET_ACCESS_KEY }}
@ -403,7 +403,7 @@ jobs:
GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.11"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
# For sccache access (only on non-forked PRs)
AWS_ACCESS_KEY_ID: ${{ secrets.MACOS_SCCACHE_S3_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.MACOS_SCCACHE_S3_SECRET_ACCESS_KEY }}
@ -522,7 +522,7 @@ jobs:
GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.12"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
# For sccache access (only on non-forked PRs)
AWS_ACCESS_KEY_ID: ${{ secrets.MACOS_SCCACHE_S3_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.MACOS_SCCACHE_S3_SECRET_ACCESS_KEY }}

View File

@ -46,7 +46,7 @@ jobs:
GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.8"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -290,7 +290,7 @@ jobs:
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.8"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -536,7 +536,7 @@ jobs:
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.8"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -782,7 +782,7 @@ jobs:
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.8"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -1027,7 +1027,7 @@ jobs:
GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.9"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -1271,7 +1271,7 @@ jobs:
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.9"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -1517,7 +1517,7 @@ jobs:
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.9"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -1763,7 +1763,7 @@ jobs:
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.9"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -2008,7 +2008,7 @@ jobs:
GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.10"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -2252,7 +2252,7 @@ jobs:
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.10"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -2498,7 +2498,7 @@ jobs:
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.10"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -2744,7 +2744,7 @@ jobs:
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.10"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -2989,7 +2989,7 @@ jobs:
GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.11"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -3233,7 +3233,7 @@ jobs:
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.11"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -3479,7 +3479,7 @@ jobs:
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.11"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -3725,7 +3725,7 @@ jobs:
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.11"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -3970,7 +3970,7 @@ jobs:
GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.12"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -4214,7 +4214,7 @@ jobs:
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.12"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -4460,7 +4460,7 @@ jobs:
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.12"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
@ -4706,7 +4706,7 @@ jobs:
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.12"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash

108
.github/workflows/inductor-cu124.yml vendored Normal file
View File

@ -0,0 +1,108 @@
name: inductor-cu124
on:
push:
tags:
- ciflow/inductor-cu124/*
workflow_dispatch:
schedule:
# Run every 4 hours during the week and every 12 hours on the weekend
- cron: 45 0,4,8,12,16,20 * * 1-5
- cron: 45 4,12 * * 0,6
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true
permissions: read-all
jobs:
linux-focal-cuda12_4-py3_10-gcc9-inductor-build:
# Should be synced with the one in inductor.yml, but this doesn't run inductor_timm
name: cuda12.4-py3.10-gcc9-sm86
uses: ./.github/workflows/_linux-build.yml
with:
sync-tag: linux-focal-cuda12_4-py3_10-gcc9-inductor-build
build-environment: linux-focal-cuda12.4-py3.10-gcc9-sm86
docker-image-name: pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9-inductor-benchmarks
cuda-arch-list: '8.6'
test-matrix: |
{ include: [
{ config: "inductor", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor_distributed", shard: 1, num_shards: 1, runner: "linux.g5.12xlarge.nvidia.gpu" },
{ config: "inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "dynamic_inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "dynamic_inductor_timm", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "dynamic_inductor_timm", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "dynamic_inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "dynamic_inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "aot_inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "aot_inductor_timm", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "aot_inductor_timm", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "aot_inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "aot_inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor_cpp_wrapper_abi_compatible", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
]}
secrets:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
linux-focal-cuda12_4-py3_10-gcc9-inductor-test:
name: cuda12.4-py3.10-gcc9-sm86
uses: ./.github/workflows/_linux-test.yml
needs: linux-focal-cuda12_4-py3_10-gcc9-inductor-build
with:
sync-tag: linux-focal-cuda12_4-py3_10-gcc9-inductor-test
build-environment: linux-focal-cuda12.4-py3.10-gcc9-sm86
docker-image: ${{ needs.linux-focal-cuda12_4-py3_10-gcc9-inductor-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda12_4-py3_10-gcc9-inductor-build.outputs.test-matrix }}
secrets:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
linux-focal-cuda12_4-py3_10-gcc9-inductor-build-gcp:
name: cuda12.4-py3.10-gcc9-sm80
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-focal-cuda12.4-py3.10-gcc9-sm80
docker-image-name: pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9-inductor-benchmarks
cuda-arch-list: '8.0'
test-matrix: |
{ include: [
{ config: "inductor_torchbench_smoketest_perf", shard: 1, num_shards: 1, runner: "linux.gcp.a100" },
]}
secrets:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
linux-focal-cuda12_4-py3_10-gcc9-inductor-test-gcp:
name: cuda12.4-py3.10-gcc9-sm80
uses: ./.github/workflows/_linux-test.yml
needs: linux-focal-cuda12_4-py3_10-gcc9-inductor-build-gcp
with:
build-environment: linux-focal-cuda12.4-py3.10-gcc9-sm80
docker-image: ${{ needs.linux-focal-cuda12_4-py3_10-gcc9-inductor-build-gcp.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda12_4-py3_10-gcc9-inductor-build-gcp.outputs.test-matrix }}
use-gha: anything-non-empty-to-use-gha
secrets:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
linux-focal-cuda12_4-py3_12-gcc9-inductor-build:
name: cuda12.4-py3.12-gcc9-sm86
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-focal-cuda12.4-py3.12-gcc9-sm86
docker-image-name: pytorch-linux-focal-cuda12.4-cudnn9-py3.12-gcc9-inductor-benchmarks
cuda-arch-list: '8.6'
test-matrix: |
{ include: [
{ config: "inductor", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
]}
linux-focal-cuda12_4-py3_12-gcc9-inductor-test:
name: cuda12.4-py3.12-gcc9-sm86
uses: ./.github/workflows/_linux-test.yml
needs: linux-focal-cuda12_4-py3_12-gcc9-inductor-build
with:
build-environment: linux-focal-cuda12.4-py3.12-gcc9-sm86
docker-image: ${{ needs.linux-focal-cuda12_4-py3_12-gcc9-inductor-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda12_4-py3_12-gcc9-inductor-build.outputs.test-matrix }}

View File

@ -21,7 +21,7 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-focal-cuda12.1-py3.10-gcc9-sm80
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9-inductor-benchmarks
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9-inductor-benchmarks
cuda-arch-list: '8.0'
test-matrix: |
{ include: [

View File

@ -18,7 +18,7 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-focal-cuda12.1-py3.10-gcc9-sm80
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9-inductor-benchmarks
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9-inductor-benchmarks
cuda-arch-list: '8.0'
test-matrix: |
{ include: [

View File

@ -71,7 +71,7 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-focal-cuda12.1-py3.10-gcc9-sm80
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9-inductor-benchmarks
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9-inductor-benchmarks
cuda-arch-list: '8.0'
test-matrix: |
{ include: [

View File

@ -23,7 +23,7 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-focal-cuda12.1-py3.10-gcc9-sm86
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9-inductor-benchmarks
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9-inductor-benchmarks
cuda-arch-list: '8.6'
test-matrix: |
{ include: [

View File

@ -44,7 +44,7 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-focal-cuda12.1-py3.10-gcc9-sm86
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9-inductor-benchmarks
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9-inductor-benchmarks
cuda-arch-list: '8.6'
test-matrix: |
{ include: [
@ -86,7 +86,7 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-focal-cuda12.1-py3.10-gcc9-sm80
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9-inductor-benchmarks
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9-inductor-benchmarks
cuda-arch-list: '8.0'
test-matrix: |
{ include: [
@ -112,7 +112,7 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-focal-cuda12.1-py3.12-gcc9-sm86
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn8-py3.12-gcc9-inductor-benchmarks
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3.12-gcc9-inductor-benchmarks
cuda-arch-list: '8.6'
test-matrix: |
{ include: [
@ -129,32 +129,18 @@ jobs:
test-matrix: ${{ needs.linux-focal-cuda12_1-py3_12-gcc9-inductor-build.outputs.test-matrix }}
linux-focal-cuda12_4-py3_10-gcc9-inductor-build:
# Should be synced with the one in inductor-periodic.yml but this only runs inductor_timm
name: cuda12.4-py3.10-gcc9-sm86
uses: ./.github/workflows/_linux-build.yml
with:
sync-tag: linux-focal-cuda12_4-py3_10-gcc9-inductor-build
build-environment: linux-focal-cuda12.4-py3.10-gcc9-sm86
docker-image-name: pytorch-linux-focal-cuda12.4-cudnn8-py3-gcc9-inductor-benchmarks
docker-image-name: pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9-inductor-benchmarks
cuda-arch-list: '8.6'
test-matrix: |
{ include: [
{ config: "inductor", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor_distributed", shard: 1, num_shards: 1, runner: "linux.g5.12xlarge.nvidia.gpu" },
{ config: "inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor_timm", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor_timm", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "dynamic_inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "dynamic_inductor_timm", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "dynamic_inductor_timm", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "dynamic_inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "dynamic_inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "aot_inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "aot_inductor_timm", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "aot_inductor_timm", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "aot_inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "aot_inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor_cpp_wrapper_abi_compatible", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
]}
secrets:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
@ -164,59 +150,13 @@ jobs:
uses: ./.github/workflows/_linux-test.yml
needs: linux-focal-cuda12_4-py3_10-gcc9-inductor-build
with:
sync-tag: linux-focal-cuda12_4-py3_10-gcc9-inductor-test
build-environment: linux-focal-cuda12.4-py3.10-gcc9-sm86
docker-image: ${{ needs.linux-focal-cuda12_4-py3_10-gcc9-inductor-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda12_4-py3_10-gcc9-inductor-build.outputs.test-matrix }}
secrets:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
linux-focal-cuda12_4-py3_10-gcc9-inductor-build-gcp:
name: cuda12.4-py3.10-gcc9-sm80
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-focal-cuda12.4-py3.10-gcc9-sm80
docker-image-name: pytorch-linux-focal-cuda12.4-cudnn8-py3-gcc9-inductor-benchmarks
cuda-arch-list: '8.0'
test-matrix: |
{ include: [
{ config: "inductor_torchbench_smoketest_perf", shard: 1, num_shards: 1, runner: "linux.gcp.a100" },
]}
secrets:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
linux-focal-cuda12_4-py3_12-gcc9-inductor-build:
name: cuda12.4-py3.12-gcc9-sm86
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-focal-cuda12.4-py3.12-gcc9-sm86
docker-image-name: pytorch-linux-focal-cuda12.4-cudnn8-py3.12-gcc9-inductor-benchmarks
cuda-arch-list: '8.6'
test-matrix: |
{ include: [
{ config: "inductor", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
]}
linux-focal-cuda12_4-py3_10-gcc9-inductor-test-gcp:
name: cuda12.4-py3.10-gcc9-sm80
uses: ./.github/workflows/_linux-test.yml
needs: linux-focal-cuda12_4-py3_10-gcc9-inductor-build-gcp
with:
build-environment: linux-focal-cuda12.4-py3.10-gcc9-sm80
docker-image: ${{ needs.linux-focal-cuda12_4-py3_10-gcc9-inductor-build-gcp.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda12_4-py3_10-gcc9-inductor-build-gcp.outputs.test-matrix }}
use-gha: anything-non-empty-to-use-gha
secrets:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
linux-focal-cuda12_4-py3_12-gcc9-inductor-test:
name: cuda12.4-py3.12-gcc9-sm86
uses: ./.github/workflows/_linux-test.yml
needs: linux-focal-cuda12_4-py3_12-gcc9-inductor-build
with:
build-environment: linux-focal-cuda12.4-py3.12-gcc9-sm86
docker-image: ${{ needs.linux-focal-cuda12_4-py3_12-gcc9-inductor-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda12_4-py3_12-gcc9-inductor-build.outputs.test-matrix }}
linux-jammy-cpu-py3_8-gcc11-inductor-build:
name: linux-jammy-cpu-py3.8-gcc11-inductor
uses: ./.github/workflows/_linux-build.yml
@ -230,6 +170,11 @@ jobs:
{ config: "cpu_inductor_timm", shard: 2, num_shards: 2, runner: "linux.12xlarge" },
{ config: "cpu_inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.12xlarge" },
{ config: "cpu_inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.12xlarge" },
{ config: "cpu_inductor_huggingface_freezing", shard: 1, num_shards: 1, runner: "linux.12xlarge" },
{ config: "cpu_inductor_timm_freezing", shard: 1, num_shards: 2, runner: "linux.12xlarge" },
{ config: "cpu_inductor_timm_freezing", shard: 2, num_shards: 2, runner: "linux.12xlarge" },
{ config: "cpu_inductor_torchbench_freezing", shard: 1, num_shards: 2, runner: "linux.12xlarge" },
{ config: "cpu_inductor_torchbench_freezing", shard: 2, num_shards: 2, runner: "linux.12xlarge" },
{ config: "dynamic_cpu_inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.12xlarge" },
{ config: "dynamic_cpu_inductor_timm", shard: 1, num_shards: 2, runner: "linux.12xlarge" },
{ config: "dynamic_cpu_inductor_timm", shard: 2, num_shards: 2, runner: "linux.12xlarge" },

View File

@ -20,7 +20,7 @@ jobs:
with:
timeout: 120
runner: linux.2xlarge
docker-image: pytorch-linux-jammy-cuda11.8-cudnn8-py3.9-linter
docker-image: pytorch-linux-jammy-cuda11.8-cudnn9-py3.9-linter
# NB: A shallow checkout won't work here because calculate-docker-image requires a full checkout
# to run git rev-parse HEAD~:.ci/docker when a new image is needed
fetch-depth: 0
@ -36,7 +36,7 @@ jobs:
with:
timeout: 120
runner: linux.2xlarge
docker-image: pytorch-linux-jammy-cuda11.8-cudnn8-py3.9-linter
docker-image: pytorch-linux-jammy-cuda11.8-cudnn9-py3.9-linter
# NB: A shallow checkout won't work here because calculate-docker-image requires a full checkout
# to run git rev-parse HEAD~:.ci/docker when a new image is needed
fetch-depth: 0

View File

@ -23,9 +23,12 @@ jobs:
build-generates-artifacts: true
# To match the one pre-installed in the m1 runners
python-version: 3.9.12
# The runner macos-m2-14 is not a typo, it's a custom runner that is different
# than our AWS macos-m1-14 runners
test-matrix: |
{ include: [
{ config: "mps", shard: 1, num_shards: 1, runner: "macos-m1-14" },
{ config: "mps", shard: 1, num_shards: 1, runner: "macos-m1-13" },
{ config: "mps", shard: 1, num_shards: 1, runner: "macos-m2-14" },
]}
macos-py3-arm64-mps-test:

View File

@ -37,6 +37,59 @@ jobs:
permissions:
id-token: write
contents: read
linux-focal-cuda12_1-py3_10-gcc9-build:
name: linux-focal-cuda12.1-py3.10-gcc9
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-focal-cuda12.1-py3.10-gcc9
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9
test-matrix: |
{ include: [
{ config: "nogpu_AVX512", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
{ config: "nogpu_NO_AVX2", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
{ config: "jit_legacy", shard: 1, num_shards: 1, runner: "linux.4xlarge.nvidia.gpu" },
]}
linux-focal-cuda12_1-py3_10-gcc9-test:
name: linux-focal-cuda12.1-py3.10-gcc9
uses: ./.github/workflows/_linux-test.yml
needs:
- linux-focal-cuda12_1-py3_10-gcc9-build
- target-determination
with:
build-environment: linux-focal-cuda12.1-py3.10-gcc9
docker-image: ${{ needs.linux-focal-cuda12_1-py3_10-gcc9-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda12_1-py3_10-gcc9-build.outputs.test-matrix }}
linux-focal-cuda12_4-py3_10-gcc9-build:
name: linux-focal-cuda12.4-py3.10-gcc9
uses: ./.github/workflows/_linux-build-label.yml
with:
build-environment: linux-focal-cuda12.4-py3.10-gcc9
docker-image-name: pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 5, runner: "linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 2, num_shards: 5, runner: "linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 3, num_shards: 5, runner: "linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 4, num_shards: 5, runner: "linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 5, num_shards: 5, runner: "linux.4xlarge.nvidia.gpu" },
{ config: "deploy", shard: 1, num_shards: 1, runner: "linux.4xlarge.nvidia.gpu" },
{ config: "nogpu_AVX512", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
{ config: "nogpu_NO_AVX2", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
{ config: "jit_legacy", shard: 1, num_shards: 1, runner: "linux.4xlarge.nvidia.gpu" },
]}
linux-focal-cuda12_4-py3_10-gcc9-test:
name: linux-focal-cuda12.4-py3.10-gcc9
uses: ./.github/workflows/_linux-test.yml
needs:
- linux-focal-cuda12_4-py3_10-gcc9-build
- target-determination
with:
timeout-minutes: 360
build-environment: linux-focal-cuda12.4-py3.10-gcc9
docker-image: ${{ needs.linux-focal-cuda12_4-py3_10-gcc9-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda12_4-py3_10-gcc9-build.outputs.test-matrix }}
parallelnative-linux-jammy-py3_8-gcc11-build:
name: parallelnative-linux-jammy-py3.8-gcc11
@ -67,7 +120,7 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-focal-cuda11.8-py3.9-gcc9
docker-image-name: pytorch-linux-focal-cuda11.8-cudnn8-py3-gcc9
docker-image-name: pytorch-linux-focal-cuda11.8-cudnn9-py3-gcc9
cuda-arch-list: 8.6
test-matrix: |
{ include: [
@ -89,7 +142,7 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-focal-cuda11.8-py3.10-gcc9-debug
docker-image-name: pytorch-linux-focal-cuda11.8-cudnn8-py3-gcc9
docker-image-name: pytorch-linux-focal-cuda11.8-cudnn9-py3-gcc9
build-with-debug: true
test-matrix: |
{ include: [

View File

@ -237,7 +237,7 @@ jobs:
uses: ./.github/workflows/_linux-build-label.yml
with:
build-environment: linux-focal-cuda11.8-py3.10-gcc9
docker-image-name: pytorch-linux-focal-cuda11.8-cudnn8-py3-gcc9
docker-image-name: pytorch-linux-focal-cuda11.8-cudnn9-py3-gcc9
test-matrix: |
{ include: [
{ config: "distributed", shard: 1, num_shards: 3, runner: "linux.8xlarge.nvidia.gpu" },
@ -262,7 +262,7 @@ jobs:
uses: ./.github/workflows/_linux-build-label.yml
with:
build-environment: linux-focal-cuda12.1-py3.10-gcc9
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 5, runner: "linux.4xlarge.nvidia.gpu" },
@ -285,34 +285,6 @@ jobs:
docker-image: ${{ needs.linux-focal-cuda12_1-py3_10-gcc9-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda12_1-py3_10-gcc9-build.outputs.test-matrix }}
linux-focal-cuda12_4-py3_10-gcc9-build:
name: linux-focal-cuda12.4-py3.10-gcc9
uses: ./.github/workflows/_linux-build-label.yml
with:
build-environment: linux-focal-cuda12.4-py3.10-gcc9
docker-image-name: pytorch-linux-focal-cuda12.4-cudnn8-py3-gcc9
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 5, runner: "linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 2, num_shards: 5, runner: "linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 3, num_shards: 5, runner: "linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 4, num_shards: 5, runner: "linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 5, num_shards: 5, runner: "linux.4xlarge.nvidia.gpu" },
{ config: "deploy", shard: 1, num_shards: 1, runner: "linux.4xlarge.nvidia.gpu" },
]}
linux-focal-cuda12_4-py3_10-gcc9-test:
name: linux-focal-cuda12.4-py3.10-gcc9
uses: ./.github/workflows/_linux-test.yml
needs:
- linux-focal-cuda12_4-py3_10-gcc9-build
- target-determination
with:
timeout-minutes: 360
build-environment: linux-focal-cuda12.4-py3.10-gcc9
docker-image: ${{ needs.linux-focal-cuda12_4-py3_10-gcc9-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda12_4-py3_10-gcc9-build.outputs.test-matrix }}
linux-jammy-py3-clang12-mobile-build:
name: linux-jammy-py3-clang12-mobile-build
uses: ./.github/workflows/_linux-build-label.yml
@ -325,12 +297,12 @@ jobs:
{ config: "default", shard: 1, num_shards: 1 },
]}
linux-jammy-cuda-11_8-cudnn8-py3_8-clang12-build:
name: linux-jammy-cuda11.8-cudnn8-py3.8-clang12
linux-jammy-cuda-11_8-cudnn9-py3_8-clang12-build:
name: linux-jammy-cuda11.8-cudnn9-py3.8-clang12
uses: ./.github/workflows/_linux-build-label.yml
with:
build-environment: linux-jammy-cuda11.8-cudnn8-py3.8-clang12
docker-image-name: pytorch-linux-jammy-cuda11.8-cudnn8-py3.8-clang12
build-environment: linux-jammy-cuda11.8-cudnn9-py3.8-clang12
docker-image-name: pytorch-linux-jammy-cuda11.8-cudnn9-py3.8-clang12
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 1 },
@ -389,7 +361,7 @@ jobs:
uses: ./.github/workflows/_bazel-build-test.yml
with:
build-environment: linux-focal-cuda12.1-py3.10-gcc9-bazel-test
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9
cuda-version: cpu
test-matrix: |
{ include: [
@ -401,7 +373,7 @@ jobs:
uses: ./.github/workflows/_bazel-build-test.yml
with:
build-environment: linux-focal-cuda12.1-py3.10-gcc9-bazel-test
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9
cuda-version: "12.1"
test-matrix: |
{ include: [
@ -413,7 +385,7 @@ jobs:
uses: ./.github/workflows/_bazel-build-test.yml
with:
build-environment: linux-focal-cuda12.4-py3.10-gcc9-bazel-test
docker-image-name: pytorch-linux-focal-cuda12.4-cudnn8-py3-gcc9
docker-image-name: pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9
cuda-version: "12.4"
test-matrix: |
{ include: [
@ -475,7 +447,7 @@ jobs:
uses: ./.github/workflows/_linux-build-label.yml
with:
build-environment: linux-focal-cuda12.1-py3.10-gcc9-sm86
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9
cuda-arch-list: 8.6
test-matrix: |
{ include: [
@ -497,33 +469,6 @@ jobs:
docker-image: ${{ needs.linux-focal-cuda12_1-py3_10-gcc9-sm86-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda12_1-py3_10-gcc9-sm86-build.outputs.test-matrix }}
linux-focal-cuda12_4-py3_10-gcc9-sm86-build:
name: linux-focal-cuda12.4-py3.10-gcc9-sm86
uses: ./.github/workflows/_linux-build-label.yml
with:
build-environment: linux-focal-cuda12.4-py3.10-gcc9-sm86
docker-image-name: pytorch-linux-focal-cuda12.4-cudnn8-py3-gcc9
cuda-arch-list: 8.6
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 5, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 2, num_shards: 5, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 3, num_shards: 5, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 4, num_shards: 5, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 5, num_shards: 5, runner: "linux.g5.4xlarge.nvidia.gpu" },
]}
linux-focal-cuda12_4-py3_10-gcc9-sm86-test:
name: linux-focal-cuda12.4-py3.10-gcc9-sm86
uses: ./.github/workflows/_linux-test.yml
needs:
- linux-focal-cuda12_4-py3_10-gcc9-sm86-build
- target-determination
with:
build-environment: linux-focal-cuda12.4-py3.10-gcc9-sm86
docker-image: ${{ needs.linux-focal-cuda12_4-py3_10-gcc9-sm86-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda12_4-py3_10-gcc9-sm86-build.outputs.test-matrix }}
linux-jammy-py3-clang12-executorch-build:
name: linux-jammy-py3-clang12-executorch
uses: ./.github/workflows/_linux-build-label.yml

View File

@ -41,7 +41,7 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-focal-cuda12.1-py3-gcc9-slow-gradcheck
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9
cuda-arch-list: 8.6
test-matrix: |
{ include: [
@ -70,7 +70,7 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-focal-cuda12.1-py3.10-gcc9-sm86
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9
cuda-arch-list: 8.6
test-matrix: |
{ include: [

View File

@ -26,7 +26,7 @@ jobs:
id: calculate-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
with:
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9
working-directory: pytorch
- name: Use following to pull public copy of the image

View File

@ -16,7 +16,7 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-focal-cuda12.1-py3.10-gcc9-sm80
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9-inductor-benchmarks
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9-inductor-benchmarks
cuda-arch-list: '8.0'
test-matrix: |
{ include: [

View File

@ -34,36 +34,39 @@ jobs:
id-token: write
contents: read
linux-focal-cuda12_1-py3_10-gcc9-build:
name: linux-focal-cuda12.1-py3.10-gcc9
uses: ./.github/workflows/_linux-build.yml
linux-focal-cuda12_4-py3_10-gcc9-sm86-build:
name: linux-focal-cuda12.4-py3.10-gcc9-sm86
uses: ./.github/workflows/_linux-build-label.yml
with:
build-environment: linux-focal-cuda12.1-py3.10-gcc9
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9
build-environment: linux-focal-cuda12.4-py3.10-gcc9-sm86
docker-image-name: pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9
cuda-arch-list: 8.6
test-matrix: |
{ include: [
{ config: "nogpu_AVX512", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
{ config: "nogpu_NO_AVX2", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
{ config: "jit_legacy", shard: 1, num_shards: 1, runner: "linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 1, num_shards: 5, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 2, num_shards: 5, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 3, num_shards: 5, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 4, num_shards: 5, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 5, num_shards: 5, runner: "linux.g5.4xlarge.nvidia.gpu" },
]}
linux-focal-cuda12_1-py3_10-gcc9-test:
name: linux-focal-cuda12.1-py3.10-gcc9
linux-focal-cuda12_4-py3_10-gcc9-sm86-test:
name: linux-focal-cuda12.4-py3.10-gcc9-sm86
uses: ./.github/workflows/_linux-test.yml
needs:
- linux-focal-cuda12_1-py3_10-gcc9-build
- linux-focal-cuda12_4-py3_10-gcc9-sm86-build
- target-determination
with:
build-environment: linux-focal-cuda12.1-py3.10-gcc9
docker-image: ${{ needs.linux-focal-cuda12_1-py3_10-gcc9-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda12_1-py3_10-gcc9-build.outputs.test-matrix }}
build-environment: linux-focal-cuda12.4-py3.10-gcc9-sm86
docker-image: ${{ needs.linux-focal-cuda12_4-py3_10-gcc9-sm86-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda12_4-py3_10-gcc9-sm86-build.outputs.test-matrix }}
libtorch-linux-focal-cuda12_1-py3_7-gcc9-debug-build:
name: libtorch-linux-focal-cuda12.1-py3.7-gcc9-debug
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: libtorch-linux-focal-cuda12.1-py3.7-gcc9
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9
build-generates-artifacts: false
runner: linux.4xlarge
test-matrix: |
@ -77,42 +80,18 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-focal-cuda12.1-py3.10-gcc9-no-ops
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 1 },
]}
linux-focal-cuda12_4-py3_10-gcc9-build:
name: linux-focal-cuda12.4-py3.10-gcc9
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-focal-cuda12.4-py3.10-gcc9
docker-image-name: pytorch-linux-focal-cuda12.4-cudnn8-py3-gcc9
test-matrix: |
{ include: [
{ config: "nogpu_AVX512", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
{ config: "nogpu_NO_AVX2", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
{ config: "jit_legacy", shard: 1, num_shards: 1, runner: "linux.4xlarge.nvidia.gpu" },
]}
linux-focal-cuda12_4-py3_10-gcc9-test:
name: linux-focal-cuda12.4-py3.10-gcc9
uses: ./.github/workflows/_linux-test.yml
needs:
- linux-focal-cuda12_4-py3_10-gcc9-build
- target-determination
with:
build-environment: linux-focal-cuda12.4-py3.10-gcc9
docker-image: ${{ needs.linux-focal-cuda12_4-py3_10-gcc9-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda12_4-py3_10-gcc9-build.outputs.test-matrix }}
libtorch-linux-focal-cuda12_4-py3_7-gcc9-debug-build:
name: libtorch-linux-focal-cuda12.4-py3.7-gcc9-debug
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: libtorch-linux-focal-cuda12.4-py3.7-gcc9
docker-image-name: pytorch-linux-focal-cuda12.4-cudnn8-py3-gcc9
docker-image-name: pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9
build-generates-artifacts: false
runner: linux.4xlarge
test-matrix: |
@ -126,7 +105,7 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-focal-cuda12.4-py3.10-gcc9-no-ops
docker-image-name: pytorch-linux-focal-cuda12.4-cudnn8-py3-gcc9
docker-image-name: pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 1 },
@ -172,6 +151,7 @@ jobs:
python-version: 3.9.12
test-matrix: |
{ include: [
{ config: "mps", shard: 1, num_shards: 1, runner: "macos-m1-13" },
{ config: "mps", shard: 1, num_shards: 1, runner: "macos-m1-14" },
]}

View File

@ -32,174 +32,3 @@ jobs:
echo
echo "Once the jobs are deemed stable enough (% red signal < 5% and TTS < 3h),"
echo " they can graduate and move back to pull or trunk."
#
# Experimental ARC jobs
#
llm-td:
name: before-test
uses: ./.github/workflows/llm_td_retrieval.yml
permissions:
id-token: write
contents: read
target-determination:
name: before-test
uses: ./.github/workflows/target_determination.yml
needs: llm-td
permissions:
id-token: write
contents: read
linux-jammy-py3_8-gcc11-build:
name: linux-jammy-py3.8-gcc11
uses: ./.github/workflows/_linux-build-rg.yml
with:
build-environment: linux-jammy-py3.8-gcc11
docker-image-name: pytorch-linux-jammy-py3.8-gcc11
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 3, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "default", shard: 2, num_shards: 3, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "default", shard: 3, num_shards: 3, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "docs_test", shard: 1, num_shards: 1, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "jit_legacy", shard: 1, num_shards: 1, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "backwards_compat", shard: 1, num_shards: 1, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "distributed", shard: 1, num_shards: 2, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "distributed", shard: 2, num_shards: 2, runner: "arc-lf-linux.2xlarge.avx512" },
]}
linux-jammy-py3_8-gcc11-test:
name: linux-jammy-py3.8-gcc11
uses: ./.github/workflows/_linux-test-rg.yml
needs:
- linux-jammy-py3_8-gcc11-build
- target-determination
with:
build-environment: linux-jammy-py3.8-gcc11
docker-image: ${{ needs.linux-jammy-py3_8-gcc11-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-py3_8-gcc11-build.outputs.test-matrix }}
linux-jammy-py3_8-gcc11-no-ops:
name: linux-jammy-py3.8-gcc11-no-ops
uses: ./.github/workflows/_linux-build-rg.yml
with:
build-environment: linux-jammy-py3.8-gcc11-no-ops
docker-image-name: pytorch-linux-jammy-py3.8-gcc11
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 1 },
]}
linux-jammy-py3_8-gcc11-pch:
name: linux-jammy-py3.8-gcc11-pch
uses: ./.github/workflows/_linux-build-rg.yml
with:
build-environment: linux-jammy-py3.8-gcc11-pch
docker-image-name: pytorch-linux-jammy-py3.8-gcc11
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 1 },
]}
linux-focal-py3_8-clang10-onnx-build:
name: linux-focal-py3.8-clang10-onnx
uses: ./.github/workflows/_linux-build-rg.yml
with:
build-environment: linux-focal-py3.8-clang10-onnx
docker-image-name: pytorch-linux-focal-py3-clang10-onnx
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 2, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "default", shard: 2, num_shards: 2, runner: "arc-lf-linux.2xlarge.avx512" },
]}
linux-focal-py3_8-clang10-onnx-test:
name: linux-focal-py3.8-clang10-onnx
uses: ./.github/workflows/_linux-test-rg.yml
needs:
- linux-focal-py3_8-clang10-onnx-build
- target-determination
with:
build-environment: linux-focal-py3.8-clang10-onnx
docker-image: ${{ needs.linux-focal-py3_8-clang10-onnx-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-py3_8-clang10-onnx-build.outputs.test-matrix }}
linux-jammy-py3_10-clang15-asan-build:
name: linux-jammy-py3.10-clang15-asan
uses: ./.github/workflows/_linux-build-rg.yml
with:
build-environment: linux-jammy-py3.10-clang15-asan
docker-image-name: pytorch-linux-jammy-py3-clang15-asan
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 6, runner: "linux.4xlarge" },
{ config: "default", shard: 2, num_shards: 6, runner: "linux.4xlarge" },
{ config: "default", shard: 3, num_shards: 6, runner: "linux.4xlarge" },
{ config: "default", shard: 4, num_shards: 6, runner: "linux.4xlarge" },
{ config: "default", shard: 5, num_shards: 6, runner: "linux.4xlarge" },
{ config: "default", shard: 6, num_shards: 6, runner: "linux.4xlarge" },
]}
sync-tag: asan-build-arc
linux-focal-py3_8-clang10-build:
name: linux-focal-py3.8-clang10
uses: ./.github/workflows/_linux-build-rg.yml
with:
build-environment: linux-focal-py3.8-clang10
docker-image-name: pytorch-linux-focal-py3.8-clang10
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 3, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "default", shard: 2, num_shards: 3, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "default", shard: 3, num_shards: 3, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "crossref", shard: 1, num_shards: 2, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "crossref", shard: 2, num_shards: 2, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "dynamo", shard: 1, num_shards: 3, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "dynamo", shard: 2, num_shards: 3, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "dynamo", shard: 3, num_shards: 3, runner: "arc-lf-linux.2xlarge.avx512" },
]}
linux-focal-py3_8-clang10-test:
name: linux-focal-py3.8-clang10
uses: ./.github/workflows/_linux-test-rg.yml
needs:
- linux-focal-py3_8-clang10-build
- target-determination
with:
build-environment: linux-focal-py3.8-clang10
docker-image: ${{ needs.linux-focal-py3_8-clang10-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-py3_8-clang10-build.outputs.test-matrix }}
linux-focal-py3_11-clang10-build:
name: linux-focal-py3.11-clang10
uses: ./.github/workflows/_linux-build-rg.yml
with:
build-environment: linux-focal-py3.11-clang10
docker-image-name: pytorch-linux-focal-py3.11-clang10
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 3, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "default", shard: 2, num_shards: 3, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "default", shard: 3, num_shards: 3, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "crossref", shard: 1, num_shards: 2, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "crossref", shard: 2, num_shards: 2, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "dynamo", shard: 1, num_shards: 3, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "dynamo", shard: 2, num_shards: 3, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "dynamo", shard: 3, num_shards: 3, runner: "arc-lf-linux.2xlarge.avx512" },
]}
linux-focal-py3_11-clang10-test:
name: linux-focal-py3.11-clang10
uses: ./.github/workflows/_linux-test-rg.yml
needs:
- linux-focal-py3_11-clang10-build
- target-determination
with:
build-environment: linux-focal-py3.11-clang10
docker-image: ${{ needs.linux-focal-py3_11-clang10-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-py3_11-clang10-build.outputs.test-matrix }}
#
# End of Experimental ARC jobs
#

4
.gitmodules vendored
View File

@ -18,10 +18,6 @@
ignore = dirty
path = third_party/protobuf
url = https://github.com/protocolbuffers/protobuf.git
[submodule "third_party/ios-cmake"]
ignore = dirty
path = third_party/ios-cmake
url = https://github.com/Yangqing/ios-cmake.git
[submodule "third_party/NNPACK"]
ignore = dirty
path = third_party/NNPACK

View File

@ -1072,7 +1072,6 @@ exclude_patterns = [
'test/test_jit_disabled.py',
'test/test_jit_fuser.py',
'test/test_jit_fuser_legacy.py',
'test/test_jit_fuser_te.py',
'test/test_jit_legacy.py',
'test/test_jit_llga_fuser.py',
'test/test_jit_profiling.py',
@ -1115,9 +1114,6 @@ exclude_patterns = [
'test/test_segment_reductions.py',
'test/test_serialization.py',
'test/test_set_default_mobile_cpu_allocator.py',
'test/test_shape_ops.py',
'test/test_show_pickle.py',
'test/test_sort_and_select.py',
'test/test_sparse.py',
'test/test_sparse_csr.py',
'test/test_sparse_semi_structured.py',
@ -1536,28 +1532,6 @@ exclude_patterns = [
'torch/distributed/optim/post_localSGD_optimizer.py',
'torch/distributed/optim/utils.py',
'torch/distributed/optim/zero_redundancy_optimizer.py',
'torch/distributed/pipeline/__init__.py',
'torch/distributed/pipeline/sync/__init__.py',
'torch/distributed/pipeline/sync/_balance/__init__.py',
'torch/distributed/pipeline/sync/_balance/blockpartition.py',
'torch/distributed/pipeline/sync/_balance/profile.py',
'torch/distributed/pipeline/sync/batchnorm.py',
'torch/distributed/pipeline/sync/checkpoint.py',
'torch/distributed/pipeline/sync/copy.py',
'torch/distributed/pipeline/sync/dependency.py',
'torch/distributed/pipeline/sync/microbatch.py',
'torch/distributed/pipeline/sync/phony.py',
'torch/distributed/pipeline/sync/pipe.py',
'torch/distributed/pipeline/sync/pipeline.py',
'torch/distributed/pipeline/sync/skip/__init__.py',
'torch/distributed/pipeline/sync/skip/layout.py',
'torch/distributed/pipeline/sync/skip/namespace.py',
'torch/distributed/pipeline/sync/skip/portal.py',
'torch/distributed/pipeline/sync/skip/skippable.py',
'torch/distributed/pipeline/sync/skip/tracker.py',
'torch/distributed/pipeline/sync/stream.py',
'torch/distributed/pipeline/sync/utils.py',
'torch/distributed/pipeline/sync/worker.py',
'torch/distributed/remote_device.py',
'torch/distributed/rendezvous.py',
'torch/distributed/rpc/__init__.py',
@ -1582,7 +1556,6 @@ exclude_patterns = [
'torch/distributed/tensor/parallel/input_reshard.py',
'torch/distributed/tensor/parallel/multihead_attention_tp.py',
'torch/distributed/tensor/parallel/style.py',
'torch/distributed/utils.py',
'torch/fft/__init__.py',
'torch/func/__init__.py',
'torch/functional.py',
@ -1674,18 +1647,6 @@ exclude_patterns = [
'torch/hub.py',
'torch/library.py',
'torch/linalg/__init__.py',
# UFMT causes import cycle on masked
'torch/masked/__init__.py',
'torch/masked/_docs.py',
'torch/masked/_ops.py',
'torch/masked/maskedtensor/__init__.py',
'torch/masked/maskedtensor/_ops_refs.py',
'torch/masked/maskedtensor/binary.py',
'torch/masked/maskedtensor/core.py',
'torch/masked/maskedtensor/creation.py',
'torch/masked/maskedtensor/passthrough.py',
'torch/masked/maskedtensor/reductions.py',
'torch/masked/maskedtensor/unary.py',
'torch/monitor/__init__.py',
'torch/nested/__init__.py',
'torch/nn/__init__.py',
@ -1864,8 +1825,6 @@ exclude_patterns = [
'torch/testing/_internal/distributed/nn/__init__.py',
'torch/testing/_internal/distributed/nn/api/__init__.py',
'torch/testing/_internal/distributed/nn/api/remote_module_test.py',
'torch/testing/_internal/distributed/pipe_with_ddp_test.py',
'torch/testing/_internal/distributed/pipeline/__init__.py',
'torch/testing/_internal/distributed/rpc/__init__.py',
'torch/testing/_internal/distributed/rpc/dist_autograd_test.py',
'torch/testing/_internal/distributed/rpc/dist_optimizer_test.py',
@ -2120,7 +2079,7 @@ init_command = [
'python3',
'tools/linter/adapters/pip_init.py',
'--dry-run={{DRYRUN}}',
'ruff==0.4.6',
'ruff==0.4.8',
]
is_formatter = true

View File

@ -455,7 +455,6 @@ filegroup(
name = "caffe2_core_srcs",
srcs = [
"caffe2/core/common.cc",
"caffe2/core/types.cc",
],
)
@ -488,7 +487,6 @@ filegroup(
filegroup(
name = "caffe2_utils_srcs",
srcs = [
"caffe2/utils/cpuid.cc",
"caffe2/utils/proto_wrap.cc",
"caffe2/utils/string_utils.cc",
"caffe2/utils/threadpool/ThreadPool.cc",
@ -507,12 +505,9 @@ cc_library(
name = "caffe2_for_aten_headers",
hdrs = [
"caffe2/core/common.h",
"caffe2/core/logging.h",
"caffe2/core/types.h",
"caffe2/perfkernels/common.h",
"caffe2/perfkernels/embedding_lookup.h",
"caffe2/perfkernels/embedding_lookup_idx.h",
"caffe2/utils/cpuid.h",
"caffe2/utils/fixed_divisor.h",
] + glob([
"caffe2/utils/threadpool/*.h",
@ -522,7 +517,6 @@ cc_library(
deps = [
":caffe2_core_macros",
"//c10",
"//caffe2/proto:caffe2_pb",
],
)
@ -547,7 +541,6 @@ cc_library(
deps = [
":caffe2_core_macros",
":caffe2_for_aten_headers",
"//caffe2/proto:caffe2_pb",
],
)
@ -568,7 +561,6 @@ cc_library(
":caffe2_perfkernels_avx",
":caffe2_perfkernels_avx2",
":caffe2_perfkernels_avx512",
"//caffe2/proto:caffe2_pb",
"//third_party/miniz-2.1.0:miniz",
"@com_google_protobuf//:protobuf",
"@eigen",
@ -777,6 +769,7 @@ cc_library(
":caffe2",
":torch_headers",
"@kineto",
"@cpp-httplib",
] + if_cuda([
"@cuda//:nvToolsExt",
"@cutlass",

View File

@ -242,8 +242,7 @@ option(USE_COLORIZE_OUTPUT "Colorize output during compilation" ON)
option(USE_ASAN "Use Address+Undefined Sanitizers" OFF)
option(USE_TSAN "Use Thread Sanitizer" OFF)
option(USE_CUDA "Use CUDA" ON)
cmake_dependent_option(USE_XPU "Use XPU. Only available on Linux." ON "LINUX"
OFF)
option(USE_XPU "Use XPU" ON)
cmake_dependent_option(
BUILD_LAZY_CUDA_LINALG "Build cuda linalg ops as separate library" ON
"USE_CUDA AND LINUX AND BUILD_PYTHON" OFF)
@ -540,6 +539,8 @@ option(BUILD_EXECUTORCH "Master flag to build Executorch" ON)
if(LINUX)
set(CMAKE_SHARED_LINKER_FLAGS
"${CMAKE_SHARED_LINKER_FLAGS} -Wl,--no-as-needed")
set(CMAKE_SHARED_LINKER_FLAGS
"${CMAKE_SHARED_LINKER_FLAGS} $ENV{LDFLAGS}")
endif()
if(MSVC)
@ -892,6 +893,14 @@ endif()
if(USE_SLEEF_FOR_ARM_VEC256)
string(APPEND CMAKE_CXX_FLAGS " -DAT_BUILD_ARM_VEC256_WITH_SLEEF")
add_definitions(-DAT_BUILD_ARM_VEC256_WITH_SLEEF)
endif()
# Enable sleef on macOS with Apple silicon by default
if((${CMAKE_SYSTEM_NAME} STREQUAL "Darwin") AND ("${CMAKE_SYSTEM_PROCESSOR}" STREQUAL "arm64"))
message(STATUS "Running on macOS with Apple silicon")
string(APPEND CMAKE_CXX_FLAGS " -DAT_BUILD_ARM_VEC256_WITH_SLEEF")
add_definitions(-DAT_BUILD_ARM_VEC256_WITH_SLEEF)
endif()
if(USE_XNNPACK)

View File

@ -1,4 +1,4 @@
![PyTorch Logo](https://github.com/pytorch/pytorch/blob/main/docs/source/_static/img/pytorch-logo-dark.png)
![PyTorch Logo](https://github.com/pytorch/pytorch/raw/main/docs/source/_static/img/pytorch-logo-dark.png)
--------------------------------------------------------------------------------
@ -98,7 +98,7 @@ from several research papers on this topic, as well as current and past work suc
While this technique is not unique to PyTorch, it's one of the fastest implementations of it to date.
You get the best of speed and flexibility for your crazy research.
![Dynamic graph](https://github.com/pytorch/pytorch/blob/main/docs/source/_static/img/dynamic_graph.gif)
![Dynamic graph](https://github.com/pytorch/pytorch/raw/main/docs/source/_static/img/dynamic_graph.gif)
### Python First
@ -189,7 +189,7 @@ Other potentially useful environment variables may be found in `setup.py`.
##### Intel GPU Support
If you want to compile with Intel GPU support, follow these
- [PyTorch Prerequisites for Intel GPUs](https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpus.html) instructions.
- Intel GPU is currently supported only for Linux systems.
- Intel GPU is supported for Linux and Windows.
If you want to disable Intel GPU support, export the environment variable `USE_XPU=0`.
Other potentially useful environment variables may be found in `setup.py`.
@ -213,6 +213,7 @@ conda install -c pytorch magma-cuda121 # or the magma-cuda* that matches your C
# (optional) If using torch.compile with inductor/triton, install the matching version of triton
# Run from the pytorch directory after cloning
# For Intel GPU support, please explicitly `export USE_XPU=1` before running command.
make triton
```

View File

@ -37,6 +37,7 @@
- [TL;DR](#tldr)
- [Accelerator Software](#accelerator-software)
- [Special support cases](#special-support-cases)
- [Operating Systems](#operating-systems)
- [Submitting Tutorials](#submitting-tutorials)
- [Special Topics](#special-topics)
- [Updating submodules for a release](#updating-submodules-for-a-release)
@ -426,6 +427,15 @@ the size restrictions for publishing on PyPI so the default version that is publ
These special support cases will be handled on a case by case basis and support may be continued if current PyTorch maintainers feel as though there may still be a
need to support these particular versions of software.
## Operating Systems
Supported OS flavors are summarized in the table below:
| Operating System family | Architectrue | Notes |
| --- | --- | --- |
| Linux | aarch64, x86_64 | Wheels are manylinux2014 compatible, i.e. they should be runnable on any Linux system with glibc-2.17 or above. |
| MacOS | arm64 | Builds should be compatible with MacOS 11 (Big Sur) or newer, but are actively tested against MacOS 14 (Sonoma). |
| MacOS | x86_64 | Requires MacOS Catalina or above, not supported after 2.2, see https://github.com/pytorch/pytorch/issues/114602 |
| Windows | x86_64 | Buils are compatible with Windows-10 or newer. |
# Submitting Tutorials
Tutorials in support of a release feature must be submitted to the [pytorch/tutorials](https://github.com/pytorch/tutorials) repo at least two weeks before the release date to allow for editorial and technical review. There is no cherry-pick process for tutorials. All tutorials will be merged around the release day and published at [pytorch.org/tutorials](https://pytorch.org/tutorials/).

View File

@ -168,6 +168,12 @@ new_local_repository(
path = "third_party/opentelemetry-cpp",
)
new_local_repository(
name = "cpp-httplib",
build_file = "//third_party:cpp-httplib.BUILD",
path = "third_party/cpp-httplib",
)
new_local_repository(
name = "tensorpipe",
build_file = "//third_party:tensorpipe.BUILD",

View File

@ -386,6 +386,7 @@ if(UNIX AND NOT APPLE)
endif(UNIX AND NOT APPLE)
if(UNIX)
include(CheckFunctionExists)
set(CMAKE_EXTRA_INCLUDE_FILES "sys/mman.h")
CHECK_FUNCTION_EXISTS(mmap HAVE_MMAP)
if(HAVE_MMAP)
@ -472,7 +473,6 @@ endif()
if(USE_CUDA AND NOT USE_ROCM)
list(APPEND ATen_CUDA_INCLUDE ${CMAKE_CURRENT_SOURCE_DIR}/../../../third_party/cutlass/include)
list(APPEND ATen_CUDA_INCLUDE ${CMAKE_CURRENT_SOURCE_DIR}/../../../third_party/cutlass/tools/util/include)
if($ENV{ATEN_STATIC_CUDA})
list(APPEND ATen_CUDA_DEPENDENCY_LIBS
${CUDA_LIBRARIES}

View File

@ -218,8 +218,8 @@ static inline Tensor applySlice(
? (*self_sizes)[dim]
: self.sym_size(dim);
if (!disable_slice_optimization &&
TORCH_GUARD_SIZE_OBLIVIOUS(start.sym_eq(0)) && length == stop &&
step == 1) {
TORCH_GUARD_SIZE_OBLIVIOUS(start.sym_eq(0)) &&
TORCH_GUARD_SIZE_OBLIVIOUS(length.sym_eq(stop)) && step == 1) {
return self;
}
}

View File

@ -68,7 +68,7 @@ thread_local std::array<at::ScalarType, at::COMPILE_TIME_MAX_DEVICE_TYPES>
at::kBFloat16, // XLA / TPU
at::ScalarType::Undefined, // Vulkan
at::ScalarType::Undefined, // Metal
at::kBFloat16, // XPU
at::kHalf, // XPU
at::ScalarType::Undefined, // MPS
at::ScalarType::Undefined, // Meta (tensors with no data)
at::kBFloat16, // HPU / HABANA

View File

@ -275,16 +275,6 @@ void expectOutOfPlaceMultiBoxedCallingWorks(const KernelFunction& func) {
EXPECT_TRUE(stack[1].toTensor().is_same(t2));
}
void expectBoxedCallingFailsWith(const KernelFunction& func, const char* errorMessage) {
called_with_args = c10::nullopt;
vector<IValue> stack {3, 4};
OperatorHandle dummy = makeDummyOperatorHandle();
expectThrows<c10::Error>([&] {
func.callBoxed(dummy, CPU_TEST_SET, &stack);
}, errorMessage);
}
//
// unboxed calling tests:
//

View File

@ -40,10 +40,6 @@ int64_t incrementKernel(const Tensor& tensor, int64_t input) {
return input + 1;
}
int64_t decrementKernel(const Tensor& tensor, int64_t input) {
return input - 1;
}
void expectCallsIncrement(DispatchKey dispatch_key) {
at::AutoDispatchBelowAutograd mode;
@ -55,17 +51,6 @@ void expectCallsIncrement(DispatchKey dispatch_key) {
EXPECT_EQ(6, result[0].toInt());
}
void expectCallsDecrement(DispatchKey dispatch_key) {
at::AutoDispatchBelowAutograd mode;
// assert that schema and cpu kernel are present
auto op = c10::Dispatcher::singleton().findSchema({"_test::my_op", ""});
ASSERT_TRUE(op.has_value());
auto result = callOp(*op, dummyTensor(dispatch_key), 5);
EXPECT_EQ(1, result.size());
EXPECT_EQ(4, result[0].toInt());
}
TEST(OperatorRegistrationTestLegacyFunctionBasedKernel, givenKernel_whenRegistered_thenCanBeCalled) {
auto registrar = RegisterOperators().op("_test::my_op(Tensor dummy, int input) -> int", &incrementKernel);
expectCallsIncrement(DispatchKey::CPU);

View File

@ -662,18 +662,6 @@ void expectCallsConcatUnboxed(DispatchKey dispatch_key) {
EXPECT_EQ("123", result);
}
void expectCannotCallConcatBoxed(DispatchKey dispatch_key) {
at::AutoDispatchBelowAutograd mode;
// assert that schema and cpu kernel are present
auto op = c10::Dispatcher::singleton().findSchema({"_test::my_op", ""});
ASSERT_TRUE(op.has_value());
expectThrows<c10::Error>(
[&] {callOp(*op, dummyTensor(dispatch_key), "1", "2", 3);},
"Tried to call KernelFunction::callBoxed() on a KernelFunction that can only be called with KernelFunction::call()."
);
}
TEST(OperatorRegistrationTestFunctionBasedKernel, givenKernel_whenRegistered_thenCanBeCalledUnboxed) {
auto registrar = RegisterOperators().op("_test::my_op(Tensor dummy, str a, str b, int c) -> str", RegisterOperators::options().kernel<decltype(concatKernel), &concatKernel>(DispatchKey::CPU));
expectCallsConcatUnboxed(DispatchKey::CPU);

View File

@ -51,17 +51,6 @@ void expectCallsIncrement(DispatchKey dispatch_key) {
EXPECT_EQ(6, result[0].toInt());
}
void expectCallsDecrement(DispatchKey dispatch_key) {
at::AutoDispatchBelowAutograd mode;
// assert that schema and cpu kernel are present
auto op = c10::Dispatcher::singleton().findSchema({"_test::my_op", ""});
ASSERT_TRUE(op.has_value());
auto result = callOp(*op, dummyTensor(dispatch_key), 5);
EXPECT_EQ(1, result.size());
EXPECT_EQ(4, result[0].toInt());
}
TEST(OperatorRegistrationTestFunctorBasedKernel, givenKernel_whenRegistered_thenCanBeCalled) {
auto registrar = RegisterOperators().op("_test::my_op(Tensor dummy, int input) -> int", RegisterOperators::options().kernel<IncrementKernel>(DispatchKey::CPU));
expectCallsIncrement(DispatchKey::CPU);

View File

@ -4,6 +4,21 @@
#endif
namespace at::cpu {
bool is_cpu_support_avx2() {
#if !defined(__s390x__) && !defined(__powerpc__)
return cpuinfo_initialize() && cpuinfo_has_x86_avx2();
#else
return false;
#endif
}
bool is_cpu_support_avx512() {
#if !defined(__s390x__) && !defined(__powerpc__)
return cpuinfo_initialize() && cpuinfo_has_x86_avx512f() && cpuinfo_has_x86_avx512vl() && cpuinfo_has_x86_avx512bw() && cpuinfo_has_x86_avx512dq();
#else
return false;
#endif
}
bool is_cpu_support_vnni() {
#if !defined(__s390x__) && !defined(__powerpc__)

View File

@ -4,6 +4,9 @@
namespace at::cpu {
TORCH_API bool is_cpu_support_avx2();
TORCH_API bool is_cpu_support_avx512();
// Detect if CPU support Vector Neural Network Instruction.
TORCH_API bool is_cpu_support_vnni();

View File

@ -1,3 +1,4 @@
#include <ATen/cuda/CUDAContextLight.h>
#include <ATen/cuda/Sleep.h>
#include <c10/cuda/CUDAException.h>
@ -32,4 +33,37 @@ void sleep(int64_t cycles) {
C10_CUDA_KERNEL_LAUNCH_CHECK();
}
#ifdef USE_ROCM
__global__ void flush_icache_kernel()
{
asm __volatile__("s_icache_inv \n\t"
"s_nop 0 \n\t"
"s_nop 0 \n\t"
"s_nop 0 \n\t"
"s_nop 0 \n\t"
"s_nop 0 \n\t"
"s_nop 0 \n\t"
"s_nop 0 \n\t"
"s_nop 0 \n\t"
"s_nop 0 \n\t"
"s_nop 0 \n\t"
"s_nop 0 \n\t"
"s_nop 0 \n\t"
"s_nop 0 \n\t"
"s_nop 0 \n\t"
"s_nop 0 \n\t"
"s_nop 0 \n\t" ::
:);
}
#endif
void flush_icache() {
#ifdef USE_ROCM
dim3 grid(at::cuda::getCurrentDeviceProperties()->multiProcessorCount * 60);
dim3 block(64);
flush_icache_kernel<<<grid, block, 0, c10::cuda::getCurrentCUDAStream()>>>();
C10_CUDA_KERNEL_LAUNCH_CHECK();
#endif
}
} // namespace at::cuda

View File

@ -7,4 +7,7 @@ namespace at::cuda {
// enqueues a kernel that spins for the specified number of cycles
TORCH_CUDA_CU_API void sleep(int64_t cycles);
// flushes instruction cache for ROCm; no-op for CUDA
TORCH_CUDA_CU_API void flush_icache();
} // namespace at::cuda

View File

@ -170,43 +170,6 @@ CUDA_STUB3(cuLinkComplete, CUlinkState, void **, size_t *);
CUDA_STUB3(cuFuncSetAttribute, CUfunction, CUfunction_attribute, int);
CUDA_STUB3(cuFuncGetAttribute, int*, CUfunction_attribute, CUfunction);
#if defined(CUDA_VERSION) && CUDA_VERSION >= 12000
CUresult CUDAAPI
cuTensorMapEncodeTiled(
CUtensorMap* tensorMap,
CUtensorMapDataType tensorDataType,
cuuint32_t tensorRank,
void* globalAddress,
const cuuint64_t* globalDim,
const cuuint64_t* globalStrides,
const cuuint32_t* boxDim,
const cuuint32_t* elementStrides,
CUtensorMapInterleave interleave,
CUtensorMapSwizzle swizzle,
CUtensorMapL2promotion l2Promotion,
CUtensorMapFloatOOBfill oobFill) {
auto fn = reinterpret_cast<decltype(&cuTensorMapEncodeTiled)>(
getCUDALibrary().sym(__func__));
if (!fn)
throw std::runtime_error("Can't get cuTensorMapEncodeTiled");
lazyNVRTC.cuTensorMapEncodeTiled = fn;
return fn(
tensorMap,
tensorDataType,
tensorRank,
globalAddress,
globalDim,
globalStrides,
boxDim,
elementStrides,
interleave,
swizzle,
l2Promotion,
oobFill);
}
#endif
// Irregularly shaped functions
CUresult CUDAAPI cuLaunchKernel(CUfunction f,
unsigned int gridDimX,

View File

@ -34,8 +34,8 @@ struct PhiloxCudaState {
int64_t* ptr;
};
Payload seed_;
Payload offset_;
Payload seed_{};
Payload offset_{};
uint32_t offset_intragraph_ = 0;
bool captured_ = false;
};

View File

@ -59,25 +59,16 @@ namespace at { namespace cuda {
_(cuLinkAddData) \
_(cuLinkComplete) \
_(cuFuncSetAttribute) \
_(cuFuncGetAttribute) \
#if defined(CUDA_VERSION) && CUDA_VERSION >= 12000
#define AT_FORALL_NVRTC_EXTENDED(_) \
AT_FORALL_NVRTC_BASE(_) \
_(cuTensorMapEncodeTiled)
#else
#define AT_FORALL_NVRTC_EXTENDED(_) \
AT_FORALL_NVRTC_BASE(_)
#endif
_(cuFuncGetAttribute)
#if defined(CUDA_VERSION) && CUDA_VERSION >= 11010
#define AT_FORALL_NVRTC(_) \
AT_FORALL_NVRTC_EXTENDED(_) \
AT_FORALL_NVRTC_BASE(_) \
_(nvrtcGetCUBINSize) \
_(nvrtcGetCUBIN)
#else
#define AT_FORALL_NVRTC(_) \
AT_FORALL_NVRTC_EXTENDED(_)
AT_FORALL_NVRTC_BASE(_)
#endif
#else

View File

@ -66,7 +66,7 @@ static bool NumericalCheck(ScalarType dtype, void* c, void* other_c, int64_t siz
return false;
}
else {
TUNABLE_LOG("├──verify numerics: atol=", last_succeed_atol, ", rtol=", last_succeed_rtol);
TUNABLE_LOG3("├──verify numerics: atol=", last_succeed_atol, ", rtol=", last_succeed_rtol);
}
return true;
@ -76,30 +76,54 @@ static bool NumericalCheck(ScalarType dtype, void* c, void* other_c, int64_t siz
template <typename T>
struct GemmParams : OpParams {
GemmParams() {
duplicate_inputs_ = false;
}
std::string Signature() const override {
return c10::str(transa, transb, "_", m, "_", n, "_", k);
}
GemmParams* DeepCopy() const {
size_t GetSize(bool duplicate_inputs) const {
size_t size = sizeof(T) * ldc * n;
if (duplicate_inputs) {
size += sizeof(T) * lda * ((transa == 'n' || transa == 'N') ? k : m);
size += sizeof(T) * ldb * ((transb == 'n' || transb == 'N') ? n : k);
}
return size;
}
GemmParams* DeepCopy(bool duplicate_inputs) const {
GemmParams* copy = new GemmParams;
*copy = *this;
c10::DeviceIndex device = 0;
AT_CUDA_CHECK(c10::cuda::GetDevice(&device));
size_t c_size = m * n * sizeof(T);
size_t c_size = ldc * n * sizeof(T);
copy->c = static_cast<T*>(c10::cuda::CUDACachingAllocator::raw_alloc(c_size));
AT_CUDA_CHECK(c10::cuda::CUDACachingAllocator::memcpyAsync(
copy->c, device, c, device, c_size, getCurrentCUDAStream(device), true));
if (duplicate_inputs) {
size_t a_size = sizeof(T) * lda * ((transa == 'n' || transa == 'N') ? k : m);
size_t b_size = sizeof(T) * ldb * ((transb == 'n' || transb == 'N') ? n : k);
copy->a = static_cast<const T*>(c10::cuda::CUDACachingAllocator::raw_alloc(a_size));
copy->b = static_cast<const T*>(c10::cuda::CUDACachingAllocator::raw_alloc(b_size));
copy->duplicate_inputs_ = true;
}
return copy;
}
// only call on object returned by DeepCopy
void Delete() {
c10::cuda::CUDACachingAllocator::raw_delete(c);
if (duplicate_inputs_) {
c10::cuda::CUDACachingAllocator::raw_delete(const_cast<T*>(a));
c10::cuda::CUDACachingAllocator::raw_delete(const_cast<T*>(b));
}
}
TuningStatus NumericalCheck(GemmParams<T> *other) {
auto c_dtype = c10::CppTypeToScalarType<T>::value;
return detail::NumericalCheck(c_dtype, c, other->c, m*n) ? OK : FAIL;
return detail::NumericalCheck(c_dtype, c, other->c, ldc*n) ? OK : FAIL;
}
char transa;
@ -115,15 +139,30 @@ struct GemmParams : OpParams {
at::opmath_type<T> beta;
T* c;
int64_t ldc;
private:
bool duplicate_inputs_;
};
template <typename T>
struct GemmStridedBatchedParams : OpParams {
GemmStridedBatchedParams() {
duplicate_inputs_ = false;
}
std::string Signature() const override {
return c10::str(transa, transb, "_", m, "_", n, "_", k, "_B_", batch);
}
GemmStridedBatchedParams* DeepCopy() const {
size_t GetSize(bool duplicate_inputs) const {
size_t size = sizeof(T) * stride_c * batch;
if (duplicate_inputs) {
size += sizeof(T) * stride_a * batch;
size += sizeof(T) * stride_b * batch;
}
return size;
}
GemmStridedBatchedParams* DeepCopy(bool duplicate_inputs) const {
GemmStridedBatchedParams* copy = new GemmStridedBatchedParams;
*copy = *this;
c10::DeviceIndex device = 0;
@ -132,12 +171,23 @@ struct GemmStridedBatchedParams : OpParams {
copy->c = static_cast<T*>(c10::cuda::CUDACachingAllocator::raw_alloc(c_size));
AT_CUDA_CHECK(c10::cuda::CUDACachingAllocator::memcpyAsync(
copy->c, device, c, device, c_size, getCurrentCUDAStream(device), true));
if (duplicate_inputs) {
size_t a_size = sizeof(T) * stride_a * batch;
size_t b_size = sizeof(T) * stride_b * batch;
copy->a = static_cast<const T*>(c10::cuda::CUDACachingAllocator::raw_alloc(a_size));
copy->b = static_cast<const T*>(c10::cuda::CUDACachingAllocator::raw_alloc(b_size));
copy->duplicate_inputs_ = true;
}
return copy;
}
// only call on object returned by DeepCopy
void Delete() {
c10::cuda::CUDACachingAllocator::raw_delete(c);
if (duplicate_inputs_) {
c10::cuda::CUDACachingAllocator::raw_delete(const_cast<T*>(a));
c10::cuda::CUDACachingAllocator::raw_delete(const_cast<T*>(b));
}
}
TuningStatus NumericalCheck(GemmStridedBatchedParams<T> *other) {
@ -162,33 +212,59 @@ struct GemmStridedBatchedParams : OpParams {
int64_t ldc;
int64_t stride_c;
int64_t batch;
private:
bool duplicate_inputs_;
};
template <typename T>
struct ScaledGemmParams : OpParams {
ScaledGemmParams() {
duplicate_inputs_ = false;
}
std::string Signature() const override {
return c10::str(transa, transb, "_", m, "_", n, "_", k);
}
ScaledGemmParams* DeepCopy() const {
size_t GetSize(bool duplicate_inputs) const {
size_t size = sizeof(T) * ldc * n;
if (duplicate_inputs) {
size += sizeof(T) * lda * ((transa == 'n' || transa == 'N') ? k : m);
size += sizeof(T) * ldb * ((transb == 'n' || transb == 'N') ? n : k);
}
return size;
}
ScaledGemmParams* DeepCopy(bool duplicate_inputs) const {
ScaledGemmParams* copy = new ScaledGemmParams;
*copy = *this;
c10::DeviceIndex device = 0;
AT_CUDA_CHECK(c10::cuda::GetDevice(&device));
size_t c_size = m * n * sizeof(T);
size_t c_size = ldc * n * sizeof(T);
copy->c = c10::cuda::CUDACachingAllocator::raw_alloc(c_size);
AT_CUDA_CHECK(c10::cuda::CUDACachingAllocator::memcpyAsync(
copy->c, device, c, device, c_size, getCurrentCUDAStream(device), true));
if (duplicate_inputs) {
size_t a_size = sizeof(T) * lda * ((transa == 'n' || transa == 'N') ? k : m);
size_t b_size = sizeof(T) * ldb * ((transb == 'n' || transb == 'N') ? n : k);
copy->a = c10::cuda::CUDACachingAllocator::raw_alloc(a_size);
copy->b = c10::cuda::CUDACachingAllocator::raw_alloc(b_size);
copy->duplicate_inputs_ = true;
}
return copy;
}
// only call on object returned by DeepCopy
void Delete() {
c10::cuda::CUDACachingAllocator::raw_delete(c);
if (duplicate_inputs_) {
c10::cuda::CUDACachingAllocator::raw_delete(const_cast<void*>(a));
c10::cuda::CUDACachingAllocator::raw_delete(const_cast<void*>(b));
}
}
TuningStatus NumericalCheck(ScaledGemmParams<T> *other) {
return detail::NumericalCheck(c_dtype, c, other->c, m*n) ? OK : FAIL;
return detail::NumericalCheck(c_dtype, c, other->c, ldc*n) ? OK : FAIL;
}
char transa;
@ -212,6 +288,8 @@ struct ScaledGemmParams : OpParams {
ScalarType c_dtype;
void* amax_ptr;
bool use_fast_accum;
private:
bool duplicate_inputs_;
};
} // namespace at::cuda::tunable

View File

@ -263,19 +263,19 @@ static size_t GetHipblasltWorkspaceSize() {
// 256MB is max workspace size allowed for hipblaslt
// hipblaslt-bench uses 32MB
// recommendation from hipblaslt author was 76MB
size_t workspace_size = 2*128*1024*1024; // default 256MB
size_t workspace_size = 32*1024; // going with 32MB
if (env) {
try {
workspace_size = std::stoi(env);
} catch(std::invalid_argument const& e) {
TORCH_WARN("invalid HIPBLASLT_WORKSPACE_SIZE,",
" using default workspace size of ", workspace_size, " bytes.");
" using default workspace size of ", workspace_size, " KiB.");
} catch(std::out_of_range const& e) {
TORCH_WARN("HIPBLASLT_WORKSPACE_SIZE out of range,",
" using default workspace size of ", workspace_size, " bytes.");
" using default workspace size of ", workspace_size, " KiB.");
}
}
return workspace_size;
return workspace_size * 1024;
}
template <typename T, cublasStatus_t (*destructor)(T*)>
@ -413,12 +413,10 @@ class HipblasltGemmOp : public Callable<ParamsT> {
if (status == HIPBLAS_STATUS_SUCCESS) {
if (ret_workspace_size >= workspace_size) {
//TUNABLE_LOG("[hipBLASLt] Solution #", algo_index, " workspace too large");
return FAIL;
}
}
else {
//TUNABLE_LOG("[hipBLASLt] Solution #", algo_index, " not supported");
return FAIL;
}

View File

@ -2,67 +2,30 @@
This directory implements a TunableOp interface.
Some operations, such as GEMMs, could be implemented using more than one library or more than one technique. For
example, a GEMM could be implemented for CUDA or ROCm using either the blas or blasLt libraries. Further, ROCm's
rocblas and hipblaslt libraries allow the user to query for all possible algorithms and then choose one. How does one
know which implementation is the fastest and should be chosen? That's what TunableOp provides.
Some operations, such as GEMMs, could be implemented using more than one library or more than one technique. For
example, a GEMM could be implemented for CUDA or ROCm using either the blas or blasLt libraries. Further, ROCm's
rocblas and hipblaslt libraries allow the user to query for all possible algorithms and then choose one. How does one
know which implementation is the fastest and should be chosen? That's what TunableOp provides.
The behavior of TunableOp is currently easily manipulated through environment variables, though you could use the C++
interface of at::cuda::tunable::getTuningContext(). A Python interface to the TuningContext does not yet exist.
## Enabling TunableOp and Tuning Separately
The TunableOp feature is enabled separately from enabling the tuning phase itself. Enabling TunableOp means that PyTorch
will replace any standard operators with their Tunable implementations. Any call to a TunableOp first checks whether it
has already been tuned for the given operator inputs. If so, it will immediately call the tuned operation; no further
tuning will take place even when the tuning setting is enabled. Instead if no tuning result is found, and tuning is
enabled, the TunableOp will benchmark every registered implementation of that operator for the given set of inputs and
select the fastest.
Currently only a TunableGemm for ROCm is implemented. Any call to at::cuda::blas::gemm() can optionally use the
TunableGemm. Calling gemm() for a given set of input arguments (transa, transb, m, n, k) will attempt to use the
fastest available implementation.
## File Input and Output
The first time any TunableOp is invoked, the internal database of tuned operations will be prepared by attempting to
read the results from the given file. The default filename is 'tunableop_results.csv'. To support tuning when multiple
GPUs are used across multiple processes, the GPU device ordinal is automatically inserted into the filename to avoid
multiple processes overwriting the same file.
## Environment Variables
#### PYTORCH_TUNABLEOP_ENABLED
Default is 0. Set to 1 to enable.
This is the big on/off switch for all TunableOp implementations.
#### PYTORCH_TUNABLEOP_TUNING
Default is 1. Set to 0 to disable.
When enabled, if a tuned entry isn't found, run the tuning step and record the entry.
#### PYTORCH_TUNABLEOP_VERBOSE
Default is 0. Set to 1 to enable.
This will produce a lot of diagnostic messages but may be useful to see if TunableOp is being used at all.
Otherwise, TunableOp is completely silent unless there is a warning or error during its use.
#### PYTORCH_TUNABLEOP_FILENAME
Default is 'tunableop_results.csv'. If you provide a filename, the TuningContext will attempt to read it the first time
the context is used. If tuning is enabled and new tunings are discovered, it will also write out to this same filename
with all tunings, both the ones it read in at startup as well as the new ones found at runtime. This can be used, for
example, to build up a tunings file across many workloads by reusing the same file. Unsetting this variable is not
recommended but can be done, in which case the tuning results will not be saved.
#### PYTORCH_TUNABLEOP_NUMERICAL_CHECK
Default is 1. Set to 0 to disable. Compare the results of each possible solution against the default solution and reject
those with low accuracy.
#### PYTORCH_TUNABLEOP_HIPBLASLT_ENABLED
Default is 1. Set to 0 to disable hipblaslt being considered during tuning.
### Tuning Iterations
By default, each possible solution for a given operator will be run for either 100 iterations or as many iterations can
be run within 30ms, whichever is smaller. Its average execution will be calculated. The fastest solution is chosen. In
addition, a set of warm up iterations can optionally be run prior to the timed iterations. The following environment
variables can be used to set either the maximum number of iterations to attempt or the maximum amount of time allowed in
milliseconds, or both, in which case the smaller of the two values used.
#### PYTORCH_TUNABLEOP_MAX_TUNING_DURATION_MS
Default is 30.
#### PYTORCH_TUNABLEOP_MAX_TUNING_ITERATIONS
Default is 100.
#### PYTORCH_TUNABLEOP_MAX_WARMUP_DURATION_MS
Default is 0, meaning it is not used.
#### PYTORCH_TUNABLEOP_MAX_WARMUP_ITERATIONS
Default is 1.
## File Output
If tuning is enabled and new tunings are discovered during the course of your workload, it will also write out to this
same filename with all tunings, both the ones it read in at startup as well as the new ones found at runtime. This can
be used, for example, to build up a tunings file across many workloads by reusing the same file. The output file is
automatically created when the application terminates. This behavior can be controlled by the C++ and Python APIs but
not the environment variables.
Assuming you specified a filename, you'll end up with a CSV file with contents like so:
@ -75,8 +38,8 @@ GemmTunableOp_float_NT,nt_25088_4096_64,1219,1.262
GemmTunableOp_float_NT,nt_4096_4096_64,1216,0.033
```
Note the "Validator" lines. If you change a library verison, or rocm version, or pytorch version, TunableOp will detect
this and not load the tunings because they are likely affected by other software changes.
Note the "Validator" lines. If you change a library verison, or ROCm version, or PyTorch version, TunableOp will detect
this and reject the tunings file because the prior tunings are likely affected by other software changes.
The remaining lines are the tuned solutions for each TunableOp encountered during your execution. Each line consists of
4 comma-separated fields: operator name, operator parameters, solution name, and average execution time. The execution
@ -86,3 +49,102 @@ hipBLAS or hipBLASLt libraries, if you know the specific solution index you can
selected by replacing the value. The operator name and parameters (fields 1 and 2) are internally named and should not
be modified. In the case of GemmTunableOp, field 1 indicates the datatype and whether the inputs are transposed (T) or
not (N) and field 2 indicates the M, N, K input shapes.
There is an option to enable verbose output but it is only recommended for debugging purposes. This will produce a lot
of diagnostic messages but may be useful to see if TunableOp is being used at all. Otherwise, TunableOp is completely
silent, besides file output, unless there is a warning or error during its use.
## A Note on Tuning Behavior, Warmup, and Cache Effects
Tuning an operator consists of iterating through the list or registered implementations and profiling each one. The
profile is established by running a single implementation in a loop multiple times and taking the average execution
time. There is also an optional warmup phase prior to tuning that can help with reaching stable power states by the
hardware. During tuning of a workload the various hardware caches will more likely produce hits than when not tuning.
There are options for flushing the instruction cache and rotate the input tensors which might help produce a more
faithful profile of the tuned operator as if the operator were run within a larger workload instead of in a tight,
repetitive loop.
By default, each possible solution for a given operator will be run for either 100 iterations or as many iterations that
can be run within 30ms, whichever is smaller, and its average execution will be calculated. The fastest solution among
all that were successfully profiled will be chosen. A profile might fail if the given solution doesn't achieve the same
accuracy as the default implementation or if the solution returns an error code.
## Current Tunable Operators
### TunableGemm for ROCm
Currently only a TunableGemm for ROCm is implemented. Note that CUDA builds of PyTorch will function correctly when
using TunableOp but the only solution available to CUDA builds is the 'Default' implementation i.e. the original cuBLAS
default, now called through TunableOp. Any call to at::cuda::blas::gemm() or ::bgemm() will be routed through TunableOp
when enabled. Calling gemm() for a given set of input arguments (transa, transb, m, n, k) will attempt to use the
fastest available implementation across both rocblas and hipblaslt.
## Tuning Context
The behavior of TunableOp is currently manipulated through environment variables, the C++ interface of
at::cuda::tunable::getTuningContext(), or the `torch.cuda.tunable` python interfaces. The environment variables take
precedence over any setting you manipulate using the C++ or Python APIs.
### Environment Variable Interface
Environment variables are cached the first time they are read. You cannot use the environment variable interface
programmatically since the settings become fixed. Use the C++ or Python APIs instead.
| Environment Variable | Description |
| -------------------- | ----------- |
| PYTORCH_TUNABLEOP_ENABLED | Default is 0. Set to 1 to enable. |
| PYTORCH_TUNABLEOP_TUNING | Default is 1. Set to 0 to disable. |
| PYTORCH_TUNABLEOP_VERBOSE | Default is 0. Set to 1 to enable basic logging. 2 for basic tuning status. 3 for full trace. |
| PYTORCH_TUNABLEOP_VERBOSE_FILENAME | Default is "err" for stderr. Set to "out" for stdout or a filename for capturing verbose logging. |
| PYTORCH_TUNABLEOP_FILENAME | Default is 'tunableop_results.csv'. |
| PYTORCH_TUNABLEOP_NUMERICAL_CHECK | Default is 0. Set to 1 to enable. |
| PYTORCH_TUNABLEOP_ROCBLAS_ENABLED | Default is 1. Set to 0 to disable rocblas being considered during tuning. |
| PYTORCH_TUNABLEOP_HIPBLASLT_ENABLED | Default is 1. Set to 0 to disable hipblaslt being considered during tuning. |
| PYTORCH_TUNABLEOP_MAX_TUNING_DURATION_MS | Default is 30. Unit is milliseconds. |
| PYTORCH_TUNABLEOP_MAX_TUNING_ITERATIONS | Default is 100. |
| PYTORCH_TUNABLEOP_MAX_WARMUP_DURATION_MS | Default is 0, meaning it is not used. Unit is milliseconds. |
| PYTORCH_TUNABLEOP_MAX_WARMUP_ITERATIONS | Default is 0, meaning it is not used. |
| PYTORCH_TUNABLEOP_ICACHE_FLUSH_ENABLED | Default is 1. Set to 0 to disable. |
| PYTORCH_TUNABLEOP_ROTATING_BUFFER_SIZE | Default is to query L2 cache size. Set to 0 to disable. Otherwise, set to the number of MiB to use for the pool of operator parameters. For example, setting this to the size of your device's memory cache will guarantee that every tuning iteration will use a cold cache. |
### Python Interface
All python APIs exist in the `torch.cuda.tunable` module.
| Python API | Description |
| ---------- | ----------- |
| enable(val: bool = True) -> None | |
| is_enabled() -> bool | |
| tuning_enable(val: bool = True) -> None | Default is True. |
| tuning_is_enabled() -> bool | |
| set_max_tuning_duration(duration: int) -> None | |
| get_max_tuning_duration() -> int | |
| set_max_tuning_iterations(iterations: int) -> None | |
| get_max_tuning_iterations() -> int | |
| set_filename(filename: str, insert_device_ordinal: bool = False) -> None | |
| get_filename() -> str | |
| get_results() -> Tuple[str, str, str, float] | |
| get_validators() -> Tuple[str, str] | |
| write_file_on_exit(val: bool) -> None | Default is True. |
| write_file(filename: Optional[str] = None) -> None | If filename not given, it will call get_filename(). |
| read_file(filename: Optional[str] = None) -> None | If filename not given, it will call get_filename(). |
### C++ Interface
Example:
```C++
#include <ATen/cuda/tunable/Tunable.h>
at::cuda::tunable::getTuningContext()->EnableTunableOp(true);
```
| C++ API | Description |
| ------- | ----------- |
| void EnableTunableOp(bool value); | |
| bool IsTunableOpEnabled() const; | |
| void EnableTuning(bool value); | |
| bool IsTuningEnabled() const; | |
| void SetMaxTuningDurationMs(int max_duration_ms); | |
| int GetMaxTuningDurationMs() const; | |
| void SetMaxTuningIterations(int max_iter); | |
| int GetMaxTuningIterations() const; | |
| TuningResults GetTuningResults(); | |
| void SetFilename(const std::string& filename, bool insert_device_ordinal=false); | |
| std::string GetFilename() const; | |
| void WriteFileOnExit(bool value); | |
| bool ReadFile(const std::string& filename={}); | |
| bool WriteFile(const std::string& filename={}); | |

View File

@ -65,14 +65,14 @@ ResultEntry TuningResultsManager::Lookup(const std::string& op_signature, const
std::scoped_lock l{lock_};
auto kernel_map_it = results_.find(op_signature);
if (kernel_map_it == results_.cend()) {
TUNABLE_LOG("missing op_signature, returning null ResultEntry");
TUNABLE_LOG3("missing op_signature, returning null ResultEntry");
return ResultEntry::Null();
}
const auto& km = kernel_map_it->second;
auto it = km.find(params_signature);
if (it == km.cend()) {
TUNABLE_LOG("missing params_signature, returning null ResultEntry");
TUNABLE_LOG3("missing params_signature, returning null ResultEntry");
return ResultEntry::Null();
}
return it->second;
@ -85,14 +85,14 @@ inline void TuningResultsManager::AddImpl(const std::string& op_signature,
auto it = kernel_map.find(params_signature);
if (it != kernel_map.end()) {
if (it->second != best) {
TUNABLE_LOG(op_signature, "(", params_signature, ") already has a best kernel ",
TUNABLE_LOG1(op_signature, "(", params_signature, ") already has a best kernel ",
"id=", it->second, " selected, want to add a different best kernel ", best,
", the new kernel id will be ignored.");
}
return;
}
TUNABLE_LOG(op_signature, "(", params_signature, ") -> ", best);
TUNABLE_LOG2(op_signature, "(", params_signature, ") -> ", best);
kernel_map.emplace(params_signature, best);
}
@ -120,7 +120,7 @@ void TuningResultsManager::Delete(const std::string& op_signature, const std::st
return;
}
TUNABLE_LOG(op_signature, "(", params_signature, ")");
TUNABLE_LOG2(op_signature, "(", params_signature, ")");
it->second.erase(it2);
}
@ -131,7 +131,7 @@ inline void TuningResultsManager::DisjointMergeImpl(
auto it = results.find(op_signature);
if (it == results.end()) {
for (const auto& [param_sig, kernel_id] : kernel_map) {
TUNABLE_LOG(op_signature, "(", param_sig, ") -> ", kernel_id);
TUNABLE_LOG2(op_signature, "(", param_sig, ") -> ", kernel_id);
}
results[op_signature] = kernel_map;
return;
@ -143,7 +143,7 @@ inline void TuningResultsManager::DisjointMergeImpl(
}
void TuningResultsManager::Load(const std::unordered_map<std::string, KernelMap>& results_to_load) {
TUNABLE_LOG("Loading results");
TUNABLE_LOG1("Loading results");
std::scoped_lock l{lock_};
for (const auto& [op_signature, kernel_map] : results_to_load) {
DisjointMergeImpl(op_signature, kernel_map, results_);
@ -194,12 +194,12 @@ static bool CheckMandatoryKeys(
for (const auto& k : TuningResultsValidator::mandatory_keys) {
if (gv_funcs.find(k) == gv_funcs.end()) {
passed = false;
TUNABLE_LOG("key=\"", k, "\" is not registered for Get and Validate. ");
TUNABLE_LOG1("key=\"", k, "\" is not registered for Get and Validate. ");
}
if (to_check.find(k) == to_check.end()) {
passed = false;
TUNABLE_LOG("key=\"", k, "\" is not provided for validation. ");
TUNABLE_LOG1("key=\"", k, "\" is not provided for validation. ");
}
}
return passed;
@ -294,10 +294,14 @@ TuningContext::TuningContext() :
enable_{false},
tuning_enable_{true},
manager_initialized_{false},
write_file_on_exit_{true},
numerics_check_enable_{false},
max_tuning_duration_ms_{30},
max_tuning_iterations_{100},
max_warmup_duration_ms_{0},
max_warmup_iterations_{0},
icache_flush_{true},
rotating_buffer_size_{-1},
filename_{},
results_count_from_input_file_{0}
{
@ -311,115 +315,158 @@ TuningContext::~TuningContext() {
return;
}
auto filename = GetFilename();
if (IsTunableOpEnabled() && IsTuningEnabled() && !filename.empty()) {
if (IsTunableOpEnabled() && IsTuningEnabled() && !filename.empty() && write_file_on_exit_) {
if (results_count_from_input_file_ < GetTuningResultsManager().GetSize()) {
if (results_count_from_input_file_ > 0) {
TUNABLE_LOG("additional tuning results available, rewriting file ", filename);
TUNABLE_LOG1("additional tuning results available, rewriting file ", filename);
}
else {
TUNABLE_LOG("writing file ", filename);
TUNABLE_LOG1("writing file ", filename);
}
if (!WriteFile(filename)) {
TUNABLE_LOG("failed to write file ", filename);
TUNABLE_LOG1("failed to write file ", filename);
}
}
}
}
void TuningContext::EnableTunableOp() {
TUNABLE_LOG("Enable TunableOp");
enable_ = true;
}
void TuningContext::DisableTunableOp() {
TUNABLE_LOG("Disable TunableOp");
enable_ = false;
void TuningContext::EnableTunableOp(bool value) {
enable_ = value;
if (value) {
TUNABLE_LOG1("Enable TunableOp");
}
else {
TUNABLE_LOG1("Disable TunableOp");
}
}
bool TuningContext::IsTunableOpEnabled() const {
static const char *env = std::getenv("PYTORCH_TUNABLEOP_ENABLED");
if (env != nullptr && strcmp(env, "1") == 0) {
//TUNABLE_LOG("PYTORCH_TUNABLEOP_ENABLED=1");
return true;
}
return enable_;
}
void TuningContext::EnableTuning() {
TUNABLE_LOG("Enable Tuning for TunableOp");
tuning_enable_ = true;
}
void TuningContext::DisableTuning() {
TUNABLE_LOG("Disable Tuning for TunableOp");
tuning_enable_ = false;
void TuningContext::EnableTuning(bool value) {
tuning_enable_ = value;
if (value) {
TUNABLE_LOG1("Enable Tuning for TunableOp");
}
else {
TUNABLE_LOG1("Disable Tuning for TunableOp");
}
}
bool TuningContext::IsTuningEnabled() const {
static const char *env = std::getenv("PYTORCH_TUNABLEOP_TUNING");
if (env != nullptr && strcmp(env, "0") == 0) {
//TUNABLE_LOG("PYTORCH_TUNABLEOP_TUNING=1");
return false;
}
return tuning_enable_;
}
void TuningContext::WriteFileOnExit(bool value) {
write_file_on_exit_ = value;
}
void TuningContext::EnableNumericsCheck(bool value) {
numerics_check_enable_ = value;
}
bool TuningContext::IsNumericsCheckEnabled() const {
static const char *env = getenv("PYTORCH_TUNABLEOP_NUMERICAL_CHECK");
if (env != nullptr && strcmp(env, "0") == 0) {
return false;
}
return numerics_check_enable_;
}
void TuningContext::SetMaxTuningDurationMs(int max_duration_ms) {
max_tuning_duration_ms_ = max_duration_ms;
max_tuning_duration_ms_ = max_duration_ms < 0 ? 0 : max_duration_ms;
}
int TuningContext::GetMaxTuningDurationMs() const {
static const char *env = std::getenv("PYTORCH_TUNABLEOP_MAX_TUNING_DURATION_MS");
if (env != nullptr) {
return atoi(env);
int val = atoi(env);
return val < 0 ? 0 : val;
}
return max_tuning_duration_ms_;
}
void TuningContext::SetMaxTuningIterations(int max_iter) {
max_tuning_iterations_ = max_iter;
max_tuning_iterations_ = max_iter < 0 ? 0 : max_iter;
}
int TuningContext::GetMaxTuningIterations() const {
static const char *env = std::getenv("PYTORCH_TUNABLEOP_MAX_TUNING_ITERATIONS");
if (env != nullptr) {
return atoi(env);
int val = atoi(env);
return val < 0 ? 0 : val;
}
return max_tuning_iterations_;
}
void TuningContext::SetMaxWarmupDurationMs(int max_duration_ms) {
max_warmup_duration_ms_ = max_duration_ms;
max_warmup_duration_ms_ = max_duration_ms < 0 ? 0 : max_duration_ms;
}
int TuningContext::GetMaxWarmupDurationMs() const {
static const char *env = std::getenv("PYTORCH_TUNABLEOP_MAX_WARMUP_DURATION_MS");
if (env != nullptr) {
return atoi(env);
int val = atoi(env);
return val < 0 ? 0 : val;
}
return max_warmup_duration_ms_;
}
void TuningContext::SetMaxWarmupIterations(int max_iter) {
max_warmup_iterations_ = max_iter;
max_warmup_iterations_ = max_iter < 0 ? 0 : max_iter;
}
int TuningContext::GetMaxWarmupIterations() const {
static const char *env = std::getenv("PYTORCH_TUNABLEOP_MAX_WARMUP_ITERATIONS");
if (env != nullptr) {
return atoi(env);
int val = atoi(env);
return val < 0 ? 0 : val;
}
return max_warmup_iterations_;
}
void TuningContext::EnableTunableOpAndTuning() {
EnableTunableOp();
EnableTuning();
void TuningContext::EnableICacheFlush(bool value) {
icache_flush_ = value;
}
void TuningContext::DisableTunableOpAndTuning() {
DisableTunableOp();
DisableTuning();
bool TuningContext::IsICacheFlushEnabled() const {
static const char *env = std::getenv("PYTORCH_TUNABLEOP_ICACHE_FLUSH_ENABLED");
if (env != nullptr && strcmp(env, "0") == 0) {
return false;
}
return icache_flush_;
}
void TuningContext::SetRotatingBufferSize(int size) {
rotating_buffer_size_ = size < 0 ? 0 : size;
}
int TuningContext::GetRotatingBufferSize() const {
static const char *env = std::getenv("PYTORCH_TUNABLEOP_ROTATING_BUFFER_SIZE");
if (env != nullptr) {
constexpr int MB = 1024 * 1024;
int val = atoi(env);
return val < 0 ? 0 : val * MB; // env var is specified as MB, returned as bytes
}
else {
if (rotating_buffer_size_ < 0) {
// negative buffer size (default) means query for L2 cache size
int l2_cache_size = at::cuda::getCurrentDeviceProperties()->l2CacheSize;
return l2_cache_size;
}
else {
return rotating_buffer_size_;
}
}
}
TuningResultsManager& TuningContext::GetTuningResultsManager() {
@ -429,7 +476,7 @@ TuningResultsManager& TuningContext::GetTuningResultsManager() {
// if SetFilename() was not already called, call it now with the default or env var
const char *env = std::getenv("PYTORCH_TUNABLEOP_FILENAME");
std::string filename = (env == nullptr) ? "tunableop_results.csv" : env;
SetFilename(filename);
SetFilename(filename, true);
}
auto filename = GetFilename();
if (!filename.empty()) {
@ -461,32 +508,34 @@ TuningStatus TuningContext::LoadTuningResults(const TuningResults& tr) {
return OK;
}
void TuningContext::SetFilename(const std::string& filename) {
void TuningContext::SetFilename(const std::string& filename, bool insert_device_ordinal) {
filename_ = filename;
if (filename_.empty()) {
return;
}
// differentiate filename based on device ordinal to avoid
// use case of one process per device writing to same file
std::string device = c10::str(int(c10::cuda::current_device()));
if (insert_device_ordinal) {
// differentiate filename based on device ordinal to avoid
// use case of one process per device writing to same file
std::string device = c10::str(int(c10::cuda::current_device()));
// does filename contain %d to insert device ordinal in specific location?
const std::string TOKEN("%d");
std::size_t found = filename_.find(TOKEN);
if (found != std::string::npos) {
filename_.replace(found, TOKEN.length(), device);
}
else {
// no %d present, so append device ordinal before final '.'
found = filename_.rfind(".");
// does filename contain %d to insert device ordinal in specific location?
const std::string TOKEN("%d");
std::size_t found = filename_.find(TOKEN);
if (found != std::string::npos) {
filename_.insert(found, device);
filename_.replace(found, TOKEN.length(), device);
}
else {
// all else fails, just append
filename_.append(device);
// no %d present, so append device ordinal before final '.'
found = filename_.rfind(".");
if (found != std::string::npos) {
filename_.insert(found, device);
}
else {
// all else fails, just append
filename_.append(device);
}
}
}
}
@ -495,14 +544,15 @@ std::string TuningContext::GetFilename() const {
return filename_;
}
bool TuningContext::ReadFile(const std::string& filename) {
TUNABLE_LOG("reading tuning results from ", filename);
bool TuningContext::ReadFile(const std::string& filename_) {
std::string filename = filename_.empty() ? GetFilename() : filename_;
TUNABLE_LOG1("reading tuning results from ", filename);
ResultsMap results;
std::unordered_map<std::string, std::string> validators;
std::string line;
std::ifstream file(filename);
if (!file) {
TUNABLE_LOG("could not open ", filename, " for reading tuning results");
TUNABLE_LOG1("could not open ", filename, " for reading tuning results");
return false;
}
while (std::getline(file, line)) {
@ -517,7 +567,7 @@ bool TuningContext::ReadFile(const std::string& filename) {
}
if (parts[0] == "Validator" && parts.size() >= 3) {
validators[parts[1]] = parts[2];
TUNABLE_LOG("Validator ", parts[1], "=", parts[2]);
TUNABLE_LOG1("Validator ", parts[1], "=", parts[2]);
}
else if (parts.size() >= 4) {
results[parts[0]].emplace(parts[1], ResultEntry(parts[2], atof(parts[3].c_str())));
@ -527,7 +577,7 @@ bool TuningContext::ReadFile(const std::string& filename) {
results[parts[0]].emplace(parts[1], ResultEntry(parts[2], 0));
}
else {
TUNABLE_LOG("could not parse line: ", line);
TUNABLE_LOG1("could not parse line: ", line);
}
}
if (GetTuningResultsValidator().ValidateAll(validators) != FAIL) {
@ -535,16 +585,17 @@ bool TuningContext::ReadFile(const std::string& filename) {
results_count_from_input_file_ = manager_.GetSize();
}
else {
TUNABLE_LOG("results validator check failed");
TUNABLE_LOG1("results validator check failed");
return false;
}
return true;
}
bool TuningContext::WriteFile(const std::string& filename) {
bool TuningContext::WriteFile(const std::string& filename_) {
std::string filename = filename_.empty() ? GetFilename() : filename_;
std::ofstream file(filename, std::ios::out | std::ios::trunc);
if (!file.good()) {
TUNABLE_LOG("error opening tuning results file for writing ", filename);
TUNABLE_LOG1("error opening tuning results file for writing ", filename);
return false;
}
auto validators = GetTuningResultsValidator().GetAllValidators();

View File

@ -11,6 +11,7 @@
#include <c10/util/CallOnce.h>
#include <fstream>
#include <functional>
#include <iostream>
#include <memory>
@ -23,27 +24,58 @@
namespace at::cuda::tunable {
static void TunableLog(const std::string& msg) {
static const char *env = getenv("PYTORCH_TUNABLEOP_VERBOSE");
if (env != nullptr && strcmp(env, "1") == 0) {
std::cerr << msg << std::endl;
namespace detail {
struct MaybeDelete {
bool owns_pointer;
void operator()(std::ostream* os) const { if (owns_pointer) delete os; }
};
using OstreamPtr = std::unique_ptr<std::ostream, MaybeDelete>;
static OstreamPtr get_stream(std::string filename) {
if (filename.compare("out") == 0) {
return OstreamPtr { &std::cout, MaybeDelete {false} };
}
else if (filename.compare("err") == 0) {
return OstreamPtr { &std::cerr, MaybeDelete {false} };
}
else {
return OstreamPtr { new std::ofstream {filename.c_str()}, MaybeDelete {true} };
}
}
#define TUNABLE_LOG(...) TunableLog(c10::str(__VA_ARGS__))
enum TuningStatus {
}
static void TunableLog(int level, const std::string& msg) {
static const char *env_file = getenv("PYTORCH_TUNABLEOP_VERBOSE_FILENAME");
static const char *env_verbose = getenv("PYTORCH_TUNABLEOP_VERBOSE");
static int level_user = env_verbose ? atoi(env_verbose) : 0;
static auto streamptr = detail::get_stream(env_file ? env_file : "err");
if (level_user >= level) {
(*streamptr) << msg <<std::endl;
}
}
#define TUNABLE_LOGV(LEVEL, ...) TunableLog(LEVEL, c10::str(__VA_ARGS__))
#define TUNABLE_LOG1(...) TUNABLE_LOGV(1, __VA_ARGS__)
#define TUNABLE_LOG2(...) TUNABLE_LOGV(2, __VA_ARGS__)
#define TUNABLE_LOG3(...) TUNABLE_LOGV(3, __VA_ARGS__)
enum TORCH_CUDA_CPP_API TuningStatus {
OK = 0,
FAIL = 1,
UNSUPPORTED = 2,
};
// Mapping from params signature to kernel id
class ResultEntry {
class TORCH_CUDA_CPP_API ResultEntry {
public:
explicit ResultEntry(const std::string& key, double time) : key_(key), time_(time) {}
bool operator==(const ResultEntry& other) { return key_ == other.key_; }
bool operator!=(const ResultEntry& other) { return key_ != other.key_; }
operator std::string () { return key_; }
std::string GetKey() const { return key_; }
double GetTime() const { return time_; }
friend std::ostream& operator<<(std::ostream& stream, const ResultEntry& entry);
static ResultEntry Null() { return ResultEntry("Null", 0.0); }
static ResultEntry Default() { return ResultEntry("Default", 0.0); }
@ -56,7 +88,7 @@ class ResultEntry {
typedef std::unordered_map<std::string, ResultEntry> KernelMap;
typedef std::unordered_map<std::string, KernelMap> ResultsMap;
struct TuningResults {
struct TORCH_CUDA_CPP_API TuningResults {
// Validates if these results are compatible with the libraries
std::unordered_map<std::string, std::string> validators;
@ -64,7 +96,7 @@ struct TuningResults {
ResultsMap results;
};
class TuningResultsManager {
class TORCH_CUDA_CPP_API TuningResultsManager {
public:
TuningResultsManager() = default;
~TuningResultsManager() = default;
@ -102,7 +134,7 @@ class TuningResultsManager {
ResultsMap results_;
};
class TuningResultsValidator {
class TORCH_CUDA_CPP_API TuningResultsValidator {
public:
using GetFunc = std::function<std::string()>;
using ValidateFunc = std::function<TuningStatus(const std::string&)>;
@ -126,7 +158,7 @@ class TuningResultsValidator {
GetValidateFuncs validators_;
};
class TuningContext {
class TORCH_CUDA_CPP_API TuningContext {
public:
TuningContext();
~TuningContext();
@ -135,14 +167,15 @@ class TuningContext {
TuningContext &operator=(TuningContext &) = delete;
TuningContext &operator=(TuningContext &&) = delete;
void EnableTunableOp();
void DisableTunableOp();
void EnableTunableOp(bool value);
bool IsTunableOpEnabled() const;
void EnableTuning();
void DisableTuning();
void EnableTuning(bool value);
bool IsTuningEnabled() const;
void EnableNumericsCheck(bool value);
bool IsNumericsCheckEnabled() const;
void SetMaxTuningDurationMs(int max_duration_ms);
int GetMaxTuningDurationMs() const;
@ -155,8 +188,11 @@ class TuningContext {
void SetMaxWarmupIterations(int max_iter);
int GetMaxWarmupIterations() const;
void EnableTunableOpAndTuning();
void DisableTunableOpAndTuning();
void EnableICacheFlush(bool value);
bool IsICacheFlushEnabled() const;
void SetRotatingBufferSize(int size);
int GetRotatingBufferSize() const;
TuningResultsManager& GetTuningResultsManager();
@ -166,21 +202,26 @@ class TuningContext {
TuningStatus LoadTuningResults(const TuningResults& tr);
void SetFilename(const std::string& filename);
void SetFilename(const std::string& filename, bool insert_device_ordinal=false);
std::string GetFilename() const;
protected:
bool ReadFile(const std::string& filename);
bool WriteFile(const std::string& filename);
void WriteFileOnExit(bool value);
bool ReadFile(const std::string& filename={});
bool WriteFile(const std::string& filename={});
private:
bool enable_;
bool tuning_enable_;
bool manager_initialized_;
bool write_file_on_exit_;
bool numerics_check_enable_;
int max_tuning_duration_ms_;
int max_tuning_iterations_;
int max_warmup_duration_ms_;
int max_warmup_iterations_;
bool icache_flush_;
int rotating_buffer_size_;
mutable TuningResultsManager manager_;
mutable c10::once_flag manager_init_once_;
TuningResultsValidator validator_;
@ -188,7 +229,7 @@ class TuningContext {
size_t results_count_from_input_file_;
};
TuningContext* getTuningContext();
TORCH_CUDA_CPP_API TuningContext* getTuningContext();
class ITimer {
public:

View File

@ -175,6 +175,56 @@ inline std::string TypeName(c10::complex<float> v) {
return "c10::complex<float>";
}
#ifdef USE_ROCM
static void AddRocblasValidator() {
auto validators = getTuningContext()->GetTuningResultsValidator().GetAllValidators();
if (validators.find("ROCBLAS_VERSION") == validators.end()) {
std::string rocblas_version = c10::str(
XSTRINGIFY(ROCBLAS_VERSION_MAJOR), ".",
XSTRINGIFY(ROCBLAS_VERSION_MINOR), ".",
XSTRINGIFY(ROCBLAS_VERSION_PATCH), "-",
XSTRINGIFY(ROCBLAS_VERSION_TWEAK));
getTuningContext()->GetTuningResultsValidator().RegisterValidator(
"ROCBLAS_VERSION",
[rocblas_version]() { return rocblas_version; },
[rocblas_version](auto&& k) { return rocblas_version == k ? OK : FAIL; });
}
}
static void AddHipblasltValidator() {
auto validators = getTuningContext()->GetTuningResultsValidator().GetAllValidators();
if (validators.find("HIPBLASLT_VERSION") == validators.end()) {
std::string hipblaslt_version = c10::str(
XSTRINGIFY(HIPBLASLT_VERSION_MAJOR), ".",
XSTRINGIFY(HIPBLASLT_VERSION_MINOR), ".",
XSTRINGIFY(HIPBLASLT_VERSION_PATCH), "-",
XSTRINGIFY(HIPBLASLT_VERSION_TWEAK));
getTuningContext()->GetTuningResultsValidator().RegisterValidator(
"HIPBLASLT_VERSION",
[hipblaslt_version]() { return hipblaslt_version; },
[hipblaslt_version](auto&& k) { return hipblaslt_version == k ? OK : FAIL; });
}
}
static void AddRocmValidator() {
auto validators = getTuningContext()->GetTuningResultsValidator().GetAllValidators();
if (validators.find("ROCM_VERSION") == validators.end()) {
std::string rocm_version = ROCM_BUILD_INFO;
getTuningContext()->GetTuningResultsValidator().RegisterValidator(
"ROCM_VERSION",
[rocm_version]() { return rocm_version; },
[rocm_version](auto&& k) { return rocm_version == k ? OK : FAIL; });
}
if (validators.find("GCN_ARCH_NAME") == validators.end()) {
std::string gcn_arch_name = at::cuda::getCurrentDeviceProperties()->gcnArchName;
getTuningContext()->GetTuningResultsValidator().RegisterValidator(
"GCN_ARCH_NAME",
[gcn_arch_name]() { return gcn_arch_name; },
[gcn_arch_name](auto&& k) { return gcn_arch_name == k ? OK : FAIL; });
}
}
#endif
template <typename T, BlasOp ALayout, BlasOp BLayout>
class GemmTunableOp : public TunableOp<GemmParams<T>, StreamTimer> {
@ -182,45 +232,21 @@ class GemmTunableOp : public TunableOp<GemmParams<T>, StreamTimer> {
GemmTunableOp() {
this->RegisterOp(std::string("Default"), std::make_unique<DefaultGemmOp<T>>());
auto validators = getTuningContext()->GetTuningResultsValidator().GetAllValidators();
#ifdef USE_ROCM
for (auto&& [name, op] : GetRocBlasGemmTypeStringAndOps<T>()) {
this->RegisterOp(std::move(name), std::move(op));
bool rocm_validators = false;
static const char *env_rocblas = std::getenv("PYTORCH_TUNABLEOP_ROCBLAS_ENABLED");
if (env_rocblas == nullptr || strcmp(env_rocblas, "1") == 0) {
rocm_validators = true;
for (auto&& [name, op] : GetRocBlasGemmTypeStringAndOps<T>()) {
this->RegisterOp(std::move(name), std::move(op));
}
AddRocblasValidator();
}
if (validators.find("ROCM_VERSION") == validators.end()) {
std::string rocm_version = ROCM_BUILD_INFO;
getTuningContext()->GetTuningResultsValidator().RegisterValidator(
"ROCM_VERSION",
[rocm_version]() { return rocm_version; },
[rocm_version](auto&& k) { return rocm_version == k ? OK : FAIL; });
}
if (validators.find("GCN_ARCH_NAME") == validators.end()) {
std::string gcn_arch_name = at::cuda::getCurrentDeviceProperties()->gcnArchName;
getTuningContext()->GetTuningResultsValidator().RegisterValidator(
"GCN_ARCH_NAME",
[gcn_arch_name]() { return gcn_arch_name; },
[gcn_arch_name](auto&& k) { return gcn_arch_name == k ? OK : FAIL; });
}
if (validators.find("ROCBLAS_VERSION") == validators.end()) {
std::string rocblas_version = c10::str(
XSTRINGIFY(ROCBLAS_VERSION_MAJOR), ".",
XSTRINGIFY(ROCBLAS_VERSION_MINOR), ".",
XSTRINGIFY(ROCBLAS_VERSION_PATCH), "-",
XSTRINGIFY(ROCBLAS_VERSION_TWEAK));
getTuningContext()->GetTuningResultsValidator().RegisterValidator(
"ROCBLAS_VERSION",
[rocblas_version]() { return rocblas_version; },
[rocblas_version](auto&& k) { return rocblas_version == k ? OK : FAIL; });
}
#endif
#if defined(USE_ROCM)
static const char *env = std::getenv("PYTORCH_TUNABLEOP_HIPBLASLT_ENABLED");
if (env == nullptr || strcmp(env, "1") == 0) {
static const char *env_hipblaslt = std::getenv("PYTORCH_TUNABLEOP_HIPBLASLT_ENABLED");
if (env_hipblaslt == nullptr || strcmp(env_hipblaslt, "1") == 0) {
rocm_validators = true;
// disallow tuning of hipblaslt with c10::complex
if constexpr (
!std::is_same_v<T, c10::complex<float>> &&
@ -229,18 +255,11 @@ class GemmTunableOp : public TunableOp<GemmParams<T>, StreamTimer> {
this->RegisterOp(std::move(name), std::move(op));
}
}
AddHipblasltValidator();
}
if (validators.find("HIPBLASLT_VERSION") == validators.end()) {
std::string hipblaslt_version = c10::str(
XSTRINGIFY(HIPBLASLT_VERSION_MAJOR), ".",
XSTRINGIFY(HIPBLASLT_VERSION_MINOR), ".",
XSTRINGIFY(HIPBLASLT_VERSION_PATCH), "-",
XSTRINGIFY(HIPBLASLT_VERSION_TWEAK));
getTuningContext()->GetTuningResultsValidator().RegisterValidator(
"HIPBLASLT_VERSION",
[hipblaslt_version]() { return hipblaslt_version; },
[hipblaslt_version](auto&& k) { return hipblaslt_version == k ? OK : FAIL; });
}
if (rocm_validators) {
AddRocmValidator();
}
#endif
}
@ -256,45 +275,21 @@ class GemmStridedBatchedTunableOp : public TunableOp<GemmStridedBatchedParams<T>
GemmStridedBatchedTunableOp() {
this->RegisterOp(std::string("Default"), std::make_unique<DefaultGemmStridedBatchedOp<T>>());
auto validators = getTuningContext()->GetTuningResultsValidator().GetAllValidators();
#ifdef USE_ROCM
for (auto&& [name, op] : GetRocBlasGemmStridedBatchedTypeStringAndOps<T>()) {
this->RegisterOp(std::move(name), std::move(op));
bool rocm_validators = false;
static const char *env_rocblas = std::getenv("PYTORCH_TUNABLEOP_ROCBLAS_ENABLED");
if (env_rocblas == nullptr || strcmp(env_rocblas, "1") == 0) {
rocm_validators = true;
for (auto&& [name, op] : GetRocBlasGemmStridedBatchedTypeStringAndOps<T>()) {
this->RegisterOp(std::move(name), std::move(op));
}
AddRocblasValidator();
}
if (validators.find("ROCM_VERSION") == validators.end()) {
std::string rocm_version = ROCM_BUILD_INFO;
getTuningContext()->GetTuningResultsValidator().RegisterValidator(
"ROCM_VERSION",
[rocm_version]() { return rocm_version; },
[rocm_version](auto&& k) { return rocm_version == k ? OK : FAIL; });
}
if (validators.find("GCN_ARCH_NAME") == validators.end()) {
std::string gcn_arch_name = at::cuda::getCurrentDeviceProperties()->gcnArchName;
getTuningContext()->GetTuningResultsValidator().RegisterValidator(
"GCN_ARCH_NAME",
[gcn_arch_name]() { return gcn_arch_name; },
[gcn_arch_name](auto&& k) { return gcn_arch_name == k ? OK : FAIL; });
}
if (validators.find("ROCBLAS_VERSION") == validators.end()) {
std::string rocblas_version = c10::str(
XSTRINGIFY(ROCBLAS_VERSION_MAJOR), ".",
XSTRINGIFY(ROCBLAS_VERSION_MINOR), ".",
XSTRINGIFY(ROCBLAS_VERSION_PATCH), "-",
XSTRINGIFY(ROCBLAS_VERSION_TWEAK));
getTuningContext()->GetTuningResultsValidator().RegisterValidator(
"ROCBLAS_VERSION",
[rocblas_version]() { return rocblas_version; },
[rocblas_version](auto&& k) { return rocblas_version == k ? OK : FAIL; });
}
#endif
#if defined(USE_ROCM)
static const char *env = std::getenv("PYTORCH_TUNABLEOP_HIPBLASLT_ENABLED");
if (env == nullptr || strcmp(env, "1") == 0) {
static const char *env_hipblaslt = std::getenv("PYTORCH_TUNABLEOP_HIPBLASLT_ENABLED");
if (env_hipblaslt == nullptr || strcmp(env_hipblaslt, "1") == 0) {
rocm_validators = true;
// disallow tuning of hipblaslt with c10::complex
if constexpr (
!std::is_same_v<T, c10::complex<float>> &&
@ -303,18 +298,11 @@ class GemmStridedBatchedTunableOp : public TunableOp<GemmStridedBatchedParams<T>
this->RegisterOp(std::move(name), std::move(op));
}
}
AddHipblasltValidator();
}
if (validators.find("HIPBLASLT_VERSION") == validators.end()) {
std::string hipblaslt_version = c10::str(
XSTRINGIFY(HIPBLASLT_VERSION_MAJOR), ".",
XSTRINGIFY(HIPBLASLT_VERSION_MINOR), ".",
XSTRINGIFY(HIPBLASLT_VERSION_PATCH), "-",
XSTRINGIFY(HIPBLASLT_VERSION_TWEAK));
getTuningContext()->GetTuningResultsValidator().RegisterValidator(
"HIPBLASLT_VERSION",
[hipblaslt_version]() { return hipblaslt_version; },
[hipblaslt_version](auto&& k) { return hipblaslt_version == k ? OK : FAIL; });
}
if (rocm_validators) {
AddRocmValidator();
}
#endif
}
@ -336,18 +324,8 @@ class ScaledGemmTunableOp : public TunableOp<ScaledGemmParams<CT>, StreamTimer>
for (auto&& [name, op] : GetHipBlasLtScaledGemmTypeStringAndOps<AT, BT, CT, ALayout, BLayout>()) {
this->RegisterOp(std::move(name), std::move(op));
}
if (validators.find("HIPBLASLT_VERSION") == validators.end()) {
std::string hipblaslt_version = c10::str(
XSTRINGIFY(HIPBLASLT_VERSION_MAJOR), ".",
XSTRINGIFY(HIPBLASLT_VERSION_MINOR), ".",
XSTRINGIFY(HIPBLASLT_VERSION_PATCH), "-",
XSTRINGIFY(HIPBLASLT_VERSION_TWEAK));
getTuningContext()->GetTuningResultsValidator().RegisterValidator(
"HIPBLASLT_VERSION",
[hipblaslt_version]() { return hipblaslt_version; },
[hipblaslt_version](auto&& k) { return hipblaslt_version == k ? OK : FAIL; });
}
AddHipblasltValidator();
AddRocmValidator();
#endif
}

View File

@ -10,6 +10,7 @@
#pragma once
#include <ATen/cuda/tunable/Tunable.h>
#include <ATen/cuda/Sleep.h>
#include <c10/cuda/CUDACachingAllocator.h>
#ifndef _WIN32
@ -62,7 +63,7 @@ class TunableOp {
result = ResultEntry::Default();
}
if (result == ResultEntry::Null()) {
TUNABLE_LOG("no result, using default");
TUNABLE_LOG2("no result, using default");
result = ResultEntry::Default();
}
auto iter = ops_.find(result);
@ -87,88 +88,120 @@ class TunableOp {
}
private:
static void WarmUp(Callable<ParamsT> *op, ParamsT* param, size_t num_iter) {
static void WarmUp(Callable<ParamsT> *op, const std::vector<ParamsT*> &param, size_t num_iter, size_t &offset) {
TuningContext* ctx = getTuningContext();
bool do_flush = ctx->IsICacheFlushEnabled();
for (size_t i = 0; i < num_iter; i++) {
TORCH_CHECK(op->Call(param) == OK);
if (do_flush) {
at::cuda::flush_icache();
}
TORCH_CHECK(op->Call(param[(i+offset++)%param.size()]) == OK);
}
}
static double Profile(Callable<ParamsT> *op, ParamsT* param, size_t num_iter) {
static double Profile(Callable<ParamsT> *op, const std::vector<ParamsT*> &param, size_t num_iter, size_t &offset) {
TuningContext* ctx = getTuningContext();
bool do_flush = ctx->IsICacheFlushEnabled();
TimerT timer{};
timer.Start();
for (size_t i = 0; i < num_iter; i++) {
TORCH_CHECK(op->Call(param) == OK);
if (do_flush) {
at::cuda::flush_icache();
}
TORCH_CHECK(op->Call(param[(i+offset++)%param.size()]) == OK);
}
timer.End();
return timer.Duration() / num_iter;
}
protected:
bool IsNumericsCheckEnabled() {
static const char *env = getenv("PYTORCH_TUNABLEOP_NUMERICAL_CHECK");
if (env != nullptr && strcmp(env, "0") == 0) {
return false;
}
return true;
}
virtual ResultEntry FindFastest(const ParamsT* params) {
TuningContext* ctx = getTuningContext();
auto op_sig = Signature();
auto params_sig = params->Signature();
TUNABLE_LOG("finding fastest for ", op_sig, '(', params_sig, ')', " out of ", op_names_.size(), " candidates");
TUNABLE_LOG2("finding fastest for ", op_sig, '(', params_sig, ')', " out of ", op_names_.size(), " candidates");
auto min_duration_ms = std::numeric_limits<double>::infinity();
std::string id_name = "Default";
ParamsT* reference_params = nullptr;
// calcaulte a reference answer for numerical check
ParamsT* reference_params = params->DeepCopy();
TORCH_CHECK(ops_[ResultEntry::Default()]->Call(reference_params) == OK);
if (ctx->IsNumericsCheckEnabled()) {
reference_params = params->DeepCopy(false);
TORCH_CHECK(ops_[ResultEntry::Default()]->Call(reference_params) == OK);
}
// need a copy of params to reuse
ParamsT* reusable_params = params->DeepCopy();
// need copies of params to reuse
// make as many copies as will fill the requested rotating buffer size, if requested
// rotating_size guaranteed to be >= 0 even though GetRotatingBufferSize() returns int
size_t rotating_size = ctx->GetRotatingBufferSize();
bool use_buffer_rotation = (rotating_size > 0);
size_t param_size = params->GetSize(use_buffer_rotation);
size_t param_count = (rotating_size / param_size) + 1;
constexpr size_t MB = 1024*1024;
if (use_buffer_rotation) {
TUNABLE_LOG2("Rotating buffer ", rotating_size/MB, " MiB. ",
"Needed Size: ", param_size/MB, " MiB. ",
"Needed number of param copies: ", param_count);
}
TORCH_CHECK(param_count > 0);
std::vector<ParamsT*> reusable_params(param_count);
for (size_t i = 0; i < param_count; i++) {
reusable_params[i] = params->DeepCopy(use_buffer_rotation);
}
// for rotating buffer
size_t offset = 0;
for (size_t i = 0; i < op_names_.size(); i++) {
auto* candidate = ops_[op_names_[i]].get(); // borrow pointer
auto status = candidate->Call(reusable_params);
if (status != OK) {
TUNABLE_LOG("├──unsupported id=", i, ", ", op_sig, '(', params_sig, ") ", op_names_[i]);
continue;
}
if (IsNumericsCheckEnabled()) {
ParamsT* numerical_params = params->DeepCopy();
WarmUp(candidate, numerical_params, 1);
if (ctx->IsNumericsCheckEnabled()) {
ParamsT* numerical_params = params->DeepCopy(false);
auto status = candidate->Call(numerical_params);
if (status != OK) {
TUNABLE_LOG3("├──unsupported id=", i, ", ", op_sig, '(', params_sig, ") ", op_names_[i]);
continue;
}
status = reference_params->NumericalCheck(numerical_params);
numerical_params->Delete();
if (status != OK) {
TUNABLE_LOG("├──numerics check failed for id=", i, ", ", op_sig, '(', params_sig, ") ", op_names_[i]);
TUNABLE_LOG3("├──numerics check failed for id=", i, ", ", op_sig, '(', params_sig, ") ", op_names_[i]);
continue;
}
}
else {
auto status = candidate->Call(reusable_params[0]);
if (status != OK) {
TUNABLE_LOG3("├──unsupported id=", i, ", ", op_sig, '(', params_sig, ") ", op_names_[i]);
continue;
}
}
// collect a small profile
constexpr const int approx_num_iter = 3;
auto approx_duration = Profile(candidate, reusable_params, approx_num_iter);
auto approx_duration = Profile(candidate, reusable_params, approx_num_iter, offset);
// bail if too slow
if (approx_duration > 2 * min_duration_ms) {
TUNABLE_LOG("├──skip slow instance id=", i, ", ", op_sig, '(', params_sig, ") ", op_names_[i]);
TUNABLE_LOG3("├──skip slow instance id=", i, ", ", op_sig, '(', params_sig, ") ", op_names_[i]);
continue;
}
// for warmup does user set max duration, max iters, or both?
// warmup is allowed to be skipped by setting either iterations or duration to 0
double max_warmup_duration = ctx->GetMaxWarmupDurationMs();
int max_warmup_iter = ctx->GetMaxWarmupIterations();
int warmup_iter = 1; // default
if (max_warmup_duration > 0) {
if (max_warmup_duration >= 0) {
int duration_iters = max_warmup_duration / approx_duration;
if (max_warmup_iter > 0) {
if (max_warmup_iter >= 0) {
warmup_iter = std::min(max_warmup_iter, duration_iters);
}
else {
warmup_iter = duration_iters;
}
}
else if (max_warmup_iter > 0) {
else if (max_warmup_iter >= 0) {
warmup_iter = max_warmup_iter;
}
@ -188,27 +221,34 @@ class TunableOp {
else if (max_tuning_iter > 0) {
tuning_iter = max_tuning_iter;
}
// tuning must run at least 1 iteration
tuning_iter = std::max(1, tuning_iter);
// do the full warmup followed by tuning
double warmup_ms = warmup_iter * approx_duration;
double tuning_ms = tuning_iter * approx_duration;
TUNABLE_LOG("├──tuning using "
TUNABLE_LOG3("├──tuning using "
"warmup iters ", warmup_iter, " [", warmup_ms, " ms] "
"and tuning iters ", tuning_iter, " [", tuning_ms, " ms] ",
"instance id=", i, ", ", op_sig, "(", params_sig, ") ", op_names_[i]);
WarmUp(candidate, reusable_params, warmup_iter);
auto duration_ms = Profile(candidate, reusable_params, tuning_iter);
TUNABLE_LOG3("├──offset at ", offset);
WarmUp(candidate, reusable_params, warmup_iter, offset);
auto duration_ms = Profile(candidate, reusable_params, tuning_iter, offset);
if (duration_ms < min_duration_ms) {
TUNABLE_LOG("├──found better instance id=", i, ". " , duration_ms, "ms. ", op_names_[i]);
TUNABLE_LOG3("├──found better instance id=", i, ". " , duration_ms, "ms. ", op_names_[i]);
min_duration_ms = duration_ms;
id_name = op_names_[i];
}
}
reusable_params->Delete();
reference_params->Delete();
for (size_t i = 0; i < reusable_params.size(); i++) {
reusable_params[i]->Delete();
}
if (reference_params) {
reference_params->Delete();
}
TUNABLE_LOG("└──found fastest for ", op_sig, '(', params_sig, ") ", id_name);
TUNABLE_LOG2("└──found fastest for ", op_sig, '(', params_sig, ") ", id_name);
return ResultEntry(id_name, min_duration_ms);
}

View File

@ -31,46 +31,6 @@ Tensor index_select_backward_hack(const Tensor& grad, IntArrayRef self_sizes, in
return at::zeros(self_sizes, grad.options()).index_add(dim, index, grad);
}
static optional<std::tuple<Tensor,int64_t>> unwrap(const Tensor& tensor) {
auto* wrapped = maybeGetTensorWrapper(tensor);
if (wrapped) {
if (wrapped->level().has_value()) {
return std::make_tuple(wrapped->value(), *wrapped->level());
}
return unwrap(wrapped->value());
}
auto* batched = maybeGetBatchedImpl(tensor);
if (batched) {
return std::make_tuple(batched->value(), batched->level());
}
return nullopt;
}
static bool can_perform_inplace(const Tensor& a, const Tensor& b) {
// TODO: generalize this to more transforms
auto a_ = unwrap(a);
auto b_ = unwrap(b);
if (!a_.has_value() && b_.has_value()) {
return false;
}
if (!a_.has_value() && !b_.has_value()) {
return true;
}
if (a_.has_value() && !b_.has_value()) {
return true;
}
TORCH_INTERNAL_ASSERT(a_.has_value() && b_.has_value());
// If b has any wrapper that a does not, then we cannot do a.inplace_(b)
if (std::get<1>(*a_) < std::get<1>(*b_)) {
return false;
}
if (std::get<1>(*a_) > std::get<1>(*b_)) {
return can_perform_inplace(std::get<0>(*a_), b);
}
return can_perform_inplace(std::get<0>(*a_), std::get<0>(*b_));
}
// TODO: linear is pretty important for performance, but I'm not sure how to work
// around the in-place.
Tensor linear_hack(const Tensor& input, const Tensor& weight, const std::optional<Tensor>& bias_opt) {

View File

@ -1480,23 +1480,14 @@ Tensor& not_equal_(Tensor& self, const Scalar& other) { return self.ne_(other);
Tensor& logical_and_out(const Tensor& self, const Tensor& other, Tensor& result) { return comparison_op_out(result, self, other, logical_and_stub); }
Tensor logical_and(const Tensor& self, const Tensor& other) { return comparison_op(self, other, static_cast<OutFunc>(at::logical_and_out)); }
Tensor& logical_and_(Tensor& self, const Tensor& other) { return comparison_op_(self, other, static_cast<OutFunc>(at::logical_and_out)); }
static Tensor& logical_and_out(Tensor& result, const Tensor& self, const Scalar& other) { return comparison_op_out(result, self, other, static_cast<OutFunc>(at::logical_and_out)); }
static Tensor logical_and(const Tensor& self, const Scalar& other) { return comparison_op(self, other, static_cast<OutFunc>(at::logical_and_out)); }
static Tensor& logical_and_(Tensor& self, const Scalar& other) { return comparison_op_(self, other, static_cast<OutFunc>(at::logical_and_out)); }
Tensor& logical_or_out(const Tensor& self, const Tensor& other, Tensor& result) { return comparison_op_out(result, self, other, logical_or_stub); }
Tensor logical_or(const Tensor& self, const Tensor& other) { return comparison_op(self, other, static_cast<OutFunc>(at::logical_or_out)); }
Tensor& logical_or_(Tensor& self, const Tensor& other) { return comparison_op_(self, other, static_cast<OutFunc>(at::logical_or_out)); }
static Tensor& logical_or_out(Tensor& result, const Tensor& self, const Scalar& other) { return comparison_op_out(result, self, other, static_cast<OutFunc>(at::logical_or_out)); }
static Tensor logical_or(const Tensor& self, const Scalar& other) { return comparison_op(self, other, static_cast<OutFunc>(at::logical_or_out)); }
static Tensor& logical_or_(Tensor& self, const Scalar& other) { return comparison_op_(self, other, static_cast<OutFunc>(at::logical_or_out)); }
Tensor& logical_xor_out(const Tensor& self, const Tensor& other, Tensor& result) { return comparison_op_out(result, self, other, logical_xor_stub); }
Tensor logical_xor(const Tensor& self, const Tensor& other) { return comparison_op(self, other, static_cast<OutFunc>(at::logical_xor_out)); }
Tensor& logical_xor_(Tensor& self, const Tensor& other) { return comparison_op_(self, other, static_cast<OutFunc>(at::logical_xor_out)); }
static Tensor& logical_xor_out(Tensor& result, const Tensor& self, const Scalar& other) { return comparison_op_out(result, self, other, static_cast<OutFunc>(at::logical_xor_out)); }
static Tensor logical_xor(const Tensor& self, const Scalar& other) { return comparison_op(self, other, static_cast<OutFunc>(at::logical_xor_out)); }
static Tensor& logical_xor_(Tensor& self, const Scalar& other) { return comparison_op_(self, other, static_cast<OutFunc>(at::logical_xor_out)); }
// binary max, alias for maximum
Tensor& max_out(const Tensor& self, const Tensor& other, Tensor& result) {

View File

@ -105,6 +105,28 @@ void fp16_gemv_trans(
const float beta,
float16_t* y,
const int incy);
float fp16_dot_with_fp32_arith(
const float16_t* vec1,
const float16_t* vec2,
int64_t len);
void bf16_gemv_trans(
const int m,
const int n,
const at::BFloat16 alpha,
const at::BFloat16* a,
const int lda,
const at::BFloat16* x,
const int incx,
const at::BFloat16 beta,
at::BFloat16* y,
const int incy);
float bf16_dot_with_fp32_arith(
const at::BFloat16* vec1,
const at::BFloat16* vec2,
int64_t len);
#endif
template <typename scalar_t>
@ -113,8 +135,11 @@ bool scal_use_fast_path(C10_UNUSED int64_t n, C10_UNUSED int64_t incx) {
}
template <typename scalar_t>
bool gemv_use_fast_path(C10_UNUSED int64_t m, C10_UNUSED int64_t n,
C10_UNUSED int64_t lda, C10_UNUSED int64_t incx, C10_UNUSED int64_t incy) {
bool gemv_use_fast_path(C10_UNUSED char trans, C10_UNUSED int64_t m,
C10_UNUSED int64_t n, C10_UNUSED scalar_t alpha,
C10_UNUSED int64_t lda,
C10_UNUSED int64_t incx, C10_UNUSED scalar_t beta,
C10_UNUSED int64_t incy) {
return false;
}
@ -133,7 +158,7 @@ void gemv_fast_path(C10_UNUSED const char *trans, C10_UNUSED const int *m, C10_U
#define INSTANTIATE(scalar_t) \
template bool scal_use_fast_path<scalar_t>(int64_t n, int64_t incx); \
template bool gemv_use_fast_path<scalar_t>(int64_t m, int64_t n, int64_t lda, int64_t incx, int64_t incy); \
template bool gemv_use_fast_path<scalar_t>(char trans, int64_t m, int64_t n, scalar_t alpha, int64_t lda, int64_t incx, scalar_t beta, int64_t incy); \
template void gemv_fast_path<scalar_t>(const char *trans, const int *m, const int *n, const scalar_t *alpha, const scalar_t *a, const int *lda, const scalar_t *x, const int *incx, const scalar_t *beta, scalar_t *y, const int *incy); \
template void scal_fast_path<scalar_t>(int *n, scalar_t *a, scalar_t *x, int *incx);
@ -160,15 +185,15 @@ void scal_fast_path<float>(int *n, float *a, float *x, int *incx) {
}
template <>
bool gemv_use_fast_path<float>(int64_t m, int64_t n, int64_t lda, int64_t incx, int64_t incy) {
bool gemv_use_fast_path<float>(C10_UNUSED char trans, int64_t m, int64_t n, C10_UNUSED float alpha, int64_t lda, int64_t incx, C10_UNUSED float beta, int64_t incy) {
auto intmax = std::numeric_limits<int>::max();
return (m <= intmax) && (n <= intmax) && (lda <= intmax) &&
(incx > 0) && (incx <= intmax) && (incy > 0) && (incy <= intmax);
}
template <>
bool gemv_use_fast_path<double>(int64_t m, int64_t n, int64_t lda, int64_t incx, int64_t incy) {
return gemv_use_fast_path<float>(m, n, lda, incx, incy);
bool gemv_use_fast_path<double>(C10_UNUSED char trans, int64_t m, int64_t n, C10_UNUSED double alpha, int64_t lda, int64_t incx, C10_UNUSED double beta, int64_t incy) {
return gemv_use_fast_path<float>(trans, m, n, (float)alpha, lda, incx, (float)beta, incy);
}
template <>
@ -190,7 +215,6 @@ INSTANTIATE(int8_t);
INSTANTIATE(int16_t);
INSTANTIATE(int);
INSTANTIATE(int64_t);
INSTANTIATE(c10::BFloat16);
#if defined(__aarch64__) && !defined(C10_MOBILE)
template <>
bool scal_use_fast_path<at::Half>(C10_UNUSED int64_t n, C10_UNUSED int64_t incx) {
@ -199,14 +223,32 @@ bool scal_use_fast_path<at::Half>(C10_UNUSED int64_t n, C10_UNUSED int64_t incx)
template <>
bool gemv_use_fast_path<at::Half>(
C10_UNUSED char trans,
C10_UNUSED int64_t m,
C10_UNUSED int64_t n,
at::Half alpha,
C10_UNUSED int64_t lda,
C10_UNUSED int64_t incx,
at::Half beta,
C10_UNUSED int64_t incy) {
return true;
return incx == 1 && c10::detail::fp16_from_bits(alpha.x) == 1.0f &&
c10::detail::fp16_from_bits(beta.x) == 0.0f;
}
template <>
bool gemv_use_fast_path<at::BFloat16>(
C10_UNUSED char trans,
C10_UNUSED int64_t m,
C10_UNUSED int64_t n,
at::BFloat16 alpha,
C10_UNUSED int64_t lda,
C10_UNUSED int64_t incx,
at::BFloat16 beta,
C10_UNUSED int64_t incy) {
return (trans == 'T' || trans == 't') && incx == 1 && alpha == 1.0 && beta == 0.0;
}
#ifdef __ARM_FEATURE_FP16_SCALAR_ARITHMETIC
static inline float16_t reduce(float16x4_t x) {
auto sum = vpadd_f16(x, x);
@ -379,7 +421,7 @@ static inline double reduce(float32x4_t x[kF32RegistersPerIteration]) {
return vaddvq_f32(x[0]);
}
static C10_ALWAYS_INLINE void fp16_dot_with_fp32_arith_main_inner_loop(
static C10_ALWAYS_INLINE void dot_with_fp32_arith_main_inner_loop(
const float16_t* vec1,
const float16_t* vec2,
float32x4_t sum[kF32RegistersPerIteration],
@ -392,7 +434,7 @@ static C10_ALWAYS_INLINE void fp16_dot_with_fp32_arith_main_inner_loop(
sum[2 * registerPairIndex + 1] = f32_fma_high_f16(sum[2 * registerPairIndex + 1], temp_vec1, temp_vec2);
}
static C10_ALWAYS_INLINE void fp16_dot_with_fp32_arith_vectorized_tail_inner_loop(
static C10_ALWAYS_INLINE void dot_with_fp32_arith_vectorized_tail_inner_loop(
const float16_t* vec1,
const float16_t* vec2,
float32x4_t* tailSum,
@ -402,14 +444,48 @@ static C10_ALWAYS_INLINE void fp16_dot_with_fp32_arith_vectorized_tail_inner_loo
*tailSum = f32_fma_f16(*tailSum, temp_vec1, temp_vec2);
}
float fp16_dot_with_fp32_arith(const float16_t* vec1, const float16_t* vec2, int64_t len) {
static C10_ALWAYS_INLINE float32x4_t to_bfloat16(uint16x4_t u16) {
int32x4_t shift = vdupq_n_s32(16);
return vreinterpretq_f32_u32(vshlq_u32(vmovl_u16(u16), shift));
}
static C10_ALWAYS_INLINE float32x4_t f32_fma_bf16(float32x4_t a, uint16x4_t b, uint16x4_t c) {
return f32_fma(a, to_bfloat16(b), to_bfloat16(c));
}
static C10_ALWAYS_INLINE void dot_with_fp32_arith_main_inner_loop(
const at::BFloat16* vec1,
const at::BFloat16* vec2,
float32x4_t sum[kF32RegistersPerIteration],
int registerPairIndex) {
// TODO: detect intrinsic availability, use them if they're available. __ARM_FEATURE_BF16
// Load a pair of f32 registers at a time.
const uint16x8_t temp_vec1 = vld1q_u16(reinterpret_cast<const uint16_t*>(&vec1[registerPairIndex * 2 * kF32ElementsPerRegister]));
const uint16x8_t temp_vec2 = vld1q_u16(reinterpret_cast<const uint16_t*>(&vec2[registerPairIndex * 2 * kF32ElementsPerRegister]));
sum[2 * registerPairIndex] = f32_fma_bf16(sum[2 * registerPairIndex], vget_low_u16(temp_vec1), vget_low_u16(temp_vec2));
sum[2 * registerPairIndex + 1] = f32_fma_bf16(sum[2 * registerPairIndex + 1], vget_high_u16(temp_vec1), vget_high_u16(temp_vec2));
}
static C10_ALWAYS_INLINE void dot_with_fp32_arith_vectorized_tail_inner_loop(
const at::BFloat16* vec1,
const at::BFloat16* vec2,
float32x4_t* tailSum,
int idx) {
const auto temp_vec1 = vld1_u16(reinterpret_cast<const uint16_t*>(&vec1[idx]));
const auto temp_vec2 = vld1_u16(reinterpret_cast<const uint16_t*>(&vec2[idx]));
*tailSum = f32_fma_bf16(*tailSum, temp_vec1, temp_vec2);
}
template <typename T>
float dot_with_fp32_arith(const T* vec1, const T* vec2, int64_t len) {
float32x4_t sum[kF32RegistersPerIteration] = {vdupq_n_f32(0)};
const auto len_aligned = len & ~(kF32ElementsPerIteration - 1);
for (int j = 0; j < len_aligned ; j += kF32ElementsPerIteration) {
const auto* vec1_ = vec1 + j;
const auto* vec2_ = vec2 + j;
c10::ForcedUnroll<kF32RegisterPairsPerIteration>{}([vec1_, vec2_, &sum](auto k) {
fp16_dot_with_fp32_arith_main_inner_loop(vec1_, vec2_, sum, k);
dot_with_fp32_arith_main_inner_loop(vec1_, vec2_, sum, k);
});
}
auto reducedSum = reduce(sum);
@ -420,7 +496,7 @@ float fp16_dot_with_fp32_arith(const float16_t* vec1, const float16_t* vec2, int
float32x4_t tailSum = vdupq_n_f32(0);
const auto len_aligned_4 = len & ~3;
for (int j = len_aligned; j < len_aligned_4; j += 4) {
fp16_dot_with_fp32_arith_vectorized_tail_inner_loop(vec1, vec2, &tailSum, j);
dot_with_fp32_arith_vectorized_tail_inner_loop(vec1, vec2, &tailSum, j);
}
auto reducedTail = vpaddq_f32(tailSum, tailSum);
reducedSum += vgetq_lane_f32(vpaddq_f32(reducedTail, reducedTail), 0);
@ -432,6 +508,14 @@ float fp16_dot_with_fp32_arith(const float16_t* vec1, const float16_t* vec2, int
return reducedSum;
}
float fp16_dot_with_fp32_arith(const float16_t* vec1, const float16_t* vec2, int64_t len) {
return dot_with_fp32_arith(vec1, vec2, len);
}
float bf16_dot_with_fp32_arith(const at::BFloat16* vec1, const at::BFloat16* vec2, int64_t len) {
return dot_with_fp32_arith(vec1, vec2, len);
}
// On my Apple M1 Macbook (which is ARM v8.5 and thus has the
// instructions f32_fma_{low,high}_f16 is targeting), this kernel has
// equivalent performance to the fp16-native kernel.
@ -443,6 +527,14 @@ static void fp16_gemv_trans_fp32_arith_by_dot_products(const int m, const int n,
});
}
static void bf16_gemv_trans_fp32_arith_by_dot_products(const int m, const int n, const at::BFloat16* a, const int lda, const at::BFloat16 *x, at::BFloat16* y, int incy) {
parallel_for(0, n, 1, [&](int begin, int end) {
for (int i = begin; i < end; ++i) {
y[i * incy] = bf16_dot_with_fp32_arith(x, a + lda * i, m);
}
});
}
void fp16_gemv_trans(
const int m,
const int n,
@ -454,26 +546,28 @@ void fp16_gemv_trans(
const float beta,
float16_t* y,
const int incy) {
if (incx == 1 && alpha == 1.0 && beta == 0.0) {
TORCH_INTERNAL_ASSERT_DEBUG_ONLY(incx == 1 && alpha == 1.0 && beta == 0.0);
#ifdef __ARM_FEATURE_FP16_SCALAR_ARITHMETIC
if (at::globalContext().allowFP16ReductionCPU()) {
return fp16_gemv_trans_fp16_arith_by_dot_products(m, n, a, lda, x, y, incy);
}
if (at::globalContext().allowFP16ReductionCPU()) {
return fp16_gemv_trans_fp16_arith_by_dot_products(m, n, a, lda, x, y, incy);
}
#endif
return fp16_gemv_trans_fp32_arith_by_dot_products(m, n, a, lda, x, y, incy);
}
for (const auto i : c10::irange(n)) {
float sum = 0;
const auto row_ = a + lda * i;
for (const auto j : c10::irange(m)) {
sum += x[j * incx] * row_[j];
}
if (beta == 0.0) {
y[i * incy] = alpha * sum;
} else {
y[i * incy] = beta * y[i * incy] + alpha * sum;
}
}
return fp16_gemv_trans_fp32_arith_by_dot_products(m, n, a, lda, x, y, incy);
}
void bf16_gemv_trans(
const int m,
const int n,
const at::BFloat16 alpha,
const at::BFloat16* a,
const int lda,
const at::BFloat16* x,
const int incx,
const at::BFloat16 beta,
at::BFloat16* y,
const int incy) {
TORCH_INTERNAL_ASSERT_DEBUG_ONLY(incx == 1 && alpha == 1.0 && beta == 0.0);
return bf16_gemv_trans_fp32_arith_by_dot_products(m, n, a, lda, x, y, incy);
}
@ -590,9 +684,37 @@ void gemv_fast_path<at::Half>(
*incy);
}
}
#else
template <>
void gemv_fast_path<at::BFloat16>(
const char* trans,
const int* m,
const int* n,
const at::BFloat16* alpha,
const at::BFloat16* a,
const int* lda,
const at::BFloat16* x,
const int* incx,
const at::BFloat16* beta,
at::BFloat16* y,
const int* incy) {
TORCH_INTERNAL_ASSERT_DEBUG_ONLY(trans[0] == 'T' || trans[0] == 't');
bf16_gemv_trans(
*m,
*n,
*alpha,
a,
*lda,
x,
*incx,
*beta,
y,
*incy);
}
#else // defined(__aarch64__) && !defined(C10_MOBILE)
INSTANTIATE(c10::Half);
#endif
INSTANTIATE(c10::BFloat16);
#endif // defined(__aarch64__) && !defined(C10_MOBILE)
#undef INSTANTIATE
} // namespace blas_impl
@ -623,7 +745,7 @@ void gemv(char trans, int64_t m, int64_t n, scalar_t alpha, const scalar_t *a, i
if(n == 1) lda = m;
#if AT_BUILD_WITH_BLAS()
if (blas_impl::gemv_use_fast_path<scalar_t>(m, n, lda, incx, incy)) {
if (blas_impl::gemv_use_fast_path<scalar_t>(trans, m, n, alpha, lda, incx, beta, incy)) {
TORCH_CHECK(lda >= std::max<int64_t>(1L, m), "lda should be at least max(1,", m, "), but have ", lda);
int i_m = (int)m;
int i_n = (int)n;

View File

@ -393,7 +393,7 @@ struct RegisterPRIVATEUSE1Dispatch {
// REGISTER_DISPATCH now dispatches an AVX512 kernel to nullptr but registers other dispatches.
// ALSO_REGISTER_AVX512_DISPATCH should be used for ensuring AVX512 dispatch, among others.
#ifdef CPU_CAPABILITY_AVX512
#define REGISTER_DISPATCH(name, fn) REGISTER_ARCH_DISPATCH(name, CPU_CAPABILITY, nullptr)
#define REGISTER_DISPATCH(name, fn) REGISTER_ARCH_DISPATCH(name, CPU_CAPABILITY, ((void*)(fn) ? nullptr : nullptr))
#else
#define REGISTER_DISPATCH(name, fn) REGISTER_ARCH_DISPATCH(name, CPU_CAPABILITY, fn)
#endif

View File

@ -856,7 +856,7 @@ namespace {
/**
* @brief Computes the optimal matrix chain multiplication order
*
* Follows the dynamic programming algorithm from Cormen et al,
* Follows the dynamic programming algorithm from Cormen et al.,
* "Introduction to Algorithms, Third Edition", Chapter 15.2,
* p. 370-378. Note that the book uses 1-based indexing.
*

View File

@ -2,9 +2,9 @@
// Licensed under the BSD-3-Clause license
// This is the CPU implementation of the Connectionist Temporal Loss.
// We mostly follow Graves.
// 1. Graves et al: http://www.cs.toronto.edu/~graves/icml_2006.pdf
// 1. Graves et al.: http://www.cs.toronto.edu/~graves/icml_2006.pdf
// We use the equations from above link, but note that [1] has 1-based indexing and we (of course) use 0-based.
// Graves et al call the probabilities y, we use log_probs (also calling them inputs)
// Graves et al. call the probabilities y, we use log_probs (also calling them inputs)
#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
#include <ATen/core/Tensor.h>

View File

@ -499,13 +499,4 @@ Tensor nll_loss2d_symint(const Tensor & self, const Tensor & target, const std::
return std::get<0>(at::nll_loss2d_forward_symint(self, target, weight, reduction, std::move(ignore_index)));
}
// Duplicate of above code for non-symbolic ints. Kept for BC purposes and to minimize breakages.
static Tensor nll_loss2d(const Tensor & self, const Tensor & target, const std::optional<Tensor>& weight_opt, int64_t reduction, int64_t ignore_index) {
// See [Note: hacky wrapper removal for optional tensor]
c10::MaybeOwned<Tensor> weight_maybe_owned = at::borrow_from_optional_tensor(weight_opt);
const Tensor& weight = *weight_maybe_owned;
return std::get<0>(at::nll_loss2d_forward_symint(self, target, weight, reduction, ignore_index));
}
} // namespace at::native

View File

@ -508,7 +508,7 @@ static inline C10_HOST_DEVICE scalar_t calc_polygamma(scalar_t x, int n) {
/* References
* [igam1] "The Digital Library of Mathematical Functions", dlmf.nist.gov
* [igam2] Maddock et. al., "Incomplete Gamma Functions",
* [igam2] Maddock et al., "Incomplete Gamma Functions",
* https://www.boost.org/doc/libs/1_61_0/libs/math/doc/html/math_toolkit/sf_gamma/igamma.html
*/

View File

@ -28,18 +28,6 @@ Tensor empty_meta_symint(
size, dtype_opt, layout_opt, device_opt, pin_memory_opt, memory_format_opt);
}
// Kept only for BC with XLA
static Tensor empty_strided_meta(
IntArrayRef size,
IntArrayRef stride,
std::optional<ScalarType> dtype_opt,
std::optional<Layout> layout_opt,
std::optional<Device> device_opt,
std::optional<bool> pin_memory_opt
) {
return empty_strided_meta_symint(c10::fromIntArrayRefSlow(size), c10::fromIntArrayRefSlow(stride), dtype_opt, layout_opt, device_opt, pin_memory_opt);
}
Tensor empty_strided_meta_symint(
SymIntArrayRef size,
SymIntArrayRef stride,

View File

@ -802,55 +802,6 @@ TORCH_IMPL_FUNC(slow_conv_transpose2d_structured_cpu)
dilation);
}
static std::tuple<Tensor&, Tensor&, Tensor&> slow_conv_transpose2d_backward_out_cpu(const Tensor& grad_output,
const Tensor& input,
const Tensor& weight,
IntArrayRef kernel_size,
IntArrayRef stride,
IntArrayRef padding,
IntArrayRef output_padding,
IntArrayRef dilation,
Tensor& grad_input,
Tensor& grad_weight,
Tensor& grad_bias) {
if (grad_input.defined()) {
slow_conv_transpose2d_backward_out_cpu_template(
input,
grad_output,
grad_input,
weight,
kernel_size,
stride,
padding,
output_padding,
dilation);
}
if (grad_bias.defined()) {
at::sum_out(grad_bias, grad_output, IntArrayRef{0, 2, 3});
}
if (grad_weight.defined()) {
grad_weight.resize_(weight.sizes(), weight.suggest_memory_format());
grad_weight.zero_();
slow_conv_transpose2d_acc_grad_parameters_cpu(
input,
weight,
grad_output,
grad_weight,
grad_bias,
kernel_size,
stride,
padding,
output_padding,
dilation,
1);
}
return std::tuple<Tensor&, Tensor&, Tensor&>(
grad_input, grad_weight, grad_bias);
}
static std::tuple<Tensor, Tensor, Tensor> slow_conv_transpose2d_backward_cpu(
const Tensor& grad_output,
const Tensor& input,

View File

@ -871,58 +871,6 @@ Tensor slow_conv_transpose3d_cpu(
return output;
}
static std::tuple<Tensor&, Tensor&, Tensor&> slow_conv_transpose3d_backward_out_cpu(const Tensor& grad_output,
const Tensor& input,
const Tensor& weight,
IntArrayRef kernel_size,
IntArrayRef stride,
IntArrayRef padding,
IntArrayRef output_padding,
IntArrayRef dilation,
Tensor& grad_input,
Tensor& grad_weight,
Tensor& grad_bias) {
if (grad_input.defined()) {
slow_conv_transpose3d_backward_out_cpu_template(
input,
grad_output,
grad_input,
weight,
kernel_size,
stride,
padding,
output_padding,
dilation);
}
if (grad_weight.defined()) {
grad_weight.resize_(weight.sizes());
grad_weight.zero_();
}
if (grad_bias.defined()) {
grad_bias.resize_({weight.size(1)});
grad_bias.zero_();
}
if (grad_weight.defined() || grad_bias.defined()) {
slow_conv_transpose3d_acc_grad_parameters_cpu(
input,
grad_output,
grad_weight,
grad_bias,
kernel_size,
stride,
padding,
output_padding,
dilation,
1);
}
return std::tuple<Tensor&, Tensor&, Tensor&>(
grad_input, grad_weight, grad_bias);
}
static std::tuple<Tensor, Tensor, Tensor> slow_conv_transpose3d_backward_cpu(
const Tensor& grad_output,
const Tensor& input,

View File

@ -339,12 +339,6 @@ Tensor& gather_out(const Tensor& self, Dimname dim, const Tensor& index, bool sp
Tensor index_add(const Tensor& self, Dimname dim, const Tensor& index, const Tensor& source, const Scalar &alpha) {
reportNYIDimnameOverload("index_add");
}
static Tensor& index_add_(Tensor& self, Dimname dim, const Tensor& index, const Tensor& source, const Scalar &alpha) {
reportNYIDimnameOverload("index_add");
}
static Tensor& index_add_out(const Tensor& self, Dimname dim, const Tensor& index, const Tensor& source, const Scalar& alpha, Tensor& result) {
reportNYIDimnameOverload("index_add");
}
Tensor index_fill(const Tensor& self, Dimname dim, const Tensor& index, const Scalar& source) {
return at::index_fill(self, dimname_to_position(self, dim), index, source);
}
@ -372,21 +366,12 @@ Tensor index_select(const Tensor& self, Dimname dim, const Tensor& index) {
Tensor scatter(const Tensor& self, Dimname dim, const Tensor& index, const Tensor& source) {
reportNYIDimnameOverload("scatter");
}
static Tensor& scatter_(Tensor& self, Dimname dim, const Tensor& index, const Tensor& source) {
reportNYIDimnameOverload("scatter");
}
Tensor scatter(const Tensor& self, Dimname dim, const Tensor& index, const Scalar& source) {
reportNYIDimnameOverload("scatter");
}
static Tensor& scatter_(Tensor& self, Dimname dim, const Tensor& index, const Scalar& source) {
reportNYIDimnameOverload("scatter");
}
Tensor scatter_add(const Tensor& self, Dimname dim, const Tensor& index, const Tensor& source) {
reportNYIDimnameOverload("scatter_add");
}
static Tensor& scatter_add_(Tensor& self, Dimname dim, const Tensor& index, const Tensor& source) {
reportNYIDimnameOverload("scatter_add");
}
std::tuple<Tensor&, Tensor&> sort_out(const Tensor& self, std::optional<bool> stable, Dimname dim, bool keepdim, Tensor& values, Tensor& indices) {
reportNYIDimnameOverload("sort");
}

View File

@ -2276,11 +2276,6 @@ bool cpu_equal(const Tensor& self, const Tensor& other) {
return result.load();
}
static Tensor value_selecting_reduction_backward(const Tensor& grad, int64_t dim, const Tensor& indices, at::IntArrayRef sizes, bool keepdim) {
return at::native::value_selecting_reduction_backward_symint(grad, dim, indices, c10::fromIntArrayRefSlow(sizes), keepdim);
}
// max(dim), min(dim), topk(dim), mode(dim), are examples of reduction
// functions that select values. value_selecting_reduction_backward is the
// backward function for those operators; it propagates the grad to the

View File

@ -301,14 +301,6 @@ void reflection_pad2d_backward_out_template(
} // namespace
// TODO: I tihnk this function should be removed since we implement it with
// TORCH_IMPL_FUNC below
static Tensor& reflection_pad1d_out_cpu(const Tensor& input, IntArrayRef padding,
Tensor& output) {
reflection_pad1d_kernel(kCPU, output, input, padding);
return output;
}
Tensor& reflection_pad1d_out_quantized_cpu(const Tensor& input, IntArrayRef padding,
Tensor& output) {
TORCH_CHECK(input.qscheme() == kPerTensorAffine, "Only per tensor quantization is supported");

View File

@ -231,14 +231,6 @@ TensorImpl* resize_impl_cpu_(
return _resize_impl_(self, size, stride, resize_storage);
}
static TensorImpl* resize_impl_meta_(
TensorImpl* self,
c10::SymIntArrayRef size,
at::OptionalSymIntArrayRef stride,
bool resize_storage = true) {
return _resize_impl_(self, size, stride, resize_storage);
}
template <typename T>
const Tensor& _resize_(
const Tensor& self,

View File

@ -792,12 +792,6 @@ std::tuple<Tensor, Tensor> max(const Tensor& self, Dimname dim, bool keepdim) {
std::tuple<Tensor&, Tensor&> max_out(const Tensor& self, Dimname dim, bool keepdim, Tensor& max, Tensor& max_indices) {
return at::max_out(max, max_indices, self, dimname_to_position(self, dim), keepdim);
}
static Tensor argmax(const Tensor& /*self*/, Dimname /*dim*/, bool /*keepdim*/) {
reportNYIDimnameOverload("argmax");
}
static Tensor argmin(const Tensor& /*self*/, Dimname /*dim*/, bool /*keepdim*/) {
reportNYIDimnameOverload("argmin");
}
Tensor argsort(const Tensor& /*self*/, Dimname /*dim*/, bool /*keepdim*/) {
reportNYIDimnameOverload("argsort");
}

View File

@ -24,10 +24,6 @@
namespace at::native {
static bool is_cuda(const Tensor& self) {
return self.is_cuda();
}
bool is_distributed(const Tensor& self) {
return false;
}
@ -60,18 +56,6 @@ bool is_neg(const Tensor& self) {
return self.is_neg();
}
static bool is_sparse(const Tensor& self) {
return self.is_sparse();
}
static bool is_sparse_csr(const Tensor& self) {
return self.is_sparse_csr();
}
static bool is_quantized(const Tensor& self) {
return self.is_quantized();
}
// True if `self` and `from` have compatible tensor type so that `from`'s
// TensorImpl can be copied to `self`.
bool _has_compatible_shallow_copy_type(const Tensor& self, const Tensor& from) {

View File

@ -38,6 +38,11 @@ float fp16_dot_with_fp32_arith(
const float16_t* x,
const float16_t* a,
int64_t len);
float bf16_dot_with_fp32_arith(
const at::BFloat16* x,
const at::BFloat16* a,
int64_t len);
}
#endif
@ -326,20 +331,8 @@ static float compute_dot(const at::Half* a, const at::Half* b, int64_t len) {
len);
}
static float compute_dot(const at::BFloat16* a, const at::BFloat16* b, int64_t l) {
if ((l&3) != 0) {
return sum(l, [&](int64_t i) -> float {
return float(a[i]) * float(b[i]);
});
}
float32x4_t rcv = vdupq_n_f32(0);
for (int64_t idx = 0; idx < l; idx += 4) {
float32x4_t aVec = load_as_float32x4(a + idx);
float32x4_t bVec = load_as_float32x4(b + idx);
rcv = vaddq_f32(rcv, vmulq_f32(aVec, bVec));
}
auto sum = vpaddq_f32(rcv, rcv);
return vgetq_lane_f32(vpaddq_f32(sum, sum), 0);
static float compute_dot(const at::BFloat16* a, const at::BFloat16* b, int64_t len) {
return at::native::blas_impl::bf16_dot_with_fp32_arith(a, b, len);
}
template <>

View File

@ -1,7 +1,3 @@
#include <cstdint>
#include <c10/util/Exception.h>
#include <c10/core/Scalar.h>
#include <c10/core/ScalarType.h>
#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
#include <ATen/core/Tensor.h>
#include <ATen/core/NamedTensor.h>
@ -14,7 +10,6 @@
#include <ATen/cuda/tunable/TunableGemm.h>
#include <ATen/native/Resize.h>
#include <c10/util/MaybeOwned.h>
#include <ATen/native/cuda/RowwiseScaledMM.h>
#ifndef AT_PER_OPERATOR_HEADERS
#include <ATen/Functions.h>
@ -824,97 +819,24 @@ static bool _scaled_mm_allowed_device() {
#endif
}
namespace{
enum class ScalingType {
TensorWise,
RowWise,
Error
};
// Validates the scale tensors to scaled_mm
// And returns the type of scaling/which kernel to use
ScalingType get_scaling_type(
const c10::optional<at::Tensor>& scale_a,
const c10::optional<at::Tensor>& scale_b,
int64_t dim_m,
int64_t dim_n) {
TORCH_CHECK(
scale_a.has_value() == scale_b.has_value(),
"Both scale_a and scale_b must be present or absent.");
if (scale_a.has_value()) {
// Both Per-Tensor and Row-wise scaling expect fp32 tensors
TORCH_CHECK(
scale_a->scalar_type() == kFloat && scale_b->scalar_type() == kFloat,
"Both scale_a and scale_b must be float (fp32) tensors.");
// Check the singluar scale case for per-tensor scaling
if (scale_a->numel() == 1 && scale_b->numel() == 1) {
return ScalingType::TensorWise;
} else if (scale_a->dim() == 1 && scale_a->size(0) == dim_m) {
// Check the per-row scaling case
#if !defined(USE_ROCM) && !defined(_MSC_VER) || \
(defined(USE_ROCM) && ROCM_VERSION >= 60000)
TORCH_CHECK(
scale_a->dim() == 1 && scale_b->dim() == 1,
"Both scale_a and scale_b must be 1-dimensional tensors");
TORCH_CHECK(
scale_b->size(0) == dim_n,
"For row-wise scaling, scale_b must have size ",
dim_n,
" but got ",
scale_b->size(0),
".");
TORCH_CHECK(
scale_a->is_contiguous() && scale_b->is_contiguous(),
"Both scale_a and scale_b must be contiguous.");
return ScalingType::RowWise;
#else
TORCH_CHECK(false, "Per-row scaling is not supported for this platform!");
return ScalingType::Error;
#endif // !defined(USE_ROCM) && !defined(_MSC_VER) || (defined(USE_ROCM) &&
// ROCM_VERSION >= 60000)
} else {
TORCH_CHECK(
false,
"For row-wise scaling, scale_a must be size ",
dim_m,
" but got ",
scale_a->numel(),
" and scale_b must be size ",
dim_n,
" but got ",
scale_b->numel(),
".");
// Unreachable
return ScalingType::RowWise;
}
}
return ScalingType::Error;
}
} // namespace
// Computes matrix multiply + bias while applying scaling to input and output matrices and computes amax
// Scales are only applicable when matrices are of Float8 type and assumbed to be equal to 1.0 by default.
// If output matrix type is 16 or 32-bit type, neither scale_result is applied nor amax is computed.
// Known limitations:
// - Only works if mat1 is row-major and mat2 is column-major
// - Only works if matrices sizes are divisible by 32
// - If 1-dimensional tensors are used then scale_a should be size = mat1.size(0)
// and scale_b should have size = to mat2.size(1)
//
// Arguments:
// - `mat1`: the first operand of the matrix multiply, can be type `torch.float8_e4m3fn` or `torch.float8_e5m2`
// - `mat2`: the second operand of the matrix multiply, can be type `torch.float8_e4m3fn` or `torch.float8_e5m2`
// - `bias`: the bias, can be type `torch.float16` or `torch.bfloat16`
// - `out_dtype`: the output dtype, can either be a float8 or a higher precision floating point type
// - `scale_a`: a scalar or 1-dimensional tensor with the inverse scale of `mat1`, only needed if `mat1` is a float8 type
// - `scale_b`: a scalar or 1-dimensional tensor with the inverse scale of `mat2`, only needed if `mat2` is a float8 type
// - `scale_result`: a scalar tensor with the scale of the output, only utilized if the output is a float8 type
// - `scale_a`: a scalar tensor with the inverse scale of `mat1`, only needed if `mat1` is a float8 type
// - `scale_b`: a scalar tensor with the inverse scale of `mat2`, only needed if `mat2` is a float8 type
// - `scale_result`: a scalar tensor with the scale of the output, only set if the output is a float8 type
// - `use_fast_accum`: if true, enables fast float8 accumulation
// - `out`: a reference to the output tensor
// - `amax`: a reference to the amax tensor of the output, only mutated if the output is a float8 type and will be updated inplace
// - `amax`: a reference to the amax tensor of the output, only needed if the output is a float8 type and will be updated inplace
std::tuple<Tensor&, Tensor&>
_scaled_mm_out_cuda(const Tensor& mat1, const Tensor& mat2,
@ -933,11 +855,10 @@ _scaled_mm_out_cuda(const Tensor& mat1, const Tensor& mat2,
TORCH_CHECK(
mat1.sizes()[1] == mat2.sizes()[0], "mat1 and mat2 shapes cannot be multiplied (",
mat1.sizes()[0], "x", mat1.sizes()[1], " and ", mat2.sizes()[0], "x", mat2.sizes()[1], ")");
// Check what type of scaling we are doing based on inputs
ScalingType scaling_choice = get_scaling_type(scale_a, scale_b, mat1.size(0), mat2.size(1));
TORCH_INTERNAL_ASSERT(scaling_choice != ScalingType::Error, "Scaling type not supported");
TORCH_CHECK(!scale_a || (scale_a->numel() == 1 && scale_a->scalar_type() == kFloat),
"scale_a must be float scalar");
TORCH_CHECK(!scale_b || (scale_b->numel() == 1 && scale_b->scalar_type() == kFloat),
"scale_b must be a float scalar");
TORCH_CHECK(!scale_result || (scale_result->numel() == 1 && scale_result->scalar_type() == kFloat),
"scale_result must be a float scalar");
TORCH_CHECK(!bias || bias->numel() == mat2.sizes()[1], "Bias must be size ", mat2.sizes()[1],
@ -980,26 +901,12 @@ _scaled_mm_out_cuda(const Tensor& mat1, const Tensor& mat2,
{scale_result_, "scale_result", 7}};
checkAllSameGPU(__func__, targs);
}
// Validation checks have passed lets resize the output to actual size
IntArrayRef mat1_sizes = mat1.sizes();
IntArrayRef mat2_sizes = mat2.sizes();
at::native::resize_output(out, {mat1_sizes[0], mat2_sizes[1]});
at::native::resize_output(amax, {});
// We are doing row-wise scaling
if (scaling_choice == ScalingType::RowWise) {
TORCH_CHECK(out.dtype() == kBFloat16, "Only bf16 high precsion output types are supported for row-wise scaling.");
at::cuda::detail::f8f8bf16_rowwise(
mat1,
mat2,
scale_a.value(),
scale_b.value(),
bias,
use_fast_accum,
out);
return {out, amax};
}
cublasCommonArgs args(mat1, mat2, out);
const auto out_dtype_ = args.result->scalar_type();
TORCH_CHECK(args.transa == 't' && args.transb == 'n', "Only multiplication of row-major and column-major matrices is supported by cuBLASLt");

View File

@ -2,9 +2,9 @@
// Licensed under the BSD-3-Clause license
// This is the GPU implementation of the Connectionist Temporal Loss.
// We mostly follow Graves.
// 1. Graves et al: http://www.cs.toronto.edu/~graves/icml_2006.pdf
// 1. Graves et al.: http://www.cs.toronto.edu/~graves/icml_2006.pdf
// We use the equations from above link, but note that [1] has 1-based indexing and we (of course) use 0-based.
// Graves et al call the probabilities y, we use log_probs (also calling them inputs)
// Graves et al. call the probabilities y, we use log_probs (also calling them inputs)
// A few optimizations (similar to those here, but also some I didn't take) are described in
// 2. Minmin Sun: http://on-demand.gputechconf.com/gtc/2016/presentation/s6383-minmin-sun-speech-recognition.pdf
#define TORCH_ASSERT_ONLY_METHOD_OPERATORS

View File

@ -29,18 +29,10 @@ static inline void maybe_resize_storage_cuda(TensorImpl* self, size_t new_size_b
inline TensorImpl* resize_impl_cuda_(
TensorImpl* self,
IntArrayRef size,
at::OptionalIntArrayRef stride,
bool device_guard = true) {
at::OptionalIntArrayRef stride) {
if (self->sizes() == size && (!stride || self->strides() == stride)) {
return self;
}
// NB: We don't need to hold the device guard when calling from TH
cuda::OptionalCUDAGuard guard;
if (device_guard) {
guard.set_index(self->storage().device().index());
}
const auto itemsize = self->dtype().itemsize();
const auto storage_offset = self->storage_offset();
size_t storage_size = 1;

View File

@ -1,535 +0,0 @@
#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
#include <ATen/Dispatch.h>
#include <ATen/core/Tensor.h>
#include <ATen/cuda/CUDAContext.h>
#include <ATen/cuda/nvrtc_stub/ATenNVRTC.h>
// Determine if the architecture supports rowwise scaled mm
#if !defined(USE_ROCM) && defined(CUDA_VERSION) && CUDA_VERSION >= 12000
#define BUILD_ROWWISE_FP8_KERNEL
#endif
#if defined(BUILD_ROWWISE_FP8_KERNEL)
// We are going to override the cuTensorMapEncodeTiled driver api with our lazy loader
static CUresult CUDAAPI nvrtc_cuTensorMapEncodeTiled(
CUtensorMap* tensorMap,
CUtensorMapDataType tensorDataType,
cuuint32_t tensorRank,
void* globalAddress,
const cuuint64_t* globalDim,
const cuuint64_t* globalStrides,
const cuuint32_t* boxDim,
const cuuint32_t* elementStrides,
CUtensorMapInterleave interleave,
CUtensorMapSwizzle swizzle,
CUtensorMapL2promotion l2Promotion,
CUtensorMapFloatOOBfill oobFill) {
return at::globalContext().getNVRTC().cuTensorMapEncodeTiled(
tensorMap,
tensorDataType,
tensorRank,
globalAddress,
globalDim,
globalStrides,
boxDim,
elementStrides,
interleave,
swizzle,
l2Promotion,
oobFill);
}
#include <cutlass/core_io.h>
#include <cutlass/cutlass.h>
#include <cutlass/gemm/device/gemm.h>
#include <cutlass/half.h>
#include <cutlass/numeric_types.h>
#include <cutlass/trace.h>
#include <cutlass/util/host_tensor.h>
// Rename the global function symbol
#define cuTensorMapEncodeTiled nvrtc_cuTensorMapEncodeTiled
#include <cute/tensor.hpp>
#undef cuTensorMapEncodeTiled
// Set everything back to normal
#include <cutlass/gemm/collective/collective_builder.hpp>
#include <cutlass/gemm/device/gemm_universal_adapter.h>
#include <cutlass/epilogue/collective/collective_builder.hpp>
#include <cute/atom/mma_atom.hpp>
#include <cutlass/gemm/dispatch_policy.hpp>
#include <cutlass/gemm/kernel/gemm_universal.hpp>
#include <cutlass/util/packed_stride.hpp>
namespace {
// Cutlass rowwise kernel
template <
int TB_M,
int TB_N,
int TB_K,
int TBS_M,
int TBS_N,
int TBS_K,
bool PONG,
bool FAST_ACCUM,
bool USE_BIAS,
typename INPUT_DTYPE,
typename BIAS_DTYPE>
void f8f8bf16_rowwise_impl(
at::Tensor XQ, // FP8
at::Tensor WQ, // FP8
at::Tensor x_scale,
at::Tensor w_scale,
c10::optional<at::Tensor> bias,
at::Tensor out) {
int M = XQ.size(0);
int N = WQ.size(1);
int K = XQ.size(1);
TORCH_CHECK(XQ.is_cuda() && XQ.is_contiguous());
TORCH_CHECK(
WQ.is_cuda() && WQ.ndimension() == 2 && WQ.stride(1) == WQ.size(0) &&
WQ.stride(0) == 1);
// auto Y = at::empty({M, N}, XQ.options().dtype(at::kBFloat16));
using ElementInputA = INPUT_DTYPE;
using LayoutInputA = cutlass::layout::RowMajor;
constexpr int AlignmentInputA = 16 / sizeof(ElementInputA);
using ElementInputB = cutlass::float_e4m3_t;
using LayoutInputB = cutlass::layout::ColumnMajor;
constexpr int AlignmentInputB = 16 / sizeof(ElementInputB);
using ElementBias = BIAS_DTYPE;
using ElementOutput = cutlass::bfloat16_t;
using LayoutOutput = cutlass::layout::RowMajor;
constexpr int AlignmentOutput = 16 / sizeof(ElementOutput);
using ElementAccumulator = float;
using ElementComputeEpilogue = float;
using ArchTag = cutlass::arch::Sm90; // Tag indicating the minimum SM that
// supports the intended feature
using OperatorClass = cutlass::arch::OpClassTensorOp;
using TileShape = cute::Shape<
cute::Int<TB_M>,
cute::Int<TB_N>,
cute::Int<TB_K>>; // Threadblock-level
// tile size
using ClusterShape = cute::Shape<
cute::Int<TBS_M>,
cute::Int<TBS_N>,
cute::Int<TBS_K>>; // Shape of the
// threadblocks in a
// cluster
using KernelSchedule = cutlass::gemm::collective::
KernelScheduleAuto; // Kernel to launch based on the default setting in
// the Collective Builder
// Implement rowwise scaling epilogue.
using XScale = cutlass::epilogue::fusion::Sm90ColBroadcast<
0,
TileShape,
ElementComputeEpilogue,
cute::Stride<cute::Int<1>, cute::Int<0>, cute::Int<0>>>;
using WScale = cutlass::epilogue::fusion::Sm90RowBroadcast<
PONG ? 2 : 1,
TileShape,
ElementComputeEpilogue,
cute::Stride<cute::Int<0>, cute::Int<1>, cute::Int<0>>>;
using Bias = cutlass::epilogue::fusion::Sm90RowBroadcast<
PONG ? 2 : 1,
TileShape,
ElementBias,
cute::Stride<cute::Int<0>, cute::Int<1>, cute::Int<0>>>;
using Accum = cutlass::epilogue::fusion::Sm90AccFetch;
using Compute0 = cutlass::epilogue::fusion::Sm90Compute<
cutlass::multiplies,
ElementComputeEpilogue, // First stage output type.
ElementComputeEpilogue, // First stage input types.
cutlass::FloatRoundStyle::round_to_nearest>;
using EVTCompute0 =
cutlass::epilogue::fusion::Sm90EVT<Compute0, WScale, Accum>;
using Compute1 = cutlass::epilogue::fusion::Sm90Compute<
cutlass::multiplies,
cute::conditional_t< // Second stage output type.
USE_BIAS,
ElementBias,
ElementOutput>,
ElementComputeEpilogue, // Second stage input types.
cutlass::FloatRoundStyle::round_to_nearest>;
using EVTCompute1 =
cutlass::epilogue::fusion::Sm90EVT<Compute1, XScale, EVTCompute0>;
using ComputeBias = cutlass::epilogue::fusion::Sm90Compute<
cutlass::plus,
ElementOutput, // Final (optional) stage output type.
ElementBias, // Final stage input types.
cutlass::FloatRoundStyle::round_to_nearest>;
using EVTComputeBias =
cutlass::epilogue::fusion::Sm90EVT<ComputeBias, Bias, EVTCompute1>;
using EpilogueEVT =
cute::conditional_t<USE_BIAS, EVTComputeBias, EVTCompute1>;
using CollectiveEpilogue =
typename cutlass::epilogue::collective::CollectiveBuilder<
cutlass::arch::Sm90,
cutlass::arch::OpClassTensorOp,
TileShape,
ClusterShape,
cutlass::epilogue::collective::EpilogueTileAuto,
ElementAccumulator,
ElementComputeEpilogue,
ElementOutput,
LayoutOutput,
AlignmentOutput,
ElementOutput,
LayoutOutput,
AlignmentOutput,
cutlass::epilogue::TmaWarpSpecialized,
EpilogueEVT>::CollectiveOp;
using DefaultSchedule = cutlass::gemm::KernelTmaWarpSpecialized;
using PongSchedule = cutlass::gemm::KernelTmaWarpSpecializedPingpong;
using FastDefaultSchedule =
cutlass::gemm::KernelTmaWarpSpecializedFP8FastAccum;
using FastPongSchedule =
cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum;
using SlowAccum = cute::conditional_t<PONG, PongSchedule, DefaultSchedule>;
using FastAccum =
cute::conditional_t<PONG, FastPongSchedule, FastDefaultSchedule>;
using MainLoopSchedule =
cute::conditional_t<FAST_ACCUM, FastAccum, SlowAccum>;
using CollectiveMainloop =
typename cutlass::gemm::collective::CollectiveBuilder<
ArchTag,
OperatorClass,
ElementInputA,
LayoutInputA,
AlignmentInputA,
ElementInputB,
LayoutInputB,
AlignmentInputB,
ElementAccumulator,
TileShape,
ClusterShape,
cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(
sizeof(typename CollectiveEpilogue::SharedStorage))>,
MainLoopSchedule>::CollectiveOp;
using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
cute::Shape<int, int, int>,
CollectiveMainloop,
CollectiveEpilogue>;
using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
using StrideInputA = typename Gemm::GemmKernel::StrideA;
using StrideInputB = typename Gemm::GemmKernel::StrideB;
using StrideOutput = typename Gemm::GemmKernel::StrideC;
StrideInputA stride_a = cutlass::make_cute_packed_stride(
StrideInputA{}, cute::make_shape(M, K, 1));
StrideInputB stride_b = cutlass::make_cute_packed_stride(
StrideInputB{}, cute::make_shape(N, K, 1));
StrideOutput stride_output = cutlass::make_cute_packed_stride(
StrideOutput{}, cute::make_shape(M, N, 1));
typename Gemm::Arguments arguments{
cutlass::gemm::GemmUniversalMode::kGemm,
{M, N, K},
{reinterpret_cast<ElementInputA*>(XQ.data_ptr()),
stride_a,
reinterpret_cast<ElementInputB*>(WQ.data_ptr()),
stride_b},
{{}, // Epilogue thread we populate below.
(ElementOutput*)out.data_ptr<at::BFloat16>(),
stride_output,
(ElementOutput*)out.data_ptr<at::BFloat16>(),
stride_output}};
if constexpr (USE_BIAS) {
arguments.epilogue.thread = {
{reinterpret_cast<ElementBias*>(bias.value().data_ptr())}, // bias
// compute_1
{
{reinterpret_cast<ElementComputeEpilogue*>(
x_scale.data_ptr())}, // x_scale
// compute_0
{
{reinterpret_cast<ElementComputeEpilogue*>(
w_scale.data_ptr())}, // w_scale
{}, // Accumulator
{} // Multiplies
},
{}, // Multiplies
},
{}, // Plus
};
} else {
arguments.epilogue.thread = {
{reinterpret_cast<ElementComputeEpilogue*>(
x_scale.data_ptr())}, // x_scale
// compute_0
{
{reinterpret_cast<ElementComputeEpilogue*>(
w_scale.data_ptr())}, // w_scale
{}, // Accumulator
{} // Multiplies
},
{}, // Multiplies
};
}
Gemm gemm;
// Using the arguments, query for extra workspace required for matrix
// multiplication computation
size_t workspace_size = Gemm::get_workspace_size(arguments);
// Allocate workspace memory
cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);
// Check the problem size is supported or not
cutlass::Status status = gemm.can_implement(arguments);
if (status != cutlass::Status::kSuccess) {
throw std::runtime_error("cutlass cannot implement");
}
// Initialize CUTLASS kernel with arguments and workspace pointer
status = gemm.initialize(arguments, workspace.get());
if (status != cutlass::Status::kSuccess) {
throw std::runtime_error("cutlass cannot initialize");
}
status = gemm(at::cuda::getCurrentCUDAStream());
if (status != cutlass::Status::kSuccess) {
throw std::runtime_error(
std::string("cutlass cannot run") +
cutlass::cutlassGetStatusString(status));
}
C10_CUDA_KERNEL_LAUNCH_CHECK();
}
// FP8 Rowwise Cutlass kernel dispatch.
enum class KernelMode { Small, Large, Default };
KernelMode get_kernel_mode(at::Tensor XQ, at::Tensor WQ) {
auto M = XQ.size(0);
auto K = XQ.size(1);
auto N = WQ.size(0);
// Use a large kernel if at least two shapes are large....
bool use_large_kernel =
((M >= 2048 && K >= 2048) || (M >= 2048 && N >= 2048) ||
(K >= 2048 && N >= 2048));
if (M <= 128 || N <= 128) {
return KernelMode::Small;
} else if (use_large_kernel) {
return KernelMode::Large;
} else {
return KernelMode::Default;
}
}
template <typename InputDType, bool FastAccum, bool UseBias, typename BiasDType>
void dispatch_fp8_rowwise_kernel(
at::Tensor XQ,
at::Tensor WQ,
at::Tensor x_scale,
at::Tensor w_scale,
c10::optional<at::Tensor> bias,
at::Tensor out) {
KernelMode kernel = get_kernel_mode(XQ, WQ);
if (kernel == KernelMode::Small) {
return f8f8bf16_rowwise_impl<
64,
128,
128,
2,
1,
1,
false,
FastAccum,
UseBias,
InputDType,
BiasDType>(XQ, WQ, x_scale, w_scale, bias, out);
} else if (kernel == KernelMode::Large) {
return f8f8bf16_rowwise_impl<
128,
128,
128,
2,
1,
1,
true,
FastAccum,
UseBias,
InputDType,
BiasDType>(XQ, WQ, x_scale, w_scale, bias, out);
} else {
return f8f8bf16_rowwise_impl<
128,
128,
128,
1,
2,
1,
false,
FastAccum,
UseBias,
InputDType,
BiasDType>(XQ, WQ, x_scale, w_scale, bias, out);
}
}
} // namespace
#endif // !defined(USE_ROCM)
namespace at::cuda::detail {
void f8f8bf16_rowwise(
at::Tensor XQ, // FP8
at::Tensor WQ, // FP8
at::Tensor x_scale, // FP32
at::Tensor w_scale, // FP32
c10::optional<at::Tensor> bias, // BF16
bool use_fast_accum,
at::Tensor& out) {
#if defined(BUILD_ROWWISE_FP8_KERNEL)
// Check datatypes.
TORCH_CHECK(
x_scale.dtype() == at::kFloat && w_scale.dtype() == at::kFloat,
"Scale tensors must be float32.");
if (bias.has_value()) {
TORCH_CHECK(
bias.value().dtype() == at::kFloat ||
bias.value().dtype() == at::kBFloat16,
"Bias type must be bfloat16 or float32 if provided.");
}
// Extract problem size.
int M = XQ.size(0);
int N = WQ.size(1);
int K = XQ.size(1);
bool use_bias = bias.has_value();
bool bf16_bias = use_bias && bias.value().dtype() == at::kBFloat16;
// Templatize based on input dtype.
bool use_e5m2 = XQ.dtype() == at::kFloat8_e5m2;
TORCH_CHECK(WQ.dtype() == at::kFloat8_e4m3fn, "For row-wise scaling the second input is required to be a float8_e4m3fn dtype.");
if (use_bias) {
if (bf16_bias) {
if (use_fast_accum) {
if (use_e5m2) {
return dispatch_fp8_rowwise_kernel<
cutlass::float_e5m2_t,
true,
true,
cutlass::bfloat16_t>(XQ, WQ, x_scale, w_scale, bias, out);
} else {
return dispatch_fp8_rowwise_kernel<
cutlass::float_e4m3_t,
true,
true,
cutlass::bfloat16_t>(XQ, WQ, x_scale, w_scale, bias, out);
}
} else {
if (use_e5m2) {
return dispatch_fp8_rowwise_kernel<
cutlass::float_e5m2_t,
false,
true,
cutlass::bfloat16_t>(XQ, WQ, x_scale, w_scale, bias, out);
} else {
return dispatch_fp8_rowwise_kernel<
cutlass::float_e4m3_t,
false,
true,
cutlass::bfloat16_t>(XQ, WQ, x_scale, w_scale, bias, out);
}
}
} else {
if (use_fast_accum) {
if (use_e5m2) {
return dispatch_fp8_rowwise_kernel<
cutlass::float_e5m2_t,
true,
true,
float>(XQ, WQ, x_scale, w_scale, bias, out);
} else {
return dispatch_fp8_rowwise_kernel<
cutlass::float_e4m3_t,
true,
true,
float>(XQ, WQ, x_scale, w_scale, bias, out);
}
} else {
if (use_e5m2) {
return dispatch_fp8_rowwise_kernel<
cutlass::float_e5m2_t,
false,
true,
float>(XQ, WQ, x_scale, w_scale, bias, out);
} else {
return dispatch_fp8_rowwise_kernel<
cutlass::float_e4m3_t,
false,
true,
float>(XQ, WQ, x_scale, w_scale, bias, out);
}
}
}
} else {
if (use_fast_accum) {
if (use_e5m2) {
return dispatch_fp8_rowwise_kernel<
cutlass::float_e5m2_t,
true,
false,
float>(XQ, WQ, x_scale, w_scale, bias, out);
} else {
return dispatch_fp8_rowwise_kernel<
cutlass::float_e4m3_t,
true,
false,
float>(XQ, WQ, x_scale, w_scale, bias, out);
}
} else {
if (use_e5m2) {
return dispatch_fp8_rowwise_kernel<
cutlass::float_e5m2_t,
false,
false,
float>(XQ, WQ, x_scale, w_scale, bias, out);
} else {
return dispatch_fp8_rowwise_kernel<
cutlass::float_e4m3_t,
false,
false,
float>(XQ, WQ, x_scale, w_scale, bias, out);
}
}
}
#else // BUILD_ROWWISE_FP8_KERNEL
TORCH_CHECK(false, "Rowwise scaling is not currenlty supported on your device");
#endif
}
} // namespace at::cuda::detail

View File

@ -1,15 +0,0 @@
#pragma once
#include <ATen/core/TensorBase.h>
#include <c10/util/Optional.h>
namespace at::cuda::detail {
TORCH_API void f8f8bf16_rowwise(
at::Tensor XQ, // FP8
at::Tensor WQ, // FP8
at::Tensor x_scale, // FP32
at::Tensor w_scale, // FP32
c10::optional<at::Tensor> bias, // BF16
bool use_fast_accum,
at::Tensor& out);
} // at::cuda::detail

View File

@ -40,6 +40,8 @@
#include <magma_v2.h>
#include <ATen/cuda/detail/CUDAHooks.h>
const bool use_magma_ = true;
namespace {
struct MagmaInitializer {
MagmaInitializer() {
@ -59,6 +61,9 @@ struct MagmaInitializer {
#error "MAGMA release minor or micro version >= 10, please correct AT_MAGMA_VERSION"
#endif
#else
const bool use_magma_ = false;
#endif
namespace at::native {
@ -79,9 +84,9 @@ void magmaLdlHermitian(
magma_int_t ldda,
magma_int_t* ipiv,
magma_int_t* info) {
static_assert(
false&&sizeof(scalar_t),
"LDL decomposition is not available."
TORCH_CHECK(
false,
"LDL decomposition is not available.",
"Please rebuild with MAGMA 2.5.4+.");
}
@ -1029,13 +1034,18 @@ magma_trans_t to_magma(TransposeType trans) {
namespace {
#if AT_MAGMA_ENABLED()
template <typename scalar_t>
void apply_ldl_factor_magma(
const Tensor& A,
const Tensor& pivots,
const Tensor& info,
bool upper) {
#if !AT_MAGMA_ENABLED()
TORCH_CHECK(
false,
"torch.linalg.ldl_factor: MAGMA library not found in "
"compilation. Please rebuild with MAGMA.");
#else
auto batch_size = batchCount(A);
magma_int_t n = magma_int_cast(A.size(-2), "A.size(-2)");
magma_int_t leading_dim = magma_int_cast(A.stride(-1), "A.stride(-1)");
@ -1066,6 +1076,7 @@ void apply_ldl_factor_magma(
}
pivots.copy_(pivots_cpu);
info.copy_(info_cpu);
#endif
}
void ldl_factor_magma(
@ -1087,7 +1098,6 @@ void ldl_factor_magma(
apply_ldl_factor_magma<scalar_t>(LD, pivots, info, upper);
});
}
#endif
void ldl_factor_kernel(
const Tensor& LD,
@ -1100,10 +1110,8 @@ void ldl_factor_kernel(
case at::LinalgBackend::Cusolver:
return ldl_factor_cusolver(
LD, pivots, info, upper, hermitian);
#if AT_MAGMA_ENABLED()
case at::LinalgBackend::Magma:
return ldl_factor_magma(LD, pivots, info, upper, hermitian);
#endif
default:
// By default use cusolver if available and magma otherwise.
// If cusolver and magma 2.5.4+ are both available and hermitian=true,
@ -1147,9 +1155,12 @@ REGISTER_CUDA_DISPATCH(ldl_solve_stub, &ldl_solve_kernel)
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ cholesky_solve ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#if AT_MAGMA_ENABLED()
template <typename scalar_t>
static void apply_cholesky_solve(Tensor& b, Tensor& A, bool upper, int64_t& info) {
#if !AT_MAGMA_ENABLED()
AT_ERROR("cholesky_solve: MAGMA library not found in "
"compilation. Please rebuild with MAGMA.");
#else
magma_uplo_t uplo = upper ? MagmaUpper : MagmaLower;
auto A_data = A.data_ptr<scalar_t>();
@ -1168,8 +1179,8 @@ static void apply_cholesky_solve(Tensor& b, Tensor& A, bool upper, int64_t& info
auto b_mat_stride = matrixStride(b);
magma_int_t batch_size = magma_int_cast(batchCount(A), "batchCount");
scalar_t** A_array = nullptr;
scalar_t** b_array = nullptr;
scalar_t** A_array;
scalar_t** b_array;
ALLOCATE_ARRAY(A_array, scalar_t*, batch_size);
ALLOCATE_ARRAY(b_array, scalar_t*, batch_size);
@ -1186,7 +1197,7 @@ static void apply_cholesky_solve(Tensor& b, Tensor& A, bool upper, int64_t& info
// Compute as many batches of 65535 possible
// The number of "mini"-batches are floor(batch_size / batch_limit)
// and these cover floor(batch_size / batch_limit) * batch_limit matrix solves
int64_t mini_batches = batch_size / batch_limit, mini_idx = 0;
int64_t mini_batches = batch_size / batch_limit, mini_idx;
for (mini_idx = 0; mini_idx < mini_batches * batch_limit; mini_idx += batch_limit) {
scalar_t** A_array_cur = &A_array[mini_idx];
scalar_t** b_array_cur = &b_array[mini_idx];
@ -1210,6 +1221,7 @@ static void apply_cholesky_solve(Tensor& b, Tensor& A, bool upper, int64_t& info
info = info_tmp;
}
#endif
}
Tensor _cholesky_solve_helper_cuda_magma(const Tensor& self, const Tensor& A, bool upper) {
@ -1222,7 +1234,6 @@ Tensor _cholesky_solve_helper_cuda_magma(const Tensor& self, const Tensor& A, bo
TORCH_CHECK(info == 0, "MAGMA cholesky_solve : invalid argument: ", -info);
return self_working_copy;
}
#endif
// Todo: cusolverDn<T>potrsBatched only supports nrhs == 1 and does not have good performance.
// Batched cholesky_solve is dispatched to magma.
@ -1232,20 +1243,14 @@ Tensor _cholesky_solve_helper_cuda(const Tensor& self, const Tensor& A, bool upp
switch (preferred_backend) {
case at::LinalgBackend::Cusolver:
return _cholesky_solve_helper_cuda_cusolver(self, A, upper);
#if AT_MAGMA_ENABLED()
case at::LinalgBackend::Magma:
return _cholesky_solve_helper_cuda_magma(self, A, upper);
#endif
default:
#if !AT_MAGMA_ENABLED()
return _cholesky_solve_helper_cuda_cusolver(self, A, upper);
#else
if (batchCount(self) == 1) {
if (batchCount(self) == 1 || !use_magma_) {
return _cholesky_solve_helper_cuda_cusolver(self, A, upper);
} else {
return _cholesky_solve_helper_cuda_magma(self, A, upper);
}
#endif
}
#else
return _cholesky_solve_helper_cuda_magma(self, A, upper);
@ -1254,9 +1259,14 @@ Tensor _cholesky_solve_helper_cuda(const Tensor& self, const Tensor& A, bool upp
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ cholesky ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#if AT_MAGMA_ENABLED()
template <typename scalar_t>
static void apply_cholesky(const Tensor& self, bool upper, const Tensor& info) {
#if !AT_MAGMA_ENABLED()
TORCH_CHECK(
false,
"Calling torch.linalg.cholesky on a CUDA tensor requires compiling ",
"PyTorch with MAGMA. Please use PyTorch built with MAGMA support.");
#else
magma_uplo_t uplo = upper ? MagmaUpper : MagmaLower;
auto self_data = self.data_ptr<scalar_t>();
@ -1278,7 +1288,7 @@ static void apply_cholesky(const Tensor& self, bool upper, const Tensor& info) {
auto self_mat_stride = matrixStride(self);
magma_int_t batch_size = magma_int_cast(batchCount(self), "batchCount");
scalar_t** self_array = nullptr;
scalar_t** self_array;
ALLOCATE_ARRAY(self_array, scalar_t*, batch_size);
@ -1304,6 +1314,7 @@ static void apply_cholesky(const Tensor& self, bool upper, const Tensor& info) {
uplo, n, self_array_cur, lda, info_array_cur, nbatches, magma_queue);
}
}
#endif
}
void cholesky_helper_magma(const Tensor& input, bool upper, const Tensor& info) {
@ -1339,7 +1350,6 @@ void cholesky_helper_magma(const Tensor& input, bool upper, const Tensor& info)
}
}
}
#endif
static void cholesky_kernel(const Tensor& input, const Tensor& info, bool upper) {
#if defined(USE_LINALG_SOLVER) && !defined(USE_ROCM)
@ -1348,21 +1358,15 @@ static void cholesky_kernel(const Tensor& input, const Tensor& info, bool upper)
case at::LinalgBackend::Cusolver:
cholesky_helper_cusolver(input, upper, info);
break;
#if AT_MAGMA_ENABLED()
case at::LinalgBackend::Magma:
cholesky_helper_magma(input, upper, info);
break;
#endif
default:
#if !AT_MAGMA_ENABLED()
cholesky_helper_cusolver(input, upper, info);
#else
if (batchCount(input) == 1 || use_cusolver_potrf_batched_) {
if (batchCount(input) == 1 || !use_magma_ || use_cusolver_potrf_batched_) {
cholesky_helper_cusolver(input, upper, info);
} else {
cholesky_helper_magma(input, upper, info);
}
#endif
}
#else
cholesky_helper_magma(input, upper, info);
@ -1380,9 +1384,11 @@ This is an in-place routine, content of 'input' is overwritten.
MAGMA requires 'infos' to reside in CPU memory.
For more information see MAGMA's documentation for POTRS routine.
*/
#if AT_MAGMA_ENABLED()
template <typename scalar_t>
static void apply_cholesky_inverse(Tensor& input, Tensor& infos, bool upper) {
#if !AT_MAGMA_ENABLED()
TORCH_CHECK(false, "cholesky_inverse: MAGMA library not found in compilation. Please rebuild with MAGMA.");
#else
// magmaCholeskyInverse (magma_dpotri_gpu) is slow because internally
// it transfers data several times between GPU and CPU and calls lapack routine on CPU
// using magmaCholeskySolveBatched is a lot faster
@ -1412,6 +1418,7 @@ static void apply_cholesky_inverse(Tensor& input, Tensor& infos, bool upper) {
int64_t info_tmp = 0;
apply_cholesky_solve<scalar_t>(result_u, input_u, upper, info_tmp);
infos.fill_(info_tmp);
#endif
}
// This is a type dispatching helper function for 'apply_cholesky_inverse'
@ -1421,7 +1428,6 @@ Tensor& cholesky_inverse_kernel_impl_magma(Tensor &result, Tensor& infos, bool u
});
return result;
}
#endif
Tensor& cholesky_inverse_kernel_impl(Tensor &result, Tensor& infos, bool upper) {
// This function calculates the inverse matrix in-place
@ -1432,25 +1438,20 @@ Tensor& cholesky_inverse_kernel_impl(Tensor &result, Tensor& infos, bool upper)
switch (preferred_backend) {
case at::LinalgBackend::Cusolver:
return cholesky_inverse_kernel_impl_cusolver(result, infos, upper);
#if AT_MAGMA_ENABLED()
case at::LinalgBackend::Magma:
return cholesky_inverse_kernel_impl_magma(result, infos, upper);
#endif
default:
#if !AT_MAGMA_ENABLED()
return cholesky_inverse_kernel_impl_cusolver(result, infos, upper);
#else
if (batchCount(result) == 1) {
if (batchCount(result) == 1 ||
!use_magma_) {
return cholesky_inverse_kernel_impl_cusolver(result, infos, upper);
} else {
return cholesky_inverse_kernel_impl_magma(result, infos, upper);
}
#endif
}
#else
return cholesky_inverse_kernel_impl_magma(result, infos, upper);
#endif
}
REGISTER_CUDA_DISPATCH(cholesky_inverse_stub, &cholesky_inverse_kernel_impl);
@ -1525,9 +1526,14 @@ static void apply_lu_factor_looped_magma(const Tensor& input, const Tensor& pivo
For further details, please see the MAGMA documentation for magma_dgetrf_batched.
*/
#if AT_MAGMA_ENABLED()
template <typename scalar_t>
static void apply_lu_factor_batched_magma(const Tensor& input, const Tensor& pivots, const Tensor& infos, bool compute_pivots) {
#if !AT_MAGMA_ENABLED()
TORCH_CHECK(
false,
"Calling linalg.lu_factor on a CUDA tensor requires compiling ",
"PyTorch with MAGMA. Please rebuild with MAGMA.");
#else
// There is a bug in lu_factor_batched_magma in MAGMA < 2.5.2, see
// https://bitbucket.org/icl/magma/issues/13/getrf_batched-kernel-produces-nans-on
std::tuple<magma_int_t, magma_int_t, magma_int_t> version;
@ -1544,7 +1550,7 @@ static void apply_lu_factor_batched_magma(const Tensor& input, const Tensor& piv
magma_int_t n = magma_int_cast(input.size(-1), "n");
auto leading_dimension = std::max<magma_int_t>(1, m);
scalar_t** input_array = nullptr;
scalar_t** input_array;
ALLOCATE_ARRAY(input_array, scalar_t*, batch_size);
// Set up array of pointers to matrices
@ -1564,7 +1570,7 @@ static void apply_lu_factor_batched_magma(const Tensor& input, const Tensor& piv
// magmaLuBatched might not set the values for it
// see https://github.com/pytorch/pytorch/pull/53064
pivots.fill_(1);
magma_int_t** pivots_array = nullptr;
magma_int_t** pivots_array;
ALLOCATE_ARRAY(pivots_array, magma_int_t*, batch_size);
for (int64_t i = 0; i < batch_size; i++) {
pivots_array[i] = &pivots_data[i * pivots_stride];
@ -1577,6 +1583,7 @@ static void apply_lu_factor_batched_magma(const Tensor& input, const Tensor& piv
// block CPU until all operations on the queue are finished
// this explicit sync prevents garbage results from the subsequent magmaLuSolveBatched call from a different queue
magma_queue_sync(magma_queue.get_queue());
#endif
}
static void lu_factor_looped_magma(const Tensor& input, const Tensor& pivots, const Tensor& infos, bool compute_pivots) {
@ -1590,7 +1597,6 @@ static void lu_factor_batched_magma(const Tensor& input, const Tensor& pivots, c
apply_lu_factor_batched_magma<scalar_t>(input, pivots, infos, compute_pivots);
});
}
#endif
static void lu_factor(const Tensor& input, const Tensor& pivots, const Tensor& infos, bool compute_pivots) {
auto batch_size = batchCount(input);
@ -1598,7 +1604,6 @@ static void lu_factor(const Tensor& input, const Tensor& pivots, const Tensor& i
auto m = input.size(-2);
auto n = input.size(-1);
#if AT_MAGMA_ENABLED()
const auto lu_factor_magma = [batch_size](const Tensor& input, const Tensor& pivots, const Tensor& infos, const bool compute_pivots) {
if (batch_size == 1) {
lu_factor_looped_magma(input, pivots, infos, compute_pivots);
@ -1606,7 +1611,6 @@ static void lu_factor(const Tensor& input, const Tensor& pivots, const Tensor& i
lu_factor_batched_magma(input, pivots, infos, compute_pivots);
}
};
#endif
const auto preferred_backend = at::globalContext().linalgPreferredBackend();
#ifdef USE_LINALG_SOLVER
@ -1631,12 +1635,9 @@ static void lu_factor(const Tensor& input, const Tensor& pivots, const Tensor& i
lu_factor_cusolver(input, pivots, infos, compute_pivots);
} else
#endif // ifdef USE_LINALG_SOLVER
#if AT_MAGMA_ENABLED()
if (preferred_backend == at::LinalgBackend::Magma) {
lu_factor_magma(input, pivots, infos, compute_pivots);
} else
#endif
{ // preferred backend == default
} else { // preferred backend == default
#ifdef USE_LINALG_SOLVER
#if AT_MAGMA_ENABLED()
// If magma batched is buggy, we use cusolver
@ -1700,8 +1701,8 @@ AT_ERROR("triangular_solve: MAGMA library not found in "
auto A_mat_stride = matrixStride(A);
auto b_mat_stride = matrixStride(b);
scalar_t** A_array = nullptr;
scalar_t** b_array = nullptr;
scalar_t** A_array;
scalar_t** b_array;
ALLOCATE_ARRAY(A_array, scalar_t*, batch_size);
ALLOCATE_ARRAY(b_array, scalar_t*, batch_size);
@ -1719,7 +1720,7 @@ AT_ERROR("triangular_solve: MAGMA library not found in "
// The number of "mini"-batches are floor(batch_size / batch_limit)
// and these cover floor(batch_size / batch_limit) * batch_limit matrix solves
int64_t mini_batches = batch_size / batch_limit;
int64_t mini_idx = 0; // this is outside the loop because it is used for the case batch_size % batch_limit != 0
int64_t mini_idx; // this is outside the loop because it is used for the case batch_size % batch_limit != 0
for (mini_idx = 0; mini_idx < mini_batches * batch_limit; mini_idx += batch_limit) {
scalar_t** A_array_cur = &A_array[mini_idx];
scalar_t** b_array_cur = &b_array[mini_idx];
@ -1776,7 +1777,7 @@ Tensor& orgqr_kernel_impl(Tensor& result, const Tensor& tau) {
#ifdef USE_LINALG_SOLVER
return orgqr_helper_cusolver(result, tau); // cusolver
#else
static_assert(false, "Calling torch.orgqr on a CUDA tensor requires compiling ",
TORCH_CHECK(false, "Calling torch.orgqr on a CUDA tensor requires compiling ",
"PyTorch with cuSOLVER. Please use PyTorch built with cuSOLVER support.");
#endif
}
@ -1787,8 +1788,8 @@ void ormqr_kernel(const Tensor& input, const Tensor& tau, const Tensor& other, b
#ifdef USE_LINALG_SOLVER
ormqr_cusolver(input, tau, other, left, transpose);
#else
static_assert(false,
"Calling torch.ormqr on a CUDA tensor requires compiling "
TORCH_CHECK(false,
"Calling torch.ormqr on a CUDA tensor requires compiling ",
"PyTorch with cuSOLVER. Please use PyTorch built with cuSOLVER support.");
#endif
}
@ -1797,9 +1798,15 @@ REGISTER_CUDA_DISPATCH(ormqr_stub, &ormqr_kernel);
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ qr ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#if AT_MAGMA_ENABLED()
template <typename scalar_t>
static void apply_geqrf(const Tensor& input, const Tensor& tau) {
#if !AT_MAGMA_ENABLED()
TORCH_CHECK(
false,
"Calling torch.geqrf on a CUDA tensor requires compiling ",
"PyTorch with MAGMA. Please use PyTorch built with MAGMA support.");
#else
magma_int_t m = magma_int_cast(input.size(-2), "m");
magma_int_t n = magma_int_cast(input.size(-1), "n");
@ -1826,6 +1833,7 @@ static void apply_geqrf(const Tensor& input, const Tensor& tau) {
checkMagmaInternalError(info, "geqrf");
}
tau.copy_(tau_cpu, /*non_blocking=*/true);
#endif
}
// This is a type dispatching helper function for 'apply_geqrf'
@ -1834,7 +1842,6 @@ void geqrf_magma(const Tensor& input, const Tensor& tau) {
apply_geqrf<scalar_t>(input, tau);
});
}
#endif
void geqrf_kernel(const Tensor& input, const Tensor& tau) {
#ifdef USE_LINALG_SOLVER
@ -1860,10 +1867,8 @@ void geqrf_kernel(const Tensor& input, const Tensor& tau) {
// - ?geqrf2_gpu gives correct R, but doesn't allow computation of Q via ?orgqr_gpu
// Refer to the below link for more details:
// http://icl.cs.utk.edu/magma/forum/viewtopic.php?f=2&t=1015&p=2800&hilit=geqrf_gpu#p2800
#if AT_MAGMA_ENABLED()
case at::LinalgBackend::Magma:
return geqrf_magma(input, tau);
#endif
case at::LinalgBackend::Cusolver:
default:
return geqrf_cusolver_backend(input, tau);
@ -1875,9 +1880,14 @@ void geqrf_kernel(const Tensor& input, const Tensor& tau) {
REGISTER_CUDA_DISPATCH(geqrf_stub, &geqrf_kernel);
#if AT_MAGMA_ENABLED()
template <typename scalar_t>
static void apply_magma_eigh(const Tensor& values, const Tensor& vectors, const Tensor& infos, bool upper, bool compute_eigenvectors) {
#if !AT_MAGMA_ENABLED()
TORCH_CHECK(
false,
"Calling torch.linalg.eigh/eigvalsh on a CUDA tensor requires compiling ",
"PyTorch with MAGMA. Please use PyTorch built with MAGMA support.");
#else
TORCH_INTERNAL_ASSERT_DEBUG_ONLY(values.device() == kCPU);
TORCH_INTERNAL_ASSERT_DEBUG_ONLY(infos.device() == kCPU);
@ -1897,7 +1907,7 @@ static void apply_magma_eigh(const Tensor& values, const Tensor& vectors, const
auto values_data = values.data_ptr<value_t>();
auto infos_data = infos.data_ptr<magma_int_t>();
scalar_t* wA = nullptr;
scalar_t* wA;
ALLOCATE_ARRAY(wA, scalar_t, lda * lda);
// Run once, first to get the optimum work sizes.
@ -1907,14 +1917,14 @@ static void apply_magma_eigh(const Tensor& values, const Tensor& vectors, const
magma_int_t lwork = -1;
scalar_t wkopt;
magma_int_t liwork = -1;
magma_int_t iwkopt = -1;
magma_int_t iwkopt;
magma_int_t lrwork = -1;
value_t rwkopt;
magmaSyevd<scalar_t, value_t>(jobz, uplo, n, vectors_data, lda, values_data,
wA, lda, &wkopt, lwork, &rwkopt, lrwork, &iwkopt, liwork, infos_data);
scalar_t* work = nullptr;
magma_int_t* iwork = nullptr;
scalar_t* work;
magma_int_t* iwork;
lwork = magma_int_cast(std::max<int64_t>(1, real_impl<scalar_t, value_t>(wkopt)), "work_size");
liwork = magma_int_cast(std::max<int64_t>(1, iwkopt), "iwork_size");
ALLOCATE_ARRAY(work, scalar_t, lwork);
@ -1941,6 +1951,7 @@ static void apply_magma_eigh(const Tensor& values, const Tensor& vectors, const
return;
}
}
#endif
}
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ linalg_eigh ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@ -1979,17 +1990,14 @@ void linalg_eigh_magma(const Tensor& eigenvalues, const Tensor& eigenvectors, co
eigenvalues.copy_(eigenvalues_cpu);
}
}
#endif
void linalg_eigh_kernel(const Tensor& eigenvalues, const Tensor& eigenvectors, const Tensor& infos, bool upper, bool compute_eigenvectors) {
#if defined(USE_LINALG_SOLVER)
auto preferred_backend = at::globalContext().linalgPreferredBackend();
switch (preferred_backend) {
#if AT_MAGMA_ENABLED()
case at::LinalgBackend::Magma:
linalg_eigh_magma(eigenvalues, eigenvectors, infos, upper, compute_eigenvectors);
break;
#endif
case at::LinalgBackend::Cusolver:
default:
linalg_eigh_cusolver(eigenvalues, eigenvectors, infos, upper, compute_eigenvectors);
@ -2009,9 +2017,12 @@ This is an in-place routine, content of 'input', 'values', 'vectors' is overwrit
'infos' is an int Tensor containing error codes for each matrix in the batched input.
For more information see MAGMA's documentation for GEEV routine.
*/
#if AT_MAGMA_ENABLED()
template <typename scalar_t>
void apply_linalg_eig(Tensor& values, Tensor& vectors, Tensor& input, Tensor& infos, bool compute_eigenvectors) {
#if !AT_MAGMA_ENABLED()
TORCH_CHECK(false, "Calling torch.linalg.eig on a CUDA tensor requires compiling PyTorch with MAGMA. "
"Either transfer the tensor to the CPU before calling torch.linalg.eig or recompile with MAGMA.");
#else
TORCH_INTERNAL_ASSERT_DEBUG_ONLY(input.device() == at::kCPU);
TORCH_INTERNAL_ASSERT_DEBUG_ONLY(values.device() == at::kCPU);
TORCH_INTERNAL_ASSERT_DEBUG_ONLY(infos.device() == at::kCPU);
@ -2061,6 +2072,7 @@ void apply_linalg_eig(Tensor& values, Tensor& vectors, Tensor& input, Tensor& in
magmaEig<scalar_t, value_t>(jobvl, jobvr, n, input_working_ptr, lda, values_working_ptr,
lvectors_data, ldvl, rvectors_working_ptr, ldvr, work_data, lwork, rwork_data, info_working_ptr);
}
#endif
}
// This is a type dispatching helper function for 'apply_linalg_eig'
@ -2093,6 +2105,10 @@ static void apply_svd_magma(const Tensor& A,
const Tensor& S,
const Tensor& Vh,
const Tensor& info) {
#if !AT_MAGMA_ENABLED()
AT_ERROR("linalg.svd: MAGMA library not found in "
"compilation. Please rebuild with MAGMA.");
#else
using value_t = typename c10::scalar_value_type<scalar_t>::type;
const auto A_data = A.data_ptr<scalar_t>();
const auto U_data = compute_uv ? U.data_ptr<scalar_t>() : nullptr;
@ -2120,7 +2136,7 @@ static void apply_svd_magma(const Tensor& A,
rwork = static_cast<value_t*>(storage_rwork.mutable_data());
}
magma_int_t* iwork = nullptr;
magma_int_t* iwork;
ALLOCATE_ARRAY(iwork, magma_int_t, 8 * std::min(m, n));
// Query svd for the optimal lwork size
@ -2135,7 +2151,7 @@ static void apply_svd_magma(const Tensor& A,
&wkopt, lwork, rwork, iwork, info_data);
lwork = magma_int_cast(real_impl<scalar_t, value_t>(wkopt), "work_size");
}
scalar_t* work = nullptr;
scalar_t* work;
ALLOCATE_ARRAY(work, scalar_t, lwork);
for (int64_t i = 0; i < batchsize; i++) {
@ -2148,6 +2164,7 @@ static void apply_svd_magma(const Tensor& A,
work, lwork, rwork, iwork,
info_data + i);
}
#endif
}
void svd_magma(const Tensor& A,
@ -2189,7 +2206,6 @@ void svd_magma(const Tensor& A,
S.copy_(S_, /*non_blocking*/true);
info.copy_(info, /*non_blocking*/true);
}
#endif
void svd_kernel(const Tensor& A,
const bool full_matrices,
@ -2201,13 +2217,10 @@ void svd_kernel(const Tensor& A,
const Tensor& info) {
#ifdef USE_LINALG_SOLVER
// We always use cuSOLVER unless the user has specified they want to use MAGMA
#if AT_MAGMA_ENABLED()
bool use_magma = at::globalContext().linalgPreferredBackend() == at::LinalgBackend::Magma;
if (use_magma) {
svd_magma(A, full_matrices, compute_uv, U, S, Vh, info);
} else
#endif
{
} else {
// svd_cusolver computes V rather than Vh, so we pass a view of Vh.mT
// and then conjugate Vh in-place
svd_cusolver(A, full_matrices, compute_uv, driver, U, S, compute_uv ? Vh.mT() : Vh, info);
@ -2238,9 +2251,14 @@ REGISTER_CUDA_DISPATCH(svd_stub, &svd_kernel)
For further details, please see the MAGMA documentation for magma_dgetrs_gpu.
*/
#if AT_MAGMA_ENABLED()
template <typename scalar_t>
static void apply_lu_solve_looped_magma(const Tensor& LU, const Tensor& pivots, const Tensor& B, TransposeType transpose) {
#if !AT_MAGMA_ENABLED()
TORCH_CHECK(
false,
"Calling linalg.lu_solve on a CUDA tensor requires compiling ",
"PyTorch with MAGMA. Please rebuild with MAGMA.");
#else
auto trans = to_magma(transpose);
auto b_data = B.data_ptr<scalar_t>();
auto lu_data = LU.data_ptr<scalar_t>();
@ -2278,6 +2296,7 @@ static void apply_lu_solve_looped_magma(const Tensor& LU, const Tensor& pivots,
// so we don't need to check it all the time
TORCH_INTERNAL_ASSERT_DEBUG_ONLY(info == 0);
}
#endif
}
/*
@ -2296,6 +2315,12 @@ static void apply_lu_solve_looped_magma(const Tensor& LU, const Tensor& pivots,
*/
template <typename scalar_t>
static void apply_lu_solve_batched_magma(const Tensor& LU, const Tensor& pivots, const Tensor& B, TransposeType transpose) {
#if !AT_MAGMA_ENABLED()
TORCH_CHECK(
false,
"Calling linalg.lu_solve on a CUDA tensor requires compiling ",
"PyTorch with MAGMA. Please rebuild with MAGMA.");
#else
TORCH_INTERNAL_ASSERT(batchCount(B) == batchCount(LU), "batch_size of LU and B must be the same");
TORCH_INTERNAL_ASSERT(batchCount(LU) == batchCount(pivots.unsqueeze(-1)), "batch_size of LU and pivots must be the same");
auto trans = to_magma(transpose);
@ -2313,9 +2338,9 @@ static void apply_lu_solve_batched_magma(const Tensor& LU, const Tensor& pivots,
auto pivots_stride = pivots.size(-1);
magma_int_t batch_size = magma_int_cast(batchCount(B), "batchCount");
magma_int_t** pivots_array = nullptr;
scalar_t** lu_array = nullptr;
scalar_t** b_array = nullptr;
magma_int_t** pivots_array;
scalar_t** lu_array;
scalar_t** b_array;
ALLOCATE_ARRAY(pivots_array, magma_int_t*, batch_size);
ALLOCATE_ARRAY(lu_array, scalar_t*, batch_size);
@ -2339,7 +2364,7 @@ static void apply_lu_solve_batched_magma(const Tensor& LU, const Tensor& pivots,
scalar_t** b_array_cur = &b_array[mini_idx];
magma_int_t** pivots_array_cur = &pivots_array[mini_idx];
int info = -1;
int info;
magmaLuSolveBatched<scalar_t>(
n, nrhs, lu_array_cur, leading_dimension,
pivots_array_cur, b_array_cur, leading_dimension,
@ -2349,6 +2374,7 @@ static void apply_lu_solve_batched_magma(const Tensor& LU, const Tensor& pivots,
// so we don't need to check it all the time
TORCH_INTERNAL_ASSERT_DEBUG_ONLY(info == 0);
}
#endif
}
static void lu_solve_batched_magma(const Tensor& LU, const Tensor& pivots, const Tensor& B, TransposeType trans) {
@ -2364,7 +2390,6 @@ static void lu_solve_looped_magma(const Tensor& LU, const Tensor& pivots, const
apply_lu_solve_looped_magma<scalar_t>(LU, pivots, B, trans);
});
}
#endif
c10::MaybeOwned<Tensor> maybe_expand_lu(const Tensor& B, const Tensor& LU) {
// B and LU have the same number of dimensions
@ -2399,11 +2424,9 @@ static void lu_solve_kernel(const Tensor& LU, const Tensor& pivots, const Tensor
auto b = batchCount(B);
auto n = LU.size(-2);
auto k = B.size(-1);
#if AT_MAGMA_ENABLED()
// magma implementation of LU solve cannot handle a b tensor with last dim > 1024
// See https://bitbucket.org/icl/magma/issues/19/dgesv_batched-dgetrs_batched-fails-for
bool over_batched_magma_dim_limit = k > 1024;
#endif
// heuristics determined from tests discussed in https://github.com/pytorch/pytorch/pull/72935
// Computes X = U^{-1}L^{-1}P^T B via triangular solves
@ -2418,7 +2441,7 @@ static void lu_solve_kernel(const Tensor& LU, const Tensor& pivots, const Tensor
.set_check_mem_overlap(false)
.check_all_same_dtype(false)
.resize_outputs(false)
.declare_static_shape(pivots_->sizes(), /*squash_dims=*/pivots_->dim() - 1)
.declare_static_shape(pivots_->sizes(), /*squash_dim=*/pivots_->dim() - 1)
.add_output(perm)
.add_const_input(*pivots_)
.build();
@ -2434,7 +2457,7 @@ static void lu_solve_kernel(const Tensor& LU, const Tensor& pivots, const Tensor
// B1 = P^T @ B (must be done out-of-place as B is both source and target)
auto B1 = B.scatter(-2, inv_perm.unsqueeze(-1).expand_as(B), B);
// B = L^{-1} @ B1
at::linalg_solve_triangular_out(const_cast<Tensor&>(B), *LU_, B1, /*upper=*/false, /*left=*/true, /*unitriangular=*/true);
at::linalg_solve_triangular_out(const_cast<Tensor&>(B), *LU_, std::move(B1), /*upper=*/false, /*left=*/true, /*unitriangular=*/true);
// B = U^{-1} @ B
at::linalg_solve_triangular_out(const_cast<Tensor&>(B), *LU_, B, /*upper=*/true);
} else {
@ -2456,13 +2479,11 @@ static void lu_solve_kernel(const Tensor& LU, const Tensor& pivots, const Tensor
};
#endif
#if AT_MAGMA_ENABLED()
auto lu_solve_batched_magma_fn = [](const Tensor& LU, const Tensor& pivots, const Tensor& B, TransposeType trans) {
auto LU_ = maybe_expand_lu(B, LU);
auto pivots_ = maybe_expand_pivots(B, pivots);
lu_solve_batched_magma(*LU_, *pivots_, B, trans);
};
#endif
// Preferred Backend
@ -2477,7 +2498,6 @@ static void lu_solve_kernel(const Tensor& LU, const Tensor& pivots, const Tensor
return;
} else
#endif // ifdef USE_LINALG_SOLVER
#if AT_MAGMA_ENABLED()
if (preferred_backend == at::LinalgBackend::Magma) {
// Looped magma is very slow, but batched magma is buggy in these two cases
if (!over_batched_magma_dim_limit && trans == TransposeType::NoTranspose) {
@ -2488,7 +2508,6 @@ static void lu_solve_kernel(const Tensor& LU, const Tensor& pivots, const Tensor
}
return;
}
#endif
// Heuristic
//if (n == k) {
@ -2529,12 +2548,9 @@ static void lu_solve_kernel(const Tensor& LU, const Tensor& pivots, const Tensor
}
if (n <= 8) {
#if AT_MAGMA_ENABLED()
if (!over_batched_magma_dim_limit && trans == TransposeType::NoTranspose && k >= 256) {
if (use_magma_ && !over_batched_magma_dim_limit && trans == TransposeType::NoTranspose && k >= 256) {
lu_solve_batched_magma_fn(LU, pivots, B, trans);
} else
#endif
{
} else {
lu_solve_batched_cublas_fn(LU, pivots, B, trans);
}
} else if (n <= 64) {
@ -2567,9 +2583,12 @@ REGISTER_CUDA_DISPATCH(lu_solve_stub, &lu_solve_kernel);
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ lstsq ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#if AT_MAGMA_ENABLED()
template <typename scalar_t>
static void apply_gels(const Tensor& a, Tensor& b, Tensor& infos) {
#if !AT_MAGMA_ENABLED()
TORCH_CHECK(false, "torch.linalg.lstsq: MAGMA library not found in "
"compilation. Please rebuild with MAGMA.");
#else
auto trans = MagmaNoTrans;
auto m = magma_int_cast(a.size(-2), "m");
auto n = magma_int_cast(a.size(-1), "n");
@ -2599,6 +2618,7 @@ static void apply_gels(const Tensor& a, Tensor& b, Tensor& infos) {
hwork_ptr, lwork, infos_working_ptr);
}
);
#endif
}
void gels_magma(const Tensor& a, Tensor& b, Tensor& infos) {
@ -2606,7 +2626,6 @@ void gels_magma(const Tensor& a, Tensor& b, Tensor& infos) {
apply_gels<scalar_t>(a, b, infos);
});
}
#endif
void linalg_lstsq_gels(const Tensor& A, const Tensor& B, const Tensor& /*infos*/) {
// The steps for using the QR decomposition for solving least squares problems
@ -2695,10 +2714,8 @@ void gels_looped(const Tensor& a, Tensor& b, Tensor& infos) {
#if defined(USE_LINALG_SOLVER) && !defined(USE_ROCM)
auto preferred_backend = at::globalContext().linalgPreferredBackend();
switch (preferred_backend) {
#if AT_MAGMA_ENABLED()
case at::LinalgBackend::Magma:
return gels_magma(a, b, infos);
#endif
case at::LinalgBackend::Cusolver:
default:
// linalg_lstsq_gels is a generic function that is implemented using

View File

@ -6,10 +6,9 @@
#include <torch/script.h>
namespace at {
namespace native {
namespace metal {
namespace native::metal {
at::Tensor& copy_from_metal_(at::Tensor& dst, const at::Tensor& src) {
static Tensor& copy_from_metal_(Tensor& dst, const Tensor& src) {
TORCH_INTERNAL_ASSERT(
src.device().type() == DeviceType::Metal,
"copy_from_metal input tensor's device is not metal");
@ -34,7 +33,7 @@ at::Tensor& copy_from_metal_(at::Tensor& dst, const at::Tensor& src) {
return dst;
}
at::Tensor& copy_to_metal_(at::Tensor& dst, const at::Tensor& src) {
static Tensor& copy_to_metal_(Tensor& dst, const Tensor& src) {
TORCH_INTERNAL_ASSERT(
dst.device().type() == DeviceType::Metal,
"copy_to_metal_ output tensor's device is not metal");
@ -54,7 +53,7 @@ at::Tensor& copy_to_metal_(at::Tensor& dst, const at::Tensor& src) {
return dst;
}
at::Tensor& metal_copy_impl_(at::Tensor& dst, const at::Tensor& src) {
static Tensor& metal_copy_impl_(Tensor& dst, const Tensor& src) {
if (src.device().type() == at::kMetal && dst.device().type() == at::kCPU) {
return copy_from_metal_(dst, src);
}
@ -69,7 +68,7 @@ at::Tensor& metal_copy_impl_(at::Tensor& dst, const at::Tensor& src) {
#pragma mark - ATen Ops
Tensor empty(
static Tensor empty(
c10::SymIntArrayRef sym_size,
optional<ScalarType> dtype,
optional<Layout> layout,
@ -88,7 +87,7 @@ Tensor empty(
std::move(mt), at::device(at::kMetal).dtype(dtype));
};
at::Tensor empty_strided(
static Tensor empty_strided(
IntArrayRef size,
IntArrayRef stride,
optional<ScalarType> dtype,
@ -109,8 +108,7 @@ TORCH_LIBRARY_IMPL(aten, Metal, m) {
m.impl(TORCH_SELECTIVE_NAME("aten::empty_strided"), TORCH_FN(empty_strided));
}
} // namespace metal
} // namespace native
} // namespace native::metal
struct MetalImpl : public at::metal::MetalInterface {
bool is_metal_available() const override {

View File

@ -3,9 +3,7 @@
#include <c10/util/ArrayRef.h>
namespace at {
namespace native {
namespace metal {
namespace at::native::metal {
struct Conv2DParams final {
Conv2DParams() {}
@ -46,8 +44,6 @@ struct Conv2DParams final {
int64_t OH; // output height
};
} // namespace metal
} // namespace native
} // namespace at
} // namespace at::native::metal
#endif /* MetalConvParams_h */

View File

@ -5,9 +5,7 @@
#include <string>
namespace at {
namespace native {
namespace metal {
namespace at::native::metal {
struct MetalDeviceInfo {
std::string name;
@ -42,8 +40,6 @@ static inline MetalDeviceInfo createDeviceInfo(id<MTLDevice> device) {
return device_info;
}
}
}
}
} // namespace at::native::metal
#endif

View File

@ -6,9 +6,7 @@
#include <ATen/ATen.h>
namespace at {
namespace native {
namespace metal {
namespace at::native::metal {
enum class NeuronType {
None,
@ -66,8 +64,6 @@ static inline MPSNNNeuronDescriptor* neuronDescriptor(NeuronType type) {
}
}
}
}
}
} // namespace at::native::metal
#endif /* MetalNeuronType_h */

Some files were not shown because too many files have changed in this diff Show More