Compare commits

..

254 Commits

Author SHA1 Message Date
b2cbcf710b [BE] typing for decorators - _inductor/lowering (#131574)
See #131429

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131574
Approved by: https://github.com/oulgen, https://github.com/zou3519
ghstack dependencies: #131568, #131569, #131570, #131571, #131572, #131573
2024-07-25 22:24:19 +00:00
f0f20f7e97 [BE] typing for decorators - _jit_internal (#131573)
See #131429

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131573
Approved by: https://github.com/oulgen, https://github.com/zou3519
ghstack dependencies: #131568, #131569, #131570, #131571, #131572
2024-07-25 22:24:19 +00:00
bfe0079b72 [BE] typing for decorators - _meta_registrations (#131572)
See #131429
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131572
Approved by: https://github.com/oulgen, https://github.com/zou3519
ghstack dependencies: #131568, #131569, #131570, #131571
2024-07-25 22:24:19 +00:00
4b985e6f80 [BE] typing for decorators - distributed/_tensor/ops/utils (#131571)
See #131429

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131571
Approved by: https://github.com/oulgen, https://github.com/zou3519
ghstack dependencies: #131568, #131569, #131570
2024-07-25 22:24:19 +00:00
5731b486c8 [BE] typing for decorators - library (#131570)
See #131429

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131570
Approved by: https://github.com/oulgen, https://github.com/zou3519
ghstack dependencies: #131568, #131569
2024-07-25 22:24:19 +00:00
aa58af8b43 [BE] typing for decorators - masked/_ops (#131569)
See #131429
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131569
Approved by: https://github.com/oulgen, https://github.com/zou3519
ghstack dependencies: #131568
2024-07-25 22:24:19 +00:00
193f62fde9 [BE] typing for decorators - fx/_compatibility (#131568)
See #131429

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131568
Approved by: https://github.com/justinchuby, https://github.com/oulgen, https://github.com/zou3519
2024-07-25 22:24:19 +00:00
709ddf7a9d Add wrappers for synchronous GPUDirect Storage APIs (#130633)
Based in part on https://github.com/NVIDIA/apex/pull/1774

Differential Revision: [D60155434](https://our.internmc.facebook.com/intern/diff/D60155434)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130633
Approved by: https://github.com/albanD
2024-07-25 22:23:38 +00:00
0455344777 [dynamo] Turn on inline_inbuilt_nn_modules (#131275)
Known issues that are deliberately kept open and will be fixed later are tracked here - https://github.com/pytorch/pytorch/issues/131696

Training dashboard ([link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2018%20Jul%202024%2000%3A03%3A50%20GMT&stopTime=Thu%2C%2025%20Jul%202024%2000%3A03%3A50%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=gh/anijain2305/435/head&lCommit=408b9358b8fca3a5d08b39741419fe8a596941aa&rBranch=gh/anijain2305/435/base&rCommit=d31f2ae904ba2cf0884bf24413ba2109c3585d51))

![image](https://github.com/user-attachments/assets/08ef081c-37d7-436d-905b-4b9e2b470644)

Inference dashboard ([link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2018%20Jul%202024%2000%3A03%3A50%20GMT&stopTime=Thu%2C%2025%20Jul%202024%2000%3A03%3A50%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&lBranch=gh/anijain2305/435/head&lCommit=914244fa2fe0055917e039e35183b21fa90afdc6&rBranch=gh/anijain2305/435/base&rCommit=d31f2ae904ba2cf0884bf24413ba2109c3585d51))
![image](https://github.com/user-attachments/assets/32136eff-a39e-4cde-a438-e51a665bc3c9)

Inference sees a little bit more perf degradation but we are ok with that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131275
Approved by: https://github.com/ezyang
ghstack dependencies: #131744
2024-07-25 22:14:17 +00:00
513ce5f69a MTIA equivalent of torch.cuda.memory_stats (#131673)
Summary: Adding MTIA equivalent of `torch.cuda.memory_stats`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131673
Approved by: https://github.com/egienvalue
2024-07-25 21:59:59 +00:00
9039131a89 TCPStore: fix remote address (#131773)
This fixes corrupt remote address logs caused by dangling pointers to addrinfo_storage inside of addrinfo.

Test plan:

Enable debug logs and verify addresses are correct

```
TORCH_CPP_LOG_LEVEL=INFO TORCH_DISABLE_SHARE_RDZV_TCP_STORE=1 TORCH_DISTRIBUTED_DEBUG=DETAIL LOGLEVEL=INFO python test/distributed/test_store.py -v
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131773
Approved by: https://github.com/kurman
2024-07-25 21:55:25 +00:00
520182dbff [reland][inductor] switch AotCodeCompiler to new cpp_builder (#130127)
Changes:
1. Switch `AotCodeCompiler` to new cpp_builder.
2. Only use `deprecated_cpp_compile_command` for `fb_code`, due to I can't debug anymore on no Meta internal environment access.
3. Add `TODO` comments for further some Meta employee help on contine to do this work.
4. Due to item 3, we only remaining `deprecated_cpp_compile_command` for `fb_code` to be fix, let's remove `validate_new_cpp_commands`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130127
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-25 21:45:40 +00:00
a34692c0a3 [Inductor] Added and_masks and or_masks utilities & make fully masked out rows 0 instead of nan (#131552)
Combine #131073 and #131012 and fix doc building failures.

Co-authored-by: chilli <chilli@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131552
Approved by: https://github.com/Chillee
2024-07-25 21:29:46 +00:00
89bdd9c18f [kineto] populate src/dst rank for p2p (#130812)
Summary:
as title
populate src/dst rank (global rank) for p2p kernel

Differential Revision: D59794535

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130812
Approved by: https://github.com/aaronenyeshi
2024-07-25 21:10:57 +00:00
1c58aacbc8 [dtensor] move ops to private (#131211)
as titled

Differential Revision: [D60132519](https://our.internmc.facebook.com/intern/diff/D60132519)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131211
Approved by: https://github.com/XilunWu, https://github.com/wz337
ghstack dependencies: #131212
2024-07-25 20:59:55 +00:00
605dfd8fb4 Switch sync_distributed_folder to use non-reverse order (#131683)
`git` on GHA seems to use the reverse commit ordering that I see locally O_o
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131683
Approved by: https://github.com/seemethere
2024-07-25 20:44:23 +00:00
fe2e6f0c51 Revert "[reland][inductor] switch AotCodeCompiler to new cpp_builder (#130127)"
This reverts commit dfc9bfc8839ea3a0ffe933a64cd129fab5e4da75.

Reverted https://github.com/pytorch/pytorch/pull/130127 on behalf of https://github.com/atalman due to Breask CI test_dataloader.py::TestDataLoader::test_segfault [GH job link](https://github.com/pytorch/pytorch/actions/runs/10099725941/job/27930133346) [HUD commit link](2c1851f04e) ([comment](https://github.com/pytorch/pytorch/pull/130127#issuecomment-2251360224))
2024-07-25 20:44:04 +00:00
1ad4e6f228 Refactor cudagraphs to use serializable placeholder info (#130252)
This PR refactors placeholders in cudagraphs to be serializable. We define a new PlaceholderInfo object which only has the necessary parts of placeholders for logging/debugging, and use that instead of `torch.fx.Node` directly. This allows us to then save PlaceholderInfo into the FXGraphCache/AOTAutogradCache later.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130252
Approved by: https://github.com/eellison, https://github.com/masnesral
ghstack dependencies: #129384
2024-07-25 20:39:37 +00:00
eqy
69d63b2318 [CUDA][Pooling] Clean up unused accscalar_t in maxpool2d forward (#131728)
maxpool forward doesn't actually do any accumulation and the second template param was just a dupe of the first

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131728
Approved by: https://github.com/mikaylagawarecki
2024-07-25 20:32:42 +00:00
fdc4d6fe96 [inductor] Refactor fusion of inplace operations (#130835)
Resubmit of #128979

`WeakDep`s force readers to have completed before a mutation overwrites the
buffer, but we want to allow fusions to occur for inplace mutations where the
same index is read and written.

Currently this is achieved by:
1. Identifying the buffers used by the mutating op in its `dep_closure`
2. Not creating `WeakDep`s for buffers in the `dep_closure`
3. Fixing up any bad fusions that might occur by an extra check in `can_fuse_vertical`

So we are first over-agressive in removing `WeakDep`, then add an ad-hoc fixup.

This PR instead emits all `WeakDep`s and adds a `fusable_weak_dep` check to
`can_fuse_vertical` which selectively allows inplace operation to fuse.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130835
Approved by: https://github.com/lezcano
2024-07-25 20:29:01 +00:00
61d7bb3e79 Migrate trunk workflows to Amazon2023 ami (#131677)
A continuation of the migration started in
- https://github.com/pytorch/pytorch/pull/131250

All migrated trunk jobs passed successfully
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131677
Approved by: https://github.com/malfet
2024-07-25 20:19:16 +00:00
a6ebd56f7b Factor out cudagraph post compile into its own function (#129384)
Moves cudagraphs stuff into a post_compile function that I can later call when loading from AOTAutogradCache. On a cache hit, we only need to save any reasons for disabling cudagraphs along with some metadata needed to run cudagraphify. The arguments to cudagraphs_post_compile should be the set of parameters I'll need to reconstruct on a warm start.

No actual behavioral change should result from this: I'm moving the behavior into separate functions, but every operation should be the same pre and post PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129384
Approved by: https://github.com/eellison
2024-07-25 20:15:44 +00:00
58b8704f28 [aot] Keep backward mutations in backward (#129130)
https://github.com/pytorch/pytorch/issues/127561

Mutations of inputs in backward are emitted manually, after joint_fn tracing.
With default partitioner logic they will be moved to "forward" graph, as this is operation on forward inputs.

To keep those mutations in backward:
- Introduce "subgraph" node key, that can be specified with contextmanager. When we do manual `copy_` in backward on forward input - we know that his is for backward - set subgraph="backward"

In partitioner:
Introducing optional argument subgraph, to filter out nodes with specified subgraph (node_subgraph) and not to add them to subgraph if node_subgraph is different.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129130
Approved by: https://github.com/Chillee
2024-07-25 20:02:25 +00:00
6c31e02971 Fixes the example for convert_conv3d_weight_memory_format (#131742)
Fixes #129158

Please let me know if changes are needed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131742
Approved by: https://github.com/albanD
2024-07-25 20:01:44 +00:00
fba24252bd [dynamo][frame summary] Skip frame summary for frames from inside torch/nn/modules (#131744)
This ensures that the stack trace points to the user code.

At main (no inlining)
![image](https://github.com/user-attachments/assets/bf6f1f46-2dfe-45a2-95e1-fb733cda7e50)

With inlining but without this PR

![image](https://github.com/user-attachments/assets/fcb16c4d-dd81-4e5d-a63a-391a73683deb)

With inlining and this PR

![image](https://github.com/user-attachments/assets/69f10f65-c2ed-4179-acd5-a2824615129c)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131744
Approved by: https://github.com/ezyang
2024-07-25 19:30:03 +00:00
a1fad03fa8 [ROCm] Enable cudagraph expandable segments UTs in inductory/dynamo (#131111)
Test runtimes extracted from CI logs are as follows.

"linux-focal-rocm6.1-py3.8":
"dynamo/test_cudagraphs_expandable_segments": 3.3185000000000002,
"inductor/test_cudagraph_trees_expandable_segments": 153.233,

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131111
Approved by: https://github.com/eqy, https://github.com/jataylo, https://github.com/peterbell10
2024-07-25 19:26:04 +00:00
8c4683c978 Add device argument to the large_grid unit test (#131702)
Missing device argument lets this unit test only run on CPUs. Two unit tests added in the previous PR https://github.com/pytorch/pytorch/pull/127448. But only one use `device=self.device` to make sure the tests run on correct devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131702
Approved by: https://github.com/desertfire
2024-07-25 19:19:56 +00:00
bf6aae1468 Improve torch.masked.mean and torch.masked._std_var scaling (#131293)
Fixes #131292

Using `new_ones` is expensive and unnecessary.

Before:

![21232fda-366a-47ea-a017-15a35cd51d0c](https://github.com/user-attachments/assets/779830f0-0027-4fab-a9e6-b99954c80bc5)

After:

![aad2dfcc-52c9-4046-86ab-122b044fa19c](https://github.com/user-attachments/assets/810711c5-c4f0-4b6b-91dc-9a9e714f6ee0)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131293
Approved by: https://github.com/ezyang
2024-07-25 18:52:59 +00:00
2c1851f04e [export] fix output node's meta (#131706)
Summary:
This pr fixes all the places in strict export stack where the output node's meta is not preserved correctly. However, we're getting a new error for the test we intend to fix: `buck2 run caffe2/test/quantization:test_quantization -- -r "test_re_export_preserve_handle"`:

The `get_attr` nodes has wrong metadata. I guess there are more things need to be fixed to get it working but it's beyond the scope of this PR.

Test Plan: buck2 run caffe2/test/quantization:test_quantization -- -r "test_re_export_preserve_handle"

Differential Revision: D60198221

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131706
Approved by: https://github.com/yushangdi
2024-07-25 18:44:21 +00:00
dfc9bfc883 [reland][inductor] switch AotCodeCompiler to new cpp_builder (#130127)
Changes:
1. Switch `AotCodeCompiler` to new cpp_builder.
2. Only use `deprecated_cpp_compile_command` for `fb_code`, due to I can't debug anymore on no Meta internal environment access.
3. Add `TODO` comments for further some Meta employee help on contine to do this work.
4. Due to item 3, we only remaining `deprecated_cpp_compile_command` for `fb_code` to be fix, let's remove `validate_new_cpp_commands`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130127
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-25 18:34:08 +00:00
f3df7deab8 Revert "Add flag to ignore unsupported @triton.autotune args in user-written kernel compilation (#131431)"
This reverts commit e9db1b059733a02e1fb726d22a0489471044ad98.

Reverted https://github.com/pytorch/pytorch/pull/131431 on behalf of https://github.com/clee2000 due to broke internal tests D60211713 ([comment](https://github.com/pytorch/pytorch/pull/131431#issuecomment-2251091957))
2024-07-25 18:00:46 +00:00
2423d89d0c [dynamo] mirror training flag in OptimizedModule (#131546)
Fixes https://github.com/pytorch/pytorch/issues/122414.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131546
Approved by: https://github.com/yanboliang, https://github.com/anijain2305
2024-07-25 17:43:09 +00:00
c3679bed35 Revert "Fix py codegen to delete values that don't have any users (#131028)"
This reverts commit 91aba7baac3d2a079c0b13db25588842260c98cc.

Reverted https://github.com/pytorch/pytorch/pull/131028 on behalf of https://github.com/clee2000 due to broke inductor/test_triton_kernels inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_functionalize [GH job link](https://github.com/pytorch/pytorch/actions/runs/10094659640/job/27915271250) [HUD commit link](91aba7baac) ([comment](https://github.com/pytorch/pytorch/pull/131028#issuecomment-2251058374))
2024-07-25 17:42:18 +00:00
ec3829795d [3/3] 3D Composability - move tp dp tests (#129802)
pytorch (fsdp, tp, pp) -> pytorch (composable)
Move (fsdp, tp, pp) tests under pytorch into a composable folder

FSDP:
test/distributed/_composable/fsdp/test_fully_shard_trainin.py
-TestFullyShard2DTraining
**DP:
test/distributed/tensor/parallel/test_ddp_2d_parallel.py
TP:
test/distributed/tensor/parallel/test_fsdp_2d_parallel.py**
PP:
test/distributed/pipelining/test_composability.py

=>
**distributed/_composable/test_composability/test_2d_composability.py**
distributed/_composable/test_composability/test_pp_composability.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129802
Approved by: https://github.com/fduwjj
ghstack dependencies: #129800, #129801
2024-07-25 16:36:55 +00:00
29571c5c06 [2/3] 3D Composability - move pp tests (#129801)
pytorch (fsdp, tp, pp) -> pytorch (composable)
Move (fsdp, tp, pp) tests under pytorch into a composable folder

FSDP:
test/distributed/_composable/fsdp/test_fully_shard_trainin.py
-TestFullyShard2DTraining
DP:
test/distributed/tensor/parallel/test_ddp_2d_parallel.py
TP:
test/distributed/tensor/parallel/test_fsdp_2d_parallel.py
**PP:
test/distributed/pipelining/test_composability.py**

=>
distributed/_composable/test_composability/test_2d_composability.py
**distributed/_composable/test_composability/test_pp_composability.py**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129801
Approved by: https://github.com/wconstab
ghstack dependencies: #129800
2024-07-25 16:36:55 +00:00
75c4176b05 [export][BE] consolidate export and export_for_training (#131496)
Summary: This PR consolidates the implementation of export and export_for_training to maximize code re-use. Also add some type annotations and comments in the code for better readability.

Test Plan: Existing tests.

Differential Revision: D60130515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131496
Approved by: https://github.com/avikchaudhuri, https://github.com/pianpwk
2024-07-25 16:35:16 +00:00
6bc8db1d32 Rename is_training flag to have more information (#131618)
Summary: rename is_training flag into dispatch_tracing_mode = “make_fx” or “aot_export”

Test Plan: OSS CI

Differential Revision: D60154327

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131618
Approved by: https://github.com/ydwu4
2024-07-25 16:29:55 +00:00
f063027d54 [aoti] Fix constant inputs passed to aoti (#131594)
In cases where the program takes in a constant, export will specialize on the constant and embed the constant into the graph, with the graph containing a placeholder node with no users. However, inductor errors further down as typically in torch.compile, these constants don't show up as inputs. Since these constants are already embedded in the graph, we will just ignore these inputs while compiling with AOTI, and filter out the non-tensor inputs during the runtime.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131594
Approved by: https://github.com/desertfire
2024-07-25 16:22:15 +00:00
ffc6bf8149 [dynamo] lazily guard and specialize on the symint when used in f-string. (#131529)
Fixes https://github.com/pytorch/pytorch/issues/103602.

This PR implements the idea of "if someone creates a string and then ends up not using it, we would prefer to NOT have specialized." mentioned in above issue. Specifically, we create a lazy variable tracker instead of ConstantVariable when we're in FORMAT_VALUE, and when the lazy variable tracker is realized (i.e. it's going to be used), we create a ConstantVariable and the specialization/guarding happens at the time of realization.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131529
Approved by: https://github.com/ezyang
2024-07-25 16:16:34 +00:00
96e8df6a3a [ts_converter] Support prim::max and prim::if with multiple outputs (#131593)
Summary: As title.

Test Plan: test_converter.py

Reviewed By: angelayi

Differential Revision: D60147455

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131593
Approved by: https://github.com/ydwu4
2024-07-25 16:13:31 +00:00
cyy
b07ea91c4c [2/N] Fix clang-tidy warnings in jit (#131735)
Follows #131034

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131735
Approved by: https://github.com/ezyang
2024-07-25 15:56:53 +00:00
49a8e061b6 Revert "Support IPC for Expandable Segments (#130890)"
This reverts commit 0e71a88f9b2ca6b950c76a061791559cdd8a8870.

Reverted https://github.com/pytorch/pytorch/pull/130890 on behalf of https://github.com/zdevito due to some internal tests show shutdown issues with the change to the table that holds ipc handles ([comment](https://github.com/pytorch/pytorch/pull/130890#issuecomment-2250767280))
2024-07-25 15:54:57 +00:00
cyy
a4be5cb50e Simplify some c++ code (#131612)
The simplifications were discovered by static analysis tools.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131612
Approved by: https://github.com/ezyang
2024-07-25 15:07:37 +00:00
c3d099ddd1 [BE][Easy] Add hooks to doc for Optimizer base class (#131628)
Happened to notice this was missing from the base class (but is rendering for the other optimizers like Adam etc.) when I wanted to link the state_dict hooks for https://discuss.pytorch.org/t/global-not-per-param-optimizer-state/206769

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131628
Approved by: https://github.com/janeyx99
2024-07-25 15:07:08 +00:00
745b55d14a [CI][dashboard] Add a workflow to collect aarch64 perf (#131729)
Summary: as title
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131729
Approved by: https://github.com/huydhn
2024-07-25 14:58:47 +00:00
1eedb0a962 fix torchrun log message (#131652)
fixes https://github.com/pytorch/pytorch/issues/131461

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131652
Approved by: https://github.com/awgu
2024-07-25 14:50:10 +00:00
d0e2ab617d Migrate conda, manywheel and libtorch docker builds to pytorch/pytorch (#129022)
Migration of Docker conda builds  to pytorch/pytorch from pytorch/builder: https://github.com/pytorch/builder/blob/main/.github/workflows/build-conda-images.yml

Related to: https://github.com/pytorch/builder/issues/1849

Migrate scripts and worklfows, adds logic to execute on PR and upload to ecr with github hash tag in order to test Docker build and nightly on PR.

Test when executing on PR, upload to ecr:
https://github.com/pytorch/pytorch/actions/runs/9799439218/job/27059691327
```
308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/conda-builder-cpu:789cf8fcd738088860056160f6e9ea7cd005972b
```

Test With-Push, upload to dockerhub:
https://github.com/pytorch/pytorch/actions/runs/9799783407/job/27060633427
```
docker.io/pytorch/conda-builder:cpu done
```
Will upload here: https://hub.docker.com/r/pytorch/conda-builder/

Test using ecr image in the nightly workflow:
https://github.com/pytorch/pytorch/actions/runs/9798428933/job/27057835235#step:16:87

Note: This is first part that will build docker and upload it to either dockerhub or ecr. After merging followup PR will need to change conda nightly workflows to either use ecr image or dockerhub image, depending if we are running it on PR or from main/release branch.

Cleanup of workflows and scripts from builder repo: https://github.com/pytorch/builder/pull/1923
Co-authored-by: atalman <atalman@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129022
Approved by: https://github.com/atalman, https://github.com/seemethere, https://github.com/malfet, https://github.com/chuanqi129
2024-07-25 14:36:15 +00:00
4a5a87168e [BE] typing for decorators - _prims_common/wrappers (#131567)
See #131429
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131567
Approved by: https://github.com/oulgen, https://github.com/zou3519
2024-07-25 14:35:13 +00:00
7260eaeca0 Fix vulkan builds with missing overrides errors (#131760)
Followup after https://github.com/pytorch/pytorch/pull/131524

Also, use `C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED` macro to suppress existing warnings

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131760
Approved by: https://github.com/atalman
2024-07-25 14:29:44 +00:00
fddb1bcdea [CCA][Memory Snapshot] Move user_defined annotations to Native Caching Allocator (#130964)
Summary: Instead of embedding the user_defined TraceEntry inside of device_traces, which causes issues when some threads may not have the proper device id set, save them into an external_annotations field by using a RingBuffer<AnnotationEntry> called annotation_buffer owned by the NativeCachingAllocator.

Test Plan: CI, resnet run, and FBR model.

Differential Revision: D59703213

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130964
Approved by: https://github.com/zdevito
2024-07-25 14:06:52 +00:00
c88c90a897 [TS2E] Improve logging (#131711)
Serializing the text without having to do so can be costly for large outputs like ExportedProgram

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131711
Approved by: https://github.com/ydwu4
2024-07-25 13:40:10 +00:00
316c0d3e6b [inductor][cpp][gemm] support k slicing for static shapes (#130821)
This PR provides the initial support for k-slicing (i.e. parallel reduction along k-dim) of CPP GEMM template. Only static shapes are supported now. When k-slicing is enabled, there would be extra temporary buffers allocated to hold the intermediate results and an extra barrier after initial GEMM compute by each thread, i.e. each thread first stores the GEMM result to temporary accumulation buffers (pointed by `local_buf_ptrs` which is an array of pointers pointing to accumulation buffers), followed by a reduction along k-slices, epilogue computes and store to the final output `Y`. In each k-slicing thread group, the reduction along k-slices and epilogue computes are conducted in parallel along M-dim. The algorithm is designed to reduce the synchronization overhead as much as possible.

The k-slicing is enabled when blocking on M and N is unable to occupy all threads. Since k-slicing doesn't always bring benefit, an extra configuration is added to enable it (disable by default). We need to identify a good heuristics in the future to enable k-slicing by default.

Performance numbers with 64x4096x64, 64x10000x64, 64x20000x64 as examples on 60-core SPR as examples. As you can see, the perf of k-slicing is only better than non-k-slicing when K is large enough.

Without k-slicing
AUTOTUNE linear_unary(64x4096, 64x4096, 64)
  cpp_packed_gemm_0 0.0108 ms 100.0%
  _linear_pointwise 0.0431 ms 25.1%

AUTOTUNE linear_unary(64x10000, 64x10000, 64)
  cpp_packed_gemm_0 0.0272 ms 100.0%
  _linear_pointwise 0.0892 ms 30.5%

AUTOTUNE linear_unary(64x20000, 64x20000, 64)
  cpp_packed_gemm_0 0.0781 ms 100.0%
  _linear_pointwise 0.1693 ms 46.1%

With k-slicing:
AUTOTUNE linear_unary(64x4096, 64x4096, 64)
  cpp_packed_gemm_0 0.0260 ms 100.0%
  _linear_pointwise 0.0444 ms 58.5%

AUTOTUNE linear_unary(64x10000, 64x10000, 64)
  cpp_packed_gemm_0 0.0275 ms 100.0%
  _linear_pointwise 0.0893 ms 30.8%

AUTOTUNE linear_unary(64x20000, 64x20000, 64)
  cpp_packed_gemm_0 0.0284 ms 100.0%
  _linear_pointwise 0.1686 ms 16.8%

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130821
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
ghstack dependencies: #131024
2024-07-25 13:36:38 +00:00
d962dba0c4 Revert "[2/3] 3D Composability - move pp tests (#129801)"
This reverts commit 84cd062fb25c6da7d33b559c28afa38420e64415.

Reverted https://github.com/pytorch/pytorch/pull/129801 on behalf of https://github.com/atalman due to Broke periodic CI: distributed/_composable/test_composability/test_pp_composability.py::ComposabilityTest::test_manual_with_data_parallel_dp_type_DDP_ScheduleClass4 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10083807511/job/27882848654) [HUD commit link](544f950d14) ([comment](https://github.com/pytorch/pytorch/pull/129801#issuecomment-2250326191))
2024-07-25 13:30:56 +00:00
9c4cf866c2 Adafactor forloop basic impl (#129905)
#109581

At this point, the vanilla implementation (the default) is good.
Docs: https://docs-preview.pytorch.org/pytorch/pytorch/129905/generated/torch.optim.Adafactor.html#torch.optim.Adafactor

Specifically, the impl in this PR, which attempts to replicate the paper,
```
optim = torch.optim.Adafactor([weight])
```
is close enough to https://pytorch-optimizers.readthedocs.io/en/latest/optimizer/#pytorch_optimizer.AdaFactor
```
optim_c = AdaFactor([weight], betas=(0, 0.999), scale_parameter=False)
```
is close enough to https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adafactor
```
optim = keras.optimizers.Adafactor(learning_rate=0.01)
```

The three results respectively for the same randomly generated weights:
```
# ours
tensor([[ 0.3807594, -0.3912092],
        [ 0.0762539,  0.5377805],
        [ 0.2459473,  0.4662207]])

# pytorch-optimizer
tensor([[ 0.3807592, -0.3912172],
        [ 0.0762507,  0.5377818],
        [ 0.2459457,  0.4662213]])

# keras
array([[ 0.38076326, -0.39121315],
        [ 0.0762547 ,  0.5377859 ],
        [ 0.24594972,  0.46622536]], dtype=float32)
```

This gives me confidence to move forward in speeding up the implementation now that a baseline has been established. If you're curious about differences:
* keras assigns step_size (rho_t in their code) to `min(lr, 1 / sqrt(step)` whereas the OG impl uses a hardcoded 0.01 instead of lr. We do the same thing as keras, but our lr default is 0.01.
* We differ from the pytorch-optimizers default in that our default will not track momentum (thus `beta1=0`) and we do not apply parameter scaling.

<details>

Keras collab: https://colab.research.google.com/drive/1i3xF8ChL7TWKJGV_5v_5nMhXKnYmQQ06?usp=sharing

My script repro:

```
import torch
from pytorch_optimizer import AdaFactor
torch.set_printoptions(precision=7)

weight = torch.tensor([[ 0.37697506, -0.39500135],
        [ 0.07246649,  0.53399765],
        [ 0.24216151,  0.46243715]], dtype=torch.float32)
# bias = torch.tensor([0, 0], dtype=torch.float32)

weight.grad = torch.tensor([[-0.5940447, -0.7743838],
        [-0.5940447, -0.7743838],
        [-0.5940447, -0.7743838]], dtype=torch.float32)
# bias.grad = torch.tensor([-2.5027974,  1.5422692], dtype=torch.float32)

weight_c = weight.clone()
weight_c.grad = weight.grad.clone()

optim = torch.optim.Adafactor([weight])
optim.step()
print(weight)

optim_c = AdaFactor([weight_c], betas=(0, 0.999), scale_parameter=False)
optim_c.step()
print(weight_c)
```

<details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129905
Approved by: https://github.com/albanD
2024-07-25 13:17:19 +00:00
e8956c9fe6 Allow cpu scalar to be moved to HPU in masked_fill_decomposition (#127871)
Extension of the condition allowing the cpu scalar to be moved to specific devices.

This fixes an HPU specific error:
`torch._dynamo.exc.BackendCompilerFailed: backend='aot_hpu_training_backend' raised:
RuntimeError: Expected `value` to be on same device as `a`While executing %masked_fill : [num_users=1] = call_method[target=masked_fill](args = (%matmul, %expand_as, %tensor), kwargs = {})`

On the HPU in eager mode the problem doesn't occur because the pytorch's implementation is not used then.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127871
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-07-25 13:04:55 +00:00
91aba7baac Fix py codegen to delete values that don't have any users (#131028)
Fixes #131025

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131028
Approved by: https://github.com/ezyang
2024-07-25 13:04:23 +00:00
2784b3f1b7 [inductor] Fix split-scan interaction with multi-kernel (#131044)
This fixes a couple errors that come up when multi-kernel is used with
split-scan.
1. The split-scan was being marked as a persistent kernel, which allowed
   a multi-kernel to be created but this isn't supported. Fix is to
   never mark split-scan as persistent.
2. Benchmark codegen was not handling WorkspaceArg, and would raise a
   KeyError during codegen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131044
Approved by: https://github.com/shunting314
2024-07-25 11:36:36 +00:00
c04f70bb30 [BE] enable UFMT for torch/ao/ (#128864)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128864
Approved by: https://github.com/ezyang
2024-07-25 11:30:14 +00:00
434f60ce33 Refactor nightly checkout tool (#131134)
Changes:

- Add `-C REPO` in `git` commands to allow the tool can be run everywhere not only the repo dir
- Use `pathlib.Path` as many as possible
- Replace `subprocess.run(..., check=True)` with `subprocess.check_{call,output}(...)`
- Add `encoding='utf-8'` for files
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131134
Approved by: https://github.com/ezyang
2024-07-25 11:20:43 +00:00
054d214c50 [BE][tests] show local variables on failure in tests (#131151)
------

As per the title, add argument `--locals` for `unittest` and `--showlocals --tb=long` for `pytest` in CI.

Some failures cannot be reproduced on the local machine but exist on cloud CI. This change allows us to investigate the test failure more easily.

Example output: https://github.com/pytorch/pytorch/actions/runs/9961546996/job/27523888353?pr=130710#step:20:3361

```text
/opt/conda/envs/py_3.8/lib/python3.8/site-packages/sympy/core/function.py:307:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

cls = FloorDiv, base = -1.00000000000000, divisor = -1.00000000000000

    @classmethod
    def eval(cls, base, divisor):
        # python test/test_dynamic_shapes.py -k TestDimConstraints.test_dim_constraints_solve_full
        # Assert triggered by inequality solver
        # assert base.is_integer, base
        # assert divisor.is_integer, divisor

        # We don't provide the same error message as in Python because SymPy
        # makes it difficult to check the types.
        if divisor.is_zero:
            raise ZeroDivisionError("division by zero")
        if base in (int_oo, -int_oo, sympy.oo, -sympy.oo) and divisor in (
            int_oo,
            -int_oo,
            sympy.oo,
            -sympy.oo,
        ):
            return sympy.nan
        if base is sympy.nan or divisor is sympy.nan:
            return sympy.nan

        if base.is_zero:
            return sympy.S.Zero
        if base.is_integer and divisor == 1:
            return base
        if base.is_integer and divisor == -1:
            return sympy.Mul(base, -1)
        if (
            isinstance(base, sympy.Number)
            and isinstance(divisor, sympy.Number)
            and (
                base in (int_oo, -int_oo, sympy.oo, -sympy.oo)
                or divisor in (int_oo, -int_oo, sympy.oo, -sympy.oo)
            )
        ):
            r = float(base) / float(divisor)
            if r == math.inf:
                return int_oo
            elif r == -math.inf:
                return -int_oo
            elif math.isnan(r):
                return sympy.nan
            else:
                return sympy.Integer(math.floor(r))
        if isinstance(base, sympy.Integer) and isinstance(divisor, sympy.Integer):
            return sympy.Integer(int(base) // int(divisor))
        if isinstance(base, FloorDiv):
            return FloorDiv(base.args[0], base.args[1] * divisor)

        # Expands (x + y) // b into x // b + y // b.
        # This only works if floor is an identity, i.e. x / b is an integer.
        for term in sympy.Add.make_args(base):
            quotient = term / divisor
            if quotient.is_integer and isinstance(divisor, sympy.Integer):
                # NB: this is correct even if the divisor is not an integer, but it
                # creates rational expressions that cause problems with dynamic
                # shapes.
                return FloorDiv(base - term, divisor) + quotient

        try:
            gcd = sympy.gcd(base, divisor)
            if gcd != 1:
>               return FloorDiv(
                    sympy.simplify(base / gcd), sympy.simplify(divisor / gcd)
                )

base       = -1.00000000000000
cls        = FloorDiv
divisor    = -1.00000000000000
gcd        = 1.00000000000000
quotient   = 1.00000000000000
term       = -1.00000000000000

/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/utils/_sympy/functions.py:159:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

args = (FloorDiv, -1.00000000000000, -1.00000000000000), kwargs = {}

    @wraps(func)
    def wrapper(*args, **kwargs):
        try:
>           retval = cfunc(*args, **kwargs)
E           RecursionError: maximum recursion depth exceeded in comparison
E
E           To execute this test, run the following from the base repo dir:
E               python test/test_sympy_utils.py -k TestValueRanges.test_binary_ref_fn_floordiv_dtype_float
E
E           This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

args       = (FloorDiv, -1.00000000000000, -1.00000000000000)
cfunc      = <functools._lru_cache_wrapper object at 0x7fc5303173a0>
func       = <function Function.__new__ at 0x7fc530317280>
kwargs     = {}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131151
Approved by: https://github.com/ezyang
2024-07-25 10:10:58 +00:00
c4bf4005d1 [dtensor][debug] adding new noise level which allows users to only print operations with dtensors (#131592)
**Summary**
I have added a new noise level between the existing levels of 1 and 2, such that the noise level controls are now:
          0. prints module-level collective counts
          1. prints dTensor operations not included in trivial operations (new noise level)
          2. prints operations not included in trivial operations
          3. prints all operations

This gives the user more flexibility in controlling what information they want to use. The noise levels are used both for creating the console/file log and the json dump. In the example file, I have changed the module_tracing examples to noise level 0 and have changed my transformer examples to show off the new noise level.

**Test Plan**
1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_json_dump

2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_operation_tracing

3. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_module_tracing

4. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_module_tracing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131592
Approved by: https://github.com/XilunWu
ghstack dependencies: #131419, #130996
2024-07-25 06:54:57 +00:00
41e9f9cb7c [inductor] Fix flaky tests in test_select_algorithm.py (#131709)
Summary: Same as [#131699](https://github.com/pytorch/pytorch/pull/131699), but in `test_select_algorithm.py`.

Test Plan: Tested internally.

Differential Revision: D60202778

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131709
Approved by: https://github.com/eellison
2024-07-25 06:42:57 +00:00
3afdbecb23 [inductor] Fix flaky tests in test_debug_trace.py (#131722)
Summary:
When run internally in multiple parallel processes, the `test_debug_trace` hits the cache and skips writing all the expected outputs. Here we force-disable inductor cache to circumvent the problem. Ideally, we should switch to using a cleaner `fresh_inductor_cache` decorator approach, but it doesn't work at the moment.

Additionally, the debug trace dir is now generated by `tempfile.mkdtemp` to avoid a (rather unlikely) race condition.

Test Plan: Tested internally.

Differential Revision: D60207586

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131722
Approved by: https://github.com/eellison
2024-07-25 05:56:01 +00:00
059f9fb30b [BE][inductor] Type annotate codecache.py and config.py (#131427)
As title.

Checked/ Referred to the raw json file for runtime types . (and tried to cover all the missing annotations listed in the .json) this time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131427
Approved by: https://github.com/eellison, https://github.com/oulgen
2024-07-25 05:54:38 +00:00
ace6decc99 Fix static py::object dangling pointer with py::gil_safe_call_once_and_store (#130341)
Fix static `py::object`s with `py::gil_safe_call_once_and_store`.

The following code will leak a `py::object` which will call its destructor when shutdown the program. The destructor will call `Py_DECREF(obj.m_ptr)` which may raise a segmentation fault.

```c++
void func() {
    static py::object obj = py::module_::import("foo").attr("bar");

    ...
}
```

The correct code is to use raw pointers rather than the instance.

```c++
void func() {
    static py::object* obj_ptr = new py::object{py::module_::import("foo").attr("bar")};
    py::object obj = *obj_ptr;

    ...
}
```

This PR uses the `py::gil_safe_call_once_and_store` function from `pybind11`, which can run arbitrary initialization code only once under the Python GIL thread safely.

```c++
void func() {
    PYBIND11_CONSTINIT static py::gil_safe_call_once_and_store<py::object> storage;
    py::object obj = storage
                         .call_once_and_store_result(
                             []() -> py::object {
                                 return py::module_::import("foo").attr("bar");
                             }
                         )
                         .get_stored();

    ...
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130341
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-07-25 05:53:09 +00:00
59ef88ea5b [inductor] Fix flaky tests in test_pad_mm (#131699)
Summary: When run internally, some tests in `test_pad_mm.py` requiring big enough GPU to run `max_autotune=True` fail, as they're getting a smaller GPU than they need. Here we add `skipTest`s to skip the tests in these (rare) circumstances.

Differential Revision: D60192586

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131699
Approved by: https://github.com/chenyang78, https://github.com/shunting314, https://github.com/eellison
2024-07-25 05:46:45 +00:00
ee996cd63c [inductor] Fix flaky tests in test_benchmark_fusion.py (#131733)
Summary: Same as [#131699](https://github.com/pytorch/pytorch/pull/131699), but in `test_benchmark_fusion.py`.

Test Plan: Tested internally.

Differential Revision: D60211793

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131733
Approved by: https://github.com/oulgen
2024-07-25 05:39:14 +00:00
42a4df9447 Support CUDA nightly package in tools/nightly.py (#131133)
Add a new option `--cuda` to `tools/nightly.py` to pull the nightly packages with CUDA support.

```bash
# installs pytorch-nightly with cpuonly
tools/nightly.py pull

# The following only available on Linux and Windows
# installs pytorch-nightly with latest CUDA we support
tools/nightly.py pull --cuda

# installs pytorch-nightly with CUDA 12.1
tools/nightly.py pull --cuda 12.1
```

Also add targets in `Makefile` and instructions in constribution guidelines.

```bash
# setup conda environment with pytorch-nightly
make setup-env

# setup conda environment with pytorch-nightly with CUDA support
make setup-env-cuda
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131133
Approved by: https://github.com/ezyang
2024-07-25 05:33:52 +00:00
ceab3121de [inductor] Fix flaky tests in test_memory_planning.py (#131703)
Summary: Internally, the ABI-compatible mode is [enabled by default](eb54ca7abe/torch/_inductor/config.py (L53)). As a result, when the `abi_compatible: False` flag is not specified explitictly in the tests assuming non-ABI-compatible C++ codegen, those are failing internally. Here we fix one such test in `test_memory_planning.py`.

Test Plan: Tested internally.

Differential Revision: D60197327

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131703
Approved by: https://github.com/eellison
2024-07-25 05:09:08 +00:00
cyy
35bb0d3638 Fix unsigned type bug in CUDACachingAllocator.cpp (#131464)
curr_block->size and block_state.size are both size_t, so once they are not equal, split will happen. According to the comment, it's better to use '>'
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131464
Approved by: https://github.com/eqy, https://github.com/ezyang
2024-07-25 04:48:05 +00:00
5f3f14e5e4 [BE] Annotate subgraph_lowering (#131545)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131545
Approved by: https://github.com/anijain2305, https://github.com/zou3519
2024-07-25 04:35:26 +00:00
00e19ae97a [MTIA] Support module.mtia() (#131499)
Summary: Following other device backends' implementation to support module.mtia() API.

Test Plan: OSS and Internal CIs.

Differential Revision: D60076584

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131499
Approved by: https://github.com/mikaylagawarecki
2024-07-25 04:23:48 +00:00
2ce734cee9 [BE] enable UFMT for torch/ao/quantization/ (#128863)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128863
Approved by: https://github.com/ezyang
ghstack dependencies: #128861, #128862
2024-07-25 04:17:54 +00:00
a2f6eb33d0 Register buffer in static input test (#131686)
Previously, without nn module inlining, dynamo would lift all tensor attributes on an nn module to be constant on the graph. With nn module inlining these need to be buffers explicitly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131686
Approved by: https://github.com/anijain2305
2024-07-25 03:47:56 +00:00
cyy
62704db5c3 [Distributed] [10/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d/control_plane (#131671)
Follows #130109

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131671
Approved by: https://github.com/zou3519
2024-07-25 03:46:55 +00:00
2d7c135757 Bump setuptools from 69.5.1 to 70.0.0 in /tools/build/bazel (#130893)
Bumps [setuptools](https://github.com/pypa/setuptools) from 69.5.1 to 70.0.0.
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a href="https://github.com/pypa/setuptools/blob/main/NEWS.rst">setuptools's changelog</a>.</em></p>
<blockquote>
<h1>v70.0.0</h1>
<h2>Features</h2>
<ul>
<li>Emit a warning when <code>[tools.setuptools]</code> is present in <code>pyproject.toml</code> and will be ignored. -- by :user:<code>SnoopJ</code> (<a href="https://redirect.github.com/pypa/setuptools/issues/4150">#4150</a>)</li>
<li>Improved <code>AttributeError</code> error message if <code>pkg_resources.EntryPoint.require</code> is called without extras or distribution
Gracefully &quot;do nothing&quot; when trying to activate a <code>pkg_resources.Distribution</code> with a <code>None</code> location, rather than raising a <code>TypeError</code>
-- by :user:<code>Avasam</code> (<a href="https://redirect.github.com/pypa/setuptools/issues/4262">#4262</a>)</li>
<li>Typed the dynamically defined variables from <code>pkg_resources</code> -- by :user:<code>Avasam</code> (<a href="https://redirect.github.com/pypa/setuptools/issues/4267">#4267</a>)</li>
<li>Modernized and refactored VCS handling in package_index. (<a href="https://redirect.github.com/pypa/setuptools/issues/4332">#4332</a>)</li>
</ul>
<h2>Bugfixes</h2>
<ul>
<li>In install command, use super to call the superclass methods. Avoids race conditions when monkeypatching from _distutils_system_mod occurs late. (<a href="https://redirect.github.com/pypa/setuptools/issues/4136">#4136</a>)</li>
<li>Fix finder template for lenient editable installs of implicit nested namespaces
constructed by using <code>package_dir</code> to reorganise directory structure. (<a href="https://redirect.github.com/pypa/setuptools/issues/4278">#4278</a>)</li>
<li>Fix an error with <code>UnicodeDecodeError</code> handling in <code>pkg_resources</code> when trying to read files in UTF-8 with a fallback -- by :user:<code>Avasam</code> (<a href="https://redirect.github.com/pypa/setuptools/issues/4348">#4348</a>)</li>
</ul>
<h2>Improved Documentation</h2>
<ul>
<li>Uses RST substitution to put badges in 1 line. (<a href="https://redirect.github.com/pypa/setuptools/issues/4312">#4312</a>)</li>
</ul>
<h2>Deprecations and Removals</h2>
<ul>
<li>
<p>Further adoption of UTF-8 in <code>setuptools</code>.
This change regards mostly files produced and consumed during the build process
(e.g. metadata files, script wrappers, automatically updated config files, etc..)
Although precautions were taken to minimize disruptions, some edge cases might
be subject to backwards incompatibility.</p>
<p>Support for <code>&quot;locale&quot;</code> encoding is now <strong>deprecated</strong>. (<a href="https://redirect.github.com/pypa/setuptools/issues/4309">#4309</a>)</p>
</li>
<li>
<p>Remove <code>setuptools.convert_path</code> after long deprecation period.
This function was never defined by <code>setuptools</code> itself, but rather a
side-effect of an import for internal usage. (<a href="https://redirect.github.com/pypa/setuptools/issues/4322">#4322</a>)</p>
</li>
<li>
<p>Remove fallback for customisations of <code>distutils</code>' <code>build.sub_command</code> after long
deprecated period.
Users are advised to import <code>build</code> directly from <code>setuptools.command.build</code>. (<a href="https://redirect.github.com/pypa/setuptools/issues/4322">#4322</a>)</p>
</li>
<li>
<p>Removed <code>typing_extensions</code> from vendored dependencies -- by :user:<code>Avasam</code> (<a href="https://redirect.github.com/pypa/setuptools/issues/4324">#4324</a>)</p>
</li>
<li>
<p>Remove deprecated <code>setuptools.dep_util</code>.
The provided alternative is <code>setuptools.modified</code>. (<a href="https://redirect.github.com/pypa/setuptools/issues/4360">#4360</a>)</p>
</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a href="5cbf12a9b6"><code>5cbf12a</code></a> Workaround for release error in v70</li>
<li><a href="9c1bcc3417"><code>9c1bcc3</code></a> Bump version: 69.5.1 → 70.0.0</li>
<li><a href="4dc0c31644"><code>4dc0c31</code></a> Remove deprecated <code>setuptools.dep_util</code> (<a href="https://redirect.github.com/pypa/setuptools/issues/4360">#4360</a>)</li>
<li><a href="6c1ef5748d"><code>6c1ef57</code></a> Remove xfail now that test passes. Ref <a href="https://redirect.github.com/pypa/setuptools/issues/4371">#4371</a>.</li>
<li><a href="d14fa0162c"><code>d14fa01</code></a> Add all site-packages dirs when creating simulated environment for test_edita...</li>
<li><a href="6b7f7a18af"><code>6b7f7a1</code></a> Prevent <code>bin</code> folders to be taken as extern packages when vendoring (<a href="https://redirect.github.com/pypa/setuptools/issues/4370">#4370</a>)</li>
<li><a href="69141f69f8"><code>69141f6</code></a> Add doctest for vendorised bin folder</li>
<li><a href="2a53cc1200"><code>2a53cc1</code></a> Prevent 'bin' folders to be taken as extern packages</li>
<li><a href="720862807d"><code>7208628</code></a> Replace call to deprecated <code>validate_pyproject</code> command (<a href="https://redirect.github.com/pypa/setuptools/issues/4363">#4363</a>)</li>
<li><a href="96d681aa40"><code>96d681a</code></a> Remove call to deprecated validate_pyproject command</li>
<li>Additional commits viewable in <a href="https://github.com/pypa/setuptools/compare/v69.5.1...v70.0.0">compare view</a></li>
</ul>
</details>
<br />

[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=setuptools&package-manager=pip&previous-version=69.5.1&new-version=70.0.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/pytorch/pytorch/network/alerts).

</details>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130893
Approved by: https://github.com/kit1980
2024-07-25 03:32:08 +00:00
d6115439be [MPS] Add SDPA implentation (#131362)
This work is based off @malfet's #119200

Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131362
Approved by: https://github.com/kimishpatel
2024-07-25 03:24:37 +00:00
cyy
d98d00487d [2/N] Remove unused variables (#131468)
Follows #122496

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131468
Approved by: https://github.com/eqy, https://github.com/ezyang
2024-07-25 03:08:07 +00:00
cyy
538258bc13 [1/N] Fix clang-tidy warnings in jit (#131034)
Some some tidy warnings
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131034
Approved by: https://github.com/ezyang
2024-07-25 03:03:46 +00:00
cyy
46e42ae85d [4/N] Fix Wunused-parameter warnings (#131291)
Follows #131271

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131291
Approved by: https://github.com/ezyang
2024-07-25 02:59:22 +00:00
03979a599e [BE] enable UFMT for torch/ao/pruning/ (#128862)
Part of #123062

- #123062
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128862
Approved by: https://github.com/ezyang
ghstack dependencies: #128861
2024-07-25 02:49:35 +00:00
973a1362b9 [BE] enable UFMT for torch/ao/nn/ (#128861)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128861
Approved by: https://github.com/ezyang
2024-07-25 02:49:19 +00:00
c047bddbca [easy][dynamo] Update test for inline_inbuilt_n_modules (#131718)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131718
Approved by: https://github.com/williamwen42, https://github.com/mlazos
ghstack dependencies: #131694
2024-07-25 02:49:16 +00:00
01bc2a8165 [inline-inbuilt-nn-modules] Skip mobilenet_v2 test for cpu inductor (#131694)
Related issue https://github.com/pytorch/pytorch/issues/131693

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131694
Approved by: https://github.com/eellison
2024-07-25 02:49:16 +00:00
b5c006acac [BE][Easy] enable UFMT for torch/nn/ (#128865)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128865
Approved by: https://github.com/ezyang
2024-07-25 02:48:42 +00:00
cyy
8ea4c72eb2 [1/N] Fix clang-tidy warnings in aten/src/ATen/native/*.{cpp,h} (#130798)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130798
Approved by: https://github.com/ezyang
2024-07-25 02:36:43 +00:00
ab609d6aa6 [ts_convert] Update conversion for aten.tensor (#131549)
Fixes aten::tensor issues in edgeml models
P1492137675
| suite   |   #models |   #has_ts_model |   #has_sample_inputs |   #ts_can_run |   #can_convert |   #ep_result_correct |   #can_package |   #sigmoid_can_run |   #sigmoid_result_correct |
|---------|-----------|-----------------|----------------------|---------------|----------------|----------------------|----------------|--------------------|---------------------------|
| EDGEML  |        34 |              25 |                   23 |            21 |              2 |                    2 |              2 |                  2 |                         2 |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131549
Approved by: https://github.com/jiashenC, https://github.com/SherlockNoMad
2024-07-25 01:11:03 +00:00
e20fb5e975 [PTD][c10d] Include PG status into flight recorder (#131268)
We are considering consolidating data source for logging and flight recorder so that we don't build multiple paths for debugging information. Before we do any merging, we want to first ensure that the PG status is also included in flight recorder. Also, we can leverage this information to validate our FR dump as well. Because the dump is not synced so we might potentially see some variants in the dump.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131268
Approved by: https://github.com/shuqiangzhang
2024-07-25 01:01:00 +00:00
c3fe9075a9 [ROCM] Use hipblaslt version from hipblaslt runtime instead of header for tunableops validator (#131078)
Summary:
When tunable ops load selected kernels from csv file, it will validate hipblaslt version defined in hipblaslt-version.h

This PR changes the validator to fetch hipblaslt version and revision from hipblaslt runtime instead of the header file, as in our environment we might rollout a new version of the run time prior to updating the header file fleet wide.

Test Plan:
Verified generated tunableops kernel selection has the correct hipblaslt version from runtime:

```
Validator,PT_VERSION,2.5.0
Validator,ROCBLAS_VERSION,4.0.0-72e57364-dirty
Validator,HIPBLASLT_VERSION,800-bf2c3184
Validator,ROCM_VERSION,6.0.0.0-12969-1544e39
Validator,GCN_ARCH_NAME,gfx942:sramecc+:xnack-
GemmTunableOp_BFloat16_TN,tn_8192_2_3584,Gemm_Hipblaslt_TN_572,0.0240676
GemmTunableOp_BFloat16_TN,tn_7168_2_8192,Gemm_Hipblaslt_TN_482,0.0359019
GemmTunableOp_BFloat16_TN,tn_8192_2_1024,Default,0.0173723
GemmTunableOp_BFloat16_TN,tn_1280_2_8192,Gemm_Hipblaslt_TN_491,0.0191047
```

Differential Revision: D59889043

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131078
Approved by: https://github.com/jeffdaily, https://github.com/xw285cornell
2024-07-25 00:54:07 +00:00
cyy
803c5b8640 [CMake] Fix private compile options for CUDA code (#130546)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130546
Approved by: https://github.com/ezyang
2024-07-25 00:22:18 +00:00
7a42470bcb Annotate all InstructionTranslator (#131509)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131509
Approved by: https://github.com/zou3519
2024-07-24 23:45:53 +00:00
7535b23a25 [export] Fix set_grad hoo if output is empty (#131511)
Fixes https://fb.workplace.com/groups/1075192433118967/permalink/1467707973867409/

Test Plan: CI

Differential Revision: D60135531

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131511
Approved by: https://github.com/ydwu4
2024-07-24 23:17:20 +00:00
29c9f8c782 [export] Fix graph_break log registration error when importing export/_trace.py (#131523)
Summary:
When importing `_trace.py`, put `torch._dynamo.exc.Unsupported` in the global variable ``_ALLOW_LIST`` can cause import to ``export/_trace.py`` to fail with error:

ValueError: Artifact name: 'graph_breaks' not registered, please call register_artifact('graph_breaks') in torch._logging.registrations.

The error is directly raise on line `graph_breaks_log = torch._logging.getArtifactLogger(__name__, "graph_breaks")` in `_dynamo/exc.py`. I've checked that ``register_artifact('graph_breaks')`` does already exist in torch._logging.registrations.

Explicitly call `import torch._logging` doesn't fix the issue.

(see T196719676)

We move ``_ALLOW_LIST`` to be a local variable.

Test Plan:
buck2 test 'fbcode//mode/opt' fbcode//aiplatform/modelstore/publish/utils/tests:fc_transform_utils_test -- --exact 'aiplatform/modelstore/publish/utils/tests:fc_transform_utils_test - test_serialized_model_for_disagg_acc (aiplatform.modelstore.publish.utils.tests.fc_transform_utils_test.PrepareSerializedModelTest)'

buck2 test 'fbcode//mode/opt' fbcode//aiplatform/modelstore/publish/utils/tests:fc_transform_utils_test -- --exact 'aiplatform/modelstore/publish/utils/tests:fc_transform_utils_test - test_serialized_test_dsnn_module (aiplatform.modelstore.publish.utils.tests.fc_transform_utils_test.PrepareSerializedModelTest)'

Differential Revision: D60136706

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131523
Approved by: https://github.com/zhxchen17
2024-07-24 22:40:24 +00:00
236e06f9f9 Revert "Ensure staticmethods can be allowed in graph (#130882)"
This reverts commit 93fdd0237dcfe8cb4c65f3596aef123417b760a1.

Reverted https://github.com/pytorch/pytorch/pull/130882 on behalf of https://github.com/clee2000 due to torchrec test still broken internally D59945836 ([comment](https://github.com/pytorch/pytorch/pull/130882#issuecomment-2249003059))
2024-07-24 22:32:41 +00:00
5db5865614 Revert "Annotate all InstructionTranslator (#131509)"
This reverts commit eafbd20f23746aa6b9090d989a4ccb059f45297e.

Reverted https://github.com/pytorch/pytorch/pull/131509 on behalf of https://github.com/clee2000 due to sorry need to revert this to revert something else, I think you only need to rebase and remerge ([comment](https://github.com/pytorch/pytorch/pull/131509#issuecomment-2249000843))
2024-07-24 22:29:49 +00:00
a7e20ef7e4 [BE] Get rid of missing destructor override warning (#131204)
Regression introduced by https://github.com/pytorch/pytorch/pull/126376

Before this change, compiling torch_cpu on my MacBook prints tons of warnings every time HooksInterface is included
```
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/src/optim/adamw.cpp:1:
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/optim/adamw.h:3:
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/nn/module.h:3:
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/nn/modules/container/any_module_holder.h:3:
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/nn/modules/container/any_value.h:3:
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/detail/static.h:4:
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/types.h:3:
In file included from /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/ATen.h:7:
In file included from /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/Context.h:13:
/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/detail/HIPHooksInterface.h:27:11: warning: '~HIPHooksInterface' overrides a destructor but is not marked 'override' [-Winconsistent-missing-destructor-override]
  virtual ~HIPHooksInterface() = default;
          ^
/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/detail/AcceleratorHooksInterface.h:16:11: note: overridden virtual function is here
  virtual ~AcceleratorHooksInterface() = default;
          ^
1 warning generated.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131204
Approved by: https://github.com/albanD, https://github.com/seemethere
2024-07-24 22:29:31 +00:00
b56939dae1 Annotate more InstructionTranslator (#131680)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131680
Approved by: https://github.com/zou3519
ghstack dependencies: #131676
2024-07-24 22:14:29 +00:00
f9322c26b2 Remove _export/exported_program.py (#131597)
Summary:
We removed references to _export/exported_program.py in executorch
in D60052318. Now we can remove this file.

Update the pin to executorch.

Test Plan: contbuild & OSS CI:

Differential Revision: D60072980

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131597
Approved by: https://github.com/avikchaudhuri
2024-07-24 22:04:17 +00:00
eb54ca7abe Revert "[BE] Get rid of missing destructor override warning (#131204)"
This reverts commit 8a890b72dc3e4dcd501060c2a2fee139c235a8b8.

Reverted https://github.com/pytorch/pytorch/pull/131204 on behalf of https://github.com/atalman due to sorry @malfet need to revert to make CI green, lets reland with ciflow/periodic label on ([comment](https://github.com/pytorch/pytorch/pull/131204#issuecomment-2248898033))
2024-07-24 21:08:49 +00:00
544f950d14 [BE] Improve error message when there are internal changes (#131547)
Fixes https://github.com/pytorch/test-infra/issues/4988
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131547
Approved by: https://github.com/xuzhao9, https://github.com/malfet, https://github.com/atalman
2024-07-24 20:38:08 +00:00
7f61324268 Add sparse block to flex_decoding kernel (#130884)
fix typo

Finish flex_decoding block sparse

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130884
Approved by: https://github.com/drisspg
2024-07-24 20:30:25 +00:00
b90aa18569 [aoti] Add initial custom op support (#127034)
Re-land of https://github.com/pytorch/pytorch/pull/125242

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127034
Approved by: https://github.com/malfet
2024-07-24 20:29:55 +00:00
44fdf24967 [BE] typing for decorators - jit/_decompositions (#131566)
See #131429
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131566
Approved by: https://github.com/oulgen, https://github.com/zou3519
2024-07-24 20:28:28 +00:00
2b83e4f8d7 [ROCm] Enable flex decoding unit tests (#131048)
Flex decoding tests are passing with upstream pytorch on MI300X/MI2XX.
Only flex attention unit tests have issues.

[result_mi250.log](https://github.com/user-attachments/files/16286954/result_mi250.log)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131048
Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony, https://github.com/malfet
2024-07-24 20:25:34 +00:00
84cd062fb2 [2/3] 3D Composability - move pp tests (#129801)
pytorch (fsdp, tp, pp) -> pytorch (composable)
Move (fsdp, tp, pp) tests under pytorch into a composable folder

FSDP:
test/distributed/_composable/fsdp/test_fully_shard_trainin.py
-TestFullyShard2DTraining
DP:
test/distributed/tensor/parallel/test_ddp_2d_parallel.py
TP:
test/distributed/tensor/parallel/test_fsdp_2d_parallel.py
**PP:
test/distributed/pipelining/test_composability.py**

=>
distributed/_composable/test_composability/test_2d_composability.py
**distributed/_composable/test_composability/test_pp_composability.py**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129801
Approved by: https://github.com/wconstab
ghstack dependencies: #129800
2024-07-24 20:17:54 +00:00
a9e6356271 [ONNX] Update torch.onnx.export API (#131501)
- Add a `kwargs` option; add the `dynamic_shapes` option so users can supply it directly to `torch.export`.
- Make the options keyword-only arguments (bc-breaking)
- Deprecate the `training` and `operator_export_type` options and include a warning message. The exact time for removal is TBD but the message should discourage users from using the options.
- Deprecate two functions `exportable_ops` and pretty print export

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131501
Approved by: https://github.com/titaiwangms
2024-07-24 20:03:17 +00:00
9db567f17d [ONNX] Set dump_exported_program to True in bench (#131670)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131670
Approved by: https://github.com/titaiwangms
2024-07-24 20:02:03 +00:00
85fa66be04 Add rerun_disabled_tests for inductor (#131681)
Test in prod?

THis also turns on mem leak check

Briefly checked that
```
 python3 ".github/scripts/filter_test_configs.py" \
    --workflow "inductor" \
    --job-name "cuda12.1-py3.10-gcc9-sm86 / build" \
    --test-matrix "{ include: [
    { config: "inductor", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "inductor", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "inductor_distributed", shard: 1, num_shards: 1, runner: "linux.g5.12xlarge.nvidia.gpu" },
    { config: "inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "inductor_timm", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "inductor_timm", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "dynamic_inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "dynamic_inductor_timm", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "dynamic_inductor_timm", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "dynamic_inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "dynamic_inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "aot_inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "aot_inductor_timm", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "aot_inductor_timm", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "aot_inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "aot_inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "inductor_cpp_wrapper_abi_compatible", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
  ]}
  " \
    --selected-test-configs "" \
    --pr-number "${PR_NUMBER}" \
    --tag "${TAG}" \
    --event-name "schedule" \
    --schedule "29 8 * * *" \
    --branch "${HEAD_BRANCH}"
```
has rerun disabled tests option in the test matrix

I don't think all these things need to run but I'm not sure which ones (probably just inductor?)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131681
Approved by: https://github.com/zou3519
2024-07-24 19:56:00 +00:00
65ce2bf465 Allow setting PYTHON_LIB_REL_PATH via environment variable (#128419)
This allows builds to customize the location where caffe2's Python modules are installed to.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128419
Approved by: https://github.com/PaliC, https://github.com/d4l3k, https://github.com/malfet
2024-07-24 19:49:06 +00:00
074b46b7d9 [1/3] 3D Composability - move fsdp tests (#129800)
pytorch (fsdp, tp, pp) -> pytorch (composable)
Move (fsdp, tp, pp) tests under pytorch into a composable folder

**FSDP:
test/distributed/_composable/fsdp/test_fully_shard_trainin.py
-TestFullyShard2DTraining**
DP:
test/distributed/tensor/parallel/test_ddp_2d_parallel.py
TP:
test/distributed/tensor/parallel/test_fsdp_2d_parallel.py
PP:
test/distributed/pipelining/test_composability.py

=>
**distributed/_composable/test_composability/test_2d_composability.py**
distributed/_composable/test_composability/test_pp_composability.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129800
Approved by: https://github.com/awgu
2024-07-24 19:47:34 +00:00
e0f1bf14a4 Fully type torch/utils/_config_module.py (#131676)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131676
Approved by: https://github.com/zou3519
2024-07-24 19:36:09 +00:00
05681b6838 Migrate missed experimental jobs to Amazon2023 AMI (#131485)
Adding in a few jobs that got missed in https://github.com/pytorch/pytorch/pull/131250

Those jobs have passed with the new AMI:
https://github.com/pytorch/pytorch/actions/runs/10063808680/job/27820050195?pr=131485
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131485
Approved by: https://github.com/atalman, https://github.com/malfet
2024-07-24 19:33:02 +00:00
05064f2827 [CI] Move all ROCm jobs to periodic frequency (#131637)
`inductor` and `rocm` workflows are the major contributors to the CI load on ROCm CI at the moment, resulting in huge backlogs: https://github.com/pytorch/pytorch/pull/131489#issue-2425804464

* Move rocm.yml to cron frequency
* Move ROCm CI jobs from inductor.yml to inductor-rocm.yml
* Introduce `ciflow/inductor-rocm` as PR label to manually invoke inductor jobs for ROCm (no automatic invoking to limit CI load)
* After this PR, only `trunk` workflow jobs for ROCm will run on every commit and PR merge, but since they take 45min*3 time on average, I decided to leave them as-is since it will provide us some basic insulation against ROCm breakage.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131637
Approved by: https://github.com/clee2000, https://github.com/atalman, https://github.com/huydhn
2024-07-24 19:26:58 +00:00
8aff6caf67 [CI][dashboard] Rename cpu-x86 to cpu_x86 (#131658)
Summary: '-' is used as a special separator by upload_dynamo_perf_stats.py, so switch to '_' instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131658
Approved by: https://github.com/huydhn
2024-07-24 19:16:52 +00:00
3ce6f61416 [AOTI] Support fallback ops not in inductor_fallback_ops (#131247)
Summary: For aten ops that are not listed in inductor_fallback_ops, AOTI will use proxy executor to execute them instead of erroring out as missing C shim implementation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131247
Approved by: https://github.com/angelayi
2024-07-24 19:16:43 +00:00
aeca9845a6 Migrate Lint jobs to Amazon 2023 AMI (#131514)
Continuing in the same vein as https://github.com/pytorch/pytorch/pull/131250, migrate all self-hosted lint.yml jobs to use the new Amazon 2023 AMI
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131514
Approved by: https://github.com/clee2000, https://github.com/malfet, https://github.com/huydhn
2024-07-24 19:11:02 +00:00
b98b3127f7 [easy][pytorch][counters] Move WaitCounter in c10/util (#131021)
Summary: Since WaitCounter frontend itself has minimal depdendencies it's fine to be moved into c10. Specific backends can be registered/linked separately.

Test Plan: unit test

Reviewed By: jamesperng, asiab4, c-p-i-o

Differential Revision: D59842868

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131021
Approved by: https://github.com/asiab4
2024-07-24 18:38:33 +00:00
7718024d2b [3.13] support 3.13 multiline traces in munge_exc (#131207)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131207
Approved by: https://github.com/jansel, https://github.com/anijain2305
ghstack dependencies: #131206
2024-07-24 18:22:30 +00:00
f0378912a0 [3.13, dynamo] fix test/dynamo/test_bytecode_utils.py (#131206)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131206
Approved by: https://github.com/jansel, https://github.com/anijain2305
2024-07-24 18:22:30 +00:00
a86909d251 [inductor] Type annotate constant_folding.py (#131364)
Summary: Type annotate constant_folding.py

Test Plan: mypy

Reviewed By: angelayi

Differential Revision: D60063872

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131364
Approved by: https://github.com/angelayi
2024-07-24 18:20:06 +00:00
8fe5b93667 support zb1p and zb2p algorithms (#130752)
Previously, we have proved that ZB2P is not truly zero bubble when num_local_stages exceed 4 and so only ZB1P was supported.

We did a few tweaks to the ZB2P to really make it zero bubble. Algorithm and proof is attached.
[zero_bubble.pdf](https://github.com/user-attachments/files/16238738/zero_bubble.pdf)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130752
Approved by: https://github.com/H-Huang
2024-07-24 17:58:46 +00:00
5e6cfb7db5 Add an extra shard for distributed periodic jobs (#131498)
Fixes issue of timeouts being observed in ROCm periodic workflow for distributed runs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131498
Approved by: https://github.com/jithunnair-amd, https://github.com/pruthvistony, https://github.com/clee2000
2024-07-24 16:44:53 +00:00
106c6a49f5 [dynamo] limit number of compiles per frame (#130891)
Fixes https://github.com/pytorch/pytorch/issues/130776

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130891
Approved by: https://github.com/anijain2305
2024-07-24 16:43:40 +00:00
abcd329359 [BE] typing for decorators - onnx/symbolic_helper (#131565)
See #131429
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131565
Approved by: https://github.com/justinchuby, https://github.com/oulgen, https://github.com/zou3519, https://github.com/titaiwangms
2024-07-24 16:39:47 +00:00
0e71a88f9b Support IPC for Expandable Segments (#130890)
This reapplication commit is the same as before except it resolves a build error in an internal build where `handle` was shadowed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130890
Approved by: https://github.com/dsjohns2
2024-07-24 15:45:40 +00:00
eb5883f8aa Add new runner labels to actionlint (#131525)
Adding the labels corresponding to the Amazon2023 ami
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131525
Approved by: https://github.com/atalman
2024-07-24 15:28:59 +00:00
72d17d95d7 [inductor] Enable dynamo for Windows. RC1 (#131286)
Changes:
1. Enable Windows in `check_if_inductor_supported`.
2. Disable Windows in `AotCodeCompiler`.
3. Force Windows inductor to `c++20` to support `std::enable_if_t`.
4. Disable `test_x86inductor_quantizer` UT on `Windows` temporary, It still some issue need to be fix: https://github.com/pytorch/pytorch/pull/131308 .

Based on this PR, I have run first model `resnet18` on Windows inductor successful.
<img width="1036" alt="image" src="https://github.com/user-attachments/assets/2642bda1-1845-417a-aaba-39bdf22e65d6">

TODO:
1. Upgrade pytorch Windows build to `c++20`.
2. Fix and re-enable `test_x86inductor_quantizer` UT on `Windows`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131286
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-24 15:26:55 +00:00
4c7f22dee2 [BE] remove unnecessary _dispatch_sqrt by using ** 0.5 (#131358)
Based on the discussion here where ** 0.5 is not slower than math.sqrt. https://github.com/pytorch/pytorch/pull/129905#discussion_r1675605075

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131358
Approved by: https://github.com/albanD
2024-07-24 14:58:57 +00:00
98984422eb [triton_op] fix autotuning (#131363)
The problem was we were shoving SymInts into the constant_args side
table. The root problem is that torch.fx.node.base_types, which we use
to determine what can be put in the graph, doesn't actually have SymInt
in it. This PR fixes base_types to include SymInt.

Test Plan:
- tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131363
Approved by: https://github.com/oulgen, https://github.com/justinchuby
2024-07-24 14:03:37 +00:00
bc938184de [FSDP2] Added set_reduce_scatter_divide_factor (#129286)
This PR adds an API `FSDPModule.set_reduce_scatter_divide_factor` to allow setting a custom gradient divide factor for reduce-scatter. This can be useful when using parallelisms in combination with FSDP (e.g. expert parallelism), where gradients need to be divided by a custom factor (e.g. an extra `EP` factor).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129286
Approved by: https://github.com/weifengpy
2024-07-24 12:42:35 +00:00
8ffd109a00 Revert "Fix py codegen to delete values that don't have any users (#131028)"
This reverts commit 466c167b71e6021f8eadcfbae1d9156a375663ce.

Reverted https://github.com/pytorch/pytorch/pull/131028 on behalf of https://github.com/atalman due to breaks CI ([comment](https://github.com/pytorch/pytorch/pull/131028#issuecomment-2247771530))
2024-07-24 12:21:43 +00:00
451462dbff [1/N] Add missing constructors or assignment operators (#131077)
Just mark them as deleted in most cases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131077
Approved by: https://github.com/ezyang
2024-07-24 12:09:39 +00:00
0c6f1ca064 Introduce torch._dynamo.config.enable_compiler_collectives for syncing compilation across ranks (#130935)
This PR implements an opt-in configuration option for synchronizing compilation across all ranks at the end of Dynamo tracing (and potentially, other places in the future). There are two pieces to this PR:

1. Implementing infrastructure for compiler collectives (DistributedState/LocalState, the actual collective)
2. Using this infrastructure to synchronize automatic dynamic choices across all ranks

The infrastructure in part one can be used for other purposes, just add more (serializable) fields to LocalState.

Here is how automatic dynamic synchronization works:

1. Preflight in "torch/_dynamo/variables/builder.py": On the first Dynamo trace run, we trace without automatic dynamic at all; we assume all Tensor inputs that are not otherwise marked are static. This run is purely to collect all Tensor input sizes in the program.
2. torch/_dynamo/output_graph.py: At the end of the first Dynamo trace run, we perform a compiler collective to distribute all Tensor input sizes to all ranks. Then, we restart Dynamo
3. Apply the updates in "torch/_dynamo/variables/builder.py": Now that we have all sizes for every rank, we now update frame state with the observed sizes for all ranks, in rank order. Under the assumption that frame state is consistent on all ranks, this series of updates will preserve consistency.

For future work, it would be safer if we force a consistent hint on all ranks; this is more involved as we have to interpose in fakification.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130935
Approved by: https://github.com/jansel
2024-07-24 11:24:11 +00:00
85d3ee1d67 [micro_pipeline_tp] refactor all-gather and reduce-scatter pattern matchers to be more flexible and testable (#131409)
High level goals:
- Cover the all-gather and reduce-scatter pattern matchers with unit tests
- Make it easier to exclude certain collectives as async-tp candidates
- Make it easier to match other all-gather and reduce-scatter variants (e.g. fp8 collectives)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131409
Approved by: https://github.com/weifengpy
2024-07-24 11:16:27 +00:00
89d5391bbf [inductor] Kill mark_node_as_mutating (#130834)
Resubmit of #129346

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130834
Approved by: https://github.com/lezcano
ghstack dependencies: #130832, #130833
2024-07-24 11:11:19 +00:00
6415c45da5 [inductor] Use multiple outputs for flex-attention (#130833)
Resubmit of #129344

This fixes the DCE issue for attention output

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130833
Approved by: https://github.com/lezcano
ghstack dependencies: #130832
2024-07-24 11:11:19 +00:00
95c248751b [inductor] Make UserDefinedTritonKernel a multi-output operation (#130832)
Resubmit of #129325

Previously each mutation was represented by a `MutationOutput` operation which
was a new scheduler node that must be scheduled immediately afterwards.

Now we have a single scheduler node, which produces mutiple `MutationOutput`
buffers as its output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130832
Approved by: https://github.com/lezcano
2024-07-24 11:11:14 +00:00
a4c3f29047 [ONNX][BE] Remove ruff skips in torch/onnx (#131368)
Remove all ruff skips for torch/onnx since we do not do runtime type checking anymore.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131368
Approved by: https://github.com/titaiwangms, https://github.com/Skylion007
2024-07-24 10:56:43 +00:00
62e566b345 [BE] Remove suppression of inconsistent missing overrides (#131524)
This should prevent regressions like the ones fixed by https://github.com/pytorch/pytorch/pull/131204

- Remove global `-Wno-error=inconsistent-missing-override`
- Wrap offending includes (protobuf and asmjit) with `C10_DIAGNOSTIC_PUSH_AND_IGNORE` and `C10_DIAGNOSTIC_POP_AND_IGNORED`
- Add `override` keyword to `at::namespace::tunable::StreamTimer` and `LLVMCodeGenImpl`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131524
Approved by: https://github.com/atalman
2024-07-24 10:07:36 +00:00
83d19620f6 kill tmp _is_executorch flag (#131488)
Test Plan: existing tests

Differential Revision: D60126186

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131488
Approved by: https://github.com/ydwu4
2024-07-24 08:51:37 +00:00
1e34870796 [CI][dashboard][reland] Collect PT2 cpu perf nightly (#131560)
Summary: Add a workflow similar to inductor-perf-test-nightly.yml but use x86 metal instances for perf measurement. The data processing and dashboard update will come next.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131560
Approved by: https://github.com/huydhn
2024-07-24 08:50:33 +00:00
276b5238ef [bug] Add is_compiling check for optimizers to avoid untracked tensor during graph tracing (#130909)
Hey folks, I was using the `stateless_func` [here](7c45476d38/torch/distributed/_spmd/api.py (L435)), which worked well before [this commit](https://github.com/pytorch/pytorch/pull/111084) but then introduced a `_tensor_constant0` and made this func non-stateless. Since there is no way to retrieve this constant tensor before compilation and performance is not an issue when tracing a graph, I think it might be good to fall back to the other branch.
![image](https://github.com/user-attachments/assets/6ee4487d-456b-47e0-8c1d-66cb5a641d47)

![image](https://github.com/user-attachments/assets/1ed46502-e50e-45c4-9751-49aa5a4590ae)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130909
Approved by: https://github.com/mlazos
2024-07-24 08:29:27 +00:00
cyy
41189b0da4 Simplify THPEvent_get_device (#131466)
Because self->event.device() always returns Device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131466
Approved by: https://github.com/albanD
2024-07-24 08:24:01 +00:00
e782918b8e [NestedTensor] Add example NestedTensor objects with inner dimension of size 1 to tests reducing along jagged dimension for NestedTensor (#131516)
Add example `NestedTensor`s with inner dimension of size `1` to `_get_example_tensor_lists` with `include_inner_dim_size_1=True`. This diff creates `NestedTensor`s of sizes `(B, *, 1)` and `(B, *, 5, 1)`, ensuring that the current implementations of jagged reductions for `sum` and `mean` hold for tensors of effective shape `(B, *)` and `(B, *, 5)`.

Differential Revision: [D59846023](https://our.internmc.facebook.com/intern/diff/D59846023/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131516
Approved by: https://github.com/davidberard98
2024-07-24 07:01:39 +00:00
e9db1b0597 Add flag to ignore unsupported @triton.autotune args in user-written kernel compilation (#131431)
Summary: We currently don't support some of the `@triton.autotune` arguments when compiling user-written Triton kernels with PT2. In this PR, we're adding a flag to circumvent it. This is to unblock internal compilation in some cases. The flag is supplied with the docs mentioning why it is not a good idea to set it.

Test Plan:

```
python test/inductor/test_triton_kernels.py -k test_triton_kernel_
autotune_with_unsupported_args
...
----------------------------------------------------------------------
Ran 3 tests in 3.636s

OK
```

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131431
Approved by: https://github.com/oulgen, https://github.com/zou3519
2024-07-24 05:37:09 +00:00
eafbd20f23 Annotate all InstructionTranslator (#131509)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131509
Approved by: https://github.com/zou3519
2024-07-24 05:31:01 +00:00
5772c13f56 Dont wrap negative indexing in scatter reduce (#131503)
Fix for https://github.com/pytorch/pytorch/issues/131321

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131503
Approved by: https://github.com/shunting314
2024-07-24 04:01:32 +00:00
9f96d4b61b Disable inlining on cudagraph fallback tests (#131557)
The cudagraph fallback tests should only run without nn module inlining. The [rerecord limit](fc3d2b26cd/torch/_inductor/cudagraph_trees.py (L1922)) is ignored if nn module inlining is disabled. Arguably it should just be higher, but this PR addresses the failures and allows inlining to be on by default on main.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131557
Approved by: https://github.com/anijain2305
ghstack dependencies: #131556
2024-07-24 04:00:02 +00:00
9575b1afad Ensure tensor dict is populated with compiled autograd (#131556)
The issue addressed is that compiled autograd changes the calling convention of the FX graph to only have a single placeholder which contains a list of inputs. In this case, the meta of the tensor input nodes don't contain the `tensor_dict` meta. This adds them.

The context is that `tensor_dict` is used to convey if a tensor is an input with a static address.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131556
Approved by: https://github.com/anijain2305
2024-07-24 04:00:02 +00:00
dffbd3a1e2 Add mypy typing to pattern_matcher (#131506)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131506
Approved by: https://github.com/zou3519
2024-07-24 02:55:43 +00:00
7124efa81b Include _native.h for structured_native_functions (#131208)
In gen.py, the code for generating CompositeViewCopyKernels.cpp includes *_native.h headers for "view_groups" but not "structured_native_functions". However, this results in the TORCH_API in the headers being ineffective and presents such functions being used outside libtorch_cpu.so

This patch ensures that gen.py includes the native headers for "structured_native_functions" in the same way as for "view_groups".

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131208
Approved by: https://github.com/bdhirsh
2024-07-24 02:55:36 +00:00
31da9ee711 Use explain function to provide more meaningful information when conversion failed. (#131214)
Summary: In the script of testing different families of models, when the conversion failed, we switch to use output from the explain function to provide more meaningful information.

Test Plan:
Manual testing with attatched log information.

```
buck2 run mode/dev-nosan sigmoid/inference/ts_migration:main -- --mode test_all --test_suites ads_merge --model_id 440779101
```

```
Processing 440779101_5455.predictor.disagg.gpu.merge

    model_name: 440779101_5455.predictor.disagg.gpu.merge
    has_ts_model: True
    has_sample_inputs: True
    ops_maybe_missing_meta: set()
    ts_can_run: True
    ts_run_exception: None
    can_convert: False
    convert_exception: Unsupported nodes are found in the following list:

        0. prim::Loop [%14259 : int = prim::Loop(%14258, %1129, %1126), scope: torch.fx.graph_module.GraphModule:: # <torch_package_1>.caffe2/torch/fb/predictor/modules/tensors_to_device_module.py💯19]

        1. prim::Loop [%14326 : int = prim::Loop(%1115, %1129, %14259), scope: torch.fx.graph_module.GraphModule:: # <torch_package_1>.caffe2/torch/fb/predictor/modules/tensors_to_device_module.py💯19]
    ep_result_correct: None
    ep_run_exception: None
    can_package: None
    package_exception: None
    sigmoid_can_run: None
    sigmoid_run_exception: None
    sigmoid_result_correct: None
```

Reviewed By: SherlockNoMad

Differential Revision: D59971446

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131214
Approved by: https://github.com/angelayi
2024-07-24 02:42:18 +00:00
0ceaabaf71 [easy][inline-inbuilt-nn-modules] Update test (#131563)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131563
Approved by: https://github.com/mlazos
ghstack dependencies: #131347, #131367, #131378, #131389, #131405, #131480, #131512
2024-07-24 02:32:19 +00:00
0e780a7d69 [BE] Remove some mypy allow-untyped-decorators that are no longer needed (#131564)
See #131429
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131564
Approved by: https://github.com/oulgen
2024-07-24 02:00:08 +00:00
abb313b466 [torch.mtia] Noop set_rng_state and get_rng_state APIs (#130873)
Summary: As title

Test Plan: CI tests

Reviewed By: joebos

Differential Revision: D59036602

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130873
Approved by: https://github.com/hanzlfs
2024-07-24 01:52:21 +00:00
aa1c78c7e9 [PTD][c10d][EZ] LOG error for nccl error rather than info (#131483)
Summary: As title, when we get nccl exception we should log it as error not info.

Test Plan: CI

Reviewed By: csmodlin, rmiao

Differential Revision: D60123773

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131483
Approved by: https://github.com/fegin
2024-07-24 01:08:00 +00:00
466c167b71 Fix py codegen to delete values that don't have any users (#131028)
Fixes #131025

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131028
Approved by: https://github.com/ezyang
2024-07-24 01:03:56 +00:00
14495ce288 [BE][MPS] Use isOperatingSystemAtLeastVersion: (#131513)
Instead of trying to come up with different checks for classes resonding to selectors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131513
Approved by: https://github.com/atalman
2024-07-24 00:54:25 +00:00
76f7b3e560 [inductor][cpp][gemm] improve thread blocking heuristics (#131024)
This PR improves the thread blocking heuristics to favor full occupancy as much as possible. Also, the "m x n" block size is made as squared as possible for better data reuse.

Take the shape M=20000, N=64, K=128 as an example, the original heuristics couldn't use up all the threads when the number of threads is large, say 60:
AUTOTUNE linear_unary(200000x128, 64x128, 64)
  _linear_pointwise 0.1010 ms 100.0%
  cpp_packed_gemm_0 0.8303 ms 12.2%
0722 02:26:39.220660 302553 torch/_inductor/codegen/cpp_gemm_template.py:503] [0/0] Register blocking: GemmBlocking(block_m=32, block_n=32, block_k=32)
V0722 02:26:39.221042 302553 torch/_inductor/codegen/cpp_gemm_template.py:507] [0/0] Cache blocking: GemmBlocking(block_m=625, block_n=1, block_k=4)
V0722 02:26:39.221118 302553 torch/_inductor/codegen/cpp_gemm_template.py:509] [0/0] Thread blocking: GemmBlocking(block_m=625, block_n=1, block_k=4)
V0722 02:26:39.221252 302553 torch/_inductor/codegen/cpp_gemm_template.py:526] [0/0] Number of threads: 60, occupancy: (10, 2, 1)

After this PR:
AUTOTUNE linear_unary(200000x128, 64x128, 64)
  _linear_pointwise 0.1143 ms 100.0%
  cpp_packed_gemm_0 0.1228 ms 93.1%
V0722 02:29:49.261794 304201 torch/_inductor/codegen/cpp_gemm_template.py:309] [0/0] Register blocking: GemmBlocking(block_m=32, block_n=32, block_k=32)
V0722 02:29:49.262860 304201 torch/_inductor/codegen/cpp_gemm_template.py:313] [0/0] Cache blocking: GemmBlocking(block_m=64, block_n=1, block_k=8)
V0722 02:29:49.262951 304201 torch/_inductor/codegen/cpp_gemm_template.py:315] [0/0] Thread blocking: GemmBlocking(block_m=69, block_n=79, block_k=8)
V0722 02:29:49.263075 304201 torch/_inductor/codegen/cpp_gemm_template.py:332] [0/0] Number of threads: 60, occupancy: (15, 4, 1)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131024
Approved by: https://github.com/leslie-fang-intel, https://github.com/chunyuan-w
2024-07-24 00:36:29 +00:00
fdc9a1404e Remove _BLACK_LISTED_OPS (#131361)
Summary: remove _BLACK_LISTED_OPS after https://github.com/pytorch/pytorch/pull/100749

Test Plan: contbuild & OSS CI

Differential Revision: D60056130

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131361
Approved by: https://github.com/angelayi
2024-07-24 00:15:27 +00:00
2cf220956a [inductor] fix CacheBase.get_system on AMD (#131365)
Summary: CacheBase.get_system on AMD is missing device name and hip version, fix that

Test Plan:
on AMD:
```
buck run fbcode//mode/opt-amd-gpu scripts/nmacchioni/repros/amd_cache_key:repro
{'device': {'name': 'gfx942:sramecc+:xnack-'}, 'version': {'triton': '3.0.006965bceb379c60d8184a4166f502457952938167bfb69592ebf48abebfb0ce9-4856d26164925fd955c779d8f67ecf47cc5754052b008714b3a580d708b13dd8-06965bceb379c60d8184a4166f502457952938167bfb69592ebf48abebfb0ce9-23d635e690d670bf61798e1259674b78c0ed5ba222ab6a455f329f27a758fc2d-e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855-166fbf4e6f8845f354611638861a2a9e1dc2654224c278e10b566f09549dae7e-ccd93feaad4c82c8c1604557340de15fda0a3c84fe83f5a4d1e12a07a77bf3f4-cf28658fa328f7f283ec4e6ccc6c48d7c2a8ddbdf5134d3eb35c9b38ce4ace44-b9d80690b3109c2aaf5ece450d62e93b37eb6ab38552089794b3bb36e36a22b3-36130a37af1b19a0dec569aa08d30b00c74c8f02b6b632999d86dea169146792-4a620da64e0c263067f0dbf6c721f5214a5ac315625a07dd98520502ddf7e22f-6ace95666f6a4ecd2b1a7fc7ae865d1a9239608bd020cb6e4b8d15233c2dd9b3', 'hip': '6.0.32830'}, 'hash': 'c4db04316e15953dda8648f5a43a3f208f2c0ba454666cc7d78e40527aab85ec'}
```

on Nvidia:
```
buck run fbcode//mode/opt scripts/nmacchioni/repros/amd_cache_key:repro
{'device': {'name': 'NVIDIA PG509-210'}, 'version': {'triton': '6de41ec76ecad84e618d692e6793a4ebe707ae68a0c033a222105daa72214d7c', 'cuda': '12.0.0'}, 'hash': 'b58d0aa37d80fc2932c1b7576ca876b77aa1258db1c14e27d1f201bd15376faf'}
```

Differential Revision: D60062972

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131365
Approved by: https://github.com/eellison
2024-07-24 00:11:59 +00:00
480ae51f85 [pytree] Only import optree if it's used (#131478)
torch.utils._pytree imports optree if it's available. Instead, we change
it to if it gets used. The motivation for this is better isolation.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131478
Approved by: https://github.com/albanD
2024-07-24 00:10:49 +00:00
6850e42266 [dynamo][exception] Remove older specialization for StopIteration (#131512)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131512
Approved by: https://github.com/yanboliang
ghstack dependencies: #131347, #131367, #131378, #131389, #131405, #131480
2024-07-24 00:06:53 +00:00
e2b941a1b4 [dynamo] Rename TENSOR_ALIASING to OBJECT_ALIASING. Permit OBJECT_ALIASING for dict guards (#131480)
Fixes https://github.com/pytorch/pytorch/issues/129667

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131480
Approved by: https://github.com/williamwen42
ghstack dependencies: #131347, #131367, #131378, #131389, #131405
2024-07-24 00:06:53 +00:00
e39f136c35 [debug][dtensor] implemented activation checkpointing differentiation (#130996)
**Summary**
While trying to integrate CommDebugMode with TorchTitan, I realized that the forward_hooks were being registered even though it was in the backward pass. After investigating, I realized that it was activation checkpointing that was causing this. In order to prevent users from being confused, I edited CommDebugMode so that it could differentiate between backward pass operations and activation checkpointing operations. I have also added an example case showing that CommDebugMode is able to successfully differentiate between the backward pass and activation checkpointing. The output for the example can be seen below.

**Test Case**
torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e activation_checkpointing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130996
Approved by: https://github.com/XilunWu
ghstack dependencies: #131419
2024-07-23 23:44:56 +00:00
7b375c3682 [dtensor][debug] changed which module tracker I inherited from to fix bug with activation checkpointing (#131419)
**Summary**
I switched the module tracker I had been inheriting from PyTorch’s all purpose one to the one written by Sanket in the distributed tools folder. I did this because the original one messed up activation checkpointing by adding itself to the parent set in the backward_pre_hook and then in the forward_pre_hook for the activation_checkpointing.

**Test Case**
pytest test/distributed/_tensor/debug/test_comm_mode_features.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131419
Approved by: https://github.com/XilunWu
2024-07-23 23:44:56 +00:00
161c18ed0b SymmetricMemory-based, low contention intra-node all-gather and reduce-scatter (#130583)
```python
# NOTE [low-contention collectives]
# When a collective is overlapped with abundant compute, it makes sense to
# prioritize reducing the contention between the collective and the overlapped
# compute, even at the cost of a slightly slower collective.
#
# Common collective implementations (e.g., NCCL without user buffer
# registration) optimize for throughput with no ambient compute. However, such
# implementations may not be optimal when they are overlapped with compute:
# - These impls typically fuse the entire collective into a single kernel and
# reserve SM resources based on the most demanding portion of the collective,
# even when a large portion of the collective does not require this much
# resource.
# - These implementations typically fuse the entire collective into a single
# kernel and reserve SM resources based on the most demanding portion of the
# collective, even when a large portion of the collective does not require this
# much resource.
# - These implementations often use SM-based P2P copy as opposed to copy
# engine-based P2P copy. Copy engine-based P2P copy may not have a significant
# advantage when there's no ambient compute. However, it may significantly
# improve overall resource utilization in the presence of ambient compute.
#
# When overlapped with intensive compute (e.g., persistent matmul kernels), the
# SM-usage of a collective can lead to inefficient overlapping.
#
# Low-contention collectives achieve their goals with the following strategies:
# - Use copy engine-based copy whenever possible.
# - Break down portions of a collective with different resource requirements
# into multiple kernels. This improves the overlapping efficiency at the cost
# of additional launching overhead.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130583
Approved by: https://github.com/weifengpy
2024-07-23 23:37:48 +00:00
1930698140 Fix fake tensor SymInt caching when there's a SymInt storage_offset (#131500)
Test Plan: Internal unit tests failed before and succeeded after.

Differential Revision: D60131273

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131500
Approved by: https://github.com/clee2000
2024-07-23 23:37:04 +00:00
fc3d2b26cd Use fake PG for test_compute_comm_reordering.py unit tests (#131415)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131415
Approved by: https://github.com/yifuwang
2024-07-23 22:53:23 +00:00
980bb54361 [BE][Inductor] fix failures in test_padding.py (#131417)
The failure only happens [internally](https://www.internalfb.com/tasks/?t=195598864) because the main block was not executed when the tests are run internally.

Differential Revision: [D60083954](https://our.internmc.facebook.com/intern/diff/D60083954)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131417
Approved by: https://github.com/eellison
2024-07-23 21:53:59 +00:00
53f1f75061 [BE][Inductor] fix do_bench test (#131402)
The test fail internally [T195592444](https://www.internalfb.com/intern/tasks/?t=195592444) (This is meta internal link). But we don't see the failure in OSS.

It turns out that there are 2 issues:
1. `run_test('cuda')` is improperly handled since it tries to import a module named 'cuda' if cuda is available. Since the import fails, all tests in the file are skipped. This hides the failure in OSS. The failure is exposed in internal tests since the main block which runs `run_test('cuda')` is skipped sometimes.
2. fix the real issue that incompatible inputs are provided to `do_bench`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131402
Approved by: https://github.com/eellison
2024-07-23 21:52:35 +00:00
5a0068cc69 [BE] mypy: disallow untyped decorators (#131428)
Untyped decorators strip the types from their decorated function so even if the underlying function is fully typed then callers to it don't get any benefit from type annotations.

Step 1 - Enable the error and override in all the offending files.

#131429

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131428
Approved by: https://github.com/justinchuby, https://github.com/oulgen
2024-07-23 21:50:55 +00:00
e3ca4e79e1 Fix mypy errors introduced by #131400 (#131522)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131522
Approved by: https://github.com/zou3519, https://github.com/eellison
2024-07-23 21:25:21 +00:00
c9e74449f3 bump executorch commit pin. (#131486)
Summary: as title. Target commit: 6153b1bf7b

Test Plan: CI

Differential Revision: D60125590

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131486
Approved by: https://github.com/huydhn
2024-07-23 21:25:07 +00:00
8a890b72dc [BE] Get rid of missing destructor override warning (#131204)
Regression introduced by https://github.com/pytorch/pytorch/pull/126376

Before this change, compiling torch_cpu on my MacBook prints tons of warnings every time HooksInterface is included
```
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/src/optim/adamw.cpp:1:
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/optim/adamw.h:3:
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/nn/module.h:3:
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/nn/modules/container/any_module_holder.h:3:
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/nn/modules/container/any_value.h:3:
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/detail/static.h:4:
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/types.h:3:
In file included from /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/ATen.h:7:
In file included from /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/Context.h:13:
/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/detail/HIPHooksInterface.h:27:11: warning: '~HIPHooksInterface' overrides a destructor but is not marked 'override' [-Winconsistent-missing-destructor-override]
  virtual ~HIPHooksInterface() = default;
          ^
/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/detail/AcceleratorHooksInterface.h:16:11: note: overridden virtual function is here
  virtual ~AcceleratorHooksInterface() = default;
          ^
1 warning generated.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131204
Approved by: https://github.com/albanD, https://github.com/seemethere
2024-07-23 21:02:14 +00:00
4eee2e7a6d [operator_benchmark] Remove TARGETS from broken benchmarks (#131460)
Summary:
Remove operator_benchmark caffe2 build due to the removal of caffe2: 2fd75667b4

Plus, we are deleting the TARGETS file from broken benchmarks that we do not intend to maintain.

Test Plan: Sandcastle CI

Differential Revision: D60086216

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131460
Approved by: https://github.com/vmpuri
2024-07-23 20:06:08 +00:00
8497930766 Revert "[CI][dashboard] Collect PT2 cpu perf nightly (#131369)"
This reverts commit 9851c7313d118517d21a112960044e0fdbf560b1.

Reverted https://github.com/pytorch/pytorch/pull/131369 on behalf of https://github.com/atalman due to Sorry need to revert looks like , please run ciflow/inductor looks like this caused failure in [pytorch/pytorch/actions/runs/10058412015/job/27802257096](https://github.com/pytorch/pytorch/actions/runs/10058412015/job/27802257096) ([comment](https://github.com/pytorch/pytorch/pull/131369#issuecomment-2246142022))
2024-07-23 19:41:49 +00:00
d4e3fd613c Revert "[CI] Relax config name matching for cpu inductor tests (#131467)"
This reverts commit aa54bcb6d25fc7c9ac23b82b74ea45f03033c8b2.

Reverted https://github.com/pytorch/pytorch/pull/131467 on behalf of https://github.com/atalman due to Sorry need to revert looks like https://github.com/pytorch/pytorch/pull/131369 broke inductor tests ([comment](https://github.com/pytorch/pytorch/pull/131467#issuecomment-2246136839))
2024-07-23 19:38:35 +00:00
7b82ed2d59 Delete very old misleading info from .ci README (#131502)
I think there is no way to salvage that by updating, so deleting.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131502
Approved by: https://github.com/seemethere, https://github.com/atalman
2024-07-23 19:27:36 +00:00
93fdd0237d Ensure staticmethods can be allowed in graph (#130882)
Fixes https://github.com/pytorch/pytorch/issues/124735

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130882
Approved by: https://github.com/anijain2305, https://github.com/williamwen42
2024-07-23 18:59:19 +00:00
faddb0f30c [NestedTensor] Integrate the mean operator along the jagged dimension into NestedTensor (#131132)
Summary:
Modify the existing `mean` operator in PyTorch, invoked by `torch.mean`, to allow for reductions along the jagged dimension of a nested tensor. The function originally had a basic implementation for reducing along 1 non-ragged dimension. This diff enables PyTorch users to invoke `torch.mean` on a nested tensor when reducing along the ragged dimension, e.g. `*` in a `(B, *, M)` nested tensor.

Parametrize unit tests from `sum` to verify the accuracy of the ragged reduction implementation for `torch.mean`. Add unit tests and parametrize `sum` unit tests to verify error handling for unsupported features in `NestedTensor` `torch.mean`.

Test Plan:
Verify that the new unit test passes via the following command:
```
buck2 run mode/{opt,inplace} //caffe2/test:nested -- --regex test_mean
```

```
buck2 run mode/{opt,inplace} //caffe2/test:nested -- --regex test_jagged_op
```

Differential Revision: D59654668

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131132
Approved by: https://github.com/davidberard98, https://github.com/jbschlosser
2024-07-23 18:48:34 +00:00
120ca23a1f Fix IMAs in Flash-Attention splitkv kernel (#131277)
# Summary

While debugging CI failures for flash_attention tests I stumbled across 2 IMAs for the split-kv variant of flash attention.
1. Illegal global memory writes during the writing of softmax_lse_accum. This was pinpointed to the temporary liftime of these out_accum and softmax_lse_accum. These were likely getting their refcount dropped **before** the kernel launch that used, them allowing them to potentially get used for other allocations.
2. After debugging this there was illegal writes of the combine kernel. I was able to pinpoint this to the writing to the reduce LSE. From my understanding it was making assumption that kBlocKM evenly divided the global number of rows and wasn't masking out these writes.

### History
My line of thinking for this:

We create the temporary split accum + LSE stats tensors to store the data for each split. We then launch a follow up kernel to do the reduction.

Under ordinary non roofline memory usage the cuda memory caching allocator will keep these allocations alive even though the tensors were created within a temporary scope and no longer have any live references.

On CI we often run near max memory usage. We change/add tests and suddenly we get close to oom threshold. The memory allocator will reap these segments and we get write after free errors.

After that  fix I did get further past the splitkv_flash kernel and then got the following error:

``` Shell
❯ TORCH_DISABLE_ADDR2LINE=1 PYTORCH_NO_CUDA_MEMORY_CACHING=1  compute-sanitizer --show-backtrace=device --tool memcheck --log-file ima.txt python ima.py

softmax_lseaccum_ptr =0x7f5ebb208a00
oaccum_ptr =0x7f5ebb208c00
softmax_lse_ptr = 0x7f5ebb208800
❯
❯ head ima.txt -n 10
========= COMPUTE-SANITIZER
========= Invalid __global__ write of size 4 bytes
=========     at void pytorch_flash::flash_fwd_splitkv_combine_kernel<pytorch_flash::Flash_fwd_kernel_traits<(int)32, (int)64, (int)256, (int)4, (bool)0, (bool)0, cutlass::bfloat16_t, pytorch_flash::Flash_kernel_traits<(int)32, (int)64, (int)256, (int)4, cutlass::bfloat16_t>>, (int)16, (int)1, (bool)1>(pytorch_flash::Flash_fwd_params)+0x630
=========     by thread (2,0,0) in block (0,0,0)
=========     Address 0x7f5ebb208804 is out of bounds
=========     and is 1 bytes after the nearest allocation at 0x7f5ebb208800 of size 4 bytes
```

Okay I looked at the address and it looks like we are writing consective bytes past the softmax_lse_ptr in from the combine func: I tried padding out the softmax_lse to q_padded and no more illegal memory errors on my repro:
```
========= COMPUTE-SANITIZER
========= ERROR SUMMARY: 0 errors
```

Fixes https://github.com/pytorch/pytorch/issues/131240
Fixes https://github.com/pytorch/pytorch/issues/131227
Fixes https://github.com/pytorch/pytorch/issues/131221

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131277
Approved by: https://github.com/malfet
2024-07-23 18:26:49 +00:00
f75d724482 Updating Types in torch/_dynamo/utils.py (#131001)
Adds some type annotations to the torch/_dynamo/utils.py file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131001
Approved by: https://github.com/aorenste
2024-07-23 18:25:52 +00:00
aa54bcb6d2 [CI] Relax config name matching for cpu inductor tests (#131467)
Summary: Matching *cpu* instead of *cpu_inductor* should be sufficient. This fixes torchbench test failures in https://github.com/pytorch/pytorch/pull/131369.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131467
Approved by: https://github.com/zou3519
2024-07-23 18:24:29 +00:00
94f22eb6b2 refactor post-trace fakification in strict (#131421)
Summary:
Previously it was unclear what `_convert_input_to_fake` actually does (used in strict), and in particular how it is different from `make_fake_inputs` (used in non-strict).

This PR splits that function to work purely on user inputs, then renames it to `extract_fake_inputs` and adds a comment clarifying what it does—namely, it extracts fake inputs from a given graph module instead of "converting inputs to fake inputs" (as suggested by the current name) or "making fake inputs" (as happens in non-strict, where no tracing has taken place yet).

The remainder of that function used to also fakify params and buffers. It turns out that this part is identical to what happens in non-strict, hence we also pull `make_fake_inputs` out from `non_strict_utils` into `_trace`, merge it with another util, and make both modes call it.

Test Plan: existing tests

Differential Revision: D60084442

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131421
Approved by: https://github.com/zhxchen17
2024-07-23 18:23:03 +00:00
f85c35872b Remove GraphModuleOpUpgrader in _export.serde.upgrade.py (#131373)
Summary: Remove GraphModuleOpUpgrader in _export.serde.upgrade.py and the file

Test Plan: contbuild & OSS CI

Differential Revision: D60067937

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131373
Approved by: https://github.com/angelayi
2024-07-23 18:09:44 +00:00
22906be8f0 Do not abort on SPARSE_STATUS_INVALID_VALUE (#130382)
Summary:
Newer versions of the MKL library return `SPARSE_STATUS_INVALID_VALUE` when badly formed non-triangular matrices are passed to the `mkl_sparse_?_trsv`/`mkl_sparse_?_mrsv` functions. This would start aborting (badly written) tests that worked with the old version which just filled the result tensor with `-NaN` instead of returning an error status.

This changes the code to fill the result tensor with `-NaN` on `SPARSE_STATUS_INVALID_VALUE` so we get the same behavior regardless of the MKL version in use.

Test Plan: `buck2 test 'fbcode//mode/opt' fbcode//caffe2/test:sparse -- --run-disabled`

Differential Revision: D59542023

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130382
Approved by: https://github.com/malfet
2024-07-23 18:09:36 +00:00
cfb9ccab6c [export] Filter errors by exception type, add case name (#131327)
Summary:
-  Log export errors to Scuba and mark them with "classified" and "unclassified"
- Classify errors by exception type (ALLOW_LIST) and a `case_name` attribute
- Add `case_name` for some exceptions.

Test Plan:
Running the code below logs a classified error to `torch_export_usage` table in Scuba.

```
import torch

from torch._export.db.case import SupportLevel

class TorchSymMin(torch.nn.Module):
    """
    torch.sym_min operator is not supported in export.
    """

    def forward(self, x):
        return x.sum() + torch.sym_min(x.size(0), 100)

example_args = (torch.randn(3, 2),)
tags = {"torch.operator"}
support_level = SupportLevel.NOT_SUPPORTED_YET
model = TorchSymMin()

torch.export.export(model, example_args)
``

Differential Revision: D59981459

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131327
Approved by: https://github.com/zhxchen17
2024-07-23 18:01:13 +00:00
6b8ec2b371 Revert "[triton_op] fix autotuning (#131363)"
This reverts commit 154f27455a62314dfb689f1fe13c0cfd52490339.

Reverted https://github.com/pytorch/pytorch/pull/131363 on behalf of https://github.com/ZainRizvi due to This was a tricky one, but looking at the code it's the change to torch/fx/node.py that triggered the type violation errors. Reverting since this is now breaking trunk ([comment](https://github.com/pytorch/pytorch/pull/131363#issuecomment-2245899858))
2024-07-23 18:01:09 +00:00
3fe72e0c2e [4/N] Non-Tensor: Support layout, device and dtype for aten operations (#125897)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125897
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-23 17:50:17 +00:00
68c725a094 [custom ops] Add register_vmap for custom ops (#130589)
Fixes #130284
Fixes #130653

- Add `torch.library.register_vmap` to custom ops
- Add `register_vmap` for operators in ops in custom_op_db.
- Make `torch.autograd.Function` support kwarg-only kwargs for vmap
- test operators in op_db with `tests/test_vmap`.
- change `test_vmap` to allow custom `out_dim` and allow "None" in `out_dim` when testing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130589
Approved by: https://github.com/zou3519
2024-07-23 17:48:38 +00:00
404d640c39 [1/2] PT2 Inductor ComboKernels - Foreach cases (#124969)
Summary:
A ComboKernel combines independent Inductor Triton kernels into a single one.
Consolidation with Foreach kernel:
1) For the scheduler node, the logic is consolidated into ForeachKernelSchedulerNode
2) The backend kernel is consolidated into ComboKernel.

(Note: this is part 1 which only deals with the 1st case above.)

Details:

1. ComboKernel can be viewed as the extension of Foreach kernel (see the examples below). The main differences are: 1) the block size is tunable (but currently shared by the sub-kernels).  2) it supports multiple kernel typs, like pointwise, reduce, and may extend to matmm as well (it doesn't support mixed 1d and 2d kernels yet, but it can be extended for such case) 3) the blocks are interleaved among the sub kernels (can be extended to other arrangement), 4) it is designed to be general enough to combine kernels without dependency and doesn't rely on certain patterns. 5) it doesn't support dynamic sizes yet but can be easily extended for it.

2. ComboKernel is used in two cases: 1) for existing foreach kernels, combo kernels are used as the backend kernel. the front-end kernel generation logic remains the same. 2) Added an extra optimization phase to the end of the scheduler to generate extra combo kernels if combo_kernels is True in config.py

3. The combo kernel generation in the added optimization phase is done in two steps: 1) in the front end inside the scheduler, it topologically sort the schedule nodes to find all the nodes with no data dependency and create a frond end schedule node for them. We currently limit the maximal number of sub-nodes for each combo kernel to 8 (but we still need to find what is the optimal number). 2) then, these sub-nodes are combined in the codegen phase to generate the combo kernel code for them based on a few rules. For example, 1d and 2d kernels are separated into different combo kernels, as mixing them is not supported yet. Note these algorithms we provide are very basic, and the users can register their customized combo kernel generation algorithms for both steps.

4. Performance wise, combining small kernels is about always to see performance gain. however, combining very large kernels may not see any perf gain, sometimes even regression possibly due to improper block sizes. Thus, a benchmark function is implemented to avoid such perf regression, and it is recommended to turn it on by setting benchmark_combo_kernels to True whenever combo_kernels is True.

Example:
- element wise kernels
original Pytorch function:
```
 def test_activations(a, b, c):
     a1 = torch.nn.functional.relu(a)
     b1 = torch.nn.functional.sigmoid(b)
     c1 = torch.nn.functional.tanh(c)
     return a1, b1, c1
```
combokernel
```
triton_heuristics.pointwise(
    size_hints=[512], tile_hint=TileHint.DEFAULT,
    filename=__file__,
    triton_meta={'signature': {0: '*fp32', 1: '*fp32', 2: '*fp32', 3: '*fp32', 4: '*fp32', 5: '*fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [AttrsDescriptor(divisible_by_16=(0, 1, 2, 3, 4, 5), equal_to_1=())]},
    inductor_meta={'kernel_name': 'triton_poi_fused_0', 'mutated_arg_names': []}
)
triton.jit
def triton_(in_ptr0, in_ptr1, in_ptr2, out_ptr0, out_ptr1, out_ptr2, XBLOCK : tl.constexpr):
    pid = tl.program_id(0)
    if pid % 3 == 0:
        pid_offset = pid // 3
        xnumel = 100
        rnumel = 1
        xoffset = pid_offset * XBLOCK
        xindex = xoffset + tl.arange(0, XBLOCK)[:]
        xmask = xindex < xnumel
        x0 = xindex
        tmp0 = tl.load(in_ptr0 + (x0), xmask)
        tmp1 = triton_helpers.maximum(0, tmp0)
        tl.store(out_ptr0 + (x0), tmp1, xmask)
    elif pid % 3 == 1:
        pid_offset = pid // 3
        xnumel = 400
        rnumel = 1
        xoffset = pid_offset * XBLOCK
        xindex = xoffset + tl.arange(0, XBLOCK)[:]
        xmask = xindex < xnumel
        x1 = xindex
        tmp2 = tl.load(in_ptr1 + (x1), xmask)
        tmp3 = tl.sigmoid(tmp2)
        tl.store(out_ptr1 + (x1), tmp3, xmask)
    elif pid % 3 == 2:
        pid_offset = pid // 3
        xnumel = 100
        rnumel = 1
        xoffset = pid_offset * XBLOCK
        xindex = xoffset + tl.arange(0, XBLOCK)[:]
        xmask = xindex < xnumel
        x2 = xindex
        tmp4 = tl.load(in_ptr2 + (x2), xmask)
        tmp5 = libdevice.tanh(tmp4)
        tl.store(out_ptr2 + (x2), tmp5, xmask)
    else:
        pass
```
- reduction kernels
Original Pytorch function:
```
def test_reduce(a, b, c):
     a1 = torch.sum(a, dim=0)
     b1 = torch.max(b, dim=0)
     c1 = torch.min(c, dim=0)
     return a1, b1, c1
```
Generated combokernal:
```
 triton_heuristics.persistent_reduction(
     size_hints=[32, 32],
     reduction_hint=ReductionHint.DEFAULT,
     filename=__file__,
     triton_meta={'signature': {0: '*fp32', 1: '*fp32', 2: '*fp32', 3: '*fp32', 4: '*i64', 5: '*fp32', 6: '*i64', 7: '*fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [AttrsDescriptor(divisible_by_16=(0, 1, 2, 3, 4, 5, 6, 7), equal_to_1=())]},
     inductor_meta={'kernel_name': 'triton_per_fused_0', 'mutated_arg_names': []}
 )
 triton.jit
 def triton_(in_ptr0, in_ptr1, in_ptr2, out_ptr0, out_ptr1, out_ptr2, out_ptr3, out_ptr4, XBLOCK : tl.constexpr):
     pid = tl.program_id(0)
     if pid % 3 == 0:
         pid_offset = pid // 3
         xnumel = 20
         rnumel = 20
         RBLOCK_0: tl.constexpr = 32
         xoffset = pid_offset * XBLOCK
         xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
         xmask = xindex < xnumel
         rindex = tl.arange(0, RBLOCK_0)[None, :]
         roffset = 0
         rmask = rindex < rnumel
         r1 = rindex
         x0 = xindex
         tmp0 = tl.load(in_ptr0 + (x0 + (20*r1)), rmask & xmask, other=0.0)
         tmp1 = tl.broadcast_to(tmp0, [XBLOCK, RBLOCK_0])
         tmp3 = tl.where(rmask & xmask, tmp1, float("-inf"))
         tmp4 = triton_helpers.max2(tmp3, 1)[:, None]
         tmp6 = tl.broadcast_to(rindex, tmp3.shape)
         _, tmp5_tmp = triton_helpers.max_with_index(tmp3, tmp6, 1)
         tmp5 = tmp5_tmp[:, None]
         tl.store(out_ptr0 + (x0), tmp4, xmask)
         tl.store(out_ptr1 + (x0), tmp5, xmask)
     elif pid % 3 == 1:
         pid_offset = pid // 3
         xnumel = 10
         rnumel = 10
         RBLOCK_1: tl.constexpr = 16
         xoffset = pid_offset * XBLOCK
         xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
         xmask = xindex < xnumel
         rindex = tl.arange(0, RBLOCK_1)[None, :]
         roffset = 0
         rmask = rindex < rnumel
         r3 = rindex
         x2 = xindex
         tmp7 = tl.load(in_ptr1 + (x2 + (10*r3)), rmask & xmask, other=0.0)
         tmp8 = tl.broadcast_to(tmp7, [XBLOCK, RBLOCK_1])
         tmp10 = tl.where(rmask & xmask, tmp8, float("inf"))
         tmp11 = triton_helpers.min2(tmp10, 1)[:, None]
         tmp13 = tl.broadcast_to(rindex, tmp10.shape)
         _, tmp12_tmp = triton_helpers.min_with_index(tmp10, tmp13, 1)
         tmp12 = tmp12_tmp[:, None]
         tl.store(out_ptr2 + (x2), tmp11, xmask)
         tl.store(out_ptr3 + (x2), tmp12, xmask)
     elif pid % 3 == 2:
         pid_offset = pid // 3
         xnumel = 10
         rnumel = 10
         RBLOCK_2: tl.constexpr = 16
         xoffset = pid_offset * XBLOCK
         xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
         xmask = xindex < xnumel
         rindex = tl.arange(0, RBLOCK_2)[None, :]
         roffset = 0
         rmask = rindex < rnumel
         r5 = rindex
         x4 = xindex
         tmp14 = tl.load(in_ptr2 + (x4 + (10*r5)), rmask & xmask, other=0.0)
         tmp15 = tl.broadcast_to(tmp14, [XBLOCK, RBLOCK_2])
         tmp17 = tl.where(rmask & xmask, tmp15, 0)
         tmp18 = tl.sum(tmp17, 1)[:, None]
         tl.store(out_ptr4 + (x4), tmp18, xmask)
     else:
         pass
```

Note: ComboKernels uses masks to allow combination of kernels working with tensors of different sizes.

Test Plan:
```
buck2 test mode/dev-nosan caffe2/test/inductor:foreach
```
```
buck2 test mode/dev-nosan caffe2/test/inductor:combo_kernels
```

Differential Revision: D54134695

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124969
Approved by: https://github.com/mlazos
2024-07-23 17:34:28 +00:00
979429ca89 [inductor]Add DtypeView to avoid memory leak and unnecessary kernel generations (#128883)
Fixes #126338
## Issue Summary

When torchinductor compiles the combination `functional_collective -> view.dtype -> wait`, a memory leak occurs. This happens because `view.dtype` is compiled into an out-of-place Triton kernel that copies the input data to a new tensor, even if the data hasn't completed collection via the wait operation. The tensor used by `collective` is only freed when the `wait` operation triggers the garbage collector, see [~WorkRegistry](https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/Functional.cpp#L41). However, since `wait` now waits for a new tensor, the previous one is never freed. The `view.dtype` should only check the metadata instead of creating a new tensor. The current lowering is against its semantics and causes memory leaks.

See more great discussions in the #126338

This kind of lowering also generates unnecessary triton kernels for `view.dtype` when it can't be fused with other operations.

## Fix
The function `aten.view.dtype` is a CPU operation that changes the metadata of its input. After discussions with @eellison and @bdhirsh, we decided to change the lowering of `aten.view.dtype` to ensure it fallback properly to the correct `aten.view.dtype` instead of generating a Triton kernel in some cases. This approach also preserves the same semantics of the view operation.
When the model calls `aten.view.dtype` with a data type whose bit width matches the input's original data type, we lower it to the newly added `DtypeView` in IR, acting like a `ReinterpretView`. When the operation can be fused, its `make_loader` is called to maintain the correct type conversion for each load instruction. When the operation can't be fused, it falls back to `aten.view.dtype` to avoid Triton kernel generation.

## Example

```python
@torch.compile
def fn(x, y):
    x = x.view(torch.float16)
    y = y.view(torch.float16) + 1
    return x @ y

x = torch.randn((2, 2), device=self.device, dtype=torch.bfloat16)
y = torch.randn((2, 2), device=self.device, dtype=torch.bfloat16)
fn(x, y)
```
The output code generated before this fix is like the following.
```python
triton_poi_fused_add_view_0...
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 4
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (x0), xmask).to(tl.float32)
    tmp1 = tmp0.to(tl.bfloat16).to(tl.float32, bitcast=True).to(tl.float32)
    tl.store(out_ptr0 + (x0), tmp1, xmask)

triton_poi_fused_add_view_1...
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 4
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (x0), xmask).to(tl.float32)
    tmp1 = tmp0.to(tl.bfloat16).to(tl.float32, bitcast=True).to(tl.float32)
    tmp2 = 1.0
    tmp3 = tmp1 + tmp2
    tl.store(out_ptr0 + (x0), tmp3, xmask)

def call(args):
...
        triton_poi_fused_view_0.run(arg0_1, buf0, 4, grid=grid(4), stream=stream0)
        del arg0_1
        buf1 = empty_strided_cuda((2, 2), (2, 1), torch.float16)
        # Source Nodes: [view_1, y], Original ATen: [aten.add, aten.view]
        triton_poi_fused_add_view_1.run(arg1_1, buf1, 4, grid=grid(4), stream=stream0)
        del arg1_1
        buf2 = empty_strided_cuda((2, 2), (2, 1), torch.float16)
        # Source Nodes: [matmul, view_1, x, y], Original ATen: [aten.add, aten.mm, aten.view]
        extern_kernels.mm(buf0, buf1, out=buf2)
```
As you can see, the two `view` operations are compiled to two kernels `triton_poi_fused_view_0` nad `triton_poi_fused_add_view_1`. Both of them has a line `tmp1 = tmp0.to(tl.bfloat16).to(tl.float32, bitcast=True).to(tl.float32)` which does the type conversion.

The main issue is that the first `view` operation didn't do anything to the actual data. But it generates a triton kernel with a new output tensor. Another small issue is that this triton kernel can't be compiled because `bitcast=True` only support type converstion with same bidwidth.

The following are output code generated after this PR.

```python
triton_poi_fused_add_0...
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 4
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (x0), xmask).to(tl.float32)
    tmp1 = tmp0.to(tl.bfloat16).to(tl.float32)
    tmp2 = 1.0
    tmp3 = tmp1 + tmp2
    tl.store(out_ptr0 + (x0), tmp3, xmask)
def call(args):
...
        triton_poi_fused_add_0.run(arg1_1, buf0, 4, grid=grid(4), stream=stream0)
        del arg1_1
        buf1 = empty_strided_cuda((2, 2), (2, 1), torch.float16)
        # Source Nodes: [matmul, y], Original ATen: [aten.add, aten.mm]
        extern_kernels.mm(aten.view.dtype(arg0_1, torch.float16), buf0, out=buf1)
```
The first `view` operation has been replaced with the `aten.view.dtype` and it is directly passed as an argument. The second one is still there because it is fused with the following add operation. The invalid bitcast operation is removed too.

The following two code snippets is for the upcasts and downcasts. For dtype in `torch.float16, torch.bfloat16`, each load will be upcasted to float32, then downcast to its original dtype to ensure use values with the right precision.

7bda23ef84/torch/_inductor/codegen/triton.py (L1725-L1726)
7bda23ef84/torch/_inductor/codegen/triton.py (L629-L642)

Huge thanks to @eellison, @bdhirsh, @shunting314, and @desertfire .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128883
Approved by: https://github.com/eellison
2024-07-23 17:31:39 +00:00
f93a6a4d31 Add mypy typing to torch_version.py (#131447)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131447
Approved by: https://github.com/angelayi
ghstack dependencies: #131434
2024-07-23 17:31:07 +00:00
eab1595ce2 [dynamo] Delete wrong assertion in bind_args (#131405)
Fix - https://github.com/pytorch/pytorch/issues/130537

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131405
Approved by: https://github.com/williamwen42, https://github.com/yanboliang
ghstack dependencies: #131347, #131367, #131378, #131389
2024-07-23 17:28:05 +00:00
e4b5645f83 Revert "Add wrappers for synchronous GPUDirect Storage APIs (#130633)"
This reverts commit 5b5e0698a5f560decb9bbdd150ed7b0622eb7777.

Reverted https://github.com/pytorch/pytorch/pull/130633 on behalf of https://github.com/clee2000 due to breaking a lot of jobs and build rules internally D60085885, possibly needs to update some bazel build? ([comment](https://github.com/pytorch/pytorch/pull/130633#issuecomment-2245806738))
2024-07-23 17:19:34 +00:00
f7754c6dc5 Run pull jobs with new AMI (#131250)
Migrate all pull jobs to the new Amazon 2023 AMI runner type.

Exceptions:
- Distributed tests are still on the old AMI since they had some weird [test failures](https://github.com/pytorch/pytorch/actions/runs/10047579686/job/27770963175). Will debug those separately.
- Ported over a couple trunk and slow jobs that had `sync-tag`s set with the pull jobs and so needed to be on the same AMI

Revert plan, in case something starts breaking when we run these new AMIs at a larger scale:
- If specific jobs start failing consistently, we bring those jobs back to the old AMI
- If the failure is more widespread, revert this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131250
Approved by: https://github.com/malfet, https://github.com/atalman
2024-07-23 17:17:12 +00:00
5f0b65bee7 Revert "Replace manual parsing of "TMPDIR", "TMP", "TEMP" and "TEMPDIR" with std::filesystem::temp_directory_path() (#130842)"
This reverts commit d33804f8b6e2ea38f8446826a16be13ce4f9b71e.

Reverted https://github.com/pytorch/pytorch/pull/130842 on behalf of https://github.com/clee2000 due to breaking some builds internally D60085710, Im not sure what the logs mean but I think its something about build size ([comment](https://github.com/pytorch/pytorch/pull/130842#issuecomment-2245799309))
2024-07-23 17:15:06 +00:00
4ca8705035 Add mypy typing to fx node (#131434)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131434
Approved by: https://github.com/zou3519
2024-07-23 17:00:31 +00:00
ded5bdb0de Use inductor TestCase for test_replicate_with_compiler.py (#131053)
Summary: `test/distributed/_composable/test_replicate_with_compiler.py` torch.compiles. This change introduces a version of MultiProcessTestCase that derives from the inductor TestCase class to make sure we always get a clean cache dir.

Test Plan: `python test/distributed/_composable/test_replicate_with_compiler.py`

Differential Revision: [D59925519](https://our.internmc.facebook.com/intern/diff/D59925519)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131053
Approved by: https://github.com/eellison
2024-07-23 16:59:55 +00:00
a5ad02d05d Remove MacOS M2 14 runner from MacMPS job (#131465)
As it's been dead for 2+ weeks and causing queuing issues

<img width="760" alt="image" src="https://github.com/user-attachments/assets/4e806cae-3a67-4acb-b84f-1a9131d2a859">
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131465
Approved by: https://github.com/seemethere, https://github.com/atalman
2024-07-23 16:51:42 +00:00
c1ef214046 Print ExportedProgram without color by default (#131399)
Summary:
Without plugin, colored ExportedProgram is not really readable.

![image](https://github.com/user-attachments/assets/319920a9-bb4b-4ad2-bcac-0c4f76973b11)

Test Plan: CI

Differential Revision: D60074481

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131399
Approved by: https://github.com/angelayi
2024-07-23 16:41:55 +00:00
db376fb643 Ensure non-contiguous indices are handled (#131430)
The unaligned inputs checker built in the assumption that static indices are a contiguous range (ie 0, 1, 2)
when with the new changes with nn module inlining break this assumption.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131430
Approved by: https://github.com/anijain2305
2024-07-23 16:37:55 +00:00
4f0497c747 Divorce triton and pt2 remote caching (#131345)
Now that remote caching has evolved into various parts of PT2, we want to separate triton and pt2 caching as changes to one have caused SEVs to the other.

Differential Revision: D60047752

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131345
Approved by: https://github.com/aorenste
2024-07-23 16:28:12 +00:00
154f27455a [triton_op] fix autotuning (#131363)
The problem was we were shoving SymInts into the constant_args side
table. The root problem is that torch.fx.node.base_types, which we use
to determine what can be put in the graph, doesn't actually have SymInt
in it. This PR fixes base_types to include SymInt.

Test Plan:
- tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131363
Approved by: https://github.com/oulgen
2024-07-23 16:15:00 +00:00
3aa45cae77 [export] Removed deprecated dialect field from EP schema. [2/2] (#131344)
Summary: Not landable until we've updated the pin of executorch.

Test Plan: CI

Reviewed By: SherlockNoMad

Differential Revision: D59759620

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131344
Approved by: https://github.com/SherlockNoMad, https://github.com/ydwu4
2024-07-23 16:05:10 +00:00
b61600f6cc [pytorch] fix the leak for pinned memory when using _create_cpu_state… (#131270)
When pin_memory and share_memory both are set to True in _create_cpu_state_dict, the memory is pinned using cudaHostRegister but is never unpinned. So, once tensor is created and freed, when a new tensor is created the caching allocator is allocating the same memory. This fails with below error.

```
obj = <[RuntimeError('CUDA error: part or all of the requested memory range is already mapped\nCUDA kernel errors might be a...pile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n') raised in repr()] Tensor object at 0x7f0028a4d6c0> pg = None, device = None, _ = None
```

This PR fixes this by unregistering this memory on tensor free by attaching a hook.

This is easily reproducible with xlformers checkpointing unit tests and the fix is verified with the same.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131270
Approved by: https://github.com/LucasLLC
2024-07-23 15:47:21 +00:00
1e86387871 Revert "Support IPC for Expandable Segments (#130890)"
This reverts commit 32c2f84e349ad6e34b8559d3f1f9c27020ae702f.

Reverted https://github.com/pytorch/pytorch/pull/130890 on behalf of https://github.com/zdevito due to variable shadowing broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/130890#issuecomment-2245456085))
2024-07-23 14:46:28 +00:00
f064dac588 [CI] change xpu ci build runner type to reduce build time (#130922)
The current XPU build sometime needs 2+hours, change the build runner to `linux.12xlarge` to reduce build time. Works for #114850
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130922
Approved by: https://github.com/atalman
2024-07-23 14:45:30 +00:00
6bbef2a06b [dynamo] Support set on KeysView (#131389)
Fixes https://github.com/pytorch/pytorch/issues/129664

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131389
Approved by: https://github.com/mlazos
ghstack dependencies: #131347, #131367, #131378
2024-07-23 14:15:26 +00:00
e7c5e06772 [dynamo] Support __contains__ on __dict__ on UserDefinedClassVariable (#131378)
Fixes https://github.com/pytorch/pytorch/issues/129665

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131378
Approved by: https://github.com/mlazos
ghstack dependencies: #131347, #131367
2024-07-23 14:15:26 +00:00
0bc5e26067 [dynamo] Support dict conversion of objects derived from MutableMapping (#131367)
Fixes - https://github.com/pytorch/pytorch/issues/129662

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131367
Approved by: https://github.com/williamwen42
ghstack dependencies: #131347
2024-07-23 14:15:20 +00:00
a944cce5b8 [dynamo] Support if callable on list (#131347)
Fixes https://github.com/pytorch/pytorch/issues/130720

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131347
Approved by: https://github.com/williamwen42, https://github.com/mlazos
2024-07-23 14:15:15 +00:00
250cdb2ac7 Fix cuda_half_test.cu (#131416)
$atanh(1.0)$ is $\inf$ (see https://www.mathworks.com/help/matlab/ref/atanh.html ) and difference between two infinities is nan, which is neither greater, nor less nor equal to any reasonable threshold

Fix the test by comparing that atanh of .5 is equal for float and half and that atanh of 1.0 equal to infinity

Fixes https://github.com/pytorch/pytorch/issues/131401

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131416
Approved by: https://github.com/atalman, https://github.com/albanD
2024-07-23 14:10:20 +00:00
4ac77fc6bd [HOP] Don't send HOPs to torch_dispatch (#131370)
I regretted the decision in
https://github.com/pytorch/pytorch/pull/130606. Most user
torch_dispatchs don't have enough to actually handle the HOP correctly,
so for now I'd prefer that users explicitly define the interaction
between the HOP and their torch_dispatch class.

An example is FlopCounterMode: if we allow HOPs to get passed to it, it
will ignore auto_functionalized(mm) by default but it will record flops
for mm, which is weird.

Test Plan:
- tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131370
Approved by: https://github.com/ydwu4
2024-07-23 13:41:08 +00:00
027f35d9e6 [Inductor] Allow customize decompositions for fwd_only trace function (#131329)
Summary:

Inductor will aggressively try to decompose and lower ops into a smaller opset. However, sometimes it may not align with kernel coverage (or perf preference) on different backends. (eg. Inductor will decompose Gelu into primitive ops, but certain backends already has a Gelu op) Therefore, we need a mechanism to allow customization of decomp for trace function so that Inductor will simply pass this op through.

Test Plan:

Reviewers:
@eellison
Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131329
Approved by: https://github.com/eellison
2024-07-23 13:10:48 +00:00
eb146b10db Only depend on sympy 1.12 for conda (no 3.13 there anyways) (#131355)
Fixing nightly after https://github.com/pytorch/pytorch/pull/130895
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131355
Approved by: https://github.com/atalman
2024-07-23 12:19:58 +00:00
9851c7313d [CI][dashboard] Collect PT2 cpu perf nightly (#131369)
Summary: Add a workflow similar to inductor-perf-test-nightly.yml but use x86 metal instances for perf measurement. The data processing and dashboard update will come next.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131369
Approved by: https://github.com/huydhn
2024-07-23 11:55:39 +00:00
3f3b226ffc Fixes for the extension backend tests (#130933)
There were some miscellaneous issues I found:

* The WrapperCodeGen subclass constructors don't accept any arguments, which doesn't mesh with how Inductor can try to construct them.
* A DeviceInterface subclass for Triton doesn't implement `triton_supported() == True`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130933
Approved by: https://github.com/eellison, https://github.com/jansel
2024-07-23 10:46:32 +00:00
d8e2e1fe50 [aoti] use reshape instead of view for flattening tensors for the nan checker (#131302)
For some non-contiguous tensors, tensor.view would trigger the following
runtime error:

"RuntimeError: view size is not compatible with input tensor’s size and stride
(at least one dimension spans across two contiguous subspaces).
Use .reshape(…) instead"

So, let's use reshape instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131302
Approved by: https://github.com/muchulee8, https://github.com/desertfire
2024-07-23 10:15:28 +00:00
16247987a1 Add decomposition for t_copy (#130939)
* Extracted from #128416

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130939
Approved by: https://github.com/peterbell10
2024-07-23 08:29:19 +00:00
16a2a1aad3 Annotate graph.py (#131400)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131400
Approved by: https://github.com/shunting314
2024-07-23 07:04:12 +00:00
102d8e5a63 MPS LSTM backward kernel workaround on MacOS 14.4+ (#130038)
The bug causing the correctness problem will be fixed in future OS release. Root cause of the problem is in a bug in an optimization to MPSGraph reshape operation in MacOS 14_4 that results in a correctness issue with the shapes the LSTM gradient operation has when num_layers > 2.

Solves silentness of issue #125803.
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130038
Approved by: https://github.com/malfet
2024-07-23 06:30:40 +00:00
29e2e2afb6 Revert D59561509: Multisect successfully blamed "D59561509: [FX][export] DCE pass, check schema for node impurity (#130395)" for one test failure (#131341)
Summary:
This diff reverts D59561509
D59561509: [FX][export] DCE pass, check schema for node impurity (#130395) by yushangdi causes the following test failure:

Tests affected:
- [cogwheel:cogwheel_mtia_cmf_m5_shrunk_test#test_flow_with_verification](https://www.internalfb.com/intern/test/844425041436985/)

Here's the Multisect link:
https://www.internalfb.com/multisect/6533402
Here are the tasks that are relevant to this breakage:
T191383430: 10+ tests unhealthy for ads_mtia_inference

The backout may land if someone accepts it.

If this diff has been generated in error, you can Commandeer and Abandon it.

Test Plan: NA

Differential Revision: D60029318

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131341
Approved by: https://github.com/angelayi
2024-07-23 05:23:47 +00:00
b2ad16f01d avoid OpOverloadPacket.__getattr__ calls in inductor lowering (#131348)
we have seen stacktrace samples showing that a lot of compilation time is spent in exceptions raised in `OpOverloadPacket.__getattr__`. It's not entirely clear why/how this happens, but I spot-checked a few places in `_inductor.graph.py` where we previously may have been calling `hasattr(OpOverloadPacket, ...)`, that can be avoided (hasattr will go through getattr, which, for OpOverloadPacket, will do a lookup in the dispatch table for all overload names of the packet).

Test Plan: CI

Differential Revision: D60048270

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131348
Approved by: https://github.com/davidberard98
2024-07-23 04:30:04 +00:00
99d9b369f4 [Optim] Support tensor lr for all optimizers and check it is 1-element (#131065)
Fixes: #130980
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131065
Approved by: https://github.com/janeyx99
2024-07-23 04:27:05 +00:00
781189f25d Add nvjitlink to the list of loadable global deps (#131295)
To fix the cusparse dependency resolution in CUDA-12.x, that has nvJitLink dependency:
```
$ ldd -r /usr/local/cuda-11.8/lib64/libcusparse.so.11.7.5.86
	linux-vdso.so.1 (0x00007ffea6f51000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fb13306f000)
	librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fb133065000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fb13305f000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fb132f10000)
	libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fb132eeb000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fb132cf7000)
	/lib64/ld-linux-x86-64.so.2 (0x00007fb143db7000)
$ ldd -r /usr/local/cuda-12.1/lib64/libcusparse.so.12.1.0.106
	linux-vdso.so.1 (0x00007ffc41909000)
	libnvJitLink.so.12 => /usr/local/cuda-12.1/lib64/libnvJitLink.so.12 (0x00007f3916b38000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f3916aea000)
	librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f3916ae0000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f3916ada000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f391698b000)
	libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f3916964000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f3916772000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f3929a8c000)
```
Fixes #131284
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131295
Approved by: https://github.com/malfet
2024-07-23 04:26:33 +00:00
02cd4dbcf4 [BE][CI] Get rid of duplicated code (#131406)
Followup after https://github.com/pytorch/pytorch/pull/131061 Define `run_if_exists` function that runs cpp test if it exists and prints a warning otherwise.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131406
Approved by: https://github.com/kit1980, https://github.com/atalman
2024-07-23 04:01:13 +00:00
35a0e0f018 [tp] improve SequenceParallel and its documentation (#131346)
SequenceParallel style assumes the input torch.Tensor ALREADY sharded on
the sequence dimension if not passing in DTensor. Since it causes some
user confusion on the documentation, this PR:

1. for the case where input passed in is already a DTensor, we check the
   input placements and redistribute if it's not sharded on the sequence
dimension
2. update the doc to make it more explicit about the case when user
   passed in a torch.Tensor and DTensor

This would fix https://github.com/pytorch/pytorch/issues/129355

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131346
Approved by: https://github.com/awgu
2024-07-23 03:57:01 +00:00
12434504a2 [c10d] remove non-necessary tests (#131212)
as titled, comm tensor is not being actively used as we approached the
functional collectives as our collective tracing approach

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131212
Approved by: https://github.com/XilunWu
2024-07-23 03:48:55 +00:00
8a591da3e7 [CI] Enable AOT inductor in cpu performance smoke test (#130097)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130097
Approved by: https://github.com/chuanqi129, https://github.com/desertfire
2024-07-23 03:44:13 +00:00
6cbb1437c1 Revert "Add sparse block to flex_decoding kernel (#130884)"
This reverts commit 0bf59db6cc076468f44197f0d7ee41f6204c47c2.

Reverted https://github.com/pytorch/pytorch/pull/130884 on behalf of https://github.com/atalman due to Sorry reverting test_causal_full_mask_vs_sdpa constantly failing on trunk ([comment](https://github.com/pytorch/pytorch/pull/130884#issuecomment-2244113663))
2024-07-23 02:10:14 +00:00
28b0ad4f46 [PT2] Minor fix in signpost (#131332)
Summary: compile_id is a named Tuple. We want to log signposts.

Test Plan:
Run e2e job.
Confirm this shows up correctly.
{F1767320364}

Differential Revision: D60045020

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131332
Approved by: https://github.com/oulgen
2024-07-23 01:56:00 +00:00
b435d84261 Revert "[custom ops] Add register_vmap for custom ops (#130589)"
This reverts commit 074b42064195c45471912f851e94c753992a9a1f.

Reverted https://github.com/pytorch/pytorch/pull/130589 on behalf of https://github.com/atalman due to Please fix lint and reland ([comment](https://github.com/pytorch/pytorch/pull/130589#issuecomment-2244092174))
2024-07-23 01:44:44 +00:00
8963623494 Re-implement pin_memory to be device-agnostic by leveraging the Accelerator concept (#126376)
This PR re-implements pin memory aiming to get rid of the optional `device` argument and makes all related APIs to be device-agnostic. We add two new abstract APIs in [AcceleratorHooksInterface](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/detail/AcceleratorHooksInterface.h#L12) and redefine pin memory as: "Pin memory is always pinned for the current accelerator device". In detail, it uses [getAcceleratorHooksInterface](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/Context.h#L61) in pin_memory/is_pinned to get an appropriate device and invoke the corresponding overridden interfaces, instead of using BackendSelect and then dispatching to CUDA or other specific backends' implement methods.

Note: For new backends who want to implement and use pin memory, just inherit AcceleratorHooksInterface and overwrite the `isPinnedPtr` and `getPinnedMemoryAllocator` methods.

Additional context: To avoid BC-breaking, this PR just preserves the `device` arg of related APIs and would throw a deprecation warning if `device` arg is passed. Another PR will be submitted to update all PT callers (`Tensor.is_pinned()`, `Tensor.pin_memory()`...) not to pass this arg based on this PR. In future, `device` arg will be actually removed.

Relates #124908
Relates #14560
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126376
Approved by: https://github.com/albanD
2024-07-23 01:44:15 +00:00
074b420641 [custom ops] Add register_vmap for custom ops (#130589)
Fixes #130284
Fixes #130653

- Add `torch.library.register_vmap` to custom ops
- Add `register_vmap` for operators in ops in custom_op_db.
- Make `torch.autograd.Function` support kwarg-only kwargs for vmap
- test operators in op_db with `tests/test_vmap`.
- change `test_vmap` to allow custom `out_dim` and allow "None" in `out_dim` when testing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130589
Approved by: https://github.com/zou3519
2024-07-23 00:54:52 +00:00
1e5ecc4277 move save/load from _export to export (#131353)
Test Plan: existing tests

Differential Revision: D60053905

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131353
Approved by: https://github.com/angelayi
2024-07-23 00:48:28 +00:00
26f7dd286b [export] Allow non-CIA ops to be preserved (#131075)
I feel like the semantics of `run_decompositions(preserve_ops,...)` should be that we should always preserve whatever operator is put into `preserve_ops`, even if it's not CIA?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131075
Approved by: https://github.com/bdhirsh
2024-07-23 00:41:48 +00:00
69b1999586 TunableOp size hotfix (#130800)
Fixes #130727.  GetSize calculation was incorrect for strided batched gemm.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130800
Approved by: https://github.com/xw285cornell
2024-07-22 23:42:26 +00:00
8ae1963a61 [Autograd] Cond Higher-Order Operation (#126911)
This is an updated PR to equip cond with the autograd feature and replaces the old [PR](https://github.com/pytorch/pytorch/pull/126007)

@ydwu4 I tried to incorporate your requests already.

Currently there are two problems that I struggle with solving:

1. There seems to be an import issue when trying to import cond in `torch/__init__.py`, see [here](8a704035c9/torch/__init__.py (L1914-L1916)). Therefore, I had to comment those lines, which resolved the import issues, but I believe cond is not proberly exposed as torch.cond.
2. I am not entirely sure how to deal with the opinfo test in `hop_db.py`

Co-authored-by: Yidi Wu <yidi@meta.com>
Co-authored-by: Xuehai Pan <XuehaiPan@outlook.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126911
Approved by: https://github.com/ydwu4
2024-07-22 23:18:19 +00:00
c74396e890 Revert "[c10d] remove non-necessary tests (#131212)"
This reverts commit 0c074352ab62acba22265d8f19ea95851ae61d0f.

Reverted https://github.com/pytorch/pytorch/pull/131212 on behalf of https://github.com/atalman due to sorry need to revert breaks OSS CI, module 'test_c10d_common' has no attribute 'CompilerTest' ([comment](https://github.com/pytorch/pytorch/pull/131212#issuecomment-2243961785))
2024-07-22 23:11:44 +00:00
f8f41dcb24 Revert "[inductor] Make UserDefinedTritonKernel a multi-output operation (#130832)"
This reverts commit deacc543f13067ab22e8fb2ab714a20dd60bb056.

Reverted https://github.com/pytorch/pytorch/pull/130832 on behalf of https://github.com/atalman due to broke periodic test ([comment](https://github.com/pytorch/pytorch/pull/130832#issuecomment-2243894772))
2024-07-22 22:10:02 +00:00
15eb10df02 Revert "[inductor] Use multiple outputs for flex-attention (#130833)"
This reverts commit 9df8ea1cf2d62bfe21b46188faea6ef2e29e5210.

Reverted https://github.com/pytorch/pytorch/pull/130833 on behalf of https://github.com/atalman due to broke periodic https://github.com/pytorch/pytorch/pull/130832 ([comment](https://github.com/pytorch/pytorch/pull/130833#issuecomment-2243890944))
2024-07-22 22:07:06 +00:00
f8875e8277 Revert "[inductor] Kill mark_node_as_mutating (#130834)"
This reverts commit 33f036a6f71b386d4ccb9a756ed892c144ec6a5f.

Reverted https://github.com/pytorch/pytorch/pull/130834 on behalf of https://github.com/atalman due to broke periodic https://github.com/pytorch/pytorch/pull/130832 ([comment](https://github.com/pytorch/pytorch/pull/130834#issuecomment-2243886215))
2024-07-22 22:02:43 +00:00
d33804f8b6 Replace manual parsing of "TMPDIR", "TMP", "TEMP" and "TEMPDIR" with std::filesystem::temp_directory_path() (#130842)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130842
Approved by: https://github.com/fegin
2024-07-22 21:49:33 +00:00
a136a7d623 [Functional Collective] enable custom work registration from python (#130354)
This PR does two things:
- Allow tensor -> work registration in Python via `torch._C._distributed_c10d.register_work`. Calling `torch.ops._c10d_functional.wait_tensor` on a tensor would trigger `.wait()` on the registered work object.
- Allow user-defined work object in Python to work with functional collectives.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130354
Approved by: https://github.com/wanchaol, https://github.com/fegin, https://github.com/wconstab
2024-07-22 21:45:19 +00:00
a3922acc06 [TD] More synonyms, new heuristic for test_public_bindings (#130397)
test_public_bindings should be run on anything that changes the public API - need to figure out in the future what is part of the public api, currently I'm using anything in torch/

flex_attention should be run on anything involving autograd
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130397
Approved by: https://github.com/malfet
2024-07-22 21:42:54 +00:00
0bf59db6cc Add sparse block to flex_decoding kernel (#130884)
fix typo

Finish flex_decoding block sparse

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130884
Approved by: https://github.com/drisspg
2024-07-22 21:29:43 +00:00
83b355bad5 [aoti] forward fix of D60006838, add back test_multiple_output_alias (#131331) (#131356)
Summary:

Forward fix of D60006838.

The unit test test_multiple_output_alias passed in OSS CI, but failing internally. So adding it back to skip list.

Test Plan: ci

Differential Revision: D60044926

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131356
Approved by: https://github.com/kit1980, https://github.com/malfet
2024-07-22 20:17:21 +00:00
e3eaa22126 [torchbench][multisect] Run accuracy check at Diff time (#131266)
Summary:
X-link: https://github.com/pytorch/benchmark/pull/2388

We can enable accuracy checks at Diff time since it is not a performance metric.

* Refactor the existing diff time test to use the new PT2 Benchmark Runner.
* Deprecate the speedup tests and enable the accuracy tests only. We rely on ServiceLab to perform performance testing and regression detection.

Test Plan:
Sandcastle CI

Or buck test command:

```
buck2 test 'fbcode//mode/opt' fbcode//pytorch/benchmark/fb/test_gpu:run_test_gpu -- test_training_resnet50_accuracy
```

Test UI: https://www.internalfb.com/intern/testinfra/testrun/1688850102375429

Reviewed By: oulgen

Differential Revision: D59825601

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131266
Approved by: https://github.com/oulgen
2024-07-22 20:14:28 +00:00
0c074352ab [c10d] remove non-necessary tests (#131212)
as titled, comm tensor is not being actively used as we approached the
functional collectives as our collective tracing approach

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131212
Approved by: https://github.com/XilunWu
2024-07-22 19:52:44 +00:00
781a33f5d8 Enable dynamic rollout for Linux trunk workflows (#131325)
Enables dynamic migration of jobs to the LF AWS account for the Linux trunk workflow.

The new runners are only given to people specified in this issue: https://github.com/pytorch/test-infra/issues/5132

This closes pytorch/ci-infra#250.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131325
Approved by: https://github.com/ZainRizvi
2024-07-22 19:43:24 +00:00
406f510f89 [c10d] add bfloat16 support for NAN check (#131131)
Summary:
Need another dispacher macro to support more data types
Test Plan:
(sqzhang_1) [sqzhang@devgpu009.cln1 ~/pytorch (86fcae11)]$ python
test/distributed/test_c10d_nccl.py -k test_nan_assert_bfloat16
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:18:
checkForNaN: block: [0,0,0], thread: [85,0,0] Assertion
`!isnan(data[i])` failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:18:
checkForNaN: block: [0,0,0], thread: [18,0,0] Assertion
`!isnan(data[i])` failed.
NCCL version 2.21.5+cuda12.0

devgpu009:1193787:1193787 [0] init.cc:1773 NCCL WARN Cuda failure
'device-side assert triggered'
.
----------------------------------------------------------------------
Ran 1 test in 9.416s

OK
Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131131
Approved by: https://github.com/wconstab, https://github.com/XilunWu
2024-07-22 19:41:19 +00:00
9e753d1f20 [AMD] catch exception when other processes belong to other users (#131018)
Summary:
It is a long known pain point that if other users are running things, the call of `torch.cuda.memory.list_gpu_processes()` will error out:
```
  torch.cuda.memory.list_gpu_processes()
  File "torch/cuda/memory.py", line 647, in list_gpu_processes
    procs = amdsmi.amdsmi_get_gpu_process_list(handle)  # type: ignore[attr-defined]
  File "amdsmi/py_interface/amdsmi_interface.py", line 1946, in amdsmi_get_gpu_process_list
    _check_res(
  File "amdsmi/py_interface/amdsmi_interface.py", line 510, in _check_res
    raise AmdSmiLibraryException(ret_code)
amdsmi.py_interface.amdsmi_exception.AmdSmiLibraryException: Error code:
	10 | AMDSMI_STATUS_NO_PERM - Permission Denied

```

So just catch this error

Test Plan: torch.cuda.memory.list_gpu_processes() no longer fails

Differential Revision: D59901053

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131018
Approved by: https://github.com/eqy, https://github.com/clee2000
2024-07-22 19:38:51 +00:00
861 changed files with 33143 additions and 14883 deletions

View File

@ -1,4 +1,4 @@
# Docker images for GitHub CI
# Docker images for GitHub CI and CD
This directory contains everything needed to build the Docker images
that are used in our CI.
@ -12,7 +12,7 @@ each image as the `BUILD_ENVIRONMENT` environment variable.
See `build.sh` for valid build environments (it's the giant switch).
## Contents
## Docker CI builds
* `build.sh` -- dispatch script to launch all builds
* `common` -- scripts used to execute individual Docker build stages
@ -21,6 +21,12 @@ See `build.sh` for valid build environments (it's the giant switch).
* `ubuntu-rocm` -- Dockerfile for Ubuntu image with ROCm support
* `ubuntu-xpu` -- Dockerfile for Ubuntu image with XPU support
### Docker CD builds
* `conda` - Dockerfile and build.sh to build Docker images used in nightly conda builds
* `manywheel` - Dockerfile and build.sh to build Docker images used in nightly manywheel builds
* `libtorch` - Dockerfile and build.sh to build Docker images used in nightly libtorch builds
## Usage
```bash

View File

@ -407,6 +407,22 @@ case "$image" in
# from pytorch/llvm:9.0.1 is x86 specific
SKIP_LLVM_SRC_BUILD_INSTALL=yes
;;
pytorch-linux-jammy-aarch64-py3.10-gcc11-inductor-benchmarks)
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=11
ACL=yes
PROTOBUF=yes
DB=yes
VISION=yes
CONDA_CMAKE=yes
# snadampal: skipping sccache due to the following issue
# https://github.com/pytorch/pytorch/issues/121559
SKIP_SCCACHE_INSTALL=yes
# snadampal: skipping llvm src build install because the current version
# from pytorch/llvm:9.0.1 is x86 specific
SKIP_LLVM_SRC_BUILD_INSTALL=yes
INDUCTOR_BENCHMARKS=yes
;;
*)
# Catch-all for builds that are not hardcoded.
PROTOBUF=yes

View File

@ -1 +1 @@
9d859653ae916d0a72f6b2b5c5925bed38832140
48da61aa34b73ea8e2ee815a6a79eea817e361db

View File

@ -0,0 +1,5 @@
0.6b
manylinux_2_17
rocm6.1
04b5df8c8123f90cba3ede7e971e6fbc6040d506
77c29fa3f3b614e187d7213d745e989a92708cee2bc6020419ab49019af399d1

View File

@ -0,0 +1,20 @@
#!/bin/bash
# Script used only in CD pipeline
set -ex
# Anaconda
# Latest anaconda is using openssl-3 which is incompatible with all currently published versions of git
# Which are using openssl-1.1.1, see https://anaconda.org/anaconda/git/files?version=2.40.1 for example
MINICONDA_URL=https://repo.anaconda.com/miniconda/Miniconda3-py311_23.5.2-0-Linux-x86_64.sh
wget -q $MINICONDA_URL
# NB: Manually invoke bash per https://github.com/conda/conda/issues/10431
bash $(basename "$MINICONDA_URL") -b -p /opt/conda
rm $(basename "$MINICONDA_URL")
export PATH=/opt/conda/bin:$PATH
# See https://github.com/pytorch/builder/issues/1473
# Pin conda to 23.5.2 as it's the last one compatible with openssl-1.1.1
conda install -y conda=23.5.2 conda-build anaconda-client git ninja
# The cmake version here needs to match with the minimum version of cmake
# supported by PyTorch (3.18). There is only 3.18.2 on anaconda
/opt/conda/bin/pip3 install cmake==3.18.2
conda remove -y --force patchelf

View File

@ -0,0 +1,95 @@
#!/bin/bash
# Script used only in CD pipeline
set -uex -o pipefail
PYTHON_DOWNLOAD_URL=https://www.python.org/ftp/python
PYTHON_DOWNLOAD_GITHUB_BRANCH=https://github.com/python/cpython/archive/refs/heads
GET_PIP_URL=https://bootstrap.pypa.io/get-pip.py
# Python versions to be installed in /opt/$VERSION_NO
CPYTHON_VERSIONS=${CPYTHON_VERSIONS:-"3.8.1 3.9.0 3.10.1 3.11.0 3.12.0 3.13.0"}
function check_var {
if [ -z "$1" ]; then
echo "required variable not defined"
exit 1
fi
}
function do_cpython_build {
local py_ver=$1
local py_folder=$2
check_var $py_ver
check_var $py_folder
tar -xzf Python-$py_ver.tgz
pushd $py_folder
local prefix="/opt/_internal/cpython-${py_ver}"
mkdir -p ${prefix}/lib
if [[ -n $(which patchelf) ]]; then
local shared_flags="--enable-shared"
else
local shared_flags="--disable-shared"
fi
if [[ -z "${WITH_OPENSSL+x}" ]]; then
local openssl_flags=""
else
local openssl_flags="--with-openssl=${WITH_OPENSSL} --with-openssl-rpath=auto"
fi
# -Wformat added for https://bugs.python.org/issue17547 on Python 2.6
CFLAGS="-Wformat" ./configure --prefix=${prefix} ${openssl_flags} ${shared_flags} > /dev/null
make -j40 > /dev/null
make install > /dev/null
if [[ "${shared_flags}" == "--enable-shared" ]]; then
patchelf --set-rpath '$ORIGIN/../lib' ${prefix}/bin/python3
fi
popd
rm -rf $py_folder
# Some python's install as bin/python3. Make them available as
# bin/python.
if [ -e ${prefix}/bin/python3 ]; then
ln -s python3 ${prefix}/bin/python
fi
${prefix}/bin/python get-pip.py
if [ -e ${prefix}/bin/pip3 ] && [ ! -e ${prefix}/bin/pip ]; then
ln -s pip3 ${prefix}/bin/pip
fi
${prefix}/bin/pip install wheel==0.34.2
local abi_tag=$(${prefix}/bin/python -c "from wheel.pep425tags import get_abbr_impl, get_impl_ver, get_abi_tag; print('{0}{1}-{2}'.format(get_abbr_impl(), get_impl_ver(), get_abi_tag()))")
ln -s ${prefix} /opt/python/${abi_tag}
}
function build_cpython {
local py_ver=$1
check_var $py_ver
check_var $PYTHON_DOWNLOAD_URL
local py_ver_folder=$py_ver
if [ "$py_ver" = "3.13.0" ]; then
PY_VER_SHORT="3.13"
check_var $PYTHON_DOWNLOAD_GITHUB_BRANCH
wget $PYTHON_DOWNLOAD_GITHUB_BRANCH/$PY_VER_SHORT.tar.gz -O Python-$py_ver.tgz
do_cpython_build $py_ver cpython-$PY_VER_SHORT
else
wget -q $PYTHON_DOWNLOAD_URL/$py_ver_folder/Python-$py_ver.tgz
do_cpython_build $py_ver Python-$py_ver
fi
rm -f Python-$py_ver.tgz
}
function build_cpythons {
check_var $GET_PIP_URL
curl -sLO $GET_PIP_URL
for py_ver in $@; do
build_cpython $py_ver
done
rm -f get-pip.py
}
mkdir -p /opt/python
mkdir -p /opt/_internal
build_cpythons $CPYTHON_VERSIONS

View File

@ -0,0 +1,239 @@
#!/bin/bash
set -ex
NCCL_VERSION=v2.21.5-1
CUDNN_VERSION=9.1.0.70
function install_cusparselt_040 {
# cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html
mkdir tmp_cusparselt && pushd tmp_cusparselt
wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-x86_64/libcusparse_lt-linux-x86_64-0.4.0.7-archive.tar.xz
tar xf libcusparse_lt-linux-x86_64-0.4.0.7-archive.tar.xz
cp -a libcusparse_lt-linux-x86_64-0.4.0.7-archive/include/* /usr/local/cuda/include/
cp -a libcusparse_lt-linux-x86_64-0.4.0.7-archive/lib/* /usr/local/cuda/lib64/
popd
rm -rf tmp_cusparselt
}
function install_cusparselt_052 {
# cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html
mkdir tmp_cusparselt && pushd tmp_cusparselt
wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-x86_64/libcusparse_lt-linux-x86_64-0.5.2.1-archive.tar.xz
tar xf libcusparse_lt-linux-x86_64-0.5.2.1-archive.tar.xz
cp -a libcusparse_lt-linux-x86_64-0.5.2.1-archive/include/* /usr/local/cuda/include/
cp -a libcusparse_lt-linux-x86_64-0.5.2.1-archive/lib/* /usr/local/cuda/lib64/
popd
rm -rf tmp_cusparselt
}
function install_118 {
echo "Installing CUDA 11.8 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.4.0"
rm -rf /usr/local/cuda-11.8 /usr/local/cuda
# install CUDA 11.8.0 in the same container
wget -q https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda_11.8.0_520.61.05_linux.run
chmod +x cuda_11.8.0_520.61.05_linux.run
./cuda_11.8.0_520.61.05_linux.run --toolkit --silent
rm -f cuda_11.8.0_520.61.05_linux.run
rm -f /usr/local/cuda && ln -s /usr/local/cuda-11.8 /usr/local/cuda
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
mkdir tmp_cudnn && cd tmp_cudnn
wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-${CUDNN_VERSION}_cuda11-archive.tar.xz -O cudnn-linux-x86_64-${CUDNN_VERSION}_cuda11-archive.tar.xz
tar xf cudnn-linux-x86_64-${CUDNN_VERSION}_cuda11-archive.tar.xz
cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda11-archive/include/* /usr/local/cuda/include/
cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda11-archive/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf tmp_cudnn
# NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses
# Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build
git clone -b $NCCL_VERSION --depth 1 https://github.com/NVIDIA/nccl.git
cd nccl && make -j src.build
cp -a build/include/* /usr/local/cuda/include/
cp -a build/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf nccl
install_cusparselt_040
ldconfig
}
function install_121 {
echo "Installing CUDA 12.1 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.5.2"
rm -rf /usr/local/cuda-12.1 /usr/local/cuda
# install CUDA 12.1.0 in the same container
wget -q https://developer.download.nvidia.com/compute/cuda/12.1.1/local_installers/cuda_12.1.1_530.30.02_linux.run
chmod +x cuda_12.1.1_530.30.02_linux.run
./cuda_12.1.1_530.30.02_linux.run --toolkit --silent
rm -f cuda_12.1.1_530.30.02_linux.run
rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.1 /usr/local/cuda
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
mkdir tmp_cudnn && cd tmp_cudnn
wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz -O cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz
tar xf cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz
cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/include/* /usr/local/cuda/include/
cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf tmp_cudnn
# NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses
# Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build
git clone -b $NCCL_VERSION --depth 1 https://github.com/NVIDIA/nccl.git
cd nccl && make -j src.build
cp -a build/include/* /usr/local/cuda/include/
cp -a build/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf nccl
install_cusparselt_052
ldconfig
}
function install_124 {
echo "Installing CUDA 12.4 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.5.2"
rm -rf /usr/local/cuda-12.4 /usr/local/cuda
# install CUDA 12.4.0 in the same container
wget -q https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_550.54.14_linux.run
chmod +x cuda_12.4.0_550.54.14_linux.run
./cuda_12.4.0_550.54.14_linux.run --toolkit --silent
rm -f cuda_12.4.0_550.54.14_linux.run
rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.4 /usr/local/cuda
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
mkdir tmp_cudnn && cd tmp_cudnn
wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz -O cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz
tar xf cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz
cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/include/* /usr/local/cuda/include/
cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf tmp_cudnn
# NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses
# Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build
git clone -b $NCCL_VERSION --depth 1 https://github.com/NVIDIA/nccl.git
cd nccl && make -j src.build
cp -a build/include/* /usr/local/cuda/include/
cp -a build/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf nccl
install_cusparselt_052
ldconfig
}
function prune_118 {
echo "Pruning CUDA 11.8 and cuDNN"
#####################################################################################
# CUDA 11.8 prune static libs
#####################################################################################
export NVPRUNE="/usr/local/cuda-11.8/bin/nvprune"
export CUDA_LIB_DIR="/usr/local/cuda-11.8/lib64"
export GENCODE="-gencode arch=compute_35,code=sm_35 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
export GENCODE_CUDNN="-gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
if [[ -n "$OVERRIDE_GENCODE" ]]; then
export GENCODE=$OVERRIDE_GENCODE
fi
# all CUDA libs except CuDNN and CuBLAS (cudnn and cublas need arch 3.7 included)
ls $CUDA_LIB_DIR/ | grep "\.a" | grep -v "culibos" | grep -v "cudart" | grep -v "cudnn" | grep -v "cublas" | grep -v "metis" \
| xargs -I {} bash -c \
"echo {} && $NVPRUNE $GENCODE $CUDA_LIB_DIR/{} -o $CUDA_LIB_DIR/{}"
# prune CuDNN and CuBLAS
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublas_static.a -o $CUDA_LIB_DIR/libcublas_static.a
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a
#####################################################################################
# CUDA 11.8 prune visual tools
#####################################################################################
export CUDA_BASE="/usr/local/cuda-11.8/"
rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2022.3.0 $CUDA_BASE/nsight-systems-2022.4.2/
}
function prune_121 {
echo "Pruning CUDA 12.1"
#####################################################################################
# CUDA 12.1 prune static libs
#####################################################################################
export NVPRUNE="/usr/local/cuda-12.1/bin/nvprune"
export CUDA_LIB_DIR="/usr/local/cuda-12.1/lib64"
export GENCODE="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
export GENCODE_CUDNN="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
if [[ -n "$OVERRIDE_GENCODE" ]]; then
export GENCODE=$OVERRIDE_GENCODE
fi
# all CUDA libs except CuDNN and CuBLAS
ls $CUDA_LIB_DIR/ | grep "\.a" | grep -v "culibos" | grep -v "cudart" | grep -v "cudnn" | grep -v "cublas" | grep -v "metis" \
| xargs -I {} bash -c \
"echo {} && $NVPRUNE $GENCODE $CUDA_LIB_DIR/{} -o $CUDA_LIB_DIR/{}"
# prune CuDNN and CuBLAS
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublas_static.a -o $CUDA_LIB_DIR/libcublas_static.a
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a
#####################################################################################
# CUDA 12.1 prune visual tools
#####################################################################################
export CUDA_BASE="/usr/local/cuda-12.1/"
rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2023.1.0 $CUDA_BASE/nsight-systems-2023.1.2/
}
function prune_124 {
echo "Pruning CUDA 12.4"
#####################################################################################
# CUDA 12.4 prune static libs
#####################################################################################
export NVPRUNE="/usr/local/cuda-12.4/bin/nvprune"
export CUDA_LIB_DIR="/usr/local/cuda-12.4/lib64"
export GENCODE="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
export GENCODE_CUDNN="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
if [[ -n "$OVERRIDE_GENCODE" ]]; then
export GENCODE=$OVERRIDE_GENCODE
fi
if [[ -n "$OVERRIDE_GENCODE_CUDNN" ]]; then
export GENCODE_CUDNN=$OVERRIDE_GENCODE_CUDNN
fi
# all CUDA libs except CuDNN and CuBLAS
ls $CUDA_LIB_DIR/ | grep "\.a" | grep -v "culibos" | grep -v "cudart" | grep -v "cudnn" | grep -v "cublas" | grep -v "metis" \
| xargs -I {} bash -c \
"echo {} && $NVPRUNE $GENCODE $CUDA_LIB_DIR/{} -o $CUDA_LIB_DIR/{}"
# prune CuDNN and CuBLAS
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublas_static.a -o $CUDA_LIB_DIR/libcublas_static.a
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a
#####################################################################################
# CUDA 12.1 prune visual tools
#####################################################################################
export CUDA_BASE="/usr/local/cuda-12.4/"
rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2024.1.0 $CUDA_BASE/nsight-systems-2023.4.4/
}
# idiomatic parameter and option handling in sh
while test $# -gt 0
do
case "$1" in
11.8) install_118; prune_118
;;
12.1) install_121; prune_121
;;
12.4) install_124; prune_124
;;
*) echo "bad argument $1"; exit 1
;;
esac
shift
done

View File

@ -0,0 +1,93 @@
#!/bin/bash
# Script used only in CD pipeline
set -ex
NCCL_VERSION=v2.21.5-1
function install_cusparselt_052 {
# cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html
mkdir tmp_cusparselt && pushd tmp_cusparselt
wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-sbsa/libcusparse_lt-linux-sbsa-0.5.2.1-archive.tar.xz
tar xf libcusparse_lt-linux-sbsa-0.5.2.1-archive.tar.xz
cp -a libcusparse_lt-linux-sbsa-0.5.2.1-archive/include/* /usr/local/cuda/include/
cp -a libcusparse_lt-linux-sbsa-0.5.2.1-archive/lib/* /usr/local/cuda/lib64/
popd
rm -rf tmp_cusparselt
}
function install_124 {
echo "Installing CUDA 12.4 and cuDNN 9.1 and NCCL ${NCCL_VERSION} and cuSparseLt-0.5.2"
rm -rf /usr/local/cuda-12.4 /usr/local/cuda
# install CUDA 12.4.0 in the same container
wget -q https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_550.54.14_linux_sbsa.run
chmod +x cuda_12.4.0_550.54.14_linux_sbsa.run
./cuda_12.4.0_550.54.14_linux_sbsa.run --toolkit --silent
rm -f cuda_12.4.0_550.54.14_linux_sbsa.run
rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.4 /usr/local/cuda
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
mkdir tmp_cudnn && cd tmp_cudnn
wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-sbsa/cudnn-linux-sbsa-9.1.0.70_cuda12-archive.tar.xz -O cudnn-linux-sbsa-9.1.0.70_cuda12-archive.tar.xz
tar xf cudnn-linux-sbsa-9.1.0.70_cuda12-archive.tar.xz
cp -a cudnn-linux-sbsa-9.1.0.70_cuda12-archive/include/* /usr/local/cuda/include/
cp -a cudnn-linux-sbsa-9.1.0.70_cuda12-archive/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf tmp_cudnn
# NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses
# Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build
git clone -b ${NCCL_VERSION} --depth 1 https://github.com/NVIDIA/nccl.git
cd nccl && make -j src.build
cp -a build/include/* /usr/local/cuda/include/
cp -a build/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf nccl
install_cusparselt_052
ldconfig
}
function prune_124 {
echo "Pruning CUDA 12.4"
#####################################################################################
# CUDA 12.4 prune static libs
#####################################################################################
export NVPRUNE="/usr/local/cuda-12.4/bin/nvprune"
export CUDA_LIB_DIR="/usr/local/cuda-12.4/lib64"
export GENCODE="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
export GENCODE_CUDNN="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
if [[ -n "$OVERRIDE_GENCODE" ]]; then
export GENCODE=$OVERRIDE_GENCODE
fi
# all CUDA libs except CuDNN and CuBLAS
ls $CUDA_LIB_DIR/ | grep "\.a" | grep -v "culibos" | grep -v "cudart" | grep -v "cudnn" | grep -v "cublas" | grep -v "metis" \
| xargs -I {} bash -c \
"echo {} && $NVPRUNE $GENCODE $CUDA_LIB_DIR/{} -o $CUDA_LIB_DIR/{}"
# prune CuDNN and CuBLAS
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublas_static.a -o $CUDA_LIB_DIR/libcublas_static.a
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a
#####################################################################################
# CUDA 12.1 prune visual tools
#####################################################################################
export CUDA_BASE="/usr/local/cuda-12.4/"
rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2024.1.0 $CUDA_BASE/nsight-systems-2023.4.4/
}
# idiomatic parameter and option handling in sh
while test $# -gt 0
do
case "$1" in
12.4) install_124; prune_124
;;
*) echo "bad argument $1"; exit 1
;;
esac
shift
done

View File

@ -0,0 +1,23 @@
#!/bin/bash
# Script used only in CD pipeline
set -ex
LIBPNG_VERSION=1.6.37
mkdir -p libpng
pushd libpng
wget http://download.sourceforge.net/libpng/libpng-$LIBPNG_VERSION.tar.gz
tar -xvzf libpng-$LIBPNG_VERSION.tar.gz
pushd libpng-$LIBPNG_VERSION
./configure
make
make install
popd
popd
rm -rf libpng

View File

@ -0,0 +1,29 @@
#!/usr/bin/env bash
# Script used only in CD pipeline
set -eou pipefail
MAGMA_VERSION="2.5.2"
function do_install() {
cuda_version=$1
cuda_version_nodot=${1/./}
MAGMA_VERSION="2.6.1"
magma_archive="magma-cuda${cuda_version_nodot}-${MAGMA_VERSION}-1.tar.bz2"
cuda_dir="/usr/local/cuda-${cuda_version}"
(
set -x
tmp_dir=$(mktemp -d)
pushd ${tmp_dir}
curl -OLs https://anaconda.org/pytorch/magma-cuda${cuda_version_nodot}/${MAGMA_VERSION}/download/linux-64/${magma_archive}
tar -xvf "${magma_archive}"
mkdir -p "${cuda_dir}/magma"
mv include "${cuda_dir}/magma/include"
mv lib "${cuda_dir}/magma/lib"
popd
)
}
do_install $1

View File

@ -0,0 +1,134 @@
#!/bin/bash
# Script used only in CD pipeline
set -ex
ROCM_VERSION=$1
if [[ -z $ROCM_VERSION ]]; then
echo "missing ROCM_VERSION"
exit 1;
fi
# To make version comparison easier, create an integer representation.
save_IFS="$IFS"
IFS=. ROCM_VERSION_ARRAY=(${ROCM_VERSION})
IFS="$save_IFS"
if [[ ${#ROCM_VERSION_ARRAY[@]} == 2 ]]; then
ROCM_VERSION_MAJOR=${ROCM_VERSION_ARRAY[0]}
ROCM_VERSION_MINOR=${ROCM_VERSION_ARRAY[1]}
ROCM_VERSION_PATCH=0
elif [[ ${#ROCM_VERSION_ARRAY[@]} == 3 ]]; then
ROCM_VERSION_MAJOR=${ROCM_VERSION_ARRAY[0]}
ROCM_VERSION_MINOR=${ROCM_VERSION_ARRAY[1]}
ROCM_VERSION_PATCH=${ROCM_VERSION_ARRAY[2]}
else
echo "Unhandled ROCM_VERSION ${ROCM_VERSION}"
exit 1
fi
ROCM_INT=$(($ROCM_VERSION_MAJOR * 10000 + $ROCM_VERSION_MINOR * 100 + $ROCM_VERSION_PATCH))
# Install custom MIOpen + COMgr for ROCm >= 4.0.1
if [[ $ROCM_INT -lt 40001 ]]; then
echo "ROCm version < 4.0.1; will not install custom MIOpen"
exit 0
fi
# Function to retry functions that sometimes timeout or have flaky failures
retry () {
$* || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)
}
# Build custom MIOpen to use comgr for offline compilation.
## Need a sanitized ROCM_VERSION without patchlevel; patchlevel version 0 must be added to paths.
ROCM_DOTS=$(echo ${ROCM_VERSION} | tr -d -c '.' | wc -c)
if [[ ${ROCM_DOTS} == 1 ]]; then
ROCM_VERSION_NOPATCH="${ROCM_VERSION}"
ROCM_INSTALL_PATH="/opt/rocm-${ROCM_VERSION}.0"
else
ROCM_VERSION_NOPATCH="${ROCM_VERSION%.*}"
ROCM_INSTALL_PATH="/opt/rocm-${ROCM_VERSION}"
fi
# MIOPEN_USE_HIP_KERNELS is a Workaround for COMgr issues
MIOPEN_CMAKE_COMMON_FLAGS="
-DMIOPEN_USE_COMGR=ON
-DMIOPEN_BUILD_DRIVER=OFF
"
# Pull MIOpen repo and set DMIOPEN_EMBED_DB based on ROCm version
if [[ $ROCM_INT -ge 60100 ]] && [[ $ROCM_INT -lt 60200 ]]; then
echo "ROCm 6.1 MIOpen does not need any patches, do not build from source"
exit 0
elif [[ $ROCM_INT -ge 60000 ]] && [[ $ROCM_INT -lt 60100 ]]; then
echo "ROCm 6.0 MIOpen does not need any patches, do not build from source"
exit 0
elif [[ $ROCM_INT -ge 50700 ]] && [[ $ROCM_INT -lt 60000 ]]; then
echo "ROCm 5.7 MIOpen does not need any patches, do not build from source"
exit 0
elif [[ $ROCM_INT -ge 50600 ]] && [[ $ROCM_INT -lt 50700 ]]; then
MIOPEN_BRANCH="release/rocm-rel-5.6-staging"
elif [[ $ROCM_INT -ge 50500 ]] && [[ $ROCM_INT -lt 50600 ]]; then
MIOPEN_BRANCH="release/rocm-rel-5.5-gfx11"
elif [[ $ROCM_INT -ge 50400 ]] && [[ $ROCM_INT -lt 50500 ]]; then
MIOPEN_CMAKE_DB_FLAGS="-DMIOPEN_EMBED_DB=gfx900_56;gfx906_60;gfx90878;gfx90a6e;gfx1030_36 -DMIOPEN_USE_MLIR=Off"
MIOPEN_BRANCH="release/rocm-rel-5.4-staging"
elif [[ $ROCM_INT -ge 50300 ]] && [[ $ROCM_INT -lt 50400 ]]; then
MIOPEN_CMAKE_DB_FLAGS="-DMIOPEN_EMBED_DB=gfx900_56;gfx906_60;gfx90878;gfx90a6e;gfx1030_36 -DMIOPEN_USE_MLIR=Off"
MIOPEN_BRANCH="release/rocm-rel-5.3-staging"
elif [[ $ROCM_INT -ge 50200 ]] && [[ $ROCM_INT -lt 50300 ]]; then
MIOPEN_CMAKE_DB_FLAGS="-DMIOPEN_EMBED_DB=gfx900_56;gfx906_60;gfx90878;gfx90a6e;gfx1030_36 -DMIOPEN_USE_MLIR=Off"
MIOPEN_BRANCH="release/rocm-rel-5.2-staging"
elif [[ $ROCM_INT -ge 50100 ]] && [[ $ROCM_INT -lt 50200 ]]; then
MIOPEN_CMAKE_DB_FLAGS="-DMIOPEN_EMBED_DB=gfx900_56;gfx906_60;gfx90878;gfx90a6e;gfx1030_36"
MIOPEN_BRANCH="release/rocm-rel-5.1-staging"
elif [[ $ROCM_INT -ge 50000 ]] && [[ $ROCM_INT -lt 50100 ]]; then
MIOPEN_CMAKE_DB_FLAGS="-DMIOPEN_EMBED_DB=gfx900_56;gfx906_60;gfx90878;gfx90a6e;gfx1030_36"
MIOPEN_BRANCH="release/rocm-rel-5.0-staging"
else
echo "Unhandled ROCM_VERSION ${ROCM_VERSION}"
exit 1
fi
yum remove -y miopen-hip
git clone https://github.com/ROCm/MIOpen -b ${MIOPEN_BRANCH}
pushd MIOpen
# remove .git to save disk space since CI runner was running out
rm -rf .git
# Don't build MLIR to save docker build time
# since we are disabling MLIR backend for MIOpen anyway
if [[ $ROCM_INT -ge 50400 ]] && [[ $ROCM_INT -lt 50500 ]]; then
sed -i '/rocMLIR/d' requirements.txt
elif [[ $ROCM_INT -ge 50200 ]] && [[ $ROCM_INT -lt 50400 ]]; then
sed -i '/llvm-project-mlir/d' requirements.txt
fi
## MIOpen minimum requirements
cmake -P install_deps.cmake --minimum
# clean up since CI runner was running out of disk space
rm -rf /tmp/*
yum clean all
rm -rf /var/cache/yum
rm -rf /var/lib/yum/yumdb
rm -rf /var/lib/yum/history
## Build MIOpen
mkdir -p build
cd build
PKG_CONFIG_PATH=/usr/local/lib/pkgconfig CXX=${ROCM_INSTALL_PATH}/llvm/bin/clang++ cmake .. \
${MIOPEN_CMAKE_COMMON_FLAGS} \
${MIOPEN_CMAKE_DB_FLAGS} \
-DCMAKE_PREFIX_PATH="${ROCM_INSTALL_PATH}/hip;${ROCM_INSTALL_PATH}"
make MIOpen -j $(nproc)
# Build MIOpen package
make -j $(nproc) package
# clean up since CI runner was running out of disk space
rm -rf /usr/local/cget
yum install -y miopen-*.rpm
popd
rm -rf MIOpen

View File

@ -0,0 +1,16 @@
#!/bin/bash
set -ex
# MKL
MKL_VERSION=2024.2.0
MKLROOT=/opt/intel
mkdir -p ${MKLROOT}
pushd /tmp
python3 -mpip install wheel
python3 -mpip download -d . mkl-static==${MKL_VERSION}
python3 -m wheel unpack mkl_static-${MKL_VERSION}-py2.py3-none-manylinux1_x86_64.whl
python3 -m wheel unpack mkl_include-${MKL_VERSION}-py2.py3-none-manylinux1_x86_64.whl
mv mkl_static-${MKL_VERSION}/mkl_static-${MKL_VERSION}.data/data/lib ${MKLROOT}
mv mkl_include-${MKL_VERSION}/mkl_include-${MKL_VERSION}.data/data/include ${MKLROOT}

View File

@ -0,0 +1,13 @@
#!/bin/bash
# Script used only in CD pipeline
set -ex
mkdir -p /usr/local/mnist/
cd /usr/local/mnist
for img in train-images-idx3-ubyte.gz train-labels-idx1-ubyte.gz t10k-images-idx3-ubyte.gz t10k-labels-idx1-ubyte.gz; do
wget -q https://ossci-datasets.s3.amazonaws.com/mnist/$img
gzip -d $img
done

View File

@ -0,0 +1,22 @@
#!/bin/bash
# Script used only in CD pipeline
set -ex
cd /
git clone https://github.com/OpenMathLib/OpenBLAS.git -b v0.3.25 --depth 1 --shallow-submodules
OPENBLAS_BUILD_FLAGS="
NUM_THREADS=128
USE_OPENMP=1
NO_SHARED=0
DYNAMIC_ARCH=1
TARGET=ARMV8
CFLAGS=-O3
"
OPENBLAS_CHECKOUT_DIR="OpenBLAS"
make -j8 ${OPENBLAS_BUILD_FLAGS} -C ${OPENBLAS_CHECKOUT_DIR}
make -j8 ${OPENBLAS_BUILD_FLAGS} install -C ${OPENBLAS_CHECKOUT_DIR}

View File

@ -0,0 +1,16 @@
#!/bin/bash
# Script used only in CD pipeline
set -ex
# Pin the version to latest release 0.17.2, building newer commit starts
# to fail on the current image
git clone -b 0.17.2 --single-branch https://github.com/NixOS/patchelf
cd patchelf
sed -i 's/serial/parallel/g' configure.ac
./bootstrap.sh
./configure
make
make install
cd ..
rm -rf patchelf

View File

@ -0,0 +1,150 @@
#!/bin/bash
# Script used only in CD pipeline
###########################
### prereqs
###########################
# Install Python packages depending on the base OS
ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')
case "$ID" in
ubuntu)
apt-get update -y
apt-get install -y libpciaccess-dev pkg-config
apt-get clean
;;
centos)
yum install -y libpciaccess-devel pkgconfig
;;
*)
echo "Unable to determine OS..."
exit 1
;;
esac
python3 -m pip install meson ninja
###########################
### clone repo
###########################
GIT_SSL_NO_VERIFY=true git clone https://gitlab.freedesktop.org/mesa/drm.git
pushd drm
###########################
### patch
###########################
patch -p1 <<'EOF'
diff --git a/amdgpu/amdgpu_asic_id.c b/amdgpu/amdgpu_asic_id.c
index a5007ffc..13fa07fc 100644
--- a/amdgpu/amdgpu_asic_id.c
+++ b/amdgpu/amdgpu_asic_id.c
@@ -22,6 +22,13 @@
*
*/
+#define _XOPEN_SOURCE 700
+#define _LARGEFILE64_SOURCE
+#define _FILE_OFFSET_BITS 64
+#include <ftw.h>
+#include <link.h>
+#include <limits.h>
+
#include <ctype.h>
#include <stdio.h>
#include <stdlib.h>
@@ -34,6 +41,19 @@
#include "amdgpu_drm.h"
#include "amdgpu_internal.h"
+static char *amdgpuids_path = NULL;
+static const char* amdgpuids_path_msg = NULL;
+
+static int check_for_location_of_amdgpuids(const char *filepath, const struct stat *info, const int typeflag, struct FTW *pathinfo)
+{
+ if (typeflag == FTW_F && strstr(filepath, "amdgpu.ids")) {
+ amdgpuids_path = strdup(filepath);
+ return 1;
+ }
+
+ return 0;
+}
+
static int parse_one_line(struct amdgpu_device *dev, const char *line)
{
char *buf, *saveptr;
@@ -113,10 +133,46 @@ void amdgpu_parse_asic_ids(struct amdgpu_device *dev)
int line_num = 1;
int r = 0;
+ // attempt to find typical location for amdgpu.ids file
fp = fopen(AMDGPU_ASIC_ID_TABLE, "r");
+
+ // if it doesn't exist, search
+ if (!fp) {
+
+ char self_path[ PATH_MAX ];
+ ssize_t count;
+ ssize_t i;
+
+ count = readlink( "/proc/self/exe", self_path, PATH_MAX );
+ if (count > 0) {
+ self_path[count] = '\0';
+
+ // remove '/bin/python' from self_path
+ for (i=count; i>0; --i) {
+ if (self_path[i] == '/') break;
+ self_path[i] = '\0';
+ }
+ self_path[i] = '\0';
+ for (; i>0; --i) {
+ if (self_path[i] == '/') break;
+ self_path[i] = '\0';
+ }
+ self_path[i] = '\0';
+
+ if (1 == nftw(self_path, check_for_location_of_amdgpuids, 5, FTW_PHYS)) {
+ fp = fopen(amdgpuids_path, "r");
+ amdgpuids_path_msg = amdgpuids_path;
+ }
+ }
+
+ }
+ else {
+ amdgpuids_path_msg = AMDGPU_ASIC_ID_TABLE;
+ }
+
+ // both hard-coded location and search have failed
if (!fp) {
- fprintf(stderr, "%s: %s\n", AMDGPU_ASIC_ID_TABLE,
- strerror(errno));
+ fprintf(stderr, "amdgpu.ids: No such file or directory\n");
return;
}
@@ -132,7 +188,7 @@ void amdgpu_parse_asic_ids(struct amdgpu_device *dev)
continue;
}
- drmMsg("%s version: %s\n", AMDGPU_ASIC_ID_TABLE, line);
+ drmMsg("%s version: %s\n", amdgpuids_path_msg, line);
break;
}
@@ -150,7 +206,7 @@ void amdgpu_parse_asic_ids(struct amdgpu_device *dev)
if (r == -EINVAL) {
fprintf(stderr, "Invalid format: %s: line %d: %s\n",
- AMDGPU_ASIC_ID_TABLE, line_num, line);
+ amdgpuids_path_msg, line_num, line);
} else if (r && r != -EAGAIN) {
fprintf(stderr, "%s: Cannot parse ASIC IDs: %s\n",
__func__, strerror(-r));
EOF
###########################
### build
###########################
meson builddir --prefix=/opt/amdgpu
pushd builddir
ninja install
popd
popd

View File

@ -1,7 +1,11 @@
#!/bin/bash
# Script used in CI and CD pipeline
set -ex
MKLROOT=${MKLROOT:-/opt/conda/envs/py_$ANACONDA_PYTHON_VERSION}
# "install" hipMAGMA into /opt/rocm/magma by copying after build
git clone https://bitbucket.org/icl/magma.git
pushd magma
@ -11,7 +15,10 @@ git checkout a1625ff4d9bc362906bd01f805dbbe12612953f6
cp make.inc-examples/make.inc.hip-gcc-mkl make.inc
echo 'LIBDIR += -L$(MKLROOT)/lib' >> make.inc
echo 'LIB += -Wl,--enable-new-dtags -Wl,--rpath,/opt/rocm/lib -Wl,--rpath,$(MKLROOT)/lib -Wl,--rpath,/opt/rocm/magma/lib' >> make.inc
if [[ -f "${MKLROOT}/lib/libmkl_core.a" ]]; then
echo 'LIB = -Wl,--start-group -lmkl_gf_lp64 -lmkl_gnu_thread -lmkl_core -Wl,--end-group -lpthread -lstdc++ -lm -lgomp -lhipblas -lhipsparse' >> make.inc
fi
echo 'LIB += -Wl,--enable-new-dtags -Wl,--rpath,/opt/rocm/lib -Wl,--rpath,$(MKLROOT)/lib -Wl,--rpath,/opt/rocm/magma/lib -ldl' >> make.inc
echo 'DEVCCFLAGS += --gpu-max-threads-per-block=256' >> make.inc
export PATH="${PATH}:/opt/rocm/bin"
if [[ -n "$PYTORCH_ROCM_ARCH" ]]; then
@ -25,7 +32,7 @@ done
# hipcc with openmp flag may cause isnan() on __device__ not to be found; depending on context, compiler may attempt to match with host definition
sed -i 's/^FOPENMP/#FOPENMP/g' make.inc
make -f make.gen.hipMAGMA -j $(nproc)
LANG=C.UTF-8 make lib/libmagma.so -j $(nproc) MKLROOT=/opt/conda/envs/py_$ANACONDA_PYTHON_VERSION
make testing/testing_dgemm -j $(nproc) MKLROOT=/opt/conda/envs/py_$ANACONDA_PYTHON_VERSION
LANG=C.UTF-8 make lib/libmagma.so -j $(nproc) MKLROOT="${MKLROOT}"
make testing/testing_dgemm -j $(nproc) MKLROOT="${MKLROOT}"
popd
mv magma /opt/rocm

View File

@ -1,6 +1,6 @@
#!/bin/bash
set -xe
# Script used in CI and CD pipeline
# Intel® software for general purpose GPU capabilities.
# Refer to https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpus.html
@ -8,19 +8,23 @@ set -xe
# Users should update to the latest version as it becomes available
function install_ubuntu() {
. /etc/os-release
if [[ ! " jammy " =~ " ${VERSION_CODENAME} " ]]; then
echo "Ubuntu version ${VERSION_CODENAME} not supported"
exit
fi
apt-get update -y
apt-get install -y gpg-agent wget
# Set up the repository. To do this, download the key to the system keyring
# To add the online network package repository for the GPU Driver LTS releases
wget -qO - https://repositories.intel.com/gpu/intel-graphics.key \
| gpg --dearmor --output /usr/share/keyrings/intel-graphics.gpg
wget -qO - https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB \
| gpg --dearmor --output /usr/share/keyrings/intel-for-pytorch-gpu-dev-keyring.gpg
# Add the signed entry to APT sources and configure the APT client to use the Intel repository
| gpg --yes --dearmor --output /usr/share/keyrings/intel-graphics.gpg
echo "deb [arch=amd64 signed-by=/usr/share/keyrings/intel-graphics.gpg] \
https://repositories.intel.com/gpu/ubuntu jammy/lts/2350 unified" \
| tee /etc/apt/sources.list.d/intel-gpu-jammy.list
https://repositories.intel.com/gpu/ubuntu ${VERSION_CODENAME}/lts/2350 unified" \
| tee /etc/apt/sources.list.d/intel-gpu-${VERSION_CODENAME}.list
# To add the online network network package repository for the Intel Support Packages
wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB \
| gpg --dearmor > /usr/share/keyrings/intel-for-pytorch-gpu-dev-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/intel-for-pytorch-gpu-dev-keyring.gpg] \
https://apt.repos.intel.com/intel-for-pytorch-gpu-dev all main" \
| tee /etc/apt/sources.list.d/intel-for-pytorch-gpu-dev.list
@ -97,6 +101,86 @@ EOF
rm -rf /var/lib/yum/history
}
function install_rhel() {
. /etc/os-release
if [[ "${ID}" == "rhel" ]]; then
if [[ ! " 8.6 8.8 8.9 9.0 9.2 9.3 " =~ " ${VERSION_ID} " ]]; then
echo "RHEL version ${VERSION_ID} not supported"
exit
fi
elif [[ "${ID}" == "almalinux" ]]; then
# Workaround for almalinux8 which used by quay.io/pypa/manylinux_2_28_x86_64
VERSION_ID="8.6"
fi
dnf install -y 'dnf-command(config-manager)'
# To add the online network package repository for the GPU Driver LTS releases
dnf config-manager --add-repo \
https://repositories.intel.com/gpu/rhel/${VERSION_ID}/lts/2350/unified/intel-gpu-${VERSION_ID}.repo
# To add the online network network package repository for the Intel Support Packages
tee > /etc/yum.repos.d/intel-for-pytorch-gpu-dev.repo << EOF
[intel-for-pytorch-gpu-dev]
name=Intel for Pytorch GPU dev repository
baseurl=https://yum.repos.intel.com/intel-for-pytorch-gpu-dev
enabled=1
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://yum.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB
EOF
# The xpu-smi packages
dnf install -y xpu-smi
# Compute and Media Runtimes
dnf install -y \
intel-opencl intel-media intel-mediasdk libmfxgen1 libvpl2\
level-zero intel-level-zero-gpu mesa-dri-drivers mesa-vulkan-drivers \
mesa-vdpau-drivers libdrm mesa-libEGL mesa-libgbm mesa-libGL \
mesa-libxatracker libvpl-tools intel-metrics-discovery \
intel-metrics-library intel-igc-core intel-igc-cm \
libva libva-utils intel-gmmlib libmetee intel-gsc intel-ocloc
# Development packages
dnf install -y --refresh \
intel-igc-opencl-devel level-zero-devel intel-gsc-devel libmetee-devel \
level-zero-devel
# Install Intel Support Packages
yum install -y intel-for-pytorch-gpu-dev intel-pti-dev
# Cleanup
dnf clean all
rm -rf /var/cache/yum
rm -rf /var/lib/yum/yumdb
rm -rf /var/lib/yum/history
}
function install_sles() {
. /etc/os-release
VERSION_SP=${VERSION_ID//./sp}
if [[ ! " 15sp4 15sp5 " =~ " ${VERSION_SP} " ]]; then
echo "SLES version ${VERSION_ID} not supported"
exit
fi
# To add the online network package repository for the GPU Driver LTS releases
zypper addrepo -f -r \
https://repositories.intel.com/gpu/sles/${VERSION_SP}/lts/2350/unified/intel-gpu-${VERSION_SP}.repo
rpm --import https://repositories.intel.com/gpu/intel-graphics.key
# To add the online network network package repository for the Intel Support Packages
zypper addrepo https://yum.repos.intel.com/intel-for-pytorch-gpu-dev intel-for-pytorch-gpu-dev
rpm --import https://yum.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB
# The xpu-smi packages
zypper install -y lsb-release flex bison xpu-smi
# Compute and Media Runtimes
zypper install -y intel-level-zero-gpu level-zero intel-gsc intel-opencl intel-ocloc \
intel-media-driver libigfxcmrt7 libvpl2 libvpl-tools libmfxgen1 libmfx1
# Development packages
zypper install -y libigdfcl-devel intel-igc-cm libigfxcmrt-devel level-zero-devel
# Install Intel Support Packages
zypper install -y intel-for-pytorch-gpu-dev intel-pti-dev
}
# The installation depends on the base OS
ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')
@ -107,6 +191,12 @@ case "$ID" in
centos)
install_centos
;;
rhel|almalinux)
install_rhel
;;
sles)
install_sles
;;
*)
echo "Unable to determine OS..."
exit 1

101
.ci/docker/conda/Dockerfile Normal file
View File

@ -0,0 +1,101 @@
ARG CUDA_VERSION=10.2
ARG BASE_TARGET=cuda${CUDA_VERSION}
FROM centos:7 as base
ENV LC_ALL en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US.UTF-8
ARG DEVTOOLSET_VERSION=9
RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo
RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo
RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo
RUN yum update -y
RUN yum install -y wget curl perl util-linux xz bzip2 git patch which unzip
# Just add everything as a safe.directory for git since these will be used in multiple places with git
RUN git config --global --add safe.directory '*'
RUN yum install -y yum-utils centos-release-scl
RUN yum-config-manager --enable rhel-server-rhscl-7-rpms
RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo
RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo
RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo
RUN yum install -y devtoolset-${DEVTOOLSET_VERSION}-gcc devtoolset-${DEVTOOLSET_VERSION}-gcc-c++ devtoolset-${DEVTOOLSET_VERSION}-gcc-gfortran devtoolset-${DEVTOOLSET_VERSION}-binutils
# EPEL for cmake
RUN wget http://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm && \
rpm -ivh epel-release-latest-7.noarch.rpm && \
rm -f epel-release-latest-7.noarch.rpm
# cmake
RUN yum install -y cmake3 && \
ln -s /usr/bin/cmake3 /usr/bin/cmake
ENV PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH
RUN yum install -y autoconf aclocal automake make sudo
RUN rm -rf /usr/local/cuda-*
FROM base as patchelf
# Install patchelf
ADD ./common/install_patchelf.sh install_patchelf.sh
RUN bash ./install_patchelf.sh && rm install_patchelf.sh && cp $(which patchelf) /patchelf
FROM base as openssl
# Install openssl
ADD ./common/install_openssl.sh install_openssl.sh
RUN bash ./install_openssl.sh && rm install_openssl.sh
FROM base as conda
# Install Anaconda
ADD ./common/install_conda_docker.sh install_conda.sh
RUN bash ./install_conda.sh && rm install_conda.sh
# Install CUDA
FROM base as cuda
ARG CUDA_VERSION=10.2
RUN rm -rf /usr/local/cuda-*
ADD ./common/install_cuda.sh install_cuda.sh
ENV CUDA_HOME=/usr/local/cuda-${CUDA_VERSION}
# Preserve CUDA_VERSION for the builds
ENV CUDA_VERSION=${CUDA_VERSION}
# Make things in our path by default
ENV PATH=/usr/local/cuda-${CUDA_VERSION}/bin:$PATH
FROM cuda as cuda11.8
RUN bash ./install_cuda.sh 11.8
ENV DESIRED_CUDA=11.8
FROM cuda as cuda12.1
RUN bash ./install_cuda.sh 12.1
ENV DESIRED_CUDA=12.1
FROM cuda as cuda12.4
RUN bash ./install_cuda.sh 12.4
ENV DESIRED_CUDA=12.4
# Install MNIST test data
FROM base as mnist
ADD ./common/install_mnist.sh install_mnist.sh
RUN bash ./install_mnist.sh
FROM base as all_cuda
COPY --from=cuda11.8 /usr/local/cuda-11.8 /usr/local/cuda-11.8
COPY --from=cuda12.1 /usr/local/cuda-12.1 /usr/local/cuda-12.1
COPY --from=cuda12.4 /usr/local/cuda-12.4 /usr/local/cuda-12.4
# Final step
FROM ${BASE_TARGET} as final
COPY --from=openssl /opt/openssl /opt/openssl
COPY --from=patchelf /patchelf /usr/local/bin/patchelf
COPY --from=conda /opt/conda /opt/conda
# Add jni.h for java host build.
COPY ./common/install_jni.sh install_jni.sh
COPY ./java/jni.h jni.h
RUN bash ./install_jni.sh && rm install_jni.sh
ENV PATH /opt/conda/bin:$PATH
COPY --from=mnist /usr/local/mnist /usr/local/mnist
RUN rm -rf /usr/local/cuda
RUN chmod o+rw /usr/local
RUN touch /.condarc && \
chmod o+rw /.condarc && \
chmod -R o+rw /opt/conda

76
.ci/docker/conda/build.sh Executable file
View File

@ -0,0 +1,76 @@
#!/usr/bin/env bash
# Script used only in CD pipeline
set -eou pipefail
image="$1"
shift
if [ -z "${image}" ]; then
echo "Usage: $0 IMAGE"
exit 1
fi
DOCKER_IMAGE_NAME="pytorch/${image}"
export DOCKER_BUILDKIT=1
TOPDIR=$(git rev-parse --show-toplevel)
CUDA_VERSION=${CUDA_VERSION:-12.1}
case ${CUDA_VERSION} in
cpu)
BASE_TARGET=base
DOCKER_TAG=cpu
;;
all)
BASE_TARGET=all_cuda
DOCKER_TAG=latest
;;
*)
BASE_TARGET=cuda${CUDA_VERSION}
DOCKER_TAG=cuda${CUDA_VERSION}
;;
esac
(
set -x
docker build \
--target final \
--progress plain \
--build-arg "BASE_TARGET=${BASE_TARGET}" \
--build-arg "CUDA_VERSION=${CUDA_VERSION}" \
--build-arg "DEVTOOLSET_VERSION=9" \
-t ${DOCKER_IMAGE_NAME} \
$@ \
-f "${TOPDIR}/.ci/docker/conda/Dockerfile" \
${TOPDIR}/.ci/docker/
)
if [[ "${DOCKER_TAG}" =~ ^cuda* ]]; then
# Test that we're using the right CUDA compiler
(
set -x
docker run --rm "${DOCKER_IMAGE_NAME}" nvcc --version | grep "cuda_${CUDA_VERSION}"
)
fi
GITHUB_REF=${GITHUB_REF:-$(git symbolic-ref -q HEAD || git describe --tags --exact-match)}
GIT_BRANCH_NAME=${GITHUB_REF##*/}
GIT_COMMIT_SHA=${GITHUB_SHA:-$(git rev-parse HEAD)}
DOCKER_IMAGE_BRANCH_TAG=${DOCKER_IMAGE_NAME}-${GIT_BRANCH_NAME}
DOCKER_IMAGE_SHA_TAG=${DOCKER_IMAGE_NAME}-${GIT_COMMIT_SHA}
if [[ "${WITH_PUSH:-}" == true ]]; then
(
set -x
docker push "${DOCKER_IMAGE_NAME}"
if [[ -n ${GITHUB_REF} ]]; then
docker tag ${DOCKER_IMAGE_NAME} ${DOCKER_IMAGE_BRANCH_TAG}
docker tag ${DOCKER_IMAGE_NAME} ${DOCKER_IMAGE_SHA_TAG}
docker push "${DOCKER_IMAGE_BRANCH_TAG}"
docker push "${DOCKER_IMAGE_SHA_TAG}"
fi
)
fi

View File

@ -0,0 +1,107 @@
ARG BASE_TARGET=base
ARG GPU_IMAGE=ubuntu:20.04
FROM ${GPU_IMAGE} as base
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get clean && apt-get update
RUN apt-get install -y curl locales g++ git-all autoconf automake make cmake wget unzip sudo
# Just add everything as a safe.directory for git since these will be used in multiple places with git
RUN git config --global --add safe.directory '*'
RUN locale-gen en_US.UTF-8
ENV LC_ALL en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US.UTF-8
# Install openssl
FROM base as openssl
ADD ./common/install_openssl.sh install_openssl.sh
RUN bash ./install_openssl.sh && rm install_openssl.sh
# Install python
FROM base as python
ADD common/install_cpython.sh install_cpython.sh
RUN apt-get update -y && \
apt-get install build-essential gdb lcov libbz2-dev libffi-dev \
libgdbm-dev liblzma-dev libncurses5-dev libreadline6-dev \
libsqlite3-dev libssl-dev lzma lzma-dev tk-dev uuid-dev zlib1g-dev -y && \
bash ./install_cpython.sh && \
rm install_cpython.sh && \
apt-get clean
FROM base as conda
ADD ./common/install_conda_docker.sh install_conda.sh
RUN bash ./install_conda.sh && rm install_conda.sh
FROM base as cpu
# Install Anaconda
COPY --from=conda /opt/conda /opt/conda
# Install python
COPY --from=python /opt/python /opt/python
COPY --from=python /opt/_internal /opt/_internal
ENV PATH=/opt/conda/bin:/usr/local/cuda/bin:$PATH
# Install MKL
ADD ./common/install_mkl.sh install_mkl.sh
RUN bash ./install_mkl.sh && rm install_mkl.sh
FROM cpu as cuda
ADD ./common/install_cuda.sh install_cuda.sh
ADD ./common/install_magma.sh install_magma.sh
ENV CUDA_HOME /usr/local/cuda
FROM cuda as cuda11.8
RUN bash ./install_cuda.sh 11.8
RUN bash ./install_magma.sh 11.8
RUN ln -sf /usr/local/cuda-11.8 /usr/local/cuda
FROM cuda as cuda12.1
RUN bash ./install_cuda.sh 12.1
RUN bash ./install_magma.sh 12.1
RUN ln -sf /usr/local/cuda-12.1 /usr/local/cuda
FROM cuda as cuda12.4
RUN bash ./install_cuda.sh 12.4
RUN bash ./install_magma.sh 12.4
RUN ln -sf /usr/local/cuda-12.4 /usr/local/cuda
FROM cpu as rocm
ARG PYTORCH_ROCM_ARCH
ENV PYTORCH_ROCM_ARCH ${PYTORCH_ROCM_ARCH}
ENV MKLROOT /opt/intel
# Adding ROCM_PATH env var so that LoadHip.cmake (even with logic updated for ROCm6.0)
# find HIP works for ROCm5.7. Not needed for ROCm6.0 and above.
# Remove below when ROCm5.7 is not in support matrix anymore.
ENV ROCM_PATH /opt/rocm
# No need to install ROCm as base docker image should have full ROCm install
#ADD ./common/install_rocm.sh install_rocm.sh
ADD ./common/install_rocm_drm.sh install_rocm_drm.sh
ADD ./common/install_rocm_magma.sh install_rocm_magma.sh
# gfortran and python needed for building magma from source for ROCm
RUN apt-get update -y && \
apt-get install gfortran -y && \
apt-get install python -y && \
apt-get clean
RUN bash ./install_rocm_drm.sh && rm install_rocm_drm.sh
RUN bash ./install_rocm_magma.sh && rm install_rocm_magma.sh
# Install AOTriton
COPY ./common/common_utils.sh common_utils.sh
COPY ./common/aotriton_version.txt aotriton_version.txt
COPY ./common/install_aotriton.sh install_aotriton.sh
RUN bash ./install_aotriton.sh /opt/rocm && rm install_aotriton.sh aotriton_version.txt
ENV AOTRITON_INSTALLED_PREFIX /opt/rocm/aotriton
FROM ${BASE_TARGET} as final
COPY --from=openssl /opt/openssl /opt/openssl
# Install patchelf
ADD ./common/install_patchelf.sh install_patchelf.sh
RUN bash ./install_patchelf.sh && rm install_patchelf.sh
# Install Anaconda
COPY --from=conda /opt/conda /opt/conda
# Install python
COPY --from=python /opt/python /opt/python
COPY --from=python /opt/_internal /opt/_internal
ENV PATH=/opt/conda/bin:/usr/local/cuda/bin:$PATH

93
.ci/docker/libtorch/build.sh Executable file
View File

@ -0,0 +1,93 @@
#!/usr/bin/env bash
# Script used only in CD pipeline
set -eou pipefail
image="$1"
shift
if [ -z "${image}" ]; then
echo "Usage: $0 IMAGE"
exit 1
fi
DOCKER_IMAGE="pytorch/${image}"
TOPDIR=$(git rev-parse --show-toplevel)
GPU_ARCH_TYPE=${GPU_ARCH_TYPE:-cpu}
GPU_ARCH_VERSION=${GPU_ARCH_VERSION:-}
WITH_PUSH=${WITH_PUSH:-}
DOCKER=${DOCKER:-docker}
case ${GPU_ARCH_TYPE} in
cpu)
BASE_TARGET=cpu
DOCKER_TAG=cpu
GPU_IMAGE=ubuntu:20.04
DOCKER_GPU_BUILD_ARG=""
;;
cuda)
BASE_TARGET=cuda${GPU_ARCH_VERSION}
DOCKER_TAG=cuda${GPU_ARCH_VERSION}
GPU_IMAGE=ubuntu:20.04
DOCKER_GPU_BUILD_ARG=""
;;
rocm)
BASE_TARGET=rocm
DOCKER_TAG=rocm${GPU_ARCH_VERSION}
GPU_IMAGE=rocm/dev-ubuntu-20.04:${GPU_ARCH_VERSION}-complete
PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx1030;gfx1100"
ROCM_REGEX="([0-9]+)\.([0-9]+)[\.]?([0-9]*)"
if [[ $GPU_ARCH_VERSION =~ $ROCM_REGEX ]]; then
ROCM_VERSION_INT=$((${BASH_REMATCH[1]}*10000 + ${BASH_REMATCH[2]}*100 + ${BASH_REMATCH[3]:-0}))
else
echo "ERROR: rocm regex failed"
exit 1
fi
if [[ $ROCM_VERSION_INT -ge 60000 ]]; then
PYTORCH_ROCM_ARCH+=";gfx942"
fi
DOCKER_GPU_BUILD_ARG="--build-arg PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH}"
;;
*)
echo "ERROR: Unrecognized GPU_ARCH_TYPE: ${GPU_ARCH_TYPE}"
exit 1
;;
esac
(
set -x
DOCKER_BUILDKIT=1 ${DOCKER} build \
--target final \
${DOCKER_GPU_BUILD_ARG} \
--build-arg "GPU_IMAGE=${GPU_IMAGE}" \
--build-arg "BASE_TARGET=${BASE_TARGET}" \
-t "${DOCKER_IMAGE}" \
$@ \
-f "${TOPDIR}/.ci/docker/libtorch/Dockerfile" \
"${TOPDIR}/.ci/docker/"
)
GITHUB_REF=${GITHUB_REF:-$(git symbolic-ref -q HEAD || git describe --tags --exact-match)}
GIT_BRANCH_NAME=${GITHUB_REF##*/}
GIT_COMMIT_SHA=${GITHUB_SHA:-$(git rev-parse HEAD)}
DOCKER_IMAGE_BRANCH_TAG=${DOCKER_IMAGE}-${GIT_BRANCH_NAME}
DOCKER_IMAGE_SHA_TAG=${DOCKER_IMAGE}-${GIT_COMMIT_SHA}
if [[ "${WITH_PUSH}" == true ]]; then
(
set -x
${DOCKER} push "${DOCKER_IMAGE}"
if [[ -n ${GITHUB_REF} ]]; then
${DOCKER} tag ${DOCKER_IMAGE} ${DOCKER_IMAGE_BRANCH_TAG}
${DOCKER} tag ${DOCKER_IMAGE} ${DOCKER_IMAGE_SHA_TAG}
${DOCKER} push "${DOCKER_IMAGE_BRANCH_TAG}"
${DOCKER} push "${DOCKER_IMAGE_SHA_TAG}"
fi
)
fi

View File

@ -0,0 +1,203 @@
# syntax = docker/dockerfile:experimental
ARG ROCM_VERSION=3.7
ARG BASE_CUDA_VERSION=11.8
ARG GPU_IMAGE=centos:7
FROM centos:7 as base
ENV LC_ALL en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US.UTF-8
ARG DEVTOOLSET_VERSION=9
# Note: This is required patch since CentOS have reached EOL
# otherwise any yum install setp will fail
RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo
RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo
RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo
RUN yum install -y wget curl perl util-linux xz bzip2 git patch which perl zlib-devel
# Just add everything as a safe.directory for git since these will be used in multiple places with git
RUN git config --global --add safe.directory '*'
RUN yum install -y yum-utils centos-release-scl
RUN yum-config-manager --enable rhel-server-rhscl-7-rpms
# Note: After running yum-config-manager --enable rhel-server-rhscl-7-rpms
# patch is required once again. Somehow this steps adds mirror.centos.org
RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo
RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo
RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo
RUN yum install -y devtoolset-${DEVTOOLSET_VERSION}-gcc devtoolset-${DEVTOOLSET_VERSION}-gcc-c++ devtoolset-${DEVTOOLSET_VERSION}-gcc-gfortran devtoolset-${DEVTOOLSET_VERSION}-binutils
ENV PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH
RUN wget http://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm && \
rpm -ivh epel-release-latest-7.noarch.rpm && \
rm -f epel-release-latest-7.noarch.rpm
# cmake-3.18.4 from pip
RUN yum install -y python3-pip && \
python3 -mpip install cmake==3.18.4 && \
ln -s /usr/local/bin/cmake /usr/bin/cmake
RUN yum install -y autoconf aclocal automake make sudo
FROM base as openssl
# Install openssl (this must precede `build python` step)
# (In order to have a proper SSL module, Python is compiled
# against a recent openssl [see env vars above], which is linked
# statically. We delete openssl afterwards.)
ADD ./common/install_openssl.sh install_openssl.sh
RUN bash ./install_openssl.sh && rm install_openssl.sh
# EPEL for cmake
FROM base as patchelf
# Install patchelf
ADD ./common/install_patchelf.sh install_patchelf.sh
RUN bash ./install_patchelf.sh && rm install_patchelf.sh
RUN cp $(which patchelf) /patchelf
FROM patchelf as python
# build python
COPY manywheel/build_scripts /build_scripts
ADD ./common/install_cpython.sh /build_scripts/install_cpython.sh
RUN bash build_scripts/build.sh && rm -r build_scripts
FROM base as cuda
ARG BASE_CUDA_VERSION=10.2
# Install CUDA
ADD ./common/install_cuda.sh install_cuda.sh
RUN bash ./install_cuda.sh ${BASE_CUDA_VERSION} && rm install_cuda.sh
FROM base as intel
# MKL
ADD ./common/install_mkl.sh install_mkl.sh
RUN bash ./install_mkl.sh && rm install_mkl.sh
FROM base as magma
ARG BASE_CUDA_VERSION=10.2
# Install magma
ADD ./common/install_magma.sh install_magma.sh
RUN bash ./install_magma.sh ${BASE_CUDA_VERSION} && rm install_magma.sh
FROM base as jni
# Install java jni header
ADD ./common/install_jni.sh install_jni.sh
ADD ./java/jni.h jni.h
RUN bash ./install_jni.sh && rm install_jni.sh
FROM base as libpng
# Install libpng
ADD ./common/install_libpng.sh install_libpng.sh
RUN bash ./install_libpng.sh && rm install_libpng.sh
FROM ${GPU_IMAGE} as common
RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo
RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo
RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo
ENV LC_ALL en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US.UTF-8
RUN yum install -y \
aclocal \
autoconf \
automake \
bison \
bzip2 \
curl \
diffutils \
file \
git \
make \
patch \
perl \
unzip \
util-linux \
wget \
which \
xz \
yasm
RUN yum install -y \
https://repo.ius.io/ius-release-el7.rpm \
https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
RUN yum swap -y git git236-core
# git236+ would refuse to run git commands in repos owned by other users
# Which causes version check to fail, as pytorch repo is bind-mounted into the image
# Override this behaviour by treating every folder as safe
# For more details see https://github.com/pytorch/pytorch/issues/78659#issuecomment-1144107327
RUN git config --global --add safe.directory "*"
ENV SSL_CERT_FILE=/opt/_internal/certs.pem
# Install LLVM version
COPY --from=openssl /opt/openssl /opt/openssl
COPY --from=python /opt/python /opt/python
COPY --from=python /opt/_internal /opt/_internal
COPY --from=python /opt/python/cp39-cp39/bin/auditwheel /usr/local/bin/auditwheel
COPY --from=intel /opt/intel /opt/intel
COPY --from=patchelf /usr/local/bin/patchelf /usr/local/bin/patchelf
COPY --from=jni /usr/local/include/jni.h /usr/local/include/jni.h
COPY --from=libpng /usr/local/bin/png* /usr/local/bin/
COPY --from=libpng /usr/local/bin/libpng* /usr/local/bin/
COPY --from=libpng /usr/local/include/png* /usr/local/include/
COPY --from=libpng /usr/local/include/libpng* /usr/local/include/
COPY --from=libpng /usr/local/lib/libpng* /usr/local/lib/
COPY --from=libpng /usr/local/lib/pkgconfig /usr/local/lib/pkgconfig
FROM common as cpu_final
ARG BASE_CUDA_VERSION=10.1
ARG DEVTOOLSET_VERSION=9
RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo
RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo
RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo
RUN yum install -y yum-utils centos-release-scl
RUN yum-config-manager --enable rhel-server-rhscl-7-rpms
RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo
RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo
RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo
RUN yum install -y devtoolset-${DEVTOOLSET_VERSION}-gcc devtoolset-${DEVTOOLSET_VERSION}-gcc-c++ devtoolset-${DEVTOOLSET_VERSION}-gcc-gfortran devtoolset-${DEVTOOLSET_VERSION}-binutils
ENV PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH
# cmake is already installed inside the rocm base image, so remove if present
RUN rpm -e cmake || true
# cmake-3.18.4 from pip
RUN yum install -y python3-pip && \
python3 -mpip install cmake==3.18.4 && \
ln -s /usr/local/bin/cmake /usr/bin/cmake
# ninja
RUN yum install -y ninja-build
FROM cpu_final as cuda_final
RUN rm -rf /usr/local/cuda-${BASE_CUDA_VERSION}
COPY --from=cuda /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda-${BASE_CUDA_VERSION}
COPY --from=magma /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda-${BASE_CUDA_VERSION}
RUN ln -sf /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda
ENV PATH=/usr/local/cuda/bin:$PATH
FROM cpu_final as rocm_final
ARG ROCM_VERSION=3.7
ARG PYTORCH_ROCM_ARCH
ENV PYTORCH_ROCM_ARCH ${PYTORCH_ROCM_ARCH}
# Adding ROCM_PATH env var so that LoadHip.cmake (even with logic updated for ROCm6.0)
# find HIP works for ROCm5.7. Not needed for ROCm6.0 and above.
# Remove below when ROCm5.7 is not in support matrix anymore.
ENV ROCM_PATH /opt/rocm
ENV MKLROOT /opt/intel
# No need to install ROCm as base docker image should have full ROCm install
#ADD ./common/install_rocm.sh install_rocm.sh
#RUN ROCM_VERSION=${ROCM_VERSION} bash ./install_rocm.sh && rm install_rocm.sh
ADD ./common/install_rocm_drm.sh install_rocm_drm.sh
RUN bash ./install_rocm_drm.sh && rm install_rocm_drm.sh
# cmake3 is needed for the MIOpen build
RUN ln -sf /usr/local/bin/cmake /usr/bin/cmake3
ADD ./common/install_rocm_magma.sh install_rocm_magma.sh
RUN bash ./install_rocm_magma.sh && rm install_rocm_magma.sh
ADD ./common/install_miopen.sh install_miopen.sh
RUN bash ./install_miopen.sh ${ROCM_VERSION} && rm install_miopen.sh
# Install AOTriton
COPY ./common/common_utils.sh common_utils.sh
COPY ./common/aotriton_version.txt aotriton_version.txt
COPY ./common/install_aotriton.sh install_aotriton.sh
RUN bash ./install_aotriton.sh /opt/rocm && rm install_aotriton.sh aotriton_version.txt
ENV AOTRITON_INSTALLED_PREFIX /opt/rocm/aotriton

View File

@ -0,0 +1,152 @@
# syntax = docker/dockerfile:experimental
ARG ROCM_VERSION=3.7
ARG BASE_CUDA_VERSION=10.2
ARG GPU_IMAGE=nvidia/cuda:${BASE_CUDA_VERSION}-devel-centos7
FROM quay.io/pypa/manylinux2014_x86_64 as base
ENV LC_ALL en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US.UTF-8
RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo
RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo
RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo
RUN yum install -y wget curl perl util-linux xz bzip2 git patch which perl zlib-devel
RUN yum install -y yum-utils centos-release-scl sudo
RUN yum-config-manager --enable rhel-server-rhscl-7-rpms
RUN yum install -y devtoolset-7-gcc devtoolset-7-gcc-c++ devtoolset-7-gcc-gfortran devtoolset-7-binutils
ENV PATH=/opt/rh/devtoolset-7/root/usr/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/rh/devtoolset-7/root/usr/lib64:/opt/rh/devtoolset-7/root/usr/lib:$LD_LIBRARY_PATH
# cmake
RUN yum install -y cmake3 && \
ln -s /usr/bin/cmake3 /usr/bin/cmake
FROM base as openssl
# Install openssl (this must precede `build python` step)
# (In order to have a proper SSL module, Python is compiled
# against a recent openssl [see env vars above], which is linked
# statically. We delete openssl afterwards.)
ADD ./common/install_openssl.sh install_openssl.sh
RUN bash ./install_openssl.sh && rm install_openssl.sh
# remove unncessary python versions
RUN rm -rf /opt/python/cp26-cp26m /opt/_internal/cpython-2.6.9-ucs2
RUN rm -rf /opt/python/cp26-cp26mu /opt/_internal/cpython-2.6.9-ucs4
RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6
RUN rm -rf /opt/python/cp34-cp34m /opt/_internal/cpython-3.4.6
FROM base as cuda
ARG BASE_CUDA_VERSION=10.2
# Install CUDA
ADD ./common/install_cuda.sh install_cuda.sh
RUN bash ./install_cuda.sh ${BASE_CUDA_VERSION} && rm install_cuda.sh
FROM base as intel
# MKL
ADD ./common/install_mkl.sh install_mkl.sh
RUN bash ./install_mkl.sh && rm install_mkl.sh
FROM base as magma
ARG BASE_CUDA_VERSION=10.2
# Install magma
ADD ./common/install_magma.sh install_magma.sh
RUN bash ./install_magma.sh ${BASE_CUDA_VERSION} && rm install_magma.sh
FROM base as jni
# Install java jni header
ADD ./common/install_jni.sh install_jni.sh
ADD ./java/jni.h jni.h
RUN bash ./install_jni.sh && rm install_jni.sh
FROM base as libpng
# Install libpng
ADD ./common/install_libpng.sh install_libpng.sh
RUN bash ./install_libpng.sh && rm install_libpng.sh
FROM ${GPU_IMAGE} as common
RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo
RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo
RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo
ENV LC_ALL en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US.UTF-8
RUN yum install -y \
aclocal \
autoconf \
automake \
bison \
bzip2 \
curl \
diffutils \
file \
git \
make \
patch \
perl \
unzip \
util-linux \
wget \
which \
xz \
yasm
RUN yum install -y \
https://repo.ius.io/ius-release-el7.rpm \
https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
RUN yum swap -y git git236-core
# git236+ would refuse to run git commands in repos owned by other users
# Which causes version check to fail, as pytorch repo is bind-mounted into the image
# Override this behaviour by treating every folder as safe
# For more details see https://github.com/pytorch/pytorch/issues/78659#issuecomment-1144107327
RUN git config --global --add safe.directory "*"
ENV SSL_CERT_FILE=/opt/_internal/certs.pem
# Install LLVM version
COPY --from=openssl /opt/openssl /opt/openssl
COPY --from=base /opt/python /opt/python
COPY --from=base /opt/_internal /opt/_internal
COPY --from=base /usr/local/bin/auditwheel /usr/local/bin/auditwheel
COPY --from=intel /opt/intel /opt/intel
COPY --from=base /usr/local/bin/patchelf /usr/local/bin/patchelf
COPY --from=libpng /usr/local/bin/png* /usr/local/bin/
COPY --from=libpng /usr/local/bin/libpng* /usr/local/bin/
COPY --from=libpng /usr/local/include/png* /usr/local/include/
COPY --from=libpng /usr/local/include/libpng* /usr/local/include/
COPY --from=libpng /usr/local/lib/libpng* /usr/local/lib/
COPY --from=libpng /usr/local/lib/pkgconfig /usr/local/lib/pkgconfig
COPY --from=jni /usr/local/include/jni.h /usr/local/include/jni.h
FROM common as cpu_final
ARG BASE_CUDA_VERSION=10.2
RUN yum install -y yum-utils centos-release-scl
RUN yum-config-manager --enable rhel-server-rhscl-7-rpms
RUN yum install -y devtoolset-7-gcc devtoolset-7-gcc-c++ devtoolset-7-gcc-gfortran devtoolset-7-binutils
ENV PATH=/opt/rh/devtoolset-7/root/usr/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/rh/devtoolset-7/root/usr/lib64:/opt/rh/devtoolset-7/root/usr/lib:$LD_LIBRARY_PATH
# cmake
RUN yum install -y cmake3 && \
ln -s /usr/bin/cmake3 /usr/bin/cmake
# ninja
RUN yum install -y http://repo.okay.com.mx/centos/7/x86_64/release/okay-release-1-1.noarch.rpm
RUN yum install -y ninja-build
FROM cpu_final as cuda_final
RUN rm -rf /usr/local/cuda-${BASE_CUDA_VERSION}
COPY --from=cuda /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda-${BASE_CUDA_VERSION}
COPY --from=magma /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda-${BASE_CUDA_VERSION}
FROM common as rocm_final
ARG ROCM_VERSION=3.7
# Install ROCm
ADD ./common/install_rocm.sh install_rocm.sh
RUN bash ./install_rocm.sh ${ROCM_VERSION} && rm install_rocm.sh
# cmake is already installed inside the rocm base image, but both 2 and 3 exist
# cmake3 is needed for the later MIOpen custom build, so that step is last.
RUN yum install -y cmake3 && \
rm -f /usr/bin/cmake && \
ln -s /usr/bin/cmake3 /usr/bin/cmake
ADD ./common/install_miopen.sh install_miopen.sh
RUN bash ./install_miopen.sh ${ROCM_VERSION} && rm install_miopen.sh

View File

@ -0,0 +1,153 @@
# syntax = docker/dockerfile:experimental
ARG ROCM_VERSION=3.7
ARG BASE_CUDA_VERSION=11.8
ARG GPU_IMAGE=amd64/almalinux:8
FROM quay.io/pypa/manylinux_2_28_x86_64 as base
ENV LC_ALL en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US.UTF-8
ARG DEVTOOLSET_VERSION=11
RUN yum install -y sudo wget curl perl util-linux xz bzip2 git patch which perl zlib-devel yum-utils gcc-toolset-${DEVTOOLSET_VERSION}-toolchain
ENV PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH
# cmake-3.18.4 from pip
RUN yum install -y python3-pip && \
python3 -mpip install cmake==3.18.4 && \
ln -s /usr/local/bin/cmake /usr/bin/cmake3
FROM base as openssl
# Install openssl (this must precede `build python` step)
# (In order to have a proper SSL module, Python is compiled
# against a recent openssl [see env vars above], which is linked
# statically. We delete openssl afterwards.)
ADD ./common/install_openssl.sh install_openssl.sh
RUN bash ./install_openssl.sh && rm install_openssl.sh
# remove unncessary python versions
RUN rm -rf /opt/python/cp26-cp26m /opt/_internal/cpython-2.6.9-ucs2
RUN rm -rf /opt/python/cp26-cp26mu /opt/_internal/cpython-2.6.9-ucs4
RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6
RUN rm -rf /opt/python/cp34-cp34m /opt/_internal/cpython-3.4.6
FROM base as cuda
ARG BASE_CUDA_VERSION=11.8
# Install CUDA
ADD ./common/install_cuda.sh install_cuda.sh
RUN bash ./install_cuda.sh ${BASE_CUDA_VERSION} && rm install_cuda.sh
FROM base as intel
# MKL
ADD ./common/install_mkl.sh install_mkl.sh
RUN bash ./install_mkl.sh && rm install_mkl.sh
FROM base as magma
ARG BASE_CUDA_VERSION=10.2
# Install magma
ADD ./common/install_magma.sh install_magma.sh
RUN bash ./install_magma.sh ${BASE_CUDA_VERSION} && rm install_magma.sh
FROM base as jni
# Install java jni header
ADD ./common/install_jni.sh install_jni.sh
ADD ./java/jni.h jni.h
RUN bash ./install_jni.sh && rm install_jni.sh
FROM base as libpng
# Install libpng
ADD ./common/install_libpng.sh install_libpng.sh
RUN bash ./install_libpng.sh && rm install_libpng.sh
FROM ${GPU_IMAGE} as common
ARG DEVTOOLSET_VERSION=11
ENV LC_ALL en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US.UTF-8
RUN yum -y install epel-release
RUN yum -y update
RUN yum install -y \
autoconf \
automake \
bison \
bzip2 \
curl \
diffutils \
file \
git \
make \
patch \
perl \
unzip \
util-linux \
wget \
which \
xz \
gcc-toolset-${DEVTOOLSET_VERSION}-toolchain \
glibc-langpack-en
RUN yum install -y \
https://repo.ius.io/ius-release-el7.rpm \
https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
RUN yum swap -y git git236-core
# git236+ would refuse to run git commands in repos owned by other users
# Which causes version check to fail, as pytorch repo is bind-mounted into the image
# Override this behaviour by treating every folder as safe
# For more details see https://github.com/pytorch/pytorch/issues/78659#issuecomment-1144107327
RUN git config --global --add safe.directory "*"
ENV SSL_CERT_FILE=/opt/_internal/certs.pem
# Install LLVM version
COPY --from=openssl /opt/openssl /opt/openssl
COPY --from=base /opt/python /opt/python
COPY --from=base /opt/_internal /opt/_internal
COPY --from=base /usr/local/bin/auditwheel /usr/local/bin/auditwheel
COPY --from=intel /opt/intel /opt/intel
COPY --from=base /usr/local/bin/patchelf /usr/local/bin/patchelf
COPY --from=libpng /usr/local/bin/png* /usr/local/bin/
COPY --from=libpng /usr/local/bin/libpng* /usr/local/bin/
COPY --from=libpng /usr/local/include/png* /usr/local/include/
COPY --from=libpng /usr/local/include/libpng* /usr/local/include/
COPY --from=libpng /usr/local/lib/libpng* /usr/local/lib/
COPY --from=libpng /usr/local/lib/pkgconfig /usr/local/lib/pkgconfig
COPY --from=jni /usr/local/include/jni.h /usr/local/include/jni.h
FROM common as cpu_final
ARG BASE_CUDA_VERSION=11.8
ARG DEVTOOLSET_VERSION=11
# Ensure the expected devtoolset is used
ENV PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH
# cmake-3.18.4 from pip
RUN yum install -y python3-pip && \
python3 -mpip install cmake==3.18.4 && \
ln -s /usr/local/bin/cmake /usr/bin/cmake3
FROM cpu_final as cuda_final
RUN rm -rf /usr/local/cuda-${BASE_CUDA_VERSION}
COPY --from=cuda /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda-${BASE_CUDA_VERSION}
COPY --from=magma /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda-${BASE_CUDA_VERSION}
FROM common as rocm_final
ARG ROCM_VERSION=3.7
# Install ROCm
ADD ./common/install_rocm.sh install_rocm.sh
RUN bash ./install_rocm.sh ${ROCM_VERSION} && rm install_rocm.sh
# cmake is already installed inside the rocm base image, but both 2 and 3 exist
# cmake3 is needed for the later MIOpen custom build, so that step is last.
RUN yum install -y cmake3 && \
rm -f /usr/bin/cmake && \
ln -s /usr/bin/cmake3 /usr/bin/cmake
ADD ./common/install_miopen.sh install_miopen.sh
RUN bash ./install_miopen.sh ${ROCM_VERSION} && rm install_miopen.sh
FROM cpu_final as xpu_final
# cmake-3.28.4 from pip
RUN python3 -m pip install --upgrade pip && \
python3 -mpip install cmake==3.28.4
ADD ./common/install_xpu.sh install_xpu.sh
RUN bash ./install_xpu.sh && rm install_xpu.sh
RUN pushd /opt/_internal && tar -xJf static-libs-for-embedding-only.tar.xz && popd

View File

@ -0,0 +1,57 @@
FROM quay.io/pypa/manylinux_2_28_aarch64 as base
# Graviton needs GCC 10 or above for the build. GCC12 is the default version in almalinux-8.
ARG GCCTOOLSET_VERSION=11
# Language variabes
ENV LC_ALL=en_US.UTF-8
ENV LANG=en_US.UTF-8
ENV LANGUAGE=en_US.UTF-8
# Installed needed OS packages. This is to support all
# the binary builds (torch, vision, audio, text, data)
RUN yum -y install epel-release
RUN yum -y update
RUN yum install -y \
autoconf \
automake \
bison \
bzip2 \
curl \
diffutils \
file \
git \
less \
libffi-devel \
libgomp \
make \
openssl-devel \
patch \
perl \
unzip \
util-linux \
wget \
which \
xz \
yasm \
zstd \
sudo \
gcc-toolset-${GCCTOOLSET_VERSION}-toolchain
# Ensure the expected devtoolset is used
ENV PATH=/opt/rh/gcc-toolset-${GCCTOOLSET_VERSION}/root/usr/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/rh/gcc-toolset-${GCCTOOLSET_VERSION}/root/usr/lib64:/opt/rh/gcc-toolset-${GCCTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH
# git236+ would refuse to run git commands in repos owned by other users
# Which causes version check to fail, as pytorch repo is bind-mounted into the image
# Override this behaviour by treating every folder as safe
# For more details see https://github.com/pytorch/pytorch/issues/78659#issuecomment-1144107327
RUN git config --global --add safe.directory "*"
FROM base as final
# remove unncessary python versions
RUN rm -rf /opt/python/cp26-cp26m /opt/_internal/cpython-2.6.9-ucs2
RUN rm -rf /opt/python/cp26-cp26mu /opt/_internal/cpython-2.6.9-ucs4
RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6
RUN rm -rf /opt/python/cp34-cp34m /opt/_internal/cpython-3.4.6

View File

@ -0,0 +1,94 @@
FROM quay.io/pypa/manylinux2014_aarch64 as base
# Graviton needs GCC 10 for the build
ARG DEVTOOLSET_VERSION=10
# Language variabes
ENV LC_ALL=en_US.UTF-8
ENV LANG=en_US.UTF-8
ENV LANGUAGE=en_US.UTF-8
# Installed needed OS packages. This is to support all
# the binary builds (torch, vision, audio, text, data)
RUN yum -y install epel-release
RUN yum -y update
RUN yum install -y \
autoconf \
automake \
bison \
bzip2 \
curl \
diffutils \
file \
git \
make \
patch \
perl \
unzip \
util-linux \
wget \
which \
xz \
yasm \
less \
zstd \
libgomp \
sudo \
devtoolset-${DEVTOOLSET_VERSION}-gcc \
devtoolset-${DEVTOOLSET_VERSION}-gcc-c++ \
devtoolset-${DEVTOOLSET_VERSION}-gcc-gfortran \
devtoolset-${DEVTOOLSET_VERSION}-binutils
# Ensure the expected devtoolset is used
ENV PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH
# git236+ would refuse to run git commands in repos owned by other users
# Which causes version check to fail, as pytorch repo is bind-mounted into the image
# Override this behaviour by treating every folder as safe
# For more details see https://github.com/pytorch/pytorch/issues/78659#issuecomment-1144107327
RUN git config --global --add safe.directory "*"
###############################################################################
# libglfortran.a hack
#
# libgfortran.a from quay.io/pypa/manylinux2014_aarch64 is not compiled with -fPIC.
# This causes __stack_chk_guard@@GLIBC_2.17 on pytorch build. To solve, get
# ubuntu's libgfortran.a which is compiled with -fPIC
# NOTE: Need a better way to get this library as Ubuntu's package can be removed by the vender, or changed
###############################################################################
RUN cd ~/ \
&& curl -L -o ~/libgfortran-10-dev.deb http://ports.ubuntu.com/ubuntu-ports/pool/universe/g/gcc-10/libgfortran-10-dev_10.5.0-1ubuntu1_arm64.deb \
&& ar x ~/libgfortran-10-dev.deb \
&& tar --use-compress-program=unzstd -xvf data.tar.zst -C ~/ \
&& cp -f ~/usr/lib/gcc/aarch64-linux-gnu/10/libgfortran.a /opt/rh/devtoolset-10/root/usr/lib/gcc/aarch64-redhat-linux/10/
# install cmake
RUN yum install -y cmake3 && \
ln -s /usr/bin/cmake3 /usr/bin/cmake
FROM base as openssl
# Install openssl (this must precede `build python` step)
# (In order to have a proper SSL module, Python is compiled
# against a recent openssl [see env vars above], which is linked
# statically. We delete openssl afterwards.)
ADD ./common/install_openssl.sh install_openssl.sh
RUN bash ./install_openssl.sh && rm install_openssl.sh
ENV SSL_CERT_FILE=/opt/_internal/certs.pem
FROM base as openblas
# Install openblas
ADD ./common/install_openblas.sh install_openblas.sh
RUN bash ./install_openblas.sh && rm install_openblas.sh
FROM openssl as final
# remove unncessary python versions
RUN rm -rf /opt/python/cp26-cp26m /opt/_internal/cpython-2.6.9-ucs2
RUN rm -rf /opt/python/cp26-cp26mu /opt/_internal/cpython-2.6.9-ucs4
RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6
RUN rm -rf /opt/python/cp34-cp34m /opt/_internal/cpython-3.4.6
COPY --from=openblas /opt/OpenBLAS/ /opt/OpenBLAS/
ENV LD_LIBRARY_PATH=/opt/OpenBLAS/lib:$LD_LIBRARY_PATH

View File

@ -0,0 +1,91 @@
FROM quay.io/pypa/manylinux_2_28_aarch64 as base
# Cuda ARM build needs gcc 11
ARG DEVTOOLSET_VERSION=11
# Language variables
ENV LC_ALL=en_US.UTF-8
ENV LANG=en_US.UTF-8
ENV LANGUAGE=en_US.UTF-8
# Installed needed OS packages. This is to support all
# the binary builds (torch, vision, audio, text, data)
RUN yum -y install epel-release
RUN yum -y update
RUN yum install -y \
autoconf \
automake \
bison \
bzip2 \
curl \
diffutils \
file \
git \
make \
patch \
perl \
unzip \
util-linux \
wget \
which \
xz \
yasm \
less \
zstd \
libgomp \
sudo \
gcc-toolset-${DEVTOOLSET_VERSION}-toolchain
# Ensure the expected devtoolset is used
ENV PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH
# git236+ would refuse to run git commands in repos owned by other users
# Which causes version check to fail, as pytorch repo is bind-mounted into the image
# Override this behaviour by treating every folder as safe
# For more details see https://github.com/pytorch/pytorch/issues/78659#issuecomment-1144107327
RUN git config --global --add safe.directory "*"
FROM base as openssl
# Install openssl (this must precede `build python` step)
# (In order to have a proper SSL module, Python is compiled
# against a recent openssl [see env vars above], which is linked
# statically. We delete openssl afterwards.)
ADD ./common/install_openssl.sh install_openssl.sh
RUN bash ./install_openssl.sh && rm install_openssl.sh
ENV SSL_CERT_FILE=/opt/_internal/certs.pem
FROM openssl as final
# remove unncessary python versions
RUN rm -rf /opt/python/cp26-cp26m /opt/_internal/cpython-2.6.9-ucs2
RUN rm -rf /opt/python/cp26-cp26mu /opt/_internal/cpython-2.6.9-ucs4
RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6
RUN rm -rf /opt/python/cp34-cp34m /opt/_internal/cpython-3.4.6
FROM base as cuda
ARG BASE_CUDA_VERSION
# Install CUDA
ADD ./common/install_cuda_aarch64.sh install_cuda_aarch64.sh
RUN bash ./install_cuda_aarch64.sh ${BASE_CUDA_VERSION} && rm install_cuda_aarch64.sh
FROM base as magma
ARG BASE_CUDA_VERSION
# Install magma
ADD ./common/install_magma.sh install_magma.sh
RUN bash ./install_magma.sh ${BASE_CUDA_VERSION} && rm install_magma.sh
FROM base as openblas
# Install openblas
ADD ./common/install_openblas.sh install_openblas.sh
RUN bash ./install_openblas.sh && rm install_openblas.sh
FROM final as cuda_final
ARG BASE_CUDA_VERSION
RUN rm -rf /usr/local/cuda-${BASE_CUDA_VERSION}
COPY --from=cuda /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda-${BASE_CUDA_VERSION}
COPY --from=magma /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda-${BASE_CUDA_VERSION}
COPY --from=openblas /opt/OpenBLAS/ /opt/OpenBLAS/
RUN ln -sf /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda
ENV PATH=/usr/local/cuda/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/OpenBLAS/lib:$LD_LIBRARY_PATH

View File

@ -0,0 +1,71 @@
FROM centos:8 as base
ENV LC_ALL en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US.UTF-8
ENV PATH /opt/rh/gcc-toolset-11/root/bin/:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
# change to a valid repo
RUN sed -i 's|#baseurl=http://mirror.centos.org|baseurl=http://vault.centos.org|g' /etc/yum.repos.d/CentOS-Linux-*.repo
# enable to install ninja-build
RUN sed -i 's|enabled=0|enabled=1|g' /etc/yum.repos.d/CentOS-Linux-PowerTools.repo
RUN yum -y update
RUN yum install -y wget curl perl util-linux xz bzip2 git patch which zlib-devel sudo
RUN yum install -y autoconf automake make cmake gdb gcc-toolset-11-gcc-c++
FROM base as openssl
ADD ./common/install_openssl.sh install_openssl.sh
RUN bash ./install_openssl.sh && rm install_openssl.sh
# Install python
FROM base as python
RUN yum install -y openssl-devel zlib-devel bzip2-devel ncurses-devel sqlite-devel readline-devel tk-devel gdbm-devel libpcap-devel xz-devel libffi-devel
ADD common/install_cpython.sh install_cpython.sh
RUN bash ./install_cpython.sh && rm install_cpython.sh
FROM base as conda
ADD ./common/install_conda_docker.sh install_conda.sh
RUN bash ./install_conda.sh && rm install_conda.sh
RUN /opt/conda/bin/conda install -y cmake
FROM base as intel
# Install MKL
COPY --from=python /opt/python /opt/python
COPY --from=python /opt/_internal /opt/_internal
COPY --from=conda /opt/conda /opt/conda
ENV PATH=/opt/conda/bin:$PATH
ADD ./common/install_mkl.sh install_mkl.sh
RUN bash ./install_mkl.sh && rm install_mkl.sh
FROM base as patchelf
ADD ./common/install_patchelf.sh install_patchelf.sh
RUN bash ./install_patchelf.sh && rm install_patchelf.sh
RUN cp $(which patchelf) /patchelf
FROM base as jni
ADD ./common/install_jni.sh install_jni.sh
ADD ./java/jni.h jni.h
RUN bash ./install_jni.sh && rm install_jni.sh
FROM base as libpng
ADD ./common/install_libpng.sh install_libpng.sh
RUN bash ./install_libpng.sh && rm install_libpng.sh
FROM base as final
COPY --from=openssl /opt/openssl /opt/openssl
COPY --from=python /opt/python /opt/python
COPY --from=python /opt/_internal /opt/_internal
COPY --from=intel /opt/intel /opt/intel
COPY --from=conda /opt/conda /opt/conda
COPY --from=patchelf /usr/local/bin/patchelf /usr/local/bin/patchelf
COPY --from=jni /usr/local/include/jni.h /usr/local/include/jni.h
COPY --from=libpng /usr/local/bin/png* /usr/local/bin/
COPY --from=libpng /usr/local/bin/libpng* /usr/local/bin/
COPY --from=libpng /usr/local/include/png* /usr/local/include/
COPY --from=libpng /usr/local/include/libpng* /usr/local/include/
COPY --from=libpng /usr/local/lib/libpng* /usr/local/lib/
COPY --from=libpng /usr/local/lib/pkgconfig /usr/local/lib/pkgconfig
RUN yum install -y ninja-build

View File

@ -0,0 +1,73 @@
FROM --platform=linux/s390x docker.io/ubuntu:24.04 as base
# Language variables
ENV LC_ALL=C.UTF-8
ENV LANG=C.UTF-8
ENV LANGUAGE=C.UTF-8
# Installed needed OS packages. This is to support all
# the binary builds (torch, vision, audio, text, data)
RUN apt update ; apt upgrade -y
RUN apt install -y \
build-essential \
autoconf \
automake \
bzip2 \
curl \
diffutils \
file \
git \
make \
patch \
perl \
unzip \
util-linux \
wget \
which \
xz-utils \
less \
zstd \
cmake \
python3 \
python3-dev \
python3-setuptools \
python3-yaml \
python3-typing-extensions \
libblas-dev \
libopenblas-dev \
liblapack-dev \
libatlas-base-dev
# git236+ would refuse to run git commands in repos owned by other users
# Which causes version check to fail, as pytorch repo is bind-mounted into the image
# Override this behaviour by treating every folder as safe
# For more details see https://github.com/pytorch/pytorch/issues/78659#issuecomment-1144107327
RUN git config --global --add safe.directory "*"
FROM base as openssl
# Install openssl (this must precede `build python` step)
# (In order to have a proper SSL module, Python is compiled
# against a recent openssl [see env vars above], which is linked
# statically. We delete openssl afterwards.)
ADD ./common/install_openssl.sh install_openssl.sh
RUN bash ./install_openssl.sh && rm install_openssl.sh
ENV SSL_CERT_FILE=/opt/_internal/certs.pem
# EPEL for cmake
FROM base as patchelf
# Install patchelf
ADD ./common/install_patchelf.sh install_patchelf.sh
RUN bash ./install_patchelf.sh && rm install_patchelf.sh
RUN cp $(which patchelf) /patchelf
FROM patchelf as python
# build python
COPY manywheel/build_scripts /build_scripts
ADD ./common/install_cpython.sh /build_scripts/install_cpython.sh
RUN bash build_scripts/build.sh && rm -r build_scripts
FROM openssl as final
COPY --from=python /opt/python /opt/python
COPY --from=python /opt/_internal /opt/_internal
COPY --from=python /opt/python/cp39-cp39/bin/auditwheel /usr/local/bin/auditwheel
COPY --from=patchelf /usr/local/bin/patchelf /usr/local/bin/patchelf

154
.ci/docker/manywheel/build.sh Executable file
View File

@ -0,0 +1,154 @@
#!/usr/bin/env bash
# Script used only in CD pipeline
set -eou pipefail
TOPDIR=$(git rev-parse --show-toplevel)
image="$1"
shift
if [ -z "${image}" ]; then
echo "Usage: $0 IMAGE"
exit 1
fi
DOCKER_IMAGE="pytorch/${image}"
DOCKER_REGISTRY="${DOCKER_REGISTRY:-docker.io}"
GPU_ARCH_TYPE=${GPU_ARCH_TYPE:-cpu}
GPU_ARCH_VERSION=${GPU_ARCH_VERSION:-}
MANY_LINUX_VERSION=${MANY_LINUX_VERSION:-}
DOCKERFILE_SUFFIX=${DOCKERFILE_SUFFIX:-}
WITH_PUSH=${WITH_PUSH:-}
case ${GPU_ARCH_TYPE} in
cpu)
TARGET=cpu_final
DOCKER_TAG=cpu
GPU_IMAGE=centos:7
DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=9"
;;
cpu-manylinux_2_28)
TARGET=cpu_final
DOCKER_TAG=cpu
GPU_IMAGE=amd64/almalinux:8
DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=11"
MANY_LINUX_VERSION="2_28"
;;
cpu-aarch64)
TARGET=final
DOCKER_TAG=cpu-aarch64
GPU_IMAGE=arm64v8/centos:7
DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=10"
MANY_LINUX_VERSION="aarch64"
;;
cpu-aarch64-2_28)
TARGET=final
DOCKER_TAG=cpu-aarch64
GPU_IMAGE=arm64v8/almalinux:8
DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=11"
MANY_LINUX_VERSION="2_28_aarch64"
;;
cpu-cxx11-abi)
TARGET=final
DOCKER_TAG=cpu-cxx11-abi
GPU_IMAGE=""
DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=9"
MANY_LINUX_VERSION="cxx11-abi"
;;
cpu-s390x)
TARGET=final
DOCKER_TAG=cpu-s390x
GPU_IMAGE=redhat/ubi9
DOCKER_GPU_BUILD_ARG=""
MANY_LINUX_VERSION="s390x"
;;
cuda)
TARGET=cuda_final
DOCKER_TAG=cuda${GPU_ARCH_VERSION}
# Keep this up to date with the minimum version of CUDA we currently support
GPU_IMAGE=centos:7
DOCKER_GPU_BUILD_ARG="--build-arg BASE_CUDA_VERSION=${GPU_ARCH_VERSION} --build-arg DEVTOOLSET_VERSION=9"
;;
cuda-manylinux_2_28)
TARGET=cuda_final
DOCKER_TAG=cuda${GPU_ARCH_VERSION}
GPU_IMAGE=amd64/almalinux:8
DOCKER_GPU_BUILD_ARG="--build-arg BASE_CUDA_VERSION=${GPU_ARCH_VERSION} --build-arg DEVTOOLSET_VERSION=11"
MANY_LINUX_VERSION="2_28"
;;
cuda-aarch64)
TARGET=cuda_final
DOCKER_TAG=cuda${GPU_ARCH_VERSION}
GPU_IMAGE=arm64v8/centos:7
DOCKER_GPU_BUILD_ARG="--build-arg BASE_CUDA_VERSION=${GPU_ARCH_VERSION} --build-arg DEVTOOLSET_VERSION=11"
MANY_LINUX_VERSION="aarch64"
DOCKERFILE_SUFFIX="_cuda_aarch64"
;;
rocm)
TARGET=rocm_final
DOCKER_TAG=rocm${GPU_ARCH_VERSION}
GPU_IMAGE=rocm/dev-centos-7:${GPU_ARCH_VERSION}-complete
PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx1030;gfx1100"
ROCM_REGEX="([0-9]+)\.([0-9]+)[\.]?([0-9]*)"
if [[ $GPU_ARCH_VERSION =~ $ROCM_REGEX ]]; then
ROCM_VERSION_INT=$((${BASH_REMATCH[1]}*10000 + ${BASH_REMATCH[2]}*100 + ${BASH_REMATCH[3]:-0}))
else
echo "ERROR: rocm regex failed"
exit 1
fi
if [[ $ROCM_VERSION_INT -ge 60000 ]]; then
PYTORCH_ROCM_ARCH+=";gfx942"
fi
DOCKER_GPU_BUILD_ARG="--build-arg ROCM_VERSION=${GPU_ARCH_VERSION} --build-arg PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH} --build-arg DEVTOOLSET_VERSION=9"
;;
xpu)
TARGET=xpu_final
DOCKER_TAG=xpu
GPU_IMAGE=amd64/almalinux:8
DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=11"
MANY_LINUX_VERSION="2_28"
;;
*)
echo "ERROR: Unrecognized GPU_ARCH_TYPE: ${GPU_ARCH_TYPE}"
exit 1
;;
esac
IMAGES=''
if [[ -n ${MANY_LINUX_VERSION} && -z ${DOCKERFILE_SUFFIX} ]]; then
DOCKERFILE_SUFFIX=_${MANY_LINUX_VERSION}
fi
(
set -x
DOCKER_BUILDKIT=1 docker build \
${DOCKER_GPU_BUILD_ARG} \
--build-arg "GPU_IMAGE=${GPU_IMAGE}" \
--target "${TARGET}" \
-t "${DOCKER_IMAGE}" \
$@ \
-f "${TOPDIR}/.ci/docker/manywheel/Dockerfile${DOCKERFILE_SUFFIX}" \
"${TOPDIR}/.ci/docker/"
)
GITHUB_REF=${GITHUB_REF:-$(git symbolic-ref -q HEAD || git describe --tags --exact-match)}
GIT_BRANCH_NAME=${GITHUB_REF##*/}
GIT_COMMIT_SHA=${GITHUB_SHA:-$(git rev-parse HEAD)}
DOCKER_IMAGE_BRANCH_TAG=${DOCKER_IMAGE}-${GIT_BRANCH_NAME}
DOCKER_IMAGE_SHA_TAG=${DOCKER_IMAGE}-${GIT_COMMIT_SHA}
if [[ "${WITH_PUSH}" == true ]]; then
(
set -x
docker push "${DOCKER_IMAGE}"
if [[ -n ${GITHUB_REF} ]]; then
docker tag ${DOCKER_IMAGE} ${DOCKER_IMAGE_BRANCH_TAG}
docker tag ${DOCKER_IMAGE} ${DOCKER_IMAGE_SHA_TAG}
docker push "${DOCKER_IMAGE_BRANCH_TAG}"
docker push "${DOCKER_IMAGE_SHA_TAG}"
fi
)
fi

View File

@ -0,0 +1,131 @@
#!/bin/bash
# Top-level build script called from Dockerfile
# Script used only in CD pipeline
# Stop at any error, show all commands
set -ex
# openssl version to build, with expected sha256 hash of .tar.gz
# archive
OPENSSL_ROOT=openssl-1.1.1l
OPENSSL_HASH=0b7a3e5e59c34827fe0c3a74b7ec8baef302b98fa80088d7f9153aa16fa76bd1
DEVTOOLS_HASH=a8ebeb4bed624700f727179e6ef771dafe47651131a00a78b342251415646acc
PATCHELF_HASH=d9afdff4baeacfbc64861454f368b7f2c15c44d245293f7587bbf726bfe722fb
CURL_ROOT=curl-7.73.0
CURL_HASH=cf34fe0b07b800f1c01a499a6e8b2af548f6d0e044dca4a29d88a4bee146d131
AUTOCONF_ROOT=autoconf-2.69
AUTOCONF_HASH=954bd69b391edc12d6a4a51a2dd1476543da5c6bbf05a95b59dc0dd6fd4c2969
# Get build utilities
MY_DIR=$(dirname "${BASH_SOURCE[0]}")
source $MY_DIR/build_utils.sh
if [ "$(uname -m)" != "s390x" ] ; then
# Dependencies for compiling Python that we want to remove from
# the final image after compiling Python
PYTHON_COMPILE_DEPS="zlib-devel bzip2-devel ncurses-devel sqlite-devel readline-devel tk-devel gdbm-devel db4-devel libpcap-devel xz-devel libffi-devel"
# Libraries that are allowed as part of the manylinux1 profile
MANYLINUX1_DEPS="glibc-devel libstdc++-devel glib2-devel libX11-devel libXext-devel libXrender-devel mesa-libGL-devel libICE-devel libSM-devel ncurses-devel"
# Development tools and libraries
yum -y install bzip2 make git patch unzip bison yasm diffutils \
automake which file cmake28 \
kernel-devel-`uname -r` \
${PYTHON_COMPILE_DEPS}
else
# Dependencies for compiling Python that we want to remove from
# the final image after compiling Python
PYTHON_COMPILE_DEPS="zlib1g-dev libbz2-dev libncurses-dev libsqlite3-dev libdb-dev libpcap-dev liblzma-dev libffi-dev"
# Libraries that are allowed as part of the manylinux1 profile
MANYLINUX1_DEPS="libglib2.0-dev libX11-dev libncurses-dev"
# Development tools and libraries
apt install -y bzip2 make git patch unzip diffutils \
automake which file cmake \
linux-headers-virtual \
${PYTHON_COMPILE_DEPS}
fi
# Install newest autoconf
build_autoconf $AUTOCONF_ROOT $AUTOCONF_HASH
autoconf --version
# Compile the latest Python releases.
# (In order to have a proper SSL module, Python is compiled
# against a recent openssl [see env vars above], which is linked
# statically. We delete openssl afterwards.)
build_openssl $OPENSSL_ROOT $OPENSSL_HASH
/build_scripts/install_cpython.sh
PY39_BIN=/opt/python/cp39-cp39/bin
# Our openssl doesn't know how to find the system CA trust store
# (https://github.com/pypa/manylinux/issues/53)
# And it's not clear how up-to-date that is anyway
# So let's just use the same one pip and everyone uses
$PY39_BIN/pip install certifi
ln -s $($PY39_BIN/python -c 'import certifi; print(certifi.where())') \
/opt/_internal/certs.pem
# If you modify this line you also have to modify the versions in the
# Dockerfiles:
export SSL_CERT_FILE=/opt/_internal/certs.pem
# Install newest curl
build_curl $CURL_ROOT $CURL_HASH
rm -rf /usr/local/include/curl /usr/local/lib/libcurl* /usr/local/lib/pkgconfig/libcurl.pc
hash -r
curl --version
curl-config --features
# Install patchelf (latest with unreleased bug fixes)
curl -sLOk https://nixos.org/releases/patchelf/patchelf-0.10/patchelf-0.10.tar.gz
# check_sha256sum patchelf-0.9njs2.tar.gz $PATCHELF_HASH
tar -xzf patchelf-0.10.tar.gz
(cd patchelf-0.10 && ./configure && make && make install)
rm -rf patchelf-0.10.tar.gz patchelf-0.10
# Install latest pypi release of auditwheel
$PY39_BIN/pip install auditwheel
ln -s $PY39_BIN/auditwheel /usr/local/bin/auditwheel
# Clean up development headers and other unnecessary stuff for
# final image
if [ "$(uname -m)" != "s390x" ] ; then
yum -y erase wireless-tools gtk2 libX11 hicolor-icon-theme \
avahi freetype bitstream-vera-fonts \
${PYTHON_COMPILE_DEPS} || true > /dev/null 2>&1
yum -y install ${MANYLINUX1_DEPS}
yum -y clean all > /dev/null 2>&1
yum list installed
else
apt purge -y ${PYTHON_COMPILE_DEPS} || true > /dev/null 2>&1
fi
# we don't need libpython*.a, and they're many megabytes
find /opt/_internal -name '*.a' -print0 | xargs -0 rm -f
# Strip what we can -- and ignore errors, because this just attempts to strip
# *everything*, including non-ELF files:
find /opt/_internal -type f -print0 \
| xargs -0 -n1 strip --strip-unneeded 2>/dev/null || true
# We do not need the Python test suites, or indeed the precompiled .pyc and
# .pyo files. Partially cribbed from:
# https://github.com/docker-library/python/blob/master/3.4/slim/Dockerfile
find /opt/_internal \
\( -type d -a -name test -o -name tests \) \
-o \( -type f -a -name '*.pyc' -o -name '*.pyo' \) \
-print0 | xargs -0 rm -f
for PYTHON in /opt/python/*/bin/python; do
# Smoke test to make sure that our Pythons work, and do indeed detect as
# being manylinux compatible:
$PYTHON $MY_DIR/manylinux1-check.py
# Make sure that SSL cert checking works
$PYTHON $MY_DIR/ssl-check.py
done
# Fix libc headers to remain compatible with C99 compilers.
find /usr/include/ -type f -exec sed -i 's/\bextern _*inline_*\b/extern __inline __attribute__ ((__gnu_inline__))/g' {} +
# Now we can delete our built SSL
rm -rf /usr/local/ssl

View File

@ -0,0 +1,91 @@
#!/bin/bash
# Helper utilities for build
# Script used only in CD pipeline
OPENSSL_DOWNLOAD_URL=https://www.openssl.org/source/old/1.1.1/
CURL_DOWNLOAD_URL=https://curl.askapache.com/download
AUTOCONF_DOWNLOAD_URL=https://ftp.gnu.org/gnu/autoconf
function check_var {
if [ -z "$1" ]; then
echo "required variable not defined"
exit 1
fi
}
function do_openssl_build {
./config no-ssl2 no-shared -fPIC --prefix=/usr/local/ssl > /dev/null
make > /dev/null
make install > /dev/null
}
function check_sha256sum {
local fname=$1
check_var ${fname}
local sha256=$2
check_var ${sha256}
echo "${sha256} ${fname}" > ${fname}.sha256
sha256sum -c ${fname}.sha256
rm -f ${fname}.sha256
}
function build_openssl {
local openssl_fname=$1
check_var ${openssl_fname}
local openssl_sha256=$2
check_var ${openssl_sha256}
check_var ${OPENSSL_DOWNLOAD_URL}
curl -sLO ${OPENSSL_DOWNLOAD_URL}/${openssl_fname}.tar.gz
check_sha256sum ${openssl_fname}.tar.gz ${openssl_sha256}
tar -xzf ${openssl_fname}.tar.gz
(cd ${openssl_fname} && do_openssl_build)
rm -rf ${openssl_fname} ${openssl_fname}.tar.gz
}
function do_curl_build {
LIBS=-ldl ./configure --with-ssl --disable-shared > /dev/null
make > /dev/null
make install > /dev/null
}
function build_curl {
local curl_fname=$1
check_var ${curl_fname}
local curl_sha256=$2
check_var ${curl_sha256}
check_var ${CURL_DOWNLOAD_URL}
curl -sLO ${CURL_DOWNLOAD_URL}/${curl_fname}.tar.bz2
check_sha256sum ${curl_fname}.tar.bz2 ${curl_sha256}
tar -jxf ${curl_fname}.tar.bz2
(cd ${curl_fname} && do_curl_build)
rm -rf ${curl_fname} ${curl_fname}.tar.bz2
}
function do_standard_install {
./configure > /dev/null
make > /dev/null
make install > /dev/null
}
function build_autoconf {
local autoconf_fname=$1
check_var ${autoconf_fname}
local autoconf_sha256=$2
check_var ${autoconf_sha256}
check_var ${AUTOCONF_DOWNLOAD_URL}
curl -sLO ${AUTOCONF_DOWNLOAD_URL}/${autoconf_fname}.tar.gz
check_sha256sum ${autoconf_fname}.tar.gz ${autoconf_sha256}
tar -zxf ${autoconf_fname}.tar.gz
(cd ${autoconf_fname} && do_standard_install)
rm -rf ${autoconf_fname} ${autoconf_fname}.tar.gz
}

View File

@ -0,0 +1,60 @@
# Logic copied from PEP 513
def is_manylinux1_compatible():
# Only Linux, and only x86-64 / i686
from distutils.util import get_platform
if get_platform() not in ["linux-x86_64", "linux-i686", "linux-s390x"]:
return False
# Check for presence of _manylinux module
try:
import _manylinux
return bool(_manylinux.manylinux1_compatible)
except (ImportError, AttributeError):
# Fall through to heuristic check below
pass
# Check glibc version. CentOS 5 uses glibc 2.5.
return have_compatible_glibc(2, 5)
def have_compatible_glibc(major, minimum_minor):
import ctypes
process_namespace = ctypes.CDLL(None)
try:
gnu_get_libc_version = process_namespace.gnu_get_libc_version
except AttributeError:
# Symbol doesn't exist -> therefore, we are not linked to
# glibc.
return False
# Call gnu_get_libc_version, which returns a string like "2.5".
gnu_get_libc_version.restype = ctypes.c_char_p
version_str = gnu_get_libc_version()
# py2 / py3 compatibility:
if not isinstance(version_str, str):
version_str = version_str.decode("ascii")
# Parse string and check against requested version.
version = [int(piece) for piece in version_str.split(".")]
assert len(version) == 2
if major != version[0]:
return False
if minimum_minor > version[1]:
return False
return True
import sys
if is_manylinux1_compatible():
print(f"{sys.executable} is manylinux1 compatible")
sys.exit(0)
else:
print(f"{sys.executable} is NOT manylinux1 compatible")
sys.exit(1)

View File

@ -0,0 +1,35 @@
# cf. https://github.com/pypa/manylinux/issues/53
GOOD_SSL = "https://google.com"
BAD_SSL = "https://self-signed.badssl.com"
import sys
print("Testing SSL certificate checking for Python:", sys.version)
if sys.version_info[:2] < (2, 7) or sys.version_info[:2] < (3, 4):
print("This version never checks SSL certs; skipping tests")
sys.exit(0)
if sys.version_info[0] >= 3:
from urllib.request import urlopen
EXC = OSError
else:
from urllib import urlopen
EXC = IOError
print(f"Connecting to {GOOD_SSL} should work")
urlopen(GOOD_SSL)
print("...it did, yay.")
print(f"Connecting to {BAD_SSL} should fail")
try:
urlopen(BAD_SSL)
# If we get here then we failed:
print("...it DIDN'T!!!!!11!!1one!")
sys.exit(1)
except EXC:
print("...it did, yay.")

View File

@ -1,42 +1 @@
This directory contains scripts for our continuous integration.
One important thing to keep in mind when reading the scripts here is
that they are all based off of Docker images, which we build for each of
the various system configurations we want to run on Jenkins. This means
it is very easy to run these tests yourself:
1. Figure out what Docker image you want. The general template for our
images look like:
``registry.pytorch.org/pytorch/pytorch-$BUILD_ENVIRONMENT:$DOCKER_VERSION``,
where ``$BUILD_ENVIRONMENT`` is one of the build environments
enumerated in
[pytorch-dockerfiles](https://github.com/pytorch/pytorch/blob/master/.ci/docker/build.sh). The dockerfile used by jenkins can be found under the `.ci` [directory](https://github.com/pytorch/pytorch/blob/master/.ci/docker)
2. Run ``docker run -it -u jenkins $DOCKER_IMAGE``, clone PyTorch and
run one of the scripts in this directory.
The Docker images are designed so that any "reasonable" build commands
will work; if you look in [build.sh](build.sh) you will see that it is a
very simple script. This is intentional. Idiomatic build instructions
should work inside all of our Docker images. You can tweak the commands
however you need (e.g., in case you want to rebuild with DEBUG, or rerun
the build with higher verbosity, etc.).
We have to do some work to make this so. Here is a summary of the
mechanisms we use:
- We install binaries to directories like `/usr/local/bin` which
are automatically part of your PATH.
- We add entries to the PATH using Docker ENV variables (so
they apply when you enter Docker) and `/etc/environment` (so they
continue to apply even if you sudo), instead of modifying
`PATH` in our build scripts.
- We use `/etc/ld.so.conf.d` to register directories containing
shared libraries, instead of modifying `LD_LIBRARY_PATH` in our
build scripts.
- We reroute well known paths like `/usr/bin/gcc` to alternate
implementations with `update-alternatives`, instead of setting
`CC` and `CXX` in our implementations.

View File

@ -44,16 +44,15 @@ time python test/run_test.py --verbose -i distributed/_tensor/test_dtensor_compi
time python test/run_test.py --verbose -i distributed/test_device_mesh
# DTensor/TP tests
time python test/run_test.py --verbose -i distributed/tensor/parallel/test_ddp_2d_parallel
time python test/run_test.py --verbose -i distributed/tensor/parallel/test_fsdp_2d_parallel
time python test/run_test.py --verbose -i distributed/tensor/parallel/test_tp_examples
time python test/run_test.py --verbose -i distributed/tensor/parallel/test_tp_random_state
# FSDP2 tests
time python test/run_test.py --verbose -i distributed/_composable/fsdp/test_fully_shard_training -- -k test_2d_mlp_with_nd_mesh
# Pipelining composability tests
time python test/run_test.py --verbose -i distributed/pipelining/test_composability.py
# ND composability tests
time python test/run_test.py --verbose -i distributed/_composable/test_composability/test_2d_composability
time python test/run_test.py --verbose -i distributed/_composable/test_composability/test_pp_composability
# Other tests
time python test/run_test.py --verbose -i test_cuda_primary_ctx

View File

@ -316,11 +316,9 @@ test_inductor_distributed() {
python test/run_test.py -i inductor/test_aot_inductor.py -k test_replicate_on_devices --verbose
python test/run_test.py -i distributed/test_c10d_functional_native.py --verbose
python test/run_test.py -i distributed/_tensor/test_dtensor_compile.py --verbose
python test/run_test.py -i distributed/tensor/parallel/test_fsdp_2d_parallel.py --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_comm.py --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_multi_group --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_with_activation_checkpointing --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_2d_mlp --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_hsdp --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_2d_transformer_checkpoint_resume --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_gradient_accumulation --verbose
@ -405,7 +403,7 @@ if [[ "${TEST_CONFIG}" == *dynamic* ]]; then
DYNAMO_BENCHMARK_FLAGS+=(--dynamic-shapes --dynamic-batch-only)
fi
if [[ "${TEST_CONFIG}" == *cpu_inductor* || "${TEST_CONFIG}" == *cpu_aot_inductor* ]]; then
if [[ "${TEST_CONFIG}" == *cpu* ]]; then
DYNAMO_BENCHMARK_FLAGS+=(--device cpu)
else
DYNAMO_BENCHMARK_FLAGS+=(--device cuda)
@ -429,6 +427,19 @@ test_perf_for_dashboard() {
# TODO: All the accuracy tests can be skipped once the CI accuracy checking is stable enough
local targets=(accuracy performance)
local device=cuda
local taskset=""
if [[ "${TEST_CONFIG}" == *cpu* ]]; then
if [[ "${TEST_CONFIG}" == *cpu_x86* ]]; then
device=cpu_x86
elif [[ "${TEST_CONFIG}" == *cpu_aarch64* ]]; then
device=cpu_aarch64
fi
test_inductor_set_cpu_affinity
end_core=$(( $(test_inductor_get_core_number)-1 ))
taskset="taskset -c 0-$end_core"
fi
for mode in "${modes[@]}"; do
if [[ "$mode" == "inference" ]]; then
dtype=bfloat16
@ -444,56 +455,56 @@ test_perf_for_dashboard() {
fi
if [[ "$DASHBOARD_TAG" == *default-true* ]]; then
python "benchmarks/dynamo/$suite.py" \
$taskset python "benchmarks/dynamo/$suite.py" \
"${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" --disable-cudagraphs "$@" \
--output "$TEST_REPORTS_DIR/${backend}_no_cudagraphs_${suite}_${dtype}_${mode}_cuda_${target}.csv"
--output "$TEST_REPORTS_DIR/${backend}_no_cudagraphs_${suite}_${dtype}_${mode}_${device}_${target}.csv"
fi
if [[ "$DASHBOARD_TAG" == *cudagraphs-true* ]]; then
python "benchmarks/dynamo/$suite.py" \
$taskset python "benchmarks/dynamo/$suite.py" \
"${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" "$@" \
--output "$TEST_REPORTS_DIR/${backend}_with_cudagraphs_${suite}_${dtype}_${mode}_cuda_${target}.csv"
--output "$TEST_REPORTS_DIR/${backend}_with_cudagraphs_${suite}_${dtype}_${mode}_${device}_${target}.csv"
fi
if [[ "$DASHBOARD_TAG" == *dynamic-true* ]]; then
python "benchmarks/dynamo/$suite.py" \
$taskset python "benchmarks/dynamo/$suite.py" \
"${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" --dynamic-shapes \
--dynamic-batch-only "$@" \
--output "$TEST_REPORTS_DIR/${backend}_dynamic_${suite}_${dtype}_${mode}_cuda_${target}.csv"
--output "$TEST_REPORTS_DIR/${backend}_dynamic_${suite}_${dtype}_${mode}_${device}_${target}.csv"
fi
if [[ "$DASHBOARD_TAG" == *cppwrapper-true* ]] && [[ "$mode" == "inference" ]]; then
TORCHINDUCTOR_CPP_WRAPPER=1 python "benchmarks/dynamo/$suite.py" \
TORCHINDUCTOR_CPP_WRAPPER=1 $taskset python "benchmarks/dynamo/$suite.py" \
"${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" --disable-cudagraphs "$@" \
--output "$TEST_REPORTS_DIR/${backend}_cpp_wrapper_${suite}_${dtype}_${mode}_cuda_${target}.csv"
--output "$TEST_REPORTS_DIR/${backend}_cpp_wrapper_${suite}_${dtype}_${mode}_${device}_${target}.csv"
fi
if [[ "$DASHBOARD_TAG" == *freezing_cudagraphs-true* ]] && [[ "$mode" == "inference" ]]; then
python "benchmarks/dynamo/$suite.py" \
$taskset python "benchmarks/dynamo/$suite.py" \
"${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" "$@" --freezing \
--output "$TEST_REPORTS_DIR/${backend}_with_cudagraphs_freezing_${suite}_${dtype}_${mode}_cuda_${target}.csv"
--output "$TEST_REPORTS_DIR/${backend}_with_cudagraphs_freezing_${suite}_${dtype}_${mode}_${device}_${target}.csv"
fi
if [[ "$DASHBOARD_TAG" == *freeze_autotune_cudagraphs-true* ]] && [[ "$mode" == "inference" ]]; then
TORCHINDUCTOR_MAX_AUTOTUNE=1 python "benchmarks/dynamo/$suite.py" \
TORCHINDUCTOR_MAX_AUTOTUNE=1 $taskset python "benchmarks/dynamo/$suite.py" \
"${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" "$@" --freezing \
--output "$TEST_REPORTS_DIR/${backend}_with_cudagraphs_freezing_autotune_${suite}_${dtype}_${mode}_cuda_${target}.csv"
--output "$TEST_REPORTS_DIR/${backend}_with_cudagraphs_freezing_autotune_${suite}_${dtype}_${mode}_${device}_${target}.csv"
fi
if [[ "$DASHBOARD_TAG" == *aotinductor-true* ]] && [[ "$mode" == "inference" ]]; then
TORCHINDUCTOR_ABI_COMPATIBLE=1 python "benchmarks/dynamo/$suite.py" \
TORCHINDUCTOR_ABI_COMPATIBLE=1 $taskset python "benchmarks/dynamo/$suite.py" \
"${target_flag[@]}" --"$mode" --"$dtype" --export-aot-inductor --disable-cudagraphs "$@" \
--output "$TEST_REPORTS_DIR/${backend}_aot_inductor_${suite}_${dtype}_${mode}_cuda_${target}.csv"
--output "$TEST_REPORTS_DIR/${backend}_aot_inductor_${suite}_${dtype}_${mode}_${device}_${target}.csv"
fi
if [[ "$DASHBOARD_TAG" == *maxautotune-true* ]]; then
TORCHINDUCTOR_MAX_AUTOTUNE=1 python "benchmarks/dynamo/$suite.py" \
TORCHINDUCTOR_MAX_AUTOTUNE=1 $taskset python "benchmarks/dynamo/$suite.py" \
"${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" "$@" \
--output "$TEST_REPORTS_DIR/${backend}_max_autotune_${suite}_${dtype}_${mode}_cuda_${target}.csv"
--output "$TEST_REPORTS_DIR/${backend}_max_autotune_${suite}_${dtype}_${mode}_${device}_${target}.csv"
fi
if [[ "$DASHBOARD_TAG" == *cudagraphs_low_precision-true* ]] && [[ "$mode" == "inference" ]]; then
# TODO: This has a new dtype called quant and the benchmarks script needs to be updated to support this.
# The tentative command is as follows. It doesn't work now, but it's ok because we only need mock data
# to fill the dashboard.
python "benchmarks/dynamo/$suite.py" \
$taskset python "benchmarks/dynamo/$suite.py" \
"${target_flag[@]}" --"$mode" --quant --backend "$backend" "$@" \
--output "$TEST_REPORTS_DIR/${backend}_cudagraphs_low_precision_${suite}_quant_${mode}_cuda_${target}.csv" || true
--output "$TEST_REPORTS_DIR/${backend}_cudagraphs_low_precision_${suite}_quant_${mode}_${device}_${target}.csv" || true
# Copy cudagraph results as mock data, easiest choice?
cp "$TEST_REPORTS_DIR/${backend}_with_cudagraphs_${suite}_${dtype}_${mode}_cuda_${target}.csv" \
"$TEST_REPORTS_DIR/${backend}_cudagraphs_low_precision_${suite}_quant_${mode}_cuda_${target}.csv"
cp "$TEST_REPORTS_DIR/${backend}_with_cudagraphs_${suite}_${dtype}_${mode}_${device}_${target}.csv" \
"$TEST_REPORTS_DIR/${backend}_cudagraphs_low_precision_${suite}_quant_${mode}_${device}_${target}.csv"
fi
done
done
@ -574,7 +585,7 @@ test_dynamo_benchmark() {
elif [[ "${TEST_CONFIG}" == *perf* ]]; then
test_single_dynamo_benchmark "dashboard" "$suite" "$shard_id" "$@"
else
if [[ "${TEST_CONFIG}" == *cpu_inductor* || "${TEST_CONFIG}" == *cpu_aot_inductor* ]]; then
if [[ "${TEST_CONFIG}" == *cpu* ]]; then
local dt="float32"
if [[ "${TEST_CONFIG}" == *amp* ]]; then
dt="amp"
@ -645,10 +656,11 @@ test_inductor_torchbench_smoketest_perf() {
done
}
test_inductor_torchbench_cpu_smoketest_perf(){
TEST_REPORTS_DIR=$(pwd)/test/test-reports
mkdir -p "$TEST_REPORTS_DIR"
test_inductor_get_core_number() {
echo $(($(lscpu | grep 'Socket(s):' | awk '{print $2}') * $(lscpu | grep 'Core(s) per socket:' | awk '{print $4}')))
}
test_inductor_set_cpu_affinity(){
#set jemalloc
JEMALLOC_LIB="/usr/lib/x86_64-linux-gnu/libjemalloc.so.2"
IOMP_LIB="$(dirname "$(which python)")/../lib/libiomp5.so"
@ -656,32 +668,39 @@ test_inductor_torchbench_cpu_smoketest_perf(){
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"
export KMP_AFFINITY=granularity=fine,compact,1,0
export KMP_BLOCKTIME=1
CORES=$(lscpu | grep Core | awk '{print $4}')
export OMP_NUM_THREADS=$CORES
end_core=$(( CORES-1 ))
cores=$(test_inductor_get_core_number)
export OMP_NUM_THREADS=$cores
}
test_inductor_torchbench_cpu_smoketest_perf(){
TEST_REPORTS_DIR=$(pwd)/test/test-reports
mkdir -p "$TEST_REPORTS_DIR"
test_inductor_set_cpu_affinity
end_core=$(( $(test_inductor_get_core_number)-1 ))
MODELS_SPEEDUP_TARGET=benchmarks/dynamo/expected_ci_speedup_inductor_torchbench_cpu.csv
grep -v '^ *#' < "$MODELS_SPEEDUP_TARGET" | while IFS=',' read -r -a model_cfg
do
local model_name=${model_cfg[0]}
local data_type=${model_cfg[1]}
local speedup_target=${model_cfg[4]}
if [[ ${model_cfg[3]} == "cpp" ]]; then
local data_type=${model_cfg[2]}
local speedup_target=${model_cfg[5]}
local backend=${model_cfg[1]}
if [[ ${model_cfg[4]} == "cpp" ]]; then
export TORCHINDUCTOR_CPP_WRAPPER=1
else
unset TORCHINDUCTOR_CPP_WRAPPER
fi
local output_name="$TEST_REPORTS_DIR/inductor_inference_${model_cfg[0]}_${model_cfg[1]}_${model_cfg[2]}_${model_cfg[3]}_cpu_smoketest.csv"
if [[ ${model_cfg[2]} == "dynamic" ]]; then
if [[ ${model_cfg[3]} == "dynamic" ]]; then
taskset -c 0-"$end_core" python benchmarks/dynamo/torchbench.py \
--inference --performance --"$data_type" -dcpu -n50 --only "$model_name" --dynamic-shapes \
--dynamic-batch-only --freezing --timeout 9000 --backend=inductor --output "$output_name"
--dynamic-batch-only --freezing --timeout 9000 --"$backend" --output "$output_name"
else
taskset -c 0-"$end_core" python benchmarks/dynamo/torchbench.py \
--inference --performance --"$data_type" -dcpu -n50 --only "$model_name" \
--freezing --timeout 9000 --backend=inductor --output "$output_name"
--freezing --timeout 9000 --"$backend" --output "$output_name"
fi
cat "$output_name"
# The threshold value needs to be actively maintained to make this check useful.
@ -1267,7 +1286,7 @@ elif [[ "${TEST_CONFIG}" == *timm* ]]; then
id=$((SHARD_NUMBER-1))
test_dynamo_benchmark timm_models "$id"
elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then
if [[ "${TEST_CONFIG}" == *cpu_inductor* || "${TEST_CONFIG}" == *cpu_aot_inductor* ]]; then
if [[ "${TEST_CONFIG}" == *cpu* ]]; then
install_torchaudio cpu
else
install_torchaudio cuda
@ -1284,7 +1303,7 @@ elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then
elif [[ "${TEST_CONFIG}" == *inductor_torchbench_cpu_smoketest_perf* ]]; then
checkout_install_torchbench timm_vision_transformer phlippe_densenet basic_gnn_gcn \
llama_v2_7b_16h resnet50 timm_efficientnet mobilenet_v3_large timm_resnest \
shufflenet_v2_x1_0 hf_GPT2
shufflenet_v2_x1_0 hf_GPT2 yolov3 mobilenet_v2 resnext50_32x4d hf_T5_base
PYTHONPATH=$(pwd)/torchbench test_inductor_torchbench_cpu_smoketest_perf
elif [[ "${TEST_CONFIG}" == *torchbench_gcp_smoketest* ]]; then
checkout_install_torchbench
@ -1293,7 +1312,7 @@ elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then
checkout_install_torchbench
# Do this after checkout_install_torchbench to ensure we clobber any
# nightlies that torchbench may pull in
if [[ "${TEST_CONFIG}" != *cpu_inductor* && "${TEST_CONFIG}" != *cpu_aot_inductor* ]]; then
if [[ "${TEST_CONFIG}" != *cpu* ]]; then
install_torchrec_and_fbgemm
fi
PYTHONPATH=$(pwd)/torchbench test_dynamo_benchmark torchbench "$id"

View File

@ -52,13 +52,6 @@ if [[ "\$python_nodot" = *39* ]]; then
NUMPY_PIN=">=1.20"
fi
if [[ "\$python_nodot" = *38* ]]; then
# sympy 1.12.1 is the last version that supports Python 3.8
SYMPY_PIN="==1.12.1"
else
SYMPY_PIN=">=1.13.0"
fi
# Move debug wheels out of the package dir so they don't get installed
mkdir -p /tmp/debug_final_pkgs
mv /final_pkgs/debug-*.zip /tmp/debug_final_pkgs || echo "no debug packages to move"
@ -88,7 +81,7 @@ if [[ "$PACKAGE_TYPE" == conda ]]; then
"numpy\${NUMPY_PIN}" \
mkl>=2018 \
ninja \
"sympy\${SYMPY_PIN}" \
sympy>=1.12 \
typing-extensions \
${PROTOBUF_PACKAGE}
if [[ "$DESIRED_CUDA" == 'cpu' ]]; then

View File

@ -5,7 +5,7 @@ git submodule sync
git submodule update --init --recursive
# This takes some time
make setup_lint
make setup-lint
# Add CMAKE_PREFIX_PATH to bashrc
echo 'export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}' >> ~/.bashrc

View File

@ -9,14 +9,16 @@ self-hosted-runner:
- linux.large
- linux.2xlarge
- linux.4xlarge
- linux.9xlarge.ephemeral
- linux.12xlarge
- linux.12xlarge.ephemeral
- linux.24xlarge
- linux.arm64.2xlarge
- linux.4xlarge.nvidia.gpu
- linux.8xlarge.nvidia.gpu
- linux.16xlarge.nvidia.gpu
- linux.g5.4xlarge.nvidia.gpu
# Organization-wide AWS Linux Runners on Linux Foundation account
# Pytorch/pytorch AWS Linux Runners on Linux Foundation account
- lf.linux.large
- lf.linux.2xlarge
- lf.linux.4xlarge
@ -27,6 +29,28 @@ self-hosted-runner:
- lf.linux.8xlarge.nvidia.gpu
- lf.linux.16xlarge.nvidia.gpu
- lf.linux.g5.4xlarge.nvidia.gpu
# Organization-wide AWS Linux Runners with new Amazon 2023 AMI
- amz2023.linux.large
- amz2023.linux.2xlarge
- amz2023.linux.4xlarge
- amz2023.linux.12xlarge
- amz2023.linux.24xlarge
- amz2023.linux.arm64.2xlarge
- amz2023.linux.4xlarge.nvidia.gpu
- amz2023.linux.8xlarge.nvidia.gpu
- amz2023.linux.16xlarge.nvidia.gpu
- amz2023.linux.g5.4xlarge.nvidia.gpu
# Pytorch/pytorch AWS Linux Runners with the new Amazon 2023 AMI on Linux Foundation account
- amz2023.lf.linux.large
- amz2023.lf.linux.2xlarge
- amz2023.lf.linux.4xlarge
- amz2023.lf.linux.12xlarge
- amz2023.lf.linux.24xlarge
- amz2023.lf.linux.arm64.2xlarge
- amz2023.lf.linux.4xlarge.nvidia.gpu
- amz2023.lf.linux.8xlarge.nvidia.gpu
- amz2023.lf.linux.16xlarge.nvidia.gpu
- amz2023.lf.linux.g5.4xlarge.nvidia.gpu
# Repo-specific IBM hosted S390x runner
- linux.s390x
# Organization wide AWS Windows runners

View File

@ -6,6 +6,7 @@ ciflow_push_tags:
- ciflow/binaries_libtorch
- ciflow/binaries_wheel
- ciflow/inductor
- ciflow/inductor-rocm
- ciflow/inductor-perf-compare
- ciflow/inductor-micro-benchmark
- ciflow/inductor-cu124

View File

@ -13,7 +13,7 @@ git checkout "$SYNC_BRANCH"
# Using a hardcoded SHA here is a massive speedup as we can skip the entire history of the pytorch GitHub repo.
# This specific SHA was chosen as it was before the "branch point" of the stable branch
for SHA in $(git log ba3b05fdf37ddbc3c301294d6a560a816335e717..origin/main --pretty="%h" --reverse -- torch/distributed torch/csrc/distributed test/distributed test/cpp/c10d benchmarks/distributed)
for SHA in $(git log ba3b05fdf37ddbc3c301294d6a560a816335e717..origin/main --pretty="%h" -- torch/distributed torch/csrc/distributed test/distributed test/cpp/c10d benchmarks/distributed)
do
# `git merge-base --is-ancestor` exits with code 0 if the given SHA is an ancestor, and non-0 otherwise
if git merge-base --is-ancestor $SHA HEAD || [[ $(git log --grep="(cherry picked from commit $SHA") ]]

View File

@ -1459,7 +1459,7 @@ def find_matching_merge_rule(
if not skip_internal_checks and pr.has_internal_changes():
raise RuntimeError(
"This PR has internal changes and must be landed via Phabricator"
"This PR has internal changes and must be landed via Phabricator! Please try reimporting/rexporting the PR!"
)
# Categorize all checks when skip_mandatory_checks (force merge) is set. Do it here

View File

@ -0,0 +1,64 @@
name: Build conda docker images
on:
workflow_dispatch:
push:
branches:
- main
- release/*
tags:
# NOTE: Binary build pipelines should only get triggered on release candidate or nightly builds
# Release candidate tags look like: v1.11.0-rc1
- v[0-9]+.[0-9]+.[0-9]+-rc[0-9]+
paths:
- conda/Dockerfile
- 'common/*'
- .github/workflows/build-conda-images.yml
pull_request:
paths:
- conda/Dockerfile
- 'common/*'
- .github/workflows/build-conda-images.yml
env:
DOCKER_REGISTRY: "docker.io"
DOCKER_BUILDKIT: 1
DOCKER_ID: ${{ secrets.DOCKER_ID }}
DOCKER_TOKEN: ${{ secrets.DOCKER_TOKEN }}
WITH_PUSH: ${{ github.event_name == 'push' && (github.ref == 'refs/heads/main' || startsWith(github.ref, 'refs/heads/release')) }}
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true
jobs:
build-docker:
runs-on: linux.9xlarge.ephemeral
strategy:
matrix:
cuda_version: ["11.8", "12.1", "12.4", "cpu"]
env:
CUDA_VERSION: ${{ matrix.cuda_version }}
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
with:
submodules: false
- name: Calculate docker image
if: env.WITH_PUSH == 'false'
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
with:
docker-image-name: conda-builder${{ matrix.cuda_version == 'cpu' && '-' || '-cuda' }}${{matrix.cuda_version}}
docker-build-dir: .ci/docker/conda
always-rebuild: true
push: true
- name: Authenticate if WITH_PUSH
if: env.WITH_PUSH == 'true'
run: |
if [[ "${WITH_PUSH}" == true ]]; then
echo "${DOCKER_TOKEN}" | docker login -u "${DOCKER_ID}" --password-stdin
fi
- name: Build Docker Image
if: env.WITH_PUSH == 'true'
run: |
.ci/docker/conda/build.sh conda-builder${{ matrix.cuda_version == 'cpu' && ':' || ':cuda' }}${{matrix.cuda_version}}

View File

@ -0,0 +1,120 @@
name: Build libtorch docker images
on:
push:
branches:
- main
- release/*
tags:
# NOTE: Binary build pipelines should only get triggered on release candidate or nightly builds
# Release candidate tags look like: v1.11.0-rc1
- v[0-9]+.[0-9]+.[0-9]+-rc[0-9]+
paths:
- '.ci/docker/libtorch/*'
- '.ci/docker/common/*'
- .github/workflows/build-libtorch-images.yml
pull_request:
paths:
- '.ci/docker/libtorch/*'
- '.ci/docker/common/*'
- .github/workflows/build-libtorch-images.yml
env:
DOCKER_REGISTRY: "docker.io"
DOCKER_BUILDKIT: 1
DOCKER_ID: ${{ secrets.DOCKER_ID }}
DOCKER_TOKEN: ${{ secrets.DOCKER_TOKEN }}
WITH_PUSH: ${{ github.event_name == 'push' && (github.ref == 'refs/heads/main' || startsWith(github.ref, 'refs/heads/release')) }}
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true
jobs:
build-docker-cuda:
runs-on: linux.9xlarge.ephemeral
strategy:
matrix:
cuda_version: ["12.4", "12.1", "11.8"]
env:
GPU_ARCH_TYPE: cuda
GPU_ARCH_VERSION: ${{ matrix.cuda_version }}
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
with:
submodules: false
- name: Calculate docker image
if: env.WITH_PUSH == 'false'
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
with:
docker-image-name: libtorch-cxx11-builder-cuda${{matrix.cuda_version}}
docker-build-dir: .ci/docker/libtorch
always-rebuild: true
push: true
- name: Authenticate if WITH_PUSH
if: env.WITH_PUSH == 'true'
run: |
if [[ "${WITH_PUSH}" == true ]]; then
echo "${DOCKER_TOKEN}" | docker login -u "${DOCKER_ID}" --password-stdin
fi
- name: Build Docker Image
if: env.WITH_PUSH == 'true'
run: |
.ci/docker/libtorch/build.sh libtorch-cxx11-builder:cuda${{matrix.cuda_version}}
build-docker-rocm:
runs-on: linux.9xlarge.ephemeral
strategy:
matrix:
rocm_version: ["6.0", "6.1"]
env:
GPU_ARCH_TYPE: rocm
GPU_ARCH_VERSION: ${{ matrix.rocm_version }}
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
with:
submodules: false
- name: Calculate docker image
if: env.WITH_PUSH == 'false'
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
with:
docker-image-name: libtorch-cxx11-builder-rocm${{matrix.rocm_version}}
docker-build-dir: .ci/docker/libtorch
always-rebuild: true
push: true
- name: Authenticate if WITH_PUSH
if: env.WITH_PUSH == 'true'
run: |
if [[ "${WITH_PUSH}" == true ]]; then
echo "${DOCKER_TOKEN}" | docker login -u "${DOCKER_ID}" --password-stdin
fi
- name: Build Docker Image
if: env.WITH_PUSH == 'true'
run: |
.ci/docker/libtorch/build.sh libtorch-cxx11-builder:rocm${{matrix.rocm_version}}
build-docker-cpu:
runs-on: linux.9xlarge.ephemeral
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
with:
submodules: false
- name: Calculate docker image
if: env.WITH_PUSH == 'false'
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
with:
docker-image-name: libtorch-cxx11-builder-cpu
docker-build-dir: .ci/docker/libtorch
always-rebuild: true
push: true
- name: Authenticate if WITH_PUSH
if: env.WITH_PUSH == 'true'
run: |
if [[ "${WITH_PUSH}" == true ]]; then
echo "${DOCKER_TOKEN}" | docker login -u "${DOCKER_ID}" --password-stdin
fi
- name: Build Docker Image
if: env.WITH_PUSH == 'true'
run: |
.ci/docker/libtorch/build.sh libtorch-cxx11-builder:cpu

View File

@ -0,0 +1,322 @@
name: Build manywheel docker images
on:
workflow_dispatch:
push:
branches:
- main
- release/*
tags:
# NOTE: Binary build pipelines should only get triggered on release candidate or nightly builds
# Release candidate tags look like: v1.11.0-rc1
- v[0-9]+.[0-9]+.[0-9]+-rc[0-9]+
paths:
- '.ci/docker/manywheel/*'
- '.ci/docker/common/*'
- .github/workflows/build-manywheel-images.yml
pull_request:
paths:
- '.ci/docker/manywheel/*'
- '.ci/docker/common/*'
- .github/workflows/build-manywheel-images.yml
env:
DOCKER_REGISTRY: "docker.io"
DOCKER_BUILDKIT: 1
DOCKER_ID: ${{ secrets.DOCKER_ID }}
DOCKER_TOKEN: ${{ secrets.DOCKER_TOKEN }}
WITH_PUSH: ${{ github.event_name == 'push' && (github.ref == 'refs/heads/main' || startsWith(github.ref, 'refs/heads/release')) }}
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true
jobs:
build-docker-cuda:
runs-on: linux.9xlarge.ephemeral
strategy:
matrix:
cuda_version: ["12.4", "12.1", "11.8"]
env:
GPU_ARCH_TYPE: cuda
GPU_ARCH_VERSION: ${{ matrix.cuda_version }}
steps:
- name: Purge tools folder (free space for build)
run: rm -rf /opt/hostedtoolcache
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
with:
submodules: false
- name: Calculate docker image
if: env.WITH_PUSH == 'false'
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
with:
docker-image-name: manylinux-builder-cuda${{matrix.cuda_version}}
docker-build-dir: .ci/docker/manywheel
always-rebuild: true
push: true
- name: Authenticate if WITH_PUSH
if: env.WITH_PUSH == 'true'
run: |
if [[ "${WITH_PUSH}" == true ]]; then
echo "${DOCKER_TOKEN}" | docker login -u "${DOCKER_ID}" --password-stdin
fi
- name: Build Docker Image
if: env.WITH_PUSH == 'true'
run: |
.ci/docker/manywheel/build.sh manylinux-builder:cuda${{matrix.cuda_version}}
# NOTE: manylinux_2_28 are still experimental, see https://github.com/pytorch/pytorch/issues/123649
build-docker-cuda-manylinux_2_28:
runs-on: linux.9xlarge.ephemeral
strategy:
matrix:
cuda_version: ["12.4", "12.1", "11.8"]
env:
GPU_ARCH_TYPE: cuda-manylinux_2_28
GPU_ARCH_VERSION: ${{ matrix.cuda_version }}
steps:
- name: Purge tools folder (free space for build)
run: rm -rf /opt/hostedtoolcache
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
with:
submodules: false
- name: Calculate docker image
if: env.WITH_PUSH == 'false'
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
with:
docker-image-name: manylinux2_28-builder-cuda${{matrix.cuda_version}}
docker-build-dir: .ci/docker/manywheel
always-rebuild: true
push: true
- name: Authenticate if WITH_PUSH
if: env.WITH_PUSH == 'true'
run: |
if [[ "${WITH_PUSH}" == true ]]; then
echo "${DOCKER_TOKEN}" | docker login -u "${DOCKER_ID}" --password-stdin
fi
- name: Build Docker Image
if: env.WITH_PUSH == 'true'
run: |
.ci/docker/manywheel/build.sh manylinux2_28-builder:cuda${{matrix.cuda_version}}
build-docker-cuda-aarch64:
runs-on: linux.arm64.2xlarge
strategy:
matrix:
cuda_version: ["12.4"]
env:
GPU_ARCH_TYPE: cuda-aarch64
GPU_ARCH_VERSION: ${{ matrix.cuda_version }}
steps:
- name: Checkout PyTorch
uses: actions/checkout@v3
- name: Calculate docker image
if: env.WITH_PUSH == 'false'
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
with:
docker-image-name: manylinuxaarch64-builder-cuda${{matrix.cuda_version}}
docker-build-dir: .ci/docker/manywheel
always-rebuild: true
push: true
- name: Authenticate if WITH_PUSH
if: env.WITH_PUSH == 'true'
run: |
if [[ "${WITH_PUSH}" == true ]]; then
echo "${DOCKER_TOKEN}" | docker login -u "${DOCKER_ID}" --password-stdin
fi
- name: Build Docker Image
if: env.WITH_PUSH == 'true'
run: |
.ci/docker/manywheel/build.sh manylinuxaarch64-builder:cuda${{matrix.cuda_version}}
build-docker-rocm:
runs-on: linux.9xlarge.ephemeral
strategy:
matrix:
rocm_version: ["6.0", "6.1"]
env:
GPU_ARCH_TYPE: rocm
GPU_ARCH_VERSION: ${{ matrix.rocm_version }}
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
with:
submodules: false
- name: Calculate docker image
if: env.WITH_PUSH == 'false'
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
with:
docker-image-name: manylinux-builder-rocm${{matrix.rocm_version}}
docker-build-dir: .ci/docker/manywheel
always-rebuild: true
push: true
- name: Authenticate if WITH_PUSH
if: env.WITH_PUSH == 'true'
run: |
if [[ "${WITH_PUSH}" == true ]]; then
echo "${DOCKER_TOKEN}" | docker login -u "${DOCKER_ID}" --password-stdin
fi
- name: Build Docker Image
if: env.WITH_PUSH == 'true'
run: |
.ci/docker/manywheel/build.sh manylinux-builder:rocm${{matrix.rocm_version}}
build-docker-cpu:
runs-on: linux.9xlarge.ephemeral
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
with:
submodules: false
- name: Calculate docker image
if: env.WITH_PUSH == 'false'
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
with:
docker-image-name: manylinux-builder-cpu
docker-build-dir: .ci/docker/manywheel
always-rebuild: true
push: true
- name: Authenticate if WITH_PUSH
if: env.WITH_PUSH == 'true'
run: |
if [[ "${WITH_PUSH}" == true ]]; then
echo "${DOCKER_TOKEN}" | docker login -u "${DOCKER_ID}" --password-stdin
fi
- name: Build Docker Image
if: env.WITH_PUSH == 'true'
run: |
.ci/docker/manywheel/build.sh manylinux-builder:cpu
build-docker-cpu-manylinux_2_28:
runs-on: linux.9xlarge.ephemeral
env:
GPU_ARCH_TYPE: cpu-manylinux_2_28
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
with:
submodules: false
- name: Calculate docker image
if: env.WITH_PUSH == 'false'
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
with:
docker-image-name: manylinux2_28-builder-cpu
docker-build-dir: .ci/docker/manywheel
always-rebuild: true
push: true
- name: Authenticate if WITH_PUSH
if: env.WITH_PUSH == 'true'
run: |
if [[ "${WITH_PUSH}" == true ]]; then
echo "${DOCKER_TOKEN}" | docker login -u "${DOCKER_ID}" --password-stdin
fi
- name: Build Docker Image
if: env.WITH_PUSH == 'true'
run: |
.ci/docker/manywheel/build.sh manylinux2_28-builder:cpu
build-docker-cpu-aarch64:
runs-on: linux.arm64.2xlarge
env:
GPU_ARCH_TYPE: cpu-aarch64
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
with:
submodules: false
- name: Calculate docker image
if: env.WITH_PUSH == 'false'
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
with:
docker-image-name: manylinuxaarch64-builder-cpu-aarch64
docker-build-dir: .ci/docker/manywheel
always-rebuild: true
push: true
- name: Authenticate if WITH_PUSH
if: env.WITH_PUSH == 'true'
run: |
if [[ "${WITH_PUSH}" == true ]]; then
echo "${DOCKER_TOKEN}" | docker login -u "${DOCKER_ID}" --password-stdin
fi
- name: Build Docker Image
if: env.WITH_PUSH == 'true'
run: |
.ci/docker/manywheel/build.sh manylinuxaarch64-builder:cpu-aarch64
build-docker-cpu-aarch64-2_28:
runs-on: linux.arm64.2xlarge
env:
GPU_ARCH_TYPE: cpu-aarch64-2_28
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
with:
submodules: false
- name: Calculate docker image
if: env.WITH_PUSH == 'false'
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
with:
docker-image-name: manylinux2_28_aarch64-builder-cpu-aarch64
docker-build-dir: .ci/docker/manywheel
always-rebuild: true
push: true
- name: Authenticate if WITH_PUSH
if: env.WITH_PUSH == 'true'
run: |
if [[ "${WITH_PUSH}" == true ]]; then
echo "${DOCKER_TOKEN}" | docker login -u "${DOCKER_ID}" --password-stdin
fi
- name: Build Docker Image
if: env.WITH_PUSH == 'true'
run: |
.ci/docker/manywheel/build.sh manylinux2_28_aarch64-builder:cpu-aarch64
build-docker-cpu-cxx11-abi:
runs-on: linux.9xlarge.ephemeral
env:
GPU_ARCH_TYPE: cpu-cxx11-abi
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
with:
submodules: false
- name: Calculate docker image
if: env.WITH_PUSH == 'false'
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
with:
docker-image-name: manylinuxcxx11-abi-builder-cpu-cxx11-abi
docker-build-dir: .ci/docker/manywheel
always-rebuild: true
push: true
- name: Authenticate if WITH_PUSH
if: env.WITH_PUSH == 'true'
run: |
if [[ "${WITH_PUSH}" == true ]]; then
echo "${DOCKER_TOKEN}" | docker login -u "${DOCKER_ID}" --password-stdin
fi
- name: Build Docker Image
if: env.WITH_PUSH == 'true'
run: |
.ci/docker/manywheel/build.sh manylinuxcxx11-abi-builder:cpu-cxx11-abi
build-docker-xpu:
runs-on: linux.9xlarge.ephemeral
env:
GPU_ARCH_TYPE: xpu
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
with:
submodules: false
- name: Calculate docker image
if: env.WITH_PUSH == 'false'
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
with:
docker-image-name: manylinux2_28-builder-xpu
docker-build-dir: .ci/docker/manywheel
always-rebuild: true
push: true
- name: Authenticate if WITH_PUSH
if: env.WITH_PUSH == 'true'
run: |
if [[ "${WITH_PUSH}" == true ]]; then
echo "${DOCKER_TOKEN}" | docker login -u "${DOCKER_ID}" --password-stdin
fi
- name: Build Docker Image
if: env.WITH_PUSH == 'true'
run: |
.ci/docker/manywheel/build.sh manylinux2_28-builder:xpu

View File

@ -65,6 +65,8 @@ jobs:
include:
- docker-image-name: pytorch-linux-jammy-aarch64-py3.10-gcc11
runner: linux.arm64.2xlarge
- docker-image-name: pytorch-linux-jammy-aarch64-py3.10-gcc11-inductor-benchmarks
runner: linux.arm64.2xlarge
runs-on: [self-hosted, "${{ matrix.runner }}"]
env:
DOCKER_IMAGE_BASE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/${{ matrix.docker-image-name }}

View File

@ -9,6 +9,7 @@ on:
# Run every 4 hours during the week and every 12 hours on the weekend
- cron: 45 0,4,8,12,16,20 * * 1-5
- cron: 45 4,12 * * 0,6
- cron: 29 8 * * * # about 1:29am PDT, for mem leak check and rerun disabled tests
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}

View File

@ -0,0 +1,106 @@
name: inductor-perf-nightly-aarch64
on:
schedule:
# - cron: 0 7 * * 1-6
# - cron: 0 7 * * 0
# Does not perform max_autotune on CPU, so skip the weekly run setup
- cron: 0 7 * * *
# NB: GitHub has an upper limit of 10 inputs here
workflow_dispatch:
inputs:
training:
# CPU for training is not typical, but leave the option open here
description: Run training (off by default)?
required: false
type: boolean
default: false
inference:
description: Run inference (on by default)?
required: false
type: boolean
default: true
default:
description: Run inductor_default?
required: false
type: boolean
default: true
dynamic:
description: Run inductor_dynamic_shapes?
required: false
type: boolean
default: false
aotinductor:
description: Run aot_inductor for inference?
required: false
type: boolean
default: false
benchmark_configs:
description: The list of configs used the benchmark
required: false
type: string
default: inductor_huggingface_perf_cpu_aarch64,inductor_timm_perf_cpu_aarch64,inductor_torchbench_perf_cpu_aarch64
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}-${{ github.event_name == 'schedule' }}
cancel-in-progress: true
permissions: read-all
jobs:
linux-jammy-aarch64-py3_10-inductor-build:
name: linux-jammy-aarch64-py3.10-inductor
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-jammy-aarch64-py3.10
docker-image-name: pytorch-linux-jammy-aarch64-py3.10-gcc11-inductor-benchmarks
test-matrix: |
{ include: [
{ config: "inductor_huggingface_perf_cpu_aarch64", shard: 1, num_shards: 3, runner: "linux.arm64.m7g.metal" },
{ config: "inductor_huggingface_perf_cpu_aarch64", shard: 2, num_shards: 3, runner: "linux.arm64.m7g.metal" },
{ config: "inductor_huggingface_perf_cpu_aarch64", shard: 3, num_shards: 3, runner: "linux.arm64.m7g.metal" },
{ config: "inductor_timm_perf_cpu_aarch64", shard: 1, num_shards: 5, runner: "linux.arm64.m7g.metal" },
{ config: "inductor_timm_perf_cpu_aarch64", shard: 2, num_shards: 5, runner: "linux.arm64.m7g.metal" },
{ config: "inductor_timm_perf_cpu_aarch64", shard: 3, num_shards: 5, runner: "linux.arm64.m7g.metal" },
{ config: "inductor_timm_perf_cpu_aarch64", shard: 4, num_shards: 5, runner: "linux.arm64.m7g.metal" },
{ config: "inductor_timm_perf_cpu_aarch64", shard: 5, num_shards: 5, runner: "linux.arm64.m7g.metal" },
{ config: "inductor_torchbench_perf_cpu_aarch64", shard: 1, num_shards: 4, runner: "linux.arm64.m7g.metal" },
{ config: "inductor_torchbench_perf_cpu_aarch64", shard: 2, num_shards: 4, runner: "linux.arm64.m7g.metal" },
{ config: "inductor_torchbench_perf_cpu_aarch64", shard: 3, num_shards: 4, runner: "linux.arm64.m7g.metal" },
{ config: "inductor_torchbench_perf_cpu_aarch64", shard: 4, num_shards: 4, runner: "linux.arm64.m7g.metal" },
]}
selected-test-configs: ${{ inputs.benchmark_configs }}
secrets:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
linux-jammy-aarch64-py3_10-inductor-test-nightly:
name: linux-jammy-aarch64-py3.10-inductor
uses: ./.github/workflows/_linux-test.yml
needs: linux-jammy-aarch64-py3_10-inductor-build
if: github.event.schedule == '0 7 * * *'
with:
build-environment: linux-jammy-aarch64-py3.10
dashboard-tag: training-false-inference-true-default-true-dynamic-true-aotinductor-true
docker-image: ${{ needs.linux-jammy-aarch64-py3_10-inductor-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-aarch64-py3_10-inductor-build.outputs.test-matrix }}
use-gha: anything-non-empty-to-use-gha
timeout-minutes: 720
secrets:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
linux-jammy-aarch64-py3_10-inductor-test:
name: linux-jammy-aarch64-py3.10-inductor
uses: ./.github/workflows/_linux-test.yml
needs: linux-jammy-aarch64-py3_10-inductor-build
if: github.event_name == 'workflow_dispatch'
with:
build-environment: linux-jammy-aarch64-py3.10
dashboard-tag: training-${{ inputs.training }}-inference-${{ inputs.inference }}-default-${{ inputs.default }}-dynamic-${{ inputs.dynamic }}-aotinductor-${{ inputs.aotinductor }}
docker-image: ${{ needs.linux-jammy-aarch64-py3_10-inductor-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-aarch64-py3_10-inductor-build.outputs.test-matrix }}
use-gha: anything-non-empty-to-use-gha
timeout-minutes: 720
secrets:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}

View File

@ -0,0 +1,106 @@
name: inductor-perf-nightly-x86
on:
schedule:
# - cron: 0 7 * * 1-6
# - cron: 0 7 * * 0
# Does not perform max_autotune on CPU, so skip the weekly run setup
- cron: 0 7 * * *
# NB: GitHub has an upper limit of 10 inputs here
workflow_dispatch:
inputs:
training:
# CPU for training is not typical, but leave the option open here
description: Run training (off by default)?
required: false
type: boolean
default: false
inference:
description: Run inference (on by default)?
required: false
type: boolean
default: true
default:
description: Run inductor_default?
required: false
type: boolean
default: true
dynamic:
description: Run inductor_dynamic_shapes?
required: false
type: boolean
default: false
aotinductor:
description: Run aot_inductor for inference?
required: false
type: boolean
default: false
benchmark_configs:
description: The list of configs used the benchmark
required: false
type: string
default: inductor_huggingface_perf_cpu_x86,inductor_timm_perf_cpu_x86,inductor_torchbench_perf_cpu_x86
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}-${{ github.event_name == 'schedule' }}
cancel-in-progress: true
permissions: read-all
jobs:
linux-jammy-cpu-py3_8-gcc11-inductor-build:
name: linux-jammy-cpu-py3.8-gcc11-inductor
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-jammy-py3.8-gcc11-build
docker-image-name: pytorch-linux-jammy-py3.8-gcc11-inductor-benchmarks
test-matrix: |
{ include: [
{ config: "inductor_huggingface_perf_cpu_x86", shard: 1, num_shards: 3, runner: "linux.24xl.spr-metal" },
{ config: "inductor_huggingface_perf_cpu_x86", shard: 2, num_shards: 3, runner: "linux.24xl.spr-metal" },
{ config: "inductor_huggingface_perf_cpu_x86", shard: 3, num_shards: 3, runner: "linux.24xl.spr-metal" },
{ config: "inductor_timm_perf_cpu_x86", shard: 1, num_shards: 5, runner: "linux.24xl.spr-metal" },
{ config: "inductor_timm_perf_cpu_x86", shard: 2, num_shards: 5, runner: "linux.24xl.spr-metal" },
{ config: "inductor_timm_perf_cpu_x86", shard: 3, num_shards: 5, runner: "linux.24xl.spr-metal" },
{ config: "inductor_timm_perf_cpu_x86", shard: 4, num_shards: 5, runner: "linux.24xl.spr-metal" },
{ config: "inductor_timm_perf_cpu_x86", shard: 5, num_shards: 5, runner: "linux.24xl.spr-metal" },
{ config: "inductor_torchbench_perf_cpu_x86", shard: 1, num_shards: 4, runner: "linux.24xl.spr-metal" },
{ config: "inductor_torchbench_perf_cpu_x86", shard: 2, num_shards: 4, runner: "linux.24xl.spr-metal" },
{ config: "inductor_torchbench_perf_cpu_x86", shard: 3, num_shards: 4, runner: "linux.24xl.spr-metal" },
{ config: "inductor_torchbench_perf_cpu_x86", shard: 4, num_shards: 4, runner: "linux.24xl.spr-metal" },
]}
selected-test-configs: ${{ inputs.benchmark_configs }}
secrets:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
linux-jammy-cpu-py3_8-gcc11-inductor-test-nightly:
name: linux-jammy-cpu-py3.8-gcc11-inductor
uses: ./.github/workflows/_linux-test.yml
needs: linux-jammy-cpu-py3_8-gcc11-inductor-build
if: github.event.schedule == '0 7 * * *'
with:
build-environment: linux-jammy-py3.8-gcc11-build
dashboard-tag: training-false-inference-true-default-true-dynamic-true-aotinductor-true
docker-image: ${{ needs.linux-jammy-cpu-py3_8-gcc11-inductor-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-cpu-py3_8-gcc11-inductor-build.outputs.test-matrix }}
use-gha: anything-non-empty-to-use-gha
timeout-minutes: 720
secrets:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
linux-jammy-cpu-py3_8-gcc11-inductor-test:
name: linux-jammy-cpu-py3.8-gcc11-inductor
uses: ./.github/workflows/_linux-test.yml
needs: linux-jammy-cpu-py3_8-gcc11-inductor-build
if: github.event_name == 'workflow_dispatch'
with:
build-environment: linux-jammy-py3.8-gcc11-build
dashboard-tag: training-${{ inputs.training }}-inference-${{ inputs.inference }}-default-${{ inputs.default }}-dynamic-${{ inputs.dynamic }}-aotinductor-${{ inputs.aotinductor }}
docker-image: ${{ needs.linux-jammy-cpu-py3_8-gcc11-inductor-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-cpu-py3_8-gcc11-inductor-build.outputs.test-matrix }}
use-gha: anything-non-empty-to-use-gha
timeout-minutes: 720
secrets:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}

47
.github/workflows/inductor-rocm.yml vendored Normal file
View File

@ -0,0 +1,47 @@
name: inductor-rocm
on:
schedule:
# We have several schedules so jobs can check github.event.schedule to activate only for a fraction of the runs.
# Also run less frequently on weekends.
- cron: 45 0,4,8,12,16,20 * * 1-5
- cron: 45 4,12 * * 0,6
- cron: 29 8 * * * # about 1:29am PDT, for mem leak check and rerun disabled tests
push:
branches:
# - main
- release/*
tags:
- ciflow/inductor-rocm/*
workflow_dispatch:
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}-${{ github.event_name == 'schedule' }}-${{ github.event.schedule }}
cancel-in-progress: true
permissions: read-all
jobs:
linux-focal-rocm6_1-py3_8-inductor-build:
name: rocm6.1-py3.8-inductor
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-focal-rocm6.1-py3.8
docker-image-name: pytorch-linux-focal-rocm-n-py3
test-matrix: |
{ include: [
{ config: "inductor", shard: 1, num_shards: 2, runner: "linux.rocm.gpu.2" },
{ config: "inductor", shard: 2, num_shards: 2, runner: "linux.rocm.gpu.2" },
]}
linux-focal-rocm6_1-py3_8-inductor-test:
permissions:
id-token: write
contents: read
name: rocm6.1-py3.8-inductor
uses: ./.github/workflows/_rocm-test.yml
needs: linux-focal-rocm6_1-py3_8-inductor-build
with:
build-environment: linux-focal-rocm6.1-py3.8
docker-image: ${{ needs.linux-focal-rocm6_1-py3_8-inductor-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-rocm6_1-py3_8-inductor-build.outputs.test-matrix }}

View File

@ -16,30 +16,6 @@ concurrency:
permissions: read-all
jobs:
linux-focal-rocm6_1-py3_8-inductor-build:
name: rocm6.1-py3.8-inductor
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-focal-rocm6.1-py3.8
docker-image-name: pytorch-linux-focal-rocm-n-py3
test-matrix: |
{ include: [
{ config: "inductor", shard: 1, num_shards: 2, runner: "linux.rocm.gpu.2" },
{ config: "inductor", shard: 2, num_shards: 2, runner: "linux.rocm.gpu.2" },
]}
linux-focal-rocm6_1-py3_8-inductor-test:
permissions:
id-token: write
contents: read
name: rocm6.1-py3.8-inductor
uses: ./.github/workflows/_rocm-test.yml
needs: linux-focal-rocm6_1-py3_8-inductor-build
with:
build-environment: linux-focal-rocm6.1-py3.8
docker-image: ${{ needs.linux-focal-rocm6_1-py3_8-inductor-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-rocm6_1-py3_8-inductor-build.outputs.test-matrix }}
linux-focal-cuda12_1-py3_10-gcc9-inductor-build:
name: cuda12.1-py3.10-gcc9-sm86
uses: ./.github/workflows/_linux-build.yml

View File

@ -19,7 +19,7 @@ jobs:
uses: pytorch/test-infra/.github/workflows/linux_job.yml@main
with:
timeout: 120
runner: linux.2xlarge
runner: amz2023.linux.2xlarge
docker-image: pytorch-linux-jammy-cuda11.8-cudnn9-py3.9-linter
# NB: A shallow checkout won't work here because calculate-docker-image requires a full checkout
# to run git rev-parse HEAD~:.ci/docker when a new image is needed
@ -35,7 +35,7 @@ jobs:
uses: pytorch/test-infra/.github/workflows/linux_job.yml@main
with:
timeout: 120
runner: linux.2xlarge
runner: amz2023.linux.2xlarge
docker-image: pytorch-linux-jammy-cuda11.8-cudnn9-py3.9-linter
# NB: A shallow checkout won't work here because calculate-docker-image requires a full checkout
# to run git rev-parse HEAD~:.ci/docker when a new image is needed
@ -49,7 +49,7 @@ jobs:
quick-checks:
uses: pytorch/test-infra/.github/workflows/linux_job.yml@main
with:
runner: linux.2xlarge
runner: amz2023.linux.2xlarge
docker-image: pytorch-linux-focal-linter
fetch-depth: 0
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
@ -83,7 +83,7 @@ jobs:
pr-sanity-checks:
name: pr-sanity-checks
runs-on: [self-hosted, linux.large]
runs-on: [self-hosted, amz2023.linux.large]
# Only run this on pull requests. This check is simple enough to be done without a Docker image
if: github.event_name == 'pull_request' && !contains(github.event.pull_request.labels.*.name, 'skip-pr-sanity-checks')
steps:
@ -103,7 +103,7 @@ jobs:
workflow-checks:
uses: pytorch/test-infra/.github/workflows/linux_job.yml@main
with:
runner: linux.2xlarge
runner: amz2023.linux.2xlarge
docker-image: pytorch-linux-focal-linter
fetch-depth: -1
submodules: true
@ -139,7 +139,7 @@ jobs:
toc:
uses: pytorch/test-infra/.github/workflows/linux_job.yml@main
with:
runner: linux.2xlarge
runner: amz2023.linux.2xlarge
docker-image: pytorch-linux-focal-linter
fetch-depth: 0
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
@ -177,7 +177,7 @@ jobs:
if: ${{ github.repository == 'pytorch/pytorch' }}
uses: pytorch/test-infra/.github/workflows/linux_job.yml@main
with:
runner: linux.2xlarge
runner: amz2023.linux.2xlarge
docker-image: pytorch-linux-focal-linter
fetch-depth: 0
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}

View File

@ -28,7 +28,7 @@ jobs:
test-matrix: |
{ include: [
{ config: "mps", shard: 1, num_shards: 1, runner: "macos-m1-13" },
{ config: "mps", shard: 1, num_shards: 1, runner: "macos-m2-14" },
{ config: "mps", shard: 1, num_shards: 1, runner: "macos-m1-14" },
]}
macos-py3-arm64-mps-test:

View File

@ -277,8 +277,9 @@ jobs:
docker-image-name: pytorch-linux-focal-rocm-n-py3
test-matrix: |
{ include: [
{ config: "distributed", shard: 1, num_shards: 2, runner: "linux.rocm.gpu" },
{ config: "distributed", shard: 2, num_shards: 2, runner: "linux.rocm.gpu" },
{ config: "distributed", shard: 1, num_shards: 3, runner: "linux.rocm.gpu" },
{ config: "distributed", shard: 2, num_shards: 3, runner: "linux.rocm.gpu" },
{ config: "distributed", shard: 3, num_shards: 3, runner: "linux.rocm.gpu" },
]}
linux-focal-rocm6_1-py3_8-test:

View File

@ -48,20 +48,20 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge"
runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge"
build-environment: linux-jammy-py3.8-gcc11
docker-image-name: pytorch-linux-jammy-py3.8-gcc11
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 2, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 3, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 4, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "docs_test", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "jit_legacy", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "backwards_compat", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "distributed", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "distributed", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 1, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge" },
{ config: "default", shard: 2, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge" },
{ config: "default", shard: 3, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge" },
{ config: "default", shard: 4, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge" },
{ config: "docs_test", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge" },
{ config: "jit_legacy", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge" },
{ config: "backwards_compat", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge" },
{ config: "distributed", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge" },
{ config: "distributed", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge" },
]}
linux-jammy-py3_8-gcc11-test:
@ -88,7 +88,7 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge"
runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge"
build-environment: linux-jammy-py3.8-gcc11-no-ops
docker-image-name: pytorch-linux-jammy-py3.8-gcc11
test-matrix: |
@ -101,7 +101,7 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge"
runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge"
build-environment: linux-jammy-py3.8-gcc11-pch
docker-image-name: pytorch-linux-jammy-py3.8-gcc11
test-matrix: |
@ -115,17 +115,17 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge"
runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge"
build-environment: linux-jammy-py3.10-clang15-asan
docker-image-name: pytorch-linux-jammy-py3-clang15-asan
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 6, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "default", shard: 2, num_shards: 6, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "default", shard: 3, num_shards: 6, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "default", shard: 4, num_shards: 6, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "default", shard: 5, num_shards: 6, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "default", shard: 6, num_shards: 6, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "default", shard: 1, num_shards: 6, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.4xlarge" },
{ config: "default", shard: 2, num_shards: 6, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.4xlarge" },
{ config: "default", shard: 3, num_shards: 6, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.4xlarge" },
{ config: "default", shard: 4, num_shards: 6, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.4xlarge" },
{ config: "default", shard: 5, num_shards: 6, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.4xlarge" },
{ config: "default", shard: 6, num_shards: 6, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.4xlarge" },
]}
sync-tag: asan-build
@ -147,13 +147,13 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge"
runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge"
build-environment: linux-focal-py3.8-clang10-onnx
docker-image-name: pytorch-linux-focal-py3-clang10-onnx
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge" },
{ config: "default", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge" },
]}
linux-focal-py3_8-clang10-onnx-test:
@ -172,20 +172,20 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge"
runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge"
build-environment: linux-focal-py3.8-clang10
docker-image-name: pytorch-linux-focal-py3.8-clang10
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 2, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 3, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 4, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "crossref", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "crossref", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "dynamo", shard: 1, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "dynamo", shard: 2, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "dynamo", shard: 3, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 1, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge" },
{ config: "default", shard: 2, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge" },
{ config: "default", shard: 3, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge" },
{ config: "default", shard: 4, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge" },
{ config: "crossref", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge" },
{ config: "crossref", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge" },
{ config: "dynamo", shard: 1, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge" },
{ config: "dynamo", shard: 2, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge" },
{ config: "dynamo", shard: 3, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge" },
]}
linux-focal-py3_8-clang10-test:
name: linux-focal-py3.8-clang10
@ -203,20 +203,20 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge"
runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge"
build-environment: linux-focal-py3.11-clang10
docker-image-name: pytorch-linux-focal-py3.11-clang10
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 2, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 3, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 4, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "crossref", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "crossref", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "dynamo", shard: 1, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "dynamo", shard: 2, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "dynamo", shard: 3, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 1, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge" },
{ config: "default", shard: 2, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge" },
{ config: "default", shard: 3, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge" },
{ config: "default", shard: 4, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge" },
{ config: "crossref", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge" },
{ config: "crossref", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge" },
{ config: "dynamo", shard: 1, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge" },
{ config: "dynamo", shard: 2, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge" },
{ config: "dynamo", shard: 3, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge" },
]}
linux-focal-py3_11-clang10-test:
@ -235,18 +235,18 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge"
runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge"
build-environment: linux-focal-py3.12-clang10
docker-image-name: pytorch-linux-focal-py3.12-clang10
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 2, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 3, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 4, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "dynamo", shard: 1, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "dynamo", shard: 2, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "dynamo", shard: 3, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 1, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge" },
{ config: "default", shard: 2, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge" },
{ config: "default", shard: 3, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge" },
{ config: "default", shard: 4, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge" },
{ config: "dynamo", shard: 1, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge" },
{ config: "dynamo", shard: 2, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge" },
{ config: "dynamo", shard: 3, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge" },
]}
linux-focal-py3_12-clang10-test:
@ -264,7 +264,7 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge"
runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge"
build-environment: linux-focal-cuda11.8-py3.10-gcc9
docker-image-name: pytorch-linux-focal-cuda11.8-cudnn9-py3-gcc9
test-matrix: |
@ -291,16 +291,16 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge"
runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge"
build-environment: linux-focal-cuda12.1-py3.10-gcc9
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 2, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 3, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 4, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 5, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 1, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 2, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 3, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 4, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 5, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.4xlarge.nvidia.gpu" },
]}
linux-focal-cuda12_1-py3_10-gcc9-test:
@ -320,7 +320,7 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge"
runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge"
build-environment: linux-jammy-py3-clang12-mobile-build
docker-image-name: pytorch-linux-jammy-py3-clang15-asan
build-generates-artifacts: false
@ -334,7 +334,7 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge"
runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge"
build-environment: linux-jammy-cuda11.8-cudnn9-py3.8-clang12
docker-image-name: pytorch-linux-jammy-cuda11.8-cudnn9-py3.8-clang12
test-matrix: |
@ -347,7 +347,7 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge"
runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge"
build-environment: linux-focal-py3-clang9-mobile-custom-build-static
docker-image-name: pytorch-linux-focal-py3-clang9-android-ndk-r21e
build-generates-artifacts: false
@ -361,12 +361,12 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge"
runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge"
build-environment: linux-focal-py3.8-clang9-xla
docker-image-name: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/xla_base:v1.1-lite
test-matrix: |
{ include: [
{ config: "xla", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.12xlarge" },
{ config: "xla", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.12xlarge" },
]}
linux-focal-py3_8-clang9-xla-test:
@ -401,13 +401,13 @@ jobs:
uses: ./.github/workflows/_bazel-build-test.yml
needs: get-label-type
with:
runner: "${{ needs.get-label-type.outputs.label-type }}linux.large"
runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.large"
build-environment: linux-focal-cuda12.1-py3.10-gcc9-bazel-test
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9
cuda-version: cpu
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "default", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.4xlarge" },
]}
linux-focal-cuda12_1-py3_10-gcc9-bazel-test:
@ -415,13 +415,13 @@ jobs:
uses: ./.github/workflows/_bazel-build-test.yml
needs: get-label-type
with:
runner: "${{ needs.get-label-type.outputs.label-type }}linux.large"
runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.large"
build-environment: linux-focal-cuda12.1-py3.10-gcc9-bazel-test
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9
cuda-version: "12.1"
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.4xlarge.nvidia.gpu" },
]}
linux-focal-cuda12_4-py3_10-gcc9-bazel-test:
@ -429,13 +429,13 @@ jobs:
uses: ./.github/workflows/_bazel-build-test.yml
needs: get-label-type
with:
runner: "${{ needs.get-label-type.outputs.label-type }}linux.large"
runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.large"
build-environment: linux-focal-cuda12.4-py3.10-gcc9-bazel-test
docker-image-name: pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9
cuda-version: "12.4"
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.4xlarge.nvidia.gpu" },
]}
linux-focal-py3-clang9-android-ndk-r21e-gradle-custom-build-single:
@ -465,7 +465,7 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge"
runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge"
build-environment: linux-jammy-py3.8-gcc111-mobile-lightweight-dispatch-build
docker-image-name: pytorch-linux-jammy-py3.8-gcc11
build-generates-artifacts: false
@ -481,7 +481,7 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge"
runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge"
build-environment: linux-focal-rocm6.1-py3.8
docker-image-name: pytorch-linux-focal-rocm-n-py3
sync-tag: rocm-build
@ -497,17 +497,17 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge"
runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge"
build-environment: linux-focal-cuda12.1-py3.10-gcc9-sm86
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9
cuda-arch-list: 8.6
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 2, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 3, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 4, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 5, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 1, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 2, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 3, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 4, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 5, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.g5.4xlarge.nvidia.gpu" },
]}
linux-focal-cuda12_1-py3_10-gcc9-sm86-test:
@ -526,12 +526,12 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge"
runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge"
build-environment: linux-jammy-py3-clang12-executorch
docker-image-name: pytorch-linux-jammy-py3-clang12-executorch
test-matrix: |
{ include: [
{ config: "executorch", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "executorch", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge" },
]}
linux-jammy-py3-clang12-executorch-test:
@ -548,17 +548,17 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge"
runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge"
use_split_build: true
build-environment: linux-focal-cuda12.1-py3.10-gcc9
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 2, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 3, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 4, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 5, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 1, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 2, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 3, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 4, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 5, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.4xlarge.nvidia.gpu" },
]}
linux-focal-cuda12_1-py3_10-gcc9-experimental-split-build-test:
@ -576,18 +576,20 @@ jobs:
linux-focal-py3_12-clang10-experimental-split-build:
name: linux-focal-py3.12-clang10-experimental-split-build
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge"
use_split_build: True
build-environment: linux-focal-py3.12-clang10
docker-image-name: pytorch-linux-focal-py3.12-clang10
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 3, runner: "linux.2xlarge" },
{ config: "default", shard: 2, num_shards: 3, runner: "linux.2xlarge" },
{ config: "default", shard: 3, num_shards: 3, runner: "linux.2xlarge" },
{ config: "dynamo", shard: 1, num_shards: 3, runner: "linux.2xlarge" },
{ config: "dynamo", shard: 2, num_shards: 3, runner: "linux.2xlarge" },
{ config: "dynamo", shard: 3, num_shards: 3, runner: "linux.2xlarge" },
{ config: "default", shard: 1, num_shards: 3, runner: "amz2023.linux.2xlarge" },
{ config: "default", shard: 2, num_shards: 3, runner: "amz2023.linux.2xlarge" },
{ config: "default", shard: 3, num_shards: 3, runner: "amz2023.linux.2xlarge" },
{ config: "dynamo", shard: 1, num_shards: 3, runner: "amz2023.linux.2xlarge" },
{ config: "dynamo", shard: 2, num_shards: 3, runner: "amz2023.linux.2xlarge" },
{ config: "dynamo", shard: 3, num_shards: 3, runner: "amz2023.linux.2xlarge" },
]}
linux-focal-py3_12-clang10-experimental-split-build-test:
name: linux-focal-py3.12-clang10-experimental-split-build

View File

@ -3,12 +3,18 @@ name: rocm
on:
push:
branches:
- main
# - main
- release/*
tags:
- ciflow/rocm/*
workflow_dispatch:
schedule:
# We have several schedules so jobs can check github.event.schedule to activate only for a fraction of the runs.
# Also run less frequently on weekends.
- cron: 45 0,8,16 * * 1-5
- cron: 45 4 * * 0,6
- cron: 45 4,12,20 * * 1-5
- cron: 45 12 * * 0,6
- cron: 29 8 * * * # about 1:29am PDT
concurrency:

View File

@ -152,14 +152,14 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge"
runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge"
build-environment: linux-jammy-py3.10-clang15-asan
docker-image-name: pytorch-linux-jammy-py3-clang15-asan
test-matrix: |
{ include: [
{ config: "slow", shard: 1, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "slow", shard: 2, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "slow", shard: 3, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "slow", shard: 1, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.4xlarge" },
{ config: "slow", shard: 2, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.4xlarge" },
{ config: "slow", shard: 3, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.4xlarge" },
]}
sync-tag: asan-build

View File

@ -46,17 +46,19 @@ jobs:
linux-focal-cuda12_4-py3_10-gcc9-sm86-build:
name: linux-focal-cuda12.4-py3.10-gcc9-sm86
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge"
build-environment: linux-focal-cuda12.4-py3.10-gcc9-sm86
docker-image-name: pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9
cuda-arch-list: 8.6
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 5, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 2, num_shards: 5, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 3, num_shards: 5, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 4, num_shards: 5, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 5, num_shards: 5, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 1, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 2, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 3, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 4, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 5, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.g5.4xlarge.nvidia.gpu" },
]}
linux-focal-cuda12_4-py3_10-gcc9-sm86-test:
@ -73,11 +75,12 @@ jobs:
libtorch-linux-focal-cuda12_1-py3_7-gcc9-debug-build:
name: libtorch-linux-focal-cuda12.1-py3.7-gcc9-debug
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
build-environment: libtorch-linux-focal-cuda12.1-py3.7-gcc9
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9
build-generates-artifacts: false
runner: linux.4xlarge
runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.4xlarge"
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 1 },
@ -87,7 +90,9 @@ jobs:
linux-focal-cuda12_1-py3_10-gcc9-no-ops-build:
name: linux-focal-cuda12.1-py3.10-gcc9-no-ops
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge"
build-environment: linux-focal-cuda12.1-py3.10-gcc9-no-ops
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9
test-matrix: |
@ -98,11 +103,12 @@ jobs:
libtorch-linux-focal-cuda12_4-py3_7-gcc9-debug-build:
name: libtorch-linux-focal-cuda12.4-py3.7-gcc9-debug
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
build-environment: libtorch-linux-focal-cuda12.4-py3.7-gcc9
docker-image-name: pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9
build-generates-artifacts: false
runner: linux.4xlarge
runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.4xlarge"
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 1 },
@ -112,7 +118,9 @@ jobs:
linux-focal-cuda12_4-py3_10-gcc9-no-ops-build:
name: linux-focal-cuda12.4-py3.10-gcc9-no-ops
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge"
build-environment: linux-focal-cuda12.4-py3.10-gcc9-no-ops
docker-image-name: pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9
test-matrix: |
@ -128,7 +136,7 @@ jobs:
docker-image-name: pytorch-linux-focal-py3-clang9-android-ndk-r21e
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
{ config: "default", shard: 1, num_shards: 1, runner: "amz2023.linux.2xlarge" },
]}
macos-py3-arm64-build:
@ -217,7 +225,7 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge"
runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge"
build-environment: linux-focal-rocm6.1-py3.8
docker-image-name: pytorch-linux-focal-rocm-n-py3
sync-tag: rocm-build
@ -246,20 +254,22 @@ jobs:
linux-focal-cuda12_4-py3_10-gcc9-experimental-split-build:
name: linux-focal-cuda12.4-py3.10-gcc9-experimental-split-build
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge"
use_split_build: true
build-environment: linux-focal-cuda12.4-py3.10-gcc9
docker-image-name: pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9
test-matrix: |
{ include: [
{ config: "nogpu_AVX512", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
{ config: "nogpu_NO_AVX2", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
{ config: "jit_legacy", shard: 1, num_shards: 1, runner: "linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 1, num_shards: 5, runner: "linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 2, num_shards: 5, runner: "linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 3, num_shards: 5, runner: "linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 4, num_shards: 5, runner: "linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 5, num_shards: 5, runner: "linux.4xlarge.nvidia.gpu" },
{ config: "nogpu_AVX512", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge" },
{ config: "nogpu_NO_AVX2", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge" },
{ config: "jit_legacy", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 1, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 2, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 3, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 4, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 5, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.4xlarge.nvidia.gpu" },
]}
linux-focal-cuda12_4-py3_10-gcc9-experimental-split-build-test:
@ -276,15 +286,17 @@ jobs:
linux-focal-cuda11_8-py3_10-gcc9-experimental-split-build:
name: linux-focal-cuda11.8-py3.10-gcc9-experimental-split-build
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge"
use_split_build: true
build-environment: linux-focal-cuda11.8-py3.10-gcc9
docker-image-name: pytorch-linux-focal-cuda11.8-cudnn9-py3-gcc9
test-matrix: |
{ include: [
{ config: "distributed", shard: 1, num_shards: 3, runner: "linux.8xlarge.nvidia.gpu" },
{ config: "distributed", shard: 2, num_shards: 3, runner: "linux.8xlarge.nvidia.gpu" },
{ config: "distributed", shard: 3, num_shards: 3, runner: "linux.8xlarge.nvidia.gpu" },
{ config: "distributed", shard: 1, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.8xlarge.nvidia.gpu" },
{ config: "distributed", shard: 2, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.8xlarge.nvidia.gpu" },
{ config: "distributed", shard: 3, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.8xlarge.nvidia.gpu" },
]}
linux-focal-cuda11_8-py3_10-gcc9-experimental-split-build-test:

View File

@ -2,7 +2,7 @@ name: Upload torch dynamo performance stats
on:
workflow_run:
workflows: [inductor-A100-perf-nightly]
workflows: [inductor-A100-perf-nightly, inductor-perf-nightly-aarch64, inductor-perf-nightly-x86]
types:
- completed

View File

@ -17,7 +17,7 @@ jobs:
with:
build-environment: linux-jammy-xpu-py3.8
docker-image-name: pytorch-linux-jammy-xpu-2024.0-py3
runner: linux.2xlarge
runner: linux.12xlarge
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 4, runner: "linux.idc.xpu" },

View File

@ -417,6 +417,7 @@ exclude_patterns = [
'aten/src/ATen/native/vulkan/api/vk_mem_alloc.h',
'test/cpp/jit/upgrader_models/*.ptl',
'test/cpp/jit/upgrader_models/*.ptl.ff',
'.ci/docker/common/install_rocm_drm.sh',
'.lintrunner.toml',
]
command = [
@ -1191,196 +1192,6 @@ exclude_patterns = [
'torch/_export/trace.py',
'torch/_export/verifier.py',
'torch/_vendor/**',
'torch/ao/__init__.py',
'torch/ao/nn/__init__.py',
'torch/ao/nn/intrinsic/__init__.py',
'torch/ao/nn/intrinsic/modules/__init__.py',
'torch/ao/nn/intrinsic/modules/fused.py',
'torch/ao/nn/intrinsic/qat/__init__.py',
'torch/ao/nn/intrinsic/qat/modules/__init__.py',
'torch/ao/nn/intrinsic/qat/modules/conv_fused.py',
'torch/ao/nn/intrinsic/qat/modules/linear_fused.py',
'torch/ao/nn/intrinsic/qat/modules/linear_relu.py',
'torch/ao/nn/intrinsic/quantized/__init__.py',
'torch/ao/nn/intrinsic/quantized/dynamic/__init__.py',
'torch/ao/nn/intrinsic/quantized/dynamic/modules/__init__.py',
'torch/ao/nn/intrinsic/quantized/dynamic/modules/linear_relu.py',
'torch/ao/nn/intrinsic/quantized/modules/__init__.py',
'torch/ao/nn/intrinsic/quantized/modules/bn_relu.py',
'torch/ao/nn/intrinsic/quantized/modules/conv_add.py',
'torch/ao/nn/intrinsic/quantized/modules/conv_relu.py',
'torch/ao/nn/intrinsic/quantized/modules/linear_relu.py',
'torch/ao/nn/qat/__init__.py',
'torch/ao/nn/qat/dynamic/__init__.py',
'torch/ao/nn/qat/dynamic/modules/__init__.py',
'torch/ao/nn/qat/dynamic/modules/linear.py',
'torch/ao/nn/qat/modules/__init__.py',
'torch/ao/nn/qat/modules/conv.py',
'torch/ao/nn/qat/modules/embedding_ops.py',
'torch/ao/nn/qat/modules/linear.py',
'torch/ao/nn/quantizable/__init__.py',
'torch/ao/nn/quantizable/modules/__init__.py',
'torch/ao/nn/quantizable/modules/activation.py',
'torch/ao/nn/quantizable/modules/rnn.py',
'torch/ao/nn/quantized/__init__.py',
'torch/ao/nn/quantized/dynamic/__init__.py',
'torch/ao/nn/quantized/dynamic/modules/__init__.py',
'torch/ao/nn/quantized/dynamic/modules/conv.py',
'torch/ao/nn/quantized/dynamic/modules/linear.py',
'torch/ao/nn/quantized/dynamic/modules/rnn.py',
'torch/ao/nn/quantized/functional.py',
'torch/ao/nn/quantized/modules/__init__.py',
'torch/ao/nn/quantized/modules/activation.py',
'torch/ao/nn/quantized/modules/batchnorm.py',
'torch/ao/nn/quantized/modules/conv.py',
'torch/ao/nn/quantized/modules/dropout.py',
'torch/ao/nn/quantized/modules/embedding_ops.py',
'torch/ao/nn/quantized/modules/functional_modules.py',
'torch/ao/nn/quantized/modules/linear.py',
'torch/ao/nn/quantized/modules/normalization.py',
'torch/ao/nn/quantized/modules/rnn.py',
'torch/ao/nn/quantized/modules/utils.py',
'torch/ao/nn/quantized/reference/__init__.py',
'torch/ao/nn/quantized/reference/modules/__init__.py',
'torch/ao/nn/quantized/reference/modules/conv.py',
'torch/ao/nn/quantized/reference/modules/linear.py',
'torch/ao/nn/quantized/reference/modules/rnn.py',
'torch/ao/nn/quantized/reference/modules/sparse.py',
'torch/ao/nn/quantized/reference/modules/utils.py',
'torch/ao/nn/sparse/__init__.py',
'torch/ao/nn/sparse/quantized/__init__.py',
'torch/ao/nn/sparse/quantized/dynamic/__init__.py',
'torch/ao/nn/sparse/quantized/dynamic/linear.py',
'torch/ao/nn/sparse/quantized/linear.py',
'torch/ao/nn/sparse/quantized/utils.py',
'torch/ao/ns/__init__.py',
'torch/ao/ns/_numeric_suite.py',
'torch/ao/ns/_numeric_suite_fx.py',
'torch/ao/ns/fx/__init__.py',
'torch/ao/ns/fx/graph_matcher.py',
'torch/ao/ns/fx/graph_passes.py',
'torch/ao/ns/fx/mappings.py',
'torch/ao/ns/fx/n_shadows_utils.py',
'torch/ao/ns/fx/ns_types.py',
'torch/ao/ns/fx/pattern_utils.py',
'torch/ao/ns/fx/qconfig_multi_mapping.py',
'torch/ao/ns/fx/utils.py',
'torch/ao/ns/fx/weight_utils.py',
'torch/ao/pruning/__init__.py',
'torch/ao/pruning/_experimental/__init__.py',
'torch/ao/pruning/_experimental/activation_sparsifier/__init__.py',
'torch/ao/pruning/_experimental/activation_sparsifier/activation_sparsifier.py',
'torch/ao/pruning/_experimental/data_scheduler/__init__.py',
'torch/ao/pruning/_experimental/data_scheduler/base_data_scheduler.py',
'torch/ao/pruning/_experimental/data_sparsifier/__init__.py',
'torch/ao/pruning/_experimental/data_sparsifier/base_data_sparsifier.py',
'torch/ao/pruning/_experimental/data_sparsifier/benchmarks/dlrm_utils.py',
'torch/ao/pruning/_experimental/data_sparsifier/benchmarks/evaluate_disk_savings.py',
'torch/ao/pruning/_experimental/data_sparsifier/benchmarks/evaluate_forward_time.py',
'torch/ao/pruning/_experimental/data_sparsifier/benchmarks/evaluate_model_metrics.py',
'torch/ao/pruning/_experimental/data_sparsifier/data_norm_sparsifier.py',
'torch/ao/pruning/_experimental/data_sparsifier/lightning/__init__.py',
'torch/ao/pruning/_experimental/data_sparsifier/lightning/callbacks/__init__.py',
'torch/ao/pruning/_experimental/data_sparsifier/lightning/callbacks/_data_sparstity_utils.py',
'torch/ao/pruning/_experimental/data_sparsifier/lightning/callbacks/data_sparsity.py',
'torch/ao/pruning/_experimental/data_sparsifier/lightning/tests/test_callbacks.py',
'torch/ao/pruning/_experimental/data_sparsifier/quantization_utils.py',
'torch/ao/pruning/_experimental/pruner/__init__.py',
'torch/ao/pruning/_experimental/pruner/base_structured_sparsifier.py',
'torch/ao/pruning/_experimental/pruner/lstm_saliency_pruner.py',
'torch/ao/pruning/_experimental/pruner/match_utils.py',
'torch/ao/pruning/_experimental/pruner/parametrization.py',
'torch/ao/pruning/_experimental/pruner/prune_functions.py',
'torch/ao/pruning/_experimental/pruner/saliency_pruner.py',
'torch/ao/pruning/_mappings.py',
'torch/ao/pruning/scheduler/__init__.py',
'torch/ao/pruning/scheduler/base_scheduler.py',
'torch/ao/pruning/scheduler/cubic_scheduler.py',
'torch/ao/pruning/scheduler/lambda_scheduler.py',
'torch/ao/pruning/sparsifier/__init__.py',
'torch/ao/pruning/sparsifier/base_sparsifier.py',
'torch/ao/pruning/sparsifier/nearly_diagonal_sparsifier.py',
'torch/ao/pruning/sparsifier/utils.py',
'torch/ao/pruning/sparsifier/weight_norm_sparsifier.py',
'torch/ao/quantization/__init__.py',
'torch/ao/quantization/_correct_bias.py',
'torch/ao/quantization/_equalize.py',
'torch/ao/quantization/_learnable_fake_quantize.py',
'torch/ao/quantization/backend_config/__init__.py',
'torch/ao/quantization/backend_config/_common_operator_config_utils.py',
'torch/ao/quantization/backend_config/_qnnpack_pt2e.py',
'torch/ao/quantization/backend_config/_x86_inductor_pt2e.py',
'torch/ao/quantization/backend_config/backend_config.py',
'torch/ao/quantization/backend_config/executorch.py',
'torch/ao/quantization/backend_config/fbgemm.py',
'torch/ao/quantization/backend_config/native.py',
'torch/ao/quantization/backend_config/observation_type.py',
'torch/ao/quantization/backend_config/onednn.py',
'torch/ao/quantization/backend_config/qnnpack.py',
'torch/ao/quantization/backend_config/tensorrt.py',
'torch/ao/quantization/backend_config/utils.py',
'torch/ao/quantization/backend_config/x86.py',
'torch/ao/quantization/experimental/APoT_tensor.py',
'torch/ao/quantization/experimental/apot_utils.py',
'torch/ao/quantization/experimental/fake_quantize.py',
'torch/ao/quantization/experimental/fake_quantize_function.py',
'torch/ao/quantization/experimental/linear.py',
'torch/ao/quantization/experimental/observer.py',
'torch/ao/quantization/experimental/qconfig.py',
'torch/ao/quantization/experimental/quantizer.py',
'torch/ao/quantization/fake_quantize.py',
'torch/ao/quantization/fuse_modules.py',
'torch/ao/quantization/fuser_method_mappings.py',
'torch/ao/quantization/fx/__init__.py',
'torch/ao/quantization/fx/_decomposed.py',
'torch/ao/quantization/fx/_equalize.py',
'torch/ao/quantization/fx/_lower_to_native_backend.py',
'torch/ao/quantization/fx/_model_report/__init__.py',
'torch/ao/quantization/fx/_model_report/detector.py',
'torch/ao/quantization/fx/_model_report/model_report.py',
'torch/ao/quantization/fx/_model_report/model_report_observer.py',
'torch/ao/quantization/fx/_model_report/model_report_visualizer.py',
'torch/ao/quantization/fx/convert.py',
'torch/ao/quantization/fx/custom_config.py',
'torch/ao/quantization/fx/fuse.py',
'torch/ao/quantization/fx/fuse_handler.py',
'torch/ao/quantization/fx/graph_module.py',
'torch/ao/quantization/fx/lower_to_fbgemm.py',
'torch/ao/quantization/fx/lower_to_qnnpack.py',
'torch/ao/quantization/fx/lstm_utils.py',
'torch/ao/quantization/fx/match_utils.py',
'torch/ao/quantization/fx/pattern_utils.py',
'torch/ao/quantization/fx/prepare.py',
'torch/ao/quantization/fx/qconfig_mapping_utils.py',
'torch/ao/quantization/fx/quantize_handler.py',
'torch/ao/quantization/fx/tracer.py',
'torch/ao/quantization/fx/utils.py',
'torch/ao/quantization/observer.py',
'torch/ao/quantization/pt2e/__init__.py',
'torch/ao/quantization/pt2e/_propagate_annotation.py',
'torch/ao/quantization/pt2e/graph_utils.py',
'torch/ao/quantization/pt2e/prepare.py',
'torch/ao/quantization/pt2e/qat_utils.py',
'torch/ao/quantization/pt2e/quantizer/__init__.py',
'torch/ao/quantization/pt2e/quantizer/composable_quantizer.py',
'torch/ao/quantization/pt2e/quantizer/embedding_quantizer.py',
'torch/ao/quantization/pt2e/quantizer/qnnpack_quantizer.py',
'torch/ao/quantization/pt2e/quantizer/quantizer.py',
'torch/ao/quantization/pt2e/quantizer/utils.py',
'torch/ao/quantization/pt2e/quantizer/x86_inductor_quantizer.py',
'torch/ao/quantization/pt2e/representation/__init__.py',
'torch/ao/quantization/pt2e/representation/rewrite.py',
'torch/ao/quantization/pt2e/utils.py',
'torch/ao/quantization/qconfig.py',
'torch/ao/quantization/qconfig_mapping.py',
'torch/ao/quantization/quant_type.py',
'torch/ao/quantization/quantization_mappings.py',
'torch/ao/quantization/quantize.py',
'torch/ao/quantization/quantize_fx.py',
'torch/ao/quantization/quantize_jit.py',
'torch/ao/quantization/quantize_pt2e.py',
'torch/ao/quantization/stubs.py',
'torch/ao/quantization/utils.py',
'torch/compiler/__init__.py',
'torch/contrib/__init__.py',
'torch/contrib/_tensorboard_vis.py',
@ -1476,59 +1287,6 @@ exclude_patterns = [
'torch/linalg/__init__.py',
'torch/monitor/__init__.py',
'torch/nested/__init__.py',
'torch/nn/intrinsic/__init__.py',
'torch/nn/intrinsic/modules/__init__.py',
'torch/nn/intrinsic/modules/fused.py',
'torch/nn/intrinsic/qat/__init__.py',
'torch/nn/intrinsic/qat/modules/__init__.py',
'torch/nn/intrinsic/qat/modules/conv_fused.py',
'torch/nn/intrinsic/qat/modules/linear_fused.py',
'torch/nn/intrinsic/qat/modules/linear_relu.py',
'torch/nn/intrinsic/quantized/__init__.py',
'torch/nn/intrinsic/quantized/dynamic/__init__.py',
'torch/nn/intrinsic/quantized/dynamic/modules/__init__.py',
'torch/nn/intrinsic/quantized/dynamic/modules/linear_relu.py',
'torch/nn/intrinsic/quantized/modules/__init__.py',
'torch/nn/intrinsic/quantized/modules/bn_relu.py',
'torch/nn/intrinsic/quantized/modules/conv_relu.py',
'torch/nn/intrinsic/quantized/modules/linear_relu.py',
'torch/nn/qat/__init__.py',
'torch/nn/qat/dynamic/__init__.py',
'torch/nn/qat/dynamic/modules/__init__.py',
'torch/nn/qat/dynamic/modules/linear.py',
'torch/nn/qat/modules/__init__.py',
'torch/nn/qat/modules/conv.py',
'torch/nn/qat/modules/embedding_ops.py',
'torch/nn/qat/modules/linear.py',
'torch/nn/quantizable/__init__.py',
'torch/nn/quantizable/modules/__init__.py',
'torch/nn/quantizable/modules/activation.py',
'torch/nn/quantizable/modules/rnn.py',
'torch/nn/quantized/__init__.py',
'torch/nn/quantized/_reference/__init__.py',
'torch/nn/quantized/_reference/modules/__init__.py',
'torch/nn/quantized/_reference/modules/conv.py',
'torch/nn/quantized/_reference/modules/linear.py',
'torch/nn/quantized/_reference/modules/rnn.py',
'torch/nn/quantized/_reference/modules/sparse.py',
'torch/nn/quantized/_reference/modules/utils.py',
'torch/nn/quantized/dynamic/__init__.py',
'torch/nn/quantized/dynamic/modules/__init__.py',
'torch/nn/quantized/dynamic/modules/conv.py',
'torch/nn/quantized/dynamic/modules/linear.py',
'torch/nn/quantized/dynamic/modules/rnn.py',
'torch/nn/quantized/functional.py',
'torch/nn/quantized/modules/__init__.py',
'torch/nn/quantized/modules/activation.py',
'torch/nn/quantized/modules/batchnorm.py',
'torch/nn/quantized/modules/conv.py',
'torch/nn/quantized/modules/dropout.py',
'torch/nn/quantized/modules/embedding_ops.py',
'torch/nn/quantized/modules/functional_modules.py',
'torch/nn/quantized/modules/linear.py',
'torch/nn/quantized/modules/normalization.py',
'torch/nn/quantized/modules/rnn.py',
'torch/nn/quantized/modules/utils.py',
'torch/signal/__init__.py',
'torch/signal/windows/__init__.py',
'torch/signal/windows/windows.py',

View File

@ -1041,10 +1041,6 @@ if(NOT MSVC)
append_cxx_flag_if_supported("-Wno-error=pedantic" CMAKE_CXX_FLAGS)
append_cxx_flag_if_supported("-Wno-error=old-style-cast" CMAKE_CXX_FLAGS)
append_cxx_flag_if_supported("-Wno-error=inconsistent-missing-override"
CMAKE_CXX_FLAGS)
append_cxx_flag_if_supported(
"-Wno-error=inconsistent-missing-destructor-override" CMAKE_CXX_FLAGS)
append_cxx_flag_if_supported("-Wconstant-conversion" CMAKE_CXX_FLAGS)
append_cxx_flag_if_supported("-Wno-invalid-partial-specialization"
CMAKE_CXX_FLAGS)

View File

@ -11,6 +11,7 @@ aspects of contributing to PyTorch.
<!-- toc -->
- [Developing PyTorch](#developing-pytorch)
- [Setup the development environment](#setup-the-development-environment)
- [Tips and Debugging](#tips-and-debugging)
- [Nightly Checkout & Pull](#nightly-checkout--pull)
- [Codebase structure](#codebase-structure)
@ -64,8 +65,24 @@ aspects of contributing to PyTorch.
<!-- tocstop -->
## Developing PyTorch
Follow the instructions for [installing PyTorch from source](https://github.com/pytorch/pytorch#from-source). If you get stuck when developing PyTorch on your machine, check out the [tips and debugging](#tips-and-debugging) section below for common solutions.
### Setup the development environment
First, you need to [fork the PyTorch project on GitHub](https://github.com/pytorch/pytorch/fork) and follow the instructions at [Connecting to GitHub with SSH](https://docs.github.com/en/authentication/connecting-to-github-with-ssh) to setup your SSH authentication credentials.
Then clone the PyTorch project and setup the development environment:
```bash
git clone git@github.com:<USERNAME>/pytorch.git
cd pytorch
git remote add upstream git@github.com:pytorch/pytorch.git
make setup-env # or make setup-env-cuda for pre-built CUDA binaries
conda activate pytorch-deps
```
### Tips and Debugging
* If you want to have no-op incremental rebuilds (which are fast), see [Make no-op build fast](#make-no-op-build-fast) below.
@ -175,6 +192,13 @@ the regular environment parameters (`--name` or `--prefix`):
conda activate my-env
```
To install the nightly binaries built with CUDA, you can pass in the flag `--cuda`:
```bash
./tools/nightly.py checkout -b my-nightly-branch --cuda
conda activate pytorch-deps
```
You can also use this tool to pull the nightly commits into the current branch:
```bash
@ -325,7 +349,7 @@ command runs tests such as `TestNN.test_BCELoss` and
Install all prerequisites by running
```bash
make setup_lint
make setup-lint
```
You can now run the same linting steps that are used in CI locally via `make`:

View File

@ -1,6 +1,7 @@
# This makefile does nothing but delegating the actual building to cmake.
PYTHON = python3
PIP = pip3
PIP = $(PYTHON) -m pip
NIGHTLY_TOOL_OPTS := pull
all:
@mkdir -p build && cd build && cmake .. $(shell $(PYTHON) ./scripts/get_python_cmake_flags.py) && $(MAKE)
@ -22,10 +23,27 @@ linecount:
echo "Cloc is not available on the machine. You can install cloc with " && \
echo " sudo apt-get install cloc"
setup_lint:
ensure-branch-clean:
@if [ -n "$(shell git status --porcelain)" ]; then \
echo "Please commit or stash all changes before running this script"; \
exit 1; \
fi
setup-env: ensure-branch-clean
$(PYTHON) tools/nightly.py $(NIGHTLY_TOOL_OPTS)
setup-env-cuda:
$(MAKE) setup-env PYTHON="$(PYTHON)" NIGHTLY_TOOL_OPTS="$(NIGHTLY_TOOL_OPTS) --cuda"
setup_env: setup-env
setup_env_cuda: setup-env-cuda
setup-lint:
$(PIP) install lintrunner
lintrunner init
setup_lint: setup-lint
lint:
lintrunner

View File

@ -98,7 +98,7 @@ void CPUGeneratorImpl::set_current_seed(uint64_t seed) {
* Sets the offset of RNG state.
* See Note [Acquire lock when using random generators]
*/
void CPUGeneratorImpl::set_offset(uint64_t offset) {
void CPUGeneratorImpl::set_offset(uint64_t offset [[maybe_unused]]) {
TORCH_CHECK(false, "CPU Generator does not use offset");
}

View File

@ -73,6 +73,8 @@ class TORCH_API Context {
return at::detail::getPrivateUse1Hooks();
} else if (device_type == at::kMTIA) {
return at::detail::getMTIAHooks();
} else if (device_type == at::kHIP) {
return at::detail::getHIPHooks();
} else {
AT_ERROR(
c10::DeviceTypeName(device_type), " device type not an accelerator.");
@ -94,8 +96,22 @@ class TORCH_API Context {
AT_ERROR(c10::DeviceTypeName(device_type), " device type not enabled.");
}
}
static bool isPinnedPtr(const void* data) {
return detail::getCUDAHooks().isPinnedPtr(data);
bool isPinnedPtr(
const void* data,
std::optional<c10::DeviceType> device_type = std::nullopt) {
auto opt_device_type =
device_type.has_value() ? device_type.value() : at::getAccelerator();
if (!opt_device_type.has_value() || // there is no accelerator
!at::isAccelerator(
opt_device_type.value())) { // passed device not an accelerator
return false;
}
return getAcceleratorHooksInterface(opt_device_type.value())
.isPinnedPtr(data);
}
Allocator* getPinnedMemoryAllocator(
std::optional<c10::DeviceType> device_type = std::nullopt) {
return getAcceleratorHooksInterface(device_type).getPinnedMemoryAllocator();
}
static bool hasOpenMP();
static bool hasMKL();

View File

@ -295,7 +295,7 @@ DLManagedTensor* toDLPack(const Tensor& src) {
}
Tensor fromDLPack(DLManagedTensor* src) {
auto deleter = [src](void* self) {
auto deleter = [src](void* self [[maybe_unused]]) {
if (src->deleter) {
src->deleter(src);
}

View File

@ -2,7 +2,7 @@
#include <ATen/DeviceAccelerator.h>
namespace at {
C10_API std::optional<DeviceType> getAccelerator(bool checked) {
std::optional<c10::DeviceType> getAccelerator(bool checked) {
#define DETECT_AND_ASSIGN_ACCELERATOR(device_name) \
if (at::has##device_name()) { \
device_type = k##device_name; \
@ -20,11 +20,13 @@ C10_API std::optional<DeviceType> getAccelerator(bool checked) {
// first.
return kPrivateUse1;
}
std::optional<DeviceType> device_type = std::nullopt;
std::optional<c10::DeviceType> device_type = std::nullopt;
bool is_accelerator_detected = false;
DETECT_AND_ASSIGN_ACCELERATOR(CUDA)
DETECT_AND_ASSIGN_ACCELERATOR(MTIA)
DETECT_AND_ASSIGN_ACCELERATOR(XPU)
DETECT_AND_ASSIGN_ACCELERATOR(HIP)
DETECT_AND_ASSIGN_ACCELERATOR(MPS)
if (checked) {
TORCH_CHECK(
device_type, "Cannot access accelerator device when none is available.")
@ -34,4 +36,18 @@ C10_API std::optional<DeviceType> getAccelerator(bool checked) {
#undef DETECT_AND_ASSIGN_ACCELERATOR
}
bool isAccelerator(c10::DeviceType d) {
switch (d) {
case at::kCUDA:
case at::kMTIA:
case at::kXPU:
case at::kHIP:
case at::kMPS:
case at::kPrivateUse1:
return true;
default:
return false;
}
}
} // namespace at

View File

@ -13,9 +13,7 @@
// - It provides a set of common APIs as defined by AcceleratorHooksInterface
//
// As of today, accelerator devices are (in no particular order):
// CUDA, MTIA, XPU, PrivateUse1
// We want to add once all the proper APIs are supported and tested:
// HIP, MPS
// CUDA, MTIA, XPU, HIP, MPS, PrivateUse1
namespace at {
@ -24,4 +22,6 @@ namespace at {
// When checked is true, the returned optional always has a value.
TORCH_API std::optional<c10::DeviceType> getAccelerator(bool checked = false);
TORCH_API bool isAccelerator(c10::DeviceType d);
} // namespace at

View File

@ -444,8 +444,7 @@ TensorBase empty_strided_symint_meta(
SymIntArrayRef stride,
std::optional<ScalarType> dtype_opt,
std::optional<Layout> layout_opt,
std::optional<Device> device_opt,
std::optional<bool> pin_memory_opt) {
std::optional<Device> device_opt) {
TORCH_INTERNAL_ASSERT_DEBUG_ONLY(device_or_default(device_opt).type() == DeviceType::Meta);
TORCH_INTERNAL_ASSERT_DEBUG_ONLY(layout_or_default(layout_opt) == Layout::Strided);
@ -462,8 +461,7 @@ TensorBase empty_strided_symint_meta(
stride,
optTypeMetaToScalarType(options.dtype_opt()),
options.layout_opt(),
options.device_opt(),
options.pinned_memory_opt());
options.device_opt());
}
} // namespace at::detail

View File

@ -156,8 +156,7 @@ TORCH_API TensorBase empty_strided_symint_meta(
SymIntArrayRef stride,
std::optional<ScalarType> dtype_opt,
std::optional<Layout> layout_opt,
std::optional<Device> device_opt,
std::optional<bool> pin_memory_opt);
std::optional<Device> device_opt);
TORCH_API TensorBase empty_strided_symint_meta(
SymIntArrayRef size,

View File

@ -19,6 +19,7 @@ namespace at::jit {
struct TemplateEnv {
TemplateEnv() = default;
TemplateEnv(TemplateEnv& parent) : parent(&parent) {}
TemplateEnv& operator==(const TemplateEnv& parent) = delete;
using string_list = std::vector<std::string>;

View File

@ -70,7 +70,7 @@ static std::string getSchemaInputTypesString(const FunctionSchema& schema) {
return input_types.str();
}
std::string ClassType::getForwardPreHookErrorMessage(int pre_hook_idx) const {
std::string ClassType::getForwardPreHookErrorMessage(size_t pre_hook_idx) const {
const std::string& pre_hook_name = forward_pre_hooks_[pre_hook_idx]->name();
const FunctionSchema& forward_schema = getMethod("forward").getSchema();
std::string input_types = getSchemaInputTypesString(forward_schema);
@ -98,7 +98,7 @@ std::string ClassType::getForwardPreHookErrorMessage(int pre_hook_idx) const {
return return_string;
}
std::string ClassType::getForwardHookErrorMessage(int hook_idx) const {
std::string ClassType::getForwardHookErrorMessage(size_t hook_idx) const {
const std::string& hook_name = forward_hooks_[hook_idx]->name();
const FunctionSchema& forward_schema = getMethod("forward").getSchema();
std::string input_types = getSchemaInputTypesString(forward_schema);
@ -190,7 +190,7 @@ static void checkForwardHookInputArguments(
}
void ClassType::checkForwardPreHookSchema(
int pre_hook_idx,
size_t pre_hook_idx,
const FunctionSchema& pre_hook_schema) const {
const torch::jit::Function* pre_hook = forward_pre_hooks_[pre_hook_idx];
std::string hook_id =
@ -287,7 +287,7 @@ void ClassType::checkForwardPreHookSchema(
}
void ClassType::checkForwardHookSchema(
int hook_idx,
size_t hook_idx,
const FunctionSchema& hook_schema) const {
const torch::jit::Function* hook = forward_hooks_[hook_idx];
std::string hook_id =
@ -451,8 +451,7 @@ bool ClassType::isSubtypeOfExt(const Type& rhs, std::ostream* why_not) const {
return false;
}
if (!self_method->getSchema().isSubtypeOf(
// NOLINTNEXTLINE(bugprone-argument-comment)
schema, /*is_method=*/true, why_not)) {
schema, /*as_method=*/true, why_not)) {
if (why_not) {
*why_not << "Method on class '" << repr_str()
<< "' (1) is not compatible with interface '"

View File

@ -341,10 +341,10 @@ struct TORCH_API ClassType : public NamedType {
const std::vector<torch::jit::Function*>& getForwardPreHooks() const;
void checkForwardPreHookSchema(
int pre_hook_idx,
size_t pre_hook_idx,
const FunctionSchema& pre_hook_schema) const;
void checkForwardHookSchema(
int hook_idx,
size_t hook_idx,
const FunctionSchema& hook_schema) const;
void addMethod(torch::jit::Function* method);
@ -396,8 +396,8 @@ struct TORCH_API ClassType : public NamedType {
}
void addAttribute(ClassAttribute classAttribute);
std::string getForwardPreHookErrorMessage(int pre_hook_idx) const;
std::string getForwardHookErrorMessage(int hook_idx) const;
std::string getForwardPreHookErrorMessage(size_t pre_hook_idx) const;
std::string getForwardHookErrorMessage(size_t hook_idx) const;
// Mapping of attribute names -> their type.
// NOTE: this does not contain methods, which are stored in the module

View File

@ -127,7 +127,7 @@ constexpr bool allowlist_contains(string_view allowlist, string_view item) {
// Returns true iff the given op name is on the allowlist
// and should be registered
constexpr bool op_allowlist_check(string_view op_name) {
constexpr bool op_allowlist_check(string_view op_name [[maybe_unused]]) {
assert(op_name.find("::") != string_view::npos);
// Use assert() instead of throw() due to a gcc bug. See:
// https://stackoverflow.com/questions/34280729/throw-in-constexpr-function

View File

@ -363,16 +363,16 @@ public:
return _mm512_castsi512_pd(_mm512_mask_set1_epi64(zero_vector, mask,
0xFFFFFFFFFFFFFFFF));
}
Vectorized<c10::complex<double>> operator<(const Vectorized<c10::complex<double>>& other) const {
Vectorized<c10::complex<double>> operator<(const Vectorized<c10::complex<double>>& other [[maybe_unused]]) const {
TORCH_CHECK(false, "not supported for complex numbers");
}
Vectorized<c10::complex<double>> operator<=(const Vectorized<c10::complex<double>>& other) const {
Vectorized<c10::complex<double>> operator<=(const Vectorized<c10::complex<double>>& other [[maybe_unused]]) const {
TORCH_CHECK(false, "not supported for complex numbers");
}
Vectorized<c10::complex<double>> operator>(const Vectorized<c10::complex<double>>& other) const {
Vectorized<c10::complex<double>> operator>(const Vectorized<c10::complex<double>>& other [[maybe_unused]]) const {
TORCH_CHECK(false, "not supported for complex numbers");
}
Vectorized<c10::complex<double>> operator>=(const Vectorized<c10::complex<double>>& other) const {
Vectorized<c10::complex<double>> operator>=(const Vectorized<c10::complex<double>>& other [[maybe_unused]]) const {
TORCH_CHECK(false, "not supported for complex numbers");
}

View File

@ -864,16 +864,16 @@ public:
auto mask = _mm512_cmp_ps_mask(values, other.values, _CMP_NEQ_UQ);
return _mm512_castsi512_ps(_mm512_mask_set1_epi32(zero_vector, mask, 0xFFFFFFFF));
}
Vectorized<c10::complex<float>> operator<(const Vectorized<c10::complex<float>>& other) const {
Vectorized<c10::complex<float>> operator<(const Vectorized<c10::complex<float>>& other [[maybe_unused]]) const {
TORCH_CHECK(false, "not supported for complex numbers");
}
Vectorized<c10::complex<float>> operator<=(const Vectorized<c10::complex<float>>& other) const {
Vectorized<c10::complex<float>> operator<=(const Vectorized<c10::complex<float>>& other [[maybe_unused]]) const {
TORCH_CHECK(false, "not supported for complex numbers");
}
Vectorized<c10::complex<float>> operator>(const Vectorized<c10::complex<float>>& other) const {
Vectorized<c10::complex<float>> operator>(const Vectorized<c10::complex<float>>& other [[maybe_unused]]) const {
TORCH_CHECK(false, "not supported for complex numbers");
}
Vectorized<c10::complex<float>> operator>=(const Vectorized<c10::complex<float>>& other) const {
Vectorized<c10::complex<float>> operator>=(const Vectorized<c10::complex<float>>& other [[maybe_unused]]) const {
TORCH_CHECK(false, "not supported for complex numbers");
}

View File

@ -72,12 +72,13 @@ __m512i pack_saturate_and_clamp(
template <>
inline __m512i pack_saturate_and_clamp<int32_t>(
__m512i first,
__m512i second,
int32_t min_val,
int32_t max_val) {
__m512i first [[maybe_unused]],
__m512i second [[maybe_unused]],
int32_t min_val [[maybe_unused]],
int32_t max_val [[maybe_unused]]) {
// This function is for linkage only, will not be used
AT_ERROR("pack_saturate_and_clamp<int32_t> is not supported");
return __m512i{};
}
template <>
@ -342,7 +343,7 @@ struct Vectorized<c10::qint32> : public Vectorizedqi {
const float_vec_return_type& rhs,
float scale,
int32_t zero_point,
float inverse_scale) {
float inverse_scale [[maybe_unused]]) {
Vectorized<c10::qint32> retval;
auto rhs_data = (__m512)rhs[0];
at::native::quantize_vec<c10::qint32, /*precision=*/32>(
@ -955,7 +956,7 @@ struct VectorizedQuantizedConverter {
float_vec_return_type dequantize(
Vectorized<float> scale,
Vectorized<float> zero_point,
Vectorized<float> scale_zp_premul) const {
Vectorized<float> scale_zp_premul [[maybe_unused]]) const {
float_vec_return_type rv;
for (const auto i : c10::irange(float_num_vecs())) {
float tmp_vals[16];
@ -1039,7 +1040,7 @@ struct Vectorized<c10::qint32> : public VectorizedQuantizedConverter<
const float_vec_return_type& rhs,
float scale,
int32_t zero_point,
float inverse_scale) {
float inverse_scale [[maybe_unused]]) {
std::array<value_type, size()> qvals;
std::array<float, float_num_vecs() * 16> float_vals;
@ -1183,7 +1184,7 @@ struct Vectorized<c10::qint8> : public VectorizedQuantizedConverter<
const float_vec_return_type& rhs,
float scale,
int32_t zero_point,
float inverse_scale) {
float inverse_scale [[maybe_unused]]) {
std::array<value_type, size()> qvals;
std::array<float, float_num_vecs() * 16> float_vals;
@ -1315,7 +1316,7 @@ struct Vectorized<c10::quint8> : public VectorizedQuantizedConverter<
const float_vec_return_type& rhs,
float scale,
int32_t zero_point,
float inverse_scale) {
float inverse_scale [[maybe_unused]]) {
std::array<value_type, size()> qvals;
std::array<float, float_num_vecs() * 16> float_vals;

View File

@ -1,32 +0,0 @@
#include <ATen/cuda/PinnedMemoryAllocator.h>
#include <ATen/Context.h>
#include <ATen/Config.h>
#include <ATen/TensorUtils.h>
#include <c10/core/Storage.h>
#include <ATen/ATen.h>
#include <ATen/CPUFunctions.h>
namespace at::native {
bool is_pinned_cuda(const Tensor& self, std::optional<Device> device) {
TORCH_INTERNAL_ASSERT_DEBUG_ONLY(!device.has_value() || device->is_cuda());
// TODO: unhook this
return detail::getCUDAHooks().isPinnedPtr(self.storage().data());
}
Tensor _pin_memory_cuda(const Tensor& self, std::optional<Device> device) {
TORCH_INTERNAL_ASSERT_DEBUG_ONLY(!device.has_value() || device->is_cuda());
auto* allocator = at::cuda::getPinnedMemoryAllocator();
auto storage = Storage(
Storage::use_byte_size_t(),
detail::computeStorageNbytes(
self.sizes(), self.strides(), self.dtype().itemsize()),
allocator,
/*resizable=*/false);
auto tensor = at::cpu::empty({0}, self.options()).set_(storage, 0, self.sizes(), self.strides());
tensor.copy_(self);
return tensor;
}
} // namespace at::native

View File

@ -45,7 +45,7 @@ struct DeviceThreadHandlePool : public std::enable_shared_from_this<DeviceThread
// unordered_map<int, vector<unique_ptr<Handle>>> created_handles;
Handle(const Handle& rhs) = delete;
// Following https://stackoverflow.com/questions/3279543/what-is-the-copy-and-swap-idiom
Handle(Handle&& rhs) : Handle() { std::swap(handle, rhs.handle); }
Handle(Handle&& rhs) noexcept : Handle() { std::swap(handle, rhs.handle); }
// operator= takes argument by value
Handle& operator=(Handle rhs) { std::swap(handle, rhs.handle); return *this; }
~Handle() {

View File

@ -166,15 +166,15 @@ struct GemmStridedBatchedParams : OpParams {
}
size_t GetSizeA() const {
return sizeof(T) * lda * ((transa == 'n' || transa == 'N') ? k : m) * batch;
return sizeof(T) * std::min(lda, stride_a) * ((transa == 'n' || transa == 'N') ? k : m) * batch;
}
size_t GetSizeB() const {
return sizeof(T) * ldb * ((transb == 'n' || transb == 'N') ? n : k) * batch;
return sizeof(T) * std::min(ldb, stride_b) * ((transb == 'n' || transb == 'N') ? n : k) * batch;
}
size_t GetSizeC() const {
return sizeof(T) * ldc * n * batch;
return sizeof(T) * std::min(ldc, stride_c) * n * batch;
}
size_t GetSize(bool duplicate_inputs) const {

View File

@ -18,7 +18,7 @@ namespace at::cuda::tunable {
class StreamTimer : public ITimer {
public:
StreamTimer();
virtual ~StreamTimer();
virtual ~StreamTimer() override;
void Start() override;

View File

@ -194,15 +194,19 @@ static void AddRocblasValidator() {
static void AddHipblasltValidator() {
auto validators = getTuningContext()->GetTuningResultsValidator().GetAllValidators();
if (validators.find("HIPBLASLT_VERSION") == validators.end()) {
std::string hipblaslt_version = c10::str(
XSTRINGIFY(HIPBLASLT_VERSION_MAJOR), ".",
XSTRINGIFY(HIPBLASLT_VERSION_MINOR), ".",
XSTRINGIFY(HIPBLASLT_VERSION_PATCH), "-",
XSTRINGIFY(HIPBLASLT_VERSION_TWEAK));
int version;
std::string revision(128, '\0');
auto handle = at::cuda::getCurrentCUDABlasLtHandle();
hipblasLtGetVersion(handle, &version);
hipblasLtGetGitRevision(handle, revision.data());
std::string hipblaslt_version =
c10::str(version, "-", revision.c_str());
getTuningContext()->GetTuningResultsValidator().RegisterValidator(
"HIPBLASLT_VERSION",
[hipblaslt_version]() { return hipblaslt_version; },
[hipblaslt_version](auto&& k) { return hipblaslt_version == k ? OK : FAIL; });
[hipblaslt_version](auto&& k) {
return hipblaslt_version == k ? OK : FAIL;
});
}
}
@ -318,8 +322,6 @@ class ScaledGemmTunableOp : public TunableOp<ScaledGemmParams<CT>, StreamTimer>
ScaledGemmTunableOp() {
this->RegisterOp(std::string("Default"), std::make_unique<DefaultScaledGemmOp<CT>>());
auto validators = getTuningContext()->GetTuningResultsValidator().GetAllValidators();
#if defined(USE_ROCM)
for (auto&& [name, op] : GetHipBlasLtScaledGemmTypeStringAndOps<AT, BT, CT, ALayout, BLayout>()) {
this->RegisterOp(std::move(name), std::move(op));

View File

@ -2,6 +2,7 @@
#include <c10/core/Device.h>
#include <c10/core/Stream.h>
#include <c10/core/Allocator.h>
C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED("-Wunused-parameter")
namespace at {
@ -40,6 +41,15 @@ struct TORCH_API AcceleratorHooksInterface {
TORCH_CHECK(false, "Backend doesn't support maybeExchangeDevice()");
return -1;
}
virtual bool isPinnedPtr(const void* data) const {
return false;
}
virtual Allocator* getPinnedMemoryAllocator() const {
TORCH_CHECK(false, "Backend doesn't support getPinnedMemoryAllocator()");
return nullptr;
}
};
} // namespace at

View File

@ -77,7 +77,7 @@ struct TORCH_API CUDAHooksInterface : AcceleratorHooksInterface {
TORCH_CHECK(false, "Cannot get device of pointer on CUDA without ATen_cuda library. ", CUDA_HELP);
}
virtual bool isPinnedPtr(const void* /*data*/) const {
virtual bool isPinnedPtr(const void* data) const override {
return false;
}
@ -121,7 +121,7 @@ struct TORCH_API CUDAHooksInterface : AcceleratorHooksInterface {
return -1;
}
virtual Allocator* getPinnedMemoryAllocator() const {
virtual Allocator* getPinnedMemoryAllocator() const override {
TORCH_CHECK(false, "Pinned memory requires CUDA. ", CUDA_HELP);
}

View File

@ -6,6 +6,8 @@
#include <c10/util/Registry.h>
#include <ATen/detail/AcceleratorHooksInterface.h>
#include <memory>
namespace at {
@ -19,10 +21,10 @@ namespace at {
// which we may want to call into from CPU code (and thus must be dynamically
// dispatched, to allow for separate compilation of HIP code). See
// CUDAHooksInterface for more detailed motivation.
struct TORCH_API HIPHooksInterface {
struct TORCH_API HIPHooksInterface : AcceleratorHooksInterface {
// This should never actually be implemented, but it is used to
// squelch -Werror=non-virtual-dtor
virtual ~HIPHooksInterface() = default;
virtual ~HIPHooksInterface() override = default;
// Initialize the HIP library state
virtual void initHIP() const {
@ -41,7 +43,11 @@ struct TORCH_API HIPHooksInterface {
return -1;
}
virtual Allocator* getPinnedMemoryAllocator() const {
virtual bool isPinnedPtr(const void* data) const override {
return false;
}
virtual Allocator* getPinnedMemoryAllocator() const override {
AT_ERROR("Pinned memory requires HIP.");
}
@ -52,6 +58,10 @@ struct TORCH_API HIPHooksInterface {
virtual int getNumGPUs() const {
return 0;
}
virtual bool hasPrimaryContext(DeviceIndex device_index) const override {
AT_ERROR("Cannot check primary context without ATen_hip library.");
}
};
// NB: dummy argument to suppress "ISO C++11 requires at least one argument

View File

@ -94,6 +94,12 @@ struct TORCH_API MPSHooksInterface : AcceleratorHooksInterface {
bool hasPrimaryContext(DeviceIndex device_index) const override {
FAIL_MPSHOOKS_FUNC(__func__);
}
virtual bool isPinnedPtr(const void* data) const override {
return false;
}
virtual Allocator* getPinnedMemoryAllocator() const override {
FAIL_MPSHOOKS_FUNC(__func__);
}
#undef FAIL_MPSHOOKS_FUNC
};

View File

@ -6,6 +6,9 @@
#include <c10/core/Stream.h>
#include <c10/util/Registry.h>
#include <c10/core/Allocator.h>
#include <c10/util/python_stub.h>
#include <ATen/detail/AcceleratorHooksInterface.h>
#include <string>
@ -15,7 +18,6 @@ class Context;
}
namespace at {
constexpr const char* MTIA_HELP =
"The MTIA backend requires MTIA extension for PyTorch;"
"this error has occurred because you are trying "
@ -88,6 +90,20 @@ struct TORCH_API MTIAHooksInterface : AcceleratorHooksInterface {
virtual void setCurrentStream(const c10::Stream& stream) const {
FAIL_MTIAHOOKS_FUNC(__func__);
}
virtual bool isPinnedPtr(const void* data) const override {
return false;
}
virtual Allocator* getPinnedMemoryAllocator() const override {
FAIL_MTIAHOOKS_FUNC(__func__);
return nullptr;
}
virtual PyObject* memoryStats(DeviceIndex device) const {
FAIL_MTIAHOOKS_FUNC(__func__);
return nullptr;
}
};
struct TORCH_API MTIAHooksArgs {};

View File

@ -24,7 +24,11 @@ struct TORCH_API PrivateUse1HooksInterface : AcceleratorHooksInterface {
"You should register `PrivateUse1HooksInterface` for PrivateUse1 before call `getDeviceFromPtr`.");
}
virtual Allocator* getPinnedMemoryAllocator() const {
virtual bool isPinnedPtr(const void* data) const override {
return false;
}
virtual Allocator* getPinnedMemoryAllocator() const override {
TORCH_CHECK(
false,
"You should register `PrivateUse1HooksInterface` for PrivateUse1 before call `getPinnedMemoryAllocator`.");

View File

@ -58,15 +58,15 @@ struct TORCH_API XPUHooksInterface : AcceleratorHooksInterface{
TORCH_CHECK(false, "Cannot synchronize XPU device without ATen_xpu library.");
}
virtual Allocator* getPinnedMemoryAllocator() const {
virtual Allocator* getPinnedMemoryAllocator() const override {
TORCH_CHECK(false, "Cannot get XPU pinned memory allocator without ATen_xpu library.");
}
virtual bool isPinnedPtr(const void* /*data*/) const {
virtual bool isPinnedPtr(const void* data) const override {
return false;
}
virtual bool hasPrimaryContext(DeviceIndex /*device_index*/) const override{
virtual bool hasPrimaryContext(DeviceIndex device_index) const override {
TORCH_CHECK(false, "Cannot query primary context without ATen_xpu library.");
}
};

View File

@ -861,21 +861,10 @@ static std::tuple<at::Tensor,at::Tensor,at::Tensor> _native_batch_norm_legit_no_
return at::native_batch_norm(self, weight_opt, bias_opt, Tensor(), Tensor(), train, momentum, eps);
}
static std::tuple<at::Tensor,at::Tensor,at::Tensor,at::Tensor> _batch_norm_with_update_batch(
const Tensor& self, const std::optional<Tensor>& weight_opt, const std::optional<Tensor>& bias_opt,
Tensor& running_mean, Tensor& running_var, double momentum, double eps) {
at::Tensor output, save_mean, save_var;
std::tie(output, save_mean, save_var) =
at::native_batch_norm(self, weight_opt, bias_opt, running_mean, running_var, true/*train*/, momentum, eps);
at::Tensor reserve = at::empty({0}, self.options().dtype(kByte));
return std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor>(output, save_mean, save_var, reserve);
}
TORCH_LIBRARY_IMPL(aten, FuncTorchBatched, m) {
VMAP_SUPPORT(native_batch_norm, NATIVE_BATCH_NORM_BATCH_RULE(native_batch_norm));
VMAP_SUPPORT(cudnn_batch_norm, CUDNN_BATCH_NORM_BATCH_RULE(cudnn_batch_norm));
VMAP_SUPPORT(miopen_batch_norm, MIOPEN_BATCH_NORM_BATCH_RULE(miopen_batch_norm));
m.impl("_batch_norm_with_update", _batch_norm_with_update_batch);
m.impl("_native_batch_norm_legit", _native_batch_norm_legit_batch);
m.impl("_native_batch_norm_legit.no_stats", _native_batch_norm_legit_no_stats_batch);
m.impl("native_batch_norm_backward", NATIVE_BATCH_NORM_BACKWARD_BATCH_RULE(native_batch_norm_backward));

View File

@ -55,3 +55,14 @@ static inline const char* _mklGetErrorString(sparse_status_t status) {
at::mkl::sparse::_mklGetErrorString(__err), \
" when calling `" #EXPR "`"); \
} while (0)
#define TORCH_MKLSPARSE_CHECK_SUCCESS_OR_INVALID(status, function_name) \
do { \
sparse_status_t __status = (status); \
TORCH_CHECK( \
__status == SPARSE_STATUS_SUCCESS || \
__status == SPARSE_STATUS_INVALID_VALUE, \
"MKL error: ", \
at::mkl::sparse::_mklGetErrorString(__status), \
" when calling `" function_name "`"); \
} while (0)

View File

@ -278,48 +278,60 @@ void spmmd<c10::complex<double>>(MKL_SPARSE_SPMMD_ARGTYPES(c10::complex<double>)
}
template <>
void trsv<float>(MKL_SPARSE_TRSV_ARGTYPES(float)) {
TORCH_MKLSPARSE_CHECK(mkl_sparse_s_trsv(operation, alpha, A, descr, x, y));
sparse_status_t trsv<float>(MKL_SPARSE_TRSV_ARGTYPES(float)) {
sparse_status_t status = mkl_sparse_s_trsv(operation, alpha, A, descr, x, y);
TORCH_MKLSPARSE_CHECK_SUCCESS_OR_INVALID(status, "mkl_sparse_s_trsv");
return status;
}
template <>
void trsv<double>(MKL_SPARSE_TRSV_ARGTYPES(double)) {
TORCH_MKLSPARSE_CHECK(mkl_sparse_d_trsv(operation, alpha, A, descr, x, y));
sparse_status_t trsv<double>(MKL_SPARSE_TRSV_ARGTYPES(double)) {
sparse_status_t status = mkl_sparse_d_trsv(operation, alpha, A, descr, x, y);
TORCH_MKLSPARSE_CHECK_SUCCESS_OR_INVALID(status, "mkl_sparse_d_trsv");
return status;
}
template <>
void trsv<c10::complex<float>>(MKL_SPARSE_TRSV_ARGTYPES(c10::complex<float>)) {
TORCH_MKLSPARSE_CHECK(mkl_sparse_c_trsv(
sparse_status_t trsv<c10::complex<float>>(MKL_SPARSE_TRSV_ARGTYPES(c10::complex<float>)) {
sparse_status_t status = mkl_sparse_c_trsv(
operation,
to_mkl_complex<float, MKL_Complex8>(alpha),
A,
descr,
reinterpret_cast<const MKL_Complex8*>(x),
reinterpret_cast<MKL_Complex8*>(y)));
reinterpret_cast<MKL_Complex8*>(y));
TORCH_MKLSPARSE_CHECK_SUCCESS_OR_INVALID(status, "mkl_sparse_c_trsv");
return status;
}
template <>
void trsv<c10::complex<double>>(
sparse_status_t trsv<c10::complex<double>>(
MKL_SPARSE_TRSV_ARGTYPES(c10::complex<double>)) {
TORCH_MKLSPARSE_CHECK(mkl_sparse_z_trsv(
sparse_status_t status = mkl_sparse_z_trsv(
operation,
to_mkl_complex<double, MKL_Complex16>(alpha),
A,
descr,
reinterpret_cast<const MKL_Complex16*>(x),
reinterpret_cast<MKL_Complex16*>(y)));
reinterpret_cast<MKL_Complex16*>(y));
TORCH_MKLSPARSE_CHECK_SUCCESS_OR_INVALID(status, "mkl_sparse_z_trsv");
return status;
}
template <>
void trsm<float>(MKL_SPARSE_TRSM_ARGTYPES(float)) {
TORCH_MKLSPARSE_CHECK(mkl_sparse_s_trsm(
operation, alpha, A, descr, layout, x, columns, ldx, y, ldy));
sparse_status_t trsm<float>(MKL_SPARSE_TRSM_ARGTYPES(float)) {
sparse_status_t status = mkl_sparse_s_trsm(
operation, alpha, A, descr, layout, x, columns, ldx, y, ldy);
TORCH_MKLSPARSE_CHECK_SUCCESS_OR_INVALID(status, "mkl_sparse_s_trsm");
return status;
}
template <>
void trsm<double>(MKL_SPARSE_TRSM_ARGTYPES(double)) {
TORCH_MKLSPARSE_CHECK(mkl_sparse_d_trsm(
operation, alpha, A, descr, layout, x, columns, ldx, y, ldy));
sparse_status_t trsm<double>(MKL_SPARSE_TRSM_ARGTYPES(double)) {
sparse_status_t status = mkl_sparse_d_trsm(
operation, alpha, A, descr, layout, x, columns, ldx, y, ldy);
TORCH_MKLSPARSE_CHECK_SUCCESS_OR_INVALID(status, "mkl_sparse_d_trsm");
return status;
}
template <>
void trsm<c10::complex<float>>(MKL_SPARSE_TRSM_ARGTYPES(c10::complex<float>)) {
TORCH_MKLSPARSE_CHECK(mkl_sparse_c_trsm(
sparse_status_t trsm<c10::complex<float>>(MKL_SPARSE_TRSM_ARGTYPES(c10::complex<float>)) {
sparse_status_t status = mkl_sparse_c_trsm(
operation,
to_mkl_complex<float, MKL_Complex8>(alpha),
A,
@ -329,12 +341,14 @@ void trsm<c10::complex<float>>(MKL_SPARSE_TRSM_ARGTYPES(c10::complex<float>)) {
columns,
ldx,
reinterpret_cast<MKL_Complex8*>(y),
ldy));
ldy);
TORCH_MKLSPARSE_CHECK_SUCCESS_OR_INVALID(status, "mkl_sparse_c_trsm");
return status;
}
template <>
void trsm<c10::complex<double>>(
sparse_status_t trsm<c10::complex<double>>(
MKL_SPARSE_TRSM_ARGTYPES(c10::complex<double>)) {
TORCH_MKLSPARSE_CHECK(mkl_sparse_z_trsm(
sparse_status_t status = mkl_sparse_z_trsm(
operation,
to_mkl_complex<double, MKL_Complex16>(alpha),
A,
@ -344,7 +358,9 @@ void trsm<c10::complex<double>>(
columns,
ldx,
reinterpret_cast<MKL_Complex16*>(y),
ldy));
ldy);
TORCH_MKLSPARSE_CHECK_SUCCESS_OR_INVALID(status, "mkl_sparse_z_trsm");
return status;
}
} // namespace at::mkl::sparse

View File

@ -184,7 +184,7 @@ void spmmd<c10::complex<double>>(
const scalar_t *x, scalar_t *y
template <typename scalar_t>
inline void trsv(MKL_SPARSE_TRSV_ARGTYPES(scalar_t)) {
inline sparse_status_t trsv(MKL_SPARSE_TRSV_ARGTYPES(scalar_t)) {
TORCH_INTERNAL_ASSERT(
false,
"at::mkl::sparse::trsv: not implemented for ",
@ -192,13 +192,13 @@ inline void trsv(MKL_SPARSE_TRSV_ARGTYPES(scalar_t)) {
}
template <>
void trsv<float>(MKL_SPARSE_TRSV_ARGTYPES(float));
sparse_status_t trsv<float>(MKL_SPARSE_TRSV_ARGTYPES(float));
template <>
void trsv<double>(MKL_SPARSE_TRSV_ARGTYPES(double));
sparse_status_t trsv<double>(MKL_SPARSE_TRSV_ARGTYPES(double));
template <>
void trsv<c10::complex<float>>(MKL_SPARSE_TRSV_ARGTYPES(c10::complex<float>));
sparse_status_t trsv<c10::complex<float>>(MKL_SPARSE_TRSV_ARGTYPES(c10::complex<float>));
template <>
void trsv<c10::complex<double>>(MKL_SPARSE_TRSV_ARGTYPES(c10::complex<double>));
sparse_status_t trsv<c10::complex<double>>(MKL_SPARSE_TRSV_ARGTYPES(c10::complex<double>));
#define MKL_SPARSE_TRSM_ARGTYPES(scalar_t) \
const sparse_operation_t operation, const scalar_t alpha, \
@ -207,7 +207,7 @@ void trsv<c10::complex<double>>(MKL_SPARSE_TRSV_ARGTYPES(c10::complex<double>));
const MKL_INT ldx, scalar_t *y, const MKL_INT ldy
template <typename scalar_t>
inline void trsm(MKL_SPARSE_TRSM_ARGTYPES(scalar_t)) {
inline sparse_status_t trsm(MKL_SPARSE_TRSM_ARGTYPES(scalar_t)) {
TORCH_INTERNAL_ASSERT(
false,
"at::mkl::sparse::trsm: not implemented for ",
@ -215,12 +215,12 @@ inline void trsm(MKL_SPARSE_TRSM_ARGTYPES(scalar_t)) {
}
template <>
void trsm<float>(MKL_SPARSE_TRSM_ARGTYPES(float));
sparse_status_t trsm<float>(MKL_SPARSE_TRSM_ARGTYPES(float));
template <>
void trsm<double>(MKL_SPARSE_TRSM_ARGTYPES(double));
sparse_status_t trsm<double>(MKL_SPARSE_TRSM_ARGTYPES(double));
template <>
void trsm<c10::complex<float>>(MKL_SPARSE_TRSM_ARGTYPES(c10::complex<float>));
sparse_status_t trsm<c10::complex<float>>(MKL_SPARSE_TRSM_ARGTYPES(c10::complex<float>));
template <>
void trsm<c10::complex<double>>(MKL_SPARSE_TRSM_ARGTYPES(c10::complex<double>));
sparse_status_t trsm<c10::complex<double>>(MKL_SPARSE_TRSM_ARGTYPES(c10::complex<double>));
} // namespace at::mkl::sparse

View File

@ -3,8 +3,6 @@
#include <ATen/CPUFunctions.h>
#include <ATen/EmptyTensor.h>
#include <ATen/mps/MPSAllocator.h>
#include <ATen/ops/_pin_memory_native.h>
#include <ATen/ops/is_pinned_native.h>
#include <c10/core/Allocator.h>
#include <c10/core/Storage.h>
@ -860,31 +858,12 @@ IMPSAllocator* getIMPSAllocator(bool sharedAllocator) {
return nullptr;
}
} // namespace at::mps
namespace at::native {
// torch.is_pinned() implementation
// Pinned memory will be helpful on Apple Silicon Macs with Unified memory as we
// will be able to use SharedStorageMode for MTLBuffer allocations. This will
// avoid extra copies on DataLoading operations.
bool is_pinned_mps(const Tensor& self, std::optional<Device> device) {
TORCH_INTERNAL_ASSERT_DEBUG_ONLY(!device.has_value() || device->is_mps());
return at::mps::_getSharedAllocator().isSharedBuffer(self.storage().data());
bool isMPSPinnedPtr(const void* data) {
return at::mps::_getSharedAllocator().isSharedBuffer(data);
}
// torch.pin_memory() implementation
Tensor _pin_memory_mps(const Tensor& self, std::optional<Device> device) {
TORCH_INTERNAL_ASSERT_DEBUG_ONLY(!device.has_value() || device->is_mps());
auto* shared_allocator = at::mps::getIMPSAllocator(true);
TORCH_CHECK(shared_allocator, "unable to pin memory on a non-unified memory device");
const size_t storage_size = at::detail::computeStorageNbytes(self.sizes(), self.strides(), self.dtype().itemsize());
std::cerr << "Pinning memory of size " << storage_size / 1024UL << " KB\n";
auto storage = Storage(Storage::use_byte_size_t(), storage_size, shared_allocator, false);
auto tensor = at::cpu::empty({0}, self.options()).set_(storage, 0, self.sizes(), self.strides());
tensor.copy_(self);
return tensor;
}
} // namespace at::native
} // namespace at::mps

View File

@ -59,4 +59,6 @@ C10_DECLARE_REGISTRY(MPSAllocatorCallbacksRegistry, IMpsAllocatorCallback);
IMPSAllocator* getIMPSAllocator(bool sharedAllocator = false);
bool isMPSPinnedPtr(const void* data);
} // namespace at::mps

View File

@ -31,6 +31,7 @@ enum class MacOSVersion : uint32_t {
MACOS_VER_13_2_PLUS,
MACOS_VER_13_3_PLUS,
MACOS_VER_14_0_PLUS,
MACOS_VER_14_4_PLUS,
};
//-----------------------------------------------------------------

View File

@ -105,19 +105,19 @@ MPSDevice::MPSDevice() : _mtl_device(nil), _mtl_indexing_library(nil) {
bool MPSDevice::isMacOS13Plus(MacOSVersion version) const {
id mpsCD = NSClassFromString(@"MPSGraph");
static auto compileOptions = [[[MTLCompileOptions alloc] init] autorelease];
static bool _macos_13_0_plus = [mpsCD instancesRespondToSelector:@selector(cumulativeSumWithTensor:
axis:name:)] == YES;
static bool _macos_13_1_plus =
[mpsCD instancesRespondToSelector:@selector
(sampleGridWithSourceTensor:
coordinateTensor:layout:normalizeCoordinates:relativeCoordinates:alignCorners:paddingMode
:samplingMode:constantValue:name:)] == YES;
static bool _macos_13_2_plus =
[mpsCD instancesRespondToSelector:@selector(convolution3DWithSourceTensor:weightsTensor:descriptor:name:)] == YES;
static bool _macos_13_3_plus = [compileOptions respondsToSelector:@selector(maxTotalThreadsPerThreadgroup)] == YES;
static bool _macos_14_0_plus = [mpsCD instancesRespondToSelector:@selector(conjugateWithTensor:name:)] == YES;
auto is_os_version_at_least = [](int major, int minor) {
@autoreleasepool {
NSProcessInfo* processInfo = [[NSProcessInfo alloc] init];
return [processInfo
isOperatingSystemAtLeastVersion:{.majorVersion = major, .minorVersion = minor, .patchVersion = 0}];
}
};
static bool _macos_13_0_plus = is_os_version_at_least(13, 0);
static bool _macos_13_1_plus = is_os_version_at_least(13, 1);
static bool _macos_13_2_plus = is_os_version_at_least(13, 2);
static bool _macos_13_3_plus = is_os_version_at_least(13, 3);
static bool _macos_14_0_plus = is_os_version_at_least(14, 0);
static bool _macos_14_4_plus = is_os_version_at_least(14, 0);
switch (version) {
case MacOSVersion::MACOS_VER_13_0_PLUS:
@ -130,6 +130,8 @@ bool MPSDevice::isMacOS13Plus(MacOSVersion version) const {
return _macos_13_3_plus;
case MacOSVersion::MACOS_VER_14_0_PLUS:
return _macos_14_0_plus;
case MacOSVersion::MACOS_VER_14_4_PLUS:
return _macos_14_4_plus;
default:
return false;
}

Some files were not shown because too many files have changed in this diff Show More