Compare commits

...

922 Commits

Author SHA1 Message Date
b2cbcf710b [BE] typing for decorators - _inductor/lowering (#131574)
See #131429

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131574
Approved by: https://github.com/oulgen, https://github.com/zou3519
ghstack dependencies: #131568, #131569, #131570, #131571, #131572, #131573
2024-07-25 22:24:19 +00:00
f0f20f7e97 [BE] typing for decorators - _jit_internal (#131573)
See #131429

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131573
Approved by: https://github.com/oulgen, https://github.com/zou3519
ghstack dependencies: #131568, #131569, #131570, #131571, #131572
2024-07-25 22:24:19 +00:00
bfe0079b72 [BE] typing for decorators - _meta_registrations (#131572)
See #131429
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131572
Approved by: https://github.com/oulgen, https://github.com/zou3519
ghstack dependencies: #131568, #131569, #131570, #131571
2024-07-25 22:24:19 +00:00
4b985e6f80 [BE] typing for decorators - distributed/_tensor/ops/utils (#131571)
See #131429

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131571
Approved by: https://github.com/oulgen, https://github.com/zou3519
ghstack dependencies: #131568, #131569, #131570
2024-07-25 22:24:19 +00:00
5731b486c8 [BE] typing for decorators - library (#131570)
See #131429

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131570
Approved by: https://github.com/oulgen, https://github.com/zou3519
ghstack dependencies: #131568, #131569
2024-07-25 22:24:19 +00:00
aa58af8b43 [BE] typing for decorators - masked/_ops (#131569)
See #131429
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131569
Approved by: https://github.com/oulgen, https://github.com/zou3519
ghstack dependencies: #131568
2024-07-25 22:24:19 +00:00
193f62fde9 [BE] typing for decorators - fx/_compatibility (#131568)
See #131429

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131568
Approved by: https://github.com/justinchuby, https://github.com/oulgen, https://github.com/zou3519
2024-07-25 22:24:19 +00:00
709ddf7a9d Add wrappers for synchronous GPUDirect Storage APIs (#130633)
Based in part on https://github.com/NVIDIA/apex/pull/1774

Differential Revision: [D60155434](https://our.internmc.facebook.com/intern/diff/D60155434)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130633
Approved by: https://github.com/albanD
2024-07-25 22:23:38 +00:00
0455344777 [dynamo] Turn on inline_inbuilt_nn_modules (#131275)
Known issues that are deliberately kept open and will be fixed later are tracked here - https://github.com/pytorch/pytorch/issues/131696

Training dashboard ([link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2018%20Jul%202024%2000%3A03%3A50%20GMT&stopTime=Thu%2C%2025%20Jul%202024%2000%3A03%3A50%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=gh/anijain2305/435/head&lCommit=408b9358b8fca3a5d08b39741419fe8a596941aa&rBranch=gh/anijain2305/435/base&rCommit=d31f2ae904ba2cf0884bf24413ba2109c3585d51))

![image](https://github.com/user-attachments/assets/08ef081c-37d7-436d-905b-4b9e2b470644)

Inference dashboard ([link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2018%20Jul%202024%2000%3A03%3A50%20GMT&stopTime=Thu%2C%2025%20Jul%202024%2000%3A03%3A50%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&lBranch=gh/anijain2305/435/head&lCommit=914244fa2fe0055917e039e35183b21fa90afdc6&rBranch=gh/anijain2305/435/base&rCommit=d31f2ae904ba2cf0884bf24413ba2109c3585d51))
![image](https://github.com/user-attachments/assets/32136eff-a39e-4cde-a438-e51a665bc3c9)

Inference sees a little bit more perf degradation but we are ok with that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131275
Approved by: https://github.com/ezyang
ghstack dependencies: #131744
2024-07-25 22:14:17 +00:00
513ce5f69a MTIA equivalent of torch.cuda.memory_stats (#131673)
Summary: Adding MTIA equivalent of `torch.cuda.memory_stats`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131673
Approved by: https://github.com/egienvalue
2024-07-25 21:59:59 +00:00
9039131a89 TCPStore: fix remote address (#131773)
This fixes corrupt remote address logs caused by dangling pointers to addrinfo_storage inside of addrinfo.

Test plan:

Enable debug logs and verify addresses are correct

```
TORCH_CPP_LOG_LEVEL=INFO TORCH_DISABLE_SHARE_RDZV_TCP_STORE=1 TORCH_DISTRIBUTED_DEBUG=DETAIL LOGLEVEL=INFO python test/distributed/test_store.py -v
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131773
Approved by: https://github.com/kurman
2024-07-25 21:55:25 +00:00
520182dbff [reland][inductor] switch AotCodeCompiler to new cpp_builder (#130127)
Changes:
1. Switch `AotCodeCompiler` to new cpp_builder.
2. Only use `deprecated_cpp_compile_command` for `fb_code`, due to I can't debug anymore on no Meta internal environment access.
3. Add `TODO` comments for further some Meta employee help on contine to do this work.
4. Due to item 3, we only remaining `deprecated_cpp_compile_command` for `fb_code` to be fix, let's remove `validate_new_cpp_commands`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130127
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-25 21:45:40 +00:00
a34692c0a3 [Inductor] Added and_masks and or_masks utilities & make fully masked out rows 0 instead of nan (#131552)
Combine #131073 and #131012 and fix doc building failures.

Co-authored-by: chilli <chilli@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131552
Approved by: https://github.com/Chillee
2024-07-25 21:29:46 +00:00
89bdd9c18f [kineto] populate src/dst rank for p2p (#130812)
Summary:
as title
populate src/dst rank (global rank) for p2p kernel

Differential Revision: D59794535

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130812
Approved by: https://github.com/aaronenyeshi
2024-07-25 21:10:57 +00:00
1c58aacbc8 [dtensor] move ops to private (#131211)
as titled

Differential Revision: [D60132519](https://our.internmc.facebook.com/intern/diff/D60132519)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131211
Approved by: https://github.com/XilunWu, https://github.com/wz337
ghstack dependencies: #131212
2024-07-25 20:59:55 +00:00
605dfd8fb4 Switch sync_distributed_folder to use non-reverse order (#131683)
`git` on GHA seems to use the reverse commit ordering that I see locally O_o
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131683
Approved by: https://github.com/seemethere
2024-07-25 20:44:23 +00:00
fe2e6f0c51 Revert "[reland][inductor] switch AotCodeCompiler to new cpp_builder (#130127)"
This reverts commit dfc9bfc8839ea3a0ffe933a64cd129fab5e4da75.

Reverted https://github.com/pytorch/pytorch/pull/130127 on behalf of https://github.com/atalman due to Breask CI test_dataloader.py::TestDataLoader::test_segfault [GH job link](https://github.com/pytorch/pytorch/actions/runs/10099725941/job/27930133346) [HUD commit link](2c1851f04e) ([comment](https://github.com/pytorch/pytorch/pull/130127#issuecomment-2251360224))
2024-07-25 20:44:04 +00:00
1ad4e6f228 Refactor cudagraphs to use serializable placeholder info (#130252)
This PR refactors placeholders in cudagraphs to be serializable. We define a new PlaceholderInfo object which only has the necessary parts of placeholders for logging/debugging, and use that instead of `torch.fx.Node` directly. This allows us to then save PlaceholderInfo into the FXGraphCache/AOTAutogradCache later.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130252
Approved by: https://github.com/eellison, https://github.com/masnesral
ghstack dependencies: #129384
2024-07-25 20:39:37 +00:00
eqy
69d63b2318 [CUDA][Pooling] Clean up unused accscalar_t in maxpool2d forward (#131728)
maxpool forward doesn't actually do any accumulation and the second template param was just a dupe of the first

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131728
Approved by: https://github.com/mikaylagawarecki
2024-07-25 20:32:42 +00:00
fdc4d6fe96 [inductor] Refactor fusion of inplace operations (#130835)
Resubmit of #128979

`WeakDep`s force readers to have completed before a mutation overwrites the
buffer, but we want to allow fusions to occur for inplace mutations where the
same index is read and written.

Currently this is achieved by:
1. Identifying the buffers used by the mutating op in its `dep_closure`
2. Not creating `WeakDep`s for buffers in the `dep_closure`
3. Fixing up any bad fusions that might occur by an extra check in `can_fuse_vertical`

So we are first over-agressive in removing `WeakDep`, then add an ad-hoc fixup.

This PR instead emits all `WeakDep`s and adds a `fusable_weak_dep` check to
`can_fuse_vertical` which selectively allows inplace operation to fuse.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130835
Approved by: https://github.com/lezcano
2024-07-25 20:29:01 +00:00
61d7bb3e79 Migrate trunk workflows to Amazon2023 ami (#131677)
A continuation of the migration started in
- https://github.com/pytorch/pytorch/pull/131250

All migrated trunk jobs passed successfully
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131677
Approved by: https://github.com/malfet
2024-07-25 20:19:16 +00:00
a6ebd56f7b Factor out cudagraph post compile into its own function (#129384)
Moves cudagraphs stuff into a post_compile function that I can later call when loading from AOTAutogradCache. On a cache hit, we only need to save any reasons for disabling cudagraphs along with some metadata needed to run cudagraphify. The arguments to cudagraphs_post_compile should be the set of parameters I'll need to reconstruct on a warm start.

No actual behavioral change should result from this: I'm moving the behavior into separate functions, but every operation should be the same pre and post PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129384
Approved by: https://github.com/eellison
2024-07-25 20:15:44 +00:00
58b8704f28 [aot] Keep backward mutations in backward (#129130)
https://github.com/pytorch/pytorch/issues/127561

Mutations of inputs in backward are emitted manually, after joint_fn tracing.
With default partitioner logic they will be moved to "forward" graph, as this is operation on forward inputs.

To keep those mutations in backward:
- Introduce "subgraph" node key, that can be specified with contextmanager. When we do manual `copy_` in backward on forward input - we know that his is for backward - set subgraph="backward"

In partitioner:
Introducing optional argument subgraph, to filter out nodes with specified subgraph (node_subgraph) and not to add them to subgraph if node_subgraph is different.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129130
Approved by: https://github.com/Chillee
2024-07-25 20:02:25 +00:00
6c31e02971 Fixes the example for convert_conv3d_weight_memory_format (#131742)
Fixes #129158

Please let me know if changes are needed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131742
Approved by: https://github.com/albanD
2024-07-25 20:01:44 +00:00
fba24252bd [dynamo][frame summary] Skip frame summary for frames from inside torch/nn/modules (#131744)
This ensures that the stack trace points to the user code.

At main (no inlining)
![image](https://github.com/user-attachments/assets/bf6f1f46-2dfe-45a2-95e1-fb733cda7e50)

With inlining but without this PR

![image](https://github.com/user-attachments/assets/fcb16c4d-dd81-4e5d-a63a-391a73683deb)

With inlining and this PR

![image](https://github.com/user-attachments/assets/69f10f65-c2ed-4179-acd5-a2824615129c)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131744
Approved by: https://github.com/ezyang
2024-07-25 19:30:03 +00:00
a1fad03fa8 [ROCm] Enable cudagraph expandable segments UTs in inductory/dynamo (#131111)
Test runtimes extracted from CI logs are as follows.

"linux-focal-rocm6.1-py3.8":
"dynamo/test_cudagraphs_expandable_segments": 3.3185000000000002,
"inductor/test_cudagraph_trees_expandable_segments": 153.233,

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131111
Approved by: https://github.com/eqy, https://github.com/jataylo, https://github.com/peterbell10
2024-07-25 19:26:04 +00:00
8c4683c978 Add device argument to the large_grid unit test (#131702)
Missing device argument lets this unit test only run on CPUs. Two unit tests added in the previous PR https://github.com/pytorch/pytorch/pull/127448. But only one use `device=self.device` to make sure the tests run on correct devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131702
Approved by: https://github.com/desertfire
2024-07-25 19:19:56 +00:00
bf6aae1468 Improve torch.masked.mean and torch.masked._std_var scaling (#131293)
Fixes #131292

Using `new_ones` is expensive and unnecessary.

Before:

![21232fda-366a-47ea-a017-15a35cd51d0c](https://github.com/user-attachments/assets/779830f0-0027-4fab-a9e6-b99954c80bc5)

After:

![aad2dfcc-52c9-4046-86ab-122b044fa19c](https://github.com/user-attachments/assets/810711c5-c4f0-4b6b-91dc-9a9e714f6ee0)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131293
Approved by: https://github.com/ezyang
2024-07-25 18:52:59 +00:00
2c1851f04e [export] fix output node's meta (#131706)
Summary:
This pr fixes all the places in strict export stack where the output node's meta is not preserved correctly. However, we're getting a new error for the test we intend to fix: `buck2 run caffe2/test/quantization:test_quantization -- -r "test_re_export_preserve_handle"`:

The `get_attr` nodes has wrong metadata. I guess there are more things need to be fixed to get it working but it's beyond the scope of this PR.

Test Plan: buck2 run caffe2/test/quantization:test_quantization -- -r "test_re_export_preserve_handle"

Differential Revision: D60198221

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131706
Approved by: https://github.com/yushangdi
2024-07-25 18:44:21 +00:00
dfc9bfc883 [reland][inductor] switch AotCodeCompiler to new cpp_builder (#130127)
Changes:
1. Switch `AotCodeCompiler` to new cpp_builder.
2. Only use `deprecated_cpp_compile_command` for `fb_code`, due to I can't debug anymore on no Meta internal environment access.
3. Add `TODO` comments for further some Meta employee help on contine to do this work.
4. Due to item 3, we only remaining `deprecated_cpp_compile_command` for `fb_code` to be fix, let's remove `validate_new_cpp_commands`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130127
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-25 18:34:08 +00:00
f3df7deab8 Revert "Add flag to ignore unsupported @triton.autotune args in user-written kernel compilation (#131431)"
This reverts commit e9db1b059733a02e1fb726d22a0489471044ad98.

Reverted https://github.com/pytorch/pytorch/pull/131431 on behalf of https://github.com/clee2000 due to broke internal tests D60211713 ([comment](https://github.com/pytorch/pytorch/pull/131431#issuecomment-2251091957))
2024-07-25 18:00:46 +00:00
2423d89d0c [dynamo] mirror training flag in OptimizedModule (#131546)
Fixes https://github.com/pytorch/pytorch/issues/122414.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131546
Approved by: https://github.com/yanboliang, https://github.com/anijain2305
2024-07-25 17:43:09 +00:00
c3679bed35 Revert "Fix py codegen to delete values that don't have any users (#131028)"
This reverts commit 91aba7baac3d2a079c0b13db25588842260c98cc.

Reverted https://github.com/pytorch/pytorch/pull/131028 on behalf of https://github.com/clee2000 due to broke inductor/test_triton_kernels inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_functionalize [GH job link](https://github.com/pytorch/pytorch/actions/runs/10094659640/job/27915271250) [HUD commit link](91aba7baac) ([comment](https://github.com/pytorch/pytorch/pull/131028#issuecomment-2251058374))
2024-07-25 17:42:18 +00:00
ec3829795d [3/3] 3D Composability - move tp dp tests (#129802)
pytorch (fsdp, tp, pp) -> pytorch (composable)
Move (fsdp, tp, pp) tests under pytorch into a composable folder

FSDP:
test/distributed/_composable/fsdp/test_fully_shard_trainin.py
-TestFullyShard2DTraining
**DP:
test/distributed/tensor/parallel/test_ddp_2d_parallel.py
TP:
test/distributed/tensor/parallel/test_fsdp_2d_parallel.py**
PP:
test/distributed/pipelining/test_composability.py

=>
**distributed/_composable/test_composability/test_2d_composability.py**
distributed/_composable/test_composability/test_pp_composability.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129802
Approved by: https://github.com/fduwjj
ghstack dependencies: #129800, #129801
2024-07-25 16:36:55 +00:00
29571c5c06 [2/3] 3D Composability - move pp tests (#129801)
pytorch (fsdp, tp, pp) -> pytorch (composable)
Move (fsdp, tp, pp) tests under pytorch into a composable folder

FSDP:
test/distributed/_composable/fsdp/test_fully_shard_trainin.py
-TestFullyShard2DTraining
DP:
test/distributed/tensor/parallel/test_ddp_2d_parallel.py
TP:
test/distributed/tensor/parallel/test_fsdp_2d_parallel.py
**PP:
test/distributed/pipelining/test_composability.py**

=>
distributed/_composable/test_composability/test_2d_composability.py
**distributed/_composable/test_composability/test_pp_composability.py**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129801
Approved by: https://github.com/wconstab
ghstack dependencies: #129800
2024-07-25 16:36:55 +00:00
75c4176b05 [export][BE] consolidate export and export_for_training (#131496)
Summary: This PR consolidates the implementation of export and export_for_training to maximize code re-use. Also add some type annotations and comments in the code for better readability.

Test Plan: Existing tests.

Differential Revision: D60130515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131496
Approved by: https://github.com/avikchaudhuri, https://github.com/pianpwk
2024-07-25 16:35:16 +00:00
6bc8db1d32 Rename is_training flag to have more information (#131618)
Summary: rename is_training flag into dispatch_tracing_mode = “make_fx” or “aot_export”

Test Plan: OSS CI

Differential Revision: D60154327

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131618
Approved by: https://github.com/ydwu4
2024-07-25 16:29:55 +00:00
f063027d54 [aoti] Fix constant inputs passed to aoti (#131594)
In cases where the program takes in a constant, export will specialize on the constant and embed the constant into the graph, with the graph containing a placeholder node with no users. However, inductor errors further down as typically in torch.compile, these constants don't show up as inputs. Since these constants are already embedded in the graph, we will just ignore these inputs while compiling with AOTI, and filter out the non-tensor inputs during the runtime.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131594
Approved by: https://github.com/desertfire
2024-07-25 16:22:15 +00:00
ffc6bf8149 [dynamo] lazily guard and specialize on the symint when used in f-string. (#131529)
Fixes https://github.com/pytorch/pytorch/issues/103602.

This PR implements the idea of "if someone creates a string and then ends up not using it, we would prefer to NOT have specialized." mentioned in above issue. Specifically, we create a lazy variable tracker instead of ConstantVariable when we're in FORMAT_VALUE, and when the lazy variable tracker is realized (i.e. it's going to be used), we create a ConstantVariable and the specialization/guarding happens at the time of realization.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131529
Approved by: https://github.com/ezyang
2024-07-25 16:16:34 +00:00
96e8df6a3a [ts_converter] Support prim::max and prim::if with multiple outputs (#131593)
Summary: As title.

Test Plan: test_converter.py

Reviewed By: angelayi

Differential Revision: D60147455

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131593
Approved by: https://github.com/ydwu4
2024-07-25 16:13:31 +00:00
cyy
b07ea91c4c [2/N] Fix clang-tidy warnings in jit (#131735)
Follows #131034

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131735
Approved by: https://github.com/ezyang
2024-07-25 15:56:53 +00:00
49a8e061b6 Revert "Support IPC for Expandable Segments (#130890)"
This reverts commit 0e71a88f9b2ca6b950c76a061791559cdd8a8870.

Reverted https://github.com/pytorch/pytorch/pull/130890 on behalf of https://github.com/zdevito due to some internal tests show shutdown issues with the change to the table that holds ipc handles ([comment](https://github.com/pytorch/pytorch/pull/130890#issuecomment-2250767280))
2024-07-25 15:54:57 +00:00
cyy
a4be5cb50e Simplify some c++ code (#131612)
The simplifications were discovered by static analysis tools.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131612
Approved by: https://github.com/ezyang
2024-07-25 15:07:37 +00:00
c3d099ddd1 [BE][Easy] Add hooks to doc for Optimizer base class (#131628)
Happened to notice this was missing from the base class (but is rendering for the other optimizers like Adam etc.) when I wanted to link the state_dict hooks for https://discuss.pytorch.org/t/global-not-per-param-optimizer-state/206769

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131628
Approved by: https://github.com/janeyx99
2024-07-25 15:07:08 +00:00
745b55d14a [CI][dashboard] Add a workflow to collect aarch64 perf (#131729)
Summary: as title
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131729
Approved by: https://github.com/huydhn
2024-07-25 14:58:47 +00:00
1eedb0a962 fix torchrun log message (#131652)
fixes https://github.com/pytorch/pytorch/issues/131461

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131652
Approved by: https://github.com/awgu
2024-07-25 14:50:10 +00:00
d0e2ab617d Migrate conda, manywheel and libtorch docker builds to pytorch/pytorch (#129022)
Migration of Docker conda builds  to pytorch/pytorch from pytorch/builder: https://github.com/pytorch/builder/blob/main/.github/workflows/build-conda-images.yml

Related to: https://github.com/pytorch/builder/issues/1849

Migrate scripts and worklfows, adds logic to execute on PR and upload to ecr with github hash tag in order to test Docker build and nightly on PR.

Test when executing on PR, upload to ecr:
https://github.com/pytorch/pytorch/actions/runs/9799439218/job/27059691327
```
308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/conda-builder-cpu:789cf8fcd738088860056160f6e9ea7cd005972b
```

Test With-Push, upload to dockerhub:
https://github.com/pytorch/pytorch/actions/runs/9799783407/job/27060633427
```
docker.io/pytorch/conda-builder:cpu done
```
Will upload here: https://hub.docker.com/r/pytorch/conda-builder/

Test using ecr image in the nightly workflow:
https://github.com/pytorch/pytorch/actions/runs/9798428933/job/27057835235#step:16:87

Note: This is first part that will build docker and upload it to either dockerhub or ecr. After merging followup PR will need to change conda nightly workflows to either use ecr image or dockerhub image, depending if we are running it on PR or from main/release branch.

Cleanup of workflows and scripts from builder repo: https://github.com/pytorch/builder/pull/1923
Co-authored-by: atalman <atalman@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129022
Approved by: https://github.com/atalman, https://github.com/seemethere, https://github.com/malfet, https://github.com/chuanqi129
2024-07-25 14:36:15 +00:00
4a5a87168e [BE] typing for decorators - _prims_common/wrappers (#131567)
See #131429
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131567
Approved by: https://github.com/oulgen, https://github.com/zou3519
2024-07-25 14:35:13 +00:00
7260eaeca0 Fix vulkan builds with missing overrides errors (#131760)
Followup after https://github.com/pytorch/pytorch/pull/131524

Also, use `C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED` macro to suppress existing warnings

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131760
Approved by: https://github.com/atalman
2024-07-25 14:29:44 +00:00
fddb1bcdea [CCA][Memory Snapshot] Move user_defined annotations to Native Caching Allocator (#130964)
Summary: Instead of embedding the user_defined TraceEntry inside of device_traces, which causes issues when some threads may not have the proper device id set, save them into an external_annotations field by using a RingBuffer<AnnotationEntry> called annotation_buffer owned by the NativeCachingAllocator.

Test Plan: CI, resnet run, and FBR model.

Differential Revision: D59703213

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130964
Approved by: https://github.com/zdevito
2024-07-25 14:06:52 +00:00
c88c90a897 [TS2E] Improve logging (#131711)
Serializing the text without having to do so can be costly for large outputs like ExportedProgram

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131711
Approved by: https://github.com/ydwu4
2024-07-25 13:40:10 +00:00
316c0d3e6b [inductor][cpp][gemm] support k slicing for static shapes (#130821)
This PR provides the initial support for k-slicing (i.e. parallel reduction along k-dim) of CPP GEMM template. Only static shapes are supported now. When k-slicing is enabled, there would be extra temporary buffers allocated to hold the intermediate results and an extra barrier after initial GEMM compute by each thread, i.e. each thread first stores the GEMM result to temporary accumulation buffers (pointed by `local_buf_ptrs` which is an array of pointers pointing to accumulation buffers), followed by a reduction along k-slices, epilogue computes and store to the final output `Y`. In each k-slicing thread group, the reduction along k-slices and epilogue computes are conducted in parallel along M-dim. The algorithm is designed to reduce the synchronization overhead as much as possible.

The k-slicing is enabled when blocking on M and N is unable to occupy all threads. Since k-slicing doesn't always bring benefit, an extra configuration is added to enable it (disable by default). We need to identify a good heuristics in the future to enable k-slicing by default.

Performance numbers with 64x4096x64, 64x10000x64, 64x20000x64 as examples on 60-core SPR as examples. As you can see, the perf of k-slicing is only better than non-k-slicing when K is large enough.

Without k-slicing
AUTOTUNE linear_unary(64x4096, 64x4096, 64)
  cpp_packed_gemm_0 0.0108 ms 100.0%
  _linear_pointwise 0.0431 ms 25.1%

AUTOTUNE linear_unary(64x10000, 64x10000, 64)
  cpp_packed_gemm_0 0.0272 ms 100.0%
  _linear_pointwise 0.0892 ms 30.5%

AUTOTUNE linear_unary(64x20000, 64x20000, 64)
  cpp_packed_gemm_0 0.0781 ms 100.0%
  _linear_pointwise 0.1693 ms 46.1%

With k-slicing:
AUTOTUNE linear_unary(64x4096, 64x4096, 64)
  cpp_packed_gemm_0 0.0260 ms 100.0%
  _linear_pointwise 0.0444 ms 58.5%

AUTOTUNE linear_unary(64x10000, 64x10000, 64)
  cpp_packed_gemm_0 0.0275 ms 100.0%
  _linear_pointwise 0.0893 ms 30.8%

AUTOTUNE linear_unary(64x20000, 64x20000, 64)
  cpp_packed_gemm_0 0.0284 ms 100.0%
  _linear_pointwise 0.1686 ms 16.8%

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130821
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
ghstack dependencies: #131024
2024-07-25 13:36:38 +00:00
d962dba0c4 Revert "[2/3] 3D Composability - move pp tests (#129801)"
This reverts commit 84cd062fb25c6da7d33b559c28afa38420e64415.

Reverted https://github.com/pytorch/pytorch/pull/129801 on behalf of https://github.com/atalman due to Broke periodic CI: distributed/_composable/test_composability/test_pp_composability.py::ComposabilityTest::test_manual_with_data_parallel_dp_type_DDP_ScheduleClass4 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10083807511/job/27882848654) [HUD commit link](544f950d14) ([comment](https://github.com/pytorch/pytorch/pull/129801#issuecomment-2250326191))
2024-07-25 13:30:56 +00:00
9c4cf866c2 Adafactor forloop basic impl (#129905)
#109581

At this point, the vanilla implementation (the default) is good.
Docs: https://docs-preview.pytorch.org/pytorch/pytorch/129905/generated/torch.optim.Adafactor.html#torch.optim.Adafactor

Specifically, the impl in this PR, which attempts to replicate the paper,
```
optim = torch.optim.Adafactor([weight])
```
is close enough to https://pytorch-optimizers.readthedocs.io/en/latest/optimizer/#pytorch_optimizer.AdaFactor
```
optim_c = AdaFactor([weight], betas=(0, 0.999), scale_parameter=False)
```
is close enough to https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adafactor
```
optim = keras.optimizers.Adafactor(learning_rate=0.01)
```

The three results respectively for the same randomly generated weights:
```
# ours
tensor([[ 0.3807594, -0.3912092],
        [ 0.0762539,  0.5377805],
        [ 0.2459473,  0.4662207]])

# pytorch-optimizer
tensor([[ 0.3807592, -0.3912172],
        [ 0.0762507,  0.5377818],
        [ 0.2459457,  0.4662213]])

# keras
array([[ 0.38076326, -0.39121315],
        [ 0.0762547 ,  0.5377859 ],
        [ 0.24594972,  0.46622536]], dtype=float32)
```

This gives me confidence to move forward in speeding up the implementation now that a baseline has been established. If you're curious about differences:
* keras assigns step_size (rho_t in their code) to `min(lr, 1 / sqrt(step)` whereas the OG impl uses a hardcoded 0.01 instead of lr. We do the same thing as keras, but our lr default is 0.01.
* We differ from the pytorch-optimizers default in that our default will not track momentum (thus `beta1=0`) and we do not apply parameter scaling.

<details>

Keras collab: https://colab.research.google.com/drive/1i3xF8ChL7TWKJGV_5v_5nMhXKnYmQQ06?usp=sharing

My script repro:

```
import torch
from pytorch_optimizer import AdaFactor
torch.set_printoptions(precision=7)

weight = torch.tensor([[ 0.37697506, -0.39500135],
        [ 0.07246649,  0.53399765],
        [ 0.24216151,  0.46243715]], dtype=torch.float32)
# bias = torch.tensor([0, 0], dtype=torch.float32)

weight.grad = torch.tensor([[-0.5940447, -0.7743838],
        [-0.5940447, -0.7743838],
        [-0.5940447, -0.7743838]], dtype=torch.float32)
# bias.grad = torch.tensor([-2.5027974,  1.5422692], dtype=torch.float32)

weight_c = weight.clone()
weight_c.grad = weight.grad.clone()

optim = torch.optim.Adafactor([weight])
optim.step()
print(weight)

optim_c = AdaFactor([weight_c], betas=(0, 0.999), scale_parameter=False)
optim_c.step()
print(weight_c)
```

<details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129905
Approved by: https://github.com/albanD
2024-07-25 13:17:19 +00:00
e8956c9fe6 Allow cpu scalar to be moved to HPU in masked_fill_decomposition (#127871)
Extension of the condition allowing the cpu scalar to be moved to specific devices.

This fixes an HPU specific error:
`torch._dynamo.exc.BackendCompilerFailed: backend='aot_hpu_training_backend' raised:
RuntimeError: Expected `value` to be on same device as `a`While executing %masked_fill : [num_users=1] = call_method[target=masked_fill](args = (%matmul, %expand_as, %tensor), kwargs = {})`

On the HPU in eager mode the problem doesn't occur because the pytorch's implementation is not used then.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127871
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-07-25 13:04:55 +00:00
91aba7baac Fix py codegen to delete values that don't have any users (#131028)
Fixes #131025

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131028
Approved by: https://github.com/ezyang
2024-07-25 13:04:23 +00:00
2784b3f1b7 [inductor] Fix split-scan interaction with multi-kernel (#131044)
This fixes a couple errors that come up when multi-kernel is used with
split-scan.
1. The split-scan was being marked as a persistent kernel, which allowed
   a multi-kernel to be created but this isn't supported. Fix is to
   never mark split-scan as persistent.
2. Benchmark codegen was not handling WorkspaceArg, and would raise a
   KeyError during codegen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131044
Approved by: https://github.com/shunting314
2024-07-25 11:36:36 +00:00
c04f70bb30 [BE] enable UFMT for torch/ao/ (#128864)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128864
Approved by: https://github.com/ezyang
2024-07-25 11:30:14 +00:00
434f60ce33 Refactor nightly checkout tool (#131134)
Changes:

- Add `-C REPO` in `git` commands to allow the tool can be run everywhere not only the repo dir
- Use `pathlib.Path` as many as possible
- Replace `subprocess.run(..., check=True)` with `subprocess.check_{call,output}(...)`
- Add `encoding='utf-8'` for files
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131134
Approved by: https://github.com/ezyang
2024-07-25 11:20:43 +00:00
054d214c50 [BE][tests] show local variables on failure in tests (#131151)
------

As per the title, add argument `--locals` for `unittest` and `--showlocals --tb=long` for `pytest` in CI.

Some failures cannot be reproduced on the local machine but exist on cloud CI. This change allows us to investigate the test failure more easily.

Example output: https://github.com/pytorch/pytorch/actions/runs/9961546996/job/27523888353?pr=130710#step:20:3361

```text
/opt/conda/envs/py_3.8/lib/python3.8/site-packages/sympy/core/function.py:307:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

cls = FloorDiv, base = -1.00000000000000, divisor = -1.00000000000000

    @classmethod
    def eval(cls, base, divisor):
        # python test/test_dynamic_shapes.py -k TestDimConstraints.test_dim_constraints_solve_full
        # Assert triggered by inequality solver
        # assert base.is_integer, base
        # assert divisor.is_integer, divisor

        # We don't provide the same error message as in Python because SymPy
        # makes it difficult to check the types.
        if divisor.is_zero:
            raise ZeroDivisionError("division by zero")
        if base in (int_oo, -int_oo, sympy.oo, -sympy.oo) and divisor in (
            int_oo,
            -int_oo,
            sympy.oo,
            -sympy.oo,
        ):
            return sympy.nan
        if base is sympy.nan or divisor is sympy.nan:
            return sympy.nan

        if base.is_zero:
            return sympy.S.Zero
        if base.is_integer and divisor == 1:
            return base
        if base.is_integer and divisor == -1:
            return sympy.Mul(base, -1)
        if (
            isinstance(base, sympy.Number)
            and isinstance(divisor, sympy.Number)
            and (
                base in (int_oo, -int_oo, sympy.oo, -sympy.oo)
                or divisor in (int_oo, -int_oo, sympy.oo, -sympy.oo)
            )
        ):
            r = float(base) / float(divisor)
            if r == math.inf:
                return int_oo
            elif r == -math.inf:
                return -int_oo
            elif math.isnan(r):
                return sympy.nan
            else:
                return sympy.Integer(math.floor(r))
        if isinstance(base, sympy.Integer) and isinstance(divisor, sympy.Integer):
            return sympy.Integer(int(base) // int(divisor))
        if isinstance(base, FloorDiv):
            return FloorDiv(base.args[0], base.args[1] * divisor)

        # Expands (x + y) // b into x // b + y // b.
        # This only works if floor is an identity, i.e. x / b is an integer.
        for term in sympy.Add.make_args(base):
            quotient = term / divisor
            if quotient.is_integer and isinstance(divisor, sympy.Integer):
                # NB: this is correct even if the divisor is not an integer, but it
                # creates rational expressions that cause problems with dynamic
                # shapes.
                return FloorDiv(base - term, divisor) + quotient

        try:
            gcd = sympy.gcd(base, divisor)
            if gcd != 1:
>               return FloorDiv(
                    sympy.simplify(base / gcd), sympy.simplify(divisor / gcd)
                )

base       = -1.00000000000000
cls        = FloorDiv
divisor    = -1.00000000000000
gcd        = 1.00000000000000
quotient   = 1.00000000000000
term       = -1.00000000000000

/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/utils/_sympy/functions.py:159:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

args = (FloorDiv, -1.00000000000000, -1.00000000000000), kwargs = {}

    @wraps(func)
    def wrapper(*args, **kwargs):
        try:
>           retval = cfunc(*args, **kwargs)
E           RecursionError: maximum recursion depth exceeded in comparison
E
E           To execute this test, run the following from the base repo dir:
E               python test/test_sympy_utils.py -k TestValueRanges.test_binary_ref_fn_floordiv_dtype_float
E
E           This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

args       = (FloorDiv, -1.00000000000000, -1.00000000000000)
cfunc      = <functools._lru_cache_wrapper object at 0x7fc5303173a0>
func       = <function Function.__new__ at 0x7fc530317280>
kwargs     = {}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131151
Approved by: https://github.com/ezyang
2024-07-25 10:10:58 +00:00
c4bf4005d1 [dtensor][debug] adding new noise level which allows users to only print operations with dtensors (#131592)
**Summary**
I have added a new noise level between the existing levels of 1 and 2, such that the noise level controls are now:
          0. prints module-level collective counts
          1. prints dTensor operations not included in trivial operations (new noise level)
          2. prints operations not included in trivial operations
          3. prints all operations

This gives the user more flexibility in controlling what information they want to use. The noise levels are used both for creating the console/file log and the json dump. In the example file, I have changed the module_tracing examples to noise level 0 and have changed my transformer examples to show off the new noise level.

**Test Plan**
1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_json_dump

2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_operation_tracing

3. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_module_tracing

4. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_module_tracing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131592
Approved by: https://github.com/XilunWu
ghstack dependencies: #131419, #130996
2024-07-25 06:54:57 +00:00
41e9f9cb7c [inductor] Fix flaky tests in test_select_algorithm.py (#131709)
Summary: Same as [#131699](https://github.com/pytorch/pytorch/pull/131699), but in `test_select_algorithm.py`.

Test Plan: Tested internally.

Differential Revision: D60202778

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131709
Approved by: https://github.com/eellison
2024-07-25 06:42:57 +00:00
3afdbecb23 [inductor] Fix flaky tests in test_debug_trace.py (#131722)
Summary:
When run internally in multiple parallel processes, the `test_debug_trace` hits the cache and skips writing all the expected outputs. Here we force-disable inductor cache to circumvent the problem. Ideally, we should switch to using a cleaner `fresh_inductor_cache` decorator approach, but it doesn't work at the moment.

Additionally, the debug trace dir is now generated by `tempfile.mkdtemp` to avoid a (rather unlikely) race condition.

Test Plan: Tested internally.

Differential Revision: D60207586

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131722
Approved by: https://github.com/eellison
2024-07-25 05:56:01 +00:00
059f9fb30b [BE][inductor] Type annotate codecache.py and config.py (#131427)
As title.

Checked/ Referred to the raw json file for runtime types . (and tried to cover all the missing annotations listed in the .json) this time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131427
Approved by: https://github.com/eellison, https://github.com/oulgen
2024-07-25 05:54:38 +00:00
ace6decc99 Fix static py::object dangling pointer with py::gil_safe_call_once_and_store (#130341)
Fix static `py::object`s with `py::gil_safe_call_once_and_store`.

The following code will leak a `py::object` which will call its destructor when shutdown the program. The destructor will call `Py_DECREF(obj.m_ptr)` which may raise a segmentation fault.

```c++
void func() {
    static py::object obj = py::module_::import("foo").attr("bar");

    ...
}
```

The correct code is to use raw pointers rather than the instance.

```c++
void func() {
    static py::object* obj_ptr = new py::object{py::module_::import("foo").attr("bar")};
    py::object obj = *obj_ptr;

    ...
}
```

This PR uses the `py::gil_safe_call_once_and_store` function from `pybind11`, which can run arbitrary initialization code only once under the Python GIL thread safely.

```c++
void func() {
    PYBIND11_CONSTINIT static py::gil_safe_call_once_and_store<py::object> storage;
    py::object obj = storage
                         .call_once_and_store_result(
                             []() -> py::object {
                                 return py::module_::import("foo").attr("bar");
                             }
                         )
                         .get_stored();

    ...
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130341
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-07-25 05:53:09 +00:00
59ef88ea5b [inductor] Fix flaky tests in test_pad_mm (#131699)
Summary: When run internally, some tests in `test_pad_mm.py` requiring big enough GPU to run `max_autotune=True` fail, as they're getting a smaller GPU than they need. Here we add `skipTest`s to skip the tests in these (rare) circumstances.

Differential Revision: D60192586

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131699
Approved by: https://github.com/chenyang78, https://github.com/shunting314, https://github.com/eellison
2024-07-25 05:46:45 +00:00
ee996cd63c [inductor] Fix flaky tests in test_benchmark_fusion.py (#131733)
Summary: Same as [#131699](https://github.com/pytorch/pytorch/pull/131699), but in `test_benchmark_fusion.py`.

Test Plan: Tested internally.

Differential Revision: D60211793

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131733
Approved by: https://github.com/oulgen
2024-07-25 05:39:14 +00:00
42a4df9447 Support CUDA nightly package in tools/nightly.py (#131133)
Add a new option `--cuda` to `tools/nightly.py` to pull the nightly packages with CUDA support.

```bash
# installs pytorch-nightly with cpuonly
tools/nightly.py pull

# The following only available on Linux and Windows
# installs pytorch-nightly with latest CUDA we support
tools/nightly.py pull --cuda

# installs pytorch-nightly with CUDA 12.1
tools/nightly.py pull --cuda 12.1
```

Also add targets in `Makefile` and instructions in constribution guidelines.

```bash
# setup conda environment with pytorch-nightly
make setup-env

# setup conda environment with pytorch-nightly with CUDA support
make setup-env-cuda
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131133
Approved by: https://github.com/ezyang
2024-07-25 05:33:52 +00:00
ceab3121de [inductor] Fix flaky tests in test_memory_planning.py (#131703)
Summary: Internally, the ABI-compatible mode is [enabled by default](eb54ca7abe/torch/_inductor/config.py (L53)). As a result, when the `abi_compatible: False` flag is not specified explitictly in the tests assuming non-ABI-compatible C++ codegen, those are failing internally. Here we fix one such test in `test_memory_planning.py`.

Test Plan: Tested internally.

Differential Revision: D60197327

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131703
Approved by: https://github.com/eellison
2024-07-25 05:09:08 +00:00
cyy
35bb0d3638 Fix unsigned type bug in CUDACachingAllocator.cpp (#131464)
curr_block->size and block_state.size are both size_t, so once they are not equal, split will happen. According to the comment, it's better to use '>'
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131464
Approved by: https://github.com/eqy, https://github.com/ezyang
2024-07-25 04:48:05 +00:00
5f3f14e5e4 [BE] Annotate subgraph_lowering (#131545)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131545
Approved by: https://github.com/anijain2305, https://github.com/zou3519
2024-07-25 04:35:26 +00:00
00e19ae97a [MTIA] Support module.mtia() (#131499)
Summary: Following other device backends' implementation to support module.mtia() API.

Test Plan: OSS and Internal CIs.

Differential Revision: D60076584

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131499
Approved by: https://github.com/mikaylagawarecki
2024-07-25 04:23:48 +00:00
2ce734cee9 [BE] enable UFMT for torch/ao/quantization/ (#128863)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128863
Approved by: https://github.com/ezyang
ghstack dependencies: #128861, #128862
2024-07-25 04:17:54 +00:00
a2f6eb33d0 Register buffer in static input test (#131686)
Previously, without nn module inlining, dynamo would lift all tensor attributes on an nn module to be constant on the graph. With nn module inlining these need to be buffers explicitly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131686
Approved by: https://github.com/anijain2305
2024-07-25 03:47:56 +00:00
cyy
62704db5c3 [Distributed] [10/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d/control_plane (#131671)
Follows #130109

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131671
Approved by: https://github.com/zou3519
2024-07-25 03:46:55 +00:00
2d7c135757 Bump setuptools from 69.5.1 to 70.0.0 in /tools/build/bazel (#130893)
Bumps [setuptools](https://github.com/pypa/setuptools) from 69.5.1 to 70.0.0.
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a href="https://github.com/pypa/setuptools/blob/main/NEWS.rst">setuptools's changelog</a>.</em></p>
<blockquote>
<h1>v70.0.0</h1>
<h2>Features</h2>
<ul>
<li>Emit a warning when <code>[tools.setuptools]</code> is present in <code>pyproject.toml</code> and will be ignored. -- by :user:<code>SnoopJ</code> (<a href="https://redirect.github.com/pypa/setuptools/issues/4150">#4150</a>)</li>
<li>Improved <code>AttributeError</code> error message if <code>pkg_resources.EntryPoint.require</code> is called without extras or distribution
Gracefully &quot;do nothing&quot; when trying to activate a <code>pkg_resources.Distribution</code> with a <code>None</code> location, rather than raising a <code>TypeError</code>
-- by :user:<code>Avasam</code> (<a href="https://redirect.github.com/pypa/setuptools/issues/4262">#4262</a>)</li>
<li>Typed the dynamically defined variables from <code>pkg_resources</code> -- by :user:<code>Avasam</code> (<a href="https://redirect.github.com/pypa/setuptools/issues/4267">#4267</a>)</li>
<li>Modernized and refactored VCS handling in package_index. (<a href="https://redirect.github.com/pypa/setuptools/issues/4332">#4332</a>)</li>
</ul>
<h2>Bugfixes</h2>
<ul>
<li>In install command, use super to call the superclass methods. Avoids race conditions when monkeypatching from _distutils_system_mod occurs late. (<a href="https://redirect.github.com/pypa/setuptools/issues/4136">#4136</a>)</li>
<li>Fix finder template for lenient editable installs of implicit nested namespaces
constructed by using <code>package_dir</code> to reorganise directory structure. (<a href="https://redirect.github.com/pypa/setuptools/issues/4278">#4278</a>)</li>
<li>Fix an error with <code>UnicodeDecodeError</code> handling in <code>pkg_resources</code> when trying to read files in UTF-8 with a fallback -- by :user:<code>Avasam</code> (<a href="https://redirect.github.com/pypa/setuptools/issues/4348">#4348</a>)</li>
</ul>
<h2>Improved Documentation</h2>
<ul>
<li>Uses RST substitution to put badges in 1 line. (<a href="https://redirect.github.com/pypa/setuptools/issues/4312">#4312</a>)</li>
</ul>
<h2>Deprecations and Removals</h2>
<ul>
<li>
<p>Further adoption of UTF-8 in <code>setuptools</code>.
This change regards mostly files produced and consumed during the build process
(e.g. metadata files, script wrappers, automatically updated config files, etc..)
Although precautions were taken to minimize disruptions, some edge cases might
be subject to backwards incompatibility.</p>
<p>Support for <code>&quot;locale&quot;</code> encoding is now <strong>deprecated</strong>. (<a href="https://redirect.github.com/pypa/setuptools/issues/4309">#4309</a>)</p>
</li>
<li>
<p>Remove <code>setuptools.convert_path</code> after long deprecation period.
This function was never defined by <code>setuptools</code> itself, but rather a
side-effect of an import for internal usage. (<a href="https://redirect.github.com/pypa/setuptools/issues/4322">#4322</a>)</p>
</li>
<li>
<p>Remove fallback for customisations of <code>distutils</code>' <code>build.sub_command</code> after long
deprecated period.
Users are advised to import <code>build</code> directly from <code>setuptools.command.build</code>. (<a href="https://redirect.github.com/pypa/setuptools/issues/4322">#4322</a>)</p>
</li>
<li>
<p>Removed <code>typing_extensions</code> from vendored dependencies -- by :user:<code>Avasam</code> (<a href="https://redirect.github.com/pypa/setuptools/issues/4324">#4324</a>)</p>
</li>
<li>
<p>Remove deprecated <code>setuptools.dep_util</code>.
The provided alternative is <code>setuptools.modified</code>. (<a href="https://redirect.github.com/pypa/setuptools/issues/4360">#4360</a>)</p>
</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a href="5cbf12a9b6"><code>5cbf12a</code></a> Workaround for release error in v70</li>
<li><a href="9c1bcc3417"><code>9c1bcc3</code></a> Bump version: 69.5.1 → 70.0.0</li>
<li><a href="4dc0c31644"><code>4dc0c31</code></a> Remove deprecated <code>setuptools.dep_util</code> (<a href="https://redirect.github.com/pypa/setuptools/issues/4360">#4360</a>)</li>
<li><a href="6c1ef5748d"><code>6c1ef57</code></a> Remove xfail now that test passes. Ref <a href="https://redirect.github.com/pypa/setuptools/issues/4371">#4371</a>.</li>
<li><a href="d14fa0162c"><code>d14fa01</code></a> Add all site-packages dirs when creating simulated environment for test_edita...</li>
<li><a href="6b7f7a18af"><code>6b7f7a1</code></a> Prevent <code>bin</code> folders to be taken as extern packages when vendoring (<a href="https://redirect.github.com/pypa/setuptools/issues/4370">#4370</a>)</li>
<li><a href="69141f69f8"><code>69141f6</code></a> Add doctest for vendorised bin folder</li>
<li><a href="2a53cc1200"><code>2a53cc1</code></a> Prevent 'bin' folders to be taken as extern packages</li>
<li><a href="720862807d"><code>7208628</code></a> Replace call to deprecated <code>validate_pyproject</code> command (<a href="https://redirect.github.com/pypa/setuptools/issues/4363">#4363</a>)</li>
<li><a href="96d681aa40"><code>96d681a</code></a> Remove call to deprecated validate_pyproject command</li>
<li>Additional commits viewable in <a href="https://github.com/pypa/setuptools/compare/v69.5.1...v70.0.0">compare view</a></li>
</ul>
</details>
<br />

[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=setuptools&package-manager=pip&previous-version=69.5.1&new-version=70.0.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/pytorch/pytorch/network/alerts).

</details>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130893
Approved by: https://github.com/kit1980
2024-07-25 03:32:08 +00:00
d6115439be [MPS] Add SDPA implentation (#131362)
This work is based off @malfet's #119200

Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131362
Approved by: https://github.com/kimishpatel
2024-07-25 03:24:37 +00:00
cyy
d98d00487d [2/N] Remove unused variables (#131468)
Follows #122496

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131468
Approved by: https://github.com/eqy, https://github.com/ezyang
2024-07-25 03:08:07 +00:00
cyy
538258bc13 [1/N] Fix clang-tidy warnings in jit (#131034)
Some some tidy warnings
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131034
Approved by: https://github.com/ezyang
2024-07-25 03:03:46 +00:00
cyy
46e42ae85d [4/N] Fix Wunused-parameter warnings (#131291)
Follows #131271

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131291
Approved by: https://github.com/ezyang
2024-07-25 02:59:22 +00:00
03979a599e [BE] enable UFMT for torch/ao/pruning/ (#128862)
Part of #123062

- #123062
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128862
Approved by: https://github.com/ezyang
ghstack dependencies: #128861
2024-07-25 02:49:35 +00:00
973a1362b9 [BE] enable UFMT for torch/ao/nn/ (#128861)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128861
Approved by: https://github.com/ezyang
2024-07-25 02:49:19 +00:00
c047bddbca [easy][dynamo] Update test for inline_inbuilt_n_modules (#131718)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131718
Approved by: https://github.com/williamwen42, https://github.com/mlazos
ghstack dependencies: #131694
2024-07-25 02:49:16 +00:00
01bc2a8165 [inline-inbuilt-nn-modules] Skip mobilenet_v2 test for cpu inductor (#131694)
Related issue https://github.com/pytorch/pytorch/issues/131693

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131694
Approved by: https://github.com/eellison
2024-07-25 02:49:16 +00:00
b5c006acac [BE][Easy] enable UFMT for torch/nn/ (#128865)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128865
Approved by: https://github.com/ezyang
2024-07-25 02:48:42 +00:00
cyy
8ea4c72eb2 [1/N] Fix clang-tidy warnings in aten/src/ATen/native/*.{cpp,h} (#130798)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130798
Approved by: https://github.com/ezyang
2024-07-25 02:36:43 +00:00
ab609d6aa6 [ts_convert] Update conversion for aten.tensor (#131549)
Fixes aten::tensor issues in edgeml models
P1492137675
| suite   |   #models |   #has_ts_model |   #has_sample_inputs |   #ts_can_run |   #can_convert |   #ep_result_correct |   #can_package |   #sigmoid_can_run |   #sigmoid_result_correct |
|---------|-----------|-----------------|----------------------|---------------|----------------|----------------------|----------------|--------------------|---------------------------|
| EDGEML  |        34 |              25 |                   23 |            21 |              2 |                    2 |              2 |                  2 |                         2 |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131549
Approved by: https://github.com/jiashenC, https://github.com/SherlockNoMad
2024-07-25 01:11:03 +00:00
e20fb5e975 [PTD][c10d] Include PG status into flight recorder (#131268)
We are considering consolidating data source for logging and flight recorder so that we don't build multiple paths for debugging information. Before we do any merging, we want to first ensure that the PG status is also included in flight recorder. Also, we can leverage this information to validate our FR dump as well. Because the dump is not synced so we might potentially see some variants in the dump.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131268
Approved by: https://github.com/shuqiangzhang
2024-07-25 01:01:00 +00:00
c3fe9075a9 [ROCM] Use hipblaslt version from hipblaslt runtime instead of header for tunableops validator (#131078)
Summary:
When tunable ops load selected kernels from csv file, it will validate hipblaslt version defined in hipblaslt-version.h

This PR changes the validator to fetch hipblaslt version and revision from hipblaslt runtime instead of the header file, as in our environment we might rollout a new version of the run time prior to updating the header file fleet wide.

Test Plan:
Verified generated tunableops kernel selection has the correct hipblaslt version from runtime:

```
Validator,PT_VERSION,2.5.0
Validator,ROCBLAS_VERSION,4.0.0-72e57364-dirty
Validator,HIPBLASLT_VERSION,800-bf2c3184
Validator,ROCM_VERSION,6.0.0.0-12969-1544e39
Validator,GCN_ARCH_NAME,gfx942:sramecc+:xnack-
GemmTunableOp_BFloat16_TN,tn_8192_2_3584,Gemm_Hipblaslt_TN_572,0.0240676
GemmTunableOp_BFloat16_TN,tn_7168_2_8192,Gemm_Hipblaslt_TN_482,0.0359019
GemmTunableOp_BFloat16_TN,tn_8192_2_1024,Default,0.0173723
GemmTunableOp_BFloat16_TN,tn_1280_2_8192,Gemm_Hipblaslt_TN_491,0.0191047
```

Differential Revision: D59889043

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131078
Approved by: https://github.com/jeffdaily, https://github.com/xw285cornell
2024-07-25 00:54:07 +00:00
cyy
803c5b8640 [CMake] Fix private compile options for CUDA code (#130546)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130546
Approved by: https://github.com/ezyang
2024-07-25 00:22:18 +00:00
7a42470bcb Annotate all InstructionTranslator (#131509)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131509
Approved by: https://github.com/zou3519
2024-07-24 23:45:53 +00:00
7535b23a25 [export] Fix set_grad hoo if output is empty (#131511)
Fixes https://fb.workplace.com/groups/1075192433118967/permalink/1467707973867409/

Test Plan: CI

Differential Revision: D60135531

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131511
Approved by: https://github.com/ydwu4
2024-07-24 23:17:20 +00:00
29c9f8c782 [export] Fix graph_break log registration error when importing export/_trace.py (#131523)
Summary:
When importing `_trace.py`, put `torch._dynamo.exc.Unsupported` in the global variable ``_ALLOW_LIST`` can cause import to ``export/_trace.py`` to fail with error:

ValueError: Artifact name: 'graph_breaks' not registered, please call register_artifact('graph_breaks') in torch._logging.registrations.

The error is directly raise on line `graph_breaks_log = torch._logging.getArtifactLogger(__name__, "graph_breaks")` in `_dynamo/exc.py`. I've checked that ``register_artifact('graph_breaks')`` does already exist in torch._logging.registrations.

Explicitly call `import torch._logging` doesn't fix the issue.

(see T196719676)

We move ``_ALLOW_LIST`` to be a local variable.

Test Plan:
buck2 test 'fbcode//mode/opt' fbcode//aiplatform/modelstore/publish/utils/tests:fc_transform_utils_test -- --exact 'aiplatform/modelstore/publish/utils/tests:fc_transform_utils_test - test_serialized_model_for_disagg_acc (aiplatform.modelstore.publish.utils.tests.fc_transform_utils_test.PrepareSerializedModelTest)'

buck2 test 'fbcode//mode/opt' fbcode//aiplatform/modelstore/publish/utils/tests:fc_transform_utils_test -- --exact 'aiplatform/modelstore/publish/utils/tests:fc_transform_utils_test - test_serialized_test_dsnn_module (aiplatform.modelstore.publish.utils.tests.fc_transform_utils_test.PrepareSerializedModelTest)'

Differential Revision: D60136706

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131523
Approved by: https://github.com/zhxchen17
2024-07-24 22:40:24 +00:00
236e06f9f9 Revert "Ensure staticmethods can be allowed in graph (#130882)"
This reverts commit 93fdd0237dcfe8cb4c65f3596aef123417b760a1.

Reverted https://github.com/pytorch/pytorch/pull/130882 on behalf of https://github.com/clee2000 due to torchrec test still broken internally D59945836 ([comment](https://github.com/pytorch/pytorch/pull/130882#issuecomment-2249003059))
2024-07-24 22:32:41 +00:00
5db5865614 Revert "Annotate all InstructionTranslator (#131509)"
This reverts commit eafbd20f23746aa6b9090d989a4ccb059f45297e.

Reverted https://github.com/pytorch/pytorch/pull/131509 on behalf of https://github.com/clee2000 due to sorry need to revert this to revert something else, I think you only need to rebase and remerge ([comment](https://github.com/pytorch/pytorch/pull/131509#issuecomment-2249000843))
2024-07-24 22:29:49 +00:00
a7e20ef7e4 [BE] Get rid of missing destructor override warning (#131204)
Regression introduced by https://github.com/pytorch/pytorch/pull/126376

Before this change, compiling torch_cpu on my MacBook prints tons of warnings every time HooksInterface is included
```
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/src/optim/adamw.cpp:1:
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/optim/adamw.h:3:
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/nn/module.h:3:
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/nn/modules/container/any_module_holder.h:3:
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/nn/modules/container/any_value.h:3:
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/detail/static.h:4:
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/types.h:3:
In file included from /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/ATen.h:7:
In file included from /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/Context.h:13:
/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/detail/HIPHooksInterface.h:27:11: warning: '~HIPHooksInterface' overrides a destructor but is not marked 'override' [-Winconsistent-missing-destructor-override]
  virtual ~HIPHooksInterface() = default;
          ^
/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/detail/AcceleratorHooksInterface.h:16:11: note: overridden virtual function is here
  virtual ~AcceleratorHooksInterface() = default;
          ^
1 warning generated.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131204
Approved by: https://github.com/albanD, https://github.com/seemethere
2024-07-24 22:29:31 +00:00
b56939dae1 Annotate more InstructionTranslator (#131680)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131680
Approved by: https://github.com/zou3519
ghstack dependencies: #131676
2024-07-24 22:14:29 +00:00
f9322c26b2 Remove _export/exported_program.py (#131597)
Summary:
We removed references to _export/exported_program.py in executorch
in D60052318. Now we can remove this file.

Update the pin to executorch.

Test Plan: contbuild & OSS CI:

Differential Revision: D60072980

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131597
Approved by: https://github.com/avikchaudhuri
2024-07-24 22:04:17 +00:00
eb54ca7abe Revert "[BE] Get rid of missing destructor override warning (#131204)"
This reverts commit 8a890b72dc3e4dcd501060c2a2fee139c235a8b8.

Reverted https://github.com/pytorch/pytorch/pull/131204 on behalf of https://github.com/atalman due to sorry @malfet need to revert to make CI green, lets reland with ciflow/periodic label on ([comment](https://github.com/pytorch/pytorch/pull/131204#issuecomment-2248898033))
2024-07-24 21:08:49 +00:00
544f950d14 [BE] Improve error message when there are internal changes (#131547)
Fixes https://github.com/pytorch/test-infra/issues/4988
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131547
Approved by: https://github.com/xuzhao9, https://github.com/malfet, https://github.com/atalman
2024-07-24 20:38:08 +00:00
7f61324268 Add sparse block to flex_decoding kernel (#130884)
fix typo

Finish flex_decoding block sparse

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130884
Approved by: https://github.com/drisspg
2024-07-24 20:30:25 +00:00
b90aa18569 [aoti] Add initial custom op support (#127034)
Re-land of https://github.com/pytorch/pytorch/pull/125242

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127034
Approved by: https://github.com/malfet
2024-07-24 20:29:55 +00:00
44fdf24967 [BE] typing for decorators - jit/_decompositions (#131566)
See #131429
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131566
Approved by: https://github.com/oulgen, https://github.com/zou3519
2024-07-24 20:28:28 +00:00
2b83e4f8d7 [ROCm] Enable flex decoding unit tests (#131048)
Flex decoding tests are passing with upstream pytorch on MI300X/MI2XX.
Only flex attention unit tests have issues.

[result_mi250.log](https://github.com/user-attachments/files/16286954/result_mi250.log)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131048
Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony, https://github.com/malfet
2024-07-24 20:25:34 +00:00
84cd062fb2 [2/3] 3D Composability - move pp tests (#129801)
pytorch (fsdp, tp, pp) -> pytorch (composable)
Move (fsdp, tp, pp) tests under pytorch into a composable folder

FSDP:
test/distributed/_composable/fsdp/test_fully_shard_trainin.py
-TestFullyShard2DTraining
DP:
test/distributed/tensor/parallel/test_ddp_2d_parallel.py
TP:
test/distributed/tensor/parallel/test_fsdp_2d_parallel.py
**PP:
test/distributed/pipelining/test_composability.py**

=>
distributed/_composable/test_composability/test_2d_composability.py
**distributed/_composable/test_composability/test_pp_composability.py**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129801
Approved by: https://github.com/wconstab
ghstack dependencies: #129800
2024-07-24 20:17:54 +00:00
a9e6356271 [ONNX] Update torch.onnx.export API (#131501)
- Add a `kwargs` option; add the `dynamic_shapes` option so users can supply it directly to `torch.export`.
- Make the options keyword-only arguments (bc-breaking)
- Deprecate the `training` and `operator_export_type` options and include a warning message. The exact time for removal is TBD but the message should discourage users from using the options.
- Deprecate two functions `exportable_ops` and pretty print export

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131501
Approved by: https://github.com/titaiwangms
2024-07-24 20:03:17 +00:00
9db567f17d [ONNX] Set dump_exported_program to True in bench (#131670)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131670
Approved by: https://github.com/titaiwangms
2024-07-24 20:02:03 +00:00
85fa66be04 Add rerun_disabled_tests for inductor (#131681)
Test in prod?

THis also turns on mem leak check

Briefly checked that
```
 python3 ".github/scripts/filter_test_configs.py" \
    --workflow "inductor" \
    --job-name "cuda12.1-py3.10-gcc9-sm86 / build" \
    --test-matrix "{ include: [
    { config: "inductor", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "inductor", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "inductor_distributed", shard: 1, num_shards: 1, runner: "linux.g5.12xlarge.nvidia.gpu" },
    { config: "inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "inductor_timm", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "inductor_timm", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "dynamic_inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "dynamic_inductor_timm", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "dynamic_inductor_timm", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "dynamic_inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "dynamic_inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "aot_inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "aot_inductor_timm", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "aot_inductor_timm", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "aot_inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "aot_inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "inductor_cpp_wrapper_abi_compatible", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
  ]}
  " \
    --selected-test-configs "" \
    --pr-number "${PR_NUMBER}" \
    --tag "${TAG}" \
    --event-name "schedule" \
    --schedule "29 8 * * *" \
    --branch "${HEAD_BRANCH}"
```
has rerun disabled tests option in the test matrix

I don't think all these things need to run but I'm not sure which ones (probably just inductor?)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131681
Approved by: https://github.com/zou3519
2024-07-24 19:56:00 +00:00
65ce2bf465 Allow setting PYTHON_LIB_REL_PATH via environment variable (#128419)
This allows builds to customize the location where caffe2's Python modules are installed to.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128419
Approved by: https://github.com/PaliC, https://github.com/d4l3k, https://github.com/malfet
2024-07-24 19:49:06 +00:00
074b46b7d9 [1/3] 3D Composability - move fsdp tests (#129800)
pytorch (fsdp, tp, pp) -> pytorch (composable)
Move (fsdp, tp, pp) tests under pytorch into a composable folder

**FSDP:
test/distributed/_composable/fsdp/test_fully_shard_trainin.py
-TestFullyShard2DTraining**
DP:
test/distributed/tensor/parallel/test_ddp_2d_parallel.py
TP:
test/distributed/tensor/parallel/test_fsdp_2d_parallel.py
PP:
test/distributed/pipelining/test_composability.py

=>
**distributed/_composable/test_composability/test_2d_composability.py**
distributed/_composable/test_composability/test_pp_composability.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129800
Approved by: https://github.com/awgu
2024-07-24 19:47:34 +00:00
e0f1bf14a4 Fully type torch/utils/_config_module.py (#131676)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131676
Approved by: https://github.com/zou3519
2024-07-24 19:36:09 +00:00
05681b6838 Migrate missed experimental jobs to Amazon2023 AMI (#131485)
Adding in a few jobs that got missed in https://github.com/pytorch/pytorch/pull/131250

Those jobs have passed with the new AMI:
https://github.com/pytorch/pytorch/actions/runs/10063808680/job/27820050195?pr=131485
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131485
Approved by: https://github.com/atalman, https://github.com/malfet
2024-07-24 19:33:02 +00:00
05064f2827 [CI] Move all ROCm jobs to periodic frequency (#131637)
`inductor` and `rocm` workflows are the major contributors to the CI load on ROCm CI at the moment, resulting in huge backlogs: https://github.com/pytorch/pytorch/pull/131489#issue-2425804464

* Move rocm.yml to cron frequency
* Move ROCm CI jobs from inductor.yml to inductor-rocm.yml
* Introduce `ciflow/inductor-rocm` as PR label to manually invoke inductor jobs for ROCm (no automatic invoking to limit CI load)
* After this PR, only `trunk` workflow jobs for ROCm will run on every commit and PR merge, but since they take 45min*3 time on average, I decided to leave them as-is since it will provide us some basic insulation against ROCm breakage.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131637
Approved by: https://github.com/clee2000, https://github.com/atalman, https://github.com/huydhn
2024-07-24 19:26:58 +00:00
8aff6caf67 [CI][dashboard] Rename cpu-x86 to cpu_x86 (#131658)
Summary: '-' is used as a special separator by upload_dynamo_perf_stats.py, so switch to '_' instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131658
Approved by: https://github.com/huydhn
2024-07-24 19:16:52 +00:00
3ce6f61416 [AOTI] Support fallback ops not in inductor_fallback_ops (#131247)
Summary: For aten ops that are not listed in inductor_fallback_ops, AOTI will use proxy executor to execute them instead of erroring out as missing C shim implementation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131247
Approved by: https://github.com/angelayi
2024-07-24 19:16:43 +00:00
aeca9845a6 Migrate Lint jobs to Amazon 2023 AMI (#131514)
Continuing in the same vein as https://github.com/pytorch/pytorch/pull/131250, migrate all self-hosted lint.yml jobs to use the new Amazon 2023 AMI
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131514
Approved by: https://github.com/clee2000, https://github.com/malfet, https://github.com/huydhn
2024-07-24 19:11:02 +00:00
b98b3127f7 [easy][pytorch][counters] Move WaitCounter in c10/util (#131021)
Summary: Since WaitCounter frontend itself has minimal depdendencies it's fine to be moved into c10. Specific backends can be registered/linked separately.

Test Plan: unit test

Reviewed By: jamesperng, asiab4, c-p-i-o

Differential Revision: D59842868

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131021
Approved by: https://github.com/asiab4
2024-07-24 18:38:33 +00:00
7718024d2b [3.13] support 3.13 multiline traces in munge_exc (#131207)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131207
Approved by: https://github.com/jansel, https://github.com/anijain2305
ghstack dependencies: #131206
2024-07-24 18:22:30 +00:00
f0378912a0 [3.13, dynamo] fix test/dynamo/test_bytecode_utils.py (#131206)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131206
Approved by: https://github.com/jansel, https://github.com/anijain2305
2024-07-24 18:22:30 +00:00
a86909d251 [inductor] Type annotate constant_folding.py (#131364)
Summary: Type annotate constant_folding.py

Test Plan: mypy

Reviewed By: angelayi

Differential Revision: D60063872

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131364
Approved by: https://github.com/angelayi
2024-07-24 18:20:06 +00:00
8fe5b93667 support zb1p and zb2p algorithms (#130752)
Previously, we have proved that ZB2P is not truly zero bubble when num_local_stages exceed 4 and so only ZB1P was supported.

We did a few tweaks to the ZB2P to really make it zero bubble. Algorithm and proof is attached.
[zero_bubble.pdf](https://github.com/user-attachments/files/16238738/zero_bubble.pdf)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130752
Approved by: https://github.com/H-Huang
2024-07-24 17:58:46 +00:00
5e6cfb7db5 Add an extra shard for distributed periodic jobs (#131498)
Fixes issue of timeouts being observed in ROCm periodic workflow for distributed runs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131498
Approved by: https://github.com/jithunnair-amd, https://github.com/pruthvistony, https://github.com/clee2000
2024-07-24 16:44:53 +00:00
106c6a49f5 [dynamo] limit number of compiles per frame (#130891)
Fixes https://github.com/pytorch/pytorch/issues/130776

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130891
Approved by: https://github.com/anijain2305
2024-07-24 16:43:40 +00:00
abcd329359 [BE] typing for decorators - onnx/symbolic_helper (#131565)
See #131429
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131565
Approved by: https://github.com/justinchuby, https://github.com/oulgen, https://github.com/zou3519, https://github.com/titaiwangms
2024-07-24 16:39:47 +00:00
0e71a88f9b Support IPC for Expandable Segments (#130890)
This reapplication commit is the same as before except it resolves a build error in an internal build where `handle` was shadowed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130890
Approved by: https://github.com/dsjohns2
2024-07-24 15:45:40 +00:00
eb5883f8aa Add new runner labels to actionlint (#131525)
Adding the labels corresponding to the Amazon2023 ami
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131525
Approved by: https://github.com/atalman
2024-07-24 15:28:59 +00:00
72d17d95d7 [inductor] Enable dynamo for Windows. RC1 (#131286)
Changes:
1. Enable Windows in `check_if_inductor_supported`.
2. Disable Windows in `AotCodeCompiler`.
3. Force Windows inductor to `c++20` to support `std::enable_if_t`.
4. Disable `test_x86inductor_quantizer` UT on `Windows` temporary, It still some issue need to be fix: https://github.com/pytorch/pytorch/pull/131308 .

Based on this PR, I have run first model `resnet18` on Windows inductor successful.
<img width="1036" alt="image" src="https://github.com/user-attachments/assets/2642bda1-1845-417a-aaba-39bdf22e65d6">

TODO:
1. Upgrade pytorch Windows build to `c++20`.
2. Fix and re-enable `test_x86inductor_quantizer` UT on `Windows`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131286
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-24 15:26:55 +00:00
4c7f22dee2 [BE] remove unnecessary _dispatch_sqrt by using ** 0.5 (#131358)
Based on the discussion here where ** 0.5 is not slower than math.sqrt. https://github.com/pytorch/pytorch/pull/129905#discussion_r1675605075

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131358
Approved by: https://github.com/albanD
2024-07-24 14:58:57 +00:00
98984422eb [triton_op] fix autotuning (#131363)
The problem was we were shoving SymInts into the constant_args side
table. The root problem is that torch.fx.node.base_types, which we use
to determine what can be put in the graph, doesn't actually have SymInt
in it. This PR fixes base_types to include SymInt.

Test Plan:
- tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131363
Approved by: https://github.com/oulgen, https://github.com/justinchuby
2024-07-24 14:03:37 +00:00
bc938184de [FSDP2] Added set_reduce_scatter_divide_factor (#129286)
This PR adds an API `FSDPModule.set_reduce_scatter_divide_factor` to allow setting a custom gradient divide factor for reduce-scatter. This can be useful when using parallelisms in combination with FSDP (e.g. expert parallelism), where gradients need to be divided by a custom factor (e.g. an extra `EP` factor).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129286
Approved by: https://github.com/weifengpy
2024-07-24 12:42:35 +00:00
8ffd109a00 Revert "Fix py codegen to delete values that don't have any users (#131028)"
This reverts commit 466c167b71e6021f8eadcfbae1d9156a375663ce.

Reverted https://github.com/pytorch/pytorch/pull/131028 on behalf of https://github.com/atalman due to breaks CI ([comment](https://github.com/pytorch/pytorch/pull/131028#issuecomment-2247771530))
2024-07-24 12:21:43 +00:00
451462dbff [1/N] Add missing constructors or assignment operators (#131077)
Just mark them as deleted in most cases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131077
Approved by: https://github.com/ezyang
2024-07-24 12:09:39 +00:00
0c6f1ca064 Introduce torch._dynamo.config.enable_compiler_collectives for syncing compilation across ranks (#130935)
This PR implements an opt-in configuration option for synchronizing compilation across all ranks at the end of Dynamo tracing (and potentially, other places in the future). There are two pieces to this PR:

1. Implementing infrastructure for compiler collectives (DistributedState/LocalState, the actual collective)
2. Using this infrastructure to synchronize automatic dynamic choices across all ranks

The infrastructure in part one can be used for other purposes, just add more (serializable) fields to LocalState.

Here is how automatic dynamic synchronization works:

1. Preflight in "torch/_dynamo/variables/builder.py": On the first Dynamo trace run, we trace without automatic dynamic at all; we assume all Tensor inputs that are not otherwise marked are static. This run is purely to collect all Tensor input sizes in the program.
2. torch/_dynamo/output_graph.py: At the end of the first Dynamo trace run, we perform a compiler collective to distribute all Tensor input sizes to all ranks. Then, we restart Dynamo
3. Apply the updates in "torch/_dynamo/variables/builder.py": Now that we have all sizes for every rank, we now update frame state with the observed sizes for all ranks, in rank order. Under the assumption that frame state is consistent on all ranks, this series of updates will preserve consistency.

For future work, it would be safer if we force a consistent hint on all ranks; this is more involved as we have to interpose in fakification.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130935
Approved by: https://github.com/jansel
2024-07-24 11:24:11 +00:00
85d3ee1d67 [micro_pipeline_tp] refactor all-gather and reduce-scatter pattern matchers to be more flexible and testable (#131409)
High level goals:
- Cover the all-gather and reduce-scatter pattern matchers with unit tests
- Make it easier to exclude certain collectives as async-tp candidates
- Make it easier to match other all-gather and reduce-scatter variants (e.g. fp8 collectives)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131409
Approved by: https://github.com/weifengpy
2024-07-24 11:16:27 +00:00
89d5391bbf [inductor] Kill mark_node_as_mutating (#130834)
Resubmit of #129346

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130834
Approved by: https://github.com/lezcano
ghstack dependencies: #130832, #130833
2024-07-24 11:11:19 +00:00
6415c45da5 [inductor] Use multiple outputs for flex-attention (#130833)
Resubmit of #129344

This fixes the DCE issue for attention output

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130833
Approved by: https://github.com/lezcano
ghstack dependencies: #130832
2024-07-24 11:11:19 +00:00
95c248751b [inductor] Make UserDefinedTritonKernel a multi-output operation (#130832)
Resubmit of #129325

Previously each mutation was represented by a `MutationOutput` operation which
was a new scheduler node that must be scheduled immediately afterwards.

Now we have a single scheduler node, which produces mutiple `MutationOutput`
buffers as its output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130832
Approved by: https://github.com/lezcano
2024-07-24 11:11:14 +00:00
a4c3f29047 [ONNX][BE] Remove ruff skips in torch/onnx (#131368)
Remove all ruff skips for torch/onnx since we do not do runtime type checking anymore.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131368
Approved by: https://github.com/titaiwangms, https://github.com/Skylion007
2024-07-24 10:56:43 +00:00
62e566b345 [BE] Remove suppression of inconsistent missing overrides (#131524)
This should prevent regressions like the ones fixed by https://github.com/pytorch/pytorch/pull/131204

- Remove global `-Wno-error=inconsistent-missing-override`
- Wrap offending includes (protobuf and asmjit) with `C10_DIAGNOSTIC_PUSH_AND_IGNORE` and `C10_DIAGNOSTIC_POP_AND_IGNORED`
- Add `override` keyword to `at::namespace::tunable::StreamTimer` and `LLVMCodeGenImpl`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131524
Approved by: https://github.com/atalman
2024-07-24 10:07:36 +00:00
83d19620f6 kill tmp _is_executorch flag (#131488)
Test Plan: existing tests

Differential Revision: D60126186

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131488
Approved by: https://github.com/ydwu4
2024-07-24 08:51:37 +00:00
1e34870796 [CI][dashboard][reland] Collect PT2 cpu perf nightly (#131560)
Summary: Add a workflow similar to inductor-perf-test-nightly.yml but use x86 metal instances for perf measurement. The data processing and dashboard update will come next.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131560
Approved by: https://github.com/huydhn
2024-07-24 08:50:33 +00:00
276b5238ef [bug] Add is_compiling check for optimizers to avoid untracked tensor during graph tracing (#130909)
Hey folks, I was using the `stateless_func` [here](7c45476d38/torch/distributed/_spmd/api.py (L435)), which worked well before [this commit](https://github.com/pytorch/pytorch/pull/111084) but then introduced a `_tensor_constant0` and made this func non-stateless. Since there is no way to retrieve this constant tensor before compilation and performance is not an issue when tracing a graph, I think it might be good to fall back to the other branch.
![image](https://github.com/user-attachments/assets/6ee4487d-456b-47e0-8c1d-66cb5a641d47)

![image](https://github.com/user-attachments/assets/1ed46502-e50e-45c4-9751-49aa5a4590ae)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130909
Approved by: https://github.com/mlazos
2024-07-24 08:29:27 +00:00
cyy
41189b0da4 Simplify THPEvent_get_device (#131466)
Because self->event.device() always returns Device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131466
Approved by: https://github.com/albanD
2024-07-24 08:24:01 +00:00
e782918b8e [NestedTensor] Add example NestedTensor objects with inner dimension of size 1 to tests reducing along jagged dimension for NestedTensor (#131516)
Add example `NestedTensor`s with inner dimension of size `1` to `_get_example_tensor_lists` with `include_inner_dim_size_1=True`. This diff creates `NestedTensor`s of sizes `(B, *, 1)` and `(B, *, 5, 1)`, ensuring that the current implementations of jagged reductions for `sum` and `mean` hold for tensors of effective shape `(B, *)` and `(B, *, 5)`.

Differential Revision: [D59846023](https://our.internmc.facebook.com/intern/diff/D59846023/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131516
Approved by: https://github.com/davidberard98
2024-07-24 07:01:39 +00:00
e9db1b0597 Add flag to ignore unsupported @triton.autotune args in user-written kernel compilation (#131431)
Summary: We currently don't support some of the `@triton.autotune` arguments when compiling user-written Triton kernels with PT2. In this PR, we're adding a flag to circumvent it. This is to unblock internal compilation in some cases. The flag is supplied with the docs mentioning why it is not a good idea to set it.

Test Plan:

```
python test/inductor/test_triton_kernels.py -k test_triton_kernel_
autotune_with_unsupported_args
...
----------------------------------------------------------------------
Ran 3 tests in 3.636s

OK
```

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131431
Approved by: https://github.com/oulgen, https://github.com/zou3519
2024-07-24 05:37:09 +00:00
eafbd20f23 Annotate all InstructionTranslator (#131509)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131509
Approved by: https://github.com/zou3519
2024-07-24 05:31:01 +00:00
5772c13f56 Dont wrap negative indexing in scatter reduce (#131503)
Fix for https://github.com/pytorch/pytorch/issues/131321

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131503
Approved by: https://github.com/shunting314
2024-07-24 04:01:32 +00:00
9f96d4b61b Disable inlining on cudagraph fallback tests (#131557)
The cudagraph fallback tests should only run without nn module inlining. The [rerecord limit](fc3d2b26cd/torch/_inductor/cudagraph_trees.py (L1922)) is ignored if nn module inlining is disabled. Arguably it should just be higher, but this PR addresses the failures and allows inlining to be on by default on main.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131557
Approved by: https://github.com/anijain2305
ghstack dependencies: #131556
2024-07-24 04:00:02 +00:00
9575b1afad Ensure tensor dict is populated with compiled autograd (#131556)
The issue addressed is that compiled autograd changes the calling convention of the FX graph to only have a single placeholder which contains a list of inputs. In this case, the meta of the tensor input nodes don't contain the `tensor_dict` meta. This adds them.

The context is that `tensor_dict` is used to convey if a tensor is an input with a static address.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131556
Approved by: https://github.com/anijain2305
2024-07-24 04:00:02 +00:00
dffbd3a1e2 Add mypy typing to pattern_matcher (#131506)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131506
Approved by: https://github.com/zou3519
2024-07-24 02:55:43 +00:00
7124efa81b Include _native.h for structured_native_functions (#131208)
In gen.py, the code for generating CompositeViewCopyKernels.cpp includes *_native.h headers for "view_groups" but not "structured_native_functions". However, this results in the TORCH_API in the headers being ineffective and presents such functions being used outside libtorch_cpu.so

This patch ensures that gen.py includes the native headers for "structured_native_functions" in the same way as for "view_groups".

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131208
Approved by: https://github.com/bdhirsh
2024-07-24 02:55:36 +00:00
31da9ee711 Use explain function to provide more meaningful information when conversion failed. (#131214)
Summary: In the script of testing different families of models, when the conversion failed, we switch to use output from the explain function to provide more meaningful information.

Test Plan:
Manual testing with attatched log information.

```
buck2 run mode/dev-nosan sigmoid/inference/ts_migration:main -- --mode test_all --test_suites ads_merge --model_id 440779101
```

```
Processing 440779101_5455.predictor.disagg.gpu.merge

    model_name: 440779101_5455.predictor.disagg.gpu.merge
    has_ts_model: True
    has_sample_inputs: True
    ops_maybe_missing_meta: set()
    ts_can_run: True
    ts_run_exception: None
    can_convert: False
    convert_exception: Unsupported nodes are found in the following list:

        0. prim::Loop [%14259 : int = prim::Loop(%14258, %1129, %1126), scope: torch.fx.graph_module.GraphModule:: # <torch_package_1>.caffe2/torch/fb/predictor/modules/tensors_to_device_module.py💯19]

        1. prim::Loop [%14326 : int = prim::Loop(%1115, %1129, %14259), scope: torch.fx.graph_module.GraphModule:: # <torch_package_1>.caffe2/torch/fb/predictor/modules/tensors_to_device_module.py💯19]
    ep_result_correct: None
    ep_run_exception: None
    can_package: None
    package_exception: None
    sigmoid_can_run: None
    sigmoid_run_exception: None
    sigmoid_result_correct: None
```

Reviewed By: SherlockNoMad

Differential Revision: D59971446

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131214
Approved by: https://github.com/angelayi
2024-07-24 02:42:18 +00:00
0ceaabaf71 [easy][inline-inbuilt-nn-modules] Update test (#131563)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131563
Approved by: https://github.com/mlazos
ghstack dependencies: #131347, #131367, #131378, #131389, #131405, #131480, #131512
2024-07-24 02:32:19 +00:00
0e780a7d69 [BE] Remove some mypy allow-untyped-decorators that are no longer needed (#131564)
See #131429
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131564
Approved by: https://github.com/oulgen
2024-07-24 02:00:08 +00:00
abb313b466 [torch.mtia] Noop set_rng_state and get_rng_state APIs (#130873)
Summary: As title

Test Plan: CI tests

Reviewed By: joebos

Differential Revision: D59036602

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130873
Approved by: https://github.com/hanzlfs
2024-07-24 01:52:21 +00:00
aa1c78c7e9 [PTD][c10d][EZ] LOG error for nccl error rather than info (#131483)
Summary: As title, when we get nccl exception we should log it as error not info.

Test Plan: CI

Reviewed By: csmodlin, rmiao

Differential Revision: D60123773

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131483
Approved by: https://github.com/fegin
2024-07-24 01:08:00 +00:00
466c167b71 Fix py codegen to delete values that don't have any users (#131028)
Fixes #131025

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131028
Approved by: https://github.com/ezyang
2024-07-24 01:03:56 +00:00
14495ce288 [BE][MPS] Use isOperatingSystemAtLeastVersion: (#131513)
Instead of trying to come up with different checks for classes resonding to selectors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131513
Approved by: https://github.com/atalman
2024-07-24 00:54:25 +00:00
76f7b3e560 [inductor][cpp][gemm] improve thread blocking heuristics (#131024)
This PR improves the thread blocking heuristics to favor full occupancy as much as possible. Also, the "m x n" block size is made as squared as possible for better data reuse.

Take the shape M=20000, N=64, K=128 as an example, the original heuristics couldn't use up all the threads when the number of threads is large, say 60:
AUTOTUNE linear_unary(200000x128, 64x128, 64)
  _linear_pointwise 0.1010 ms 100.0%
  cpp_packed_gemm_0 0.8303 ms 12.2%
0722 02:26:39.220660 302553 torch/_inductor/codegen/cpp_gemm_template.py:503] [0/0] Register blocking: GemmBlocking(block_m=32, block_n=32, block_k=32)
V0722 02:26:39.221042 302553 torch/_inductor/codegen/cpp_gemm_template.py:507] [0/0] Cache blocking: GemmBlocking(block_m=625, block_n=1, block_k=4)
V0722 02:26:39.221118 302553 torch/_inductor/codegen/cpp_gemm_template.py:509] [0/0] Thread blocking: GemmBlocking(block_m=625, block_n=1, block_k=4)
V0722 02:26:39.221252 302553 torch/_inductor/codegen/cpp_gemm_template.py:526] [0/0] Number of threads: 60, occupancy: (10, 2, 1)

After this PR:
AUTOTUNE linear_unary(200000x128, 64x128, 64)
  _linear_pointwise 0.1143 ms 100.0%
  cpp_packed_gemm_0 0.1228 ms 93.1%
V0722 02:29:49.261794 304201 torch/_inductor/codegen/cpp_gemm_template.py:309] [0/0] Register blocking: GemmBlocking(block_m=32, block_n=32, block_k=32)
V0722 02:29:49.262860 304201 torch/_inductor/codegen/cpp_gemm_template.py:313] [0/0] Cache blocking: GemmBlocking(block_m=64, block_n=1, block_k=8)
V0722 02:29:49.262951 304201 torch/_inductor/codegen/cpp_gemm_template.py:315] [0/0] Thread blocking: GemmBlocking(block_m=69, block_n=79, block_k=8)
V0722 02:29:49.263075 304201 torch/_inductor/codegen/cpp_gemm_template.py:332] [0/0] Number of threads: 60, occupancy: (15, 4, 1)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131024
Approved by: https://github.com/leslie-fang-intel, https://github.com/chunyuan-w
2024-07-24 00:36:29 +00:00
fdc9a1404e Remove _BLACK_LISTED_OPS (#131361)
Summary: remove _BLACK_LISTED_OPS after https://github.com/pytorch/pytorch/pull/100749

Test Plan: contbuild & OSS CI

Differential Revision: D60056130

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131361
Approved by: https://github.com/angelayi
2024-07-24 00:15:27 +00:00
2cf220956a [inductor] fix CacheBase.get_system on AMD (#131365)
Summary: CacheBase.get_system on AMD is missing device name and hip version, fix that

Test Plan:
on AMD:
```
buck run fbcode//mode/opt-amd-gpu scripts/nmacchioni/repros/amd_cache_key:repro
{'device': {'name': 'gfx942:sramecc+:xnack-'}, 'version': {'triton': '3.0.006965bceb379c60d8184a4166f502457952938167bfb69592ebf48abebfb0ce9-4856d26164925fd955c779d8f67ecf47cc5754052b008714b3a580d708b13dd8-06965bceb379c60d8184a4166f502457952938167bfb69592ebf48abebfb0ce9-23d635e690d670bf61798e1259674b78c0ed5ba222ab6a455f329f27a758fc2d-e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855-166fbf4e6f8845f354611638861a2a9e1dc2654224c278e10b566f09549dae7e-ccd93feaad4c82c8c1604557340de15fda0a3c84fe83f5a4d1e12a07a77bf3f4-cf28658fa328f7f283ec4e6ccc6c48d7c2a8ddbdf5134d3eb35c9b38ce4ace44-b9d80690b3109c2aaf5ece450d62e93b37eb6ab38552089794b3bb36e36a22b3-36130a37af1b19a0dec569aa08d30b00c74c8f02b6b632999d86dea169146792-4a620da64e0c263067f0dbf6c721f5214a5ac315625a07dd98520502ddf7e22f-6ace95666f6a4ecd2b1a7fc7ae865d1a9239608bd020cb6e4b8d15233c2dd9b3', 'hip': '6.0.32830'}, 'hash': 'c4db04316e15953dda8648f5a43a3f208f2c0ba454666cc7d78e40527aab85ec'}
```

on Nvidia:
```
buck run fbcode//mode/opt scripts/nmacchioni/repros/amd_cache_key:repro
{'device': {'name': 'NVIDIA PG509-210'}, 'version': {'triton': '6de41ec76ecad84e618d692e6793a4ebe707ae68a0c033a222105daa72214d7c', 'cuda': '12.0.0'}, 'hash': 'b58d0aa37d80fc2932c1b7576ca876b77aa1258db1c14e27d1f201bd15376faf'}
```

Differential Revision: D60062972

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131365
Approved by: https://github.com/eellison
2024-07-24 00:11:59 +00:00
480ae51f85 [pytree] Only import optree if it's used (#131478)
torch.utils._pytree imports optree if it's available. Instead, we change
it to if it gets used. The motivation for this is better isolation.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131478
Approved by: https://github.com/albanD
2024-07-24 00:10:49 +00:00
6850e42266 [dynamo][exception] Remove older specialization for StopIteration (#131512)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131512
Approved by: https://github.com/yanboliang
ghstack dependencies: #131347, #131367, #131378, #131389, #131405, #131480
2024-07-24 00:06:53 +00:00
e2b941a1b4 [dynamo] Rename TENSOR_ALIASING to OBJECT_ALIASING. Permit OBJECT_ALIASING for dict guards (#131480)
Fixes https://github.com/pytorch/pytorch/issues/129667

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131480
Approved by: https://github.com/williamwen42
ghstack dependencies: #131347, #131367, #131378, #131389, #131405
2024-07-24 00:06:53 +00:00
e39f136c35 [debug][dtensor] implemented activation checkpointing differentiation (#130996)
**Summary**
While trying to integrate CommDebugMode with TorchTitan, I realized that the forward_hooks were being registered even though it was in the backward pass. After investigating, I realized that it was activation checkpointing that was causing this. In order to prevent users from being confused, I edited CommDebugMode so that it could differentiate between backward pass operations and activation checkpointing operations. I have also added an example case showing that CommDebugMode is able to successfully differentiate between the backward pass and activation checkpointing. The output for the example can be seen below.

**Test Case**
torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e activation_checkpointing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130996
Approved by: https://github.com/XilunWu
ghstack dependencies: #131419
2024-07-23 23:44:56 +00:00
7b375c3682 [dtensor][debug] changed which module tracker I inherited from to fix bug with activation checkpointing (#131419)
**Summary**
I switched the module tracker I had been inheriting from PyTorch’s all purpose one to the one written by Sanket in the distributed tools folder. I did this because the original one messed up activation checkpointing by adding itself to the parent set in the backward_pre_hook and then in the forward_pre_hook for the activation_checkpointing.

**Test Case**
pytest test/distributed/_tensor/debug/test_comm_mode_features.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131419
Approved by: https://github.com/XilunWu
2024-07-23 23:44:56 +00:00
161c18ed0b SymmetricMemory-based, low contention intra-node all-gather and reduce-scatter (#130583)
```python
# NOTE [low-contention collectives]
# When a collective is overlapped with abundant compute, it makes sense to
# prioritize reducing the contention between the collective and the overlapped
# compute, even at the cost of a slightly slower collective.
#
# Common collective implementations (e.g., NCCL without user buffer
# registration) optimize for throughput with no ambient compute. However, such
# implementations may not be optimal when they are overlapped with compute:
# - These impls typically fuse the entire collective into a single kernel and
# reserve SM resources based on the most demanding portion of the collective,
# even when a large portion of the collective does not require this much
# resource.
# - These implementations typically fuse the entire collective into a single
# kernel and reserve SM resources based on the most demanding portion of the
# collective, even when a large portion of the collective does not require this
# much resource.
# - These implementations often use SM-based P2P copy as opposed to copy
# engine-based P2P copy. Copy engine-based P2P copy may not have a significant
# advantage when there's no ambient compute. However, it may significantly
# improve overall resource utilization in the presence of ambient compute.
#
# When overlapped with intensive compute (e.g., persistent matmul kernels), the
# SM-usage of a collective can lead to inefficient overlapping.
#
# Low-contention collectives achieve their goals with the following strategies:
# - Use copy engine-based copy whenever possible.
# - Break down portions of a collective with different resource requirements
# into multiple kernels. This improves the overlapping efficiency at the cost
# of additional launching overhead.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130583
Approved by: https://github.com/weifengpy
2024-07-23 23:37:48 +00:00
1930698140 Fix fake tensor SymInt caching when there's a SymInt storage_offset (#131500)
Test Plan: Internal unit tests failed before and succeeded after.

Differential Revision: D60131273

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131500
Approved by: https://github.com/clee2000
2024-07-23 23:37:04 +00:00
fc3d2b26cd Use fake PG for test_compute_comm_reordering.py unit tests (#131415)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131415
Approved by: https://github.com/yifuwang
2024-07-23 22:53:23 +00:00
980bb54361 [BE][Inductor] fix failures in test_padding.py (#131417)
The failure only happens [internally](https://www.internalfb.com/tasks/?t=195598864) because the main block was not executed when the tests are run internally.

Differential Revision: [D60083954](https://our.internmc.facebook.com/intern/diff/D60083954)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131417
Approved by: https://github.com/eellison
2024-07-23 21:53:59 +00:00
53f1f75061 [BE][Inductor] fix do_bench test (#131402)
The test fail internally [T195592444](https://www.internalfb.com/intern/tasks/?t=195592444) (This is meta internal link). But we don't see the failure in OSS.

It turns out that there are 2 issues:
1. `run_test('cuda')` is improperly handled since it tries to import a module named 'cuda' if cuda is available. Since the import fails, all tests in the file are skipped. This hides the failure in OSS. The failure is exposed in internal tests since the main block which runs `run_test('cuda')` is skipped sometimes.
2. fix the real issue that incompatible inputs are provided to `do_bench`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131402
Approved by: https://github.com/eellison
2024-07-23 21:52:35 +00:00
5a0068cc69 [BE] mypy: disallow untyped decorators (#131428)
Untyped decorators strip the types from their decorated function so even if the underlying function is fully typed then callers to it don't get any benefit from type annotations.

Step 1 - Enable the error and override in all the offending files.

#131429

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131428
Approved by: https://github.com/justinchuby, https://github.com/oulgen
2024-07-23 21:50:55 +00:00
e3ca4e79e1 Fix mypy errors introduced by #131400 (#131522)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131522
Approved by: https://github.com/zou3519, https://github.com/eellison
2024-07-23 21:25:21 +00:00
c9e74449f3 bump executorch commit pin. (#131486)
Summary: as title. Target commit: 6153b1bf7b

Test Plan: CI

Differential Revision: D60125590

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131486
Approved by: https://github.com/huydhn
2024-07-23 21:25:07 +00:00
8a890b72dc [BE] Get rid of missing destructor override warning (#131204)
Regression introduced by https://github.com/pytorch/pytorch/pull/126376

Before this change, compiling torch_cpu on my MacBook prints tons of warnings every time HooksInterface is included
```
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/src/optim/adamw.cpp:1:
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/optim/adamw.h:3:
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/nn/module.h:3:
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/nn/modules/container/any_module_holder.h:3:
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/nn/modules/container/any_value.h:3:
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/detail/static.h:4:
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/types.h:3:
In file included from /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/ATen.h:7:
In file included from /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/Context.h:13:
/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/detail/HIPHooksInterface.h:27:11: warning: '~HIPHooksInterface' overrides a destructor but is not marked 'override' [-Winconsistent-missing-destructor-override]
  virtual ~HIPHooksInterface() = default;
          ^
/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/detail/AcceleratorHooksInterface.h:16:11: note: overridden virtual function is here
  virtual ~AcceleratorHooksInterface() = default;
          ^
1 warning generated.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131204
Approved by: https://github.com/albanD, https://github.com/seemethere
2024-07-23 21:02:14 +00:00
4eee2e7a6d [operator_benchmark] Remove TARGETS from broken benchmarks (#131460)
Summary:
Remove operator_benchmark caffe2 build due to the removal of caffe2: 2fd75667b4

Plus, we are deleting the TARGETS file from broken benchmarks that we do not intend to maintain.

Test Plan: Sandcastle CI

Differential Revision: D60086216

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131460
Approved by: https://github.com/vmpuri
2024-07-23 20:06:08 +00:00
8497930766 Revert "[CI][dashboard] Collect PT2 cpu perf nightly (#131369)"
This reverts commit 9851c7313d118517d21a112960044e0fdbf560b1.

Reverted https://github.com/pytorch/pytorch/pull/131369 on behalf of https://github.com/atalman due to Sorry need to revert looks like , please run ciflow/inductor looks like this caused failure in [pytorch/pytorch/actions/runs/10058412015/job/27802257096](https://github.com/pytorch/pytorch/actions/runs/10058412015/job/27802257096) ([comment](https://github.com/pytorch/pytorch/pull/131369#issuecomment-2246142022))
2024-07-23 19:41:49 +00:00
d4e3fd613c Revert "[CI] Relax config name matching for cpu inductor tests (#131467)"
This reverts commit aa54bcb6d25fc7c9ac23b82b74ea45f03033c8b2.

Reverted https://github.com/pytorch/pytorch/pull/131467 on behalf of https://github.com/atalman due to Sorry need to revert looks like https://github.com/pytorch/pytorch/pull/131369 broke inductor tests ([comment](https://github.com/pytorch/pytorch/pull/131467#issuecomment-2246136839))
2024-07-23 19:38:35 +00:00
7b82ed2d59 Delete very old misleading info from .ci README (#131502)
I think there is no way to salvage that by updating, so deleting.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131502
Approved by: https://github.com/seemethere, https://github.com/atalman
2024-07-23 19:27:36 +00:00
93fdd0237d Ensure staticmethods can be allowed in graph (#130882)
Fixes https://github.com/pytorch/pytorch/issues/124735

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130882
Approved by: https://github.com/anijain2305, https://github.com/williamwen42
2024-07-23 18:59:19 +00:00
faddb0f30c [NestedTensor] Integrate the mean operator along the jagged dimension into NestedTensor (#131132)
Summary:
Modify the existing `mean` operator in PyTorch, invoked by `torch.mean`, to allow for reductions along the jagged dimension of a nested tensor. The function originally had a basic implementation for reducing along 1 non-ragged dimension. This diff enables PyTorch users to invoke `torch.mean` on a nested tensor when reducing along the ragged dimension, e.g. `*` in a `(B, *, M)` nested tensor.

Parametrize unit tests from `sum` to verify the accuracy of the ragged reduction implementation for `torch.mean`. Add unit tests and parametrize `sum` unit tests to verify error handling for unsupported features in `NestedTensor` `torch.mean`.

Test Plan:
Verify that the new unit test passes via the following command:
```
buck2 run mode/{opt,inplace} //caffe2/test:nested -- --regex test_mean
```

```
buck2 run mode/{opt,inplace} //caffe2/test:nested -- --regex test_jagged_op
```

Differential Revision: D59654668

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131132
Approved by: https://github.com/davidberard98, https://github.com/jbschlosser
2024-07-23 18:48:34 +00:00
120ca23a1f Fix IMAs in Flash-Attention splitkv kernel (#131277)
# Summary

While debugging CI failures for flash_attention tests I stumbled across 2 IMAs for the split-kv variant of flash attention.
1. Illegal global memory writes during the writing of softmax_lse_accum. This was pinpointed to the temporary liftime of these out_accum and softmax_lse_accum. These were likely getting their refcount dropped **before** the kernel launch that used, them allowing them to potentially get used for other allocations.
2. After debugging this there was illegal writes of the combine kernel. I was able to pinpoint this to the writing to the reduce LSE. From my understanding it was making assumption that kBlocKM evenly divided the global number of rows and wasn't masking out these writes.

### History
My line of thinking for this:

We create the temporary split accum + LSE stats tensors to store the data for each split. We then launch a follow up kernel to do the reduction.

Under ordinary non roofline memory usage the cuda memory caching allocator will keep these allocations alive even though the tensors were created within a temporary scope and no longer have any live references.

On CI we often run near max memory usage. We change/add tests and suddenly we get close to oom threshold. The memory allocator will reap these segments and we get write after free errors.

After that  fix I did get further past the splitkv_flash kernel and then got the following error:

``` Shell
❯ TORCH_DISABLE_ADDR2LINE=1 PYTORCH_NO_CUDA_MEMORY_CACHING=1  compute-sanitizer --show-backtrace=device --tool memcheck --log-file ima.txt python ima.py

softmax_lseaccum_ptr =0x7f5ebb208a00
oaccum_ptr =0x7f5ebb208c00
softmax_lse_ptr = 0x7f5ebb208800
❯
❯ head ima.txt -n 10
========= COMPUTE-SANITIZER
========= Invalid __global__ write of size 4 bytes
=========     at void pytorch_flash::flash_fwd_splitkv_combine_kernel<pytorch_flash::Flash_fwd_kernel_traits<(int)32, (int)64, (int)256, (int)4, (bool)0, (bool)0, cutlass::bfloat16_t, pytorch_flash::Flash_kernel_traits<(int)32, (int)64, (int)256, (int)4, cutlass::bfloat16_t>>, (int)16, (int)1, (bool)1>(pytorch_flash::Flash_fwd_params)+0x630
=========     by thread (2,0,0) in block (0,0,0)
=========     Address 0x7f5ebb208804 is out of bounds
=========     and is 1 bytes after the nearest allocation at 0x7f5ebb208800 of size 4 bytes
```

Okay I looked at the address and it looks like we are writing consective bytes past the softmax_lse_ptr in from the combine func: I tried padding out the softmax_lse to q_padded and no more illegal memory errors on my repro:
```
========= COMPUTE-SANITIZER
========= ERROR SUMMARY: 0 errors
```

Fixes https://github.com/pytorch/pytorch/issues/131240
Fixes https://github.com/pytorch/pytorch/issues/131227
Fixes https://github.com/pytorch/pytorch/issues/131221

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131277
Approved by: https://github.com/malfet
2024-07-23 18:26:49 +00:00
f75d724482 Updating Types in torch/_dynamo/utils.py (#131001)
Adds some type annotations to the torch/_dynamo/utils.py file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131001
Approved by: https://github.com/aorenste
2024-07-23 18:25:52 +00:00
aa54bcb6d2 [CI] Relax config name matching for cpu inductor tests (#131467)
Summary: Matching *cpu* instead of *cpu_inductor* should be sufficient. This fixes torchbench test failures in https://github.com/pytorch/pytorch/pull/131369.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131467
Approved by: https://github.com/zou3519
2024-07-23 18:24:29 +00:00
94f22eb6b2 refactor post-trace fakification in strict (#131421)
Summary:
Previously it was unclear what `_convert_input_to_fake` actually does (used in strict), and in particular how it is different from `make_fake_inputs` (used in non-strict).

This PR splits that function to work purely on user inputs, then renames it to `extract_fake_inputs` and adds a comment clarifying what it does—namely, it extracts fake inputs from a given graph module instead of "converting inputs to fake inputs" (as suggested by the current name) or "making fake inputs" (as happens in non-strict, where no tracing has taken place yet).

The remainder of that function used to also fakify params and buffers. It turns out that this part is identical to what happens in non-strict, hence we also pull `make_fake_inputs` out from `non_strict_utils` into `_trace`, merge it with another util, and make both modes call it.

Test Plan: existing tests

Differential Revision: D60084442

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131421
Approved by: https://github.com/zhxchen17
2024-07-23 18:23:03 +00:00
f85c35872b Remove GraphModuleOpUpgrader in _export.serde.upgrade.py (#131373)
Summary: Remove GraphModuleOpUpgrader in _export.serde.upgrade.py and the file

Test Plan: contbuild & OSS CI

Differential Revision: D60067937

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131373
Approved by: https://github.com/angelayi
2024-07-23 18:09:44 +00:00
22906be8f0 Do not abort on SPARSE_STATUS_INVALID_VALUE (#130382)
Summary:
Newer versions of the MKL library return `SPARSE_STATUS_INVALID_VALUE` when badly formed non-triangular matrices are passed to the `mkl_sparse_?_trsv`/`mkl_sparse_?_mrsv` functions. This would start aborting (badly written) tests that worked with the old version which just filled the result tensor with `-NaN` instead of returning an error status.

This changes the code to fill the result tensor with `-NaN` on `SPARSE_STATUS_INVALID_VALUE` so we get the same behavior regardless of the MKL version in use.

Test Plan: `buck2 test 'fbcode//mode/opt' fbcode//caffe2/test:sparse -- --run-disabled`

Differential Revision: D59542023

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130382
Approved by: https://github.com/malfet
2024-07-23 18:09:36 +00:00
cfb9ccab6c [export] Filter errors by exception type, add case name (#131327)
Summary:
-  Log export errors to Scuba and mark them with "classified" and "unclassified"
- Classify errors by exception type (ALLOW_LIST) and a `case_name` attribute
- Add `case_name` for some exceptions.

Test Plan:
Running the code below logs a classified error to `torch_export_usage` table in Scuba.

```
import torch

from torch._export.db.case import SupportLevel

class TorchSymMin(torch.nn.Module):
    """
    torch.sym_min operator is not supported in export.
    """

    def forward(self, x):
        return x.sum() + torch.sym_min(x.size(0), 100)

example_args = (torch.randn(3, 2),)
tags = {"torch.operator"}
support_level = SupportLevel.NOT_SUPPORTED_YET
model = TorchSymMin()

torch.export.export(model, example_args)
``

Differential Revision: D59981459

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131327
Approved by: https://github.com/zhxchen17
2024-07-23 18:01:13 +00:00
6b8ec2b371 Revert "[triton_op] fix autotuning (#131363)"
This reverts commit 154f27455a62314dfb689f1fe13c0cfd52490339.

Reverted https://github.com/pytorch/pytorch/pull/131363 on behalf of https://github.com/ZainRizvi due to This was a tricky one, but looking at the code it's the change to torch/fx/node.py that triggered the type violation errors. Reverting since this is now breaking trunk ([comment](https://github.com/pytorch/pytorch/pull/131363#issuecomment-2245899858))
2024-07-23 18:01:09 +00:00
3fe72e0c2e [4/N] Non-Tensor: Support layout, device and dtype for aten operations (#125897)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125897
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-23 17:50:17 +00:00
68c725a094 [custom ops] Add register_vmap for custom ops (#130589)
Fixes #130284
Fixes #130653

- Add `torch.library.register_vmap` to custom ops
- Add `register_vmap` for operators in ops in custom_op_db.
- Make `torch.autograd.Function` support kwarg-only kwargs for vmap
- test operators in op_db with `tests/test_vmap`.
- change `test_vmap` to allow custom `out_dim` and allow "None" in `out_dim` when testing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130589
Approved by: https://github.com/zou3519
2024-07-23 17:48:38 +00:00
404d640c39 [1/2] PT2 Inductor ComboKernels - Foreach cases (#124969)
Summary:
A ComboKernel combines independent Inductor Triton kernels into a single one.
Consolidation with Foreach kernel:
1) For the scheduler node, the logic is consolidated into ForeachKernelSchedulerNode
2) The backend kernel is consolidated into ComboKernel.

(Note: this is part 1 which only deals with the 1st case above.)

Details:

1. ComboKernel can be viewed as the extension of Foreach kernel (see the examples below). The main differences are: 1) the block size is tunable (but currently shared by the sub-kernels).  2) it supports multiple kernel typs, like pointwise, reduce, and may extend to matmm as well (it doesn't support mixed 1d and 2d kernels yet, but it can be extended for such case) 3) the blocks are interleaved among the sub kernels (can be extended to other arrangement), 4) it is designed to be general enough to combine kernels without dependency and doesn't rely on certain patterns. 5) it doesn't support dynamic sizes yet but can be easily extended for it.

2. ComboKernel is used in two cases: 1) for existing foreach kernels, combo kernels are used as the backend kernel. the front-end kernel generation logic remains the same. 2) Added an extra optimization phase to the end of the scheduler to generate extra combo kernels if combo_kernels is True in config.py

3. The combo kernel generation in the added optimization phase is done in two steps: 1) in the front end inside the scheduler, it topologically sort the schedule nodes to find all the nodes with no data dependency and create a frond end schedule node for them. We currently limit the maximal number of sub-nodes for each combo kernel to 8 (but we still need to find what is the optimal number). 2) then, these sub-nodes are combined in the codegen phase to generate the combo kernel code for them based on a few rules. For example, 1d and 2d kernels are separated into different combo kernels, as mixing them is not supported yet. Note these algorithms we provide are very basic, and the users can register their customized combo kernel generation algorithms for both steps.

4. Performance wise, combining small kernels is about always to see performance gain. however, combining very large kernels may not see any perf gain, sometimes even regression possibly due to improper block sizes. Thus, a benchmark function is implemented to avoid such perf regression, and it is recommended to turn it on by setting benchmark_combo_kernels to True whenever combo_kernels is True.

Example:
- element wise kernels
original Pytorch function:
```
 def test_activations(a, b, c):
     a1 = torch.nn.functional.relu(a)
     b1 = torch.nn.functional.sigmoid(b)
     c1 = torch.nn.functional.tanh(c)
     return a1, b1, c1
```
combokernel
```
triton_heuristics.pointwise(
    size_hints=[512], tile_hint=TileHint.DEFAULT,
    filename=__file__,
    triton_meta={'signature': {0: '*fp32', 1: '*fp32', 2: '*fp32', 3: '*fp32', 4: '*fp32', 5: '*fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [AttrsDescriptor(divisible_by_16=(0, 1, 2, 3, 4, 5), equal_to_1=())]},
    inductor_meta={'kernel_name': 'triton_poi_fused_0', 'mutated_arg_names': []}
)
triton.jit
def triton_(in_ptr0, in_ptr1, in_ptr2, out_ptr0, out_ptr1, out_ptr2, XBLOCK : tl.constexpr):
    pid = tl.program_id(0)
    if pid % 3 == 0:
        pid_offset = pid // 3
        xnumel = 100
        rnumel = 1
        xoffset = pid_offset * XBLOCK
        xindex = xoffset + tl.arange(0, XBLOCK)[:]
        xmask = xindex < xnumel
        x0 = xindex
        tmp0 = tl.load(in_ptr0 + (x0), xmask)
        tmp1 = triton_helpers.maximum(0, tmp0)
        tl.store(out_ptr0 + (x0), tmp1, xmask)
    elif pid % 3 == 1:
        pid_offset = pid // 3
        xnumel = 400
        rnumel = 1
        xoffset = pid_offset * XBLOCK
        xindex = xoffset + tl.arange(0, XBLOCK)[:]
        xmask = xindex < xnumel
        x1 = xindex
        tmp2 = tl.load(in_ptr1 + (x1), xmask)
        tmp3 = tl.sigmoid(tmp2)
        tl.store(out_ptr1 + (x1), tmp3, xmask)
    elif pid % 3 == 2:
        pid_offset = pid // 3
        xnumel = 100
        rnumel = 1
        xoffset = pid_offset * XBLOCK
        xindex = xoffset + tl.arange(0, XBLOCK)[:]
        xmask = xindex < xnumel
        x2 = xindex
        tmp4 = tl.load(in_ptr2 + (x2), xmask)
        tmp5 = libdevice.tanh(tmp4)
        tl.store(out_ptr2 + (x2), tmp5, xmask)
    else:
        pass
```
- reduction kernels
Original Pytorch function:
```
def test_reduce(a, b, c):
     a1 = torch.sum(a, dim=0)
     b1 = torch.max(b, dim=0)
     c1 = torch.min(c, dim=0)
     return a1, b1, c1
```
Generated combokernal:
```
 triton_heuristics.persistent_reduction(
     size_hints=[32, 32],
     reduction_hint=ReductionHint.DEFAULT,
     filename=__file__,
     triton_meta={'signature': {0: '*fp32', 1: '*fp32', 2: '*fp32', 3: '*fp32', 4: '*i64', 5: '*fp32', 6: '*i64', 7: '*fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [AttrsDescriptor(divisible_by_16=(0, 1, 2, 3, 4, 5, 6, 7), equal_to_1=())]},
     inductor_meta={'kernel_name': 'triton_per_fused_0', 'mutated_arg_names': []}
 )
 triton.jit
 def triton_(in_ptr0, in_ptr1, in_ptr2, out_ptr0, out_ptr1, out_ptr2, out_ptr3, out_ptr4, XBLOCK : tl.constexpr):
     pid = tl.program_id(0)
     if pid % 3 == 0:
         pid_offset = pid // 3
         xnumel = 20
         rnumel = 20
         RBLOCK_0: tl.constexpr = 32
         xoffset = pid_offset * XBLOCK
         xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
         xmask = xindex < xnumel
         rindex = tl.arange(0, RBLOCK_0)[None, :]
         roffset = 0
         rmask = rindex < rnumel
         r1 = rindex
         x0 = xindex
         tmp0 = tl.load(in_ptr0 + (x0 + (20*r1)), rmask & xmask, other=0.0)
         tmp1 = tl.broadcast_to(tmp0, [XBLOCK, RBLOCK_0])
         tmp3 = tl.where(rmask & xmask, tmp1, float("-inf"))
         tmp4 = triton_helpers.max2(tmp3, 1)[:, None]
         tmp6 = tl.broadcast_to(rindex, tmp3.shape)
         _, tmp5_tmp = triton_helpers.max_with_index(tmp3, tmp6, 1)
         tmp5 = tmp5_tmp[:, None]
         tl.store(out_ptr0 + (x0), tmp4, xmask)
         tl.store(out_ptr1 + (x0), tmp5, xmask)
     elif pid % 3 == 1:
         pid_offset = pid // 3
         xnumel = 10
         rnumel = 10
         RBLOCK_1: tl.constexpr = 16
         xoffset = pid_offset * XBLOCK
         xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
         xmask = xindex < xnumel
         rindex = tl.arange(0, RBLOCK_1)[None, :]
         roffset = 0
         rmask = rindex < rnumel
         r3 = rindex
         x2 = xindex
         tmp7 = tl.load(in_ptr1 + (x2 + (10*r3)), rmask & xmask, other=0.0)
         tmp8 = tl.broadcast_to(tmp7, [XBLOCK, RBLOCK_1])
         tmp10 = tl.where(rmask & xmask, tmp8, float("inf"))
         tmp11 = triton_helpers.min2(tmp10, 1)[:, None]
         tmp13 = tl.broadcast_to(rindex, tmp10.shape)
         _, tmp12_tmp = triton_helpers.min_with_index(tmp10, tmp13, 1)
         tmp12 = tmp12_tmp[:, None]
         tl.store(out_ptr2 + (x2), tmp11, xmask)
         tl.store(out_ptr3 + (x2), tmp12, xmask)
     elif pid % 3 == 2:
         pid_offset = pid // 3
         xnumel = 10
         rnumel = 10
         RBLOCK_2: tl.constexpr = 16
         xoffset = pid_offset * XBLOCK
         xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
         xmask = xindex < xnumel
         rindex = tl.arange(0, RBLOCK_2)[None, :]
         roffset = 0
         rmask = rindex < rnumel
         r5 = rindex
         x4 = xindex
         tmp14 = tl.load(in_ptr2 + (x4 + (10*r5)), rmask & xmask, other=0.0)
         tmp15 = tl.broadcast_to(tmp14, [XBLOCK, RBLOCK_2])
         tmp17 = tl.where(rmask & xmask, tmp15, 0)
         tmp18 = tl.sum(tmp17, 1)[:, None]
         tl.store(out_ptr4 + (x4), tmp18, xmask)
     else:
         pass
```

Note: ComboKernels uses masks to allow combination of kernels working with tensors of different sizes.

Test Plan:
```
buck2 test mode/dev-nosan caffe2/test/inductor:foreach
```
```
buck2 test mode/dev-nosan caffe2/test/inductor:combo_kernels
```

Differential Revision: D54134695

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124969
Approved by: https://github.com/mlazos
2024-07-23 17:34:28 +00:00
979429ca89 [inductor]Add DtypeView to avoid memory leak and unnecessary kernel generations (#128883)
Fixes #126338
## Issue Summary

When torchinductor compiles the combination `functional_collective -> view.dtype -> wait`, a memory leak occurs. This happens because `view.dtype` is compiled into an out-of-place Triton kernel that copies the input data to a new tensor, even if the data hasn't completed collection via the wait operation. The tensor used by `collective` is only freed when the `wait` operation triggers the garbage collector, see [~WorkRegistry](https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/Functional.cpp#L41). However, since `wait` now waits for a new tensor, the previous one is never freed. The `view.dtype` should only check the metadata instead of creating a new tensor. The current lowering is against its semantics and causes memory leaks.

See more great discussions in the #126338

This kind of lowering also generates unnecessary triton kernels for `view.dtype` when it can't be fused with other operations.

## Fix
The function `aten.view.dtype` is a CPU operation that changes the metadata of its input. After discussions with @eellison and @bdhirsh, we decided to change the lowering of `aten.view.dtype` to ensure it fallback properly to the correct `aten.view.dtype` instead of generating a Triton kernel in some cases. This approach also preserves the same semantics of the view operation.
When the model calls `aten.view.dtype` with a data type whose bit width matches the input's original data type, we lower it to the newly added `DtypeView` in IR, acting like a `ReinterpretView`. When the operation can be fused, its `make_loader` is called to maintain the correct type conversion for each load instruction. When the operation can't be fused, it falls back to `aten.view.dtype` to avoid Triton kernel generation.

## Example

```python
@torch.compile
def fn(x, y):
    x = x.view(torch.float16)
    y = y.view(torch.float16) + 1
    return x @ y

x = torch.randn((2, 2), device=self.device, dtype=torch.bfloat16)
y = torch.randn((2, 2), device=self.device, dtype=torch.bfloat16)
fn(x, y)
```
The output code generated before this fix is like the following.
```python
triton_poi_fused_add_view_0...
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 4
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (x0), xmask).to(tl.float32)
    tmp1 = tmp0.to(tl.bfloat16).to(tl.float32, bitcast=True).to(tl.float32)
    tl.store(out_ptr0 + (x0), tmp1, xmask)

triton_poi_fused_add_view_1...
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 4
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (x0), xmask).to(tl.float32)
    tmp1 = tmp0.to(tl.bfloat16).to(tl.float32, bitcast=True).to(tl.float32)
    tmp2 = 1.0
    tmp3 = tmp1 + tmp2
    tl.store(out_ptr0 + (x0), tmp3, xmask)

def call(args):
...
        triton_poi_fused_view_0.run(arg0_1, buf0, 4, grid=grid(4), stream=stream0)
        del arg0_1
        buf1 = empty_strided_cuda((2, 2), (2, 1), torch.float16)
        # Source Nodes: [view_1, y], Original ATen: [aten.add, aten.view]
        triton_poi_fused_add_view_1.run(arg1_1, buf1, 4, grid=grid(4), stream=stream0)
        del arg1_1
        buf2 = empty_strided_cuda((2, 2), (2, 1), torch.float16)
        # Source Nodes: [matmul, view_1, x, y], Original ATen: [aten.add, aten.mm, aten.view]
        extern_kernels.mm(buf0, buf1, out=buf2)
```
As you can see, the two `view` operations are compiled to two kernels `triton_poi_fused_view_0` nad `triton_poi_fused_add_view_1`. Both of them has a line `tmp1 = tmp0.to(tl.bfloat16).to(tl.float32, bitcast=True).to(tl.float32)` which does the type conversion.

The main issue is that the first `view` operation didn't do anything to the actual data. But it generates a triton kernel with a new output tensor. Another small issue is that this triton kernel can't be compiled because `bitcast=True` only support type converstion with same bidwidth.

The following are output code generated after this PR.

```python
triton_poi_fused_add_0...
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 4
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (x0), xmask).to(tl.float32)
    tmp1 = tmp0.to(tl.bfloat16).to(tl.float32)
    tmp2 = 1.0
    tmp3 = tmp1 + tmp2
    tl.store(out_ptr0 + (x0), tmp3, xmask)
def call(args):
...
        triton_poi_fused_add_0.run(arg1_1, buf0, 4, grid=grid(4), stream=stream0)
        del arg1_1
        buf1 = empty_strided_cuda((2, 2), (2, 1), torch.float16)
        # Source Nodes: [matmul, y], Original ATen: [aten.add, aten.mm]
        extern_kernels.mm(aten.view.dtype(arg0_1, torch.float16), buf0, out=buf1)
```
The first `view` operation has been replaced with the `aten.view.dtype` and it is directly passed as an argument. The second one is still there because it is fused with the following add operation. The invalid bitcast operation is removed too.

The following two code snippets is for the upcasts and downcasts. For dtype in `torch.float16, torch.bfloat16`, each load will be upcasted to float32, then downcast to its original dtype to ensure use values with the right precision.

7bda23ef84/torch/_inductor/codegen/triton.py (L1725-L1726)
7bda23ef84/torch/_inductor/codegen/triton.py (L629-L642)

Huge thanks to @eellison, @bdhirsh, @shunting314, and @desertfire .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128883
Approved by: https://github.com/eellison
2024-07-23 17:31:39 +00:00
f93a6a4d31 Add mypy typing to torch_version.py (#131447)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131447
Approved by: https://github.com/angelayi
ghstack dependencies: #131434
2024-07-23 17:31:07 +00:00
eab1595ce2 [dynamo] Delete wrong assertion in bind_args (#131405)
Fix - https://github.com/pytorch/pytorch/issues/130537

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131405
Approved by: https://github.com/williamwen42, https://github.com/yanboliang
ghstack dependencies: #131347, #131367, #131378, #131389
2024-07-23 17:28:05 +00:00
e4b5645f83 Revert "Add wrappers for synchronous GPUDirect Storage APIs (#130633)"
This reverts commit 5b5e0698a5f560decb9bbdd150ed7b0622eb7777.

Reverted https://github.com/pytorch/pytorch/pull/130633 on behalf of https://github.com/clee2000 due to breaking a lot of jobs and build rules internally D60085885, possibly needs to update some bazel build? ([comment](https://github.com/pytorch/pytorch/pull/130633#issuecomment-2245806738))
2024-07-23 17:19:34 +00:00
f7754c6dc5 Run pull jobs with new AMI (#131250)
Migrate all pull jobs to the new Amazon 2023 AMI runner type.

Exceptions:
- Distributed tests are still on the old AMI since they had some weird [test failures](https://github.com/pytorch/pytorch/actions/runs/10047579686/job/27770963175). Will debug those separately.
- Ported over a couple trunk and slow jobs that had `sync-tag`s set with the pull jobs and so needed to be on the same AMI

Revert plan, in case something starts breaking when we run these new AMIs at a larger scale:
- If specific jobs start failing consistently, we bring those jobs back to the old AMI
- If the failure is more widespread, revert this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131250
Approved by: https://github.com/malfet, https://github.com/atalman
2024-07-23 17:17:12 +00:00
5f0b65bee7 Revert "Replace manual parsing of "TMPDIR", "TMP", "TEMP" and "TEMPDIR" with std::filesystem::temp_directory_path() (#130842)"
This reverts commit d33804f8b6e2ea38f8446826a16be13ce4f9b71e.

Reverted https://github.com/pytorch/pytorch/pull/130842 on behalf of https://github.com/clee2000 due to breaking some builds internally D60085710, Im not sure what the logs mean but I think its something about build size ([comment](https://github.com/pytorch/pytorch/pull/130842#issuecomment-2245799309))
2024-07-23 17:15:06 +00:00
4ca8705035 Add mypy typing to fx node (#131434)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131434
Approved by: https://github.com/zou3519
2024-07-23 17:00:31 +00:00
ded5bdb0de Use inductor TestCase for test_replicate_with_compiler.py (#131053)
Summary: `test/distributed/_composable/test_replicate_with_compiler.py` torch.compiles. This change introduces a version of MultiProcessTestCase that derives from the inductor TestCase class to make sure we always get a clean cache dir.

Test Plan: `python test/distributed/_composable/test_replicate_with_compiler.py`

Differential Revision: [D59925519](https://our.internmc.facebook.com/intern/diff/D59925519)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131053
Approved by: https://github.com/eellison
2024-07-23 16:59:55 +00:00
a5ad02d05d Remove MacOS M2 14 runner from MacMPS job (#131465)
As it's been dead for 2+ weeks and causing queuing issues

<img width="760" alt="image" src="https://github.com/user-attachments/assets/4e806cae-3a67-4acb-b84f-1a9131d2a859">
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131465
Approved by: https://github.com/seemethere, https://github.com/atalman
2024-07-23 16:51:42 +00:00
c1ef214046 Print ExportedProgram without color by default (#131399)
Summary:
Without plugin, colored ExportedProgram is not really readable.

![image](https://github.com/user-attachments/assets/319920a9-bb4b-4ad2-bcac-0c4f76973b11)

Test Plan: CI

Differential Revision: D60074481

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131399
Approved by: https://github.com/angelayi
2024-07-23 16:41:55 +00:00
db376fb643 Ensure non-contiguous indices are handled (#131430)
The unaligned inputs checker built in the assumption that static indices are a contiguous range (ie 0, 1, 2)
when with the new changes with nn module inlining break this assumption.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131430
Approved by: https://github.com/anijain2305
2024-07-23 16:37:55 +00:00
4f0497c747 Divorce triton and pt2 remote caching (#131345)
Now that remote caching has evolved into various parts of PT2, we want to separate triton and pt2 caching as changes to one have caused SEVs to the other.

Differential Revision: D60047752

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131345
Approved by: https://github.com/aorenste
2024-07-23 16:28:12 +00:00
154f27455a [triton_op] fix autotuning (#131363)
The problem was we were shoving SymInts into the constant_args side
table. The root problem is that torch.fx.node.base_types, which we use
to determine what can be put in the graph, doesn't actually have SymInt
in it. This PR fixes base_types to include SymInt.

Test Plan:
- tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131363
Approved by: https://github.com/oulgen
2024-07-23 16:15:00 +00:00
3aa45cae77 [export] Removed deprecated dialect field from EP schema. [2/2] (#131344)
Summary: Not landable until we've updated the pin of executorch.

Test Plan: CI

Reviewed By: SherlockNoMad

Differential Revision: D59759620

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131344
Approved by: https://github.com/SherlockNoMad, https://github.com/ydwu4
2024-07-23 16:05:10 +00:00
b61600f6cc [pytorch] fix the leak for pinned memory when using _create_cpu_state… (#131270)
When pin_memory and share_memory both are set to True in _create_cpu_state_dict, the memory is pinned using cudaHostRegister but is never unpinned. So, once tensor is created and freed, when a new tensor is created the caching allocator is allocating the same memory. This fails with below error.

```
obj = <[RuntimeError('CUDA error: part or all of the requested memory range is already mapped\nCUDA kernel errors might be a...pile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n') raised in repr()] Tensor object at 0x7f0028a4d6c0> pg = None, device = None, _ = None
```

This PR fixes this by unregistering this memory on tensor free by attaching a hook.

This is easily reproducible with xlformers checkpointing unit tests and the fix is verified with the same.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131270
Approved by: https://github.com/LucasLLC
2024-07-23 15:47:21 +00:00
1e86387871 Revert "Support IPC for Expandable Segments (#130890)"
This reverts commit 32c2f84e349ad6e34b8559d3f1f9c27020ae702f.

Reverted https://github.com/pytorch/pytorch/pull/130890 on behalf of https://github.com/zdevito due to variable shadowing broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/130890#issuecomment-2245456085))
2024-07-23 14:46:28 +00:00
f064dac588 [CI] change xpu ci build runner type to reduce build time (#130922)
The current XPU build sometime needs 2+hours, change the build runner to `linux.12xlarge` to reduce build time. Works for #114850
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130922
Approved by: https://github.com/atalman
2024-07-23 14:45:30 +00:00
6bbef2a06b [dynamo] Support set on KeysView (#131389)
Fixes https://github.com/pytorch/pytorch/issues/129664

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131389
Approved by: https://github.com/mlazos
ghstack dependencies: #131347, #131367, #131378
2024-07-23 14:15:26 +00:00
e7c5e06772 [dynamo] Support __contains__ on __dict__ on UserDefinedClassVariable (#131378)
Fixes https://github.com/pytorch/pytorch/issues/129665

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131378
Approved by: https://github.com/mlazos
ghstack dependencies: #131347, #131367
2024-07-23 14:15:26 +00:00
0bc5e26067 [dynamo] Support dict conversion of objects derived from MutableMapping (#131367)
Fixes - https://github.com/pytorch/pytorch/issues/129662

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131367
Approved by: https://github.com/williamwen42
ghstack dependencies: #131347
2024-07-23 14:15:20 +00:00
a944cce5b8 [dynamo] Support if callable on list (#131347)
Fixes https://github.com/pytorch/pytorch/issues/130720

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131347
Approved by: https://github.com/williamwen42, https://github.com/mlazos
2024-07-23 14:15:15 +00:00
250cdb2ac7 Fix cuda_half_test.cu (#131416)
$atanh(1.0)$ is $\inf$ (see https://www.mathworks.com/help/matlab/ref/atanh.html ) and difference between two infinities is nan, which is neither greater, nor less nor equal to any reasonable threshold

Fix the test by comparing that atanh of .5 is equal for float and half and that atanh of 1.0 equal to infinity

Fixes https://github.com/pytorch/pytorch/issues/131401

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131416
Approved by: https://github.com/atalman, https://github.com/albanD
2024-07-23 14:10:20 +00:00
4ac77fc6bd [HOP] Don't send HOPs to torch_dispatch (#131370)
I regretted the decision in
https://github.com/pytorch/pytorch/pull/130606. Most user
torch_dispatchs don't have enough to actually handle the HOP correctly,
so for now I'd prefer that users explicitly define the interaction
between the HOP and their torch_dispatch class.

An example is FlopCounterMode: if we allow HOPs to get passed to it, it
will ignore auto_functionalized(mm) by default but it will record flops
for mm, which is weird.

Test Plan:
- tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131370
Approved by: https://github.com/ydwu4
2024-07-23 13:41:08 +00:00
027f35d9e6 [Inductor] Allow customize decompositions for fwd_only trace function (#131329)
Summary:

Inductor will aggressively try to decompose and lower ops into a smaller opset. However, sometimes it may not align with kernel coverage (or perf preference) on different backends. (eg. Inductor will decompose Gelu into primitive ops, but certain backends already has a Gelu op) Therefore, we need a mechanism to allow customization of decomp for trace function so that Inductor will simply pass this op through.

Test Plan:

Reviewers:
@eellison
Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131329
Approved by: https://github.com/eellison
2024-07-23 13:10:48 +00:00
eb146b10db Only depend on sympy 1.12 for conda (no 3.13 there anyways) (#131355)
Fixing nightly after https://github.com/pytorch/pytorch/pull/130895
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131355
Approved by: https://github.com/atalman
2024-07-23 12:19:58 +00:00
9851c7313d [CI][dashboard] Collect PT2 cpu perf nightly (#131369)
Summary: Add a workflow similar to inductor-perf-test-nightly.yml but use x86 metal instances for perf measurement. The data processing and dashboard update will come next.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131369
Approved by: https://github.com/huydhn
2024-07-23 11:55:39 +00:00
3f3b226ffc Fixes for the extension backend tests (#130933)
There were some miscellaneous issues I found:

* The WrapperCodeGen subclass constructors don't accept any arguments, which doesn't mesh with how Inductor can try to construct them.
* A DeviceInterface subclass for Triton doesn't implement `triton_supported() == True`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130933
Approved by: https://github.com/eellison, https://github.com/jansel
2024-07-23 10:46:32 +00:00
d8e2e1fe50 [aoti] use reshape instead of view for flattening tensors for the nan checker (#131302)
For some non-contiguous tensors, tensor.view would trigger the following
runtime error:

"RuntimeError: view size is not compatible with input tensor’s size and stride
(at least one dimension spans across two contiguous subspaces).
Use .reshape(…) instead"

So, let's use reshape instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131302
Approved by: https://github.com/muchulee8, https://github.com/desertfire
2024-07-23 10:15:28 +00:00
16247987a1 Add decomposition for t_copy (#130939)
* Extracted from #128416

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130939
Approved by: https://github.com/peterbell10
2024-07-23 08:29:19 +00:00
16a2a1aad3 Annotate graph.py (#131400)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131400
Approved by: https://github.com/shunting314
2024-07-23 07:04:12 +00:00
102d8e5a63 MPS LSTM backward kernel workaround on MacOS 14.4+ (#130038)
The bug causing the correctness problem will be fixed in future OS release. Root cause of the problem is in a bug in an optimization to MPSGraph reshape operation in MacOS 14_4 that results in a correctness issue with the shapes the LSTM gradient operation has when num_layers > 2.

Solves silentness of issue #125803.
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130038
Approved by: https://github.com/malfet
2024-07-23 06:30:40 +00:00
29e2e2afb6 Revert D59561509: Multisect successfully blamed "D59561509: [FX][export] DCE pass, check schema for node impurity (#130395)" for one test failure (#131341)
Summary:
This diff reverts D59561509
D59561509: [FX][export] DCE pass, check schema for node impurity (#130395) by yushangdi causes the following test failure:

Tests affected:
- [cogwheel:cogwheel_mtia_cmf_m5_shrunk_test#test_flow_with_verification](https://www.internalfb.com/intern/test/844425041436985/)

Here's the Multisect link:
https://www.internalfb.com/multisect/6533402
Here are the tasks that are relevant to this breakage:
T191383430: 10+ tests unhealthy for ads_mtia_inference

The backout may land if someone accepts it.

If this diff has been generated in error, you can Commandeer and Abandon it.

Test Plan: NA

Differential Revision: D60029318

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131341
Approved by: https://github.com/angelayi
2024-07-23 05:23:47 +00:00
b2ad16f01d avoid OpOverloadPacket.__getattr__ calls in inductor lowering (#131348)
we have seen stacktrace samples showing that a lot of compilation time is spent in exceptions raised in `OpOverloadPacket.__getattr__`. It's not entirely clear why/how this happens, but I spot-checked a few places in `_inductor.graph.py` where we previously may have been calling `hasattr(OpOverloadPacket, ...)`, that can be avoided (hasattr will go through getattr, which, for OpOverloadPacket, will do a lookup in the dispatch table for all overload names of the packet).

Test Plan: CI

Differential Revision: D60048270

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131348
Approved by: https://github.com/davidberard98
2024-07-23 04:30:04 +00:00
99d9b369f4 [Optim] Support tensor lr for all optimizers and check it is 1-element (#131065)
Fixes: #130980
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131065
Approved by: https://github.com/janeyx99
2024-07-23 04:27:05 +00:00
781189f25d Add nvjitlink to the list of loadable global deps (#131295)
To fix the cusparse dependency resolution in CUDA-12.x, that has nvJitLink dependency:
```
$ ldd -r /usr/local/cuda-11.8/lib64/libcusparse.so.11.7.5.86
	linux-vdso.so.1 (0x00007ffea6f51000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fb13306f000)
	librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fb133065000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fb13305f000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fb132f10000)
	libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fb132eeb000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fb132cf7000)
	/lib64/ld-linux-x86-64.so.2 (0x00007fb143db7000)
$ ldd -r /usr/local/cuda-12.1/lib64/libcusparse.so.12.1.0.106
	linux-vdso.so.1 (0x00007ffc41909000)
	libnvJitLink.so.12 => /usr/local/cuda-12.1/lib64/libnvJitLink.so.12 (0x00007f3916b38000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f3916aea000)
	librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f3916ae0000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f3916ada000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f391698b000)
	libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f3916964000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f3916772000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f3929a8c000)
```
Fixes #131284
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131295
Approved by: https://github.com/malfet
2024-07-23 04:26:33 +00:00
02cd4dbcf4 [BE][CI] Get rid of duplicated code (#131406)
Followup after https://github.com/pytorch/pytorch/pull/131061 Define `run_if_exists` function that runs cpp test if it exists and prints a warning otherwise.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131406
Approved by: https://github.com/kit1980, https://github.com/atalman
2024-07-23 04:01:13 +00:00
35a0e0f018 [tp] improve SequenceParallel and its documentation (#131346)
SequenceParallel style assumes the input torch.Tensor ALREADY sharded on
the sequence dimension if not passing in DTensor. Since it causes some
user confusion on the documentation, this PR:

1. for the case where input passed in is already a DTensor, we check the
   input placements and redistribute if it's not sharded on the sequence
dimension
2. update the doc to make it more explicit about the case when user
   passed in a torch.Tensor and DTensor

This would fix https://github.com/pytorch/pytorch/issues/129355

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131346
Approved by: https://github.com/awgu
2024-07-23 03:57:01 +00:00
12434504a2 [c10d] remove non-necessary tests (#131212)
as titled, comm tensor is not being actively used as we approached the
functional collectives as our collective tracing approach

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131212
Approved by: https://github.com/XilunWu
2024-07-23 03:48:55 +00:00
8a591da3e7 [CI] Enable AOT inductor in cpu performance smoke test (#130097)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130097
Approved by: https://github.com/chuanqi129, https://github.com/desertfire
2024-07-23 03:44:13 +00:00
6cbb1437c1 Revert "Add sparse block to flex_decoding kernel (#130884)"
This reverts commit 0bf59db6cc076468f44197f0d7ee41f6204c47c2.

Reverted https://github.com/pytorch/pytorch/pull/130884 on behalf of https://github.com/atalman due to Sorry reverting test_causal_full_mask_vs_sdpa constantly failing on trunk ([comment](https://github.com/pytorch/pytorch/pull/130884#issuecomment-2244113663))
2024-07-23 02:10:14 +00:00
28b0ad4f46 [PT2] Minor fix in signpost (#131332)
Summary: compile_id is a named Tuple. We want to log signposts.

Test Plan:
Run e2e job.
Confirm this shows up correctly.
{F1767320364}

Differential Revision: D60045020

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131332
Approved by: https://github.com/oulgen
2024-07-23 01:56:00 +00:00
b435d84261 Revert "[custom ops] Add register_vmap for custom ops (#130589)"
This reverts commit 074b42064195c45471912f851e94c753992a9a1f.

Reverted https://github.com/pytorch/pytorch/pull/130589 on behalf of https://github.com/atalman due to Please fix lint and reland ([comment](https://github.com/pytorch/pytorch/pull/130589#issuecomment-2244092174))
2024-07-23 01:44:44 +00:00
8963623494 Re-implement pin_memory to be device-agnostic by leveraging the Accelerator concept (#126376)
This PR re-implements pin memory aiming to get rid of the optional `device` argument and makes all related APIs to be device-agnostic. We add two new abstract APIs in [AcceleratorHooksInterface](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/detail/AcceleratorHooksInterface.h#L12) and redefine pin memory as: "Pin memory is always pinned for the current accelerator device". In detail, it uses [getAcceleratorHooksInterface](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/Context.h#L61) in pin_memory/is_pinned to get an appropriate device and invoke the corresponding overridden interfaces, instead of using BackendSelect and then dispatching to CUDA or other specific backends' implement methods.

Note: For new backends who want to implement and use pin memory, just inherit AcceleratorHooksInterface and overwrite the `isPinnedPtr` and `getPinnedMemoryAllocator` methods.

Additional context: To avoid BC-breaking, this PR just preserves the `device` arg of related APIs and would throw a deprecation warning if `device` arg is passed. Another PR will be submitted to update all PT callers (`Tensor.is_pinned()`, `Tensor.pin_memory()`...) not to pass this arg based on this PR. In future, `device` arg will be actually removed.

Relates #124908
Relates #14560
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126376
Approved by: https://github.com/albanD
2024-07-23 01:44:15 +00:00
074b420641 [custom ops] Add register_vmap for custom ops (#130589)
Fixes #130284
Fixes #130653

- Add `torch.library.register_vmap` to custom ops
- Add `register_vmap` for operators in ops in custom_op_db.
- Make `torch.autograd.Function` support kwarg-only kwargs for vmap
- test operators in op_db with `tests/test_vmap`.
- change `test_vmap` to allow custom `out_dim` and allow "None" in `out_dim` when testing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130589
Approved by: https://github.com/zou3519
2024-07-23 00:54:52 +00:00
1e5ecc4277 move save/load from _export to export (#131353)
Test Plan: existing tests

Differential Revision: D60053905

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131353
Approved by: https://github.com/angelayi
2024-07-23 00:48:28 +00:00
26f7dd286b [export] Allow non-CIA ops to be preserved (#131075)
I feel like the semantics of `run_decompositions(preserve_ops,...)` should be that we should always preserve whatever operator is put into `preserve_ops`, even if it's not CIA?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131075
Approved by: https://github.com/bdhirsh
2024-07-23 00:41:48 +00:00
69b1999586 TunableOp size hotfix (#130800)
Fixes #130727.  GetSize calculation was incorrect for strided batched gemm.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130800
Approved by: https://github.com/xw285cornell
2024-07-22 23:42:26 +00:00
8ae1963a61 [Autograd] Cond Higher-Order Operation (#126911)
This is an updated PR to equip cond with the autograd feature and replaces the old [PR](https://github.com/pytorch/pytorch/pull/126007)

@ydwu4 I tried to incorporate your requests already.

Currently there are two problems that I struggle with solving:

1. There seems to be an import issue when trying to import cond in `torch/__init__.py`, see [here](8a704035c9/torch/__init__.py (L1914-L1916)). Therefore, I had to comment those lines, which resolved the import issues, but I believe cond is not proberly exposed as torch.cond.
2. I am not entirely sure how to deal with the opinfo test in `hop_db.py`

Co-authored-by: Yidi Wu <yidi@meta.com>
Co-authored-by: Xuehai Pan <XuehaiPan@outlook.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126911
Approved by: https://github.com/ydwu4
2024-07-22 23:18:19 +00:00
c74396e890 Revert "[c10d] remove non-necessary tests (#131212)"
This reverts commit 0c074352ab62acba22265d8f19ea95851ae61d0f.

Reverted https://github.com/pytorch/pytorch/pull/131212 on behalf of https://github.com/atalman due to sorry need to revert breaks OSS CI, module 'test_c10d_common' has no attribute 'CompilerTest' ([comment](https://github.com/pytorch/pytorch/pull/131212#issuecomment-2243961785))
2024-07-22 23:11:44 +00:00
f8f41dcb24 Revert "[inductor] Make UserDefinedTritonKernel a multi-output operation (#130832)"
This reverts commit deacc543f13067ab22e8fb2ab714a20dd60bb056.

Reverted https://github.com/pytorch/pytorch/pull/130832 on behalf of https://github.com/atalman due to broke periodic test ([comment](https://github.com/pytorch/pytorch/pull/130832#issuecomment-2243894772))
2024-07-22 22:10:02 +00:00
15eb10df02 Revert "[inductor] Use multiple outputs for flex-attention (#130833)"
This reverts commit 9df8ea1cf2d62bfe21b46188faea6ef2e29e5210.

Reverted https://github.com/pytorch/pytorch/pull/130833 on behalf of https://github.com/atalman due to broke periodic https://github.com/pytorch/pytorch/pull/130832 ([comment](https://github.com/pytorch/pytorch/pull/130833#issuecomment-2243890944))
2024-07-22 22:07:06 +00:00
f8875e8277 Revert "[inductor] Kill mark_node_as_mutating (#130834)"
This reverts commit 33f036a6f71b386d4ccb9a756ed892c144ec6a5f.

Reverted https://github.com/pytorch/pytorch/pull/130834 on behalf of https://github.com/atalman due to broke periodic https://github.com/pytorch/pytorch/pull/130832 ([comment](https://github.com/pytorch/pytorch/pull/130834#issuecomment-2243886215))
2024-07-22 22:02:43 +00:00
d33804f8b6 Replace manual parsing of "TMPDIR", "TMP", "TEMP" and "TEMPDIR" with std::filesystem::temp_directory_path() (#130842)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130842
Approved by: https://github.com/fegin
2024-07-22 21:49:33 +00:00
a136a7d623 [Functional Collective] enable custom work registration from python (#130354)
This PR does two things:
- Allow tensor -> work registration in Python via `torch._C._distributed_c10d.register_work`. Calling `torch.ops._c10d_functional.wait_tensor` on a tensor would trigger `.wait()` on the registered work object.
- Allow user-defined work object in Python to work with functional collectives.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130354
Approved by: https://github.com/wanchaol, https://github.com/fegin, https://github.com/wconstab
2024-07-22 21:45:19 +00:00
a3922acc06 [TD] More synonyms, new heuristic for test_public_bindings (#130397)
test_public_bindings should be run on anything that changes the public API - need to figure out in the future what is part of the public api, currently I'm using anything in torch/

flex_attention should be run on anything involving autograd
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130397
Approved by: https://github.com/malfet
2024-07-22 21:42:54 +00:00
0bf59db6cc Add sparse block to flex_decoding kernel (#130884)
fix typo

Finish flex_decoding block sparse

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130884
Approved by: https://github.com/drisspg
2024-07-22 21:29:43 +00:00
83b355bad5 [aoti] forward fix of D60006838, add back test_multiple_output_alias (#131331) (#131356)
Summary:

Forward fix of D60006838.

The unit test test_multiple_output_alias passed in OSS CI, but failing internally. So adding it back to skip list.

Test Plan: ci

Differential Revision: D60044926

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131356
Approved by: https://github.com/kit1980, https://github.com/malfet
2024-07-22 20:17:21 +00:00
e3eaa22126 [torchbench][multisect] Run accuracy check at Diff time (#131266)
Summary:
X-link: https://github.com/pytorch/benchmark/pull/2388

We can enable accuracy checks at Diff time since it is not a performance metric.

* Refactor the existing diff time test to use the new PT2 Benchmark Runner.
* Deprecate the speedup tests and enable the accuracy tests only. We rely on ServiceLab to perform performance testing and regression detection.

Test Plan:
Sandcastle CI

Or buck test command:

```
buck2 test 'fbcode//mode/opt' fbcode//pytorch/benchmark/fb/test_gpu:run_test_gpu -- test_training_resnet50_accuracy
```

Test UI: https://www.internalfb.com/intern/testinfra/testrun/1688850102375429

Reviewed By: oulgen

Differential Revision: D59825601

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131266
Approved by: https://github.com/oulgen
2024-07-22 20:14:28 +00:00
0c074352ab [c10d] remove non-necessary tests (#131212)
as titled, comm tensor is not being actively used as we approached the
functional collectives as our collective tracing approach

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131212
Approved by: https://github.com/XilunWu
2024-07-22 19:52:44 +00:00
781a33f5d8 Enable dynamic rollout for Linux trunk workflows (#131325)
Enables dynamic migration of jobs to the LF AWS account for the Linux trunk workflow.

The new runners are only given to people specified in this issue: https://github.com/pytorch/test-infra/issues/5132

This closes pytorch/ci-infra#250.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131325
Approved by: https://github.com/ZainRizvi
2024-07-22 19:43:24 +00:00
406f510f89 [c10d] add bfloat16 support for NAN check (#131131)
Summary:
Need another dispacher macro to support more data types
Test Plan:
(sqzhang_1) [sqzhang@devgpu009.cln1 ~/pytorch (86fcae11)]$ python
test/distributed/test_c10d_nccl.py -k test_nan_assert_bfloat16
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:18:
checkForNaN: block: [0,0,0], thread: [85,0,0] Assertion
`!isnan(data[i])` failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:18:
checkForNaN: block: [0,0,0], thread: [18,0,0] Assertion
`!isnan(data[i])` failed.
NCCL version 2.21.5+cuda12.0

devgpu009:1193787:1193787 [0] init.cc:1773 NCCL WARN Cuda failure
'device-side assert triggered'
.
----------------------------------------------------------------------
Ran 1 test in 9.416s

OK
Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131131
Approved by: https://github.com/wconstab, https://github.com/XilunWu
2024-07-22 19:41:19 +00:00
9e753d1f20 [AMD] catch exception when other processes belong to other users (#131018)
Summary:
It is a long known pain point that if other users are running things, the call of `torch.cuda.memory.list_gpu_processes()` will error out:
```
  torch.cuda.memory.list_gpu_processes()
  File "torch/cuda/memory.py", line 647, in list_gpu_processes
    procs = amdsmi.amdsmi_get_gpu_process_list(handle)  # type: ignore[attr-defined]
  File "amdsmi/py_interface/amdsmi_interface.py", line 1946, in amdsmi_get_gpu_process_list
    _check_res(
  File "amdsmi/py_interface/amdsmi_interface.py", line 510, in _check_res
    raise AmdSmiLibraryException(ret_code)
amdsmi.py_interface.amdsmi_exception.AmdSmiLibraryException: Error code:
	10 | AMDSMI_STATUS_NO_PERM - Permission Denied

```

So just catch this error

Test Plan: torch.cuda.memory.list_gpu_processes() no longer fails

Differential Revision: D59901053

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131018
Approved by: https://github.com/eqy, https://github.com/clee2000
2024-07-22 19:38:51 +00:00
23ae6e2eb3 [FSDP2] Removed state dict error for HSDP (#131320)
Fixes https://github.com/pytorch/torchtitan/issues/441#issuecomment-2241288906.

This PR avoids raising the 2D state dict error for HSDP, which does not depend on strided sharding.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131320
Approved by: https://github.com/wanchaol, https://github.com/weifengpy
2024-07-22 19:23:17 +00:00
d3556786b8 Blocklist certain modules for weights_only load (#131259)
Also bold certain text in the error message as suggested
<img width="3000" alt="Screenshot 2024-07-19 at 5 56 48 PM" src="https://github.com/user-attachments/assets/378f20c5-c6b2-4e53-8eaf-0bd26c3a6b60">

With a GLOBAL like `os.execv` the error message is now as such

```python
File "/data/users/mg1998/pytorch/torch/serialization.py", line 1256, in load
    raise pickle.UnpicklingError(_get_wo_message(str(e))) from None
_pickle.UnpicklingError: Weights only load failed. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
Trying to load unsupported GLOBAL posix.execv whose module posix is blocked.

Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131259
Approved by: https://github.com/malfet, https://github.com/albanD
2024-07-22 18:23:21 +00:00
93ef2e53f8 [3.13, dynamo] support FORMAT_SIMPLE/FORMAT_SPEC (#130751)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130751
Approved by: https://github.com/Skylion007
ghstack dependencies: #130566, #130567, #130568, #130569
2024-07-22 18:07:40 +00:00
375a4d7e9e [3.13, dynamo] decompose fused load/store instructions (#130569)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130569
Approved by: https://github.com/jansel
ghstack dependencies: #130566, #130567, #130568
2024-07-22 18:07:40 +00:00
157f38bc4d [3.13, dynamo] support STORE_FAST_LOAD_FAST (#130568)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130568
Approved by: https://github.com/jansel
ghstack dependencies: #130566, #130567
2024-07-22 18:07:35 +00:00
1e116c7a1e [3.13, dynamo] fix END_FOR (#130567)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130567
Approved by: https://github.com/jansel
ghstack dependencies: #130566
2024-07-22 18:07:32 +00:00
4319147ca9 [3.13, dynamo] fix closures, MAKE_FUNCTION, LOAD_CLOSURE; support SET_FUNCTION_ATTRIBUTE (#130566)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130566
Approved by: https://github.com/jansel
2024-07-22 18:07:28 +00:00
44e689d947 Revert "[TD] More synonyms, new heuristic for test_public_bindings (#130397)"
This reverts commit d8a35d57220cdd5ed2fe52c02bb1f78cc0b3c75b.

Reverted https://github.com/pytorch/pytorch/pull/130397 on behalf of https://github.com/clee2000 due to broke lint, probably a landrace ([comment](https://github.com/pytorch/pytorch/pull/130397#issuecomment-2243518651))
2024-07-22 18:03:22 +00:00
56bb047449 [pt2] Increase dynamo/inductor default log level to info (#131311)
Summary: Avoid the logs to be too verbose

Test Plan: CI

Differential Revision: D60028647

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131311
Approved by: https://github.com/oulgen
2024-07-22 17:33:29 +00:00
d8a35d5722 [TD] More synonyms, new heuristic for test_public_bindings (#130397)
test_public_bindings should be run on anything that changes the public API - need to figure out in the future what is part of the public api, currently I'm using anything in torch/

flex_attention should be run on anything involving autograd
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130397
Approved by: https://github.com/malfet
2024-07-22 17:06:00 +00:00
b9912f31ef Revert "[export] fix zero arg export in training_ir (#130990)"
This reverts commit 50436d5bdb5d2e29307a0c0bcfcce8d7e2da82c0.

Reverted https://github.com/pytorch/pytorch/pull/130990 on behalf of https://github.com/clee2000 due to failing some executorch and torchrec tests internally D60006710 ([comment](https://github.com/pytorch/pytorch/pull/130990#issuecomment-2243395316))
2024-07-22 16:49:25 +00:00
32c2f84e34 Support IPC for Expandable Segments (#130890)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130890
Approved by: https://github.com/dsjohns2
ghstack dependencies: #130888, #130889
2024-07-22 16:15:01 +00:00
0246b28510 [aoti] refactor aoti_torch__scaled_mm and skip aoti fp8 test for some cases (#130868)
Continuing https://github.com/pytorch/pytorch/pull/128683 and https://github.com/pytorch/pytorch/pull/130582.

The api of _scaled_mm has changed. For example, there is only one return now. So change the aoti api as well.

Also, tested the fp8 tests offline. The test_fp8_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface would fail with `error: use of undeclared identifier 'float8_e4m3fn'` and `error: use of undeclared identifier 'half'`, so skipping them for now.

The reason this wasn't known earlier is probably because the CI doesn't use H100.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130868
Approved by: https://github.com/drisspg, https://github.com/chenyang78, https://github.com/desertfire
2024-07-22 15:24:20 +00:00
5b5e0698a5 Add wrappers for synchronous GPUDirect Storage APIs (#130633)
Based in part on https://github.com/NVIDIA/apex/pull/1774

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130633
Approved by: https://github.com/albanD
2024-07-22 14:51:24 +00:00
5c78581fc9 Fix documentation for tensor.repeat. (#131195)
Fixes #130930.

Adjusts the documentation which used `sizes` instead of `repeats`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131195
Approved by: https://github.com/mikaylagawarecki, https://github.com/soulitzer
2024-07-22 14:48:18 +00:00
26383a6cc0 Revert "Added and_masks and or_masks utilities (#131073)"
This reverts commit 92bb323d36adca097c44a2fc8d9f0d574214d801.

Reverted https://github.com/pytorch/pytorch/pull/131073 on behalf of https://github.com/albanD due to The docs build fails here and in trunk ([comment](https://github.com/pytorch/pytorch/pull/131073#issuecomment-2242997958))
2024-07-22 13:44:55 +00:00
3eb9fa5d58 Add support for using LF Canary runners (#131188)
The script is updated such that if a canary build is detected and the label_type is LF runner it will run on an LF Canary runner.

Closes pytorch/ci-infra#245.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131188
Approved by: https://github.com/ZainRizvi
2024-07-22 13:26:46 +00:00
eqy
69e2590490 Fix MKLDNN check in test_aot_inductor.py (#130982)
`torch.ops.mkldnn._is_mkldnn_bf16_supported()` assumes MKLDNN is on the system which isn't the case for e.g., some ARM system configurations

CC @tinglvv @nWEIdia

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130982
Approved by: https://github.com/malfet
2024-07-22 11:58:18 +00:00
92bb323d36 Added and_masks and or_masks utilities (#131073)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131073
Approved by: https://github.com/drisspg
ghstack dependencies: #130871, #130904
2024-07-22 11:48:03 +00:00
68df24f9b6 [xla hash update] update the pinned xla hash (#126672)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126672
Approved by: https://github.com/pytorchbot
2024-07-22 11:35:36 +00:00
6d65a2c3f4 [3/N] Non-Tensor: Support string parameter for aten operations (#125831)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125831
Approved by: https://github.com/jansel, https://github.com/jgong5
2024-07-22 09:42:35 +00:00
8da19fec60 [Inductor] Support store SPIR-V binary file output from Intel Triton. (#130849)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130849
Approved by: https://github.com/peterbell10, https://github.com/EikanWang
2024-07-22 05:59:03 +00:00
2820e1d9f8 Update CPython support policy (#130989)
Update as specified in the RFC that was accepted: https://github.com/pytorch/rfcs/blob/master/RFC-0038-cpython-support.md
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130989
Approved by: https://github.com/seemethere
2024-07-22 05:29:07 +00:00
1614891946 [Profiler] exclude gpu_user_annotation when accumulating cuda time total (#130733)
Fixes #[130730](https://github.com/pytorch/pytorch/issues/130730)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130733
Approved by: https://github.com/aaronenyeshi
2024-07-22 04:35:21 +00:00
c2425a3b57 [BE] Use _linux-build.yml instead of -linux-build-label.yml flavor (#130762)
It was also introduced during the ARC experiment and supposed to be a temporary thing.
Fix `use_split_build` option handling in `_linux_build.yml`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130762
Approved by: https://github.com/Skylion007, https://github.com/atalman, https://github.com/jeanschmidt
2024-07-21 23:17:17 +00:00
500cbb5b90 Add decomposition for view_copy (#130938)
* Extracted from #128416
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130938
Approved by: https://github.com/peterbell10
ghstack dependencies: #130937
2024-07-21 20:39:24 +00:00
f628813066 Fix out_wrapper, _make_copy_from_view to handle all signatures (#130937)
* See #128416 and #129476
* Simplify xskip lists in test/functorch/test_ops.py
* Add supports_out=True to OpInfos for copy ops
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130937
Approved by: https://github.com/peterbell10
2024-07-21 20:39:24 +00:00
b193894b94 FakeTensor cache SymInt support (#127596)
Adds support for SymInts in the FakeTensor cache.

A couple notes:
1. When a SymInt is present in the input key for a FakeTensor operation we cache on the ShapeEnv instead of using the FakeTensorMode cache. This is necessary so we don't have to remember and check the guards. It reduces the cache hits but there's diminishing return on how much work we can do before the cache becomes more of a burden than a gain.
2. We need to be careful that when we cache an output SymInt that is a direct copy from the input that when we have a cache-hit we copy the SymNode from the input to the output. This is important because the fx-graph building code actually uses SymNode ids in the process of building the graph so constructing a same-content-but-different-id SymNode will fail.
3. In the cache key we store SymInts as a _PySymInputStub. These represent SymInt (and friends) but support `__hash__` and `__eq__` (which SymInt do not).
4. In the cache entry we store SymInts as a _SymIntOutputStub.

Perf example:
```
python benchmarks/dynamo/timm_models.py --ci --accuracy --timing
--explain --inductor --dynamic-shapes --dynamic-batch-only --device cuda
--training --amp --total-partitions 2 --partition-id 0 --output
/tmp/training_timm_models.csv --filter crossvit_9_240
```
fake tensor cache before:
```
INFO: FakeTensor cache stats:
INFO:   cache_hits: 68137
INFO:   cache_misses: 837
INFO:   cache_bypasses:
INFO:     symbolic shape:            48224
INFO:     CompositeImplicitAutograd: 917
INFO:     non-fake tensor:           70
INFO:     non-FakeTensor output:     62
INFO:     non-builtin:               8
INFO:     dynamic output shape:      1
```
and after:
```
INFO: FakeTensor cache stats:
INFO:   cache_hits: 88187
INFO:   cache_misses: 14233
INFO:   cache_bypasses:
INFO:     CompositeImplicitAutograd: 1037
INFO:     non-FakeTensor output:     602
INFO:     non-fake tensor:           70
INFO:     unsafe view:               36
INFO:     non-builtin:               8
INFO:     dynamic output shape:      1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127596
Approved by: https://github.com/eellison
ghstack dependencies: #131014, #129780
2024-07-21 19:26:38 +00:00
ebce85172e FakeTensor cache SymInt support: flatten cache key (#129780)
This is part of #127596, pulled out to make reviewing a little easier.

Flatten the FakeTensor cache key - so it's a list of singular elements and pointing at one requires a single index rather than a PyTree path.  This is used in the next PR to allow us to have the cache entry refer to an input SymInt that it needs to copy directly into the output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129780
Approved by: https://github.com/oulgen, https://github.com/eellison
ghstack dependencies: #131014
2024-07-21 19:26:38 +00:00
f3562e2cdc backport dataclass(slots=True) (#131014)
Python 3.10 adds `@dataclass(slots=True)` to auto-build the `__slots__` for a dataclass. This is really useful but we can't use it until 3.10 becomes our minimum version.

Copied the code for that functionality from python into a new decorator and ported it to use 3.8 syntax (removed use of `match`).

Usage:
```
@dataclass_slots
@dataclass
class X:
  pass
```
is the same as (in py3.10):
```
@dataclass(slots=True)
class X:
  pass
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131014
Approved by: https://github.com/oulgen, https://github.com/eellison
2024-07-21 19:26:31 +00:00
1439bd3c9c [Easy][pytree] enable CXX pytree under torch::deploy (#130144)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130144
Approved by: https://github.com/zou3519
ghstack dependencies: #130895, #130139
2024-07-21 07:36:22 +00:00
ddde9dd25c [dynamo][automatic_dynamic] Trigger dynamism on stride changes (#130232)
Fixes https://github.com/pytorch/pytorch/issues/129798

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130232
Approved by: https://github.com/ezyang
2024-07-21 03:45:54 +00:00
e506dfa640 [dynamo] Add a JK kill switch for disabling compile (#131258)
Summary: The JK disables dynamo by passing None to set_eval_frame.

Test Plan:
Ran buck test mode/opt caffe2/test/dynamo:test_dynamo

Buck UI: https://www.internalfb.com/buck2/1fec33b4-c95a-4bdf-b47b-7c0b8ab9e24a
Test UI: https://www.internalfb.com/intern/testinfra/testrun/2814750010105363
Network: Up: 0B  Down: 0B
Jobs completed: 9596. Time elapsed: 28:54.5s.
Tests finished: Pass 4796. Fail 0. Fatal 0. Skip 17. Build failure 0

Also manually write a small local test with torch.compile and toggles the code to see if PT2 can be disabled. Validated with running the test and observing the log.

PT2 enabled: P1486847242. Can see dynamo log about graph breaks.
PT2 disabled: P1486847727. No dynamo log. The newly added warning printed.

Reviewed By: ezyang

Differential Revision: D59968925

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131258
Approved by: https://github.com/c00w
2024-07-21 01:22:31 +00:00
cyy
1d1d074072 [3/N] Fix Wunused-parameter warnings (#131271)
Follows #131170

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131271
Approved by: https://github.com/ezyang
2024-07-20 23:31:03 +00:00
d57af32e63 Fix undefined tensor error in _copy_from_and_resize when fallback to cpu. (#130237)
1) Add skip undefined tensor in cpu fallback when call _copy_from_and_resize;
2) Modify to_cpu function support optional tensor;
3) Add copy back to origin optional tensor when alias_info isWrite is true.

@ezyang @bdhirsh

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130237
Approved by: https://github.com/ezyang
2024-07-20 23:12:17 +00:00
13283fb4bc [distributed] test_store: remove flaky bind test (#131262)
Fixes https://github.com/pytorch/pytorch/issues/131084

There's no good way to fix this since some tests environments can bind the protected range. Removing test since the value is relatively low since it's just testing error messages.

Test plan:

```
python test/distributed/test_store.py -v -k address
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131262
Approved by: https://github.com/mori360, https://github.com/XilunWu
2024-07-20 23:04:31 +00:00
407c87a32c [debug][dtensor] fixed updating current module (#130995)
**Summary**
Fixed issue with updating the current module when transitioning between child module to parent module and in the backward pass. The first issue is caused because the prehook is not called again when we go back to the parent module and that the hook being used was a register_module_forward_hook, which runs before the register_module_hook used in redistribute, causing the collective call to be assigned to the incorrect module. In order to do this, I updated the current module to be the parent module in a register_forward_hook in the module tracker. The second issue was caused by the parent set in the module tracker I inherit from being incorrect. I fixed this issue by saving the parents of each module and using them in collective counter instead of the incorrect set. I have updated the example in module_operation_tracing to reflect the correct output. In addition, I changed the test cases that used the incompatible old CommDebugMode.

**Test Case**
1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_operation_tracing

2. pytest test/distributed/_tensor/debug/test_comm_mode_features.py -s -k test_transformer_module_tracing

3. python test/distributed/_composable/fsdp/test_fully_shard_training.py -k TestFullyShardGradientAccumulation.test_gradient_accumulation

4. python test/distributed/_tensor/test_math_ops.py -k DistMathOpsTest.test_layer_norm_bwd

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130995
Approved by: https://github.com/XilunWu
ghstack dependencies: #130410
2024-07-20 20:57:29 +00:00
33f036a6f7 [inductor] Kill mark_node_as_mutating (#130834)
Resubmit of #129346

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130834
Approved by: https://github.com/lezcano
ghstack dependencies: #130831, #130832, #130833
2024-07-20 18:53:33 +00:00
fccbe85475 [BE] Improve CUDA UpSample error message (#131252)
`Expected grad_output.numel() <= std::numeric_limits<int32_t>::max() to be true` is not very helpful, it's better to mention method name as well as actual tensor size

This error was reported in https://github.com/pytorch/pytorch/issues/131185

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131252
Approved by: https://github.com/albanD
2024-07-20 16:49:34 +00:00
a7a951a4ae [executorch hash update] update the pinned executorch hash (#130001)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Co-authored-by: Huy Do <huydhn@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130001
Approved by: https://github.com/pytorchbot
2024-07-20 16:44:07 +00:00
b6d477fd56 [BE][Easy][16/19] enforce style for empty lines in import segments in torch/_i*/ (#129768)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129768
Approved by: https://github.com/jansel
2024-07-20 16:20:58 +00:00
8e478d4fb1 Add Alban and Piotr into Core Maintainers (#130903)
See official announcement here: https://dev-discuss.pytorch.org/t/alban-desmaison-and-piotr-bialecki-are-now-pytorch-core-maintainers/2280

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130903
Approved by: https://github.com/albanD, https://github.com/Skylion007
2024-07-20 16:02:42 +00:00
637ab85e7f fix for launching kernel invalid config error when calling embedding … (#130994)
…with large index

Fixes #130806
When an output size of 2147483648 (=131072*16384) is expected in the above issue, it throwed out the following error:
RuntimeError: HIP error: invalid configuration argument

What happened was that the second parameter passed to hipLaunchKernel was crazy {2147483648,1,1}.
Found two issues in the Indexing.cu:

1: ptrdiff_t was used but it is signed int,  outTotalSize >= 2147483648 can cause overflow when doing [this](39493aa934/aten/src/ATen/native/cuda/Indexing.cu (L1367)):
2: On ROCm, std::min -> ::min did not work as expected when outTotalSize>=2147483648

As the result, 2147483648 was sent to hipLaunchKernel which the GPU does not support such a huge number since this number specifies the number of threads per block. The original code intended to set 128 threads per block, though this is debatable as the perf would not good for latest powerful GPUs (a TODO item to update for perf maybe?) , but at least it would not cause `invalid configuration argument` error.

[Test]
Run the same code snippet in the [issue](https://github.com/pytorch/pytorch/issues/130806), and print the output, its dim and numel(), which looks like below now:
```
output=tensor([[ 0.4044, -0.0244, -0.6865,  ..., -0.7800,  0.1175,  1.6726],
        [-1.0866, -0.1609,  0.3538,  ...,  1.9105,  0.7882,  1.1583],
        [-2.2079,  0.3736,  0.3610,  ..., -0.2658, -0.0459,  1.3077],
        ...,
        [ 0.8753, -0.7482, -0.1978,  ...,  0.9016,  1.1501, -0.5178],
        [-1.5845, -0.6277,  1.4520,  ...,  0.5733, -2.1198, -0.0915],
        [-0.6310, -1.0239, -0.1910,  ...,  0.4309,  0.1630,  0.3239]],
       device='cuda:0'), dim=2, numel=2147483648
```

Added a large tensor unit test too.
```
/pytorch# pytest test/nn/test_embedding.py -k test_large_tensors
================================================================================== test session starts ===================================================================================
platform linux -- Python 3.9.19, pytest-7.3.2, pluggy-1.4.0
rootdir: /dockerx/development/pytorch
configfile: pytest.ini
plugins: flakefinder-1.1.0, rerunfailures-14.0, xdist-3.3.1, xdoctest-1.1.0, cpp-2.3.0, hypothesis-5.35.1
collected 288 items / 287 deselected / 1 selected
Running 1 items in this shard

test/nn/test_embedding.py .                                                                                                                                                        [100%]

=========================================================================== 1 passed, 287 deselected in 3.16s ============================================================================
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130994
Approved by: https://github.com/jeffdaily, https://github.com/xw285cornell
2024-07-20 08:33:29 +00:00
a8319698b3 [inductor] [cpp] improve cache blocking with CPU info (#129348)
## Description
For single thread case, this PR improves the cache blocking in CPP GEMM template with the CPU info (the L1 and L2 cache size). `Mc_blocks` and `Kc_blocks` are calculated based on the below condition:
     - size_of_B < L1
     - size_of_A < 0.5 * L2

For multi-thread, we need to tune the task decomposition among threads together with cache blocking. We disabled the cache blocking change for now and will submit a follow-up PR for multi-thread optimizations.

## Performance
No regressions. Models with > 3% performance speedup are listed below:

### BF16 single thread (measured on CPU with AMX support)
- static shape

| Model Family | Model Name | Speedup |
|--------------|------------|---------|
torchbench | detectron2_fasterrcnn_r_101_dc5| 4%

- dynamic shape

| Model Family | Model Name | Speedup |
|--------------|------------|---------|
torchbench | detectron2_fasterrcnn_r_101_dc5| 4%

### FP32 single thread (measured on Ice Lake)
- static shape

| Model Family | Model Name | Speedup |
|--------------|------------|---------|
torchbench | basic_gnn_edgecnn| 10%

- dynamic shape

| Model Family | Model Name | Speedup |
|--------------|------------|---------|
torchbench | basic_gnn_edgecnn| 10%

### Next step
The E2E level improvement is limited due to the below reasons:

- For several HF models, we can observe performance improvement at kernel level for the gemm template kernel but since the performance is either still worse than ATen kernel (thus won't be selected during autotune) or improved from worse than ATen to similar to ATen, so we don't see E2E level performance change.

- There're models where the gemm template kernel could get > 10% performance improvement with this PR but since the kernel time is only about 3% of the E2E time, we don't observe significant E2E level improvement.

We will continue to find possible optimizations in the gemm template kernel in follow-up PRs.

Co-authored-by: Jiong Gong <jiong.gong@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129348
Approved by: https://github.com/jgong5, https://github.com/jansel
ghstack dependencies: #130675, #130690
2024-07-20 06:53:31 +00:00
0b44e1a74c [inductor][cpp][gemm] optimize arbitrary N in packed gemm template (#130690)
Currently we require `n % register_block_n == 0` which typically bring good perf when `n` is a multiply of 8, 16, 32 etc. while will fall back to the reference micro gemm otherwise (where `register_block_n == 1`). This PR optimizes this by padding `n` to the multiple of `register_block_n` which is 8, 16, 32 etc. for packed weight. Therefore, the micro-gemm can work as is on the padded `n`. When the weight is padded, we will use the local accumulation buffer to get the result from micro-gemm and then unpadded (sliced) before storing back to the output buffer.

Performance numbers measured on "Intel (R) Xeon (R) CPU Max 9480", single core, bf16.

Before
AUTOTUNE linear_unary(512x768, 3073x768, 3073)
  _linear_pointwise 2.3563 ms 100.0%
  cpp_packed_gemm_0 710.5902 ms 0.3%

After
AUTOTUNE linear_unary(512x768, 3073x768, 3073)
  cpp_packed_gemm_0 1.8909 ms 100.0%
  _linear_pointwise 2.1016 ms 90.0%

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130690
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
ghstack dependencies: #130675
2024-07-20 06:30:15 +00:00
ebc012ace6 Add hooks for execution on intel gaudi devices - 1 (#128584)
## Motivation
This is follow up to PR:https://github.com/pytorch/pytorch/pull/126970  to support Gaudi devices for Pytorch UT execution.

## Changes
We are adding additional hooks to:
1. Add dtype exceptions for Gaudi/HPU
2. Extend onlyNativeDevices decorator  functionality to add additional devices

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128584
Approved by: https://github.com/albanD
2024-07-20 05:03:36 +00:00
d31f2ae904 Ensure invariant that all inputs have tensor dict (#131249)
There was a path with freezing enabled that violated the invariant that all inputs have the "tensor_dict" meta. This ensures that `register_attr_or_module` also sets tensor_dict meta.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131249
Approved by: https://github.com/anijain2305
2024-07-20 04:40:58 +00:00
37337ef5c3 add some description on create_block_mask and mask mods (#131209)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131209
Approved by: https://github.com/joydddd
2024-07-20 04:40:48 +00:00
168c0e24a5 [IntraNodeComm] Fix some issues in two-shot all-reduce (#131244)
Two issues:
- Similar to https://github.com/pytorch/pytorch/pull/129501, two-shot all-reduce's reduction order was different across ranks. This PR fixes it.
- When migrated to use SymmetricMemory, I accidentally used `get_buffer_ptrs_dev` instread of `get_buffer_ptrs` (the former is an on-device array). This PR fixes it (for https://github.com/pytorch/pytorch/issues/131215).

The failing snippet provided by https://github.com/pytorch/pytorch/issues/131215 now works.
```python
import os
import torch
import torch.distributed as dist

def _get_global_rank() -> int:
    return int(os.environ.get("LOCAL_RANK", "0"))

def is_local():
    return _get_global_rank() == 0

def _get_world_size() -> int:
    return int(os.environ.get("LOCAL_WORLD_SIZE", "1"))

global_rank = _get_global_rank()
world_size = _get_world_size()
torch.cuda.set_device(global_rank)
dist.init_process_group(backend="nccl")
global_group = dist.group.WORLD
draft_group = dist.new_group([0,1])

inp = torch.full((128, 1, 4096), global_rank, dtype=torch.bfloat16, device="cuda")
dist.all_reduce(inp, group=global_group)
expect = sum(range(world_size))
assert inp.eq(expect).all()

if 0 <= global_rank < 2:
    inp = torch.full((128, 1, 2048), global_rank, dtype=torch.bfloat16, device="cuda")
    dist.all_reduce(inp, group=draft_group)
    expect = sum(range(2))
    assert inp.eq(expect).all()

torch.cuda.synchronize()
print("success")
dist.destroy_process_group()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131244
Approved by: https://github.com/weifengpy, https://github.com/Chillee
2024-07-20 02:51:45 +00:00
d2bd9acabd [BE] bump optree version to 0.12.1 (#130139)
0.12.0 Major Updates:

- Add context manager to temporarily set the dictionary sorting mode
- Add accessor APIs
- Use `stable` tag for `pybind11` for Python 3.13 support
- Fix potential segmentation fault for pickling support

0.12.1 Updates:

- Fix warning regression during import when launch with strict warning filters

Closes #130155
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130139
Approved by: https://github.com/zou3519
ghstack dependencies: #130895
2024-07-20 02:41:10 +00:00
50436d5bdb [export] fix zero arg export in training_ir (#130990)
Fixed TrainingIRToRunDecomp failures for test_tensor_attribute_zero_args and also a few re-tracability failures because run_decomposition does a retracing.

**edit:** also remove the eliminate_dead_code() in _unlift because of one onnx test failure:
a constant tensor attr was lifted as constant_tensor input but it's not used in the graph after aot_autograd due to a short cut in its decomposition. This causes the setattr to be removed by eliminate_dead_code but the graph signature still contains the name of that buffer, which causes an inconsitency between the transformed graph and ep's original signature after _unlift. And it seems that this has happened a few times where some nodes are accidentally removed and we're in an inconsistent state.

The alternative of removing it would be: every time we call elimiate_dead_code, we verify the consistency of the graph with 1. the graph before transformation and 2. all the meta datas but i think this deserves a complete design.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130990
Approved by: https://github.com/pianpwk
2024-07-20 02:35:13 +00:00
3c43fe068f [inductor] parallel compile: Create new pipes for subproc communication (#131194)
Summary: Rather then using stdin/stdout for IPC, we can create new pipes and pass the descriptors to the subproc via the cmd line. https://github.com/pytorch/pytorch/issues/131070 reports an issue where the combination of deepspeed and onnxruntime-training causes _something_ in the subproc to write to stdout and corrupt the IPC. The current implementation was already brittle; we can just create new pipes specifically for the IPC.

Test Plan: I was able to repro the MemoryError in https://github.com/pytorch/pytorch/issues/131070 by installing deepspeed and onnxruntime-training. Verified this PR fixes.

Differential Revision: [D59968362](https://our.internmc.facebook.com/intern/diff/D59968362)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131194
Approved by: https://github.com/malfet, https://github.com/eellison, https://github.com/atalman
2024-07-20 02:23:01 +00:00
9df8ea1cf2 [inductor] Use multiple outputs for flex-attention (#130833)
Resubmit of #129344

This fixes the DCE issue for attention output

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130833
Approved by: https://github.com/lezcano
ghstack dependencies: #130831, #130832
2024-07-20 02:05:10 +00:00
deacc543f1 [inductor] Make UserDefinedTritonKernel a multi-output operation (#130832)
Resubmit of #129325

Previously each mutation was represented by a `MutationOutput` operation which
was a new scheduler node that must be scheduled immediately afterwards.

Now we have a single scheduler node, which produces mutiple `MutationOutput`
buffers as its output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130832
Approved by: https://github.com/lezcano
ghstack dependencies: #130831
2024-07-20 02:05:10 +00:00
27c2a0d63b [inductor] Separate Buffer and Operation into two concepts (#130831)
Resubmit of #128893

Currently a buffer represents both a tensor with physical storage and a
computation that produces the tensor as a result.

This PR attempts to split these into two different concepts in the scheduler.
This should allow us to have multiple outputs from a single operation.

Differential Revision: [D59876059](https://our.internmc.facebook.com/intern/diff/D59876059)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130831
Approved by: https://github.com/lezcano
2024-07-20 02:05:07 +00:00
bb4251213b Add decomposition for channel_shuffle (#118775)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118775
Approved by: https://github.com/peterbell10
2024-07-20 01:24:41 +00:00
f0075c179b Pin sympy >= 1.13.0 (#130895)
------

The opposite of #130836. Pin `sympy >= 1.13.0` for Python >= 3.9 and `sympy == 1.12.1` for Python 3.8.

- #130836

See the PR description of #130836 for more details.

`sympy` 1.13.0 introduces some breaking changes which break our tests. More specifically:

- Ref [Backwards compatibility breaks and deprecations](https://github.com/sympy/sympy/wiki/release-notes-for-1.13.0#backwards-compatibility-breaks-and-deprecations)

> BREAKING CHANGE: Float and Integer/Rational no longer compare equal with a == b. From now on Float(2.0) != Integer(2). Previously expressions involving Float would compare unequal e.g. x*2.0 != x*2 but an individual Float would compare equal to an Integer. In SymPy 1.7 a Float will always compare unequal to an Integer even if they have the same "value". Use sympy.numbers.int_valued(number) to test if a number is a concrete number with no decimal part. ([#25614](https://github.com/sympy/sympy/pull/25614) by [@smichr](https://github.com/smichr))

`sympy >= 1.13.0` is required to enable Python 3.13 support. This should be part of #130689.

- #130689

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130895
Approved by: https://github.com/ezyang
2024-07-20 00:59:24 +00:00
30d1826b2b Revert "[executorch hash update] update the pinned executorch hash (#130001)"
This reverts commit 4821f72457afd7b1b5c61c1c8c3c49105c1bd22d.

Reverted https://github.com/pytorch/pytorch/pull/130001 on behalf of https://github.com/clee2000 due to the test_sympy_utils failure is real, Dr. CI is wrong https://github.com/pytorch/pytorch/actions/runs/10015433275/job/27687163560 4821f72457 ([comment](https://github.com/pytorch/pytorch/pull/130001#issuecomment-2240807631))
2024-07-20 00:56:14 +00:00
cyy
cd8bbdc71a [2/N] Fix Wunused-parameter warnings (#131170)
Follows #130924
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131170
Approved by: https://github.com/mikaylagawarecki
2024-07-19 23:58:56 +00:00
207fb96155 [functorch] saved tensor hooks error should only apply to grad, vjp transforms. (#131191)
There's no reason to ban them for vmap or jvp, because without the
{grad, vjp} transforms those just act above PyTorch autograd, which will
end up saving regular Tensors.

Test Plan:
- some tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131191
Approved by: https://github.com/drisspg
2024-07-19 23:16:27 +00:00
4821f72457 [executorch hash update] update the pinned executorch hash (#130001)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130001
Approved by: https://github.com/pytorchbot
2024-07-19 23:10:20 +00:00
7c299b46ca Revert "Invalidate StorageImpl instances when tensor is overwritten with cudagraphs (#125264)"
This reverts commit 8390843eba6271dcdbec7d048e9fa4e56d4479d8.

Reverted https://github.com/pytorch/pytorch/pull/125264 on behalf of https://github.com/izaitsevfb due to breaks internal tests ([comment](https://github.com/pytorch/pytorch/pull/125264#issuecomment-2240516202))
2024-07-19 22:58:51 +00:00
35bf05561c [Inductor] B2B-GEMM performance tuning with test (#130778)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130778
Approved by: https://github.com/eellison
2024-07-19 22:53:57 +00:00
6657b14a64 [inductor] Fix the method for checking the variable type of entry.numel (#131026)
The data type of numel in the IterationRangesEntry class is sympy.Expr. To determine if it's an integer, we need to use sympy.Integer.

Co-authored-by: peterbell10 <peterbell10@live.co.uk>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131026
Approved by: https://github.com/peterbell10
2024-07-19 22:51:11 +00:00
0e72baddf0 Revert "[easy][pytorch][counters] Move WaitCounter in c10/util (#131021)"
This reverts commit 0ca7b6ddd91192ebffd3c88bf314d07ba6cddf50.

Reverted https://github.com/pytorch/pytorch/pull/131021 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/131021#issuecomment-2240280827))
2024-07-19 21:56:09 +00:00
4aef5a1134 [c10] add an option to pg_config split share (#130877)
Summary:
context is: #129865
We want to give users an option to not share comms resouces so that
comm opts can overlap
Test Plan:
Augmentd UT

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130877
Approved by: https://github.com/fduwjj
2024-07-19 21:11:26 +00:00
0ca7b6ddd9 [easy][pytorch][counters] Move WaitCounter in c10/util (#131021)
Summary: Since WaitCounter frontend itself has minimal depdendencies it's fine to be moved into c10. Specific backends can be registered/linked separately.

Test Plan: unit test

Reviewed By: jamesperng, asiab4, c-p-i-o

Differential Revision: D59842868

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131021
Approved by: https://github.com/asiab4
2024-07-19 20:58:32 +00:00
c64ad2403c LF runners: Add new runner types for Amazon2023 AMIs (#131246)
Add new LF runner types with the Amazon2023 ami, matching the change done in https://github.com/pytorch/test-infra/pull/5487
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131246
Approved by: https://github.com/malfet
2024-07-19 20:30:41 +00:00
85ca88a2bb [Distributed][PP export] update tracing to handle autocast inclusion (#130998)
Fixes https://github.com/pytorch/pytorch/issues/128394

This updates PP export tracing to use no_grad() context along with avoid predispatch.
This enables tracing for HF llama models that currently fail due to not handling the use of autocast in the Rope embeddings.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130998
Approved by: https://github.com/fduwjj
2024-07-19 20:08:00 +00:00
ceee87df2e [export] modify export code owners (#130894)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130894
Approved by: https://github.com/zhxchen17
2024-07-19 19:49:34 +00:00
5f981388ec Revert "[ROCm] Enable ROCm support for inductor's dynamic_rblock_scaling (#129663)"
This reverts commit d7a78ec8b938a61297221912464f5afef288b823.

Reverted https://github.com/pytorch/pytorch/pull/129663 on behalf of https://github.com/atalman due to Breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/129663#issuecomment-2240011143))
2024-07-19 19:46:26 +00:00
125be005eb [Docs] Fix fake tensor doc (#131205)
Fix this: `# AttributeError: 'FakeTensorMode' object has no attribute 'from_real_tensor'`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131205
Approved by: https://github.com/eellison
2024-07-19 17:59:45 +00:00
e49c0acc39 [dynamo] Revert https://github.com/pytorch/pytorch/pull/130416 (#131058)
All the changes brought by the original PR have been addressed in alternative ways in the stack. Why the original PR has to be reverted requires  more effort because there is some bad interaction with export.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131058
Approved by: https://github.com/williamwen42
2024-07-19 17:26:24 +00:00
042be441ba [aoti] Unskip some aot inductor tests (#130973)
Trying to unskip some tests, and if they are still broken, add reasons.

## example testing command
```
pytest -v test/inductor/test_aot_inductor.py -k test_add_complex
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130973
Approved by: https://github.com/ColinPeppler
2024-07-19 17:19:35 +00:00
9b5c70878b [Fix] Missing parameter happens when retracing an already jit.scripted module (#129787)
#### Issue
Model parameters sometime do not appear in the `named_parameters()` function. For example, when trying to jit.trace an already jit.scripted model. This PR fixes that by relying on `state_dict` to get both parameters`requires_grad=True` and buffers.

#### Test Plan
* `pytest test/export/test_converter.py -s -k test_convert_retrace_nested_scripted_modules`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129787
Approved by: https://github.com/angelayi
2024-07-19 16:58:48 +00:00
abb3f2822c [aotinductor] Support additional lifted constants supplied to const folding. (#130743)
Summary:
In export workflow, we always have a lifted graph which doesn't fetch constants through get_attr nodes. This cause some compatibility issue when we're trying to use inductor's split_const_gm function with a lifted graph.

This diff make an additive change to split_const_gm's interface, such that, when the pass sees a placeholder node is present in the lifted_constants table, it will also use that as the source of constness.

This change won't break the existing code and the lifted_constants table can be used orthogonal to the existing const folding mechanisms.

Also as required from MTIA team, we want to introduce a small callback function used to skip certain nodes during const folding.

For the internal followup counterpart, see D59685145

Test Plan: buck run mode/opt caffe2/test:test_export -- -r split_const_gm

Differential Revision: D59692790

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130743
Approved by: https://github.com/desertfire, https://github.com/SherlockNoMad
2024-07-19 16:48:56 +00:00
31e79aae6a Another follow up to #130260 (#130993)
Another followup to https://github.com/pytorch/pytorch/pull/130260
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130993
Approved by: https://github.com/huydhn
2024-07-19 16:43:54 +00:00
d4a79d4a7c Fix an example: Resolve broadcasting error in attn_bias and attn_mask… (#130209)
… addition, fix device assignment for newly created variables in method

Fix an example: Resolve broadcasting error in attn_bias and attn_mask addition, fix device assignment for newly created variables in method

1. `attn_bias += attn_mask` would cause a broadcasting error. Because the shape of `attn_bias` is (L, S), the shape of the output would be expected as (L, S) too. When the shape of input is (N, num_heads, L, S), a broadcasting should be triggered. Then, the shape of the output would be (N, num_heads, L, S), which is unexpected.
2. `attn_bias` is a newly created variables in method, which is not assigned device.

**This is my retry of #130200 .** I used a wrong account in that pr.

Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130209
Approved by: https://github.com/mikaylagawarecki
2024-07-19 15:23:22 +00:00
451fc029fe docs: note transposed weight initialisations (#130122)
Fixes #129834

Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130122
Approved by: https://github.com/mikaylagawarecki
2024-07-19 15:23:03 +00:00
5f3d8b8788 Revert "[c10] add an option to pg_config split share (#130877)"
This reverts commit 367213a608528ee74e67e03bf11f775e263ef480.

Reverted https://github.com/pytorch/pytorch/pull/130877 on behalf of https://github.com/atalman due to breaks internal build ([comment](https://github.com/pytorch/pytorch/pull/130877#issuecomment-2239298810))
2024-07-19 14:24:50 +00:00
25d8a0480b [lint] Remove unnecessary BUCKRESTRICTEDSYNTAX suppressions
Differential Revision: D59935630

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131187
2024-07-19 07:19:11 -07:00
a6a2cd6257 Typo fix (#131037)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131037
Approved by: https://github.com/albanD, https://github.com/Skylion007
2024-07-19 13:17:54 +00:00
1b72cf0b09 Add hasattr for tensor variable (#131008)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131008
Approved by: https://github.com/anijain2305
ghstack dependencies: #131007
2024-07-19 12:43:27 +00:00
1f961ad495 Runs aten cuda cpp tests in CI (#131061)
It seems like these tests are never run because https://github.com/pytorch/pytorch/pull/99956 got rid of the `pushd $1` which would make the if conditions true in CUDA builds.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131061
Approved by: https://github.com/malfet, https://github.com/eqy
2024-07-19 12:35:33 +00:00
d7a78ec8b9 [ROCm] Enable ROCm support for inductor's dynamic_rblock_scaling (#129663)
As of ROCm 6.1 [hipDeviceProp_t::regsPerMultiprocessor](https://rocm.docs.amd.com/projects/HIP/en/latest/doxygen/html/structhip_device_prop__t.html#a7390d5b180d63978c81aa971060270b4) is now available allowing us to enable this attribute on ROCm.
```
>>> torch.cuda.get_device_properties(0)
_CudaDeviceProperties(name='AMD Instinct MI250X/MI250', major=9, minor=0, gcnArchName='gfx90a:sramecc+:xnack-', total_memory=65520MB, multi_processor_count=104)
>>> torch.cuda.get_device_properties(0).regs_per_multiprocessor
65536
```

With https://github.com/triton-lang/triton/pull/3962we can extract n_regs and n_spells from a triton binary with AMD backend allowing us to enable inductor's dynamic_rblock_scaling on ROCm initially implemented in https://github.com/pytorch/pytorch/pull/115094

Leaving this in draft until following PRs have landed:
- https://github.com/pytorch/pytorch/pull/129361 to bump the triton commit pin
- https://github.com/pytorch/pytorch/pull/128449 to allow us to grab warp_size from device properties instead of hard coding 64 on ROCm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129663
Approved by: https://github.com/jansel, https://github.com/shunting314
2024-07-19 09:45:03 +00:00
cyy
feef057691 [1/N] Fix Wunused-parameter warnings (#130924)
Before we can turn Wunused-parameter into an error
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130924
Approved by: https://github.com/ezyang
2024-07-19 06:14:51 +00:00
eee76c86a8 Write trace_structured events to scuba (#130955)
Summary: https://fb.workplace.com/groups/1286739428954016/posts/1287192258908733

Test Plan: Run test with tlparse and inspect https://www.internalfb.com/intern/scuba/query/?dataset=pt2_trace_structured_events

Differential Revision: D59866096

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130955
Approved by: https://github.com/ezyang
2024-07-19 06:02:47 +00:00
982309b501 Initial commit of flight recorder trace (#130764)
Summary:
`fr_trace.py` is used to analyze flight recorder dump files.
This script was taken from @wconstab and @zdevito.
Only minor changes made were to make the linter happy and add a few odd new fields that I added in version `2.2` of the collector portions.

Test Plan:
Tested manually on some flight recorder data and it seems to run.

TODO:
Address 15 odd `#type: ignore` that I put in there to make the linter happy for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130764
Approved by: https://github.com/fduwjj
2024-07-19 06:00:54 +00:00
fd4899bc58 [ONNX] Run ruff pyupgrade to update type annotations (#130657)
Use the newest syntax for type annotations
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130657
Approved by: https://github.com/titaiwangms
2024-07-19 05:09:44 +00:00
4f60a2e39c Set correct output dtype for dequantize op during convert_pt2e in decomposed mode (#128953)
Earlier the signature of dequantize ops for decomposed quantized Tensor was changed for wider use-cases where the output dtype can be different from torch.float and needs to be passed during dequantization.
Please refer: https://github.com/pytorch/pytorch/pull/121450

However, setting of correct output dtype for dequantize ops was still missing in convert_pt2e flow.

This change enables the users to use PT2E quantization flow with non torch.float unquantized dtype, such as torch.bfloat16.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128953
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2024-07-19 04:58:02 +00:00
d59803fb67 Refactored flexattention kernel (#130904)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130904
Approved by: https://github.com/drisspg
ghstack dependencies: #130871
2024-07-19 04:56:32 +00:00
ac76dd606f [dynamo] Alternative way to skip empty hooks guards on inbuilt nn modules (#131057)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131057
Approved by: https://github.com/williamwen42, https://github.com/jansel
ghstack dependencies: #131056
2024-07-19 04:42:38 +00:00
00e54e74ff [dynamo][cpp-guards] Fix bug in dict tags (#131056)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131056
Approved by: https://github.com/williamwen42, https://github.com/jansel
2024-07-19 04:42:38 +00:00
3c622fbcd3 [inductor] Fix var_to_range in IndexPropagation (#130984)
The current code assumes that indirect variables will be created by the
same `IndexPropagation` instance, however that isn't true in the case of
masked sub-blocks where we take in variables from the parent block.

This fixes the issue by moving the var range information up to the
`LoopBody` object where it can be shared by all sub-blocks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130984
Approved by: https://github.com/lezcano
2024-07-19 03:08:00 +00:00
b556d31586 Update torch-xpu-ops pin (ATen XPU implementation) (#131015)
Regular update.
1. New 90 ATen operators and their variants are supported for XPU.
2. Bugfixing: a. Fixing out-of-bound memory access in index_put kernel b. Fixing debug build error
3. Binary change. Split device AOT code of SYCL kernel into multiple libraries to avoid linkage failure.
4. torch-xpu-ops test case enhancement: a. Hook PyTorch testing ob_db to align opInfo configuration with CUDA b. Hook _check_arg_device2 and freeze_rng_state to make XPU happy

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131015
Approved by: https://github.com/EikanWang
2024-07-19 02:18:55 +00:00
52cb9abb1d Add deterministic support in nn.functional.interpolate for XPU (#129864)
Both for CUDA and XPU, there are no deterministic implementation at native in `aten::upsample_bilinear` and `aten::replication_pad`. CUDA leverage operator decomposition path in frontend hook `nn.functional.interpolate` as its deterministic implentation. XPU backend uses the same solution in this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129864
Approved by: https://github.com/dvrogozh, https://github.com/albanD, https://github.com/EikanWang
2024-07-19 02:15:42 +00:00
39493aa934 [inductor][cpp][gemm] move bias add to epilogue (#130675)
Speedup bias-add compute by moving it to the epilogue. Performance numbers measured on "Intel (R) Xeon (R) CPU Max 9480", single core, bf16.
Before
AUTOTUNE linear_unary(512x768, 3072x768, 3072)
  cpp_packed_gemm_0 1.9200 ms 100.0%
  _linear_pointwise 1.9345 ms 99.3%

After
AUTOTUNE linear_unary(512x768, 3072x768, 3072)
  cpp_packed_gemm_0 1.8321 ms 100.0%
  _linear_pointwise 1.9246 ms 95.2%

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130675
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
2024-07-19 01:16:34 +00:00
5a6a806b19 [Inductor UT] Generalize device-bias code in case TestFxGraphCache.test_inductor_counters. (#131006)
[Inductor UT] Generalize device-bias code in case `TestFxGraphCache.test_inductor_counters`.
Fix #131005

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131006
Approved by: https://github.com/masnesral
2024-07-19 01:14:22 +00:00
208dffa702 [Compiled DDP] DDP + AC unit test (#130981)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130981
Approved by: https://github.com/fegin
2024-07-19 01:07:41 +00:00
cyy
3cc6183ce1 Fix getAugOp error (#131033)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131033
Approved by: https://github.com/ezyang
2024-07-19 01:07:24 +00:00
6e7b9ee8a0 [inductor] adapte windows file path (#130713)
This PR is depends on https://github.com/pytorch/pytorch/pull/130132 can be landed successful.
The detailed log: https://github.com/pytorch/pytorch/issues/124245#issuecomment-2211889758

After the file path was adapted for Windows, the first Windows inductor case was run successful.

```python
import torch

def foo(x, y):
    a = torch.sin(x)
    b = torch.cos(x)
    return a + b
opt_foo1 = torch.compile(foo)
print(opt_foo1(torch.randn(10, 10), torch.randn(10, 10)))
```

Result:
![image](https://github.com/user-attachments/assets/4944df47-e74d-476b-8eb5-1d1fd5abeb41)

Co-authored-by: Jiong Gong <jiong.gong@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130713
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/desertfire
2024-07-18 23:19:38 +00:00
e880cb2fe0 [ONNX] Remove beartype usage (#130484)
beartype has served us well in identifying type errors and ensuring we call internal functions with the correct arguments (thanks!). However, the value of having beartype is diminished because of the following:

1. When beartype improves support for better Dict[] type checking, it discovered typing mistakes in some functions that were previously uncaught. This caused the exporter to fail with newer versions beartype when it used to succeed. Since we cannot fix PyTorch and release a new version just because of this, it creates confusion for users that have beartype in their environment from using torch.onnx
2. beartype adds an additional call line in the traceback, which makes the already thick dynamo stack even larger, affecting readability when users diagnose errors with the traceback.
3. Since the typing annotations need to be evaluated, we cannot use new syntaxes like `|` because we need to maintain compatibility with Python 3.8. We don't want to wait for PyTorch take py310 as the lowest supported Python before using the new typing syntaxes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130484
Approved by: https://github.com/titaiwangms
2024-07-18 22:07:40 +00:00
fb3674b1f4 Revert "[Autograd] Cond Higher-Order Operation (#126911)"
This reverts commit f7058b735e52a1d876912f8c96a594673a495007.

Reverted https://github.com/pytorch/pytorch/pull/126911 on behalf of https://github.com/clee2000 due to broke lint and functorch/test_aotdispatch f7058b735e Probably a landrace since both the test and lint passed on PR ([comment](https://github.com/pytorch/pytorch/pull/126911#issuecomment-2237703182))
2024-07-18 22:06:40 +00:00
686b7f046a [Fix]: TSConverter handles call ops with multiple outputs (#129294)
#### Issue
* Current call ops does not handle IR with multiple outputs. If an op has multiple outputs, we add an implicit unpack to map output. E.g.,
```
%5 : Tensor, %6 : Tensor = aten::max(%x.1, %3, %4), scope: export.test_converter.M:: # /data/users/jiashenc/pytorch/test/export/test_converter.py:774:20
```
* There are some cases that `prim::If` sub-blocks do not return any outputs. E.g.,
```
%9 : bool = aten::gt(%8, %3), scope: export.test_converter.M::/torch.nn.modules.pooling.AdaptiveMaxPool2d::pool # <string>:5:9
   = prim::If(%9), scope: export.test_converter.M::/torch.nn.modules.pooling.AdaptiveMaxPool2d::pool # <string>:5:2
    block0():
      -> ()
    block1():
       = prim::RaiseException(%5, %4), scope: export.test_converter.M::/torch.nn.modules.pooling.AdaptiveMaxPool2d::pool # <string>:5:2
      -> ()
```

#### Test Plan
We did an exhaustive search of all torch APIs that can return multiple outputs. We sample some of common ones and add new test cases based on those.
* `pytest test/export/test_converter.py -s -k test_ts2ep_multi_outputs_on_call_ops`

#### Appendix
* aten ops that return multiple outputs.
```
aten._batch_norm_impl_index
aten._batch_norm_no_update
aten._batch_norm_with_update
aten._batch_norm_with_update_functional
aten._cudnn_rnn
aten._efficient_attention_backward
aten._efficient_attention_forward
aten._embedding_bag
aten._embedding_bag_forward_only
aten._flash_attention_backward
aten._flash_attention_forward
aten._fused_adam
aten._fused_dropout
aten._fused_moving_avg_obs_fq_helper
aten._linalg_det
aten._linalg_eigh
aten._linalg_slogdet
aten._linalg_solve_ex
aten._linalg_svd
aten._native_batch_norm_legit
aten._native_batch_norm_legit_functional
aten._native_batch_norm_legit_no_training
aten._pack_padded_sequence
aten._prelu_kernel_backward
aten._scaled_dot_product_efficient_attention
aten._scaled_dot_product_efficient_attention_backward
aten._scaled_dot_product_flash_attention
aten._scaled_dot_product_flash_attention_backward
aten._scaled_dot_product_flash_attention_for_cpu
aten._scaled_dot_product_flash_attention_for_cpu_backward
aten._thnn_fused_lstm_cell
aten._thnn_fused_lstm_cell_backward_impl
aten._unique2
aten._weight_norm_interface
aten.adaptive_max_pool2d
aten.adaptive_max_pool3d
aten.aminmax
aten.batch_norm_backward
aten.convolution_backward
aten.cudnn_batch_norm
aten.cudnn_batch_norm_backward
aten.cummax
aten.cummin
aten.fractional_max_pool2d
aten.frexp
aten.grid_sampler_2d_backward
aten.grid_sampler_3d_backward
aten.gru
aten.linalg_cholesky_ex
aten.linalg_eig
aten.linalg_inv_ex
aten.linalg_ldl_factor_ex
aten.linalg_lu
aten.linalg_lu_factor_ex
aten.linalg_qr
aten.linear_backward
aten.log_sigmoid_forward
aten.lstm
aten.lu_unpack
aten.max
aten.max_pool2d_with_indices
aten.max_pool3d_with_indices
aten.median
aten.min
aten.miopen_batch_norm
aten.miopen_batch_norm_backward
aten.mkldnn_rnn_layer
aten.mkldnn_rnn_layer_backward
aten.mode
aten.multilabel_margin_loss_forward
aten.nanmedian
aten.native_batch_norm
aten.native_batch_norm_backward
aten.native_dropout
aten.native_group_norm
aten.native_group_norm_backward
aten.native_layer_norm
aten.native_layer_norm_backward
aten.nll_loss2d_forward
aten.nll_loss_forward
aten.quantized_gru
aten.quantized_lstm
aten.rnn_relu
aten.rnn_tanh
aten.sort
aten.std_mean
aten.topk
aten.triangular_solve
aten.unique_dim
aten.var_mean
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129294
Approved by: https://github.com/angelayi
2024-07-18 21:55:18 +00:00
7f1cda1533 Autoheuristic: Do not store choices as metadata (#130304)
While for optimizations like pad_mm, there are always only two possible choices, for other decision procedures, like kernel choice selection, the set of "available" choices depends on the input. Instead of storing the choices as metadata, we can instead take a look at all choices for which we have collected data (i.e. `df[CHOICE_COL].unique()`).

In this PR, I also try to replace "choice" and "feedback" with global constants CHOICE_COL and FEEDBACK_COL.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130304
Approved by: https://github.com/eellison
2024-07-18 21:39:42 +00:00
4d9f2a6d56 Small expandable segments refactor. (#130889)
Makes next PRs that will export/import segment handles easier to write.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130889
Approved by: https://github.com/dsjohns2
ghstack dependencies: #130888
2024-07-18 21:34:38 +00:00
d8fed480ef Move handle-creation logic into cudacaching allocator. (#130888)
A later PR will then make the handle abstract and able to use
either cudaMalloc or expandable segments.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130888
Approved by: https://github.com/dsjohns2
2024-07-18 21:34:38 +00:00
3e9cf1cc80 Fix potential segfault during deletion (#131036)
Summary: See comment in code

Test Plan: code reading

Reviewed By: albanD

Differential Revision: D59872819

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131036
Approved by: https://github.com/Skylion007, https://github.com/albanD
2024-07-18 21:18:31 +00:00
f7058b735e [Autograd] Cond Higher-Order Operation (#126911)
This is an updated PR to equip cond with the autograd feature and replaces the old [PR](https://github.com/pytorch/pytorch/pull/126007)

@ydwu4 I tried to incorporate your requests already.

Currently there are two problems that I struggle with solving:

1. There seems to be an import issue when trying to import cond in `torch/__init__.py`, see [here](8a704035c9/torch/__init__.py (L1914-L1916)). Therefore, I had to comment those lines, which resolved the import issues, but I believe cond is not proberly exposed as torch.cond.
2. I am not entirely sure how to deal with the opinfo test in `hop_db.py`

Co-authored-by: Yidi Wu <yidi@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126911
Approved by: https://github.com/ydwu4
2024-07-18 21:09:09 +00:00
24467ba2ec Update pin (#130896)
Test the XLA pin update
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130896
Approved by: https://github.com/anijain2305
2024-07-18 21:04:30 +00:00
793b17ebcb Add numeric_debugger top level APIs (#130643)
Summary:
Add three top level APIs for numeric debugger in pt2e flow that can log intermediate output in the model
and calculate summary for metric comparisons between nodes in two graphs

* `prepare_for_propagation_comparison`
* `extract_results_from_loggers`
* `compare_results`

Test Plan:
python test/test_quantization.py -k test_prepare_for_propagation_comparison
python test/test_quantization.py -k test_extract_results_from_loggers

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130643
Approved by: https://github.com/dulinriley, https://github.com/tarun292
2024-07-18 20:54:18 +00:00
726b9268d2 Revert "Re-implement pin_memory to be device-agnostic by leveraging the Accelerator concept (#126376)"
This reverts commit c986aeea2d7d9403be702119e3dd4dcb18134fc2.

Reverted https://github.com/pytorch/pytorch/pull/126376 on behalf of https://github.com/atalman due to Failing internal builds ([comment](https://github.com/pytorch/pytorch/pull/126376#issuecomment-2237496633))
2024-07-18 20:25:20 +00:00
e7f7c5c3f8 [inductor] Avoid fallback case for custom scan op lowering (#130936)
We currently can't generate split scans when there are multiple scan
values, so we normally fall back to ATen. However, for the higher order
scan op, we can't fallback so it makes sense to just generate the slower
kernel anyway. This avoids having special shapes where we fail to
codegen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130936
Approved by: https://github.com/lezcano
2024-07-18 19:53:47 +00:00
367213a608 [c10] add an option to pg_config split share (#130877)
Summary:
context is: #129865
We want to give users an option to not share comms resouces so that
comm opts can overlap
Test Plan:
Augmentd UT

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130877
Approved by: https://github.com/fduwjj
2024-07-18 19:03:00 +00:00
c015e5b9e3 Make sure that TransformGetItemToIndex for all graph replay (#131003)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131003
Approved by: https://github.com/Chillee
ghstack dependencies: #130871
2024-07-18 18:32:21 +00:00
82242a258a rm duplicate index_dtype arg (#130803)
- Remove duplicate `index_dtype` argument for `_test_meta_sparse_compressed` operation.
- Also remove unused `y_v_numel` variable.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130803
Approved by: https://github.com/soulitzer
2024-07-18 18:30:13 +00:00
6d9f74f0af Add flex decoding benchmark (#130850)
ghstack-source-id: b4f26fb66ed47907b11580c8c853737959c58811
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130788

Add benchmark for flex decoding.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130850
Approved by: https://github.com/Chillee, https://github.com/drisspg
2024-07-18 18:09:25 +00:00
fff92d4f18 Revert "Use inductor TestCase for test_replicate_with_compiler.py (#129494)"
This reverts commit 9f392f8294e928aec49599ad649aa899e1356102.

Reverted https://github.com/pytorch/pytorch/pull/129494 on behalf of https://github.com/atalman due to broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/129494#issuecomment-2237147504))
2024-07-18 17:42:05 +00:00
745324e487 [export] turn on hybrid symints by default (#130775)
Sets `prefer_deferred_runtime_asserts_over_guards=True` for export, so any guards emitted from `SymNode.expect_true` (for example, guards that are implicitly required to be true for an op to succeed) won't lead to constraint violations. Instead these should appear in the graph as runtime asserts, or potentially as replacement expressions for placeholder shapes.

For example, this reshape op should emit s0 * s1 = s2, deferred as a runtime assert.
```
x = torch.randn(4, 8)  # [s0, s1]
y = torch.randn(32)  # [s2]
out = x.reshape(-1) + y
# this emits Eq(s0 * s1, s2), and we represent y's shape as [s0*s1] in the graph.
```

However, other complex guards can still cause export to fail, for instance guards emitted from `SymNode.guard_bool/guard_size_oblivious` (e.g. explicit if-else conditions in user code or lower-level op implementations hit during tracing) can still raise constraint violations. These can be deferred with `allow_complex_guards_as_runtime_asserts=True`. We don't yet make this default, because while this makes export more likely to succeed, it results in non-trivial asserts being emitted that often represent specialization to a variant of the op, or checks related to 0/1 specialization.

We also remove forced specializations for export and kill the `_disable_forced_specializations` flag - now any guard we can't express with Dims/DerivedDims either are handled with Hybrid SymInts, or should be resolved with rewriting or deferring.

Follow up:
Currently, `ShapeEnv._set_replacement()` is called for complex equality expressions (e.g. s2 -> s0*s1 in the example above), and the ExportedProgram stores `s0*s1` in the input placeholder. This isn't checked for validity when the program is run, so an option is to avoid replacement and/or runtime assert on equality.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130775
Approved by: https://github.com/avikchaudhuri
2024-07-18 17:40:58 +00:00
22388ffe03 Graph break on tostring for numpy remapping (#131007)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131007
Approved by: https://github.com/williamwen42
2024-07-18 17:23:41 +00:00
8bf0be7c78 [CUDAGraph] Add operator.mul to skip list for find_input_mutations (#130986)
The #130912 error happens since `operator.mul` does not have `_schema`.

So why do we have `operator.mul` and why is it not dispatched to `torch.ops.aten.mul`? This op comes from %mul_3.

    %mul_3 : [num_users=50] = call_function[target=operator.mul](args = (%arg689_1, 4096), kwargs = {})

`%arg689_1` is a placeholder with `meta[‘val’] = s0`. It comes form dynamic shapes and represents the batch size since it’s also used in many other nodes such as:

    %view_1 : [num_users=1] = call_function[target=torch.ops.aten.view.default](args = (%mm, [%arg689_1, 4096, 320]), kwargs = {})
and

    %native_group_norm_2 : [num_users=1] = call_function[target=torch.ops.aten.native_group_norm.default](args = (%div_1, %arg16_1, %arg17_1, %arg689_1, 320, 4096, 32, 1e-06), kwargs = {})

To fix the issue, we can add `operator.mul` to skip list.

Fixes #130912

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130986
Approved by: https://github.com/eellison
2024-07-18 17:11:39 +00:00
5979014059 DSD for TorchTune LoRA (#129635)
Fixes #128745
Solve the issue with conflicts when users use full_state_dict while the model is FSDP.

Current solve the issue for `full_state_dict=True`, with error
`'aten.copy_.default: got mixed torch.Tensor and DTensor, need to convert all torch.Tensor to DTensor before calling distributed operators!',).`

TODO: for` broadcast_from_rank0=True, full_state_dict=True`, the error is
`NotImplementedError: c10d::broadcast_: attempted to run this operator with Meta tensors`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129635
Approved by: https://github.com/fegin
2024-07-18 17:00:35 +00:00
5484c86021 [export] Fully support extension op in serialization/deserialization. (#130851)
Summary: Finishing up the mechanism to "register" certain types of operators to a registry so that the serializer can handle them correctly. This is expected to be firstly used by executorch.

Test Plan: buck run mode/opt caffe2/test:test_export -- -r test_export_with_extension_op_serialization

Differential Revision: D59825148

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130851
Approved by: https://github.com/angelayi
2024-07-18 16:47:53 +00:00
85451b2cde [DTensor] Fix shard_dim_alltoall fake tensor return (#129945)
shard_dim_alltoall op has a return type as a Tensor in its schemas (here: https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/Functional.cpp#L628),
but its FakeTensor implementation returns a list of tensors (see the chunk() call here: https://github.com/pytorch/pytorch/blob/main/torch/distributed/_tensor/_collective_utils.py#L33).
So it would error out when device="meta".

This PR fixes the fake tensor mode return type for 1d mesh and adds a test to compare shape with non-meta tensor case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129945
Approved by: https://github.com/wanchaol
2024-07-18 16:43:40 +00:00
16aaff7783 Fix mm pad regresion - more conservative estimation of plannable inputs (#128909)
- More conservative estimation of plannable inputs
- Consider constant_pad_nd as pointwise node in concat lowering
- Use aten.cat instead of constant pad ndwhen padding just a single dimension because it can be memory-planned away

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128909
Approved by: https://github.com/Chillee
2024-07-18 16:42:30 +00:00
27ded03545 [FX][export] DCE pass, check schema for node impurity (#130395)
Change the default DCE pass to check node schema for impure nodes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130395
Approved by: https://github.com/angelayi, https://github.com/jgong5
2024-07-18 16:31:40 +00:00
32ff04d30a [dtensor][debug] adding functionality to control noisiness of the debug output (#130410)
**Summary**
Currently, the output of CommDebugMode contains a lot of noise, such as operations that usually won’t tell the user much information such as aten.detach.default. I have created a set of these trivial operations and added a user argument noise_level for users to choose how much information they would want to receive.

noise_level = 1 prints module-level collective counts
noise_level = 2 prints operations not included in trivial operations and module information
noise_level = 3 prints all operations

In addition, I have removed the generate_module_tracing_table since noise_level = 1 essentially replaces it. Finally, I have updated the examples and test cases.

**Test Plan**
1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_json_dump

2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_json_dump

3. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_operation_tracing

4. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_operation_tracing

5. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_module_tracing

6. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_module_tracing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130410
Approved by: https://github.com/XilunWu
2024-07-18 16:12:59 +00:00
8ea03372a1 [MPS] Store philox counter as part of the RNG state (#130662)
Fixes #130613

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130662
Approved by: https://github.com/malfet
2024-07-18 15:57:28 +00:00
cyy
7c90a82970 [Reland] [5/N] Change static functions in headers to inline (#131010)
Reland of #130673

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131010
Approved by: https://github.com/Skylion007
2024-07-18 15:53:48 +00:00
d6ae8bbf16 Revert "[export] Add print_readable to unflattener (#128617)"
This reverts commit 9fee87e4cd9efb55ee5427a8e6b3c57de7c599f9.

Reverted https://github.com/pytorch/pytorch/pull/128617 on behalf of https://github.com/clee2000 due to broke inductor/test_flex_attention https://github.com/pytorch/pytorch/actions/runs/9984688318/job/27595182606 433ef4e444 Not run on PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/128617#issuecomment-2236867975))
2024-07-18 15:31:51 +00:00
120fdf7ee2 Revert "[aota] Needs autograd if an input requires_grad, agnostic to enable_grad (#128890)"
This reverts commit e98135d1ad2f999fec649ecd21b35f3d5676be43.

Reverted https://github.com/pytorch/pytorch/pull/128890 on behalf of https://github.com/zou3519 due to broke trunk tests, probably a landrace ([comment](https://github.com/pytorch/pytorch/pull/128890#issuecomment-2236790805))
2024-07-18 14:58:25 +00:00
5a90ed3523 Reinplacing should ignore copy_ nodes where the mutated arg is not read (#130866)
Might fix #127660, need to test some more cases.

We update the reinplacing pass. If we have something like the following,
where "sin" is a custom op (this situation should also apply to triton
kernels)
```py
def graph(x):
    y = sin(x)
    z = sin(y)
    x.copy_(z)
```
then the reinplacer used to produce the following:
```py
"""step 1: reinplaces the first sin"""
def graph(x):
    x_clone = x.clone()
    sin_out(x, out=x_clone)
    z = sin(x_clone)
    x.copy_(z)

"""step 2: reinplaces the second sin"""
def graph(x):
    x_clone = x.clone()
    sin_out(x, out=x_clone)
    sin_out(x_clone, out=x_clone)
    x.copy_(x_clone)
```
However, the first clone is unnecessary. It is safe to reinplace
the first sin into the following:
```py
def graph(x):
    sin_out(x, out=x)
    z = sin(x)
    x.copy_(z)
```
because there are no users of `x`'s original value (the copy_ node
doesn't actually use the original value of x!)

This PR updates the reinplacing pass to ignore copy_ in its computation
of if the original value of the mutated argument is still needed.

NB: this also applies to triton kernels, but it was easier for me to
reason about custom ops (and my repros were all for custom ops).

Test Plan:
- new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130866
Approved by: https://github.com/oulgen
2024-07-18 13:47:54 +00:00
dd39dca034 Removing some cruff and updating signatures for consistency (#130871)
# Summary

- This removes a bunch of example score mods that were primarily used for testing and places them directly in the test file. We should follow up with merging test_flex_decode and test_flash when the velocity slows down a little
- Fixes a bug with indexing on block mask
- Adds some doc strings to helper funcs and fixes some misc typing things
- Forces functions passed to `create_block_mask` to mask_mods and updates tests files

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130871
Approved by: https://github.com/joydddd, https://github.com/Chillee
2024-07-18 13:32:11 +00:00
9f6db5d0e2 Revert "Ensure staticmethods can be allowed in graph (#130882)"
This reverts commit b0387449db41c90fb4226baea97a8d889a0951c4.

Reverted https://github.com/pytorch/pytorch/pull/130882 on behalf of https://github.com/atalman due to failing torchrec tests internally, please fix and reland ([comment](https://github.com/pytorch/pytorch/pull/130882#issuecomment-2236528473))
2024-07-18 13:31:30 +00:00
63a0a65df9 Define 'zero-preserving unary functions' in docs (#130804)
Make explicit the definition of 'zero-preserving unary functions' in the sparse tensors documentation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130804
Approved by: https://github.com/soulitzer
2024-07-18 13:30:29 +00:00
eqy
1b07d42171 Add @syed-ahmed to CUDA CODEOWNERS paths (#130971)
CC @ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130971
Approved by: https://github.com/soulitzer
2024-07-18 11:55:10 +00:00
c986aeea2d Re-implement pin_memory to be device-agnostic by leveraging the Accelerator concept (#126376)
This PR re-implements pin memory aiming to get rid of the optional `device` argument and makes all related APIs to be device-agnostic. We add two new abstract APIs in [AcceleratorHooksInterface](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/detail/AcceleratorHooksInterface.h#L12) and redefine pin memory as: "Pin memory is always pinned for the current accelerator device". In detail, it uses [getAcceleratorHooksInterface](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/Context.h#L61) in pin_memory/is_pinned to get an appropriate device and invoke the corresponding overridden interfaces, instead of using BackendSelect and then dispatching to CUDA or other specific backends' implement methods.

Note: For new backends who want to implement and use pin memory, just inherit AcceleratorHooksInterface and overwrite the `isPinnedPtr` and `getPinnedMemoryAllocator` methods.

Additional context: To avoid BC-breaking, this PR just preserves the `device` arg of related APIs and would throw a deprecation warning if `device` arg is passed. Another PR will be submitted to update all PT callers (`Tensor.is_pinned()`, `Tensor.pin_memory()`...) not to pass this arg based on this PR. In future, `device` arg will be actually removed.

Relates #124908
Relates #14560
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126376
Approved by: https://github.com/albanD
2024-07-18 11:54:14 +00:00
38b7d89aa4 Uses context pointer for deleter to enable multiple CUDAPluggableAllocator usage (#130472)
We should be able to create multiple CUDAPluggableAllocators in the same pytorch program (see https://github.com/pytorch/pytorch/issues/124807, https://github.com/pytorch/pytorch/pull/125722 for context). When mixing CUDAPluggableAllocators in the same pytorch program, we need to make sure that the deleter passed in through the CUDAPluggableAllocator gets "attached" to the data_ptr and persist until program exit (when it's called to free the memory).

Currently, CUDAPluggableAllocator maintains a global `current_custom_allocator`. When creating the `DataPtr`, `raw_deleter` attaches `custom_raw_deleter` to the DataPtr which calls  `current_custom_allocator->raw_delete(...)`. This approach is fine when using only one allocator, however for multiple allocator use case, DataPtr would be using the deleter of whatever is in the `current_custom_allocator`. For example, if allocation 1 was done with `cudaMalloc` and allocation 2 was done with `ncclMemAlloc`, and if `current_custom_allocator` is currently pointing to the CUDAPluggableAllocator with `ncclMemAlloc` - when cleaning up the allocation 1, we'd be using `ncclMemFree` instead of `cudaFree`.

In this PR, we solve the above problem by remembering the `free_fn_` using a deleter context. Hence, there is no need to go through an allocator object to find the deleter.

CC: @zdevito @ptrblck @eqy
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130472
Approved by: https://github.com/eqy, https://github.com/ezyang
2024-07-18 11:33:21 +00:00
28a74b9fa4 [NestedTensor] Integrate sum along the jagged dimension into NestedTensor (#130425)
Summary: Modify the existing `sum` operator in PyTorch, invoked by `torch.sum`, to allow for reductions along the ragged dimension of a nested tensor. This diff enables PyTorch users to invoke `torch.sum` on a nested tensor with `dim=1`, where `ragged_idx=1`.

Functions modified in `caffe2/torch/nested/_internal/ops.py`:
- `sum_dim_IntList()`: The function assumes that `ragged_idx=1`; in the case that `dim=1` as well, where `dim` is the dimension on which we reduce, this diff invokes the PyTorch benchmark found in D58423489. Specifically, this diff pads a nested tensor, e.g. of logical shape `(B, *, M)`, using [`torch.ops.aten._jagged_to_padded_dense_forward`](https://www.internalfb.com/code/fbsource/[92c2a067ab04e3eebc999254fed4ae2fbea6def3]/fbcode/deeplearning/fbgemm/fbgemm_gpu/fb/inductor_lowerings/elementwise_ops.py?lines=26), then reduces across the `*` dimension (`dim == 1`) to a `(B, M)` output tensor.
- `_wrap_jagged_dims()`: This diff adds special handling to allow for the case where `dim` contains `1` and not `0`, but to continue disallowing the case where `dim` contains `0` and not `1`. In this function's creation, I created a helper function, `_get_condition_for_invalid_jagged_reductions()`, which makes it clearer which conditions apply to which operators. Specifically, operators which are enabled with jagged reductions are specified at the top of the file in `SUPPORTED_JAGGED_REDUCTIONS` and have a different set of conditions that need to be tested, as reducing along `dim == 1` without `dim == 0` is now possible.

Functions modified in `caffe2/test/test_nestedtensor.py`:
- `test_sum_int_DimList()`: This diff adds special handling in the `sum` unit test to allow for the case where `dim` contains `1` and not `0`, but to continue disallowing the case where `dim` contains `0` and not `1`.
- `test_sum_int_DimList_ragged_dim_1()`: This diff adds a new unit test which verifies the accuracy and feasibility of reducing along the jagged dimension of a nested tensor.

Notes:
- This diff solely adds functionality for the case in which we reduce only along the ragged dimension. Cases in which we reduce along both the ragged and another dimension, like `dim == (1, 2)`, are not permitted, as this set of diffs focuses primarily on the former.
- The `sum` operator is the only operator which uses the function `_wrap_jagged_dims()`; all other operators use `_wrap_jagged_dim()`. I would like to later look into why this is the case and if we can consolidate this!
- I modified some of the comments in the `sum` function as well as the unit tests for more clarity.

Test Plan:
Verify that existing (`test_sum_int_DimList`) and new (`test_sum_int_DimList_ragged_dim_1`) unit tests pass via the following command:

```
buck2 run mode/{opt,inplace} //caffe2/test:nested -- --regex test_sum_int_DimList
```

Differential Revision: D59571209

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130425
Approved by: https://github.com/davidberard98
2024-07-18 10:48:18 +00:00
e98135d1ad [aota] Needs autograd if an input requires_grad, agnostic to enable_grad (#128890)
Reland of:  https://github.com/pytorch/pytorch/pull/128016

Summary from previous PR:
We assume only two possible mutually exclusive scenarios:

Running compiled region for training (Any of inputs has requires_grad)

Produced differentiable outputs should have requires_grad.
Running compiled region for inference (None of inputs has requires_grad)

All outputs do not have requires_grad.
Even if user runs the region under no_grad(), but has an input Tensor with requires_grad - we go Training scenario (1).

With current state that means:
1/ needs_autograd should not check torch.is_grad_enabled(), only that any of inputs requires_grad
2/ if needs_autograd => trace_joint (We are in training scenario 1.) => always run compiled region under with.enable_grad()

Changes in partitioner?

Inference and Training graphs had difference in return container, list/tuple.
The changes in partitioner are done to unify and return always tuple.
As a result - some changes in test_aotdispatch.py for graph contents list -> tuple.

Why was revert?

There was a regression of hf_Reformer model on inference.
```
TORCHINDUCTOR_FX_GRAPH_CACHE=0 python benchmarks/dynamo/torchbench.py --performance --inference --bfloat16 --backend inductor --device cuda --only hf_Reformer --cold-start-latency --use-eval-mode
```

Because one of the compiled graphs contained outputs, which are aliases to the inputs that are nn.Parameter(requires_grad=True).

Even if inference bencharmsk torchbench runs inside with` torch.no_grad()` - alias (specifically for hf_Reformer - expand) ops preserve requires_grad.

As a result we started compiling training graph instead of inference.

Fix for view ops:

If we have outputs, that are aliases to inputs that requires_grad, those outputs requires grad is not a reason to generate training graph.

This is handled in aot_autograd.py, where output_and_mutation_safe are calculated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128890
Approved by: https://github.com/bdhirsh
2024-07-18 08:27:53 +00:00
cf3f4285a8 Add recursive metadata guard test (#131002)
Ensures that nested tensors subclasses are guarded properly. It turns out this case is already handled [here](d77af49380/torch/_dynamo/variables/builder.py (L1496)) which will recursively wrap inner tensors adding metadata guards for them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131002
Approved by: https://github.com/bdhirsh
2024-07-18 08:24:43 +00:00
134bc4fc34 [BE][Easy][12/19] enforce style for empty lines in import segments in test/i*/ (#129763)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129763
Approved by: https://github.com/jansel
2024-07-18 07:49:19 +00:00
dfc3347c4a [pytorch][counters] Make WaitCounter backend pluggable (#130934)
Summary:
This diff introduces a much more flexible model for WaitCounter backend:
1. Backend can be installed dynamically (even if not linked with pytorch) instead of relying on macros and swapping implementation at compile time
2. Multiple backends are supported at the same time.

Test Plan: unit test

Reviewed By: jamesperng

Differential Revision: D59795863

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130934
Approved by: https://github.com/asiab4
2024-07-18 07:23:55 +00:00
b732b52f1e Revert "[BE][Easy][12/19] enforce style for empty lines in import segments in test/i*/ (#129763)"
This reverts commit aecc746fccc4495313167e3a7f94210daf457e1d.

Reverted https://github.com/pytorch/pytorch/pull/129763 on behalf of https://github.com/XuehaiPan due to need reland after rerunning lintrunner on main ([comment](https://github.com/pytorch/pytorch/pull/129763#issuecomment-2235736732))
2024-07-18 06:39:58 +00:00
6c2c8ee15b [export] Remove preserved ops from decomp list (#130970)
Fixes https://fb.workplace.com/groups/1075192433118967/permalink/1466016147369925/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130970
Approved by: https://github.com/bdhirsh
2024-07-18 05:15:22 +00:00
aecc746fcc [BE][Easy][12/19] enforce style for empty lines in import segments in test/i*/ (#129763)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129763
Approved by: https://github.com/jansel
2024-07-18 05:13:41 +00:00
740fb22966 [BE][Easy][4/19] enforce style for empty lines in import segments in functorch/ (#129755)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129755
Approved by: https://github.com/zou3519
ghstack dependencies: #129752
2024-07-18 05:08:03 +00:00
a085acd7d6 [dynamo] Revert back changes to UnspecializedBuiltinNNModuleVariable (#130991)
xref - https://fb.workplace.com/groups/1075192433118967/permalink/1466525440652329/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130991
Approved by: https://github.com/williamwen42, https://github.com/mlazos
2024-07-18 05:01:46 +00:00
9f392f8294 Use inductor TestCase for test_replicate_with_compiler.py (#129494)
Summary: `test/distributed/_composable/test_replicate_with_compiler.py` exercises inductor. This change introduces a version of MultiProcessTestCase that derives from the inductor TestCase class to make sure we always get a clean cache dir.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129494
Approved by: https://github.com/eellison
2024-07-18 03:08:32 +00:00
433ef4e444 Revert "[FX][export] DCE pass, check schema for node impurity (#130395)"
This reverts commit e22b0acc766db4a853fe8fd73e919b4adf0e3148.

Reverted https://github.com/pytorch/pytorch/pull/130395 on behalf of https://github.com/yushangdi due to breaking tests, need to rebase and fix ([comment](https://github.com/pytorch/pytorch/pull/130395#issuecomment-2235192986))
2024-07-18 02:46:03 +00:00
bd56bcf0ab [TEST] Fix _scaled_mm tests (#130897)
This PR resolves several sets of `_scaled_mm` test failures:
- `scale_a` and `scale_b` are now required arguments, so the function `sample_inputs_scaled_mm` must supply them
- `_scaled_mm` does not support `"meta"` device, so it should be skipped in `test_meta.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130897
Approved by: https://github.com/drisspg
2024-07-18 02:15:00 +00:00
9fee87e4cd [export] Add print_readable to unflattener (#128617)
Taking inspiration from `GraphModule.print_readable` (aka I copied its [code](17b45e905a/torch/fx/graph_module.py (L824))), I added a `print_readable` to the unflattened module, because it's kind of nontrivial to print the contents of this module.

Example print from `python test/export/test_unflatten.py -k test_unflatten_nested`
```
class UnflattenedModule(torch.nn.Module):
    def forward(self, x: "f32[2, 3]"):
        # No stacktrace found for following nodes
        rootparam: "f32[2, 3]" = self.rootparam

        # File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:99 in forward, code: x = x * self.rootparam
        mul: "f32[2, 3]" = torch.ops.aten.mul.Tensor(x, rootparam);  x = rootparam = None

        # No stacktrace found for following nodes
        foo: "f32[2, 3]" = self.foo(mul);  mul = None
        bar: "f32[2, 3]" = self.bar(foo);  foo = None
        return (bar,)

    class foo(torch.nn.Module):
        def forward(self, mul: "f32[2, 3]"):
            # No stacktrace found for following nodes
            child1param: "f32[2, 3]" = self.child1param
            nested: "f32[2, 3]" = self.nested(mul);  mul = None

            # File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:79 in forward, code: return x + self.child1param
            add: "f32[2, 3]" = torch.ops.aten.add.Tensor(nested, child1param);  nested = child1param = None
            return add

        class nested(torch.nn.Module):
            def forward(self, mul: "f32[2, 3]"):
                # File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:67 in forward, code: return x / x
                div: "f32[2, 3]" = torch.ops.aten.div.Tensor(mul, mul);  mul = None
                return div

    class bar(torch.nn.Module):
        def forward(self, add: "f32[2, 3]"):
            # No stacktrace found for following nodes
            child2buffer: "f32[2, 3]" = self.child2buffer

            # File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:87 in forward, code: return x - self.child2buffer
            sub: "f32[2, 3]" = torch.ops.aten.sub.Tensor(add, child2buffer);  add = child2buffer = None
            return sub
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128617
Approved by: https://github.com/zhxchen17, https://github.com/pianpwk
2024-07-18 01:36:01 +00:00
cyy
a0ae77b25b Simpilfy cub::unique_by_key code (#130907)
It removed an unused parameter.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130907
Approved by: https://github.com/ezyang
2024-07-18 01:12:00 +00:00
d818c3319f Autoheuristic: add config options for specifying optimizations to collect data for and use heuristics (#130245)
Previously, it was only possible to collect data or use a heuristic regardless of where autoheuristic is used. This PR makes it possible to collect data for some optimizations while using a learned heuristic for other optimizations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130245
Approved by: https://github.com/shunting314
2024-07-18 01:04:36 +00:00
051971ab32 Reorder MIOpen conditions so getCUDAHooks only called when CUDA input (#130867)
See post for more details: [fb.workplace.com/groups/1405155842844877/permalink/8719141948112860](https://fb.workplace.com/groups/1405155842844877/permalink/8719141948112860/)
Function getCUDAHooks() returns a reference to an object without checking if the object is null. In the AutoMOS QE, which runs a ML model in Messenger Android, we are getting native crashes because of this reason: [internalfb.com/code/fbsource/[b7f8e18320f9d5d8347c3428c67301f20c3c81d2]/xplat/caffe2/aten/src/ATen/native/Convolution.cpp?lines=504](https://www.internalfb.com/code/fbsource/%5Bb7f8e18320f9d5d8347c3428c67301f20c3c81d2%5D/xplat/caffe2/aten/src/ATen/native/Convolution.cpp?lines=504), crash [fburl.com/logview/xi4w7jk4](https://fburl.com/logview/xi4w7jk4)

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130867
Approved by: https://github.com/albanD
2024-07-18 00:59:33 +00:00
e22b0acc76 [FX][export] DCE pass, check schema for node impurity (#130395)
Change the default DCE pass to check node schema for impure nodes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130395
Approved by: https://github.com/angelayi, https://github.com/jgong5
2024-07-18 00:55:20 +00:00
cyy
73d0f484b3 [structural binding][11/N] Replace std::tie with structural binding (#130830)
Follows  #130784

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130830
Approved by: https://github.com/janeyx99
2024-07-18 00:45:06 +00:00
e14d1d10ef Unwrap Identity in prepare indexing (#130967)
We wrap indexing calculation in the concat kernel in `Identity` so that we do not expand int32 intermediates to int64. This was causing an issue where the index simplified to an integer and would not hit an intended [path](752c817898/torch/_inductor/codegen/triton.py (L1554)) which would do wrapping with tl.full.

I couldn't generate a minimal repro to add as test but I have a repro you can check here: P1483831261 There is already a test that we dont expand the int32 intermediates to int64.

Differential Revision: [D59871850](https://our.internmc.facebook.com/intern/diff/D59871850)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130967
Approved by: https://github.com/Chillee, https://github.com/jansel
2024-07-18 00:43:53 +00:00
d77af49380 [Traceable FSDP2] Preserve fsdp.set_ op through lowering; Add unit test for multiple .set_ into same primal; Add unit test for FSDP2 module layer reuse (#130786)
Test commands:
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_fullgraph_backend_inductor`
- `pytest -rA test/functorch/test_aotdispatch.py::TestAOTAutograd::test_input_mutation_fsdp_set__into_same_input`
- `PYTORCH_TEST_WITH_CROSSREF=1 python test/functorch/test_aotdispatch.py -k TestAOTAutogradWithCache.test_input_mutation_fsdp_set__into_same_input`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130786
Approved by: https://github.com/bdhirsh
ghstack dependencies: #129773
2024-07-17 23:25:42 +00:00
fc3dbcd1c3 [Traceable FSDP2][Inductor] Re-inplace all_gather_into_tensor (#129773)
FSDP2 eager pre-allocates the output buffer for AllGather and the AllGather just writes into that buffer. However, under compile, by default we use out-of-place AllGather, which means in Traceable FSDP2 case we will be unnecessarily using more memory than eager. We want to re-inplace that AllGather instead.

This PR adds a post_grad pass to re-inplace all_gather_into_tensor (i.e. changing it from `all_gather_into_tensor.default` out-of-place op to `all_gather_into_tensor_out.default` out-variant op).

One thing to note is that since with this pass we are introducing a mutable op into the post_grad FX graph, we must do this pass after `reinplace_inplaceable_ops` (at which point we are okay again with having mutable ops in the graph). To facilitate this, this PR adds a `post_grad_custom_post_reinplace_pass` extension point to allow user-defined post-reinplace FX passes.

---

Test commands:
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_fullgraph_backend_inductor`

---

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129773
Approved by: https://github.com/eellison
2024-07-17 22:51:20 +00:00
442bfa7fc4 Fix mypy error (#130992)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130992
Approved by: https://github.com/izaitsevfb
2024-07-17 22:49:23 +00:00
a0da1265c5 Define key in codecache (#130979)
Test Plan:
```
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_misc.py::InlineInbuiltNNModulesMiscTests::test_auto_functionalize_can_with_none_return_inline_inbuilt_nn_modules'
```

Differential Revision: D59875657

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130979
Approved by: https://github.com/jamesjwu
2024-07-17 22:44:50 +00:00
31e3330040 [Reland][FSDP2] Allowed List[nn.Module] as arg (#130949)
This PR allows `fully_shard`'s first argument to be `List[nn.Module]` instead of strictly `nn.Module`. This allows more flexible grouping of modules/parameters for communication, which can lead to memory savings and/or more efficient communication.

**Approach**
At a high level, we can think of a model as a tree of modules. Previously, we could only select specific module nodes in this tree as representing one FSDP parameter group. With this PR, we can select a group of module nodes, effectively becoming a single super node.

To implement the runtime schedule, we define new forward hooks that run based on the following semantics:
- If a module is the first to run the pre-hook, actually run the given pre-hook. Otherwise, the pre-hook is no-op.
- If a module is the last to run the post-hook, actually run the given post-hook. Otherwise, the post-hook is a no-op.
- First and last are determined by scoreboarding against a set of the modules.
- This set must get cleared at the end of backward in the case that >=1 module in the list is never used, in which case we still want the forward hooks to run in the next forward after this backward.

Beyond these new forward hooks, everything else is some simple generalization from `Module` to `List[Module]` or `Tuple[Module, ...]`.

**Examples**
This PR enables wrapping Llama models more efficiently by grouping the final norm and output linear together: https://github.com/pytorch/torchtitan/pull/382.

If at least one of the modules in the list does not run forward before backward, then there will be a warning message like:
```
1 of the 2 modules passed to fully_shard did not run forward before backward, which is error-prone since FSDP post-forward/pre-backward logic will not run for these modules. We recommend passing only modules that run forward together. Modules that did not run forward: [FSDPLinear(in_features=1, out_features=1, bias=True)]
```

---

**Changes for reland:** none since breakage was from PR below

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130949
Approved by: https://github.com/weifengpy
ghstack dependencies: #130947
2024-07-17 22:40:14 +00:00
ff7e021e94 [Reland][PT-D] Relaxed contract to allow Sequence[nn.Module] (#127773) (#130947)
This PR relaxes `@contract` to allow the 1st argument to be `Sequence[nn.Module]` instead of strictly `nn.Module`. This is required for the next PR, which allows `fully_shard` to take in `List[nn.Module]`.

---

**Changes for reland:**
- The previous PR assumed that any `func` decorated with `@contract` would return the same input `module` as output (which is true for PT-D composable APIs).
- However, TorchRec `shard` returns a different module as output (though that module _does_ satisfy the `@contract` FQN check).
- This PR removes the assumption and instead only enforces the FQN check following the input module order. In other words, if calling `func([x1, ..., xN])` for `N` modules `x1, ..., xN` that returns `[y1, ..., yM]` for `M` modules, we require that `N = M` and that FQNs are preserved coordinate-wise: `xi` and `yi` have same FQNs for all `i = 1, ..., N`.

Differential Revision: [D59863438](https://our.internmc.facebook.com/intern/diff/D59863438)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130947
Approved by: https://github.com/weifengpy, https://github.com/atalman
2024-07-17 22:40:13 +00:00
90105a4f3e [ts-migration] Support RaiseException, prim::Unitialized, prim::Enter, and prim::Exit (#129416)
- Support raise exception. It's behavior matches non-strict export now, thanks to @ydwu4's [PR](https://github.com/pytorch/pytorch/pull/128709).
- Support prim::Unitialized, prim::Enter, and prim::Exit
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129416
Approved by: https://github.com/angelayi
2024-07-17 21:59:52 +00:00
874bbc53c9 Revert "Define key in codecache (#130979)"
This reverts commit 4112f687831fb6f3554ff659a0be45909a1b4639.

Reverted https://github.com/pytorch/pytorch/pull/130979 on behalf of https://github.com/clee2000 due to broke lint on torch/_inductor/codecache.py https://github.com/pytorch/pytorch/actions/runs/9981737836/job/27586013811 f0faecd291 ([comment](https://github.com/pytorch/pytorch/pull/130979#issuecomment-2234392332))
2024-07-17 21:59:19 +00:00
43a6d20883 Add decomposition for reflection_pad{1,2,3}d_backward (#130299)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130299
Approved by: https://github.com/lezcano
ghstack dependencies: #130130
2024-07-17 21:56:00 +00:00
0eb43ed189 Revert "[ts-migration] Support RaiseException, prim::Unitialized, prim::Enter, and prim::Exit (#129416)"
This reverts commit f0faecd2915d73e56917922cc995237cef064e50.

Reverted https://github.com/pytorch/pytorch/pull/129416 on behalf of https://github.com/clee2000 due to broke lint, but for for torch/_inductor/codecache.py this time https://github.com/pytorch/pytorch/actions/runs/9981737836/job/27586013811 f0faecd291 ([comment](https://github.com/pytorch/pytorch/pull/129416#issuecomment-2234387254))
2024-07-17 21:55:48 +00:00
ebdfc7e37d [BE] Rename ISORT_WHITELIST to ISORT_SKIPLIST (#130987)
To better represent what this list is doing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130987
Approved by: https://github.com/seemethere, https://github.com/ZainRizvi
2024-07-17 21:52:56 +00:00
df5919393c [ROCm] std::clamp work-around for hip-clang compiler (#127812)
Fixes #127666.

Other std math functions are replaced with those in the global namespace during hipify. HIP does not claim to support every function in the C++ standard library. std::clamp is not yet supported and we have been relying on the std implementation. For Fedora 40 + gcc 14, a host-side assert is used which is not supported. Work-around this by replacing std::clamp with min and max.  Using #ifndef USE_ROCM to differentiate between CUDA using std::clamp and the ROCm replacement broke Windows builds. The replacement generates the same PTX as std::clamp, so using the replacement unconditionally. The replacement generates the same PTX as std::clamp. See https://godbolt.org/z/Wde9KW3v4 for a sample.

Original patch comes from @lamikr. Modified to improve efficiency.

https://github.com/lamikr/rocm_sdk_builder/pull/37

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127812
Approved by: https://github.com/hongxiayang, https://github.com/malfet
2024-07-17 21:31:17 +00:00
f0faecd291 [ts-migration] Support RaiseException, prim::Unitialized, prim::Enter, and prim::Exit (#129416)
- Support raise exception. It's behavior matches non-strict export now, thanks to @ydwu4's [PR](https://github.com/pytorch/pytorch/pull/128709).
- Support prim::Unitialized, prim::Enter, and prim::Exit
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129416
Approved by: https://github.com/angelayi
2024-07-17 21:27:45 +00:00
4112f68783 Define key in codecache (#130979)
Test Plan:
```
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_misc.py::InlineInbuiltNNModulesMiscTests::test_auto_functionalize_can_with_none_return_inline_inbuilt_nn_modules'
```

Differential Revision: D59875657

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130979
Approved by: https://github.com/jamesjwu
2024-07-17 21:19:13 +00:00
0b134c15cd Revert "Relax constraints for creating a GenericContextWrappingVariable (#129091)"
This reverts commit 882fd9186924b4632fba65033717d97d15ad3339.

Reverted https://github.com/pytorch/pytorch/pull/129091 on behalf of https://github.com/clee2000 due to test_jit started failing on main after this stack https://github.com/pytorch/pytorch/actions/runs/9980754603/job/27583474357 a8bd2933d9 ([comment](https://github.com/pytorch/pytorch/pull/129091#issuecomment-2234269541))
2024-07-17 20:59:40 +00:00
c49f909aab Revert "wrap self.call_function(...) in try finally block to undo changes to self.kw_names (#130490)"
This reverts commit a8bd2933d9eaf24ec9582001efa844de499d9e93.

Reverted https://github.com/pytorch/pytorch/pull/130490 on behalf of https://github.com/clee2000 due to test_jit started failing on main after this stack https://github.com/pytorch/pytorch/actions/runs/9980754603/job/27583474357 a8bd2933d9 ([comment](https://github.com/pytorch/pytorch/pull/129091#issuecomment-2234269541))
2024-07-17 20:59:40 +00:00
65b4163bd2 [dynamo][nn-module] Make slice getitem on nn module container sourceless (#130852)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130852
Approved by: https://github.com/mlazos
ghstack dependencies: #130773
2024-07-17 20:17:08 +00:00
a8bd2933d9 wrap self.call_function(...) in try finally block to undo changes to self.kw_names (#130490)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130490
Approved by: https://github.com/williamwen42, https://github.com/zou3519
ghstack dependencies: #129091
2024-07-17 20:07:06 +00:00
882fd91869 Relax constraints for creating a GenericContextWrappingVariable (#129091)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129091
Approved by: https://github.com/yanboliang, https://github.com/zou3519
2024-07-17 20:07:06 +00:00
41f5d5dcaf Revert "[inductor] adapte windows file path (#130713)"
This reverts commit e51e971a8675826e517a78bf2a97f8e2df5f4abd.

Reverted https://github.com/pytorch/pytorch/pull/130713 on behalf of https://github.com/clee2000 due to sorry but I think its still failing, this time on windows CUDA https://github.com/pytorch/pytorch/actions/runs/9971126834/job/27552761451 bb62e9d7c3.  It was not run on PR due to being on the periodic workflow, which isnt usually run on PRs due to capacity issues for windows CUDA machines.  I will add ciflow/periodic to the PR to ensure the test gets run ([comment](https://github.com/pytorch/pytorch/pull/130713#issuecomment-2234092078))
2024-07-17 19:37:16 +00:00
1bf4a44b33 Revert "[ts-migration] Support RaiseException, prim::Unitialized, prim::Enter, and prim::Exit (#129416)"
This reverts commit ef0511245a92bae7057c195dcae2efc237b96f16.

Reverted https://github.com/pytorch/pytorch/pull/129416 on behalf of https://github.com/clee2000 due to broke lint for test/export/test_converter.py https://github.com/pytorch/pytorch/actions/runs/9979009143/job/27577181982 ef0511245a.  Probably a landrace ([comment](https://github.com/pytorch/pytorch/pull/129416#issuecomment-2234067407))
2024-07-17 19:21:52 +00:00
b0387449db Ensure staticmethods can be allowed in graph (#130882)
Fixes https://github.com/pytorch/pytorch/issues/124735

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130882
Approved by: https://github.com/anijain2305, https://github.com/williamwen42
2024-07-17 19:18:30 +00:00
e4f9d01cd9 Add test for dataclass field accesses (#130848)
Fixes https://github.com/pytorch/pytorch/issues/120108

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130848
Approved by: https://github.com/williamwen42, https://github.com/anijain2305
2024-07-17 19:14:23 +00:00
470f07c840 Add guard override capability for tensor subclass metadata (#130780)
Fixes https://github.com/pytorch/pytorch/issues/114405

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130780
Approved by: https://github.com/anijain2305, https://github.com/bdhirsh
ghstack dependencies: #130779
2024-07-17 19:13:53 +00:00
bea6762c01 Add guards on subclass metadata (#130779)
This PR adds guards in dynamo which verify the equality of tensor subclass metadata along with tests verifying the expected recompile behavior. The next PR adds the capability to override the guard behavior to possibly perform the check in a less expensive manner.

Toward fixing https://github.com/pytorch/pytorch/issues/114405

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130779
Approved by: https://github.com/anijain2305, https://github.com/bdhirsh
2024-07-17 19:13:52 +00:00
752c817898 [AOTI][refactor] Unify UserDefinedTritonKernel.codegen (#130796)
Summary: Unify the argment codegen logic between python wrapper and cpp wrapper.

Differential Revision: [D59809273](https://our.internmc.facebook.com/intern/diff/D59809273)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130796
Approved by: https://github.com/oulgen
2024-07-17 18:37:23 +00:00
efefea52e0 renamed inductor kernel args in flexattention properly (#130869)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130869
Approved by: https://github.com/drisspg, https://github.com/joydddd
ghstack dependencies: #130809, #130818
2024-07-17 18:36:03 +00:00
480a5bd881 Renamed mask_fn to mask_mod (#130818)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130818
Approved by: https://github.com/drisspg
ghstack dependencies: #130809
2024-07-17 18:36:03 +00:00
d96c80649f [export] constants & non-persistent buffers for training IR (#130864)
Summary: Uses original ExportedProgram constants and graph signature to inform decompositions, so that constant tensors and non-persistent buffers are respected for training IR. Removes 7 test failures for training IR.

Test Plan: test_export

Differential Revision: D59820909

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130864
Approved by: https://github.com/angelayi
2024-07-17 18:27:53 +00:00
ef0511245a [ts-migration] Support RaiseException, prim::Unitialized, prim::Enter, and prim::Exit (#129416)
- Support raise exception. It's behavior matches non-strict export now, thanks to @ydwu4's [PR](https://github.com/pytorch/pytorch/pull/128709).
- Support prim::Unitialized, prim::Enter, and prim::Exit
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129416
Approved by: https://github.com/angelayi
2024-07-17 17:48:36 +00:00
d552e5c3d5 Fix ciflow/nightly triggering commit hash update workflow (#130570)
Move the if statement to be higher so people don't get the below
![image](https://github.com/user-attachments/assets/e9be7d7c-6400-4f80-880f-d58dcb4b5495)
like https://togithub.com/pytorch/pytorch/pull/130465
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130570
Approved by: https://github.com/ZainRizvi
2024-07-17 17:13:50 +00:00
db3290846e [BE][Easy][10/19] enforce style for empty lines in import segments in test/d*/ (#129761)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129761
Approved by: https://github.com/fegin
2024-07-17 16:57:39 +00:00
1e13cb2f28 Log cache state to structured logs (#130845)
https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpRm4MaD/0_0_0/fx_graph_cache_hash_4.json

Differential Revision: D59795574

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130845
Approved by: https://github.com/jamesjwu
2024-07-17 16:45:45 +00:00
af0b5ee924 Reduce number of samples in {svd,pca}_lowrank OpInfos (#127199)
We don't need to generate so many samples for these very expensive ops.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127199
Approved by: https://github.com/peterbell10, https://github.com/zou3519
2024-07-17 16:29:36 +00:00
6e916f112f [inductor] skip fx remote cache for 2 tests in test_metrics.py (#130853)
Summary: `collect_defined_kernels()` is essentially patching deep inside to see if a specific codegen is happening. We could also patch somewhere in the cache path to make sure it's called, but I'm not sure that's really testing anything interesting. I suggest it's better to just disable the remote cache here.

Test Plan: `buck2 test -j 18 'fbcode//mode/opt' fbcode//caffe2/test/inductor:metrics -- --exact 'caffe2/test/inductor:metrics - test_kernel_args_num_gb (caffe2.test.inductor.test_metrics.TestMetrics)' --run-disabled --stress-runs 10`

Differential Revision: D59825899

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130853
Approved by: https://github.com/oulgen
2024-07-17 16:17:43 +00:00
1fb572289b [BE][c10d] Add a warning messages in the comment about cuda hang (#130844)
Add comments to warn users potential hang for the cuda event query in NCCLPG.

Co-authored-by: Andrew Gu <31054793+awgu@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130844
Approved by: https://github.com/wconstab
2024-07-17 15:51:19 +00:00
b7d2abd766 Fix vectorized ops.masked (#130130)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130130
Approved by: https://github.com/jgong5, https://github.com/lezcano
2024-07-17 14:55:11 +00:00
b29b23137c [Easy] Fix argument name collision in dispatched functions (#129562)
Use positional-only argument to avoid naming collision with aten ops arguments that are named "self".

```python
In [1]: def foo(self, *args, **kwargs):
   ...:     print(self, args, kwargs)
   ...:

In [2]: def bar(self, /, *args, **kwargs):
   ...:     print(self, args, kwargs)
   ...:

In [3]: foo(1, 2, self=3)
TypeError: foo() got multiple values for argument 'self'

In [4]: bar(1, 2, self=3)
1
(2,)
{'self': 3}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129562
Approved by: https://github.com/zou3519, https://github.com/fegin
2024-07-17 14:39:56 +00:00
c0ed38e644 [BE][Easy][3/19] enforce style for empty lines in import segments in benchmarks/ (#129754)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129754
Approved by: https://github.com/ezyang
2024-07-17 14:34:42 +00:00
32995dec28 Add support for XPU accumulate type (#128579)
Provide an accumulate type interface specifically for XPU, similar to what was done for MPS.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128579
Approved by: https://github.com/EikanWang, https://github.com/albanD
2024-07-17 14:33:53 +00:00
76169cf691 [BE][Easy][9/19] enforce style for empty lines in import segments in test/[e-h]*/ (#129760)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129760
Approved by: https://github.com/ezyang
2024-07-17 14:25:29 +00:00
cbf274d4a7 [aoti] Add packaging solution (#129895)
In this PR, I added support for packaging the AOTI generated files into a zipfile, and loading it in python.

`compile_so` takes the path to the package, a device, and a desired so_path location, and compiles package into a .so, and saves to the specified location.
`load_package` takes a path to the package and device, calls _extract_so, and then creates a callable to run the compiled model.

The zipfile generated looks like the following:
```
|- version
|- archive_format
|- data
   |- aotinductor
      |- cbtnafqaqrhvwztv7xudlal4xs6sofxa5oxccyuaqtrt6aozaklx.cubin  # AOTI cuda generated cubin files
      |- cskkqtna23bty2v3aq7g2q37cxrgufehlkuaaolhlgug5zg6fuwe.cpp  # AOTI generated cpp file
      |- cskkqtna23bty2v3aq7g2q37cxrgufehlkuaaolhlgug5zg6fuwe_compile_flags  # Flags for compiling the .o
      |- c6qqtnpgwfi3dv5nb76ai773kt45ezoxfwdmd7q37lvq6fs2tnoi.o  # AOTI saved const.o
      |- cskkqtna23bty2v3aq7g2q37cxrgufehlkuaaolhlgug5zg6fuwe_linker_flags  # Flags for linking the files to form the .so
   |- constants
      |- constants.pt  # Constants saved using torch.save, can be loaded using mmap
```

The workflow is something like:
```
with torch.no_grad():
    ep = torch.export.export(
        model,
        example_inputs,
        dynamic_shapes=dynamic_shapes,
        strict=False,
    )
    gm = ep.module()
    package_path = torch._inductor.aot_compile(
        gm,
        example_inputs,
        options= {
              "aot_inductor.output_path": "my_path.pt2",  # or a directory
              "aot_inductor.package": True,
        }
    )
compiled_model = torch._inductor.package.load_package(package_path, device)
return compiled_model
```

I tried turning on loading the weights using mmap by default, but had some trouble with it, so that is just left as a todo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129895
Approved by: https://github.com/malfet
2024-07-17 13:56:58 +00:00
94a910b43b Revert "Renamed mask_fn to mask_mod (#130818)"
This reverts commit 1a97bcf93b2ac98505ef6ff011ccb3565e456596.

Reverted https://github.com/pytorch/pytorch/pull/130818 on behalf of https://github.com/atalman due to Failing internally ([comment](https://github.com/pytorch/pytorch/pull/130818#issuecomment-2233367318))
2024-07-17 13:47:08 +00:00
d027aef8f8 Revert "Removed q_num_blocks from constructor (#130819)"
This reverts commit 03c660468eb57772e82c1034613f5ff8781c775a.

Reverted https://github.com/pytorch/pytorch/pull/130819 on behalf of https://github.com/atalman due to Internal problem with previous PR in stack https://github.com/pytorch/pytorch/pull/130818 ([comment](https://github.com/pytorch/pytorch/pull/130819#issuecomment-2233359569))
2024-07-17 13:43:35 +00:00
4b7ff35622 Fix flex_attention import in score_mod (#130906)
torch.nn.attention._flex_attention has been renamed to torch.nn.attention.flex_attention, so the import does not work currently.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130906
Approved by: https://github.com/Chillee
2024-07-17 13:37:08 +00:00
e1b2d8f975 Revert "[cuDNN][SDPA] Support attn_bias in cuDNN (#130482)"
This reverts commit de177b50f89e45a57ac056ee64a64d7775b450ff.

Reverted https://github.com/pytorch/pytorch/pull/130482 on behalf of https://github.com/atalman due to failing internally ([comment](https://github.com/pytorch/pytorch/pull/130482#issuecomment-2233309217))
2024-07-17 13:21:50 +00:00
d3a11a0198 [Inductor] Handle device_put op in constant folding. (#130824)
Fix #130823

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130824
Approved by: https://github.com/eellison, https://github.com/EikanWang
ghstack dependencies: #130817
2024-07-17 10:13:36 +00:00
2af2d26562 [Inductor UT] Generalize device-bias code in test_triton_kernels.py and test_torchinductor.py (#130817)
[Inductor UT] Generalize newly introduced device-bias code in test_triton_kernels.py::test_add_kernel and test_torchinductor.py::test_ctr_not_moved_to_cuda_when_used_in_index_put
Fix #130814 , #130838

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130817
Approved by: https://github.com/zou3519
2024-07-17 10:13:36 +00:00
2300bb2a88 [3.13, dynamo] support TO_BOOL (#130565)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130565
Approved by: https://github.com/jansel
ghstack dependencies: #130459, #130460, #130461, #130564
2024-07-17 09:47:58 +00:00
539acf7656 [3.13, dynamo] support CALL_KW (#130564)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130564
Approved by: https://github.com/jansel
ghstack dependencies: #130459, #130460, #130461
2024-07-17 09:47:58 +00:00
e2365c05d7 [3.13, dynamo] fix instruction line numbers (#130461)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130461
Approved by: https://github.com/jansel
ghstack dependencies: #130459, #130460
2024-07-17 09:47:58 +00:00
82b2e7a253 [3.13, dynamo] fix CALL_FUNCTION_EX in symbolic_convert (#130460)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130460
Approved by: https://github.com/jansel
ghstack dependencies: #130459
2024-07-17 09:47:58 +00:00
8c9a996091 [3.13, dynamo] support LOAD_FAST_LOAD_FAST and STORE_FAST_STORE_FAST (#130459)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130459
Approved by: https://github.com/jansel
2024-07-17 09:47:58 +00:00
bb62e9d7c3 Avoid autocast deprecation warning in DataParallel (#130660)
Fixes #130659

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130660
Approved by: https://github.com/guangyey, https://github.com/fegin, https://github.com/albanD
2024-07-17 08:32:19 +00:00
f6838d521a [BE][Easy][5/19] enforce style for empty lines in import segments in tools/ and torchgen/ (#129756)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129756
Approved by: https://github.com/ezyang
2024-07-17 06:44:35 +00:00
ba48cf6535 [BE][Easy][6/19] enforce style for empty lines in import segments in test/ (#129757)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129757
Approved by: https://github.com/ezyang
2024-07-17 06:42:37 +00:00
e51e971a86 [inductor] adapte windows file path (#130713)
This PR is depends on https://github.com/pytorch/pytorch/pull/130132 can be landed successful.
The detailed log: https://github.com/pytorch/pytorch/issues/124245#issuecomment-2211889758

After the file path was adapted for Windows, the first Windows inductor case was run successful.

```python
import torch

def foo(x, y):
    a = torch.sin(x)
    b = torch.cos(x)
    return a + b
opt_foo1 = torch.compile(foo)
print(opt_foo1(torch.randn(10, 10), torch.randn(10, 10)))
```

Result:
![image](https://github.com/user-attachments/assets/4944df47-e74d-476b-8eb5-1d1fd5abeb41)

Co-authored-by: Jiong Gong <jiong.gong@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130713
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/desertfire
2024-07-17 06:36:11 +00:00
7c45476d38 [pytorch][counters] WaitCounter cleanup (#130664)
Summary:
This diff does a minor cleanup of WaitCounters:
1. Fixes some singleton use to ensure one instance of WaitCounterImpl per counter per process
2. Updates API to enable measuring duration of individual wait operations

Test Plan: unit test

Differential Revision: D59709324

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130664
Approved by: https://github.com/c-p-i-o, https://github.com/asiab4
2024-07-17 04:42:35 +00:00
419b8df0b6 [inductor][easy] add debug logs for inlining constants (#130799)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130799
Approved by: https://github.com/chenyang78
2024-07-17 04:21:08 +00:00
f2552dcc3d refactor cached tensor more generic (#129359)
# Motivation
solve https://github.com/pytorch/pytorch/issues/129027 to refactor cached tensor to be generic.

# Additional Context
No API name change. It is only decoupling with CUDA build option.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129359
Approved by: https://github.com/eqy, https://github.com/EikanWang, https://github.com/albanD
2024-07-17 03:00:08 +00:00
c6aa03bd4e Add allow_xpu to enable XPU UTs (#130312)
# Motivation
enable UTs under folder test/xpu/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130312
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD
2024-07-17 02:40:28 +00:00
fc238db62a Separate AOTI Eager utils as a single file (#125819)
The key change is code movement. We just moved aoti eager related code from `torch._inductor.utils` to `torch._inductor.aoti_eager`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125819
Approved by: https://github.com/jansel, https://github.com/jgong5, https://github.com/desertfire
2024-07-17 02:27:11 +00:00
d1c4e6b55f [BE]: Enable a few additional ruff rules (#130700)
Enables a few extra ruff rules, most of which do not have any violations as I already cleaned them with earlier PRs, these just turns them on to enforce them. Adds 1 noqa as we want the suboptimal lambda generation + call kept as a test. Also enables the test in flake8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130700
Approved by: https://github.com/justinchuby, https://github.com/ezyang
2024-07-17 02:06:04 +00:00
c24c50da92 fix tensor print behavior for XPU (#130523)
# Motivation
Some XPU device don't support `double` data type. So we have to use `tensor.to(torch.float)` if it is a XPU tensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130523
Approved by: https://github.com/gujinghui, https://github.com/EikanWang, https://github.com/albanD
2024-07-17 02:03:32 +00:00
aa95fb99af On advice of James March, log pid instead of tid (#130679)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130679
Approved by: https://github.com/jmarchfb
2024-07-17 02:00:10 +00:00
e9023d57b0 [ROCm] Return correct AMDSMI socket_power metric (#130331)
Extending on the change in https://github.com/pytorch/pytorch/pull/127729

Depending on gcnArch the API to return socket power will change based on underlying gpu_metrics. This PR will handle both cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130331
Approved by: https://github.com/jeffdaily, https://github.com/eqy, https://github.com/malfet
2024-07-17 01:58:58 +00:00
03c660468e Removed q_num_blocks from constructor (#130819)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130819
Approved by: https://github.com/drisspg
ghstack dependencies: #130809, #130818
2024-07-17 01:41:20 +00:00
1a97bcf93b Renamed mask_fn to mask_mod (#130818)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130818
Approved by: https://github.com/drisspg
ghstack dependencies: #130809
2024-07-17 01:41:20 +00:00
6024fea0f8 Compute q_num_blocks from kv_num_blocks if q_num_blocks is not passed in (#130809)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130809
Approved by: https://github.com/drisspg
2024-07-17 01:41:15 +00:00
ef9d9be236 TCPStoreLibUvBackend: log port on error (#130797)
Adds better error messages when a socket fails to bind in libuv.

New format:
```
The server socket has failed to bind. port: 1, useIpv6: 0, code: -13, name: EACCES, message: permission denied
```

Old format:

```
The server socket has failed to listen on any local network address. useIpv6: 0, code: -98, name: EADDRINUSE, message: address already in use
```

Test plan:

Added test in `test_store.py`

```
python test/distributed/test_store.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130797
Approved by: https://github.com/kurman
2024-07-17 01:34:15 +00:00
25cb4426d3 [inductor] Add num_matches_for_scatter_upon_const_tensor to list of cached metrics (#130843)
Summary: test/inductor:scatter_optimization is using this counter and fails with remote caching enabled

Test Plan: `buck2 test -j 18 'fbcode//mode/opt' fbcode//caffe2/test/inductor:scatter_optimization -- --exact 'caffe2/test/inductor:scatter_optimization - test_cross_entropy_loss (caffe2.test.inductor.test_scatter_optimization.TestScatterOpt)' --run-disabled --stress-runs 10`

Differential Revision: D59817406

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130843
Approved by: https://github.com/oulgen
2024-07-17 00:41:22 +00:00
8458dc8966 Revert "Use inductor TestCase for distributed tests (#129494)"
This reverts commit 3cd2ae331a5ed6839456bb0025c729a1ee50bc84.

Reverted https://github.com/pytorch/pytorch/pull/129494 on behalf of https://github.com/masnesral due to fbcode failures ([comment](https://github.com/pytorch/pytorch/pull/129494#issuecomment-2232063690))
2024-07-17 00:32:48 +00:00
d7a8e8f7c5 Revert "[PT-D] Relaxed contract to allow Sequence[nn.Module] (#127773)"
This reverts commit b27695791e9cc4eedb1b713b1be20398bfeb911b.

Reverted https://github.com/pytorch/pytorch/pull/127773 on behalf of https://github.com/atalman due to failing internally ([comment](https://github.com/pytorch/pytorch/pull/127773#issuecomment-2232004006))
2024-07-16 23:48:09 +00:00
9a6d81b178 Fix pytorch JIT build for LLVM 18+ (#130661)
Summary: LLVM upstream(https://github.com/llvm/llvm-project/pull/97824) changed `getHostCPUFeatures`to  use Return StringMap. Fix this to unblock T195389358

Test Plan:
```
buck2 build mode/opt-clang-thinlto --upload-all-actions -c unicorn.hfsort="1" -c cxx.extra_cxxflags="-gpubnames -w -Wno-enum-constexpr-conversion -Wno-missing-template-arg-list-after-template-kw -Wno-c++11-narrowing -Wno-c++11-narrowing-const-reference -ferror-limit=0" -c cxx.extra_cflags="-gpubnames -w -Wno-enum-constexpr-conversion -Wno-missing-template-arg-list-after-template-kw -Wno-c++11-narrowing -Wno-c++11-narrowing-const-reference" -c cxx.profile="fbcode//fdo/autofdo/unicorn/topaggr/top_aggregator_server:autofdo" unicorn/topaggr:top_aggregator_server
```

Differential Revision: D59708722

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130661
Approved by: https://github.com/Skylion007
2024-07-16 23:47:48 +00:00
eqy
de177b50f8 [cuDNN][SDPA] Support attn_bias in cuDNN (#130482)
CC @drisspg
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130482
Approved by: https://github.com/drisspg
2024-07-16 23:45:21 +00:00
4f40a7078e Revert "[FSDP2] Allowed List[nn.Module] as arg (#127786)"
This reverts commit d3ab8cecedd7843b8caed5946404704a18479811.

Reverted https://github.com/pytorch/pytorch/pull/127786 on behalf of https://github.com/atalman due to bottom pr from the stack is failing on internal error ([comment](https://github.com/pytorch/pytorch/pull/127786#issuecomment-2231999178))
2024-07-16 23:45:17 +00:00
7919f0b952 Add buffer static input tests to cudagraph trees (#130402)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130402
Approved by: https://github.com/eellison
ghstack dependencies: #130393
2024-07-16 22:12:38 +00:00
415d5e53ae Propagate buffer and parameter indices through AOT (#130393)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130393
Approved by: https://github.com/bdhirsh
2024-07-16 22:12:38 +00:00
5f3c356a56 Revert "[inductor] adapte windows file path (#130713)"
This reverts commit 69e99172450e40536bf2e6c110183d34a0e283e2.

Reverted https://github.com/pytorch/pytorch/pull/130713 on behalf of https://github.com/clee2000 due to broke functorch\test_eager_transforms.py on windows https://github.com/pytorch/pytorch/actions/runs/9958208725/job/27530132704 69e9917245.  Test failure on PR is real, possibly force merged to get around lint error? ([comment](https://github.com/pytorch/pytorch/pull/130713#issuecomment-2231901793))
2024-07-16 22:07:55 +00:00
2eec02523b [autograd] Support GradientEdge as output for torch.autograd.grad (#127766)
This is useful for splitting grad to run in two parts while preserving intermediates:

<details>

<summary>
Click to see code
</summary>

```python
import collections
import weakref
from torch.autograd.graph import GradientEdge

def _get_grad_fn_or_grad_acc(t):
    if t.requires_grad and t.grad_fn is None:
        return t.view_as(t).grad_fn.next_functions[0][0]
    else:
        return t.grad_fn

def reverse_closure(roots, target_nodes):
    # Recurse until we reach a target node
    closure = set()
    actual_target_nodes = set()
    q: Deque = collections.deque()
    for node in roots:
        if node is not None and node not in closure:
            closure.add(node)
            q.append(node)
    while q:
        node = q.popleft()
        reverse_edges = node.metadata.get("reverse_edges", [])
        for holder_ref, idx in reverse_edges:
            ref = holder_ref()
            if ref is not None:
                raise RuntimeError("Reverse graph is no longer alive")
            fn = ref.node
            if fn in closure or fn is None:
                continue
            if fn in target_nodes:
                actual_target_nodes.add(fn)
                continue
            closure.add(fn)
            q.append(fn)
    return closure, actual_target_nodes

# Enable weak pointer
class Holder():
    def __init__(self, node):
        self.node = node

# TODO: use weak references to avoid reference cycle
def construct_reverse_graph(roots):
    q: Deque = collections.deque()
    root_seen = set()
    reverse_graph_refs = []
    for node in roots:
        if node is not None and node not in root_seen:
            q.append(node)
            root_seen.add(node)
    while q:
        node = q.popleft()
        for fn, idx in node.next_functions:
            if fn is not None:
                # Don't necessarily need to store on the graph
                reverse_edges = fn.metadata.get("reverse_edges", [])
                if len(reverse_edges) == 0:
                    q.append(fn)
                holder = Holder(node)
                holder_ref = weakref.ref(holder)
                reverse_graph_refs.append(holder)
                reverse_edges.append((holder_ref, idx))
                fn.metadata["reverse_edges"] = reverse_edges
    return reverse_graph_refs

def get_param_groups(inputs, params):
    inputs_closure, _ = reverse_closure(inputs, set())
    param_groups = dict()  # keyed on intermediates
    for i, param in enumerate(params):
        closure, intersected = reverse_closure([param], inputs_closure)
        param_group = {
            "params": set([param]),
            "intermediates": set(intersected),
        }
        for input_node in intersected:
            existing = param_groups.get(input_node, None)
            if existing is not None:
                existing["params"] = existing["params"].union(param_group["params"])
                existing["intermediates"] = existing["intermediates"].union(param_group["intermediates"])
                param_group = existing
            else:
                param_groups[input_node] = param_group

    # Sanity check: union of all param_groups params should be equal to all params
    union_params = set()
    seen_ids = set()
    unique_param_groups = []
    for param_group in param_groups.values():
        if id(param_group) not in seen_ids:
            seen_ids.add(id(param_group))
            unique_param_groups.append(param_group)
            union_params = union_params.union(param_group["params"])
    assert union_params == set(params)

    return unique_param_groups

def compute_grads_only_inputs2(roots, inps, weights):
    root_grad_fns = list(map(_get_grad_fn_or_grad_acc, roots))
    inp_grad_fns = list(map(_get_grad_fn_or_grad_acc, inps))
    weight_grad_fns = list(map(_get_grad_fn_or_grad_acc, weights))

    reverse_graph_refs = construct_reverse_graph(root_grad_fns)
    param_groups = get_param_groups(inp_grad_fns, weight_grad_fns)
    del reverse_graph_refs

    for param_group in param_groups:
        for i, intermediate in enumerate(param_group["intermediates"]):
            def get_hook(param_group, i):
                def hook(grad_inputs):
                    if param_group.get("grads", None) is None:
                        param_group["grads"] = [None] * len(param_group["intermediates"])
                    param_group["grads"][i] = grad_inputs
                return hook
            # These are always "split" nodes that we need to recompute, so
            # save their inputs.
            intermediate.register_prehook(get_hook(param_group, i))

    dinputs = torch.autograd.grad((out,), inputs=tuple(inps), grad_outputs=(torch.ones_like(out),), retain_graph=True)
    return dinputs, param_groups

def compute_grads_only_weights2(user_weights, param_groups):
    all_dweights = dict()
    for param_group in param_groups:
        # TODO: Handle case where intermediate can have multiple outputs
        intermediate_edges = tuple(GradientEdge(i, 0) for i in param_group["intermediates"])
        weights_edges = tuple(GradientEdge(w, 0) for w in param_group["params"])

        assert all(len(g) == 1 for g in param_group["grads"])
        # [NEW!] Able to pass a GradientEdge to autograd.grad as output
        # We do not need to retain_graph because... guarantee no overlap?
        print("trying to execute: ", intermediate_edges, weights_edges)
        dweights = torch.autograd.grad(intermediate_edges, weights_edges, grad_outputs=sum(param_group["grads"], tuple()))
        for w, dw in zip(param_group["params"], dweights):
            all_dweights[w] = dw
    # return grads in the original order weights were provided in
    out = []
    for w in user_weights:
        grad_acc = _get_grad_fn_or_grad_acc(w)
        out.append(all_dweights[grad_acc])
    return tuple(out)

```

</details>

```python
import torch.nn as nn

# Setup
mod1 = nn.Linear(10, 10)
mod2 = nn.Linear(10, 10)

a = torch.rand(10, requires_grad=True)

weights = tuple(mod1.parameters()) + tuple(mod2.parameters())
inps = (a,)

out = mod2(mod1(a))

class LoggingTensorMode(torch.utils._python_dispatch.TorchDispatchMode):
    def __torch_dispatch__(self, func, types, args=(), kwargs=None):
        if kwargs is None:
            kwargs = {}
        rs = func(*args, **kwargs)
        print(f"{func.__module__}.{func.__name__}")
        return rs

print(" -- SPLIT -- ")
# Compute gradients in two parts
with LoggingTensorMode():
    print("PART 1")
    dinputs, state = compute_grads_only_inputs2((out,), inps, weights)
    print("PART 2")
    dweights = compute_grads_only_weights2(weights, state)

out = mod2(mod1(a))

print(" -- REF -- ")

# Compare with reference
with LoggingTensorMode():
    ref_all_gradients = torch.autograd.grad(out, inputs=tuple(inps) + weights, grad_outputs=(torch.ones_like(out),))

for actual, ref in zip(dinputs + dweights, ref_all_gradients):
    print(torch.allclose(actual, ref))

```

<img width="598" alt="image" src="https://github.com/pytorch/pytorch/assets/13428986/3681b8a7-3ab4-4d1d-a836-abef6913e671">

```
PART 1
torch._ops.aten.view.default
torch._ops.aten.view.default
torch._ops.aten.view.default
torch._ops.aten.view.default
torch._ops.aten.view.default
torch._ops.aten.ones_like.default
V0603 10:17:21.590878 8300067520 torch/autograd/graph.py:751] Executing: <ViewBackward0 object at 0x12a1ee160> with grad_outputs: [f32[10]]
torch._ops.aten.view.default
V0603 10:17:21.591204 8300067520 torch/autograd/graph.py:751] Executing: <AddmmBackward0 object at 0x12a1ee0d0> with grad_outputs: [f32[1, 10]]
torch._ops.aten.t.default
torch._ops.aten.mm.default
V0603 10:17:21.591578 8300067520 torch/autograd/graph.py:751] Executing: <ViewBackward0 object at 0x100d7ae50> with grad_outputs: [f32[1, 10]]
torch._ops.aten.view.default
V0603 10:17:21.591747 8300067520 torch/autograd/graph.py:751] Executing: <ViewBackward0 object at 0x12a1e4a60> with grad_outputs: [f32[10]]
torch._ops.aten.view.default
V0603 10:17:21.591834 8300067520 torch/autograd/graph.py:751] Executing: <AddmmBackward0 object at 0x12a1e4bb0> with grad_outputs: [f32[1, 10]]
torch._ops.aten.t.default
torch._ops.aten.mm.default
V0603 10:17:21.591922 8300067520 torch/autograd/graph.py:751] Executing: <ViewBackward0 object at 0x12a1e4a90> with grad_outputs: [f32[1, 10]]
torch._ops.aten.view.default
PART 2
trying to execute:  (GradientEdge(node=<AddmmBackward0 object at 0x12a1e4bb0>, output_nr=0),) (GradientEdge(node=<AccumulateGrad object at 0x12a21b130>, output_nr=0), GradientEdge(node=<AccumulateGrad object at 0x12a21b7c0>, output_nr=0))
V0603 10:17:21.592223 8300067520 torch/autograd/graph.py:751] Executing: <AddmmBackward0 object at 0x12a1e4bb0> with grad_outputs: [f32[1, 10]]
torch._ops.aten.t.default
torch._ops.aten.mm.default
torch._ops.aten.t.default
torch._ops.aten.sum.dim_IntList
torch._ops.aten.view.default
V0603 10:17:21.592421 8300067520 torch/autograd/graph.py:751] Executing: <TBackward0 object at 0x12a1cad60> with grad_outputs: [f32[10, 10]]
torch._ops.aten.t.default
trying to execute:  (GradientEdge(node=<AddmmBackward0 object at 0x12a1ee0d0>, output_nr=0),) (GradientEdge(node=<AccumulateGrad object at 0x12a1e41c0>, output_nr=0), GradientEdge(node=<AccumulateGrad object at 0x12a21b670>, output_nr=0))
V0603 10:17:21.593481 8300067520 torch/autograd/graph.py:751] Executing: <AddmmBackward0 object at 0x12a1ee0d0> with grad_outputs: [f32[1, 10]]
torch._ops.aten.t.default
torch._ops.aten.mm.default
torch._ops.aten.t.default
torch._ops.aten.sum.dim_IntList
torch._ops.aten.view.default
V0603 10:17:21.593750 8300067520 torch/autograd/graph.py:751] Executing: <TBackward0 object at 0x12a21b2b0> with grad_outputs: [f32[10, 10]]
torch._ops.aten.t.default
torch._ops.aten.view.default
torch._ops.aten.view.default
torch._ops.aten.view.default
torch._ops.aten.view.default

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127766
Approved by: https://github.com/albanD
2024-07-16 21:46:19 +00:00
c1e7e40f24 Revert "[Traceable FSDP2][Inductor] Re-inplace all_gather_into_tensor (#129773)"
This reverts commit f2f31027ce8dc3985663bf6eaa66f3c5559b724a.

Reverted https://github.com/pytorch/pytorch/pull/129773 on behalf of https://github.com/clee2000 due to failed inductor/test_torchinductor_dynamic_shapes.py on mac https://github.com/pytorch/pytorch/actions/runs/9963396991/job/27530249256 f2f31027ce.  The build failed on PR so test jobs didn't run ([comment](https://github.com/pytorch/pytorch/pull/129773#issuecomment-2231808437))
2024-07-16 20:54:14 +00:00
4e479568df [PT2] Log compile ID in the signpost event (#130801)
Summary:
We should log compile ID as well for easier comparison.

Currently going through some of this data, I think we should make few more changes as well.

Reland for D59725870

Test Plan: Sandcastle and Pytorch

Differential Revision: D59789110

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130801
Approved by: https://github.com/oulgen
2024-07-16 20:47:36 +00:00
2ceade37c5 [SymmetricMemory] put socket files in /tmp (#130757)
Currently the socket files are put in the current directory, which may not be writable in all environments.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130757
Approved by: https://github.com/Chillee
ghstack dependencies: #130756
2024-07-16 20:21:05 +00:00
0468f2616a [SymmetricMemory] make sure different subgroups with the same name use different store prefixes (#130756)
This fixes a race condition in which different subgroups with the same name on the same host would use the same store.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130756
Approved by: https://github.com/Chillee
2024-07-16 20:21:05 +00:00
f2f31027ce [Traceable FSDP2][Inductor] Re-inplace all_gather_into_tensor (#129773)
FSDP2 eager pre-allocates the output buffer for AllGather and the AllGather just writes into that buffer. However, under compile, by default we use out-of-place AllGather, which means in Traceable FSDP2 case we will be unnecessarily using more memory than eager. We want to re-inplace that AllGather instead.

This PR adds a post_grad pass to re-inplace all_gather_into_tensor (i.e. changing it from `all_gather_into_tensor.default` out-of-place op to `all_gather_into_tensor_out.default` out-variant op).

One thing to note is that since with this pass we are introducing a mutable op into the post_grad FX graph, we must do this pass after `reinplace_inplaceable_ops` (at which point we are okay again with having mutable ops in the graph). To facilitate this, this PR adds a `post_grad_custom_post_reinplace_pass` extension point to allow user-defined post-reinplace FX passes.

---

Test commands:
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_fullgraph_backend_inductor`

---

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129773
Approved by: https://github.com/eellison
2024-07-16 20:07:41 +00:00
156b99cfb1 [inductor] Handle inductor counters in fx graph cache (#130635)
Summary: Similar to the handling of metrics, save inductor counter deltas in the FX graph cache entry and increment the counters appropriately on a cache hit

Test Plan: new unit test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130635
Approved by: https://github.com/eellison
2024-07-16 20:07:16 +00:00
d548417d95 [NJT] throw an exception if nested_tensor_from_jagged is fx-traced without being fx.wrapped (#130702)
The NJT constructor can't be fx-traced safely due to the dummy nt used:

774ca93fd2/torch/nested/_internal/nested_tensor.py (L501-L508)

The error doesn't appear immediately, but appears if you try to move a module with an fx-traced NJT constructor onto a different device, or try to serialize it. Let's throw an error if we try to fx-trace the NJT constructor so users know to wrap the call.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130702
Approved by: https://github.com/jbschlosser, https://github.com/soulitzer
2024-07-16 19:21:10 +00:00
0851de5b16 Revert "[ONNX] Remove beartype usage (#130484)"
This reverts commit 1794c35912025aa44b0d70f67ff664b4f7bd1014.

Reverted https://github.com/pytorch/pytorch/pull/130484 on behalf of https://github.com/clee2000 due to test_sympy_utils failure is real https://github.com/pytorch/pytorch/actions/runs/9961499559/job/27523758780 1794c35912.  Dr CI is matching with commits in current commit? ([comment](https://github.com/pytorch/pytorch/pull/130484#issuecomment-2231575577))
2024-07-16 18:41:51 +00:00
09b1b113f5 Cache min / max seq len for torch.nested.as_nested_tensor(t) (#130766)
For the `torch.nested.as_nested_tensor(t)` constructor, computing min / max seq len is trivial since the sequence lengths are all the same. Might as well cache them during construction.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130766
Approved by: https://github.com/YuqingJ, https://github.com/soulitzer
2024-07-16 18:32:47 +00:00
408c921d96 Make hashing a SymInt raise an error again (#130548)
See https://github.com/pytorch/pytorch/issues/130547

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130548
Approved by: https://github.com/Skylion007, https://github.com/albanD, https://github.com/lezcano
2024-07-16 18:30:30 +00:00
1d8baa4df2 [torchbench][servicelab] Fix servicelab test failures (#130781)
Fix servicelab test failures
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130781
Approved by: https://github.com/desertfire
2024-07-16 17:35:13 +00:00
1794c35912 [ONNX] Remove beartype usage (#130484)
beartype has served us well in identifying type errors and ensuring we call internal functions with the correct arguments (thanks!). However, the value of having beartype is diminished because of the following:

1. When beartype improves support for better Dict[] type checking, it discovered typing mistakes in some functions that were previously uncaught. This caused the exporter to fail with newer versions beartype when it used to succeed. Since we cannot fix PyTorch and release a new version just because of this, it creates confusion for users that have beartype in their environment from using torch.onnx
2. beartype adds an additional call line in the traceback, which makes the already thick dynamo stack even larger, affecting readability when users diagnose errors with the traceback.
3. Since the typing annotations need to be evaluated, we cannot use new syntaxes like `|` because we need to maintain compatibility with Python 3.8. We don't want to wait for PyTorch take py310 as the lowest supported Python before using the new typing syntaxes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130484
Approved by: https://github.com/titaiwangms
2024-07-16 17:34:36 +00:00
67e22d6c61 [Fix]: Convert operator that does specialization to its symbolic counterpart (#129578)
#### Issue
During conversion, use symbolic operator when exist.

#### Test Plan
`pytest test/export/test_converter.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129578
Approved by: https://github.com/angelayi
2024-07-16 17:19:57 +00:00
e8998d68c8 [export] add non-strict training IR (#130062)
Summary: Adds non-strict implementation of training IR export. Any expected non-strict training IR failures are also either existing strict training IR or non-strict failures (no new failures added). 4 strict training IR failures also resolved.

Refraining from unifying export/export_for_training, per @ydwu4's feedback :)

Test Plan: added test_export_training_ir_to_run_decomp_non_strict.py for non-strict training IR

Differential Revision: D59349454

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130062
Approved by: https://github.com/ydwu4, https://github.com/zhxchen17
2024-07-16 17:08:00 +00:00
d2f44eabe7 [Export] Support aten.full.default and aten.full_like.default (#130639)
Summary: Add operator tests for full & full_like operators

Test Plan:
Rerun kernel test using
```
buck2 run //glow/fba/tests:run_kernel mode/dev -- --kernel splat --config "input=1;dtype=fp32;fill_value=42.0"  -tl_time
```
{F1752274071}

Operator tests
```
buck2 run mode/{opt,inplace} //caffe2/torch/fb/test_library:afg_operator_test -- -k __full__
```
{F1752340913}

Differential Revision: D59593849

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130639
Approved by: https://github.com/StellarrZ
2024-07-16 16:50:04 +00:00
f272e0ab4a [inductor] support unbacked symint divisors in vars_and_sizes (#130595)
Scenario:
```
>>> nodes
IterationRangesEntry(
    x2,
    divisor=192*u0 + 192576,
    length=s1,
    (xindex//(192*u0 + 192576)),
    {x0: 192, x1: u0 + 1003, x2: s1, x3: 192*s1*u0 + 192576*s1, x4: 192*u0 + 192576})
IterationRangesEntry(
    x1,
    divisor=192,
    length=u0 + 1003,
    ModularIndexing(xindex, 192, u0 + 1003),
    {x0: 192, x1: u0 + 1003, x2: s1, x3: 192*s1*u0 + 192576*s1, x4: 192*u0 + 192576})
IterationRangesEntry(
    x0,
    divisor=1,
    length=192,
    ModularIndexing(xindex, 1, 192),
    {x0: 192, x1: u0 + 1003, x2: s1, x3: 192*s1*u0 + 192576*s1, x4: 192*u0 + 192576})
```

Think about whether using fallback is safe here. I think it's safe because the divisor of one IterationRangesEntry should be the product of the lengths of the preceding IterationRangesEntry? Unless, one of the lengths divides by an unbacked symint?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130595
Approved by: https://github.com/aakhundov, https://github.com/ezyang
2024-07-16 16:21:38 +00:00
2b43d339fe Make FlexAttention API public (#130755)
# Summary

Makes the prototype API flex_attention public

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130755
Approved by: https://github.com/Chillee
2024-07-16 16:21:25 +00:00
cbda8be537 Revert "Propagate buffer and parameter indices through AOT (#130393)"
This reverts commit 69a77389e2c4052834c89a25757cdbf5f83b6208.

Reverted https://github.com/pytorch/pytorch/pull/130393 on behalf of https://github.com/clee2000 due to broke lint for torch/_functorch/_aot_autograd/subclass_utils.py https://github.com/pytorch/pytorch/actions/runs/9948630877/job/27483551649 80236dca90 lint was green on PR, probably a landrace ([comment](https://github.com/pytorch/pytorch/pull/130393#issuecomment-2231263753))
2024-07-16 15:43:34 +00:00
9cb23ba85b Revert "Add buffer static input tests to cudagraph trees (#130402)"
This reverts commit 80236dca90b0874cb2b6f9c9fa5f159c55726401.

Reverted https://github.com/pytorch/pytorch/pull/130402 on behalf of https://github.com/clee2000 due to broke lint for torch/_functorch/_aot_autograd/subclass_utils.py https://github.com/pytorch/pytorch/actions/runs/9948630877/job/27483551649 80236dca90 lint was green on PR, probably a landrace ([comment](https://github.com/pytorch/pytorch/pull/130393#issuecomment-2231263753))
2024-07-16 15:43:34 +00:00
c509319210 [inductor] Disable remote fx graph cache in test_snode_runtime (#130655)
Summary: Unfortunately we can't save / restore metrics.metrics.node_runtimes in the cache entries because these contain objects that don't pickle: `TypeError: cannot pickle 'PyCapsule' object`.

Test Plan: `buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:snode_runtime -- --exact 'caffe2/test/inductor:snode_runtime - test_mm (caffe2.test.inductor.test_snode_runtime.ComputeBoundedTests)' --run-disabled --jobs 18 --stress-runs 10`

Differential Revision: D59705654

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130655
Approved by: https://github.com/oulgen
2024-07-16 15:11:17 +00:00
aa4ad711ef [CCA][Memory Snapshot] Create TraceEntryRingBuffer class for alloc_trace logic (#130741)
Summary:
Move the alloc_trace logic into a separate class, to reduce risk of deadlocks when mixing with CCA's lock. Switch to an std::mutex instead of std::recursive_mutex.

Let's us re-use the logic in TraceEntryRingBuffer class for later diffs.

Test Plan: CI, resnet run, and FBR model.

Differential Revision: D59690408

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130741
Approved by: https://github.com/davidberard98
2024-07-16 15:01:48 +00:00
e11c41035c Directly use empty strided in cudagraph copy (#130777)
We had an issue with the `-1` somehow ending up in negative num elements required. not sure why the original didn't work - we should land if CI is green.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130777
Approved by: https://github.com/BoyuanFeng
2024-07-16 14:37:30 +00:00
4c3348932c typing: convert_frame (#130670)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130670
Approved by: https://github.com/Skylion007
ghstack dependencies: #130669
2024-07-16 14:31:35 +00:00
ea25febfab typing: storage (#130669)
This isn't a full typing of the file - it just fixes some uses of unbound 'T' (if you use a TypeVar as an output it also needs to be an input).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130669
Approved by: https://github.com/oulgen, https://github.com/Skylion007
2024-07-16 14:31:35 +00:00
8390843eba Invalidate StorageImpl instances when tensor is overwritten with cudagraphs (#125264)
Fixes #104435

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125264
Approved by: https://github.com/ezyang
2024-07-16 14:29:29 +00:00
1fbfb3202d [docs][TorchScript] document c10::AliasAnalysisKind::CONSERVATIVE (#130765)
I spent a while trying to search this to remember what this was called. Adding it to the OVERVIEW.md docs so it's easier to search
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130765
Approved by: https://github.com/nmacchioni, https://github.com/eellison, https://github.com/aaronenyeshi
2024-07-16 14:20:31 +00:00
69e9917245 [inductor] adapte windows file path (#130713)
This PR is depends on https://github.com/pytorch/pytorch/pull/130132 can be landed successful.
The detailed log: https://github.com/pytorch/pytorch/issues/124245#issuecomment-2211889758

After the file path was adapted for Windows, the first Windows inductor case was run successful.

```python
import torch

def foo(x, y):
    a = torch.sin(x)
    b = torch.cos(x)
    return a + b
opt_foo1 = torch.compile(foo)
print(opt_foo1(torch.randn(10, 10), torch.randn(10, 10)))
```

Result:
![image](https://github.com/user-attachments/assets/4944df47-e74d-476b-8eb5-1d1fd5abeb41)

Co-authored-by: Jiong Gong <jiong.gong@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130713
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/desertfire
2024-07-16 13:53:39 +00:00
53e5b8ac5b [BE]: Update flake8-comprehensions and enable C420 (#130699)
Uses `dict.fromkeys` whenever possible as covered by flake8-comprehensions rule C420. While the ruff rule RUF025 is still in preview, flake8-comprehensions have added a new rule which covers this. Use dict.fromkeys is faster when the value being added to the dictionary is the same at every iteration and is immutable, it also removes an unnecessary dict comprehension.

This rule will be enabled with our current ruleset in RUF in 0.6 as C420.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130699
Approved by: https://github.com/lezcano, https://github.com/ezyang
2024-07-16 13:47:49 +00:00
213685ba97 [torchao][pt2 benchmark runner] Run performance test non-alternately (#130136)
Summary:
By default, performance tests (speedup experiments) will run the baseline and test backend alternately.

However, this does not work for the torchao backend, which will change the model in-place, therefore the baseline run will also run with torchao backend since the model has already been quantized.

Add a new experiment "latency_experiment" to run performance tests non-alternately (first run baseline for a few iterations, then run the test backend).

Test Plan:
```
buck2 run mode/opt //pytorch/benchmark:pt2 -- --only AlbertForMaskedLM --quantization noquant --performance --inference --bfloat16
```

```
buck2 run mode/opt //pytorch/benchmark:pt2 -- --only AlbertForMaskedLM --quantization autoquant --performance --inference --bfloat16 --inductor-compile-mode max-autotune
```

Differential Revision: D59332736

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130136
Approved by: https://github.com/jerryzh168
2024-07-16 13:38:17 +00:00
67c6941b4e Update torch.cat decomp for 0-dim (#130763)
Fix for https://github.com/pytorch/pytorch/issues/130615

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130763
Approved by: https://github.com/Skylion007, https://github.com/mlazos
2024-07-16 13:34:01 +00:00
705da70f2c [inductor][cpp] align dtype convert cache between vec and scalar kernels (#130677)
The conversion cache used for fixing https://github.com/pytorch/pytorch/issues/115260 depended on "store" which might be removed and ignored. This would lead to inconsistent code generated between vec and scalar kernels since we generate scalar kernel first followed by the vector kernel and the store buffer might be removed by the scalar and impacts the vector kernel codegen. This PR move the caching from "store" to the "to_dtype" calls which won't be impacted by the removed buffers.

`pytest -k test_consistent_remove_buffers test/inductor/test_cpu_repro.py`

before
```c++
extern "C"  void kernel(const bfloat16* in_ptr0,
                       bfloat16* out_ptr1)
{
    {
        for(long x0=static_cast<long>(0L); x0<static_cast<long>(64L); x0+=static_cast<long>(16L))
        {
            auto tmp0 = at::vec::Vectorized<bfloat16>::loadu(in_ptr0 + static_cast<long>(x0), 16);
            auto tmp1 = at::vec::convert<float>(tmp0);
            auto tmp2 = tmp1 + tmp1;
            auto tmp3 = at::vec::convert<bfloat16>(tmp2);
            auto tmp4 = at::vec::convert<float>(tmp3);
            auto tmp5 = tmp1 + tmp4;
            auto tmp6 = at::vec::convert<bfloat16>(tmp5);
            tmp6.store(out_ptr1 + static_cast<long>(x0), 16);
        }
        #pragma omp simd simdlen(8)
        for(long x0=static_cast<long>(64L); x0<static_cast<long>(65L); x0+=static_cast<long>(1L))
        {
            auto tmp0 = in_ptr0[static_cast<long>(x0)];
            auto tmp1 = c10::convert<float>(tmp0);
            auto tmp2 = decltype(tmp1)(tmp1 + tmp1);
            auto tmp3 = c10::convert<bfloat16>(tmp2);
            auto tmp4 = decltype(tmp1)(tmp1 + tmp2);
            auto tmp5 = c10::convert<bfloat16>(tmp4);
            out_ptr1[static_cast<long>(x0)] = tmp5;
        }
    }
}
```

after
```c++
extern "C"  void kernel(const bfloat16* in_ptr0,
                       bfloat16* out_ptr1)
{
    {
        for(long x0=static_cast<long>(0L); x0<static_cast<long>(64L); x0+=static_cast<long>(16L))
        {
            auto tmp0 = at::vec::Vectorized<bfloat16>::loadu(in_ptr0 + static_cast<long>(x0), 16);
            auto tmp1 = at::vec::convert<float>(tmp0);
            auto tmp2 = tmp1 + tmp1;
            auto tmp3 = at::vec::convert<bfloat16>(tmp2);
            auto tmp4 = tmp1 + tmp2;
            auto tmp5 = at::vec::convert<bfloat16>(tmp4);
            tmp5.store(out_ptr1 + static_cast<long>(x0), 16);
        }
        #pragma omp simd simdlen(8)
        for(long x0=static_cast<long>(64L); x0<static_cast<long>(65L); x0+=static_cast<long>(1L))
        {
            auto tmp0 = in_ptr0[static_cast<long>(x0)];
            auto tmp1 = c10::convert<float>(tmp0);
            auto tmp2 = decltype(tmp1)(tmp1 + tmp1);
            auto tmp3 = c10::convert<bfloat16>(tmp2);
            auto tmp4 = decltype(tmp1)(tmp1 + tmp2);
            auto tmp5 = c10::convert<bfloat16>(tmp4);
            out_ptr1[static_cast<long>(x0)] = tmp5;
        }
    }
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130677
Approved by: https://github.com/leslie-fang-intel
2024-07-16 13:25:05 +00:00
68a4f2a3df Revert "Tighten torch.library.infer_schema input types (#130705)"
This reverts commit ca2d424c6e5358f9fee8dc9ee7477de76b50f848.

Reverted https://github.com/pytorch/pytorch/pull/130705 on behalf of https://github.com/atalman due to Failing internal CI ([comment](https://github.com/pytorch/pytorch/pull/130705#issuecomment-2230821876))
2024-07-16 12:57:11 +00:00
dee0f43fde Add a CI job to check runner det sync (#129746)
Add a new CI job that runs only when the runner determinator files are modified. The jobs checks that the runner_determinator.py script is in sync with the version embedded in _runner-determinator.yaml.

Fixes TBD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129746
Approved by: https://github.com/zxiiro, https://github.com/ZainRizvi, https://github.com/jeanschmidt
2024-07-16 11:44:55 +00:00
e57101d927 Add testing regarding SparseAdam state_dicts (#130645)
Summary:
- Updated SparseAdam to run test_state_dict_deterministic unit test.
- Made gradients sparse while keeping weights dense in the above test.

Test Plan:
- Ran test_optim.py locally.

Fixes #116507

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130645
Approved by: https://github.com/janeyx99
2024-07-16 11:29:22 +00:00
cyy
168e41009b [structural binding][10/N] Replace std::tie with structural binding (#130784)
Follows  #130404

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130784
Approved by: https://github.com/malfet
2024-07-16 10:28:14 +00:00
747b38c131 [BE][Easy][2/19] enforce style for empty lines in import segments in .ci/ and .github/ (#129753)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129753
Approved by: https://github.com/malfet
ghstack dependencies: #129752
2024-07-16 09:40:00 +00:00
096dc444ce Keep zero check be compatible with different sympy versions (#130729)
# Motivation
I found a difference between sympy 1.12 and 1.13.
```python
# for 1.12
>>> import sympy
>>> a = sympy.Number(0.0)
>>> a == 0
True
```
```python
# for 1.13
>>> import sympy
>>> a = sympy.Number(0.0)
>>> a == 0
False
```
The different behavior will impact the result of [safe_mul](6beec34b1c/torch/utils/_sympy/value_ranges.py (L521-L528)), resulting in an incorrect results when `a = sympy.Number(0.0)`, `b = inf` and the result is `nan` if sympy version is 1.13. (the expected result is **0**)
```python
def safe_mul(a, b):
    # Make unknown() * wrap(0.0) == wrap(0.0)
    if a == 0.0:
        return a
    elif b == 0.0:
        return b
    else:
        return a * b
```

In different sympy versions, `sympy.Number(0)` always has the same behavior that equals to 0.0.
```python
>>> import sympy
>>> a = sympy.Number(0)
>>> a == 0.0
True # for different sympy versions
```
So, use 0.0 when checking zero in safe_mul to keep compatible with different sympy versions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130729
Approved by: https://github.com/lezcano, https://github.com/EikanWang
2024-07-16 08:39:00 +00:00
fedae41c57 [dynamo] Do not mark nn.module containers as BuiltinNNModuleVariable (#130773)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130773
Approved by: https://github.com/williamwen42, https://github.com/mlazos
2024-07-16 06:55:46 +00:00
83eedf66b9 Update libfmt submodule to 11.0.1 (#130628)
Update libfmt to 11.0.1 reopen of https://github.com/pytorch/pytorch/pull/129962. Requires a kineto update and moves fmt::join into a separate include so added it where necessary.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130628
Approved by: https://github.com/aaronenyeshi
2024-07-16 06:12:11 +00:00
c549629696 [CD] Fix xpu nightly wheel test failure (#130742)
The xpu nightly wheel test met permission issue on `linux.idc.xpu` runner. Because those runners onboarded with `jenkins` user but the binary test in docker container with `root` directly. The temp files can't be deleted, refer https://github.com/pytorch/pytorch/actions/runs/9935452320/job/27448053625#step:8:91
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130742
Approved by: https://github.com/atalman
2024-07-16 05:31:20 +00:00
cyy
95dbbf713e [Distributed] [9/N] Fix clang-tidy warnings in torch/csrc/distributed/rpc (#130109)
Follows #125102

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130109
Approved by: https://github.com/ezyang
2024-07-16 04:23:42 +00:00
7b2e802f31 [dtensor] add a few dunder methods to pointwise ops (#130754)
fixes https://github.com/pytorch/pytorch/issues/130671

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130754
Approved by: https://github.com/Skylion007, https://github.com/awgu, https://github.com/msaroufim
ghstack dependencies: #130753
2024-07-16 02:53:35 +00:00
2b2671a7b1 [dtensor] fix foreach_norm when ord is 2 (#130753)
as titled, fixed a case when passing ord as 2 (default value), the op
dispatching does not receive the default value case

We simply check if the args schema receiving a `ord` field or not

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130753
Approved by: https://github.com/awgu
2024-07-16 02:53:35 +00:00
a29052a0bf [BE][Ez]: Update ruff to 0.5.2 (#130698)
Update ruff to 0.5.2 which bugfixes and performance improvements
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130698
Approved by: https://github.com/ezyang
2024-07-16 01:31:30 +00:00
ad314a2f05 Pass torch.load(weights_only=) internally to avoid FutureWarning (#130663)
Fixes #130658

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130663
Approved by: https://github.com/malfet, https://github.com/LucasLLC
2024-07-16 01:24:38 +00:00
3cd2ae331a Use inductor TestCase for distributed tests (#129494)
Summary: At least some of the tests deriving from MultiProcessTestCase exercise inductor. Using the inductor TestCase class makes sure we always get a clean cache dir.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129494
Approved by: https://github.com/eellison
2024-07-16 01:24:35 +00:00
39eeaac4e5 inductor: avoiding moving constructor to cuda when it would cause h2d sync in index_put_ fallback (#130338)
My attempt at a fix for https://github.com/pytorch/pytorch/issues/130335, see issue for more details / internal xref. Any feedback from inductor folks is appreciated. I attempted to make the move-constructors-to-cuda pass a bit less aggressive by detecting when the movement would incur a H2D sync for `aten.index_put_`. I'm not sure if there are any other ops that inductor falls back to eager on, that may-or-may-not incur a H2D sync if we change any of their inputs from cpu to cuda.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130338
Approved by: https://github.com/eellison
2024-07-16 00:48:58 +00:00
93a03edcf9 Update error message in meta__convert_weight_to_int4pack (#130707)
This PR is to fix error message in https://github.com/pytorch/pytorch/pull/129940.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130707
Approved by: https://github.com/lezcano, https://github.com/malfet
2024-07-16 00:44:35 +00:00
a3abfa5cb5 [BE][Easy][1/19] enforce style for empty lines in import segments (#129752)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129752
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-07-16 00:42:56 +00:00
eqy
5e617d7ef5 [CUDA] Actually bump tolerances for test_grad_pca_lowrank (#130770)
Fixes change in #129902 to actually bump pca rather than svd, thanks @ptrblck for the catch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130770
Approved by: https://github.com/Skylion007
2024-07-16 00:41:10 +00:00
80236dca90 Add buffer static input tests to cudagraph trees (#130402)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130402
Approved by: https://github.com/eellison
ghstack dependencies: #130391, #130392, #130503, #130393
2024-07-16 00:25:38 +00:00
69a77389e2 Propagate buffer and parameter indices through AOT (#130393)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130393
Approved by: https://github.com/bdhirsh
ghstack dependencies: #130391, #130392, #130503
2024-07-16 00:25:38 +00:00
200d3d0a89 Remove static param counting if inlining NN modules (#130503)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130503
Approved by: https://github.com/bdhirsh
ghstack dependencies: #130391, #130392
2024-07-16 00:25:34 +00:00
0d0c09702a Update mark_static_address for inlining NN modules (#130392)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130392
Approved by: https://github.com/anijain2305
ghstack dependencies: #130391
2024-07-16 00:25:29 +00:00
d8616eb66a Mark nn_module params and buffers as static in dynamo (#130391)
This PR marks all buffers and parameters of an NNModule as static using the `mark_static_address` API. As a result, when tensors are passed to AOT, the `tensor_dict` metadata of placeholder nodes will contain the `static_address_type` key, indicating which graph argument positions are static for cudagraphs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130391
Approved by: https://github.com/anijain2305
2024-07-16 00:25:23 +00:00
9ab8d47f9d Constant folding for dynamic shape node (#129686)
Extend constant folding for dynamic shape node, only support pointwise op and some restricted ops

We support dynamic shapes by limiting constant folding of ops that are guaranteed to have uniform values (full, pointwise ops, and views) and running these operators with tensors of shape 1. This also eliminates the possibility of memory overhead of constant folding.

Taken over from https://github.com/pytorch/pytorch/pull/128937

joint work with @imzhuhl

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129686
Approved by: https://github.com/Chillee
ghstack dependencies: #130367
2024-07-16 00:17:11 +00:00
ea4f310ff1 [Nested Tensor][easy] Add softmax backward support (#130602)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130602
Approved by: https://github.com/davidberard98, https://github.com/jbschlosser
2024-07-16 00:07:42 +00:00
d3ab8ceced [FSDP2] Allowed List[nn.Module] as arg (#127786)
This PR allows `fully_shard`'s first argument to be `List[nn.Module]` instead of strictly `nn.Module`. This allows more flexible grouping of modules/parameters for communication, which can lead to memory savings and/or more efficient communication.

**Approach**
At a high level, we can think of a model as a tree of modules. Previously, we could only select specific module nodes in this tree as representing one FSDP parameter group. With this PR, we can select a group of module nodes, effectively becoming a single super node.

To implement the runtime schedule, we define new forward hooks that run based on the following semantics:
- If a module is the first to run the pre-hook, actually run the given pre-hook. Otherwise, the pre-hook is no-op.
- If a module is the last to run the post-hook, actually run the given post-hook. Otherwise, the post-hook is a no-op.
- First and last are determined by scoreboarding against a set of the modules.
- This set must get cleared at the end of backward in the case that >=1 module in the list is never used, in which case we still want the forward hooks to run in the next forward after this backward.

Beyond these new forward hooks, everything else is some simple generalization from `Module` to `List[Module]` or `Tuple[Module, ...]`.

**Examples**
This PR enables wrapping Llama models more efficiently by grouping the final norm and output linear together: https://github.com/pytorch/torchtitan/pull/382.

If at least one of the modules in the list does not run forward before backward, then there will be a warning message like:
```
1 of the 2 modules passed to fully_shard did not run forward before backward, which is error-prone since FSDP post-forward/pre-backward logic will not run for these modules. We recommend passing only modules that run forward together. Modules that did not run forward: [FSDPLinear(in_features=1, out_features=1, bias=True)]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127786
Approved by: https://github.com/yf225, https://github.com/weifengpy
ghstack dependencies: #127773
2024-07-15 23:54:10 +00:00
b27695791e [PT-D] Relaxed contract to allow Sequence[nn.Module] (#127773)
This PR relaxes `@contract` to allow the 1st argument to be `Sequence[nn.Module]` instead of strictly `nn.Module`. This is required for the next PR, which allows `fully_shard` to take in `List[nn.Module]`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127773
Approved by: https://github.com/weifengpy
2024-07-15 23:54:10 +00:00
54a932b0ac Support for expandable segments with cuda graph trees (#128068)
This PR adds support to use expandable segments with private memory pools which should unblock using it with cuda graphs and cuda graph trees. Currently, the allocator silently avoids using expandable segments when allocating in a private pool due to checkpoint saving/restoring not meshing well with how we keep track of unmapped blocks.

The PR itself is pretty short, most of the logic for checkpointing and reapplying state for non-expandable segments transfers over without much work.

Expandable segments reserve a virtual address space of size equal to the amount of physical memory on the GPU. Every time we want to `malloc()` or `free()` memory in a memory pool with expandable segments turned on, we map/unmap pages of physical GPU memory under the hood to create a new block that we return to the caller. This is beneficial due to the fact that each memory pool functions as a single segment of memory with a contiguous block of memory addresses that can grow and shrink as needed, avoiding fragmentation from allocating multiple non-contiguous segments that may not be merged together.

The caching allocator handles this by creating an unmapped block for the entire reserved virtual address space at init, which is treated similarly to an unallocated block in a free pool. When callers call `malloc()`, it's split and mapped to create allocated blocks, and calling `free()` similarly caches and merges free blocks in a free pool to be used later. Expandable blocks are unmapped and returned back to Cuda when they are cleaned up, or when we hit an OOM and the allocator attempts to remap cached free blocks. The code paths to map, free, and unmap blocks in expandable segments is similar to that for normal blocks and does all the same work of updating stats on memory usage, moving blocks between active and free pools, and returning memory to Cuda.

With Cuda Graph Trees and private memory pools, we need the ability to take checkpoints of the current state of the memory allocator after each graph capture as well as reapplying the state before capturing a new graph after replaying a captured graph so that the new cuda graph capture has access to the state of the allocator at the point after replaying a previously captured graph so it can reuse empty blocks and allocate new ones.

As mentioned in a below comment, memory in a private pool is cached until the private pool is destroyed and allocations can only grow from extra graph captures, any freeing of memory would result in invalid memory addresses and would break cuda graphs.

One implementation detail to note for unmapped blocks with expandable segments is that unmapped blocks are kept track in a member variable `unmapped` of a `BlockPool`. `unmapped` is *not* part of the checkpointed state of the caching allocator and isn't restored when reapplying checkpoints since we never free/unmap memory back to cuda and is persisted across graph captures / replays.

Checkpointing the current state of the memory allocator works as expected with expandable segments. Checkpointing grabs the first block of every segment in the active and free pools of the private pool and traverses the linked list of blocks in the segment to capture the state of every segment, which is then saved and kept for when it is needed to be reapplied. For expandable blocks, the last block in every segment will be an unallocated unmapped block containing the remaining amount of unmapped memory at graph capture time, and this too is saved in the checkpoint.

Reapplying the checkpoints works by freeing all allocated blocks and merging them into a single block per segment, then for each segment, we manually split and allocate all blocks from the checkpoint and then free the blocks marked as unallocated in the checkpoint state. For expandable segments, we need to make some modifications to not split unmapped blocks and avoid manually mapping then freeing unmapped blocks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128068
Approved by: https://github.com/eqy, https://github.com/eellison
2024-07-15 23:23:23 +00:00
006020ff6e Fix the cudagraph capture of SDPA (#130712)
Summary: The scalar tensor by default is on CPU, which failed the cuda graph capture. To fix the issue, we put the scalar tensor on GPU

Test Plan: buck2 test 'fbcode//mode/opt' fbcode//gen_ai/llm_inference/fb/tests:test_llama2_multimodal_generator -- --exact 'gen_ai/llm_inference/fb/tests:test_llama2_multimodal_generator - gen_ai.llm_inference.fb.tests.test_llama2_multimodal_generator.TestGenerator: test_multimodal_decode_gen2'

Differential Revision: D59740639

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130712
Approved by: https://github.com/Skylion007, https://github.com/chenyang78
2024-07-15 23:05:48 +00:00
50ef099ad0 Learn a heuristic to decide whether to pad before mm (#128643)
This PR introduces AutoHeuristic, a framework to collect results from autotuning, learn a heuristic as a machine learning model (a regression tree), and then ship the learned heuristic by generating the regression tree to code.

The heuristics have been learned on artificial/random data that has been collected with the `gen_data_pad_mm.py` script. The `gen_pad_mm_a100.sh` scripts can then be used to learn a heuristic and generate it to code.

The best model is decided by doing a grid search over various values for `max_depth` and `min_samples_leaf` and choosing the model with the highest number of correct predicitons on the validation set.

The heuristic can return "unsure" which means that it is not sure which choice is the best choice and as a result autotuning will happen.

On A100 only tensors where each dimension is >= 512 are considered. For smaller tensors the heuristics that I learned returned "unsure" too often.

The results for randomly generated data and huggingface look as follows:
`max_wrong_speedup` is max(`wrong_speedups`) where `wrong_speedups` contains all the speedups one could have achieved for those examples where the heuristic made a wrong choice, i.e. a `max_wrong_speedup` of 1.37 means that the heuristic selected a choice, but the other choice would have been 1.37x faster. `gman_wrong_speedup` is the geomean of `wrong_speedups`.

The heuristic is learned as a regression tree, that returns higher values for better choices. The threshold decides how much better the better choice has to be for it to be returned, i.e. on A100 if the better choice is less than 1.702530x better than the other choice, "unsure" will be returned. This threshold is determined using the validation set.

A100
```
       max_depth  min_samples_leaf dataset  correct  wrong  unsure  total  max_wrong_speedup  gman_wrong_speedup  threshold
15         5.0                10     train     2730      4    3023   5757           1.372220            1.193873   1.702530
16         5.0                10       val      878      0    1042   1920                NaN                 NaN   1.702530
17         5.0                10      test      925      2     993   1920           1.741708            1.354954   1.702530
18         5.0                10  hf-train       14      0      22     36                NaN                 NaN   1.702530
19         5.0                10    hf-inf        7      0       1      8                NaN                 NaN   1.702530
```

The numbers for huggingface only include tensors where each dim is >=512. If all tensors would have been included there would have been the following number of matmuls, where at least one dimension is unaligned:
A100 hf-train: 60
A100 hf-inf: 10

## Results on running huggingface locally
This only includes models where the learned heuristic made at least one decision. For the examples here, it takes around 0.25-0.3 seconds to perform autotuning for the padded and unpadded version, so each decision that the heuristic makes saves around 0.25-0.3 seconds.
#pad_mm_autotuning is the number of times autotuning happened in pad_mm and #heuristic_made_decision is the number of times the heuristic made a decision (i.e. it didn't return "unsure").

I ran huggingface locally, each model 5 times and took the median speedup and compilation_latency.
Results on huggingface training
```
                          name speedup_heuristic speedup_baseline  speedup_diff compilation_latency_heuristic compilation_latency_baseline  compilation_latency_diff  comp_latency_reduction%  #pad_mm_autotuning  #heuristic_made_decision
               BartForCausalLM   1.19 (+/- 0.00)  1.19 (+/- 0.00)         -0.00              40.33 (+/- 1.13)             40.95 (+/- 0.78)                     -0.62                     1.52                   3                         2
  BartForConditionalGeneration   1.53 (+/- 0.06)  1.47 (+/- 0.05)          0.06              81.93 (+/- 5.20)             82.23 (+/- 1.92)                     -0.30                     0.36                   3                         1
    BlenderbotSmallForCausalLM   1.86 (+/- 0.04)  1.86 (+/- 0.00)          0.00              36.76 (+/- 0.49)             37.62 (+/- 1.33)                     -0.87                     2.31                   3                         2
                     CamemBert   2.36 (+/- 0.01)  2.35 (+/- 0.01)          0.01              97.60 (+/- 1.91)             98.69 (+/- 1.35)                     -1.09                     1.11                   2                         1
                   DistillGPT2   2.57 (+/- 0.01)  2.57 (+/- 0.01)          0.00              57.33 (+/- 0.77)             58.26 (+/- 1.41)                     -0.93                     1.59                   3                         2
             PLBartForCausalLM   2.07 (+/- 0.01)  2.06 (+/- 0.01)          0.01              32.54 (+/- 0.83)             34.65 (+/- 0.71)                     -2.11                     6.10                   3                         2
PLBartForConditionalGeneration   1.87 (+/- 0.00)  1.88 (+/- 0.00)         -0.01              58.45 (+/- 1.24)             58.95 (+/- 1.92)                     -0.50                     0.85                   3                         1
            RobertaForCausalLM   2.39 (+/- 0.01)  2.40 (+/- 0.01)         -0.01              97.38 (+/- 1.52)             97.69 (+/- 1.18)                     -0.31                     0.32                   2                         1
              TrOCRForCausalLM   1.70 (+/- 0.00)  1.70 (+/- 0.00)         -0.00              44.79 (+/- 1.33)             45.25 (+/- 1.08)                     -0.46                     1.01                   3                         2

Mean difference in speedup: 0.01
Mean compilation latency saved: -0.80s
Mean compilation latency reduction: 1.68%
```

Results on huggingface inference
```
                          name speedup_heuristic speedup_baseline  speedup_diff compilation_latency_heuristic compilation_latency_baseline  compilation_latency_diff  comp_latency_reduction%  #pad_mm_autotuning  #heuristic_made_decision
               BartForCausalLM   1.11 (+/- 0.00)  1.11 (+/- 0.00)          0.00              19.02 (+/- 0.28)             19.40 (+/- 0.35)                     -0.38                     1.95                   3                         2
  BartForConditionalGeneration   1.26 (+/- 0.01)  1.23 (+/- 0.03)          0.03              36.84 (+/- 0.40)             36.55 (+/- 0.75)                      0.30                    -0.81                   3                         1
    BlenderbotSmallForCausalLM   1.87 (+/- 0.02)  1.87 (+/- 0.01)          0.00              17.53 (+/- 0.31)             18.03 (+/- 0.43)                     -0.49                     2.74                   3                         2
                   DistillGPT2   2.50 (+/- 0.02)  2.50 (+/- 0.01)          0.00              16.16 (+/- 0.29)             16.40 (+/- 0.18)                     -0.24                     1.46                   3                         2
             PLBartForCausalLM   1.93 (+/- 0.01)  1.94 (+/- 0.01)         -0.00              15.30 (+/- 0.22)             16.01 (+/- 0.71)                     -0.71                     4.43                   3                         2
PLBartForConditionalGeneration   1.98 (+/- 0.01)  1.98 (+/- 0.01)          0.00              25.90 (+/- 0.32)             26.58 (+/- 0.62)                     -0.67                     2.53                   3                         1
              TrOCRForCausalLM   1.61 (+/- 0.00)  1.62 (+/- 0.00)         -0.01              21.38 (+/- 0.37)             21.85 (+/- 0.16)                     -0.47                     2.16                   3                         2

Mean difference in speedup: 0.00
Mean compilation latency saved: -0.38s
Mean compilation latency reduction: 2.07%
```

For now, the heuristic can only be applied to decide whether to pad for mm. One could also learn heuristics for bmm and addmm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128643
Approved by: https://github.com/Chillee, https://github.com/eellison
2024-07-15 23:04:06 +00:00
9a5204dc2d [inductor] Remove "spawn" as an option for parallel compile method (#130746)
Summary: Looks like "spawn" is broken. Since we have "subprocess", I don't think we need it any more, so just remove as an option.

Test Plan: Verified that we get: `AssertionError: Invalid start method: spawn`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130746
Approved by: https://github.com/Skylion007
2024-07-15 22:55:54 +00:00
3f031b96c6 [Fix] Correctly identifying arguments for sub-blocks with renaming logic during TorchScript to ExportedProgram conversion (#128386)
#### Issue
Fix two issues related to inputs lifting when there are sub-blocks.
* Some inputs may appear in the nested sub-blocks, which need a recursive search to identify which arguments need to be lifted / passed in the top-level block.
* Some inputs to the sub-block are intermediate results, meaning their names are only number. This will cause issue during code generation (i.e., invalid argument name). We rename those to valid names.

#### Test Plan
* `pytest test/export/test_converter.py -s -k test_convert_nn_module_with_nested_if_and_param`
* `test/export/test_converter.py -s -k test_hidden_input_name`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128386
Approved by: https://github.com/angelayi
2024-07-15 22:48:13 +00:00
b893aa71ca Rename generate_numeric_debug_handle to numeric_debugger (#130590)
Summary:
att

Test Plan:
CI

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130590
Approved by: https://github.com/dulinriley, https://github.com/tarun292
2024-07-15 22:42:27 +00:00
535016967a Enable UFMT on all of torch/sparse (#130545)
Partially addresses #123062
Ran lintrunner on:
- torch/sparse

Detail:
```
$ lintrunner -a --take UFMT --all-files
ok No lint issues.
Successfully applied all patches.
```

@ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130545
Approved by: https://github.com/ezyang
2024-07-15 22:35:52 +00:00
7d4f50de19 dynamo add support for defaultdict(set) (#130745)
Fixes #130554

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130745
Approved by: https://github.com/Skylion007
2024-07-15 22:23:33 +00:00
3928ca2ab6 [dynamo] update call map to allow multiple input parameters (#130748)
Fixes https://github.com/pytorch/pytorch/issues/128072.

Commandeering https://github.com/pytorch/pytorch/pull/128282 since the issue is now hi pri.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130748
Approved by: https://github.com/Skylion007, https://github.com/anijain2305
2024-07-15 22:16:49 +00:00
eqy
6f32dc0c7b Don't pass error message as places in assertGreaterAlmostEqual (#130648)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130648
Approved by: https://github.com/awgu
2024-07-15 22:14:49 +00:00
dff9d68f18 Revert "Fix names conflict when lifting (#129817)"
This reverts commit 53cf46b8c602f8512d49a5c30bca7fcf5411e25c.

Reverted https://github.com/pytorch/pytorch/pull/129817 on behalf of https://github.com/clee2000 due to Failing inductor/test_flex_attention.py https://github.com/pytorch/pytorch/actions/runs/9940532858/job/27478084137 74da2a467f Sorry for the churn, possibly a landrace? ([comment](https://github.com/pytorch/pytorch/pull/129817#issuecomment-2229519886))
2024-07-15 22:08:45 +00:00
78799e82b0 Revert "Invalidate StorageImpl instances when tensor is overwritten with cudagraphs (#125264)"
This reverts commit 1bc390c5f5ac065c156f55f4eceed267ecc67b41.

Reverted https://github.com/pytorch/pytorch/pull/125264 on behalf of https://github.com/jithunnair-amd due to test test/inductor/test_cudagraph_trees.py::CudaGraphTreeTests::test_fallback_to_eager_if_recompiling_too_many_times is failing https://github.com/pytorch/pytorch/actions/runs/9933628108/job/27477785946 1bc390c5f5. Test was introduced by fa5f572748 which is before the merge base ([comment](https://github.com/pytorch/pytorch/pull/125264#issuecomment-2229508737))
2024-07-15 21:59:46 +00:00
db3a641b71 Implement operator for micro-pipelined all-gather -> _scaled_mm (#129289)
This PR implements `torch.ops.symm_mem.fused_all_gather_scaled_matmul`. It's similar to `torch.ops.symm_mem.fused_all_gather_matmul`, except that it takes scales and calls ` _scaled_mm`.

[Profiling Trace vs. Baseline](https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_tmp0gmg1f2_) (FB internal only)

Co-authored-by: Will Feng <yf225@cornell.edu>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129289
Approved by: https://github.com/Chillee, https://github.com/weifengpy, https://github.com/drisspg
2024-07-15 21:48:35 +00:00
77fb5b0e23 [c10d] a new Pytorch API (split_group) to create a process group (#130507)
This is the implementation following the RFC: https://github.com/pytorch/pytorch/issues/130407

ncclCommSplit
Summary:
In current Pytorch/c10d, the new_group API is used to create a new
process group from the default pg.  When device_id is specified in
init_process_group and nccl is used as the backend, the new_group call
will use ncclCommSplit to create the nccl communicators to save
communicator resources. It has a few drawbacks:

Redundant calls
Suppose the default group has 256 ranks, we need to have 32 children PGs
and each child PG has 8 ranks. in this case, each rank needs to call
new_group and ncclCommSplit 32 times because of how we implement
new_group API and the collective requirement of ncclCommSplit. For a
specific global rank, 31 calls of ncclCommSplit would be no_color split,
and only 1 of them is colored split. With the proposed new split_group
API, we expect only 1 call of split_group/ncclCommSplit is needed per
rank in the above example case

new_group can only split from default_pg
Ideally, a new pg should be able to be split from any pg

With the new split_group API, users can create new PGs using
ncclCommSplit with less number of calls and initialize the PG eagerly.
This is also useful in the cases of creating many P2P communicators.
Test Plan:
New UTs:
e.g., python test/distributed/test_c10d_nccl.py -k
test_comm_split_group_larger_scale
Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130507
Approved by: https://github.com/wconstab
2024-07-15 21:26:43 +00:00
ac3e2cb64a [BE] Delete unused -rg.yml workflow (#130759)
As well as `_linux-test-label.yml` as ARC experiment is dead
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130759
Approved by: https://github.com/ZainRizvi
2024-07-15 20:41:59 +00:00
ee6f0ab190 [DeviceMesh][Reland] Only include the real thread_id in DeviceMesh hash under threaded backend (#130495) (#130685)
Summary:
As a followup to https://github.com/pytorch/pytorch/pull/130454, users are hitting the cross-mesh operation error because the DeviceMesh thread ID differs between the saved vs. loaded DTensor due to thread id being different.

This is a hot fix to only consider the real thread_id in DeviceMesh hash under threaded backend, but set it to None for all other cases.

As a follow up, we need to look at the following test failures to better root cause specific DeviceMesh related failures related to MTPG, if thread_id is not included as part of the hash.
```
test/distributed/_composable/fsdp/test_fully_shard_training.py::TestFullyShardRegisteredParams::test_param_registration_after_forward
test/distributed/_tensor/test_dtensor_ops.py::TestDTensorOpsCPU::test_dtensor_op_db_column_stack_cpu_float32
```

Adding an additional is_initialized() check since APF has a test mocking the backend without pg initialized. Therefore, we need to add the is_initialized() check to avoid test failure. In real use case, we should have a pg initialized before the get_backend() check. Not sure if we want to add this specifically for the test, but temporarily adding it to unblock APF conveyor runs.

Test Plan:
```
[irisz@devgpu051.cln3 /data/users/irisz/fbsource/fbcode (38e4a0a3b)]$ buck2 test 'fbcode//mode/opt' fbcode//apf/distributed/tests:pipeline_parallel_test_cpu -- --exact 'apf/distributed/tests:pipeline_parallel_test_cpu - apf.distributed.tests.pipeline_parallel_test_cpu.PipelineParallelContextTestCPU: test_stage_pg_creation_with_different_backends'
```

Reviewed By: gag1jain

Differential Revision: D59725924

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130685
Approved by: https://github.com/gag1jain
2024-07-15 20:05:26 +00:00
27322355de Added some more documentation to block mask creation (#130649)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130649
Approved by: https://github.com/drisspg
ghstack dependencies: #130626
2024-07-15 19:48:42 +00:00
0e79e1f958 [NJT+SDPA]Fix flash_attention output when batch_size=1 and seq_len=1 (#130652)
fix issue  #130196

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130652
Approved by: https://github.com/Skylion007, https://github.com/drisspg, https://github.com/jbschlosser
2024-07-15 19:44:04 +00:00
074a5c0c9b Revert "[BE] bump optree version to 0.12.1 (#130139)"
This reverts commit 8fcb156e8b5697a8f292db6db2a1803c5f4ce2d7.

Reverted https://github.com/pytorch/pytorch/pull/130139 on behalf of https://github.com/clee2000 due to broke inductor/test_torchinductor_codegen_dynamic_shapes.py and test_sympy_utils.py 8fcb156e8b ([comment](https://github.com/pytorch/pytorch/pull/130139#issuecomment-2229248447))
2024-07-15 19:42:11 +00:00
f1456c74a0 Fix mkl-static issue for Windows. (#130697)
Background:
We found the pytorch Windows release/2.4 performance regression: https://github.com/pytorch/pytorch/issues/130619

After some debug works, I found the pytorch Windows static mkl build options are wrong:
<img width="1049" alt="image" src="https://github.com/user-attachments/assets/38692142-bfca-4c98-8092-6e105c82bb13">
1. Thread lib is wrong.
2. Miss `openmp` lib and config.
> Debug history: https://github.com/pytorch/pytorch/issues/130619#issuecomment-2226782504 and https://github.com/pytorch/pytorch/issues/130619#issuecomment-2226418611

This PR will fix `mkl-static` build options issue.
<img width="863" alt="image" src="https://github.com/user-attachments/assets/834f6cee-7e6d-4d74-b2bc-8a270f05e429">

Reference:
<img width="482" alt="image" src="https://github.com/user-attachments/assets/8184dadb-f230-4062-a49f-51df1d7285f5">

https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl-link-line-advisor.html#gs.c6izlg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130697
Approved by: https://github.com/jgong5, https://github.com/atalman
2024-07-15 19:28:11 +00:00
a7cfe40c9b [dtensor] Improve from_local API with run_check (#130289)
as titled, this PR:
1. switch `run_check` to be by default False and add extra doc/comments
   about the correctness guarantee. Since I observed so many calls
forget to use run_check=False, we should simply switch to not perform
metadata check and make our documentation explicit
2. Implement metadata check by picking up the changes from https://github.com/pytorch/pytorch/pull/115229
3. Improve the from_local documentation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130289
Approved by: https://github.com/awgu, https://github.com/wz337
ghstack dependencies: #130286, #130287, #130288
2024-07-15 18:52:55 +00:00
3342f3aa4e [dtensor] simplify sdpa strategies (#130288)
as titled, this PR simplifies both flash and efficient attention op
strategy generation paths

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130288
Approved by: https://github.com/tianyu-l
ghstack dependencies: #130286, #130287
2024-07-15 18:52:55 +00:00
7d82dc2c23 [dtensor] slice_backward to use op strategy (#130287)
as titled. slice_backward right now forward the sharding
unconditionally, which is wrong mathmatically. This PR switch it to op
strategy and only allow replication

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130287
Approved by: https://github.com/awgu
ghstack dependencies: #130286
2024-07-15 18:52:49 +00:00
53cf46b8c6 Fix names conflict when lifting (#129817)
## Bug description
When pending args that are potentially to be lift [here](58f346c874/torch/_dynamo/output_graph.py (L1866)) having same base name, like `contiguous` and `contiguous_1`, the call into [create_graph_input](58f346c874/torch/_dynamo/output_graph.py (L2081)) can finally create a name ([here](58f346c874/torch/fx/graph.py (L1008))) that overwrite args to lift. And thus causing a wrong output of graph.

## Reproducing
Below is an reproduceable example,
```python
import logging
from typing import List

import torch
from functorch.compile import aot_module_simplified, make_boxed_func

@torch.library.custom_op("mylib::somefunc_forward", mutates_args=())
def somefunc_forward(
    input_: torch.Tensor,
    weight: torch.Tensor,
    shape: List[int],
) -> torch.Tensor:
    return torch.ones_like(input_)

@somefunc_forward.register_fake
def _(input_, shape, weight):
    return torch.empty_like(input_)

@torch.library.custom_op("mylib::somefunc_backward", mutates_args=())
def somefunc_backward(
    grad_output: torch.Tensor,
    input_: torch.Tensor,
    weight: torch.Tensor,
    shape: List[int],
) -> torch.Tensor:
    print(f"backward.{grad_output.shape=}")
    print(f"backward.{input_.shape=}")
    print(f"backward.{weight.shape=}")
    print(f"backward.{shape=}")
    assert list(weight.shape) == shape
    return torch.ones_like(weight)

@somefunc_backward.register_fake
def _(grad_output, input_, weight, shape):
    return torch.empty_like(weight)

def a_func(grad_output, input_, weight_, shape):
    return torch.ones_like(input_.sum() * weight_)

class SomeFunc(torch.autograd.Function):
    @staticmethod
    def forward(ctx, input, weight, normalized_shape):
        ctx.normalized_shape = normalized_shape
        input_ = input.contiguous()
        weight_ = weight.contiguous()
        output = somefunc_forward(input_, weight_, ctx.normalized_shape)
        ctx.save_for_backward(input_, weight_)
        return output

    @staticmethod
    def backward(ctx, grad_output):
        input_, weight_ = ctx.saved_tensors
        # grad_weight = a_func(grad_output, input_, weight_, ctx.normalized_shape)
        grad_weight = somefunc_backward(
            grad_output.contiguous(),
            input_,
            weight_,
            ctx.normalized_shape,
        )
        return None, grad_weight, None

class MyModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.weight = torch.nn.Parameter(torch.ones(7))

    def forward(self, x):
        return SomeFunc.apply(x, self.weight, [7])

model = MyModel()
torch._logging.set_logs(dynamo=logging.DEBUG, aot=logging.DEBUG, graph_code=True)

def aot_print_backend(gm, sample_inputs):
    # Forward compiler capture
    def fw(gm, sample_inputs):
        print(f"----- fw")
        gm.print_readable()
        return make_boxed_func(gm.forward)

    # Backward compiler capture
    def bw(gm, sample_inputs):
        print(f"----- bw")
        gm.print_readable()
        return make_boxed_func(gm.forward)

    # Call AOTAutograd
    gm_forward = aot_module_simplified(
        gm, sample_inputs, fw_compiler=fw, bw_compiler=bw
    )
    return gm_forward

model = torch.compile(
    model,
    backend=aot_print_backend,
    dynamic=False,
)
out = model(torch.rand((128, 4, 7)))
out.mean().backward()
```

I can see log that showing calling into create_graph_input like
```log
V0629 02:08:46.839914 8200981504 torch/_dynamo/output_graph.py:2042] [0/0] create_graph_input contiguous (none)
V0629 02:08:46.839998 8200981504 torch/_dynamo/output_graph.py:2042] [0/0] create_graph_input contiguous_1 (none)
```

And the backward graph generate will be like
```log
class GraphModule(torch.nn.Module):
    def forward(self, function_ctx, somefunc_forward_default: "f32[128, 4, 7]", contiguous: "f32[128, 4, 7]", contiguous_1: "f32[7]"):
        contiguous_1 = contiguous
        contiguous_2 = contiguous_1

        # No stacktrace found for following nodes
        _set_grad_enabled = torch._C._set_grad_enabled(False)

         # File: /Users/bytedance/testtorch/test_custom_op_bug.py:61 in backward, code: grad_output.contiguous(),
        contiguous: "f32[128, 4, 7]" = somefunc_forward_default.contiguous();  somefunc_forward_default = None

         # File: /opt/tiger/pytorch/torch/_library/custom_ops.py:506 in __call__, code: return self._opoverload(*args, **kwargs)
        somefunc_backward_default: "f32[7]" = torch.ops.mylib.somefunc_backward.default(contiguous, contiguous_1, contiguous_2, [7]);  contiguous = contiguous_1 = contiguous_2 = None

        # No stacktrace found for following nodes
        _set_grad_enabled_1 = torch._C._set_grad_enabled(True)
        return (None, somefunc_backward_default)
```

The original code of `somefunc_backward` takes a input list of `grad_output`, `input_`, `weight` and `shape`, where `weight` should be shape of `torch.Size([7])`. However, in the graph, `contiguous1` and `contiguous_2` are assigned with `contiguous`, this leads to assertion failure I added in `somefunc_backward`.

## Environment
```log
Collecting environment information...
PyTorch version: 2.5.0a0+git0b7e8df
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 14.5 (arm64)
GCC version: Could not collect
Clang version: 15.0.0 (clang-1500.3.9.4)
CMake version: version 3.26.4
Libc version: N/A

Python version: 3.9.19 (main, May  6 2024, 14:39:30)  [Clang 14.0.6 ] (64-bit runtime)
Python platform: macOS-14.5-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Apple M3 Pro

Versions of relevant libraries:
[pip3] numpy==2.0.0
[pip3] optree==0.11.0
[pip3] torch==2.5.0a0+git0b7e8df
[pip3] torchgraph==0.0.1
[conda] numpy                     2.0.0                    pypi_0    pypi
[conda] optree                    0.11.0                   pypi_0    pypi
[conda] torch                     2.5.0a0+git0b7e8df           dev_0    <develop>
[conda] torchgraph                0.0.1                     dev_0    <develop>
```

## How to fix?

I put a naive fix that add the potential args to lift into the used_names. This visits private variables, will fix that if this issue makes sense to you.

@zou3519 @oulgen

Co-authored-by: rzou <zou3519@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129817
Approved by: https://github.com/zou3519
2024-07-15 18:49:12 +00:00
b4b64f76e5 Ensure tensors devices match on torch.index_put batch rule impl (#130479)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130479
Approved by: https://github.com/zou3519
2024-07-15 18:16:31 +00:00
00d71b3e86 Tweak tolerances for test_vjp_linalg_tensorsolve_cuda_float32 to pass in Windows / debug builds (#130449)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130449
Approved by: https://github.com/zou3519, https://github.com/malfet
ghstack dependencies: #128238, #130360
2024-07-15 17:35:34 +00:00
9e161af179 Revert "Increase tolerance for tensorsolve tests (#130620)"
This reverts commit 103b6ccab2bd025dfacc8c8a91f71f3d68e50426.

Reverted https://github.com/pytorch/pytorch/pull/130620 on behalf of https://github.com/clee2000 due to didn't work, test is still failing on this PR and on main, reverting in favor of https://github.com/pytorch/pytorch/pull/130449 instead ([comment](https://github.com/pytorch/pytorch/pull/130620#issuecomment-2229036418))
2024-07-15 17:35:04 +00:00
8fcb156e8b [BE] bump optree version to 0.12.1 (#130139)
0.12.0 Major Updates:

- Add context manager to temporarily set the dictionary sorting mode
- Add accessor APIs
- Use `stable` tag for `pybind11` for Python 3.13 support
- Fix potential segmentation fault for pickling support

0.12.1 Updates:

- Fix warning regression during import when launch with strict warning filters

Closes #130155
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130139
Approved by: https://github.com/zou3519
2024-07-15 17:27:07 +00:00
1e897a0ca4 Revert "Fix names conflict when lifting (#129817)"
This reverts commit 74da2a467f166e00316aee82ba24835ca563ed87.

Reverted https://github.com/pytorch/pytorch/pull/129817 on behalf of https://github.com/clee2000 due to broke dynamo/test_inline_inbuilt_nn_modules.py https://github.com/pytorch/pytorch/actions/runs/9940532858/job/27461141919 74da2a467f.  Test passed on PR, possibly a landrace? ([comment](https://github.com/pytorch/pytorch/pull/129817#issuecomment-2228993570))
2024-07-15 17:09:52 +00:00
0099e15b47 Also put unbacked symbols in symbol_to_node in split_module pass (#130535)
This is not a complete fix but it is a simple one, full fix tracked
in https://github.com/pytorch/pytorch/issues/130534

Internal xref:
https://fb.workplace.com/groups/6829516587176185/posts/7510238679103969/

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130535
Approved by: https://github.com/malfet
2024-07-15 16:56:01 +00:00
ca2d424c6e Tighten torch.library.infer_schema input types (#130705)
Made the following changes:
- mutates_args is now keyword-only and mandatory. This is to align with
  torch.library.custom_op (which makes it mandatory because it's easy to
  miss)
- op_name is now keyword-only. This helps the readability of the API
- updated all usages of infer_schema

This change is not BC-breaking because we introduced
torch.library.infer_schema a couple of days ago.

Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130705
Approved by: https://github.com/yushangdi
2024-07-15 16:43:57 +00:00
9df4bc6a0d Revert "Constant folding for dynamic shape node (#129686)"
This reverts commit b7d287fbec0a05a3d4c9524006e6bfd1de6a71a0.

Reverted https://github.com/pytorch/pytorch/pull/129686 on behalf of https://github.com/atalman due to Failing internally.  Test: https://github.com/pytorch/ao/blob/main/test/prototype/mx_formats/test_mx_linear.py ([comment](https://github.com/pytorch/pytorch/pull/129686#issuecomment-2228755295))
2024-07-15 15:19:24 +00:00
7cd48df2da Refine the logic of device construction when only device index is given (#129119)
# Motivation
Before this PR, device construction was `cuda` type when only a device index was given. It also returns the `PrivateUser1` type if a `PrivateUser1` type is registered.
```bash
>>> import torch
>>> device = torch.device(0)
>>> device.type
'cuda'
>>> a = torch.tensor([1, 2])
>>> b = a.to(0)
>>> b
tensor([1, 2], device='cuda:0')
```
It works well on CUDA GPU. But it will raise unexpected information and error running on XPU.
```bash
>>> import torch
>>> device = torch.device(0)
>>> device.type
'cuda'
>>> a = torch.tensor([1, 2])
>>> b = a.to(0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/xxx/pytorch/torch/cuda/__init__.py", line 302, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled
```
With this PR, refine the logic to use the currently available device type instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129119
Approved by: https://github.com/albanD, https://github.com/gujinghui, https://github.com/EikanWang
ghstack dependencies: #129463, #129205, #129363
2024-07-15 14:34:29 +00:00
9cae2160f5 Introduce the concept of Accelerators to PyTorch doc (#129363)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129363
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD
ghstack dependencies: #129463, #129205
2024-07-15 14:24:46 +00:00
74da2a467f Fix names conflict when lifting (#129817)
## Bug description
When pending args that are potentially to be lift [here](58f346c874/torch/_dynamo/output_graph.py (L1866)) having same base name, like `contiguous` and `contiguous_1`, the call into [create_graph_input](58f346c874/torch/_dynamo/output_graph.py (L2081)) can finally create a name ([here](58f346c874/torch/fx/graph.py (L1008))) that overwrite args to lift. And thus causing a wrong output of graph.

## Reproducing
Below is an reproduceable example,
```python
import logging
from typing import List

import torch
from functorch.compile import aot_module_simplified, make_boxed_func

@torch.library.custom_op("mylib::somefunc_forward", mutates_args=())
def somefunc_forward(
    input_: torch.Tensor,
    weight: torch.Tensor,
    shape: List[int],
) -> torch.Tensor:
    return torch.ones_like(input_)

@somefunc_forward.register_fake
def _(input_, shape, weight):
    return torch.empty_like(input_)

@torch.library.custom_op("mylib::somefunc_backward", mutates_args=())
def somefunc_backward(
    grad_output: torch.Tensor,
    input_: torch.Tensor,
    weight: torch.Tensor,
    shape: List[int],
) -> torch.Tensor:
    print(f"backward.{grad_output.shape=}")
    print(f"backward.{input_.shape=}")
    print(f"backward.{weight.shape=}")
    print(f"backward.{shape=}")
    assert list(weight.shape) == shape
    return torch.ones_like(weight)

@somefunc_backward.register_fake
def _(grad_output, input_, weight, shape):
    return torch.empty_like(weight)

def a_func(grad_output, input_, weight_, shape):
    return torch.ones_like(input_.sum() * weight_)

class SomeFunc(torch.autograd.Function):
    @staticmethod
    def forward(ctx, input, weight, normalized_shape):
        ctx.normalized_shape = normalized_shape
        input_ = input.contiguous()
        weight_ = weight.contiguous()
        output = somefunc_forward(input_, weight_, ctx.normalized_shape)
        ctx.save_for_backward(input_, weight_)
        return output

    @staticmethod
    def backward(ctx, grad_output):
        input_, weight_ = ctx.saved_tensors
        # grad_weight = a_func(grad_output, input_, weight_, ctx.normalized_shape)
        grad_weight = somefunc_backward(
            grad_output.contiguous(),
            input_,
            weight_,
            ctx.normalized_shape,
        )
        return None, grad_weight, None

class MyModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.weight = torch.nn.Parameter(torch.ones(7))

    def forward(self, x):
        return SomeFunc.apply(x, self.weight, [7])

model = MyModel()
torch._logging.set_logs(dynamo=logging.DEBUG, aot=logging.DEBUG, graph_code=True)

def aot_print_backend(gm, sample_inputs):
    # Forward compiler capture
    def fw(gm, sample_inputs):
        print(f"----- fw")
        gm.print_readable()
        return make_boxed_func(gm.forward)

    # Backward compiler capture
    def bw(gm, sample_inputs):
        print(f"----- bw")
        gm.print_readable()
        return make_boxed_func(gm.forward)

    # Call AOTAutograd
    gm_forward = aot_module_simplified(
        gm, sample_inputs, fw_compiler=fw, bw_compiler=bw
    )
    return gm_forward

model = torch.compile(
    model,
    backend=aot_print_backend,
    dynamic=False,
)
out = model(torch.rand((128, 4, 7)))
out.mean().backward()
```

I can see log that showing calling into create_graph_input like
```log
V0629 02:08:46.839914 8200981504 torch/_dynamo/output_graph.py:2042] [0/0] create_graph_input contiguous (none)
V0629 02:08:46.839998 8200981504 torch/_dynamo/output_graph.py:2042] [0/0] create_graph_input contiguous_1 (none)
```

And the backward graph generate will be like
```log
class GraphModule(torch.nn.Module):
    def forward(self, function_ctx, somefunc_forward_default: "f32[128, 4, 7]", contiguous: "f32[128, 4, 7]", contiguous_1: "f32[7]"):
        contiguous_1 = contiguous
        contiguous_2 = contiguous_1

        # No stacktrace found for following nodes
        _set_grad_enabled = torch._C._set_grad_enabled(False)

         # File: /Users/bytedance/testtorch/test_custom_op_bug.py:61 in backward, code: grad_output.contiguous(),
        contiguous: "f32[128, 4, 7]" = somefunc_forward_default.contiguous();  somefunc_forward_default = None

         # File: /opt/tiger/pytorch/torch/_library/custom_ops.py:506 in __call__, code: return self._opoverload(*args, **kwargs)
        somefunc_backward_default: "f32[7]" = torch.ops.mylib.somefunc_backward.default(contiguous, contiguous_1, contiguous_2, [7]);  contiguous = contiguous_1 = contiguous_2 = None

        # No stacktrace found for following nodes
        _set_grad_enabled_1 = torch._C._set_grad_enabled(True)
        return (None, somefunc_backward_default)
```

The original code of `somefunc_backward` takes a input list of `grad_output`, `input_`, `weight` and `shape`, where `weight` should be shape of `torch.Size([7])`. However, in the graph, `contiguous1` and `contiguous_2` are assigned with `contiguous`, this leads to assertion failure I added in `somefunc_backward`.

## Environment
```log
Collecting environment information...
PyTorch version: 2.5.0a0+git0b7e8df
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 14.5 (arm64)
GCC version: Could not collect
Clang version: 15.0.0 (clang-1500.3.9.4)
CMake version: version 3.26.4
Libc version: N/A

Python version: 3.9.19 (main, May  6 2024, 14:39:30)  [Clang 14.0.6 ] (64-bit runtime)
Python platform: macOS-14.5-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Apple M3 Pro

Versions of relevant libraries:
[pip3] numpy==2.0.0
[pip3] optree==0.11.0
[pip3] torch==2.5.0a0+git0b7e8df
[pip3] torchgraph==0.0.1
[conda] numpy                     2.0.0                    pypi_0    pypi
[conda] optree                    0.11.0                   pypi_0    pypi
[conda] torch                     2.5.0a0+git0b7e8df           dev_0    <develop>
[conda] torchgraph                0.0.1                     dev_0    <develop>
```

## How to fix?

I put a naive fix that add the potential args to lift into the used_names. This visits private variables, will fix that if this issue makes sense to you.

@zou3519 @oulgen

Co-authored-by: rzou <zou3519@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129817
Approved by: https://github.com/zou3519
2024-07-15 13:41:46 +00:00
ee039c0614 [custom_op] triton_op API V0 (#130637)
This is the initial version of an API to create custom operators whose
implementations are backed by triton kernels. While user-defined triton
kernels work out-of-the-box with triton kernels, you may wish to
construct a custom operator if you need to compose with other PyTorch
subsystems, like Tensor subclasses or vmap.

I'm hoping to get design feedback on this and ship it so that we can
begin experimenting with customers.

Test Plan:
- new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130637
Approved by: https://github.com/albanD
2024-07-15 13:00:54 +00:00
cyy
6beec34b1c [structural binding][9/N] Replace std::tie with structural binding (#130404)
Follows  #130544

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130404
Approved by: https://github.com/janeyx99
2024-07-15 10:14:52 +00:00
ac28ae18dc [BE][Ez]: Update pybind11 submodule to v2.13.1 (#129827)
Updates pybind11 submodule to v2.13.1
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129827
Approved by: https://github.com/XuehaiPan, https://github.com/atalman, https://github.com/albanD
2024-07-15 08:58:56 +00:00
1d983bbb28 [easy][inline-inbuilt-nn-module] Update test output (#130681)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130681
Approved by: https://github.com/zou3519, https://github.com/jansel
ghstack dependencies: #130654, #130420
2024-07-15 06:19:53 +00:00
1a266def4f [dynamo][unsoundness but very controlled] Skip guards on inbuilt nn module hooks (#130420)
Reduces the guard overhead from 2.1k units to 1k units. Compared to no-inlining (0.4k units), this reduces the slowdown from 5x to 2.5x.

This introduces unsoundness, but only for hooks for inbuilt nn modules (user defined nn module hooks are fine).

Each builtin nn module adds 4 empty ordered dict checks in the check_fn. This blows up for models with large numbers of builtin nn modules. With this PR, we skip those guards. There is no other easy way I can think of right now to control the guard overhead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130420
Approved by: https://github.com/jansel
ghstack dependencies: #130654
2024-07-15 06:19:53 +00:00
dc7725cc16 [halide-backend] Random number generation (#130211)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130211
Approved by: https://github.com/jansel
2024-07-15 05:03:24 +00:00
1bc390c5f5 Invalidate StorageImpl instances when tensor is overwritten with cudagraphs (#125264)
Fixes #104435

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125264
Approved by: https://github.com/ezyang
2024-07-15 04:16:17 +00:00
a3c0bab502 [inductor] [cpp] use non-temporal tile load for A (#129455)
Use non-temporal tile load `_tile_stream_loadd` for A to keep B in L1.
Verified AMP static shapes and dynamic shapes on CPU with AMX support and no obvious performance boost (no regression either) at end-to-end level. We're expecting to get performance gain when adding https://github.com/pytorch/pytorch/pull/129348 (also in this ghstack) on top of this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129455
Approved by: https://github.com/jgong5
2024-07-15 04:07:29 +00:00
c547b2e871 Fix python detection in cuda.cmake (#130651)
If Python package has not been detected previously, call it here

This fixes regression introduced by https://github.com/pytorch/pytorch/pull/128801 that results in annoying, but harmless warning reported in https://github.com/pytorch/pytorch/issues/129777

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130651
Approved by: https://github.com/Skylion007
2024-07-15 03:45:31 +00:00
c0897919da Revert " [5/N] Change static functions in headers to inline (#130673)"
This reverts commit 4410c44ae6fd8eb36f2358ac76f7d988ca7537c5.

Reverted https://github.com/pytorch/pytorch/pull/130673 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it causes CUDA build 12.1/12.4 to timeout in trunk, I am not sure what I am looking at yet, so attempt to revert to see if it fixes trunk.  Plz keep in mind that a cancelled job is counted as a failure ([comment](https://github.com/pytorch/pytorch/pull/130673#issuecomment-2227641368))
2024-07-15 03:27:11 +00:00
cyy
28f6ae2718 [9/N] Replace c10::optional with std::optional (#130674)
Follows  #130509

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130674
Approved by: https://github.com/Skylion007
2024-07-15 00:48:43 +00:00
774ca93fd2 Added zb1p schedule (#130210)
Adds the ZB1P schedule in https://arxiv.org/pdf/2401.10241.

The ZB2P schedule might not be zero bubble when pp_group_size > 4. Proof:

![image](https://github.com/pytorch/pytorch/assets/13212964/fac4a738-c323-47c7-bcaa-c6cdd1cf20d7)

Since ZB2P generates longer schedules for some cases, and we might need a collective for fault tolerance all reduce at the end of every iteration for llama 4, so holding off to implement a more fancier ZBV schedule for now unless it would be useful.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130210
Approved by: https://github.com/H-Huang
2024-07-14 17:32:59 +00:00
cyy
5fe9515d35 [structural binding][8/N] Replace std::tie with structural binding (#130544)
Follows #130216
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130544
Approved by: https://github.com/ezyang
2024-07-14 13:23:20 +00:00
81322aee74 [Inductor][CPP] Support more than one LocalBuffer (#129121)
**Summary**
Support more than 1 Local Buffer in an outer loop fused node and also the case when multi global buffers sharing usage of same local buffer.

**TestPlan**
```
python -u -m pytest -s -v inductor/test_cpu_repro.py -k test_two_local_buffers_in_outer_loop_fusion
python -u -m pytest -s -v inductor/test_cpu_repro.py -k test_share_local_buffers_in_outer_loop_fusion
```

**Next Step**

- [✓] Support more than one Local Buffer/Global Buffer

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129121
Approved by: https://github.com/jgong5, https://github.com/peterbell10
ghstack dependencies: #126967
2024-07-14 11:31:14 +00:00
adaa0fea5a [Inductor][CPP] Enable Local Buffer for Outer loop fusion (#126967)
**Summary**
Currently, the Inductor CPP backend [generated code](https://gist.github.com/leslie-fang-intel/98f91d43dabed581a1ffe23daf133a65#file-bf16-softmax-generated-code-wo-local-buffer-py) for `Softmax` with BF16 data type is significantly slower than the [ATen Implementation](9a2beb862d/aten/src/ATen/native/cpu/SoftMaxKernel.cpp (L149)). Upon comparing the generated code with ATen, the performance bottleneck appears to be related to the usage of [local buffer in ATen](9a2beb862d/aten/src/ATen/native/cpu/SoftMaxKernel.cpp (L159-L160)).

In the current implementation, the Inductor uses the output buffer of Kernel Group Args to store and load temporary result (such as `exp`), since this buffer is corresponding to a `SchedulerNode`. Each thread accesses a portion of this output buffer via indexing. However, since this buffer (take this `exp` as example) is only utilized internally within decomposed `softmax`, this buffer can be replaced with a thread-local buffer similar to ATen's approach.

In this PR, we have introduced the optimizations of `LocalBuffer`. Following this enhancement, the [new generated Inductor code with local buffer](https://gist.github.com/leslie-fang-intel/98f91d43dabed581a1ffe23daf133a65#file-bf16-softmax-generated-code-w-local-buffer-py) for BF16 `Softmax` demonstrates significantly improved performance. Running the benchmark [here](https://gist.github.com/leslie-fang-intel/37d81441237b5139c8295f5e6c4cd31a) to test this BF16 `Softmax` case on an 8480 Xeon server shows similar performance between the Inductor CPP Backend and the ATen implementation.

**TestPlan**
```
python -u -m pytest -s -v inductor/test_cpu_repro.py -k test_local_buffer_in_outer_loop_fusion
```

**Next Step**

- [ ] Support more than one Local Buffer/Global Buffer

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126967
Approved by: https://github.com/jgong5, https://github.com/peterbell10
2024-07-14 11:28:10 +00:00
dcaa111dc8 support intersection by polyfill (#130672)
Fixes https://github.com/pytorch/pytorch/issues/130557

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130672
Approved by: https://github.com/anijain2305
2024-07-14 10:44:26 +00:00
4d7bf72d93 [BE][Easy] fix ruff rule needless-bool (SIM103) (#130206)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130206
Approved by: https://github.com/malfet
2024-07-14 08:17:52 +00:00
fa5f572748 [cudagraph] fallback to eager if re-record too many times (#129349)
Summary:
CUDAGraph Trees previously relies on an assumption that static inputs (parameters and buffers) does not change tensor addresses across multiple function invocations. This assumption can be used to reduce the number of tensor copies to improve performance. We also use `check_static_inputs_are_stable()` to check whether this assumption holds at runtime.

While this assumption is True in most cases, we recently observe a few cases that this assumption is not valid:
- [Inline inbuilt nn modules](https://github.com/pytorch/pytorch/pull/126822): the same function (a nn module) is used in multiple places and different parameters and buffers are passed to this function with different tensor addresses
- Some user code changes tensor addresses of parameters/buffers. See [internal example]( https://www.internalfb.com/mlhub/pipelines/runs/mast/sw-935450288-OfflineTraining_08ba1cf0?job_attempt=1&version=0&env=PRODUCTION)
- Compiled Autograd may also pass parameters/buffers with different tensor addresses across runs.

Previous PR [#126822](https://github.com/pytorch/pytorch/pull/126822) (by @mlazos) allows detecting static tensor address changes during runtime and re-recording a cudagraph if that happened. However, if the same function is re-recorded too many times, it may introduce large overhead and hurt performance. This PR adds `torch._inductor.config.triton.cudagraph_max_recording` (=5) to fallback to eager if a function has been recorded more than `cudagraph_max_recording` times for a specific node in the CUDAGraph Trees.

A summary on how static tensor address changes are handled now:

- For each child node, check the assumption via `check_invariants`. If this holds, execute node with the assumption.
- If the assumption does not hold for all child nodes, re-record if the function_id has not been recorded too many times for the current_node.
- If the function_id has been re-recorded too many times, fallback to eager function and warning.

Test Plan: CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129349
Approved by: https://github.com/eellison
2024-07-14 04:17:24 +00:00
cyy
4410c44ae6 [5/N] Change static functions in headers to inline (#130673)
Follows #128286

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130673
Approved by: https://github.com/ezyang
2024-07-14 03:15:28 +00:00
6f275ae4d0 Add kwinputs to Kineto Traces (#130373)
Summary: On the autograd side of things, we are currently saving the kwinputs but we aren't doing anything with them on the profiler side. This diff enables the use of the kwinputs for both FunctionEvents and Chrome Traces.

Test Plan: Added unit testing for both chrome traces and FunctionEvents. Used RecordFunctionFast to test kwinputs since test already had kwargs being passed in but not tested.

Differential Revision: D59472345

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130373
Approved by: https://github.com/davidberard98
2024-07-14 00:40:59 +00:00
f9f85bfc0b [Inductor] FlexAttention supports partial masking (#130415) (#130626)
This is the new version of https://github.com/pytorch/pytorch/pull/130415

Updated test script: https://gist.github.com/yanboliang/7c34a82df611d4ea8869cb9e041bfbfc
Updated perf numbers:
```
(pt) [ybliang@devgpu002.ash8 ~/local/debug]$ CUDA_VISIBLE_DEVICES=4 python debug7.py
fwd speedup: 0.7166695598192317
bwd speedup: 0.7142133867805904
(pt) [ybliang@devgpu002.ash8 ~/local/debug]$ CUDA_VISIBLE_DEVICES=4 python debug7.py --partial-mask
fwd speedup: 0.8428246087169973
bwd speedup: 0.8486261278030254
```
Approved by: https://github.com/Chillee

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130626
Approved by: https://github.com/drisspg, https://github.com/yanboliang
2024-07-14 00:37:26 +00:00
cbb7e26acd [3.13, dynamo] fix jump target offset calculation (#130458)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130458
Approved by: https://github.com/jansel
ghstack dependencies: #130383, #130384, #130385
2024-07-13 23:32:06 +00:00
0b5792c0ae [3.13, dynamo] fix NULL ordering in symbolic_convert CALL (#130385)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130385
Approved by: https://github.com/jansel
ghstack dependencies: #130383, #130384
2024-07-13 23:32:05 +00:00
87b406d7e5 [3.13, dynamo] codegen TO_BOOL before conditional jump (#130384)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130384
Approved by: https://github.com/jansel
ghstack dependencies: #130383
2024-07-13 23:32:02 +00:00
92ac9ee83c [3.13, dynamo] swap null and pop_null in codegen (#130383)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130383
Approved by: https://github.com/jansel
2024-07-13 23:31:57 +00:00
97cfc65dbc Back out "[DeviceMesh] Only include the real thread_id in DeviceMesh hash under threaded backend (#130495)" (#130676)
Summary:
Original commit changeset: 80c2ca639146

Original Phabricator Diff: D59612200

Test Plan: buck2 test 'fbcode//mode/opt' fbcode//apf/distributed/tests:pipeline_parallel_test_cpu -- --exact 'apf/distributed/tests:pipeline_parallel_test_cpu - apf.distributed.tests.pipeline_parallel_test_cpu.PipelineParallelContextTestCPU: test_stage_pg_creation_with_different_backends'

Differential Revision: D59719562

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130676
Approved by: https://github.com/xunnanxu
2024-07-13 23:19:22 +00:00
e5de25896f Fixed CUDA randint generation for large ranges. (#126066)
Fixes #125224

For large ranges, calls to CUDA `randint` use a different `unroll_factor` to generate random ints. This `unroll_factor` was not considered correctly in the calculation of the Philox offsets. Thus, some of the random states were reused, resulting in lower entropy (see #125224).

This also affects multiple other random functions, such as `torch.rand` and `torch.randn`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126066
Approved by: https://github.com/eqy, https://github.com/lezcano
2024-07-13 21:42:27 +00:00
1f162a5fce Revert "[Inductor][CPP] Support vectorization of remainder (#129849)"
This reverts commit 5bc18ec0a181fac0994522fefaf664f917d64b86.

Reverted https://github.com/pytorch/pytorch/pull/129849 on behalf of https://github.com/izaitsevfb due to fails the compilation of executorch benchmark internally ([comment](https://github.com/pytorch/pytorch/pull/129849#issuecomment-2227054413))
2024-07-13 19:28:34 +00:00
8714b7fc69 [dynamo][cpp-guards] Use dict tags to skip guards on immutable dict getitems (#130654)
Reduces the guard overhead from 3.7k units to 2.1k units.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130654
Approved by: https://github.com/jansel
2024-07-13 15:31:10 +00:00
cyy
7c83f5f7d5 [8/N] Replace c10::optional with std::optional (#130509)
Follows #130510

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130509
Approved by: https://github.com/ezyang
2024-07-13 13:05:36 +00:00
0effcb70ef Revert "[ONNX] Remove beartype usage (#130484)"
This reverts commit f44739cf42e22a569bd1bdb0c113f8a069c17a41.

Reverted https://github.com/pytorch/pytorch/pull/130484 on behalf of https://github.com/huydhn due to Sorry for reverting your change but those failures show up in trunk after the commit landed f44739cf42, I am reverting it to see if it fix trunk ([comment](https://github.com/pytorch/pytorch/pull/130484#issuecomment-2226812311))
2024-07-13 07:52:59 +00:00
567482973d typing fake_tensor.py (#128041)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128041
Approved by: https://github.com/eellison
ghstack dependencies: #129182
2024-07-13 06:07:40 +00:00
1ad0f38a37 Fix IMAs in FlexAttention + autotuning (#130352)
# Summary

Makes error message better for non divisible sequence lengths.

Updates this PR was blocked due to two IMAs.
- The first, is that when the kv indices ends up being an 'arange' I.e. there are non sparse blocks, we end up loading off of kv_indices + 1.
- The second I dont really have a clear answer for. We were hitting an ima here:
9f401187c7/torch/_inductor/kernel/flex_attention.py (L846)
I noticed that the for our inputs 2048 and q_blocksize = 128 we were again exactly at 16. Something felt fishy. I suspect we launch one extra sparse_q block,  But why only during autotuning...

### Repro:
https://gist.github.com/drisspg/f312a66426f3440b7756c6c0cc037f4c
### After this change:
```
========= COMPUTE-SANITIZER
AUTOTUNE flex_attention(1x1x2048x64, 1x1x2048x64, 1x1x2048x64, 1x1x2048, 1x1x16, 1x1x16x16)
  triton_flex_attention_0 2.1118 ms 100.0% BLOCK_DMODEL=64, BLOCK_M=128, BLOCK_N=128, OUTPUT_LOGSUMEXP=True, PRESCALE_QK=False, ROWS_GUARANTEED_SAFE=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=3, num_warps=4
  triton_flex_attention_3 2.4306 ms 86.9% BLOCK_DMODEL=64, BLOCK_M=64, BLOCK_N=128, OUTPUT_LOGSUMEXP=True, PRESCALE_QK=False, ROWS_GUARANTEED_SAFE=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=3, num_warps=4
  triton_flex_attention_1 2.5729 ms 82.1% BLOCK_DMODEL=64, BLOCK_M=128, BLOCK_N=64, OUTPUT_LOGSUMEXP=True, PRESCALE_QK=False, ROWS_GUARANTEED_SAFE=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=3, num_warps=4
  triton_flex_attention_4 2.8035 ms 75.3% BLOCK_DMODEL=64, BLOCK_M=64, BLOCK_N=64, OUTPUT_LOGSUMEXP=True, PRESCALE_QK=False, ROWS_GUARANTEED_SAFE=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=3, num_warps=4
  triton_flex_attention_2 2.8837 ms 73.2% BLOCK_DMODEL=64, BLOCK_M=128, BLOCK_N=128, OUTPUT_LOGSUMEXP=True, PRESCALE_QK=False, ROWS_GUARANTEED_SAFE=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=3, num_warps=4
SingleProcess AUTOTUNE benchmarking takes 0.7225 seconds and 1.5218 seconds precompiling
AUTOTUNE flex_attention_backward(1x1x2048x64, 1x1x2048x64, 1x1x2048x64, 1x1x2048, 1x1x2048, 1x1x2048x64, 1x1x2048x64, 1x1x2048x64, 1x1x16, 1x1x16x16, 1x1x16, 1x1x16x16)
  triton_flex_attention_backward_30 2.7763 ms 100.0% BLOCK_DMODEL=64, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, PRESCALE_QK=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=1, num_warps=4
  triton_flex_attention_backward_15 3.1404 ms 88.4% BLOCK_DMODEL=64, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, PRESCALE_QK=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=3, num_warps=4
  triton_flex_attention_backward_14 3.2604 ms 85.2% BLOCK_DMODEL=64, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, PRESCALE_QK=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=1, num_warps=4
  triton_flex_attention_backward_7 3.4176 ms 81.2% BLOCK_DMODEL=64, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, PRESCALE_QK=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=3, num_warps=4
  triton_flex_attention_backward_8 3.4182 ms 81.2% BLOCK_DMODEL=64, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, PRESCALE_QK=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=4, num_warps=4
  triton_flex_attention_backward_34 3.4939 ms 79.5% BLOCK_DMODEL=64, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, PRESCALE_QK=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=1, num_warps=8
  triton_flex_attention_backward_6 3.6517 ms 76.0% BLOCK_DMODEL=64, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, PRESCALE_QK=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=1, num_warps=4
  triton_flex_attention_backward_26 3.7000 ms 75.0% BLOCK_DMODEL=64, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, PRESCALE_QK=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=1, num_warps=8
  triton_flex_attention_backward_22 4.0120 ms 69.2% BLOCK_DMODEL=64, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, PRESCALE_QK=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=1, num_warps=4
  triton_flex_attention_backward_18 4.5052 ms 61.6% BLOCK_DMODEL=64, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, PRESCALE_QK=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=1, num_warps=8
SingleProcess AUTOTUNE benchmarking takes 6.6558 seconds and 6.3567 seconds precompiling
torch.Size([1, 1, 2048, 64])
Test completed successfully!
========= ERROR SUMMARY: 0 errors
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130352
Approved by: https://github.com/Skylion007, https://github.com/Chillee
2024-07-13 05:27:39 +00:00
c03e667276 [Inductor][PatternMatcher] Always prevent match across mutations (#130584)
Preventing match across mutations should always be the safe thing to do. This will be especially important for Traceable FSDP2 because in that case we do have mutation ops (`.set_` and `.resize_(0)`) in the middle of the graph for both joint-graph and post-grad graph, so making sure the pattern matcher passes work well with middle-of-graph mutation ops is important.

Q: Why can't we move these mutation ops to the end of graph, to make pass writing easier?
A: We attempted to do that in https://github.com/pytorch/pytorch/pull/129852, but the custom FX passes (in `torch/_functorch/_aot_autograd/fx_passes.py`) for the re-functionalization is complicated to maintain, and the changes to partitioner (in `torch/_functorch/partitioners.py`) also feels hacky. Hence we want to preserve these mutation ops in the middle of graph to avoid the complexity.

Test commands:
- `pytest -rA test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_uint4x2_mixed_mm`
- `pytest -rA test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_serialized_patterns_up_to_date`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130584
Approved by: https://github.com/jansel
2024-07-13 03:39:21 +00:00
3710a79622 Flex Attention HOP: Add support for flex decoding (#129415)
# Flex Decoding
tl;dr This PR adds `flex_decoding` kernel to higher-order-op: `flex_attention` as the backend for multi-head attention decoding.

Higher-order-op `flex_attention` was introduced in (https://github.com/pytorch/pytorch/pull/121845) to accept a user defined score modification callable (`score_mod`) and through `torch.compile`to create an efficient fused flash attention kernel instatiation. The `flex_attention` kernel is efficient for long queries (>512 tokens) attention. This PR introduces `flex_decoding` kernel as an alternative backend for `flex_attention` HOP to handle LLM inference where short queries (<32 tokens) attends to long key/value sequences.

### Details

LLM decoding iteratively attends each newly generated token ( query length = 1 ) to a long key/value context (up to 132k). `flex_attention` kernel only parallelizes attention along query length (M), batch size (B) and number of heads (H) dimension. LLM decoding lacks enough parallelism in the M dimension to fill up all SMs on the modern GPUs.

`flex_decoding` adds parallelization along key/value sequence length (N). The key/value cache of a single head are split into multiple blocks and the query tokens attends to them in parallel. The results for the same head are then reduced across KV blocks to generate a global output.

## Examples

Consider a Group Query Attention (GQA) decoding case, where a query token of 16 query heads (Hq) attends to 2 kv head (Hkv). Assume a batch size of 2 (B=2) and kv cache length of 4096 (N=4096). The attention kernel iteratively attends to newly generated query token (Mq = 1).

We transform this problem into a Multiheaded Attention (MHA) problem by assuming a query length equal to number of query heads per kv heads, i.e. M=Hq//Hkv.
The inputs to `flex_attention` HOP is thus a query of shape (B=2, H=Hkv=2, M=Hq//Hkv=8, D=64), key,value of shape (B=2, H=Hkv=2, N=4096, D=64, which lead to an intermediate attention score matrix of shape (2, 2, 8, 4096) and an output of shape (2, 2, 8, 64).

```Python
import torch
from torch.nn.attention._flex_attention import _flex_attention as flex_attention

torch.manual_seed(0)

# Lets create some input tensors
# query of shape (B, Hkv, Hq//Hkv, D)
# key/value of shape (B, Hkv, N, D)
query = torch.randn(2, 2, 8, 64, device="cuda", dtype=torch.float32)
key = torch.randn(2, 2, 4096, 64, device="cuda", dtype=torch.float32)
value = torch.randn(2, 2, 4096, 64, device="cuda", dtype=torch.float32)

# Lets create a new score_modification checkerboard.
def checkerboard(score, batch, head, token_q, token_kv):
    score = torch.where(torch.abs(token_kv - token_q) == 1, score * 0.5, score)
    score = torch.where(torch.abs(token_kv - token_q) == 2, score * 2.0, score)
    return score

# Lets call flex_attention with this new score modification for decoding.
# The flex_attention HOP will chose flex_decoding as its backend since our query length (M) is only 8.
output = flex_attention(query, key, value, score_mod=checkerboard)

compiled_flex_attention = torch.compile(flex_attention)
out_compiled = compiled_flex_attention (query, key, value, score_mod=checkerboard)

torch.testing.assert_close(output, out_compiled, atol=2e-2, rtol=2e-2)
```

## Future Plans
- This PR does not implement load mask for score_mod function. This means if the score_mod functions takes a captured buffer along the M dimension , it must be padded to q length of 16, or next 2^n of query length if q_len > 16.
i.e.
```python
q_scale = torch.randn(Hq//Hkv, device="cuda")
q_scale = torch.nn.functional.pad(q_scale, (0, 16-Hq//Hkv)) # Pad captured buffer
def bias_mod(score, batch, head, q, kv):
    score = score + q_scale[token_q]
    return score
```
- Backward path for short queries (<128 token) currently does not work because the `flex_attention_backward` kernel is lacking mask support and only takes query length of a multiple of 128.
- Dynamic shape and max_autotuning is currently not working
- Add block sparse mask support (#129216 is a draft for flex_attention kernel)
- Add explicit GQA support. (#130076 is a draft for GQA support on flex_attention kernel)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129415
Approved by: https://github.com/Chillee
2024-07-13 00:41:48 +00:00
f44739cf42 [ONNX] Remove beartype usage (#130484)
beartype has served us well in identifying type errors and ensuring we call internal functions with the correct arguments (thanks!). However, the value of having beartype is diminished because of the following:

1. When beartype improves support for better Dict[] type checking, it discovered typing mistakes in some functions that were previously uncaught. This caused the exporter to fail with newer versions beartype when it used to succeed. Since we cannot fix PyTorch and release a new version just because of this, it creates confusion for users that have beartype in their environment from using torch.onnx
2. beartype adds an additional call line in the traceback, which makes the already thick dynamo stack even larger, affecting readability when users diagnose errors with the traceback.
3. Since the typing annotations need to be evaluated, we cannot use new syntaxes like `|` because we need to maintain compatibility with Python 3.8. We don't want to wait for PyTorch take py310 as the lowest supported Python before using the new typing syntaxes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130484
Approved by: https://github.com/titaiwangms
2024-07-13 00:08:25 +00:00
a7f54c7f8a [dynamo] add meta fn for aten.kthvalue.default (#130562)
I saw
```
torch._dynamo.exc.Unsupported: unsupported operator: aten.kthvalue.default
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130562
Approved by: https://github.com/jingsh, https://github.com/zou3519
2024-07-12 23:48:31 +00:00
634b62f111 typing proxy_tensor.py (#129182)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129182
Approved by: https://github.com/Chillee
2024-07-12 23:17:09 +00:00
ea78b0c177 Revert "Fix static py::object dangling pointer with py::gil_safe_call_once_and_store (#130341)"
This reverts commit a17d1e5322229a31f868d98987996a04736933a6.

Reverted https://github.com/pytorch/pytorch/pull/130341 on behalf of https://github.com/izaitsevfb due to internal needs pybind update ([comment](https://github.com/pytorch/pytorch/pull/130341#issuecomment-2226499397))
2024-07-12 23:07:37 +00:00
f422027fce fix torch.linalg.lstsq input check (#130612)
Fixes [#117236 ](https://github.com/pytorch/pytorch/issues/117236)
The current case does not meet the vector scenario requirements, and it lacks sufficient checks (relying solely on ```dim_diff``` is insufficient).  Consequently, it triggers an internal assertion error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130612
Approved by: https://github.com/lezcano
2024-07-12 23:06:52 +00:00
06ebf87a1e Fix and improve reorder_compute_for_overlap (#130573)
Since the raise_comms and sink_waits passes are also scheduling-based, we can now implement reorder_compute_for_overlap as an optional step in the same pass. Merging them into the same pass greatly simplifies the logic and makes it easier to reason about the synergy between different passes.

- The unit tests are now fixed and re-enabled.
- Verified that the pass produces good schedulling w/ Llama3 70B in torchtitan (the scheduling was sub-optimal before this PR).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130573
Approved by: https://github.com/Chillee
ghstack dependencies: #129980
2024-07-12 22:25:49 +00:00
619029e892 [easy] Small rendering fix in Tensor.module_load doc (#130489)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130489
Approved by: https://github.com/janeyx99
2024-07-12 22:12:53 +00:00
95046c86e3 [HOP] add HOP x torch_dispatch interaction (#130606)
This involved beefing up the Python dispatcher to handle torch_dispatch.
Given a HOP and a torch_dispatch Tensor subclass:
- the HOP will show up in the subclass's `__torch_dispatch__`
- you can also use HOP.py_impl to register a rule for the HOP x
  subclass interaction
- (coming soon) we'll offer a way to open register HOP x subclass
  interaction without needing to touch the subclass's
  `__torch_dispatch__` or the HOP's .py_impl.

Test Plan:
- new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130606
Approved by: https://github.com/ydwu4
2024-07-12 21:51:36 +00:00
f093cd4086 Fix custom ops warning during export (#130623)
Fixes https://github.com/pytorch/pytorch/issues/130588

The problem was we were warning on all custom ops, not just ones marked
as CompositeImplicitAutograd. This PR changes the warning to just warn
on CompositeImplicitAutograd ops.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130623
Approved by: https://github.com/williamwen42
2024-07-12 21:34:29 +00:00
7c289c2a5c Add torch.serialization.safe_globals context manager (#127939)
Add context manager mentioned in https://github.com/pytorch/pytorch/pull/127808#pullrequestreview-2096298486

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127939
Approved by: https://github.com/albanD
2024-07-12 20:38:43 +00:00
f0d7164cb9 Revert "[inductor] switch AotCodeCompiler to new cpp_builder (#130127)"
This reverts commit 2abc7cc21b8a215f000ac037c316ca178e9ade81.

Reverted https://github.com/pytorch/pytorch/pull/130127 on behalf of https://github.com/izaitsevfb due to breaks meta-internal tests ([comment](https://github.com/pytorch/pytorch/pull/130127#issuecomment-2226313943))
2024-07-12 20:36:00 +00:00
103b6ccab2 Increase tolerance for tensorsolve tests (#130620)
Fix current failure in periodic trunk https://hud.pytorch.org/failure?name=periodic%20%2F%20linux-focal-cuda11.8-py3.10-gcc9-debug%20%2F%20test%20(default%2C%204%2C%205%2C%20linux.4xlarge.nvidia.gpu)&jobName=undefined&failureCaptures=%5B%22functorch%2Ftest_ops.py%3A%3ATestOperatorsCUDA%3A%3Atest_vjp_linalg_tensorsolve_cuda_float32%22%5D

Since it appeared with https://github.com/pytorch/pytorch/pull/128238 that only updates random seed for the test, I expect this is just bad luck of the draw. Thus increasing tolerance like we do for other tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130620
Approved by: https://github.com/lezcano, https://github.com/atalman, https://github.com/malfet
2024-07-12 20:08:18 +00:00
af4da0799c [PyTorch] Half: don't disable direct conversion to/from float on mobile (#130465)
As far as I can tell, `FCVT` (https://developer.arm.com/documentation/ddi0602/2024-06/SIMD-FP-Instructions/FCVT--Floating-point-convert-precision--scalar--?lang=en)
is part of the base aarch64 instruction set, so it should work fine on mobile.

Differential Revision: [D59589733](https://our.internmc.facebook.com/intern/diff/D59589733/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130465
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-07-12 19:46:30 +00:00
d727e2f2d1 add total wall time in calculate_time_spent (#130611)
Fixes #ISSUE_NUMBER

Actual wall time is fwd_entire_frame_time + bwd_inductor_compile.  `calculate_time_spent` is accessed internally for monitoring use https://fburl.com/code/iiurj5m6.  However, summing values up lose the info of fwd/bwd.

This PR adds a new key of `total_wall_time` without affecting dynamo counters.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130611
Approved by: https://github.com/oulgen, https://github.com/Yuzhen11
2024-07-12 19:32:44 +00:00
eqy
60fc01d0ab [CUDA] Don't double-destroy CUDA graph when debug dump is used (#130401)
Repro from @eellison

Could have sworn we had another PR with this fix floating around somewhere but I couldn't find it...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130401
Approved by: https://github.com/Skylion007, https://github.com/eellison
2024-07-12 18:57:07 +00:00
43b98fa521 Add debug repr to SymNode (#129925)
Fixes #129403

Create a separate printing function to debug SymNode, since we can't easily change `__repr__` that is used by GraphModule.recompile() to create a pythonic version of a graph

This is my first contribution, please let me know if there is anything that I should look into in further details

Thank you for you guidance! 🙏 I hope to contribute more in the future!

@aorenste
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129925
Approved by: https://github.com/aorenste
2024-07-12 18:31:23 +00:00
2c4303c1d1 [ROCm] [BUGFIX] Re-enable rocm-specific tuning parameters (#130617)
Small bug fix - https://github.com/pytorch/pytorch/pull/124592 replaced the torch.version.hip with device_props but made a mistake in porting the original logic.

The original code was:
```
if torch.version.hip is not None:
```

Which was incorrectly replaced by:
```
if self.device_props.type != "hip":
```

Perhaps we need to write some unit tests here in the future.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130617
Approved by: https://github.com/masnesral
2024-07-12 18:29:59 +00:00
741c1710e8 [cond] inlining into one of the branches when pred is a python constant (#130493)
Reland https://github.com/pytorch/pytorch/pull/128709.

When the input predicate is a python constant, we specialize into one of the branches and warn users that torch.cond is not preserving the dynamism. The previous behavior is that we baked in True/False in the cond operator. This can be confusing. In this PR, we change it to be specializing into one of the branches when the inputs are constants.

We additionally change the naming of cond operator to default one without overriding its name. This allows better testing on de-serialized graph.

Test Plan:
The predicate in some existing tests is the result of a shape comparison. When no dynamic shape is involved, the predicate is a python bool. To fix them, we either change the predicate to be some data-dependent tensor or change the test to check cond is specialized as one of the branches,

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130493
Approved by: https://github.com/BoyuanFeng
2024-07-12 18:02:09 +00:00
0bf9a091ec [torchbind] add tracing_mode support (#129586)
Sometimes, it could be difficult to write a fake class e.g. when the original implementation is using some third-party libraries or users are certain that the class is safe to trace with the real object.

This PR allows user to specify their intention by implementing a "safe_to_trace_with_real_obj" method on their script class.

Test Plan:
`pytest test/export/test_torchbind.py -k safe`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129586
Approved by: https://github.com/zou3519
2024-07-12 18:01:47 +00:00
c3e77d144e [3.12, 3.13, dynamo] simplified construction for frame f_locals/localsplus (#129185)
Construct frame localsplus in 3.12+ using our own simplified way rather than copypasting from CPython.

This is necessary for 3.13 since we can no longer generate frame `f_locals` before executing the interpreter frame.
We also enable this for 3.12 since the `f_locals` construction between 3.12 and 3.13 is the same, so we can test for correctness with 3.12.

This is also one of the first steps to completing https://github.com/pytorch/pytorch/issues/93753 - we will implement simplified f_locals generation of previous Python versions in the future.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129185
Approved by: https://github.com/jansel
2024-07-12 17:56:38 +00:00
b0a597fcb4 Fix #121334: graph break on constant method call (#130158)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130158
Approved by: https://github.com/lezcano
2024-07-12 17:34:46 +00:00
4865c6425c Add new control plane handler (#129712)
Summary:
Add a new control plane handler to retrieve flight recorder data as
JSON.

Test Plan:
Unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129712
Approved by: https://github.com/wconstab
2024-07-12 17:32:01 +00:00
55dc82bef9 [EZ] Make test_pytree_inputs actually run tests on CUDA (#130593)
Right now it's only running it on CPU even when `self.device` is set to CUDA

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130593
Approved by: https://github.com/angelayi
2024-07-12 17:17:28 +00:00
988ed4d5db [export] clean up allow_complex_guards_as_runtime_asserts flag (#130596)
Summary: removes underscore, cleans up dead code in DimConstraints

Test Plan: existing export tests

Reviewed By: angelayi

Differential Revision: D59612746

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130596
Approved by: https://github.com/angelayi
2024-07-12 17:17:11 +00:00
dafef3ff35 [CP] Make CP loss curve on par with TP (#129515)
Summary:
This PR changes two implementations to make CP (CP8) lose curve be on par with TP (TP8).

1. Making key and value contiguous before doing ring attention. It is unclear why this is a requirement as SDPA does not have this requirement.

2. Use the out, grad_out, softmax_lse passed by autograd to do the backward. This implementation is similar to the implementation in transformer engine. The original implementation reruns the SDPA to get the output and logsumexp and uses that reculcated results to infer the corrected softmax_lse. But that implementation does not give a better accuracy or lose curve. Instead, that implementation converges slower.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129515
Approved by: https://github.com/d4l3k, https://github.com/wanchaol
ghstack dependencies: #129512, #129514
2024-07-12 16:55:28 +00:00
c35f12c67c [EZ] Add formatting changes to .git-blame-ignore-revs (#130627)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130627
Approved by: https://github.com/izaitsevfb, https://github.com/clee2000
2024-07-12 16:37:46 +00:00
22fd89c904 [TEST][Inductor] Fix scaled_mm call (#130582)
`_scaled_mm` no longer returns `amax` (see #128683)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130582
Approved by: https://github.com/drisspg
2024-07-12 16:25:18 +00:00
34e57025e1 Add unsigned int types to torch/types.h (#130616)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130616
Approved by: https://github.com/NicolasHug, https://github.com/albanD
2024-07-12 16:24:29 +00:00
2b1df24877 Revert "Make hashing a SymInt raise an error again (#130548)"
This reverts commit 3100455b8eeebdfbc3428ff9454579ac50666faf.

Reverted https://github.com/pytorch/pytorch/pull/130548 on behalf of https://github.com/clee2000 due to broke inductor/test_triton_kernels.py https://github.com/pytorch/pytorch/actions/runs/9908970127/job/27377960411 3100455b8e. Not run on PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/130548#issuecomment-2225912018))
2024-07-12 16:20:12 +00:00
2a1f22e57f Change BN to eval before QAT Convert phase (#130598)
**Summary**
In the QAT convert phase, we fold bn into conv and do DCE to this BN node. We should change `torch.ops.aten._native_batch_norm_legit.default` to `torch.ops.aten._native_batch_norm_legit_no_training.default`  for a safe DCE.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130598
Approved by: https://github.com/jgong5, https://github.com/yushangdi
2024-07-12 16:03:56 +00:00
18418a7dbb [ONNX] Fix torch_onnx patch accuracy bug in benchmark (#130586)
The ONNX related compilers have another route of accuracy check, and this PR brings torch_onnx compiler to the right measurement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130586
Approved by: https://github.com/justinchuby
2024-07-12 15:47:59 +00:00
e5657024b5 Fix loss_parallel with BF16 logits (#130550)
Fixes #130549

This PR uses the specific dtype for the `grad_input` buffer and fixes the error

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130550
Approved by: https://github.com/tianyu-l
2024-07-12 15:47:38 +00:00
ea4b80e6d6 [FX][export] strict DCE pass, check schema for node impurity (#130552)
Fixes the failure in `test/export/test_export_training_ir_to_run_decomp.py ` caused by dead code elimination removing node with side effects.

For background, in export, we may want to export higher-level IRs that are not functional, so we need to check for side effects more carefully.

 A call_function node is impure if it has at least one mutable argument.

Fixed the tests below:

test_to_module_with_mutated_buffer_multiple_update_sub_later
test_export_input_mutation_static_shape
test_buffer_util

Another attempt modifying the original DCE pass is made in PR #130395, but it breaks some other tests, so here we add a flag and use it for export only.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130552
Approved by: https://github.com/pianpwk
2024-07-12 15:43:27 +00:00
febadda107 [MPS] Fix torch.[all|any] for 5+D tensors (#130542)
Workaround bug in `reductionAndWithTensor:` that kills app with the
following assert if 5+D tensor as an input
```
Assertion failed: (0 <= mpsAxis && mpsAxis < 4 && "Runtime canonicalization must simplify reduction axes to minor 4 dimensions."), function encodeNDArrayOp, file GPUReductionOps.mm, line 76.
```
by reshaping the tensor to 2D/3D one before running the reduction.

Refactored common code into `all_any_common_impl_mps` as both `reductionOrWithTensor:` and `reductionAndWithTensor:` suffer from the same issue

Enabled `test_reduction_ops_5D` and  added regression test to it
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130542
Approved by: https://github.com/Skylion007, https://github.com/albanD
ghstack dependencies: #130541
2024-07-12 15:06:22 +00:00
d443fbc025 [inductor] Cache precompilation functions based on configs (#130350)
Summary: If we attempt to precompile sets of different choices (e.g. Triton vs Cutlass) that have the same key, the cached pool of futures doesn't work, since it only includes the first set of configs.  Add the config's hashes to the key to avoid this problem.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130350
Approved by: https://github.com/eellison
2024-07-12 14:21:49 +00:00
9c69684af8 [custom_ops] expose torch.library.register_torch_dispatch (#130261)
This is the API for defining the interaction between a torch_dispatch
class and a custom op. Taking API bikeshedding.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130261
Approved by: https://github.com/albanD
ghstack dependencies: #130064
2024-07-12 14:13:01 +00:00
ba941769b5 Add API for open registration between operators and subclasses (and modes) (#130064)
We add torch.library.Library._register_torch_dispatch_rule. Here, a user
can provide us a specific rule to run for a specific
(torch_dispatch_class, operator) pair. The motivation is that a user
might want to extend a subclass/mode but may not have access to the
source code of the subclass/mode.

I'll make this public in a follow-up PR if we think the approach and API
is good.

Keep in mind that many subclasses will likely deliver their own open
registration solution (DTensor has register_sharding_prop_rule and NJT
has register_jagged_op); _register_torch_dispatch_rule is meant as a
catch-all open registration mechanism for when the subclass hasn't
provided anything more specific.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130064
Approved by: https://github.com/albanD
2024-07-12 14:13:01 +00:00
ae3ac9cb64 Only test _is_param if doing instance check on Parameter base (#130578)
Fixes https://github.com/pytorch/pytorch/issues/111348

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130578
Approved by: https://github.com/Skylion007
2024-07-12 13:55:13 +00:00
6f54e961ea Add trace_shape_events artifact tracing for ShapeEnv events (#130473)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130473
Approved by: https://github.com/lezcano
2024-07-12 13:50:25 +00:00
3100455b8e Make hashing a SymInt raise an error again (#130548)
See https://github.com/pytorch/pytorch/issues/130547

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130548
Approved by: https://github.com/Skylion007, https://github.com/albanD
2024-07-12 13:49:56 +00:00
b75cc70875 [Pipelining] add looped schedules to fsdp/ddp test (#130563)
It feels like an oversight that these were not tested, especially since
the test case already handles multi schedules specially but no
multi-schedules were being tested
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130563
Approved by: https://github.com/H-Huang
2024-07-12 13:39:47 +00:00
da030e7add Revert "[Inductor] FlexAttention supports partial masking (#130415)"
This reverts commit 207564bab1c4fe42750931765734ee604032fb69.

Reverted https://github.com/pytorch/pytorch/pull/130415 on behalf of https://github.com/janeyx99 due to Windows trunk test_proxy_tensor test failures look relevant  ([comment](https://github.com/pytorch/pytorch/pull/130415#issuecomment-2225575622))
2024-07-12 13:20:18 +00:00
207564bab1 [Inductor] FlexAttention supports partial masking (#130415)
This is the new version of #130235

Updated test script: https://gist.github.com/yanboliang/7c34a82df611d4ea8869cb9e041bfbfc
Updated perf numbers:
```
(pt) [ybliang@devgpu002.ash8 ~/local/debug]$ CUDA_VISIBLE_DEVICES=4 python debug7.py
fwd speedup: 0.7166695598192317
bwd speedup: 0.7142133867805904
(pt) [ybliang@devgpu002.ash8 ~/local/debug]$ CUDA_VISIBLE_DEVICES=4 python debug7.py --partial-mask
fwd speedup: 0.8428246087169973
bwd speedup: 0.8486261278030254
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130415
Approved by: https://github.com/Chillee
2024-07-12 07:19:28 +00:00
e568c91a7b [CP] Fix the incorrect ring schedule in the fwd and bwd (#129514)
Summary:
1. The argument order for all_to_all_single is "block, output_split_size, input_split_sizes, pg".
2. Uses the correct ring order for the grad_kv.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129514
Approved by: https://github.com/d4l3k, https://github.com/drisspg, https://github.com/wanchaol
ghstack dependencies: #129512
2024-07-12 07:05:36 +00:00
0d8dedb01b [dtensor] Add dtensor to TORCH_LOGS (#129512)
Summary:
Add the basic log for dispatcher of dtensor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129512
Approved by: https://github.com/wanchaol, https://github.com/XilunWu
2024-07-12 06:50:53 +00:00
b6215f44ef DCP checkpoint_dist_client integration (#130452)
Summary:
Integrate scope tracking with `checkpoint/fb/logging_handlers.py`.

Add a map of uuid -> tracker context manager. when logging handler has following events:
* `start`: create scope_tracker object, call `__enter__`, add to map with uuid
* `end`: retrieve scope_tracker object by uuid, call `__exit__`.
* `exception`: retrieve scope_tracker object by uuid, call `__exit__` with current exception info.

Test Plan:
Test with bento notebook (attached).
with a runtime_error in finish_checkpoint method.

scuba records:
https://fburl.com/scuba/workflow_signpost/ddttgmv2

Differential Revision: D56654417

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130452
Approved by: https://github.com/LucasLLC
2024-07-12 06:01:56 +00:00
ff25dfca5a Save quantization_tag in export graph serialization (#127473)
Summary: `quantization_tag` is a first class citizen metadata in quantization flows that is preserved by it. As we'll want to store the quantized exported graphs we also need to preserve this metadata as it's used in later flows. Only json supported metadata will be allowed to be serialized.

Test Plan: Added test case

Differential Revision: D57939282

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127473
Approved by: https://github.com/angelayi
2024-07-12 05:06:40 +00:00
b7d287fbec Constant folding for dynamic shape node (#129686)
Extend constant folding for dynamic shape node, only support pointwise op and some restricted ops

We support dynamic shapes by limiting constant folding of ops that are guaranteed to have uniform values (full, pointwise ops, and views) and running these operators with tensors of shape 1. This also eliminates the possibility of memory overhead of constant folding.

Taken over from https://github.com/pytorch/pytorch/pull/128937

joint work with @imzhuhl

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129686
Approved by: https://github.com/Chillee
ghstack dependencies: #130367
2024-07-12 03:44:29 +00:00
ae0edadea0 [SDPA] Replace masked_fill_ with aten::where (#130281)
Summary:
full context in D59385876

Based on the offline discussion with PT2 folks, we switched to change the SDPA impl to mitigate the AOTI lowering issue

Test Plan: PYTORCH_TEST_FBCODE=1 buck2 run  mode/opt -c=python.package_style=inplace -c fbcode.enable_gpu_sections=true -c fbcode.platform=platform010 -c fbcode.split-dwarf=true caffe2/test/inductor:test_inductor -- -r test_sdpa_inference_mode_aot_compile

Differential Revision: D59495634

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130281
Approved by: https://github.com/drisspg, https://github.com/zou3519, https://github.com/Skylion007, https://github.com/justinchuby
2024-07-12 03:04:31 +00:00
c16e90fe06 The device_suffix in a test_name is "privateuse1" sometimes. (#130091)
When run some test cases on the privateuse1 device, the device_suffix in a test_name is 'privateuse1' sometimes.
For examples, a test_name is 'test_Dropout1d_npu', while it would be 'test_Dropout1d_privateuse1' sometimes.
When setUpClass() didn't set it, the device_suffix would be "privateuse1".

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130091
Approved by: https://github.com/zou3519
2024-07-12 02:51:40 +00:00
9ae40c6bc0 Fix and improve raise_comms and sink_waits (#129980)
The tests for `raise_comms` and `sink_waits` passes were not enabled in CI. The passes are now broken due to functional collective v2 and possibly other changes.

Correctness issues:
- The original passes did not take mutation into consideration and may yield semantically different scheduling order. This may be due to the recent changes to how mutations are expressed in Inductor IR (e.g., MutationOutput).

Effectiveness issues:
- The original passes only moved the comm/wait nodes themselves. However, comm nodes can come with prologues (e.g., clone for all_reduce_, split-cat for non-zero dim all-gather). Whenever there are any prologues, the comms won't be raised at all.
- The prologues are often horizontally fused with other pointwise nodes. This can severely delay the scheduling of the comm node.

This PR:
- Make the passes handle mutation correctly.
- Instead of moving individual comm/wait nodes, schedule all node using a scored method. This way the comm nodes can be optimally raised even in the presence of prologues.
- The horizontal fusion of prolofues often severely delays the scheduling of the comm node. Horizontally fusing this clone can almost never out-perform scheduling the comm node earlier. Also in most cases, this clone is eliminated via in-place reuse. Therefore, we tell the scheduler to not fuse it.
- Enable the tests in CI.

Co-authored-by: Will Feng <yf225@cornell.edu>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129980
Approved by: https://github.com/yf225
2024-07-12 01:55:47 +00:00
c6a676add4 [Traceable FSDP2][Inductor] Add GroupedSchedulerNode to contain nodes that must be scheduled together (#128568)
As discussed with @mlazos and @Chillee in the Inductor group chat, we need the concept of `GroupedSchedulerNode` to be able to express nodes that must be scheduled together one-after-another (i.e. no other node is allowed to fuse into them or schedule in-between them).

This is particularly important for comm reordering and fine-grained control of peak memory. For Traceable FSDP2, there are two very important requirements:
- At any time, there must be only one AllGather in flight. However, our existing comm reordering pass will naturally raise **all** of AllGather ops to the beginning of the graph, which will clearly blow up memory usage. Instead, we leverage GroupedScheduleNode which provides simple connection points to build the "chaining" on. i.e. we use it to express the schedule `(copyin + AllGather1) -> (AllGather1Wait+copyout) -> (copyin + AllGather2) -> (AllGather2Wait+copyout) ...` by setting up fake dep between the GroupedScheduleNode, which is a very clean and easy-to-understand way to express this schedule.
- The "comms" in FSDP2 are not just comms, but a combination of compute and comm. We must prevent other nodes from being scheduled in-between that set of nodes, otherwise we are artificially delaying the release of comm buffer memory which makes the peak memory usage quite bad. This is particularly pronounced for `AllGatherWait+copyout`.

From these two requirements, we derive the behavior of `GroupedSchedulerNode`: it contains nodes that must be scheduled together one-after-another (i.e. no other node is allowed to fuse into them or schedule in-between them).

----

Q: Can we leverage `ir.Subgraph`?
A: I looked into the possibility of using `ir.Subgraph` to implement this, but realized that:
1. `ir.Subgraph` requires defining the subgraph in FX IR.
2. There is no guarantee that the Inductor IR nodes that we want to group together will all have a corresponding FX IR node, because some of those Inductor IR nodes can potentially be dynamically generated by a custom pass in the scheduler (e.g. for merging multiple all-gathers into one big all-gather, and later we want to group that big all-gather with some other op). Dynamically generated Inductor IR node doesn't have a corresponding upstream FX IR node.
3. For the above reasons, we can't use the `ir.Subgraph`, and need to define a new (and more lightweight) concept of `GroupedSchedulerNode` to achieve the behavior we need (this PR).

----

Test commands:
- `pytest -rA  test/distributed/test_compute_comm_reordering.py::TestComputeCommReorderingMultiProc::test_grouped_scheduler_node`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128568
Approved by: https://github.com/eellison, https://github.com/mlazos
2024-07-12 01:42:38 +00:00
c101c4517a Add python type for list iterators (#130511)
Fixes https://github.com/pytorch/pytorch/issues/117026

Also not sure why this was missing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130511
Approved by: https://github.com/williamwen42, https://github.com/yanboliang, https://github.com/anijain2305
2024-07-12 01:14:18 +00:00
536b5b19b5 Revert "Simplify c10::string_view (#130009)"
This reverts commit 10c7f037fe3271cb3865816c216007ba403f5347.

Reverted https://github.com/pytorch/pytorch/pull/130009 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/130009#issuecomment-2224223526))
2024-07-12 00:46:49 +00:00
7f2436014e add MTIA as valid device type for prof averages (#130340)
Summary: Add MTIA as valid device option for getting profile averages

Test Plan: Tested with auto-trace on MTIA

Differential Revision: D59486392

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130340
Approved by: https://github.com/aaronenyeshi
2024-07-12 00:39:01 +00:00
7ce5b5767c Revert "Make c10::string_view an alias of std::string_view (#130417)"
This reverts commit c9551a3f50efc8163d8508a3c2189536528577ac.

Reverted https://github.com/pytorch/pytorch/pull/130417 on behalf of https://github.com/izaitsevfb due to depends on #130009 which needs to be reverted ([comment](https://github.com/pytorch/pytorch/pull/130417#issuecomment-2224212227))
2024-07-12 00:37:04 +00:00
b5b91b418d [Easy] Update record_function Comment (#130561)
Summary: Users have been confused why user annotations on GPU tracks do not show when doing GPU only tracing. This comment should help users understand that to use this function they need to have CPU activies enabled.

Test Plan: N/A it is just updating a comment

Differential Revision: D59649390

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130561
Approved by: https://github.com/aaronenyeshi
2024-07-11 23:51:25 +00:00
18b7633bfb [export] fix kwargs in run_decompositions() for training IR (#130553)
Re-exporting GraphModule expects all inputs to be in args, though not in pytree-flattened format. This avoids failing when we run with a fx.Interpreter subclass in [AOTAutograd tracing](973037be6a/torch/_functorch/_aot_autograd/traced_function_transforms.py (L760-L762)).

Removes 7 test failures for training IR export.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130553
Approved by: https://github.com/zhxchen17, https://github.com/ydwu4
2024-07-11 22:53:18 +00:00
26c2b92525 [export] make with_effect mark op has_effect to prevent them from DCEed. (#129680)
Before the PR, custom ops that don't return outputs will get eliminated after calling `.module()` because the effect_token that keeps the operator alive is removed in remove_effect_token pass. The reason why we want to remove_effect_token is because we don't want the token to be part of input. However, this causes DCE calls in remove_effect_token itself and the dce calls in unlift to remove the custom op in the graph causing an error in the exported graph.

This PR calls has_side_effect in with_effect to make sure graph.eliminate_dead_code doesn't remove the calls by accident.

Test Plan:
Add a new test pytest test/export/test_torchbind.py -k test_export_inplace_custom_op

Differential Revision: [D59498728](https://our.internmc.facebook.com/intern/diff/D59498728)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129680
Approved by: https://github.com/angelayi
2024-07-11 22:46:21 +00:00
9c6c0deadc Add eager_compile_backwards_failure to tlparse (#130434)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130434
Approved by: https://github.com/albanD
2024-07-11 22:35:33 +00:00
d97d962082 Revert "Add decompositions for copy variants of view ops (#128416)"
This reverts commit 68751799b85aa7f659420801bdbb8451f01ab09a.

Reverted https://github.com/pytorch/pytorch/pull/128416 on behalf of https://github.com/izaitsevfb due to breaks test_qs8_permute_copy test in executorch ([comment](https://github.com/pytorch/pytorch/pull/128416#issuecomment-2224023423))
2024-07-11 22:09:23 +00:00
a2f630a9a4 Revert "Decompose expand_copy and permute_copy (#129476)"
This reverts commit 7d4cb2109823f1c4001dff62b461bb9eda07ca17.

Reverted https://github.com/pytorch/pytorch/pull/129476 on behalf of https://github.com/izaitsevfb due to depends on #128416 which needs to be reverted ([comment](https://github.com/pytorch/pytorch/pull/129476#issuecomment-2224019720))
2024-07-11 22:06:15 +00:00
fc872e98f3 Infer prim tags from equivalent aten ones (#130367)
Take intersection of all the tags for corresponding aten op overloads. Previously, some of the rng ops not having tags caused issues with constant folding (they should get decomposed but thats a separate issue).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130367
Approved by: https://github.com/ezyang
2024-07-11 20:53:52 +00:00
726a287271 [export] Expand verifier to be multiple on ExportedProgram (#130364)
Summary: This diff updates the ExportedProgram class in PyTorch to allow for multiple verifiers to be attached to it. This is done by adding a new field to the ExportedProgram schema called "verifiers" which is a list of strings representing the names of the verifiers to be attached to the program. The verifiers are loaded using the "load_verifier" function which is defined in the "torch._export.serde.serialize" module. The "exported_program.dialect" field is also deprecated in favor of the "verifiers" field.

Test Plan: CI

Differential Revision: D59408546

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130364
Approved by: https://github.com/angelayi, https://github.com/ydwu4
2024-07-11 20:34:49 +00:00
5c6edd29ec Turn on splitShare=1 to make the optimization of comm_split effective. (#129929)
Fixes #129865
Currently, new_group will call ncclCommSplit in some cases. In theory, ncclCommSplit will bring performance and memory benefits. However, the config parameter of the ncclCommSplit function in pytorch does not set "splitShare=1", which results in the optimization of ncclCommSplit being turned off and the benefits being invalid.
This PR turn on splitShare=1 to make the optimization of comm_split effective.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129929
Approved by: https://github.com/shuqiangzhang
2024-07-11 20:14:58 +00:00
c50b189280 Move trunk windows builds to CUDA-12.1 (#130446)
That should catch build regressions that were previously only detectable during the nightly builds
Win + CUDA-11.8 builds and tests are still run as part of periodic workflow
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130446
Approved by: https://github.com/atalman
2024-07-11 19:50:57 +00:00
bc18863713 Corner-case fix for upscale_histogram in the new HistogramObserver (#130316)
Summary: Small fix to the bucketize function that caused a run-time error in some corner cases.

Test Plan: Unit tests

Differential Revision: D59508432

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130316
Approved by: https://github.com/jerryzh168
2024-07-11 19:49:21 +00:00
cd9bae30de Allow kwargs in _remove_effect_tokens_pass (#130491)
Summary: Previously, remove_effect_tokens pass didn't pass kwargs to the internal nodes. This PR fix it and add a test for it.

Test Plan: buck2 run caffe2/test:test_export -- -r test_remove_effect_token_kwargs

Reviewed By: angelayi

Differential Revision: D59603147

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130491
Approved by: https://github.com/angelayi
2024-07-11 19:03:19 +00:00
578388bed8 Revert "Support for expandable segments with cuda graph trees (#128068)"
This reverts commit fdc83610f272610ce50d1a6f5b6354f2df1baabb.

Reverted https://github.com/pytorch/pytorch/pull/128068 on behalf of https://github.com/janeyx99 due to Reverting for breaking ROCm tests on trunk, I think the tests need to be qualified with @onlyCUDA ([comment](https://github.com/pytorch/pytorch/pull/128068#issuecomment-2223672381))
2024-07-11 18:58:13 +00:00
1cae60a87e Caching attr_proxy for nn_module attribute to fix guard check failure (#130280)
Fixes https://github.com/pytorch/pytorch/issues/129939

Differential Revision: [D59594605](https://our.internmc.facebook.com/intern/diff/D59594605)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130280
Approved by: https://github.com/anijain2305
2024-07-11 18:21:35 +00:00
0a4fe2ff86 [DSD] Use no_grad() to make some operations faster and avoid possible memory leakage (#130355)
Use no_grad() to make some operations faster and avoid possible memory leakage

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130355
Approved by: https://github.com/wz337
2024-07-11 18:18:08 +00:00
973037be6a [BE][Easy] apply autofix for ruff rules unnecessary-collection-call (C408): list() / tuple() / dict() (#130199)
This PR changes the empty collection factory call to Python literals:

- `list()` -> `[]`
- `tuple()` -> `()`
- `dict()` -> `{}`

The Python literals are more performant and safer. For example, the bytecode for building an empty dictionary:

```bash
$ python3 -m dis - <<EOS
import collections

d1 = {}
d2 = dict()

dict = collections.OrderedDict
d3 = dict()
EOS
```

```text
  0           0 RESUME                   0

  1           2 LOAD_CONST               0 (0)
              4 LOAD_CONST               1 (None)
              6 IMPORT_NAME              0 (collections)
              8 STORE_NAME               0 (collections)

  3          10 BUILD_MAP                0
             12 STORE_NAME               1 (d1)

  4          14 PUSH_NULL
             16 LOAD_NAME                2 (dict)
             18 CALL                     0
             26 STORE_NAME               3 (d2)

  6          28 LOAD_NAME                0 (collections)
             30 LOAD_ATTR                8 (OrderedDict)
             50 STORE_NAME               2 (dict)

  7          52 PUSH_NULL
             54 LOAD_NAME                2 (dict)
             56 CALL                     0
             64 STORE_NAME               5 (d3)
             66 RETURN_CONST             1 (None)
```

The dict literal `{}` only has one bytecode `BUILD_MAP`, while the factory call `dict()` has three `PUSH_NULL + LOAD_NAME + CALL`. Also, the factory call is not safe if users override the `dict` name in `locals` or `globals` (see the example of replacing with `OrderedDict` above).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130199
Approved by: https://github.com/malfet
2024-07-11 17:30:28 +00:00
492de213e2 Revert "Change deprecated warning on dispatch_on_subclass to warn once (#130047)"
This reverts commit f21a21828ac6e16d903ee88f726fdb2278c04782.

Reverted https://github.com/pytorch/pytorch/pull/130047 on behalf of https://github.com/albanD due to The failure on the PR are valid, they should not have been ignored ([comment](https://github.com/pytorch/pytorch/pull/130047#issuecomment-2223488933))
2024-07-11 17:24:02 +00:00
f21a21828a Change deprecated warning on dispatch_on_subclass to warn once (#130047)
Summary:
Right now the deprecated warning fires on every operator that calls into torch_function. Changing it to TORCH_WARN_ONCE instead.

More context in https://fb.workplace.com/groups/260102303573409/permalink/445299188387052/

Test Plan: Sandcastle

Differential Revision: D59338775

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130047
Approved by: https://github.com/XilunWu
2024-07-11 17:02:26 +00:00
3896ba3260 [DeviceMesh] Only include the real thread_id in DeviceMesh hash under threaded backend (#130495)
Fixes #ISSUE_NUMBER

As a followup to https://github.com/pytorch/pytorch/pull/130454, users are hitting the cross-mesh operation error because the DeviceMesh thread ID differs between the saved vs. loaded DTensor due to thread id being different.

This is a hot fix to only consider the real thread_id in DeviceMesh hash under threaded backend, but set it to None for all other cases.

As a follow up, we need to look at the following test failures to better root cause specific DeviceMesh related failures related to MTPG, if thread_id is not included as part of the hash.
```
test/distributed/_composable/fsdp/test_fully_shard_training.py::TestFullyShardRegisteredParams::test_param_registration_after_forward
test/distributed/_tensor/test_dtensor_ops.py::TestDTensorOpsCPU::test_dtensor_op_db_column_stack_cpu_float32
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130495
Approved by: https://github.com/awgu, https://github.com/wanchaol
2024-07-11 17:02:18 +00:00
72d9135679 increase tensor size to force out of memory exception on the latest generations of GPUs (#130334)
This PR fixes profiler/test_profiler.py::.TestProfiler::test_oom_tracing
Test expects OOM by allocating huge tensor. But MI300X has enough memory to allocate such a tensor.
This PR increases tensor size with a large margin to force OutOfMemory exception on MI300X and future GPU generations

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130334
Approved by: https://github.com/jeffdaily, https://github.com/janeyx99
2024-07-11 16:59:40 +00:00
9c1ba5ac10 [BE] Cleanup unused vars in MPS (#130541)
And move `using namespace mps` outside of every function as there are no
need to repeat it
Use `getTensorsStringKey` instead of explicit
`getMPSShapeString(getMPSShape(t)) + getMPSDataTypeString(t)`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130541
Approved by: https://github.com/Skylion007
2024-07-11 16:48:03 +00:00
68ad3eb722 Do not set hints for mark_unbacked quantities (#130483)
Fixes https://github.com/pytorch/pytorch/issues/130456

When we mark_unbacked a size, we actually DO have a hint for it
(because we have a real, input tensor) for it, and previously, we were
accidentally putting it into the hint field of SymNode.  If marked
unbacked size is zero or one, this can lead to inconsistency between
hint compute and static evaluation compute under guard size oblivious,
since that's the whole point of size oblivious.  Answer is to scrub out
hints on mark unbacked ints.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130483
Approved by: https://github.com/lezcano
2024-07-11 15:51:00 +00:00
ca023f77bc [CD] Add pytorch xpu wheel build in nightly (#129560)
Add pytorch xpu wheel build in nightly after the xpu build image enabling PR https://github.com/pytorch/builder/pull/1879 merged

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129560
Approved by: https://github.com/atalman
2024-07-11 15:49:04 +00:00
fb9bc6d74a [custom op] add doc for CustomOpDef.set_kernel_enabled (#130406)
<img width="1067" alt="Screenshot 2024-07-09 at 6 14 55 PM" src="https://github.com/pytorch/pytorch/assets/22356083/941751f8-8e12-43cb-8477-c739476e0096">
<img width="965" alt="Screenshot 2024-07-09 at 6 14 59 PM" src="https://github.com/pytorch/pytorch/assets/22356083/aa9be099-f26c-45a3-8a14-742a2bb7c28b">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130406
Approved by: https://github.com/zou3519
2024-07-11 15:47:35 +00:00
5ed72ff5f5 Reduce all tensors to their metadata in AOTAutogradCache; add tests (#128583)
This PR makes it so that all tensors are reduced to their metadata in AOTAutogradCache. Because dynamo always embeds constant tensors into the FXgraph directly, there's no risk of a constant tensor whose values are semantically important being lost here. AOTAutograd itself may take a constant tensor and set it as an attribute on an FXGraph for inductor, but Dynamo never does this.

One other thing that this diff does is add `[pickler.fast](https://docs.python.org/3/library/pickle.html#pickle.Pickler.fast)` to our pickling algorithm for cache key generation. Pickle will often memoize/intern strings when pickling, leading to false cache misses due to inconsistent memoization. Turning on pickler.fast removes this behavior.

Technically `fast` is a "deprecated" feature according to python docs. But it's still supported in py3.8-3.12, and if it ever is removed, the only downside will just be a few more cache misses, so I think it's worth just adding here (and removing later as needed)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128583
Approved by: https://github.com/oulgen
ghstack dependencies: #128335
2024-07-11 15:39:09 +00:00
be7bf20234 Add JK to enable fx graph cache for amd (#130463)
Test Plan: ad hoc testing

Differential Revision: D59593961

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130463
Approved by: https://github.com/nmacchioni, https://github.com/mxz297
2024-07-11 15:28:38 +00:00
6f662e9575 update the input weight of _convert_weight_to_int4pack to [n][k / 2] uint8 (#129940)
This PR is to update the input `weight` of `_convert_weight_to_int4pack` from `[n][k] int32` to `[n][k / 2] uint8`, both for CPU, CUDA and MPS, which can help decouple int4 model checkpoint with different ISAs and different platforms in `gpt-fast`. The advantage is int4 model checkpoint can be shared in different test machines, without re-generating in one certain platform. Meanwhile, the size of input `weight` can be reduced to `1 / 8`.

Before this PR, packed weight stored in CUDA specific layout: `[n/8][k/(InnerKTiles*16)][32][InnerKTiles/2]`, dtype int32, where InnerKTiles = 2, 4, 8. CPU packed weight viewed as the SAME shape but stored in different layout: `[n/64][k][32]`, dtype uint8. Weight is strongly coupled with platforms (CPU/CUDA) and ISAs (AVX512/AVX2/scalar). And users cannot use a generated weight in another different ISA or platform, because when loading weight into devices, the compute format is different.
![image](https://github.com/pytorch/pytorch/assets/61222868/64971c4b-29b9-42cf-9aeb-ffa01cea93dd)

Now, we use common serialized layout (`[n][k/2] uint8`) for different devices or ISAs as input `weight` of `_convert_weight_to_int4pack`, and each back chooses how to interpret as compute layout.
![image](https://github.com/pytorch/pytorch/assets/61222868/c7990761-c723-417b-aca2-7c60db7785c7)

### Performance
Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores)
There is no obvious regression of this PR.
![image](https://github.com/pytorch/pytorch/assets/61222868/6046dcf4-920b-4c63-9ca3-1c8c3cafebde)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129940
Approved by: https://github.com/jgong5, https://github.com/lezcano, https://github.com/mingfeima
2024-07-11 15:26:48 +00:00
cyy
c4a2b6a943 [2/N] Fix NVCC warnings (#130214)
Follows #130191
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130214
Approved by: https://github.com/ezyang
2024-07-11 14:46:53 +00:00
a833582dbb [dynamo][tuple] Optimize guard for small tuples - helps conv2d guards (#130400)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130400
Approved by: https://github.com/yanboliang, https://github.com/jansel
ghstack dependencies: #130285, #130368, #130416
2024-07-11 14:13:24 +00:00
f7d7b94017 [dynamo][unspecialized-nn-module] Distinguish between user-defined and builtin nn module (#130416)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130416
Approved by: https://github.com/jansel
ghstack dependencies: #130285, #130368
2024-07-11 14:13:24 +00:00
fed8b0055f [dynamo][bufgix] Fix the value for key manager (#130368)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130368
Approved by: https://github.com/jansel
ghstack dependencies: #130285
2024-07-11 14:13:19 +00:00
9c612df504 [dynamo][cpp-guards][QOL] Print NO_TENSOR_ALIASING guard once (#130285)
NO_TENSOR_ALIASING guard lists all tensors. Printing it on every occurence is ugly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130285
Approved by: https://github.com/jansel
2024-07-11 14:13:14 +00:00
bac10cdd6f [DCP] Fix duplicated logging messages when enable both c10d and dcp l… (#130423)
…ogger

Fixes #129951 . Would you take a moment to review it? @LucasLLC

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130423
Approved by: https://github.com/Skylion007
2024-07-11 13:43:39 +00:00
0d66ccaf23 [IntraNodeComm] fix an issue where input check fails when running all-reduce on sub groups (#130492)
Tested against the following snippet with `ENABLE_INTRA_NODE_COMM=1`.

```python
import os
import torch
import torch.distributed as dist

def main():
    rank = int(os.environ["RANK"])
    local_rank = int(os.environ["LOCAL_RANK"])
    world_size = int(os.environ["WORLD_SIZE"])

    torch.cuda.set_device(f"cuda:{local_rank}")
    dist.init_process_group("nccl")

    draft_group = dist.new_group([0, 1, 2, 3])
    target_group = dist.new_group([4, 5, 6, 7])

    inp = torch.full((128, 128), rank, dtype=torch.bfloat16, device="cuda")
    dist.all_reduce(inp)
    expect = sum(range(world_size))
    assert inp.eq(expect).all()

    if 0 <= rank < 4:
        inp = torch.full((128, 128), rank, dtype=torch.bfloat16, device="cuda")
        dist.all_reduce(inp, group=draft_group)
        expect = sum(range(4))
        assert inp.eq(expect).all()
    else:
        inp = torch.full((128, 128), rank, dtype=torch.bfloat16, device="cuda")
        dist.all_reduce(inp, group=target_group)
        expect = sum(range(4, 8))
        assert inp.eq(expect).all()

    torch.cuda.synchronize()
    dist.destroy_process_group()

if __name__ == "__main__":
    main()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130492
Approved by: https://github.com/Chillee
2024-07-11 13:39:14 +00:00
f261c6ebe8 Revert "[halide-backend] Update CI pin (#130258)"
This reverts commit 4fcfd475bea24b832da32a0c4d464dd87c73a2a9.

Reverted https://github.com/pytorch/pytorch/pull/130258 on behalf of https://github.com/albanD due to Seems to have broken trunk pretty bad 4fcfd475be ([comment](https://github.com/pytorch/pytorch/pull/130258#issuecomment-2222935064))
2024-07-11 13:26:01 +00:00
354edb232a Make public binding test only consider files that are packaged in the wheels (#130497)
In particular, when creating the PyTorch wheel, we use setuptools find_packages 551b3c6dca/setup.py (L1055) which explicitly skips packages without `__init__.py` files (namespace packages) https://setuptools.pypa.io/en/latest/userguide/package_discovery.html#finding-simple-packages.

So this PR is reverting the change to stop skipping these namespace packages as, even though they are in the codebase, they are not in the published binaries and so we're ok relaxing the public API and importability rules for them.

A manual diff of the two traversal methods:
```
torch._inductor.kernel.bmm
torch._inductor.kernel.conv
torch._inductor.kernel.flex_attention
torch._inductor.kernel.mm
torch._inductor.kernel.mm_common
torch._inductor.kernel.mm_plus_mm
torch._inductor.kernel.unpack_mixed_mm
torch._strobelight.examples.cli_function_profiler_example
torch._strobelight.examples.compile_time_profile_example
torch.ao.pruning._experimental.data_sparsifier.benchmarks.dlrm_utils
torch.ao.pruning._experimental.data_sparsifier.benchmarks.evaluate_disk_savings
torch.ao.pruning._experimental.data_sparsifier.benchmarks.evaluate_forward_time
torch.ao.pruning._experimental.data_sparsifier.benchmarks.evaluate_model_metrics
torch.ao.pruning._experimental.data_sparsifier.lightning.tests.test_callbacks
torch.ao.quantization.experimental.APoT_tensor
torch.ao.quantization.experimental.adaround_fake_quantize
torch.ao.quantization.experimental.adaround_loss
torch.ao.quantization.experimental.adaround_optimization
torch.ao.quantization.experimental.apot_utils
torch.ao.quantization.experimental.fake_quantize
torch.ao.quantization.experimental.fake_quantize_function
torch.ao.quantization.experimental.linear
torch.ao.quantization.experimental.observer
torch.ao.quantization.experimental.qconfig
torch.ao.quantization.experimental.quantizer
torch.csrc.jit.tensorexpr.codegen_external
torch.csrc.jit.tensorexpr.scripts.bisect
torch.csrc.lazy.test_mnist
torch.distributed._tensor.examples.checkpoint_example
torch.distributed._tensor.examples.comm_mode_features_example
torch.distributed._tensor.examples.comm_mode_features_example_argparser
torch.distributed._tensor.examples.convnext_example
torch.distributed._tensor.examples.torchrec_sharding_example
torch.distributed._tensor.examples.visualize_sharding_example
torch.distributed.benchmarks.benchmark_ddp_rpc
torch.distributed.checkpoint.examples.async_checkpointing_example
torch.distributed.checkpoint.examples.fsdp_checkpoint_example
torch.distributed.checkpoint.examples.stateful_example
torch.distributed.examples.memory_tracker_example
torch.fx.experimental.shape_inference.infer_shape
torch.fx.experimental.shape_inference.infer_symbol_values
torch.include.fp16.avx
torch.include.fp16.avx2
torch.onnx._internal.fx.analysis.unsupported_nodes
torch.onnx._internal.fx.passes._utils
torch.onnx._internal.fx.passes.decomp
torch.onnx._internal.fx.passes.functionalization
torch.onnx._internal.fx.passes.modularization
torch.onnx._internal.fx.passes.readability
torch.onnx._internal.fx.passes.type_promotion
torch.onnx._internal.fx.passes.virtualization
torch.utils._strobelight.examples.cli_function_profiler_example
torch.utils.benchmark.examples.sparse.compare
torch.utils.benchmark.examples.sparse.fuzzer
torch.utils.benchmark.examples.sparse.op_benchmark
torch.utils.tensorboard._convert_np
torch.utils.tensorboard._embedding
torch.utils.tensorboard._onnx_graph
torch.utils.tensorboard._proto_graph
torch.utils.tensorboard._pytorch_graph
torch.utils.tensorboard._utils
torch.utils.tensorboard.summary
torch.utils.tensorboard.writer

```

These are all either namespace packages (which we want to remove) or package that are not importable (and tagged as such in the test).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130497
Approved by: https://github.com/aorenste
2024-07-11 13:22:04 +00:00
215013daad [cuDNN][SDPA] Limit cuDNN SDPA head-dim to 128 (#130494)
Limit cuDNN SDPA to head-dim 128 globally. Apparently the support for 256 is only for the forward on sm90+, which would be clunky to maintain as it would mean dispatching different for forward/backward.

CC @drisspg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130494
Approved by: https://github.com/drisspg, https://github.com/Skylion007
2024-07-11 13:21:18 +00:00
cyy
9822fdc354 [7/N] Replace c10::optional with std::optional (#130510)
Follows #130438

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130510
Approved by: https://github.com/janeyx99
2024-07-11 13:21:05 +00:00
f52b2ee90f Modularize aten parameter parser and checker (#125308)
In this PR, we abstracted the different types of aten operation parameters as `ParameterMetadata`. This structure intends to be used to represent and store the metadata of each aten operation parameter. Currently, it only supports `Tensor`, `TensorList`, and `Scalar`.

```C++
using ParameterMetadataValue = std::variant<TensorMetadata, std::vector<TensorMetadata>, c10::Scalar>;
```

With this PR, we can extend other parameter-type support in a more modularize way, like `string`, `int`, `double`.

Differential Revision: [D59399546](https://our.internmc.facebook.com/intern/diff/D59399546)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125308
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/atalman
2024-07-11 13:17:25 +00:00
2a51ccc77e When translation validation is enabled, assert that hint is consistent (#130478)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130478
Approved by: https://github.com/lezcano
2024-07-11 13:02:31 +00:00
cyy
c9551a3f50 Make c10::string_view an alias of std::string_view (#130417)
Follows #130009 to further facilitate the mitigation from c10::string_view to std::string_view. The old c10::string_view was renamed to c10::string_view_ext.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130417
Approved by: https://github.com/ezyang
2024-07-11 12:31:06 +00:00
cyy
c5b66c3fe1 Enable -Werror=pedantic on torch targets (#130319)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130319
Approved by: https://github.com/ezyang
2024-07-11 12:27:32 +00:00
5db9bd467e Skip test_nnc_correctness for new op _unsafe_masked_index (#130375)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130375
Approved by: https://github.com/lezcano
2024-07-11 08:17:16 +00:00
b1942a1af4 [fbgemm_gpu] Break up fbgemm_cuda_utils.cuh, pt 10 (#130468)
Summary:
X-link: https://github.com/pytorch/FBGEMM/pull/2814

X-link: https://github.com/facebookresearch/FBGEMM/pull/19

- Break up `fbgemm_cuda_utils.cuh`, pt 10

Test Plan:
```
buck2 targets //deeplearning/fbgemm/fbgemm_gpu/test/jagged/... | grep -v '-' | xargs -I % sh -c 'buck2 run @//mode/opt -c fbcode.nvcc_arch=v100 -c fbcode.platform=platform010 % || exit 255'

buck2 targets //deeplearning/fbgemm/fbgemm_gpu/test/tbe/... | grep -v '-' | xargs -I % sh -c 'buck2 run @//mode/opt -c fbcode.nvcc_arch=v100 -c fbcode.platform=platform010 % || exit 255'

buck2 targets //deeplearning/fbgemm/fbgemm_gpu/test/sparse/... | grep -v '-' | xargs -I % sh -c 'buck2 run @//mode/opt -c fbcode.nvcc_arch=v100 -c fbcode.platform=platform010 % || exit 255'

buck2 build --config fbcode.enable_gpu_sections=true --flagfile fbcode//mode/dev-nosan-amd-gpu fbcode//smart/inference_platform_sp/llm_predictor_amd:service

buck2 build --flagfile fbcode//mode/amd-gpu fbcode//hpc/ops:sparse_ops

buck2 build --flagfile fbcode//mode/dev-nosan-amd-gpu fbcode//caffe2/benchmarks/operator_benchmark/pt:add_test
```

Reviewed By: spcyppt

Differential Revision: D59545097

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130468
Approved by: https://github.com/ezyang
2024-07-11 07:10:27 +00:00
79c41bb58a [inductor] switch CppCodeCache to new cpp_builder. (#130132)
Changes:
1. switch CppCodeCache to new cpp_builder.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130132
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-11 07:03:43 +00:00
75ab027fbb [dtensor] move bernolli to op strategy (#130286)
as titled

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130286
Approved by: https://github.com/awgu, https://github.com/yifuwang
2024-07-11 06:43:11 +00:00
fdc83610f2 Support for expandable segments with cuda graph trees (#128068)
This PR adds support to use expandable segments with private memory pools which should unblock using it with cuda graphs and cuda graph trees. Currently, the allocator silently avoids using expandable segments when allocating in a private pool due to checkpoint saving/restoring not meshing well with how we keep track of unmapped blocks.

The PR itself is pretty short, most of the logic for checkpointing and reapplying state for non-expandable segments transfers over without much work.

Expandable segments reserve a virtual address space of size equal to the amount of physical memory on the GPU. Every time we want to `malloc()` or `free()` memory in a memory pool with expandable segments turned on, we map/unmap pages of physical GPU memory under the hood to create a new block that we return to the caller. This is beneficial due to the fact that each memory pool functions as a single segment of memory with a contiguous block of memory addresses that can grow and shrink as needed, avoiding fragmentation from allocating multiple non-contiguous segments that may not be merged together.

The caching allocator handles this by creating an unmapped block for the entire reserved virtual address space at init, which is treated similarly to an unallocated block in a free pool. When callers call `malloc()`, it's split and mapped to create allocated blocks, and calling `free()` similarly caches and merges free blocks in a free pool to be used later. Expandable blocks are unmapped and returned back to Cuda when they are cleaned up, or when we hit an OOM and the allocator attempts to remap cached free blocks. The code paths to map, free, and unmap blocks in expandable segments is similar to that for normal blocks and does all the same work of updating stats on memory usage, moving blocks between active and free pools, and returning memory to Cuda.

With Cuda Graph Trees and private memory pools, we need the ability to take checkpoints of the current state of the memory allocator after each graph capture as well as reapplying the state before capturing a new graph after replaying a captured graph so that the new cuda graph capture has access to the state of the allocator at the point after replaying a previously captured graph so it can reuse empty blocks and allocate new ones.

As mentioned in a below comment, memory in a private pool is cached until the private pool is destroyed and allocations can only grow from extra graph captures, any freeing of memory would result in invalid memory addresses and would break cuda graphs.

One implementation detail to note for unmapped blocks with expandable segments is that unmapped blocks are kept track in a member variable `unmapped` of a `BlockPool`. `unmapped` is *not* part of the checkpointed state of the caching allocator and isn't restored when reapplying checkpoints since we never free/unmap memory back to cuda and is persisted across graph captures / replays.

Checkpointing the current state of the memory allocator works as expected with expandable segments. Checkpointing grabs the first block of every segment in the active and free pools of the private pool and traverses the linked list of blocks in the segment to capture the state of every segment, which is then saved and kept for when it is needed to be reapplied. For expandable blocks, the last block in every segment will be an unallocated unmapped block containing the remaining amount of unmapped memory at graph capture time, and this too is saved in the checkpoint.

Reapplying the checkpoints works by freeing all allocated blocks and merging them into a single block per segment, then for each segment, we manually split and allocate all blocks from the checkpoint and then free the blocks marked as unallocated in the checkpoint state. For expandable segments, we need to make some modifications to not split unmapped blocks and avoid manually mapping then freeing unmapped blocks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128068
Approved by: https://github.com/zdevito, https://github.com/eqy
2024-07-11 05:33:09 +00:00
da24823e06 [BE][EZ] Migrate to new dcp save and load APIs (#130475)
When I play with DCP for distributed inference, I found that we are still using deprecated APIs for DCP even in unit test. So this PR is using the new API with unified small letters "dcp".

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130475
Approved by: https://github.com/wz337
2024-07-11 04:13:39 +00:00
5835ff1ed5 [Easy][Inductor] Add comment for .min_order and .max_order (#130390)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130390
Approved by: https://github.com/anijain2305
2024-07-11 03:58:03 +00:00
a4576dad34 [reland][custom ops] infer schema (#130079)
Fixes #129617

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130079
Approved by: https://github.com/zou3519
2024-07-11 03:39:07 +00:00
9f401187c7 [pipelining] Refactor test_schedule to fix "-k" (#130294)
This is kind of a short-sighted workaround and we should actually come
up with a way to fix this in general, but I got annoyed that I can't use
-k to filter tests in test_schedule, and realized it's because we jam
tests using the new MultiProcContinuousTest fixture together with
old-style tests.

For now I separate the two types of tests so -k works again.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130294
Approved by: https://github.com/H-Huang
2024-07-11 03:18:02 +00:00
dfd1d1971e Fix warning when pickle.load torch.Storage (#130246)
Fixes https://github.com/pytorch/pytorch/issues/130242

Since `torch.save` does not use pickle for storages, the `torch.load` in `_load_from_bytes` should not ever be called when `torch.load`-ing a checkpoint. Setting weights_only=False explicitly in `_load_from_bytes` to avoid the weights_only warning when using the pickle module

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130246
Approved by: https://github.com/albanD
2024-07-11 02:40:29 +00:00
4fcfd475be [halide-backend] Update CI pin (#130258)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130258
Approved by: https://github.com/eellison
2024-07-11 02:26:16 +00:00
df9d1b44e7 Preserve _numeric_debug_handle throguh deepcopy and re-export (#129287)
Summary:
* Added support for preserving it during deepcopy, need to remap the args since _numeric_debug_handle refers
to the nodes in the graph

TODO: need to fully support re-export, currently the metadata for output node is not preserved

Test Plan:
python test/test_quantization.py -k test_deepcopy_preserve_handle
python test/test_quantization.py -k test_copy_preserve_handle

all related tests:
python test/test_quantization.py -k TestGenerateNumericDebugHandle

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129287
Approved by: https://github.com/zhxchen17
2024-07-11 02:19:41 +00:00
a205a53c50 Make sym_node log more useful (#130436)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130436
Approved by: https://github.com/Skylion007
2024-07-11 01:42:53 +00:00
79e34800c3 Suppress guards generated by empty_strided in ir_node_to_tensor (#130431)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130431
Approved by: https://github.com/IvanKobzarev
2024-07-11 01:19:11 +00:00
cyy
798b9652f7 [6/N] Replace c10::optional with std::optional (#130438)
Follows #130408

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130438
Approved by: https://github.com/janeyx99
2024-07-11 01:15:37 +00:00
5bc18ec0a1 [Inductor][CPP] Support vectorization of remainder (#129849)
**Summary**
When check the vectorization status among 3 test suit, we found some operators disabled vectorization with message `Disabled vectorization: op: remainder`. In this PR, we add vectorization support of this op.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_vec_remainder
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_int_div_vec
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129849
Approved by: https://github.com/jgong5, https://github.com/lezcano
ghstack dependencies: #130405
2024-07-11 00:50:50 +00:00
6adc725157 doc - fix the max_norm value in a note (#129687)
`max_norm=True` is currently written in the note, but `max_norm` can be a `float`, NOT a `bool` (as the [docstring](ec284d3a74/torch/nn/modules/sparse.py (L30)) says).
That note was created in #45595

The current pull request cleans it up.
The value `True` in the note can confuse the users to think it can be a boolean.

In fact, a counter-intuitive behavior will happen if users try to set it to `False`:
it will be interpreted as 0, so the values of the embedding will become 0 - not what the users were expecting by setting it to `False`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129687
Approved by: https://github.com/mikaylagawarecki, https://github.com/malfet
2024-07-11 00:01:17 +00:00
358da54be5 [inductor] Better messaging when triton version is too old (#130403)
Summary:
If triton is available, but we can't import triton.compiler.compiler.triton_key, then we see some annoying behavior:
1) If we don't actually need to compile triton, the subprocess pool will still spew error messages about the import failure; it's unclear to users if this is an actual problem.
2) If we do need to compile triton, we a) see the error messages from above and b) get a vanilla import exception without the helpful "RuntimeError: Cannot find a working triton installation ..."

Test Plan: Ran with and without torch.compile for a) recent version of triton, b) triton 2.2, and c) no triton. In all cases, verified expected output (success or meaningful error message)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130403
Approved by: https://github.com/eellison
2024-07-10 23:45:50 +00:00
ceedee23ec [DTensor] Included meshes in cross-mesh error msg (#130454)
The current error message is not actionable since we do not know which meshes are involved. Including the `__repr__` of each mesh in the error helps but is not always sufficient.

7d4cb21098/torch/distributed/device_mesh.py (L395-L408)

The problem is that `DeviceMesh.__eq__` is actually pretty involved, and we cannot see all parts of the `__eq__` criteria just from the `__repr__` (e.g. the thread ID).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130454
Approved by: https://github.com/wz337, https://github.com/wanchaol
2024-07-10 22:40:57 +00:00
2abc7cc21b [inductor] switch AotCodeCompiler to new cpp_builder (#130127)
Changes:
1. Switch `AotCodeCompiler` to new cpp_builder.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130127
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-10 22:28:29 +00:00
551b3c6dca Use irange to avoid -Wsign-compare errors (#130388)
Fixes meta-internal errors after importing #128753

(see [D59498679](https://www.internalfb.com/diff/D59498679))
```
fbcode/caffe2/aten/src/ATen/Context.cpp:286:34: error: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Werror,-Wsign-compare]
      for (auto index = 0; index < at::getNumGPUs(); index++) {
                           ~~~~~ ^ ~~~~~~~~~~~~~~~~
1 error generated.
```
Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130388
Approved by: https://github.com/Skylion007, https://github.com/malfet
2024-07-10 22:07:51 +00:00
ce499eee0c Revert "Add API for open registration between operators and subclasses (and modes) (#130064)"
This reverts commit c23d103afae65588772cb30037ea4110f01f6f41.

Reverted https://github.com/pytorch/pytorch/pull/130064 on behalf of https://github.com/izaitsevfb due to fails internal builds, see [D59553526](https://www.internalfb.com/diff/D59553526) ([comment](https://github.com/pytorch/pytorch/pull/130064#issuecomment-2221587575))
2024-07-10 21:50:32 +00:00
83c95c48f7 Flight recoder data as JSON (#129505)
Summary:
Provide a new API to retrieve flight recorder data as JSON.
The one minor difference between flight recorder as Pickle v/s JSON is
that the JSON API does not retrieve stack traces at the moment.
This ends up being far too much data.

Test Plan:
unit test

Differential Revision: [D59536460](https://our.internmc.facebook.com/intern/diff/D59536460)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129505
Approved by: https://github.com/wconstab, https://github.com/d4l3k
2024-07-10 21:50:27 +00:00
86bca69c5f Revert "[custom_ops] expose torch.library.register_torch_dispatch (#130261)"
This reverts commit bb9a73f767526e0d23c60360db5212b6bed0e8bc.

Reverted https://github.com/pytorch/pytorch/pull/130261 on behalf of https://github.com/izaitsevfb due to depends on #130064 which needs to be reverted ([comment](https://github.com/pytorch/pytorch/pull/130261#issuecomment-2221569707))
2024-07-10 21:43:28 +00:00
e14a0f45ed Revert "[reland][custom ops] infer schema (#130079)"
This reverts commit bef085bdfa62cc14589c70279de17108b2c2089f.

Reverted https://github.com/pytorch/pytorch/pull/130079 on behalf of https://github.com/izaitsevfb due to depends on #130064 which needs to be reverted ([comment](https://github.com/pytorch/pytorch/pull/130079#issuecomment-2221561483))
2024-07-10 21:40:16 +00:00
46c52661bc Use a better cherry-pick strategy for stable pytorch w/ distribute changes (#129987)
1. Update the branch name from internal feedback
2. Only cherry-pick in the changes to these folders
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129987
Approved by: https://github.com/seemethere
2024-07-10 20:55:36 +00:00
80a421a54d [TD] Pin numpy to 1.26.0 in indexer (#130442)
Temporarily pin 1.26.0 to get the workflow working while I go sort out which dependencies need to be updated

Succeeding run: https://github.com/pytorch/pytorch/actions/runs/9877733366/job/27280052419?pr=130442

Tested by adding my branch to the trust relationship for the policy and removing the environment
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130442
Approved by: https://github.com/atalman, https://github.com/malfet
2024-07-10 20:52:24 +00:00
cd2638be09 Revert "[pipelining] Refactor test_schedule to fix "-k" (#130294)"
This reverts commit 1352f13f7827cd1862a6e0507fb17dccddf73dc2.

Reverted https://github.com/pytorch/pytorch/pull/130294 on behalf of https://github.com/clee2000 due to broke lint https://github.com/pytorch/pytorch/actions/runs/9879591538/job/27286156803 ([comment](https://github.com/pytorch/pytorch/pull/130294#issuecomment-2221376073))
2024-07-10 20:26:58 +00:00
b81767161e Revert "[aota] Needs autograd if an input requires_grad, agnostic to enable_grad (#128890)"
This reverts commit 08d5423d339ac4b302f8ae6b63b334e032104753.

Reverted https://github.com/pytorch/pytorch/pull/128890 on behalf of https://github.com/clee2000 due to broke inductor/test_flex_attention https://github.com/pytorch/pytorch/actions/runs/9879109008/job/27286339304 08d5423d33 test was not run on PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/128890#issuecomment-2221368245))
2024-07-10 20:22:24 +00:00
1b3b4c2fb9 [runtime asserts] deduplicate runtime asserts & CSE (#128599) (#130380)
original PR: https://github.com/pytorch/pytorch/pull/128599 (re-created after revert + poisoned diff train)

Summary:
This PR adds deduplication and CSE for runtime asserts. Existing size computation in the graph is CSE'd along with added runtime asserts, and redundant asserts are removed. Shape calls on intermediate tensors are also turned into compute on input sizes if possible, allowing intermediate tensors to be freed earlier. For example:
```
z = torch.cat([x, x], dim=0)  # 2*s0
w = z.repeat(y.shape[0])  # 2*s0*s1
_w = w.shape[0]

s0 = x.shape[0]
s1 = y.shape[0]
_w0 = 2 * s0
_w = _w0 * s1
```

Additionally, constrain_range calls are deduplicated. Single-symbol bound checks for unbacked symbols (e.g. u0 >= 0, u0 <= 5) and sym_constrain_range.default calls are also removed, since they accumulate range info in the ShapeEnv, and are replaced with two _assert_scalar.default calls that check the min/max bounds. For example:
```
torch.sym_constrain_range_for_size(n, min=2, max=16)
torch.sym_constrain_range(n, min=4, max=20)
torch._check(n >= 0)
torch._check(n >= 3)
torch._check(n <= 14)

torch.sym_constrain_range_for_size(n)
torch._check(n >= 4)
torch._check(n <= 14)
```

Test Plan:
contbuild & OSS CI, see 940e4477ab

Original Phabricator Test Plan:
Imported from GitHub, without a `Test Plan:` line.

Differential Revision: D59543603

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130380
Approved by: https://github.com/izaitsevfb
2024-07-10 19:23:37 +00:00
1352f13f78 [pipelining] Refactor test_schedule to fix "-k" (#130294)
This is kind of a short-sighted workaround and we should actually come
up with a way to fix this in general, but I got annoyed that I can't use
-k to filter tests in test_schedule, and realized it's because we jam
tests using the new MultiProcContinuousTest fixture together with
old-style tests.

For now I separate the two types of tests so -k works again.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130294
Approved by: https://github.com/H-Huang
2024-07-10 18:32:51 +00:00
cf090e222e Update torch-xpu-ops pin (ATen XPU implementation) (#130333)
1. Fixing compilation error due to PyTorch update. The helper function prototype changes, `checkIndexTensorTypes`.
2. Fixing compilation error due to PyTorch update. PyTorch forced -Werror=unused-function.
3. Fixing inductor case failure due to CUDA bias implementation in the case. https://github.com/pytorch/pytorch/issues/130426

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130333
Approved by: https://github.com/EikanWang, https://github.com/atalman
2024-07-10 18:10:53 +00:00
4b7ee51260 [BE][MPS] Cleanup optimizers code (#130453)
- Fix C++20 forward compatibility warnings, namely
```
warning: use of function template name with no prior declaration in function call with explicit template arguments is a C++20 extension [-Wc++20-extensions]
  multi_tensor_apply_for_fused_optimizer<2, 512>(kernel_name,
```
- Use nested namespaces
- Do not explicitly specify `at::` namespace for functions already implemented inside of that namespace
- Use more convenience methods (rather than call by hand)
- Use C++14 `return f();` for void functions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130453
Approved by: https://github.com/Skylion007
2024-07-10 18:00:05 +00:00
08d5423d33 [aota] Needs autograd if an input requires_grad, agnostic to enable_grad (#128890)
Reland of:  https://github.com/pytorch/pytorch/pull/128016

Summary from previous PR:
We assume only two possible mutually exclusive scenarios:

Running compiled region for training (Any of inputs has requires_grad)

Produced differentiable outputs should have requires_grad.
Running compiled region for inference (None of inputs has requires_grad)

All outputs do not have requires_grad.
Even if user runs the region under no_grad(), but has an input Tensor with requires_grad - we go Training scenario (1).

With current state that means:
1/ needs_autograd should not check torch.is_grad_enabled(), only that any of inputs requires_grad
2/ if needs_autograd => trace_joint (We are in training scenario 1.) => always run compiled region under with.enable_grad()

Changes in partitioner?

Inference and Training graphs had difference in return container, list/tuple.
The changes in partitioner are done to unify and return always tuple.
As a result - some changes in test_aotdispatch.py for graph contents list -> tuple.

Why was revert?

There was a regression of hf_Reformer model on inference.
```
TORCHINDUCTOR_FX_GRAPH_CACHE=0 python benchmarks/dynamo/torchbench.py --performance --inference --bfloat16 --backend inductor --device cuda --only hf_Reformer --cold-start-latency --use-eval-mode
```

Because one of the compiled graphs contained outputs, which are aliases to the inputs that are nn.Parameter(requires_grad=True).

Even if inference bencharmsk torchbench runs inside with` torch.no_grad()` - alias (specifically for hf_Reformer - expand) ops preserve requires_grad.

As a result we started compiling training graph instead of inference.

Fix for view ops:

If we have outputs, that are aliases to inputs that requires_grad, those outputs requires grad is not a reason to generate training graph.

This is handled in aot_autograd.py, where output_and_mutation_safe are calculated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128890
Approved by: https://github.com/bdhirsh
2024-07-10 17:56:32 +00:00
0beeac35fa Revert "[cond] inlining into one of the branches when pred is a python constant (#128709)"
This reverts commit fe3e6878c4bb2a6001045c179fd7fa9838242558.

Reverted https://github.com/pytorch/pytorch/pull/128709 on behalf of https://github.com/ydwu4 due to causing error on truck due to a land racing: fe3e6878c4 ([comment](https://github.com/pytorch/pytorch/pull/128709#issuecomment-2221104043))
2024-07-10 17:47:19 +00:00
b4b7477d3f Fix CPU Annotation Overlapping with Python Events (#129599)
Summary:
Currently we have an issue where CPU User annotations can overlap with python events in the event that a python event calls step() within the function itself. To combat this, we can move the left side of the user annotation to the beginning of the parent python function. We do this because when instantiating the profiler we already start on step 0.
To implement this, we start by collecting all instances of ProfilerStep during post processing. Since TorchOps and Python events are sorted already, we can easily check if the current python event partially overlaps with the current ProfilerStep and, if so, alter the start time of the current ProfilerStep. We then move to the next ProfilerStep and continue iterating through all the python events. This keeps the time complexity of adding events to 'out' at O(s + n) -> O(n) post sorting, where "s" is the number of ProfilerSteps and "n" is the length of all events.

Test Plan:
Added unit test in which step() is called midway through a function. Afterwards, we print out a trace and then load the json to check that there are no overlaps. Also make sure that there is no regression in performance.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129599
Approved by: https://github.com/aaronenyeshi
2024-07-10 17:33:56 +00:00
6b3460ae0d fix discrepancy from the export of #126601 (#130296)
#126601 (internally [D58103182](https://www.internalfb.com/diff/D58103182)) was exported missing one class definition. This PR brings github repo in sync with fbcode.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130296
Approved by: https://github.com/kit1980, https://github.com/seemethere, https://github.com/malfet
2024-07-10 17:26:44 +00:00
7d4cb21098 Decompose expand_copy and permute_copy (#129476)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129476
Approved by: https://github.com/amjames, https://github.com/lezcano
2024-07-10 17:12:01 +00:00
a7aa066b09 Fix link to dynamo in torch/fx readme (#130233)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130233
Approved by: https://github.com/janeyx99
2024-07-10 17:00:49 +00:00
a09910d3a9 add strobelight profile links to tlparse (#129703)
Summary: title.

Test Plan:
buck2TORCH_TRACE=~/my_trace_log_dir buck2 run  @//mode/inplace  @//mode/opt  //caffe2/fb/strobelight:compile_time_profiler_example

tlparse ~/my_trace_log_dir

result
https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpBrQJcL/index.html
 {F1726980413}

Differential Revision: D59130581

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129703
Approved by: https://github.com/aorenste
2024-07-10 16:53:21 +00:00
fe3e6878c4 [cond] inlining into one of the branches when pred is a python constant (#128709)
When the input predicate is a python constant, we specialize into one of the branches and warn users that torch.cond is not preserving the dynamism. The previous behavior is that we baked in True/False in the cond operator. This can be confusing. In this PR, we change it to be specializing into one of the branches when the inputs are constants.

We additionally change the naming of cond operator to default one without overriding its name. This allows better testing on de-serialized graph.

Test Plan:
The predicate in some existing tests is the result of a shape comparison. When no dynamic shape is involved, the predicate is a python bool. To fix them, we either change the predicate to be some data-dependent tensor or change the test to check cond is specialized as one of the branches,

Differential Revision: [D59589709](https://our.internmc.facebook.com/intern/diff/D59589709)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128709
Approved by: https://github.com/zou3519
2024-07-10 16:44:27 +00:00
9d94b122f0 Fix usage of USE_ROCM when calling cudaFuncGetAttributes (#130441)
This fixes MSVC build regression introduced by https://github.com/pytorch/pytorch/pull/129710 as VC++ fails to unroll nested defines in the specific order and fails with
```
C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\int4mm.cu(984): error: "#" not expected here
    do { const cudaError_t __err = cudaFuncGetAttributes( &funcAttr, #if defined(USE_ROCM) (void *)func #else func #endif ); c10::cuda::c10_cuda_check_implementation( static_cast<int32_t>(__err), "C:\\actions-runner\\_work\\pytorch\\pytorch\\builder\\windows\\pytorch\\aten\\src\\ATen\\native\\cuda\\int4mm.cu", __func__, static_cast<uint32_t>(991), true); } while (0);
```

Fixes https://github.com/pytorch/pytorch/issues/130437

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130441
Approved by: https://github.com/Skylion007, https://github.com/malfet
2024-07-10 16:30:43 +00:00
ae73489b7d [codemod] Use C++17 [[fallthrough]] in 1 file inc caffe2/aten/src/ATen/native/cuda/DistributionTemplates.h (#130433)
Test Plan: Sandcastle

Reviewed By: meyering

Differential Revision: D59528276

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130433
Approved by: https://github.com/malfet
2024-07-10 16:30:37 +00:00
bef085bdfa [reland][custom ops] infer schema (#130079)
Fixes #129617

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130079
Approved by: https://github.com/zou3519
2024-07-10 16:18:36 +00:00
ce4d95143f Add scale kwarg to FlexAttention (and some changes that get FlexAttention numerics to be as accurate as FA2) (#130250)
After this PR, our numerical error is within 3% of FA2 for forward and gradients. Prior, for `dq` our numerical error was 30% higher. I also added a `PRESCALE_QK` kernel option that increases perf by about 3-4% but incurs about 20-30% more numerical error.

![image](https://github.com/pytorch/pytorch/assets/6355099/7b5ff44e-219b-4a05-8a1b-2a0182c01ab2)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130250
Approved by: https://github.com/drisspg
ghstack dependencies: #130227
2024-07-10 16:14:45 +00:00
a7715e36de Add block mask utility support for batches and heads > 1 (#130227)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130227
Approved by: https://github.com/yanboliang
2024-07-10 16:14:45 +00:00
c83b941141 [export] add dynamic shapes argument and infer from graph nodes (#129928)
Fixes the example in #118304 for `torch._functorch.aot_autograd.aot_export_module` and `torch.export.export`.

On a high level, the issue is caused by not detecting fake_mode when there's no input.

Change plan:

1) we add a  `dynamic_shapes: Union[bool, None] = None` arg to `aot_export_module` and `_aot_export_function`.

2) if the input is not a graph module, then we can only rely on this `dynamic_shapes` input arg.

3) If the input is a graph module, then we can traverse the graph and check.

4) So we check if the input mod is a graph module or just a module, and do 2) or 3) depending on the type.

Fixes #129927

Bug source: dynamo's fake_mode is not detected correctly in `_convert_input_to_fake` in `_traced.py` when there’s no input to the graph). So in ` _strict_export_lower_to_aten_ir`, we create another fake_mode. `dynamo_fake_mode` is not the same as the fake_mode used by dynamo.

Change plan:
check `gm_torch_level` graph's node meta "example_value" for fake mode in addition.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129928
Approved by: https://github.com/angelayi
2024-07-10 15:51:05 +00:00
cyy
d31f866b33 [BE] [CMake] Remove AT_CORE_STATIC_WINDOWS option (#130409)
AT_CORE_STATIC_WINDOWS was inherited from torch and is not used anymore.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130409
Approved by: https://github.com/malfet
2024-07-10 15:50:47 +00:00
81ea298600 Wrap the test func with try/except to always call destroy_process_group (#124961)
This can avoid PG warning about not calling destry_pg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124961
Approved by: https://github.com/wanchaol, https://github.com/wz337
2024-07-10 15:36:38 +00:00
81df076bfd Fix Apple crash when running PyTorch with Metal API validation turned on (#130377)
Fixes #130376 (at least, for my usage)

There may be other places in the code base where `-setBytes:length:` is called with a length of 0 besides this, but this is the case that has triggered for me. Please let me know if there are any specific tests I should run.
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130377
Approved by: https://github.com/malfet
2024-07-10 15:07:47 +00:00
417c83e7cf [ROCm] Unskip scaled_dot_product_attention tests on ROCm (#127966)
Needle has moved quite a bit on the ROCm backend front. This PR intended to examine the tests referenced in the following issue: https://github.com/pytorch/pytorch/issues/96560

This a follow-up PR to https://github.com/pytorch/pytorch/pull/125069

unskipping the next batch of tests referenced by the aforementioned issue. No explicit changes needed for source as they worked immediately after unskipping.

The tests previously marked with xfail have now been modified to not expect a failure iff running on ROCm as they now pass. Behavior is unchanged for them on other architectures.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127966
Approved by: https://github.com/malfet
2024-07-10 14:53:41 +00:00
b38de2f9e2 [decomps] Fix aten._to_copy decomp (#130381)
`aten._to_copy` can receive a python number as input. This occurs in
torch.compile support for vmap (see #130188). Previously, this would
raise an assertion error. This PR changes it so that if we see a python
number, we call torch.scalar_tensor on it first (h/t @bdhirsh).

Fixes #130362

Fixes #130188

Test Plan:
- new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130381
Approved by: https://github.com/Chillee
2024-07-10 14:34:28 +00:00
cyy
bd3452f431 [5/N] Change #include <c10/util/Optional.h> to #include <optional> (#130408)
Follows  #130329

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130408
Approved by: https://github.com/malfet
2024-07-10 14:29:43 +00:00
99967e1119 [MPS][TYPE_PROMOTION] Fix Clamp (#130226)
Summary:
1. Fixed #130201 by adding type promotion.
2. Added proper tests.
3. Found torch's type promotion is different from numpy as follows:

```python
import torch
import numpy as np
np.clip(np.array([1], dtype=np.float32), np.array([1], dtype=np.int32), None).dtype  # dtype('float64')
torch.clamp(torch.tensor([1], dtype=torch.float32), torch.tensor([1], dtype=torch.int32)).dtype  # torch.float32
```

~Not sure the proper way to handle it, it causes numpy ref tests to fail.~
Reason here, so think I'm gonna xfail it:
3c1cf03fde/test/test_ops.py (L260-L264)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130226
Approved by: https://github.com/malfet
2024-07-10 14:27:39 +00:00
6ce0bd7d3b [HOP] Use user directed names for variables where possible (#130271)
Afaict the previous check was too strict. Removing it passes all the
mutation tests (mutation checks happen via the TensorVariable's mutable_local).

Test Plan:
- tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130271
Approved by: https://github.com/Chillee, https://github.com/ydwu4
2024-07-10 13:59:20 +00:00
637cc8d27f Revert "update the input weight of _convert_weight_to_int4pack to [n][k / 2] uint8 (#129940)"
This reverts commit 6367f02a0e136ced05c665301bcdaa4d76690457.

Reverted https://github.com/pytorch/pytorch/pull/129940 on behalf of https://github.com/albanD due to Broke rocm tests on main 6367f02a0e ([comment](https://github.com/pytorch/pytorch/pull/129940#issuecomment-2220554681))
2024-07-10 13:48:32 +00:00
a1590e16df Add single Python 3.10, single Cuda 12.1 build with dependencies included (#130349)
Build large wheel for Python 3.10, CUDA 12.1 that will be used in Colab. Build name: ``manywheel-py3_11-cuda12_1-full-build``

We still have all code to support the full build in builder repo, here:
https://github.com/pytorch/builder/blob/main/manywheel/build_cuda.sh#L151

Test:
```
import sys
import torch
sys.version_info
print(torch.__version__)
sys.version_info

2.3.0+cu121
sys.version_info(major=3, minor=10, micro=12, releaselevel='final', serial=0)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130349
Approved by: https://github.com/malfet
2024-07-10 12:57:39 +00:00
cb2bce98de [MPS][BE] Reduce the number of parameters encoded for no momentum fused SGD (#130131)
Summary:

1. Reduce the number of parameters encoded for no momentum fused SGD
2. Use convenience functions `mtl_setBuffer` and `mtl_setBytes`.

Just a BE, no significant performance difference is observed.

Test plan: Relying on CI signals
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130131
Approved by: https://github.com/janeyx99, https://github.com/malfet
2024-07-10 07:58:38 +00:00
6367f02a0e update the input weight of _convert_weight_to_int4pack to [n][k / 2] uint8 (#129940)
This PR is to update the input `weight` of `_convert_weight_to_int4pack` from `[n][k] int32` to `[n][k / 2] uint8`, both for CPU, CUDA and MPS, which can help decouple int4 model checkpoint with different ISAs and different platforms in `gpt-fast`. The advantage is int4 model checkpoint can be shared in different test machines, without re-generating in one certain platform. Meanwhile, the size of input `weight` can be reduced to `1 / 8`.

Before this PR, packed weight stored in CUDA specific layout: `[n/8][k/(InnerKTiles*16)][32][InnerKTiles/2]`, dtype int32, where InnerKTiles = 2, 4, 8. CPU packed weight viewed as the SAME shape but stored in different layout: `[n/64][k][32]`, dtype uint8. Weight is strongly coupled with platforms (CPU/CUDA) and ISAs (AVX512/AVX2/scalar). And users cannot use a generated weight in another different ISA or platform, because when loading weight into devices, the compute format is different.
![image](https://github.com/pytorch/pytorch/assets/61222868/64971c4b-29b9-42cf-9aeb-ffa01cea93dd)

Now, we use common serialized layout (`[n][k/2] uint8`) for different devices or ISAs as input `weight` of `_convert_weight_to_int4pack`, and each back chooses how to interpret as compute layout.
![image](https://github.com/pytorch/pytorch/assets/61222868/c7990761-c723-417b-aca2-7c60db7785c7)

### Performance
Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores)
There is no obvious regression of this PR.
![image](https://github.com/pytorch/pytorch/assets/61222868/6046dcf4-920b-4c63-9ca3-1c8c3cafebde)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129940
Approved by: https://github.com/jgong5, https://github.com/lezcano, https://github.com/mingfeima
2024-07-10 07:38:42 +00:00
e29657efb6 [Inductor][CPP] Fix typo in merge rules (#130405)
**Summary**
There is a typo of the `CPU Inductor` group in `merge_rules.yaml` which should be `test/inductor/test_cpu_repro.py` instead of `test/inductor/test_cpu_repo.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130405
Approved by: https://github.com/jgong5, https://github.com/lezcano
2024-07-10 07:13:03 +00:00
cyy
10c7f037fe Simplify c10::string_view (#130009)
Make c10::basic_string_view a subclass of std::basic_string_view for easier replacement in the future.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130009
Approved by: https://github.com/ezyang
2024-07-10 05:02:16 +00:00
a17d1e5322 Fix static py::object dangling pointer with py::gil_safe_call_once_and_store (#130341)
Fix static `py::object`s with `py::gil_safe_call_once_and_store`.

The following code will leak a `py::object` which will call its destructor when shutdown the program. The destructor will call `Py_DECREF(obj.m_ptr)` which may raise a segmentation fault.

```c++
void func() {
    static py::object obj = py::module_::import("foo").attr("bar");

    ...
}
```

The correct code is to use raw pointers rather than the instance.

```c++
void func() {
    static py::object* obj_ptr = new py::object{py::module_::import("foo").attr("bar")};
    py::object obj = *obj_ptr;

    ...
}
```

This PR uses the `py::gil_safe_call_once_and_store` function from `pybind11`, which can run arbitrary initialization code only once under the Python GIL thread safely.

```c++
void func() {
    PYBIND11_CONSTINIT static py::gil_safe_call_once_and_store<py::object> storage;
    py::object obj = storage
                         .call_once_and_store_result(
                             []() -> py::object {
                                 return py::module_::import("foo").attr("bar");
                             }
                         )
                         .get_stored();

    ...
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130341
Approved by: https://github.com/ezyang
2024-07-10 04:23:37 +00:00
5abe7ebd41 Add new (private) capture_triton API (#130178)
When applied to a triton kernel, capture_triton allows the triton kernel
to be captured when tracing with make_fx. It does this by transforming the
call to the triton kernel into a call to the
triton_kernel_wrapper_mutation HOP, which can actually be traced into a
graph via make_fx.

We have two main uses cases for this:
- non-strict export doesn't use Dynamo, but people want to use
  non-strict export to export programs with triton kernels.
  non-strict export uses make_fx tracing, so this is a necessary step in
  that direction.
- People want to write inductor passes that replace a sequence of
  operators with a call to a function that may contain a triton kernel.
  The way these passes work today is that we have a FX graph and want to
  replace a subgraph of it with a new subgraph. We obtain said subgraph
  from calling make_fx on the function; this won't work on raw triton
  kernels but will work if one uses capture_triton.

Test Plan:
- I wrote some manual tests to run make_fx over two of the triton
  kernels in test_triton_kernels. It would be nice to be able to run
  make_fx through all of the tests in the file but I'm not sure how to
  do that refactor right now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130178
Approved by: https://github.com/oulgen
ghstack dependencies: #130177
2024-07-10 03:09:29 +00:00
99c68f7bea Refactor TritonKernelVariable's logic so it can be shared (#130177)
TritonKernelVariable's logic tells us how to go from a user-defined
triton kernel and a grid to a call to the triton_kernel_wrapper_mutation
HOP. We want to re-use this in a setting without Dynamo; in the next PR
up, we create a new decorator (capture_triton) that, when applied to a
triton kernel, transforms a call to the triton kernel into a call
to the triton_kernel_wrapper_mutation HOP.

Test Plan:
- existing tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130177
Approved by: https://github.com/oulgen, https://github.com/ydwu4
2024-07-10 03:09:29 +00:00
868d9a4f12 [cpu][flash attention] fix nan issue (#130014)
Fixes #127055.

NaNs are generated in flash attention because the computation of `std::exp((-inf) - (-inf))` and `+/-inf * 0` in lazy softmax. We fix the issue by avoiding the related calculation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130014
Approved by: https://github.com/jgong5, https://github.com/drisspg
2024-07-10 02:33:26 +00:00
68751799b8 Add decompositions for copy variants of view ops (#128416)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128416
Approved by: https://github.com/amjames, https://github.com/lezcano
2024-07-10 01:39:09 +00:00
cyy
007e75958f [4/N] Change #include <c10/util/Optional.h> to #include <optional> (#130329)
Follows #130300

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130329
Approved by: https://github.com/ezyang
2024-07-10 01:26:50 +00:00
9912209743 check if the input fx graph of aot_compile return tuple (#129824)
Fixes https://github.com/pytorch/pytorch/issues/129719

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129824
Approved by: https://github.com/angelayi, https://github.com/yushangdi
2024-07-10 01:18:55 +00:00
cyy
85b8503621 [Caffe2] Remove Caffe2 documentation (#130089)
Due to the removal of Caffe2 code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130089
Approved by: https://github.com/r-barnes, https://github.com/albanD
2024-07-10 00:52:16 +00:00
cyy
7a3ab1fe79 [structural binding][7/N] Replace std::tie with structural binding (#130216)
Follows #120353

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130216
Approved by: https://github.com/albanD
2024-07-10 00:52:04 +00:00
fb696bf264 Revert "Add block mask utility support for batches and heads > 1 (#130227)"
This reverts commit 64139987c0588f2eef198a0b9fd6904783b37b2c.

Reverted https://github.com/pytorch/pytorch/pull/130227 on behalf of https://github.com/izaitsevfb due to breaks internal builds, please see D59498662 ([comment](https://github.com/pytorch/pytorch/pull/130227#issuecomment-2218842579))
2024-07-09 22:34:39 +00:00
44815ed67e Revert "Add scale kwarg to FlexAttention (and some changes that get FlexAttention numerics to be as accurate as FA2) (#130250)"
This reverts commit 3e48d927332915e1ecbd3c7f2c6b9680428f181e.

Reverted https://github.com/pytorch/pytorch/pull/130250 on behalf of https://github.com/izaitsevfb due to depends on #130227 which needs to be reverted ([comment](https://github.com/pytorch/pytorch/pull/130250#issuecomment-2218840674))
2024-07-09 22:32:54 +00:00
5b5a1f5202 Add on to Mark some test_decomp tests as slow on win #130260 (#130337)
An add on to https://github.com/pytorch/pytorch/pull/130260

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130337
Approved by: https://github.com/malfet
2024-07-09 22:30:53 +00:00
fd43a2ba27 Forward fix for test_compare_cpu_cuda_float32 (#130360)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130360
Approved by: https://github.com/malfet
ghstack dependencies: #128238
2024-07-09 22:28:39 +00:00
3be4922a9d Revert "[HOP] Use user directed names for variables where possible (#130271)"
This reverts commit adb65682affdfc37f724c02ea8c8930d3925fc07.

Reverted https://github.com/pytorch/pytorch/pull/130271 on behalf of https://github.com/clee2000 due to broke inductor/test_flex_attention https://github.com/pytorch/pytorch/actions/runs/9863205414/job/27236960046 adb65682af Test not run on PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/130271#issuecomment-2218832643))
2024-07-09 22:24:39 +00:00
37d4d04309 [torchscript] Add logging for model id. (#130118)
Summary: as title.

Test Plan: CI

Reviewed By: angelayi

Differential Revision: D59348256

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130118
Approved by: https://github.com/BoyuanFeng
2024-07-09 22:24:16 +00:00
fb5cb17fbe [torch][fx] Add normalize_args constructor argument to FxGraphDrawer (#130348)
Summary:
When writing out Graphviz files for graphs, sometimes the arguments are all
in a row and it's unclear which is which. Like for `aten.conv2d`, someone might not
remember the stride, padding, dilation order.

Add an option `normalize_args` (defaults to False) to normalize all args into kwargs.
This should help the readability of a graph.

Differential Revision: D59529417

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130348
Approved by: https://github.com/mcremon-meta
2024-07-09 22:16:54 +00:00
df83142131 [CCA][Memory Snapshot] Stop duplicating annotations to all device_traces (#130315)
Summary: This diff fixes a bug, where all record_annotations will save a TraceEntry to each of the device_traces. Instead, we should only save annotations to the current device_trace that is being called by the thread calling the native allocator's recordAnnotation.

Test Plan: CI and ran workloads on MVAI WPR FBR.

Reviewed By: zdevito

Differential Revision: D59477339

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130315
Approved by: https://github.com/zdevito
2024-07-09 21:38:47 +00:00
bb9a73f767 [custom_ops] expose torch.library.register_torch_dispatch (#130261)
This is the API for defining the interaction between a torch_dispatch
class and a custom op. Taking API bikeshedding.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130261
Approved by: https://github.com/albanD
ghstack dependencies: #130064
2024-07-09 21:11:27 +00:00
c23d103afa Add API for open registration between operators and subclasses (and modes) (#130064)
We add torch.library.Library._register_torch_dispatch_rule. Here, a user
can provide us a specific rule to run for a specific
(torch_dispatch_class, operator) pair. The motivation is that a user
might want to extend a subclass/mode but may not have access to the
source code of the subclass/mode.

I'll make this public in a follow-up PR if we think the approach and API
is good.

Keep in mind that many subclasses will likely deliver their own open
registration solution (DTensor has register_sharding_prop_rule and NJT
has register_jagged_op); _register_torch_dispatch_rule is meant as a
catch-all open registration mechanism for when the subclass hasn't
provided anything more specific.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130064
Approved by: https://github.com/albanD
2024-07-09 21:11:27 +00:00
9c9744c3ac Revert "[runtime asserts] deduplicate runtime asserts & CSE (#128599)"
This reverts commit 940e4477ab0b81eea25051447cf5f599080c903f.

Reverted https://github.com/pytorch/pytorch/pull/128599 on behalf of https://github.com/izaitsevfb due to breaking internal APS tests, see D59498864 ([comment](https://github.com/pytorch/pytorch/pull/128599#issuecomment-2218724762))
2024-07-09 21:03:49 +00:00
f85bda8bdd c10d/Handlers: expose running handlers from Python (#130149)
This adds a `_run_handler` method that will invoke a specific handler.

Test plan:

```
python test/distributed/elastic/test_control_plane.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130149
Approved by: https://github.com/kurman, https://github.com/c-p-i-o
2024-07-09 20:20:59 +00:00
1d93367cfa Fix typo (#130305)
Fixes #130241

that is a reopen pr of #130244, for possibly fixing the failed job
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130305
Approved by: https://github.com/Skylion007
2024-07-09 20:02:00 +00:00
721a798886 add bits16 to graph dtype_abbrs (#130339)
As title, patch the dtype in torch.fx.graph
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130339
Approved by: https://github.com/angelayi
2024-07-09 19:58:51 +00:00
42f647219a [ROCm] Add int4 support (#129710)
- Add AMD support for int4 kernel
  - Only supports CDNA2 and CDNA3 gpus for now
  - Uses `mfma_f32_16x16x16bf16` instruction for matrix multiply
  - Uses `v_and_or_b32` instruction and `__hfma2` instrinsic for unpacking bf16 values
  - Enable hipify for `__nv_bfloat16` and `__nv_bfloat162` data types
- Enable int4 unit tests for CDNA2 and CDNA3 AMD gpus
- Fix torchscript issues due to hipify for `__nv_bfloat16` type
  - TorchScript has its own implementation for bfloat16 type
    - Implemented in `__nv_bloat16` structure at [resource_strings.h](https://github.com/pytorch/pytorch/blob/main/torch/csrc/jit/codegen/fuser/cuda/resource_strings.h)
    - So, we shouldn't hipify any reference of `__nv_bfloat16` in the torchscript implementation
    - Hence moved the `__nv_bfloat16` direct references in `codegen.cpp` and `cuda_codegen.cpp` to `resource_strings.h` which is already exempted from hipify

Fixes #124699
Fixes pytorch-labs/gpt-fast/issues/154

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129710
Approved by: https://github.com/malfet
2024-07-09 19:49:12 +00:00
adb65682af [HOP] Use user directed names for variables where possible (#130271)
Afaict the previous check was too strict. Removing it passes all the
mutation tests (mutation checks happen via the TensorVariable's mutable_local).

Test Plan:
- tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130271
Approved by: https://github.com/Chillee, https://github.com/ydwu4
ghstack dependencies: #130255, #130268
2024-07-09 19:42:52 +00:00
cyy
a6345d3477 [CMake] [3/N] Remove unused code (#130322)
Some functions used by Caffe2 were removed along with some outdated checks. Follows #130006.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130322
Approved by: https://github.com/r-barnes
2024-07-09 19:33:33 +00:00
3477ee38e4 fix the use of initial learning rate in the OneCycleLR example (#130306)
Fixes #127649

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130306
Approved by: https://github.com/janeyx99
2024-07-09 18:58:07 +00:00
3689471ea4 [inductor] Add FileCheck to flex attention epilogue test (#129343)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129343
Approved by: https://github.com/lezcano
2024-07-09 18:15:55 +00:00
c6cce976b2 Fix an issue where ENABLE_INTRA_NODE_COMM=1 + multiple process groups leads to failure (#130269)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130269
Approved by: https://github.com/Chillee
2024-07-09 17:42:09 +00:00
cb4bec311a Fix nodes has more than one output users after replace_set_grad_with_hop pass (#129716)
Summary: Previously, when we inline the subgraphs that doesn't have a different require_grad environment, we didn't clean up the nodes's users in subgraph and direcly used them to  to replace the output of  the call_modules. This records dead depencies in node.users. This PR fixes this.

Test Plan:
Added a new test.

Also see the torchrec tests:
Step 1:
buck run mode/dev-nosan //aimp/experimental/pt2:pt2_export -- --model-entity-id 934687114 --output /tmp/934687114.zip --use-torchrec-eager-mp --use-manifold

Step 2:
buck run mode/opt -c python.package_style=inplace -c fbcode.enable_gpu_sections=true aimp/cli:cli --  --platform=aps --template=disagg_gpu_aps_pt2 --pt2 --model-entity-id=934687114 non-request-only-tagging torchrec-shard-and-quantize gpu-disagg-split assign-device materialize-weights script-and-save

Differential Revision: D59132214

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129716
Approved by: https://github.com/angelayi
2024-07-09 17:04:03 +00:00
e4c51d22c5 [cuDNN] Cleanup < 8.5 #ifdefs (#130283)
We've said cuDNN 8.5 is the minimum supported version for a bit now

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130283
Approved by: https://github.com/Skylion007
2024-07-09 16:35:39 +00:00
cab90b0049 [custom ops] disable kernel temporarily (#130190)
Fixes #128621

Sometimes we want to disable the backend implementation for testing/benchmarking purposes.

For example:

```python
@custom_op("mylib::f", mutates_args=())
def f(x: Tensor) -> Tensor:
    return torch.zeros(1)

print(f(torch.randn(1))) # tensor([0.])

@f.register_kernel("cpu")
def _(x):
    return torch.ones(1)

print(f(torch.randn(1))). # tensor([1.])

with f.set_kernel_enabled("cpu", enabled = False):
    print(f(0)) # tensor([0.])
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130190
Approved by: https://github.com/williamwen42, https://github.com/zou3519
2024-07-09 16:13:50 +00:00
edf273edf4 Revert some PRs (#130303)
Summary:
Revert https://github.com/pytorch/pytorch/pull/129346 thru
https://github.com/pytorch/pytorch/pull/128893

For S430832

Test Plan: Tests

Differential Revision: D59503843

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130303
Approved by: https://github.com/bdhirsh
2024-07-09 14:46:00 +00:00
cyy
71efbf701d [3/N] Change #include <c10/util/Optional.h> to #include <optional> (#130300)
Follows #130236

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130300
Approved by: https://github.com/ezyang
2024-07-09 13:32:57 +00:00
a5f816df18 Add more dtypes to __cuda_array_interface__ (#129621)
`__cuda_array_interface__` was missing some unsigned integer dtypes as well as BF16.

numba doesn't support BF16 so I skip tests for that one.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129621
Approved by: https://github.com/lezcano
2024-07-09 10:47:19 +00:00
3e48d92733 Add scale kwarg to FlexAttention (and some changes that get FlexAttention numerics to be as accurate as FA2) (#130250)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130250
Approved by: https://github.com/drisspg
ghstack dependencies: #130160, #130106, #130224, #130227
2024-07-09 09:24:06 +00:00
eqy
86fb76e871 [SDPA] Clean up print in test/test_transformers.py (#130302)
Left this in #125343, oops...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130302
Approved by: https://github.com/awgu
2024-07-09 09:20:52 +00:00
953c6476bd [CMAKE] Look for Development.Module instead of Development (#129669)
Based on the [cmake issue](https://gitlab.kitware.com/cmake/cmake/-/issues/23716) and [manylinux issue](https://github.com/pypa/manylinux/issues/1347), when building a python module, it should find the `Development.Module` module, not `Development`, which includes `Development.Module` and `Development.Embed`, and will expect the shared python library only. After this PR and before #124613, pytorch could be built with a static libpython (e.g. in manylinux).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129669
Approved by: https://github.com/malfet
2024-07-09 09:16:43 +00:00
b139b5090f [pytorch] Name threads in thread pools for better debugging (#130270)
Threads inside the thread pools are not named, so they inherit the main process name or the name of the first thread. In our case if we set `pt_main_thread` as the thread name when a thread does `import torch`, this name will be inherited by all the threads in the created pools.

This PR names the threads in the pools I was able to find. There are other pools created, like OpenMP ones and we need to follow-up on those.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130270
Approved by: https://github.com/d4l3k, https://github.com/albanD
2024-07-09 08:03:47 +00:00
312652c325 [RFC] Add support for device extension autoloading (#127074)
Fixes #122468

- Load device extensions at the end of `torch/__init__.py`
- Enabled by default, or you can disable it with `TORCH_DEVICE_BACKEND_AUTOLOAD=0`

run test:

```python
python test/run_test.py -i test_autoload_enable
python test/run_test.py -i test_autoload_disable
```

doc:

https://docs-preview.pytorch.org/pytorch/pytorch/127074/miscellaneous_environment_variables.html

co-author:  @jgong5 @bsochack @bkowalskiINTEL @jczaja @FFFrog @hipudding

Co-authored-by: albanD <desmaison.alban@gmail.com>
Co-authored-by: Jiong Gong <jiong.gong@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127074
Approved by: https://github.com/albanD, https://github.com/jgong5
2024-07-09 06:14:13 +00:00
6c4efd4e95 [Memory Snapshot][BE] Clean up record function callback scope (#130265)
Summary: We can directly set the scope to at::RecordScope::USER_SCOPE for the at::RecordFunctionCallback object, rather than performing a check inside of the callback.

Test Plan:
Ran locally, works fine.

https://www.internalfb.com/pytorch_memory_visualizer/mvai_gpu_traces/tree/gpu_snapshot/fire-aaronshi-20240704-1709-7a80b83b/0/rank-0_itrn-1503.Jul_04_17_24_02.3577.snapshot.pickle

Differential Revision: D59477046

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130265
Approved by: https://github.com/davidberard98
2024-07-09 05:23:48 +00:00
ded469cfbd [issue scrubbing] Fix imports in test_memory_planning.py to work with pytest (#130275)
Summary: I actually don't grok why this pattern works; I guess pytest expects a different import syntax for these relative imports?? But this pattern is used in many other tests here (notably `test_aot_inductor.py`), so it must be right ;)

Test Plan:
Ran both ways:
* `python test/inductor/test_memory_planning.py`
* `pytest test/inductor/test_memory_planning.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130275
Approved by: https://github.com/zou3519
2024-07-09 05:20:56 +00:00
e235db98c9 [Inductor] Add aot_mode UT to new cpp_builder. (#130105)
Changes:
1. Add `aot_mode` parameter to `validate_new_cpp_commands` UT.
2. Switch AotCodeCompiler vec isa command gen to new cpp_builder.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130105
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-09 04:08:35 +00:00
31df1d235e Support tensor stride (#129297)
Summary:
X-link: https://github.com/facebookresearch/param/pull/126

Support tensor stride for execution trace.

Test Plan: buck2 test mode/opt caffe2/test:test_profiler_cuda profiler.test_execution_trace.TestExecutionTrace

Differential Revision: D58900476

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129297
Approved by: https://github.com/sanrise, https://github.com/izaitsevfb
2024-07-09 03:55:46 +00:00
e836ee1955 Enhancements to recompiles logs (#130043)
----

- We now record on CacheEntry what the compile id that populated it was, so now we can say why a specific frame was rejected
- Add structured log for recompiles under name artifact "recompile_reasons". As it stands, it's not terribly structured, but this was the easiest thing I could do to start
- Slightly reformat multi-reason printing; since we only report one guard failure seems better to have it as a single line

Example output:

```
V0703 10:34:13.273000 140345997743104 torch/_dynamo/guards.py:2590] [0/1] [__recompiles] Recompiling function f in /data/users/ezyang/a/pytorch/b.py:3
V0703 10:34:13.273000 140345997743104 torch/_dynamo/guards.py:2590] [0/1] [__recompiles]     triggered by the following guard failure(s):
V0703 10:34:13.273000 140345997743104 torch/_dynamo/guards.py:2590] [0/1] [__recompiles]     - 0/0: tensor 'L['x']' size mismatch at index 0. expected 4, actual 5
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130043
Approved by: https://github.com/anijain2305
2024-07-09 03:40:56 +00:00
cyy
29861779ce [2/N] Change #include <c10/util/Optional.h> to #include <optional> (#130236)
Follows  #128301. The changes were made by grep and sed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130236
Approved by: https://github.com/ezyang
2024-07-09 03:17:24 +00:00
d1e0653fad [fx][easy] print_readable should recursively apply options (#130268)
For example, print_readable(colored=True) should also print submodules
with colors.

Test Plan:
- tested locally
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130268
Approved by: https://github.com/Chillee
ghstack dependencies: #130255
2024-07-09 02:50:20 +00:00
f2c9f0c0db [HOP] improve naming for subgraph inputs (#130255)
Previously, subgraph input names were whatever the input proxies were,
which were confusing. This PR changes those names to be
whatever the names of the arguments the functions being
speculate_subgraph'ed are. This is best-effort: if we can't figure it
out then we go back to the previous strategy.

Test Plan:
- existing expecttests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130255
Approved by: https://github.com/ydwu4
2024-07-09 02:46:40 +00:00
abe81d5d05 Fix the rest of foreach flakers (#130277)
Reenable foreach tests on non-sm86 machines. I believe I've fixed the flakes that are caused when TORCH_SHOW_CPP_STACKTRACES=1, though I know @clee2000 had also just landed https://github.com/pytorch/pytorch/pull/129004 for the same effect.

Regardless, this makes the foreach tests more robust against future disruptions anyway. Fix similar in flavor to https://github.com/pytorch/pytorch/pull/129003

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130277
Approved by: https://github.com/soulitzer
2024-07-09 02:08:21 +00:00
d44c30e2f9 Revert "Add API for open registration between operators and subclasses (and modes) (#130064)"
This reverts commit 922d2737d5e0ad22ee1dcf91c48ab09d641de840.

Reverted https://github.com/pytorch/pytorch/pull/130064 on behalf of https://github.com/huydhn due to Sorry for reverting your change but test_profiler_tree is failing in trunk after this lands 922d2737d5, maybe a landrace ([comment](https://github.com/pytorch/pytorch/pull/130064#issuecomment-2216135497))
2024-07-09 01:48:38 +00:00
75fa10066d Mark some test_decomp tests as slow on win (#130260)
Auto slow test detection is marking and then un marking these as slow, so permanently mark them as slow on windows.

These tests take >500s on windows.

This is part of the reason why test_decomp keeps failing on windows (ex da66e50e6e)

The other part is something to do with reruns + thresholds that I am still investigating
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130260
Approved by: https://github.com/huydhn, https://github.com/malfet
2024-07-09 00:16:31 +00:00
7f08d3d9a0 [C10D] Fix corrupt log due to uint_8 printing as char (#130184)
Previously, jobs would log lines like this due to interpreteting an int8 value as a signed char when streaming out.

"ProcessGroupNCCL created ncclComm_ 0x94960120 on CUDA device: ^@"

We need a better solution for avoiding this systematically, but at least
for now fix the spot we know about.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130184
Approved by: https://github.com/eeggl, https://github.com/Skylion007
2024-07-08 23:37:50 +00:00
4c19623800 Change numeric_debug_handle to store per-node id (#129811)
Summary:
Previously we store edge id in numeric_debug_handle to support operator fusion and operator decomposition throughout the stack,
but according to feedback from customers, people prefer the simpler per-node id, and they are fine with not having the additional
support for numerical debugging for inputs and willing to hack around to achieve this.

This PR changes the structure of numeric_debug_handle to store unique_id for each node instead.

e.g.
graph:
```
node = op(input_node, weight_node)
```
Before:
```
node.meta[NUMERIC_DEBUG_HANDLE_KEY] = {input_node: id1, weight_node: id2, "output": id3}
```

After:
```
node.meta[NUMERIC_DEBUG_HANDLE_KEY] = id1
```

Test Plan:
python test/test_quantization.py -k TestGenerateNumericDebugHandle

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129811
Approved by: https://github.com/tarun292
2024-07-08 23:36:19 +00:00
a28bb3268d [Pipelining] Reorder _Action from F1_1 to 1F1 (#129786)
Also steers away from accesing _Action via positional unpacking since
that is error prone

Co-authored-by: Howard Huang <howardhuang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129786
Approved by: https://github.com/H-Huang
2024-07-08 23:07:51 +00:00
60d9f3f7d9 Set the epoch timestamp when uploading data to dynamoDB (#130273)
This is to move away the `_event_time` field from Rockset, which we cannot use when reimport the data
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130273
Approved by: https://github.com/clee2000
2024-07-08 22:58:32 +00:00
b4cc25f126 [custom_op]Fix self in mutation_args (#130179)
Fixes #124933

## Issue Summary
If users define `self` as mutate args, there is an error occurs `TypeError: AutoFunctionalized.__call__() got multiple values for argument 'self'`. For the following example, the schema for mutates_args is parsed as {"self": FakeTensor}.  6df963a2c8/torch/_higher_order_ops/auto_functionalize.py (L234)
In the above line, it is unwrapped as `self=FakeTensor` and leads to wrong argument pass because `self` is the default keyword for functions of a class, such as https://github.com/pytorch/pytorch/compare/main...findhao/fix-self-custom-ops#diff-9453b6b52a54783beec3dd1c60248620f61c3a524d404a188af17bbdf6be3d9eR292 .
```python
import torch

@torch.library.custom_op("mylib::foo", mutates_args={"self"})
def foo(self: torch.Tensor) -> None:
    self.sin_()

x = torch.randn(3)

@torch.compile(backend="inductor", fullgraph=True)
def f(x):
    foo(x)

f(x)
```
## Fix
This PR changes all related default argument `self` to `self_` following the existing way in 6fc771d19b/torch/_ops.py (L667)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130179
Approved by: https://github.com/zou3519
2024-07-08 22:55:50 +00:00
17ca0d0edf Add linux manywheel python 3.13 binary workflows (#130030)
Test with passing linux manywheel workflows is here: https://github.com/pytorch/pytorch/pull/121979
Builder PR already merged: https://github.com/pytorch/builder/pull/1910

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130030
Approved by: https://github.com/albanD
2024-07-08 22:50:15 +00:00
00335a27b4 Accept min / max sequence length in nested_tensor_from_jagged() constructor (#130175)
This PR updates the public API for NJT construction `torch.nested.nested_tensor_from_jagged()` to accept values for min / max sequence length. It's useful to provide these ahead of time to avoid GPU -> CPU syncs from on-demand computation later on.

NB: The test changes are extensive because I reworked the existing `_validate_nt()` helper function used throughout our NJT construction tests to verify more (specifically: expected cached min / max seq len and contiguity).

API design question: should we additionally provide an option to compute these from `offsets` at construction time? I can think of three possible cases during construction:
1. Min / max seq len has already been obtained from *somewhere* (manual calculation, static values, etc.) and they should be used in the cache
2. Min / max seq len should be computed immediately at construction time for use in the cache (ideally, the caller wouldn't have to do this computation manually)
3. Min / max seq len are not needed at all (i.e. SDPA isn't ever called) and computation should be skipped
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130175
Approved by: https://github.com/davidberard98, https://github.com/soulitzer
2024-07-08 22:14:52 +00:00
922d2737d5 Add API for open registration between operators and subclasses (and modes) (#130064)
We add torch.library.Library._register_torch_dispatch_rule. Here, a user
can provide us a specific rule to run for a specific
(torch_dispatch_class, operator) pair. The motivation is that a user
might want to extend a subclass/mode but may not have access to the
source code of the subclass/mode.

I'll make this public in a follow-up PR if we think the approach and API
is good.

Keep in mind that many subclasses will likely deliver their own open
registration solution (DTensor has register_sharding_prop_rule and NJT
has register_jagged_op); _register_torch_dispatch_rule is meant as a
catch-all open registration mechanism for when the subclass hasn't
provided anything more specific.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130064
Approved by: https://github.com/albanD
2024-07-08 22:13:05 +00:00
44a773c121 Revert "[custom ops] infer schema (#130079)"
This reverts commit 3fe324ffb612c8712f6af7639c1e7bcec5f3b4fd.

Reverted https://github.com/pytorch/pytorch/pull/130079 on behalf of https://github.com/huydhn due to The test_public_bindings failure looks legit 3fe324ffb6 ([comment](https://github.com/pytorch/pytorch/pull/130079#issuecomment-2215420957))
2024-07-08 22:02:29 +00:00
f9bb258892 Revert "[Inductor] Add aot_mode UT to new cpp_builder. (#130105)"
This reverts commit 21eeedb4554edab22b42bcb2f75f19e85652b72e.

Reverted https://github.com/pytorch/pytorch/pull/130105 on behalf of https://github.com/izaitsevfb due to Breaks 46 tests internally at meta with: OSError: CUDA_HOME environment variable is not set ([comment](https://github.com/pytorch/pytorch/pull/130105#issuecomment-2215392198))
2024-07-08 21:40:03 +00:00
5e467604c3 Revert "[inductor] switch AotCodeCompiler to new cpp_builder (#130127)"
This reverts commit dc5f37193f8d144d3de8525bf64eb1775d91e932.

Reverted https://github.com/pytorch/pytorch/pull/130127 on behalf of https://github.com/izaitsevfb due to Depends on #130105 which has to be reverted ([comment](https://github.com/pytorch/pytorch/pull/130127#issuecomment-2215355259))
2024-07-08 21:25:28 +00:00
09d57f577b Revert "[inductor] switch CppCodeCache to new cpp_builder. (#130132)"
This reverts commit 3957b3b34976896e0b13e1d09cf19e1da5b8292e.

Reverted https://github.com/pytorch/pytorch/pull/130132 on behalf of https://github.com/izaitsevfb due to Depends on  #130105 which has to be reverted ([comment](https://github.com/pytorch/pytorch/pull/130132#issuecomment-2215352180))
2024-07-08 21:22:39 +00:00
856fe230c7 [AOTI] better approach to generating runtime checks for symbolic dimensions (#130220)
Previously, we only handled cases where the symbolic dimension is of
Symbol. We should use bound_sympy which handles more general cases for us.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130220
Approved by: https://github.com/aakhundov
2024-07-08 20:46:38 +00:00
3fe324ffb6 [custom ops] infer schema (#130079)
Fixes #129617

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130079
Approved by: https://github.com/zou3519
2024-07-08 20:46:23 +00:00
1e61cb8c87 Revert "[3.12, 3.13, dynamo] simplified construction for frame f_locals/localsplus (#129185)"
This reverts commit b428f1ad77aedfd150e920c8b0d23b7e6393ad6f.

Reverted https://github.com/pytorch/pytorch/pull/129185 on behalf of https://github.com/huydhn due to dr ci categorization is wrong, the test_linalg xsuccess is real, theres also a test_jit failure https://github.com/pytorch/pytorch/actions/runs/9844339391/job/27178009798 b428f1ad77 ([comment](https://github.com/pytorch/pytorch/pull/129185#issuecomment-2215230345))
2024-07-08 20:37:07 +00:00
f059201e0d [dtensor][debug] added deviceMesh for relevant operations and module parameter sharding and module fqn (#130072)
**Summary**
In order to give users more information, I have added the deviceMesh for operations with DTensor inputs, and module parameter sharding and FQN. These changes have only been placed in operation tracing log. In the future, I plan to just have one logging function with an argument to show how detailed a user wants the log to be, and will get rid of the module tracing log function. This information has also been added to the JSON dump and can be seen in the browser visual. I have also edited the test case file as the module_depth dictionary has been replaced with module_helper_dict and have edited the example output for the MLP operation tracing which can be seen below:

**Test Plan**
1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_json_dump

2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_json_dump

3. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_operation_tracing

4. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_operation_tracing

5. pytest test/distributed/_tensor/debug/test_comm_mode_features.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130072
Approved by: https://github.com/XilunWu
ghstack dependencies: #129994
2024-07-08 20:12:52 +00:00
3e53cae0fc Release 2.4 matrix update. Future releases dates (#130267)
Added Release Compatibility Matrix for release 2.4
Updated future release dates for 2.6-2.9
Updated possible patch release date for 2.4

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130267
Approved by: https://github.com/malfet, https://github.com/albanD
2024-07-08 20:09:17 +00:00
36e2608783 [Quant][PT2E] enable qlinear post op fusion for dynamic quant & qat (#122667)
**Description**
Add fusion path for dynamic quant and for QAT.
The following patterns can be matched for static quant with QAT cases:
`qx -> qlinear -> add -> optional relu -> optional type convert -> optional quant`

The following patterns can be matched for dynamic quant cases:
`qx -> qlinear -> add -> optional relu`

**Test plan**
python test/inductor/test_mkldnn_pattern_matcher.py -k test_qlinear
python test/inductor/test_cpu_cpp_wrapper.py -k test_qlinear
python test/test_quantization.py -k test_linear_unary
python test/test_quantization.py -k test_linear_binary

Differential Revision: [D57655830](https://our.internmc.facebook.com/intern/diff/D57655830)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122667
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel
2024-07-08 20:04:39 +00:00
a8985a97f9 elastic/store: use wait instead of get for barrier (#130148)
Summary: We call `.get` in the elastic store barrier operation but we don't need the result. This switches it to use `.wait` instead which eliminates one network round trip as `get` internally does a wait first.

Test Plan:

CI + existing tests -- no behavior change

Differential Revision: D59396199

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130148
Approved by: https://github.com/kurman, https://github.com/wconstab
2024-07-08 19:53:42 +00:00
22c809aa73 [FSDP] Runtime Error on Checkpoint Loading for optimizer state (#129110)
for checkpoint optimizer, tensors are created on CUDA when other backends are used. This is because by default torch.device() constructed via a single device ordinal is treated as a cuda device.

In _alloc_tensor, empty tensor are created using device = cast(torch.device, _get_device_module(device_type).current_device()). above will return only the index which will create the empty tensor on CUDA by the default behavior. So, change it to use torch.device(device_type,device_module(device_type).current_device()) to get the device with the index.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129110
Approved by: https://github.com/fegin
2024-07-08 18:52:13 +00:00
9158bb7837 Ignore functional tensor wrapper when caching (#128335)
This PR makes it so that we don't try to serialize FunctionalTensorWrappers. FunctionalTensorWrappers don't pickle well because they have no underlying storage. This should be fixable at a later point, but I might not be the right author for implementing the serialization for it. If there's a way to avoid actually saving the FunctionalTensorWrappers themselves and just saving the ViewMetadata so we can replay it, that would also work.

To do this, we disable view_replay_input_mutations when using AOTAutogradCache, and then only keep the functional tensor in the ViewAndMutationMeta if we need it for view_replay_input_mutations (i.e. the cache is off).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128335
Approved by: https://github.com/bdhirsh
2024-07-08 18:39:20 +00:00
6dc64026cb Restrict fusions in foreach if there are dependencies on multiple subkernels (#130046)
In https://www.internalfb.com/intern/sevmanager/view/s/429861/, a downstream consuming buffer `buf486_buf526` had two read dependencies; `buf373` and `buf394`, both of which were at separate indices of the upstream foreach op. `buf486_buf526` was fused into `buf373` because in the usual fused case, this is completely fine if all dependencies are met in the upstream fused buffer. However in the foreach case and this case specifically it is possible for foreach ops to be partitioned if there are many arguments in order to stay under CUDA driver arg limits. As a result, this large foreach op was split into two, and the latter had `buf394` in its node schedule for allocation, while the earlier split did not, even though `buf486_buf526` uses the `buf394`, as a result we would hit the unbound local error.

@eellison provided this repro to help debug the issue (https://www.internalfb.com/phabricator/paste/view/P1453035092)

To fix this, we no longer return a valid producer subnode if there are multiple producer subnodes for a downstream consuming op. In short we should not fuse if there are dependencies on multiple foreach subkernels because 1) their execution order is non-deterministic and 2) (this issue) we may not properly handle dependencies in the presence of foreach partitioning.

Co-authored-by: David Berard <dberard@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130046
Approved by: https://github.com/eellison
2024-07-08 18:25:16 +00:00
64139987c0 Add block mask utility support for batches and heads > 1 (#130227)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130227
Approved by: https://github.com/yanboliang
ghstack dependencies: #130160, #130106, #130224
2024-07-08 18:15:35 +00:00
cd683212a2 Fix indexing twice with score_mod (#130224)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130224
Approved by: https://github.com/yanboliang
ghstack dependencies: #130160, #130106
2024-07-08 18:15:35 +00:00
e16276b9bf [ROCm] Check supported archs before setting preferred blas backend to hipblasLT (#128753)
This PR is needed to resolve usability issues with PyTorch ROCm nightly wheels on non-gfx90a/gf94x architectures as a result of https://github.com/pytorch/pytorch/pull/127944.

Addresses https://github.com/pytorch/pytorch/issues/119081#issuecomment-2166504992

### With this PR's changes, I get the following on a gfx908 (unsupported by hipblasLT) architecture:
_Using setter function:_
```
>>> torch.backends.cuda.preferred_blas_library(backend="cublaslt")
[W617 19:58:58.286088851 Context.cpp:280] Warning: torch.backends.cuda.preferred_blas_library is an experimental feature. If you see any error or unexpected behavior when this flag is set please file an issue on GitHub. (function operator())
[W617 19:59:02.125161985 Context.cpp:291] Warning: Attempting to use hipBLASLt on an unsupported architecture! Overriding blas backend to hipblas (function operator())
<_BlasBackend.Cublas: 0>
```

_Using `TORCH_BLAS_PREFER_HIPBLASLT` env var:_
```
root@9d47bf40d4d4:/tmp/pytorch# TORCH_BLAS_PREFER_CUBLASLT=1 python
>>> import torch
>>> torch.backends.cuda.preferred_blas_library()
[W619 06:14:11.627715807 Context.cpp:274] Warning: Attempting to use hipBLASLt on an unsupported architecture! Overriding blas backend to hipblas (function operator())
<_BlasBackend.Cublas: 0>
```

### and the following on a gfx90a (supported by hipblasLT) architecture:
_Using setter function:_
```
>>> import torch
>>> torch.backends.cuda.preferred_blas_library()
<_BlasBackend.Cublaslt: 1>
>>> torch.backends.cuda.preferred_blas_library(backend="cublas")
<_BlasBackend.Cublas: 0>
>>> torch.backends.cuda.preferred_blas_library(backend="cublaslt")
[W620 18:38:29.404265518 Context.cpp:293] Warning: torch.backends.cuda.preferred_blas_library is an experimental feature. If you see any error or unexpected behavior when this flag is set please file an issue on GitHub. (function operator())
<_BlasBackend.Cublaslt: 1>
```

_Using `TORCH_BLAS_PREFER_HIPBLASLT` env var:_
```
root@9d47bf40d4d4:/tmp/pytorch# TORCH_BLAS_PREFER_HIPBLASLT=1 python
>>> import torch
>>> torch.backends.cuda.preferred_blas_library()
<_BlasBackend.Cublaslt: 1>
```
(Same result for _Using `TORCH_BLAS_PREFER_CUBLASLT` env var:_)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128753
Approved by: https://github.com/malfet
2024-07-08 17:43:41 +00:00
b428f1ad77 [3.12, 3.13, dynamo] simplified construction for frame f_locals/localsplus (#129185)
Construct frame localsplus in 3.12+ using our own simplified way rather than copypasting from CPython.

This is necessary for 3.13 since we can no longer generate frame `f_locals` before executing the interpreter frame.
We also enable this for 3.12 since the `f_locals` construction between 3.12 and 3.13 is the same, so we can test for correctness with 3.12.

This is also one of the first steps to completing https://github.com/pytorch/pytorch/issues/93753 - we will implement simplified f_locals generation of previous Python versions in the future.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129185
Approved by: https://github.com/jansel
2024-07-08 17:39:05 +00:00
d325aaef39 [halide-backend] Use get_reduction_combine_fn for reduction ops (#130212)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130212
Approved by: https://github.com/eellison
2024-07-08 17:23:32 +00:00
a18568f293 [dtensor][debug] Added functionality to convert log into a json file (#129994)
**Summary**
Currently, users have 2 options to view the tracing data. The first is through console where colored text is used to help users read the information. The second is they can log the information to a text file to view the log, which is useful in instances where the log is too long to fit in the console. However, depending on the model complexity, these logs could go on for thousands of lines making it difficult for the user to find specific information. In order to fix this, I have added the functionality to convert the log into a JSON file, which will be used to create a tree view in a browser, allowing the user to collapse parts of the log that will not be useful to them. I have given the user the option to pass their own file path, but have a default one in the event that none is provided. The expected output of the beginning json file and the browser view for the MLP model are shown below:

<img width="542" alt="Screenshot 2024-07-02 at 3 40 41 PM" src="https://github.com/pytorch/pytorch/assets/50644008/b9570540-e1d2-4777-b643-db4801b60ed8">

<img width="777" alt="Screenshot 2024-07-02 at 3 41 43 PM" src="https://github.com/pytorch/pytorch/assets/50644008/9296e255-c3ae-48a4-8be7-4273f69ee178">

**Test Plan**
1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_json_dump

2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_json_dump

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129994
Approved by: https://github.com/XilunWu
2024-07-08 17:15:34 +00:00
61017eb77b Add missing mapping between DLDevice and ATenDevice for MAIA (#129615)
This PR adds missing mapping between the `DLDevice `and `ATenDevice `for MAIA device. These changes are necessary for `dlpack `support for `maia `tensors.

[MAIA is added to the DldeviceType enum in the dlpack repo](bbd2f4d324/include/dlpack/dlpack.h (L120)) already.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129615
Approved by: https://github.com/albanD
2024-07-08 17:08:39 +00:00
63743b223c [AO] catch qparam mismatch for cat (#123769)
Summary:
use &= instead of |= since |= ignores incorrect scale/zp
change scale to use float comparison, instead of int comparison

Issue warning instead of error for backward compatibility: ex: P1204628034

Test Plan: see warning in: P1204628034

Reviewed By: jerryzh168

Differential Revision: D55699212

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123769
Approved by: https://github.com/jerryzh168
2024-07-08 16:47:14 +00:00
f4774d64bf Skip test_profile_memory on windows (#130037)
The test was introduced in https://github.com/pytorch/pytorch/pull/128743
It is failing on windows cuda a9a744e442/1 (it is skipped on cpu jobs)

After talking with the author and Aaron, I have been advised to skip it on windows, as windows support for kineto is not a high priority
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130037
Approved by: https://github.com/huydhn, https://github.com/aaronenyeshi
2024-07-08 16:11:51 +00:00
d7b7f8b79f Revert "[ROCm] Add int4 support (#129710)"
This reverts commit d0ad13fa42fc2e9935bd3bda2937a3491276d274.

Reverted https://github.com/pytorch/pytorch/pull/129710 on behalf of https://github.com/jeffdaily due to original ROCm PR did not have ciflow/rocm, missed signal ([comment](https://github.com/pytorch/pytorch/pull/129710#issuecomment-2214558368))
2024-07-08 16:07:53 +00:00
c8ab2e8b63 Set seed per sample for OpInfo tests + support for restricting to a single sample input (#128238)
This PR:
* Sets a random seed before generating each sample for an OpInfo test. It does this by intercepting the sample input iterator via `TrackedInputIter`, optionally setting the seed to a test name specific seed before each iterator call (default is to set the seed).
    * Some quick and dirty benchmarking shows (hopefully) negligible overhead from setting the random seed before each sample input generation. For a trivial (single assert) test that uses `@ops`:
* Uncovered a bunch of test issues:
    * Test breakdown (>100 total)
        * A lot of tolerance issues (tweaked tolerance values to fix)
        * 1 broken OpInfo (`sample_inputs_masked_fill` was generating a sample of the wrong dtype)
        * 3 actually broken semantics (for masked tensor; added xfails)
        * 4 Jacobian mismatches (added xfails)
        * 2 nan results (skip for now, need fixing)
        * 3 results too far from reference result (add xfails)
* Skips MPS tests for now (there are so many failures!). Those will default to the old behavior.

**before (no seed setting):**
```
real	0m21.306s
user	0m19.053s
sys	0m5.192s
```

**after (with seed setting):**
```
real	0m21.905s
user	0m19.578s
sys	0m5.390s
```

* Utilizing the above for reproducible sample input generation, adds support for restricting the iterator to a single sample input. This is done via an env var `PYTORCH_OPINFO_SAMPLE_INPUT_INDEX` and its usage is included in the repro command.

```
======================================================================
ERROR: test_bar_add_cuda_uint8 (__main__.TestFooCUDA.test_bar_add_cuda_uint8)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_device_type.py", line 971, in test_wrapper
    return test(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/jbschlosser/branches/testing_updates/test/test_ops.py", line 2671, in test_bar
    self.assertFalse(True)
AssertionError: True is not false

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_utils.py", line 2816, in wrapper
    method(*args, **kwargs)
  File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_utils.py", line 2816, in wrapper
    method(*args, **kwargs)
  File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_device_type.py", line 419, in instantiated_test
    result = test(self, **param_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_utils.py", line 1426, in wrapper
    fn(*args, **kwargs)
  File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_device_type.py", line 982, in test_wrapper
    raise new_e from e
Exception: Caused by sample input at index 3: SampleInput(input=Tensor[size=(10, 5), device="cuda:0", dtype=torch.uint8], args=TensorList[Tensor[size=(), device="cuda:0", dtype=torch.uint8]], kwargs={}, broadcasts_input=False, name='')

To execute this test, run the following from the base repo dir:
    PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=3 python test/test_ops.py -k TestFooCUDA.test_bar_add_cuda_uint8

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

----------------------------------------------------------------------
Ran 1 test in 0.037s

FAILED (errors=1)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128238
Approved by: https://github.com/janeyx99, https://github.com/justinchuby
2024-07-08 16:06:38 +00:00
acf9e31cf8 adding MTIA to supported activities (#130052)
Summary: Put the hasMTIA block in the if condition as well to let MTIA activities be added to supported activities

Test Plan: Tested with auto-trace

Differential Revision: D59280848

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130052
Approved by: https://github.com/aaronenyeshi
2024-07-08 15:20:05 +00:00
16d53cb7d5 Only run mixed_mm heuristic if shapes are static (#130081)
If we have dynamic shapes, the heuristic in mixed_mm will cause a crash, because it cannot compare m, k and n to integer values. This PR makes it so that the heuristic only runs if we have static shapes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130081
Approved by: https://github.com/Chillee
2024-07-08 14:20:55 +00:00
010009e642 [compiled autograd] c++ autograd function saved_data: lift tensors (#130057)
avoid recompiles when custom c++ autograd function use ctx->saved_data to save tensors

iv.toTensor can return reference for `after(iv.toTensor())`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130057
Approved by: https://github.com/jansel
2024-07-08 07:42:07 +00:00
cyy
f4dcf2ae93 [1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128301
Approved by: https://github.com/ezyang, https://github.com/r-barnes
2024-07-08 07:03:53 +00:00
f053be2a97 [dynamo] Graph break on random_ op (#130222)
Fixes https://github.com/pytorch/pytorch/issues/121621

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130222
Approved by: https://github.com/jansel
2024-07-08 06:10:24 +00:00
31bb65de19 [Inductor] Fix conditional codegen (#129492)
Summary:
We have the cache to guarantee the `sym` is codegen only once, see the following code
```
def ensure_size_computed(self, sym: sympy.Symbol):
    if isinstance(sym, sympy.Symbol) and symbol_is_type(sym, SymT.PRECOMPUTED_SIZE):
        if sym in self.computed_sizes:
            return
        self.computed_sizes.add(sym)
        expr = V.graph.sizevars.inv_precomputed_replacements[sym]
        self.writeline(
            f"{self.declare}{sym} = {self.expr_printer(expr)}{self.ending}"
        )
```
However, we don't consider the case when same `sym`s need to be codegen in both conditions (true branch and false branch), which caused the issue of  `undefined symbols`: P1441378833

To fix the issue, we use a stack to capture the state before doing the condition codegen and restore the state after doing the codegen

Test Plan:
TORCH_LOGS="+inductor" buck2 run mode/dev-nosan -c fbcode.nvcc_arch=h100 -c fbcode.enable_gpu_sections=true --config 'cxx.extra_cxxflags=-g1' -c fbcode.platform010_cuda_version=12 //scripts/hhh:repro_cond_torch_compile

PYTORCH_TEST_FBCODE=1 TORCH_COMPILE_DEBUG=1 buck2 run  mode/opt -c=python.package_style=inplace -c fbcode.enable_gpu_sections=true -c fbcode.platform=platform010 -c fbcode.split-dwarf=true //caffe2/test/inductor:control_flow -- -r test_cond_control_flow_with_precomputed_size

Differential Revision: D58973730

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129492
Approved by: https://github.com/aakhundov
2024-07-08 05:33:47 +00:00
c5c9dbece1 [dynamo][user-defined] Simplify and improve scope of UserDefinedObject var_getattr (#130169)
Fixes https://github.com/pytorch/pytorch/issues/122649

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130169
Approved by: https://github.com/jansel
ghstack dependencies: #118448, #130159
2024-07-08 04:10:56 +00:00
d0ad13fa42 [ROCm] Add int4 support (#129710)
Add AMD support for int4 kernel using mfma_f32_16x16x16bf16 instruction.
Only supports CDNA2 and CDNA3 gpus for now.
Fixes #124699

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129710
Approved by: https://github.com/malfet
2024-07-07 23:54:22 +00:00
d1b832e739 [inductor][mkl][inline-inbuilt-nn-modules] Change assertion (#130219)
Fixes the test in the next PR - `python test/inductor/test_mkldnn_pattern_matcher.py -k TestDynamicPatternMatcher.test_conv3d_unary_dynamic_shapes`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130219
Approved by: https://github.com/leslie-fang-intel
2024-07-07 21:32:07 +00:00
940e4477ab [runtime asserts] deduplicate runtime asserts & CSE (#128599)
This PR adds deduplication and CSE for runtime asserts. Existing size computation in the graph is CSE'd along with added runtime asserts, and redundant asserts are removed. Shape calls on intermediate tensors are also turned into compute on input sizes if possible, allowing intermediate tensors to be freed earlier. For example:
```
z = torch.cat([x, x], dim=0)  # 2*s0
w = z.repeat(y.shape[0])  # 2*s0*s1
_w = w.shape[0]
# something with _w ...

# turns into ->
s0 = x.shape[0]
s1 = y.shape[0]
_w0 = 2 * s0
_w = _w0 * s1
```

Additionally, constrain_range calls are deduplicated. Single-symbol bound checks for unbacked symbols (e.g. u0 >= 0, u0 <= 5) and sym_constrain_range.default calls are also removed, since they accumulate range info in the ShapeEnv, and are replaced with two _assert_scalar.default calls that check the min/max bounds. For example:
```
torch.sym_constrain_range_for_size(n, min=2, max=16)
torch.sym_constrain_range(n, min=4, max=20)
torch._check(n >= 0)
torch._check(n >= 3)
torch._check(n <= 14)

# turns into
torch.sym_constrain_range_for_size(n)
torch._check(n >= 4)
torch._check(n <= 14)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128599
Approved by: https://github.com/ezyang
2024-07-07 20:10:14 +00:00
0c44684901 [Typo] Fix typo in DispatchKeyExtractor.h (#130221)
Summary: typo_helper

Test Plan: ci

Differential Revision: D59424671

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130221
Approved by: https://github.com/Skylion007
2024-07-07 19:43:31 +00:00
e423224546 Revert "[Inductor][CPP] Enable Local Buffer for Outer loop fusion (#126967)"
This reverts commit 98929ceae3873f18f4747b88cdff708fde107aa7.

Reverted https://github.com/pytorch/pytorch/pull/126967 on behalf of https://github.com/leslie-fang-intel due to Broken trunk and need rebase ([comment](https://github.com/pytorch/pytorch/pull/126967#issuecomment-2212337926))
2024-07-07 06:16:32 +00:00
1b57dce35f Revert "[Inductor][CPP] Support more than one LocalBuffer (#129121)"
This reverts commit f794cf59bd0891ff4a4337e0d919ee68ba1f0472.

Reverted https://github.com/pytorch/pytorch/pull/129121 on behalf of https://github.com/leslie-fang-intel due to Broken trunk and need rebase ([comment](https://github.com/pytorch/pytorch/pull/129121#issuecomment-2212337590))
2024-07-07 06:13:40 +00:00
f794cf59bd [Inductor][CPP] Support more than one LocalBuffer (#129121)
**Summary**
Support more than 1 Local Buffer in an outer loop fused node and also the case when multi global buffers sharing usage of same local buffer.

**TestPlan**
```
python -u -m pytest -s -v inductor/test_cpu_repro.py -k test_two_local_buffers_in_outer_loop_fusion
python -u -m pytest -s -v inductor/test_cpu_repro.py -k test_share_local_buffers_in_outer_loop_fusion
```

**Next Step**

- [✓] Support more than one Local Buffer/Global Buffer

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129121
Approved by: https://github.com/jgong5, https://github.com/peterbell10
ghstack dependencies: #126967
2024-07-07 05:43:08 +00:00
98929ceae3 [Inductor][CPP] Enable Local Buffer for Outer loop fusion (#126967)
**Summary**
Currently, the Inductor CPP backend [generated code](https://gist.github.com/leslie-fang-intel/98f91d43dabed581a1ffe23daf133a65#file-bf16-softmax-generated-code-wo-local-buffer-py) for `Softmax` with BF16 data type is significantly slower than the [ATen Implementation](9a2beb862d/aten/src/ATen/native/cpu/SoftMaxKernel.cpp (L149)). Upon comparing the generated code with ATen, the performance bottleneck appears to be related to the usage of [local buffer in ATen](9a2beb862d/aten/src/ATen/native/cpu/SoftMaxKernel.cpp (L159-L160)).

In the current implementation, the Inductor uses the output buffer of Kernel Group Args to store and load temporary result (such as `exp`), since this buffer is corresponding to a `SchedulerNode`. Each thread accesses a portion of this output buffer via indexing. However, since this buffer (take this `exp` as example) is only utilized internally within decomposed `softmax`, this buffer can be replaced with a thread-local buffer similar to ATen's approach.

In this PR, we have introduced the optimizations of `LocalBuffer`. Following this enhancement, the [new generated Inductor code with local buffer](https://gist.github.com/leslie-fang-intel/98f91d43dabed581a1ffe23daf133a65#file-bf16-softmax-generated-code-w-local-buffer-py) for BF16 `Softmax` demonstrates significantly improved performance. Running the benchmark [here](https://gist.github.com/leslie-fang-intel/37d81441237b5139c8295f5e6c4cd31a) to test this BF16 `Softmax` case on an 8480 Xeon server shows similar performance between the Inductor CPP Backend and the ATen implementation.

**TestPlan**
```
python -u -m pytest -s -v inductor/test_cpu_repro.py -k test_local_buffer_in_outer_loop_fusion
```

**Next Step**

- [ ] Support more than one Local Buffer/Global Buffer

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126967
Approved by: https://github.com/jgong5, https://github.com/peterbell10
2024-07-07 05:34:57 +00:00
a3ce9eddd6 [BE][Easy] apply autofix for ruff rule unnecessary-literal-set (C405) and unnecessary-map (C417) (#130198)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130198
Approved by: https://github.com/Skylion007
2024-07-07 00:58:22 +00:00
9983242c8e [inductor] support adding a new inductor backend using PrivateUse1 (#129953)
Add handling custom device registered by PrivateUse1 in init_backend_registration() func

Fixes #129952

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129953
Approved by: https://github.com/jansel
2024-07-06 21:15:40 +00:00
3d138af943 [Inductor] First implementation of the B2B-GEMM pass with tests (#129995)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129995
Approved by: https://github.com/eellison
2024-07-06 19:10:22 +00:00
3957b3b349 [inductor] switch CppCodeCache to new cpp_builder. (#130132)
Changes:
1. switch CppCodeCache to new cpp_builder.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130132
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-06 18:57:44 +00:00
dc5f37193f [inductor] switch AotCodeCompiler to new cpp_builder (#130127)
Changes:
1. Switch `AotCodeCompiler` to new cpp_builder.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130127
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-06 18:44:13 +00:00
cyy
dfe3534134 [1/N] Fix NVCC warnings (#130191)
Fixes NVCC warnings, as the required steps to enable Werror on CUDA.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130191
Approved by: https://github.com/Skylion007
2024-07-06 18:25:04 +00:00
3f50e197c4 [BE] annotate torch.autograd.graph (#129558)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129558
Approved by: https://github.com/soulitzer
2024-07-06 18:14:16 +00:00
01ec03bac6 [inductor] switch HalideCodeCache to new cpp_builder. (#130146)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130146
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-06 17:35:17 +00:00
cyy
2f219f7d79 Enforce unused-{variable/function} checks to all torch targets (#130189)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130189
Approved by: https://github.com/ezyang
2024-07-06 16:03:01 +00:00
cyy
096eca2f9a [2/N] Replace exceptions with static_assert(false) in some templates (#130116)
Follows #127371

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130116
Approved by: https://github.com/eqy, https://github.com/ezyang
2024-07-06 13:23:05 +00:00
520a4642bf [CI] Enable build with asserts (#129924)
Not a standard CMake config, as far as I can tell, but it introduces an important concept of optimized build without `NDEBUG`. Test by running `python -c "import torch; torch._C._crash_if_debug_asserts_fail(424242)"`, which is a no-op unless debug_assert_fail is enabled.

Add recently added `_unsafe_masked_index`/`_unsafe_masked_index_put_accumulate` to DONT_ENFORCE_SAME_TENSOR_IMPL_OR_STORAGE to avoid all test involving those ops to fail with internal assert
Suppress number of internal asserts to make CI green, see https://github.com/pytorch/pytorch/issues/130073

Fixes https://github.com/pytorch/pytorch/issues/102105

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129924
Approved by: https://github.com/atalman, https://github.com/albanD
2024-07-06 13:14:32 +00:00
da66e50e6e Added compile option to create_block_mask (#130106)
Compiling the `create_block_mask` function allows us to "materialize" extremely large masks. This would have been a 1 *trillion* element tensor if fully materialized.

```
print(do_bench(lambda: create_block_mask(causal_mask, 1, 1, 2**20, 2**20, _compiled=True)))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130106
Approved by: https://github.com/yanboliang
ghstack dependencies: #130160
2024-07-06 08:09:56 +00:00
963f430d13 Revert "[runtime asserts] deduplicate runtime asserts & CSE (#128599)"
This reverts commit 0267b2ddcb58aa66b2b62336216da7df4f9939d8.

Reverted https://github.com/pytorch/pytorch/pull/128599 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to cause a landrace and fails inductor/test_cudagraph_trees in trunk 0267b2ddcb ([comment](https://github.com/pytorch/pytorch/pull/128599#issuecomment-2211690518))
2024-07-06 07:20:05 +00:00
aa4899eee9 [CCA][Memory Snapshot] Fix race on alloc_trace vector - S430480 (#130180)
Summary:
Multiple threads can be calling the alloc_trace std::vector, which will result in SIGSEGVs when objects are double freed, accessed after free, or two inserts at the same time.

We need to lock when inserting, accessing or removing TraceEntry in alloc_trace.

Test Plan:
This is a rare crash, which was exposed when we introduced recordAnnotations, which saves record_function annotations into the snapshot files. Saving a lot of annotations can trigger this bug. Here are a few jobs that crashed before, and this diff fixes.

Differential Revision: D59380507

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130180
Approved by: https://github.com/eqy, https://github.com/kit1980
2024-07-06 06:14:54 +00:00
e019540c9e Revert "Fix the SDPA AOT export issue (#130164)"
This reverts commit 1927c406844affbfe3496d5cbc31d4ebe11c8bfb.

Reverted https://github.com/pytorch/pytorch/pull/130164 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is breaking ExecuTorch tests in trunk 1927c40684 ([comment](https://github.com/pytorch/pytorch/pull/130164#issuecomment-2211667777))
2024-07-06 05:59:49 +00:00
bf609630ae Fix a bunch of stride issues with FlexAttention (#130160)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130160
Approved by: https://github.com/yanboliang
2024-07-06 03:58:14 +00:00
10c831567b Make sympify'ing SymInt/etc produce their sympy expression (#130166)
There is one huge problem this fixes: today, sympify(symint)
produces a float(!!) because Sympy attempts to see if you can
coerce the symint to float in sympify and of course this works on
SymInt.

However, this also has another nontrivial effect: anywhere in Inductor
where sympy expressions are passed around, it is also valid to pass
around a SymInt now.  I'm ambivalent about this: it's currently a
mistake to be passing around a SymInt when a sympy expression is
expected.  But maybe this is fine?

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130166
Approved by: https://github.com/yf225
2024-07-06 03:56:45 +00:00
acd03ca2d9 [halide-backend] Support scan kernels (#129035)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129035
Approved by: https://github.com/shunting314, https://github.com/eellison
ghstack dependencies: #130129
2024-07-06 03:49:50 +00:00
c5110f6388 [halide-backend] Use 0D scalar inputs/outputs (#130129)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130129
Approved by: https://github.com/shunting314
2024-07-06 03:49:50 +00:00
0267b2ddcb [runtime asserts] deduplicate runtime asserts & CSE (#128599)
This PR adds deduplication and CSE for runtime asserts. Existing size computation in the graph is CSE'd along with added runtime asserts, and redundant asserts are removed. Shape calls on intermediate tensors are also turned into compute on input sizes if possible, allowing intermediate tensors to be freed earlier. For example:
```
z = torch.cat([x, x], dim=0)  # 2*s0
w = z.repeat(y.shape[0])  # 2*s0*s1
_w = w.shape[0]
# something with _w ...

# turns into ->
s0 = x.shape[0]
s1 = y.shape[0]
_w0 = 2 * s0
_w = _w0 * s1
```

Additionally, constrain_range calls are deduplicated. Single-symbol bound checks for unbacked symbols (e.g. u0 >= 0, u0 <= 5) and sym_constrain_range.default calls are also removed, since they accumulate range info in the ShapeEnv, and are replaced with two _assert_scalar.default calls that check the min/max bounds. For example:
```
torch.sym_constrain_range_for_size(n, min=2, max=16)
torch.sym_constrain_range(n, min=4, max=20)
torch._check(n >= 0)
torch._check(n >= 3)
torch._check(n <= 14)

# turns into
torch.sym_constrain_range_for_size(n)
torch._check(n >= 4)
torch._check(n <= 14)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128599
Approved by: https://github.com/ezyang
2024-07-06 03:44:49 +00:00
7c43f59a45 [audio hash update] update the pinned audio hash (#129429)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129429
Approved by: https://github.com/pytorchbot
2024-07-06 03:34:12 +00:00
bd0252fb98 [dynamo][user-defined] Support method descriptors (#130159)
Fixes https://github.com/pytorch/pytorch/issues/120650

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130159
Approved by: https://github.com/jansel
ghstack dependencies: #118448
2024-07-06 02:03:09 +00:00
a1a2023eb8 Back out "Pass device to is_pinned call inside TensorProperties.create_from_tensor" (#129972)
Summary:
It turns out, the device used as a param in is_pinned is meant to be the accelerator device with the respect to which pinning is expected. Passing 'cpu' always makes the return value false, regardless of whether the actual tensor is a cpu tensor pinned to Cuda.

Besides, there is a PR https://github.com/pytorch/pytorch/pull/126376 about to be merged which automatically uses the correct accelerator device which obviates the need for users to pass any kind of explicit  device and doesn't create Cuda context for pure cpu tensors.

Note, https://www.internalfb.com/intern/test/844425019931542?ref_report_id=0 test is expected to be broken by this diff, but it should be fixed forward by https://github.com/pytorch/pytorch/pull/126376

Test Plan: Sandcastle.

Differential Revision: D59283190

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129972
Approved by: https://github.com/LucasLLC
2024-07-06 01:07:32 +00:00
1927c40684 Fix the SDPA AOT export issue (#130164)
Summary:
## Context
TL;DR: aot_export failed for SDPA memory efficient backend when using `inference_mode`

The CMF AOTI lowering started to fail on the trunk. We have the script (https://fburl.com/code/kfk64i5s) to reproduce the issue quickly (log: P1469307638). By bisecting the stack, we found the issue starting from the D58701607

## Root Cause
In the `inference_mode()`,
the `aten::scaled_dot_product_attention` was not decomposed before the `functionalization` and the op it-self was an out-place op, so the `functionalization` doesn't make change and then was decomposed into `masked_fill_.`, then decomposed to the `copy_`
So it's `aten::sdpa` --- (functionalization) ---> `aten::sdpa` --- (decompose) ---> `masked_fill_` --- (decompose) ---> `copy_` ---> failure

In the `torch.no_grad()`,
`aten::sdpa` was decomposed before `functionalization`, so the story is
`aten::sdpa` --- (decompose) ---> `masked_fill_` --- (functionalization) ---> `masked_fill` --- (decompose) ---> `out-place ops` ---> good

## How to fix
Long-term:
The issue was tracked in the ticket (https://github.com/pytorch/pytorch/issues/129418). The long-term fix could be we do one more round of `functionalization` after the `decompose`, like

`aten::sdpa` --- (functionalization) ---> `aten::sdpa` --- (decompose) ---> `masked_fill_` --- (functionalization) ---> `masked_fill` ---> good

Short-term:
It would be a big change I guess. To unblock the production use-case, I marked the `aten::sdpa` should be decomposed in this diff

Test Plan:
local repro works now

buck run mode/opt scripts/sijiac/prototypes:sdpa_aoti

Differential Revision: D59385876

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130164
Approved by: https://github.com/zou3519
2024-07-06 00:57:47 +00:00
c5ede865c4 [pt2-bench] raise tolerance for squeezenet1_1 (#130165)
The training accuracy for this model starts to regress. It does not show up on the weekly run yet but
1. it shows up in my MA runs [here](https://hud.pytorch.org/benchmark/torchbench/inductor_max_autotune?dashboard=torchinductor&startTime=Fri,%2028%20Jun%202024%2006:53:45%20GMT&stopTime=Fri,%2005%20Jul%202024%2006:53:45%20GMT&granularity=hour&mode=training&dtype=amp&lBranch=gh/shunting314/162/head&lCommit=cb236e8c198b54901e4fb19698f91be786f72e25&rBranch=main&rCommit=4ee1cb9b955fcc5d75a421b19393998122136f2c)
2. I can repro it locally

Command:
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/torchbench.py --accuracy --training --amp --backend
 inductor --device cuda --only squeezenet1_1
```

Raise the tolerance to fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130165
Approved by: https://github.com/jansel
ghstack dependencies: #129996, #129941, #130005, #130163
2024-07-06 00:49:15 +00:00
0fcbca9adb [pt2-bench] use eval mode for vision_maskrcnn (#130163)
Try to fix https://github.com/pytorch/pytorch/issues/130161

The reason that `--accuracy` works is we use eval mode. While `--training` does not work since we use training mode but TorchBench does not return targets tenors. In training mode, vision_maskrcnn requires targets tensors

I fix that to always use eval mode for vision_maskrcnn training.

With the fix, I start see a segfault: https://gist.github.com/shunting314/5a70df3463b2a4421b2c34aa88e78d1f

I'm not sure if that's due to my local setup but I think the fix in this PR is something we need any way. We can check the dashboard after the PR is in.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130163
Approved by: https://github.com/jansel
ghstack dependencies: #129996, #129941, #130005
2024-07-06 00:49:15 +00:00
cyy
e5841bb8d5 [3/N] Enforce unused-function and unused-variable checks (#130084)
Follows #129878.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130084
Approved by: https://github.com/ezyang
2024-07-05 23:56:00 +00:00
126796d239 [c10d] fixing an UT after a change in eager mode new group (#130167)
Summary:
after
https://github.com/pytorch/pytorch/pull/129284, new_group is eager now if device_id is specified, one UT was broken
This PR fixes it.

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130167
Approved by: https://github.com/wconstab
2024-07-05 23:18:30 +00:00
d1d0a7080f [torchgen] reference generated comment to actual location of the generator and template (#130020)
As per title.

```diff
# torch/_VF.pyi

- # @generated from torch/_C/_VariableFunctions.pyi.in
+ # @generated by tools/pyi/gen_pyi.py from torch/_C/_VariableFunctions.pyi.in
```

```diff
# torch/return_types.pyi

- # @generated from torch/_C/return_types.pyi
+ # @generated by tools/pyi/gen_pyi.py from torch/_C/return_types.pyi.in
```

```diff
# torch/_C/__init__.pyi

- # @generated from torch/_C/__init__.pyi.in
+ # @generated by tools/pyi/gen_pyi.py from torch/_C/__init__.pyi.in
```

```diff
# torch/_C/_nn.pyi

+ # @generated by tools/pyi/gen_pyi.py from torch/_C/_nn.pyi.in
```

```diff
# torch/_C/_VariableFunctions.pyi

- # @generated from torch/_C/_VariableFunctions.pyi.in
+ # @generated by tools/pyi/gen_pyi.py from torch/_C/_VariableFunctions.pyi.in
```

```diff
# torch/nn/functional.pyi

+ # @generated by tools/pyi/gen_pyi.py from torch/nn/functional.pyi.in
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130020
Approved by: https://github.com/ezyang
2024-07-05 21:47:14 +00:00
6fc771d19b Revert "Change depreacate warning on dispatch_on_subclass to warn once (#130047)"
This reverts commit 8ff243bcf190bab62348310693f0ad2f90061c89.

Reverted https://github.com/pytorch/pytorch/pull/130047 on behalf of https://github.com/clee2000 due to broke test_overrides.py::TestTorchFunctionWarning::test_warn_on_invalid_torch_function on multiple jobs 8ff243bcf1 https://github.com/pytorch/pytorch/actions/runs/9812489165/job/27097342443.  Dr CI is doing something weird about the unstable failures ([comment](https://github.com/pytorch/pytorch/pull/130047#issuecomment-2211409090))
2024-07-05 21:03:36 +00:00
df50452279 Pin optree==0.11.0 on windows CI (#130155)
Fixes #ISSUE_NUMBER

doctests
test_testing

Failing run has 0.12.0 https://github.com/pytorch/pytorch/actions/runs/9804335516/job/27072891998
Succeeding run has 0.11.0 https://github.com/pytorch/pytorch/actions/runs/9798330845/job/27057359554

It is already pinned for mac and linux
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130155
Approved by: https://github.com/huydhn, https://github.com/atalman
2024-07-05 20:28:58 +00:00
18e75c098b [DCP] Adds Checkpointing Team (dcp) to merge rules (#129582)
[DCP] Adds Checkpointing Team (dcp) to merge rules. Please comment to this PR if you think you should be added as well!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129582
Approved by: https://github.com/fegin
2024-07-05 20:09:31 +00:00
739fc01ac9 [NCCL] Make sure current device is correct in torch.distributed.barrier()'s streamSynchronize (#129908)
The real root cause of the issue is that the current stream on a given CUDA device may be the legacy default stream, which doesn't seem to have a device associated with it. If the current CUDA device as reported by `cudaGetDevice` doesn't match the device of the intended legacy default stream's device (this happens if a user is running distributed code without e.g., `torch.cuda.set_device(mylocalrank)`) then the stream synchronize will not have the intended effect. Previous stream sync code here correctly inserted a `DeviceGuard` to ensure that this legacy-default-stream-sync with a mismatched current device didn't happen, but the check is elided here. The simplest fix is to just use the `CUDAStream` wrapper's `synchronize()` call, which already correctly uses a `DeviceGuard` internally:
a21d4363d2/c10/cuda/CUDAStream.h (L132)

OUTDATED below:
The current behavior of `barrier`'s `synchronizeInternal` seems to be a bit counterintuitive, as it is synchronizing on a device's current `CUDAStream` rather than the one used for the actual `allreduce` (the `ncclStream`). In practice this results in a script like the following:
```
import logging
import os
import time
import torch
import torch.distributed as dist

def main():
    logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(message)s")

    backend = 'nccl'
    group = torch.distributed.init_process_group(backend=backend)
    rank = torch.distributed.get_rank(group=group)

    for i in range(4):
        time.sleep(rank)
        logging.info(f"Rank {rank}: enter barrier {i}")
        dist.barrier()
        logging.info(f"Rank {rank}: exit barrier {i}")

    dist.destroy_process_group()

if __name__ == "__main__":
    main()
```
appearing to show that ranks can exit barrier(s) before other ranks have entered. Note that the device-side ordering should still be correct in this case, but the host is free to run ahead.

The issue can be worked-around by adding a `torch.cuda.synchronize(rank)` after the `barrier`, but this seems to be against the spirit of the stream synchronization which deliberately tried to avoid a device synchronization.

This PR does a sync on the `allreduce`'s stream so that a device synchronization is not needed to align the host's output with the device.

CC @wujingyue @Aidyn-A @ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129908
Approved by: https://github.com/kwen2501
2024-07-05 19:53:54 +00:00
faebaef089 [EZ] Fix typo in upload stats OIDC rolename (#130168)
My mistake from https://github.com/pytorch/pytorch/pull/129544
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130168
Approved by: https://github.com/kit1980, https://github.com/malfet, https://github.com/atalman
2024-07-05 19:38:24 +00:00
3d56673b24 [Split Build][BE] remove extraneous .py, .a, and .so files (#130053)
Removes extraneous .a, .so, and .py files from the split build. From here we can also clean up the builder script which produces the binary to do this. That pr is https://github.com/pytorch/builder/pull/1912

Verification:

The built wheel with BUILD_LIBTORCH_WHL=1 has the following files only (with .a, .so, and .py extensions)

```
sahanp@devgpu086 ~/p/dist (viable/strict)> pwd                                                                                                                                                                                                                            (pytorch-3.10)
/home/sahanp/pytorch/dist
sahanp@devgpu086 ~/p/dist (viable/strict)> find . -type f \( -name "*.py" -o -name "*.a" -o -name "*.so" \)                                                                                                                                                               (pytorch-3.10)
./torch/__init__.py
./torch/lib/libbackend_with_compiler.so
./torch/lib/libc10.so
./torch/lib/libjitbackend_test.so
./torch/lib/libtorch.so
./torch/lib/libtorch_cpu.so
./torch/lib/libtorch_global_deps.so
./torch/lib/libtorchbind_test.so
sahanp@devgpu086 ~/p/dist (viable/strict)>
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130053
Approved by: https://github.com/atalman
2024-07-05 19:05:32 +00:00
8ff243bcf1 Change depreacate warning on dispatch_on_subclass to warn once (#130047)
Summary:
Right now the deprecated warning fires on every operator that calls into torch_function. Changing it to TORCH_WARN_ONCE instead.

More context in https://fb.workplace.com/groups/260102303573409/permalink/445299188387052/

Test Plan: Sandcastle

Differential Revision: D59338775

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130047
Approved by: https://github.com/XilunWu
2024-07-05 18:52:49 +00:00
784e3b4123 Revert "Change numeric_debug_handle to store per-node id (#129811)"
This reverts commit a9a744e442975cfbc6f4b26a532e5c1b3d9d5692.

Reverted https://github.com/pytorch/pytorch/pull/129811 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/129811#issuecomment-2211245852))
2024-07-05 18:14:02 +00:00
889ed48a22 Fix missing id-token write in upload stats (#130153)
Fix the mistake from https://github.com/pytorch/pytorch/pull/129544
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130153
Approved by: https://github.com/clee2000
2024-07-05 18:05:46 +00:00
7c5f3cd049 Add explain function to TSConverter. (#129968)
Summary: The explain function does a conversion dry run to provide feedback on which operators are not supported / fail the conversion to the users.

Test Plan: * `pytest test/export/test_converter.py`

Differential Revision: D59251934

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129968
Approved by: https://github.com/angelayi
2024-07-05 18:04:29 +00:00
7ea8a3c9b8 [dynamo] Validate check_fn (#118448)
Fixes - https://github.com/pytorch/pytorch/issues/128090

Tracker issue here - https://github.com/pytorch/pytorch/issues/129937

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118448
Approved by: https://github.com/jansel, https://github.com/ezyang
2024-07-05 18:04:12 +00:00
7192ee0735 Default to input tensor device for as_nested_tensor(t) (#130050)
Fixes #129647
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130050
Approved by: https://github.com/YuqingJ
2024-07-05 17:50:08 +00:00
a33ee73a28 Upload perf stats to both Rockset and dynamoDB (#129544)
To avoid outage on HUD, I plan to migrate perf stats to dynamoDB as follows:

1. Upload perf stats to both Rockset and dynamoDB
2. Copy all the existing content from Rockset to dynamoDB
3. Create new Rockset tables to map to dynamoDB
4. Switch HUD to use the new Rockset tables (temporarily)
5. Delete the existing tables

This depends on https://github.com/pytorch-labs/pytorch-gha-infra/pull/422

### Testing

```
python3 -m tools.stats.upload_dynamo_perf_stats --workflow-run-id 9770217910 --workflow-run-attempt 1 --repo "pytorch/pytorch" --head-branch "gh/shunting314/162/head" --rockset-collection torch_dynamo_perf_stats --rockset-workspace inductor --dynamodb-table torchci-dynamo-perf-stats --match-filename "^inductor_"
...
Writing 1607 documents to DynamoDB torchci-dynamo-perf-stats
```

And confirm the same number of documents is on the table

![Screenshot 2024-07-03 at 18 10 35](https://github.com/pytorch/pytorch/assets/475357/6c055c96-00ca-4cb3-bbe5-fe4914f9da9b)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129544
Approved by: https://github.com/clee2000
2024-07-05 16:31:49 +00:00
e7ab7b83bc Have torch_key hash entire torch directory (#129250)
Summary:
Title. This way, both FXGraphCache and AOTAutogradCache use the same torch_key, and we don't need to only hash specific files.

There's an argument to be made to only hash *.py and *.cpp files. Maybe we can fix the glob to do that.

We use a buck_filegroup because otherwise $SRCs gets too large. By using `$(location :torch_sources)`, we make the genrule implicitly depend on all files globbed by torch_sources.

Test Plan:
Unit tests still pass on OSS
For torch_key:

```
buck2 build caffe2:src_hash.txt -v 2 --show-output
```
See the output, then make any change to any torch file. See that the hash changes.

Reviewed By: oulgen

Differential Revision: D58875785

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129250
Approved by: https://github.com/oulgen
2024-07-05 15:37:16 +00:00
2656 changed files with 72940 additions and 39945 deletions

View File

@ -1,4 +1,4 @@
# Docker images for GitHub CI
# Docker images for GitHub CI and CD
This directory contains everything needed to build the Docker images
that are used in our CI.
@ -12,7 +12,7 @@ each image as the `BUILD_ENVIRONMENT` environment variable.
See `build.sh` for valid build environments (it's the giant switch).
## Contents
## Docker CI builds
* `build.sh` -- dispatch script to launch all builds
* `common` -- scripts used to execute individual Docker build stages
@ -21,6 +21,12 @@ See `build.sh` for valid build environments (it's the giant switch).
* `ubuntu-rocm` -- Dockerfile for Ubuntu image with ROCm support
* `ubuntu-xpu` -- Dockerfile for Ubuntu image with XPU support
### Docker CD builds
* `conda` - Dockerfile and build.sh to build Docker images used in nightly conda builds
* `manywheel` - Dockerfile and build.sh to build Docker images used in nightly manywheel builds
* `libtorch` - Dockerfile and build.sh to build Docker images used in nightly libtorch builds
## Usage
```bash

View File

@ -407,6 +407,22 @@ case "$image" in
# from pytorch/llvm:9.0.1 is x86 specific
SKIP_LLVM_SRC_BUILD_INSTALL=yes
;;
pytorch-linux-jammy-aarch64-py3.10-gcc11-inductor-benchmarks)
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=11
ACL=yes
PROTOBUF=yes
DB=yes
VISION=yes
CONDA_CMAKE=yes
# snadampal: skipping sccache due to the following issue
# https://github.com/pytorch/pytorch/issues/121559
SKIP_SCCACHE_INSTALL=yes
# snadampal: skipping llvm src build install because the current version
# from pytorch/llvm:9.0.1 is x86 specific
SKIP_LLVM_SRC_BUILD_INSTALL=yes
INDUCTOR_BENCHMARKS=yes
;;
*)
# Catch-all for builds that are not hardcoded.
PROTOBUF=yes

View File

@ -1 +1 @@
c572f9e509b5ec5d56f4d218271e36269bba244f
48da61aa34b73ea8e2ee815a6a79eea817e361db

View File

@ -0,0 +1,5 @@
0.6b
manylinux_2_17
rocm6.1
04b5df8c8123f90cba3ede7e971e6fbc6040d506
77c29fa3f3b614e187d7213d745e989a92708cee2bc6020419ab49019af399d1

View File

@ -85,7 +85,7 @@ fi
else
CONDA_COMMON_DEPS="astunparse pyyaml mkl=2021.4.0 mkl-include=2021.4.0 setuptools"
if [ "$ANACONDA_PYTHON_VERSION" = "3.11" ] || [ "$ANACONDA_PYTHON_VERSION" = "3.12" ]; then
if [ "$ANACONDA_PYTHON_VERSION" = "3.11" ] || [ "$ANACONDA_PYTHON_VERSION" = "3.12" ] || [ "$ANACONDA_PYTHON_VERSION" = "3.13" ]; then
conda_install numpy=1.26.0 ${CONDA_COMMON_DEPS}
else
conda_install numpy=1.21.2 ${CONDA_COMMON_DEPS}

View File

@ -0,0 +1,20 @@
#!/bin/bash
# Script used only in CD pipeline
set -ex
# Anaconda
# Latest anaconda is using openssl-3 which is incompatible with all currently published versions of git
# Which are using openssl-1.1.1, see https://anaconda.org/anaconda/git/files?version=2.40.1 for example
MINICONDA_URL=https://repo.anaconda.com/miniconda/Miniconda3-py311_23.5.2-0-Linux-x86_64.sh
wget -q $MINICONDA_URL
# NB: Manually invoke bash per https://github.com/conda/conda/issues/10431
bash $(basename "$MINICONDA_URL") -b -p /opt/conda
rm $(basename "$MINICONDA_URL")
export PATH=/opt/conda/bin:$PATH
# See https://github.com/pytorch/builder/issues/1473
# Pin conda to 23.5.2 as it's the last one compatible with openssl-1.1.1
conda install -y conda=23.5.2 conda-build anaconda-client git ninja
# The cmake version here needs to match with the minimum version of cmake
# supported by PyTorch (3.18). There is only 3.18.2 on anaconda
/opt/conda/bin/pip3 install cmake==3.18.2
conda remove -y --force patchelf

View File

@ -0,0 +1,95 @@
#!/bin/bash
# Script used only in CD pipeline
set -uex -o pipefail
PYTHON_DOWNLOAD_URL=https://www.python.org/ftp/python
PYTHON_DOWNLOAD_GITHUB_BRANCH=https://github.com/python/cpython/archive/refs/heads
GET_PIP_URL=https://bootstrap.pypa.io/get-pip.py
# Python versions to be installed in /opt/$VERSION_NO
CPYTHON_VERSIONS=${CPYTHON_VERSIONS:-"3.8.1 3.9.0 3.10.1 3.11.0 3.12.0 3.13.0"}
function check_var {
if [ -z "$1" ]; then
echo "required variable not defined"
exit 1
fi
}
function do_cpython_build {
local py_ver=$1
local py_folder=$2
check_var $py_ver
check_var $py_folder
tar -xzf Python-$py_ver.tgz
pushd $py_folder
local prefix="/opt/_internal/cpython-${py_ver}"
mkdir -p ${prefix}/lib
if [[ -n $(which patchelf) ]]; then
local shared_flags="--enable-shared"
else
local shared_flags="--disable-shared"
fi
if [[ -z "${WITH_OPENSSL+x}" ]]; then
local openssl_flags=""
else
local openssl_flags="--with-openssl=${WITH_OPENSSL} --with-openssl-rpath=auto"
fi
# -Wformat added for https://bugs.python.org/issue17547 on Python 2.6
CFLAGS="-Wformat" ./configure --prefix=${prefix} ${openssl_flags} ${shared_flags} > /dev/null
make -j40 > /dev/null
make install > /dev/null
if [[ "${shared_flags}" == "--enable-shared" ]]; then
patchelf --set-rpath '$ORIGIN/../lib' ${prefix}/bin/python3
fi
popd
rm -rf $py_folder
# Some python's install as bin/python3. Make them available as
# bin/python.
if [ -e ${prefix}/bin/python3 ]; then
ln -s python3 ${prefix}/bin/python
fi
${prefix}/bin/python get-pip.py
if [ -e ${prefix}/bin/pip3 ] && [ ! -e ${prefix}/bin/pip ]; then
ln -s pip3 ${prefix}/bin/pip
fi
${prefix}/bin/pip install wheel==0.34.2
local abi_tag=$(${prefix}/bin/python -c "from wheel.pep425tags import get_abbr_impl, get_impl_ver, get_abi_tag; print('{0}{1}-{2}'.format(get_abbr_impl(), get_impl_ver(), get_abi_tag()))")
ln -s ${prefix} /opt/python/${abi_tag}
}
function build_cpython {
local py_ver=$1
check_var $py_ver
check_var $PYTHON_DOWNLOAD_URL
local py_ver_folder=$py_ver
if [ "$py_ver" = "3.13.0" ]; then
PY_VER_SHORT="3.13"
check_var $PYTHON_DOWNLOAD_GITHUB_BRANCH
wget $PYTHON_DOWNLOAD_GITHUB_BRANCH/$PY_VER_SHORT.tar.gz -O Python-$py_ver.tgz
do_cpython_build $py_ver cpython-$PY_VER_SHORT
else
wget -q $PYTHON_DOWNLOAD_URL/$py_ver_folder/Python-$py_ver.tgz
do_cpython_build $py_ver Python-$py_ver
fi
rm -f Python-$py_ver.tgz
}
function build_cpythons {
check_var $GET_PIP_URL
curl -sLO $GET_PIP_URL
for py_ver in $@; do
build_cpython $py_ver
done
rm -f get-pip.py
}
mkdir -p /opt/python
mkdir -p /opt/_internal
build_cpythons $CPYTHON_VERSIONS

View File

@ -0,0 +1,239 @@
#!/bin/bash
set -ex
NCCL_VERSION=v2.21.5-1
CUDNN_VERSION=9.1.0.70
function install_cusparselt_040 {
# cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html
mkdir tmp_cusparselt && pushd tmp_cusparselt
wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-x86_64/libcusparse_lt-linux-x86_64-0.4.0.7-archive.tar.xz
tar xf libcusparse_lt-linux-x86_64-0.4.0.7-archive.tar.xz
cp -a libcusparse_lt-linux-x86_64-0.4.0.7-archive/include/* /usr/local/cuda/include/
cp -a libcusparse_lt-linux-x86_64-0.4.0.7-archive/lib/* /usr/local/cuda/lib64/
popd
rm -rf tmp_cusparselt
}
function install_cusparselt_052 {
# cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html
mkdir tmp_cusparselt && pushd tmp_cusparselt
wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-x86_64/libcusparse_lt-linux-x86_64-0.5.2.1-archive.tar.xz
tar xf libcusparse_lt-linux-x86_64-0.5.2.1-archive.tar.xz
cp -a libcusparse_lt-linux-x86_64-0.5.2.1-archive/include/* /usr/local/cuda/include/
cp -a libcusparse_lt-linux-x86_64-0.5.2.1-archive/lib/* /usr/local/cuda/lib64/
popd
rm -rf tmp_cusparselt
}
function install_118 {
echo "Installing CUDA 11.8 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.4.0"
rm -rf /usr/local/cuda-11.8 /usr/local/cuda
# install CUDA 11.8.0 in the same container
wget -q https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda_11.8.0_520.61.05_linux.run
chmod +x cuda_11.8.0_520.61.05_linux.run
./cuda_11.8.0_520.61.05_linux.run --toolkit --silent
rm -f cuda_11.8.0_520.61.05_linux.run
rm -f /usr/local/cuda && ln -s /usr/local/cuda-11.8 /usr/local/cuda
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
mkdir tmp_cudnn && cd tmp_cudnn
wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-${CUDNN_VERSION}_cuda11-archive.tar.xz -O cudnn-linux-x86_64-${CUDNN_VERSION}_cuda11-archive.tar.xz
tar xf cudnn-linux-x86_64-${CUDNN_VERSION}_cuda11-archive.tar.xz
cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda11-archive/include/* /usr/local/cuda/include/
cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda11-archive/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf tmp_cudnn
# NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses
# Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build
git clone -b $NCCL_VERSION --depth 1 https://github.com/NVIDIA/nccl.git
cd nccl && make -j src.build
cp -a build/include/* /usr/local/cuda/include/
cp -a build/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf nccl
install_cusparselt_040
ldconfig
}
function install_121 {
echo "Installing CUDA 12.1 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.5.2"
rm -rf /usr/local/cuda-12.1 /usr/local/cuda
# install CUDA 12.1.0 in the same container
wget -q https://developer.download.nvidia.com/compute/cuda/12.1.1/local_installers/cuda_12.1.1_530.30.02_linux.run
chmod +x cuda_12.1.1_530.30.02_linux.run
./cuda_12.1.1_530.30.02_linux.run --toolkit --silent
rm -f cuda_12.1.1_530.30.02_linux.run
rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.1 /usr/local/cuda
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
mkdir tmp_cudnn && cd tmp_cudnn
wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz -O cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz
tar xf cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz
cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/include/* /usr/local/cuda/include/
cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf tmp_cudnn
# NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses
# Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build
git clone -b $NCCL_VERSION --depth 1 https://github.com/NVIDIA/nccl.git
cd nccl && make -j src.build
cp -a build/include/* /usr/local/cuda/include/
cp -a build/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf nccl
install_cusparselt_052
ldconfig
}
function install_124 {
echo "Installing CUDA 12.4 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.5.2"
rm -rf /usr/local/cuda-12.4 /usr/local/cuda
# install CUDA 12.4.0 in the same container
wget -q https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_550.54.14_linux.run
chmod +x cuda_12.4.0_550.54.14_linux.run
./cuda_12.4.0_550.54.14_linux.run --toolkit --silent
rm -f cuda_12.4.0_550.54.14_linux.run
rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.4 /usr/local/cuda
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
mkdir tmp_cudnn && cd tmp_cudnn
wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz -O cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz
tar xf cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz
cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/include/* /usr/local/cuda/include/
cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf tmp_cudnn
# NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses
# Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build
git clone -b $NCCL_VERSION --depth 1 https://github.com/NVIDIA/nccl.git
cd nccl && make -j src.build
cp -a build/include/* /usr/local/cuda/include/
cp -a build/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf nccl
install_cusparselt_052
ldconfig
}
function prune_118 {
echo "Pruning CUDA 11.8 and cuDNN"
#####################################################################################
# CUDA 11.8 prune static libs
#####################################################################################
export NVPRUNE="/usr/local/cuda-11.8/bin/nvprune"
export CUDA_LIB_DIR="/usr/local/cuda-11.8/lib64"
export GENCODE="-gencode arch=compute_35,code=sm_35 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
export GENCODE_CUDNN="-gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
if [[ -n "$OVERRIDE_GENCODE" ]]; then
export GENCODE=$OVERRIDE_GENCODE
fi
# all CUDA libs except CuDNN and CuBLAS (cudnn and cublas need arch 3.7 included)
ls $CUDA_LIB_DIR/ | grep "\.a" | grep -v "culibos" | grep -v "cudart" | grep -v "cudnn" | grep -v "cublas" | grep -v "metis" \
| xargs -I {} bash -c \
"echo {} && $NVPRUNE $GENCODE $CUDA_LIB_DIR/{} -o $CUDA_LIB_DIR/{}"
# prune CuDNN and CuBLAS
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublas_static.a -o $CUDA_LIB_DIR/libcublas_static.a
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a
#####################################################################################
# CUDA 11.8 prune visual tools
#####################################################################################
export CUDA_BASE="/usr/local/cuda-11.8/"
rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2022.3.0 $CUDA_BASE/nsight-systems-2022.4.2/
}
function prune_121 {
echo "Pruning CUDA 12.1"
#####################################################################################
# CUDA 12.1 prune static libs
#####################################################################################
export NVPRUNE="/usr/local/cuda-12.1/bin/nvprune"
export CUDA_LIB_DIR="/usr/local/cuda-12.1/lib64"
export GENCODE="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
export GENCODE_CUDNN="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
if [[ -n "$OVERRIDE_GENCODE" ]]; then
export GENCODE=$OVERRIDE_GENCODE
fi
# all CUDA libs except CuDNN and CuBLAS
ls $CUDA_LIB_DIR/ | grep "\.a" | grep -v "culibos" | grep -v "cudart" | grep -v "cudnn" | grep -v "cublas" | grep -v "metis" \
| xargs -I {} bash -c \
"echo {} && $NVPRUNE $GENCODE $CUDA_LIB_DIR/{} -o $CUDA_LIB_DIR/{}"
# prune CuDNN and CuBLAS
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublas_static.a -o $CUDA_LIB_DIR/libcublas_static.a
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a
#####################################################################################
# CUDA 12.1 prune visual tools
#####################################################################################
export CUDA_BASE="/usr/local/cuda-12.1/"
rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2023.1.0 $CUDA_BASE/nsight-systems-2023.1.2/
}
function prune_124 {
echo "Pruning CUDA 12.4"
#####################################################################################
# CUDA 12.4 prune static libs
#####################################################################################
export NVPRUNE="/usr/local/cuda-12.4/bin/nvprune"
export CUDA_LIB_DIR="/usr/local/cuda-12.4/lib64"
export GENCODE="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
export GENCODE_CUDNN="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
if [[ -n "$OVERRIDE_GENCODE" ]]; then
export GENCODE=$OVERRIDE_GENCODE
fi
if [[ -n "$OVERRIDE_GENCODE_CUDNN" ]]; then
export GENCODE_CUDNN=$OVERRIDE_GENCODE_CUDNN
fi
# all CUDA libs except CuDNN and CuBLAS
ls $CUDA_LIB_DIR/ | grep "\.a" | grep -v "culibos" | grep -v "cudart" | grep -v "cudnn" | grep -v "cublas" | grep -v "metis" \
| xargs -I {} bash -c \
"echo {} && $NVPRUNE $GENCODE $CUDA_LIB_DIR/{} -o $CUDA_LIB_DIR/{}"
# prune CuDNN and CuBLAS
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublas_static.a -o $CUDA_LIB_DIR/libcublas_static.a
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a
#####################################################################################
# CUDA 12.1 prune visual tools
#####################################################################################
export CUDA_BASE="/usr/local/cuda-12.4/"
rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2024.1.0 $CUDA_BASE/nsight-systems-2023.4.4/
}
# idiomatic parameter and option handling in sh
while test $# -gt 0
do
case "$1" in
11.8) install_118; prune_118
;;
12.1) install_121; prune_121
;;
12.4) install_124; prune_124
;;
*) echo "bad argument $1"; exit 1
;;
esac
shift
done

View File

@ -0,0 +1,93 @@
#!/bin/bash
# Script used only in CD pipeline
set -ex
NCCL_VERSION=v2.21.5-1
function install_cusparselt_052 {
# cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html
mkdir tmp_cusparselt && pushd tmp_cusparselt
wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-sbsa/libcusparse_lt-linux-sbsa-0.5.2.1-archive.tar.xz
tar xf libcusparse_lt-linux-sbsa-0.5.2.1-archive.tar.xz
cp -a libcusparse_lt-linux-sbsa-0.5.2.1-archive/include/* /usr/local/cuda/include/
cp -a libcusparse_lt-linux-sbsa-0.5.2.1-archive/lib/* /usr/local/cuda/lib64/
popd
rm -rf tmp_cusparselt
}
function install_124 {
echo "Installing CUDA 12.4 and cuDNN 9.1 and NCCL ${NCCL_VERSION} and cuSparseLt-0.5.2"
rm -rf /usr/local/cuda-12.4 /usr/local/cuda
# install CUDA 12.4.0 in the same container
wget -q https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_550.54.14_linux_sbsa.run
chmod +x cuda_12.4.0_550.54.14_linux_sbsa.run
./cuda_12.4.0_550.54.14_linux_sbsa.run --toolkit --silent
rm -f cuda_12.4.0_550.54.14_linux_sbsa.run
rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.4 /usr/local/cuda
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
mkdir tmp_cudnn && cd tmp_cudnn
wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-sbsa/cudnn-linux-sbsa-9.1.0.70_cuda12-archive.tar.xz -O cudnn-linux-sbsa-9.1.0.70_cuda12-archive.tar.xz
tar xf cudnn-linux-sbsa-9.1.0.70_cuda12-archive.tar.xz
cp -a cudnn-linux-sbsa-9.1.0.70_cuda12-archive/include/* /usr/local/cuda/include/
cp -a cudnn-linux-sbsa-9.1.0.70_cuda12-archive/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf tmp_cudnn
# NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses
# Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build
git clone -b ${NCCL_VERSION} --depth 1 https://github.com/NVIDIA/nccl.git
cd nccl && make -j src.build
cp -a build/include/* /usr/local/cuda/include/
cp -a build/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf nccl
install_cusparselt_052
ldconfig
}
function prune_124 {
echo "Pruning CUDA 12.4"
#####################################################################################
# CUDA 12.4 prune static libs
#####################################################################################
export NVPRUNE="/usr/local/cuda-12.4/bin/nvprune"
export CUDA_LIB_DIR="/usr/local/cuda-12.4/lib64"
export GENCODE="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
export GENCODE_CUDNN="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
if [[ -n "$OVERRIDE_GENCODE" ]]; then
export GENCODE=$OVERRIDE_GENCODE
fi
# all CUDA libs except CuDNN and CuBLAS
ls $CUDA_LIB_DIR/ | grep "\.a" | grep -v "culibos" | grep -v "cudart" | grep -v "cudnn" | grep -v "cublas" | grep -v "metis" \
| xargs -I {} bash -c \
"echo {} && $NVPRUNE $GENCODE $CUDA_LIB_DIR/{} -o $CUDA_LIB_DIR/{}"
# prune CuDNN and CuBLAS
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublas_static.a -o $CUDA_LIB_DIR/libcublas_static.a
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a
#####################################################################################
# CUDA 12.1 prune visual tools
#####################################################################################
export CUDA_BASE="/usr/local/cuda-12.4/"
rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2024.1.0 $CUDA_BASE/nsight-systems-2023.4.4/
}
# idiomatic parameter and option handling in sh
while test $# -gt 0
do
case "$1" in
12.4) install_124; prune_124
;;
*) echo "bad argument $1"; exit 1
;;
esac
shift
done

View File

@ -0,0 +1,23 @@
#!/bin/bash
# Script used only in CD pipeline
set -ex
LIBPNG_VERSION=1.6.37
mkdir -p libpng
pushd libpng
wget http://download.sourceforge.net/libpng/libpng-$LIBPNG_VERSION.tar.gz
tar -xvzf libpng-$LIBPNG_VERSION.tar.gz
pushd libpng-$LIBPNG_VERSION
./configure
make
make install
popd
popd
rm -rf libpng

View File

@ -0,0 +1,29 @@
#!/usr/bin/env bash
# Script used only in CD pipeline
set -eou pipefail
MAGMA_VERSION="2.5.2"
function do_install() {
cuda_version=$1
cuda_version_nodot=${1/./}
MAGMA_VERSION="2.6.1"
magma_archive="magma-cuda${cuda_version_nodot}-${MAGMA_VERSION}-1.tar.bz2"
cuda_dir="/usr/local/cuda-${cuda_version}"
(
set -x
tmp_dir=$(mktemp -d)
pushd ${tmp_dir}
curl -OLs https://anaconda.org/pytorch/magma-cuda${cuda_version_nodot}/${MAGMA_VERSION}/download/linux-64/${magma_archive}
tar -xvf "${magma_archive}"
mkdir -p "${cuda_dir}/magma"
mv include "${cuda_dir}/magma/include"
mv lib "${cuda_dir}/magma/lib"
popd
)
}
do_install $1

View File

@ -0,0 +1,134 @@
#!/bin/bash
# Script used only in CD pipeline
set -ex
ROCM_VERSION=$1
if [[ -z $ROCM_VERSION ]]; then
echo "missing ROCM_VERSION"
exit 1;
fi
# To make version comparison easier, create an integer representation.
save_IFS="$IFS"
IFS=. ROCM_VERSION_ARRAY=(${ROCM_VERSION})
IFS="$save_IFS"
if [[ ${#ROCM_VERSION_ARRAY[@]} == 2 ]]; then
ROCM_VERSION_MAJOR=${ROCM_VERSION_ARRAY[0]}
ROCM_VERSION_MINOR=${ROCM_VERSION_ARRAY[1]}
ROCM_VERSION_PATCH=0
elif [[ ${#ROCM_VERSION_ARRAY[@]} == 3 ]]; then
ROCM_VERSION_MAJOR=${ROCM_VERSION_ARRAY[0]}
ROCM_VERSION_MINOR=${ROCM_VERSION_ARRAY[1]}
ROCM_VERSION_PATCH=${ROCM_VERSION_ARRAY[2]}
else
echo "Unhandled ROCM_VERSION ${ROCM_VERSION}"
exit 1
fi
ROCM_INT=$(($ROCM_VERSION_MAJOR * 10000 + $ROCM_VERSION_MINOR * 100 + $ROCM_VERSION_PATCH))
# Install custom MIOpen + COMgr for ROCm >= 4.0.1
if [[ $ROCM_INT -lt 40001 ]]; then
echo "ROCm version < 4.0.1; will not install custom MIOpen"
exit 0
fi
# Function to retry functions that sometimes timeout or have flaky failures
retry () {
$* || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)
}
# Build custom MIOpen to use comgr for offline compilation.
## Need a sanitized ROCM_VERSION without patchlevel; patchlevel version 0 must be added to paths.
ROCM_DOTS=$(echo ${ROCM_VERSION} | tr -d -c '.' | wc -c)
if [[ ${ROCM_DOTS} == 1 ]]; then
ROCM_VERSION_NOPATCH="${ROCM_VERSION}"
ROCM_INSTALL_PATH="/opt/rocm-${ROCM_VERSION}.0"
else
ROCM_VERSION_NOPATCH="${ROCM_VERSION%.*}"
ROCM_INSTALL_PATH="/opt/rocm-${ROCM_VERSION}"
fi
# MIOPEN_USE_HIP_KERNELS is a Workaround for COMgr issues
MIOPEN_CMAKE_COMMON_FLAGS="
-DMIOPEN_USE_COMGR=ON
-DMIOPEN_BUILD_DRIVER=OFF
"
# Pull MIOpen repo and set DMIOPEN_EMBED_DB based on ROCm version
if [[ $ROCM_INT -ge 60100 ]] && [[ $ROCM_INT -lt 60200 ]]; then
echo "ROCm 6.1 MIOpen does not need any patches, do not build from source"
exit 0
elif [[ $ROCM_INT -ge 60000 ]] && [[ $ROCM_INT -lt 60100 ]]; then
echo "ROCm 6.0 MIOpen does not need any patches, do not build from source"
exit 0
elif [[ $ROCM_INT -ge 50700 ]] && [[ $ROCM_INT -lt 60000 ]]; then
echo "ROCm 5.7 MIOpen does not need any patches, do not build from source"
exit 0
elif [[ $ROCM_INT -ge 50600 ]] && [[ $ROCM_INT -lt 50700 ]]; then
MIOPEN_BRANCH="release/rocm-rel-5.6-staging"
elif [[ $ROCM_INT -ge 50500 ]] && [[ $ROCM_INT -lt 50600 ]]; then
MIOPEN_BRANCH="release/rocm-rel-5.5-gfx11"
elif [[ $ROCM_INT -ge 50400 ]] && [[ $ROCM_INT -lt 50500 ]]; then
MIOPEN_CMAKE_DB_FLAGS="-DMIOPEN_EMBED_DB=gfx900_56;gfx906_60;gfx90878;gfx90a6e;gfx1030_36 -DMIOPEN_USE_MLIR=Off"
MIOPEN_BRANCH="release/rocm-rel-5.4-staging"
elif [[ $ROCM_INT -ge 50300 ]] && [[ $ROCM_INT -lt 50400 ]]; then
MIOPEN_CMAKE_DB_FLAGS="-DMIOPEN_EMBED_DB=gfx900_56;gfx906_60;gfx90878;gfx90a6e;gfx1030_36 -DMIOPEN_USE_MLIR=Off"
MIOPEN_BRANCH="release/rocm-rel-5.3-staging"
elif [[ $ROCM_INT -ge 50200 ]] && [[ $ROCM_INT -lt 50300 ]]; then
MIOPEN_CMAKE_DB_FLAGS="-DMIOPEN_EMBED_DB=gfx900_56;gfx906_60;gfx90878;gfx90a6e;gfx1030_36 -DMIOPEN_USE_MLIR=Off"
MIOPEN_BRANCH="release/rocm-rel-5.2-staging"
elif [[ $ROCM_INT -ge 50100 ]] && [[ $ROCM_INT -lt 50200 ]]; then
MIOPEN_CMAKE_DB_FLAGS="-DMIOPEN_EMBED_DB=gfx900_56;gfx906_60;gfx90878;gfx90a6e;gfx1030_36"
MIOPEN_BRANCH="release/rocm-rel-5.1-staging"
elif [[ $ROCM_INT -ge 50000 ]] && [[ $ROCM_INT -lt 50100 ]]; then
MIOPEN_CMAKE_DB_FLAGS="-DMIOPEN_EMBED_DB=gfx900_56;gfx906_60;gfx90878;gfx90a6e;gfx1030_36"
MIOPEN_BRANCH="release/rocm-rel-5.0-staging"
else
echo "Unhandled ROCM_VERSION ${ROCM_VERSION}"
exit 1
fi
yum remove -y miopen-hip
git clone https://github.com/ROCm/MIOpen -b ${MIOPEN_BRANCH}
pushd MIOpen
# remove .git to save disk space since CI runner was running out
rm -rf .git
# Don't build MLIR to save docker build time
# since we are disabling MLIR backend for MIOpen anyway
if [[ $ROCM_INT -ge 50400 ]] && [[ $ROCM_INT -lt 50500 ]]; then
sed -i '/rocMLIR/d' requirements.txt
elif [[ $ROCM_INT -ge 50200 ]] && [[ $ROCM_INT -lt 50400 ]]; then
sed -i '/llvm-project-mlir/d' requirements.txt
fi
## MIOpen minimum requirements
cmake -P install_deps.cmake --minimum
# clean up since CI runner was running out of disk space
rm -rf /tmp/*
yum clean all
rm -rf /var/cache/yum
rm -rf /var/lib/yum/yumdb
rm -rf /var/lib/yum/history
## Build MIOpen
mkdir -p build
cd build
PKG_CONFIG_PATH=/usr/local/lib/pkgconfig CXX=${ROCM_INSTALL_PATH}/llvm/bin/clang++ cmake .. \
${MIOPEN_CMAKE_COMMON_FLAGS} \
${MIOPEN_CMAKE_DB_FLAGS} \
-DCMAKE_PREFIX_PATH="${ROCM_INSTALL_PATH}/hip;${ROCM_INSTALL_PATH}"
make MIOpen -j $(nproc)
# Build MIOpen package
make -j $(nproc) package
# clean up since CI runner was running out of disk space
rm -rf /usr/local/cget
yum install -y miopen-*.rpm
popd
rm -rf MIOpen

View File

@ -0,0 +1,16 @@
#!/bin/bash
set -ex
# MKL
MKL_VERSION=2024.2.0
MKLROOT=/opt/intel
mkdir -p ${MKLROOT}
pushd /tmp
python3 -mpip install wheel
python3 -mpip download -d . mkl-static==${MKL_VERSION}
python3 -m wheel unpack mkl_static-${MKL_VERSION}-py2.py3-none-manylinux1_x86_64.whl
python3 -m wheel unpack mkl_include-${MKL_VERSION}-py2.py3-none-manylinux1_x86_64.whl
mv mkl_static-${MKL_VERSION}/mkl_static-${MKL_VERSION}.data/data/lib ${MKLROOT}
mv mkl_include-${MKL_VERSION}/mkl_include-${MKL_VERSION}.data/data/include ${MKLROOT}

View File

@ -0,0 +1,13 @@
#!/bin/bash
# Script used only in CD pipeline
set -ex
mkdir -p /usr/local/mnist/
cd /usr/local/mnist
for img in train-images-idx3-ubyte.gz train-labels-idx1-ubyte.gz t10k-images-idx3-ubyte.gz t10k-labels-idx1-ubyte.gz; do
wget -q https://ossci-datasets.s3.amazonaws.com/mnist/$img
gzip -d $img
done

View File

@ -0,0 +1,22 @@
#!/bin/bash
# Script used only in CD pipeline
set -ex
cd /
git clone https://github.com/OpenMathLib/OpenBLAS.git -b v0.3.25 --depth 1 --shallow-submodules
OPENBLAS_BUILD_FLAGS="
NUM_THREADS=128
USE_OPENMP=1
NO_SHARED=0
DYNAMIC_ARCH=1
TARGET=ARMV8
CFLAGS=-O3
"
OPENBLAS_CHECKOUT_DIR="OpenBLAS"
make -j8 ${OPENBLAS_BUILD_FLAGS} -C ${OPENBLAS_CHECKOUT_DIR}
make -j8 ${OPENBLAS_BUILD_FLAGS} install -C ${OPENBLAS_CHECKOUT_DIR}

View File

@ -0,0 +1,16 @@
#!/bin/bash
# Script used only in CD pipeline
set -ex
# Pin the version to latest release 0.17.2, building newer commit starts
# to fail on the current image
git clone -b 0.17.2 --single-branch https://github.com/NixOS/patchelf
cd patchelf
sed -i 's/serial/parallel/g' configure.ac
./bootstrap.sh
./configure
make
make install
cd ..
rm -rf patchelf

View File

@ -0,0 +1,150 @@
#!/bin/bash
# Script used only in CD pipeline
###########################
### prereqs
###########################
# Install Python packages depending on the base OS
ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')
case "$ID" in
ubuntu)
apt-get update -y
apt-get install -y libpciaccess-dev pkg-config
apt-get clean
;;
centos)
yum install -y libpciaccess-devel pkgconfig
;;
*)
echo "Unable to determine OS..."
exit 1
;;
esac
python3 -m pip install meson ninja
###########################
### clone repo
###########################
GIT_SSL_NO_VERIFY=true git clone https://gitlab.freedesktop.org/mesa/drm.git
pushd drm
###########################
### patch
###########################
patch -p1 <<'EOF'
diff --git a/amdgpu/amdgpu_asic_id.c b/amdgpu/amdgpu_asic_id.c
index a5007ffc..13fa07fc 100644
--- a/amdgpu/amdgpu_asic_id.c
+++ b/amdgpu/amdgpu_asic_id.c
@@ -22,6 +22,13 @@
*
*/
+#define _XOPEN_SOURCE 700
+#define _LARGEFILE64_SOURCE
+#define _FILE_OFFSET_BITS 64
+#include <ftw.h>
+#include <link.h>
+#include <limits.h>
+
#include <ctype.h>
#include <stdio.h>
#include <stdlib.h>
@@ -34,6 +41,19 @@
#include "amdgpu_drm.h"
#include "amdgpu_internal.h"
+static char *amdgpuids_path = NULL;
+static const char* amdgpuids_path_msg = NULL;
+
+static int check_for_location_of_amdgpuids(const char *filepath, const struct stat *info, const int typeflag, struct FTW *pathinfo)
+{
+ if (typeflag == FTW_F && strstr(filepath, "amdgpu.ids")) {
+ amdgpuids_path = strdup(filepath);
+ return 1;
+ }
+
+ return 0;
+}
+
static int parse_one_line(struct amdgpu_device *dev, const char *line)
{
char *buf, *saveptr;
@@ -113,10 +133,46 @@ void amdgpu_parse_asic_ids(struct amdgpu_device *dev)
int line_num = 1;
int r = 0;
+ // attempt to find typical location for amdgpu.ids file
fp = fopen(AMDGPU_ASIC_ID_TABLE, "r");
+
+ // if it doesn't exist, search
+ if (!fp) {
+
+ char self_path[ PATH_MAX ];
+ ssize_t count;
+ ssize_t i;
+
+ count = readlink( "/proc/self/exe", self_path, PATH_MAX );
+ if (count > 0) {
+ self_path[count] = '\0';
+
+ // remove '/bin/python' from self_path
+ for (i=count; i>0; --i) {
+ if (self_path[i] == '/') break;
+ self_path[i] = '\0';
+ }
+ self_path[i] = '\0';
+ for (; i>0; --i) {
+ if (self_path[i] == '/') break;
+ self_path[i] = '\0';
+ }
+ self_path[i] = '\0';
+
+ if (1 == nftw(self_path, check_for_location_of_amdgpuids, 5, FTW_PHYS)) {
+ fp = fopen(amdgpuids_path, "r");
+ amdgpuids_path_msg = amdgpuids_path;
+ }
+ }
+
+ }
+ else {
+ amdgpuids_path_msg = AMDGPU_ASIC_ID_TABLE;
+ }
+
+ // both hard-coded location and search have failed
if (!fp) {
- fprintf(stderr, "%s: %s\n", AMDGPU_ASIC_ID_TABLE,
- strerror(errno));
+ fprintf(stderr, "amdgpu.ids: No such file or directory\n");
return;
}
@@ -132,7 +188,7 @@ void amdgpu_parse_asic_ids(struct amdgpu_device *dev)
continue;
}
- drmMsg("%s version: %s\n", AMDGPU_ASIC_ID_TABLE, line);
+ drmMsg("%s version: %s\n", amdgpuids_path_msg, line);
break;
}
@@ -150,7 +206,7 @@ void amdgpu_parse_asic_ids(struct amdgpu_device *dev)
if (r == -EINVAL) {
fprintf(stderr, "Invalid format: %s: line %d: %s\n",
- AMDGPU_ASIC_ID_TABLE, line_num, line);
+ amdgpuids_path_msg, line_num, line);
} else if (r && r != -EAGAIN) {
fprintf(stderr, "%s: Cannot parse ASIC IDs: %s\n",
__func__, strerror(-r));
EOF
###########################
### build
###########################
meson builddir --prefix=/opt/amdgpu
pushd builddir
ninja install
popd
popd

View File

@ -1,7 +1,11 @@
#!/bin/bash
# Script used in CI and CD pipeline
set -ex
MKLROOT=${MKLROOT:-/opt/conda/envs/py_$ANACONDA_PYTHON_VERSION}
# "install" hipMAGMA into /opt/rocm/magma by copying after build
git clone https://bitbucket.org/icl/magma.git
pushd magma
@ -11,7 +15,10 @@ git checkout a1625ff4d9bc362906bd01f805dbbe12612953f6
cp make.inc-examples/make.inc.hip-gcc-mkl make.inc
echo 'LIBDIR += -L$(MKLROOT)/lib' >> make.inc
echo 'LIB += -Wl,--enable-new-dtags -Wl,--rpath,/opt/rocm/lib -Wl,--rpath,$(MKLROOT)/lib -Wl,--rpath,/opt/rocm/magma/lib' >> make.inc
if [[ -f "${MKLROOT}/lib/libmkl_core.a" ]]; then
echo 'LIB = -Wl,--start-group -lmkl_gf_lp64 -lmkl_gnu_thread -lmkl_core -Wl,--end-group -lpthread -lstdc++ -lm -lgomp -lhipblas -lhipsparse' >> make.inc
fi
echo 'LIB += -Wl,--enable-new-dtags -Wl,--rpath,/opt/rocm/lib -Wl,--rpath,$(MKLROOT)/lib -Wl,--rpath,/opt/rocm/magma/lib -ldl' >> make.inc
echo 'DEVCCFLAGS += --gpu-max-threads-per-block=256' >> make.inc
export PATH="${PATH}:/opt/rocm/bin"
if [[ -n "$PYTORCH_ROCM_ARCH" ]]; then
@ -25,7 +32,7 @@ done
# hipcc with openmp flag may cause isnan() on __device__ not to be found; depending on context, compiler may attempt to match with host definition
sed -i 's/^FOPENMP/#FOPENMP/g' make.inc
make -f make.gen.hipMAGMA -j $(nproc)
LANG=C.UTF-8 make lib/libmagma.so -j $(nproc) MKLROOT=/opt/conda/envs/py_$ANACONDA_PYTHON_VERSION
make testing/testing_dgemm -j $(nproc) MKLROOT=/opt/conda/envs/py_$ANACONDA_PYTHON_VERSION
LANG=C.UTF-8 make lib/libmagma.so -j $(nproc) MKLROOT="${MKLROOT}"
make testing/testing_dgemm -j $(nproc) MKLROOT="${MKLROOT}"
popd
mv magma /opt/rocm

View File

@ -1,6 +1,6 @@
#!/bin/bash
set -xe
# Script used in CI and CD pipeline
# Intel® software for general purpose GPU capabilities.
# Refer to https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpus.html
@ -8,19 +8,23 @@ set -xe
# Users should update to the latest version as it becomes available
function install_ubuntu() {
. /etc/os-release
if [[ ! " jammy " =~ " ${VERSION_CODENAME} " ]]; then
echo "Ubuntu version ${VERSION_CODENAME} not supported"
exit
fi
apt-get update -y
apt-get install -y gpg-agent wget
# Set up the repository. To do this, download the key to the system keyring
# To add the online network package repository for the GPU Driver LTS releases
wget -qO - https://repositories.intel.com/gpu/intel-graphics.key \
| gpg --dearmor --output /usr/share/keyrings/intel-graphics.gpg
wget -qO - https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB \
| gpg --dearmor --output /usr/share/keyrings/intel-for-pytorch-gpu-dev-keyring.gpg
# Add the signed entry to APT sources and configure the APT client to use the Intel repository
| gpg --yes --dearmor --output /usr/share/keyrings/intel-graphics.gpg
echo "deb [arch=amd64 signed-by=/usr/share/keyrings/intel-graphics.gpg] \
https://repositories.intel.com/gpu/ubuntu jammy/lts/2350 unified" \
| tee /etc/apt/sources.list.d/intel-gpu-jammy.list
https://repositories.intel.com/gpu/ubuntu ${VERSION_CODENAME}/lts/2350 unified" \
| tee /etc/apt/sources.list.d/intel-gpu-${VERSION_CODENAME}.list
# To add the online network network package repository for the Intel Support Packages
wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB \
| gpg --dearmor > /usr/share/keyrings/intel-for-pytorch-gpu-dev-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/intel-for-pytorch-gpu-dev-keyring.gpg] \
https://apt.repos.intel.com/intel-for-pytorch-gpu-dev all main" \
| tee /etc/apt/sources.list.d/intel-for-pytorch-gpu-dev.list
@ -97,6 +101,86 @@ EOF
rm -rf /var/lib/yum/history
}
function install_rhel() {
. /etc/os-release
if [[ "${ID}" == "rhel" ]]; then
if [[ ! " 8.6 8.8 8.9 9.0 9.2 9.3 " =~ " ${VERSION_ID} " ]]; then
echo "RHEL version ${VERSION_ID} not supported"
exit
fi
elif [[ "${ID}" == "almalinux" ]]; then
# Workaround for almalinux8 which used by quay.io/pypa/manylinux_2_28_x86_64
VERSION_ID="8.6"
fi
dnf install -y 'dnf-command(config-manager)'
# To add the online network package repository for the GPU Driver LTS releases
dnf config-manager --add-repo \
https://repositories.intel.com/gpu/rhel/${VERSION_ID}/lts/2350/unified/intel-gpu-${VERSION_ID}.repo
# To add the online network network package repository for the Intel Support Packages
tee > /etc/yum.repos.d/intel-for-pytorch-gpu-dev.repo << EOF
[intel-for-pytorch-gpu-dev]
name=Intel for Pytorch GPU dev repository
baseurl=https://yum.repos.intel.com/intel-for-pytorch-gpu-dev
enabled=1
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://yum.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB
EOF
# The xpu-smi packages
dnf install -y xpu-smi
# Compute and Media Runtimes
dnf install -y \
intel-opencl intel-media intel-mediasdk libmfxgen1 libvpl2\
level-zero intel-level-zero-gpu mesa-dri-drivers mesa-vulkan-drivers \
mesa-vdpau-drivers libdrm mesa-libEGL mesa-libgbm mesa-libGL \
mesa-libxatracker libvpl-tools intel-metrics-discovery \
intel-metrics-library intel-igc-core intel-igc-cm \
libva libva-utils intel-gmmlib libmetee intel-gsc intel-ocloc
# Development packages
dnf install -y --refresh \
intel-igc-opencl-devel level-zero-devel intel-gsc-devel libmetee-devel \
level-zero-devel
# Install Intel Support Packages
yum install -y intel-for-pytorch-gpu-dev intel-pti-dev
# Cleanup
dnf clean all
rm -rf /var/cache/yum
rm -rf /var/lib/yum/yumdb
rm -rf /var/lib/yum/history
}
function install_sles() {
. /etc/os-release
VERSION_SP=${VERSION_ID//./sp}
if [[ ! " 15sp4 15sp5 " =~ " ${VERSION_SP} " ]]; then
echo "SLES version ${VERSION_ID} not supported"
exit
fi
# To add the online network package repository for the GPU Driver LTS releases
zypper addrepo -f -r \
https://repositories.intel.com/gpu/sles/${VERSION_SP}/lts/2350/unified/intel-gpu-${VERSION_SP}.repo
rpm --import https://repositories.intel.com/gpu/intel-graphics.key
# To add the online network network package repository for the Intel Support Packages
zypper addrepo https://yum.repos.intel.com/intel-for-pytorch-gpu-dev intel-for-pytorch-gpu-dev
rpm --import https://yum.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB
# The xpu-smi packages
zypper install -y lsb-release flex bison xpu-smi
# Compute and Media Runtimes
zypper install -y intel-level-zero-gpu level-zero intel-gsc intel-opencl intel-ocloc \
intel-media-driver libigfxcmrt7 libvpl2 libvpl-tools libmfxgen1 libmfx1
# Development packages
zypper install -y libigdfcl-devel intel-igc-cm libigfxcmrt-devel level-zero-devel
# Install Intel Support Packages
zypper install -y intel-for-pytorch-gpu-dev intel-pti-dev
}
# The installation depends on the base OS
ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')
@ -107,6 +191,12 @@ case "$ID" in
centos)
install_centos
;;
rhel|almalinux)
install_rhel
;;
sles)
install_sles
;;
*)
echo "Unable to determine OS..."
exit 1

101
.ci/docker/conda/Dockerfile Normal file
View File

@ -0,0 +1,101 @@
ARG CUDA_VERSION=10.2
ARG BASE_TARGET=cuda${CUDA_VERSION}
FROM centos:7 as base
ENV LC_ALL en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US.UTF-8
ARG DEVTOOLSET_VERSION=9
RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo
RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo
RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo
RUN yum update -y
RUN yum install -y wget curl perl util-linux xz bzip2 git patch which unzip
# Just add everything as a safe.directory for git since these will be used in multiple places with git
RUN git config --global --add safe.directory '*'
RUN yum install -y yum-utils centos-release-scl
RUN yum-config-manager --enable rhel-server-rhscl-7-rpms
RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo
RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo
RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo
RUN yum install -y devtoolset-${DEVTOOLSET_VERSION}-gcc devtoolset-${DEVTOOLSET_VERSION}-gcc-c++ devtoolset-${DEVTOOLSET_VERSION}-gcc-gfortran devtoolset-${DEVTOOLSET_VERSION}-binutils
# EPEL for cmake
RUN wget http://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm && \
rpm -ivh epel-release-latest-7.noarch.rpm && \
rm -f epel-release-latest-7.noarch.rpm
# cmake
RUN yum install -y cmake3 && \
ln -s /usr/bin/cmake3 /usr/bin/cmake
ENV PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH
RUN yum install -y autoconf aclocal automake make sudo
RUN rm -rf /usr/local/cuda-*
FROM base as patchelf
# Install patchelf
ADD ./common/install_patchelf.sh install_patchelf.sh
RUN bash ./install_patchelf.sh && rm install_patchelf.sh && cp $(which patchelf) /patchelf
FROM base as openssl
# Install openssl
ADD ./common/install_openssl.sh install_openssl.sh
RUN bash ./install_openssl.sh && rm install_openssl.sh
FROM base as conda
# Install Anaconda
ADD ./common/install_conda_docker.sh install_conda.sh
RUN bash ./install_conda.sh && rm install_conda.sh
# Install CUDA
FROM base as cuda
ARG CUDA_VERSION=10.2
RUN rm -rf /usr/local/cuda-*
ADD ./common/install_cuda.sh install_cuda.sh
ENV CUDA_HOME=/usr/local/cuda-${CUDA_VERSION}
# Preserve CUDA_VERSION for the builds
ENV CUDA_VERSION=${CUDA_VERSION}
# Make things in our path by default
ENV PATH=/usr/local/cuda-${CUDA_VERSION}/bin:$PATH
FROM cuda as cuda11.8
RUN bash ./install_cuda.sh 11.8
ENV DESIRED_CUDA=11.8
FROM cuda as cuda12.1
RUN bash ./install_cuda.sh 12.1
ENV DESIRED_CUDA=12.1
FROM cuda as cuda12.4
RUN bash ./install_cuda.sh 12.4
ENV DESIRED_CUDA=12.4
# Install MNIST test data
FROM base as mnist
ADD ./common/install_mnist.sh install_mnist.sh
RUN bash ./install_mnist.sh
FROM base as all_cuda
COPY --from=cuda11.8 /usr/local/cuda-11.8 /usr/local/cuda-11.8
COPY --from=cuda12.1 /usr/local/cuda-12.1 /usr/local/cuda-12.1
COPY --from=cuda12.4 /usr/local/cuda-12.4 /usr/local/cuda-12.4
# Final step
FROM ${BASE_TARGET} as final
COPY --from=openssl /opt/openssl /opt/openssl
COPY --from=patchelf /patchelf /usr/local/bin/patchelf
COPY --from=conda /opt/conda /opt/conda
# Add jni.h for java host build.
COPY ./common/install_jni.sh install_jni.sh
COPY ./java/jni.h jni.h
RUN bash ./install_jni.sh && rm install_jni.sh
ENV PATH /opt/conda/bin:$PATH
COPY --from=mnist /usr/local/mnist /usr/local/mnist
RUN rm -rf /usr/local/cuda
RUN chmod o+rw /usr/local
RUN touch /.condarc && \
chmod o+rw /.condarc && \
chmod -R o+rw /opt/conda

76
.ci/docker/conda/build.sh Executable file
View File

@ -0,0 +1,76 @@
#!/usr/bin/env bash
# Script used only in CD pipeline
set -eou pipefail
image="$1"
shift
if [ -z "${image}" ]; then
echo "Usage: $0 IMAGE"
exit 1
fi
DOCKER_IMAGE_NAME="pytorch/${image}"
export DOCKER_BUILDKIT=1
TOPDIR=$(git rev-parse --show-toplevel)
CUDA_VERSION=${CUDA_VERSION:-12.1}
case ${CUDA_VERSION} in
cpu)
BASE_TARGET=base
DOCKER_TAG=cpu
;;
all)
BASE_TARGET=all_cuda
DOCKER_TAG=latest
;;
*)
BASE_TARGET=cuda${CUDA_VERSION}
DOCKER_TAG=cuda${CUDA_VERSION}
;;
esac
(
set -x
docker build \
--target final \
--progress plain \
--build-arg "BASE_TARGET=${BASE_TARGET}" \
--build-arg "CUDA_VERSION=${CUDA_VERSION}" \
--build-arg "DEVTOOLSET_VERSION=9" \
-t ${DOCKER_IMAGE_NAME} \
$@ \
-f "${TOPDIR}/.ci/docker/conda/Dockerfile" \
${TOPDIR}/.ci/docker/
)
if [[ "${DOCKER_TAG}" =~ ^cuda* ]]; then
# Test that we're using the right CUDA compiler
(
set -x
docker run --rm "${DOCKER_IMAGE_NAME}" nvcc --version | grep "cuda_${CUDA_VERSION}"
)
fi
GITHUB_REF=${GITHUB_REF:-$(git symbolic-ref -q HEAD || git describe --tags --exact-match)}
GIT_BRANCH_NAME=${GITHUB_REF##*/}
GIT_COMMIT_SHA=${GITHUB_SHA:-$(git rev-parse HEAD)}
DOCKER_IMAGE_BRANCH_TAG=${DOCKER_IMAGE_NAME}-${GIT_BRANCH_NAME}
DOCKER_IMAGE_SHA_TAG=${DOCKER_IMAGE_NAME}-${GIT_COMMIT_SHA}
if [[ "${WITH_PUSH:-}" == true ]]; then
(
set -x
docker push "${DOCKER_IMAGE_NAME}"
if [[ -n ${GITHUB_REF} ]]; then
docker tag ${DOCKER_IMAGE_NAME} ${DOCKER_IMAGE_BRANCH_TAG}
docker tag ${DOCKER_IMAGE_NAME} ${DOCKER_IMAGE_SHA_TAG}
docker push "${DOCKER_IMAGE_BRANCH_TAG}"
docker push "${DOCKER_IMAGE_SHA_TAG}"
fi
)
fi

View File

@ -0,0 +1,107 @@
ARG BASE_TARGET=base
ARG GPU_IMAGE=ubuntu:20.04
FROM ${GPU_IMAGE} as base
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get clean && apt-get update
RUN apt-get install -y curl locales g++ git-all autoconf automake make cmake wget unzip sudo
# Just add everything as a safe.directory for git since these will be used in multiple places with git
RUN git config --global --add safe.directory '*'
RUN locale-gen en_US.UTF-8
ENV LC_ALL en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US.UTF-8
# Install openssl
FROM base as openssl
ADD ./common/install_openssl.sh install_openssl.sh
RUN bash ./install_openssl.sh && rm install_openssl.sh
# Install python
FROM base as python
ADD common/install_cpython.sh install_cpython.sh
RUN apt-get update -y && \
apt-get install build-essential gdb lcov libbz2-dev libffi-dev \
libgdbm-dev liblzma-dev libncurses5-dev libreadline6-dev \
libsqlite3-dev libssl-dev lzma lzma-dev tk-dev uuid-dev zlib1g-dev -y && \
bash ./install_cpython.sh && \
rm install_cpython.sh && \
apt-get clean
FROM base as conda
ADD ./common/install_conda_docker.sh install_conda.sh
RUN bash ./install_conda.sh && rm install_conda.sh
FROM base as cpu
# Install Anaconda
COPY --from=conda /opt/conda /opt/conda
# Install python
COPY --from=python /opt/python /opt/python
COPY --from=python /opt/_internal /opt/_internal
ENV PATH=/opt/conda/bin:/usr/local/cuda/bin:$PATH
# Install MKL
ADD ./common/install_mkl.sh install_mkl.sh
RUN bash ./install_mkl.sh && rm install_mkl.sh
FROM cpu as cuda
ADD ./common/install_cuda.sh install_cuda.sh
ADD ./common/install_magma.sh install_magma.sh
ENV CUDA_HOME /usr/local/cuda
FROM cuda as cuda11.8
RUN bash ./install_cuda.sh 11.8
RUN bash ./install_magma.sh 11.8
RUN ln -sf /usr/local/cuda-11.8 /usr/local/cuda
FROM cuda as cuda12.1
RUN bash ./install_cuda.sh 12.1
RUN bash ./install_magma.sh 12.1
RUN ln -sf /usr/local/cuda-12.1 /usr/local/cuda
FROM cuda as cuda12.4
RUN bash ./install_cuda.sh 12.4
RUN bash ./install_magma.sh 12.4
RUN ln -sf /usr/local/cuda-12.4 /usr/local/cuda
FROM cpu as rocm
ARG PYTORCH_ROCM_ARCH
ENV PYTORCH_ROCM_ARCH ${PYTORCH_ROCM_ARCH}
ENV MKLROOT /opt/intel
# Adding ROCM_PATH env var so that LoadHip.cmake (even with logic updated for ROCm6.0)
# find HIP works for ROCm5.7. Not needed for ROCm6.0 and above.
# Remove below when ROCm5.7 is not in support matrix anymore.
ENV ROCM_PATH /opt/rocm
# No need to install ROCm as base docker image should have full ROCm install
#ADD ./common/install_rocm.sh install_rocm.sh
ADD ./common/install_rocm_drm.sh install_rocm_drm.sh
ADD ./common/install_rocm_magma.sh install_rocm_magma.sh
# gfortran and python needed for building magma from source for ROCm
RUN apt-get update -y && \
apt-get install gfortran -y && \
apt-get install python -y && \
apt-get clean
RUN bash ./install_rocm_drm.sh && rm install_rocm_drm.sh
RUN bash ./install_rocm_magma.sh && rm install_rocm_magma.sh
# Install AOTriton
COPY ./common/common_utils.sh common_utils.sh
COPY ./common/aotriton_version.txt aotriton_version.txt
COPY ./common/install_aotriton.sh install_aotriton.sh
RUN bash ./install_aotriton.sh /opt/rocm && rm install_aotriton.sh aotriton_version.txt
ENV AOTRITON_INSTALLED_PREFIX /opt/rocm/aotriton
FROM ${BASE_TARGET} as final
COPY --from=openssl /opt/openssl /opt/openssl
# Install patchelf
ADD ./common/install_patchelf.sh install_patchelf.sh
RUN bash ./install_patchelf.sh && rm install_patchelf.sh
# Install Anaconda
COPY --from=conda /opt/conda /opt/conda
# Install python
COPY --from=python /opt/python /opt/python
COPY --from=python /opt/_internal /opt/_internal
ENV PATH=/opt/conda/bin:/usr/local/cuda/bin:$PATH

93
.ci/docker/libtorch/build.sh Executable file
View File

@ -0,0 +1,93 @@
#!/usr/bin/env bash
# Script used only in CD pipeline
set -eou pipefail
image="$1"
shift
if [ -z "${image}" ]; then
echo "Usage: $0 IMAGE"
exit 1
fi
DOCKER_IMAGE="pytorch/${image}"
TOPDIR=$(git rev-parse --show-toplevel)
GPU_ARCH_TYPE=${GPU_ARCH_TYPE:-cpu}
GPU_ARCH_VERSION=${GPU_ARCH_VERSION:-}
WITH_PUSH=${WITH_PUSH:-}
DOCKER=${DOCKER:-docker}
case ${GPU_ARCH_TYPE} in
cpu)
BASE_TARGET=cpu
DOCKER_TAG=cpu
GPU_IMAGE=ubuntu:20.04
DOCKER_GPU_BUILD_ARG=""
;;
cuda)
BASE_TARGET=cuda${GPU_ARCH_VERSION}
DOCKER_TAG=cuda${GPU_ARCH_VERSION}
GPU_IMAGE=ubuntu:20.04
DOCKER_GPU_BUILD_ARG=""
;;
rocm)
BASE_TARGET=rocm
DOCKER_TAG=rocm${GPU_ARCH_VERSION}
GPU_IMAGE=rocm/dev-ubuntu-20.04:${GPU_ARCH_VERSION}-complete
PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx1030;gfx1100"
ROCM_REGEX="([0-9]+)\.([0-9]+)[\.]?([0-9]*)"
if [[ $GPU_ARCH_VERSION =~ $ROCM_REGEX ]]; then
ROCM_VERSION_INT=$((${BASH_REMATCH[1]}*10000 + ${BASH_REMATCH[2]}*100 + ${BASH_REMATCH[3]:-0}))
else
echo "ERROR: rocm regex failed"
exit 1
fi
if [[ $ROCM_VERSION_INT -ge 60000 ]]; then
PYTORCH_ROCM_ARCH+=";gfx942"
fi
DOCKER_GPU_BUILD_ARG="--build-arg PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH}"
;;
*)
echo "ERROR: Unrecognized GPU_ARCH_TYPE: ${GPU_ARCH_TYPE}"
exit 1
;;
esac
(
set -x
DOCKER_BUILDKIT=1 ${DOCKER} build \
--target final \
${DOCKER_GPU_BUILD_ARG} \
--build-arg "GPU_IMAGE=${GPU_IMAGE}" \
--build-arg "BASE_TARGET=${BASE_TARGET}" \
-t "${DOCKER_IMAGE}" \
$@ \
-f "${TOPDIR}/.ci/docker/libtorch/Dockerfile" \
"${TOPDIR}/.ci/docker/"
)
GITHUB_REF=${GITHUB_REF:-$(git symbolic-ref -q HEAD || git describe --tags --exact-match)}
GIT_BRANCH_NAME=${GITHUB_REF##*/}
GIT_COMMIT_SHA=${GITHUB_SHA:-$(git rev-parse HEAD)}
DOCKER_IMAGE_BRANCH_TAG=${DOCKER_IMAGE}-${GIT_BRANCH_NAME}
DOCKER_IMAGE_SHA_TAG=${DOCKER_IMAGE}-${GIT_COMMIT_SHA}
if [[ "${WITH_PUSH}" == true ]]; then
(
set -x
${DOCKER} push "${DOCKER_IMAGE}"
if [[ -n ${GITHUB_REF} ]]; then
${DOCKER} tag ${DOCKER_IMAGE} ${DOCKER_IMAGE_BRANCH_TAG}
${DOCKER} tag ${DOCKER_IMAGE} ${DOCKER_IMAGE_SHA_TAG}
${DOCKER} push "${DOCKER_IMAGE_BRANCH_TAG}"
${DOCKER} push "${DOCKER_IMAGE_SHA_TAG}"
fi
)
fi

View File

@ -0,0 +1,203 @@
# syntax = docker/dockerfile:experimental
ARG ROCM_VERSION=3.7
ARG BASE_CUDA_VERSION=11.8
ARG GPU_IMAGE=centos:7
FROM centos:7 as base
ENV LC_ALL en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US.UTF-8
ARG DEVTOOLSET_VERSION=9
# Note: This is required patch since CentOS have reached EOL
# otherwise any yum install setp will fail
RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo
RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo
RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo
RUN yum install -y wget curl perl util-linux xz bzip2 git patch which perl zlib-devel
# Just add everything as a safe.directory for git since these will be used in multiple places with git
RUN git config --global --add safe.directory '*'
RUN yum install -y yum-utils centos-release-scl
RUN yum-config-manager --enable rhel-server-rhscl-7-rpms
# Note: After running yum-config-manager --enable rhel-server-rhscl-7-rpms
# patch is required once again. Somehow this steps adds mirror.centos.org
RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo
RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo
RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo
RUN yum install -y devtoolset-${DEVTOOLSET_VERSION}-gcc devtoolset-${DEVTOOLSET_VERSION}-gcc-c++ devtoolset-${DEVTOOLSET_VERSION}-gcc-gfortran devtoolset-${DEVTOOLSET_VERSION}-binutils
ENV PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH
RUN wget http://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm && \
rpm -ivh epel-release-latest-7.noarch.rpm && \
rm -f epel-release-latest-7.noarch.rpm
# cmake-3.18.4 from pip
RUN yum install -y python3-pip && \
python3 -mpip install cmake==3.18.4 && \
ln -s /usr/local/bin/cmake /usr/bin/cmake
RUN yum install -y autoconf aclocal automake make sudo
FROM base as openssl
# Install openssl (this must precede `build python` step)
# (In order to have a proper SSL module, Python is compiled
# against a recent openssl [see env vars above], which is linked
# statically. We delete openssl afterwards.)
ADD ./common/install_openssl.sh install_openssl.sh
RUN bash ./install_openssl.sh && rm install_openssl.sh
# EPEL for cmake
FROM base as patchelf
# Install patchelf
ADD ./common/install_patchelf.sh install_patchelf.sh
RUN bash ./install_patchelf.sh && rm install_patchelf.sh
RUN cp $(which patchelf) /patchelf
FROM patchelf as python
# build python
COPY manywheel/build_scripts /build_scripts
ADD ./common/install_cpython.sh /build_scripts/install_cpython.sh
RUN bash build_scripts/build.sh && rm -r build_scripts
FROM base as cuda
ARG BASE_CUDA_VERSION=10.2
# Install CUDA
ADD ./common/install_cuda.sh install_cuda.sh
RUN bash ./install_cuda.sh ${BASE_CUDA_VERSION} && rm install_cuda.sh
FROM base as intel
# MKL
ADD ./common/install_mkl.sh install_mkl.sh
RUN bash ./install_mkl.sh && rm install_mkl.sh
FROM base as magma
ARG BASE_CUDA_VERSION=10.2
# Install magma
ADD ./common/install_magma.sh install_magma.sh
RUN bash ./install_magma.sh ${BASE_CUDA_VERSION} && rm install_magma.sh
FROM base as jni
# Install java jni header
ADD ./common/install_jni.sh install_jni.sh
ADD ./java/jni.h jni.h
RUN bash ./install_jni.sh && rm install_jni.sh
FROM base as libpng
# Install libpng
ADD ./common/install_libpng.sh install_libpng.sh
RUN bash ./install_libpng.sh && rm install_libpng.sh
FROM ${GPU_IMAGE} as common
RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo
RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo
RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo
ENV LC_ALL en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US.UTF-8
RUN yum install -y \
aclocal \
autoconf \
automake \
bison \
bzip2 \
curl \
diffutils \
file \
git \
make \
patch \
perl \
unzip \
util-linux \
wget \
which \
xz \
yasm
RUN yum install -y \
https://repo.ius.io/ius-release-el7.rpm \
https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
RUN yum swap -y git git236-core
# git236+ would refuse to run git commands in repos owned by other users
# Which causes version check to fail, as pytorch repo is bind-mounted into the image
# Override this behaviour by treating every folder as safe
# For more details see https://github.com/pytorch/pytorch/issues/78659#issuecomment-1144107327
RUN git config --global --add safe.directory "*"
ENV SSL_CERT_FILE=/opt/_internal/certs.pem
# Install LLVM version
COPY --from=openssl /opt/openssl /opt/openssl
COPY --from=python /opt/python /opt/python
COPY --from=python /opt/_internal /opt/_internal
COPY --from=python /opt/python/cp39-cp39/bin/auditwheel /usr/local/bin/auditwheel
COPY --from=intel /opt/intel /opt/intel
COPY --from=patchelf /usr/local/bin/patchelf /usr/local/bin/patchelf
COPY --from=jni /usr/local/include/jni.h /usr/local/include/jni.h
COPY --from=libpng /usr/local/bin/png* /usr/local/bin/
COPY --from=libpng /usr/local/bin/libpng* /usr/local/bin/
COPY --from=libpng /usr/local/include/png* /usr/local/include/
COPY --from=libpng /usr/local/include/libpng* /usr/local/include/
COPY --from=libpng /usr/local/lib/libpng* /usr/local/lib/
COPY --from=libpng /usr/local/lib/pkgconfig /usr/local/lib/pkgconfig
FROM common as cpu_final
ARG BASE_CUDA_VERSION=10.1
ARG DEVTOOLSET_VERSION=9
RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo
RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo
RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo
RUN yum install -y yum-utils centos-release-scl
RUN yum-config-manager --enable rhel-server-rhscl-7-rpms
RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo
RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo
RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo
RUN yum install -y devtoolset-${DEVTOOLSET_VERSION}-gcc devtoolset-${DEVTOOLSET_VERSION}-gcc-c++ devtoolset-${DEVTOOLSET_VERSION}-gcc-gfortran devtoolset-${DEVTOOLSET_VERSION}-binutils
ENV PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH
# cmake is already installed inside the rocm base image, so remove if present
RUN rpm -e cmake || true
# cmake-3.18.4 from pip
RUN yum install -y python3-pip && \
python3 -mpip install cmake==3.18.4 && \
ln -s /usr/local/bin/cmake /usr/bin/cmake
# ninja
RUN yum install -y ninja-build
FROM cpu_final as cuda_final
RUN rm -rf /usr/local/cuda-${BASE_CUDA_VERSION}
COPY --from=cuda /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda-${BASE_CUDA_VERSION}
COPY --from=magma /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda-${BASE_CUDA_VERSION}
RUN ln -sf /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda
ENV PATH=/usr/local/cuda/bin:$PATH
FROM cpu_final as rocm_final
ARG ROCM_VERSION=3.7
ARG PYTORCH_ROCM_ARCH
ENV PYTORCH_ROCM_ARCH ${PYTORCH_ROCM_ARCH}
# Adding ROCM_PATH env var so that LoadHip.cmake (even with logic updated for ROCm6.0)
# find HIP works for ROCm5.7. Not needed for ROCm6.0 and above.
# Remove below when ROCm5.7 is not in support matrix anymore.
ENV ROCM_PATH /opt/rocm
ENV MKLROOT /opt/intel
# No need to install ROCm as base docker image should have full ROCm install
#ADD ./common/install_rocm.sh install_rocm.sh
#RUN ROCM_VERSION=${ROCM_VERSION} bash ./install_rocm.sh && rm install_rocm.sh
ADD ./common/install_rocm_drm.sh install_rocm_drm.sh
RUN bash ./install_rocm_drm.sh && rm install_rocm_drm.sh
# cmake3 is needed for the MIOpen build
RUN ln -sf /usr/local/bin/cmake /usr/bin/cmake3
ADD ./common/install_rocm_magma.sh install_rocm_magma.sh
RUN bash ./install_rocm_magma.sh && rm install_rocm_magma.sh
ADD ./common/install_miopen.sh install_miopen.sh
RUN bash ./install_miopen.sh ${ROCM_VERSION} && rm install_miopen.sh
# Install AOTriton
COPY ./common/common_utils.sh common_utils.sh
COPY ./common/aotriton_version.txt aotriton_version.txt
COPY ./common/install_aotriton.sh install_aotriton.sh
RUN bash ./install_aotriton.sh /opt/rocm && rm install_aotriton.sh aotriton_version.txt
ENV AOTRITON_INSTALLED_PREFIX /opt/rocm/aotriton

View File

@ -0,0 +1,152 @@
# syntax = docker/dockerfile:experimental
ARG ROCM_VERSION=3.7
ARG BASE_CUDA_VERSION=10.2
ARG GPU_IMAGE=nvidia/cuda:${BASE_CUDA_VERSION}-devel-centos7
FROM quay.io/pypa/manylinux2014_x86_64 as base
ENV LC_ALL en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US.UTF-8
RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo
RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo
RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo
RUN yum install -y wget curl perl util-linux xz bzip2 git patch which perl zlib-devel
RUN yum install -y yum-utils centos-release-scl sudo
RUN yum-config-manager --enable rhel-server-rhscl-7-rpms
RUN yum install -y devtoolset-7-gcc devtoolset-7-gcc-c++ devtoolset-7-gcc-gfortran devtoolset-7-binutils
ENV PATH=/opt/rh/devtoolset-7/root/usr/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/rh/devtoolset-7/root/usr/lib64:/opt/rh/devtoolset-7/root/usr/lib:$LD_LIBRARY_PATH
# cmake
RUN yum install -y cmake3 && \
ln -s /usr/bin/cmake3 /usr/bin/cmake
FROM base as openssl
# Install openssl (this must precede `build python` step)
# (In order to have a proper SSL module, Python is compiled
# against a recent openssl [see env vars above], which is linked
# statically. We delete openssl afterwards.)
ADD ./common/install_openssl.sh install_openssl.sh
RUN bash ./install_openssl.sh && rm install_openssl.sh
# remove unncessary python versions
RUN rm -rf /opt/python/cp26-cp26m /opt/_internal/cpython-2.6.9-ucs2
RUN rm -rf /opt/python/cp26-cp26mu /opt/_internal/cpython-2.6.9-ucs4
RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6
RUN rm -rf /opt/python/cp34-cp34m /opt/_internal/cpython-3.4.6
FROM base as cuda
ARG BASE_CUDA_VERSION=10.2
# Install CUDA
ADD ./common/install_cuda.sh install_cuda.sh
RUN bash ./install_cuda.sh ${BASE_CUDA_VERSION} && rm install_cuda.sh
FROM base as intel
# MKL
ADD ./common/install_mkl.sh install_mkl.sh
RUN bash ./install_mkl.sh && rm install_mkl.sh
FROM base as magma
ARG BASE_CUDA_VERSION=10.2
# Install magma
ADD ./common/install_magma.sh install_magma.sh
RUN bash ./install_magma.sh ${BASE_CUDA_VERSION} && rm install_magma.sh
FROM base as jni
# Install java jni header
ADD ./common/install_jni.sh install_jni.sh
ADD ./java/jni.h jni.h
RUN bash ./install_jni.sh && rm install_jni.sh
FROM base as libpng
# Install libpng
ADD ./common/install_libpng.sh install_libpng.sh
RUN bash ./install_libpng.sh && rm install_libpng.sh
FROM ${GPU_IMAGE} as common
RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo
RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo
RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo
ENV LC_ALL en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US.UTF-8
RUN yum install -y \
aclocal \
autoconf \
automake \
bison \
bzip2 \
curl \
diffutils \
file \
git \
make \
patch \
perl \
unzip \
util-linux \
wget \
which \
xz \
yasm
RUN yum install -y \
https://repo.ius.io/ius-release-el7.rpm \
https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
RUN yum swap -y git git236-core
# git236+ would refuse to run git commands in repos owned by other users
# Which causes version check to fail, as pytorch repo is bind-mounted into the image
# Override this behaviour by treating every folder as safe
# For more details see https://github.com/pytorch/pytorch/issues/78659#issuecomment-1144107327
RUN git config --global --add safe.directory "*"
ENV SSL_CERT_FILE=/opt/_internal/certs.pem
# Install LLVM version
COPY --from=openssl /opt/openssl /opt/openssl
COPY --from=base /opt/python /opt/python
COPY --from=base /opt/_internal /opt/_internal
COPY --from=base /usr/local/bin/auditwheel /usr/local/bin/auditwheel
COPY --from=intel /opt/intel /opt/intel
COPY --from=base /usr/local/bin/patchelf /usr/local/bin/patchelf
COPY --from=libpng /usr/local/bin/png* /usr/local/bin/
COPY --from=libpng /usr/local/bin/libpng* /usr/local/bin/
COPY --from=libpng /usr/local/include/png* /usr/local/include/
COPY --from=libpng /usr/local/include/libpng* /usr/local/include/
COPY --from=libpng /usr/local/lib/libpng* /usr/local/lib/
COPY --from=libpng /usr/local/lib/pkgconfig /usr/local/lib/pkgconfig
COPY --from=jni /usr/local/include/jni.h /usr/local/include/jni.h
FROM common as cpu_final
ARG BASE_CUDA_VERSION=10.2
RUN yum install -y yum-utils centos-release-scl
RUN yum-config-manager --enable rhel-server-rhscl-7-rpms
RUN yum install -y devtoolset-7-gcc devtoolset-7-gcc-c++ devtoolset-7-gcc-gfortran devtoolset-7-binutils
ENV PATH=/opt/rh/devtoolset-7/root/usr/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/rh/devtoolset-7/root/usr/lib64:/opt/rh/devtoolset-7/root/usr/lib:$LD_LIBRARY_PATH
# cmake
RUN yum install -y cmake3 && \
ln -s /usr/bin/cmake3 /usr/bin/cmake
# ninja
RUN yum install -y http://repo.okay.com.mx/centos/7/x86_64/release/okay-release-1-1.noarch.rpm
RUN yum install -y ninja-build
FROM cpu_final as cuda_final
RUN rm -rf /usr/local/cuda-${BASE_CUDA_VERSION}
COPY --from=cuda /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda-${BASE_CUDA_VERSION}
COPY --from=magma /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda-${BASE_CUDA_VERSION}
FROM common as rocm_final
ARG ROCM_VERSION=3.7
# Install ROCm
ADD ./common/install_rocm.sh install_rocm.sh
RUN bash ./install_rocm.sh ${ROCM_VERSION} && rm install_rocm.sh
# cmake is already installed inside the rocm base image, but both 2 and 3 exist
# cmake3 is needed for the later MIOpen custom build, so that step is last.
RUN yum install -y cmake3 && \
rm -f /usr/bin/cmake && \
ln -s /usr/bin/cmake3 /usr/bin/cmake
ADD ./common/install_miopen.sh install_miopen.sh
RUN bash ./install_miopen.sh ${ROCM_VERSION} && rm install_miopen.sh

View File

@ -0,0 +1,153 @@
# syntax = docker/dockerfile:experimental
ARG ROCM_VERSION=3.7
ARG BASE_CUDA_VERSION=11.8
ARG GPU_IMAGE=amd64/almalinux:8
FROM quay.io/pypa/manylinux_2_28_x86_64 as base
ENV LC_ALL en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US.UTF-8
ARG DEVTOOLSET_VERSION=11
RUN yum install -y sudo wget curl perl util-linux xz bzip2 git patch which perl zlib-devel yum-utils gcc-toolset-${DEVTOOLSET_VERSION}-toolchain
ENV PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH
# cmake-3.18.4 from pip
RUN yum install -y python3-pip && \
python3 -mpip install cmake==3.18.4 && \
ln -s /usr/local/bin/cmake /usr/bin/cmake3
FROM base as openssl
# Install openssl (this must precede `build python` step)
# (In order to have a proper SSL module, Python is compiled
# against a recent openssl [see env vars above], which is linked
# statically. We delete openssl afterwards.)
ADD ./common/install_openssl.sh install_openssl.sh
RUN bash ./install_openssl.sh && rm install_openssl.sh
# remove unncessary python versions
RUN rm -rf /opt/python/cp26-cp26m /opt/_internal/cpython-2.6.9-ucs2
RUN rm -rf /opt/python/cp26-cp26mu /opt/_internal/cpython-2.6.9-ucs4
RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6
RUN rm -rf /opt/python/cp34-cp34m /opt/_internal/cpython-3.4.6
FROM base as cuda
ARG BASE_CUDA_VERSION=11.8
# Install CUDA
ADD ./common/install_cuda.sh install_cuda.sh
RUN bash ./install_cuda.sh ${BASE_CUDA_VERSION} && rm install_cuda.sh
FROM base as intel
# MKL
ADD ./common/install_mkl.sh install_mkl.sh
RUN bash ./install_mkl.sh && rm install_mkl.sh
FROM base as magma
ARG BASE_CUDA_VERSION=10.2
# Install magma
ADD ./common/install_magma.sh install_magma.sh
RUN bash ./install_magma.sh ${BASE_CUDA_VERSION} && rm install_magma.sh
FROM base as jni
# Install java jni header
ADD ./common/install_jni.sh install_jni.sh
ADD ./java/jni.h jni.h
RUN bash ./install_jni.sh && rm install_jni.sh
FROM base as libpng
# Install libpng
ADD ./common/install_libpng.sh install_libpng.sh
RUN bash ./install_libpng.sh && rm install_libpng.sh
FROM ${GPU_IMAGE} as common
ARG DEVTOOLSET_VERSION=11
ENV LC_ALL en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US.UTF-8
RUN yum -y install epel-release
RUN yum -y update
RUN yum install -y \
autoconf \
automake \
bison \
bzip2 \
curl \
diffutils \
file \
git \
make \
patch \
perl \
unzip \
util-linux \
wget \
which \
xz \
gcc-toolset-${DEVTOOLSET_VERSION}-toolchain \
glibc-langpack-en
RUN yum install -y \
https://repo.ius.io/ius-release-el7.rpm \
https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
RUN yum swap -y git git236-core
# git236+ would refuse to run git commands in repos owned by other users
# Which causes version check to fail, as pytorch repo is bind-mounted into the image
# Override this behaviour by treating every folder as safe
# For more details see https://github.com/pytorch/pytorch/issues/78659#issuecomment-1144107327
RUN git config --global --add safe.directory "*"
ENV SSL_CERT_FILE=/opt/_internal/certs.pem
# Install LLVM version
COPY --from=openssl /opt/openssl /opt/openssl
COPY --from=base /opt/python /opt/python
COPY --from=base /opt/_internal /opt/_internal
COPY --from=base /usr/local/bin/auditwheel /usr/local/bin/auditwheel
COPY --from=intel /opt/intel /opt/intel
COPY --from=base /usr/local/bin/patchelf /usr/local/bin/patchelf
COPY --from=libpng /usr/local/bin/png* /usr/local/bin/
COPY --from=libpng /usr/local/bin/libpng* /usr/local/bin/
COPY --from=libpng /usr/local/include/png* /usr/local/include/
COPY --from=libpng /usr/local/include/libpng* /usr/local/include/
COPY --from=libpng /usr/local/lib/libpng* /usr/local/lib/
COPY --from=libpng /usr/local/lib/pkgconfig /usr/local/lib/pkgconfig
COPY --from=jni /usr/local/include/jni.h /usr/local/include/jni.h
FROM common as cpu_final
ARG BASE_CUDA_VERSION=11.8
ARG DEVTOOLSET_VERSION=11
# Ensure the expected devtoolset is used
ENV PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH
# cmake-3.18.4 from pip
RUN yum install -y python3-pip && \
python3 -mpip install cmake==3.18.4 && \
ln -s /usr/local/bin/cmake /usr/bin/cmake3
FROM cpu_final as cuda_final
RUN rm -rf /usr/local/cuda-${BASE_CUDA_VERSION}
COPY --from=cuda /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda-${BASE_CUDA_VERSION}
COPY --from=magma /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda-${BASE_CUDA_VERSION}
FROM common as rocm_final
ARG ROCM_VERSION=3.7
# Install ROCm
ADD ./common/install_rocm.sh install_rocm.sh
RUN bash ./install_rocm.sh ${ROCM_VERSION} && rm install_rocm.sh
# cmake is already installed inside the rocm base image, but both 2 and 3 exist
# cmake3 is needed for the later MIOpen custom build, so that step is last.
RUN yum install -y cmake3 && \
rm -f /usr/bin/cmake && \
ln -s /usr/bin/cmake3 /usr/bin/cmake
ADD ./common/install_miopen.sh install_miopen.sh
RUN bash ./install_miopen.sh ${ROCM_VERSION} && rm install_miopen.sh
FROM cpu_final as xpu_final
# cmake-3.28.4 from pip
RUN python3 -m pip install --upgrade pip && \
python3 -mpip install cmake==3.28.4
ADD ./common/install_xpu.sh install_xpu.sh
RUN bash ./install_xpu.sh && rm install_xpu.sh
RUN pushd /opt/_internal && tar -xJf static-libs-for-embedding-only.tar.xz && popd

View File

@ -0,0 +1,57 @@
FROM quay.io/pypa/manylinux_2_28_aarch64 as base
# Graviton needs GCC 10 or above for the build. GCC12 is the default version in almalinux-8.
ARG GCCTOOLSET_VERSION=11
# Language variabes
ENV LC_ALL=en_US.UTF-8
ENV LANG=en_US.UTF-8
ENV LANGUAGE=en_US.UTF-8
# Installed needed OS packages. This is to support all
# the binary builds (torch, vision, audio, text, data)
RUN yum -y install epel-release
RUN yum -y update
RUN yum install -y \
autoconf \
automake \
bison \
bzip2 \
curl \
diffutils \
file \
git \
less \
libffi-devel \
libgomp \
make \
openssl-devel \
patch \
perl \
unzip \
util-linux \
wget \
which \
xz \
yasm \
zstd \
sudo \
gcc-toolset-${GCCTOOLSET_VERSION}-toolchain
# Ensure the expected devtoolset is used
ENV PATH=/opt/rh/gcc-toolset-${GCCTOOLSET_VERSION}/root/usr/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/rh/gcc-toolset-${GCCTOOLSET_VERSION}/root/usr/lib64:/opt/rh/gcc-toolset-${GCCTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH
# git236+ would refuse to run git commands in repos owned by other users
# Which causes version check to fail, as pytorch repo is bind-mounted into the image
# Override this behaviour by treating every folder as safe
# For more details see https://github.com/pytorch/pytorch/issues/78659#issuecomment-1144107327
RUN git config --global --add safe.directory "*"
FROM base as final
# remove unncessary python versions
RUN rm -rf /opt/python/cp26-cp26m /opt/_internal/cpython-2.6.9-ucs2
RUN rm -rf /opt/python/cp26-cp26mu /opt/_internal/cpython-2.6.9-ucs4
RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6
RUN rm -rf /opt/python/cp34-cp34m /opt/_internal/cpython-3.4.6

View File

@ -0,0 +1,94 @@
FROM quay.io/pypa/manylinux2014_aarch64 as base
# Graviton needs GCC 10 for the build
ARG DEVTOOLSET_VERSION=10
# Language variabes
ENV LC_ALL=en_US.UTF-8
ENV LANG=en_US.UTF-8
ENV LANGUAGE=en_US.UTF-8
# Installed needed OS packages. This is to support all
# the binary builds (torch, vision, audio, text, data)
RUN yum -y install epel-release
RUN yum -y update
RUN yum install -y \
autoconf \
automake \
bison \
bzip2 \
curl \
diffutils \
file \
git \
make \
patch \
perl \
unzip \
util-linux \
wget \
which \
xz \
yasm \
less \
zstd \
libgomp \
sudo \
devtoolset-${DEVTOOLSET_VERSION}-gcc \
devtoolset-${DEVTOOLSET_VERSION}-gcc-c++ \
devtoolset-${DEVTOOLSET_VERSION}-gcc-gfortran \
devtoolset-${DEVTOOLSET_VERSION}-binutils
# Ensure the expected devtoolset is used
ENV PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH
# git236+ would refuse to run git commands in repos owned by other users
# Which causes version check to fail, as pytorch repo is bind-mounted into the image
# Override this behaviour by treating every folder as safe
# For more details see https://github.com/pytorch/pytorch/issues/78659#issuecomment-1144107327
RUN git config --global --add safe.directory "*"
###############################################################################
# libglfortran.a hack
#
# libgfortran.a from quay.io/pypa/manylinux2014_aarch64 is not compiled with -fPIC.
# This causes __stack_chk_guard@@GLIBC_2.17 on pytorch build. To solve, get
# ubuntu's libgfortran.a which is compiled with -fPIC
# NOTE: Need a better way to get this library as Ubuntu's package can be removed by the vender, or changed
###############################################################################
RUN cd ~/ \
&& curl -L -o ~/libgfortran-10-dev.deb http://ports.ubuntu.com/ubuntu-ports/pool/universe/g/gcc-10/libgfortran-10-dev_10.5.0-1ubuntu1_arm64.deb \
&& ar x ~/libgfortran-10-dev.deb \
&& tar --use-compress-program=unzstd -xvf data.tar.zst -C ~/ \
&& cp -f ~/usr/lib/gcc/aarch64-linux-gnu/10/libgfortran.a /opt/rh/devtoolset-10/root/usr/lib/gcc/aarch64-redhat-linux/10/
# install cmake
RUN yum install -y cmake3 && \
ln -s /usr/bin/cmake3 /usr/bin/cmake
FROM base as openssl
# Install openssl (this must precede `build python` step)
# (In order to have a proper SSL module, Python is compiled
# against a recent openssl [see env vars above], which is linked
# statically. We delete openssl afterwards.)
ADD ./common/install_openssl.sh install_openssl.sh
RUN bash ./install_openssl.sh && rm install_openssl.sh
ENV SSL_CERT_FILE=/opt/_internal/certs.pem
FROM base as openblas
# Install openblas
ADD ./common/install_openblas.sh install_openblas.sh
RUN bash ./install_openblas.sh && rm install_openblas.sh
FROM openssl as final
# remove unncessary python versions
RUN rm -rf /opt/python/cp26-cp26m /opt/_internal/cpython-2.6.9-ucs2
RUN rm -rf /opt/python/cp26-cp26mu /opt/_internal/cpython-2.6.9-ucs4
RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6
RUN rm -rf /opt/python/cp34-cp34m /opt/_internal/cpython-3.4.6
COPY --from=openblas /opt/OpenBLAS/ /opt/OpenBLAS/
ENV LD_LIBRARY_PATH=/opt/OpenBLAS/lib:$LD_LIBRARY_PATH

View File

@ -0,0 +1,91 @@
FROM quay.io/pypa/manylinux_2_28_aarch64 as base
# Cuda ARM build needs gcc 11
ARG DEVTOOLSET_VERSION=11
# Language variables
ENV LC_ALL=en_US.UTF-8
ENV LANG=en_US.UTF-8
ENV LANGUAGE=en_US.UTF-8
# Installed needed OS packages. This is to support all
# the binary builds (torch, vision, audio, text, data)
RUN yum -y install epel-release
RUN yum -y update
RUN yum install -y \
autoconf \
automake \
bison \
bzip2 \
curl \
diffutils \
file \
git \
make \
patch \
perl \
unzip \
util-linux \
wget \
which \
xz \
yasm \
less \
zstd \
libgomp \
sudo \
gcc-toolset-${DEVTOOLSET_VERSION}-toolchain
# Ensure the expected devtoolset is used
ENV PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH
# git236+ would refuse to run git commands in repos owned by other users
# Which causes version check to fail, as pytorch repo is bind-mounted into the image
# Override this behaviour by treating every folder as safe
# For more details see https://github.com/pytorch/pytorch/issues/78659#issuecomment-1144107327
RUN git config --global --add safe.directory "*"
FROM base as openssl
# Install openssl (this must precede `build python` step)
# (In order to have a proper SSL module, Python is compiled
# against a recent openssl [see env vars above], which is linked
# statically. We delete openssl afterwards.)
ADD ./common/install_openssl.sh install_openssl.sh
RUN bash ./install_openssl.sh && rm install_openssl.sh
ENV SSL_CERT_FILE=/opt/_internal/certs.pem
FROM openssl as final
# remove unncessary python versions
RUN rm -rf /opt/python/cp26-cp26m /opt/_internal/cpython-2.6.9-ucs2
RUN rm -rf /opt/python/cp26-cp26mu /opt/_internal/cpython-2.6.9-ucs4
RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6
RUN rm -rf /opt/python/cp34-cp34m /opt/_internal/cpython-3.4.6
FROM base as cuda
ARG BASE_CUDA_VERSION
# Install CUDA
ADD ./common/install_cuda_aarch64.sh install_cuda_aarch64.sh
RUN bash ./install_cuda_aarch64.sh ${BASE_CUDA_VERSION} && rm install_cuda_aarch64.sh
FROM base as magma
ARG BASE_CUDA_VERSION
# Install magma
ADD ./common/install_magma.sh install_magma.sh
RUN bash ./install_magma.sh ${BASE_CUDA_VERSION} && rm install_magma.sh
FROM base as openblas
# Install openblas
ADD ./common/install_openblas.sh install_openblas.sh
RUN bash ./install_openblas.sh && rm install_openblas.sh
FROM final as cuda_final
ARG BASE_CUDA_VERSION
RUN rm -rf /usr/local/cuda-${BASE_CUDA_VERSION}
COPY --from=cuda /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda-${BASE_CUDA_VERSION}
COPY --from=magma /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda-${BASE_CUDA_VERSION}
COPY --from=openblas /opt/OpenBLAS/ /opt/OpenBLAS/
RUN ln -sf /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda
ENV PATH=/usr/local/cuda/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/OpenBLAS/lib:$LD_LIBRARY_PATH

View File

@ -0,0 +1,71 @@
FROM centos:8 as base
ENV LC_ALL en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US.UTF-8
ENV PATH /opt/rh/gcc-toolset-11/root/bin/:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
# change to a valid repo
RUN sed -i 's|#baseurl=http://mirror.centos.org|baseurl=http://vault.centos.org|g' /etc/yum.repos.d/CentOS-Linux-*.repo
# enable to install ninja-build
RUN sed -i 's|enabled=0|enabled=1|g' /etc/yum.repos.d/CentOS-Linux-PowerTools.repo
RUN yum -y update
RUN yum install -y wget curl perl util-linux xz bzip2 git patch which zlib-devel sudo
RUN yum install -y autoconf automake make cmake gdb gcc-toolset-11-gcc-c++
FROM base as openssl
ADD ./common/install_openssl.sh install_openssl.sh
RUN bash ./install_openssl.sh && rm install_openssl.sh
# Install python
FROM base as python
RUN yum install -y openssl-devel zlib-devel bzip2-devel ncurses-devel sqlite-devel readline-devel tk-devel gdbm-devel libpcap-devel xz-devel libffi-devel
ADD common/install_cpython.sh install_cpython.sh
RUN bash ./install_cpython.sh && rm install_cpython.sh
FROM base as conda
ADD ./common/install_conda_docker.sh install_conda.sh
RUN bash ./install_conda.sh && rm install_conda.sh
RUN /opt/conda/bin/conda install -y cmake
FROM base as intel
# Install MKL
COPY --from=python /opt/python /opt/python
COPY --from=python /opt/_internal /opt/_internal
COPY --from=conda /opt/conda /opt/conda
ENV PATH=/opt/conda/bin:$PATH
ADD ./common/install_mkl.sh install_mkl.sh
RUN bash ./install_mkl.sh && rm install_mkl.sh
FROM base as patchelf
ADD ./common/install_patchelf.sh install_patchelf.sh
RUN bash ./install_patchelf.sh && rm install_patchelf.sh
RUN cp $(which patchelf) /patchelf
FROM base as jni
ADD ./common/install_jni.sh install_jni.sh
ADD ./java/jni.h jni.h
RUN bash ./install_jni.sh && rm install_jni.sh
FROM base as libpng
ADD ./common/install_libpng.sh install_libpng.sh
RUN bash ./install_libpng.sh && rm install_libpng.sh
FROM base as final
COPY --from=openssl /opt/openssl /opt/openssl
COPY --from=python /opt/python /opt/python
COPY --from=python /opt/_internal /opt/_internal
COPY --from=intel /opt/intel /opt/intel
COPY --from=conda /opt/conda /opt/conda
COPY --from=patchelf /usr/local/bin/patchelf /usr/local/bin/patchelf
COPY --from=jni /usr/local/include/jni.h /usr/local/include/jni.h
COPY --from=libpng /usr/local/bin/png* /usr/local/bin/
COPY --from=libpng /usr/local/bin/libpng* /usr/local/bin/
COPY --from=libpng /usr/local/include/png* /usr/local/include/
COPY --from=libpng /usr/local/include/libpng* /usr/local/include/
COPY --from=libpng /usr/local/lib/libpng* /usr/local/lib/
COPY --from=libpng /usr/local/lib/pkgconfig /usr/local/lib/pkgconfig
RUN yum install -y ninja-build

View File

@ -0,0 +1,73 @@
FROM --platform=linux/s390x docker.io/ubuntu:24.04 as base
# Language variables
ENV LC_ALL=C.UTF-8
ENV LANG=C.UTF-8
ENV LANGUAGE=C.UTF-8
# Installed needed OS packages. This is to support all
# the binary builds (torch, vision, audio, text, data)
RUN apt update ; apt upgrade -y
RUN apt install -y \
build-essential \
autoconf \
automake \
bzip2 \
curl \
diffutils \
file \
git \
make \
patch \
perl \
unzip \
util-linux \
wget \
which \
xz-utils \
less \
zstd \
cmake \
python3 \
python3-dev \
python3-setuptools \
python3-yaml \
python3-typing-extensions \
libblas-dev \
libopenblas-dev \
liblapack-dev \
libatlas-base-dev
# git236+ would refuse to run git commands in repos owned by other users
# Which causes version check to fail, as pytorch repo is bind-mounted into the image
# Override this behaviour by treating every folder as safe
# For more details see https://github.com/pytorch/pytorch/issues/78659#issuecomment-1144107327
RUN git config --global --add safe.directory "*"
FROM base as openssl
# Install openssl (this must precede `build python` step)
# (In order to have a proper SSL module, Python is compiled
# against a recent openssl [see env vars above], which is linked
# statically. We delete openssl afterwards.)
ADD ./common/install_openssl.sh install_openssl.sh
RUN bash ./install_openssl.sh && rm install_openssl.sh
ENV SSL_CERT_FILE=/opt/_internal/certs.pem
# EPEL for cmake
FROM base as patchelf
# Install patchelf
ADD ./common/install_patchelf.sh install_patchelf.sh
RUN bash ./install_patchelf.sh && rm install_patchelf.sh
RUN cp $(which patchelf) /patchelf
FROM patchelf as python
# build python
COPY manywheel/build_scripts /build_scripts
ADD ./common/install_cpython.sh /build_scripts/install_cpython.sh
RUN bash build_scripts/build.sh && rm -r build_scripts
FROM openssl as final
COPY --from=python /opt/python /opt/python
COPY --from=python /opt/_internal /opt/_internal
COPY --from=python /opt/python/cp39-cp39/bin/auditwheel /usr/local/bin/auditwheel
COPY --from=patchelf /usr/local/bin/patchelf /usr/local/bin/patchelf

154
.ci/docker/manywheel/build.sh Executable file
View File

@ -0,0 +1,154 @@
#!/usr/bin/env bash
# Script used only in CD pipeline
set -eou pipefail
TOPDIR=$(git rev-parse --show-toplevel)
image="$1"
shift
if [ -z "${image}" ]; then
echo "Usage: $0 IMAGE"
exit 1
fi
DOCKER_IMAGE="pytorch/${image}"
DOCKER_REGISTRY="${DOCKER_REGISTRY:-docker.io}"
GPU_ARCH_TYPE=${GPU_ARCH_TYPE:-cpu}
GPU_ARCH_VERSION=${GPU_ARCH_VERSION:-}
MANY_LINUX_VERSION=${MANY_LINUX_VERSION:-}
DOCKERFILE_SUFFIX=${DOCKERFILE_SUFFIX:-}
WITH_PUSH=${WITH_PUSH:-}
case ${GPU_ARCH_TYPE} in
cpu)
TARGET=cpu_final
DOCKER_TAG=cpu
GPU_IMAGE=centos:7
DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=9"
;;
cpu-manylinux_2_28)
TARGET=cpu_final
DOCKER_TAG=cpu
GPU_IMAGE=amd64/almalinux:8
DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=11"
MANY_LINUX_VERSION="2_28"
;;
cpu-aarch64)
TARGET=final
DOCKER_TAG=cpu-aarch64
GPU_IMAGE=arm64v8/centos:7
DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=10"
MANY_LINUX_VERSION="aarch64"
;;
cpu-aarch64-2_28)
TARGET=final
DOCKER_TAG=cpu-aarch64
GPU_IMAGE=arm64v8/almalinux:8
DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=11"
MANY_LINUX_VERSION="2_28_aarch64"
;;
cpu-cxx11-abi)
TARGET=final
DOCKER_TAG=cpu-cxx11-abi
GPU_IMAGE=""
DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=9"
MANY_LINUX_VERSION="cxx11-abi"
;;
cpu-s390x)
TARGET=final
DOCKER_TAG=cpu-s390x
GPU_IMAGE=redhat/ubi9
DOCKER_GPU_BUILD_ARG=""
MANY_LINUX_VERSION="s390x"
;;
cuda)
TARGET=cuda_final
DOCKER_TAG=cuda${GPU_ARCH_VERSION}
# Keep this up to date with the minimum version of CUDA we currently support
GPU_IMAGE=centos:7
DOCKER_GPU_BUILD_ARG="--build-arg BASE_CUDA_VERSION=${GPU_ARCH_VERSION} --build-arg DEVTOOLSET_VERSION=9"
;;
cuda-manylinux_2_28)
TARGET=cuda_final
DOCKER_TAG=cuda${GPU_ARCH_VERSION}
GPU_IMAGE=amd64/almalinux:8
DOCKER_GPU_BUILD_ARG="--build-arg BASE_CUDA_VERSION=${GPU_ARCH_VERSION} --build-arg DEVTOOLSET_VERSION=11"
MANY_LINUX_VERSION="2_28"
;;
cuda-aarch64)
TARGET=cuda_final
DOCKER_TAG=cuda${GPU_ARCH_VERSION}
GPU_IMAGE=arm64v8/centos:7
DOCKER_GPU_BUILD_ARG="--build-arg BASE_CUDA_VERSION=${GPU_ARCH_VERSION} --build-arg DEVTOOLSET_VERSION=11"
MANY_LINUX_VERSION="aarch64"
DOCKERFILE_SUFFIX="_cuda_aarch64"
;;
rocm)
TARGET=rocm_final
DOCKER_TAG=rocm${GPU_ARCH_VERSION}
GPU_IMAGE=rocm/dev-centos-7:${GPU_ARCH_VERSION}-complete
PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx1030;gfx1100"
ROCM_REGEX="([0-9]+)\.([0-9]+)[\.]?([0-9]*)"
if [[ $GPU_ARCH_VERSION =~ $ROCM_REGEX ]]; then
ROCM_VERSION_INT=$((${BASH_REMATCH[1]}*10000 + ${BASH_REMATCH[2]}*100 + ${BASH_REMATCH[3]:-0}))
else
echo "ERROR: rocm regex failed"
exit 1
fi
if [[ $ROCM_VERSION_INT -ge 60000 ]]; then
PYTORCH_ROCM_ARCH+=";gfx942"
fi
DOCKER_GPU_BUILD_ARG="--build-arg ROCM_VERSION=${GPU_ARCH_VERSION} --build-arg PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH} --build-arg DEVTOOLSET_VERSION=9"
;;
xpu)
TARGET=xpu_final
DOCKER_TAG=xpu
GPU_IMAGE=amd64/almalinux:8
DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=11"
MANY_LINUX_VERSION="2_28"
;;
*)
echo "ERROR: Unrecognized GPU_ARCH_TYPE: ${GPU_ARCH_TYPE}"
exit 1
;;
esac
IMAGES=''
if [[ -n ${MANY_LINUX_VERSION} && -z ${DOCKERFILE_SUFFIX} ]]; then
DOCKERFILE_SUFFIX=_${MANY_LINUX_VERSION}
fi
(
set -x
DOCKER_BUILDKIT=1 docker build \
${DOCKER_GPU_BUILD_ARG} \
--build-arg "GPU_IMAGE=${GPU_IMAGE}" \
--target "${TARGET}" \
-t "${DOCKER_IMAGE}" \
$@ \
-f "${TOPDIR}/.ci/docker/manywheel/Dockerfile${DOCKERFILE_SUFFIX}" \
"${TOPDIR}/.ci/docker/"
)
GITHUB_REF=${GITHUB_REF:-$(git symbolic-ref -q HEAD || git describe --tags --exact-match)}
GIT_BRANCH_NAME=${GITHUB_REF##*/}
GIT_COMMIT_SHA=${GITHUB_SHA:-$(git rev-parse HEAD)}
DOCKER_IMAGE_BRANCH_TAG=${DOCKER_IMAGE}-${GIT_BRANCH_NAME}
DOCKER_IMAGE_SHA_TAG=${DOCKER_IMAGE}-${GIT_COMMIT_SHA}
if [[ "${WITH_PUSH}" == true ]]; then
(
set -x
docker push "${DOCKER_IMAGE}"
if [[ -n ${GITHUB_REF} ]]; then
docker tag ${DOCKER_IMAGE} ${DOCKER_IMAGE_BRANCH_TAG}
docker tag ${DOCKER_IMAGE} ${DOCKER_IMAGE_SHA_TAG}
docker push "${DOCKER_IMAGE_BRANCH_TAG}"
docker push "${DOCKER_IMAGE_SHA_TAG}"
fi
)
fi

View File

@ -0,0 +1,131 @@
#!/bin/bash
# Top-level build script called from Dockerfile
# Script used only in CD pipeline
# Stop at any error, show all commands
set -ex
# openssl version to build, with expected sha256 hash of .tar.gz
# archive
OPENSSL_ROOT=openssl-1.1.1l
OPENSSL_HASH=0b7a3e5e59c34827fe0c3a74b7ec8baef302b98fa80088d7f9153aa16fa76bd1
DEVTOOLS_HASH=a8ebeb4bed624700f727179e6ef771dafe47651131a00a78b342251415646acc
PATCHELF_HASH=d9afdff4baeacfbc64861454f368b7f2c15c44d245293f7587bbf726bfe722fb
CURL_ROOT=curl-7.73.0
CURL_HASH=cf34fe0b07b800f1c01a499a6e8b2af548f6d0e044dca4a29d88a4bee146d131
AUTOCONF_ROOT=autoconf-2.69
AUTOCONF_HASH=954bd69b391edc12d6a4a51a2dd1476543da5c6bbf05a95b59dc0dd6fd4c2969
# Get build utilities
MY_DIR=$(dirname "${BASH_SOURCE[0]}")
source $MY_DIR/build_utils.sh
if [ "$(uname -m)" != "s390x" ] ; then
# Dependencies for compiling Python that we want to remove from
# the final image after compiling Python
PYTHON_COMPILE_DEPS="zlib-devel bzip2-devel ncurses-devel sqlite-devel readline-devel tk-devel gdbm-devel db4-devel libpcap-devel xz-devel libffi-devel"
# Libraries that are allowed as part of the manylinux1 profile
MANYLINUX1_DEPS="glibc-devel libstdc++-devel glib2-devel libX11-devel libXext-devel libXrender-devel mesa-libGL-devel libICE-devel libSM-devel ncurses-devel"
# Development tools and libraries
yum -y install bzip2 make git patch unzip bison yasm diffutils \
automake which file cmake28 \
kernel-devel-`uname -r` \
${PYTHON_COMPILE_DEPS}
else
# Dependencies for compiling Python that we want to remove from
# the final image after compiling Python
PYTHON_COMPILE_DEPS="zlib1g-dev libbz2-dev libncurses-dev libsqlite3-dev libdb-dev libpcap-dev liblzma-dev libffi-dev"
# Libraries that are allowed as part of the manylinux1 profile
MANYLINUX1_DEPS="libglib2.0-dev libX11-dev libncurses-dev"
# Development tools and libraries
apt install -y bzip2 make git patch unzip diffutils \
automake which file cmake \
linux-headers-virtual \
${PYTHON_COMPILE_DEPS}
fi
# Install newest autoconf
build_autoconf $AUTOCONF_ROOT $AUTOCONF_HASH
autoconf --version
# Compile the latest Python releases.
# (In order to have a proper SSL module, Python is compiled
# against a recent openssl [see env vars above], which is linked
# statically. We delete openssl afterwards.)
build_openssl $OPENSSL_ROOT $OPENSSL_HASH
/build_scripts/install_cpython.sh
PY39_BIN=/opt/python/cp39-cp39/bin
# Our openssl doesn't know how to find the system CA trust store
# (https://github.com/pypa/manylinux/issues/53)
# And it's not clear how up-to-date that is anyway
# So let's just use the same one pip and everyone uses
$PY39_BIN/pip install certifi
ln -s $($PY39_BIN/python -c 'import certifi; print(certifi.where())') \
/opt/_internal/certs.pem
# If you modify this line you also have to modify the versions in the
# Dockerfiles:
export SSL_CERT_FILE=/opt/_internal/certs.pem
# Install newest curl
build_curl $CURL_ROOT $CURL_HASH
rm -rf /usr/local/include/curl /usr/local/lib/libcurl* /usr/local/lib/pkgconfig/libcurl.pc
hash -r
curl --version
curl-config --features
# Install patchelf (latest with unreleased bug fixes)
curl -sLOk https://nixos.org/releases/patchelf/patchelf-0.10/patchelf-0.10.tar.gz
# check_sha256sum patchelf-0.9njs2.tar.gz $PATCHELF_HASH
tar -xzf patchelf-0.10.tar.gz
(cd patchelf-0.10 && ./configure && make && make install)
rm -rf patchelf-0.10.tar.gz patchelf-0.10
# Install latest pypi release of auditwheel
$PY39_BIN/pip install auditwheel
ln -s $PY39_BIN/auditwheel /usr/local/bin/auditwheel
# Clean up development headers and other unnecessary stuff for
# final image
if [ "$(uname -m)" != "s390x" ] ; then
yum -y erase wireless-tools gtk2 libX11 hicolor-icon-theme \
avahi freetype bitstream-vera-fonts \
${PYTHON_COMPILE_DEPS} || true > /dev/null 2>&1
yum -y install ${MANYLINUX1_DEPS}
yum -y clean all > /dev/null 2>&1
yum list installed
else
apt purge -y ${PYTHON_COMPILE_DEPS} || true > /dev/null 2>&1
fi
# we don't need libpython*.a, and they're many megabytes
find /opt/_internal -name '*.a' -print0 | xargs -0 rm -f
# Strip what we can -- and ignore errors, because this just attempts to strip
# *everything*, including non-ELF files:
find /opt/_internal -type f -print0 \
| xargs -0 -n1 strip --strip-unneeded 2>/dev/null || true
# We do not need the Python test suites, or indeed the precompiled .pyc and
# .pyo files. Partially cribbed from:
# https://github.com/docker-library/python/blob/master/3.4/slim/Dockerfile
find /opt/_internal \
\( -type d -a -name test -o -name tests \) \
-o \( -type f -a -name '*.pyc' -o -name '*.pyo' \) \
-print0 | xargs -0 rm -f
for PYTHON in /opt/python/*/bin/python; do
# Smoke test to make sure that our Pythons work, and do indeed detect as
# being manylinux compatible:
$PYTHON $MY_DIR/manylinux1-check.py
# Make sure that SSL cert checking works
$PYTHON $MY_DIR/ssl-check.py
done
# Fix libc headers to remain compatible with C99 compilers.
find /usr/include/ -type f -exec sed -i 's/\bextern _*inline_*\b/extern __inline __attribute__ ((__gnu_inline__))/g' {} +
# Now we can delete our built SSL
rm -rf /usr/local/ssl

View File

@ -0,0 +1,91 @@
#!/bin/bash
# Helper utilities for build
# Script used only in CD pipeline
OPENSSL_DOWNLOAD_URL=https://www.openssl.org/source/old/1.1.1/
CURL_DOWNLOAD_URL=https://curl.askapache.com/download
AUTOCONF_DOWNLOAD_URL=https://ftp.gnu.org/gnu/autoconf
function check_var {
if [ -z "$1" ]; then
echo "required variable not defined"
exit 1
fi
}
function do_openssl_build {
./config no-ssl2 no-shared -fPIC --prefix=/usr/local/ssl > /dev/null
make > /dev/null
make install > /dev/null
}
function check_sha256sum {
local fname=$1
check_var ${fname}
local sha256=$2
check_var ${sha256}
echo "${sha256} ${fname}" > ${fname}.sha256
sha256sum -c ${fname}.sha256
rm -f ${fname}.sha256
}
function build_openssl {
local openssl_fname=$1
check_var ${openssl_fname}
local openssl_sha256=$2
check_var ${openssl_sha256}
check_var ${OPENSSL_DOWNLOAD_URL}
curl -sLO ${OPENSSL_DOWNLOAD_URL}/${openssl_fname}.tar.gz
check_sha256sum ${openssl_fname}.tar.gz ${openssl_sha256}
tar -xzf ${openssl_fname}.tar.gz
(cd ${openssl_fname} && do_openssl_build)
rm -rf ${openssl_fname} ${openssl_fname}.tar.gz
}
function do_curl_build {
LIBS=-ldl ./configure --with-ssl --disable-shared > /dev/null
make > /dev/null
make install > /dev/null
}
function build_curl {
local curl_fname=$1
check_var ${curl_fname}
local curl_sha256=$2
check_var ${curl_sha256}
check_var ${CURL_DOWNLOAD_URL}
curl -sLO ${CURL_DOWNLOAD_URL}/${curl_fname}.tar.bz2
check_sha256sum ${curl_fname}.tar.bz2 ${curl_sha256}
tar -jxf ${curl_fname}.tar.bz2
(cd ${curl_fname} && do_curl_build)
rm -rf ${curl_fname} ${curl_fname}.tar.bz2
}
function do_standard_install {
./configure > /dev/null
make > /dev/null
make install > /dev/null
}
function build_autoconf {
local autoconf_fname=$1
check_var ${autoconf_fname}
local autoconf_sha256=$2
check_var ${autoconf_sha256}
check_var ${AUTOCONF_DOWNLOAD_URL}
curl -sLO ${AUTOCONF_DOWNLOAD_URL}/${autoconf_fname}.tar.gz
check_sha256sum ${autoconf_fname}.tar.gz ${autoconf_sha256}
tar -zxf ${autoconf_fname}.tar.gz
(cd ${autoconf_fname} && do_standard_install)
rm -rf ${autoconf_fname} ${autoconf_fname}.tar.gz
}

View File

@ -0,0 +1,60 @@
# Logic copied from PEP 513
def is_manylinux1_compatible():
# Only Linux, and only x86-64 / i686
from distutils.util import get_platform
if get_platform() not in ["linux-x86_64", "linux-i686", "linux-s390x"]:
return False
# Check for presence of _manylinux module
try:
import _manylinux
return bool(_manylinux.manylinux1_compatible)
except (ImportError, AttributeError):
# Fall through to heuristic check below
pass
# Check glibc version. CentOS 5 uses glibc 2.5.
return have_compatible_glibc(2, 5)
def have_compatible_glibc(major, minimum_minor):
import ctypes
process_namespace = ctypes.CDLL(None)
try:
gnu_get_libc_version = process_namespace.gnu_get_libc_version
except AttributeError:
# Symbol doesn't exist -> therefore, we are not linked to
# glibc.
return False
# Call gnu_get_libc_version, which returns a string like "2.5".
gnu_get_libc_version.restype = ctypes.c_char_p
version_str = gnu_get_libc_version()
# py2 / py3 compatibility:
if not isinstance(version_str, str):
version_str = version_str.decode("ascii")
# Parse string and check against requested version.
version = [int(piece) for piece in version_str.split(".")]
assert len(version) == 2
if major != version[0]:
return False
if minimum_minor > version[1]:
return False
return True
import sys
if is_manylinux1_compatible():
print(f"{sys.executable} is manylinux1 compatible")
sys.exit(0)
else:
print(f"{sys.executable} is NOT manylinux1 compatible")
sys.exit(1)

View File

@ -0,0 +1,35 @@
# cf. https://github.com/pypa/manylinux/issues/53
GOOD_SSL = "https://google.com"
BAD_SSL = "https://self-signed.badssl.com"
import sys
print("Testing SSL certificate checking for Python:", sys.version)
if sys.version_info[:2] < (2, 7) or sys.version_info[:2] < (3, 4):
print("This version never checks SSL certs; skipping tests")
sys.exit(0)
if sys.version_info[0] >= 3:
from urllib.request import urlopen
EXC = OSError
else:
from urllib import urlopen
EXC = IOError
print(f"Connecting to {GOOD_SSL} should work")
urlopen(GOOD_SSL)
print("...it did, yay.")
print(f"Connecting to {BAD_SSL} should fail")
try:
urlopen(BAD_SSL)
# If we get here then we failed:
print("...it DIDN'T!!!!!11!!1one!")
sys.exit(1)
except EXC:
print("...it did, yay.")

View File

@ -134,9 +134,9 @@ opt-einsum==3.3
#Pinned versions: 3.3
#test that import: test_linalg.py
optree==0.11.0
optree==0.12.1
#Description: A library for tree manipulation
#Pinned versions: 0.11.0
#Pinned versions: 0.12.1
#test that import: test_vmap.py, test_aotdispatch.py, test_dynamic_shapes.py,
#test_pytree.py, test_ops.py, test_control_flow.py, test_modules.py,
#common_utils.py, test_eager_transforms.py, test_python_dispatch.py,

View File

@ -1,42 +1 @@
This directory contains scripts for our continuous integration.
One important thing to keep in mind when reading the scripts here is
that they are all based off of Docker images, which we build for each of
the various system configurations we want to run on Jenkins. This means
it is very easy to run these tests yourself:
1. Figure out what Docker image you want. The general template for our
images look like:
``registry.pytorch.org/pytorch/pytorch-$BUILD_ENVIRONMENT:$DOCKER_VERSION``,
where ``$BUILD_ENVIRONMENT`` is one of the build environments
enumerated in
[pytorch-dockerfiles](https://github.com/pytorch/pytorch/blob/master/.ci/docker/build.sh). The dockerfile used by jenkins can be found under the `.ci` [directory](https://github.com/pytorch/pytorch/blob/master/.ci/docker)
2. Run ``docker run -it -u jenkins $DOCKER_IMAGE``, clone PyTorch and
run one of the scripts in this directory.
The Docker images are designed so that any "reasonable" build commands
will work; if you look in [build.sh](build.sh) you will see that it is a
very simple script. This is intentional. Idiomatic build instructions
should work inside all of our Docker images. You can tweak the commands
however you need (e.g., in case you want to rebuild with DEBUG, or rerun
the build with higher verbosity, etc.).
We have to do some work to make this so. Here is a summary of the
mechanisms we use:
- We install binaries to directories like `/usr/local/bin` which
are automatically part of your PATH.
- We add entries to the PATH using Docker ENV variables (so
they apply when you enter Docker) and `/etc/environment` (so they
continue to apply even if you sudo), instead of modifying
`PATH` in our build scripts.
- We use `/etc/ld.so.conf.d` to register directories containing
shared libraries, instead of modifying `LD_LIBRARY_PATH` in our
build scripts.
- We reroute well known paths like `/usr/bin/gcc` to alternate
implementations with `update-alternatives`, instead of setting
`CC` and `CXX` in our implementations.

View File

@ -230,6 +230,10 @@ if [[ "${BUILD_ENVIRONMENT}" != *android* && "${BUILD_ENVIRONMENT}" != *cuda* ]]
export BUILD_STATIC_RUNTIME_BENCHMARK=ON
fi
if [[ "$BUILD_ENVIRONMENT" == *-debug* ]]; then
export CMAKE_BUILD_TYPE=RelWithAssert
fi
# Do not change workspace permissions for ROCm CI jobs
# as it can leave workspace with bad permissions for cancelled jobs
if [[ "$BUILD_ENVIRONMENT" != *rocm* ]]; then

View File

@ -6,6 +6,7 @@ from cryptography.hazmat.primitives import hashes, serialization
from cryptography.hazmat.primitives.asymmetric import rsa
from cryptography.x509.oid import NameOID
temp_dir = mkdtemp()
print(temp_dir)

View File

@ -18,6 +18,7 @@ time python test/run_test.py --verbose -i distributed/test_c10d_gloo
time python test/run_test.py --verbose -i distributed/test_c10d_nccl
time python test/run_test.py --verbose -i distributed/test_c10d_spawn_gloo
time python test/run_test.py --verbose -i distributed/test_c10d_spawn_nccl
time python test/run_test.py --verbose -i distributed/test_compute_comm_reordering
time python test/run_test.py --verbose -i distributed/test_store
time python test/run_test.py --verbose -i distributed/test_symmetric_memory
time python test/run_test.py --verbose -i distributed/test_pg_wrapper
@ -43,16 +44,15 @@ time python test/run_test.py --verbose -i distributed/_tensor/test_dtensor_compi
time python test/run_test.py --verbose -i distributed/test_device_mesh
# DTensor/TP tests
time python test/run_test.py --verbose -i distributed/tensor/parallel/test_ddp_2d_parallel
time python test/run_test.py --verbose -i distributed/tensor/parallel/test_fsdp_2d_parallel
time python test/run_test.py --verbose -i distributed/tensor/parallel/test_tp_examples
time python test/run_test.py --verbose -i distributed/tensor/parallel/test_tp_random_state
# FSDP2 tests
time python test/run_test.py --verbose -i distributed/_composable/fsdp/test_fully_shard_training -- -k test_2d_mlp_with_nd_mesh
# Pipelining composability tests
time python test/run_test.py --verbose -i distributed/pipelining/test_composability.py
# ND composability tests
time python test/run_test.py --verbose -i distributed/_composable/test_composability/test_2d_composability
time python test/run_test.py --verbose -i distributed/_composable/test_composability/test_pp_composability
# Other tests
time python test/run_test.py --verbose -i test_cuda_primary_ctx

View File

@ -3,6 +3,7 @@ import json
import math
import sys
parser = argparse.ArgumentParser()
parser.add_argument(
"--test-name", dest="test_name", action="store", required=True, help="test name"

View File

@ -3,6 +3,7 @@ import sys
import numpy
sample_data_list = sys.argv[1:]
sample_data_list = [float(v.strip()) for v in sample_data_list]

View File

@ -1,6 +1,7 @@
import json
import sys
data_file_path = sys.argv[1]
commit_hash = sys.argv[2]

View File

@ -1,5 +1,6 @@
import sys
log_file_path = sys.argv[1]
with open(log_file_path) as f:

View File

@ -249,9 +249,7 @@ fi
# This tests that the debug asserts are working correctly.
if [[ "$BUILD_ENVIRONMENT" == *-debug* ]]; then
echo "We are in debug mode: $BUILD_ENVIRONMENT. Expect the python assertion to fail"
# TODO: Enable the check after we setup the build to run debug asserts without having
# to do a full (and slow) debug build
# (cd test && ! get_exit_code python -c "import torch; torch._C._crash_if_debug_asserts_fail(424242)")
(cd test && ! get_exit_code python -c "import torch; torch._C._crash_if_debug_asserts_fail(424242)")
elif [[ "$BUILD_ENVIRONMENT" != *-bazel-* ]]; then
# Noop when debug is disabled. Skip bazel jobs because torch isn't available there yet.
echo "We are not in debug mode: $BUILD_ENVIRONMENT. Expect the assertion to pass"
@ -318,14 +316,13 @@ test_inductor_distributed() {
python test/run_test.py -i inductor/test_aot_inductor.py -k test_replicate_on_devices --verbose
python test/run_test.py -i distributed/test_c10d_functional_native.py --verbose
python test/run_test.py -i distributed/_tensor/test_dtensor_compile.py --verbose
python test/run_test.py -i distributed/tensor/parallel/test_fsdp_2d_parallel.py --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_comm.py --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_multi_group --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_with_activation_checkpointing --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_2d_mlp --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_hsdp --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_2d_transformer_checkpoint_resume --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_gradient_accumulation --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_state_dict.py -k test_dp_state_dict_save_load --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_frozen.py --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_mixed_precision.py -k test_compute_dtype --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_mixed_precision.py -k test_reduce_dtype --verbose
@ -406,7 +403,7 @@ if [[ "${TEST_CONFIG}" == *dynamic* ]]; then
DYNAMO_BENCHMARK_FLAGS+=(--dynamic-shapes --dynamic-batch-only)
fi
if [[ "${TEST_CONFIG}" == *cpu_inductor* || "${TEST_CONFIG}" == *cpu_aot_inductor* ]]; then
if [[ "${TEST_CONFIG}" == *cpu* ]]; then
DYNAMO_BENCHMARK_FLAGS+=(--device cpu)
else
DYNAMO_BENCHMARK_FLAGS+=(--device cuda)
@ -430,6 +427,19 @@ test_perf_for_dashboard() {
# TODO: All the accuracy tests can be skipped once the CI accuracy checking is stable enough
local targets=(accuracy performance)
local device=cuda
local taskset=""
if [[ "${TEST_CONFIG}" == *cpu* ]]; then
if [[ "${TEST_CONFIG}" == *cpu_x86* ]]; then
device=cpu_x86
elif [[ "${TEST_CONFIG}" == *cpu_aarch64* ]]; then
device=cpu_aarch64
fi
test_inductor_set_cpu_affinity
end_core=$(( $(test_inductor_get_core_number)-1 ))
taskset="taskset -c 0-$end_core"
fi
for mode in "${modes[@]}"; do
if [[ "$mode" == "inference" ]]; then
dtype=bfloat16
@ -445,56 +455,56 @@ test_perf_for_dashboard() {
fi
if [[ "$DASHBOARD_TAG" == *default-true* ]]; then
python "benchmarks/dynamo/$suite.py" \
$taskset python "benchmarks/dynamo/$suite.py" \
"${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" --disable-cudagraphs "$@" \
--output "$TEST_REPORTS_DIR/${backend}_no_cudagraphs_${suite}_${dtype}_${mode}_cuda_${target}.csv"
--output "$TEST_REPORTS_DIR/${backend}_no_cudagraphs_${suite}_${dtype}_${mode}_${device}_${target}.csv"
fi
if [[ "$DASHBOARD_TAG" == *cudagraphs-true* ]]; then
python "benchmarks/dynamo/$suite.py" \
$taskset python "benchmarks/dynamo/$suite.py" \
"${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" "$@" \
--output "$TEST_REPORTS_DIR/${backend}_with_cudagraphs_${suite}_${dtype}_${mode}_cuda_${target}.csv"
--output "$TEST_REPORTS_DIR/${backend}_with_cudagraphs_${suite}_${dtype}_${mode}_${device}_${target}.csv"
fi
if [[ "$DASHBOARD_TAG" == *dynamic-true* ]]; then
python "benchmarks/dynamo/$suite.py" \
$taskset python "benchmarks/dynamo/$suite.py" \
"${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" --dynamic-shapes \
--dynamic-batch-only "$@" \
--output "$TEST_REPORTS_DIR/${backend}_dynamic_${suite}_${dtype}_${mode}_cuda_${target}.csv"
--output "$TEST_REPORTS_DIR/${backend}_dynamic_${suite}_${dtype}_${mode}_${device}_${target}.csv"
fi
if [[ "$DASHBOARD_TAG" == *cppwrapper-true* ]] && [[ "$mode" == "inference" ]]; then
TORCHINDUCTOR_CPP_WRAPPER=1 python "benchmarks/dynamo/$suite.py" \
TORCHINDUCTOR_CPP_WRAPPER=1 $taskset python "benchmarks/dynamo/$suite.py" \
"${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" --disable-cudagraphs "$@" \
--output "$TEST_REPORTS_DIR/${backend}_cpp_wrapper_${suite}_${dtype}_${mode}_cuda_${target}.csv"
--output "$TEST_REPORTS_DIR/${backend}_cpp_wrapper_${suite}_${dtype}_${mode}_${device}_${target}.csv"
fi
if [[ "$DASHBOARD_TAG" == *freezing_cudagraphs-true* ]] && [[ "$mode" == "inference" ]]; then
python "benchmarks/dynamo/$suite.py" \
$taskset python "benchmarks/dynamo/$suite.py" \
"${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" "$@" --freezing \
--output "$TEST_REPORTS_DIR/${backend}_with_cudagraphs_freezing_${suite}_${dtype}_${mode}_cuda_${target}.csv"
--output "$TEST_REPORTS_DIR/${backend}_with_cudagraphs_freezing_${suite}_${dtype}_${mode}_${device}_${target}.csv"
fi
if [[ "$DASHBOARD_TAG" == *freeze_autotune_cudagraphs-true* ]] && [[ "$mode" == "inference" ]]; then
TORCHINDUCTOR_MAX_AUTOTUNE=1 python "benchmarks/dynamo/$suite.py" \
TORCHINDUCTOR_MAX_AUTOTUNE=1 $taskset python "benchmarks/dynamo/$suite.py" \
"${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" "$@" --freezing \
--output "$TEST_REPORTS_DIR/${backend}_with_cudagraphs_freezing_autotune_${suite}_${dtype}_${mode}_cuda_${target}.csv"
--output "$TEST_REPORTS_DIR/${backend}_with_cudagraphs_freezing_autotune_${suite}_${dtype}_${mode}_${device}_${target}.csv"
fi
if [[ "$DASHBOARD_TAG" == *aotinductor-true* ]] && [[ "$mode" == "inference" ]]; then
TORCHINDUCTOR_ABI_COMPATIBLE=1 python "benchmarks/dynamo/$suite.py" \
TORCHINDUCTOR_ABI_COMPATIBLE=1 $taskset python "benchmarks/dynamo/$suite.py" \
"${target_flag[@]}" --"$mode" --"$dtype" --export-aot-inductor --disable-cudagraphs "$@" \
--output "$TEST_REPORTS_DIR/${backend}_aot_inductor_${suite}_${dtype}_${mode}_cuda_${target}.csv"
--output "$TEST_REPORTS_DIR/${backend}_aot_inductor_${suite}_${dtype}_${mode}_${device}_${target}.csv"
fi
if [[ "$DASHBOARD_TAG" == *maxautotune-true* ]]; then
TORCHINDUCTOR_MAX_AUTOTUNE=1 python "benchmarks/dynamo/$suite.py" \
TORCHINDUCTOR_MAX_AUTOTUNE=1 $taskset python "benchmarks/dynamo/$suite.py" \
"${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" "$@" \
--output "$TEST_REPORTS_DIR/${backend}_max_autotune_${suite}_${dtype}_${mode}_cuda_${target}.csv"
--output "$TEST_REPORTS_DIR/${backend}_max_autotune_${suite}_${dtype}_${mode}_${device}_${target}.csv"
fi
if [[ "$DASHBOARD_TAG" == *cudagraphs_low_precision-true* ]] && [[ "$mode" == "inference" ]]; then
# TODO: This has a new dtype called quant and the benchmarks script needs to be updated to support this.
# The tentative command is as follows. It doesn't work now, but it's ok because we only need mock data
# to fill the dashboard.
python "benchmarks/dynamo/$suite.py" \
$taskset python "benchmarks/dynamo/$suite.py" \
"${target_flag[@]}" --"$mode" --quant --backend "$backend" "$@" \
--output "$TEST_REPORTS_DIR/${backend}_cudagraphs_low_precision_${suite}_quant_${mode}_cuda_${target}.csv" || true
--output "$TEST_REPORTS_DIR/${backend}_cudagraphs_low_precision_${suite}_quant_${mode}_${device}_${target}.csv" || true
# Copy cudagraph results as mock data, easiest choice?
cp "$TEST_REPORTS_DIR/${backend}_with_cudagraphs_${suite}_${dtype}_${mode}_cuda_${target}.csv" \
"$TEST_REPORTS_DIR/${backend}_cudagraphs_low_precision_${suite}_quant_${mode}_cuda_${target}.csv"
cp "$TEST_REPORTS_DIR/${backend}_with_cudagraphs_${suite}_${dtype}_${mode}_${device}_${target}.csv" \
"$TEST_REPORTS_DIR/${backend}_cudagraphs_low_precision_${suite}_quant_${mode}_${device}_${target}.csv"
fi
done
done
@ -575,7 +585,7 @@ test_dynamo_benchmark() {
elif [[ "${TEST_CONFIG}" == *perf* ]]; then
test_single_dynamo_benchmark "dashboard" "$suite" "$shard_id" "$@"
else
if [[ "${TEST_CONFIG}" == *cpu_inductor* || "${TEST_CONFIG}" == *cpu_aot_inductor* ]]; then
if [[ "${TEST_CONFIG}" == *cpu* ]]; then
local dt="float32"
if [[ "${TEST_CONFIG}" == *amp* ]]; then
dt="amp"
@ -646,10 +656,11 @@ test_inductor_torchbench_smoketest_perf() {
done
}
test_inductor_torchbench_cpu_smoketest_perf(){
TEST_REPORTS_DIR=$(pwd)/test/test-reports
mkdir -p "$TEST_REPORTS_DIR"
test_inductor_get_core_number() {
echo $(($(lscpu | grep 'Socket(s):' | awk '{print $2}') * $(lscpu | grep 'Core(s) per socket:' | awk '{print $4}')))
}
test_inductor_set_cpu_affinity(){
#set jemalloc
JEMALLOC_LIB="/usr/lib/x86_64-linux-gnu/libjemalloc.so.2"
IOMP_LIB="$(dirname "$(which python)")/../lib/libiomp5.so"
@ -657,32 +668,39 @@ test_inductor_torchbench_cpu_smoketest_perf(){
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"
export KMP_AFFINITY=granularity=fine,compact,1,0
export KMP_BLOCKTIME=1
CORES=$(lscpu | grep Core | awk '{print $4}')
export OMP_NUM_THREADS=$CORES
end_core=$(( CORES-1 ))
cores=$(test_inductor_get_core_number)
export OMP_NUM_THREADS=$cores
}
test_inductor_torchbench_cpu_smoketest_perf(){
TEST_REPORTS_DIR=$(pwd)/test/test-reports
mkdir -p "$TEST_REPORTS_DIR"
test_inductor_set_cpu_affinity
end_core=$(( $(test_inductor_get_core_number)-1 ))
MODELS_SPEEDUP_TARGET=benchmarks/dynamo/expected_ci_speedup_inductor_torchbench_cpu.csv
grep -v '^ *#' < "$MODELS_SPEEDUP_TARGET" | while IFS=',' read -r -a model_cfg
do
local model_name=${model_cfg[0]}
local data_type=${model_cfg[1]}
local speedup_target=${model_cfg[4]}
if [[ ${model_cfg[3]} == "cpp" ]]; then
local data_type=${model_cfg[2]}
local speedup_target=${model_cfg[5]}
local backend=${model_cfg[1]}
if [[ ${model_cfg[4]} == "cpp" ]]; then
export TORCHINDUCTOR_CPP_WRAPPER=1
else
unset TORCHINDUCTOR_CPP_WRAPPER
fi
local output_name="$TEST_REPORTS_DIR/inductor_inference_${model_cfg[0]}_${model_cfg[1]}_${model_cfg[2]}_${model_cfg[3]}_cpu_smoketest.csv"
if [[ ${model_cfg[2]} == "dynamic" ]]; then
if [[ ${model_cfg[3]} == "dynamic" ]]; then
taskset -c 0-"$end_core" python benchmarks/dynamo/torchbench.py \
--inference --performance --"$data_type" -dcpu -n50 --only "$model_name" --dynamic-shapes \
--dynamic-batch-only --freezing --timeout 9000 --backend=inductor --output "$output_name"
--dynamic-batch-only --freezing --timeout 9000 --"$backend" --output "$output_name"
else
taskset -c 0-"$end_core" python benchmarks/dynamo/torchbench.py \
--inference --performance --"$data_type" -dcpu -n50 --only "$model_name" \
--freezing --timeout 9000 --backend=inductor --output "$output_name"
--freezing --timeout 9000 --"$backend" --output "$output_name"
fi
cat "$output_name"
# The threshold value needs to be actively maintained to make this check useful.
@ -1268,7 +1286,7 @@ elif [[ "${TEST_CONFIG}" == *timm* ]]; then
id=$((SHARD_NUMBER-1))
test_dynamo_benchmark timm_models "$id"
elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then
if [[ "${TEST_CONFIG}" == *cpu_inductor* || "${TEST_CONFIG}" == *cpu_aot_inductor* ]]; then
if [[ "${TEST_CONFIG}" == *cpu* ]]; then
install_torchaudio cpu
else
install_torchaudio cuda
@ -1285,7 +1303,7 @@ elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then
elif [[ "${TEST_CONFIG}" == *inductor_torchbench_cpu_smoketest_perf* ]]; then
checkout_install_torchbench timm_vision_transformer phlippe_densenet basic_gnn_gcn \
llama_v2_7b_16h resnet50 timm_efficientnet mobilenet_v3_large timm_resnest \
shufflenet_v2_x1_0 hf_GPT2
shufflenet_v2_x1_0 hf_GPT2 yolov3 mobilenet_v2 resnext50_32x4d hf_T5_base
PYTHONPATH=$(pwd)/torchbench test_inductor_torchbench_cpu_smoketest_perf
elif [[ "${TEST_CONFIG}" == *torchbench_gcp_smoketest* ]]; then
checkout_install_torchbench
@ -1294,7 +1312,7 @@ elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then
checkout_install_torchbench
# Do this after checkout_install_torchbench to ensure we clobber any
# nightlies that torchbench may pull in
if [[ "${TEST_CONFIG}" != *cpu_inductor* && "${TEST_CONFIG}" != *cpu_aot_inductor* ]]; then
if [[ "${TEST_CONFIG}" != *cpu* ]]; then
install_torchrec_and_fbgemm
fi
PYTHONPATH=$(pwd)/torchbench test_dynamo_benchmark torchbench "$id"

View File

@ -4,6 +4,7 @@ import os
import subprocess
import sys
COMMON_TESTS = [
(
"Checking that torch is available",

View File

@ -5,6 +5,7 @@ import sys
import yaml
# Need to import modules that lie on an upward-relative path
sys.path.append(os.path.join(sys.path[0], ".."))

View File

@ -46,14 +46,12 @@ if [[ "\$python_nodot" = *310* ]]; then
PROTOBUF_PACKAGE="protobuf>=3.19.0"
fi
if [[ "\$python_nodot" = *39* ]]; then
if [[ "\$python_nodot" = *39* ]]; then
# There's an issue with conda channel priority where it'll randomly pick 1.19 over 1.20
# we set a lower boundary here just to be safe
NUMPY_PIN=">=1.20"
fi
# Move debug wheels out of the package dir so they don't get installed
mkdir -p /tmp/debug_final_pkgs
mv /final_pkgs/debug-*.zip /tmp/debug_final_pkgs || echo "no debug packages to move"
@ -83,7 +81,7 @@ if [[ "$PACKAGE_TYPE" == conda ]]; then
"numpy\${NUMPY_PIN}" \
mkl>=2018 \
ninja \
sympy \
sympy>=1.12 \
typing-extensions \
${PROTOBUF_PACKAGE}
if [[ "$DESIRED_CUDA" == 'cpu' ]]; then
@ -118,9 +116,18 @@ if [[ "$PACKAGE_TYPE" == libtorch ]]; then
cd /tmp/libtorch
fi
if [[ "$GPU_ARCH_TYPE" == xpu ]]; then
# Workaround for __mkl_tmp_MOD unbound variable issue, refer https://github.com/pytorch/pytorch/issues/130543
set +u
source /opt/intel/oneapi/pytorch-gpu-dev-0.5/oneapi-vars.sh
fi
# Test the package
/builder/check_binary.sh
# Clean temp files
cd /builder && git clean -ffdx
# =================== The above code will be executed inside Docker container ===================
EOL
echo

View File

@ -100,6 +100,20 @@ if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_B
fi
fi
# Set triton via PYTORCH_EXTRA_INSTALL_REQUIREMENTS for triton xpu package
if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*xpu.* && $(uname) == "Linux" ]]; then
TRITON_REQUIREMENT="pytorch-triton-xpu==${TRITON_VERSION}"
if [[ -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*dev.* ]]; then
TRITON_SHORTHASH=$(cut -c1-10 $PYTORCH_ROOT/.ci/docker/ci_commit_pins/triton-xpu.txt)
TRITON_REQUIREMENT="pytorch-triton-xpu==${TRITON_VERSION}+${TRITON_SHORTHASH}"
fi
if [[ -z "${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}" ]]; then
export PYTORCH_EXTRA_INSTALL_REQUIREMENTS="${TRITON_REQUIREMENT}"
else
export PYTORCH_EXTRA_INSTALL_REQUIREMENTS="${PYTORCH_EXTRA_INSTALL_REQUIREMENTS} | ${TRITON_REQUIREMENT}"
fi
fi
cat >"$envfile" <<EOL
# =================== The following code will be executed inside Docker container ===================
export TZ=UTC

View File

@ -29,6 +29,11 @@ if [[ "${USE_SPLIT_BUILD:-false}" == "true" ]]; then
UPLOAD_SUBFOLDER="${UPLOAD_SUBFOLDER}_pypi_pkg"
fi
# this is special build with all dependencies packaged
if [[ ${BUILD_NAME} == *-full* ]]; then
UPLOAD_SUBFOLDER="${UPLOAD_SUBFOLDER}_full"
fi
# Sleep 2 minutes between retries for conda upload
retry () {
"$@" || (sleep 5m && "$@") || (sleep 5m && "$@") || (sleep 5m && "$@") || (sleep 5m && "$@")

View File

@ -8,6 +8,7 @@ import time
import requests
AZURE_PIPELINE_BASE_URL = "https://aiinfra.visualstudio.com/PyTorch/"
AZURE_DEVOPS_PAT_BASE64 = os.environ.get("AZURE_DEVOPS_PAT_BASE64_SECRET", "")
PIPELINE_ID = "911"

View File

@ -5,7 +5,7 @@ git submodule sync
git submodule update --init --recursive
# This takes some time
make setup_lint
make setup-lint
# Add CMAKE_PREFIX_PATH to bashrc
echo 'export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}' >> ~/.bashrc

View File

@ -2,7 +2,7 @@
# NOTE: **Mirror any changes** to this file the [tool.ruff] config in pyproject.toml
# before we can fully move to use ruff
enable-extensions = G
select = B,C,E,F,G,P,SIM1,T4,W,B9,TOR0,TOR1,TOR2,TOR9
select = B,C,E,F,G,P,SIM1,SIM911,T4,W,B9,TOR0,TOR1,TOR2,TOR9
max-line-length = 120
# C408 ignored because we like the dict keyword argument syntax
# E501 is not flexible enough, we're using B950 instead

View File

@ -40,3 +40,7 @@ e6ec0efaf87703c5f889cfc20b29be455885d58d
a53cda1ddc15336dc1ff0ce1eff2a49cdc5f882e
# 2024-01-02 clangformat: fused adam #116583
9dc68d1aa9e554d09344a10fff69f7b50b2d23a0
# 2024-06-28 enable UFMT in `torch/storage.py`
d80939e5e9337e8078f11489afefec59fd42f93b
# 2024-06-28 enable UFMT in `torch.utils.data`
7cf0b90e49689d45be91aa539fdf54cf2ea8a9a3

View File

@ -9,14 +9,16 @@ self-hosted-runner:
- linux.large
- linux.2xlarge
- linux.4xlarge
- linux.9xlarge.ephemeral
- linux.12xlarge
- linux.12xlarge.ephemeral
- linux.24xlarge
- linux.arm64.2xlarge
- linux.4xlarge.nvidia.gpu
- linux.8xlarge.nvidia.gpu
- linux.16xlarge.nvidia.gpu
- linux.g5.4xlarge.nvidia.gpu
# Organization-wide AWS Linux Runners on Linux Foundation account
# Pytorch/pytorch AWS Linux Runners on Linux Foundation account
- lf.linux.large
- lf.linux.2xlarge
- lf.linux.4xlarge
@ -27,6 +29,28 @@ self-hosted-runner:
- lf.linux.8xlarge.nvidia.gpu
- lf.linux.16xlarge.nvidia.gpu
- lf.linux.g5.4xlarge.nvidia.gpu
# Organization-wide AWS Linux Runners with new Amazon 2023 AMI
- amz2023.linux.large
- amz2023.linux.2xlarge
- amz2023.linux.4xlarge
- amz2023.linux.12xlarge
- amz2023.linux.24xlarge
- amz2023.linux.arm64.2xlarge
- amz2023.linux.4xlarge.nvidia.gpu
- amz2023.linux.8xlarge.nvidia.gpu
- amz2023.linux.16xlarge.nvidia.gpu
- amz2023.linux.g5.4xlarge.nvidia.gpu
# Pytorch/pytorch AWS Linux Runners with the new Amazon 2023 AMI on Linux Foundation account
- amz2023.lf.linux.large
- amz2023.lf.linux.2xlarge
- amz2023.lf.linux.4xlarge
- amz2023.lf.linux.12xlarge
- amz2023.lf.linux.24xlarge
- amz2023.lf.linux.arm64.2xlarge
- amz2023.lf.linux.4xlarge.nvidia.gpu
- amz2023.lf.linux.8xlarge.nvidia.gpu
- amz2023.lf.linux.16xlarge.nvidia.gpu
- amz2023.lf.linux.g5.4xlarge.nvidia.gpu
# Repo-specific IBM hosted S390x runner
- linux.s390x
# Organization wide AWS Windows runners
@ -47,3 +71,5 @@ self-hosted-runner:
- macos-latest-xlarge
- macos-13-xlarge
- macos-14-xlarge
# Organization-wide Intel hosted XPU runners
- linux.idc.xpu

View File

@ -36,7 +36,8 @@ runs:
"${DOCKER_IMAGE}"
)
if [[ "${GPU_ARCH_TYPE}" != "rocm" && "${BUILD_ENVIRONMENT}" != "linux-aarch64-binary-manywheel" && "${BUILD_ENVIRONMENT}" != "linux-s390x-binary-manywheel" ]]; then
echo "CONTAINER_NAME=${container_name}" >> "$GITHUB_ENV"
if [[ "${GPU_ARCH_TYPE}" != "rocm" && "${BUILD_ENVIRONMENT}" != "linux-aarch64-binary-manywheel" && "${BUILD_ENVIRONMENT}" != "linux-s390x-binary-manywheel" && "${GPU_ARCH_TYPE}" != "xpu" ]]; then
# Propagate download.pytorch.org IP to container. This is only needed on Linux non aarch64 runner
grep download.pytorch.org /etc/hosts | docker exec -i "${container_name}" bash -c "/bin/cat >> /etc/hosts"
fi
@ -47,10 +48,9 @@ runs:
docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash -x /run.sh"
- name: Cleanup docker
if: always() && env.BUILD_ENVIRONMENT == 'linux-s390x-binary-manywheel'
if: always() && (env.BUILD_ENVIRONMENT == 'linux-s390x-binary-manywheel' || env.GPU_ARCH_TYPE == 'xpu')
shell: bash
run: |
# on s390x stop the container for clean worker stop
# ignore expansion of "docker ps -q" since it could be empty
# on s390x or xpu stop the container for clean worker stop
# shellcheck disable=SC2046
docker stop $(docker ps -q) || true
docker stop "${{ env.CONTAINER_NAME }}" || true

View File

@ -1 +1 @@
b829e936f7cc61b48149f5f957a451a38bf2a178
69b2a0adc2ec03ab99990d7e8be3d4510438c148

View File

@ -1 +1 @@
6f0b61e5d782913a0fc7743812f2a8e522189111
5ea4535f0699f366adb554183a65ebf7dc34a8be

View File

@ -152,3 +152,130 @@ runner_types:
is_ephemeral: false
max_available: 250
os: windows
### Setup runner types to test the Amazon Linux 2023 AMI
lf.c.amz2023.linux.12xlarge:
disk_size: 200
instance_type: c5.12xlarge
is_ephemeral: false
max_available: 1000
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.c.amz2023.linux.24xl.spr-metal:
disk_size: 200
instance_type: c7i.metal-24xl
is_ephemeral: false
max_available: 30
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.c.amz2023.linux.16xlarge.spr:
disk_size: 200
instance_type: c7i.16xlarge
is_ephemeral: false
max_available: 30
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.c.amz2023.linux.12xlarge.ephemeral:
disk_size: 200
instance_type: c5.12xlarge
is_ephemeral: true
max_available: 300
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.c.amz2023.linux.16xlarge.nvidia.gpu:
disk_size: 150
instance_type: g3.16xlarge
is_ephemeral: false
max_available: 30
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.c.amz2023.linux.24xlarge:
disk_size: 150
instance_type: c5.24xlarge
is_ephemeral: false
max_available: 250
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.c.amz2023.linux.2xlarge:
disk_size: 150
instance_type: c5.2xlarge
is_ephemeral: false
max_available: 3120
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.c.amz2023.linux.4xlarge:
disk_size: 150
instance_type: c5.4xlarge
is_ephemeral: false
max_available: 1000
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.c.amz2023.linux.4xlarge.nvidia.gpu:
disk_size: 150
instance_type: g3.4xlarge
is_ephemeral: false
max_available: 520
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.c.amz2023.linux.8xlarge.nvidia.gpu:
disk_size: 150
instance_type: g3.8xlarge
is_ephemeral: false
max_available: 400
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.c.amz2023.linux.g4dn.12xlarge.nvidia.gpu:
disk_size: 150
instance_type: g4dn.12xlarge
is_ephemeral: false
max_available: 50
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.c.amz2023.linux.g4dn.metal.nvidia.gpu:
disk_size: 150
instance_type: g4dn.metal
is_ephemeral: false
max_available: 30
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.c.amz2023.linux.g5.48xlarge.nvidia.gpu:
disk_size: 150
instance_type: g5.48xlarge
is_ephemeral: false
max_available: 20
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.c.amz2023.linux.g5.12xlarge.nvidia.gpu:
disk_size: 150
instance_type: g5.12xlarge
is_ephemeral: false
max_available: 150
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.c.amz2023.linux.g5.4xlarge.nvidia.gpu:
disk_size: 150
instance_type: g5.4xlarge
is_ephemeral: false
max_available: 1200
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.c.amz2023.linux.large:
disk_size: 15
instance_type: c5.large
is_ephemeral: false
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.c.amz2023.linux.arm64.2xlarge:
disk_size: 256
instance_type: t4g.2xlarge
is_ephemeral: false
max_available: 200
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.c.amz2023.linux.arm64.m7g.2xlarge:
disk_size: 256
instance_type: m7g.2xlarge
is_ephemeral: false
max_available: 20
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

View File

@ -152,3 +152,130 @@ runner_types:
is_ephemeral: false
max_available: 250
os: windows
### Setup runner types to test the Amazon Linux 2023 AMI
lf.amz2023.linux.12xlarge:
disk_size: 200
instance_type: c5.12xlarge
is_ephemeral: false
max_available: 1000
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.amz2023.linux.24xl.spr-metal:
disk_size: 200
instance_type: c7i.metal-24xl
is_ephemeral: false
max_available: 30
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.amz2023.linux.16xlarge.spr:
disk_size: 200
instance_type: c7i.16xlarge
is_ephemeral: false
max_available: 30
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.amz2023.linux.12xlarge.ephemeral:
disk_size: 200
instance_type: c5.12xlarge
is_ephemeral: true
max_available: 300
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.amz2023.linux.16xlarge.nvidia.gpu:
disk_size: 150
instance_type: g3.16xlarge
is_ephemeral: false
max_available: 30
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.amz2023.linux.24xlarge:
disk_size: 150
instance_type: c5.24xlarge
is_ephemeral: false
max_available: 250
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.amz2023.linux.2xlarge:
disk_size: 150
instance_type: c5.2xlarge
is_ephemeral: false
max_available: 3120
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.amz2023.linux.4xlarge:
disk_size: 150
instance_type: c5.4xlarge
is_ephemeral: false
max_available: 1000
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.amz2023.linux.4xlarge.nvidia.gpu:
disk_size: 150
instance_type: g3.4xlarge
is_ephemeral: false
max_available: 520
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.amz2023.linux.8xlarge.nvidia.gpu:
disk_size: 150
instance_type: g3.8xlarge
is_ephemeral: false
max_available: 400
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.amz2023.linux.g4dn.12xlarge.nvidia.gpu:
disk_size: 150
instance_type: g4dn.12xlarge
is_ephemeral: false
max_available: 50
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.amz2023.linux.g4dn.metal.nvidia.gpu:
disk_size: 150
instance_type: g4dn.metal
is_ephemeral: false
max_available: 30
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.amz2023.linux.g5.48xlarge.nvidia.gpu:
disk_size: 150
instance_type: g5.48xlarge
is_ephemeral: false
max_available: 20
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.amz2023.linux.g5.12xlarge.nvidia.gpu:
disk_size: 150
instance_type: g5.12xlarge
is_ephemeral: false
max_available: 150
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.amz2023.linux.g5.4xlarge.nvidia.gpu:
disk_size: 150
instance_type: g5.4xlarge
is_ephemeral: false
max_available: 1200
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.amz2023.linux.large:
disk_size: 15
instance_type: c5.large
is_ephemeral: false
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.amz2023.linux.arm64.2xlarge:
disk_size: 256
instance_type: t4g.2xlarge
is_ephemeral: false
max_available: 200
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.amz2023.linux.arm64.m7g.2xlarge:
disk_size: 256
instance_type: m7g.2xlarge
is_ephemeral: false
max_available: 20
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

View File

@ -286,6 +286,7 @@
- test/cpp/dist_autograd/**
- test/cpp/rpc/**
approved_by:
- wconstab
- mrshenli
- pritamdamania87
- zhaojuanmao
@ -312,6 +313,25 @@
- Lint
- pull
- name: DCP
patterns:
- torch/distributed/checkpoint/**
approved_by:
- LucasLLC
- fegin
- wz337
- saumishr
- daulet-askarov
- pradeepdfb
- kirtiteja
- mhorowitz
- saiteja64
mandatory_checks_name:
- EasyCLA
- Lint
- pull
- name: IDEEP
patterns:
- third_party/ideep
@ -387,7 +407,7 @@
- torch/_inductor/codegen/cpp_template.py
- torch/_inductor/codegen/cpp_gemm_template.py
- test/inductor/test_mkldnn_pattern_matcher.py
- test/inductor/test_cpu_repo.py
- test/inductor/test_cpu_repro.py
- test/inductor/test_cpu_cpp_wrapper.py
- test/inductor/test_cpu_select_algorithm.py
- aten/src/ATen/cpu/**

View File

@ -6,6 +6,7 @@ ciflow_push_tags:
- ciflow/binaries_libtorch
- ciflow/binaries_wheel
- ciflow/inductor
- ciflow/inductor-rocm
- ciflow/inductor-perf-compare
- ciflow/inductor-micro-benchmark
- ciflow/inductor-cu124

View File

@ -1,4 +1,4 @@
# iOS simulator requirements
coremltools==5.0b5
protobuf==3.20.2
optree==0.11.0
optree==0.12.1

View File

@ -17,16 +17,16 @@ pytest-xdist==3.3.1
pytest-rerunfailures==10.3
pytest-flakefinder==1.1.0
scipy==1.10.1
sympy==1.11.1
sympy==1.12.1 ; python_version == "3.8"
sympy>=1.13.0 ; python_version >= "3.9"
unittest-xml-reporting<=3.2.0,>=2.0.0
xdoctest==1.1.0
filelock==3.6.0
sympy==1.11.1
pytest-cpp==2.3.0
rockset==1.0.3
z3-solver==4.12.2.0
tensorboard==2.13.0
optree==0.11.0
optree==0.12.1
# NB: test_hparams_* from test_tensorboard is failing with protobuf 5.26.0 in
# which the stringify metadata is wrong when escaping double quote
protobuf==3.20.2

View File

@ -1,4 +1,5 @@
#!/usr/bin/env python3
import os
import shutil
import sys
@ -7,6 +8,7 @@ from subprocess import check_call
from tempfile import TemporaryDirectory
from typing import Optional
SCRIPT_DIR = Path(__file__).parent
REPO_DIR = SCRIPT_DIR.parent.parent

View File

@ -5,7 +5,6 @@ import sys
from typing import Any
from github_utils import gh_delete_comment, gh_post_pr_comment
from gitutils import get_git_remote_name, get_git_repo_dir, GitRepo
from label_utils import has_required_labels, is_label_err_comment, LABEL_ERR_MSG
from trymerge import GitHubPR

View File

@ -4,11 +4,9 @@ import json
import os
import re
from typing import Any, cast, Dict, List, Optional
from urllib.error import HTTPError
from github_utils import gh_fetch_url, gh_post_pr_comment, gh_query_issues_by_labels
from gitutils import get_git_remote_name, get_git_repo_dir, GitRepo
from trymerge import get_pr_commit_sha, GitHubPR

View File

@ -10,6 +10,7 @@ import requests
import rockset # type: ignore[import]
from gitutils import retries_decorator
LOGS_QUERY = """
with
shas as (

View File

@ -1,10 +1,12 @@
#!/usr/bin/env python3
import sys
from pathlib import Path
from typing import Any, cast, Dict, List, Set
import yaml
GITHUB_DIR = Path(__file__).parent.parent

View File

@ -1,7 +1,6 @@
import json
import subprocess
import sys
from enum import Enum
from pathlib import Path
from typing import NamedTuple, Optional

View File

@ -9,6 +9,7 @@ from typing import Any, Callable, Dict, List, Set
from github_utils import gh_fetch_json_dict, gh_graphql
from gitutils import GitRepo
SEC_IN_DAY = 24 * 60 * 60
CLOSED_PR_RETENTION = 30 * SEC_IN_DAY
NO_PR_RETENTION = 1.5 * 365 * SEC_IN_DAY

View File

@ -1,7 +1,6 @@
#!/usr/bin/env python3
import sys
from pathlib import Path
import yaml

View File

@ -14,7 +14,6 @@ import json
from typing import Any
import boto3 # type: ignore[import]
from label_utils import gh_get_labels

View File

@ -15,6 +15,7 @@ from urllib.request import Request, urlopen
import yaml
REENABLE_TEST_REGEX = "(?i)(Close(d|s)?|Resolve(d|s)?|Fix(ed|es)?) (#|https://github.com/pytorch/pytorch/issues/)([0-9]+)"
PREFIX = "test-config/"

View File

@ -8,11 +8,13 @@ architectures:
* CPU
* Latest CUDA
* Latest ROCM
* Latest XPU
"""
import os
from typing import Dict, List, Optional, Tuple
CUDA_ARCHES = ["11.8", "12.1", "12.4"]
@ -24,6 +26,7 @@ CUDA_ARCHES_CUDNN_VERSION = {"11.8": "9", "12.1": "9", "12.4": "9"}
ROCM_ARCHES = ["6.0", "6.1"]
XPU_ARCHES = ["xpu"]
CPU_CXX11_ABI_ARCH = ["cpu-cxx11-abi"]
@ -132,6 +135,8 @@ def arch_type(arch_version: str) -> str:
return "cuda"
elif arch_version in ROCM_ARCHES:
return "rocm"
elif arch_version in XPU_ARCHES:
return "xpu"
elif arch_version in CPU_CXX11_ABI_ARCH:
return "cpu-cxx11-abi"
elif arch_version in CPU_AARCH64_ARCH:
@ -156,6 +161,7 @@ WHEEL_CONTAINER_IMAGES = {
gpu_arch: f"pytorch/manylinux-builder:rocm{gpu_arch}-{DEFAULT_TAG}"
for gpu_arch in ROCM_ARCHES
},
"xpu": f"pytorch/manylinux2_28-builder:xpu-{DEFAULT_TAG}",
"cpu": f"pytorch/manylinux-builder:cpu-{DEFAULT_TAG}",
"cpu-cxx11-abi": f"pytorch/manylinuxcxx11-abi-builder:cpu-cxx11-abi-{DEFAULT_TAG}",
"cpu-aarch64": f"pytorch/manylinuxaarch64-builder:cpu-aarch64-{DEFAULT_TAG}",
@ -221,6 +227,7 @@ def translate_desired_cuda(gpu_arch_type: str, gpu_arch_version: str) -> str:
"cuda": f"cu{gpu_arch_version.replace('.', '')}",
"cuda-aarch64": "cu124",
"rocm": f"rocm{gpu_arch_version}",
"xpu": "xpu",
}.get(gpu_arch_type, gpu_arch_version)
@ -325,13 +332,13 @@ def generate_wheels_matrix(
package_type = "manywheel"
if python_versions is None:
python_versions = FULL_PYTHON_VERSIONS
python_versions = FULL_PYTHON_VERSIONS + ["3.13"]
if arches is None:
# Define default compute archivectures
arches = ["cpu"]
if os == "linux":
arches += CPU_CXX11_ABI_ARCH + CUDA_ARCHES + ROCM_ARCHES
arches += CPU_CXX11_ABI_ARCH + CUDA_ARCHES + ROCM_ARCHES + XPU_ARCHES
elif os == "windows":
arches += CUDA_ARCHES
elif os == "linux-aarch64":
@ -354,9 +361,16 @@ def generate_wheels_matrix(
or arch_version == "cpu-aarch64"
or arch_version == "cpu-s390x"
or arch_version == "cuda-aarch64"
or arch_version == "xpu"
else arch_version
)
# TODO: Enable python 3.13 on rocm, xpu, aarch64, windows
if (
gpu_arch_type in ["rocm", "xpu"] or os != "linux"
) and python_version == "3.13":
continue
# 12.1 linux wheels require PYTORCH_EXTRA_INSTALL_REQUIREMENTS to install
if (
arch_version in ["12.4", "12.1", "11.8"]
@ -396,9 +410,7 @@ def generate_wheels_matrix(
gpu_arch_type, gpu_arch_version
),
"use_split_build": "True",
"devtoolset": (
"cxx11-abi" if arch_version == "cuda-aarch64" else ""
),
"devtoolset": "",
"container_image": WHEEL_CONTAINER_IMAGES[arch_version],
"package_type": package_type,
"pytorch_extra_install_requirements": (
@ -411,6 +423,26 @@ def generate_wheels_matrix(
),
}
)
# Special build building to use on Colab. PyThon 3.10 for 12.1 CUDA
if python_version == "3.10" and arch_version == "12.1":
ret.append(
{
"python_version": python_version,
"gpu_arch_type": gpu_arch_type,
"gpu_arch_version": gpu_arch_version,
"desired_cuda": translate_desired_cuda(
gpu_arch_type, gpu_arch_version
),
"use_split_build": "False",
"devtoolset": "",
"container_image": WHEEL_CONTAINER_IMAGES[arch_version],
"package_type": package_type,
"pytorch_extra_install_requirements": "",
"build_name": f"{package_type}-py{python_version}-{gpu_arch_type}{gpu_arch_version}-full".replace( # noqa: B950
".", "_"
),
}
)
else:
ret.append(
{
@ -421,7 +453,9 @@ def generate_wheels_matrix(
gpu_arch_type, gpu_arch_version
),
"devtoolset": (
"cxx11-abi" if arch_version == "cpu-cxx11-abi" else ""
"cxx11-abi"
if arch_version in ["cpu-cxx11-abi", "xpu"]
else ""
),
"container_image": WHEEL_CONTAINER_IMAGES[arch_version],
"package_type": package_type,

View File

@ -8,9 +8,9 @@ from typing import Dict, Iterable, List, Literal, Set
from typing_extensions import TypedDict # Python 3.11+
import generate_binary_build_matrix # type: ignore[import]
import jinja2
Arch = Literal["windows", "linux", "macos"]
GITHUB_DIR = Path(__file__).resolve().parent.parent

View File

@ -16,6 +16,7 @@ from typing import Dict, List
import generate_binary_build_matrix
DOCKER_IMAGE_TYPES = ["runtime", "devel"]

View File

@ -4,11 +4,11 @@ import argparse
import os
import re
import subprocess
from datetime import datetime
from distutils.util import strtobool
from pathlib import Path
LEADING_V_PATTERN = re.compile("^v")
TRAILING_RC_PATTERN = re.compile("-rc[0-9]*$")
LEGACY_BASE_VERSION_SUFFIX_PATTERN = re.compile("a0$")

View File

@ -11,7 +11,6 @@ import sys
import time
import urllib
import urllib.parse
from typing import Any, Callable, Dict, List, Optional, Tuple
from urllib.request import Request, urlopen

View File

@ -3,7 +3,6 @@
import json
import os
import warnings
from dataclasses import dataclass
from typing import Any, Callable, cast, Dict, List, Optional, Tuple, Union
from urllib.error import HTTPError

View File

@ -19,6 +19,7 @@ from typing import (
Union,
)
T = TypeVar("T")
RE_GITHUB_URL_MATCH = re.compile("^https://.*@?github.com/(.+)/(.+)$")

View File

@ -1,12 +1,12 @@
"""GitHub Label Utilities."""
import json
from functools import lru_cache
from typing import Any, List, Tuple, TYPE_CHECKING, Union
from github_utils import gh_fetch_url_and_headers, GitHubComment
# TODO: this is a temp workaround to avoid circular dependencies,
# and should be removed once GitHubPR is refactored out of trymerge script.
if TYPE_CHECKING:

View File

@ -9,6 +9,7 @@ from pytest_caching_utils import (
upload_pytest_cache,
)
TEMP_DIR = "./tmp" # a backup location in case one isn't provided

View File

@ -14,6 +14,7 @@ from file_io_utils import (
zip_folder,
)
PYTEST_CACHE_KEY_PREFIX = "pytest_cache"
PYTEST_CACHE_DIR_NAME = ".pytest_cache"
BUCKET = "gha-artifacts"

View File

@ -12,6 +12,7 @@ from github.Issue import Issue
WORKFLOW_LABEL_META = "" # use meta runners
WORKFLOW_LABEL_LF = "lf." # use runners from the linux foundation
WORKFLOW_LABEL_LF_CANARY = "lf.c." # use canary runners from the linux foundation
GITHUB_OUTPUT = os.getenv("GITHUB_OUTPUT", "")
GH_OUTPUT_KEY_LABEL_TYPE = "label-type"
@ -203,6 +204,10 @@ def main() -> None:
)
label_type = WORKFLOW_LABEL_META
# For Canary builds use canary runners
if args.github_repo == "pytorch/pytorch-canary" and label_type == WORKFLOW_LABEL_LF:
label_type = WORKFLOW_LABEL_LF_CANARY
set_github_output(GH_OUTPUT_KEY_LABEL_TYPE, label_type)

View File

@ -2,7 +2,7 @@
set -eoux pipefail
SYNC_BRANCH=fbcode/pytorch-stable-prototype
SYNC_BRANCH=pytorch-stable-prototype
git config user.email "fake@example.com"
git config user.name "PyTorch Stable Bot"
@ -11,7 +11,9 @@ git fetch origin main
git fetch origin "$SYNC_BRANCH"
git checkout "$SYNC_BRANCH"
for SHA in $(git log 4333e122d4b74cdf84351ed2907045c6a767b4cd..origin/main --pretty="%h" --reverse -- torch/distributed torch/csrc/distributed test/distributed test/cpp/c10d benchmarks/distributed)
# Using a hardcoded SHA here is a massive speedup as we can skip the entire history of the pytorch GitHub repo.
# This specific SHA was chosen as it was before the "branch point" of the stable branch
for SHA in $(git log ba3b05fdf37ddbc3c301294d6a560a816335e717..origin/main --pretty="%h" -- torch/distributed torch/csrc/distributed test/distributed test/cpp/c10d benchmarks/distributed)
do
# `git merge-base --is-ancestor` exits with code 0 if the given SHA is an ancestor, and non-0 otherwise
if git merge-base --is-ancestor $SHA HEAD || [[ $(git log --grep="(cherry picked from commit $SHA") ]]
@ -20,7 +22,12 @@ do
continue
fi
echo "Copying $SHA"
git cherry-pick -x "$SHA"
git cherry-pick -x "$SHA" -X theirs
git reset --soft HEAD~1
git add torch/distributed torch/csrc/distributed test/distributed test/cpp/c10d benchmarks/distributed
git checkout .
git commit --reuse-message=HEAD@{1}
git clean -f
done
if [[ "${WITH_PUSH}" == true ]]; then

View File

@ -41,7 +41,7 @@ def main() -> None:
)
options = parser.parse_args()
tagged_images: Dict[str, bool] = dict()
tagged_images: Dict[str, bool] = {}
platform_images = [
generate_binary_build_matrix.WHEEL_CONTAINER_IMAGES,
generate_binary_build_matrix.LIBTORCH_CONTAINER_IMAGES,

View File

@ -7,6 +7,7 @@ cd llm-target-determinator
pip install -q -r requirements.txt
cd ../codellama
pip install -e .
pip install numpy==1.26.0
# Run indexer
cd ../llm-target-determinator

View File

@ -17,9 +17,7 @@ from unittest import main, mock, skip, TestCase
from urllib.error import HTTPError
from github_utils import gh_graphql
from gitutils import get_git_remote_name, get_git_repo_dir, GitRepo
from trymerge import (
categorize_checks,
DRCI_CHECKRUN_NAME,
@ -39,6 +37,7 @@ from trymerge import (
validate_revert,
)
if "GIT_REMOTE_URL" not in os.environ:
os.environ["GIT_REMOTE_URL"] = "https://github.com/pytorch/pytorch"

View File

@ -45,7 +45,6 @@ from github_utils import (
gh_update_pr_state,
GitHubComment,
)
from gitutils import (
are_ghstack_branches_in_sync,
get_git_remote_name,
@ -62,6 +61,7 @@ from label_utils import (
)
from trymerge_explainer import get_revert_message, TryMergeExplainer
# labels
MERGE_IN_PROGRESS_LABEL = "merging"
MERGE_COMPLETE_LABEL = "merged"
@ -1459,7 +1459,7 @@ def find_matching_merge_rule(
if not skip_internal_checks and pr.has_internal_changes():
raise RuntimeError(
"This PR has internal changes and must be landed via Phabricator"
"This PR has internal changes and must be landed via Phabricator! Please try reimporting/rexporting the PR!"
)
# Categorize all checks when skip_mandatory_checks (force merge) is set. Do it here

View File

@ -11,6 +11,7 @@ from github_utils import gh_post_pr_comment as gh_post_comment
from gitutils import get_git_remote_name, get_git_repo_dir, GitRepo
from trymerge import GitHubPR
SAME_SHA_ERROR = (
"\n```\nAborting rebase because rebasing the branch resulted in the same sha as the target branch.\n"
+ "This usually happens because the PR has already been merged. Please rebase locally and push.\n```"

View File

@ -81,7 +81,7 @@ jobs:
!{{ config["build_name"] }}-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: !{{ config["build_name"] }}-build
{%- if config["gpu_arch_type"] != "rocm" %}
{%- if config["gpu_arch_type"] not in ["rocm", "xpu"] %}
uses: ./.github/workflows/_binary-test-linux.yml
with:!{{ upload.binary_env_as_input(config) }}
build_name: !{{ config["build_name"] }}
@ -101,6 +101,40 @@ jobs:
{%- endif %}
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
{%- elif config["gpu_arch_type"] == "xpu" %}
runs-on: linux.idc.xpu
timeout-minutes: !{{ common.timeout_minutes }}
!{{ upload.binary_env(config) }}
permissions:
id-token: write
contents: read
steps:
- name: Setup XPU
uses: ./.github/actions/setup-xpu
- name: configure aws credentials
id: aws_creds
uses: aws-actions/configure-aws-credentials@v1.7.0
with:
role-to-assume: arn:aws:iam::308535385114:role/gha_workflow_s3_and_ecr_read_only
aws-region: us-east-1
- name: Login to Amazon ECR
id: login-ecr
uses: aws-actions/amazon-ecr-login@v2
- uses: !{{ common.download_artifact_action }}
name: Download Build Artifacts
with:
name: !{{ config["build_name"] }}
path: "${{ runner.temp }}/artifacts/"
!{{ common.checkout(deep_clone=False, directory="pytorch") }}
!{{ common.checkout(deep_clone=False, directory="builder", repository=common.builder_repo, branch=common.builder_branch) }}
- name: Pull Docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
with:
docker-image: !{{ config["container_image"] }}
- name: Test Pytorch binary
uses: ./pytorch/.github/actions/test-pytorch-binary
- name: Teardown XPU
uses: ./.github/actions/teardown-xpu
{%- else %}
runs-on: linux.rocm.gpu
timeout-minutes: !{{ common.timeout_minutes }}

View File

@ -1,105 +0,0 @@
name: linux-build-rg
on:
workflow_call:
inputs:
build-environment:
required: true
type: string
description: Top-level label for what's being built/tested.
docker-image-name:
required: true
type: string
description: Name of the base docker image to build with.
build-generates-artifacts:
required: false
type: boolean
default: true
description: If set, upload generated build artifacts.
build-with-debug:
required: false
type: boolean
default: false
description: If set, build in debug mode.
sync-tag:
required: false
type: string
default: ""
description: |
If this is set, our linter will use this to make sure that every other
job with the same `sync-tag` is identical.
cuda-arch-list:
required: false
type: string
default: "5.2"
description: |
List of CUDA architectures CI build should target.
runner-group:
required: false
type: string
default: "arc-lf-linux.2xlarge"
description: Runner group to select group type
test-matrix:
required: false
type: string
description: |
An option JSON description of what test configs to run later on. This
is moved here from the Linux test workflow so that we can apply filter
logic using test-config labels earlier and skip unnecessary builds
s3-bucket:
description: S3 bucket to download artifact
required: false
type: string
default: "gha-artifacts"
aws-role-to-assume:
description: role to assume for downloading artifacts
required: false
type: string
default: ""
secrets:
HUGGING_FACE_HUB_TOKEN:
required: false
description: |
HF Auth token to avoid rate limits when downloading models or datasets from hub
outputs:
docker-image:
value: ${{ jobs.build.outputs.docker-image }}
description: The docker image containing the built PyTorch.
test-matrix:
value: ${{ jobs.build.outputs.test-matrix }}
description: An optional JSON description of what test configs to run later on.
jobs:
build:
# Don't run on forked repos
if: github.repository_owner == 'pytorch'
runs-on:
group: ${{ inputs.runner-group }}
timeout-minutes: 240
outputs:
docker-image: ${{ steps.linux-build.outputs.docker-image }}
test-matrix: ${{ steps.linux-build.outputs.test-matrix }}
steps:
# [pytorch repo ref]
# Use a pytorch/pytorch reference instead of a reference to the local
# checkout because when we run this action we don't *have* a local
# checkout. In other cases you should prefer a local checkout.
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
- name: Linux Build
id: linux-build
uses: ./.github/actions/linux-build
with:
build-environment: ${{ inputs.build-environment }}
docker-image-name: ${{ inputs.docker-image-name }}
build-generates-artifacts: ${{ inputs.build-generates-artifacts }}
build-with-debug: ${{ inputs.build-with-debug }}
sync-tag: ${{ inputs.sync-tag }}
cuda-arch-list: ${{ inputs.cuda-arch-list }}
test-matrix: ${{ inputs.test-matrix }}
s3-bucket: ${{ inputs.s3-bucket }}
aws-role-to-assume: ${{ inputs.aws-role-to-assume }}
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}

View File

@ -228,7 +228,7 @@ jobs:
- name: Store PyTorch Build Artifacts on S3
uses: seemethere/upload-artifact-s3@v5
if: inputs.build-generates-artifacts && steps.build.outcome != 'skipped' && inputs.use_split_build != 'true'
if: inputs.build-generates-artifacts && steps.build.outcome != 'skipped' && !inputs.use_split_build
with:
name: ${{ inputs.build-environment }}
retention-days: 14
@ -236,9 +236,9 @@ jobs:
path: artifacts.zip
s3-bucket: ${{ inputs.s3-bucket }}
- name: Store PyTorch Build Artifacts on S3
- name: Store PyTorch Build Artifacts on S3 for split build
uses: seemethere/upload-artifact-s3@v5
if: inputs.build-generates-artifacts && steps.build.outcome != 'skipped' && inputs.use_split_build == 'true'
if: inputs.build-generates-artifacts && steps.build.outcome != 'skipped' && inputs.use_split_build
with:
name: ${{ inputs.build-environment }}-experimental-split-build
retention-days: 14

View File

@ -1,85 +0,0 @@
name: linux-test-rg
on:
workflow_call:
inputs:
build-environment:
required: true
type: string
description: Top-level label for what's being built/tested.
test-matrix:
required: true
type: string
description: JSON description of what test configs to run.
docker-image:
required: true
type: string
description: Docker image to run in.
sync-tag:
required: false
type: string
default: ""
description: |
If this is set, our linter will use this to make sure that every other
job with the same `sync-tag` is identical.
timeout-minutes:
required: false
type: number
default: 240
description: |
Set the maximum (in minutes) how long the workflow should take to finish
use-gha:
required: false
type: string
default: ""
description: If set to any value, upload to GHA. Otherwise upload to S3.
dashboard-tag:
required: false
type: string
default: ""
s3-bucket:
description: S3 bucket to download artifact
required: false
type: string
default: "gha-artifacts"
aws-role-to-assume:
description: role to assume for downloading artifacts
required: false
type: string
default: ""
secrets:
HUGGING_FACE_HUB_TOKEN:
required: false
description: |
HF Auth token to avoid rate limits when downloading models or datasets from hub
env:
GIT_DEFAULT_BRANCH: ${{ github.event.repository.default_branch }}
jobs:
test:
# Don't run on forked repos or empty test matrix
if: github.repository_owner == 'pytorch' && toJSON(fromJSON(inputs.test-matrix).include) != '[]'
strategy:
matrix: ${{ fromJSON(inputs.test-matrix) }}
fail-fast: false
runs-on: ${{ matrix.runner }}
timeout-minutes: ${{ matrix.mem_leak_check == 'mem_leak_check' && 600 || inputs.timeout-minutes }}
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
- name: Linux Test
id: linux-test
uses: ./.github/actions/linux-test
with:
build-environment: ${{ inputs.build-environment }}
test-matrix: ${{ inputs.test-matrix }}
docker-image: ${{ inputs.docker-image }}
sync-tag: ${{ inputs.sync-tag }}
use-gha: ${{ inputs.use-gha }}
dashboard-tag: ${{ inputs.dashboard-tag }}
s3-bucket: ${{ inputs.s3-bucket }}
aws-role-to-assume: ${{ inputs.aws-role-to-assume }}
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

View File

@ -1,86 +0,0 @@
name: linux-test-label
on:
workflow_call:
inputs:
build-environment:
required: true
type: string
description: Top-level label for what's being built/tested.
test-matrix:
required: true
type: string
description: JSON description of what test configs to run.
docker-image:
required: true
type: string
description: Docker image to run in.
sync-tag:
required: false
type: string
default: ""
description: |
If this is set, our linter will use this to make sure that every other
job with the same `sync-tag` is identical.
timeout-minutes:
required: false
type: number
default: 240
description: |
Set the maximum (in minutes) how long the workflow should take to finish
use-gha:
required: false
type: string
default: ""
description: If set to any value, upload to GHA. Otherwise upload to S3.
dashboard-tag:
required: false
type: string
default: ""
s3-bucket:
description: S3 bucket to download artifact
required: false
type: string
default: "gha-artifacts"
aws-role-to-assume:
description: role to assume for downloading artifacts
required: false
type: string
default: ""
secrets:
HUGGING_FACE_HUB_TOKEN:
required: false
description: |
HF Auth token to avoid rate limits when downloading models or datasets from hub
env:
GIT_DEFAULT_BRANCH: ${{ github.event.repository.default_branch }}
jobs:
test:
# Don't run on forked repos or empty test matrix
if: github.repository_owner == 'pytorch' && toJSON(fromJSON(inputs.test-matrix).include) != '[]'
strategy:
matrix: ${{ fromJSON(inputs.test-matrix) }}
fail-fast: false
runs-on:
group: ${{ matrix.runner }}
timeout-minutes: ${{ matrix.mem_leak_check == 'mem_leak_check' && 600 || inputs.timeout-minutes }}
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
- name: Linux Test
id: linux-test
uses: ./.github/actions/linux-test
with:
build-environment: ${{ inputs.build-environment }}
test-matrix: ${{ inputs.test-matrix }}
docker-image: ${{ inputs.docker-image }}
sync-tag: ${{ inputs.sync-tag }}
use-gha: ${{ inputs.use-gha }}
dashboard-tag: ${{ inputs.dashboard-tag }}
s3-bucket: ${{ inputs.s3-bucket }}
aws-role-to-assume: ${{ inputs.aws-role-to-assume }}
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

View File

@ -54,6 +54,7 @@ jobs:
# Hardcoding below is temporary for testing ALI runners
# This file below should match the script found in .github/scripts/runner_determinator.py
- name: Hardcode runner-determinator script
id: hardcode-script
run: |
cat <<EOF > runner_determinator.py
# flake8: noqa: G004
@ -70,6 +71,7 @@ jobs:
WORKFLOW_LABEL_META = "" # use meta runners
WORKFLOW_LABEL_LF = "lf." # use runners from the linux foundation
WORKFLOW_LABEL_LF_CANARY = "lf.c." # use canary runners from the linux foundation
GITHUB_OUTPUT = os.getenv("GITHUB_OUTPUT", "")
GH_OUTPUT_KEY_LABEL_TYPE = "label-type"
@ -261,8 +263,11 @@ jobs:
)
label_type = WORKFLOW_LABEL_META
set_github_output(GH_OUTPUT_KEY_LABEL_TYPE, label_type)
# For Canary builds use canary runners
if args.github_repo == "pytorch/pytorch-canary" and label_type == WORKFLOW_LABEL_LF:
label_type = WORKFLOW_LABEL_LF_CANARY
set_github_output(GH_OUTPUT_KEY_LABEL_TYPE, label_type)
if __name__ == "__main__":
main()

View File

@ -188,7 +188,7 @@ jobs:
run: |
pushd "${PYTORCH_FINAL_PACKAGE_DIR}"
# shellcheck disable=SC2046,SC2102
python3 -mpip install $(echo *.whl)[opt-einsum,optree]
python3 -mpip install $(echo *.whl)[opt-einsum,optree] optree==0.12.1
popd
.ci/pytorch/win-test.sh

View File

@ -0,0 +1,64 @@
name: Build conda docker images
on:
workflow_dispatch:
push:
branches:
- main
- release/*
tags:
# NOTE: Binary build pipelines should only get triggered on release candidate or nightly builds
# Release candidate tags look like: v1.11.0-rc1
- v[0-9]+.[0-9]+.[0-9]+-rc[0-9]+
paths:
- conda/Dockerfile
- 'common/*'
- .github/workflows/build-conda-images.yml
pull_request:
paths:
- conda/Dockerfile
- 'common/*'
- .github/workflows/build-conda-images.yml
env:
DOCKER_REGISTRY: "docker.io"
DOCKER_BUILDKIT: 1
DOCKER_ID: ${{ secrets.DOCKER_ID }}
DOCKER_TOKEN: ${{ secrets.DOCKER_TOKEN }}
WITH_PUSH: ${{ github.event_name == 'push' && (github.ref == 'refs/heads/main' || startsWith(github.ref, 'refs/heads/release')) }}
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true
jobs:
build-docker:
runs-on: linux.9xlarge.ephemeral
strategy:
matrix:
cuda_version: ["11.8", "12.1", "12.4", "cpu"]
env:
CUDA_VERSION: ${{ matrix.cuda_version }}
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
with:
submodules: false
- name: Calculate docker image
if: env.WITH_PUSH == 'false'
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
with:
docker-image-name: conda-builder${{ matrix.cuda_version == 'cpu' && '-' || '-cuda' }}${{matrix.cuda_version}}
docker-build-dir: .ci/docker/conda
always-rebuild: true
push: true
- name: Authenticate if WITH_PUSH
if: env.WITH_PUSH == 'true'
run: |
if [[ "${WITH_PUSH}" == true ]]; then
echo "${DOCKER_TOKEN}" | docker login -u "${DOCKER_ID}" --password-stdin
fi
- name: Build Docker Image
if: env.WITH_PUSH == 'true'
run: |
.ci/docker/conda/build.sh conda-builder${{ matrix.cuda_version == 'cpu' && ':' || ':cuda' }}${{matrix.cuda_version}}

Some files were not shown because too many files have changed in this diff Show More