Compare commits

...

624 Commits

Author SHA1 Message Date
66340e6751 Fix numerical instability for norm (#129352)
Fixes #123645
When the reduce size is large, reducing directly may exceed the range that FP32 can represent, resulting in incorrect results.
Reducing in group and using double as the intermediate cumulative type can avoid exceeding the representation range.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129352
Approved by: https://github.com/jgong5, https://github.com/peterbell10
2024-09-27 00:51:31 +00:00
adc77a9b7f [lintrunner] auto apply formatting changes as suggestions (#136239)
(Remove spurious cc)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136239
Approved by: https://github.com/huydhn, https://github.com/eqy

Co-authored-by: Huy Do <huydhn@gmail.com>
2024-09-27 00:51:21 +00:00
faedee12fa [test] enable test_triton_wrapper again (#136721)
Summary:
Reenable the `test_triton_wrapper.py` test again

# Why

We want this to run internally

# What

- fix python path issue on the test
- reenable the test

# Background

It appears that the parent process does not pass the entire path down to the child process. Namely, if there is some setup that makes the sys.path effectively look different than, say, PYTHONPATH or something like this, the child will not inherit this setup. To avoid needing to keep track of specific setups, we pass the effective `sys.path` from the parent to the child through the PYTHONPATH env variable

Test Plan: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:triton_wrapper

Differential Revision: D63438186

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136721
Approved by: https://github.com/henrylhtsang
2024-09-27 00:44:40 +00:00
22a4129a76 Generalization of FSDP common for non-cuda execution (#133209)
## Motivation
The FSDP common code for FSDP UT execution is mostly written with cuda device in mind. However other devices such the intel Gaudi supports most of the functionality. We are generalizing the base content so that the UT content can be used for non-cuda device execution.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133209
Approved by: https://github.com/kwen2501
2024-09-27 00:38:10 +00:00
a619ced5ed Revert "Update run_test.py"
This reverts commit 193073b4914a7f80758541d391eacbe21194ecdf.
2024-09-26 17:34:52 -07:00
193073b491 Update run_test.py 2024-09-26 16:56:29 -07:00
aa56f80ec1 Dont pairwise check unfusable nodes in scheduler (#136682)
Gives 8% wall time speedup on n=1000 benchmark in https://github.com/pytorch/pytorch/pull/136429

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136682
Approved by: https://github.com/ezyang, https://github.com/jansel, https://github.com/shunting314
2024-09-26 23:46:52 +00:00
0b62ebfeaa [CI] Populate JOB_ID for MPS tests (#136791)
Move `get-job-id` steps before running the tests and copy-n-paste environment variables from `_mac-test.yml` added in https://github.com/pytorch/pytorch/pull/113099

Should fix the following warning during MPS test run:
```
/Users/ec2-user/runner/_work/pytorch/pytorch/tools/stats/upload_metrics.py:147: UserWarning: Not emitting metrics for td_test_failure_stats_v2. Missing job_id. Please set the JOB_ID environment variable to pass in this value.
  warn(f"Not emitting metrics for {metric_name}. {e}")
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136791
Approved by: https://github.com/albanD, https://github.com/izaitsevfb
2024-09-26 23:00:52 +00:00
da5c7b6f4e [AOTI] Set CUDA device for torch._export.aot_load (#136715)
Summary: Fixes https://github.com/pytorch/pytorch/issues/136369. When a CUDA device with index is specified when calling torch._export.aot_load, we need to specify the CUDA device when running model.so.

Differential Revision: [D63438335](https://our.internmc.facebook.com/intern/diff/D63438335)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136715
Approved by: https://github.com/angelayi
2024-09-26 22:20:12 +00:00
991f8f8ec3 Bias gradient calculation for NJT linear backward (#136660)
Previously NYI - @mikaylagawarecki needs it for Transformers.

Fixes #136652
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136660
Approved by: https://github.com/soulitzer
2024-09-26 21:38:10 +00:00
eqy
c0e98a485b [FP8][CUDA] Fix stale expected error message (#136581)
CC @nWEIdia as I think we have seen internal failures on this

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136581
Approved by: https://github.com/mikaylagawarecki
2024-09-26 20:57:38 +00:00
5789f8d5dc [MPS] Add regression test for large inputs to F.linear (#136084)
This PR adds a regression test for the issue reported in #122045. I was not able to reproduce on macOS > 13.

~Expect the first iteration of the tests to fail for macOS 13, but pass for 14 and 15.~
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136084
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-09-26 20:46:14 +00:00
9656a603b2 Fix lint (#136781)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136781
Approved by: https://github.com/clee2000, https://github.com/ZainRizvi, https://github.com/malfet
2024-09-26 19:13:56 +00:00
c878ea2c4e Add info about "release tracker" label for cherry-picking bot (#136777)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136777
Approved by: https://github.com/seemethere, https://github.com/atalman
2024-09-26 18:45:45 +00:00
851b9732aa Download pre-compiled AOTriton from GitHub unless AOTRITON_INSTALL_FROM_SOURCE=1 is set (#136603)
PyTorch community members have reported issues with building PyTorch from source for ROCm in an environment that doesn't have aotriton pre-installed, because aotriton is only installed in the [CI](a8ed873ba2/.ci/docker/manywheel/Dockerfile (L197)) docker images. Building aotriton from source can take ~45 minutes.

This PR fixes the issue by downloading the aotriton tarball in such scenarios, *unless the user explicitly wants to build aotriton from source using the AOTRITON_INSTALL_FROM_SOURCE=1 env var*

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136603
Approved by: https://github.com/atalman

Co-authored-by: Xinya Zhang <Xinya.Zhang@amd.com>
2024-09-26 18:05:51 +00:00
f0a92541fe [export] fix lifted constants order for 0-input graphs (#136658)
Summary:
With empty graphs, the `graph.inserting_before(first_user_input = None)` call turns into a `graph.inserting_after(root)` call, inverting the order of constant input nodes being inserted.

This fixes the issue by initializing to the first node in the graph (still valid if not a user input - only used for insertion).

Test Plan: test_export

Differential Revision: D63403514

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136658
Approved by: https://github.com/avikchaudhuri
2024-09-26 17:44:24 +00:00
40c825d773 [reland] [torchelastic][c10d] Fix store prefix race in rendezvous (#136768)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136768
Approved by: https://github.com/kwen2501, https://github.com/atalman
2024-09-26 17:37:07 +00:00
da09984c0d [AOTI][Tooling][9/n] Add debug printer support for cpp kernel type (#136465)
Summary:

As title.

Cpp kernel has a different codegen path: https://www.internalfb.com/code/fbsource/[6df946858879dd9bcefa18710dd79095a957f0dd]/fbcode/caffe2/torch/_inductor/codegen/cpp.py?lines=4643
Previously it is not wrapped/supported by the debug printer manager. This diff adds this support.
It can be useful for cpu models. See this for a use case: https://www.internalfb.com/phabricator/paste/view/P1598561051?lines=927

Test Plan:
```
AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=2 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1  TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run 'fbcode//mode/opt' fbcode//accelerators/workloads/models/slimdsnn:slimdsnn -- aot --batch-size 1
```

Differential Revision: D63053101

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136465
Approved by: https://github.com/hl475
2024-09-26 17:30:43 +00:00
e4e83a4ac4 Remove aten.item hack (#136663)
Summary: Title

Test Plan: CI

Differential Revision: D63404353

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136663
Approved by: https://github.com/bdhirsh
2024-09-26 17:14:48 +00:00
2421344d8f Update current maintainers (#136672)
This file didn't had an overall in a few years so long overdue. Most of the credit goes to @orionr for gathering all of this info.

The main rules we followed:
- No code contributor is removed, they're all placed as emeritus
- Breakdown too big categories to make this document useful to know who to ping
- No category where the code is still in the codebase is removed
- We did not rework the categories (for example to be closer to module: labels) and leave that for later
- All non-emeritus names are ordered by their number of comments on issues related to their topic
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136672
Approved by: https://github.com/eqy, https://github.com/ezyang, https://github.com/seemethere, https://github.com/malfet
2024-09-26 17:13:16 +00:00
beb46de342 Correctly convert Python float to float64 when passing argument as Tensor (#136413)
I can't actually test the Dynamo codegen fix as it is impossible to
directly use the Tensor at the moment.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136413
Approved by: https://github.com/bobrenjc93
ghstack dependencies: #136599
2024-09-26 16:50:13 +00:00
11fd55827d Make CLOSURE_VARS construction lazy (#136599)
This makes us less likely to hit import cycle problems with torch

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136599
Approved by: https://github.com/anijain2305
2024-09-26 16:50:13 +00:00
ff2360c733 [FlexAttention] Reduce expensive test time by 10x (#136677)
Now that we support non 128 divisble sequence lengths; drops expensive tests by like 10x
Before
```Shell
46.32s call     test/inductor/test_flex_attention.py::TestFlexAttention::test_aot_eager_gradcheck_score_mod1
45.61s call     test/inductor/test_flex_attention.py::TestFlexAttention::test_aot_eager_gradcheck_score_mod2
44.45s call     test/inductor/test_flex_attention.py::TestFlexAttention::test_aot_eager_gradcheck_score_mod3
43.61s call     test/inductor/test_flex_attention.py::TestFlexAttention::test_aot_eager_gradcheck_score_mod0
```

After:
```Shell
4.25s call     test/inductor/test_flex_attention.py::TestFlexAttention::test_aot_eager_gradcheck_score_mod5
4.20s call     test/inductor/test_flex_attention.py::TestFlexAttention::test_aot_eager_gradcheck_score_mod4
4.19s call     test/inductor/test_flex_attention.py::TestFlexAttention::test_aot_eager_gradcheck_score_mod1
4.04s call     test/inductor/test_flex_attention.py::TestFlexAttention::test_aot_eager_gradcheck_score_mod2
3.99s call     test/inductor/test_flex_attention.py::TestFlexAttention::test_aot_eager_gradcheck_score_mod0
3.98s call     test/inductor/test_flex_attention.py::TestFlexAttention::test_aot_eager_gradcheck_score_mod3
````

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136677
Approved by: https://github.com/Chillee
ghstack dependencies: #136673
2024-09-26 16:40:21 +00:00
840c6b7a68 [FlexAttention] Add Better error message for cpu tensors (#136673)
Partially address: #136525

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136673
Approved by: https://github.com/Chillee
2024-09-26 16:40:21 +00:00
ddab704b28 Use wildcard for portion of AMI version number (#136764)
Rather than specifying a specific version number for the AMIs, use wildcards for the date section.

Issue: pytorch/pytorch#136762

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136764
Approved by: https://github.com/ZainRizvi
2024-09-26 16:39:25 +00:00
cyy
59e8f8228f [3/N] Fix clang-tidy warnings in torch/csrc/lazy (#136705)
Follows #136634
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136705
Approved by: https://github.com/Skylion007
2024-09-26 16:29:43 +00:00
31c0467594 Add Triton CPU as an Inductor backend (#133408)
The goal is to use Inductor-generated kernels to stress test the new Triton CPU backend.

Differential Revision: [D63298968](https://our.internmc.facebook.com/intern/diff/D63298968)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133408
Approved by: https://github.com/jansel, https://github.com/blaine-rister, https://github.com/malfet
2024-09-26 15:35:26 +00:00
68579ef665 [EZ][MPS] Extend arange to bfloat16 (#136754)
RangeFactories class is the only one that uses `AT_DISPATCH_MPS_TYPES`

Fixes https://github.com/pytorch/pytorch/issues/136624
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136754
Approved by: https://github.com/Skylion007
2024-09-26 15:33:45 +00:00
73ec76ed50 [MPS] Implement isposinf and isneginf (#136689)
Not sure, why `isinf` is a composite op, but those needs to be implemented by hand.

Implementation is a trivial call to
```objc
[mpsGraph equalWithPrimaryTensor:input
                 secondaryTensor:[mpsGraph constantWithScalar:std::numeric_limits<T>::infinity()
                                                     dataType:input.dataType]]
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136689
Approved by: https://github.com/Skylion007
2024-09-26 15:33:20 +00:00
d05645841e Update get_device_properties to take in optional device (#136683)
Aligns behavior with the rest of cuda's device info query methods

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136683
Approved by: https://github.com/eqy
2024-09-26 15:07:31 +00:00
d5e4a20c17 Revert "Introduce _ArglessActivation base class for parameterless activation functions (#136296)"
This reverts commit dda0e4de32b29098f25f9b2889423c9446680cc1.

Reverted https://github.com/pytorch/pytorch/pull/136296 on behalf of https://github.com/atalman due to Breaks Internal CI. Error: Too many arguments [19]: Call `nn.modules.activation._ArglessActivation.__init__` expects 0 positional arguments, 1 was provided. ([comment](https://github.com/pytorch/pytorch/pull/136296#issuecomment-2377091280))
2024-09-26 14:12:12 +00:00
4150ab44a4 Fix composite op redispatch for NJT in inference mode (#134683)
Prior to this PR, calling `reshape()` under `inference_mode()` would throw a `NotImplementedError`. This is because `inference_mode()` disables autograd key dispatch, incidentally preventing the decomposition of reshape for NJT.

This PR fixes this by redispatching on the `CompositeImplicitAutogradNestedTensor` key whenever a composite implicit op is encountered in `NJT.__torch_dispatch__()`. This fixes reshape and any other composite implicit ops underneath `inference_mode()`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134683
Approved by: https://github.com/soulitzer, https://github.com/albanD
ghstack dependencies: #136566
2024-09-26 14:10:53 +00:00
f8debd5d83 Fix wrapper subclass reentrant dispatch + TorchDispatchMode (#136566)
Fixes #136565

This PR makes the python fallback robust to the case where there are no active modes & no tensors with the Python key. In this case, simply redispatch with the Python key disabled.

This was found when trying to use reentrant dispatch for NJT to get decompositions under `inference_mode()` when the autograd key is disabled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136566
Approved by: https://github.com/bdhirsh
2024-09-26 14:06:51 +00:00
963e793e1b [Inductor][CPP] Optimize WOQ INT8 wgt dequant in AMX GEMM template (#136630)
**Summary**
Optimize the WOQ int8 AMX performance by changing the int8 -> bf16 conversion.
Earlier, 16 int8 elements were being loaded at a time & converted to 16 BF16 elements.
With this change, 32 int8 elements will be loaded at a time, and converted to a cache-line of 32 BF16 elements more efficiently.

Performance before
```
AUTOTUNE _weight_int8pack_mm(4096x4096, 4096x4096, 4096)
  cpp_packed_gemm_0 38.0439 ms 100.0%
  _weight_int8pack_mm 50.2524 ms 75.7%
SingleProcess AUTOTUNE benchmarking takes 1.1087 seconds and 1.9791 seconds precompiling
AUTOTUNE _weight_int8pack_mm(4096x4096, 11008x4096, 11008)
  cpp_packed_gemm_4 78.2038 ms 100.0%
  _weight_int8pack_mm 119.1962 ms 65.6%
SingleProcess AUTOTUNE benchmarking takes 1.9274 seconds and 1.9949 seconds precompiling
AUTOTUNE _weight_int8pack_mm(4096x11008, 4096x11008, 4096)
  cpp_packed_gemm_6 79.2368 ms 100.0%
  _weight_int8pack_mm 118.3212 ms 67.0%
SingleProcess AUTOTUNE benchmarking takes 1.9200 seconds and 2.0015 seconds precompiling
AUTOTUNE _weight_int8pack_mm(4096x4096, 32000x4096, 32000)
  cpp_packed_gemm_224 225.7201 ms 100.0%
  _weight_int8pack_mm 388.5588 ms 58.1%
```

Performance after this PR
```
AUTOTUNE _weight_int8pack_mm(4096x4096, 4096x4096, 4096)
  cpp_packed_gemm_0 11.0086 ms 100.0%
  _weight_int8pack_mm 50.2918 ms 21.9%
SingleProcess AUTOTUNE benchmarking takes 1.0837 seconds and 2.0301 seconds precompiling
AUTOTUNE _weight_int8pack_mm(4096x4096, 11008x4096, 11008)
  cpp_packed_gemm_4 24.3528 ms 100.0%
  _weight_int8pack_mm 119.8492 ms 20.3%
SingleProcess AUTOTUNE benchmarking takes 1.8303 seconds and 1.8195 seconds precompiling
AUTOTUNE _weight_int8pack_mm(4096x11008, 4096x11008, 4096)
  cpp_packed_gemm_6 24.6148 ms 100.0%
  _weight_int8pack_mm 119.1908 ms 20.7%
SingleProcess AUTOTUNE benchmarking takes 1.8315 seconds and 1.8352 seconds precompiling
AUTOTUNE _weight_int8pack_mm(4096x4096, 32000x4096, 32000)
  cpp_packed_gemm_224 78.1369 ms 100.0%
  _weight_int8pack_mm 387.6289 ms 20.2%
SingleProcess AUTOTUNE benchmarking takes 4.5059 seconds and 1.8010 seconds precompiling
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136630
Approved by: https://github.com/jgong5
ghstack dependencies: #136353
2024-09-26 08:41:58 +00:00
77fba0c407 [PT2][Optimus] Fix a group batch fusion corner case (#136650)
Summary:
We have a user report on BA model that it raised "AttributeError: 'SymFloat' object has no attribute 'shape'", thus we add type check for the meta node.

See more context in the post
https://fb.workplace.com/groups/1075192433118967/permalink/1510477489590457/

Test Plan:
# local reproduce

```
CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode split-batch-decompose --flow_id 646303196
```

P1609807876

# E2E

before fix

f646303196

after fix

Differential Revision: D63399959

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136650
Approved by: https://github.com/ezyang
2024-09-26 06:35:11 +00:00
d1bb8e828f Add deterministic path for CUDA cumsum (#136224)
Change `cumsum` to call its decomposition when `use_deterministic_algorithms(True)` and input is CUDA.

Fixes #89492

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136224
Approved by: https://github.com/ezyang, https://github.com/justinchuby
2024-09-26 04:52:05 +00:00
b408591b53 Revert "[Flex Attention] fix block size order (#136657)"
This reverts commit 529b6ab0bb9f8800ed795ec8e4fa1f0e8042bb0a.

Reverted https://github.com/pytorch/pytorch/pull/136657 on behalf of https://github.com/huydhn due to Sorry for reverting your change but some test_flex_attention is failing in trunk after this change 529b6ab0bb ([comment](https://github.com/pytorch/pytorch/pull/136657#issuecomment-2375824802))
2024-09-26 04:06:41 +00:00
cyy
3c542ce831 [Reland] Check function declarations of COREML code (#136070)
Reland of #135467 by fixing periodic workflows.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136070
Approved by: https://github.com/ezyang
2024-09-26 03:52:06 +00:00
042af7ec53 [BE] [MPS] Use validation helper for input tensors (#134609)
Small refactor to use already existing helper with equivalent behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134609
Approved by: https://github.com/malfet
2024-09-26 03:47:30 +00:00
e4d32d2194 Improve data-dependent-output meta kernel error message (#136671)
Test Plan:
- code reading
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136671
Approved by: https://github.com/williamwen42
2024-09-26 03:46:04 +00:00
190e09d8b6 [Inductor UT] Generalize device-bias code introduced from #134874 and (#136596)
[Inductor UT] Generalize device-bias code introduced from #134874 and fix unexpected success test cases.
Fix #136595

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136596
Approved by: https://github.com/EikanWang, https://github.com/jansel

Co-authored-by: Yu, Guangye <guangye.yu@intel.com>
2024-09-26 02:56:59 +00:00
dda0e4de32 Introduce _ArglessActivation base class for parameterless activation functions (#136296)
Fixes #133683
Fixes #133684
Fixes #133688

This PR introduces a new base class `_ArglessActivation` and refactors five existing activation functions to inherit from it. This change aims to improve documentation consistency and also API consistency with other activation functions that do have parameters and explicitly call `super().__init__()`

Key changes and considerations:
1. Added new class `_ArglessActivation`:
2. Refactored the following classes to inherit from `_ArglessActivation`:
   - Sigmoid
   - Tanh
   - Softsign
   - Tanhshrink
   - Softmax2d
3. Performance consideration:
   - This change introduces a slight overhead for creating a new stack frame and handling an additional function call on every instance creation
   - The impact is expected to be minimal in most use cases

Docs view before:
<img width="425" alt="Screen Shot 2024-09-18 at 3 00 22 PM" src="https://github.com/user-attachments/assets/ca0d1000-44c5-4c52-b344-68f7e170bafe">

Docs view after:
<img width="431" alt="Screen Shot 2024-09-18 at 3 00 52 PM" src="https://github.com/user-attachments/assets/f7ceb8f3-a2a2-4fd6-a2b8-39105a02bcbd">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136296
Approved by: https://github.com/mikaylagawarecki
2024-09-26 02:45:05 +00:00
d0456b4274 noop on torch.library APIs under torch::deploy (multipy) (#136645)
Fixes https://github.com/pytorch/pytorch/issues/136177

The motivation is that torch::deploy doesn't handle this well. The
workaround for users is to use C++ custom ops.

All torch.library APIs ultimately go through the torch.library.Library
object, so we add checks to noop for torch::deploy there.

Test Plan:
- new test
- going to test this internally and hope nothing breaks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136645
Approved by: https://github.com/ezyang
2024-09-26 02:34:34 +00:00
5c78c6b05a [CI] Switch aarch64 dashboard run back to nightly (#136643)
Summary: Reduce the frequency of the aarch64 dashboard CI run since we don't need to monitor its instability anymore.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136643
Approved by: https://github.com/huydhn
2024-09-26 01:26:05 +00:00
141cae2eb8 [pipelining] Fix more leaks and check leaks in tests (#136584)
Fix two more leaks of the same variety as #136507 (see that PR desc and attached gdoc for debug details).

This time, also add a test-time check that helped to discover new leaks and ensure we won't accidently regress.

Adds `check_tensor_leak` util which internally asserts no tensors are being kept alive by other objects involved in py ref cycles.

Uses objgraph for a nice debug utility when a leak is found.

Credit to @H-Huang for pointing out objdump and helping debug the 'param_group["intermediates"]` leak.

I manually confirmed that all 3 of the leaks identified/fixed so far are caught by the unit test and checker.

Sample output, if I re-introduce a leak by commenting out `del param_group["intermediates"]` in _backward.py,
and run `python test/distributed/pipelining/test_schedule_multiproc.py -k test_schedule_with_native_zero_bubble`:

```
warnings.warn(
/data/users/whc/pytorch/torch/testing/_internal/common_utils.py:5341: UserWarning: 34 tensors were found in the garbage. Did you introduce a reference cycle?
warnings.warn(
/data/users/whc/pytorch/torch/testing/_internal/common_utils.py:5347: UserWarning: Dumping first 1 objgraphs of leaked tensors rendered to png
Graph written to /tmp/objgraph-ztz642h3.dot (19 nodes)
Graph viewer (xdot) not found, generating a png instead
Image generated as /tmp/objgraph-ztz642h3.png
```

rendering of ` /tmp/objgraph-ztz642h3.png`:
<img width="1671" alt="image" src="https://github.com/user-attachments/assets/9098ff29-224c-4533-935b-83c210ac2e22">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136584
Approved by: https://github.com/kwen2501, https://github.com/H-Huang
ghstack dependencies: #136507

Co-authored-by: Howard Huang <howardhuang@fb.com>
2024-09-26 01:10:40 +00:00
e8f1dd6ba0 Fix hardcoded ROCm paths in Caffe2Targets.cmake (#136283)
Fixes #131701

Use CMake imported targets more consistently to eliminate hardcode paths.

Here is the new relevant sections of Caffe2Targets.cmake:
```
set_target_properties(c10_hip PROPERTIES
  INTERFACE_INCLUDE_DIRECTORIES "${_IMPORT_PREFIX}/include"
  INTERFACE_LINK_LIBRARIES "c10;hip::amdhip64"
)
```

```
set_target_properties(torch_hip PROPERTIES
  INTERFACE_COMPILE_DEFINITIONS "USE_C10D_NCCL"
  INTERFACE_COMPILE_OPTIONS "-fPIC;-D__HIP_PLATFORM_AMD__=1;-DCUDA_HAS_FP16=1;-DUSE_ROCM;-D__HIP_NO_HALF_OPERATORS__=1;-D__HIP_NO_HALF_CONVERSIONS__=1;-DTORCH_HIP_VERSION=602;-Wno-shift-count-negative;-Wno-shift-count-overflow;-Wno-duplicate-decl-specifier;-DCAFFE2_USE_MIOPEN;-DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_HIP;-std=c++17;-DHIPBLAS_V2;-DHIP_NEW_TYPE_ENUMS"
  INTERFACE_INCLUDE_DIRECTORIES "${_IMPORT_PREFIX}/include"
  INTERFACE_LINK_LIBRARIES "c10_hip;torch_cpu_library;hip::amdhip64;MIOpen;hiprtc::hiprtc;roc::hipblaslt;roc::hipblas;hip::hipfft;hip::hiprand;roc::hipsparse;roc::hipsolver"
)
```

HIPCUB dependency was not actually used; which is why it is removed here as the imported target had undesirable side effects.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136283
Approved by: https://github.com/jeffdaily, https://github.com/Skylion007, https://github.com/jithunnair-amd, https://github.com/atalman
2024-09-26 00:34:43 +00:00
f3dd1721f4 [Update] Update note for Getting Started with PyTorch on Intel GPUs (#129946)
remove the hardware and software prerequisites and set up env part.
keep the prerequisites section and link to pytorch prerequistes for intel gpus for driver install, intel support package install and env set up
https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpus.html
Update the support for Intel Client GPU MTL-H
Update inference & training examples

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129946
Approved by: https://github.com/seemethere
2024-09-26 00:22:05 +00:00
9223c16208 Revert "Fix constant propagation in builtins and UserClasses (#131354)"
This reverts commit dd4a51b39aa02cba23b3a387b41c5026770d9220.

Reverted https://github.com/pytorch/pytorch/pull/131354 on behalf of https://github.com/atalman due to Breaks torchrec tests ([comment](https://github.com/pytorch/pytorch/pull/131354#issuecomment-2375417145))
2024-09-25 23:01:03 +00:00
ecc15c4f89 [AOTI] Fix a missing aoti_torch_check symbol issue (#136669)
Summary: When Inductor generates cpp kernels, they should be pure cpp loops which are independent to libtorch as much as possible.

Differential Revision: D63403473

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136669
Approved by: https://github.com/henrylhtsang
2024-09-25 22:56:10 +00:00
b7a5c7d331 Do not XFAIL test_segfault in fbcode (#136661)
https://github.com/pytorch/pytorch/pull/136252 silence the failure on OSS, but the test actually passed on fbcode [T202241133](https://www.internalfb.com/intern/tasks/?t=202241133)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136661
Approved by: https://github.com/malfet
2024-09-25 22:26:24 +00:00
8d65d9f11b Constraint setuptools to 72.1.0 or older in requirements.txt (#136489)
FIXES: https://github.com/pytorch/pytorch/issues/136541

Setuptools>=74.0.0 has deprecated support for some functions in distutils, and so the builds run into error such as ```AttributeError: module 'distutils' has no attribute '_msvccompiler'```. Also, the pytorch builds have setuptools pin to 72.1.0 according to these PRs: https://github.com/pytorch/builder/pull/1995 and 89d9a8cf6f. So, until there is a fix to change the function usage in accordance with latest setuptools, the 72.1.0 version works fine.

Also observed in CI jobs: https://github.com/pytorch/pytorch/actions/runs/10979326524
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136489
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-09-25 22:06:05 +00:00
c9d12f6360 [inductor][memory] add signpost event for memory pass (#136538)
Add logging to scuba table for internal models.

For verification, I triggered a sample workflow internally and checked the scuba table logging to make sure the `Paramaters` column has the expected loggings, see [here](https://fburl.com/scuba/workflow_signpost/39h7qo9s).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136538
Approved by: https://github.com/yf225
2024-09-25 21:47:46 +00:00
b5c2a657ae Add zou3519 to CODEOWNERS for HOPs (#136679)
There are some tricky things that I want to guard against
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136679
Approved by: https://github.com/Chillee
2024-09-25 21:29:48 +00:00
289df45cee Revert "[Dynamo] Trace enter/exit of TorchFunctionModes (#135422)" (#136590)
This reverts commit 7743149b2be4a9eba7e0997ccdc6abe552bec266.

Reverts
* https://github.com/pytorch/pytorch/pull/135503
* https://github.com/pytorch/pytorch/pull/135502
* https://github.com/pytorch/pytorch/pull/135422

This passes this test. Earlier, the getitem would stay like a getitem in the Fx graph. But now the fake tensor propagations fails saying that .item is called. It seems that torch function is not getting triggered while fake tensor propagation.

```
import torch
from torch.nn.attention.flex_attention import BlockMask, _mask_mod_signature, _score_mod_signature, flex_attention
from torch._inductor.lowering import make_pointwise, register_lowering
from torch._inductor.virtualized import ops
from torch.nn.attention.flex_attention import create_block_mask

torch.set_default_device('cuda')

flex_attention = torch.compile(flex_attention, dynamic=False)

prefix_lengths = torch.arange(8)
def prefix_lm(b, h, q, kv):
    return prefix_lengths[b] >= kv

mask = create_block_mask(prefix_lm, 8, None, 512, 512, _compile=True)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136590
Approved by: https://github.com/Chillee
2024-09-25 21:10:43 +00:00
529b6ab0bb [Flex Attention] fix block size order (#136657)
`create_block_mask` currently gives wrong BLOCK_SIZE and shape when using non-default block size `(128,128)`.
This PR fixes the issue by using BLOCK_SIZE order `(Q_BLOCK_SIZE, KV_BLOCK_SIZE)`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136657
Approved by: https://github.com/Chillee, https://github.com/drisspg
2024-09-25 21:08:40 +00:00
76b044d7cb Don't actually import module when checking if its valid (#136548)
Summary: If you actually import the module, you might end up with some import cycle situation where a module is imported too early and accesses things that are not initialized yet.

Test Plan:
sandcastle and ossci

```
TORCH_LOGS=+torch._inductor.codecache buck run mode/opt caffe2/benchmarks/dynamo:torchbench
```

Differential Revision: D63330224

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136548
Approved by: https://github.com/Skylion007
2024-09-25 20:47:32 +00:00
11c5f9ac3b Use amazon linux 2023 runners for Docker builds (#136544)
Migrate these builds to linux 2023. We want to build and test the Docker images in CD.

Looks like we are hitting this issue: https://github.com/docker/buildx/issues/379 when trying to build Docker on Amazon Linux 2023.

Conda Docker build is timing out. While Manywheel is executing but failing because BUILDKIT is turned off: https://github.com/pytorch/pytorch/actions/runs/11036043157/job/30653543264?pr=136544

Proposed Solution is to fix it in user_data . Please see: https://github.com/pytorch/test-infra/issues/5712

I see docker builds are executed successfully here: https://github.com/pytorch/pytorch/actions/runs/11040149229/job/30667448668?pr=136544

Workaround timeout problem (reported in https://bugzilla.redhat.com/show_bug.cgi?id=1537564 ) by configuring number of open files per container to 1048576
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136544
Approved by: https://github.com/ZainRizvi

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-09-25 20:39:56 +00:00
13b0baf2a1 [FX] Update _inline_module util function to work with both args and kwargs (#136631)
Summary: Previously `_inline_module ` helper function only works with submodules that have args specified. This diff updates the util function to look for input arguments from submodule kwargs first using placeholder node names, then fallback to list of args if node name not found.

Test Plan:
```
buck2 run @//mode/{opt,mtia,inplace} //glow/fb/fx/fba/tests:test_fba_inductor -- -r test_connected_fusions
```

Differential Revision: D63347675

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136631
Approved by: https://github.com/jfix71
2024-09-25 20:20:57 +00:00
a8ed873ba2 Add missing input "eps" to adam docs (#135191)
Minor fix for missing input argument in the Adam optimizer docs page.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135191
Approved by: https://github.com/janeyx99
2024-09-25 20:17:23 +00:00
cyy
6aa6bd4ca5 [Distributed] [12/N] Fix clang-tidy warnings in torch/csrc/distributed/ (#136528)
Follows #136439. A dangling reference to qualifiedName was found and fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136528
Approved by: https://github.com/kwen2501
2024-09-25 20:12:08 +00:00
5a29a06aa3 [AMD][inductor] do not use float64 on AMD internally (#136441)
Summary:
Internal AMD triton seems to have issue with float64 constant:

```
### Most recent error lines found on the logs:
E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2]                ^
E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2]         tmp8 = tl.broadcast_to((libdevice.llrint((tl.full([1], 1.00000000000000, tl.float64))*(ks3.to(tl.float64)))) / ks1, [XBLOCK, RBLOCK])
E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2]         tmp7 = tmp5 + tmp6
E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2]         tmp6 = 0.5
E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2]         tmp5 = tmp4.to(tl.float32)
E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2]         tmp4 = (((r3 + (x0*((17 + (16*ks0*ks1)) // 18))) % ks2) // ks0) % ks1
E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2]         tmp3 = tmp2.to(tl.int1)
E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2]         tmp2 = tmp0 < tmp1
E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2]         tmp1 = 16*ks0*ks1
E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2]         tmp0 = r3 + (x0*((17 + (16*ks0*ks1)) // 18))
E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2]         r3 = rindex
E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2]         rmask = rindex < rnumel
E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2]         rindex = roffset + rbase
E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] triton.compiler.errors.CompilationError: at 26:15:
E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns)
```

Bisecting showing this error introduced by D62465575

This diff tries to not convert constant to float64 on AMD, and emu1.4 predictor now can run on AMD with rocm6.0.

Test Plan:
rocm6.0 can work
```
TORCHINDUCTOR_AUTOTUNE_REMOTE_CACHE=1 HIP_FORCE_DEV_KERNARG=1 HIP_GRAPH=--use-cuda-graph PYTORCH_MIOPEN_SUGGEST_NHWC=1 TORCHINDUCTOR_LAYOUT_OPTIMIZATION=1 CUDA_VISIBLE_DEVICES="2" TORCH_LOGS="recompiles,cudagraphs" buck2 run @//mode/opt-amd-gpu -c fbcode.rocm_ck_rtz=true -m rocm60 fblearner/predictor/py/applications/photogen:ip_python_predictor_photogen_cm -- --model=photogen_v1p4_9b --thrift_server_port=15008 --max_predict_calls=1 --enable_tunable_op --load_from_torch_package=genai:937233660_1
```

emu1.4 predictor on AMD fails with rocm6.2 with some other triton errors (https://www.internalfb.com/phabricator/paste/view/P1603842354)

Differential Revision: D63263806

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136441
Approved by: https://github.com/houseroad
2024-09-25 19:13:17 +00:00
37f340c1e5 [EZ] Remove remaining amz2023 runner variant references (#136540)
Validated no jobs use the amz2023 runner variant anymore ([proof](https://github.com/search?type=code&q=org%3Apytorch+%2F%5Cbamz2023%5Cb%2F+&p=1)) so removing all references to it

Explicit references to the amz2023 runner type variants were removed in the following PRs:
- https://github.com/pytorch/ignite/pull/3285
- https://github.com/pytorch/ao/pull/887
- https://github.com/pytorch/fbscribelogger/pull/1
- https://github.com/pytorch/pytorch/pull/134355

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136540
Approved by: https://github.com/huydhn, https://github.com/malfet
2024-09-25 19:01:00 +00:00
9c2c61d2dd [inductor] ELEMENTS_PER_WARP_32 -> ONE_ELEMENT_PER_THREAD (#136472)
AMD devices have 64 elements per thread; this PR makes the handling of the "ELEMENTS_PER_WARP_32" generic and uses DeviceProperties.warp_size to determine the warp size instead of hard-coding the warp size as 32. It also renames the enum value. Added a unit test for this.

Note: I left the old enum option (ELEMENTS_PER_WARP_32) as is instead of renaming it. I'm not sure whether we expect should caches to get invalidated here; if this concern is valid, then there's a risk that this would get updated, but some model could use the cached inductor code, which would reference "ELEMENTS_PER_WARP_32", which would no longer exist.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136472
Approved by: https://github.com/jansel
2024-09-25 18:21:09 +00:00
cyy
a259fbf72c [2/N] Fix clang-tidy warnings in torch/csrc/lazy (#136634)
Follows #134655
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136634
Approved by: https://github.com/Skylion007
2024-09-25 18:08:29 +00:00
0b38fa154a Fix meta registry in export (#136492)
Summary: Title

Test Plan: CI

This fixes some breaking tests in executorch. I think the root cause is when we have aten::matmul which we are not preserving, we register meta implementation from C++ side. It seems like the C++ kernel doesn't work well with mix of FakeTensor and real tensor. This PR sidesteps this problem by always preferring python CIA decomp over C++ Cia decomp

Differential Revision: D63297050

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136492
Approved by: https://github.com/bdhirsh
2024-09-25 17:53:02 +00:00
8582835499 [ONNX] Remove the operators test (#136335)
The tests are obsolete and hard to maintain.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136335
Approved by: https://github.com/xadupre, https://github.com/cyyever

Co-authored-by: Edward Z. Yang <ezyang@meta.com>
2024-09-25 17:44:18 +00:00
7cb6d31567 Dump partially traced make_fx graph in event of error to tlparse (#136508)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136508
Approved by: https://github.com/zou3519, https://github.com/bdhirsh, https://github.com/malfet
ghstack dependencies: #136533
2024-09-25 17:44:15 +00:00
9409274bc1 Fix bug in functional tensor decomp (#136600)
Summary: Previously we had a very bad bug where we don't allow any decomp on CIA. This never mattered before because we never had to actually push CIA decomp to Python key level in export.

Test Plan: CI

Differential Revision: D63363749

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136600
Approved by: https://github.com/bdhirsh
2024-09-25 17:37:50 +00:00
5d7ed02f52 [user-written triton kernels] specialize exprs if they are expected to be tl.constexpr (#136512)
Fixes #136504

If you have a tl.constexpr parameter to a triton kernel, and you pass in a SymNode, then, right now, you run into failures (see under 'constants'):

```
  File "/tmp/torchinductor_dberard/na/cnax67r5zmslz7bvdfizteaepj7fajpjallb3bu2gyetjcdqtbzj.py", line 14, in <module>
    triton_meta={'signature': {0: '*fp32', 1: '*fp32'}, 'device': DeviceProperties(type='cuda', index=0, cc=90, major=9, regs_per_multiprocessor=65536, max_threads_per_multi_processor=2048, multi_processor_count=132, warp_size=32), 'constants': {2: s0, 3: 256}, 'configs': [AttrsDescriptor(divisible_by_16=(0, 1), equal_to_1=())]},
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
NameError: name 's0' is not defined
```

To fix this, we specialize on the value during dynamo tracing, so that we have a real integer when we do codegen.

Alternatives: specialize somewhere else (e.g. inductor); or figure out how to actually pass the value dynamically into the user-written kernel. However, if we try to pass a dynamic value, then we wouldn't be able to precompile the triton kernels in inductor or use AOTI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136512
Approved by: https://github.com/oulgen, https://github.com/jansel, https://github.com/eellison
2024-09-25 17:12:11 +00:00
7c6d543a5b [export] fix _get_non_persistent_buffers for duplicates (#136552)
Summary: Export's method _get_non_persistent_buffers doesn't check duplicate submodules, so we run into state_dict related issues if non-persistent buffers exist on shared submodules.

Test Plan: test_export

Differential Revision: D63332976

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136552
Approved by: https://github.com/avikchaudhuri, https://github.com/tugsbayasgalan
2024-09-25 16:46:31 +00:00
aa80b82cea [hygiene] Delete dead alerting code (#136583)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136583
Approved by: https://github.com/clee2000
2024-09-25 15:44:46 +00:00
0232278b33 Fix comment posting permissions for check-labels.yml (#136610)
Currently it fails with

Error fetching https://api.github.com/repos/pytorch/pytorch/issues/136607/comments HTTP Error 403: Forbidden

(see https://github.com/pytorch/pytorch/actions/runs/11026434368/job/30622960113?pr=136607)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136610
Approved by: https://github.com/malfet
2024-09-25 15:43:19 +00:00
34711fe8c9 Fix test_skip_data_serialization pickle exception match (#136617)
The test is failing in trunk atm with the following error:

```
test_serialization.py::TestSerialization::test_skip_data_serialization_materialize_fake_False - AssertionError: "Can't pickle local object 'WeakValueDictionary.__init__.<locals>.remove'" does not match "Can't get local object 'WeakValueDictionary.__init__.<locals>.remove'"
```

for example, 36f0e61166

This comes from this cpython commit a3076c734d, and manifests in python 3.12.5 currently used in CI.  The failure doesn't happen when I try it out with 3.12.3 and 3.12.4.  Looking at the commit logs of https://github.com/python/cpython/commits/main/Lib/pickle.py, it looks like the exception message is changing back and forth, so I guess a regex match would capture both.
2024-09-25 08:35:46 -07:00
deb820602a viable/strict update: log push to s3 (#136470)
As stated in https://github.com/pytorch/test-infra/pull/5686, I cannot figure out a way to determine the push time from webhooks (other than when the webhook was sent, but that isn't super accurate either).  Instead, manually save a json file to s3 that contains information for the sha and date so that we can still get this information.

Relies on https://github.com/pytorch/test-infra/pull/5690

tested in https://github.com/pytorch/pytorch/pull/136387 (but I squashed so it's kinda hard to find now)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136470
Approved by: https://github.com/huydhn
2024-09-25 15:28:53 +00:00
e3b89ca124 Revert "Add deterministic path for CUDA cumsum (#136224)"
This reverts commit b1a02bf70824a4802411ddd5be1d3610e7a2e269.

Reverted https://github.com/pytorch/pytorch/pull/136224 on behalf of https://github.com/ezyang due to Failing internall CI ([comment](https://github.com/pytorch/pytorch/pull/136224#issuecomment-2374201626))
2024-09-25 14:11:01 +00:00
20a855bf01 [AOTI] Move stack_allocation logic from PythonWrapperCodegen (#136463)
Summary: Move stack_allocation logic from PythonWrapperCodegen to CppWrapperCpuArrayRef

Differential Revision: [D63319970](https://our.internmc.facebook.com/intern/diff/D63319970)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136463
Approved by: https://github.com/chenyang78
ghstack dependencies: #136062, #136461, #136462
2024-09-25 14:06:33 +00:00
5171b0e3c6 Revert "[ONNX] Remove the operators test (#136335)"
This reverts commit 9629835b1ccce8e72fc93bf95be13e3d53cb4871.

Reverted https://github.com/pytorch/pytorch/pull/136335 on behalf of https://github.com/ezyang due to I'll reland this, bear with me ([comment](https://github.com/pytorch/pytorch/pull/136335#issuecomment-2374183435))
2024-09-25 14:06:03 +00:00
070952aca5 [AOTI] Move stack_allocation logic from CppWrapperCpu (#136462)
Summary: Move stack_allocation logic from CppWrapperCpu to CppWrapperCpuArrayRef

Differential Revision: [D63300359](https://our.internmc.facebook.com/intern/diff/D63300359)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136462
Approved by: https://github.com/chenyang78
ghstack dependencies: #136062, #136461
2024-09-25 14:03:03 +00:00
5ad5f40283 [AOTI][reland] Create another wrapper class to handle ArrayRef (#136461)
Summary: Create another wrapper codegen class to handle ArrayRef for CPU. The goal is to simplify the regular cpp wrapper codegen logic and the generated cpp code.

Test Plan: CI

Differential Revision: [D63300361](https://our.internmc.facebook.com/intern/diff/D63300361)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136461
Approved by: https://github.com/angelayi, https://github.com/chenyang78
ghstack dependencies: #136062
2024-09-25 14:00:09 +00:00
25ab87c09b Add lint rule META_NO_CREATE_UNBACKED (#135870)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135870
Approved by: https://github.com/albanD
2024-09-25 13:33:56 +00:00
dd4a51b39a Fix constant propagation in builtins and UserClasses (#131354)
* Fixes https://github.com/pytorch/pytorch/issues/118675
* Replaces https://github.com/pytorch/pytorch/pull/118994

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131354
Approved by: https://github.com/jansel, https://github.com/anijain2305
2024-09-25 13:03:40 +00:00
a0c76ea853 Make test_skip_data_serialization regex more flexible (#136580)
Some CI machines seem to throw "Can't get local object" rather than
"Can't pickle local object".
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136580
Approved by: https://github.com/mikaylagawarecki
2024-09-25 11:27:23 +00:00
370c1c4297 [aotd] Fix rrelu compilation (#136008)
Issues:
https://github.com/pytorch/pytorch/issues/135083
https://github.com/pytorch/pytorch/issues/120292

rrelu decomposition contains mutation, copy_. Decompositions are executed below Functionalization, as a result AOT produces non-functional graph.

Also that decomposition is registered as python_dispatch kernel for AutogradCUDA.
Autograd dispatch happens above Functionalization, so registering it for Autograd to handle all backends makes functionalization running after this.

Testing:
```
python test/functorch/test_aotdispatch.py -k test_rrelu
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136008
Approved by: https://github.com/bdhirsh
2024-09-25 11:26:19 +00:00
c3fdf587b5 [inductor] [cpp] fix the check of template_buffer_has_other_users if no epilogue_nodes (#136518)
The `template_buffer_has_other_users` function checks the case where there're epilogue nodes and the template output has users other than these epilogue nodes.  When there's no epilogue nodes, the function could return `False` directly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136518
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
ghstack dependencies: #136418
2024-09-25 10:25:07 +00:00
cabfbef6cf [pytorch][PR] [inductor] More fixes on the keys of constants and signature dictionaries (#136514)
Summary: Previous PR forgets to change two other places that also create `constants` and `signature`.

Test Plan:
Imported from GitHub, without a `Test Plan:` line.
 {F1884584338}

Differential Revision: D63027728

Pulled By: Myrthan

Co-authored-by: Jokeren <robinho364@gmail.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136514
Approved by: https://github.com/jansel

Co-authored-by: Jokeren <robinho364@gmail.com>
2024-09-25 09:34:14 +00:00
2e30c160ef [inductor] [cpp] fix max-autotune for single-thread dynamic shapes (#136418)
Fixes the compilation error of max-autotune for `maml_omniglot` (AMP and FP32) and `soft_actor_critic` (AMP) in Torchbench for single-thread dynamic shapes case:

```
/tmp/torchinductor_user/uv/cuvq6wenwp7us423onuvntkfx4cspmagha5beiknob7tiebzhupa.cpp: In function ‘void kernel(const bfloat16*, const bfloat16*, const bfloat16*, bfloat16*, int64_t)’:
/tmp/torchinductor_user/uv/cuvq6wenwp7us423onuvntkfx4cspmagha5beiknob7tiebzhupa.cpp:279:41: error: the value of ‘Mr_blocks’ is not usable in a constant expression
  279 |         constexpr int64_t m_block_end = Mr_blocks;
      |                                         ^~~~~~~~~
/tmp/torchinductor_user/uv/cuvq6wenwp7us423onuvntkfx4cspmagha5beiknob7tiebzhupa.cpp:237:19: note: ‘Mr_blocks’ was not initialized with a constant expression
  237 |     const int64_t Mr_blocks = (M + Mr - 1) / Mr;
      |                   ^~~~~~~~~
```

The PR also updates the UT to add a test for `BS`=512 in single thread.
The previous case has `BS`=1024 equal to the `K` and `N` value. The generated code does not have symbolic shapes thus fails to capture the above issue.
By adding a case of `BS`=512, the generated code will have symbolic shape for the M dim and is able to reproduce the issue that this PR is addressing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136418
Approved by: https://github.com/jgong5
2024-09-25 09:24:05 +00:00
a0a1873148 [Inductor] Fix Triton tests after updating pybind11 to 2.13.6 (#136280)
https://github.com/pytorch/pytorch/pull/136087 update pybind11 to 2.13.6 and that new release has the feature which is expressed by [a new function](https://pybind11.readthedocs.io/en/latest/changelog.html#version-2-13-6-september-13-2024) `_pybind11_conduit_v1_`. The presence of this function breaks the serialization mechanisms used by Titon and in PyTorch itself.

Possible errors that have been noticed due to this change:

<details>
<summary> the first error </summary>

```bash
_________ KernelTests.test_layout_constraint_needs_fixed_stride_order __________
Traceback (most recent call last):
  File "/runner/_work/intel-xpu-backend-for-triton/intel-xpu-backend-for-triton/pytorch/test/inductor/test_triton_kernels.py", line 1072, in test_layout_constraint_needs_fixed_stride_order
    eager_out = f(x)
  File "/runner/_work/intel-xpu-backend-for-triton/intel-xpu-backend-for-triton/pytorch/test/inductor/test_triton_kernels.py", line 1068, in f
    arange_out(x, y)
  File "/runner/_work/intel-xpu-backend-for-triton/intel-xpu-backend-for-triton/pytorch/test/inductor/test_triton_kernels.py", line 1059, in arange_out
    kernel[grid](x, out, n_elements, BLOCK_SIZE=4)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/triton/runtime/jit.py", line 330, in <lambda>
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/triton/runtime/jit.py", line 657, in run
    kernel = self.compile(
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/triton/compiler/compiler.py", line 315, in compile
    metadata_group[metadata_filename] = fn_cache_manager.put(json.dumps(metadata, default=vars), metadata_filename,
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/json/__init__.py", line 234, in dumps
    return cls(
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
TypeError: vars() argument must have __dict__ attribute
```
</details>

<details>
<summary> the second error </summary>

```bash
________________ TestTritonWrapper.test_wrapper_using_gpu_seed _________________
Traceback (most recent call last):
  File "/cache/pytorch-c5e9d03a2da4b93481737594cbe2f5931fa569aa833f206a638189cad2c36d3c-11/test/inductor/test_triton_wrapper.py", line 40, in test_wrapper_using_gpu_seed
    out = f(x, y)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py", line 465, in _fn
    return fn(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 1292, in __call__
    return self._torchdynamo_orig_callable(
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 1087, in __call__
    result = self._inner_convert(
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 530, in __call__
    return _compile(
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 933, in _compile
    guarded_code = compile_inner(code, one_graph, hooks, transform)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 675, in compile_inner
    return _compile_inner(code, one_graph, hooks, transform)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_utils_internal.py", line 87, in wrapper_function
    return function(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 708, in _compile_inner
    out_code = transform_code_object(code, transform)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/bytecode_transformation.py", line 1322, in transform_code_object
    transformations(instructions, code_options)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 220, in _fn
    return fn(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 643, in transform
    tracer.run()
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 2776, in run
    super().run()
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 979, in run
    while self.step():
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 891, in step
    self.dispatch_table[inst.opcode](self, inst)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 2967, in RETURN_VALUE
    self._return(inst)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 2952, in _return
    self.output.compile_subgraph(
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 1117, in compile_subgraph
    self.compile_and_call_fx_graph(tx, list(reversed(stack_values)), root)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 1369, in compile_and_call_fx_graph
    compiled_fn = self.call_user_compiler(gm)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 1416, in call_user_compiler
    return self._call_user_compiler(gm)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 1465, in _call_user_compiler
    raise BackendCompilerFailed(self.compiler_fn, e).with_traceback(
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 1446, in _call_user_compiler
    compiled_fn = compiler_fn(gm, self.example_inputs())
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/repro/after_dynamo.py", line 130, in __call__
    compiled_gm = compiler_fn(gm, example_inputs)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/__init__.py", line 2235, in __call__
    return compile_fx(model_, inputs_, config_patches=self.config)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 1528, in compile_fx
    return aot_autograd(
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/backends/common.py", line 72, in __call__
    cg = aot_module_simplified(gm, example_inputs, **self.kwargs)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_functorch/aot_autograd.py", line 1071, in aot_module_simplified
    compiled_fn = dispatch_and_compile()
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_functorch/aot_autograd.py", line 1056, in dispatch_and_compile
    compiled_fn, _ = create_aot_dispatcher_function(
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_functorch/aot_autograd.py", line 522, in create_aot_dispatcher_function
    return _create_aot_dispatcher_function(
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_functorch/aot_autograd.py", line 759, in _create_aot_dispatcher_function
    compiled_fn, fw_metadata = compiler_fn(
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 179, in aot_dispatch_base
    compiled_fw = compiler(fw_module, updated_flat_args)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 1357, in fw_compiler_base
    return _fw_compiler_base(model, example_inputs, is_inference)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 1428, in _fw_compiler_base
    return inner_compile(
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 479, in compile_fx_inner
    return wrap_compiler_debug(_compile_fx_inner, compiler_name="inductor")(
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/repro/after_aot.py", line 85, in debug_wrapper
    inner_compiled_fn = compiler_fn(gm, example_inputs)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 665, in _compile_fx_inner
    compiled_graph = FxGraphCache.load(
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/codecache.py", line 1341, in load
    compiled_graph = compile_fx_fn(
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 574, in codegen_and_compile
    compiled_graph = fx_codegen_and_compile(gm, example_inputs, **fx_kwargs)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 882, in fx_codegen_and_compile
    compiled_fn = graph.compile_to_fn()
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/graph.py", line 1952, in compile_to_fn
    return self.compile_to_module().call
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/graph.py", line 1878, in compile_to_module
    return self._compile_to_module()
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/graph.py", line 1906, in _compile_to_module
    mod = PyCodeCache.load_by_key_path(
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/codecache.py", line 2866, in load_by_key_path
    mod = _reload_python_module(key, path)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/runtime/compile_tasks.py", line 45, in _reload_python_module
    exec(code, mod.__dict__, mod.__dict__)
  File "/tmp/tmps59zkbew/kg/ckgkb4gt5fs5pll4o7fqawppsmdezu5h52cq6nmrvi3yy6j7ddq4.py", line 45, in <module>
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/async_compile.py", line 198, in triton
    kernel = TritonCodeCache.load(kernel_name, source_code)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/codecache.py", line 2916, in load
    return _module_to_triton_kernel(PyCodeCache.load(source_code), kernel_name)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/codecache.py", line 2853, in load
    return cls.load_by_key_path(key, path, linemap, attrs)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/codecache.py", line 2866, in load_by_key_path
    mod = _reload_python_module(key, path)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/runtime/compile_tasks.py", line 39, in _reload_python_module
    raise RuntimeError(
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
RuntimeError: Failed to import /tmp/tmps59zkbew/g3/cg3zgxsidsjhdlz2lzvajvubdq6kg2x2hzd2kznfj43qwvlv33du.py
SyntaxError: invalid syntax (cg3zgxsidsjhdlz2lzvajvubdq6kg2x2hzd2kznfj43qwvlv33du.py, line 14)
```
</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136280
Approved by: https://github.com/etaf, https://github.com/jansel, https://github.com/EikanWang

Co-authored-by: Henry Schreiner <HenrySchreinerIII@gmail.com>
2024-09-25 08:09:46 +00:00
1cb265fafa [AILab][attempt2] Add TryExcept when decoding healthcheck port (#136574)
Summary:
## Context
The first attempt has lint error in OSS https://hud.pytorch.org/pr/pytorch/pytorch/136438#30553902641
{F1886895223}
## This Diff
Fix error message with try catch
Error Message:
```
  File "/packages/aps_models.examples.dlrm.lite/dlrm_train_app-inplace#link-tree/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 224, in _setup_healthcheck
    port=int(healthcheck_port),
ValueError: invalid literal for int() with base 10: \'%port.thrift%\'
```

Test Plan:
```
arc lint
```

Reviewed By: felixsu2006

Differential Revision: D63343041

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136574
Approved by: https://github.com/atalman
2024-09-25 04:43:51 +00:00
561cd5a0a6 [BE] Use C++17 convetion methods in CUDA kernels (#136575)
- `std::is_same<X, Y>::value` -> `std::is_same_v<X, Y>`
- `std::enable_if<C, T>::type` -> `std::enable_if_t<C, T>` And so on

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136575
Approved by: https://github.com/Skylion007, https://github.com/eqy
2024-09-25 04:30:01 +00:00
5340feb8aa Disable iOS workflow (#136571)
See https://github.com/pytorch/pytorch/issues/136284
It's been broken for more than a week and it does not seem like anyone cares about fixing it.
Once it's landed I'll reassigned the issue on `oncall: mobile`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136571
Approved by: https://github.com/huydhn, https://github.com/kit1980
2024-09-25 04:29:34 +00:00
1c9a1a2a19 [AOTI] Support MKL linear ops in cpp wrapper (#134974)
Summary: Similar to https://github.com/pytorch/pytorch/pull/134475, support mkl linear in the ABI-compatible mode for cpp-wrapper Inductor.

Differential Revision: [D63322202](https://our.internmc.facebook.com/intern/diff/D63322202)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134974
Approved by: https://github.com/chenyang78, https://github.com/leslie-fang-intel

Co-authored-by: leslie-fang-intel <leslie.fang@intel.com>
2024-09-25 03:53:11 +00:00
0200ad3457 Turn on unique kernel names (#136503)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136503
Approved by: https://github.com/ezyang, https://github.com/eellison
ghstack dependencies: #136509
2024-09-25 03:39:45 +00:00
482fe186b9 Add ROCm documentation to libtorch (C++) reST. (#136378)
Fixes #126640

Added ROCm support section to libtorch (C++) reST.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136378
Approved by: https://github.com/ezyang
2024-09-25 02:30:56 +00:00
3c7edf1ec0 [Inductor][CPP] Fix int8 cvt half (#136353)
Fix the correctness issue of https://github.com/pytorch/ao/pull/884/. The current implementation for converting between `Half/BFloat16` and `int8/uint8` incorrectly assumes that 1/4 of the int8/uint8 vector lane maps to 1/2 of the Half/BFloat16 vector lane. This assumption leads to accuracy issues after the full bit-width vectorization of the Half data type was introduced. When converting between int8 weights and the half data type, the generated code is as the following:
```
#include "/tmp/torchinductor_leslie/xw/cxww3s7wxrujoyxna7mlcjktid2uu6nntixqwm542xfkd756gl3x.h"
extern "C"  void kernel(const int8_t* in_ptr0,
                       half* out_ptr0)
{
    {
        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(2048L); x0+=static_cast<int64_t>(32L))
        {
            auto tmp0 = at::vec::Vectorized<int8_t>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(32));
            auto tmp1 = at::vec::convert<half>(tmp0);
            tmp1.store(out_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(32));
        }
    }
}
```

In this PR, we address the issue by changing the implementation to convert 1/2 of the int8/uint8 vector lane into a full vector lane of Half/BFloat16.

**TestPlan**
* AO: `python test/integration/test_integration.py -k test_int8_weight_only_quant_subclass_api`
* `python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_convert_int8_to_half_vec`
* Due to the CPP backend legalization pass, we are unable to create a unit test to simulate the conversion from `Half` to `int8`. Instead, we rely on a C++ test case.
  * `./build/bin/vec_test_all_types_AVX512 --gtest_filter="VecConvertTestsReducedFloat/*.ConvertReduced"`
  * `./build/bin/vec_test_all_types_AVX2 --gtest_filter="VecConvertTestsReducedFloat/*.ConvertReduced"`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136353
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2024-09-25 02:23:43 +00:00
eqy
8225e7706e [CUDA][Expandable Segments] Account for non-gc'able memory in expandable segments tests (#136496)
Seems like some other tests are holding onto memory that is not gc'able (e.g., cuBLAS workspaces), so these tests while working in isolation fail when run as e.g., `python test/test_cuda.py -k able`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136496
Approved by: https://github.com/ezyang
2024-09-25 01:14:45 +00:00
5233b5a448 Update PyTorch/XLA CI image to Python 3.10 (#135278)
The old image used Python 3.8. Corresponding XLA PR: https://github.com/pytorch/xla/pull/7953

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135278
Approved by: https://github.com/JackCaoG, https://github.com/atalman
2024-09-25 00:53:39 +00:00
eqy
670d64a802 [SDPA][Nested Tensor] Bump grad_query fudge factor for small GPUs (#135715)
Similar to #135711, here we see a ~1/1000 mismatch with absolute value ~0.0016 when 0.001 is allowed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135715
Approved by: https://github.com/drisspg
2024-09-25 00:36:10 +00:00
8f2a4cc4b1 Tune bsr_dense_addmm for int8 inputs on A100 (#136088)
As in the title. The tuning is done for dimensions 1280 and 5120 that are used in Vit-H.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136088
Approved by: https://github.com/cpuhrsch
2024-09-25 00:24:12 +00:00
9629835b1c [ONNX] Remove the operators test (#136335)
The tests are obsolete and hard to maintain.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136335
Approved by: https://github.com/xadupre
2024-09-24 23:08:48 +00:00
b57d67e263 Add isuruf to core reviewers (#136554)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136554
Approved by: https://github.com/Skylion007, https://github.com/malfet
2024-09-24 23:06:46 +00:00
210b136c07 [export] Add experimental swap API (#136190)
Prototyped the following API which takes in an ExportedProgram, a dictionary of fqn to modules to swap, and returns a (unlifted) GraphModule
```
_swap_modules(
    ep: ExportedProgram, modules_to_swap: Dict[str, torch.nn.Module]
) -> torch.fx.GraphModule:
```

Differential Revision: [D62879819](https://our.internmc.facebook.com/intern/diff/D62879819)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136190
Approved by: https://github.com/avikchaudhuri
2024-09-24 22:50:44 +00:00
706eda5cd8 Revert "[RFC][torchelastic][c10d] Fix store prefix race in rendezvous (#135957)"
This reverts commit 5033a1ca0dd22dae34a8939add33dbebfe0fd31d.

Reverted https://github.com/pytorch/pytorch/pull/135957 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/135957#issuecomment-2372493186))
2024-09-24 22:24:26 +00:00
ae80bce496 [dynamo] refactor resume_execution.py to use bytecode templates (#136483)
Use bytecode from template instead of hardcoding bytecode in resume_execution.py. Gets rid of a lot of Python-version dependent bytecode generation. Also makes resume_execution.py easier to support in future Python version updates.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136483
Approved by: https://github.com/jansel, https://github.com/anijain2305
2024-09-24 22:20:26 +00:00
36f0e61166 [BE] Use nested namespace in ATen/native/cuda (#136570)
It's a nice C++17 feature
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136570
Approved by: https://github.com/Skylion007
2024-09-24 22:19:10 +00:00
1d3af68202 [ROCm] install_miopen.sh exit for ROCm >= 6.3 (#136436)
Follow up to #132555.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136436
Approved by: https://github.com/jithunnair-amd, https://github.com/pruthvistony, https://github.com/atalman
2024-09-24 22:15:26 +00:00
780f4debdb [ONNX] Remove _optimize_graph from public init (#136279)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136279
Approved by: https://github.com/xadupre
ghstack dependencies: #136281
2024-09-24 22:00:55 +00:00
00bc17555a Don't try to evaluate sympy.Eq in replacement; we knew this wouldn't simplify since we are here (#136533)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136533
Approved by: https://github.com/isuruf, https://github.com/pianpwk
2024-09-24 21:52:25 +00:00
b1a02bf708 Add deterministic path for CUDA cumsum (#136224)
Change `cumsum` to call its decomposition when `use_deterministic_algorithms(True)` and input is CUDA.

Fixes #89492

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136224
Approved by: https://github.com/ezyang, https://github.com/justinchuby
2024-09-24 21:34:43 +00:00
0133fbcfe7 Revert "Correctly convert Python float to float64 when passing argument as Tensor (#136413)"
This reverts commit f0f79dd8f1df6cf6342c9c23ae3a9be0f74eb9f5.

Reverted https://github.com/pytorch/pytorch/pull/136413 on behalf of https://github.com/ezyang due to forward fix is stuck, revert this ([comment](https://github.com/pytorch/pytorch/pull/136413#issuecomment-2372404873))
2024-09-24 21:20:37 +00:00
95c0f7493f [Inductor] Rename WrapperCodeGen to PythonWrapperCodegen (#136062)
Summary: Rename WrapperCodeGen to PythonWrapperCodegen to make its meaning more explicit.

Differential Revision: [D63300358](https://our.internmc.facebook.com/intern/diff/D63300358)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136062
Approved by: https://github.com/angelayi, https://github.com/chenyang78
2024-09-24 21:02:51 +00:00
da1560c49f [SymmetricMemory] add support for cuStreamWriteValue32 (#136488)
cuStreamWriteValue efficiently combines the issuing of a system-level fence with the update of a single memory location. It is highly suitable for inter-stream progress sharing (e.g., all_gather_with_progress).

Exposing it via SymmetricMemory allows users to more easily implement efficient progress-aware matmuls in triton ([xformers example](https://github.com/facebookresearch/xformers/blob/main/xformers/ops/_triton/sequence_parallel_fused_kernels.py)).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136488
Approved by: https://github.com/eqy, https://github.com/Chillee
2024-09-24 20:56:29 +00:00
7c777dd587 [ONNX] Unify ONNXProgram and remove the old one (#136281)
## Note

`test_fx_to_onnx_with_onnxruntime.py` is removed for now (it has a lot of xfails anyways). A better version will be added back.

Fixes https://github.com/pytorch/pytorch/issues/136274

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136281
Approved by: https://github.com/xadupre, https://github.com/albanD
2024-09-24 20:52:19 +00:00
dbc3356655 [pipelining] fix py ref cycle in stage_backward (#136507)
TLDR; found forward activation tensors were being kept alive "forever"
(or until GC ran), and tracked it down to a cycle involving
`stage_backward.<locals>.extract_tensors_with_grads`.

The reference cycle in question is below.  (constructed using gc.get_referrers after doing a gc.collect in gc debug mode)

tensor is kept alive by
`[(<class 'cell'>, '0x7f7360234400')]`

tuple of cell objects
`(<cell at 0x7f73602343d0: function object at 0x7f734fff0ee0>, <cell at 0x7f7360234400: list object at 0x7f734e4d9a80>, <cell at 0x7f73602a4190: list object at 0x7f734eff8b00>)`
is kept alive by
`[(<class 'function'>, '0x7f734fff0ee0')]`

`<function stage_backward.<locals>.extract_tensors_with_grads at 0x7f734fff0ee0>`
is kept alive by
`[(<class 'cell'>, '0x7f73602343d0')]`

Put into more plain terms,

```

def stage_backward(...):
    ...
    stage_output_tensors = []

    # a cell object will exist that contains the variables defined in stage_backward and used by
    # both stage_backward and nested functions
    # in this case, the cell object contains 'stage_output_tensors' but

    # this function object will hold a reference to a 'cell' that contains any vars from
    # the parent scope not explicitly passed into the function as args.
    def extract_tensors_with_grads(...):
        ...
            # extract_tensors_with_grads refers to stage_output_tensors, so stage_output_tensors
            # is in the cell
            stage_output_tensors.append(output_val)
        ...
            # but extract_tensors_with_grads ALSO refers to itself (extract_tensors_with_grads),
            # so `extract_tensors_with_grads` will be in the cell
            extract_tensors_with_grads(...)
```

More debug details:
https://docs.google.com/document/d/1QPH1Lz0tnieIFPM2tyHrjVB-bjlnHuDgjx1p2am3cmE/edit?usp=sharing

In pdb:
```
gc.collect()
g = gc.garbage
g[-1]
[rank0]:(Pdb) [rank0]:<function
stage_backward.<locals>.extract_tensors_with_grads at 0x7fee5c3392d0>
g[-2]
[rank0]:(Pdb) [rank0]:(<cell at 0x7fee7abbcf40: function object at
0x7fee5c3392d0>, <cell at 0x7fee7abbcf70: list object at
0x7fee7ab68940>, <cell at 0x7fee5c3210c0: list object at 0x7fee5e1
d6340>)
g[-3]
[rank0]:(Pdb) [rank0]:[tensor([[[-4.1127e-06, -3.3826e-06,  2.6226e-06,
...,  6.4969e-06,
[rank0]:          -4.4405e-06, -4.7684e-06],
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136507
Approved by: https://github.com/awgu, https://github.com/kwen2501
2024-09-24 20:46:37 +00:00
7ff8e66140 Fix flexattention sympy expr printer issue (#136509)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136509
Approved by: https://github.com/yanboliang
2024-09-24 20:10:29 +00:00
02ef5dd327 [inductor][test] Check if mkl dnn bf16 is supported when using bf16 (#136290)
Sometimes the test is run with older cpu, e.g. Intel(R) Xeon(R) CPU E5-2680 v4. If we inspect its `lscpu`, in the flags, we don't see a `avx512_bf16`. So that probably means bf16 is not supported for those hardwares, and hence the unit test can fail. So we add the check in the code.

Context: https://github.com/pytorch/pytorch/pull/135038

Differential Revision: D62984129

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136290
Approved by: https://github.com/XuehaiPan, https://github.com/chenyang78
2024-09-24 19:32:48 +00:00
888744bd36 NJT binary pointwise broadcasting support via jagged <-> padded dense conversion (#133021)
Related: #132695

This PR uses padded dense <-> jagged conversions to handle binary pointwise broadcasting of (NT, T) and (T, NT). This includes:
* `(B, j0, D) + (1, 1, 1)`
* `(B, j0, D) + (B, 1, 1)`
* `(B, j0, D) + (B, 1, D)`
* etc.

This PR also adds (hacky) support for bool inputs to the jagged <-> padded dense conversions. The underlying CUDA kernels do not support integer / bool inputs; so the following workaround is employed: `convert input -> half, run conversion kernel, convert output -> bool`. Note that this bool support is needed specifically for the backward formula of `fmax`, and likely others.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133021
Approved by: https://github.com/cpuhrsch
2024-09-24 19:11:49 +00:00
8ecc5f1a8f [TorchScript][tensorexpr] imbue locale for IRPrinter (#136458)
We had an internal report where the NNC-generated CUDA code had thousands separators in integer literals. Although I wasn't able to cleanly repro, I did come up with a hacky repro and verified that this fix works (see #136459).

Differential Revision: [D63278771](https://our.internmc.facebook.com/intern/diff/D63278771)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136458
Approved by: https://github.com/eellison
2024-09-24 19:00:57 +00:00
c6192f32f1 [MPS] Add upsample_bicubic2d as Metal op (#136123)
More or less literal copy-n-paste of c33b0580e6/aten/src/ATen/native/cuda/UpSampleBicubic2d.cu (L24)
and
c33b0580e6/aten/src/ATen/native/cuda/UpSampleBicubic2d.cu (L99)
Missing `uint8` implementation mimics CUDA behavior
Initial version coded live in https://www.youtube.com/watch?v=shi6Kb5xxvk
Later refinements:
 - Switch from 2D dispatch to 1D one (to match CUDA behavior)
 - Added batch + channel loops
 - Fixed scale computation to match align corners behavior
 - Added backward implementation

Backward implementation again, mimics CUDA, so it has issues precision issue for `torch.half` as well as a somewhat slow simulation of atomic adds using atomic compare and exchange of the pair of adjacent values, i.e.
```metal
emplate <typename T>
static inline void atomic_add_helper(
    device atomic<int>* data,
    long offset,
    float value) {
  auto ptr = data + (offset >> 1);
  auto old = atomic_load_explicit(ptr, memory_order_relaxed);
  union {
    int i;
    T t[2];
  } val;
  do {
    val.i = old;
    val.t[offset & 1] += static_cast<T>(value);
  } while (!atomic_compare_exchange_weak_explicit(
      ptr, &old, val.i, memory_order_relaxed, memory_order_relaxed));
}
```
Bump basic Metal language version to 3.0, as it's supported on MacOS13 and that's the first version that has `atomic_float`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136123
Approved by: https://github.com/albanD
2024-09-24 18:58:11 +00:00
dacf0c4884 [dynamo] Do not treat user defined nn module attributes static for dynamic shape infra (#136516)
Fixes https://github.com/pytorch/pytorch/issues/136254

Th regression was introduced in https://github.com/pytorch/pytorch/pull/132736 where originally we were trying to fix another regression. This PR and the offending PR together say - "treat user defined nn module attributes as automatic dynamic, but for cudagraphs they will be considered static". This avoid recompilations. This can lead to a cudagraph recording, which is ok. This also maintains the state before inline_inbuilt_nn_modules flag was introduced.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136516
Approved by: https://github.com/williamwen42
2024-09-24 18:26:12 +00:00
1028cedf71 [inductor] Enable parallel compile by default in fbcode (#136246)
Summary: Now that we have subprocess parallel compile on by default, we can change the internal compile_threads default to > 1 with a killswitch. Some jankiness so we can avoid evaluating the justknob at import.

Test Plan: Ran codecache tests with JK on, then canaried locally with JK off

Differential Revision: D62913998

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136246
Approved by: https://github.com/eellison
2024-09-24 18:10:01 +00:00
9abdc62065 Allow fx graph caching higher order operators (opt-in) (#135877)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135877
Approved by: https://github.com/zou3519
2024-09-24 17:23:09 +00:00
efed357ef5 Add dtypes support in opinfo for Intel Gaudi (#132840)
## Motivation
This is following up on changes introduced in https://github.com/pytorch/pytorch/pull/128584
we are adding the dtype information to be picked up while executing the UTs for Intel Gaudi/HPU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132840
Approved by: https://github.com/albanD
2024-09-24 17:17:15 +00:00
064093a4d6 Revert "Increase update_hint_regression problem size to 1000 (#136434)"
This reverts commit 3116fbda0fcf9af0c3dfe1280fb7e05e30e6ad5f.

Reverted https://github.com/pytorch/pytorch/pull/136434 on behalf of https://github.com/ezyang due to whoops, this is too slow ([comment](https://github.com/pytorch/pytorch/pull/136434#issuecomment-2371847842))
2024-09-24 17:05:20 +00:00
ebfcbe0822 Move print_export_warning so lru_cache works (#136491)
Summary:
as title

move print_export_warning() out of the function so `lru_cache` actually works

Test Plan: CI

Differential Revision: D63297083

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136491
Approved by: https://github.com/pianpwk
2024-09-24 16:52:22 +00:00
44ec706789 add tolerance changes for test_sdpa_autocast in test_nestedtensor.py (#136485)
Upstreaming minor unit test fix from nvidia internal CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136485
Approved by: https://github.com/soulitzer
2024-09-24 16:31:32 +00:00
eac04fe72a Increase bf32 tolerances for some cdist tests in test_torch (#136315)
- Set the new tolerances ~= N * eps(bfloat16) which should be a comfortable upper bound for tolerances. Where N is the inner dimension of the matmal.

Logic behind choice of tolerance:

The maximum error of the summation of a series of N numbers in bfloat16 should be `N * epsilon(bfloat16)` , I confirmed by sampling different random seeds that the maximum observed error doesn't exceed this value and is usually much less.

Fixes test failures on Arm® Neoverse™ V1 ( not raised as an issue as this hardware type is not currently covered by linux-aarch64 workflow )

```
Traceback (most recent call last):
  File "/var/lib/jenkins/workspace/test/test_torch.py", line 2478, in test_cdist_large
    self.assertEqual(expected, actual)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3885, in assertEqual
    raise error_metas.pop()[0].to_error(
AssertionError: Tensor-likes are not close!

Mismatched elements: 134118 / 1000000 (13.4%)
Greatest absolute difference: 0.03829193115234375 at index (291, 726) (up to 0.005 allowed)
Greatest relative difference: 0.03519868478178978 at index (291, 726) (up to 1.3e-06 allowed)
```

@malfet @jondea

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136315
Approved by: https://github.com/albanD
2024-09-24 16:10:11 +00:00
0b667c073e Disable compiled autograd for re-entrant autograd (#135795)
Fixes #135298

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135795
Approved by: https://github.com/xmfan
2024-09-24 15:09:16 +00:00
33e10803c8 Fix ut in internal distributed_test.py (#136251)
I have failed with test case of **test_new_subgroups_by_enumeration_input_rank_exceeds_world_size**, and passed with this small change. The expected exception is supposed to be "ValueError" rather than "RuntimeError" according to [code](https://github.com/pytorch/pytorch/blob/v2.4.1/torch/distributed/distributed_c10d.py#L4190).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136251
Approved by: https://github.com/kwen2501
2024-09-24 15:06:20 +00:00
58274e4655 Remove onnx imports in dynamo (#136334)
Remove imports of the ``torch.onnx.operators`` module in dynamo. Since ONNX depends on dynamo, this import line causes a circular dependency. Judging from the source they are not actually needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136334
Approved by: https://github.com/xadupre, https://github.com/jansel, https://github.com/titaiwangms
2024-09-24 14:54:23 +00:00
2a178a6982 Avoid changing FTZ/DAZ flags in CPP builder (#136466)
Fixes https://github.com/pytorch/pytorch/issues/136273
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136466
Approved by: https://github.com/ezyang
2024-09-24 14:39:17 +00:00
6300eb1dc7 tf32 off for test_noncontiguous_samples in test_ops.py (#136484)
Upstreaming minor unit test fix from nvidia internal CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136484
Approved by: https://github.com/soulitzer
2024-09-24 14:26:47 +00:00
47ebb5856e Make avoid_device_init() aware of hpu device (#136194)
Added hpu to devices handled by avoid_device_init() in FakeTensorMode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136194
Approved by: https://github.com/eellison
2024-09-24 14:13:45 +00:00
54fc4f56ff [Docs fix] fix syntax error in docs :torch.blackman_window (#136354)
Fixes #ISSUE_NUMBER
https://pytorch.org/docs/stable/generated/torch.blackman_window.html

error at : equal to torch.blackman_window(L + 1, periodic=False)[:-1]).
should delete the last ).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136354
Approved by: https://github.com/soulitzer
2024-09-24 14:00:26 +00:00
9fc721d22b Add cache logs + other minor caching cleanup (#136456)
Summary:
- Added TORCH_LOGS=cache to dump cache stats on exit - supported by RemoteCache.
- Split REMOTE_CACHE_VERSION - it was used for both JKs fx_graph_memcache_version and autotune_memcache_version but they really should be separate (just in case we need to change one but not the other)
- Prepare `_ManifoldCache` for use with other subpath keys
- Move create_cache to be more public and use it in codecache
- Add _InductorMetaTy alias (still just a dict)
- Cleaned up some common cached_autotune calls in triton_heuristics

Test Plan: unit tests

Reviewed By: oulgen

Differential Revision: D62648249

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136456
Approved by: https://github.com/oulgen
2024-09-24 14:00:23 +00:00
342c031f0e [aotd] Fix freezing API for subclasses (#136265)
Original issue:
https://github.com/pytorch/ao/issues/890

The problem:

TracingContext.flat_params contain original params, with not desugared Subclasses.
While inductor.freezing API works on aot graphs, which already desugared Subclasses.

flat_params are used only for this logic and storing in them desguared subclasses fixes the issue.

Testing:
```
python test/functorch/test_aotdispatch.py -k test_inductor_freezing_with_subclasses
```
Torch AO original failure:
```
python test/integration/test_integration.py -k test_int8_weight_only_quant_with_freeze
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136265
Approved by: https://github.com/bdhirsh
2024-09-24 13:15:01 +00:00
cyy
f048569c24 [Distributed] [11/N] Fix clang-tidy warnings in torch/csrc/distributed/ (#136439)
Follows #131671

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136439
Approved by: https://github.com/kwen2501
2024-09-24 13:05:15 +00:00
538ee7bf60 Revert "Fix tensor.data_ptr() representation overflow (#135567)"
This reverts commit 2e8d431a8fbfdbdb07448195f16afa9e101188ac.

Reverted https://github.com/pytorch/pytorch/pull/135567 on behalf of https://github.com/etaf due to Block XPU, let's re-land with triton update. ([comment](https://github.com/pytorch/pytorch/pull/135567#issuecomment-2371200549))
2024-09-24 12:59:14 +00:00
32727b9859 Add types to _dynamo/testing.py (#136402)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136402
Approved by: https://github.com/jansel
2024-09-24 10:23:54 +00:00
73c10a04f6 [dynamo][easy] support sys.intern (#136081)
Closes #134023

- #134023

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136081
Approved by: https://github.com/anijain2305
2024-09-24 09:12:34 +00:00
1266be21f4 deprecated datetime.utcnow() fix and _RendezvousJoinOp module initiation bug fix (#136141)
Fix to #136140

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136141
Approved by: https://github.com/kwen2501
2024-09-24 07:26:10 +00:00
0a35986cdb Add option to configure reduced precision math backend for SDPA (#135964)
Summary: Address https://github.com/pytorch/pytorch/issues/135778 by adding a global flag to configure whether using high precision or low precision for math backend of SDPA.

Test Plan: buck2 run mode/opt //scripts/feikou/llm:run_attn_kernels

Differential Revision: D62625515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135964
Approved by: https://github.com/jbschlosser
2024-09-24 07:11:38 +00:00
44c871c34b [inductor] [cpp] add index check when fusing epilogue with GEMM template (#135661)
## Description
Fixes the accuracy failure of FP32 `jx_nest_base` of max-autotune.

The current epilogue fusion implementation in GEMM template assumes that the read of template buffer and the write of epilogue output in the epilogue node have the same index (the layout could be different but the index should be the same).

If the condition is not satisfied, the computation is wrong, leading to correctness issue for FP32 `jx_nest_base`.

This PR disabled the epilogue fusion with GEMM template when the above condition is not satisfied.

### Unsupported epilogue:
`buf1` is the template buffer and `buf2` is the epilogue output buffer.
The store of `buf2`:
401408 * d0 + 100352 * d1 + **7168 * d2** + **1792 * d3** + 128 * d4 + d5

The load of `buf1` in the epilogue node:
401408 * d0 + 100352 * d1 + **1792 * d2** + **25088 * d3** + 128 * d4 + d5

The above two indexes are different.

```
CppTemplateBuffer(name='buf1', layout=FixedLayout('cpu', torch.float32, size=[25088, 128], stride=[128, 1]))
ComputedBuffer(name='buf2', layout=FixedLayout('cpu', torch.float32, size=[8, 4, 14, 4, 14, 128], stride=[401408, 100352, 7168, 1792, 128, 1]), data=Pointwise(
  'cpu',
  torch.float32,
  def inner_fn(index):
      i0, i1, i2, i3, i4, i5 = index
      tmp0 = ops.load(arg5_1, i5 + 128 * i4 + 1792 * i2 + 25088 * i3 + 100352 * i1 + 401408 * i0)
      tmp1 = ops.load(buf0, i5 + 128 * i4 + 1792 * i2 + 25088 * i3 + 100352 * i1 + 401408 * i0)
      tmp2 = tmp0 + tmp1
      tmp3 = ops.load(buf1, i5 + 128 * i4 + 1792 * i2 + 25088 * i3 + 100352 * i1 + 401408 * i0)
      tmp4 = tmp2 + tmp3
      return tmp4
  ,
  ranges=[8, 4, 14, 4, 14, 128],
  origin_node=clone,
  origins=OrderedSet([clone])
))
```

### Supported epilogue:
`buf1` is the template buffer and `buf2` is the epilogue output buffer.
The store of `buf2`:
d0 + 576 * d1 + 32 * d2

The load of `buf1` in the epilogue node:
d0 + 576 * d1 + 32 * d2

The above two indexes are the same.

The layout of `buf2` and `buf1` are different though which is handled by the reindexer:
`buf1`: `size=[324, 32], stride=[32, 1]`
`buf2`: `size=[1, 32, 18, 18], stride=[10368, 1, 576, 32]`

```
CppTemplateBuffer(name='buf1', layout=FixedLayout('cpu', torch.bfloat16, size=[324, 32], stride=[32, 1]))
ComputedBuffer(name='buf2', layout=FixedLayout('cpu', torch.bfloat16, size=[1, 32, 18, 18], stride=[10368, 1, 576, 32]), data=Pointwise(
  'cpu',
  torch.bfloat16,
  def inner_fn(index):
      _, i1, i2, i3 = index
      tmp0 = ops.load(buf1, i1 + 32 * i3 + 576 * i2)
      tmp1 = ops.to_dtype(tmp0, torch.float32, src_dtype=torch.bfloat16)
      tmp2 = ops.load(_frozen_param4, i1)
      tmp3 = tmp1 * tmp2
      tmp4 = ops.load(arg7_1, i1 + 32 * i3 + 576 * i2)
      tmp5 = tmp3 + tmp4
      tmp6 = ops.to_dtype(tmp5, torch.bfloat16, src_dtype=torch.float32)
      return tmp6
  ,
  ranges=[1, 32, 18, 18],
  origin_node=convert_element_type_4,
  origins=OrderedSet([add, mul, convert_element_type_4])
))
```

## TODO
Add the support for fusions when the indexes are different in a follow-up PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135661
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
2024-09-24 05:25:28 +00:00
7283530db2 [ROCm][Inductor][CK] FP8 gemm (#136337)
At the moment, lowering torch._scaled_mm with tensorwise scaling and rowwise scaling for both A and B

We probably also want to support either combination of tensorwise and rowwise for A and B, as well as bias support

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136337
Approved by: https://github.com/chenyang78
2024-09-24 05:19:45 +00:00
7f98781f84 Fix autodeps from D62049222 that pyfmt broke (#136455)
Summary: `arc lint` changed the formatting which then caused autodeps to be confused.

Test Plan:
this passes:
```
arc lint --skip AUTODEPS
fbpython fbcode/tools/build/buck/linters/lint_autoformat.py --linter=autodeps --default-exec-timeout=1800 -- fbcode/caffe2/test/inductor/test_memory_planning.py
```

Differential Revision: D63277059

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136455
Approved by: https://github.com/bobrenjc93, https://github.com/oulgen
2024-09-24 05:06:12 +00:00
797c7e2802 [Quant][PT2E]change flatten recipe for X86InductorQuantizer (#136298)
This PR modifies the flatten recipe: if none of the users of the flatten node are quantizable ops, int8 flatten will be disabled to avoid unnecessary dtype conversions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136298
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
2024-09-24 04:30:12 +00:00
3be150653c [torch][ao] Add customizable loss function to NodeAccuracySummary (#136282)
Summary:
Add a customizable loss function callback to NodeAccuracySummary to
allow users to pass in their own loss function.

Also, fix some type errors and propagate better exception messages when
unexpected tensor comparisons occur. Finally, enhance the robustness of
`generate_numeric_debug_handle` in the case where it is called multiple
times on the same model, by avoiding reuse of the same IDs.

Test Plan: Added a test for this case in `test_numeric_debugger`.

Reviewed By: jerryzh168

Differential Revision: D62898297

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136282
Approved by: https://github.com/jerryzh168
2024-09-24 03:28:12 +00:00
e09c5b6046 Remove vt argument in raise_observed_exception (#136037)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136037
Approved by: https://github.com/zou3519
2024-09-24 02:36:57 +00:00
9372692c7b [FR] Make OSS fr_trace function available for internal script and improve pg filtering (#136473)
Differential Revision: [D63287384](https://our.internmc.facebook.com/intern/diff/D63287384/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136473
Approved by: https://github.com/c-p-i-o
2024-09-24 02:34:43 +00:00
4fd16dd8aa Clarify that libtorch API is C++17 compatible (#136471)
As it relies on some common C++17 primitives, such as `std::optional`
Replace all docs references from C++14 to C++17

Fixes https://github.com/pytorch/pytorch/issues/133205

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136471
Approved by: https://github.com/kit1980, https://github.com/atalman
2024-09-24 02:03:33 +00:00
e4d294221b [inductor] Log precompilation time (#136395)
This has been useful for diagnosing the long compile time issues I've seen in the Triton CPU backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136395
Approved by: https://github.com/eellison
2024-09-24 01:47:54 +00:00
802ba79121 Inherit all secrets to inductor workflow (#135354)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135354
Approved by: https://github.com/desertfire, https://github.com/atalman, https://github.com/malfet
2024-09-24 01:30:40 +00:00
06909803cc Existing mypy issues (#136236)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136236
Approved by: https://github.com/bobrenjc93, https://github.com/Skylion007
2024-09-24 01:02:07 +00:00
a14f57b126 fix the inductor tests (#136474)
Fixes https://github.com/pytorch/pytorch/issues/136464 introduced in https://github.com/pytorch/pytorch/pull/134874

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136474
Approved by: https://github.com/malfet
2024-09-24 00:59:22 +00:00
9d9bc65b5e Make FlashAttentionKernel.cpp compilable for SVE with GCC-11 (#136477)
Extends https://github.com/pytorch/pytorch/pull/132434 to all minor revisions of GCC-11, as they all likely affected by https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95528

Hattip to @abhishek-iitmadras  for the investigation

Fixes https://github.com/pytorch/pytorch/issues/136432

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136477
Approved by: https://github.com/atalman, https://github.com/kit1980
2024-09-24 00:54:26 +00:00
e0f84f40f7 [Pipelining] Allow non-0 stages to accept kwargs (#136416)
For supporting usage case in torchchat:
all non-0 stages requires `input_pos` and `cache_lane`.
```
kwargs = {"input_pos": input_pos, "cache_lane": lane}

if pp_rank == first_pp_rank:
    output = decorder.step(new_token, **kwargs)
elif pp_rank == last_pp_rank:
    output = decorder.step(**kwargs)
else:  # middle pp ranks
    decorder.step(**kwargs)
```

The `forward_one_chunk` code today hard sets `{}` as kwarg for non-0 stages, hence cannot support the above use case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136416
Approved by: https://github.com/wconstab
2024-09-23 23:50:59 +00:00
52c917b0ba Optimize dict reconstruct to not codegen untouched values (#134876)
PR changes how `reconstruct` is done for a ConstDict. As of today, it works as follow:
(1) codegen(...) each pair of key/value
(2) create a new dictionary to hold the new items
(3) clear the original dictionary
(4) update the original dict with the one created in (2)

We do a micro optimization in the generated bytecode to:
- Only codegen the items that changed.
- Only clear the original dictionary if a key was removed.

Fixes: #133487

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134876
Approved by: https://github.com/zou3519
2024-09-23 21:45:44 +00:00
5033a1ca0d [RFC][torchelastic][c10d] Fix store prefix race in rendezvous (#135957)
1. We want to take option 3 as discussed in https://github.com/pytorch/pytorch/issues/135712, so every time when we retry, we create a new TCPStore server first so that we don't need to append attempt count as prefix and avoid eventually TCPStore sync failure. (This is only for the TCPStore sharing enabled case)
2. We start a new server bound to an ephemeral port (i.e. 0) so it gets assigned to a free port. We then pass that downstream (trainer or c10d). By doing so, TCPStore is managed by the elastic agent rather than having a race condition on binding to a specific port in the trainer.
3. Then the port be broadcasted for dynamic_rendezvous.

Only one more question, what do we do about the store created from (_create_tcp_store) torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py, are we ok with creating a duplicate TCPStore server?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135957
Approved by: https://github.com/d4l3k, https://github.com/c-p-i-o
2024-09-23 20:32:24 +00:00
fd182b90a7 Revert "Add deterministic path for CUDA cumsum (#136224)"
This reverts commit d45b0151e5d9a9358368b9fbd7fa454edd5d9709.

Reverted https://github.com/pytorch/pytorch/pull/136224 on behalf of https://github.com/atalman due to Failing internall CI ([comment](https://github.com/pytorch/pytorch/pull/136224#issuecomment-2369244135))
2024-09-23 19:57:13 +00:00
08dba25775 [BE] Do not use deprecated APIs in SparseCsrTensorMath.cu (#136449)
- `Tensor::type()` -> `Tensor::scalar_type()`
- `Tensor::data<T>()` -> `Tensor::data_ptr<T>()`

Should fix following warnings during the compilation:
```
caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/cutlassB_f32_notaligned_k128_dropout.cu.o
 /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu: In function ‘void at::native::_GLOBAL__N__496f0b0c_22_SparseCsrTensorMath_cu_868dd545::_apply_sparse_csr_linear_solve(const at::Tensor&, const at::Tensor&, bool, const at::Tensor&)’:
 /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:739:36: error: ‘T* at::Tensor::data() const [with T = int]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations]
   739 |   int* rowOffsets = crow.data<int>();
       |                                    ^
 /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here
   247 |   T * data() const {
       | ^ ~~
 /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:740:35: error: ‘T* at::Tensor::data() const [with T = int]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations]
   740 |   int* colIndices = col.data<int>();
       |                                   ^
 /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here
   247 |   T * data() const {
       | ^ ~~
 /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu: In lambda function:
 /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753:44: error: ‘at::DeprecatedTypeProperties& at::Tensor::type() const’ is deprecated: Tensor.type() is deprecated. Instead use Tensor.options(), which in many cases (e.g. in a constructor) is a drop-in replacement. If you were using data from type(), that is now available from Tensor itself, so instead of tensor.type().scalar_type(), use tensor.scalar_type() instead and instead of tensor.type().backend() use tensor.device(). [-Werror=deprecated-declarations]
   753 |   AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] {
       |                                            ^
 /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:225:1: note: declared here
   225 |   DeprecatedTypeProperties & type() const {
       | ^ ~~
 /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753:159: error: ‘c10::ScalarType detail::scalar_type(const at::DeprecatedTypeProperties&)’ is deprecated: passing at::DeprecatedTypeProperties to an AT_DISPATCH macro is deprecated, pass an at::ScalarType instead [-Werror=deprecated-declarations]
   753 |   AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] {
       |                                                                                                                                                               ^
 /var/lib/jenkins/workspace/aten/src/ATen/Dispatch.h:109:1: note: declared here
   109 | inline at::ScalarType scalar_type(const at::DeprecatedTypeProperties& t) {
       | ^~~~~~~~~~~
 /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753:159: error: ‘c10::ScalarType detail::scalar_type(const at::DeprecatedTypeProperties&)’ is deprecated: passing at::DeprecatedTypeProperties to an AT_DISPATCH macro is deprecated, pass an at::ScalarType instead [-Werror=deprecated-declarations]
   753 |   AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] {
       |                                                                                                                                                               ^
 /var/lib/jenkins/workspace/aten/src/ATen/Dispatch.h:109:1: note: declared here
   109 | inline at::ScalarType scalar_type(const at::DeprecatedTypeProperties& t) {
       | ^~~~~~~~~~~
 /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu: In lambda function:
 /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753:1014: error: ‘T* at::Tensor::data() const [with T = double]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations]
   753 |   AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] {
       |
 /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here
   247 |   T * data() const {
       | ^ ~~
 /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753:1054: error: ‘T* at::Tensor::data() const [with T = double]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations]
   753 |   AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] {
       |
 /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here
   247 |   T * data() const {
       | ^ ~~
 /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753:1094: error: ‘T* at::Tensor::data() const [with T = double]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations]
   753 |   AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] {
       |
 /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here
   247 |   T * data() const {
       | ^ ~~
 /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu: In lambda function:
 /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753: error: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations]
   753 |   AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] {
       |
 /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here
   247 |   T * data() const {
       | ^ ~~
 /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753: error: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations]
   753 |   AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] {
       |
 /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here
   247 |   T * data() const {
       | ^ ~~
 /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753: error: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations]
   753 |   AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] {
       |
 /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here
   247 |   T * data() const {
       | ^ ~~
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136449
Approved by: https://github.com/huydhn
2024-09-23 19:20:34 +00:00
9a1dc41de7 [AMD] Skipping 0 byte send/recv for AMD GPU (#136362)
Summary: We found jobs getting stuck by send/recv zero bytes with RDMA on AMD GPUs. So just skipping them.

Reviewed By: danzimm

Differential Revision: D63075000

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136362
Approved by: https://github.com/malfet, https://github.com/houseroad
2024-09-23 19:14:12 +00:00
3116fbda0f Increase update_hint_regression problem size to 1000 (#136434)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136434
Approved by: https://github.com/laithsakka
2024-09-23 18:51:44 +00:00
274883083d Revert "[AOTI] Create another wrapper class to handle ArrayRef (#136318)"
This reverts commit d21841d077b00350d5e621e7b74dace71849c701.

Reverted https://github.com/pytorch/pytorch/pull/136318 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/136318#issuecomment-2368957264))
2024-09-23 17:47:49 +00:00
d859fcbc61 s390x: build s390x binaries on each pull request (#125399)
Ensure that s390x keeps building for each PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125399
Approved by: https://github.com/huydhn
2024-09-23 17:39:48 +00:00
83a3ee0699 Support embedding_bag() with NJT input (#135888)
Fixes #93843

`EmbeddingBag()` / `embedding_bag()` support 1D inputs with offsets to handle raggedness. NJT is a natural fit here as it already maintains offsets of the same form. This PR updates the python-side to support NJT and adds corresponding OpInfo-based NJT tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135888
Approved by: https://github.com/cpuhrsch
2024-09-23 17:35:19 +00:00
4649aeaebf Make AOTAutogradCache support remote FXGraphCache (#136173)
Summary:
After the previous refactor, we can now call load_with_key directly from AOTAutogradCache to use the remote FXGraphCache.

This does *not* implement a remote AOTAutogradCache. It just allows AOTAutogradCache to work with remote FXGraphCache.

Test Plan: (Meta only tests)

Reviewed By: aorenste

Differential Revision: D62384944

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136173
Approved by: https://github.com/oulgen
2024-09-23 17:24:27 +00:00
c3e678382b Fix addmm silent correctness on aarch64 (#136371)
Do not dispatch to fast gemmv functions when alpha is not equal to 1

Add regression test to address the problem

Fixes https://github.com/pytorch/pytorch/issues/136299

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136371
Approved by: https://github.com/swolchok
2024-09-23 17:10:34 +00:00
f0f79dd8f1 Correctly convert Python float to float64 when passing argument as Tensor (#136413)
I can't actually test the Dynamo codegen fix as it is impossible to
directly use the Tensor at the moment.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136413
Approved by: https://github.com/bobrenjc93
2024-09-23 16:48:08 +00:00
637d5c4b7e [DSD] Fix loading uneven full tensor into sharded state dict (#136365)
Fix #136228.

This is a follow up on https://github.com/pytorch/pytorch/pull/135725. We need to pass shape and stride from the original dtensor, since for uneven case, `from_local` would calculate shape and stride assuming the tensor is evenly-sharded based on the local tensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136365
Approved by: https://github.com/fegin
2024-09-23 16:35:58 +00:00
da51fe1c42 [FR] Fix errors in all2all check, improve some log output (#136399)
We found that we show the hashed pg name in our script output, which is not UX friendly.
Also we found a bug in our all2all check and we made a bunch of changes to error messages to make it better readable.

Differential Revision: [D63206469](https://our.internmc.facebook.com/intern/diff/D63206469)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136399
Approved by: https://github.com/c-p-i-o
2024-09-23 16:31:31 +00:00
df6a8fa1eb Revert "[aotd] Fix freezing API for subclasses (#136265)"
This reverts commit cdef760560049ebda5fb7e30b1703f345fe05cfa.

Reverted https://github.com/pytorch/pytorch/pull/136265 on behalf of https://github.com/atalman due to Breaks internal CI sorry, need to revert ([comment](https://github.com/pytorch/pytorch/pull/136265#issuecomment-2368772574))
2024-09-23 16:25:05 +00:00
9992084f38 [FSDP2] Fixed test_all_gather_extensions_monkey_patch (#136130)
I messed up the test before. The extensions were not running :/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136130
Approved by: https://github.com/weifengpy
ghstack dependencies: #136129
2024-09-23 15:12:44 +00:00
b9f53c0dce [FSDP2] Added module, mp policy to fsdp_pre_all_gather (#136129)
- Sometimes having access to the `MixedPrecisionPolicy` in the `fsdp_pre_all_gather` is useful. See [here](https://github.com/pytorch/ao/pull/748/files#r1760375325) in the torchao INT8 mixed precision training PR.
- Sometimes having access to the owning `nn.Module` allows for using it for saving state. See [here](https://github.com/pytorch/pytorch/issues/114299#issuecomment-2298692762) for an example.

The major paint point here is how to deal with backward compatibility. For now, we use `signature.inspect` to check if the user subclass follows the old vs. new signature. However, for the new signature, the `param_dtype` in the post-all-gather is redundant, as if the user needed it, the user could save it from the `mp_policy` passed in the pre-all-gather now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136129
Approved by: https://github.com/weifengpy
2024-09-23 15:12:36 +00:00
d21841d077 [AOTI] Create another wrapper class to handle ArrayRef (#136318)
Summary: Create another wrapper codegen class to handle ArrayRef for CPU. The goal is to simplify the regular cpp wrapper codegen logic and the generated cpp code.

Test Plan: CI

Differential Revision: D62961885

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136318
Approved by: https://github.com/frank-wei
2024-09-23 15:10:27 +00:00
0e19522122 Revert "Adds support for accelerated sorting with x86-simd-sort (#127936)"
This reverts commit 239a9ad65eebf93dcf9bb108a5129d4160b12c86.

Reverted https://github.com/pytorch/pytorch/pull/127936 on behalf of https://github.com/atalman due to test/test_sort_and_select.py::TestSortAndSelectCPU::test_sort_discontiguous_slow_cpu_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10994904767/job/30525578456) [HUD commit link](239a9ad65e) ([comment](https://github.com/pytorch/pytorch/pull/127936#issuecomment-2368522316))
2024-09-23 14:52:23 +00:00
bae427e4b1 Refactor maybe_evaluate_static into a worker function off of ShapeEnv (#135107)
By refactoring this way, I can put a non-expiring LRU cache here.
Splitting also will make it easier for me to tell who is using up all
the time.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135107
Approved by: https://github.com/aorenste
2024-09-23 14:39:20 +00:00
e9bfbf78d5 Revert "Allow fx graph caching higher order operators (opt-in) (#135877)"
This reverts commit 66d5eb64e0be91680a8531ccb24f098554610d46.

Reverted https://github.com/pytorch/pytorch/pull/135877 on behalf of https://github.com/jeanschmidt due to seems to have introduced regressions on rocm signals ([comment](https://github.com/pytorch/pytorch/pull/135877#issuecomment-2367616653))
2024-09-23 09:04:24 +00:00
cyy
75f141be62 Avoid unnecessary CMake warnings on Windows (#136393)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136393
Approved by: https://github.com/ezyang
2024-09-23 06:42:59 +00:00
663e760065 add unittest for OOM message (#129671)
Add unittest for the bug in #123984
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129671
Approved by: https://github.com/eqy
2024-09-23 04:48:01 +00:00
068fdd602f [export] enable custom tag metadata re-export test (#136048)
Improves and enables a commented out test originally introduced in #131912

In `test_custom_tag_metadata_re_export()`, we check the added "custom" metadata to given nodes is preserved and not copied to other nodes after re-exporting
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136048
Approved by: https://github.com/zhxchen17
2024-09-23 04:37:58 +00:00
66d5eb64e0 Allow fx graph caching higher order operators (opt-in) (#135877)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135877
Approved by: https://github.com/zou3519
2024-09-23 04:33:27 +00:00
cyy
a38e4c5e1e Enable clang-tidy warnings on aten/src/ATen/cuda/*.cpp (#134547)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134547
Approved by: https://github.com/ezyang
2024-09-23 03:44:55 +00:00
f276da7f98 Remove prims.slice_in_dim and prims.slice (#136150)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136150
Approved by: https://github.com/ezyang
2024-09-23 01:27:22 +00:00
3406ac24d9 [BE] fix circular import in torch/distributed/utils.py (#136286)
**Summary**
Fix circular import in `torch/distributed/utils.py` found when running internal test, see D62901023. Curious why this wasn't causing any issue. Is this relevant code deprecated and no longer used?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136286
Approved by: https://github.com/Skylion007
2024-09-22 20:54:12 +00:00
3bc073d728 [aoti] Fix workspace generation for triton (#135552)
Fixes #131337

- add `arg_type` for workspace_arg, the type is consistent with the type in `generate_workspace_allocation()`.
- do not generate example tensors for `workspace`, and use `generate_workspace_allocation()` instead.
- add workspace allocation generation code to `kernel_autotune_calls`. e.g.
```python
    workspace = empty_strided_cuda((1280, ), (1, ), torch.uint8)
    workspace.zero_()
    .....
    triton_spl_fused_add_cumprod_0.run(buf2, arg0_1, arg1_1, workspace, 1, 10000, grid=split_scan_grid(1, 10000), stream=stream0)
    del buf2, arg0_1, arg1_1, workspace
```
-  add `empty_strided_cuda = torch._C._dynamo.guards._empty_strided_cuda` to the header of triton autotune code.

The generated cpp has lines like below, so we also implement a `zero_()` for ` AtenTensorHandle `.

```cpp
    static constexpr int64_t int_array_0[] = {1280L, };
    static constexpr int64_t int_array_1[] = {1L, };
    AtenTensorHandle workspace_handle;
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_empty_strided(1, int_array_0, int_array_1, cached_torch_dtype_uint8, cached_torch_device_type_cuda,  0, &workspace_handle));

        RAIIAtenTensorHandle workspace(workspace_handle);
        workspace.zero_();
```

- Fix handle grid_fn  for grid computation. Pass in "RBLOCK" to `split_scan_grid`
-  Fix dynamic shapes:
Without the fix we generate code that looks like this `workspace = empty_strided_cuda((32*((255 + s0) // 256), ), (1, ), torch.uint8)` when doing triton autotune and `s0` is not defined.

The solution approach is to use `V.graph.sizevars.size_hint(nbytes)` to realize the workspace size for triton autotune. Note that we only realize it for triton autotune code, but not for the cpp cuda code.

- We also generate slightly different cpp code depending on if `abi_compatible` is turned on.
```cpp
RAIIAtenTensorHandle workspace(workspace_handle);
AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_zero_(workspace.get()));
```
vs

```cpp
    at::Tensor workspace = at::detail::empty_strided_cuda({8L*(c10::div_floor_integer(static_cast<int64_t>((255L + s0)), static_cast<int64_t>(256L))), }, {1L, }, at::kByte, c10::DeviceType::CUDA);
    workspace.zero_();
```

Test Plan:

```
TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1  python test/inductor/test_torchinductor.py -k GPUTests.test_consecutive_split_cumprod_cuda
python test/inductor/test_cuda_cpp_wrapper.py TestCudaWrapper.test_consecutive_split_cumprod_cuda_cuda_wrapper
python test/inductor/test_cuda_cpp_wrapper.py DynamicShapesCudaWrapperCudaTests.test_consecutive_split_cumprod_cuda_dynamic_shapes_cuda_wrapper
TORCHINDUCTOR_ABI_COMPATIBLE=1 python test/inductor/test_cuda_cpp_wrapper.py TestCudaWrapper.test_consecutive_split_cumprod_cuda_cuda_wrapper
TORCHINDUCTOR_CPP_WRAPPER=1  python test/inductor/test_torchinductor.py -k GPUTests.test_consecutive_split_cumprod_cuda
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135552
Approved by: https://github.com/desertfire
2024-09-22 04:51:37 +00:00
35532fc477 [Partitioner] Reuse partition to check whether nodes exist (#135317)
The time complexity of find node whether in NodeList is O(n). Reuse partition to speed up due to partition.nodes is hash table and has same elements.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135317
Approved by: https://github.com/ezyang
2024-09-21 23:52:02 +00:00
cyy
e4cdc31227 [14/N] Fix clang-tidy warnings in aten/src/ATen (#133988)
Follows  #133807
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133988
Approved by: https://github.com/ezyang
2024-09-21 22:41:40 +00:00
9731ccb9e0 Type _dynamo/variables/lazy.py (#136376)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136376
Approved by: https://github.com/Skylion007
2024-09-21 22:18:02 +00:00
09715638ab Add _dynamo.config.suppress_errors logging (#136379)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136379
Approved by: https://github.com/ezyang
2024-09-21 21:00:26 +00:00
3176966732 update cache tests (#136215)
Summary:
- Clean up cache test code a bit.
- Removed patch_fbcode() - it turned out to cause flaky issues (image if it set fbcode=False and then loaded a module for the first time which had a top-level fbcode check).

Test Plan: unit tests

Reviewed By: oulgen

Differential Revision: D62648248

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136215
Approved by: https://github.com/bobrenjc93
2024-09-21 20:36:22 +00:00
be4b7e8131 Param fixes in docstring (#136097)
Fixes wrong param names in docstrings. cc: @kit1980

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136097
Approved by: https://github.com/ezyang
2024-09-21 18:56:34 +00:00
b6ffa381e1 [BE]: Add half CUDA support nextafter (#136373)
Making CUDA support match CPU support for nextafter
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136373
Approved by: https://github.com/ezyang
2024-09-21 17:13:45 +00:00
cc17d58809 Revert "S390x update builder image (#132983)"
This reverts commit 080a249fc2290602402e01bf5864d9d9a416e5b6.

Reverted https://github.com/pytorch/pytorch/pull/132983 on behalf of https://github.com/atalman due to Authenticate With PUSH is failing. Error: no registries found in registries.conf, a registry must be provided. Error: Process completed with exit code 125. ([comment](https://github.com/pytorch/pytorch/pull/132983#issuecomment-2365249249))
2024-09-21 16:46:54 +00:00
03957efa5d [inductor][scheduler] reorder scheduler nodes after fusion to reduce peak memory (#134874)
**Motivations**:
A topological order of the scheduler nodes that optimize the liveness of buffers can reduce the peak memory utilization. This has been observed and studied e.g., [here](https://arxiv.org/pdf/1910.02653) and [here](https://proceedings.mlr.press/v202/steiner23a/steiner23a.pdf).

**Solutions**:
1. implement a peak memory estimator via liveness analysis
2. implement a few memory aware topological sorting algorithms and pick the one with the lowest peak memory

**Results**:
On some models we can reduce the peak memory significantly:
|             model             | batch size | peak_memory baseline | peak_memory new | ratio |
|:-----------------------------:|:----------:|:--------------------:|:---------------:|:-----:|
| alexnet                       | 128        |         1.17         |       0.99      | 1.19  |
| vgg16                         | 64         |         4.10         |       3.57      | 1.15  |
| DebertaV2ForQuestionAnswering | 1          |         11.60        |      10.56      | 1.10  |

In the presence of compiler based AC, peak memory can be further reduced:
|              model             | batch size | peak_memory baseline | peak_memory new | ratio |
|:------------------------------:|:----------:|:--------------------:|:---------------:|:-----:|
| AlbertForMaskedLM              | 4          |         6.87         |       6.43      | 1.07  |
| AlbertForQuestionAnswering     | 4          |         8.69         |       7.76      | 1.12  |
| MobileBertForQuestionAnswering | 128        |         4.67         |       3.90      | 1.20  |

[Here](https://fb.workplace.com/groups/1075192433118967/posts/1499920537312819/?comment_id=1499938843977655&reply_comment_id=1499951630643043) is an internal use case.

**Other infos:**
* neutral model runtime, because the the reordering happens after fusion. So memory saving is _for free_.
* minimal compile time overhead as the algorithm is linear in the number of edges of the inductor graph. For all hugglingface benchmark models, the additional compile time is less than 1 second.
* no peak memory regression since we only adopt a new order if the peak memory is reduced based on the estimator. However, the model is unaware of operators' working memories, but for large models, the working memory should be negligible. We haven't observed any significant regressions on all of our tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134874
Approved by: https://github.com/yf225
2024-09-21 16:28:38 +00:00
fb4670a1f9 fix mean_out: op does not update parameter out for BF16/FP16 dtype on CPU (#135174)
Fixes #134848

For BF16/FP16, when a tensor is specified in `out` parameter of mean, the mean kernel should use its storage for output, but that doesn't happen, since an `at::to` in the current code causes storage to be allocated again, but the `out` parameter tensor's storage doesn't get updated, resulting in it not holding the mean output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135174
Approved by: https://github.com/soulitzer
2024-09-21 14:21:42 +00:00
ea737e4e5d [Pipelining] Make PipelineStage support meta initialization (#136243)
Avoid allocating memory or dry-running the submodule during stage init.

Save user-provided input/output metadata during stage init, to allow
lazily initializing the buffers before the first step call.

Later, we plan to build on top of this to add lazy shape inference
(#130856) so that no input/output shapes are required at stage init.

For now, we require input/output tensors for stage init, but these
should be on meta device and stage should not allocate any real memory.

Note: this needs more thorough testing and review, but it worked on the
torchtitan 3d test.

TODO:
- delete 'device' arg from PipelineStage ctor? (move it to inferred from
  args tensors passed to first step call? separate PR.
- delete 'output_args' from PipelineStage ctor? we don't actually need
  it, but we use it to do shape validation, which is why I didn't remove
  it in this PR.  Proposal: leave it until we add lazy shape inference?

Fixes #136225, #136226

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136243
Approved by: https://github.com/H-Huang, https://github.com/kwen2501
2024-09-21 09:47:22 +00:00
cyy
c459430558 Pass Werror to CUDA host compiler (#130213)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130213
Approved by: https://github.com/ezyang
2024-09-21 08:01:06 +00:00
e18439113e [PT2][Inductor][Optmus] fix test_pad_mm_bf16 and reland to fix long computation kernel (#136349)
Summary: see D62220158

Test Plan:
```
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:pad_mm -- --exact 'caffe2/test/inductor:pad_mm - test_pad_mm_bf16 (caffe2.test.inductor.test_pad_mm.PadMMTest)' --run-disabled
```

### H100

Buck UI: https://www.internalfb.com/buck2/e5d85802-cab7-41a5-aacc-95f541796a99
Test UI: https://www.internalfb.com/intern/testinfra/testrun/9570149258587374
Network: Up: 9.1KiB  Down: 0B  (reSessionID-b339b51b-6a0e-4347-9414-1ba38f26a5d0)
Jobs completed: 9. Time elapsed: 1:15.7s.
Cache hits: 0%. Commands: 3 (cached: 0, remote: 0, local: 3)
Tests finished: Pass 1. Fail 0. Fatal 0. Skip 1. Build failure 0

### A100

Buck UI: https://www.internalfb.com/buck2/1082ad6e-56b0-4eb5-8092-ce507ca9a70e
Test UI: https://www.internalfb.com/intern/testinfra/testrun/8444249533824784
Network: Up: 9.2KiB  Down: 0B  (reSessionID-2b3056ac-f29e-4de4-b6f5-9d994acf566b)
Jobs completed: 9. Time elapsed: 1:36.9s.
Cache hits: 0%. Commands: 3 (cached: 0, remote: 0, local: 3)
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# E2E

see D62220158

Differential Revision: D63040455

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136349
Approved by: https://github.com/dshi7
2024-09-21 06:35:50 +00:00
cyy
02871461f7 Fix clang-tidy warnings in torch/csrc/lazy (#134655)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134655
Approved by: https://github.com/ezyang
2024-09-21 02:59:35 +00:00
0b91e7e2dc Remove duplicate line (#136383)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136383
Approved by: https://github.com/kit1980, https://github.com/malfet
2024-09-21 01:35:13 +00:00
eqy
29f7b8d483 [TF32] Account for TF32 in test_conv_double_backward (#135716)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135716
Approved by: https://github.com/Skylion007
2024-09-21 01:06:22 +00:00
7936584a88 Fix Vectorized<double>::next_after SVE compilation (#136388)
Should have called [`Sleef_nextafterdx_sve`](https://sleef.org/2-references/libm/aarch64#vectorized-double-precision-function-for-obtaining-the-next-representable-fp-value) rather than [`Sleef_nextafterfx_sve`](https://sleef.org/2-references/libm/aarch64#vectorized-single-precision-function-for-obtaining-the-next-representable-fp-value) to get vectorized `nextafter` for double precision rather than single precision values

This fixes a compilation issue introduced by https://github.com/pytorch/pytorch/pull/119571 and exposed by https://github.com/pytorch/pytorch/pull/133339

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136388
Approved by: https://github.com/kit1980
2024-09-20 23:54:17 +00:00
067d203b22 Upgrade pybind11 API calls for 3.13t (#136370)
This is a modified version of https://github.com/pytorch/pytorch/pull/130341 that preserve support for older pybind version.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136370
Approved by: https://github.com/Skylion007, https://github.com/malfet
2024-09-20 23:09:55 +00:00
1a10751731 [AOTI][Tooling] Filter out kernels based off lowercase names (#135395)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135395
Approved by: https://github.com/YUNQIUGUO
2024-09-20 21:56:08 +00:00
0c936c3ecb Add decomps for max_unpool (#133146)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133146
Approved by: https://github.com/amjames, https://github.com/eellison
2024-09-20 21:35:25 +00:00
293fccf86d add TORCH_CUDA_CPP_API for AutoNcclGroup (#130012)
`torch::cuda::nccl` is an option for developers to depend only on torch but not nccl. But to use `torch::cuda::nccl::send`/`torch::cuda::nccl::recv`, `ncclGroupStart()`/`ncclGroupEnd()` is needed,  `torch::cuda::nccl::AutoNcclGroup` can be used.  but `torch::cuda::nccl::AutoNcclGroup` is not exported and is LOCAL symbol, which can't be used from outside of libtorch.

<img width="1618" alt="image" src="https://github.com/pytorch/pytorch/assets/1913192/25b0bd54-2da6-480f-876d-b05acfecfe62">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130012
Approved by: https://github.com/kwen2501, https://github.com/eqy
2024-09-20 21:20:25 +00:00
239a9ad65e Adds support for accelerated sorting with x86-simd-sort (#127936)
Adds x86-simd-sort as a submodule to accelerate sorting for 32-bit and 64-bit datatypes when AVX2 or AVX512 are available.

For contiguous data, this can be over a 10x speedup for large arrays. For discontiguous data, it can give over a 4x speedup with larger arrays. These benchmarks were gathered on a Skylake system (7900x), limited to 8 threads.

<details>
<summary><b>Contiguous Benchmarks</b></summary>

```
float32, normally distributed (in microseconds)
size           Default        AVX2           AVX512         Default/AVX2   Default/AVX512
16             7.150844336    6.886271477    7.132277489    1.038420335    1.002603214
128            9.208030939    8.478154898    7.846915245    1.086089019    1.173458697
1024           37.79037627    23.60707456    16.44122627    1.600807257    2.298513241
10000          714.7355628    203.9921844    105.5683001    3.503739934    6.770361577
100000         8383.074408    721.6333354    465.3709247    11.61680593    18.01374766
1000000        97124.31945    5632.054572    3920.148401    17.24491803    24.77567416
10000000       1161974.907    86070.48988    71533.82301    13.50027063    16.24371323

int32_t, uniformly distributed (in microseconds)
size           Default        AVX2           AVX512         Default/AVX2   Default/AVX512
16             7.203208685    6.92212224     7.014458179    1.040606975    1.026908779
128            8.972388983    8.195516348    7.592543125    1.094792396    1.18173698
1024           32.77489477    23.6874548     15.36617105    1.383639359    2.132925285
10000          607.8824128    193.3402024    99.25090471    3.144107667    6.124703997
100000         523.9384684    608.1836536    442.3166784    0.861480682    1.184532472
1000000        5211.348627    5271.598405    3518.861883    0.988570871    1.480975611
10000000       133853.6263    81463.05084    67852.97394    1.643120714    1.972700952
```

</details>

Note that the int32_t sort is accelerated by FBGEMM's radix sort for larger arrays, but this only handles contiguous data and in one sorting direction.

<details>
<summary><b>Discontiguous Benchmarks</b></summary>

```
float, normal distributed, discontiguous in sorted dimension (in microseconds)
size           Default        AVX2           AVX512         Default/AVX2   Default/AVX512
16             3.836543679    4.011214256    3.84376061     0.956454439    0.99812243
128            5.755310194    5.755723127    4.820394962    0.999928257    1.193949923
1024           49.46946019    24.78790785    15.47874362    1.995709379    3.195960952
10000          665.2505291    236.6165959    143.9490662    2.811512551    4.621429974
100000         4328.002203    1329.001212    818.3516414    3.256582586    5.288682743
1000000        47651.5018     16693.72045    11827.39551    2.854456677    4.028909133
10000000       556655.1288    236252.6258    184215.9828    2.356185998    3.021752621

int32_t, uniformly distributed, discontiguous in sorted dimension  (in microseconds)
size           Default        AVX2           AVX512         Default/AVX2   Default/AVX512
16             3.817994356    3.878117442    3.770039797    0.984496837    1.012719908
128            5.578731397    5.577152082    4.716770534    1.000283176    1.182743862
1024           43.3412619     23.61275801    14.55446819    1.835501887    2.977866408
10000          634.3997478    224.4322851    133.9518324    2.826686667    4.736028889
100000         4084.358152    1292.363303    781.7867576    3.16037924     5.22438902
1000000        46262.20465    16608.35284    11367.51817    2.785478192    4.06968381
10000000       541231.9104    235185.1861    180249.9294    2.301301028    3.002674742
```

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127936
Approved by: https://github.com/jgong5, https://github.com/peterbell10
2024-09-20 21:19:33 +00:00
cyy
d2455b99fb Use cpython declaration of _PyWeakref_ClearRef (#136300)
To avoid the DLL inconsistency warning by MSVC:
```
torch/csrc/utils/python_compat.h(38): warning C4273: '_PyWeakref_ClearRef': inconsistent dll linkage
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136300
Approved by: https://github.com/Skylion007
2024-09-20 18:58:58 +00:00
7f9c06462f fix mypi in utils/_sympy/functions.py (#136339)
Signed-off-by: Bob Ren <bobren@fb.com>

Turns out older versions of python, in particular 3.8 shows errors that 3.12 doesn't. For posterity these are the steps I took to reproduce:

```
conda create -n py38 python=3.8
conda activate py38
pip install -r requirements.txt
lintrunner init
dmypy restart && lintrunner --all-files --take MYPY
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136339
Approved by: https://github.com/Skylion007
ghstack dependencies: #136205
2024-09-20 18:39:16 +00:00
f53a0f9cc1 [Inductor] Fix test_profiler_mark_wrapper_call_cuda_cuda_wrapper (#136356)
Summary: Internal profiler behaves differently after turning on triton.autotune_at_compile_time. Needs more investigation but turning it off for this test for now.

Reviewed By: henrylhtsang

Differential Revision: D63035855

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136356
Approved by: https://github.com/henrylhtsang
2024-09-20 18:35:09 +00:00
5997354151 Add more distributed examples (#130427)
1. Add `gather` example
2. Add device to `scatter` example
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130427
Approved by: https://github.com/kwen2501
2024-09-20 18:27:27 +00:00
df1eef9779 Revert "[torch][ao] Add customizable loss function to NodeAccuracySummary (#136282)"
This reverts commit f3c54ccf8f6139807f4623037c0174964a286652.

Reverted https://github.com/pytorch/pytorch/pull/136282 on behalf of https://github.com/huydhn due to This breaks OSS, let revert it and land the revert internally then ([comment](https://github.com/pytorch/pytorch/pull/136282#issuecomment-2364219252))
2024-09-20 17:49:06 +00:00
15dba021bb [ROCm][CI] upgrade CI to ROCm 6.2 (#132555)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132555
Approved by: https://github.com/pruthvistony, https://github.com/malfet
2024-09-20 17:39:31 +00:00
29affa6b95 return instead of using skipTest (#136244)
Summary:
Return from functions instead of using `skipTest`.
This is mostly to make our test report happier.
Skipped tests still show up in our  Broken test report.

```
OK (skipped=1)
I0917 16:14:24.749060 1018907 StorageDemandControl.cpp:572] Flushing Demand Control ODS counters

Skipped: Store doesn't support extended APIs
```

Test Plan:
Tested locally.
Test shows up as passed instead of skipped.

```
Cache hits: 99%. Commands: 125048 (cached: 124961, remote: 10, local: 77)
Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. Build failure 0
```

Differential Revision: D62912065

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136244
Approved by: https://github.com/XilunWu
2024-09-20 17:36:28 +00:00
d7a6980078 [inductor] Make DtypeView work with cpp_wrapper without abi_compatible (#136233)
Fixes #136159

Prior to this PR, using cpp_wrapper without abi_compatible could result in incorrect dtypes.

The following block of code implements cpp_wrapper codegen for reinterpret_view for abi_compatible mode, but not for non-abi_compatible mode.

f6f1504d39/torch/_inductor/codegen/cpp_wrapper_cpu.py (L1678-L1814)

Added a test that verifies that we keep the view behavior, but returned tensors also have correct dtypes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136233
Approved by: https://github.com/FindHao, https://github.com/eellison, https://github.com/jansel
2024-09-20 17:30:35 +00:00
080a249fc2 S390x update builder image (#132983)
S390x update builder image
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132983
Approved by: https://github.com/huydhn, https://github.com/malfet
2024-09-20 17:26:26 +00:00
783c5ba80a Revert "[PT2/Profiler] Add Context Info to Torch-Compiled Regions (#132765)"
This reverts commit 0b81f700aa7eb20d4b9f20e9627dd1208e50ea58.

Reverted https://github.com/pytorch/pytorch/pull/132765 on behalf of https://github.com/ezyang due to implementation is not correct, needs full rewrite ([comment](https://github.com/pytorch/pytorch/pull/132765#issuecomment-2364160452))
2024-09-20 17:10:27 +00:00
cdef760560 [aotd] Fix freezing API for subclasses (#136265)
Original issue:
https://github.com/pytorch/ao/issues/890

The problem:

TracingContext.flat_params contain original params, with not desugared Subclasses.
While inductor.freezing API works on aot graphs, which already desugared Subclasses.

flat_params are used only for this logic and storing in them desguared subclasses fixes the issue.

Testing:
```
python test/functorch/test_aotdispatch.py -k test_inductor_freezing_with_subclasses
```
Torch AO original failure:
```
python test/integration/test_integration.py -k test_int8_weight_only_quant_with_freeze
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136265
Approved by: https://github.com/bdhirsh
2024-09-20 16:32:49 +00:00
4842f0fac6 Enable torch build with SLEEF on ARM by default (#133339)
**Scope:** Enable PyTorch build with SLEEF on Arm by default. Enable codegen kernels compilation with SLEEF on ARM platform.

Enabling the build with SLEEF by default and setting `AT_BUILD_ARM_VEC256_WITH_SLEEF` as the default for Arm  improves performance for some models. I have benchmarked several networks on `Neoverse-V1` using `torch.compile` with the `inductor` backend.
On models  like `hf_Bert_Large` , `hf_GPT_fast`, we're seeing a **~1.2x speedup** (with 16 threads).

The below results are run with `Batch_Size=1` and `Cores=8, 16`

![Screenshot 2024-08-27 at 17 04 23](https://github.com/user-attachments/assets/319c7ef7-1202-4145-a51a-7a80dfd5f1f6)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133339
Approved by: https://github.com/malfet, https://github.com/kimishpatel

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-09-20 16:02:32 +00:00
f3c54ccf8f [torch][ao] Add customizable loss function to NodeAccuracySummary (#136282)
Summary:
Add a customizable loss function callback to NodeAccuracySummary to
allow users to pass in their own loss function.

Also, fix some type errors and propagate better exception messages when
unexpected tensor comparisons occur. Finally, enhance the robustness of
`generate_numeric_debug_handle` in the case where it is called multiple
times on the same model, by avoiding reuse of the same IDs.

Test Plan: Added a test for this case in `test_numeric_debugger`.

Reviewed By: jerryzh168

Differential Revision: D62898297

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136282
Approved by: https://github.com/jerryzh168
2024-09-20 07:34:52 +00:00
687e5cf8c5 [inductor] Relax the conditions for loop split (#135335)
Summary
This PR Relaxes the conditions for loop split to support dynamic shape cases.
Now the conditions that need to be met to apply loop split optimization are as follows:

1. No reduction and no mudular index for all nodes.
2. The indexing_exprs of all nodes contain only one (or more, but all the same) division, where the divisor is an integer, the dividend is one of the iter_vars, and this var, i.e. the dimension that needs to be split, is contiguous in all other indexing_exprs.

Example:
```
import torch
import torch.nn as nn

class GN(torch.nn.Module):
    def __init__(self, num_groups, num_channels):
        super(GN, self).__init__()
        self.gn = nn.GroupNorm(num_groups, num_channels)

    def forward(self, x):
        return self.gn(x)

input = torch.randn(2, 960, 96, 96).to(memory_format=torch.channels_last)
m = GN(32, 960).eval()
compiled_m = torch.compile(m, dynamic=True)

with torch.no_grad():
    compiled_m(input)
```

Before loop split, the node's var_ranges: `{z0: s0, z1: s2, z2: s2, z3: 960}` and indexing_exprs: `{'index0': 960*s2**2*z0 + 960*s2*z1 + 960*z2 + z3, 'index1': 32*z0 + (z3//30), 'index2': 30*s2**2, 'index3': z3, 'index4': 960*s2*z0*((s2**2//s2)) + 960*z1*((s2**2//s2)) + 960*z2 + z3}`. After loop split `z3` will split to `30*z3 + z4`, then the node's var_ranges will be changed to `{z0: s0, z1: s2, z2: s2, z3: 32, z4: 30}` and indexing_exprs will be changed to `{'index0': 960*s2**2*z0 + 960*s2*z1 + 960*z2 + 30*z3 + z4, 'index1': 32*z0 + z3, 'index2': 30*s2**2, 'index3': 30*z3 + z4, 'index4': 960*s2*z0*((s2**2//s2)) + 960*z1*((s2**2//s2)) + 960*z2 + 30*z3 + z4}`

Generated code:

- Before:
```
cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['const float*', 'const float*', 'const float*', 'float*', 'float*', 'float*', 'const int64_t', 'const int64_t'], '''
#include "/tmp/torchinductor_jiayisun/32/c32dcqa3qidvmunis4lucp3dhoicleq5qjfjfgvpiadbbzfp6ofy.h"
extern "C"  void kernel(const float* in_ptr0,
                       const float* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr0,
                       float* out_ptr1,
                       float* out_ptr2,
                       const int64_t ks0,
                       const int64_t ks1)
{
    #pragma omp parallel num_threads(112)
    {
        int tid = omp_get_thread_num();
        {
            #pragma omp for collapse(2)
            for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(ks0); x0+=static_cast<int64_t>(1L))
            {
                for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(32L); x1+=static_cast<int64_t>(1L))
                {
                    {
                        Welford<float> tmp_acc0 = Welford<float>();
                        Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                        Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                        static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<int64_t>(c10::div_floor_integer(static_cast<int64_t>((15L*(static_cast<int64_t>(ks1*ks1)))), static_cast<int64_t>(8L))));
                        for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(static_cast<int64_t>(ks1*ks1)); x2+=static_cast<int64_t>(1L))
                        {
                            for(int64_t x3=static_cast<int64_t>(0L); x3<static_cast<int64_t>(16L); x3+=static_cast<int64_t>(16L))
                            {
                                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x3 + (30L*x1) + (960L*x2) + (960L*x0*(static_cast<int64_t>(ks1*ks1)))), static_cast<int64_t>(16));
                                tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &wrecps0);
                            }
                            for(int64_t x3=static_cast<int64_t>(16L); x3<static_cast<int64_t>(30L); x3+=static_cast<int64_t>(14L))
                            {
                                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x3 + (30L*x1) + (960L*x2) + (960L*x0*(static_cast<int64_t>(ks1*ks1)))), static_cast<int64_t>(14L));
                                masked_tmp_acc0_vec = welford_combine(masked_tmp_acc0_vec, tmp0, static_cast<int64_t>(14L), &wrecps0);
                            }
                        }
                        tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec));
                        tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec));
                        out_ptr0[static_cast<int64_t>(x1 + (32L*x0))] = static_cast<float>(tmp_acc0.mean);
                        out_ptr1[static_cast<int64_t>(x1 + (32L*x0))] = static_cast<float>(tmp_acc0.m2);
                    }
                }
            }
        }
        {
            #pragma omp for collapse(2)
            for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(ks0); x0+=static_cast<int64_t>(1L))
            {
                for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(ks1); x1+=static_cast<int64_t>(1L))
                {
                    #pragma GCC ivdep
                    for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(ks1); x2+=static_cast<int64_t>(1L))
                    {
                        #pragma GCC ivdep
                        for(int64_t x3=static_cast<int64_t>(0L); x3<static_cast<int64_t>(960L); x3+=static_cast<int64_t>(1L))
                        {
                            auto tmp0 = in_ptr0[static_cast<int64_t>(x3 + (960L*x2) + (960L*ks1*x1) + (960L*x0*(static_cast<int64_t>(ks1*ks1))))];
                            auto tmp1 = out_ptr0[static_cast<int64_t>((32L*x0) + (c10::div_floor_integer(static_cast<int64_t>(x3), static_cast<int64_t>(30L))))];
                            auto tmp3 = out_ptr1[static_cast<int64_t>((32L*x0) + (c10::div_floor_integer(static_cast<int64_t>(x3), static_cast<int64_t>(30L))))];
                            auto tmp11 = in_ptr1[static_cast<int64_t>(x3)];
                            auto tmp13 = in_ptr2[static_cast<int64_t>(x3)];
                            auto tmp2 = decltype(tmp0)(tmp0 - tmp1);
                            auto tmp4 = 30L*(static_cast<int64_t>(ks1*ks1));
                            auto tmp5 = c10::convert<float>(tmp4);
                            auto tmp6 = tmp3 / tmp5;
                            auto tmp7 = static_cast<float>(1e-05);
                            auto tmp8 = decltype(tmp6)(tmp6 + tmp7);
                            auto tmp9 = 1 / std::sqrt(tmp8);
                            auto tmp10 = decltype(tmp2)(tmp2 * tmp9);
                            auto tmp12 = decltype(tmp10)(tmp10 * tmp11);
                            auto tmp14 = decltype(tmp12)(tmp12 + tmp13);
                            out_ptr2[static_cast<int64_t>(x3 + (960L*x2) + (960L*x1*(c10::div_floor_integer(static_cast<int64_t>((static_cast<int64_t>(ks1*ks1))), static_cast<int64_t>(ks1)))) + (960L*ks1*x0*(c10::div_floor_integer(static_cast<int64_t>((static_cast<int64_t>(ks1*ks1))), static_cast<int64_t>(ks1)))))] = tmp14;
                        }
                    }
                }
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg0_1, arg1_1, arg2_1, arg3_1, arg4_1 = args
    args.clear()
    s0 = arg2_1
    s2 = arg3_1
    assert_size_stride(arg0_1, (960, ), (1, ))
    assert_size_stride(arg1_1, (960, ), (1, ))
    assert_size_stride(arg4_1, (s0, 960, s2, s2), (960*(s2*s2), 1, 960*s2, 960))
    buf0 = empty_strided_cpu((s0, 32, 1, 1), (32, 1, 32*s0, 32*s0), torch.float32)
    buf1 = empty_strided_cpu((s0, 32, 1, 1), (32, 1, 32*s0, 32*s0), torch.float32)
    buf3 = empty_strided_cpu((s0, 960, s2, s2), (960*s2*((s2*s2) // s2), 1, 960*((s2*s2) // s2), 960), torch.float32)
    cpp_fused_native_group_norm_0(arg4_1, arg0_1, arg1_1, buf0, buf1, buf3, s0, s2)
    del arg0_1
    del arg1_1
    del arg4_1
    return (buf3, )
```

After:
```
cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['const float*', 'const float*', 'const float*', 'float*', 'float*', 'float*', 'const int64_t', 'const int64_t'], '''
#include "/tmp/torchinductor_jiayisun/32/c32dcqa3qidvmunis4lucp3dhoicleq5qjfjfgvpiadbbzfp6ofy.h"
extern "C"  void kernel(const float* in_ptr0,
                       const float* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr0,
                       float* out_ptr1,
                       float* out_ptr2,
                       const int64_t ks0,
                       const int64_t ks1)
{
    #pragma omp parallel num_threads(112)
    {
        int tid = omp_get_thread_num();
        {
            #pragma omp for collapse(2)
            for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(ks0); x0+=static_cast<int64_t>(1L))
            {
                for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(32L); x1+=static_cast<int64_t>(1L))
                {
                    {
                        Welford<float> tmp_acc0 = Welford<float>();
                        Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                        Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                        static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<int64_t>(c10::div_floor_integer(static_cast<int64_t>((15L*(static_cast<int64_t>(ks1*ks1)))), static_cast<int64_t>(8L))));
                        for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(static_cast<int64_t>(ks1*ks1)); x2+=static_cast<int64_t>(1L))
                        {
                            for(int64_t x3=static_cast<int64_t>(0L); x3<static_cast<int64_t>(16L); x3+=static_cast<int64_t>(16L))
                            {
                                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x3 + (30L*x1) + (960L*x2) + (960L*x0*(static_cast<int64_t>(ks1*ks1)))), static_cast<int64_t>(16));
                                tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &wrecps0);
                            }
                            for(int64_t x3=static_cast<int64_t>(16L); x3<static_cast<int64_t>(30L); x3+=static_cast<int64_t>(14L))
                            {
                                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x3 + (30L*x1) + (960L*x2) + (960L*x0*(static_cast<int64_t>(ks1*ks1)))), static_cast<int64_t>(14L));
                                masked_tmp_acc0_vec = welford_combine(masked_tmp_acc0_vec, tmp0, static_cast<int64_t>(14L), &wrecps0);
                            }
                        }
                        tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec));
                        tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec));
                        out_ptr0[static_cast<int64_t>(x1 + (32L*x0))] = static_cast<float>(tmp_acc0.mean);
                        out_ptr1[static_cast<int64_t>(x1 + (32L*x0))] = static_cast<float>(tmp_acc0.m2);
                    }
                }
            }
        }
        {
            #pragma omp for collapse(2)
            for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(ks0); x0+=static_cast<int64_t>(1L))
            {
                for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(ks1); x1+=static_cast<int64_t>(1L))
                {
                    #pragma GCC ivdep
                    for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(ks1); x2+=static_cast<int64_t>(1L))
                    {
                        #pragma GCC ivdep
                        for(int64_t x3=static_cast<int64_t>(0L); x3<static_cast<int64_t>(32L); x3+=static_cast<int64_t>(1L))
                        {
                            for(int64_t x4=static_cast<int64_t>(0L); x4<static_cast<int64_t>(16L); x4+=static_cast<int64_t>(16L))
                            {
                                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x4 + (30L*x3) + (960L*x2) + (960L*ks1*x1) + (960L*x0*(static_cast<int64_t>(ks1*ks1)))), static_cast<int64_t>(16));
                                auto tmp1 = out_ptr0[static_cast<int64_t>(x3 + (32L*x0))];
                                auto tmp4 = out_ptr1[static_cast<int64_t>(x3 + (32L*x0))];
                                auto tmp13 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<int64_t>(x4 + (30L*x3)), static_cast<int64_t>(16));
                                auto tmp15 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<int64_t>(x4 + (30L*x3)), static_cast<int64_t>(16));
                                auto tmp2 = at::vec::Vectorized<float>(tmp1);
                                auto tmp3 = tmp0 - tmp2;
                                auto tmp5 = 30L*(static_cast<int64_t>(ks1*ks1));
                                auto tmp6 = c10::convert<float>(tmp5);
                                auto tmp7 = tmp4 / tmp6;
                                auto tmp8 = static_cast<float>(1e-05);
                                auto tmp9 = decltype(tmp7)(tmp7 + tmp8);
                                auto tmp10 = 1 / std::sqrt(tmp9);
                                auto tmp11 = at::vec::Vectorized<float>(tmp10);
                                auto tmp12 = tmp3 * tmp11;
                                auto tmp14 = tmp12 * tmp13;
                                auto tmp16 = tmp14 + tmp15;
                                tmp16.store(out_ptr2 + static_cast<int64_t>(x4 + (30L*x3) + (960L*x2) + (960L*x1*(c10::div_floor_integer(static_cast<int64_t>((static_cast<int64_t>(ks1*ks1))), static_cast<int64_t>(ks1)))) + (960L*ks1*x0*(c10::div_floor_integer(static_cast<int64_t>((static_cast<int64_t>(ks1*ks1))), static_cast<int64_t>(ks1))))));
                            }
                            for(int64_t x4=static_cast<int64_t>(16L); x4<static_cast<int64_t>(30L); x4+=static_cast<int64_t>(14L))
                            {
                                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x4 + (30L*x3) + (960L*x2) + (960L*ks1*x1) + (960L*x0*(static_cast<int64_t>(ks1*ks1)))), static_cast<int64_t>(14L));
                                auto tmp1 = out_ptr0[static_cast<int64_t>(x3 + (32L*x0))];
                                auto tmp4 = out_ptr1[static_cast<int64_t>(x3 + (32L*x0))];
                                auto tmp13 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<int64_t>(x4 + (30L*x3)), static_cast<int64_t>(14L));
                                auto tmp15 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<int64_t>(x4 + (30L*x3)), static_cast<int64_t>(14L));
                                auto tmp2 = at::vec::Vectorized<float>(tmp1);
                                auto tmp3 = tmp0 - tmp2;
                                auto tmp5 = 30L*(static_cast<int64_t>(ks1*ks1));
                                auto tmp6 = c10::convert<float>(tmp5);
                                auto tmp7 = tmp4 / tmp6;
                                auto tmp8 = static_cast<float>(1e-05);
                                auto tmp9 = decltype(tmp7)(tmp7 + tmp8);
                                auto tmp10 = 1 / std::sqrt(tmp9);
                                auto tmp11 = at::vec::Vectorized<float>(tmp10);
                                auto tmp12 = tmp3 * tmp11;
                                auto tmp14 = tmp12 * tmp13;
                                auto tmp16 = tmp14 + tmp15;
                                tmp16.store(out_ptr2 + static_cast<int64_t>(x4 + (30L*x3) + (960L*x2) + (960L*x1*(c10::div_floor_integer(static_cast<int64_t>((static_cast<int64_t>(ks1*ks1))), static_cast<int64_t>(ks1)))) + (960L*ks1*x0*(c10::div_floor_integer(static_cast<int64_t>((static_cast<int64_t>(ks1*ks1))), static_cast<int64_t>(ks1))))), static_cast<int64_t>(14L));
                            }
                        }
                    }
                }
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg0_1, arg1_1, arg2_1, arg3_1, arg4_1 = args
    args.clear()
    s0 = arg2_1
    s2 = arg3_1
    assert_size_stride(arg0_1, (960, ), (1, ))
    assert_size_stride(arg1_1, (960, ), (1, ))
    assert_size_stride(arg4_1, (s0, 960, s2, s2), (960*(s2*s2), 1, 960*s2, 960))
    buf0 = empty_strided_cpu((s0, 32, 1, 1), (32, 1, 32*s0, 32*s0), torch.float32)
    buf1 = empty_strided_cpu((s0, 32, 1, 1), (32, 1, 32*s0, 32*s0), torch.float32)
    buf3 = empty_strided_cpu((s0, 960, s2, s2), (960*s2*((s2*s2) // s2), 1, 960*((s2*s2) // s2), 960), torch.float32)
    cpp_fused_native_group_norm_0(arg4_1, arg0_1, arg1_1, buf0, buf1, buf3, s0, s2)
    del arg0_1
    del arg1_1
    del arg4_1
    return (buf3, )
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135335
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jansel
2024-09-20 05:42:52 +00:00
cf31724db7 Fix and improvements to toward 3.13t (#136319)
Small part of https://github.com/pytorch/pytorch/pull/130689
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136319
Approved by: https://github.com/malfet, https://github.com/Skylion007
2024-09-20 04:22:18 +00:00
e3ea5429f2 Implement GetAttrVariable.as_python_constant() (#134216)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134216
Approved by: https://github.com/amjames, https://github.com/williamwen42
2024-09-20 03:44:43 +00:00
d9aca9914b Remove duplicated words in library.rst (#136340)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136340
Approved by: https://github.com/svekars
2024-09-20 03:30:54 +00:00
fe0e9fb385 Fix flaky SIGSEGV crash in test_profile_memory (#136304)
Fixes https://github.com/pytorch/pytorch/issues/132331

We need another barrier here to ensure that the main thread doesn't stop the profiler while other threads are still using it (and crash).  I can reliably reproduce the issue with `pytest -v test/profiler/test_cpp_thread.py -k test_profile_memory --flake-finder`.

### Testing

`pytest -v test/profiler/test_cpp_thread.py --flake-finder` all passes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136304
Approved by: https://github.com/briancoutinho
2024-09-20 02:56:49 +00:00
d45b0151e5 Add deterministic path for CUDA cumsum (#136224)
Change `cumsum` to call its decomposition when `use_deterministic_algorithms(True)` and input is CUDA.

Fixes #89492

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136224
Approved by: https://github.com/ezyang, https://github.com/justinchuby
2024-09-20 02:41:56 +00:00
1dfa07e885 passing FileTimerRequests.to_json() to log_debug_info_for_expired_timers for a better debugging experience (#135913)
Summary: The change involves passing the expired timers to the log_debug_info_for_expired_timers function after to_json() has been applied . This change is made to provide a better debugging experience for the user.

Test Plan: unit tests

Reviewed By: gag1jain

Differential Revision: D62408767

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135913
Approved by: https://github.com/gag1jain
2024-09-20 00:54:02 +00:00
bebf5302ba TCPStoreLibUvBackend: trace operations (#136320)
Summary:
This logs all operations when tracing log level is enabled for the `TCPStoreLibUvBackend`. This is very useful for debugging collective operations when issues occur as it logs all hosts and the keys that they're modifying. To minimize total data we only log the keys and not the values

This changes the C10D_* macros to be much more efficient -- previously we would always format the log string even if they would never be printed which is very wasteful for detailed tracing. This now gates them with an if statement to achieve the same behavior with no overhead

Test Plan:
```
TORCH_DISTRIBUTED_DEBUG=DETAIL torchrun --nnodes 1 --nproc_per_node 1 --no-python /bin/bash -c "echo foo"
```

```
I0919 09:26:52.352013 34271 TCPStore.cpp:285] [c10d - debug] The server has started on port = 29500.
I0919 09:26:52.352246 34271 socket.cpp:783] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (127.0.0.1, 29500).
I0919 09:26:52.352241 36903 TCPStoreLibUvBackend.cpp:1173] [c10d - debug] Uv main loop running
I0919 09:26:52.352308 34271 socket.cpp:854] [c10d - trace] The client socket is attempting to connect to [localhost]:29500.
I0919 09:26:52.353633 34271 socket.cpp:945] [c10d] The client socket has connected to [localhost]:29500 on SocketImpl(fd=41, addr=[localhost]:45646, remote=[localhost]:29500).
I0919 09:26:52.354422 34271 TCPStore.cpp:321] [c10d - debug] TCP client connected to host 127.0.0.1:29500
I0919 09:26:52.354558 36903 TCPStoreLibUvBackend.cpp:774] [c10d - trace] validate magic:1015412686 address:[localhost]:45646
I0919 09:26:52.354638 36903 TCPStoreLibUvBackend.cpp:789] [c10d - trace] ping nonce:34271 address:[localhost]:45646
I0919 09:26:52.356122 36903 TCPStoreLibUvBackend.cpp:866] [c10d - trace] add key:init/ val:1 address:[localhost]:45646
I0919 09:26:52.356308 36903 TCPStoreLibUvBackend.cpp:930] [c10d - trace] wait key_count:1 address:[localhost]:45646
I0919 09:26:52.356410 36903 TCPStoreLibUvBackend.cpp:846] [c10d - trace] get key:init/ address:[localhost]:45646
I0919 09:26:52.358688 36903 TCPStoreLibUvBackend.cpp:808] [c10d - trace] set key:/none/torchelastic/role_info/0 address:[localhost]:45646
I0919 09:26:52.360177 36903 TCPStoreLibUvBackend.cpp:930] [c10d - trace] wait key_count:1 address:[localhost]:45646
I0919 09:26:52.360296 36903 TCPStoreLibUvBackend.cpp:1004] [c10d - trace] multi_get key_count:1 address:[localhost]:45646
I0919 09:26:52.362076 36903 TCPStoreLibUvBackend.cpp:1036] [c10d - trace] multi_set key_count:1 address:[localhost]:45646
I0919 09:26:52.364001 36903 TCPStoreLibUvBackend.cpp:930] [c10d - trace] wait key_count:1 address:[localhost]:45646
I0919 09:26:52.364091 36903 TCPStoreLibUvBackend.cpp:846] [c10d - trace] get key:/none/torchelastic/assigned_ranks/0 address:[localhost]:45646
```

Differential Revision: D62924454

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136320
Approved by: https://github.com/c-p-i-o, https://github.com/XilunWu
2024-09-20 00:53:21 +00:00
9b424aac1d [CI][CUSPARSELT] Extend cusparselt installation script to support cuda 12.6 (#136321)
To prepare for future cuda updates.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136321
Approved by: https://github.com/Skylion007, https://github.com/eqy
2024-09-19 23:45:57 +00:00
172ecf78b7 DTensor: dont hash symint tensor input in propagate_tensor_meta (#136266)
This fixes a subset of issues for dynamic shapes + DTensor.

It's pretty easy to run into other issues - it's likely that we need https://github.com/pytorch/pytorch/pull/125941 to land for DTensor + dynamic shapes to work more generally. I ended up writing a test that had dynamic shape inputs but not dynamic shape outputs in order to properly test this fix

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136266
Approved by: https://github.com/ezyang, https://github.com/yf225
2024-09-19 20:39:36 +00:00
cyy
7bbdf87517 [22/N] Fix clang-tidy warnings in jit (#134829)
Follows  #134537

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134829
Approved by: https://github.com/ezyang
2024-09-19 19:24:42 +00:00
b71802fa79 add basic_modules_ListOfLinears_inductor_gpu_force_shape_pad (#136175)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136175
Approved by: https://github.com/ezyang
2024-09-19 19:15:50 +00:00
8cba0ec958 [AOTI][Tooling][8/n] Add option to pinpoint kernel names in debug printer (#136182)
Summary:
Add a third mode where we only print kernel names without dumping any intermediate actual tensor value info.

It can be helpful in quickly identifying the troublesome kernels in CUDA IMA issues.

thanks ColinPeppler and henrylhtsang for this "feature request".

Test Plan:
The output can look like this if set the `AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=3`:

{F1871629091}

Differential Revision: D62791371

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136182
Approved by: https://github.com/henrylhtsang
2024-09-19 18:51:57 +00:00
49723a8ff3 fix stride compare failed when size value equal to one in ForeachUtils.h (#134546)
When size value equal to one, tensor strides value need be skipped to compare.
@ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134546
Approved by: https://github.com/janeyx99
2024-09-19 18:43:41 +00:00
ccca3de0cd [ROCm] Enable Flex attention tests on AMD gpus (#136245)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136245
Approved by: https://github.com/malfet
2024-09-19 18:02:41 +00:00
8d9c42735a Type _sympy/functions.py [1/n] (#136205)
Signed-off-by: Bob Ren <bobren@fb.com>

I was chatting with @jamesjwu about strategies to learn the code and he suggested adding types to some files. This stack of PRs adds types to _sympy/functions.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136205
Approved by: https://github.com/Skylion007, https://github.com/jamesjwu
2024-09-19 17:15:53 +00:00
803ce507f1 Log structured logging overhead to dynamo compile (kinda) (#136142)
Summary:
X-link: https://github.com/pytorch/benchmark/pull/2454

This adds structured logging overhead at a per compile basis to compilation metrics.

To do so, we track the frame_id_frame_compile_id that trace_structured uses to categorize compiles, and use that as the key in our timing table.

Implementation notes:
- If there's times we call trace_structured without a compile id, the time won't be measured. Not really a good way around that today given the compile id framework of compilation metrics. Strobelight is still the best way to measure on a per job basis.
- We don't actually measure the time it takes to log the compilation metrics itself. Fundamentally, it's not possible to log this properly if we're storing the logging number *in* compilation metrics, since there's no way to measure it before we do it(unless we want discrepancies between dynamo_compile and tlparse, which seems suboptimal). Hopefully for a large job, the cost of structured_logging compilation metrics itself is small.
- I wanted to use frame_phase_timing here, but there's a bunch of ids to iron out, and I don't really want to deal with that headache. compilation_time_metrics is sort of what I want, but that isn't by frame/compile id, so it's also a bit off. Putting it into torch.logging as a separate thing so logging tracks its own overhead seems fine, though.

Test Plan:
Run benchmarks/nanogpt and staging logger. See that the new compilation metric is logged to the staged dynamo_compile table:

https://fburl.com/scuba/logger_staging_jjwu_30582a48f1ff9cf5f4ac50a4c40af/xazjg5xq

Note that the sum(structured_logging_overhead_s) / sum(entire_frame_compile_time) = 8.387 / 124.278  = 6%, which seems reasonable as the overhead for a small compilation like this.

You can also look at samples for a more detailed log of this.

Reviewed By: oulgen

Differential Revision: D62643611

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136142
Approved by: https://github.com/bobrenjc93
2024-09-19 16:11:38 +00:00
65df26f615 [FSDP2] Fixed 2D mismatched grad placements (#136237)
```
CUDA_VISIBLE_DEVICES=2,3,6,7 pytest test/distributed/_composable/test_composability/test_2d_composability.py -k test_train_parity_2d_transformer
```

Differential Revision: [D62964658](https://our.internmc.facebook.com/intern/diff/D62964658)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136237
Approved by: https://github.com/weifengpy
2024-09-19 14:35:15 +00:00
4ea741d24f Revert "Reland D62220158 (#136213)"
This reverts commit 083c9149b75cd918b6fb2795050d7173923a3629.

Reverted https://github.com/pytorch/pytorch/pull/136213 on behalf of https://github.com/jeanschmidt due to Seems to have introduced regressions in rocm signals ([comment](https://github.com/pytorch/pytorch/pull/136213#issuecomment-2360885064))
2024-09-19 12:44:54 +00:00
bce52d0b60 [CODEMOD][caffe2] use npt.NDArray instead of np.ndarray in type annotations (#136288)
Summary:
To facilitate PSS-2 upgrade, this uses `ndt.NDArray` instead of `nd.ndarray` in type annotations. In Numpy-1.19 (PSS-1) it's an alias to `nd.ndarray` -- a noop.
In Numpy-1.24, `ndt.NDArray` a proper generic type, and without this change uses of `nd.ndarray` generate this Pyre type error:
```counterexample
 Invalid type parameters [24]: Generic type `np.ndarray` expects 2 type parameters.
```

Test Plan: Sandcastle plus visual inspection

Differential Revision: D62977370

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136288
Approved by: https://github.com/kit1980
2024-09-19 12:40:36 +00:00
908a5689eb Return unsafe_view instead of view from matmul when folding occurs (#134568)
When tensor folding occurs during matmul operation returned tensor is a view. This can cause issues when matmul is used inside a custom function and such view is then returned as output. Then it cannot be modified inplace and causes errors.
It can be especially problematic when after such function inplace allreduce is performed.
Issue is resolved when unsafe_view is returned from matmul instead. This solution aligns matmul decomposition with eager implementation in such a way that a non view tensor is returned.

Test included in this PR reproduces the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134568
Approved by: https://github.com/zou3519
2024-09-19 11:52:16 +00:00
db80b98ec4 XFAIL test_segfault (#136252)
Fixes https://github.com/pytorch/pytorch/issues/128551

As this has been failing in trunk for a while and there is no owner yet to fix it properly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136252
Approved by: https://github.com/andrewkho
2024-09-19 04:17:06 +00:00
775517693a Add type checks for Tensor.add_ (#135864)
Fixes  #127049

There's already a meta func in `meta_registrations.py` for `add_` and `sub_` methods. I added a second meta function for error checking, i.e `int.add/sub_(float)` and `bool.add/sub_(other types)` .

Also the corresponding test with Dynamo passes, removed `@xfailIfTorchDynamo`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135864
Approved by: https://github.com/williamwen42
2024-09-19 03:09:36 +00:00
e037bb326f [dynamo] fix crash in InspectSignatureVariable (#136010)
Fix crash that was happening in https://github.com/pytorch/pytorch/issues/128095, because we were trying to extract a constant incorrectly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136010
Approved by: https://github.com/yanboliang, https://github.com/anijain2305, https://github.com/jansel
2024-09-19 00:23:00 +00:00
f2b0fc89f2 Add uint16 support for observer (#136238)
Summary:
att

Test Plan:
python test/test_quantization.py -k TestObserver

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D62909821](https://our.internmc.facebook.com/intern/diff/D62909821)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136238
Approved by: https://github.com/tarun292
2024-09-18 23:52:18 +00:00
068c80e6b6 [BE][MPS] Fix deprecation warnings on MacOS 15.0 (#136292)
[reverseSquareRootWithTensor:](https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/reversesquareroot(with:name:)?changes=__8&language=objc) were deprecated in favor of [reciprocalSquareRootWithTensor:](https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/reciprocalsquareroot(_:name:)?changes=__8&language=objc)

Without it, following warnings are generated if compiled on recently released MacOS Sequoia:
```
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:720:35: warning: 'reverseSquareRootWithTensor:name:' is deprecated: first deprecated in macOS 15.0 [-Wdeprecated-declarations]
  720 |           rsqrtTensor = [mpsGraph reverseSquareRootWithTensor:varianceEpsTensor name:nil];
      |                                   ^~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                                   reciprocalSquareRootWithTensor
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__type_traits/invoke.h:341:10: note: in instantiation of function template specialization 'at::native::batch_norm_backward_mps(const Tensor &, const Tensor &, const std::optional<Tensor> &, const std::optional<Tensor> &, const std::optional<Tensor> &, const std::optional<Tensor> &, const std::optional<Tensor> &, bool, double, std::array<bool, 3>)::(anonymous class)::operator()<MPSGraph *, CachedGraph *>' requested here
  341 | decltype(std::declval<_Fp>()(std::declval<_Args>()...))
      |          ^
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__type_traits/invoke.h:351:19: note: while substituting deduced template arguments into function template '__invoke' [with _Fp = (lambda at /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68) &, _Args = <MPSGraph *, CachedGraph *>]
  351 |   static decltype(std::__invoke(std::declval<_XFp>(), std::declval<_XArgs>()...)) __try_call(int);
      |                   ^
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__type_traits/invoke.h:357:28: note: while substituting deduced template arguments into function template '__try_call' [with _XFp = (lambda at /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68) &, _XArgs = (no value)]
  357 |   using _Result = decltype(__try_call<_Fp, _Args...>(0));
      |                            ^
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__type_traits/conjunction.h:27:32: note: in instantiation of template class 'std::__invokable_r<void, (lambda at /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68) &, MPSGraph *, CachedGraph *>' requested here
   27 | __expand_to_true<__enable_if_t<_Pred::value>...> __and_helper(int);
      |                                ^
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__type_traits/conjunction.h:38:39: note: while substituting explicitly-specified template arguments into function template '__and_helper'
   38 | using _And _LIBCPP_NODEBUG = decltype(std::__and_helper<_Pred...>(0));
      |                                       ^
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__functional/function.h:828:20: note: (skipping 1 context in backtrace; use -ftemplate-backtrace-limit=0 to see all)
  828 |             bool = _And< _IsNotSame<__remove_cvref_t<_Fp>, function>, __invokable<_Fp, _ArgTypes...> >::value>
      |                    ^
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__functional/function.h:841:49: note: in instantiation of default argument for '__callable<(lambda at /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68) &>' required here
  841 |   using _EnableIfLValueCallable = __enable_if_t<__callable<_Fp&>::value>;
      |                                                 ^~~~~~~~~~~~~~~~
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__functional/function.h:851:32: note: in instantiation of template type alias '_EnableIfLValueCallable' requested here
  851 |   template <class _Fp, class = _EnableIfLValueCallable<_Fp>>
      |                                ^
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__functional/function.h:852:25: note: in instantiation of default argument for 'function<(lambda at /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68)>' required here
  852 |   _LIBCPP_HIDE_FROM_ABI function(_Fp);
      |                         ^~~~~~~~~~~~~
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68: note: while substituting deduced template arguments into function template 'function' [with _Fp = (lambda at /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68), $1 = (no value)]
  623 |     auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
      |                                                                    ^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:24: note: while substituting deduced template arguments into function template 'LookUpOrCreateCachedGraph' [with T = CachedGraph]
  623 |     auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
      |                        ^
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/System/Library/Frameworks/MetalPerformanceShadersGraph.framework/Headers/MPSGraphArithmeticOps.h:123:1: note: 'reverseSquareRootWithTensor:name:' has been explicitly marked deprecated here
  123 | -(MPSGraphTensor *) reverseSquareRootWithTensor:(MPSGraphTensor *) tensor
      | ^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:745:37: warning: 'reverseSquareRootWithTensor:name:' is deprecated: first deprecated in macOS 15.0 [-Wdeprecated-declarations]
  745 |             rsqrtTensor = [mpsGraph reverseSquareRootWithTensor:varianceEpsTensor name:nil];
      |                                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                                     reciprocalSquareRootWithTensor
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/System/Library/Frameworks/MetalPerformanceShadersGraph.framework/Headers/MPSGraphArithmeticOps.h:123:1: note: 'reverseSquareRootWithTensor:name:' has been explicitly marked deprecated here
  123 | -(MPSGraphTensor *) reverseSquareRootWithTensor:(MPSGraphTensor *) tensor
      | ^
2 warnings generated.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136292
Approved by: https://github.com/kit1980
2024-09-18 23:38:31 +00:00
b9a197df77 [BE][MPS] Delete duplicated code in View.mm (#136295)
After https://github.com/pytorch/pytorch/pull/135706 `getGatherScatterScalarType` returns exactly the same results as `scalarToMetalTypeString` , so delete the function and call `scalarToMetalTypeString`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136295
Approved by: https://github.com/kit1980
2024-09-18 22:44:43 +00:00
f1ad680818 [dynamo]Remove stream hardcoding in dynamo VariableBuilder (#131763)
Fixes #ISSUE_NUMBER

Recent change from PR#123487 used torch.cuda.Stream directly and this causes failure for other backends. This PR will generalize the stream handling for all backends like cuda/hpu/xpu

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131763
Approved by: https://github.com/yanboliang, https://github.com/yf225
2024-09-18 22:32:34 +00:00
bc9597b7d8 [Traceable FSDP2] Minor refactor to traceable FSDP2 unit tests (#136219)
Changes in this PR:
- Monkey-patching `F.scaled_dot_product_attention` with a lambda seems to not work in some cases. This PR avoids using a lambda.
- Running `fullgraph=True` and `fullgraph=False` in the same unit test seems to cause the two cases to interfere with each other and causes error. This PR splits them into two separate unit tests.
- The checks in the unit tests might not work with compile cache. This PR turns off the cache in order to have a more predictable compile behavior to do unit test on.

Test commands:
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor_fullgraph_True`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor_fullgraph_False`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor_fullgraph_True`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor_fullgraph_False`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136219
Approved by: https://github.com/yifuwang
2024-09-18 22:30:23 +00:00
1a86d8aa29 Fix calling Add._from_args and Mul._from_args (#136143)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136143
Approved by: https://github.com/ezyang
2024-09-18 20:51:04 +00:00
aae68e2976 Add wait counter for nccl abort (#136067)
Summary:
Quite a few times, we see the NCCL PG abort taking too long. There's no easy way to measure this, so let's add a counter to measure this across the stack.

This will help us measure how much time we take the NCCL abort.
Test Plan:
Unit tests

Reviewed By: c-p-i-o

Differential Revision: D62675010

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136067
Approved by: https://github.com/fduwjj
2024-09-18 20:14:10 +00:00
eqy
68a7246f13 [cuDNN][conv][A100] Bump tolerances for vmap_autograd_grad conv2d on A100 (#136178)
Likely due to  a cuDNN heuristics update

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136178
Approved by: https://github.com/Skylion007
2024-09-18 19:42:13 +00:00
5a6ddbcc3b Extending the Pytorch vec backend for SVE (ARM) (#119571)
**Motivation:**
In Pytorch, Aten vectorization supports multiple platforms, including x86 and Arm, as well as multiple data types. It provides a generic implementation of Vector (Vec) type that allows the programmer to write code packing various primitives (such as floats) within 256bit & 512bits registers. It can be extended to support other ISAs easily by adding more VecISA sub-classes.

**Reference Link:** https://github.com/pytorch/pytorch/tree/main/aten/src/ATen/cpu/vec

**This PR:**

* Our goal with this contribution is to add support for SVE backend for Vec in the Aten vectorization for CPU backend which can be benefitted by any ARM architecture supported CPU's that supports SVE.

* More about SVE ISA for ARM: [https://developer.arm.com/Architectures/Scalable Vector Extensions](https://developer.arm.com/Architectures/Scalable%20Vector%20Extensions)

* We are using the ARM C Language Extensions for SVE (https://developer.arm.com/documentation/102699/0100/Optimizing-with-intrinsics ) to accelerate performance for various operators in the SVE backend for Vec.

* Currently we are adding support only for SVE ISA with the vector length of 256 bits (SVE 256). In future, we plan to extend this SVE support for other vector lengths as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119571
Approved by: https://github.com/malfet, https://github.com/snadampal

Co-authored-by: Divya Kotadiya <divya.kotadiya@fujitsu.com>
2024-09-18 18:59:10 +00:00
bad69044d8 [ROCm] upgrade ROCm CI builds to py3.10 (#134108)
Upgrade ROCm CI builds to py3.10

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134108
Approved by: https://github.com/jeffdaily, https://github.com/jithunnair-amd, https://github.com/atalman
2024-09-18 17:39:34 +00:00
3efaa016b1 [c10d] Make test compatible for new pytest (#136158)
Temporary fix to the issue in https://github.com/pytorch/pytorch/issues/127517.

Short-term fix following CPython: 51aefc5bf9/Lib/unittest/case.py (L419-L426)

Differential Revision: [D62878083](https://our.internmc.facebook.com/intern/diff/D62878083)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136158
Approved by: https://github.com/fegin
2024-09-18 17:10:55 +00:00
605f2d802a [PyTorch] Remove unnecessary include of c10/util/Exception.h in irange.h (#136202)
Manually audited and can't figure out why this would be needed.

Differential Revision: [D62879500](https://our.internmc.facebook.com/intern/diff/D62879500/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136202
Approved by: https://github.com/malfet
2024-09-18 16:57:15 +00:00
6a6f5b20c5 Add _addmm_activation to lower precision cast policy on AutocastCPU (#135936)
Fixes #132613.
Add `_addmm_activation` to lower precision cast policy on AutocastCPU.
`_addmm_activation`  https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/transformers/transformer.cpp#L39 of `transformer_encoder_layer_forward` may throw `RuntimeError: mat1 and mat2 must have the same dtype, but got BFloat16 and Float` when autocast is enabled, as `_native_multi_head_attention` is put in lower data type cast policy https://github.com/pytorch/pytorch/pull/107674 and `_addmm_activation` may encounter mixed data types.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135936
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-09-18 16:31:27 +00:00
c8d152cb0e Fix fast_expand recursion error (#136163)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136163
Approved by: https://github.com/ezyang
2024-09-18 13:58:45 +00:00
701ba5203f [Inductor] Increase multiplier to 3 for Inductor AMP FP16 benchmark correctness check (#135932)
Fix https://github.com/pytorch/pytorch/issues/135657.
Aligned with AMP BF16, using multiplier 3 for Inductor AMP FP16 benchmark correctness check

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135932
Approved by: https://github.com/CaoE, https://github.com/jgong5, https://github.com/jansel
2024-09-18 13:03:45 +00:00
b5be4d8c05 Fix ROCm skip decorator for test_ddp_tp and multiprocess UTs (#136161)
skip_if_rocm is used only in multiprocess case (when UT test class is a child of MultiProcessTestCase). Each individual process can exit with a skip code. If used for single process UT, it will cause the UT to fail as the process returns a non-zero exit code. Use skipIfRocm in single process UTs.

To avoid the above confusion, this PR renamed skip_if_rocm to skip_if_rocm_multiprocess.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136161
Approved by: https://github.com/jithunnair-amd, https://github.com/kwen2501, https://github.com/fegin
2024-09-18 11:01:23 +00:00
083c9149b7 Reland D62220158 (#136213)
Summary: We fix the unit test test_pad_mm and reland the diff

Test Plan: See in D62220158

Differential Revision: D62891584

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136213
Approved by: https://github.com/dshi7
2024-09-18 07:33:41 +00:00
a0207c8471 [dynamo] Fix support for classmethod(property(...)) (#134968)
Fixes #134451

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134968
Approved by: https://github.com/yanboliang
2024-09-18 04:47:51 +00:00
9aa22eabe7 [CI] Make linux-aarch64 shards actually running different tests (#136208)
Non-functional sharding was introduced in https://github.com/pytorch/pytorch/pull/125255 but each shard in that case were running the same tests...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136208
Approved by: https://github.com/seemethere, https://github.com/ZainRizvi, https://github.com/atalman
2024-09-18 03:10:21 +00:00
8895f69d12 [torch/numpy][numpy2.0 compat] Additional changes for tests to run under numpy-2.0 (#136152)
Continuation of https://github.com/pytorch/pytorch/pull/131909. This PR makes numpy tests compatible with numpy>=2.0.0. Specifically it deals with APIs that have been removed from numpy-2.0.

Changes in this PR:
1. Use `numpy.exceptions.ComplexWarning` if `numpy.exceptions` namespace is present. In numpy-2.0 `numpy.ComplexWarning` has been removed in favor of using `numpy.exceptions.ComplexWarning` (see [numpy-2.0 migration guide](https://numpy.org/devdocs/numpy_2_0_migration_guide.html#changes-to-namespaces)). Note that `numpy.exceptions` was introduced in numpy-1.25.0 hence does not exist in numpy<=1.24.x.
2. Do the same for `numpy.exceptions.VisibleDeprecationWarning`
3. Use `np.sort(...,axis=0)` over `np.msort()`(`np.msort()` removed in numpy-2.0)
4. Use `np.pad()` over `np.lib.pad()` (`np.lib` removed in numpy-2.0)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136152
Approved by: https://github.com/atalman
2024-09-18 02:11:22 +00:00
6682327c75 [BE] Make NestedTensorTransformerFunctions.cu compilable without warnings (#136222)
Before the change compilation produced following warnings:
```
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu: In function ‘std::tuple<dim3, dim3, at::native::StackArray<long int> > at::native::check_shape_and_partition_(const at::Tensor&, const std::vector<at::Tensor>&, const at::Tensor&)’:
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu:584:22: warning: comparison of integer expressions of different signedness: ‘const int’ and ‘const size_t’ {aka ‘const long unsigned int’} [-Wsign-compare]
  584 |   TORCH_CHECK(num_jagged_dim <= kStackArrayMaxDims);
      |       ~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu: In lambda function:
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu:1224:1061: warning: comparison of integer expressions of different signedness: ‘long unsigned int’ and ‘int’ [-Wsign-compare]
 1224 |   AT_DISPATCH_INDEX_TYPES(
      |
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu: In lambda function:
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu:1224:1985: warning: comparison of integer expressions of different signedness: ‘long unsigned int’ and ‘int’ [-Wsign-compare]
 1224 |   AT_DISPATCH_INDEX_TYPES(
      |
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu: In instantiation of ‘void at::native::jagged_dense_elementwise_jagged_output_opt_(const at::Tensor&, const std::vector<at::Tensor>&, const at::Tensor&, const at::Tensor&, F) [with scalar_t = c10::Half; F = __nv_dl_wrapper_t<__nv_dl_trailing_return_tag<at::Tensor (*)(const at::Tensor&, c10::ArrayRef<at::Tensor>, std::optional<c10::SymInt>), at::native::_fbgemm_dense_to_jagged_forward_symint, c10::Half, 1> >]’:
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu:1515:1:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu:1336:2006: warning: comparison of integer expressions of different signedness: ‘size_t’ {aka ‘long unsigned int’} and ‘int’ [-Wsign-compare]
 1336 |     AT_DISPATCH_INDEX_TYPES(
      |
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu:1336:2113: warning: comparison of integer expressions of different signedness: ‘size_t’ {aka ‘long unsigned int’} and ‘int’ [-Wsign-compare]
 1336 |     AT_DISPATCH_INDEX_TYPES(
      |
```
after it compiled without a warning

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136222
Approved by: https://github.com/PaliC, https://github.com/kit1980
2024-09-18 01:24:05 +00:00
b18ba9419e [AO][Inductor] Enable WOQ fusion pattern with permute (#135928)
**Summary**
Fix https://github.com/pytorch/pytorch/issues/135831 and https://github.com/pytorch/ao/issues/890. The root cause of the numerical failure was that the customized woq-int8 kernel was not triggered due to changes in the pattern. After re-adding the fusion pattern, the accuracy check now passes. I will open a separate TorchAO PR to enable these unit tests in TorchAO.

**Test Plan**
```
python test/inductor/test_mkldnn_pattern_matcher.py -k test_woq_int8
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135928
Approved by: https://github.com/jgong5, https://github.com/eellison
2024-09-18 00:56:16 +00:00
cccf500193 [c10d] remove sleep from watchdogHandler (#135760)
Summary:
Remove sleep from the `watchdogHandler` function. This sleep unnecessary slows things down during a NCCL timeout.
Flight recorder is configured to take a minute, at most, to dump out it's buffer.
This sleep ends up waiting for `8` minutes before destroy is called.

Test Plan: Unit tests.

Differential Revision: D62529875

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135760
Approved by: https://github.com/fduwjj, https://github.com/shuqiangzhang
2024-09-18 00:55:01 +00:00
f6f1504d39 [MPS] Fix 5D+ reductions over negative dimentions (#136198)
This fixes bug introduced by https://github.com/pytorch/pytorch/pull/99856 that attempts to speed-up reduction for 5D+ tensor if trailing dimensions are all ones, but introduces crashes/off-by-one errors for wrapped dimensions

Added regresion test case to `TestMPS.test_sum`

Fixes https://github.com/pytorch/pytorch/issues/136132

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136198
Approved by: https://github.com/albanD
2024-09-17 21:53:31 +00:00
a575ce0dc6 [PyTorch Pinned Allocator] Add support of background thread to process events (#135524)
Summary: Currently we process events in the regular allocation path and we call cudaEventQuery to check on the events and this path can take some locks in libcuda driver. Its not entirely needed to do process events in the allocation path, we could move this to a background thread and keep processing events regularly and put the freed block to the free list.

Differential Revision: D62396585

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135524
Approved by: https://github.com/zyan0
2024-09-17 21:08:10 +00:00
48d18fbd4c [PyTorch CUDA Allocator] Allow reuse of non-split blocks with better rounding (#136174)
Summary:
This diff adds an option to round the non-split blocks in caching allocator so that they can be reused without causing lots of fragmentation for large memory segments.

For example, if we specify max_split memory size as 400MB, then all allocations more than 400MB will not be split. Lets say, we allocated some 1024MB blocks and these are cached in the allocator blocks. If we request a new 500MB block, we round it to nearest power-2-division, thats 512MB, we add default kLargeBuffer of 20MB, that will be 532MB and since 532MB is less than existing 1024MB block, the 1024MB will not be used for this allocation, instead a new 512MB block will be created. In this diff, we provide an option to cofigure the kLargeBuffer for rounding and expose as a configurable option, so 512MB + max_non_split_rounding_size and if thats greater than 1024MB, we will use te 1024MB and we wont create a new 512MB block using cudaMalloc. This option is added so that we can pre-allocate some large blocks so that we can reuse them as much as possible and we dont stall on calling cudaMalloc.

Differential Revision: D62758758

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136174
Approved by: https://github.com/zyan0
2024-09-17 19:08:44 +00:00
eqy
e3aa5e2f64 [NCCL] Don't override waitUntilInitialized's setting of comm->initialized_ (#136155)
#133630 sets `initialized_` to `true` which causes previous wait codepaths to skip necessary waits, see also #https://github.com/pytorch/pytorch/issues/136151

CC @shuqiangzhang @wconstab

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136155
Approved by: https://github.com/fduwjj, https://github.com/kwen2501, https://github.com/c-p-i-o, https://github.com/shuqiangzhang
2024-09-17 18:50:12 +00:00
a4e9a1c90b [TorchRec][PT2 IR][APF] short circuit the flatten/unflatten between EBC and KTRegroupAsDict modules (#136045)
Summary:
# context
* for the root cause and background please refer to this [post](https://fb.workplace.com/groups/1028545332188949/permalink/1042204770823005/)
* basica idea of this diff is to **short circuit the pytree flatten-unflatten function pairs** between two preserved modules, i.e., EBC/fpEBC and KTRegroupAsDict.
NOTE: There could be multiple EBCs and one single KTRegroupAsDict as shown in the [pic](https://fburl.com/gslide/lcyt8eh3) {F1864810545}
* short-circuiting the EBC-KTRegroupAsDict pairs are very special and a must in most of the cases due to the EBC key-order issue with distributed table lookup.
* hide all the operations behind a control flag `short_circuit_pytree_ebc_regroup` to the torchrec main api call `decapsulate_ir_modules`, which should only be visible to the infra layer, not to the users.

# details
* The `_short_circuit_pytree_ebc_regroup` function finds all the EBCs/fpEBC and KTRegroupAsDict modules in an unflattened module.  Retrieve their fqns and sort to in_fqns (regroup_fqns) and out_fqns (ebc_fqns). Because currently the fpEBC is swapped as a whole, so we do some extra fqn logic to filter out the EBC that belongs to an up-level fpEBC.
* a util function `prune_pytree_flatten_unflatten` removes the in-coming and out-going pytree flatten/unflatten function calls in the graph module, based on the given fqns.

WARNING: The flag `short_circuit_pytree_ebc_regroup` should be turned on if EBCs are used and EBC sharding is needed. Assertions are also added if can't find a `KTRegroupAsDict` module, or `finalize_interpreter_modules` is not `True`.

# additional changes
* absorb the `finalize_interpreter_modules` process inside the torchrec main api `decapsulate_ir_modules`.
* set `graph.owning_module` in export.unflatten as required by the graph modification
* add one more layer of `sparse_module` for closely mimicing the APF model structure.

Test Plan:
# run test
* serializer
```
buck2 run fbcode//mode/opt fbcode//torchrec/ir/tests:test_serializer
```
* apf
```
buck2 run fbcode//mode/opt fbcode//aps_models/ads/gmp/tests/ne/e2e_deterministic_tests:gmp_e2e_ne_tests -- --filter-text 'test_mtml_instagram_model_562438350_single_gpu_with_ir'
```
* local mp run
```
==== Finished E2E deterministic test for mtml_instagram_model_gmp_474023725_non_kjt_unary ====
finished
  test_mtml_instagram_model_562438350_single_gpu_with_ir
Imports took: 6.0s! Profile with --import-profiler.            --_ |""---__
Executed 1 example in 203.1s:                               |'.|  ||  .    """|
  Successful: 1                                             | ||  || /|\""-.  |
  Failed: 0                                                 | ||  ||  |    |  |
  Skipped: 0                                                | ||  ||  |   \|/ |
  Not executed: 8                                           |."|  ||  --"" '__|
https://testslide.readthedocs.io/                              --" |__---"""
```

Differential Revision: D62606738

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136045
Approved by: https://github.com/angelayi
2024-09-17 18:42:56 +00:00
ea10c072f3 [export] Deserialize args with python keyword names (#136036)
Currently when we deserialize inputs to nodes, we deserialize arguments with default values as kwargs. So deserializing `aten.uniform`, which has the signature `uniform(Tensor(a!) self, float from=0, float to=1, *, Generator? generator=None) -> Tensor(a!)`, will get become `uniform(x, from=0, to=1)`. However, this fails when running in python because `from` is a python keyword. So the solution here is to not deserialize it as a kwarg.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136036
Approved by: https://github.com/zhxchen17
2024-09-17 18:13:14 +00:00
a8382847f4 Support rms_norm() for NJT (#135872)
`rms_norm()` is a nice-to-have for ViT :)

This PR:
* SymInt-ifies `rms_norm()`, allowing NJT to use the same decomp.
* Adds torch_function-based input validation logic for nested-specific stuff (no normalization supported over the ragged dim for now) on the python NJT side.
* Adds multi-dim support (on non-ragged, non-batch dims) to `mean()` for NJT.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135872
Approved by: https://github.com/mikaylagawarecki
ghstack dependencies: #125947
2024-09-17 18:09:20 +00:00
785e98783b Delete links to non-existing run_plan_mpi.cc (#136204)
That were deleted by https://github.com/pytorch/pytorch/pull/125092

Fixes https://github.com/pytorch/pytorch/issues/136199

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136204
Approved by: https://github.com/albanD, https://github.com/seemethere
2024-09-17 17:51:56 +00:00
cc365fdd7b [MTIA] Support torch.cuda.get_device_capability equivalent API on MTIA (#135889)
Summary:
Mirror `get_device_capability` on MTIA per https://fburl.com/gdoc/p4lo5avn

At the moment, both the major and minor version are just 0

Test Plan:
Unit test: `buck2 test //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api`

https://www.internalfb.com/intern/testinfra/testconsole/testrun/1688850109958190/

Differential Revision: D62595296

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135889
Approved by: https://github.com/egienvalue
2024-09-17 17:42:56 +00:00
8e5bb356e0 [PT2] Port merge_concats_pass to PT2 pre_grad passes (#135527)
Summary: as title

Test Plan: new UT

Differential Revision: D62398390

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135527
Approved by: https://github.com/frank-wei
2024-09-17 17:26:53 +00:00
63dc5dff10 [Fix]: Update CPUINFO submodule to fix support for NON-SVE ARM Hardware (#135857)
Regression PR : https://github.com/pytorch/cpuinfo/pull/255

Change-Id: I56cec061072be11ec33ccb661114360b979fc7aa

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135857
Approved by: https://github.com/digantdesai, https://github.com/malfet
2024-09-17 16:50:17 +00:00
67b14ce8bd [ONNX] Fix numpy method to return the correct type (#136162)
Previous implementation of the `numpy()` method returns `fp64` when the tensor is `fp32`. This is unexpected but seems to be caused by calling `__array__(dtype=None)` on the numpy array. I updated the implementation to implement the `numpy()` method explicitly and added tests to guard the behavior.

This needs to be cherry-picked into torch 2.5
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136162
Approved by: https://github.com/gramalingam, https://github.com/xadupre
2024-09-17 15:51:00 +00:00
ece8267d2c Add back optim type hints that were lost when *.pyi files were removed (#136185)
When stub files (`*.pyi`) were removed from `optim` (#125556, #125452), some types that existed are no longer available. This pull request adds them back.

Just for reference, these types are used in `pytorch-lightning`'s `LightningCLI`. Command line interfaces are created automatically, and having type hints make them nicer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136185
Approved by: https://github.com/janeyx99
2024-09-17 15:45:15 +00:00
913f97e878 Don't run reshape pattern match on dynamic shape size tensor (#136100)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136100
Approved by: https://github.com/mengluy0125
2024-09-17 15:08:55 +00:00
462b727d1e Revert "Add decomposition for permute_copy (#130944)"
This reverts commit ab9a7eadd34aee59fc67e29237610b7562cc4ff0.

Reverted https://github.com/pytorch/pytorch/pull/130944 on behalf of https://github.com/jeanschmidt due to Broke internal signal executorch.backends.xnnpack.test.ops.permute.TestPermute, more details on D62737086. @eellison could you please help get this PR merged to main? ([comment](https://github.com/pytorch/pytorch/pull/130944#issuecomment-2355846394))
2024-09-17 13:42:55 +00:00
2c4ae81494 Revert "Add decomposition for squeeze_copy (#130941)"
This reverts commit c33b0580e6a702be0cd5be691b3b465da012aa34.

Reverted https://github.com/pytorch/pytorch/pull/130941 on behalf of https://github.com/jeanschmidt due to Need to revert in order to be able to revert https://github.com/pytorch/pytorch/pull/130944, after fixing any merge conflicts, feel free to merge it back ([comment](https://github.com/pytorch/pytorch/pull/130941#issuecomment-2355831480))
2024-09-17 13:39:07 +00:00
3b5e2689a1 Revert "Optimize dict reconstruct to not codegen untouched values (#134876)"
This reverts commit a1a57a424dc992f4dc2d44bdc1e4e7e500881a9c.

Reverted https://github.com/pytorch/pytorch/pull/134876 on behalf of https://github.com/jeanschmidt due to new introduced test test_reconstruct.py::ReconstructTest::test_functional_call_reconstruct is breaking internally. @zou3519 may you help get those changes merged back to main? ([comment](https://github.com/pytorch/pytorch/pull/134876#issuecomment-2355697685))
2024-09-17 13:00:01 +00:00
e248c1d7eb Update real device in FSDP state_dict_utils (#134994)
## Motivation
The default device for tensor.device both for sharded as well as non sharded is set to cuda by default. Hence while checking the FSDP UTs we see the following errors. This change updates the actual device type based on the created tensor.

```
[rank3]   File "/root/repos/pytorch-training-tests/tests/pytorch/v2.4.0/distributed_hpu/fsdp/test_fsdp_dtensor_state_dict.py", line 143, in test_dtensor_sharded_tensor_state_dict_identical
[rank3]     sharded_tensor_sd = ref_model.state_dict()
[rank3]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1944, in state_dict
[rank3]     hook_result = hook(self, destination, prefix, local_metadata)
[rank3]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank3]     return func(*args, **kwargs)
[rank3]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/_state_dict_utils.py", line 752, in _post_state_dict_hook
[rank3]     tensor.device,
[rank3]   File "/usr/local/lib/python3.10/dist-packages/typing_extensions.py", line 2853, in wrapper
[rank3]     return arg(*args, **kwargs)
[rank3]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/_shard/sharded_tensor/api.py", line 1152, in __torch_function__
[rank3]     return dispatch(st_instance, func)
[rank3]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/_shard/sharded_tensor/api.py", line 1134, in dispatch
[rank3]     return _SHARDED_OPS[func](types, args, kwargs, st._process_group)
[rank3]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/_shard/op_registry_utils.py", line 33, in wrapper
[rank3]     return wrapped_func(types, args, kwargs, process_group)
[rank3]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/_shard/sharded_tensor/_ops/tensor_ops.py", line 52, in tensor_device
[rank3]     dev = torch.device(torch.cuda.current_device())
[rank3]   File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 878, in current_device
[rank3]     _lazy_init()
[rank3]   File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 305, in _lazy_init
[rank3]     raise AssertionError("Torch not compiled with CUDA enabled")
[rank3] AssertionError: Torch not compiled with CUDA enabled
````

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134994
Approved by: https://github.com/fegin
2024-09-17 04:39:08 +00:00
408fe41a45 [DSD][EZ] Minor update in _state_dict_utils.py (#136165)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136165
Approved by: https://github.com/kwen2501
ghstack dependencies: #135725, #135763
2024-09-17 04:32:43 +00:00
dc82d274e6 make view.dtype always return an alias (#136074)
Fixes https://github.com/pytorch/pytorch/issues/136064

In the linked repro, this issue was that there was some code like this:
```
# x has dtype torch.float32
def f(x):
    y = x.view(torch.float32)
    y.copy_(...)
```

Where because `view.dtype` is implemented today to potentially directly return its input, we would end up directly clobbering the proxy for our graph input (replacing its FX proxy value from `arg0_1` to `view_1`). This is not desirable, because we have careful assertions in AOTDispatcher that mutations only ever happen on graph inputs - but this clobbering caused the mutation to appear, from the perspective of the FX graph, like it was happening on a view of the input.

Why is this normally not a problem? Ordinarily, the `ADInplaceOrView` kernel for `view.dtype` will take the output of the view kernel, [and detach() it](https://github.com/pytorch/pytorch/blob/main/tools/autograd/gen_inplace_or_view_type.py#L466) (properly creating a fresh `TensorImpl`).

This does **not** happen, though, if you are executing the kernel from with a `__torch_dispatch__` region: the `ADInplaceOrView` logic has already run above you, so that key will be in the TLS exclude set.

This PR changes eager behavior - at first I considered trying to only change behavior under compile. But this problem isn't technically specific to PT2: if you ever rely on tensor identity from inside of a __torch_dispatch__ call, then we need to make sure the raw `view.dtype` kernel doesn't directly return the input.

I am also making the assumption that "`view.dtype` no-op'ing when the dtype is the same" is not a case worth optimizing in eager mode, and that the overhead of the `TensorImpl` creation is relatively negligible.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136074
Approved by: https://github.com/Skylion007, https://github.com/ezyang, https://github.com/albanD
ghstack dependencies: #136041
2024-09-17 03:40:54 +00:00
d463a81c27 inductor: dont use default_dtype during rng functionalization (#136041)
Fixes https://github.com/pytorch/pytorch/issues/119162

See context at https://github.com/pytorch/pytorch/issues/119162#issuecomment-2349849469

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136041
Approved by: https://github.com/eellison
2024-09-17 03:40:54 +00:00
3f74310784 Back out "Flip triton kernel default layout constraint to "needs_fixed_stride_order" (#135581)" (#136160)
Test Plan: make train-hstu-cint-publish-bf16-tgif-local

Differential Revision: D62766335

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136160
Approved by: https://github.com/muchulee8
2024-09-17 01:06:10 +00:00
37a08b33bb Revert "fix compiled_autograd deadlock throw (#135795)"
This reverts commit 00dc7d435652ad66e9d2feb2660928b632281a98.

Reverted https://github.com/pytorch/pytorch/pull/135795 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/135795#issuecomment-2354233619))
2024-09-16 23:59:56 +00:00
071da87cd7 use csv extention for test report in order for it to be uploaded to s3 (#136128)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136128
Approved by: https://github.com/clee2000
2024-09-16 21:47:46 +00:00
c12536b3c0 [ONNX] Treat CompositeImplicitAutograd ops as normal ops in decomp (#136153)
Since https://github.com/pytorch/pytorch/pull/135080, the CompositeImplicitAutograd (CIA) ops are only decomposed when a decomp function is provided in a table. There is no longer a need to distinguish CIA ops like Upsample and preserve them explicitly. On the ONNX Script torchlib side I will unregister some ops from the following list to make sure some CIA ops are still decomposed.

```
<OpOverload(op='aten.__and__', overload='Scalar')>,
 <OpOverload(op='aten.__and__', overload='Tensor')>,
 <OpOverload(op='aten.__or__', overload='Scalar')>,
 <OpOverload(op='aten.__or__', overload='Tensor')>,
 <OpOverload(op='aten.__xor__', overload='Scalar')>,
 <OpOverload(op='aten.__xor__', overload='Tensor')>,
 <OpOverload(op='aten._add_batch_dim', overload='default')>,
 <OpOverload(op='aten._assert_tensor_metadata', overload='default')>,
 <OpOverload(op='aten._backward', overload='default')>,
 <OpOverload(op='aten._batch_norm_impl_index_backward', overload='default')>,
 <OpOverload(op='aten._cast_Byte', overload='default')>,
 <OpOverload(op='aten._cast_Char', overload='default')>,
 <OpOverload(op='aten._cast_Double', overload='default')>,
 <OpOverload(op='aten._cast_Float', overload='default')>,
 <OpOverload(op='aten._cast_Half', overload='default')>,
 <OpOverload(op='aten._cast_Int', overload='default')>,
 <OpOverload(op='aten._cast_Long', overload='default')>,
 <OpOverload(op='aten._cast_Short', overload='default')>,
 <OpOverload(op='aten._choose_qparams_per_tensor', overload='default')>,
 <OpOverload(op='aten._convolution', overload='deprecated')>,
 <OpOverload(op='aten._convolution_double_backward', overload='default')>,
 <OpOverload(op='aten._convolution_mode', overload='default')>,
 <OpOverload(op='aten._cufft_clear_plan_cache', overload='default')>,
 <OpOverload(op='aten._cufft_get_plan_cache_max_size', overload='default')>,
 <OpOverload(op='aten._cufft_get_plan_cache_size', overload='default')>,
 <OpOverload(op='aten._cufft_set_plan_cache_max_size', overload='default')>,
 <OpOverload(op='aten._debug_has_internal_overlap', overload='default')>,
 <OpOverload(op='aten._dim_arange', overload='default')>,
 <OpOverload(op='aten._embedding_bag_sparse_backward', overload='default')>,
 <OpOverload(op='aten._gather_sparse_backward', overload='default')>,
 <OpOverload(op='aten._grid_sampler_2d_cpu_fallback_backward', overload='default')>,
 <OpOverload(op='aten._has_compatible_shallow_copy_type', overload='default')>,
 <OpOverload(op='aten._is_zerotensor', overload='default')>,
 <OpOverload(op='aten._lu_with_info', overload='default')>,
 <OpOverload(op='aten._nnpack_available', overload='default')>,
 <OpOverload(op='aten._pack_padded_sequence_backward', overload='default')>,
 <OpOverload(op='aten._pad_circular', overload='default')>,
 <OpOverload(op='aten._pad_enum', overload='default')>,
 <OpOverload(op='aten._pad_packed_sequence', overload='default')>,
 <OpOverload(op='aten._propagate_xla_data', overload='default')>,
 <OpOverload(op='aten._remove_batch_dim', overload='default')>,
 <OpOverload(op='aten._reshape_from_tensor', overload='default')>,
 <OpOverload(op='aten._rowwise_prune', overload='default')>,
 <OpOverload(op='aten._saturate_weight_to_fp16', overload='default')>,
 <OpOverload(op='aten._scaled_dot_product_attention_math', overload='default')>,
 <OpOverload(op='aten._shape_as_tensor', overload='default')>,
 <OpOverload(op='aten._sobol_engine_draw', overload='default')>,
 <OpOverload(op='aten._sparse_bsc_tensor_unsafe', overload='default')>,
 <OpOverload(op='aten._sparse_bsr_tensor_unsafe', overload='default')>,
 <OpOverload(op='aten._sparse_compressed_tensor_unsafe', overload='default')>,
 <OpOverload(op='aten._sparse_coo_tensor_unsafe', overload='default')>,
 <OpOverload(op='aten._sparse_csc_tensor_unsafe', overload='default')>,
 <OpOverload(op='aten._sparse_csr_tensor_unsafe', overload='default')>,
 <OpOverload(op='aten._sparse_log_softmax', overload='Dimname')>,
 <OpOverload(op='aten._sparse_log_softmax', overload='int')>,
 <OpOverload(op='aten._sparse_mm', overload='default')>,
 <OpOverload(op='aten._sparse_mm', overload='reduce')>,
 <OpOverload(op='aten._sparse_softmax', overload='Dimname')>,
 <OpOverload(op='aten._sparse_softmax', overload='int')>,
 <OpOverload(op='aten._sparse_sum', overload='default')>,
 <OpOverload(op='aten._sparse_sum', overload='dim_dtype')>,
 <OpOverload(op='aten._sparse_sum', overload='dtype')>,
 <OpOverload(op='aten._test_ambiguous_defaults', overload='a')>,
 <OpOverload(op='aten._test_ambiguous_defaults', overload='b')>,
 <OpOverload(op='aten._test_autograd_multiple_dispatch', overload='ntonly')>,
 <OpOverload(op='aten._test_check_tensor', overload='default')>,
 <OpOverload(op='aten._test_serialization_subcmul', overload='default')>,
 <OpOverload(op='aten._test_string_default', overload='default')>,
 <OpOverload(op='aten._thnn_differentiable_gru_cell_backward', overload='default')>,
 <OpOverload(op='aten._thnn_differentiable_lstm_cell_backward', overload='default')>,
 <OpOverload(op='aten._thnn_fused_lstm_cell_backward', overload='default')>,
 <OpOverload(op='aten._to_cpu', overload='default')>,
 <OpOverload(op='aten._upsample_bicubic2d_aa', overload='vec')>,
 <OpOverload(op='aten._upsample_bilinear2d_aa', overload='vec')>,
 <OpOverload(op='aten._upsample_nearest_exact1d', overload='default')>,
 <OpOverload(op='aten._upsample_nearest_exact1d', overload='vec')>,
 <OpOverload(op='aten._upsample_nearest_exact2d', overload='default')>,
 <OpOverload(op='aten._upsample_nearest_exact2d', overload='vec')>,
 <OpOverload(op='aten._upsample_nearest_exact3d', overload='default')>,
 <OpOverload(op='aten._upsample_nearest_exact3d', overload='vec')>,
 <OpOverload(op='aten._use_cudnn_rnn_flatten_weight', overload='default')>,
 <OpOverload(op='aten._validate_sparse_bsc_tensor_args', overload='default')>,
 <OpOverload(op='aten._validate_sparse_bsr_tensor_args', overload='default')>,
 <OpOverload(op='aten._validate_sparse_compressed_tensor_args', overload='default')>,
 <OpOverload(op='aten._validate_sparse_coo_tensor_args', overload='default')>,
 <OpOverload(op='aten._validate_sparse_csc_tensor_args', overload='default')>,
 <OpOverload(op='aten._validate_sparse_csr_tensor_args', overload='default')>,
 <OpOverload(op='aten._version', overload='default')>,
 <OpOverload(op='aten._weight_norm', overload='default')>,
 <OpOverload(op='aten._weight_norm_differentiable_backward', overload='default')>,
 <OpOverload(op='aten.absolute', overload='default')>,
 <OpOverload(op='aten.adaptive_avg_pool1d', overload='default')>,
 <OpOverload(op='aten.adaptive_avg_pool2d', overload='default')>,
 <OpOverload(op='aten.adaptive_avg_pool3d', overload='default')>,
 <OpOverload(op='aten.adaptive_max_pool1d', overload='default')>,
 <OpOverload(op='aten.affine_grid_generator_backward', overload='default')>,
 <OpOverload(op='aten.align_as', overload='default')>,
 <OpOverload(op='aten.align_tensors', overload='default')>,
 <OpOverload(op='aten.all', overload='dimname')>,
 <OpOverload(op='aten.any', overload='dimname')>,
 <OpOverload(op='aten.arccos', overload='default')>,
 <OpOverload(op='aten.arccosh', overload='default')>,
 <OpOverload(op='aten.arcsin', overload='default')>,
 <OpOverload(op='aten.arcsinh', overload='default')>,
 <OpOverload(op='aten.arctan', overload='default')>,
 <OpOverload(op='aten.arctan2', overload='default')>,
 <OpOverload(op='aten.arctanh', overload='default')>,
 <OpOverload(op='aten.argsort', overload='default')>,
 <OpOverload(op='aten.argsort', overload='dimname')>,
 <OpOverload(op='aten.argsort', overload='stable')>,
 <OpOverload(op='aten.argwhere', overload='default')>,
 <OpOverload(op='aten.atleast_1d', overload='Sequence')>,
 <OpOverload(op='aten.atleast_2d', overload='Sequence')>,
 <OpOverload(op='aten.atleast_3d', overload='Sequence')>,
 <OpOverload(op='aten.avg_pool1d', overload='default')>,
 <OpOverload(op='aten.bilinear', overload='default')>,
 <OpOverload(op='aten.broadcast_tensors', overload='default')>,
 <OpOverload(op='aten.can_cast', overload='default')>,
 <OpOverload(op='aten.cat', overload='names')>,
 <OpOverload(op='aten.cdist', overload='default')>,
 <OpOverload(op='aten.chain_matmul', overload='default')>,
 <OpOverload(op='aten.chalf', overload='default')>,
 <OpOverload(op='aten.choose_qparams_optimized', overload='default')>,
 <OpOverload(op='aten.clip', overload='Tensor')>,
 <OpOverload(op='aten.clip', overload='default')>,
 <OpOverload(op='aten.column_stack', overload='default')>,
 <OpOverload(op='aten.combinations', overload='default')>,
 <OpOverload(op='aten.concat', overload='default')>,
 <OpOverload(op='aten.concat', overload='names')>,
 <OpOverload(op='aten.concatenate', overload='default')>,
 <OpOverload(op='aten.concatenate', overload='names')>,
 <OpOverload(op='aten.conv1d', overload='default')>,
 <OpOverload(op='aten.conv1d', overload='padding')>,
 <OpOverload(op='aten.conv2d', overload='default')>,
 <OpOverload(op='aten.conv2d', overload='padding')>,
 <OpOverload(op='aten.conv3d', overload='default')>,
 <OpOverload(op='aten.conv3d', overload='padding')>,
 <OpOverload(op='aten.conv_tbc_backward', overload='default')>,
 <OpOverload(op='aten.conv_transpose1d', overload='default')>,
 <OpOverload(op='aten.conv_transpose2d', overload='input')>,
 <OpOverload(op='aten.conv_transpose3d', overload='input')>,
 <OpOverload(op='aten.corrcoef', overload='default')>,
 <OpOverload(op='aten.cosine_embedding_loss', overload='default')>,
 <OpOverload(op='aten.cosine_similarity', overload='default')>,
 <OpOverload(op='aten.cov', overload='default')>,
 <OpOverload(op='aten.cross', overload='default')>,
 <OpOverload(op='aten.cross_entropy_loss', overload='default')>,
 <OpOverload(op='aten.ctc_loss', overload='IntList')>,
 <OpOverload(op='aten.ctc_loss', overload='Tensor')>,
 <OpOverload(op='aten.cudnn_is_acceptable', overload='default')>,
 <OpOverload(op='aten.cummax', overload='dimname')>,
 <OpOverload(op='aten.cummaxmin_backward', overload='default')>,
 <OpOverload(op='aten.cummin', overload='dimname')>,
 <OpOverload(op='aten.cumprod', overload='dimname')>,
 <OpOverload(op='aten.cumprod_backward', overload='default')>,
 <OpOverload(op='aten.cumsum', overload='dimname')>,
 <OpOverload(op='aten.cumulative_trapezoid', overload='dx')>,
 <OpOverload(op='aten.cumulative_trapezoid', overload='x')>,
 <OpOverload(op='aten.data', overload='default')>,
 <OpOverload(op='aten.det', overload='default')>,
 <OpOverload(op='aten.diag', overload='default')>,
 <OpOverload(op='aten.diagflat', overload='default')>,
 <OpOverload(op='aten.diff', overload='default')>,
 <OpOverload(op='aten.divide', overload='Scalar')>,
 <OpOverload(op='aten.divide', overload='Scalar_mode')>,
 <OpOverload(op='aten.divide', overload='Tensor')>,
 <OpOverload(op='aten.divide', overload='Tensor_mode')>,
 <OpOverload(op='aten.dstack', overload='default')>,
 <OpOverload(op='aten.einsum', overload='default')>,
 <OpOverload(op='aten.embedding_backward', overload='default')>,
 <OpOverload(op='aten.embedding_bag', overload='default')>,
 <OpOverload(op='aten.embedding_bag', overload='padding_idx')>,
 <OpOverload(op='aten.embedding_sparse_backward', overload='default')>,
 <OpOverload(op='aten.fake_quantize_per_channel_affine', overload='default')>,
 <OpOverload(op='aten.fake_quantize_per_channel_affine_cachemask_backward', overload='default')>,
 <OpOverload(op='aten.fake_quantize_per_tensor_affine', overload='default')>,
 <OpOverload(op='aten.fake_quantize_per_tensor_affine', overload='tensor_qparams')>,
 <OpOverload(op='aten.fake_quantize_per_tensor_affine_cachemask_backward', overload='default')>,
 <OpOverload(op='aten.fbgemm_linear_fp16_weight', overload='default')>,
 <OpOverload(op='aten.fbgemm_linear_fp16_weight_fp32_activation', overload='default')>,
 <OpOverload(op='aten.fbgemm_linear_int8_weight', overload='default')>,
 <OpOverload(op='aten.fbgemm_linear_int8_weight_fp32_activation', overload='default')>,
 <OpOverload(op='aten.fbgemm_linear_quantize_weight', overload='default')>,
 <OpOverload(op='aten.fbgemm_pack_gemm_matrix_fp16', overload='default')>,
 <OpOverload(op='aten.fbgemm_pack_quantized_matrix', overload='KN')>,
 <OpOverload(op='aten.fbgemm_pack_quantized_matrix', overload='default')>,
 <OpOverload(op='aten.fft_fft', overload='default')>,
 <OpOverload(op='aten.fft_fft2', overload='default')>,
 <OpOverload(op='aten.fft_fftn', overload='default')>,
 <OpOverload(op='aten.fft_fftshift', overload='default')>,
 <OpOverload(op='aten.fft_hfft', overload='default')>,
 <OpOverload(op='aten.fft_hfft2', overload='default')>,
 <OpOverload(op='aten.fft_hfftn', overload='default')>,
 <OpOverload(op='aten.fft_ifft', overload='default')>,
 <OpOverload(op='aten.fft_ifft2', overload='default')>,
 <OpOverload(op='aten.fft_ifftn', overload='default')>,
 <OpOverload(op='aten.fft_ifftshift', overload='default')>,
 <OpOverload(op='aten.fft_ihfft', overload='default')>,
 <OpOverload(op='aten.fft_ihfft2', overload='default')>,
 <OpOverload(op='aten.fft_ihfftn', overload='default')>,
 <OpOverload(op='aten.fft_irfft', overload='default')>,
 <OpOverload(op='aten.fft_irfft2', overload='default')>,
 <OpOverload(op='aten.fft_irfftn', overload='default')>,
 <OpOverload(op='aten.fft_rfft', overload='default')>,
 <OpOverload(op='aten.fft_rfft2', overload='default')>,
 <OpOverload(op='aten.fft_rfftn', overload='default')>,
 <OpOverload(op='aten.fix', overload='default')>,
 <OpOverload(op='aten.flatten_dense_tensors', overload='default')>,
 <OpOverload(op='aten.fliplr', overload='default')>,
 <OpOverload(op='aten.flipud', overload='default')>,
 <OpOverload(op='aten.float_power', overload='Scalar')>,
 <OpOverload(op='aten.float_power', overload='Tensor_Scalar')>,
 <OpOverload(op='aten.float_power', overload='Tensor_Tensor')>,
 <OpOverload(op='aten.frobenius_norm', overload='dim')>,
 <OpOverload(op='aten.gather', overload='dimname')>,
 <OpOverload(op='aten.gather_backward', overload='default')>,
 <OpOverload(op='aten.ger', overload='default')>,
 <OpOverload(op='aten.gradient', overload='array')>,
 <OpOverload(op='aten.gradient', overload='scalararray')>,
 <OpOverload(op='aten.gradient', overload='scalarint')>,
 <OpOverload(op='aten.gradient', overload='scalarrayarray')>,
 <OpOverload(op='aten.gradient', overload='scalarrayint')>,
 <OpOverload(op='aten.gradient', overload='tensorarray')>,
 <OpOverload(op='aten.gradient', overload='tensorarrayint')>,
 <OpOverload(op='aten.greater', overload='Scalar')>,
 <OpOverload(op='aten.greater', overload='Tensor')>,
 <OpOverload(op='aten.greater_equal', overload='Scalar')>,
 <OpOverload(op='aten.greater_equal', overload='Tensor')>,
 <OpOverload(op='aten.grid_sampler', overload='default')>,
 <OpOverload(op='aten.group_norm', overload='default')>,
 <OpOverload(op='aten.gru', overload='data')>,
 <OpOverload(op='aten.gru', overload='input')>,
 <OpOverload(op='aten.gru_cell', overload='default')>,
 <OpOverload(op='aten.hinge_embedding_loss', overload='default')>,
 <OpOverload(op='aten.histogramdd', overload='TensorList_bins')>,
 <OpOverload(op='aten.histogramdd', overload='default')>,
 <OpOverload(op='aten.histogramdd', overload='int_bins')>,
 <OpOverload(op='aten.hstack', overload='default')>,
 <OpOverload(op='aten.index_add', overload='dimname')>,
 <OpOverload(op='aten.index_copy', overload='dimname')>,
 <OpOverload(op='aten.index_fill', overload='Dimname_Scalar')>,
 <OpOverload(op='aten.index_fill', overload='Dimname_Tensor')>,
 <OpOverload(op='aten.index_select', overload='dimname')>,
 <OpOverload(op='aten.index_select_backward', overload='default')>,
 <OpOverload(op='aten.infinitely_differentiable_gelu_backward', overload='default')>,
 <OpOverload(op='aten.inner', overload='default')>,
 <OpOverload(op='aten.instance_norm', overload='default')>,
 <OpOverload(op='aten.inverse', overload='default')>,
 <OpOverload(op='aten.is_complex', overload='default')>,
 <OpOverload(op='aten.is_conj', overload='default')>,
 <OpOverload(op='aten.is_distributed', overload='default')>,
 <OpOverload(op='aten.is_floating_point', overload='default')>,
 <OpOverload(op='aten.is_inference', overload='default')>,
 <OpOverload(op='aten.is_leaf', overload='default')>,
 <OpOverload(op='aten.is_neg', overload='default')>,
 <OpOverload(op='aten.is_nonzero', overload='default')>,
 <OpOverload(op='aten.is_signed', overload='default')>,
 <OpOverload(op='aten.is_vulkan_available', overload='default')>,
 <OpOverload(op='aten.isclose', overload='default')>,
 <OpOverload(op='aten.isfinite', overload='default')>,
 <OpOverload(op='aten.isreal', overload='default')>,
 <OpOverload(op='aten.istft', overload='default')>,
 <OpOverload(op='aten.item', overload='default')>,
 <OpOverload(op='aten.kl_div', overload='default')>,
 <OpOverload(op='aten.kron', overload='default')>,
 <OpOverload(op='aten.kthvalue', overload='dimname')>,
 <OpOverload(op='aten.l1_loss', overload='default')>,
 <OpOverload(op='aten.layer_norm', overload='default')>,
 <OpOverload(op='aten.ldexp', overload='Tensor')>,
 <OpOverload(op='aten.less', overload='Scalar')>,
 <OpOverload(op='aten.less', overload='Tensor')>,
 <OpOverload(op='aten.less_equal', overload='Scalar')>,
 <OpOverload(op='aten.less_equal', overload='Tensor')>,
 <OpOverload(op='aten.linalg_cholesky', overload='default')>,
 <OpOverload(op='aten.linalg_cond', overload='default')>,
 <OpOverload(op='aten.linalg_cond', overload='p_str')>,
 <OpOverload(op='aten.linalg_det', overload='default')>,
 <OpOverload(op='aten.linalg_eigh', overload='default')>,
 <OpOverload(op='aten.linalg_eigvals', overload='default')>,
 <OpOverload(op='aten.linalg_eigvalsh', overload='default')>,
 <OpOverload(op='aten.linalg_inv', overload='default')>,
 <OpOverload(op='aten.linalg_ldl_factor', overload='default')>,
 <OpOverload(op='aten.linalg_lu_factor', overload='default')>,
 <OpOverload(op='aten.linalg_matmul', overload='default')>,
 <OpOverload(op='aten.linalg_matrix_norm', overload='default')>,
 <OpOverload(op='aten.linalg_matrix_norm', overload='str_ord')>,
 <OpOverload(op='aten.linalg_matrix_power', overload='default')>,
 <OpOverload(op='aten.linalg_matrix_rank', overload='atol_rtol_float')>,
 <OpOverload(op='aten.linalg_matrix_rank', overload='atol_rtol_tensor')>,
 <OpOverload(op='aten.linalg_matrix_rank', overload='default')>,
 <OpOverload(op='aten.linalg_matrix_rank', overload='tol_tensor')>,
 <OpOverload(op='aten.linalg_multi_dot', overload='default')>,
 <OpOverload(op='aten.linalg_norm', overload='default')>,
 <OpOverload(op='aten.linalg_norm', overload='ord_str')>,
 <OpOverload(op='aten.linalg_pinv', overload='atol_rtol_float')>,
 <OpOverload(op='aten.linalg_pinv', overload='default')>,
 <OpOverload(op='aten.linalg_pinv', overload='rcond_tensor')>,
 <OpOverload(op='aten.linalg_slogdet', overload='default')>,
 <OpOverload(op='aten.linalg_solve', overload='default')>,
 <OpOverload(op='aten.linalg_solve_ex', overload='default')>,
 <OpOverload(op='aten.linalg_svd', overload='default')>,
 <OpOverload(op='aten.linalg_svdvals', overload='default')>,
 <OpOverload(op='aten.linalg_tensorinv', overload='default')>,
 <OpOverload(op='aten.linalg_tensorsolve', overload='default')>,
 <OpOverload(op='aten.linalg_vander', overload='default')>,
 <OpOverload(op='aten.linalg_vecdot', overload='default')>,
 <OpOverload(op='aten.linear', overload='default')>,
 <OpOverload(op='aten.log_sigmoid', overload='default')>,
 <OpOverload(op='aten.log_softmax', overload='Dimname')>,
 <OpOverload(op='aten.log_softmax', overload='int')>,
 <OpOverload(op='aten.logcumsumexp', overload='dimname')>,
 <OpOverload(op='aten.logdet', overload='default')>,
 <OpOverload(op='aten.logsumexp', overload='names')>,
 <OpOverload(op='aten.lstm', overload='data')>,
 <OpOverload(op='aten.lstm', overload='input')>,
 <OpOverload(op='aten.lstm_cell', overload='default')>,
 <OpOverload(op='aten.lu_solve', overload='default')>,
 <OpOverload(op='aten.margin_ranking_loss', overload='default')>,
 <OpOverload(op='aten.masked_select_backward', overload='default')>,
 <OpOverload(op='aten.matmul', overload='default')>,
 <OpOverload(op='aten.matrix_exp', overload='default')>,
 <OpOverload(op='aten.matrix_exp_backward', overload='default')>,
 <OpOverload(op='aten.matrix_power', overload='default')>,
 <OpOverload(op='aten.max', overload='names_dim')>,
 <OpOverload(op='aten.max', overload='other')>,
 <OpOverload(op='aten.max_pool1d', overload='default')>,
 <OpOverload(op='aten.max_pool1d_with_indices', overload='default')>,
 <OpOverload(op='aten.max_pool2d', overload='default')>,
 <OpOverload(op='aten.max_pool3d', overload='default')>,
 <OpOverload(op='aten.mean', overload='names_dim')>,
 <OpOverload(op='aten.median', overload='names_dim')>,
 <OpOverload(op='aten.meshgrid', overload='default')>,
 <OpOverload(op='aten.meshgrid', overload='indexing')>,
 <OpOverload(op='aten.min', overload='names_dim')>,
 <OpOverload(op='aten.min', overload='other')>,
 <OpOverload(op='aten.mish_backward', overload='default')>,
 <OpOverload(op='aten.mode', overload='dimname')>,
 <OpOverload(op='aten.msort', overload='default')>,
 <OpOverload(op='aten.multilabel_margin_loss', overload='default')>,
 <OpOverload(op='aten.multiply', overload='Scalar')>,
 <OpOverload(op='aten.multiply', overload='Tensor')>,
 <OpOverload(op='aten.nanmean', overload='default')>,
 <OpOverload(op='aten.nanmedian', overload='names_dim')>,
 <OpOverload(op='aten.nanquantile', overload='default')>,
 <OpOverload(op='aten.nanquantile', overload='scalar')>,
 <OpOverload(op='aten.native_channel_shuffle', overload='default')>,
 <OpOverload(op='aten.negative', overload='default')>,
 <OpOverload(op='aten.nested_to_padded_tensor', overload='default')>,
 <OpOverload(op='aten.nll_loss', overload='default')>,
 <OpOverload(op='aten.nll_loss2d', overload='default')>,
 <OpOverload(op='aten.nll_loss_nd', overload='default')>,
 <OpOverload(op='aten.nonzero_numpy', overload='default')>,
 <OpOverload(op='aten.norm', overload='names_ScalarOpt_dim')>,
 <OpOverload(op='aten.norm', overload='names_ScalarOpt_dim_dtype')>,
 <OpOverload(op='aten.norm_except_dim', overload='default')>,
 <OpOverload(op='aten.not_equal', overload='Scalar')>,
 <OpOverload(op='aten.not_equal', overload='Tensor')>,
 <OpOverload(op='aten.nuclear_norm', overload='default')>,
 <OpOverload(op='aten.nuclear_norm', overload='dim')>,
 <OpOverload(op='aten.one_hot', overload='default')>,
 <OpOverload(op='aten.orgqr', overload='default')>,
 <OpOverload(op='aten.outer', overload='default')>,
 <OpOverload(op='aten.output_nr', overload='default')>,
 <OpOverload(op='aten.pad', overload='default')>,
 <OpOverload(op='aten.pad_sequence', overload='default')>,
 <OpOverload(op='aten.pairwise_distance', overload='default')>,
 <OpOverload(op='aten.pdist', overload='default')>,
 <OpOverload(op='aten.pinverse', overload='default')>,
 <OpOverload(op='aten.poisson_nll_loss', overload='default')>,
 <OpOverload(op='aten.prelu', overload='default')>,
 <OpOverload(op='aten.prod', overload='dim_Dimname')>,
 <OpOverload(op='aten.promote_types', overload='default')>,
 <OpOverload(op='aten.qr', overload='default')>,
 <OpOverload(op='aten.quantile', overload='default')>,
 <OpOverload(op='aten.quantile', overload='scalar')>,
 <OpOverload(op='aten.quantized_gru_cell', overload='default')>,
 <OpOverload(op='aten.quantized_lstm_cell', overload='default')>,
 <OpOverload(op='aten.quantized_rnn_relu_cell', overload='default')>,
 <OpOverload(op='aten.quantized_rnn_tanh_cell', overload='default')>,
 <OpOverload(op='aten.relu6', overload='default')>,
 <OpOverload(op='aten.repeat_interleave', overload='self_Tensor')>,
 <OpOverload(op='aten.repeat_interleave', overload='self_int')>,
 <OpOverload(op='aten.result_type', overload='Scalar')>,
 <OpOverload(op='aten.result_type', overload='Scalar_Scalar')>,
 <OpOverload(op='aten.result_type', overload='Scalar_Tensor')>,
 <OpOverload(op='aten.result_type', overload='Tensor')>,
 <OpOverload(op='aten.retains_grad', overload='default')>,
 <OpOverload(op='aten.rms_norm', overload='default')>,
 <OpOverload(op='aten.rnn_relu', overload='data')>,
 <OpOverload(op='aten.rnn_relu', overload='input')>,
 <OpOverload(op='aten.rnn_relu_cell', overload='default')>,
 <OpOverload(op='aten.rnn_tanh', overload='data')>,
 <OpOverload(op='aten.rnn_tanh', overload='input')>,
 <OpOverload(op='aten.rnn_tanh_cell', overload='default')>,
 <OpOverload(op='aten.row_stack', overload='default')>,
 <OpOverload(op='aten.rrelu', overload='default')>,
 <OpOverload(op='aten.scaled_dot_product_attention', overload='default')>,
 <OpOverload(op='aten.scatter', overload='dimname_src')>,
 <OpOverload(op='aten.scatter', overload='dimname_value')>,
 <OpOverload(op='aten.scatter_add', overload='dimname')>,
 <OpOverload(op='aten.selu', overload='default')>,
 <OpOverload(op='aten.silu_backward', overload='default')>,
 <OpOverload(op='aten.size', overload='Dimname')>,
 <OpOverload(op='aten.size', overload='int')>,
 <OpOverload(op='aten.slogdet', overload='default')>,
 <OpOverload(op='aten.slow_conv3d', overload='default')>,
 <OpOverload(op='aten.smm', overload='default')>,
 <OpOverload(op='aten.softmax', overload='Dimname')>,
 <OpOverload(op='aten.softmax', overload='int')>,
 <OpOverload(op='aten.sort', overload='dimname')>,
 <OpOverload(op='aten.sort', overload='dimname_stable')>,
 <OpOverload(op='aten.sparse_bsc_tensor', overload='ccol_row_value')>,
 <OpOverload(op='aten.sparse_bsc_tensor', overload='ccol_row_value_size')>,
 <OpOverload(op='aten.sparse_bsr_tensor', overload='crow_col_value')>,
 <OpOverload(op='aten.sparse_bsr_tensor', overload='crow_col_value_size')>,
 <OpOverload(op='aten.sparse_coo_tensor', overload='indices')>,
 <OpOverload(op='aten.sparse_coo_tensor', overload='indices_size')>,
 <OpOverload(op='aten.sparse_csc_tensor', overload='ccol_row_value')>,
 <OpOverload(op='aten.sparse_csc_tensor', overload='ccol_row_value_size')>,
 <OpOverload(op='aten.sparse_csr_tensor', overload='crow_col_value')>,
 <OpOverload(op='aten.sparse_csr_tensor', overload='crow_col_value_size')>,
 <OpOverload(op='aten.special_digamma', overload='default')>,
 <OpOverload(op='aten.special_erf', overload='default')>,
 <OpOverload(op='aten.special_erfc', overload='default')>,
 <OpOverload(op='aten.special_erfinv', overload='default')>,
 <OpOverload(op='aten.special_exp2', overload='default')>,
 <OpOverload(op='aten.special_expit', overload='default')>,
 <OpOverload(op='aten.special_expm1', overload='default')>,
 <OpOverload(op='aten.special_gammainc', overload='default')>,
 <OpOverload(op='aten.special_gammaincc', overload='default')>,
 <OpOverload(op='aten.special_gammaln', overload='default')>,
 <OpOverload(op='aten.special_i0', overload='default')>,
 <OpOverload(op='aten.special_log1p', overload='default')>,
 <OpOverload(op='aten.special_log_softmax', overload='default')>,
 <OpOverload(op='aten.special_logit', overload='default')>,
 <OpOverload(op='aten.special_logsumexp', overload='default')>,
 <OpOverload(op='aten.special_multigammaln', overload='default')>,
 <OpOverload(op='aten.special_ndtr', overload='default')>,
 <OpOverload(op='aten.special_polygamma', overload='default')>,
 <OpOverload(op='aten.special_psi', overload='default')>,
 <OpOverload(op='aten.special_round', overload='default')>,
 <OpOverload(op='aten.special_sinc', overload='default')>,
 <OpOverload(op='aten.special_softmax', overload='default')>,
 <OpOverload(op='aten.special_xlogy', overload='default')>,
 <OpOverload(op='aten.special_xlogy', overload='other_scalar')>,
 <OpOverload(op='aten.special_xlogy', overload='self_scalar')>,
 <OpOverload(op='aten.square', overload='default')>,
 <OpOverload(op='aten.sspaddmm', overload='default')>,
 <OpOverload(op='aten.std', overload='correction_names')>,
 <OpOverload(op='aten.std', overload='default')>,
 <OpOverload(op='aten.std', overload='dim')>,
 <OpOverload(op='aten.std', overload='names_dim')>,
 <OpOverload(op='aten.std_mean', overload='correction_names')>,
 <OpOverload(op='aten.std_mean', overload='default')>,
 <OpOverload(op='aten.std_mean', overload='dim')>,
 <OpOverload(op='aten.std_mean', overload='names_dim')>,
 <OpOverload(op='aten.stft', overload='center')>,
 <OpOverload(op='aten.stft', overload='default')>,
 <OpOverload(op='aten.stride', overload='Dimname')>,
 <OpOverload(op='aten.stride', overload='int')>,
 <OpOverload(op='aten.subtract', overload='Scalar')>,
 <OpOverload(op='aten.subtract', overload='Tensor')>,
 <OpOverload(op='aten.sum', overload='dim_DimnameList')>,
 <OpOverload(op='aten.sum_to_size', overload='default')>,
 <OpOverload(op='aten.svd', overload='default')>,
 <OpOverload(op='aten.sym_size', overload='int')>,
 <OpOverload(op='aten.sym_stride', overload='int')>,
 <OpOverload(op='aten.take_along_dim', overload='default')>,
 <OpOverload(op='aten.tensordot', overload='default')>,
 <OpOverload(op='aten.thnn_conv2d', overload='default')>,
 <OpOverload(op='aten.tile', overload='default')>,
 <OpOverload(op='aten.to_dense', overload='default')>,
 <OpOverload(op='aten.to_dense_backward', overload='default')>,
 <OpOverload(op='aten.to_mkldnn_backward', overload='default')>,
 <OpOverload(op='aten.to_sparse', overload='default')>,
 <OpOverload(op='aten.to_sparse', overload='sparse_dim')>,
 <OpOverload(op='aten.to_sparse_bsc', overload='default')>,
 <OpOverload(op='aten.to_sparse_bsr', overload='default')>,
 <OpOverload(op='aten.to_sparse_csc', overload='default')>,
 <OpOverload(op='aten.to_sparse_csr', overload='default')>,
 <OpOverload(op='aten.trace_backward', overload='default')>,
 <OpOverload(op='aten.trapezoid', overload='dx')>,
 <OpOverload(op='aten.trapezoid', overload='x')>,
 <OpOverload(op='aten.trapz', overload='dx')>,
 <OpOverload(op='aten.trapz', overload='x')>,
 <OpOverload(op='aten.triplet_margin_loss', overload='default')>,
 <OpOverload(op='aten.true_divide', overload='Scalar')>,
 <OpOverload(op='aten.true_divide', overload='Tensor')>,
 <OpOverload(op='aten.type_as', overload='default')>,
 <OpOverload(op='aten.unflatten_dense_tensors', overload='default')>,
 <OpOverload(op='aten.upsample_bicubic2d', overload='vec')>,
 <OpOverload(op='aten.upsample_bilinear2d', overload='vec')>,
 <OpOverload(op='aten.upsample_linear1d', overload='vec')>,
 <OpOverload(op='aten.upsample_nearest1d', overload='default')>,
 <OpOverload(op='aten.upsample_nearest1d', overload='vec')>,
 <OpOverload(op='aten.upsample_nearest2d', overload='default')>,
 <OpOverload(op='aten.upsample_nearest2d', overload='vec')>,
 <OpOverload(op='aten.upsample_nearest3d', overload='default')>,
 <OpOverload(op='aten.upsample_nearest3d', overload='vec')>,
 <OpOverload(op='aten.upsample_trilinear3d', overload='vec')>,
 <OpOverload(op='aten.value_selecting_reduction_backward', overload='default')>,
 <OpOverload(op='aten.vander', overload='default')>,
 <OpOverload(op='aten.var', overload='correction_names')>,
 <OpOverload(op='aten.var', overload='default')>,
 <OpOverload(op='aten.var', overload='dim')>,
 <OpOverload(op='aten.var', overload='names_dim')>,
 <OpOverload(op='aten.var_mean', overload='correction_names')>,
 <OpOverload(op='aten.var_mean', overload='default')>,
 <OpOverload(op='aten.var_mean', overload='dim')>,
 <OpOverload(op='aten.var_mean', overload='names_dim')>,
 <OpOverload(op='aten.vstack', overload='default')>,
 <OpOverload(op='aten.where', overload='Scalar')>,
 <OpOverload(op='aten.where', overload='ScalarOther')>,
 <OpOverload(op='aten.where', overload='ScalarSelf')>,
 <OpOverload(op='aten.where', overload='default')>,
 <OpOverload(op='aten.wrapped_linear_prepack', overload='default')>,
 <OpOverload(op='aten.wrapped_quantized_linear_prepacked', overload='default')>
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136153
Approved by: https://github.com/xadupre, https://github.com/gramalingam
2024-09-16 21:28:54 +00:00
b76d1b79e6 Add scaling arguments to bsr_dense_addmm (#136104)
As in the title.

Tackles https://github.com/pytorch/ao/pull/821/files#r1759821413

The PR assumes that the existing tuning parameters are good also when using scaling arguments. This needs to be verified as a follow-up task.

Also, this PR redefines triton-contiguous tensors: the tensor must have strides not larger than 1. This will now allow zero strides that previously triggered `contiguous` call although the underlying memory buffer was contiguous.

Re: "a considerable slow-down occurs because tensor data is copied element-wise rather than chunk-wise" - this note should refer to a code (torch or triton?) that implements the element/chunk-wise copy so that we could verify that allowing zero strides indeed would not trigger element-wise copies. Atm, the performance increase in ViT-H benchmarks (that involve using 0 strides) is an evidence that allowing zero strides does not lead to slow-downs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136104
Approved by: https://github.com/cpuhrsch
2024-09-16 20:26:54 +00:00
bfbcdf4967 Revert "[dynamo] Fix support for classmethod(property(...)) (#134968)"
This reverts commit c64ae601ba9eb3ad2cd3402a14f6ac83c0ab7eba.

Reverted https://github.com/pytorch/pytorch/pull/134968 on behalf of https://github.com/jeanschmidt due to Breaking internal signals, we need to skip the new tests on py3.10 ([comment](https://github.com/pytorch/pytorch/pull/134968#issuecomment-2353909010))
2024-09-16 20:26:35 +00:00
3c97b0ab00 Use ncclAlltoAllv and ncclAlltoAll API when supported (#134499)
NCCL does not have an api for ncclAllToAll and ncclAllToAllv, so PyTorch does point to point send/recv. Expose this API if it is supported.

Differential Revision: [D61683836](https://our.internmc.facebook.com/intern/diff/D61683836/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134499
Approved by: https://github.com/shuqiangzhang, https://github.com/eqy
2024-09-16 20:08:06 +00:00
abd16a8c64 [torch/multiprocessing] Use multiprocessing.reduction.register ForkingPickler.register to register custom tensor and storage reductions (#135030)
Right now `multiprocessing.reduction.register()` is simply an alias to `multiprocessing.reduction.ForkingPickler.register()`
https://github.com/python/cpython/blame/main/Lib/multiprocessing/reduction.py#L56, but the top-level `register()` function exposes less of the internal details of `multiprocessing.reduction` module.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135030
Approved by: https://github.com/albanD
2024-09-16 20:07:29 +00:00
a0c7029a75 [c10d][Reland] Remove Option for ProcessGroup and Expose backend Options to reflect the correct code structure (#132931) (#135653)
We introduced the dispatchable backend for a ProcessGroup and collective in https://github.com/pytorch/pytorch/issues/86225. This PR is a follow-up cleanup to clean up the option of a ProcessGroup and ask users to either set timeout or backend later on or directly create backend after creating a PG.

Also PGNCCL is using option class from ProcessGroup but we actually should use Option from backend class. So this PR is to make the type or name to be aligned with what we are doing in cpp side. I don't change the signature for the public API, so they still use args named "pg_options"

We need to make changes to the test to make it aligned with the change.

This is try to reland D62008954 by fixing internal errors.

Differential Revision: [D62483294](https://our.internmc.facebook.com/intern/diff/D62483294/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135653
Approved by: https://github.com/wz337, https://github.com/H-Huang
2024-09-16 19:56:42 +00:00
7537f74277 Refactor FxGraphCache.load into separate functions, so that AOTAutogradCache may access it correctly later (#135491)
Summary:
We refactor FxGraphCache.load into three phases:
- prepare_key, which checks that an inductor input is cacheable and bypasses otherwise
- load_with_key, which tries to lookup the key in the cache
- post compile, where we do some logging and run post compile steps

Splitting it along these lines will allow AOTAutogradCache to use load_with_key and still get access to all of the observability + remote cache logic when accessing FxGraphCache, without needing to pass key components, etc.

Differential Revision: D62314862

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135491
Approved by: https://github.com/oulgen
2024-09-16 19:48:08 +00:00
31715be72a [BE]: Update mypy to 1.11.2 (#133816)
Updates mypy to 1.11.1 to improve type inference

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133816
Approved by: https://github.com/ezyang
2024-09-16 19:44:11 +00:00
38caf10411 [EZ] Fix spelling typo (#136157)
s/toosl/tools/ (spotted by @louie-tsai)
Also, capitalize CUDA

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136157
Approved by: https://github.com/kit1980
2024-09-16 19:30:30 +00:00
c977bb7d03 [Distributed] fix FileSystemWriter __init__ (#136135)
Fixes #135608.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136135
Approved by: https://github.com/Skylion007
2024-09-16 19:11:08 +00:00
717fca2cac Drop outdated section 'Running clang-tidy' in CONTRIBUTING.md (#136146)
Fixes #125920

[Running clang-tidy](https://github.com/pytorch/pytorch/blob/main/CONTRIBUTING.md#running-clang-tidy) section is misleading and outdated. C++ lint is done with lintrunner and covered in [local-linting](https://github.com/pytorch/pytorch/blob/main/CONTRIBUTING.md#local-linting) section.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136146
Approved by: https://github.com/janeyx99
2024-09-16 19:02:21 +00:00
f89ce4dfbb torch.nn.MultiheadAttention: docs: improvement (#136111)
`torch.nn.MultiheadAttention`: docs: improvement
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136111
Approved by: https://github.com/janeyx99
2024-09-16 18:52:20 +00:00
d3647d15e6 Remove accidentally committed code (#136154)
Accidentally left out during rebase

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136154
Approved by: https://github.com/kit1980, https://github.com/albanD
2024-09-16 18:34:20 +00:00
d0cebedb31 Revert "Add Triton CPU as an Inductor backend (#133408)"
This reverts commit e498b02b472e45cfd6b7a08db0d6c1babec655c5.

Reverted https://github.com/pytorch/pytorch/pull/133408 on behalf of https://github.com/jeanschmidt due to Broke internal signals, see D62737208 for more details ([comment](https://github.com/pytorch/pytorch/pull/133408#issuecomment-2353623816))
2024-09-16 18:33:33 +00:00
7fe004f7cf Revert "Add CI for Triton CPU backend (#135342)"
This reverts commit 426580a67db15ec17b2b861a09667bf59927e033.

Reverted https://github.com/pytorch/pytorch/pull/135342 on behalf of https://github.com/jeanschmidt due to Broke internal signals, see D62737208 for more details ([comment](https://github.com/pytorch/pytorch/pull/133408#issuecomment-2353623816))
2024-09-16 18:33:33 +00:00
23c0d2689e [BE][Ez]: Fix missing float16 coverage for adaptive_pool3d_cpu (#136091)
Testing if op info coverage has issues

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136091
Approved by: https://github.com/ezyang
2024-09-16 18:22:16 +00:00
5193f23469 [Pytorch] Cleanup Strobelight URL and shorten for readability (#136102)
Summary:
- Converted strobelight URL prefix to more readable and editable json
- Dump shortened URLs when possible for easier readability

Test Plan:
```
python ./torch/_strobelight/examples/compile_time_profile_example.py
python torch/_strobelight/examples/cli_function_profiler_example.py
```

Differential Revision: D62690292

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136102
Approved by: https://github.com/laithsakka
2024-09-16 18:10:33 +00:00
0199fd4d7e Revert "[inductor] More fixes on the keys of constants and signature dictionaries (#135406)"
This reverts commit e54b559e8860e343692bb5534777b2384a57a613.

Reverted https://github.com/pytorch/pytorch/pull/135406 on behalf of https://github.com/jeanschmidt due to Reverting as it is breaking triton_mtia internal signals @jansel could you have a look and help get those changes merged? ([comment](https://github.com/pytorch/pytorch/pull/135406#issuecomment-2353557481))
2024-09-16 17:58:02 +00:00
b491e2974c [BE][Ez]: Add full half/bfloat16 dtype for unique and isin (#136114)
Fixes #136090

* Add support for isin to tensor half dtypes for CPU (just add a few extra dispatches).
* Seems like the CUDA implementation for bfloat16 was mostly compiled and available all along (it just calls sort internally AND unique). To enable it, we just need to remove an assert to access it (since sort's functionality was updated since the assert was added) and add missing dtype support to unique.
* This unlocks more GPU functionality with minimal code bloat. I also added CPU kernels for the dtypes for parity.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136114
Approved by: https://github.com/malfet
2024-09-16 17:49:12 +00:00
0aa41eb52f [ONNX] Run type promotion test in CI and update the table (#135915)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135915
Approved by: https://github.com/gramalingam, https://github.com/xadupre
2024-09-16 16:46:13 +00:00
090046b936 [effects] Turn off dtype promotion for with_effects lowering (#136039)
By default inductor promotes arguments to the common highest dtype.
Having empty token with dtype=torch.float32 results in dtype promotion for effectful ops during lowering of with_effects.

Disabling dtype promotion for this lowering.

Removing previous workaround making token dtype torch.bool.

Testing:

```
python test/distributed/test_c10d_functional_native.py -k test_inductor_dtypeview_memory_lea
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136039
Approved by: https://github.com/bdhirsh, https://github.com/eellison, https://github.com/zou3519
2024-09-16 16:14:05 +00:00
c33b0580e6 Add decomposition for squeeze_copy (#130941)
* Extracted from #128416

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130941
Approved by: https://github.com/amjames, https://github.com/eellison
2024-09-16 15:46:57 +00:00
13bd1256f9 Delete stable prototype (#135911)
This project ended up going in an entirely different direction, so we can close out all this
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135911
Approved by: https://github.com/izaitsevfb, https://github.com/malfet
2024-09-16 15:32:17 +00:00
d833f49602 [reland][Inductor] Rename cpp_wrapper_cuda.py as cpp_wrapper_gpu.py (#136046)
Summary: Reland https://github.com/pytorch/pytorch/pull/135313 after fixing internal build issues

Test Plan: CI

Differential Revision: D62658837

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136046
Approved by: https://github.com/chenyang78, https://github.com/etaf, https://github.com/jansel
2024-09-16 14:35:19 +00:00
a803cb0531 [AOTI] Refactor how cpp_wrapper specific options are set (#136035)
Summary:
1) When cpp-wrapper is turned on, certain triton specific options need to be set, both for forward and backward. This PR considate the settings in one place.
2) Change config.triton.autotune_at_compile_time to default to None. If the flag is not explicitly set by user, default it to True for cpp-wrapper.

Differential Revision: [D62689940](https://our.internmc.facebook.com/intern/diff/D62689940)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136035
Approved by: https://github.com/chenyang78
2024-09-16 14:32:13 +00:00
bbc3fdbbde Add python 3.13.0t build to Docker images (#136001)
Adds 3.13t python to Docker images
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136001
Approved by: https://github.com/albanD
2024-09-16 12:49:36 +00:00
3117f2cf67 Revert "[BE]: Update mypy to 1.11.2 (#133816)"
This reverts commit 55299cfc223fa838aadd8d6d6fa3ed541fa5acd1.

Reverted https://github.com/pytorch/pytorch/pull/133816 on behalf of https://github.com/jeanschmidt due to seems to have broken https://github.com/pytorch/pytorch/actions/runs/10865710499/job/30155699792 on main ([comment](https://github.com/pytorch/pytorch/pull/133816#issuecomment-2352377684))
2024-09-16 09:11:16 +00:00
951c21d679 [dynamo] simplify implementation for builtins.sum (#133779)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133779
Approved by: https://github.com/jansel, https://github.com/anijain2305
ghstack dependencies: #133778
2024-09-16 04:53:06 +00:00
9961aaa601 [dynamo] simplify implementation for functools.reduce (#133778)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133778
Approved by: https://github.com/jansel, https://github.com/anijain2305
2024-09-16 04:53:06 +00:00
d2207c57f7 [Distributed] add pack-check method for float8_e5m2 (#136115)
Add support for Float8_e5m2, following similar algorithm used for Float8_e4m3fn (i.e. overflow check).

Made `HasNanFP8x8` a template so that it is extendable based on dtype.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136115
Approved by: https://github.com/Skylion007
ghstack dependencies: #135891, #135961
2024-09-15 21:37:43 +00:00
e501ed71d4 Update link in distributed.tensor.parallel.rst (#136103)
dtensor folder was moved

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136103
Approved by: https://github.com/kwen2501, https://github.com/fegin
2024-09-15 19:36:29 +00:00
ab9a7eadd3 Add decomposition for permute_copy (#130944)
* Extracted from #129476

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130944
Approved by: https://github.com/amjames, https://github.com/eellison
2024-09-15 19:35:14 +00:00
a141c6bb0d [pytorch][monitoring] Dynamic backend for WaitCounter (#135967)
Summary: This implements a default backend proxy that tries to look up a backend via dlsym. What this enables is dynamically loading a module with a backend implementation without having it statically linked with the application.

Differential Revision: D62549295

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135967
Approved by: https://github.com/c-p-i-o
2024-09-15 18:07:49 +00:00
dec3403b24 Add some doc for export_for_training (#135918)
Differential Revision: [D62610491](https://our.internmc.facebook.com/intern/diff/D62610491)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135918
Approved by: https://github.com/avikchaudhuri
ghstack dependencies: #135080, #135912
2024-09-15 17:08:12 +00:00
1904b09e61 Create export_for_inference API and expose core_aten as public facing API (#135912)
Differential Revision: [D62606908](https://our.internmc.facebook.com/intern/diff/D62606908)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135912
Approved by: https://github.com/avikchaudhuri
ghstack dependencies: #135080
2024-09-15 17:05:07 +00:00
382fad58b3 Deprecate _preserve_ops and consolidate with decomp_table (#135080)
In this PR, we deprecate _preserve_ops feature in run_decomposition API. We can't kill this API completely because Executorch team depends on it. As the syncing between two repos is non-trivial, I just leave this argument as deprecated for now. In the next PR, i will immediately remove it.

After this PR, run_decompositions will only decompose what's inside the decomp table and preserve the rest by default. Note that this feature is only rolled out to OSS for now. Old code path is protected under IS_FBCODE flag.

Differential Revision: [D62163161](https://our.internmc.facebook.com/intern/diff/D62163161/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135080
Approved by: https://github.com/justinchuby, https://github.com/avikchaudhuri, https://github.com/bdhirsh
2024-09-15 17:01:58 +00:00
357b7fb579 Revert "[Pytorch] Consolidate Strobelight compile time profiler between OSS and fbcode (#135953)"
This reverts commit b8637503c036abb898f6b880b325aeffe6f09c03.

Reverted https://github.com/pytorch/pytorch/pull/135953 on behalf of https://github.com/kollasb due to Broke internal module factory compatibility, revert from Phabricator failed ([comment](https://github.com/pytorch/pytorch/pull/135953#issuecomment-2351381777))
2024-09-15 05:32:38 +00:00
cyy
31e42a45dd Fix redundant move warnings by g++ (#134987)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134987
Approved by: https://github.com/ezyang
2024-09-15 05:28:19 +00:00
e1abd346a3 [audio hash update] update the pinned audio hash (#136106)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136106
Approved by: https://github.com/pytorchbot
2024-09-15 04:31:35 +00:00
386884e553 [Traceable FSDP2] Ignore FSDP2 forward hook side-effects in AC; Support FSDP2 + AC (#134997)
> Ignore FSDP2 forward hook side-effects in AC

Under AC, FSDP2 does not rely on forward hook to all-gather weights to do recomputation, instead it relies on pre-backward hook to do this job:
451eaf0ff2/torch/distributed/_composable/fsdp/_fsdp_state.py (L219-L220)

So when we use `speculate_subgraph` to trace the utils.checkpoint AC region, we don't actually need to worry about FSDP2 forward hook's side effects and can safely ignore it, because we are not and we don't expect to re-run the FSDP2 forward hook during backward recomputation.

----

Test commands:
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134997
Approved by: https://github.com/zou3519
ghstack dependencies: #135727
2024-09-15 02:00:17 +00:00
8072ebc36c SKIP llama for dynamic size testing (#135960)
Running Torchbench llama with dynamic size failed with
```
  File "/localdisk/leslie/torch_inductor_community/pytorch/torch/fx/experimental/symbolic_shapes.py", line 4182, in produce_guards
    raise ConstraintViolationError(
torch.fx.experimental.symbolic_shapes.ConstraintViolationError: Constraints violated (L['inputs'][0].size()[0])! For more information, run with TORCH_LOGS="+dynamic".
  - Not all values of RelaxedUnspecConstraint(L['inputs'][0].size()[0]) are valid because L['inputs'][0].size()[0] was inferred to be a constant (32).
```
Skip this model for marking dynamic dim.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135960
Approved by: https://github.com/ezyang
2024-09-15 00:06:49 +00:00
a1a57a424d Optimize dict reconstruct to not codegen untouched values (#134876)
PR changes how `reconstruct` is done for a ConstDict. As of today, it works as follow:
(1) codegen(...) each pair of key/value
(2) create a new dictionary to hold the new items
(3) clear the original dictionary
(4) update the original dict with the one created in (2)

We do a micro optimization in the generated bytecode to:
- Only codegen the items that changed.
- Only clear the original dictionary if a key was removed.

Fixes: #133487

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134876
Approved by: https://github.com/zou3519
2024-09-14 23:25:28 +00:00
a5eb43d8b4 Add TensorReferenceAnalysis and some tests (#135886)
Split out and modified from https://github.com/pytorch/pytorch/pull/130228. There were a bunch of subtle bugs eg. sometimes we need to use torch.ops.aten.{operator}.Tensor vs other times using torch.ops.aten.{operator}.default. Or in the case of pow we need to use Tensor_Tensor. I figured it'd be easier to split out adding TensorReferenceAnalysis and add some tests and do the actual integration in a separate diff.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135886
Approved by: https://github.com/ezyang
2024-09-14 23:09:40 +00:00
391f2d6d50 use a fast expand algorithm (#135999)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135999
Approved by: https://github.com/ezyang
2024-09-14 23:09:34 +00:00
5b21d91197 Fix dividing Mul by factor (#136079)
Fixes https://github.com/pytorch/pytorch/issues/136032

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136079
Approved by: https://github.com/ezyang
2024-09-14 22:14:27 +00:00
426580a67d Add CI for Triton CPU backend (#135342)
Where possible, I have marked failing tests (which we intend to fix or triage) as `@xfail_if_triton_cpu`. This will help us track progress of the Triton CPU backend over time. Tests that I don't think we need to address, or that are flaky, have been marked as skips.

Successful CI run: https://github.com/pytorch/pytorch/actions/runs/10822238062/job/30028284549

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135342
Approved by: https://github.com/jansel
ghstack dependencies: #133408
2024-09-14 21:45:19 +00:00
e498b02b47 Add Triton CPU as an Inductor backend (#133408)
The goal is to use Inductor-generated kernels to stress test the new Triton CPU backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133408
Approved by: https://github.com/jansel
2024-09-14 21:45:19 +00:00
55299cfc22 [BE]: Update mypy to 1.11.2 (#133816)
Updates mypy to 1.11.1 to improve type inference

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133816
Approved by: https://github.com/ezyang
2024-09-14 21:40:36 +00:00
c64ae601ba [dynamo] Fix support for classmethod(property(...)) (#134968)
Fixes #134451

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134968
Approved by: https://github.com/yanboliang
2024-09-14 21:00:41 +00:00
7f5abb44af [BE][Ez]: Update pybind11 to 2.13.6. Exposes new conduit cross-compat API (#136087)
Updates pybind11 submodule. The major patchnote is an experimental new function that is added to all pybind11 objects that will make them more compatible across pybind11 version, settings, and frameworks (such as nanobind) called cpp_conduit. No code changes needed on our end except to update
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136087
Approved by: https://github.com/malfet
2024-09-14 20:48:44 +00:00
8df01c8258 [Dynamo] Remove ignored modes from torch function mode stack guard (#135503)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135503
Approved by: https://github.com/anijain2305
ghstack dependencies: #134732, #133137, #135443, #135444, #135422, #135502
2024-09-14 18:52:22 +00:00
860838e9be [Dynamo] Remove ignored modes workaround (#135502)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135502
Approved by: https://github.com/anijain2305
ghstack dependencies: #134732, #133137, #135443, #135444, #135422
2024-09-14 18:52:22 +00:00
1b9daeb240 [Dynamo] Trace enter/exit of TorchFunctionModes (#135422)
This PR implements tracing of with contexts with TorchFunction modes which have the default enter/exit behavior (ie pushing/popping the mode)

Typically the bytecode for a context manager looks like this during a graph break:
1. graph call
2. enter context
3. unsupported code
4. exit context
5. resume call

resume fn structure:
1. enter context
2. jump
...
3. exit context

The issue with torch function modes is that side effects will replay any mutations to the torch function stack performed during tracing. So, we do not need to enter and exit around the unsupported code in the original function (doing so would result in a duplicate torch function mode entry during execution of the unsupported code), and we don't need to enter again in the resume function (the mode that was pushed from the side effects bytecode would still be on the stack).

So for torch function modes the structure of our output code is this:

1. graph call
2. mutate tf mode stack to replay mutations
4. unsupported code
5. on exception restore stack
6. resume function

Then our resume fn looks like this:

1. no-op enter torch function mode
2. jump
3.  exit tf mode

To implement the no-op enter of the torch function mode I added torch function mode in polyfill which no-op enters, but normally exits. This is needed because we still want to trace the with context in the resume function, and exit properly (the exit instructions will still be in the function, so we need to generate instructions to set up the context).

Separately from the bytecode, dynamo also tracks contexts on the block stack, which is how the SETUP_* instructions are implemented. Naturally at a graph break, we exit these block stacks to properly reset the contexts entirely, so that we can re-enter around the unsupported code soundly. However once again, in the torch function mode case, in the event of a graph we do not want to perform any exit side effects because we want to preserve the state of the mode stack as is so that we will properly update the stack with bytecode mentioned in the first section. If we exited here, dynamo would pop the mode off of the symbolic stack, and not update the true python torch function mode stack with the suffix bytecode. All in all, for torch function modes we enter exactly once, update the global torch function mode stack with side effects bytecode, re-read this stack when compiling the resume function, and exit exactly once in the resume function. This matches the semantics of eager exactly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135422
Approved by: https://github.com/williamwen42
ghstack dependencies: #134732, #133137, #135443, #135444
2024-09-14 18:52:22 +00:00
06caa2d560 [Dynamo] Simplify torch function mode stack guard (#135444)
The semantics of ignored modes previously had edge cases, this eliminates these by in essence filtering any ignored modes out of both the ref stack and the current torch function mode stack. This is purely to fix complexity in #135422.  The ignored modes handling will be removed in a future PR after https://github.com/pytorch/pytorch/pull/135422 lands, since we will then trace through DeviceContexts vs inserting them into the graph which needed these extra workarounds for correctness.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135444
Approved by: https://github.com/anijain2305, https://github.com/williamwen42
ghstack dependencies: #134732, #133137, #135443
2024-09-14 18:52:22 +00:00
14cabdf626 [Dynamo] Support thread local setattr (#135443)
In preparation for tracing through DeviceContext (defb515306/torch/utils/_device.py (L66))
This PR adds support for calling the setattr of thread local objects. These objects have a slots impl, and since this doesn't appear to have any side effects, we call this setattr impl when replaying mutations, since calling `object.__setattr__` on these objects results in a type error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135443
Approved by: https://github.com/anijain2305
ghstack dependencies: #134732, #133137
2024-09-14 18:52:22 +00:00
5c5c33ac32 [Dynamo] Trace torch function modes entered outside of torch.compile (#133137)
This PR adds initial tracing for torch function modes.

Details:
In essence, this adds tracing into the torch function of modes entered outside of the torch.compile call.
This does not yet support tracing enter/exit of a torch function mode/ tracing set_default_device properly using the new mode infra (this will be a very good stress test for modes). I am adding more PRs to this stack to support these. The overall plan is to support tracing enter/exit and handling graph breaks like we do other torch.* context managers.

Previously landed:
https://github.com/pytorch/pytorch/pull/133135
https://github.com/pytorch/pytorch/pull/133136
https://github.com/pytorch/pytorch/pull/133134
https://github.com/pytorch/pytorch/pull/133133
https://github.com/pytorch/pytorch/pull/133132
https://github.com/pytorch/pytorch/pull/133131
https://github.com/pytorch/pytorch/pull/133729
https://github.com/pytorch/pytorch/pull/133130

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133137
Approved by: https://github.com/jansel, https://github.com/zou3519
ghstack dependencies: #134732
2024-09-14 18:52:22 +00:00
228760b945 [Dynamo] Use custom backend to reenter metadata tf mode when tracing while/cond (#134732)
For tracing cond/while in eager, we trace the HOP with the eager backend with metadata torchfunction mode enabled. HOPs disallow the mutation that occurs in this torch function mode, so it is not able to be traced. As a result, we use a custom backend which enters this mode for tracing these HOPs. Thanks to @ydwu4 for the help with implementing this

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134732
Approved by: https://github.com/ydwu4
2024-09-14 18:52:22 +00:00
b4c84c3167 [AOTI] Fix a fallback op returning None issue (#135997)
Summary: Fixes https://github.com/pytorch/pytorch/issues/135781. In some cases, a fallback can return None in the place of a tensor.

Differential Revision: [D62659039](https://our.internmc.facebook.com/intern/diff/D62659039)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135997
Approved by: https://github.com/chenyang78
2024-09-14 18:12:06 +00:00
b82122beef Only keep ListOfLinears module in basic_modules_benchmarks and add gpu version. (#135730)
All of the previous benchmarks are similar, ListOfLinears should be representative enough.
I copied the previous benchmarks from unit tests without an intention, was just trying to create a large
number of benchmarks to better observe noise.

This PR keeps only one, we can add more as we see value and regressions in the future.
Also this diff adds a GPU version.
```
collecting compile time instruction count for basic_modules_ListOfLinears_eager
compile time instruction count for iteration 0 is 6479525851
compile time instruction count for iteration 1 is 1024432680
compile time instruction count for iteration 2 is 1019417317
compile time instruction count for iteration 3 is 1013603566
compile time instruction count for iteration 4 is 1008853980
compile time instruction count for iteration 5 is 1009541481
compile time instruction count for iteration 6 is 1005025533
compile time instruction count for iteration 7 is 1004116323
compile time instruction count for iteration 8 is 1000828633
compile time instruction count for iteration 9 is 999788323
collecting compile time instruction count for basic_modules_ListOfLinears_inductor
compile time instruction count for iteration 0 is 40837529730
compile time instruction count for iteration 1 is 18411921909
compile time instruction count for iteration 2 is 18383665161
compile time instruction count for iteration 3 is 18348983522
compile time instruction count for iteration 4 is 18349276590
compile time instruction count for iteration 5 is 18353046274
compile time instruction count for iteration 6 is 18346818581
compile time instruction count for iteration 7 is 18340057998
compile time instruction count for iteration 8 is 18331267320
compile time instruction count for iteration 9 is 18328381338
collecting compile time instruction count for basic_modules_ListOfLinears_inductor_gpu
compile time instruction count for iteration 0 is 15408870979
compile time instruction count for iteration 1 is 10949520859
compile time instruction count for iteration 2 is 11058786167
compile time instruction count for iteration 3 is 11003606719
compile time instruction count for iteration 4 is 10896406770
compile time instruction count for iteration 5 is 10982875189
compile time instruction count for iteration 6 is 10931848275
compile time instruction count for iteration 7 is 10956345008
compile time instruction count for iteration 8 is 11045384499
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135730
Approved by: https://github.com/ezyang, https://github.com/anijain2305
2024-09-14 16:45:52 +00:00
b8637503c0 [Pytorch] Consolidate Strobelight compile time profiler between OSS and fbcode (#135953)
Summary:
Move towards consolidating strobelight profiler implementations between OSS and fbcode. This change is a first step towards that.

- Created a new function to abstract out compile time profiling enablement. This function allows profiler to switch between different function profilers (e.g. Thrift based or CLI based)
- Both OSS and Fbcode now use one compile time profiler in torch/_strobelight

Test Plan:
Tested OSS with following commands:
```
python torch/_strobelight/examples/compile_time_profile_example.py
python torch/_strobelight/examples/cli_function_profiler_example.py

TORCH_COMPILE_STROBELIGHT=TRUE TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 python benchmarks/dynamo/huggingface.py --ci --accuracy --timing --explain --inductor --device cuda --training --amp  --only XLNetLMHeadModel
```

See test commands for fbcode in comments.

Differential Revision: D62444551

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135953
Approved by: https://github.com/laithsakka
2024-09-14 16:35:22 +00:00
f97cccf62a [3.13] fix 3.13 pickle error in torch/package (#136049)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136049
Approved by: https://github.com/albanD
ghstack dependencies: #136034
2024-09-14 14:28:09 +00:00
db393fb95e Add Half support for reflection and replication padding on CPU (#135931)
Fixes #135680

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135931
Approved by: https://github.com/Skylion007
2024-09-14 14:18:55 +00:00
23dec79cef Revert "[Dynamo] Use custom backend to reenter metadata tf mode when tracing while/cond (#134732)"
This reverts commit 731b178b56c83966d6e8cdfb0015d22d8f91b4d2.

Reverted https://github.com/pytorch/pytorch/pull/134732 on behalf of https://github.com/mlazos due to broke python test/quantization/pt2e/test_numeric_debugger.py TestNumericDebugger.test_re_export_preserve_handle modified yesterday ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2350937008))
2024-09-14 10:02:55 +00:00
8c8a3086a7 Revert "[Dynamo] Trace torch function modes entered outside of torch.compile (#133137)"
This reverts commit 4528777e034b157a8329d1879daf52290eea199a.

Reverted https://github.com/pytorch/pytorch/pull/133137 on behalf of https://github.com/mlazos due to broke python test/quantization/pt2e/test_numeric_debugger.py TestNumericDebugger.test_re_export_preserve_handle modified yesterday ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2350937008))
2024-09-14 10:02:55 +00:00
46f5037007 Revert "[Dynamo] Support thread local setattr (#135443)"
This reverts commit 149d0b716173787df4543186ff74b605aca54e3e.

Reverted https://github.com/pytorch/pytorch/pull/135443 on behalf of https://github.com/mlazos due to broke python test/quantization/pt2e/test_numeric_debugger.py TestNumericDebugger.test_re_export_preserve_handle modified yesterday ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2350937008))
2024-09-14 10:02:55 +00:00
7975ec3a29 Revert "[Dynamo] Simplify torch function mode stack guard (#135444)"
This reverts commit ce3c74f2744cbc134b95cf8bd53ae5e3fbc67c29.

Reverted https://github.com/pytorch/pytorch/pull/135444 on behalf of https://github.com/mlazos due to broke python test/quantization/pt2e/test_numeric_debugger.py TestNumericDebugger.test_re_export_preserve_handle modified yesterday ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2350937008))
2024-09-14 10:02:55 +00:00
f3180f0088 Revert "[Dynamo] Trace enter/exit of TorchFunctionModes (#135422)"
This reverts commit 7743149b2be4a9eba7e0997ccdc6abe552bec266.

Reverted https://github.com/pytorch/pytorch/pull/135422 on behalf of https://github.com/mlazos due to broke python test/quantization/pt2e/test_numeric_debugger.py TestNumericDebugger.test_re_export_preserve_handle modified yesterday ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2350937008))
2024-09-14 10:02:55 +00:00
838c912502 Revert "[Dynamo] Remove ignored modes workaround (#135502)"
This reverts commit 5c67cf180ee53d696f95d7c45dd99a35399e4450.

Reverted https://github.com/pytorch/pytorch/pull/135502 on behalf of https://github.com/mlazos due to broke python test/quantization/pt2e/test_numeric_debugger.py TestNumericDebugger.test_re_export_preserve_handle modified yesterday ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2350937008))
2024-09-14 10:02:55 +00:00
72b868d034 Revert "[Dynamo] Remove ignored modes from torch function mode stack guard (#135503)"
This reverts commit e77bd0ebd20e96990ccd40518e68bbcfe7fda855.

Reverted https://github.com/pytorch/pytorch/pull/135503 on behalf of https://github.com/mlazos due to broke python test/quantization/pt2e/test_numeric_debugger.py TestNumericDebugger.test_re_export_preserve_handle modified yesterday ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2350937008))
2024-09-14 10:02:54 +00:00
41b58a1bec OpenReg: Fix issue when copying on the same device (#135956)
Current copy gets wrong value when src and dst are both openreg.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135956
Approved by: https://github.com/albanD
2024-09-14 09:57:45 +00:00
f96a073c9d Use _amp_foreach_non_finite_check_and_unscale_ for CPU grads of ShardedGradScaler (#135232)
Use `_amp_foreach_non_finite_check_and_unscale_` instead of fallback version for CPU grads of `ShardedGradScaler ` as `_amp_foreach_non_finite_check_and_unscale_ ` is supported on CPU https://github.com/pytorch/pytorch/pull/109281.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135232
Approved by: https://github.com/ezyang
2024-09-14 09:53:17 +00:00
a815611db9 [Traceable FSDP2][Partitioner] Must save AC output if output has a backward hook (#135727)
If node is AC region output and has a backward hook on it, we intentionally choose to save it.
This is to work around circular dependencies in Traceable FSDP2+AC.
Example:
```
out = fully_shard(utils.checkpoint(module))(x)
norm_out = layer_norm(out)
```
and there is a circular dependency:
1. In backward, grad_input of layer_norm aka. `out_grad` is actually dependent on `out`.
2. `out` depends on `out`'s backward hook created by FSDP2 (which does all-gather for `module` weights) in order to be recomputed.
3. `out`'s FSDP2 backward hook, as is the case for all eager backward hooks, depends on `out_grad`  -> circular dependency with (1)!

Solution: check whether `out` has a backward hook, and if so, intentionally save `out` in forward graph outputs. With this, we can break the above circular dependency.

----

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135727
Approved by: https://github.com/Chillee
2024-09-14 08:45:58 +00:00
3352c9ac94 Add higher order operator name to the cache bypass exception (#135876)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135876
Approved by: https://github.com/jamesjwu, https://github.com/zou3519
2024-09-14 07:05:29 +00:00
5a2be192d1 [Traceable FSDP2] Don't register RegisterPostBackwardFunction if user intends to use Traceable FSDP2, and assert that compiled autograd is not used when entering RegisterPostBackwardFunction (#135824)
During enablement of Traceable FSDP2 on internal models, sometimes the user only applies torch.compile to some of the FSDP2 instances but not all of them. Such mixed usage pattern is not supported by compiled autograd. Here we try to catch and throw error at such usage pattern, so that the user can fix the usage.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135824
Approved by: https://github.com/awgu
2024-09-14 06:30:12 +00:00
a9bef85263 [CI] Increase open file handles limit to 16K on MacOS (#136061)
May be it will help with flaky failures tracked in https://github.com/pytorch/pytorch/issues/135885

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136061
Approved by: https://github.com/clee2000, https://github.com/kit1980, https://github.com/huydhn, https://github.com/ZainRizvi
2024-09-14 06:16:12 +00:00
44dd218a61 Disable garbage collection during compile_time_instructions count in benchmark base by default. (#135768)
When we measure compile time instruction count, probably we do want in most cases to measure gc instructions
disabling it here by default.
if it is needed we can add an option to allow it, or someone can use the regular total instruction count instead of compile time instruction count.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135768
Approved by: https://github.com/ezyang, https://github.com/anijain2305
2024-09-14 06:15:28 +00:00
1a67e2b680 [MPS] Add native im2col (#135706)
It's called from `torch.unfold` and one of the few remaining vestiges in `MPSFallback.mm`

Strongly inspired by CUDA implementation from 09519eb195/aten/src/ATen/native/cuda/im2col.cuh (L40-L61)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135706
Approved by: https://github.com/albanD
2024-09-14 06:09:36 +00:00
b9b6094793 [ROCm] Skip pointwise associative scan tests due to regression (#135995)
https://github.com/pytorch/pytorch/pull/133012 caused a regression on ROCm causing pointwise scan tests to fail

```
ERROR: test_pointwise_associative_scan_tuple_reverse_True_combine_mode_pointwise_cuda
ERROR: test_pointwise_associative_scan_tuple_reverse_False_combine_mode_pointwise_cuda
ERROR: test_pointwise_associative_scan_complex_pytree_reverse_True_combine_mode_pointwise_cuda
ERROR: test_pointwise_associative_scan_complex_pytree_reverse_False_combine_mode_pointwise_cuda
ERROR: test_pointwise_associative_scan_binary_operator_reverse_True_combine_mode_pointwise_cuda
ERROR: test_pointwise_associative_scan_binary_operator_reverse_False_combine_mode_pointwise_cuda
```

Skipping temporarily while triage is underway.

Full log: https://ossci-raw-job-status.s3.amazonaws.com/log/30067645445

```
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/_inductor/graph.py", line 1020, in call_function
    out = lowerings[target](*args, **kwargs)  # type: ignore[index]
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/_inductor/lowering.py", line 363, in wrapped
    out = decomp_fn(*args, **kwargs)
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/_inductor/lowering.py", line 6245, in associative_scan
    raise RuntimeError("Unable to generate code for associative_scan op")
torch._inductor.exc.LoweringException: RuntimeError: Unable to generate code for associative_scan op
```

NOTE: even "eager" backend fails
```
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/_higher_order_ops/associative_scan.py", line 338, in associative_scan_op_dense
    raise NotImplementedError("associative_scan is not implemented for eager")
NotImplementedError: associative_scan is not implemented for eager
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135995
Approved by: https://github.com/malfet
2024-09-14 05:40:10 +00:00
911a43f930 [TCPStore] Remove deprecated constructor (#136004)
While looking at TCPStore code again and found it confusing that we still keep the deprecated constructor for TCPStore in cpp while we don't expose it in python via pybind already. I checked both internal and external, all use cases in cpp (aside from unit test fixed in this PR) already moved to using option. So let's remove this legacy constructor to avoid confusion.

Differential Revision: [D62653634](https://our.internmc.facebook.com/intern/diff/D62653634)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136004
Approved by: https://github.com/Skylion007, https://github.com/XilunWu
2024-09-14 04:25:47 +00:00
e77bd0ebd2 [Dynamo] Remove ignored modes from torch function mode stack guard (#135503)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135503
Approved by: https://github.com/anijain2305
ghstack dependencies: #134732, #133137, #135443, #135444, #135422, #135502
2024-09-14 02:41:16 +00:00
5c67cf180e [Dynamo] Remove ignored modes workaround (#135502)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135502
Approved by: https://github.com/anijain2305
ghstack dependencies: #134732, #133137, #135443, #135444, #135422
2024-09-14 02:41:16 +00:00
7743149b2b [Dynamo] Trace enter/exit of TorchFunctionModes (#135422)
This PR implements tracing of with contexts with TorchFunction modes which have the default enter/exit behavior (ie pushing/popping the mode)

Typically the bytecode for a context manager looks like this during a graph break:
1. graph call
2. enter context
3. unsupported code
4. exit context
5. resume call

resume fn structure:
1. enter context
2. jump
...
3. exit context

The issue with torch function modes is that side effects will replay any mutations to the torch function stack performed during tracing. So, we do not need to enter and exit around the unsupported code in the original function (doing so would result in a duplicate torch function mode entry during execution of the unsupported code), and we don't need to enter again in the resume function (the mode that was pushed from the side effects bytecode would still be on the stack).

So for torch function modes the structure of our output code is this:

1. graph call
2. mutate tf mode stack to replay mutations
4. unsupported code
5. on exception restore stack
6. resume function

Then our resume fn looks like this:

1. no-op enter torch function mode
2. jump
3.  exit tf mode

To implement the no-op enter of the torch function mode I added torch function mode in polyfill which no-op enters, but normally exits. This is needed because we still want to trace the with context in the resume function, and exit properly (the exit instructions will still be in the function, so we need to generate instructions to set up the context).

Separately from the bytecode, dynamo also tracks contexts on the block stack, which is how the SETUP_* instructions are implemented. Naturally at a graph break, we exit these block stacks to properly reset the contexts entirely, so that we can re-enter around the unsupported code soundly. However once again, in the torch function mode case, in the event of a graph we do not want to perform any exit side effects because we want to preserve the state of the mode stack as is so that we will properly update the stack with bytecode mentioned in the first section. If we exited here, dynamo would pop the mode off of the symbolic stack, and not update the true python torch function mode stack with the suffix bytecode. All in all, for torch function modes we enter exactly once, update the global torch function mode stack with side effects bytecode, re-read this stack when compiling the resume function, and exit exactly once in the resume function. This matches the semantics of eager exactly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135422
Approved by: https://github.com/williamwen42
ghstack dependencies: #134732, #133137, #135443, #135444
2024-09-14 02:41:08 +00:00
ce3c74f274 [Dynamo] Simplify torch function mode stack guard (#135444)
The semantics of ignored modes previously had edge cases, this eliminates these by in essence filtering any ignored modes out of both the ref stack and the current torch function mode stack. This is purely to fix complexity in #135422.  The ignored modes handling will be removed in a future PR after https://github.com/pytorch/pytorch/pull/135422 lands, since we will then trace through DeviceContexts vs inserting them into the graph which needed these extra workarounds for correctness.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135444
Approved by: https://github.com/anijain2305, https://github.com/williamwen42
ghstack dependencies: #134732, #133137, #135443
2024-09-14 02:40:59 +00:00
149d0b7161 [Dynamo] Support thread local setattr (#135443)
In preparation for tracing through DeviceContext (defb515306/torch/utils/_device.py (L66))
This PR adds support for calling the setattr of thread local objects. These objects have a slots impl, and since this doesn't appear to have any side effects, we call this setattr impl when replaying mutations, since calling `object.__setattr__` on these objects results in a type error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135443
Approved by: https://github.com/anijain2305
ghstack dependencies: #134732, #133137
2024-09-14 02:40:52 +00:00
4528777e03 [Dynamo] Trace torch function modes entered outside of torch.compile (#133137)
This PR adds initial tracing for torch function modes.

Details:
In essence, this adds tracing into the torch function of modes entered outside of the torch.compile call.
This does not yet support tracing enter/exit of a torch function mode/ tracing set_default_device properly using the new mode infra (this will be a very good stress test for modes). I am adding more PRs to this stack to support these. The overall plan is to support tracing enter/exit and handling graph breaks like we do other torch.* context managers.

Previously landed:
https://github.com/pytorch/pytorch/pull/133135
https://github.com/pytorch/pytorch/pull/133136
https://github.com/pytorch/pytorch/pull/133134
https://github.com/pytorch/pytorch/pull/133133
https://github.com/pytorch/pytorch/pull/133132
https://github.com/pytorch/pytorch/pull/133131
https://github.com/pytorch/pytorch/pull/133729
https://github.com/pytorch/pytorch/pull/133130

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133137
Approved by: https://github.com/jansel, https://github.com/zou3519
ghstack dependencies: #134732
2024-09-14 02:40:43 +00:00
731b178b56 [Dynamo] Use custom backend to reenter metadata tf mode when tracing while/cond (#134732)
For tracing cond/while in eager, we trace the HOP with the eager backend with metadata torchfunction mode enabled. HOPs disallow the mutation that occurs in this torch function mode, so it is not able to be traced. As a result, we use a custom backend which enters this mode for tracing these HOPs. Thanks to @ydwu4 for the help with implementing this

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134732
Approved by: https://github.com/ydwu4
2024-09-14 02:40:32 +00:00
1786a17fed Revert "Use _amp_foreach_non_finite_check_and_unscale_ for CPU grads of ShardedGradScaler (#135232)"
This reverts commit 51c52061339069a2162e921e5b464fad5a411522.

Reverted https://github.com/pytorch/pytorch/pull/135232 on behalf of https://github.com/CaoE due to wrong commit ([comment](https://github.com/pytorch/pytorch/pull/135232#issuecomment-2350792806))
2024-09-14 02:31:06 +00:00
51c5206133 Use _amp_foreach_non_finite_check_and_unscale_ for CPU grads of ShardedGradScaler (#135232)
Use `_amp_foreach_non_finite_check_and_unscale_` instead of fallback version for CPU grads of `ShardedGradScaler ` as `_amp_foreach_non_finite_check_and_unscale_ ` is supported on CPU https://github.com/pytorch/pytorch/pull/109281.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135232
Approved by: https://github.com/ezyang
2024-09-14 02:20:58 +00:00
2e8d431a8f Fix tensor.data_ptr() representation overflow (#135567)
# Motivation
fix https://github.com/pytorch/pytorch/issues/135550
In PyTorch, [`tensor.data_ptr()`](e889252493/tools/autograd/templates/python_variable_methods.cpp (L204)) is reinterpreted by a [signed int64](e889252493/torch/csrc/autograd/utils/wrap_outputs.h (L50)) data type, which could result in an **overflow issue**, like below:
```python
import torch
a = torch.randn(2).to('xpu')
a.data_ptr()
# one possible output is
-23453392437248
# this is inconsistent with storage.data_ptr()
a.untyped_storage().data_ptr()
# one possible output is
18446720620317114368
```
This PR aims to fix this representation overflow issue to make `tensor.data_ptr()` consistent with [`tensor.untyped_storage().data_ptr()`](c0d2f991b1/torch/csrc/StorageMethods.cpp (L62)). With this PR, the output will become:
```python
import torch
a = torch.randn(2).to('xpu')
a.data_ptr()
# one possible output is
18446720620317114368
# this is consistent with storage.data_ptr()
a.untyped_storage().data_ptr()
# one possible output is
18446720620317114368
```

# Solution
Use `PyLong_FromVoidPtr` to prevent the overflow issue and fit the semantic of `wrap`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135567
Approved by: https://github.com/dvrogozh, https://github.com/EikanWang, https://github.com/albanD
2024-09-14 01:52:04 +00:00
95496e4855 [CI] Check that PyTorch is built with OpenMP (#136060)
Restriction for x86 only builds should have been removed long time ago

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136060
Approved by: https://github.com/clee2000, https://github.com/kit1980, https://github.com/ZainRizvi
2024-09-14 01:51:36 +00:00
5de4cb8cd8 [Inductor UT] Generalize inductor UT for intel GPU (Part 3) (#135827)
[Inductor UT] Reuse Inductor test case for Intel GPU.
Reuse `test/inductor/test_compiled_autograd.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135827
Approved by: https://github.com/etaf, https://github.com/desertfire
2024-09-14 01:43:05 +00:00
06bc717410 Fix sum() forward for NJT (#131945)
This PR solves two problems with `sum()` support in NJT:
* `sum()` over a dim with `keepdim=True` returns the wrong shape (i.e. it'll keep the wrong dim). This is a long-standing bug from way back in #112519.
* Historically, we've only supported `sum()` over a dim and not a full reduction. This PR adds the full reduction form (forward only, backward still fails).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131945
Approved by: https://github.com/davidberard98, https://github.com/jananisriram
2024-09-14 00:58:03 +00:00
081c4a966d [BE] Use squeeze/unsqueeze in im2col (#136006)
And move unsqeeze out of the dispatch, as it's dtype agnostic
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136006
Approved by: https://github.com/Skylion007, https://github.com/eqy
2024-09-14 00:35:37 +00:00
4237592b8f [Distributed] add pack-check method for float8_e4m3fn (#135961)
We check 8 x FP8 simultaneously, at size of 8 bytes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135961
Approved by: https://github.com/yifuwang, https://github.com/Skylion007
ghstack dependencies: #135891
2024-09-14 00:32:27 +00:00
a00faf4408 [3.13] fix 3.13 pickle error in serialization.py (#136034)
Error encountered when adding dynamo 3.13 support.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136034
Approved by: https://github.com/albanD
2024-09-14 00:02:40 +00:00
b608ff3bea [Easy] Dont match to mm_plus_mm if not in max autotune (#135929)
It's only an optimization when we tune the triton template.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135929
Approved by: https://github.com/FindHao
2024-09-13 23:38:02 +00:00
b8eef500a6 Fix attr check for quantization spec (#135736)
Summary:
Previously we only checked dtype and is_dynamic to decide if two quantization spec are equivalent
this may not work in some cases, e.g. when people use different qscheme or quant_min/quant_max

This PR added checks for other fields as well

Test Plan:
regression tests

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D62530974](https://our.internmc.facebook.com/intern/diff/D62530974)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135736
Approved by: https://github.com/sxu
2024-09-13 23:01:22 +00:00
aad556a0b5 [PT2][Inductor][Optimus] Fix a corner case in remove_split_with_size_one (#135962)
Summary: see context in https://fb.workplace.com/groups/1075192433118967/permalink/1501768230461383/

Test Plan:
# local reproduce
```
CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "mai" --flow_id 642153776
```
P1586356950

# e2e

before fix

f642153776

after fix

Differential Revision: D62625318

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135962
Approved by: https://github.com/jackiexu1992
2024-09-13 22:53:08 +00:00
3c5d44dda5 Cleanup unused runner variants (#136058)
Cleaning up unused runner variants, leaving behind only the few that are actually referenced by workflows

For more details see description in the PR that generated these code changes:
- https://github.com/pytorch/test-infra/pull/5665
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136058
Approved by: https://github.com/wdvr, https://github.com/malfet
2024-09-13 22:50:07 +00:00
e2d3af405f [ONNX] Remove logging apis from public (#133825)
Remove

- torch.onnx.enable_log
- torch.onnx.disable_log
- torch.onnx.set_log_stream
- torch.onnx.log

Because they are not meant for public consumption and has been marked for deprecation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133825
Approved by: https://github.com/titaiwangms
2024-09-13 22:19:52 +00:00
baff86dafb [MTIA tensor] allow shallow copy between CPU and MTIA tensors (#135871)
Reviewed By: egienvalue, hanzlfs

Differential Revision: D61662214

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135871
Approved by: https://github.com/egienvalue, https://github.com/nautsimon
2024-09-13 22:13:58 +00:00
db5e1b44d2 Fix inductor-micro-benchmark results upload (take 2) (#136052)
I had a brain freeze when I wrote the original fix.  The parameters were in the wrong order.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136052
Approved by: https://github.com/clee2000, https://github.com/kit1980, https://github.com/malfet
2024-09-13 22:05:10 +00:00
a30d5ba16c Fix bug in split-build workflows codegen (#136043)
By just deleting a few rogue lines left out in https://github.com/pytorch/pytorch/pull/135510
If file in workflows folder does not have a `.yml` extensions it will not be launched at all, will it?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136043
Approved by: https://github.com/kit1980, https://github.com/atalman
2024-09-13 21:29:06 +00:00
46935c8241 Reduce default iterations to 5 . (#135773)
running all benchmarks takes around 15 mins rn, this is the data
https://www.internalfb.com/phabricator/paste/view/P1583590240
the data looks mostly stable, and 5 iterations should be good, specially with our 1.5% threshold.
that said, the diff also add a way to increase the number of iterations for a specific benchmark.

after the change results
https://www.internalfb.com/phabricator/paste/view/P1583618969
time is down to half (7 mins)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135773
Approved by: https://github.com/ezyang, https://github.com/anijain2305
2024-09-13 21:16:38 +00:00
4f407c1884 Only measure compile time instruction count for sum_floordiv benchmark (#135785)
there was a recent strange noise +5%, -5%.
using only compile time :
1) avoid gc time .
2) avoid other operations that are not what we try to measure by this. ==> less probable noise.
```
collecting compile time instruction count for sum_floordiv_regression
compile time instruction count for iteration 0 is 8899290248
compile time instruction count for iteration 1 is 1188830489
compile time instruction count for iteration 2 is 1180579615
compile time instruction count for iteration 3 is 1176263131
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135785
Approved by: https://github.com/avikchaudhuri, https://github.com/anijain2305
2024-09-13 21:14:10 +00:00
2e461e54e8 Add gpu and gpu_dynamic versions of add_loop (#135809)
I am thinking maybe 3 iterations are enough for this one?
- so I am keeping eager and inductor since inductor is 2X eager time
- Eager dynamic is 2X eager so keeping this as well.
- inductor have three tests. (dynamic gpu, gpu and cpu)
I am unsure if am over profiling here happy to trim if anyone have suggestions.
```
collecting compile time instruction count for add_loop_eager
compile time instruction count for iteration 0 is 8213664211
compile time instruction count for iteration 1 is 2798628246
compile time instruction count for iteration 2 is 2796811362
compile time instruction count for iteration 3 is 2794438188
compile time instruction count for iteration 4 is 2794634117
collecting compile time instruction count for add_loop_eager_dynamic
compile time instruction count for iteration 0 is 5724108021
compile time instruction count for iteration 1 is 5499908609
compile time instruction count for iteration 2 is 5569101366
compile time instruction count for iteration 3 is 5493806364
compile time instruction count for iteration 4 is 5493169851
collecting compile time instruction count for add_loop_inductor
compile time instruction count for iteration 0 is 49789381222
compile time instruction count for iteration 1 is 25769347393
compile time instruction count for iteration 2 is 25772594322
compile time instruction count for iteration 3 is 25768695952
compile time instruction count for iteration 4 is 25768032314
collecting compile time instruction count for add_loop_inductor_gpu
compile time instruction count for iteration 0 is 23966942581
compile time instruction count for iteration 1 is 23771950919
compile time instruction count for iteration 2 is 23770784286
compile time instruction count for iteration 3 is 23780160875
compile time instruction count for iteration 4 is 23774634465
collecting compile time instruction count for add_loop_inductor_dynamic_gpu
compile time instruction count for iteration 0 is 41505055086
compile time instruction count for iteration 1 is 41293654089
compile time instruction count for iteration 2 is 41301016100
compile time instruction count for iteration 3 is 41306056207
compile time instruction count for iteration 4 is 41308171566
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135809
Approved by: https://github.com/ezyang, https://github.com/anijain2305
2024-09-13 20:42:31 +00:00
a3d827a28c Use python 3.11 for Large Wheel build (#136042)
Use Python 3.11 in nightly Large wheel builds. Required for Colab testing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136042
Approved by: https://github.com/kit1980, https://github.com/malfet

Co-authored-by: Sergii Dymchenko <kit1980@gmail.com>
2024-09-13 20:27:11 +00:00
4312794b92 [reland][export] fix re-export custom metadata (#135720)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/134778

The previous D62304294 broke some executorch tests. It has already been reverted.

In this diff, `_collect_param_buffer_metadata()` is modified in a way that when a `call_function` node is encountered and its input nodes include `get_attr`. We skip the fields that have been collected previously and only collect rest of the fields. This prevents over-writing.

Test Plan:
```
buck2 test 'fbcode//mode/dev-nosan' fbcode//executorch/backends/xnnpack/test:test_xnnpack_ops

buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_re_export_preserve_handle

buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_run_decompositions_preserve_handle
```

Differential Revision: D62514208

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135720
Approved by: https://github.com/zhxchen17, https://github.com/jerryzh168
2024-09-13 20:15:15 +00:00
b856f3539b Fix script name in the comments (#135507)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135507
Approved by: https://github.com/atalman
2024-09-13 19:59:47 +00:00
835e7bb077 fix requirements.txt installation failure issue on Windows (#134567)
Fixes #134564

Root cause:

The `lintrunner` wheel released on [pypi.org](https://pypi.org/project/lintrunner/#files) only supports Windows 32bit and Linux 64bit. Since compilation of pytorch requires a 64bit env, on windows, the `lintrunner` has to be compiled from source distribution. `Rust` is its dependency for compilation, as indicated in the error message. Meanwhile, Visual Studio environment is needed for linking libraries..

![image](https://github.com/user-attachments/assets/180cd899-8886-43b5-b42f-031f41e81683)

Issue when performing `pip install lintrunner` without a Visual Studio environment activated is shown below.

```bash
>python -m pip install lintrunner
Collecting lintrunner
  Downloading lintrunner-0.12.5.tar.gz (62 kB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Building wheels for collected packages: lintrunner
  Building wheel for lintrunner (pyproject.toml) ... error
  error: subprocess-exited-with-error

  × Building wheel for lintrunner (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [137 lines of output]
      Running `maturin pep517 build-wheel -i C:\Users\\miniforge3\envs\py310\python.exe --compatibility off`
      📡 Using build options bindings from pyproject.toml
         Compiling proc-macro2 v1.0.79
         Compiling unicode-ident v1.0.12
         Compiling version_check v0.9.4
         Compiling windows_x86_64_msvc v0.52.4
         Compiling winapi v0.3.9
         Compiling serde v1.0.197
         Compiling autocfg v1.2.0
         Compiling syn v1.0.109
         Compiling lazy_static v1.4.0
         Compiling libc v0.2.153
         Compiling equivalent v1.0.1
         Compiling hashbrown v0.14.3
         Compiling memchr v2.7.2
         Compiling yansi v1.0.1
         Compiling unicode-width v0.1.11
         Compiling regex-syntax v0.8.3
         Compiling encode_unicode v0.3.6
         Compiling cfg-if v1.0.0
         Compiling winnow v0.6.5
         Compiling cc v1.0.92
      error: could not compile `windows_x86_64_msvc` (build script) due to 2 previous errors
      warning: build failed, waiting for other jobs to finish...
      error: could not compile `serde` (build script) due to 2 previous errors
      error: could not compile `proc-macro2` (build script) due to 2 previous errors
      error: could not compile `syn` (build script) due to 2 previous errors
      error: could not compile `libc` (build script) due to 2 previous errors
      error: could not compile `winapi` (build script) due to 2 previous errors
      💥 maturin failed
        Caused by: Failed to build a native library through cargo
        Caused by: Cargo build finished with "exit code: 101": `cargo rustc --manifest-path Cargo.toml --message-format json --release --bins --`
      📦 Including license file "LICENSE"
      🔗 Found bin bindings
      error: linker `link.exe` not found
        |
        = note: program not found

      note: the msvc targets depend on the msvc linker but `link.exe` was not found

      note: please ensure that Visual Studio 2017 or later, or Build Tools for Visual Studio were installed with the Visual C++ option.

      note: VS Code is a different product, and is not sufficient.

      error: aborting due to 1 previous error

      error: linker `link.exe` not found
        |
        = note: program not found

      note: the msvc targets depend on the msvc linker but `link.exe` was not found

      note: please ensure that Visual Studio 2017 or later, or Build Tools for Visual Studio were installed with the Visual C++ option.

      note: VS Code is a different product, and is not sufficient.

      error: aborting due to 1 previous error

      error: linker `link.exe` not found
        |
        = note: program not found

      note: the msvc targets depend on the msvc linker but `link.exe` was not found

      note: please ensure that Visual Studio 2017 or later, or Build Tools for Visual Studio were installed with the Visual C++ option.

      note: VS Code is a different product, and is not sufficient.

      error: aborting due to 1 previous error

      error: linker `link.exe` not found
        |
        = note: program not found

      note: the msvc targets depend on the msvc linker but `link.exe` was not found

      note: please ensure that Visual Studio 2017 or later, or Build Tools for Visual Studio were installed with the Visual C++ option.

      note: VS Code is a different product, and is not sufficient.

      error: aborting due to 1 previous error

      error: linker `link.exe` not found
        |
        = note: program not found

      note: the msvc targets depend on the msvc linker but `link.exe` was not found

      note: please ensure that Visual Studio 2017 or later, or Build Tools for Visual Studio were installed with the Visual C++ option.

      note: VS Code is a different product, and is not sufficient.

      error: aborting due to 1 previous error

      error: linker `link.exe` not found
        |
        = note: program not found

      note: the msvc targets depend on the msvc linker but `link.exe` was not found

      note: please ensure that Visual Studio 2017 or later, or Build Tools for Visual Studio were installed with the Visual C++ option.

      note: VS Code is a different product, and is not sufficient.

      error: aborting due to 1 previous error

      Error: command ['maturin', 'pep517', 'build-wheel', '-i', 'C:\\Users\\\\miniforge3\\envs\\py310\\python.exe', '--compatibility', 'off'] returned non-zero exit status 1
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for lintrunner
Failed to build lintrunner
ERROR: ERROR: Failed to build installable wheels for some pyproject.toml based projects (lintrunner)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134567
Approved by: https://github.com/malfet
2024-09-13 18:43:55 +00:00
b6d6aa49b8 Revert "Validate input types for torch.nn.Linear and torch.nn.Bilinear (#135596)"
This reverts commit e157ce3ebbb3f30d008c15914e82eb74217562f0.

Reverted https://github.com/pytorch/pytorch/pull/135596 on behalf of https://github.com/malfet due to It's too restrictive, should allow other int-like types, such as `numpy.int64` ([comment](https://github.com/pytorch/pytorch/pull/135596#issuecomment-2349714104))
2024-09-13 18:06:56 +00:00
deee21cb78 Revert "[Inductor] Rename cpp_wrapper_cuda.py as cpp_wrapper_gpu.py (#135313)"
This reverts commit 16b37b309f64ddd4e498c57a99191e1d9b3dfdac.

Reverted https://github.com/pytorch/pytorch/pull/135313 on behalf of https://github.com/izaitsevfb due to breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/135313#issuecomment-2349662091))
2024-09-13 17:53:21 +00:00
3f69410976 [gpu-profiler] Expose active and repeat in os env var (#135757)
Summary: https://fb.workplace.com/groups/ai.efficiency.tools.users/permalink/1855136444971825/

Test Plan:
`buck2 test mode/opt caffe2/test:profiler -- -r test_kineto_profiler_api `

eyes

Differential Revision: D62529249

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135757
Approved by: https://github.com/Yuzhen11
2024-09-13 17:48:27 +00:00
18f9331e5d Revert "[aoti] Fix workspace generation for triton (#135552)"
This reverts commit d3833253928f29ed760b2dccac2b730028a868ca.

Reverted https://github.com/pytorch/pytorch/pull/135552 on behalf of https://github.com/izaitsevfb due to blocks revert of #135313, internal failures, see D62511427 ([comment](https://github.com/pytorch/pytorch/pull/135552#issuecomment-2349641372))
2024-09-13 17:47:36 +00:00
bc0f330169 [trymerge] Manually close merged PR when Github fails (#135890)
Manually close merged PR when Github fails to do it.

Consequences of current design:
Sleeping for 1 min uses up the machine, might result in race conditions, results in merging label to removed a bit later, pr still left open if this api fails too (ie no async clean up job)

Tested in https://github.com/malfet/deleteme/pull/92 by removing the part of the commit message that has "resolved #pr num"
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135890
Approved by: https://github.com/malfet, https://github.com/huydhn
2024-09-13 17:29:24 +00:00
7834c0bb2c [AOTI][Tooling] Add stats summary (mean/min/max, etc) for jit inductor tensor value printing (#135887)
Summary:
As title. Follow up to add stats summary (mean/min/max, etc) for jit inductor tensor value printing as well.

The inductor python wrapper code level printing would look something like this:

 {F1859224287}

Test Plan: CI

Reviewed By: chenyang78

Differential Revision: D62415575

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135887
Approved by: https://github.com/chenyang78
2024-09-13 17:19:25 +00:00
6ef49fe8f1 Revert "Pass ideep:lowp_kind to matmul_forward::compute on cache misses (#135058)"
This reverts commit 3d2431380999252d5401f83d5010b398a32e7597.

Reverted https://github.com/pytorch/pytorch/pull/135058 on behalf of https://github.com/malfet due to It regresses x86 performance ([comment](https://github.com/pytorch/pytorch/pull/135058#issuecomment-2349480861))
2024-09-13 17:09:45 +00:00
a15774563b [ROCm] Enable ROCm support for inductor's dynamic_rblock_scaling (#129663)
As of ROCm 6.1 [hipDeviceProp_t::regsPerMultiprocessor](https://rocm.docs.amd.com/projects/HIP/en/latest/doxygen/html/structhip_device_prop__t.html#a7390d5b180d63978c81aa971060270b4) is now available allowing us to enable this attribute on ROCm.
```
>>> torch.cuda.get_device_properties(0)
_CudaDeviceProperties(name='AMD Instinct MI250X/MI250', major=9, minor=0, gcnArchName='gfx90a:sramecc+:xnack-', total_memory=65520MB, multi_processor_count=104)
>>> torch.cuda.get_device_properties(0).regs_per_multiprocessor
65536
```

With https://github.com/triton-lang/triton/pull/3962we can extract n_regs and n_spells from a triton binary with AMD backend allowing us to enable inductor's dynamic_rblock_scaling on ROCm initially implemented in https://github.com/pytorch/pytorch/pull/115094

Leaving this in draft until following PRs have landed:
- https://github.com/pytorch/pytorch/pull/129361 to bump the triton commit pin
- https://github.com/pytorch/pytorch/pull/128449 to allow us to grab warp_size from device properties instead of hard coding 64 on ROCm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129663
Approved by: https://github.com/jansel, https://github.com/shunting314
2024-09-13 16:45:39 +00:00
564d00f364 Revert "Fix clang-tidy warnings in Caffe2 code (#134935)"
This reverts commit 7cfd23636c8fa6fcbb8bf3ea34e15b847ec9ad9d.

Reverted https://github.com/pytorch/pytorch/pull/134935 on behalf of https://github.com/izaitsevfb due to breaks internal builds, caffe2 is still used internally ([comment](https://github.com/pytorch/pytorch/pull/134935#issuecomment-2349368152))
2024-09-13 16:42:37 +00:00
ae02d663cd [FlexAttention] Fix output layout (#135882)
We previously only supported the same v_head dim and + qk_head dim. When allowed for different head-dims I accidently kept the same query strides for the output. This PR fixes this bug as well it ensures that we always produce output in the same stride order as the input query.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135882
Approved by: https://github.com/yanboliang, https://github.com/Chillee
2024-09-13 16:36:05 +00:00
ad2f0e9f81 Add remote cache time saved to compilation metrics (#135490)
Summary:
Record remote cache time saved via frame_phase_timing

We add to the "phase" when remote cache hits and saves us time, so that we have a 1:1 correspondence between a frame and time saved.

Test Plan:
Internally run benchmark, see that it's populated in sandbox table after previous diff lands and logger config is actualized.

Show that column exists in table:

https://fburl.com/scuba/logger_staging_jjwu_30582a48f1ff9cf5f4ac50a4c40af/fp2te0ff

Note that an earlier version of D62105258 had the column as a string so the staging table is a bit messed up. But you can see the most recent samples have the column populates as a float.

Reviewed By: aorenste

Differential Revision: D62106921

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135490
Approved by: https://github.com/aorenste
2024-09-13 16:35:51 +00:00
21ffa18ad1 Fix "expand: SymIntArrayRef expected to contain only concrete integers" in AOTInductor (#135933)
Internal xref:
https://fb.workplace.com/groups/1075192433118967/permalink/1501860707118802/

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135933
Approved by: https://github.com/angelayi
2024-09-13 15:23:42 +00:00
eqy
2519e5a8de [CUDA][FP8] Skip rowwise scaling test on sm89 (#135718)
Same reason as #https://github.com/pytorch/pytorch/pull/133612, rowwise scaling implementation is sm90+ specific (e.g., uses TMA)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135718
Approved by: https://github.com/Skylion007
2024-09-13 15:07:20 +00:00
ba6e0f31ab Remove cycle dependency by localizing the import. (#135926)
Summary:
Since https://www.internalfb.com/diff/D62215095 landed there has been many silence errors due to the dependency between functional_tensor and config.

```
 File "/tmp/torch_deploy_zip5YRJC1/torch_python_modules.zip/torch/export/__init__.py", line 64, in <module>
  File "/tmp/torch_deploy_zip5YRJC1/torch_python_modules.zip/torch/export/dynamic_shapes.py", line 23, in <module>
  File "/tmp/torch_deploy_zip5YRJC1/torch_python_modules.zip/torch/export/exported_program.py", line 26, in <module>
  File "/tmp/torch_deploy_zip5YRJC1/torch_python_modules.zip/torch/_higher_order_ops/__init__.py", line 1, in <module>
  File "/tmp/torch_deploy_zip5YRJC1/torch_python_modules.zip/torch/_higher_order_ops/cond.py", line 6, in <module>
  File "/tmp/torch_deploy_zip5YRJC1/torch_python_modules.zip/torch/_subclasses/functional_tensor.py", line 9, in <module>
  File "/tmp/torch_deploy_zip5YRJC1/torch_python_modules.zip/torch/_inductor/config.py", line 44, in <module>
```

https://fburl.com/logarithm/ol5kx0ee
complaining about a cycle dependency

this fix it.

Test Plan: buck test multipy/runtime:test_deploy_embedded_cuda_interp_without_cuda_available -- --run-disabled TorchpyTest.AcquireMultipleSessionsInDifferentPackages

Reviewed By: aorenste

Differential Revision: D62616765

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135926
Approved by: https://github.com/aorenste, https://github.com/oulgen, https://github.com/Skylion007
2024-09-13 15:05:41 +00:00
7ed0563cad Revert "[Dynamo] Use custom backend to reenter metadata tf mode when tracing while/cond (#134732)"
This reverts commit e504fb70693d4a3741c3380b6a989d441e84f737.

Reverted https://github.com/pytorch/pytorch/pull/134732 on behalf of https://github.com/albanD due to Broke tests on main ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2348886378))
2024-09-13 12:52:58 +00:00
eb7dd91dd1 Revert "[Dynamo] Trace torch function modes entered outside of torch.compile (#133137)"
This reverts commit fafdd588f27e1d56090c6d260d0382c255eaf9eb.

Reverted https://github.com/pytorch/pytorch/pull/133137 on behalf of https://github.com/albanD due to Broke tests on main ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2348886378))
2024-09-13 12:52:58 +00:00
3f30360d05 Revert "[Dynamo] Support thread local setattr (#135443)"
This reverts commit 30b007bea329f512af3dc4fd4e6c7d145e807b71.

Reverted https://github.com/pytorch/pytorch/pull/135443 on behalf of https://github.com/albanD due to Broke tests on main ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2348886378))
2024-09-13 12:52:58 +00:00
4734e356d6 Revert "[Dynamo] Simplify torch function mode stack guard (#135444)"
This reverts commit 0c080cb2c78a85a5320fbeadbbb9a2cc640fd89d.

Reverted https://github.com/pytorch/pytorch/pull/135444 on behalf of https://github.com/albanD due to Broke tests on main ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2348886378))
2024-09-13 12:52:57 +00:00
ac169795a9 Revert "[Dynamo] Trace enter/exit of TorchFunctionModes (#135422)"
This reverts commit 2af3b8ffd84e36b91279174e9106f84b2d2a11f2.

Reverted https://github.com/pytorch/pytorch/pull/135422 on behalf of https://github.com/albanD due to Broke tests on main ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2348886378))
2024-09-13 12:52:57 +00:00
fca58bfda1 Revert "[Dynamo] Remove ignored modes workaround (#135502)"
This reverts commit 7d5e0dd4b1a8d20fc8624b3085a6f5ddedd89a2e.

Reverted https://github.com/pytorch/pytorch/pull/135502 on behalf of https://github.com/albanD due to Broke tests on main ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2348886378))
2024-09-13 12:52:57 +00:00
dc71e7a7d4 Revert "[Dynamo] Remove ignored modes from torch function mode stack guard (#135503)"
This reverts commit c56728b643e2b7d796abd7ec45803319e1c5967d.

Reverted https://github.com/pytorch/pytorch/pull/135503 on behalf of https://github.com/albanD due to Broke tests on main ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2348886378))
2024-09-13 12:52:57 +00:00
1cdf658f4a Revert "[PT2][inductor][Optimus] Add pad_aten_mm_pass pattern to resolve long computation kernel in LCE (#135167)"
This reverts commit eb0fe029337b31bcb3d4b2d1e539895393975d68.

Reverted https://github.com/pytorch/pytorch/pull/135167 on behalf of https://github.com/jithunnair-amd due to Broke ROCm CI eg. https://github.com/pytorch/pytorch/actions/runs/10845542664/job/30097957154 ([comment](https://github.com/pytorch/pytorch/pull/135167#issuecomment-2348847595))
2024-09-13 12:35:05 +00:00
b5c52e96e8 Revert "[dynamo] Fix support for classmethod(property(...)) (#134968)"
This reverts commit bf68e16e94fc05f10d434cdc162a14d02c6ad23c.

Reverted https://github.com/pytorch/pytorch/pull/134968 on behalf of https://github.com/jithunnair-amd due to Broke ROCm CI: eg. https://github.com/pytorch/pytorch/actions/runs/10845542664/job/30097956613 ([comment](https://github.com/pytorch/pytorch/pull/134968#issuecomment-2348837553))
2024-09-13 12:29:03 +00:00
ea2ecab15b [AOTI][reland] Fix assert_function call in cpu autotune template (#135920)
Summary: Reland https://github.com/pytorch/pytorch/pull/135086. In the ABI-compatible mode, assert_function should be AOTI_TORCH_CHECK.

Test Plan: CI

Differential Revision: D62500592

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135920
Approved by: https://github.com/chenyang78
2024-09-13 12:21:57 +00:00
2f53d570fe Update document for autocast on CPU (#135299)
Update document for autocast on CPU due to the support of float16 and changes in the operator list.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135299
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/svekars
2024-09-13 09:11:47 +00:00
31007cf200 [Distributed] add FP8 support to NaN checker (#135891)
Adding support for `torch.float8_e4m3fn` and `torch.float8_e5m2`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135891
Approved by: https://github.com/wconstab
2024-09-13 08:43:54 +00:00
c56728b643 [Dynamo] Remove ignored modes from torch function mode stack guard (#135503)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135503
Approved by: https://github.com/anijain2305
ghstack dependencies: #134732, #133137, #135443, #135444, #135422, #135502
2024-09-13 08:41:32 +00:00
7d5e0dd4b1 [Dynamo] Remove ignored modes workaround (#135502)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135502
Approved by: https://github.com/anijain2305
ghstack dependencies: #134732, #133137, #135443, #135444, #135422
2024-09-13 08:41:32 +00:00
2af3b8ffd8 [Dynamo] Trace enter/exit of TorchFunctionModes (#135422)
This PR implements tracing of with contexts with TorchFunction modes which have the default enter/exit behavior (ie pushing/popping the mode)

Typically the bytecode for a context manager looks like this during a graph break:
1. graph call
2. enter context
3. unsupported code
4. exit context
5. resume call

resume fn structure:
1. enter context
2. jump
...
3. exit context

The issue with torch function modes is that side effects will replay any mutations to the torch function stack performed during tracing. So, we do not need to enter and exit around the unsupported code in the original function (doing so would result in a duplicate torch function mode entry during execution of the unsupported code), and we don't need to enter again in the resume function (the mode that was pushed from the side effects bytecode would still be on the stack).

So for torch function modes the structure of our output code is this:

1. graph call
2. mutate tf mode stack to replay mutations
4. unsupported code
5. on exception restore stack
6. resume function

Then our resume fn looks like this:

1. no-op enter torch function mode
2. jump
3.  exit tf mode

To implement the no-op enter of the torch function mode I added torch function mode in polyfill which no-op enters, but normally exits. This is needed because we still want to trace the with context in the resume function, and exit properly (the exit instructions will still be in the function, so we need to generate instructions to set up the context).

Separately from the bytecode, dynamo also tracks contexts on the block stack, which is how the SETUP_* instructions are implemented. Naturally at a graph break, we exit these block stacks to properly reset the contexts entirely, so that we can re-enter around the unsupported code soundly. However once again, in the torch function mode case, in the event of a graph we do not want to perform any exit side effects because we want to preserve the state of the mode stack as is so that we will properly update the stack with bytecode mentioned in the first section. If we exited here, dynamo would pop the mode off of the symbolic stack, and not update the true python torch function mode stack with the suffix bytecode. All in all, for torch function modes we enter exactly once, update the global torch function mode stack with side effects bytecode, re-read this stack when compiling the resume function, and exit exactly once in the resume function. This matches the semantics of eager exactly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135422
Approved by: https://github.com/williamwen42
ghstack dependencies: #134732, #133137, #135443, #135444
2024-09-13 08:41:24 +00:00
0c080cb2c7 [Dynamo] Simplify torch function mode stack guard (#135444)
The semantics of ignored modes previously had edge cases, this eliminates these by in essence filtering any ignored modes out of both the ref stack and the current torch function mode stack. This is purely to fix complexity in #135422.  The ignored modes handling will be removed in a future PR after https://github.com/pytorch/pytorch/pull/135422 lands, since we will then trace through DeviceContexts vs inserting them into the graph which needed these extra workarounds for correctness.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135444
Approved by: https://github.com/anijain2305, https://github.com/williamwen42
ghstack dependencies: #134732, #133137, #135443
2024-09-13 08:41:17 +00:00
30b007bea3 [Dynamo] Support thread local setattr (#135443)
In preparation for tracing through DeviceContext (defb515306/torch/utils/_device.py (L66))
This PR adds support for calling the setattr of thread local objects. These objects have a slots impl, and since this doesn't appear to have any side effects, we call this setattr impl when replaying mutations, since calling `object.__setattr__` on these objects results in a type error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135443
Approved by: https://github.com/anijain2305
ghstack dependencies: #134732, #133137
2024-09-13 08:41:07 +00:00
fafdd588f2 [Dynamo] Trace torch function modes entered outside of torch.compile (#133137)
This PR adds initial tracing for torch function modes.

Details:
In essence, this adds tracing into the torch function of modes entered outside of the torch.compile call.
This does not yet support tracing enter/exit of a torch function mode/ tracing set_default_device properly using the new mode infra (this will be a very good stress test for modes). I am adding more PRs to this stack to support these. The overall plan is to support tracing enter/exit and handling graph breaks like we do other torch.* context managers.

Previously landed:
https://github.com/pytorch/pytorch/pull/133135
https://github.com/pytorch/pytorch/pull/133136
https://github.com/pytorch/pytorch/pull/133134
https://github.com/pytorch/pytorch/pull/133133
https://github.com/pytorch/pytorch/pull/133132
https://github.com/pytorch/pytorch/pull/133131
https://github.com/pytorch/pytorch/pull/133729
https://github.com/pytorch/pytorch/pull/133130

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133137
Approved by: https://github.com/jansel, https://github.com/zou3519
ghstack dependencies: #134732
2024-09-13 08:41:00 +00:00
e504fb7069 [Dynamo] Use custom backend to reenter metadata tf mode when tracing while/cond (#134732)
For tracing cond/while in eager, we trace the HOP with the eager backend with metadata torchfunction mode enabled. HOPs disallow the mutation that occurs in this torch function mode, so it is not able to be traced. As a result, we use a custom backend which enters this mode for tracing these HOPs. Thanks to @ydwu4 for the help with implementing this

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134732
Approved by: https://github.com/ydwu4
2024-09-13 08:40:50 +00:00
b346e99376 remove fast_flush arguments (#135387)
I've removed them from upstream Triton in https://github.com/triton-lang/triton/pull/4485. It looks like most places in the code use the default value of `fast_flush=True` anyway, though there are two PRs from @pearu that use `False`. To my knowledge, there's no reason to use the `False` value.

Differential Revision: [D62325778](https://our.internmc.facebook.com/intern/diff/D62325778)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135387
Approved by: https://github.com/nmacchioni, https://github.com/jansel
2024-09-13 08:13:46 +00:00
7dc1788396 [inductor] Remove the batch fusion passes from being a default (#135922)
Ads team do a search internally to figure out which fusion passes to use.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135922
Approved by: https://github.com/eellison, https://github.com/yanboliang
ghstack dependencies: #135819
2024-09-13 06:07:33 +00:00
9fd54d787d [Inductor UT] Generalize device-bias code in test_triton_kernels.py introduced in #135530 (#135656)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135656
Approved by: https://github.com/EikanWang, https://github.com/zou3519
2024-09-13 05:27:56 +00:00
b38be727eb [Inductor UT] Generalize inductor UT for intel GPU (Part 2) (#134556)
[Inductor UT] Reuse Inductor test case for Intel GPU.
Reuse `test/inductor/test_torchinductor_opinfo.py`
Reuse `test/inductor/test_minifier_isolate.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134556
Approved by: https://github.com/etaf, https://github.com/eellison
2024-09-13 05:16:28 +00:00
e54b559e88 [inductor] More fixes on the keys of constants and signature dictionaries (#135406)
Previous PR forgets to change two other places that also create `constants` and `signature`. https://github.com/pytorch/pytorch/pull/135170

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135406
Approved by: https://github.com/jansel
2024-09-13 04:10:41 +00:00
eea5e6ff0f [DCP][DSD] Add a test case to demonstrate the workaround to load full state dict into a 2D model (#135763)
Fix https://github.com/pytorch/pytorch/issues/134095

This is a workaround for loading full state dict into a FSDP1+TP 2D model.
Since named_parameters() in FSDP1 does not return DTensor, we don't have the information to shard the full_state_dict and load it directly into the 2d model. In order to load a full state dict in FSDP1+TP 2D model, we need to do:
- load the full state dict into a 1D FSDP model
- dcp.save the full/shard state dict into storage
- initialize a 2D FSDP1+TP model
- get the default sharded state dict for the 2D model (full_state_dict=False)
- dcp.load the state dict from storage
- load the state dict into the 2D model
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135763
Approved by: https://github.com/fegin
ghstack dependencies: #135725
2024-09-13 03:51:14 +00:00
6df91b5917 real tensor prop for composite ops (#135717)
Fixes #135632

Adds real tensor propagation for decompositions, checking any symbols on their outputs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135717
Approved by: https://github.com/ezyang
2024-09-13 03:35:16 +00:00
0cdc6a8dcd [DSD] Fix distributed state dict full_state_dict option hang during set_state_dict (#135725)
Fix https://github.com/pytorch/pytorch/issues/134095
This fix distributed state dict full_state_dict option hang during set_state_dict. We switch `_distribute_tensors` in _state_dict_utils.py to use `DTensor.from_local` instead of `distribute_tensor` to support FSDP2+TP 2D strided sharding use case, as `distribute_tensor` cannot handle strided sharding yet. `distribute_tensor` incurs a scatter behind the scenes, while `DTensor.from_local` takes the local slice from the full tensor on each rank to create the DTensor (no collective).  This means it's the user's responsibility to make sure the full_tensor from the full_state_dict is the same across all ranks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135725
Approved by: https://github.com/fegin
2024-09-13 03:26:36 +00:00
6cdc70bccd [ROCm] skip test_fp8_cast_and_t on non-MI300 machines (#135917)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135917
Approved by: https://github.com/malfet
2024-09-13 02:46:48 +00:00
e6b68359d7 Fix xpu memory stats error (#135818)
# Motivation
fix https://github.com/pytorch/pytorch/issues/135726
After merging two free blocks, I made a stupid mistake of ignoring the correct size to decrease the active memory size, which should be the original block size instead of the merged block size.

# Additional Context
Add a UT to guard this scenario.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135818
Approved by: https://github.com/EikanWang
2024-09-13 02:41:21 +00:00
1c04cbfba6 [BE] Use C10_UNUSED (#135914)
Instead of `(void)foo; // Suppress unused variable`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135914
Approved by: https://github.com/huydhn, https://github.com/eqy
2024-09-13 02:27:07 +00:00
062681a0ed [Profiler] Torch Profiler distributed info is not JSON serializable (#135548)
Summary: To fix https://github.com/pytorch/pytorch/issues/133308 we must create an encoder for numpy values so we can serialize the distributed metadata to JSON.

Test Plan: Added unit test to check that numpy values can be serialized

Differential Revision: D62411619

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135548
Approved by: https://github.com/aaronenyeshi, https://github.com/albanD
2024-09-13 02:22:33 +00:00
8c356ce3da Fix lint errors in fbcode (#135614)
Summary: Fixed a bunch of fbcode imports that happened to work but confused autodeps.  After this autodeps still suggests "improvements" to TARGETS (which breaks our builds) but at least it can find all the imports.

Test Plan:
```
fbpython fbcode/tools/build/buck/linters/lint_autoformat.py --linter=autodeps --default-exec-timeout=1800 -- fbcode/caffe2/TARGETS fbcode/caffe2/test/TARGETS
```
Before:
```
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/testing.py:229) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fbur$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_export.py:87) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fburl$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_serdes.py:9) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fb$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_serdes.py:10) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fburl$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_retraceability.py:7) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https:$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_retraceability.py:6) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See ht$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_export_nonstrict.py:7) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See http$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_export_nonstrict.py:6) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See $
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_export_training_ir_to_run_decomp.py:8) when processing rule "test_export". Please make sure it's listed in the srcs parameter of an$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_export_training_ir_to_run_decomp.py:10) when processing rule "test_export". Please make sure it's listed in the srcs parameter of anoth$
ERROR while processing caffe2/test/TARGETS: Found "//python/typeshed_internal:typeshed_internal_library" owner for "cv2" but it is protected by visibility rules: [] (from caffe2/test/test_bundled_images.py:7) when processing rule "test_bundled_$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "caffe2.test.profiler_test_cpp_thread_lib" (from caffe2/test/profiler/test_cpp_thread.py:29) when processing rule "profiler_test_cpp_thread". Please make sure it's listed in t$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._utils_internal.get_file_path_2" (from caffe2/test/test_custom_ops.py:23) when processing rule "custom_ops". Please make sure it's listed in the srcs parameter of anoth$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._utils_internal.get_file_path_2" (from caffe2/test/test_public_bindings.py:13) when processing rule "public_bindings". Please make sure it's listed in the srcs paramete$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._C._profiler.symbolize_tracebacks" (from caffe2/test/test_cuda.py:3348) when processing rule "test_cuda". Please make sure it's listed in the srcs parameter of another $
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._C._profiler.gather_traceback" (from caffe2/test/test_cuda.py:3348) when processing rule "test_cuda". Please make sure it's listed in the srcs parameter of another rule$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for include <torch/csrc/autograd/profiler_kineto.h> (from caffe2/test/profiler/test_cpp_thread.cpp:2) when processing profiler_test_cpp_thread_lib.  Some things to try:
```

Differential Revision: D62049222

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135614
Approved by: https://github.com/oulgen, https://github.com/laithsakka
2024-09-13 02:04:34 +00:00
bf68e16e94 [dynamo] Fix support for classmethod(property(...)) (#134968)
Fixes #134451

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134968
Approved by: https://github.com/yanboliang
2024-09-13 01:14:18 +00:00
eqy
d732df7e56 [Inductor] Disable TF32 in test_slice_scatter_reinplace (#135709)
TF32 linear/matmul numerics seem unrelated to test functionality so disabling it here to abate noisy failures

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135709
Approved by: https://github.com/eellison
2024-09-13 00:30:45 +00:00
c9de2efde6 [Docs] fix inconsistent docs in conv1d, conv2d, and conv3d (#135894)
Addresses https://github.com/pytorch/pytorch/issues/135880
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135894
Approved by: https://github.com/mikaylagawarecki, https://github.com/malfet
2024-09-13 00:19:42 +00:00
1f15c0c7a5 [fx] Replace _snake_case with a regexp (#135822)
~2x speedup on this function, though saves <0.5s overall

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135822
Approved by: https://github.com/oulgen
ghstack dependencies: #135787, #135788, #135820, #135821
2024-09-13 00:18:41 +00:00
a72124add9 [fx] Minor optimization in create_arg (#135821)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135821
Approved by: https://github.com/oulgen
ghstack dependencies: #135787, #135788, #135820
2024-09-13 00:18:41 +00:00
10ca4c0564 [inductor] Use TracerBase directly in LoopBody (#135820)
This skips some unneeded work in the subclass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135820
Approved by: https://github.com/oulgen
ghstack dependencies: #135787, #135788
2024-09-13 00:18:41 +00:00
d3aab9642b [inductor] Optimize can_fuse_vertical() (#135788)
An O(n^2) to O(n) improvement by not comparing all pairs of deps.

Before:
![image](https://github.com/user-attachments/assets/797cd1bd-5d53-4374-8e76-ffce4232d7f9)

After:
![image](https://github.com/user-attachments/assets/1e61bf29-adba-41a4-839e-f028130fa979)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135788
Approved by: https://github.com/oulgen
ghstack dependencies: #135787
2024-09-13 00:18:41 +00:00
67a929eea8 [inductor] Remove unused check (#135787)
I think this is unreachable code because mode is always None on reads.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135787
Approved by: https://github.com/oulgen
2024-09-13 00:18:41 +00:00
f576960bbc do not expand in replace/simplify if no changes (#135863)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135863
Approved by: https://github.com/ezyang
2024-09-13 00:12:01 +00:00
1aba224cfd Update nightly PyTorch version to 2.6.0 (#135916)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135916
Approved by: https://github.com/kit1980
2024-09-13 00:08:52 +00:00
d383325392 [aoti] Fix workspace generation for triton (#135552)
Fixes #131337

- add `arg_type` for workspace_arg, the type is consistent with the type in `generate_workspace_allocation()`.
- do not generate example tensors for `workspace`, and use `generate_workspace_allocation()` instead.
- add workspace allocation generation code to `kernel_autotune_calls`. e.g.
```python
    workspace = empty_strided_cuda((1280, ), (1, ), torch.uint8)
    workspace.zero_()
    .....
    triton_spl_fused_add_cumprod_0.run(buf2, arg0_1, arg1_1, workspace, 1, 10000, grid=split_scan_grid(1, 10000), stream=stream0)
    del buf2, arg0_1, arg1_1, workspace
```
-  add `empty_strided_cuda = torch._C._dynamo.guards._empty_strided_cuda` to the header of triton autotune code.

The generated cpp has lines like below, so we also implement a `zero_()` for ` AtenTensorHandle `.

```cpp
    static constexpr int64_t int_array_0[] = {1280L, };
    static constexpr int64_t int_array_1[] = {1L, };
    AtenTensorHandle workspace_handle;
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_empty_strided(1, int_array_0, int_array_1, cached_torch_dtype_uint8, cached_torch_device_type_cuda,  0, &workspace_handle));

        RAIIAtenTensorHandle workspace(workspace_handle);
        workspace.zero_();
```

- Fix handle grid_fn  for grid computation. Pass in "RBLOCK" to `split_scan_grid`
-  Fix dynamic shapes:
Without the fix we generate code that looks like this `workspace = empty_strided_cuda((32*((255 + s0) // 256), ), (1, ), torch.uint8)` when doing triton autotune and `s0` is not defined.

The solution approach is to use `V.graph.sizevars.size_hint(nbytes)` to realize the workspace size for triton autotune. Note that we only realize it for triton autotune code, but not for the cpp cuda code.

- We also generate slightly different cpp code depending on if `abi_compatible` is turned on.
```cpp
RAIIAtenTensorHandle workspace(workspace_handle);
AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_zero_(workspace.get()));
```
vs

```cpp
    at::Tensor workspace = at::detail::empty_strided_cuda({8L*(c10::div_floor_integer(static_cast<int64_t>((255L + s0)), static_cast<int64_t>(256L))), }, {1L, }, at::kByte, c10::DeviceType::CUDA);
    workspace.zero_();
```

Test Plan:

```
TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1  python test/inductor/test_torchinductor.py -k GPUTests.test_consecutive_split_cumprod_cuda
python test/inductor/test_cuda_cpp_wrapper.py TestCudaWrapper.test_consecutive_split_cumprod_cuda_cuda_wrapper
python test/inductor/test_cuda_cpp_wrapper.py DynamicShapesCudaWrapperCudaTests.test_consecutive_split_cumprod_cuda_dynamic_shapes_cuda_wrapper
TORCHINDUCTOR_ABI_COMPATIBLE=1 python test/inductor/test_cuda_cpp_wrapper.py TestCudaWrapper.test_consecutive_split_cumprod_cuda_cuda_wrapper
TORCHINDUCTOR_CPP_WRAPPER=1  python test/inductor/test_torchinductor.py -k GPUTests.test_consecutive_split_cumprod_cuda
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135552
Approved by: https://github.com/desertfire
2024-09-12 23:53:09 +00:00
00dc7d4356 fix compiled_autograd deadlock throw (#135795)
Fixes #135298

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135795
Approved by: https://github.com/xmfan
2024-09-12 23:24:57 +00:00
1760bbc259 [FlexAttention] Ensure q/k/v and block_mask on excact the same device (#135823)
Fixes #134739

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135823
Approved by: https://github.com/BoyuanFeng
2024-09-12 23:11:01 +00:00
fb9d8e3248 [ROCm] Use ieee precision for fp32 in flex attention (#135702)
3bebc09be9

Brought in a change to flex_attention to allow TF32 precision, this largely lacks support on ROCm side and we should use ieee.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135702
Approved by: https://github.com/jeffdaily, https://github.com/drisspg
2024-09-12 23:00:48 +00:00
aaabfc8930 [Easy] Check if quant registered in constant folding (#135875)
Belated fix for https://github.com/pytorch/pytorch/issues/110904

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135875
Approved by: https://github.com/shunting314
2024-09-12 22:16:39 +00:00
63d6cd351a [dynamo] support torch.nn.attention.sdpa_kernel context manager (#135404)
Fixes https://github.com/pytorch/pytorch/issues/134608

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135404
Approved by: https://github.com/jansel, https://github.com/drisspg
2024-09-12 22:04:48 +00:00
3de9e474df Revert "Check function declarations of Core ML code (#135467)"
This reverts commit bc1b8f094d24de27432f4c29f0729e85a6b5ba63.

Reverted https://github.com/pytorch/pytorch/pull/135467 on behalf of https://github.com/malfet due to This breaks ios periodic jobs, see https://github.com/pytorch/pytorch/actions/runs/10797026668/job/29947377532 ([comment](https://github.com/pytorch/pytorch/pull/135467#issuecomment-2347322784))
2024-09-12 22:04:35 +00:00
3e1a4ea132 Revert "[DSD] Fix distributed state dict full_state_dict option hang during set_state_dict (#135725)"
This reverts commit 83c594ebd6dfa517fdd67ae23929cc60d5fa325d.

Reverted https://github.com/pytorch/pytorch/pull/135725 on behalf of https://github.com/ZainRizvi due to This is breaking lint. See [GH job link](https://github.com/pytorch/pytorch/actions/runs/10835983999/job/30068709508) [HUD commit link](83c594ebd6) ([comment](https://github.com/pytorch/pytorch/pull/135725#issuecomment-2347303272))
2024-09-12 21:47:38 +00:00
e157ce3ebb Validate input types for torch.nn.Linear and torch.nn.Bilinear (#135596)
Adding validation checks to check the input types and display better error messages for the same.
Fixes https://github.com/pytorch/pytorch/issues/135463

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135596
Approved by: https://github.com/malfet
2024-09-12 21:28:37 +00:00
b897ab0540 [export] ignore mark_dynamic() in export (#135536)
Previously we were accomodating `torch._dynamo.mark_dynamic()` for export's dynamic shapes. Here we clean things up and ignore it, requiring users to specify an export input for `dynamic_shapes`.

Note: there's 4 decorators relevant to export, `mark_dynamic, maybe_mark_dynamic, mark_static, mark_unbacked`. User calls that involve export have only been `mark_dynamic()`, and we use `maybe_mark_dynamic` under the hood for `Dim.AUTO`, but we could start using others. One reason I decided to not warn and just silently ignore is these decorators cause the tensors to carry dynamic info, and it'll be hard to tell whether the markers are from export or user calls when re-exporting with the same inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135536
Approved by: https://github.com/avikchaudhuri
2024-09-12 21:22:19 +00:00
3d24313809 Pass ideep:lowp_kind to matmul_forward::compute on cache misses (#135058)
Optimized dynamic quantization for aarch64 was enabled by #126687 and #134897

This PR fixes an issue for aarch64 where on a [cache miss](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/quantized/cpu/qlinear_dynamic.cpp#L592) (e.g. if input dimensions change) [ideep::matmul_forward::compute ](https://github.com/intel/ideep/blob/pytorch-rls-v3.5.3-2/include/ideep/operators/matmul.hpp#L160) (wrongly) runs with the [default lowp_kind (u8s8)](https://github.com/intel/ideep/blob/pytorch-rls-v3.5.3-2/include/ideep/operators/matmul.hpp#L174) which is not supported by oneDNN+ACL (Arm Compute Library), causing the workload to fall back to a much slower oneDNN gemm:jit kernel

Example:
```python
import torch

DIM = 4096
INPUT_SIZE1 = 32
INPUT_SIZE2 = 16

class LinearNet(torch.nn.Module):
   def __init__(self):
        super().__init__()
        self.fc1 = torch.nn.Linear(DIM, DIM, bias=False)

   def forward(self, x):
        x = self.fc1(x)
        return x

input1 = torch.randn(size=(INPUT_SIZE1, DIM))
input2 = torch.randn(size=(INPUT_SIZE2, DIM))

with torch.no_grad():
    model = LinearNet()
    model =  torch.ao.quantization.quantize_dynamic(model,{torch.nn.Linear})

    model(input1)   # this goes to ACL lowp_gemm
    print("="*50)
    model(input2)   # this goes to gemm:jit without this PR, and to ACL with this PR
```
In the code snippet above:
- The matmul from `model(input1)` goes to oneDNN+ACL (in both cases, with and without the PR)
- The matmul from `model(input2)`: **Without this PR**: there's a cache miss (different input shapes) and matmul_forward::compute is run with the default lowp_kind (u8s8). Hence the matmul falls back to gemm:jit in oneDNN. However, **With this PR** the matmul goes to oneDNN+ACL which is around 10x faster than oneDNN+jit.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135058
Approved by: https://github.com/jondea, https://github.com/malfet
2024-09-12 20:30:20 +00:00
cd472bb1e3 [torch][fx] Add new replacement_callback to materialize a replacement just in time (#135553)
Summary:
Sometimes we only want to generate a replacement for a matched pattern
once we know some information about the nodes in the pattern.

So far, we have found this the most useful to do matches based on specific
shapes of tensors flowing into functions.
Use a callback function similar to `match_filters`. By default this isn't used.

Had to make `replacement` a None-able parameter because Callable was
already used to detect a case where a graph needed to be traced.

Differential Revision: D62412628

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135553
Approved by: https://github.com/SherlockNoMad
2024-09-12 18:52:14 +00:00
f032135bbf Add batching rule for torch.scatter_reduce (#135547)
Fixes #134797

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135547
Approved by: https://github.com/zou3519
2024-09-12 18:51:21 +00:00
525bec804c NJT <-> padded dense conversions (#125947)
This PR:
* Implements the pre-existing `nt.to_padded_tensor(padding_val)` ATen op via the FBGEMM kernel + appropriate view gymnastics (since that kernel only handles 2D values)
* Introduces a new `_nested_from_padded_tensor` op for the reverse conversion, implemented via the reverse FBGEMM kernel + view gymnastics
    * Note: there is currently no public API for this; design booted to a future PR

TODO:
* ~~Propagate min / max sequence length via the new factory function `_nested_from_padded_tensor`~~
* ~~Verify that Inductor does computation fusion via test logic~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125947
Approved by: https://github.com/soulitzer
2024-09-12 17:54:25 +00:00
83c594ebd6 [DSD] Fix distributed state dict full_state_dict option hang during set_state_dict (#135725)
Fix https://github.com/pytorch/pytorch/issues/134095
This fix distributed state dict full_state_dict option hang during set_state_dict. We switch `_distribute_tensors` in _state_dict_utils.py to use `DTensor.from_local` instead of `distribute_tensor` to support FSDP2+TP 2D strided sharding use case, as `distribute_tensor` cannot handle strided sharding yet. `distribute_tensor` incurs a scatter behind the scenes, while `DTensor.from_local` takes the local slice from the full tensor on each rank to create the DTensor (no collective).  This means it's the user's responsibility to make sure the full_tensor from the full_state_dict is the same across all ranks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135725
Approved by: https://github.com/fegin
2024-09-12 17:43:57 +00:00
c1277945d3 [AOTI][Tooling] Support debug printing for inductor level extern kernel call such as externkernel.addmm, bmm, etc. (#135731)
Summary:
As title.

Effect after merging this diff would look something like this:

```
        print('inductor: before_launch - triton_poi_fused_0 - buf0', buf0)
        triton_poi_fused_0.run(buf0, 6, grid=grid(6), stream=stream0)
        print('inductor: after_launch - triton_poi_fused_0 - buf0', buf0)
        buf1 = empty_strided_cuda((16, 6), (6, 1), torch.float32)
        # Topologically Sorted Source Nodes: [linear], Original ATen: [aten.addmm]
        print('inductor: before_launch - extern_kernels.addmm - buf0', buf0)
        extern_kernels.addmm(buf0, reinterpret_tensor(arg2_1, (16, 16), (16, 1), 0), reinterpret_tensor(L__self___weight, (16, 6), (1, 16), 0), alpha=1, beta=1, out=buf1)
        print('inductor: after_launch - extern_kernels.addmm - buf0', buf0)
```

Context: D62272588 only support major triton kernel jit inductor debug printing codegen

Test Plan: CI & OSS CI

Reviewed By: chenyang78, ColinPeppler

Differential Revision: D62397017

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135731
Approved by: https://github.com/ColinPeppler
2024-09-12 17:31:10 +00:00
dab7d646d5 Use a better decomposition for split_with_sizes (#135728)
This decomposition has less checks and improves the performance
of torch.compile.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135728
Approved by: https://github.com/ezyang
2024-09-12 16:38:51 +00:00
7647c398ff Allow optional positional arguments for torch.func.functional_call (#134643)
This PR resolves #134408. Add an additional test and have passed the local test.

Do you think we should add a post-check to ensure `args` and `kwargs` are not both `None`? It seems to be possible to have modules without inputs.

This PR does not include any such post-check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134643
Approved by: https://github.com/zou3519
2024-09-12 15:22:06 +00:00
d67cc58181 [ONNX] Fix symbolic values and numpy implementation (#135786)
1. Remove `__eq__` to make `SymbolicTensor` hashable and test for that
2. Update the `__array__` method so that it works for tensor on GPU

Fixes https://github.com/pytorch/pytorch/issues/135700
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135786
Approved by: https://github.com/titaiwangms
2024-09-12 14:24:43 +00:00
dddaadac6c [dynamo] Dont graph break on inner torch.compile (#135819)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135819
Approved by: https://github.com/jansel
2024-09-12 11:39:09 +00:00
02169364e1 [inductor] Split reduction loops when there is no shared reads (#134307)
Fixes #129102

![image](https://github.com/user-attachments/assets/0d00f75b-2bb9-4ce6-a0d9-2daceaff539c)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134307
Approved by: https://github.com/shunting314
2024-09-12 09:45:08 +00:00
c30042fbeb [GPT-fast] Update compilation time target for Llama & Mixtral (#135817)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135817
Approved by: https://github.com/xmfan, https://github.com/huydhn
2024-09-12 07:13:44 +00:00
6700175531 [Inductor] simplify indexing_exprs in LoopBody._init_with_copy (#135574)
This PR uses `var_ranges` information to simplify `indexing_exprs` in `LoopBody._init_with_copy` to to reduce occurrences of `FloorDiv` and `ModularIndexing` in the `indexing_exprs`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135574
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2024-09-12 06:56:34 +00:00
de8a8653c0 [dtensor][BE] replace compute_local_shape with compute_local_shape_and_global_offset (#135554)
**Summary**
1. This PR removes the public API `compute_local_shape` and replace its use with the more general API `compute_local_shape_and_global_offset`.
2. To keep `compute_local_shape_and_global_offset` consistent with `compute_local_shape` on empty shards, it now returns local tensor shape `(0,)` for empty shards which is more aligned with DTensor's semantics on non-participating ranks.

**Test**
`pytest test/distributed/_tensor/test_dtensor.py`
`pytest test/distributed/_tensor/test_init.py`
`pytest test/distributed/_tensor/test_tensor_ops.py`

Differential Revision: [D62415591](https://our.internmc.facebook.com/intern/diff/D62415591)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135554
Approved by: https://github.com/tianyu-l, https://github.com/wz337
2024-09-12 06:30:09 +00:00
86335e9135 [reland 3/3][fx] Bypass custom __setattr__ in Node.__init__ (#135735)
Relands #135079 whcih was reverted by #135562

I broke this up into three parts to test internally.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135735
Approved by: https://github.com/oulgen
2024-09-12 05:50:39 +00:00
14e3f3c062 [aoti] Remove nlohmann/json.hpp from header (#135765)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135765
Approved by: https://github.com/malfet
2024-09-12 05:38:51 +00:00
9852c6d236 xpu: fix 3rd party builds on systems with cmake<3.25 (#135767)
Cmake LINUX variable is available on starting from cmake 3.25. Better to use CMAKE_SYSTEM_NAME instead to relax cmake version requirement.

See: https://cmake.org/cmake/help/v3.25/variable/LINUX.html
Fixes: #135766
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135767
Approved by: https://github.com/malfet, https://github.com/guangyey
2024-09-12 05:31:01 +00:00
6354271178 [inductor] Skip unused call to get_estimated_runtime() (#135776)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135776
Approved by: https://github.com/oulgen
ghstack dependencies: #135445, #135446
2024-09-12 05:22:23 +00:00
12902f6ecf [inductor] Cache get_operation_names/get_buffer_names (#135446)
Before:
![image](https://github.com/user-attachments/assets/db5b6fce-d849-4512-a21d-7a09efc72311)

After:
![image](https://github.com/user-attachments/assets/097e340c-03b2-491e-ad36-132350b37892)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135446
Approved by: https://github.com/oulgen
ghstack dependencies: #135445
2024-09-12 05:22:23 +00:00
3decb676aa [inductor] Optimize cache_on_self (#135445)
This is a small compile time win, but also makes profiles more readable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135445
Approved by: https://github.com/oulgen
2024-09-12 05:22:23 +00:00
8d68a02905 OpenReg: Split the daemon into drvier/executor (#135646)
Split the daemon into a proper user-process driver vs device-process executor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135646
Approved by: https://github.com/albanD
2024-09-12 05:03:46 +00:00
28330a8a39 [reland 1/3][fx] Bypass custom __setattr__ in Node.__init__ (#135733)
Relands #135079 whcih was reverted by #135562

I broke this up into three parts to test internally.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135733
Approved by: https://github.com/oulgen
2024-09-12 04:29:37 +00:00
eaba287adb [dynamo] Bug fix for _torchdynamo_inline source handling (#135612)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135612
Approved by: https://github.com/drisspg
2024-09-12 04:05:08 +00:00
cyy
f5f1d0a753 Fix build warnings for torch_python (#134981)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134981
Approved by: https://github.com/ezyang
2024-09-12 03:59:34 +00:00
5bc238c73e torch.hub: add get_dir/set_dir type hints (#134906)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134906
Approved by: https://github.com/Skylion007
2024-09-12 03:53:29 +00:00
79223114db Avoid inserting extra transpose when the input to group norm is NHWC (#135575)
When the input format for group norm is NHWC and the device is privateuseone, it introduces an additional transpose operation. To avoid this issue, a check for the privateuseone device needs to be added here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135575
Approved by: https://github.com/ezyang
2024-09-12 03:36:05 +00:00
cyy
7cfd23636c Fix clang-tidy warnings in Caffe2 code (#134935)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134935
Approved by: https://github.com/ezyang
2024-09-12 03:27:09 +00:00
0d1d69fd25 Update torch-xpu-ops pin (ATen XPU implementation) (#135647)
Release cycle for PyTorch 2.5
1. Fixing runtime error on Windows: Fail to load torch_xpu_ops_unary_binary_kernels.dll as the bin size is large.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135647
Approved by: https://github.com/EikanWang
2024-09-12 03:16:08 +00:00
21a64d57b1 [BE] typing for decorators - masked/_ops (#135108)
Differential Revision: D62184735

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135108
Approved by: https://github.com/Skylion007
2024-09-12 01:34:09 +00:00
1a74952925 "Remove BLOCK_LIST" (#135729)
Summary:
Skip test_prepare_qat_conv_bn_fusion_getitem_placeholder when we use training ir, since it's only for bn-getitem pattern, but the pattern doesn't exist in training ir.

Remove BLOCK_LIST since it's empty.
Now all internal unittests will use training ir.

Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan'  caffe2/test/quantization:test_quantization -- -r test_prepare_qat_conv_bn_fusion_getitem_placeholder
buck2 run 'fbcode//mode/dev-nosan'  caffe2/test:quantization_pt2e_qat -- -r test_prepare_qat_conv_bn_fusion_getitem_placeholder
```

Differential Revision: D62387987

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135729
Approved by: https://github.com/tugsbayasgalan
2024-09-12 01:22:06 +00:00
a130ed828a Fix the upload of x86 micro benchmark results (#135780)
Upload stats workflow currently skips this https://github.com/pytorch/pytorch/actions/runs/10807251335/job/29977650639, this is a miss from https://github.com/pytorch/pytorch/pull/135042.  So, the workflow is running but nothing has been uploaded yet.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135780
Approved by: https://github.com/atalman
2024-09-12 01:16:38 +00:00
eb0fe02933 [PT2][inductor][Optimus] Add pad_aten_mm_pass pattern to resolve long computation kernel in LCE (#135167)
Summary:
We observed another long computation issue for OBA_AFOC pyper model, thus adding a pattern to avoid the perf regression

- Only happens in A100
- Do not want to use force_shape_pad since it will pad all GEMMs, which may not be optimal. Optimus pass has more flexisibility to customized GEMM shape and do corresponding padding
- To enable, we pass the pass to config, where "k_threshold_to_pad" can be customized

inductor_config.patch(post_grad_fusion_options={"pad_aten_mm_pass": {"k_threshold_to_pad" : 8388608}})

Test Plan:
# unit test

```
buck2 test mode/opt //caffe2/test/inductor:pad_mm
```
Buck UI: https://www.internalfb.com/buck2/58b0f272-f405-45be-bc8d-aec2dc4d5841
Test UI: https://www.internalfb.com/intern/testinfra/testrun/10133099209954651
Network: Up: 9.0KiB  Down: 142B  (reSessionID-8eb71a37-a5ca-4aff-a4f1-93ade3e47e4e)
Jobs completed: 9. Time elapsed: 3:18.0s.
Cache hits: 0%. Commands: 3 (cached: 0, remote: 0, local: 3)
Tests finished: Pass 17. Fail 0. Fatal 0. Skip 0. Build failure 0

# e2e test
see [D62388582](https://www.internalfb.com/diff/D62388582)

Differential Revision: D62220158

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135167
Approved by: https://github.com/jackiexu1992
2024-09-12 00:51:34 +00:00
d270e2d240 [FSDP2] better error msg for cpu offloading (#135156)
when cpu offloading is enabled, if user load a gpu state dict, FSDP2 will throw a less obvious error at backward
```
RuntimeError: attempting to assign a gradient with device type 'cpu' to a tensor with device type 'cuda'. Please ensure that the gradient and the tensor are on the same device
```

this PR throws error more explicitly by specifying which parameters should be moved because of cpu offloading

```
FSDP parameters should be materialized on cpu when enabling cpu offloading. For example, load cpu state dict or call module.to_empty(device="cpu"). Found following parameters on non-cpu device: ['0.weight']
```

`pytest -s test/distributed/_composable/fsdp/test_fully_shard_state_dict.py -k test_dp_state_dict_cpu_offload`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135156
Approved by: https://github.com/awgu
2024-09-12 00:05:07 +00:00
16b37b309f [Inductor] Rename cpp_wrapper_cuda.py as cpp_wrapper_gpu.py (#135313)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135313
Approved by: https://github.com/jansel, https://github.com/desertfire
ghstack dependencies: #135312
2024-09-11 23:59:54 +00:00
13ee85ca5e [Inductor] Generalize cuda cpp wrapper as common triton based GPU cpp wrapper, will be reused by xpu in next PR. (#135312)
[Inductor] Generalize cuda cpp wrapper as common triton based GPU cpp wrapper, will be reused by xpu in next PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135312
Approved by: https://github.com/jansel, https://github.com/desertfire, https://github.com/eellison
2024-09-11 23:59:54 +00:00
94d2471d1f [Traceable FSDP2] Use .copy_ instead of .set_ for unsharded_param inplace update; Replace unsharded_param graph input usage with graph intermediate; Support FSDP2+LoRA (#133730)
Using `fsdp.set_` for unsharded_param inplace update causes difficult-to-debug errors when enabling Traceable FSDP2 on TorchTune models. In this PR, we change it to use `fsdp.copy_` which fixes the error and also strictly follows eager semantics (i.e. if user explictly stores an alias of the unsharded_param during execution of the user's module code, that alias will get updated correctly when the unsharded_param is copy_ into; whereas if we just swap out unsharded_param storage via set_, that user-saved alias will not get updated, which is not good).

This PR also implements the graph pass to remove the resizes and copy if there is a resize_(full) -> copy_ -> resize_(0) pattern.

------

Test commands:
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_trace_fsdp_copy_`
- `pytest -rA test/dynamo/test_repros.py::ReproTests::test_partitioner_cse_respects_mutation_boundaries`
- `pytest -rA test/dynamo/test_repros.py::ReproTests::test_fsdp_set_input_mutation_applied_when_input_gets_no_gradients`
- `pytest -rA test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mutation_op_matching`
- `python test/inductor/test_distributed_patterns.py DistributedPatternTests.test_fake_distributed_aot_eager`
- `PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=1 PYTORCH_TEST_WITH_CROSSREF=1 python test/functorch/test_aotdispatch.py TestEagerFusionOpInfoCPU.test_aot_autograd_exhaustive_norm_cpu_float32`
- `python test/distributed/test_inductor_collectives.py TestCollectivesInductor.test_backwards`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133730
Approved by: https://github.com/bdhirsh
2024-09-11 23:01:05 +00:00
5ca46be15e Fix/torch cat doc attr (#135698)
The `torch.cat` attr name for tensors in the docs differs from the method signature, unlike other methods.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135698
Approved by: https://github.com/albanD

Co-authored-by: Alexander Jipa <azzhipa@amazon.com>
2024-09-11 22:32:55 +00:00
9a04cfbeff fix for fp16 (#134106)
This PR is a replacement for https://github.com/pytorch/pytorch/pull/133085 for pushing a quick fix for RMSNorm.
The original author is @kkontny

Previous PR summary:
Since FP16 has quite small dynamic range it is very easy to overflow while computing `at::pow(input, 2)` , and it happens in real world computation.

I've tried to use `nn.RMSNorm` fused implementation instead of `LlamaRMSNorm` inside `transformers` implementation of Llama (`src/transformers/models/llama/modeling_llama.py`). It started to give wrong answers in Fp16 while still giving good in FP32. I figured out happens due to overflow while computing square of the input tensor.

Original `LLamaRMSNorm` implementation upcasts input to fp32 to prevent this and give better numerical stability.

```
class LlamaRMSNorm(nn.Module):
    def __init__(self, hidden_size, eps=1e-6):
        """
        LlamaRMSNorm is equivalent to T5LayerNorm
        """
        super().__init__()
        self.weight = nn.Parameter(torch.ones(hidden_size))
        self.variance_epsilon = eps

    def forward(self, hidden_states):
        input_dtype = hidden_states.dtype
        hidden_states = hidden_states.to(torch.float32)
        variance = hidden_states.pow(2).mean(-1, keepdim=True)
        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
        return self.weight * hidden_states.to(input_dtype)
```

Proposed commit fixed the issue. FP16 in RMSNorm has to be treated in special way, to be usable in real world implementations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134106
Approved by: https://github.com/mikaylagawarecki, https://github.com/eqy
2024-09-11 22:02:07 +00:00
66db61f0d1 [ONNX] Update fake mode usage in onnx docs (#135512)
Update fake mode usage in onnx docs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135512
Approved by: https://github.com/justinchuby

Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
2024-09-11 21:29:04 +00:00
c025f7becc Revert "[Partitioner] Reuse partition to check whether nodes exist (#135317)"
This reverts commit e004d539da3335d97a8134c9081245628f18eb67.

Reverted https://github.com/pytorch/pytorch/pull/135317 on behalf of https://github.com/izaitsevfb due to BC-breaking, breaks executorch and internal meta builds ([comment](https://github.com/pytorch/pytorch/pull/135317#issuecomment-2344730294))
2024-09-11 21:27:53 +00:00
8c4e1148b8 Refactoring byte_order (#135558)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135558
Approved by: https://github.com/mikaylagawarecki
2024-09-11 21:06:43 +00:00
e20ee39558 Expand bitwise ops to unsigned types (#135525)
Fixes https://github.com/pytorch/pytorch/issues/135436

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135525
Approved by: https://github.com/ezyang
2024-09-11 20:48:52 +00:00
74fd1bf965 [ROCm] Update to AOTriton 0.7b (#134498)
Notable changes:
1. Enable CudaGraph related tests
2. Fix UT problems
3. EXPERIMENTAL Navi31 support. User should enable Navi31 support with Env Var `TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1`

Know Problem:
1. `test/test_transformers.py` will massive failures and/or NaN outputs with `--use-pytest`
    + Update: Confirmed skip `class TestSDPAPrivateUse1Only` can fix the problem with `--use-pytest`

Note:
AOTriton 0.7b adds support to nestedtenosrs+SDPA but need more work (and consequently a separate PR) to enable it.

Fixes #133540

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134498
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily, https://github.com/malfet
2024-09-11 20:34:01 +00:00
5d964a5eb7 [Export] Fix SDPA decomposition (#135297)
Summary: Update SDPA decomposition to match updated stride from D62009189 which aligns strides with the `aten._scaled_dot_product_attention_math.default`, which makes `t.permute().continuous().permute()` no longer necessary.

Test Plan: CI

Differential Revision: D62278378

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135297
Approved by: https://github.com/drisspg
2024-09-11 20:21:59 +00:00
118d7e1480 [Inductor] add _dynamo.reset to test_cat_slice_cat_cuda (#135694)
Summary: test_cat_slice_cat_cuda runs inductor multiple times and check counters["inductor"] in between, and thus we need to reset properly.

Differential Revision: D62500331

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135694
Approved by: https://github.com/masnesral
2024-09-11 20:07:11 +00:00
dd47f6f623 Simplify expr before getting implications in _maybe_evaluate_static (#135499)
Fixes #134268

Previously we weren't simplifying these expressions before calling get_implications, resulting in inconsistent application of FloorDiv/CleanDiv. See #134268  for more details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135499
Approved by: https://github.com/ezyang
2024-09-11 19:48:29 +00:00
e05ea2b179 Add decomposition for transpose_copy (#130943)
* Extracted from #128416
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130943
Approved by: https://github.com/amjames, https://github.com/eellison
2024-09-11 19:45:22 +00:00
ad75b09d89 Replace capture_pre_autograd_graph with export_for_training in torch tests (#135623)
Summary: as title

Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r test_conv_dynamic
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:fx -- -r matcher
 buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r x86
```

CI

Differential Revision: D62448302

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135623
Approved by: https://github.com/tugsbayasgalan
2024-09-11 19:23:08 +00:00
a2cb9b7331 Flip triton kernel default layout constraint to "needs_fixed_stride_order" (#135581)
This is to match the default layout constraint for custom operators. By
default, Inductor should match the stride order of inputs to a triton
kernel.

Test Plan:
- existing tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135581
Approved by: https://github.com/eellison
ghstack dependencies: #135530
2024-09-11 18:43:18 +00:00
451eaf0ff2 Log full exception trace when error raised in Dynamo (#135697)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135697
Approved by: https://github.com/Skylion007
2024-09-11 18:14:33 +00:00
09519eb195 Support rolling over a percentage of workflows (#134816)
In order to support adding a rollover percentage, this ended up being a complete rewrite of runner_determinator.py.

Details of the new format are in the comments up top.

On the plus side, this now includes some unit tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134816
Approved by: https://github.com/PaliC, https://github.com/zxiiro
2024-09-11 18:01:26 +00:00
5314ae2660 Don't use exception chaining for BackendCompilerFailed (#135545)
Commandeered from https://github.com/pytorch/pytorch/pull/135496 as I'm now helping @ezyang ship dynamic float arguments in PT2.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135545
Approved by: https://github.com/ezyang
2024-09-11 17:49:18 +00:00
da587de9cb [ROCm] [BUGFIX] Re-enable rocm-specific tuning parameters v2 (#133852)
Small bug fix - https://github.com/pytorch/pytorch/pull/124592 replaced the torch.version.hip with device_props but made a mistake in porting the original logic.

The original code was:
`if torch.version.hip is not None:`

Which was incorrectly replaced by:
`if self.device_props.type != "hip":`

Another occurence of https://github.com/pytorch/pytorch/pull/130617

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133852
Approved by: https://github.com/masnesral, https://github.com/malfet
2024-09-11 17:21:40 +00:00
82a4df2d5f [CI] [ROCm] Run rocm workflow on every push to main branch (#135644)
Dial the frequency back up from https://github.com/pytorch/pytorch/pull/131637

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135644
Approved by: https://github.com/huydhn
2024-09-11 17:21:05 +00:00
18a9030952 [CI] Fix update slow tests (#135390)
* Add pytorchbot to list of approvers for file
* Add labels to the auto created PR

The auto generated PR is currently not merging due to some failing tests on slow workflow that were supposed to be moved back to normal

idk if this has much value, clearly we've been managing without the update
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135390
Approved by: https://github.com/ZainRizvi
2024-09-11 17:02:17 +00:00
03f23d07b4 Optimize ShapeEnv.replace (#135652)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135652
Approved by: https://github.com/ezyang
ghstack dependencies: #135621, #135622
2024-09-11 16:50:59 +00:00
8c738c9270 Improve performance of sympy_generic_le (#135622)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135622
Approved by: https://github.com/ezyang
ghstack dependencies: #135621
2024-09-11 16:20:03 +00:00
7ddacaf40a Improve performance of canonicalize_bool_expr (#135621)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135621
Approved by: https://github.com/ezyang
2024-09-11 16:20:03 +00:00
183c32fd3b Revert "[Dynamo] Trace torch function modes entered outside of torch.compile (#133137)"
This reverts commit 0d15122092c27fec1143b800bab7c996d126b547.

Reverted https://github.com/pytorch/pytorch/pull/133137 on behalf of https://github.com/clee2000 due to something in this stack broke functorch/test_control_flow.py::TestControlFlow::test_scan_simple_graph [GH job link](https://github.com/pytorch/pytorch/actions/runs/10804912306/job/29980571390) [HUD commit link](444b52ff40), newly added test yesterday ([comment](https://github.com/pytorch/pytorch/pull/133137#issuecomment-2344054339))
2024-09-11 15:57:00 +00:00
3ab12e2596 Revert "[Dynamo] Support thread local setattr (#135443)"
This reverts commit 160c228a4bd60ceffa62b045a6b0a6f9413835c5.

Reverted https://github.com/pytorch/pytorch/pull/135443 on behalf of https://github.com/clee2000 due to something in this stack broke functorch/test_control_flow.py::TestControlFlow::test_scan_simple_graph [GH job link](https://github.com/pytorch/pytorch/actions/runs/10804912306/job/29980571390) [HUD commit link](444b52ff40), newly added test yesterday ([comment](https://github.com/pytorch/pytorch/pull/135443#issuecomment-2344042800))
2024-09-11 15:53:55 +00:00
596e93b506 Revert "[dynamo] Bug fix for _torchdynamo_inline source handling (#135612)"
This reverts commit 5c3d0a2dedbc0e85f3b256ce56ac674078a5fae1.

Reverted https://github.com/pytorch/pytorch/pull/135612 on behalf of https://github.com/clee2000 due to broke inductor/test_cpu_select_algorithm.py::TestSelectAlgorithmCPU::test_linear_input_transpose_bias_True_cpu_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10805518363/job/29982386304) [HUD commit link](5c3d0a2ded), bad TD ([comment](https://github.com/pytorch/pytorch/pull/135612#issuecomment-2344039370))
2024-09-11 15:51:12 +00:00
f96e8041b1 Revert "[Dynamo] Simplify torch function mode stack guard (#135444)"
This reverts commit 444b52ff40cf4afce7bc3fdcf021a88eab3b954c.

Reverted https://github.com/pytorch/pytorch/pull/135444 on behalf of https://github.com/clee2000 due to something in this stack broke functorch/test_control_flow.py::TestControlFlow::test_scan_simple_graph [GH job link](https://github.com/pytorch/pytorch/actions/runs/10804912306/job/29980571390) [HUD commit link](444b52ff40), newly added test yesterday ([comment](https://github.com/pytorch/pytorch/pull/135444#issuecomment-2344036843))
2024-09-11 15:48:27 +00:00
7cf9c81918 Revert "[Dynamo] Use custom backend to reenter metadata tf mode when tracing while/cond (#134732)"
This reverts commit 6a3edfcc1e474e6ebd0c06624000a6d6bf1a0dee.

Reverted https://github.com/pytorch/pytorch/pull/134732 on behalf of https://github.com/clee2000 due to broke functorch/test_control_flow.py::TestControlFlow::test_scan_simple_graph [GH job link](https://github.com/pytorch/pytorch/actions/runs/10804912306/job/29980571390) [HUD commit link](444b52ff40), newly added test yesterday ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2344016694))
2024-09-11 15:39:21 +00:00
49e0b88aab Fix test_triton_kernel_float64_constant (#135583)
Summary: Landed https://github.com/pytorch/pytorch/pull/135260 too soon and the test in that PR doesn't do exactly what I tested (actually test different dtypes).

Test Plan: `python test/inductor/test_triton_kernels.py -k float64_constant`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135583
Approved by: https://github.com/isuruf, https://github.com/eellison, https://github.com/Skylion007
2024-09-11 15:16:23 +00:00
ee8c5cc1cc For S444023: Back out "deprecate search_autotune_cache (#133628)" (#135186)
Summary: For S444023

Test Plan:
Revert prevented the NaN errors - f639391901
Training job ran for 7767 iterations. NaN errors show up within the first 1k.

Reviewed By: nmacchioni

Differential Revision: D62224747

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135186
Approved by: https://github.com/kit1980
2024-09-11 14:08:40 +00:00
ce4d146f56 ATen | Fix MPSCNNNeuron creation on Mac Catalyst. (#135595)
Summary:
These are still utilized directly when using relu/sigmoid/tanh tensors directly from here: https://fburl.com/code/k6n7ofzd
However, on Mac Catalyst we always were returning `nil`, as such in most cases yielding the entire graph completely useless and most often just stray `MPSTemporaryImage` references that were never written into.

This fixes the issue completely by making sure that we always return the valid kernels back, so they can be executed.

Test Plan: Test with segmentation net that uses a combination of relu and other tensors together - run this via Mac Catalyst build - it works! {F1858576745}

Reviewed By: MichaelTay

Differential Revision: D62430010

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135595
Approved by: https://github.com/MichaelTay
2024-09-11 11:12:23 +00:00
0226fcaacf Disable cuda specific restrictions in _scaled_mm for other devices (#135579)
Fixes #135576

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135579
Approved by: https://github.com/drisspg
2024-09-11 11:05:38 +00:00
4cde5096c4 [Inductor][FlexAttention] Supports dynamic shapes with block mask (#135629)
Fixes #134560 and #135206

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135629
Approved by: https://github.com/drisspg
2024-09-11 08:10:50 +00:00
443c015393 [Distributed] Improve efficiency of NaN checker (#135414)
Some customers would like to run the NaN checks on the fly, so we are improving its efficiency.

## Benchmarking
Allreduce 2G floats. `TORCH_NCCL_NAN_CHECK=1`
Red kernel: ncclAllreduce
Blue kernel: Nan check

<img width="1093" alt="Screenshot 2024-09-06 at 10 00 05 PM" src="https://github.com/user-attachments/assets/5501bc31-024f-4115-adb2-dd66eb4025d3">

## Comparison with torch ops:
Let's say a user manually check for NaNs with the following torch ops before all-reduce:
```
torch.any(torch.isnan(x))
```
<img width="1091" alt="Screenshot 2024-09-06 at 10 14 53 PM" src="https://github.com/user-attachments/assets/1f8b5f63-c955-4612-bb96-241b6c69959b">

So our perf is on-par with torch ops.

## Changes
- Load from vidmem using "big packs" of 16 bytes
- Bump `blockDim.x` from 256 to 512
- Separate loads and checks into two loops, each of 8 iterations
- Unroll the loops
- Templated functions for checking NaN in a "big pack" based on dtype

Special thanks to @jbachan from NCCL!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135414
Approved by: https://github.com/wconstab
2024-09-11 07:53:42 +00:00
4ae6d7c18f Back out "[pytorch][PR] [export] fix re-export custom metadata" (#135634)
Summary: Broke some tests. Revert this diff

Test Plan: CI

Differential Revision: D62474337

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135634
Approved by: https://github.com/tugsbayasgalan
2024-09-11 06:16:26 +00:00
3084b7b5c0 [cuDNN][SDPA] Support attn_bias in cuDNN (#130482)
CC @drisspg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130482
Approved by: https://github.com/drisspg, https://github.com/Skylion007, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-09-11 05:59:25 +00:00
5c3d0a2ded [dynamo] Bug fix for _torchdynamo_inline source handling (#135612)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135612
Approved by: https://github.com/drisspg
ghstack dependencies: #135588
2024-09-11 05:23:42 +00:00
c608b17f60 [PTD][BE][c10d] Add some code documents for TCPStore code and cosmetic changes to libUVStore code (#130496)
While designing something else when TCPStore is needed. I spent some time digging into the codebase of TCPStore and found that the code is a little bit challenging to understand without proper documents. Although people from OSS community must be smarter than me, I still want to document my findings in the code so that devs and users can use them as a reference down the road.

Also for libuv, we need to make private variables with a "_", so it's a pure renaming of private variables such as `tcpServer`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130496
Approved by: https://github.com/wconstab
2024-09-11 04:42:25 +00:00
444b52ff40 [Dynamo] Simplify torch function mode stack guard (#135444)
The semantics of ignored modes previously had edge cases, this eliminates these by in essence filtering any ignored modes out of both the ref stack and the current torch function mode stack. This is purely to fix complexity in #135422.  The ignored modes handling will be removed in a future PR after https://github.com/pytorch/pytorch/pull/135422 lands, since we will then trace through DeviceContexts vs inserting them into the graph which needed these extra workarounds for correctness.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135444
Approved by: https://github.com/anijain2305, https://github.com/williamwen42
ghstack dependencies: #134732, #133137, #135443
2024-09-11 04:18:22 +00:00
160c228a4b [Dynamo] Support thread local setattr (#135443)
In preparation for tracing through DeviceContext (defb515306/torch/utils/_device.py (L66))
This PR adds support for calling the setattr of thread local objects. These objects have a slots impl, and since this doesn't appear to have any side effects, we call this setattr impl when replaying mutations, since calling `object.__setattr__` on these objects results in a type error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135443
Approved by: https://github.com/anijain2305
ghstack dependencies: #134732, #133137
2024-09-11 04:18:22 +00:00
0d15122092 [Dynamo] Trace torch function modes entered outside of torch.compile (#133137)
This PR adds initial tracing for torch function modes.

Details:
In essence, this adds tracing into the torch function of modes entered outside of the torch.compile call.
This does not yet support tracing enter/exit of a torch function mode/ tracing set_default_device properly using the new mode infra (this will be a very good stress test for modes). I am adding more PRs to this stack to support these. The overall plan is to support tracing enter/exit and handling graph breaks like we do other torch.* context managers.

Previously landed:
https://github.com/pytorch/pytorch/pull/133135
https://github.com/pytorch/pytorch/pull/133136
https://github.com/pytorch/pytorch/pull/133134
https://github.com/pytorch/pytorch/pull/133133
https://github.com/pytorch/pytorch/pull/133132
https://github.com/pytorch/pytorch/pull/133131
https://github.com/pytorch/pytorch/pull/133729
https://github.com/pytorch/pytorch/pull/133130

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133137
Approved by: https://github.com/jansel, https://github.com/zou3519
ghstack dependencies: #134732
2024-09-11 04:18:22 +00:00
6a3edfcc1e [Dynamo] Use custom backend to reenter metadata tf mode when tracing while/cond (#134732)
For tracing cond/while in eager, we trace the HOP with the eager backend with metadata torchfunction mode enabled. HOPs disallow the mutation that occurs in this torch function mode, so it is not able to be traced. As a result, we use a custom backend which enters this mode for tracing these HOPs. Thanks to @ydwu4 for the help with implementing this

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134732
Approved by: https://github.com/ydwu4
2024-09-11 04:18:22 +00:00
356f14e7b7 Fix the output of FileCheck when not run and add unit tests (#135345)
When FileCheck is destructed without execution, it should output all rules.
For example:
```
>>> fc = FileCheck().check("test")
>>> del fc
You have not run this instance of FileCheck!
FileCheck checks:
        CHECK: test
```

Additionally, unit tests for the Python interface of FileCheck will be added.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135345
Approved by: https://github.com/eellison
2024-09-11 04:13:24 +00:00
34dc8f69a1 Adding entry-point based support for out-of-tree rendezvous plugins (#132633)
Fixes #127519

Currently in torchrun rendezvous, there are only two rendezvous backends supported out of the box: `C10d` and `Etcd`. The changes in this PR enables the distributed elastic users to bring their out-of-tree rendezvous backend implementations as Python packages.

#### AUTHORING NEW PLUGIN
Any new plugin will be a python package exposing entry-points. For example, the structure of redis plugin is as follows:

```
plugin_root
|_ pyproject.toml
|_ src
   |_ redis
      |_ __init__.py
      |_ redis_store.py
      |_ redis_backend.py
```

The contents of the `pyproject.toml` should indicate that this is exposes a torchrun entry-point by mentioning the group name `torchrun.plugins`. The `pyproject.toml` for redis plugin would be as follows:

```
[project]
name = "redis"
version = "0.0.1"

[project.entry-points.'torchrun.plugins']
redis = 'redis'
```

The `src/redis/__init__.py` file would contain functions that return the plugin name and plugin handler. The contents of `__init__.py` for redis would be as follows:

```
def getPluginHandler():
    def _create_redis_handler(params: RendezvousParameters):
        from redis_rendezvous_backend import create_backend
        backend, store = create_backend(params)
        return create_handler(store, backend, params)
    return _create_redis_handler
```

The files `redis_store` and `redis_backend` contain the implementation of [Store](41189b0da4/torch/_C/_distributed_c10d.pyi (L171)) and [RendezvousBackend](e782918b8e/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py (L61)) respectively.

#### USER EXPERIENCE
Before using the plugin for the first time, the user has to install the plugin packages. For example, the published packages can be installed using `pip3 install <plugin-name>` and the plugin is in local file systemcan be installed using `pip3 install -e <plugin-location>`.

Once installed, the new backend can be used in torchrun as follows:

```
torchrun --rdzv-backend=redis --rdzv-endpoint=redis-container:6379 --nnodes=3 --nproc-per-node=1 --max-restarts=3 --rdzv-id=1 test.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132633
Approved by: https://github.com/fduwjj
2024-09-11 03:35:02 +00:00
cd9ee49a69 [aoti] Add cpp loader (#135374)
* Added a cpp loader, AOTIModelPackageLoader, which can load the .pt2, build the .so, and create a runner. The python-facing API is that users can directly call the `run` function, whereas in cpp users can directly access the `runner_` if they are more familiar with that. I couldn't figure out how to bind the `get_runner()` function to python...
* Added a new config, `aot_inductor.package_cpp_only` which will **not** package the so. This means that whenever the package is loaded, we will need to build the so. This is turned off by default so that new environments do not need to rebuild their so. The `package_cpp_only` is a feature which torchchat intends to use to provide flexibility to users.
* Added a new config, `aot_inductor.metadata` which stores user-provided metadata, serialized to the pt2 as a json file. It also stores the device used when exporting, "cuda" or "cpu", so that during load time, we can use that data to determine which AOTIModelContainerRunner to use. The metadata can be accessed through `loader.get_metadata()`. TODO is to move this metadata to the toplevel `package_aoti` function so that we can remove the metadata as a config.
* Separated out `package_aoti` as a standalone function, instead of it automatically being called in inductor. This is to prepare for the case where users will compile multiple models, and want to bundle it in one package. The specific use case is in torchchat, where we want to package the separately-exported encoder and decoder layers. An example of how to use this is in `test_multiple_methods`.
* `load_package` will load a singular model, given the model name.
* The loader doesn't support windows for now, I think I need to add some more casing to make the build commands work on windows?

Differential Revision: [D62329906](https://our.internmc.facebook.com/intern/diff/D62329906)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135374
Approved by: https://github.com/desertfire, https://github.com/malfet
2024-09-11 03:00:01 +00:00
26e5572dd2 Bump triton xpu pin and release version (#135638)
Similar with https://github.com/pytorch/pytorch/pull/135627

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135638
Approved by: https://github.com/atalman
2024-09-11 00:56:15 +00:00
693897df42 [dynamo] Missing guard source keys for corner case of NNModuleVariabl… (#135041)
Potentially fixes - https://fb.workplace.com/groups/1286739428954016/permalink/1319662695661689/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135041
Approved by: https://github.com/ezyang
2024-09-11 00:43:26 +00:00
3bf6be457d [MPS] Add missing dispatch to rshift.Tensor (#135607)
Missed it while working on https://github.com/pytorch/pytorch/pull/131813
Test plan: `python -c "import torch;print(torch.randint(100, 500, (64,), device='mps') >> torch.tensor([3,], device='mps'))"`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135607
Approved by: https://github.com/manuelcandales
2024-09-11 00:20:53 +00:00
492f064f15 [ONNX] Add assertion nodes to ignoring list (#135591)
Fixes #135419

PS: there are 104 empty output nodes, I suggest we add them one by one when we run into them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135591
Approved by: https://github.com/justinchuby
2024-09-11 00:18:17 +00:00
29408ea81a Add option to tweak inductor stride settings for user-defined triton kernels (#135530)
Previously, Inductor was allowed to modify the stride/storage_offset
(layout) for inputs to user-defined triton kernels. This can cause
silent incorrectness because most triton kernels are written for a
specific striding pattern (usually contiguous).

This PR adds a config to allow the user to choose Inductor's behavior on
this. The options are:
- "flexible_layout" (default): Inductor can modify the layout for inputs
  to user-defined triton kernels as much as it wants.
- "needs_fixed_stride_order": Inductor must preserve the stride order
  (when compared to tracing) for inputs to user-defined triton kernels.

This matches our handling for custom operators. In the future, we'll
want a "needs_exact_strides" option (this is the safest option).

Test Plan:
- new test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135530
Approved by: https://github.com/FindHao, https://github.com/oulgen
2024-09-11 00:11:17 +00:00
02dcb07765 Add boolean support in pack segments ops for both cpu and cuda impls (#132897) (#135620)
Summary:

Same as int types, forward only.

bypass-github-export-checks diff has been synced to github

Test Plan:
buck test mode/dev-nosan //caffe2/torch/fb/sparsenn:test -- test_pack_segments
https://www.internalfb.com/intern/testinfra/testconsole/testrun/16888498646804437/

Reviewed By: garroud

Differential Revision: D60785563

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135620
Approved by: https://github.com/kit1980

Co-authored-by: Haoming Lu <haominglu@meta.com>
2024-09-11 00:03:17 +00:00
5c38aa72c0 [dynamo][dicts][nv-embed] Support update with kwargs (#135588)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135588
Approved by: https://github.com/yanboliang
2024-09-10 23:50:23 +00:00
5134ba7458 Bump triton pin and release version (#135627)
Update the pin and release version to sync with https://github.com/triton-lang/triton/tree/release/3.1.x

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135627
Approved by: https://github.com/Chillee, https://github.com/drisspg, https://github.com/malfet
2024-09-10 23:46:36 +00:00
e48ee2cf50 [ONNX] Fix scaled_dot_product_attention with float scale (#135594)
Fixes #125158

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135594
Approved by: https://github.com/justinchuby
2024-09-10 23:04:02 +00:00
eb38ee21ba [ROCm] slow torch.sum optimization by increasing max_values_per_thread in reduce config (#135397)
Fixes #132964

This change is to optimize torch.sum() performance by increasing max_values_per_thread in setReduceConfig() for ROCm platform.
By increasing this parameter, it uses fewer threadblocks and improved the performance.

Test:
Tested on MI300x and H100, and now the MI300x perf improved to 3205GByte/s from ~1690GByte/s for the test case and is slightly better than H100 (3136GByte/s).

Also tested with other different sizes of tensors and also see perf improvement.

```python
import torch
from triton.testing import do_bench

x = torch.randn(2**30, device='cuda')

ms = do_bench(lambda: x.sum(dim=-1))

bandwidth_gbyte = x.numel() * x.dtype.itemsize / (10**9)

time_s = ms / 1000

bw_per_second = bandwidth_gbyte / time_s

print(bw_per_second)
```

Co-author: @carlobertolli

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135397
Approved by: https://github.com/eqy, https://github.com/malfet
2024-09-10 21:03:01 +00:00
8057b72763 [ez][inductor] don't benchmark cloning if there are no mutated args (#135533)
When a kernel does not have mutated args (this is quite common?), benchmarking the cost of cloning actually benchmarks a no-op. This still takes >100ms since triton.testing.do_bench will allocate 100 ms budget to run the kernel.
Skipping this benchmarking can save quite some compilation time if the code path is hit multiple times. Let's say, if the code path is hit 100 times when the graph is large, we would save >10s.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135533
Approved by: https://github.com/jansel
ghstack dependencies: #135531
2024-09-10 20:54:31 +00:00
7b17918dc9 [inductor] fix a device sync issue for benchmarking fusion (#135531)
Fix https://github.com/pytorch/pytorch/issues/134768 .

When we benchmark the latency for a fused node set, we do benchmarking twice:
1. benchmark the latency of the kernel including cloning mutated args
2. benchmark the latency of cloning mutated args without running the kernel

We subtract result 2 from result 1 to get the latency of the kernel itself.

But when the tensors are not on the cuda device 0, we get equal number for result 1 and result 2 no matter how much work the kernel does. The root cause is, in `triton.testing.do_bench` the `torch.cuda.synchronize` call sync the current cuda device (which is device 0 if it's not overriden). But since the tensors and kernels are located on another device, the sync actually does nothing (unless there happens to be other kernels on the device 0).

The fix is to set the correct current device in our benchmarking code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135531
Approved by: https://github.com/jansel
2024-09-10 20:54:31 +00:00
66c45f3ed9 [export] fix re-export custom metadata (#135282)
Fixes #134778

When a model is exported and debug handles are added to the "custom" field of non-placeholder and non-output nodes in the graph, re-exporting it will change the metadata of placeholder nodes (the "custom" field will be added or copied to these nodes, depending whether `ExportedProgram` or `ExportedProgram.module()` is passed to `generate_numeric_debug_handle()`).

This occurs because when we re-export the model, `placeholder` nodes are unlifted to `get_attr` nodes. These nodes remain as `get_attr` after being exported to `gm_torch_level`.  Their metadata are modified [here](https://github.com/pytorch/pytorch/blob/main/torch/export/_trace.py#L1347) based on `params_buffers_to_node_meta` which is collected [here](https://github.com/pytorch/pytorch/blob/main/torch/export/_trace.py#L1312).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135282
Approved by: https://github.com/jerryzh168, https://github.com/zhxchen17, https://github.com/tugsbayasgalan
2024-09-10 20:15:02 +00:00
0a9d55d2ee Revert "[AOTI] Fix assert_function call in cpu autotune template (#135086)"
This reverts commit 16c3b8f87cfa9cb5acee8104820baa389e7ee2bd.

Reverted https://github.com/pytorch/pytorch/pull/135086 on behalf of https://github.com/izaitsevfb due to breaks internal tests, see D62405818 ([comment](https://github.com/pytorch/pytorch/pull/135086#issuecomment-2341889428))
2024-09-10 19:51:16 +00:00
4ca65d3323 [CI] Increase sharding for jobs that are timing out (#135582)
Increase sharding for
* slow grad check
* slow cuda tests slow / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test
* avx

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135582
Approved by: https://github.com/huydhn, https://github.com/malfet
2024-09-10 19:45:13 +00:00
c932b39739 [FSDP2] Added _set_unshard_async_op (#135523)
This PR adds a private API `_set_unshard_async_op` that allows for running pre-forward and pre-backward all-gathers using the `async_op=True` path so that all-gather allocations happen in the default stream to avoid inter-stream fragmentation.

If using this option, forward requires explicit prefetching e.g. via the `unshard(async_op=True)` API for overlap. fp32 -> bf16 casts and the all-gather copy-in will not overlap with compute.

Differential Revision: [D62401551](https://our.internmc.facebook.com/intern/diff/D62401551)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135523
Approved by: https://github.com/weifengpy
2024-09-10 19:28:02 +00:00
1f15973657 [AOTI][Tooling][7/n] Add debug printing support for JIT inductor codegen path as well (#135285)
Summary:
1.  Add the debug printer call to a level lower for triton kernel python wrapper codegen path
2. Add `torch.save()` for jit inductor as well
3. This also fixes the issue introduced in D61949020 (at python wrapper code level for triton kernel not printing)

Test Plan:
```
AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=1  TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_addmm_abi_compatible_cuda
```

Differential Revision: D62272588

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135285
Approved by: https://github.com/chenyang78
2024-09-10 19:24:58 +00:00
fc88ba260f [amdsmi][torch] Update amdsmi API usages (#135504)
Summary: In ROCm 6.2.0 there were API name changes-- we check if the new APIs exist and use them in this diff; see 7b2463abe0 for the changes

Test Plan: CI

Differential Revision: D62325661

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135504
Approved by: https://github.com/eqy, https://github.com/houseroad
2024-09-10 19:15:39 +00:00
bf8d0e3107 [inductor] Enable subprocess parallel compile internally with killswitch (#132467)
Differential Revision: [D60629630](https://our.internmc.facebook.com/intern/diff/D60629630)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132467
Approved by: https://github.com/eellison
2024-09-10 19:05:46 +00:00
3a1239a248 [Profiler] Harden Record Function Kwargs (#135365)
Summary:
In S445839, we had HTA break because of the "stream" parameter that was added to gpu traces. This brought up discussions regarding hardening our post processing of said inputs as to not break JSON schema as well as downstream tools. For this reason, this diff does the following.

1. Only allow int, double, bool and string values to be processed as kwinputs for JSON output. We can handle lists if needed in the future.
2. Make sure that any boolean is lowercase  when a string so that the JSON does not break when parsing it
3. Force stream parameter to be an int

Test Plan: Added unit tests to ensure that the list of requirements above is true for kwargs only.

Differential Revision: D62304843

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135365
Approved by: https://github.com/aaronenyeshi
2024-09-10 18:44:05 +00:00
4f9f1775d8 Fix flaky TestCudaWrapper.test_randint_cuda_cuda_wrapper (#135370)
Summary: This test is flaky when run after `test_dynamic_shapes_persistent_reduction_mixed_x_dim_cuda_cuda_wrapper` because the TestCase sets config options globally in its setUp() that stick around for subsequent tests. For test isolation, we use a contextlib.ExitStack pattern in other tests to patch the config options and restore them in tearDown(). Update all TestCases in `test/inductor/test_combo_kernels.py` to use that pattern.

Test Plan:
```
python test/inductor/test_combo_kernels.py
python test/inductor/test_cuda_cpp_wrapper.py TestCudaWrapper.test_dynamic_shapes_persistent_reduction_mixed_x_dim_cuda_cuda_wrapper TestCudaWrapper.test_randint_cuda_cuda_wrapper
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135370
Approved by: https://github.com/jansel
2024-09-10 18:43:14 +00:00
5e0788befb Migrate remaining jobs to use runner determinator (#134867)
At this point all self-hosted runner jobs should be using the runner determinator to switch between LF and Meta runners. This change updates the remaining jobs that have not yet been migrated over.

Issue: https://lf-pytorch.atlassian.net/browse/PC-25

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134867
Approved by: https://github.com/ZainRizvi
2024-09-10 18:14:00 +00:00
440f8f57af Revert "[fx] Bypass custom __setattr__ in Node.__init__ (#135079)" (#135562)
This reverts commit 66da3b3b2acacb116a9b23e91b24934830eaf6b8.

#135079 breaks internal tests and needs to be reverted. Revert with mergebot doesn't work as this PR is technically part of the stack, but, according to @jansel, it should be possible to revert it individually.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135562
Approved by: https://github.com/jansel, https://github.com/seemethere
2024-09-10 18:07:11 +00:00
e004d539da [Partitioner] Reuse partition to check whether nodes exist (#135317)
The time complexity of find node whether in NodeList is O(n). Reuse partition to speed up due to partition.nodes is hash table and has same elements.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135317
Approved by: https://github.com/ezyang
2024-09-10 17:45:29 +00:00
c4b84a46a9 Add more logging to TunableOp validators (#135396)
Summary: Add more logging to TunableOp validators

Test Plan:
Verified additional logging when loading kernel selections:
```
ROCBLAS_VERSION validation: expect 4.0.0-72e57364-dirty to match 4.0.0-72e57364-dirty
GCN_ARCH_NAME validation: expect gfx942:sramecc+:xnack- to match gfx942:sramecc+:xnack-
HIPBLASLT_VERSION validation: expect 800-a15e4178 to match 800-a15e4178
ROCM_VERSION validation: expect 6.0.0.0-12969-1544e39 to match 6.0.0.0-12969-1544e39
PT_VERSION validation: expect 2.5.0 to match 2.5.0
```

```
[qizixi@devgpu039.atn3 /data/users/qizixi/fbsource/fbcode (f9305317d|remote/master)]$ PYTORCH_TUNABLEOP_VERBOSE=1 buck2 run mode/{opt,amd-gpu} -c fbcode.e
nable_gpu_sections=true //scripts/xdwang/example:fc_llama -- --enable-tuning
File changed: fbcode//hipblas_tuning_pt_llama0.csv
Buck UI: https://www.internalfb.com/buck2/1ed2fac4-743e-49ef-805f-7fb6b9300022
Network: Up: 0B  Down: 0B
Jobs completed: 4189. Time elapsed: 0.2s.
BUILD SUCCEEDED
Enabled tuning
- Run Linear (matmul) 2 x 1280 x 8192, dtype = torch.bfloat16
INFO:2024-09-06 14:38:07 2834864:2835138 CuptiActivityProfiler.cpp:260] HIP versions. Roctracer: 4.1; Runtime: 60032830; Driver: 60032830
INFO:2024-09-06 14:38:07 2834864:2836083 DynoConfigLoader.cpp:61] Setting communication fabric enabled = 0
reading tuning results from hipblas_tuning_pt_llama0.csv
Validator PT_VERSION=2.5.0
Validator ROCM_VERSION=6.0.0.0-12969-1544e39
Validator HIPBLASLT_VERSION=800-a15e4178
Validator GCN_ARCH_NAME=gfx942:sramecc+:xnack-
Validator ROCBLAS_VERSION=4.0.0-72e57364-dirty
ROCBLAS_VERSION validation: expect 4.0.0-72e57364-dirty to match 4.0.0-72e57364-dirty
GCN_ARCH_NAME validation: expect gfx942:sramecc+:xnack- to match gfx942:sramecc+:xnack-
HIPBLASLT_VERSION validation: expect 800-a15e4178 to match 800-a15e4178
ROCM_VERSION validation: expect 6.0.0.0-12969-1544e39 to match 6.0.0.0-12969-1544e39
PT_VERSION validation: expect 2.5.0 to match 2.5.0
Loading results
Avg time: 13.165860176086426 us, Achieved 3.19 TFLOPS, 1598.24 GB/s

- Run Linear (matmul) 2 x 8192 x 1024, dtype = torch.bfloat16
Avg time: 13.230760097503662 us, Achieved 2.54 TFLOPS, 1271.14 GB/s

- Run Linear (matmul) 2 x 7168 x 8192, dtype = torch.bfloat16
Avg time: 26.804399490356445 us, Achieved 8.76 TFLOPS, 4384.90 GB/s

- Run Linear (matmul) 2 x 8192 x 3584, dtype = torch.bfloat16
Avg time: 13.407809734344482 us, Achieved 8.76 TFLOPS, 4384.14 GB/s

2x1280x8192-torch.bfloat16,13.165860176086426,3.18574247630113,1598.237845349412
2x8192x1024-torch.bfloat16,13.230760097503662,2.536092541374924,1271.1420867780075
2x7168x8192-torch.bfloat16,26.804399490356445,8.762778814892096,4384.9040543618985
2x8192x3584-torch.bfloat16,13.407809734344482,8.759112362638383,4384.138585247748
```

Reviewed By: leitian

Differential Revision: D62322830

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135396
Approved by: https://github.com/eqy
2024-09-10 17:20:59 +00:00
cyy
bc1b8f094d Check function declarations of Core ML code (#135467)
Relax the restrictions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135467
Approved by: https://github.com/ezyang
2024-09-10 16:05:22 +00:00
f65a564fa2 [inductor] Flip custom_op_default_layout_constraint (#135239)
By default, Inductor should respect the stride order of input Tensors to
custom operators.

Test Plan:
- new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135239
Approved by: https://github.com/albanD
ghstack dependencies: #135391
2024-09-10 14:27:43 +00:00
386b313028 Handle KeyError for compiler collective in scalars too (#135385)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135385
Approved by: https://github.com/jansel
2024-09-10 12:33:04 +00:00
6d7cbc20d2 Add dynamo itertools.pairwise support (#135416)
Fixes #133766

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135416
Approved by: https://github.com/XuehaiPan, https://github.com/jansel

Co-authored-by: Xuehai Pan <XuehaiPan@pku.edu.cn>
2024-09-10 11:37:59 +00:00
ca16956b20 [Inductor] Generalize device guard codegen for cpp_wrapper mode. (#134761)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134761
Approved by: https://github.com/jansel, https://github.com/EikanWang
ghstack dependencies: #134693
2024-09-10 10:11:52 +00:00
67735d1ee8 [Inductor] Generalize is_cuda to specific device_type to make cpp_wrapper mode be extensible (#134693)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134693
Approved by: https://github.com/ezyang, https://github.com/EikanWang, https://github.com/jansel
2024-09-10 10:11:13 +00:00
6e13f5eb38 [FlexAttention] Add broadcast support for kv batch dimension (#135505)
This PR adds broadcast support for KV batch dimension.

## Details
Consider Q of shape `[Bq, Hq, Q_LEN, D]`, and K, V of shape `[Bkv, Hkv, KV_LEN, D]`. Prior to this diff, we require `Bq == Bkv`. However, for some use cases, we may have Bkv < Bq. For example, in paged attention, we provide K, V of shape `[1, Hkv, MAX_LEN, D]`, while still providing Q of shape `[Bq, Hq, Q_LEN, D]`. Here, MAX_LEN is the maximal number of tokens supported by paged attention.

This PR relax this requirement to be `Bq == Bkv or (Bq > 1 and Bkv == 0)`. This support covers both flex decoding, flex attention forward and backward.

## Benchmark
GPU: H100

We see negligible (1%~2%) performance change from this PR when `Bq == Bkv`.

```
python benchmarks/transformer/score_mod.py --calculate-bwd
```
### Perf before this PR

**FWD**

| Type    |   Speedup | score_mod     | mask_mod   | dtype          | shape(B,Hq,M,Hkv,N,D)        |
|---------|-----------|---------------|------------|----------------|------------------------------|
| Average |     0.743 |               |            |                |                              |
| Max     |     0.955 | head_bias     | None       | torch.bfloat16 | (2, 16, 1024, 2, 1024, 64)   |
| Min     |     0.548 | relative_bias | None       | torch.bfloat16 | (16, 16, 1024, 2, 1024, 128) |

**BWD**

| Type    |   Speedup | score_mod   | mask_mod   | dtype          | shape(B,Hq,M,Hkv,N,D)       |
|---------|-----------|-------------|------------|----------------|-----------------------------|
| Average |     0.834 |             |            |                |                             |
| Max     |     1.261 | head_bias   | None       | torch.bfloat16 | (8, 16, 512, 16, 512, 64)   |
| Min     |     0.456 | None        | causal     | torch.bfloat16 | (2, 16, 1024, 2, 1024, 128) |

<details>
<summary> Full performance sweep </summary>

| score_mod     | mask_mod   | dtype          | shape(B,Hq,M,Hkv,N,D)         |   fwd_eager_time |   fwd_compiled_time |   bwd_eager_time |   bwd_compiled_time |   fwd_speedup |   bwd_speedup |
|---------------|------------|----------------|-------------------------------|------------------|---------------------|------------------|---------------------|---------------|---------------|
| None          | None       | torch.bfloat16 | (2, 16, 512, 16, 512, 64)     |           15.264 |              17.184 |          107.040 |             140.800 |         0.888 |         0.760 |
| None          | causal     | torch.bfloat16 | (2, 16, 512, 16, 512, 64)     |           15.840 |              19.744 |          112.576 |             140.064 |         0.802 |         0.804 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 512, 16, 512, 64)     |           15.232 |              17.344 |           87.744 |             142.496 |         0.878 |         0.616 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 512, 16, 512, 64)     |           15.264 |              17.184 |          108.192 |             143.328 |         0.888 |         0.755 |
| None          | None       | torch.bfloat16 | (2, 16, 512, 16, 512, 128)    |           19.904 |              22.400 |          106.432 |             136.512 |         0.889 |         0.780 |
| None          | causal     | torch.bfloat16 | (2, 16, 512, 16, 512, 128)    |           19.424 |              26.752 |           91.712 |             106.688 |         0.726 |         0.860 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 512, 16, 512, 128)    |           19.808 |              22.432 |           89.024 |             101.920 |         0.883 |         0.873 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 512, 16, 512, 128)    |           19.840 |              22.272 |           88.896 |             102.592 |         0.891 |         0.867 |
| None          | None       | torch.bfloat16 | (2, 16, 1024, 16, 1024, 64)   |           30.240 |              32.416 |          116.768 |             112.256 |         0.933 |         1.040 |
| None          | causal     | torch.bfloat16 | (2, 16, 1024, 16, 1024, 64)   |           29.536 |              37.024 |          113.664 |             102.688 |         0.798 |         1.107 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 1024, 16, 1024, 64)   |           30.656 |              32.800 |          116.992 |             127.008 |         0.935 |         0.921 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 1024, 16, 1024, 64)   |           30.592 |              32.480 |          116.928 |             112.160 |         0.942 |         1.043 |
| None          | None       | torch.bfloat16 | (2, 16, 1024, 16, 1024, 128)  |           40.448 |              61.920 |          198.656 |             204.512 |         0.653 |         0.971 |
| None          | causal     | torch.bfloat16 | (2, 16, 1024, 16, 1024, 128)  |           37.760 |              62.528 |          189.536 |             170.624 |         0.604 |         1.111 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 1024, 16, 1024, 128)  |           40.896 |              62.368 |          198.304 |             205.824 |         0.656 |         0.963 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 1024, 16, 1024, 128)  |           40.448 |              61.952 |          198.432 |             203.648 |         0.653 |         0.974 |
| None          | None       | torch.bfloat16 | (2, 16, 4096, 16, 4096, 64)   |          318.528 |             355.904 |          947.232 |            1162.496 |         0.895 |         0.815 |
| None          | causal     | torch.bfloat16 | (2, 16, 4096, 16, 4096, 64)   |          199.776 |             252.128 |          677.792 |             813.184 |         0.792 |         0.834 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 4096, 16, 4096, 64)   |          316.512 |             363.328 |          947.712 |            1361.984 |         0.871 |         0.696 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 4096, 16, 4096, 64)   |          317.984 |             356.864 |          947.264 |            1165.024 |         0.891 |         0.813 |
| None          | None       | torch.bfloat16 | (2, 16, 4096, 16, 4096, 128)  |          446.656 |             734.656 |         1664.288 |            2172.960 |         0.608 |         0.766 |
| None          | causal     | torch.bfloat16 | (2, 16, 4096, 16, 4096, 128)  |          278.688 |             467.648 |         1182.624 |            1339.296 |         0.596 |         0.883 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 4096, 16, 4096, 128)  |          447.872 |             744.096 |         1662.944 |            2196.544 |         0.602 |         0.757 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 4096, 16, 4096, 128)  |          448.128 |             732.928 |         1663.072 |            2156.800 |         0.611 |         0.771 |
| None          | None       | torch.bfloat16 | (2, 16, 512, 2, 512, 64)      |           15.648 |              16.640 |          107.520 |             143.008 |         0.940 |         0.752 |
| None          | causal     | torch.bfloat16 | (2, 16, 512, 2, 512, 64)      |           15.776 |              18.240 |          129.056 |             141.920 |         0.865 |         0.909 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 512, 2, 512, 64)      |           15.168 |              16.640 |          103.616 |             139.648 |         0.912 |         0.742 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 512, 2, 512, 64)      |           15.616 |              16.640 |          128.608 |             164.448 |         0.938 |         0.782 |
| None          | None       | torch.bfloat16 | (2, 16, 512, 2, 512, 128)     |           19.776 |              21.952 |          125.344 |             170.304 |         0.901 |         0.736 |
| None          | causal     | torch.bfloat16 | (2, 16, 512, 2, 512, 128)     |           19.776 |              23.712 |          104.288 |             196.896 |         0.834 |         0.530 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 512, 2, 512, 128)     |           19.072 |              21.952 |          102.080 |             177.056 |         0.869 |         0.577 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 512, 2, 512, 128)     |           19.648 |              21.920 |          109.920 |             170.848 |         0.896 |         0.643 |
| None          | None       | torch.bfloat16 | (2, 16, 1024, 2, 1024, 64)    |           30.464 |              31.936 |          127.808 |             228.832 |         0.954 |         0.559 |
| None          | causal     | torch.bfloat16 | (2, 16, 1024, 2, 1024, 64)    |           29.472 |              33.856 |          113.152 |             215.072 |         0.871 |         0.526 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 1024, 2, 1024, 64)    |           30.496 |              32.160 |          116.576 |             231.744 |         0.948 |         0.503 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 1024, 2, 1024, 64)    |           30.464 |              31.904 |          116.320 |             229.824 |         0.955 |         0.506 |
| None          | None       | torch.bfloat16 | (2, 16, 1024, 2, 1024, 128)   |           40.480 |              61.440 |          176.448 |             345.312 |         0.659 |         0.511 |
| None          | causal     | torch.bfloat16 | (2, 16, 1024, 2, 1024, 128)   |           38.304 |              59.424 |          169.312 |             371.360 |         0.645 |         0.456 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 1024, 2, 1024, 128)   |           40.960 |              61.760 |          176.512 |             358.912 |         0.663 |         0.492 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 1024, 2, 1024, 128)   |           40.352 |              61.696 |          176.512 |             344.928 |         0.654 |         0.512 |
| None          | None       | torch.bfloat16 | (2, 16, 4096, 2, 4096, 64)    |          316.224 |             357.728 |          905.728 |            1668.448 |         0.884 |         0.543 |
| None          | causal     | torch.bfloat16 | (2, 16, 4096, 2, 4096, 64)    |          199.904 |             248.416 |          636.544 |            1109.088 |         0.805 |         0.574 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 4096, 2, 4096, 64)    |          314.880 |             363.616 |          906.304 |            1658.176 |         0.866 |         0.547 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 4096, 2, 4096, 64)    |          316.160 |             354.368 |          906.080 |            1649.024 |         0.892 |         0.549 |
| None          | None       | torch.bfloat16 | (2, 16, 4096, 2, 4096, 128)   |          446.912 |             739.840 |         1555.808 |            2521.952 |         0.604 |         0.617 |
| None          | causal     | torch.bfloat16 | (2, 16, 4096, 2, 4096, 128)   |          279.776 |             463.904 |         1068.928 |            1849.888 |         0.603 |         0.578 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 4096, 2, 4096, 128)   |          446.080 |             748.960 |         1553.504 |            2629.888 |         0.596 |         0.591 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 4096, 2, 4096, 128)   |          446.208 |             740.608 |         1558.880 |            2524.960 |         0.602 |         0.617 |
| None          | None       | torch.bfloat16 | (8, 16, 512, 16, 512, 64)     |           33.568 |              41.280 |          170.016 |             147.584 |         0.813 |         1.152 |
| None          | causal     | torch.bfloat16 | (8, 16, 512, 16, 512, 64)     |           30.688 |              43.040 |          159.552 |             146.720 |         0.713 |         1.087 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 512, 16, 512, 64)     |           34.112 |              41.504 |          170.112 |             152.672 |         0.822 |         1.114 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 512, 16, 512, 64)     |           34.240 |              41.152 |          170.272 |             134.976 |         0.832 |         1.261 |
| None          | None       | torch.bfloat16 | (8, 16, 512, 16, 512, 128)    |           48.672 |              76.416 |          295.296 |             263.648 |         0.637 |         1.120 |
| None          | causal     | torch.bfloat16 | (8, 16, 512, 16, 512, 128)    |           45.088 |              72.576 |          281.920 |             237.664 |         0.621 |         1.186 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 512, 16, 512, 128)    |           48.032 |              76.672 |          295.520 |             265.248 |         0.626 |         1.114 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 512, 16, 512, 128)    |           48.096 |              76.096 |          295.456 |             262.112 |         0.632 |         1.127 |
| None          | None       | torch.bfloat16 | (8, 16, 1024, 16, 1024, 64)   |           93.920 |             111.232 |          401.568 |             382.944 |         0.844 |         1.049 |
| None          | causal     | torch.bfloat16 | (8, 16, 1024, 16, 1024, 64)   |           68.192 |              95.232 |          338.752 |             326.816 |         0.716 |         1.037 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 1024, 16, 1024, 64)   |           93.984 |             111.840 |          401.856 |             444.224 |         0.840 |         0.905 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 1024, 16, 1024, 64)   |           94.176 |             110.496 |          401.600 |             383.136 |         0.852 |         1.048 |
| None          | None       | torch.bfloat16 | (8, 16, 1024, 16, 1024, 128)  |          131.488 |             227.040 |          727.424 |             739.712 |         0.579 |         0.983 |
| None          | causal     | torch.bfloat16 | (8, 16, 1024, 16, 1024, 128)  |           95.616 |             169.760 |          616.864 |             574.112 |         0.563 |         1.074 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 1024, 16, 1024, 128)  |          131.680 |             228.672 |          727.616 |             746.048 |         0.576 |         0.975 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 1024, 16, 1024, 128)  |          131.104 |             225.696 |          727.904 |             735.392 |         0.581 |         0.990 |
| None          | None       | torch.bfloat16 | (8, 16, 4096, 16, 4096, 64)   |         1227.296 |            1386.656 |         3720.192 |            4539.904 |         0.885 |         0.819 |
| None          | causal     | torch.bfloat16 | (8, 16, 4096, 16, 4096, 64)   |          691.360 |             831.712 |         2515.872 |            3067.808 |         0.831 |         0.820 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 4096, 16, 4096, 64)   |         1228.192 |            1403.136 |         3715.520 |            5309.280 |         0.875 |         0.700 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 4096, 16, 4096, 64)   |         1229.024 |            1384.992 |         3715.904 |            4550.368 |         0.887 |         0.817 |
| None          | None       | torch.bfloat16 | (8, 16, 4096, 16, 4096, 128)  |         1784.832 |            2865.888 |         6539.840 |            8460.224 |         0.623 |         0.773 |
| None          | causal     | torch.bfloat16 | (8, 16, 4096, 16, 4096, 128)  |         1017.408 |            1660.480 |         4369.824 |            5056.992 |         0.613 |         0.864 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 4096, 16, 4096, 128)  |         1792.448 |            2904.864 |         6546.080 |            8537.024 |         0.617 |         0.767 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 4096, 16, 4096, 128)  |         1795.552 |            2856.864 |         6544.672 |            8400.160 |         0.629 |         0.779 |
| None          | None       | torch.bfloat16 | (8, 16, 512, 2, 512, 64)      |           34.240 |              38.880 |          148.832 |             179.936 |         0.881 |         0.827 |
| None          | causal     | torch.bfloat16 | (8, 16, 512, 2, 512, 64)      |           31.168 |              38.080 |          138.528 |             167.552 |         0.818 |         0.827 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 512, 2, 512, 64)      |           34.240 |              39.168 |          148.512 |             181.248 |         0.874 |         0.819 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 512, 2, 512, 64)      |           34.240 |              38.784 |          148.864 |             180.224 |         0.883 |         0.826 |
| None          | None       | torch.bfloat16 | (8, 16, 512, 2, 512, 128)     |           48.832 |              76.352 |          253.632 |             295.968 |         0.640 |         0.857 |
| None          | causal     | torch.bfloat16 | (8, 16, 512, 2, 512, 128)     |           45.760 |              65.792 |          239.040 |             290.752 |         0.696 |         0.822 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 512, 2, 512, 128)     |           48.768 |              76.576 |          253.312 |             304.032 |         0.637 |         0.833 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 512, 2, 512, 128)     |           48.768 |              76.192 |          253.600 |             296.096 |         0.640 |         0.856 |
| None          | None       | torch.bfloat16 | (8, 16, 1024, 2, 1024, 64)    |           93.728 |             109.728 |          357.696 |             498.912 |         0.854 |         0.717 |
| None          | causal     | torch.bfloat16 | (8, 16, 1024, 2, 1024, 64)    |           68.704 |              92.288 |          295.616 |             386.240 |         0.744 |         0.765 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 1024, 2, 1024, 64)    |           93.632 |             111.392 |          357.408 |             512.448 |         0.841 |         0.697 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 1024, 2, 1024, 64)    |           93.280 |             109.952 |          357.696 |             501.440 |         0.848 |         0.713 |
| None          | None       | torch.bfloat16 | (8, 16, 1024, 2, 1024, 128)   |          131.392 |             230.496 |          612.224 |             807.552 |         0.570 |         0.758 |
| None          | causal     | torch.bfloat16 | (8, 16, 1024, 2, 1024, 128)   |           96.512 |             165.184 |          502.624 |             672.384 |         0.584 |         0.748 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 1024, 2, 1024, 128)   |          131.360 |             232.608 |          612.064 |             832.320 |         0.565 |         0.735 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 1024, 2, 1024, 128)   |          131.008 |             230.528 |          612.640 |             804.320 |         0.568 |         0.762 |
| None          | None       | torch.bfloat16 | (8, 16, 4096, 2, 4096, 64)    |         1227.968 |            1377.408 |         3477.920 |            5324.384 |         0.892 |         0.653 |
| None          | causal     | torch.bfloat16 | (8, 16, 4096, 2, 4096, 64)    |          695.264 |             824.544 |         2268.224 |            3210.208 |         0.843 |         0.707 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 4096, 2, 4096, 64)    |         1228.640 |            1404.576 |         3476.832 |            5463.456 |         0.875 |         0.636 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 4096, 2, 4096, 64)    |         1228.416 |            1378.752 |         3478.048 |            5367.712 |         0.891 |         0.648 |
| None          | None       | torch.bfloat16 | (8, 16, 4096, 2, 4096, 128)   |         1788.736 |            2867.712 |         6039.520 |            8616.256 |         0.624 |         0.701 |
| None          | causal     | torch.bfloat16 | (8, 16, 4096, 2, 4096, 128)   |         1021.952 |            1653.824 |         3866.208 |            5306.848 |         0.618 |         0.729 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 4096, 2, 4096, 128)   |         1786.752 |            2896.352 |         6044.128 |            8871.360 |         0.617 |         0.681 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 4096, 2, 4096, 128)   |         1786.080 |            2868.672 |         6040.160 |            8550.144 |         0.623 |         0.706 |
| None          | None       | torch.bfloat16 | (16, 16, 512, 16, 512, 64)    |           57.504 |              71.552 |          312.768 |             255.040 |         0.804 |         1.226 |
| None          | causal     | torch.bfloat16 | (16, 16, 512, 16, 512, 64)    |           49.472 |              71.104 |          285.696 |             243.520 |         0.696 |         1.173 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 512, 16, 512, 64)    |           58.112 |              72.896 |          312.768 |             288.256 |         0.797 |         1.085 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 512, 16, 512, 64)    |           57.952 |              71.680 |          312.768 |             255.552 |         0.808 |         1.224 |
| None          | None       | torch.bfloat16 | (16, 16, 512, 16, 512, 128)   |           82.336 |             144.256 |          580.128 |             500.160 |         0.571 |         1.160 |
| None          | causal     | torch.bfloat16 | (16, 16, 512, 16, 512, 128)   |           76.160 |             123.712 |          552.544 |             447.648 |         0.616 |         1.234 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 512, 16, 512, 128)   |           82.400 |             145.184 |          580.032 |             504.032 |         0.568 |         1.151 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 512, 16, 512, 128)   |           82.368 |             143.904 |          580.192 |             499.936 |         0.572 |         1.161 |
| None          | None       | torch.bfloat16 | (16, 16, 1024, 16, 1024, 64)  |          177.216 |             209.568 |          787.872 |             747.712 |         0.846 |         1.054 |
| None          | causal     | torch.bfloat16 | (16, 16, 1024, 16, 1024, 64)  |          121.984 |             168.256 |          651.968 |             628.256 |         0.725 |         1.038 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 1024, 16, 1024, 64)  |          177.088 |             211.488 |          788.320 |             864.352 |         0.837 |         0.912 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 1024, 16, 1024, 64)  |          177.440 |             208.576 |          787.424 |             749.120 |         0.851 |         1.051 |
| None          | None       | torch.bfloat16 | (16, 16, 1024, 16, 1024, 128) |          249.472 |             441.376 |         1405.440 |            1431.648 |         0.565 |         0.982 |
| None          | causal     | torch.bfloat16 | (16, 16, 1024, 16, 1024, 128) |          172.960 |             312.064 |         1172.064 |            1096.448 |         0.554 |         1.069 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 1024, 16, 1024, 128) |          249.632 |             446.336 |         1405.408 |            1448.480 |         0.559 |         0.970 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 1024, 16, 1024, 128) |          250.944 |             440.128 |         1406.624 |            1421.952 |         0.570 |         0.989 |
| None          | None       | torch.bfloat16 | (16, 16, 4096, 16, 4096, 64)  |         2418.720 |            2747.936 |         7330.432 |            9023.712 |         0.880 |         0.812 |
| None          | causal     | torch.bfloat16 | (16, 16, 4096, 16, 4096, 64)  |         1353.696 |            1608.480 |         4941.696 |            6078.752 |         0.842 |         0.813 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 4096, 16, 4096, 64)  |         2427.456 |            2746.816 |         7329.792 |           10539.968 |         0.884 |         0.695 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 4096, 16, 4096, 64)  |         2426.688 |            2763.168 |         7336.256 |            9057.536 |         0.878 |         0.810 |
| None          | None       | torch.bfloat16 | (16, 16, 4096, 16, 4096, 128) |         3554.240 |            5634.400 |        12919.872 |           16843.489 |         0.631 |         0.767 |
| None          | causal     | torch.bfloat16 | (16, 16, 4096, 16, 4096, 128) |         2003.648 |            3250.784 |         8610.144 |           10015.424 |         0.616 |         0.860 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 4096, 16, 4096, 128) |         3582.080 |            5710.944 |        12923.328 |           17011.871 |         0.627 |         0.760 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 4096, 16, 4096, 128) |         3581.920 |            5618.144 |        12934.528 |           16745.888 |         0.638 |         0.772 |
| None          | None       | torch.bfloat16 | (16, 16, 512, 2, 512, 64)     |           57.120 |              71.232 |          269.760 |             295.680 |         0.802 |         0.912 |
| None          | causal     | torch.bfloat16 | (16, 16, 512, 2, 512, 64)     |           49.408 |              65.312 |          242.304 |             253.952 |         0.756 |         0.954 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 512, 2, 512, 64)     |           57.504 |              72.544 |          269.632 |             298.976 |         0.793 |         0.902 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 512, 2, 512, 64)     |           57.760 |              71.040 |          269.600 |             296.640 |         0.813 |         0.909 |
| None          | None       | torch.bfloat16 | (16, 16, 512, 2, 512, 128)    |           82.336 |             147.168 |          466.080 |             487.456 |         0.559 |         0.956 |
| None          | causal     | torch.bfloat16 | (16, 16, 512, 2, 512, 128)    |           76.704 |             115.040 |          435.392 |             453.248 |         0.667 |         0.961 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 512, 2, 512, 128)    |           81.856 |             147.424 |          465.920 |             499.552 |         0.555 |         0.933 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 512, 2, 512, 128)    |           81.760 |             146.656 |          466.176 |             485.984 |         0.557 |         0.959 |
| None          | None       | torch.bfloat16 | (16, 16, 1024, 2, 1024, 64)   |          176.608 |             206.976 |          678.080 |             866.976 |         0.853 |         0.782 |
| None          | causal     | torch.bfloat16 | (16, 16, 1024, 2, 1024, 64)   |          121.664 |             164.768 |          538.240 |             636.160 |         0.738 |         0.846 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 1024, 2, 1024, 64)   |          176.608 |             209.664 |          677.696 |             883.424 |         0.842 |         0.767 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 1024, 2, 1024, 64)   |          177.440 |             207.840 |          677.248 |             868.288 |         0.854 |         0.780 |
| None          | None       | torch.bfloat16 | (16, 16, 1024, 2, 1024, 128)  |          250.272 |             449.536 |         1163.424 |            1420.832 |         0.557 |         0.819 |
| None          | causal     | torch.bfloat16 | (16, 16, 1024, 2, 1024, 128)  |          173.472 |             305.376 |          929.408 |            1104.544 |         0.568 |         0.841 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 1024, 2, 1024, 128)  |          249.376 |             454.976 |         1163.648 |            1455.296 |         0.548 |         0.800 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 1024, 2, 1024, 128)  |          250.368 |             450.144 |         1163.520 |            1409.984 |         0.556 |         0.825 |
| None          | None       | torch.bfloat16 | (16, 16, 4096, 2, 4096, 64)   |         2416.576 |            2726.208 |         6835.520 |           10442.784 |         0.886 |         0.655 |
| None          | causal     | torch.bfloat16 | (16, 16, 4096, 2, 4096, 64)   |         1357.440 |            1590.752 |         4433.664 |            5975.296 |         0.853 |         0.742 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 4096, 2, 4096, 64)   |         2427.360 |            2747.040 |         6853.056 |           10670.784 |         0.884 |         0.642 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 4096, 2, 4096, 64)   |         2441.120 |            2718.944 |         6836.640 |           10433.792 |         0.898 |         0.655 |
| None          | None       | torch.bfloat16 | (16, 16, 4096, 2, 4096, 128)  |         3555.392 |            5620.960 |        11944.000 |           16504.801 |         0.633 |         0.724 |
| None          | causal     | torch.bfloat16 | (16, 16, 4096, 2, 4096, 128)  |         2010.848 |            3241.152 |         7636.064 |            9870.464 |         0.620 |         0.774 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 4096, 2, 4096, 128)  |         3557.440 |            5688.352 |        11935.744 |           17090.496 |         0.625 |         0.698 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 4096, 2, 4096, 128)  |         3562.720 |            5630.432 |        11939.168 |           16392.033 |         0.633 |         0.728 |

</details>

### Perf after this PR

**FWD**

| Type    |   Speedup | score_mod     | mask_mod   | dtype          | shape(B,Hq,M,Hkv,N,D)      |
|---------|-----------|---------------|------------|----------------|----------------------------|
| Average |     0.776 |               |            |                |                            |
| Max     |     1.006 | None          | None       | torch.bfloat16 | (2, 16, 1024, 2, 1024, 64) |
| Min     |     0.566 | relative_bias | None       | torch.bfloat16 | (16, 16, 512, 2, 512, 128) |

**BWD**

| Type    |   Speedup | score_mod   | mask_mod   | dtype          | shape(B,Hq,M,Hkv,N,D)       |
|---------|-----------|-------------|------------|----------------|-----------------------------|
| Average |     0.817 |             |            |                |                             |
| Max     |     1.150 | None        | causal     | torch.bfloat16 | (16, 16, 512, 16, 512, 128) |
| Min     |     0.454 | None        | causal     | torch.bfloat16 | (2, 16, 1024, 2, 1024, 128) |

<details>
<summary> Full performance sweep </summary>

| score_mod     | mask_mod   | dtype          | shape(B,Hq,M,Hkv,N,D)         |   fwd_eager_time |   fwd_compiled_time |   bwd_eager_time |   bwd_compiled_time |   fwd_speedup |   bwd_speedup |
|---------------|------------|----------------|-------------------------------|------------------|---------------------|------------------|---------------------|---------------|---------------|
| None          | None       | torch.bfloat16 | (2, 16, 512, 16, 512, 64)     |           15.680 |              17.056 |           64.544 |              73.376 |         0.919 |         0.880 |
| None          | causal     | torch.bfloat16 | (2, 16, 512, 16, 512, 64)     |           15.712 |              19.872 |           65.408 |              72.864 |         0.791 |         0.898 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 512, 16, 512, 64)     |           16.160 |              17.280 |           64.896 |              73.888 |         0.935 |         0.878 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 512, 16, 512, 64)     |           16.192 |              17.120 |           64.896 |              75.424 |         0.946 |         0.860 |
| None          | None       | torch.bfloat16 | (2, 16, 512, 16, 512, 128)    |           19.648 |              22.496 |           89.184 |              82.592 |         0.873 |         1.080 |
| None          | causal     | torch.bfloat16 | (2, 16, 512, 16, 512, 128)    |           20.320 |              26.816 |           91.264 |              82.880 |         0.758 |         1.101 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 512, 16, 512, 128)    |           20.096 |              22.528 |           89.184 |              83.776 |         0.892 |         1.065 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 512, 16, 512, 128)    |           19.680 |              22.432 |           89.184 |             120.096 |         0.877 |         0.743 |
| None          | None       | torch.bfloat16 | (2, 16, 1024, 16, 1024, 64)   |           32.384 |              32.512 |          119.232 |             128.960 |         0.996 |         0.925 |
| None          | causal     | torch.bfloat16 | (2, 16, 1024, 16, 1024, 64)   |           30.176 |              37.248 |          113.664 |             119.520 |         0.810 |         0.951 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 1024, 16, 1024, 64)   |           32.512 |              32.928 |          119.264 |             131.456 |         0.987 |         0.907 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 1024, 16, 1024, 64)   |           32.448 |              32.704 |          119.200 |             128.352 |         0.992 |         0.929 |
| None          | None       | torch.bfloat16 | (2, 16, 1024, 16, 1024, 128)  |           41.952 |              62.176 |          199.040 |             214.304 |         0.675 |         0.929 |
| None          | causal     | torch.bfloat16 | (2, 16, 1024, 16, 1024, 128)  |           39.744 |              62.880 |          189.504 |             179.968 |         0.632 |         1.053 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 1024, 16, 1024, 128)  |           41.472 |              62.784 |          199.136 |             217.664 |         0.661 |         0.915 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 1024, 16, 1024, 128)  |           42.048 |              61.952 |          199.168 |             214.496 |         0.679 |         0.929 |
| None          | None       | torch.bfloat16 | (2, 16, 4096, 16, 4096, 64)   |          341.184 |             357.632 |          980.256 |            1328.896 |         0.954 |         0.738 |
| None          | causal     | torch.bfloat16 | (2, 16, 4096, 16, 4096, 64)   |          212.576 |             252.960 |          673.888 |             824.864 |         0.840 |         0.817 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 4096, 16, 4096, 64)   |          340.000 |             363.296 |          980.768 |            1375.808 |         0.936 |         0.713 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 4096, 16, 4096, 64)   |          340.768 |             356.832 |          980.960 |            1326.272 |         0.955 |         0.740 |
| None          | None       | torch.bfloat16 | (2, 16, 4096, 16, 4096, 128)  |          459.392 |             737.120 |         1678.240 |            2205.248 |         0.623 |         0.761 |
| None          | causal     | torch.bfloat16 | (2, 16, 4096, 16, 4096, 128)  |          292.672 |             468.096 |         1178.016 |            1371.584 |         0.625 |         0.859 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 4096, 16, 4096, 128)  |          462.144 |             745.312 |         1680.000 |            2252.512 |         0.620 |         0.746 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 4096, 16, 4096, 128)  |          462.112 |             736.576 |         1679.008 |            2216.480 |         0.627 |         0.758 |
| None          | None       | torch.bfloat16 | (2, 16, 512, 2, 512, 64)      |           16.064 |              16.704 |          105.120 |             120.768 |         0.962 |         0.870 |
| None          | causal     | torch.bfloat16 | (2, 16, 512, 2, 512, 64)      |           15.552 |              18.144 |          107.136 |             121.696 |         0.857 |         0.880 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 512, 2, 512, 64)      |           16.096 |              16.768 |          102.688 |             120.864 |         0.960 |         0.850 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 512, 2, 512, 64)      |           16.032 |              16.576 |          104.736 |             124.672 |         0.967 |         0.840 |
| None          | None       | torch.bfloat16 | (2, 16, 512, 2, 512, 128)     |           19.392 |              21.952 |          104.736 |             174.656 |         0.883 |         0.600 |
| None          | causal     | torch.bfloat16 | (2, 16, 512, 2, 512, 128)     |           20.128 |              23.712 |          105.216 |             199.008 |         0.849 |         0.529 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 512, 2, 512, 128)     |           19.904 |              21.888 |          103.744 |             179.520 |         0.909 |         0.578 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 512, 2, 512, 128)     |           19.968 |              21.952 |          104.640 |             177.312 |         0.910 |         0.590 |
| None          | None       | torch.bfloat16 | (2, 16, 1024, 2, 1024, 64)    |           32.096 |              31.904 |          118.720 |             231.968 |         1.006 |         0.512 |
| None          | causal     | torch.bfloat16 | (2, 16, 1024, 2, 1024, 64)    |           30.528 |              33.952 |          112.480 |             218.304 |         0.899 |         0.515 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 1024, 2, 1024, 64)    |           32.160 |              32.224 |          118.752 |             237.312 |         0.998 |         0.500 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 1024, 2, 1024, 64)    |           32.128 |              32.032 |          118.240 |             233.120 |         1.003 |         0.507 |
| None          | None       | torch.bfloat16 | (2, 16, 1024, 2, 1024, 128)   |           41.312 |              61.280 |          177.408 |             350.688 |         0.674 |         0.506 |
| None          | causal     | torch.bfloat16 | (2, 16, 1024, 2, 1024, 128)   |           39.552 |              59.360 |          168.832 |             371.488 |         0.666 |         0.454 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 1024, 2, 1024, 128)   |           41.984 |              61.696 |          177.376 |             360.416 |         0.680 |         0.492 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 1024, 2, 1024, 128)   |           41.312 |              61.760 |          177.184 |             355.744 |         0.669 |         0.498 |
| None          | None       | torch.bfloat16 | (2, 16, 4096, 2, 4096, 64)    |          339.744 |             357.888 |          939.712 |            1665.376 |         0.949 |         0.564 |
| None          | causal     | torch.bfloat16 | (2, 16, 4096, 2, 4096, 64)    |          212.608 |             248.832 |          633.280 |            1122.848 |         0.854 |         0.564 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 4096, 2, 4096, 64)    |          339.712 |             363.232 |          940.448 |            1689.440 |         0.935 |         0.557 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 4096, 2, 4096, 64)    |          341.056 |             355.264 |          940.128 |            1641.152 |         0.960 |         0.573 |
| None          | None       | torch.bfloat16 | (2, 16, 4096, 2, 4096, 128)   |          460.736 |             741.024 |         1569.824 |            2559.552 |         0.622 |         0.613 |
| None          | causal     | torch.bfloat16 | (2, 16, 4096, 2, 4096, 128)   |          293.856 |             464.192 |         1066.240 |            1840.416 |         0.633 |         0.579 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 4096, 2, 4096, 128)   |          460.704 |             753.152 |         1570.112 |            2641.088 |         0.612 |         0.594 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 4096, 2, 4096, 128)   |          460.832 |             745.536 |         1570.144 |            2602.560 |         0.618 |         0.603 |
| None          | None       | torch.bfloat16 | (8, 16, 512, 16, 512, 64)     |           35.680 |              41.280 |          171.840 |             158.176 |         0.864 |         1.086 |
| None          | causal     | torch.bfloat16 | (8, 16, 512, 16, 512, 64)     |           31.360 |              42.976 |          158.912 |             139.264 |         0.730 |         1.141 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 512, 16, 512, 64)     |           35.168 |              41.600 |          171.648 |             161.344 |         0.845 |         1.064 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 512, 16, 512, 64)     |           35.136 |              41.152 |          171.808 |             158.336 |         0.854 |         1.085 |
| None          | None       | torch.bfloat16 | (8, 16, 512, 16, 512, 128)    |           48.832 |              76.384 |          295.680 |             277.696 |         0.639 |         1.065 |
| None          | causal     | torch.bfloat16 | (8, 16, 512, 16, 512, 128)    |           45.632 |              72.512 |          281.760 |             250.752 |         0.629 |         1.124 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 512, 16, 512, 128)    |           49.504 |              76.608 |          295.584 |             279.712 |         0.646 |         1.057 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 512, 16, 512, 128)    |           48.864 |              75.904 |          295.456 |             277.568 |         0.644 |         1.064 |
| None          | None       | torch.bfloat16 | (8, 16, 1024, 16, 1024, 64)   |           99.392 |             111.232 |          408.640 |             442.656 |         0.894 |         0.923 |
| None          | causal     | torch.bfloat16 | (8, 16, 1024, 16, 1024, 64)   |           71.392 |              95.168 |          338.784 |             341.760 |         0.750 |         0.991 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 1024, 16, 1024, 64)   |           99.808 |             112.256 |          408.608 |             456.160 |         0.889 |         0.896 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 1024, 16, 1024, 64)   |          100.032 |             110.816 |          408.512 |             444.192 |         0.903 |         0.920 |
| None          | None       | torch.bfloat16 | (8, 16, 1024, 16, 1024, 128)  |          135.040 |             226.112 |          726.880 |             774.176 |         0.597 |         0.939 |
| None          | causal     | torch.bfloat16 | (8, 16, 1024, 16, 1024, 128)  |           99.904 |             169.696 |          616.448 |             607.104 |         0.589 |         1.015 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 1024, 16, 1024, 128)  |          135.488 |             228.384 |          727.776 |             782.368 |         0.593 |         0.930 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 1024, 16, 1024, 128)  |          135.744 |             225.664 |          728.000 |             773.600 |         0.602 |         0.941 |
| None          | None       | torch.bfloat16 | (8, 16, 4096, 16, 4096, 64)   |         1324.192 |            1387.808 |         3866.944 |            5217.184 |         0.954 |         0.741 |
| None          | causal     | torch.bfloat16 | (8, 16, 4096, 16, 4096, 64)   |          738.464 |             832.608 |         2507.392 |            3146.688 |         0.887 |         0.797 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 4096, 16, 4096, 64)   |         1326.016 |            1404.256 |         3867.872 |            5382.624 |         0.944 |         0.719 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 4096, 16, 4096, 64)   |         1326.144 |            1386.688 |         3867.552 |            5203.264 |         0.956 |         0.743 |
| None          | None       | torch.bfloat16 | (8, 16, 4096, 16, 4096, 128)  |         1847.488 |            2866.336 |         6612.704 |            8597.696 |         0.645 |         0.769 |
| None          | causal     | torch.bfloat16 | (8, 16, 4096, 16, 4096, 128)  |         1066.592 |            1660.640 |         4357.696 |            5174.016 |         0.642 |         0.842 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 4096, 16, 4096, 128)  |         1850.464 |            2905.408 |         6616.928 |            8793.280 |         0.637 |         0.752 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 4096, 16, 4096, 128)  |         1848.896 |            2834.720 |         6623.872 |            8637.920 |         0.652 |         0.767 |
| None          | None       | torch.bfloat16 | (8, 16, 512, 2, 512, 64)      |           36.384 |              38.656 |          150.336 |             182.624 |         0.941 |         0.823 |
| None          | causal     | torch.bfloat16 | (8, 16, 512, 2, 512, 64)      |           31.360 |              38.112 |          137.664 |             171.840 |         0.823 |         0.801 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 512, 2, 512, 64)      |           36.608 |              39.040 |          150.528 |             183.872 |         0.938 |         0.819 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 512, 2, 512, 64)      |           36.064 |              38.656 |          150.560 |             183.520 |         0.933 |         0.820 |
| None          | None       | torch.bfloat16 | (8, 16, 512, 2, 512, 128)     |           49.344 |              76.352 |          253.920 |             301.440 |         0.646 |         0.842 |
| None          | causal     | torch.bfloat16 | (8, 16, 512, 2, 512, 128)     |           46.720 |              65.824 |          239.424 |             296.384 |         0.710 |         0.808 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 512, 2, 512, 128)     |           49.248 |              76.416 |          253.728 |             307.808 |         0.644 |         0.824 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 512, 2, 512, 128)     |           49.376 |              76.288 |          253.728 |             304.736 |         0.647 |         0.833 |
| None          | None       | torch.bfloat16 | (8, 16, 1024, 2, 1024, 64)    |           99.264 |             110.144 |          364.960 |             503.072 |         0.901 |         0.725 |
| None          | causal     | torch.bfloat16 | (8, 16, 1024, 2, 1024, 64)    |           71.136 |              92.384 |          294.432 |             393.056 |         0.770 |         0.749 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 1024, 2, 1024, 64)    |           99.200 |             111.360 |          365.152 |             512.640 |         0.891 |         0.712 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 1024, 2, 1024, 64)    |           99.264 |             110.240 |          365.088 |             504.224 |         0.900 |         0.724 |
| None          | None       | torch.bfloat16 | (8, 16, 1024, 2, 1024, 128)   |          135.680 |             230.336 |          613.472 |             816.896 |         0.589 |         0.751 |
| None          | causal     | torch.bfloat16 | (8, 16, 1024, 2, 1024, 128)   |          100.256 |             165.088 |          502.144 |             676.480 |         0.607 |         0.742 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 1024, 2, 1024, 128)   |          135.008 |             232.480 |          613.184 |             836.672 |         0.581 |         0.733 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 1024, 2, 1024, 128)   |          135.232 |             230.624 |          613.536 |             827.136 |         0.586 |         0.742 |
| None          | None       | torch.bfloat16 | (8, 16, 4096, 2, 4096, 64)    |         1324.064 |            1378.688 |         3631.808 |            5308.384 |         0.960 |         0.684 |
| None          | causal     | torch.bfloat16 | (8, 16, 4096, 2, 4096, 64)    |          731.776 |             826.688 |         2263.168 |            3241.344 |         0.885 |         0.698 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 4096, 2, 4096, 64)    |         1316.128 |            1403.200 |         3625.088 |            5550.688 |         0.938 |         0.653 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 4096, 2, 4096, 64)    |         1311.904 |            1378.880 |         3616.320 |            5353.696 |         0.951 |         0.675 |
| None          | None       | torch.bfloat16 | (8, 16, 4096, 2, 4096, 128)   |         1837.856 |            2887.392 |         6121.632 |            8586.656 |         0.637 |         0.713 |
| None          | causal     | torch.bfloat16 | (8, 16, 4096, 2, 4096, 128)   |         1066.976 |            1654.368 |         3843.136 |            5291.040 |         0.645 |         0.726 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 4096, 2, 4096, 128)   |         1854.208 |            2896.832 |         6130.112 |            8745.984 |         0.640 |         0.701 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 4096, 2, 4096, 128)   |         1860.512 |            2889.344 |         6135.648 |            8750.592 |         0.644 |         0.701 |
| None          | None       | torch.bfloat16 | (16, 16, 512, 16, 512, 64)    |           60.640 |              71.552 |          315.968 |             296.512 |         0.847 |         1.066 |
| None          | causal     | torch.bfloat16 | (16, 16, 512, 16, 512, 64)    |           50.784 |              71.040 |          284.288 |             258.880 |         0.715 |         1.098 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 512, 16, 512, 64)    |           61.312 |              72.704 |          315.680 |             302.016 |         0.843 |         1.045 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 512, 16, 512, 64)    |           60.800 |              71.776 |          316.320 |             297.152 |         0.847 |         1.065 |
| None          | None       | torch.bfloat16 | (16, 16, 512, 16, 512, 128)   |           84.576 |             144.416 |          580.576 |             535.936 |         0.586 |         1.083 |
| None          | causal     | torch.bfloat16 | (16, 16, 512, 16, 512, 128)   |           76.064 |             123.648 |          553.344 |             481.376 |         0.615 |         1.150 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 512, 16, 512, 128)   |           84.160 |             145.248 |          581.024 |             540.000 |         0.579 |         1.076 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 512, 16, 512, 128)   |           84.512 |             143.552 |          581.088 |             535.776 |         0.589 |         1.085 |
| None          | None       | torch.bfloat16 | (16, 16, 1024, 16, 1024, 64)  |          189.152 |             209.408 |          798.400 |             868.704 |         0.903 |         0.919 |
| None          | causal     | torch.bfloat16 | (16, 16, 1024, 16, 1024, 64)  |          127.552 |             168.800 |          650.816 |             663.328 |         0.756 |         0.981 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 1024, 16, 1024, 64)  |          189.376 |             211.360 |          798.080 |             895.552 |         0.896 |         0.891 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 1024, 16, 1024, 64)  |          189.440 |             208.576 |          797.888 |             873.152 |         0.908 |         0.914 |
| None          | None       | torch.bfloat16 | (16, 16, 1024, 16, 1024, 128) |          257.536 |             441.760 |         1408.960 |            1514.720 |         0.583 |         0.930 |
| None          | causal     | torch.bfloat16 | (16, 16, 1024, 16, 1024, 128) |          179.328 |             312.096 |         1170.368 |            1177.472 |         0.575 |         0.994 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 1024, 16, 1024, 128) |          259.264 |             446.944 |         1408.768 |            1530.400 |         0.580 |         0.921 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 1024, 16, 1024, 128) |          258.080 |             440.480 |         1408.864 |            1514.144 |         0.586 |         0.930 |
| None          | None       | torch.bfloat16 | (16, 16, 4096, 16, 4096, 64)  |         2595.808 |            2771.456 |         7616.704 |           10405.248 |         0.937 |         0.732 |
| None          | causal     | torch.bfloat16 | (16, 16, 4096, 16, 4096, 64)  |         1435.744 |            1610.336 |         4927.520 |            6220.000 |         0.892 |         0.792 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 4096, 16, 4096, 64)  |         2595.264 |            2745.056 |         7611.232 |           10631.392 |         0.945 |         0.716 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 4096, 16, 4096, 64)  |         2576.256 |            2735.456 |         7626.400 |           10346.976 |         0.942 |         0.737 |
| None          | None       | torch.bfloat16 | (16, 16, 4096, 16, 4096, 128) |         3679.744 |            5634.816 |        13077.056 |           17182.528 |         0.653 |         0.761 |
| None          | causal     | torch.bfloat16 | (16, 16, 4096, 16, 4096, 128) |         2099.360 |            3250.176 |         8589.664 |           10236.672 |         0.646 |         0.839 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 4096, 16, 4096, 128) |         3676.800 |            5716.288 |        13073.088 |           17311.071 |         0.643 |         0.755 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 4096, 16, 4096, 128) |         3679.136 |            5570.496 |        13070.720 |           17192.863 |         0.660 |         0.760 |
| None          | None       | torch.bfloat16 | (16, 16, 512, 2, 512, 64)     |           61.600 |              71.008 |          272.320 |             300.000 |         0.868 |         0.908 |
| None          | causal     | torch.bfloat16 | (16, 16, 512, 2, 512, 64)     |           50.176 |              65.344 |          241.568 |             258.912 |         0.768 |         0.933 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 512, 2, 512, 64)     |           61.120 |              72.512 |          272.672 |             305.408 |         0.843 |         0.893 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 512, 2, 512, 64)     |           61.248 |              71.136 |          272.640 |             301.120 |         0.861 |         0.905 |
| None          | None       | torch.bfloat16 | (16, 16, 512, 2, 512, 128)    |           83.872 |             146.784 |          466.912 |             496.832 |         0.571 |         0.940 |
| None          | causal     | torch.bfloat16 | (16, 16, 512, 2, 512, 128)    |           76.704 |             115.072 |          435.584 |             462.112 |         0.667 |         0.943 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 512, 2, 512, 128)    |           83.392 |             147.392 |          466.656 |             504.448 |         0.566 |         0.925 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 512, 2, 512, 128)    |           83.360 |             146.688 |          466.656 |             499.040 |         0.568 |         0.935 |
| None          | None       | torch.bfloat16 | (16, 16, 1024, 2, 1024, 64)   |          189.024 |             207.584 |          684.768 |             873.568 |         0.911 |         0.784 |
| None          | causal     | torch.bfloat16 | (16, 16, 1024, 2, 1024, 64)   |          126.944 |             164.288 |          536.192 |             645.984 |         0.773 |         0.830 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 1024, 2, 1024, 64)   |          188.768 |             209.760 |          684.096 |             897.504 |         0.900 |         0.762 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 1024, 2, 1024, 64)   |          189.408 |             207.776 |          685.024 |             876.384 |         0.912 |         0.782 |
| None          | None       | torch.bfloat16 | (16, 16, 1024, 2, 1024, 128)  |          259.168 |             449.536 |         1167.936 |            1433.280 |         0.577 |         0.815 |
| None          | causal     | torch.bfloat16 | (16, 16, 1024, 2, 1024, 128)  |          180.000 |             305.312 |          928.000 |            1113.920 |         0.590 |         0.833 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 1024, 2, 1024, 128)  |          258.464 |             455.136 |         1167.808 |            1462.848 |         0.568 |         0.798 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 1024, 2, 1024, 128)  |          257.824 |             450.208 |         1167.744 |            1448.000 |         0.573 |         0.806 |
| None          | None       | torch.bfloat16 | (16, 16, 4096, 2, 4096, 64)   |         2598.368 |            2729.120 |         7134.400 |           10381.632 |         0.952 |         0.687 |
| None          | causal     | torch.bfloat16 | (16, 16, 4096, 2, 4096, 64)   |         1435.456 |            1591.040 |         4424.768 |            6035.808 |         0.902 |         0.733 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 4096, 2, 4096, 64)   |         2594.752 |            2725.952 |         7128.384 |           10822.496 |         0.952 |         0.659 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 4096, 2, 4096, 64)   |         2597.888 |            2716.960 |         7101.568 |           10385.440 |         0.956 |         0.684 |
| None          | None       | torch.bfloat16 | (16, 16, 4096, 2, 4096, 128)  |         3647.648 |            5581.632 |        12089.952 |           16667.233 |         0.654 |         0.725 |
| None          | causal     | torch.bfloat16 | (16, 16, 4096, 2, 4096, 128)  |         2093.952 |            3241.440 |         7579.392 |            9847.936 |         0.646 |         0.770 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 4096, 2, 4096, 128)  |         3650.528 |            5650.688 |        12105.568 |           16963.680 |         0.646 |         0.714 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 4096, 2, 4096, 128)  |         3680.064 |            5585.312 |        12117.504 |           16935.040 |         0.659 |         0.716 |

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135505
Approved by: https://github.com/Chillee
2024-09-10 09:30:02 +00:00
23b1486185 [MPS] Allow nan mean reduction in nll_loss (#135434)
This PR allows results from `nn_loss` to be `nan`, which is the same behavior as with CUDA and CPU https://github.com/pytorch/pytorch/pull/64572#issuecomment-926504162.

Fixes #134431

Ref #64572 #119108
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135434
Approved by: https://github.com/malfet
2024-09-10 08:37:59 +00:00
9902b349cb [Inductor] Make static_input_idxs a set for faster lookup (#135314)
`static_input_idxs` is only used for lookups. With large models, this is a large list. This takes over a millisecond in some cases.

Profile before change:
<img width="824" alt="image" src="https://github.com/user-attachments/assets/002a0775-fd2f-4d27-8cf2-812b502d7d5e">

Profile after change: gaps are smaller, 1ms speedup before launching the cuda graph
<img width="794" alt="image" src="https://github.com/user-attachments/assets/12a0a0b9-2cc1-4d53-ac87-9bd5140a46f5">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135314
Approved by: https://github.com/oulgen
2024-09-10 07:27:55 +00:00
5a9ac83e94 Fix doc (#135551)
Differential Revision: [D62412667](https://our.internmc.facebook.com/intern/diff/D62412667/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135551
Approved by: https://github.com/yushangdi
ghstack dependencies: #135549
2024-09-10 07:18:44 +00:00
1adf28a5c0 [inductor] print triton float64 constants correctly (#135260)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135260
Approved by: https://github.com/jansel
2024-09-10 07:05:02 +00:00
c18052da0e Add some minor doc improvement and ban using training IR for unflattener (#135549)
Title

Differential Revision: [D62412490](https://our.internmc.facebook.com/intern/diff/D62412490/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135549
Approved by: https://github.com/yushangdi
2024-09-10 06:48:42 +00:00
c0d2f991b1 Increase TRITON_MAX_BLOCK['X'] (#135181)
Fixes #135028

As title, increase `TRITON_MAX_BLOCK['X']` to 4096 and fix an error, thanks to @Chillee: https://github.com/pytorch/pytorch/pull/133300/files#r1744706189

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135181
Approved by: https://github.com/jansel
2024-09-10 05:54:37 +00:00
e889252493 Implementation of scan (#134102)
This operation is supposed to be the pendant to the `associative_scan`, but can operate with non-associative functions.

@ydwu4

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134102
Approved by: https://github.com/ydwu4
2024-09-10 04:51:16 +00:00
6546c6186d do not raise when flatten_fn_with_keys not found when suggesting fixes (#135518)
Test Plan: added test

Differential Revision: D62395371

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135518
Approved by: https://github.com/zhxchen17
2024-09-10 03:47:36 +00:00
1d9fefff19 [DCP] Fixes the stateless optimizer issue of distributed state_dict (#135535)
Some optimizers don't have states that can cause get_state_dict/set_state_dict behave incorrectly. This PR fixes the issues.

fixes: https://github.com/pytorch/pytorch/issues/133415

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135535
Approved by: https://github.com/wz337
2024-09-10 03:10:00 +00:00
7ec17b49cf Fix dynamo benchmark skip logic for cpu device (#135193)
Fixes #132380, adjust torchbench and huggingface skip models list, then we can remove `--no-skip` when running benchmarks on 3 suites.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135193
Approved by: https://github.com/chuanqi129, https://github.com/jansel
2024-09-10 03:02:19 +00:00
146921007a [inductor] [cpp] fix the input contiguous check in max-autotune (#134982)
## Description
Fixes the FP32 accuracy failure of `resmlp_12_224` and BF16 accuracy failure of `volo_d1_224` in timm.

In this PR, we check whether input is contiguous using the following way:
If it has `FixedLayout`, we know the accurate strides. For `FlexibleLayout`, if its data is a `ComputedBuffer`, we could get the fill order of the buffer to decide whether it's contiguous. For the other cases, we won't use GEMM template as we can't infer whether it's contiguous.

## Additional context
The current GEMM template only supports this case: `input.get_stride()[-1] == 1`. In `resmlp_12_224`, when we run into this check, the layout of `input` is a `FlexibleLayout`. The reason is that when realizing the input which is a `View` IR, the `convert_to_reinterpret_view` call fails:
d14fe3ffed/torch/_inductor/ir.py (L4712-L4715)

And it finally runs into this `copy_input` and returns a `FlexibleLayout`.
d14fe3ffed/torch/_inductor/ir.py (L4722)

When checking its stride, this `FlexibleLayout` indeed satisfies `input.get_stride()[-1] == 1` but it is later decided as a `FixedLayout` with `size = (3072, 196), stride = (1, 3072)`, which is not supported by the GEMM template, thus causing accuracy issue in this model.
The `FlexibleLayout` is converted to `FixedLayout` during [CppPackedGemmTemplate.add_choices](d14fe3ffed/torch/_inductor/mkldnn_lowerings.py (L1051)) which calls [slice_nd](d14fe3ffed/torch/_inductor/codegen/cpp_template_kernel.py (L150)) when rendering the kernel (`slice_nd(X)`). When creating the `SliceView` IR, [as_storage_and_layout](d14fe3ffed/torch/_inductor/ir.py (L2288)) invokes
[decide_layout](d14fe3ffed/torch/_inductor/ir.py (L2135)) and converts it to a `FixedLayout` with `size = (3072, 196), stride = (1, 3072)`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134982
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2024-09-10 02:47:38 +00:00
a71e5509bc [inductor]Add profiler to operatorbench (#135515)
Add profiling to operatorbench. The new argument `--profile` is added and the profiling trace is like the following figure.
<img width="954" alt="image" src="https://github.com/user-attachments/assets/5b00d6e3-4905-4a77-a5e9-9f62620a5fd5">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135515
Approved by: https://github.com/shunting314
2024-09-10 02:33:30 +00:00
136e28f616 Enable forward AD in functional.affine_grid (#135494)
Fixes #121411
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135494
Approved by: https://github.com/zou3519, https://github.com/soulitzer
2024-09-10 00:07:07 +00:00
39a61795e3 remove amax_ptr from scaled_gemm (#135421)
amax was removed from _scaled_mm by #128683. Remove it from the internal at::cuda::blas::scaled_gemm, as well.  This allows hipBLASLt to find additional solutions rather than forcing amax to be used and then discarding the result.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135421
Approved by: https://github.com/drisspg, https://github.com/eqy
2024-09-09 23:04:36 +00:00
b4feec9782 [xplat][XNNPACK] don't prefer static linkage in xplat for main target (#135529)
Building XNNPACK as a static library has some issues because of multiple global params floating around.

Let's try to get rid of it in xplat and see how it fares.

Differential Revision: [D60776152](https://our.internmc.facebook.com/intern/diff/D60776152/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D60776152/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135529
Approved by: https://github.com/kimishpatel, https://github.com/mcr229, https://github.com/kirklandsign
2024-09-09 22:47:01 +00:00
d81731615f [Dynamo] Adding CallFunctionNoArgsSource and (#135425)
CallFunctionNoArgsGuardAccessor to support torch.cuda.current_device()

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135425
Approved by: https://github.com/anijain2305
2024-09-09 22:46:00 +00:00
e2f9a83b85 [ONNX] Drop final None values as inputs for nodes in exporter graph (#135520)
When value for an optional input is not provided, it is defaulted to `None`, which gets translates to "" in the onnx graph. To avoid this, if we have a list of inputs and the final few are all `None`, strip them in the graph.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135520
Approved by: https://github.com/justinchuby

Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
2024-09-09 22:28:41 +00:00
70a65a8bd5 Revert "NJT <-> padded dense conversions (#125947)"
This reverts commit 09a5e88bef04d5485b70d8f65f46a675aaa52942.

Reverted https://github.com/pytorch/pytorch/pull/125947 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing dynamo test 09a5e88bef, maybe a landrace ([comment](https://github.com/pytorch/pytorch/pull/125947#issuecomment-2339228570))
2024-09-09 22:01:09 +00:00
689d278543 Revert "Add __init__.py to shape inference folder. (#135461)"
This reverts commit dced0d6d9f05f0962f74a3c6227f774111c15715.

Reverted https://github.com/pytorch/pytorch/pull/135461 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it exposes some public function without appropriate doc. I will reopen the issue with hi-prio so that it can be fixed properly ([comment](https://github.com/pytorch/pytorch/pull/135461#issuecomment-2339218382))
2024-09-09 21:55:13 +00:00
9b764491e3 Use upload-artifact@v4.4.0 for create_release.yml (#135528)
Fixes failure: https://github.com/pytorch/pytorch/actions/runs/10780281005/job/29895846007

Due broken sync
```
actions/upload-artifact@v2
and
actions/download-artifact@v4.1.7
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135528
Approved by: https://github.com/kit1980, https://github.com/malfet
2024-09-09 20:48:52 +00:00
cbc6b30a24 Fix broken E2E tests on Linux machines (#135394)
Summary:
I'm not entirely sure why this is failing with an `ImportError` (according to lastnameye a super class of `ModuleNotFoundError`s), but on our E2E tests on Linux machines (but not Macs?), we're seeing the import failure not getting caught --
`ImportError: cannot import name 'parutil' from 'libfb.py' (/data/sandcastle/boxes/eden-trunk-hg-full-fbsource/buck-out/v2/gen/fbsource/d0c916ec8d40ce11/arvr/libraries/ctrl/studies/replay/__ctrl-r__/ctrl-r#link-tree/libfb/py/__init__.py)` from this test run https://www.internalfb.com/sandcastle/workflow/2522015791331601269, an instance of this job:  https://www.internalfb.com/intern/test/844425085172858?ref_report_id=0 is the overall job

Test Plan:
`arc skycastle schedule tools/skycastle/workflows2/ctrl/js_tests.sky:test_js_e2e_replay_tests --sandcastle-spec-overrides '{"type": "fbcode", "unicastle_size": "I1_MEDIUM"}'`
->
https://www.internalfb.com/sandcastle/workflow/256705178764255769

Differential Revision: D62321167

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135394
Approved by: https://github.com/laithsakka
2024-09-09 20:18:08 +00:00
5b368de7f7 Revert "[ONNX] Update fake mode usage in onnx docs (#135512)"
This reverts commit a13c118994b4f118388d97a35abcb91a396cd437.

Reverted https://github.com/pytorch/pytorch/pull/135512 on behalf of https://github.com/davidberard98 due to failing test  https://github.com/pytorch/pytorch/actions/runs/10778813316/job/29891679127 ([comment](https://github.com/pytorch/pytorch/pull/135512#issuecomment-2338999090))
2024-09-09 20:15:12 +00:00
09a5e88bef NJT <-> padded dense conversions (#125947)
This PR:
* Implements the pre-existing `nt.to_padded_tensor(padding_val)` ATen op via the FBGEMM kernel + appropriate view gymnastics (since that kernel only handles 2D values)
* Introduces a new `_nested_from_padded_tensor` op for the reverse conversion, implemented via the reverse FBGEMM kernel + view gymnastics
    * Note: there is currently no public API for this; design booted to a future PR

TODO:
* ~~Propagate min / max sequence length via the new factory function `_nested_from_padded_tensor`~~
* ~~Verify that Inductor does computation fusion via test logic~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125947
Approved by: https://github.com/soulitzer
2024-09-09 19:37:32 +00:00
a4e6a0b240 [split build] move periodic split builds into own concurrency group (#135510)
To avoid nightly workflows cancelling each other
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135510
Approved by: https://github.com/clee2000, https://github.com/huydhn, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-09-09 19:35:57 +00:00
4ab232d0c4 Fix symbolic number's type and tensor's dtype mismatch bug in Tensor ctor (#135433)
Fixes #135432

In the current implementation, if we try to store a symbolic number in Tensor's constructor, it assumes that the tensor's dtype and the symbolic number's type are matched, which is not the case.

In other words, if we try to store a `SymInt`, current implementation assumes tensor's dtype is `torch.int32`, `torch.int64` or something. And if we try to store a `SymFloat`, it assumes tensor's dtype is `torch.float32` or `torch.float64`. However, the tensor's dtype could also be `torch.float32` or something else when we try to store `SymInt`, which would be wrong.

This PR stores symbolic numbers by tensor's scalar type by wrapping `SymInt` and `SymFoat`'s guarded number into a PyObject.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135433
Approved by: https://github.com/ezyang
2024-09-09 19:32:18 +00:00
2032f107d7 Don't try to tag s390x docker images (#135509)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135509
Approved by: https://github.com/atalman
2024-09-09 19:07:48 +00:00
5f7d956362 Fix bugs blocking flipping the default layout constraint for custom ops (#135391)
Fixes two things:
- For regular PyTorch ops, the default layout constraint tag is always
flexible_layout. This was a bug with #135238
- Mark the new quantized _wrapped_linear_prepack ops as flexible_layout.
  The metas for these are incorrect, I didn't want to fix them (and
  changing the default requires the metas actually be correct).

Test Plan:
- The next PR up in the stack. The PRs are split because the next one is
  riskier.

foo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135391
Approved by: https://github.com/albanD
2024-09-09 18:24:21 +00:00
a13c118994 [ONNX] Update fake mode usage in onnx docs (#135512)
Update fake mode usage in onnx docs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135512
Approved by: https://github.com/justinchuby
2024-09-09 18:10:37 +00:00
21241bfeee [CP] Extend CP to support load-balancing shards (#132442)
This PR extends the current ring attention to support load-balancing shards -- the context/sequence is divided into `2 * world_size` shards and each rank gets `rank` and `(world_size * 2 - rank - 1)` shards. The data re-shuffling is done in the `context_parallel` API.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132442
Approved by: https://github.com/wconstab
2024-09-09 18:04:38 +00:00
73a6fc6e30 Revert "[Inductor] Make static_input_idxs a set for faster lookup (#135314)"
This reverts commit 011cae9570fb3c44b7f6f0c8004c470579ed21da.

Reverted https://github.com/pytorch/pytorch/pull/135314 on behalf of https://github.com/ZainRizvi due to Lint is failing on this file in trunk. See [GH job link](https://github.com/pytorch/pytorch/actions/runs/10777258770/job/29885960050) [HUD commit link](011cae9570) ([comment](https://github.com/pytorch/pytorch/pull/135314#issuecomment-2338678219))
2024-09-09 17:33:01 +00:00
09287e3af4 [MPS] Add regression test for fft.fftfreq (#135440)
The issue reported in #135223 was already solved in #128393. This PR adds a regression test for it.

Fixes #135223

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135440
Approved by: https://github.com/ezyang
2024-09-09 17:12:36 +00:00
16c3b8f87c [AOTI] Fix assert_function call in cpu autotune template (#135086)
Summary: In the ABI-compatible mode, assert_function should be AOTI_TORCH_CHECK.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135086
Approved by: https://github.com/chenyang78, https://github.com/angelayi
ghstack dependencies: #134857
2024-09-09 16:54:12 +00:00
9c6dff4941 [AOTI] Add C shim for aten.mkldnn_rnn_layer in cpp wrapper (#134857)
Summary: Support aten.mkldnn_rnn_layer in the ABI-compatible mode. Because aten.mkldnn_rnn_layer is an aten op, it is easier to add a C shim function for it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134857
Approved by: https://github.com/angelayi
2024-09-09 16:54:12 +00:00
0eb425a563 [Release] Apply Release changes scripts after release 2.4 (#135495)
Based on additional changes required for https://github.com/pytorch/pytorch/pull/128347
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135495
Approved by: https://github.com/kit1980
2024-09-09 16:49:04 +00:00
011cae9570 [Inductor] Make static_input_idxs a set for faster lookup (#135314)
`static_input_idxs` is only used for lookups. With large models, this is a large list. This takes over a millisecond in some cases.

Profile before change:
<img width="824" alt="image" src="https://github.com/user-attachments/assets/002a0775-fd2f-4d27-8cf2-812b502d7d5e">

Profile after change: gaps are smaller, 1ms speedup before launching the cuda graph
<img width="794" alt="image" src="https://github.com/user-attachments/assets/12a0a0b9-2cc1-4d53-ac87-9bd5140a46f5">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135314
Approved by: https://github.com/oulgen
2024-09-09 16:24:58 +00:00
dfb2b661f7 Use float data type for Half var_sum in batchnorm stats updating on CPU (#126525)
Using float data type for Half `var_sum` in batchnorm stats updating on CPU to avoid `var_sum` overflow since the representation range of Half is small.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126525
Approved by: https://github.com/jgong5, https://github.com/peterbell10
2024-09-09 15:31:38 +00:00
5a69e0ebbe [MPS] Update decorator comments with issue ref (#135448)
Updating the comments with references to better places for context now that the bugs have been identified.

xref #135442 #135447 #134184

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135448
Approved by: https://github.com/ezyang
2024-09-09 15:18:52 +00:00
5e145861f2 [ONNX] Improves documentation of ONNX exporter (#135372)
The PR updates the documentation to reflect the changes introduced in pytorch 2.5 and related to onnx exporter.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135372
Approved by: https://github.com/justinchuby

Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
2024-09-09 15:09:01 +00:00
c35b953531 Fix wrong error msg (#135423)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135423
Approved by: https://github.com/ezyang
2024-09-09 13:28:31 +00:00
dced0d6d9f Add __init__.py to shape inference folder. (#135461)
Fixes #135196

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135461
Approved by: https://github.com/ezyang
2024-09-09 13:27:58 +00:00
c0436c5701 [inductor][cpp][gemm] fix perf regression xcit_large_24_p8_224 (#134686) (#135438)
Fix #134686.

PR https://github.com/pytorch/pytorch/pull/132729 makes GEMM template faster for one of the GEMMs in xcit_large_24_p8_224:
SingleProcess AUTOTUNE benchmarking takes 1.7088 seconds and 1.9207 seconds precompiling
AUTOTUNE linear_unary(12544x3072, 768x3072, 768)
  cpp_packed_gemm_2 2.9371 ms 100.0%
  _linear_pointwise 3.1584 ms 93.0%

But it is slower than Aten in the e2e run due to different cache behavior. The access to the input data (12544x3072) is LLC latency bound and bottlenecks seen due to the memory synchronization (data transfers and coherence updates across processors). This PR tries to mitigate the problem by cooperatively loading different chunks of input data from different processors that share the input data.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135438
Approved by: https://github.com/leslie-fang-intel
2024-09-09 05:16:02 +00:00
cyy
60e8dc4374 Check function declarations in Caffe2 code (#134925)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134925
Approved by: https://github.com/ezyang
2024-09-09 05:03:29 +00:00
e6c3f58584 Fix example: Address broadcasting error in the addition of `attn_bias… (#135427)
…` and `attn_mask`, and correct device assignment for newly created variables in the method.

Fix example: Address broadcasting error in the addition of `attn_bias` and `attn_mask`, and correct device assignment for newly created variables in the method.

1. Adding `attn_bias += attn_mask` results in a broadcasting error. The expected shape of `attn_bias` is (L, S), so the output should also have the shape (L, S). However, when the input shape is (N, num_heads, L, S), broadcasting occurs, leading to an output shape of (N, num_heads, L, S), which is not desired.
2. `attn_bias` is a newly created variable within the method, but it is not assigned to the correct device.

**This is my retry of PR #130209 . The PR has been merged into commit `d4a79d4a7c746068d25fe5cf9333495561f4ce1f`, but the modifications were overwritten by subsequent commits.**

Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com>
@mikaylagawarecki  provided a more elegant implementation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135427
Approved by: https://github.com/ezyang
2024-09-09 03:47:34 +00:00
90e12cf63d Fix return type of nansum example. (#135435)
One of the examples in the documentation of `torch.nansum` contains a wrong return type. This fixes it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135435
Approved by: https://github.com/ezyang
2024-09-09 03:34:52 +00:00
44c08f4984 [Partitioner] Query whether nodes exist in graph faster (#135316)
Find node if exist in graph.nodes (linked list) take too long time. Using graph._find_nodes_lookup_table (hash table) instead to speed up.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135316
Approved by: https://github.com/ezyang
2024-09-09 03:34:02 +00:00
b6186353c6 enable lazy_init for hpu (#135203)
enables lazy_init for hpu device
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135203
Approved by: https://github.com/ezyang
2024-09-09 03:32:20 +00:00
1209 changed files with 30046 additions and 28210 deletions

View File

@ -1,5 +1,5 @@
0.6b
0.7b
manylinux_2_17
rocm6.2
7f07e8a1cb1f99627eb6d77f5c0e9295c775f3c7
e4ab195d2bd19e939c675a13280c29714c6ef9f2cf420690da150fa0cac043b1
9be04068c3c0857a4cfd17d7e39e71d0423ebac2
3e9e1959d23b93d78a08fcc5f868125dc3854dece32fd9458be9ef4467982291

View File

@ -286,18 +286,7 @@ case "$image" in
TRITON=yes
;;
pytorch-linux-focal-rocm-n-1-py3)
ANACONDA_PYTHON_VERSION=3.8
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
ROCM_VERSION=6.0
NINJA_VERSION=1.9.0
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-focal-rocm-n-py3)
ANACONDA_PYTHON_VERSION=3.8
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
PROTOBUF=yes
DB=yes
@ -307,6 +296,17 @@ case "$image" in
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-focal-rocm-n-py3)
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
ROCM_VERSION=6.2
NINJA_VERSION=1.9.0
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-jammy-xpu-2024.0-py3)
ANACONDA_PYTHON_VERSION=3.9
GCC_VERSION=11
@ -379,6 +379,7 @@ case "$image" in
GCC_VERSION=11
CONDA_CMAKE=yes
HALIDE=yes
TRITON=yes
;;
pytorch-linux-focal-linter)
# TODO: Use 3.9 here because of this issue https://github.com/python/mypy/issues/13627.

View File

@ -1 +1 @@
cc981feba10a3f4c2e46f3fe368e8fcf5f5643df
91b14bf5593cf58a8541f3e6b9125600a867d4ef

View File

@ -1 +1 @@
757b6a61e7df814ba806f498f8bb3160f84b120c
5fe38ffd73c2ac6ed6323b554205186696631c6f

View File

@ -4,12 +4,12 @@ set -ex
source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"
TARBALL='aotriton.tar.bz2'
TARBALL='aotriton.tar.gz'
# This read command alwasy returns with exit code 1
read -d "\n" VER MANYLINUX ROCMBASE PINNED_COMMIT SHA256 < aotriton_version.txt || true
ARCH=$(uname -m)
AOTRITON_INSTALL_PREFIX="$1"
AOTRITON_URL="https://github.com/ROCm/aotriton/releases/download/${VER}/aotriton-${VER}-${MANYLINUX}_${ARCH}-${ROCMBASE}-shared.tar.bz2"
AOTRITON_URL="https://github.com/ROCm/aotriton/releases/download/${VER}/aotriton-${VER}-${MANYLINUX}_${ARCH}-${ROCMBASE}-shared.tar.gz"
cd "${AOTRITON_INSTALL_PREFIX}"
# Must use -L to follow redirects

View File

@ -7,7 +7,7 @@ PYTHON_DOWNLOAD_GITHUB_BRANCH=https://github.com/python/cpython/archive/refs/hea
GET_PIP_URL=https://bootstrap.pypa.io/get-pip.py
# Python versions to be installed in /opt/$VERSION_NO
CPYTHON_VERSIONS=${CPYTHON_VERSIONS:-"3.8.1 3.9.0 3.10.1 3.11.0 3.12.0 3.13.0"}
CPYTHON_VERSIONS=${CPYTHON_VERSIONS:-"3.8.1 3.9.0 3.10.1 3.11.0 3.12.0 3.13.0 3.13.0t"}
function check_var {
if [ -z "$1" ]; then
@ -22,6 +22,13 @@ function do_cpython_build {
check_var $py_ver
check_var $py_folder
tar -xzf Python-$py_ver.tgz
local additional_flags=""
if [ "$py_ver" == "3.13.0t" ]; then
additional_flags=" --disable-gil"
mv cpython-3.13/ cpython-3.13t/
fi
pushd $py_folder
local prefix="/opt/_internal/cpython-${py_ver}"
@ -37,8 +44,10 @@ function do_cpython_build {
local openssl_flags="--with-openssl=${WITH_OPENSSL} --with-openssl-rpath=auto"
fi
# -Wformat added for https://bugs.python.org/issue17547 on Python 2.6
CFLAGS="-Wformat" ./configure --prefix=${prefix} ${openssl_flags} ${shared_flags} > /dev/null
CFLAGS="-Wformat" ./configure --prefix=${prefix} ${openssl_flags} ${shared_flags} ${additional_flags} > /dev/null
make -j40 > /dev/null
make install > /dev/null
@ -69,7 +78,14 @@ function build_cpython {
check_var $py_ver
check_var $PYTHON_DOWNLOAD_URL
local py_ver_folder=$py_ver
if [ "$py_ver" = "3.13.0" ]; then
if [ "$py_ver" = "3.13.0t" ]; then
PY_VER_SHORT="3.13"
PYT_VER_SHORT="3.13t"
check_var $PYTHON_DOWNLOAD_GITHUB_BRANCH
wget $PYTHON_DOWNLOAD_GITHUB_BRANCH/$PY_VER_SHORT.tar.gz -O Python-$py_ver.tgz
do_cpython_build $py_ver cpython-$PYT_VER_SHORT
elif [ "$py_ver" = "3.13.0" ]; then
PY_VER_SHORT="3.13"
check_var $PYTHON_DOWNLOAD_GITHUB_BRANCH
wget $PYTHON_DOWNLOAD_GITHUB_BRANCH/$PY_VER_SHORT.tar.gz -O Python-$py_ver.tgz

View File

@ -5,7 +5,7 @@ set -ex
# cuSPARSELt license: https://docs.nvidia.com/cuda/cusparselt/license.html
mkdir tmp_cusparselt && cd tmp_cusparselt
if [[ ${CUDA_VERSION:0:4} =~ ^12\.[2-4]$ ]]; then
if [[ ${CUDA_VERSION:0:4} =~ ^12\.[2-6]$ ]]; then
arch_path='sbsa'
export TARGETARCH=${TARGETARCH:-$(uname -m)}
if [ ${TARGETARCH} = 'amd64' ] || [ "${TARGETARCH}" = 'x86_64' ]; then

View File

@ -10,6 +10,21 @@ if [[ -z $ROCM_VERSION ]]; then
exit 1;
fi
IS_UBUNTU=0
ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')
case "$ID" in
ubuntu)
IS_UBUNTU=1
;;
centos)
IS_UBUNTU=0
;;
*)
echo "Unable to determine OS..."
exit 1
;;
esac
# To make version comparison easier, create an integer representation.
save_IFS="$IFS"
IFS=. ROCM_VERSION_ARRAY=(${ROCM_VERSION})
@ -57,9 +72,11 @@ MIOPEN_CMAKE_COMMON_FLAGS="
-DMIOPEN_BUILD_DRIVER=OFF
"
# Pull MIOpen repo and set DMIOPEN_EMBED_DB based on ROCm version
if [[ $ROCM_INT -ge 60200 ]] && [[ $ROCM_INT -lt 60300 ]]; then
echo "ROCm 6.2 MIOpen does not need any patches, do not build from source"
if [[ $ROCM_INT -ge 60300 ]]; then
echo "ROCm 6.3+ MIOpen does not need any patches, do not build from source"
exit 0
elif [[ $ROCM_INT -ge 60200 ]] && [[ $ROCM_INT -lt 60300 ]]; then
MIOPEN_BRANCH="release/rocm-rel-6.2-staging"
elif [[ $ROCM_INT -ge 60100 ]] && [[ $ROCM_INT -lt 60200 ]]; then
echo "ROCm 6.1 MIOpen does not need any patches, do not build from source"
exit 0
@ -93,12 +110,21 @@ else
exit 1
fi
yum remove -y miopen-hip
if [[ ${IS_UBUNTU} == 1 ]]; then
apt-get remove -y miopen-hip
else
yum remove -y miopen-hip
fi
git clone https://github.com/ROCm/MIOpen -b ${MIOPEN_BRANCH}
pushd MIOpen
# remove .git to save disk space since CI runner was running out
rm -rf .git
# Don't build CK to save docker build time
if [[ $ROCM_INT -ge 60200 ]]; then
sed -i '/composable_kernel/d' requirements.txt
fi
# Don't build MLIR to save docker build time
# since we are disabling MLIR backend for MIOpen anyway
if [[ $ROCM_INT -ge 50400 ]] && [[ $ROCM_INT -lt 50500 ]]; then
@ -111,10 +137,15 @@ cmake -P install_deps.cmake --minimum
# clean up since CI runner was running out of disk space
rm -rf /tmp/*
yum clean all
rm -rf /var/cache/yum
rm -rf /var/lib/yum/yumdb
rm -rf /var/lib/yum/history
if [[ ${IS_UBUNTU} == 1 ]]; then
apt-get autoclean && apt-get clean
rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
else
yum clean all
rm -rf /var/cache/yum
rm -rf /var/lib/yum/yumdb
rm -rf /var/lib/yum/history
fi
## Build MIOpen
mkdir -p build
@ -131,7 +162,11 @@ make -j $(nproc) package
# clean up since CI runner was running out of disk space
rm -rf /usr/local/cget
yum install -y miopen-*.rpm
if [[ ${IS_UBUNTU} == 1 ]]; then
sudo dpkg -i miopen-hip*.deb
else
yum install -y miopen-*.rpm
fi
popd
rm -rf MIOpen

View File

@ -37,6 +37,12 @@ esac
(
set -x
# TODO: Remove LimitNOFILE=1048576 patch once https://github.com/pytorch/test-infra/issues/5712
# is resolved. This patch is required in order to fix timing out of Docker build on Amazon Linux 2023.
sudo sed -i s/LimitNOFILE=infinity/LimitNOFILE=1048576/ /usr/lib/systemd/system/docker.service
sudo systemctl daemon-reload
sudo systemctl restart docker
docker build \
--target final \
--progress plain \

View File

@ -10,6 +10,7 @@ ENV LANG en_US.UTF-8
ENV LANGUAGE en_US.UTF-8
ARG DEVTOOLSET_VERSION=9
# Note: This is required patch since CentOS have reached EOL
# otherwise any yum install setp will fail
RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo

View File

@ -124,7 +124,14 @@ if [[ -n ${MANY_LINUX_VERSION} && -z ${DOCKERFILE_SUFFIX} ]]; then
fi
(
set -x
DOCKER_BUILDKIT=1 docker build \
# TODO: Remove LimitNOFILE=1048576 patch once https://github.com/pytorch/test-infra/issues/5712
# is resolved. This patch is required in order to fix timing out of Docker build on Amazon Linux 2023.
sudo sed -i s/LimitNOFILE=infinity/LimitNOFILE=1048576/ /usr/lib/systemd/system/docker.service
sudo systemctl daemon-reload
sudo systemctl restart docker
DOCKER_BUILDKIT=1 docker build \
${DOCKER_GPU_BUILD_ARG} \
--build-arg "GPU_IMAGE=${GPU_IMAGE}" \
--target "${TARGET}" \

View File

@ -90,7 +90,7 @@ librosa>=0.6.2 ; python_version < "3.11"
#Pinned versions:
#test that import:
mypy==1.10.0
mypy==1.11.2
# Pin MyPy version because new errors are likely to appear with each release
#Description: linter
#Pinned versions: 1.10.0
@ -337,3 +337,8 @@ onnxscript==0.1.0.dev20240817
#Description: Required by mypy and test_public_bindings.py when checking torch.onnx._internal
#Pinned versions:
#test that import:
parameterized==0.8.1
#Description: Parameterizes unittests, both the tests themselves and the entire testing class
#Pinned versions:
#test that import:

View File

@ -1 +1 @@
3.0.0
3.1.0

View File

@ -68,6 +68,8 @@ RUN rm install_rocm.sh
COPY ./common/install_rocm_magma.sh install_rocm_magma.sh
RUN bash ./install_rocm_magma.sh
RUN rm install_rocm_magma.sh
ADD ./common/install_miopen.sh install_miopen.sh
RUN bash ./install_miopen.sh ${ROCM_VERSION} && rm install_miopen.sh
ENV ROCM_PATH /opt/rocm
ENV PATH /opt/rocm/bin:$PATH
ENV PATH /opt/rocm/hcc/bin:$PATH
@ -121,5 +123,8 @@ RUN bash ./install_cache.sh && rm install_cache.sh
ARG BUILD_ENVIRONMENT
ENV BUILD_ENVIRONMENT ${BUILD_ENVIRONMENT}
# Install LLVM dev version (Defined in the pytorch/builder github repository)
COPY --from=pytorch/llvm:9.0.1 /opt/llvm /opt/llvm
USER jenkins
CMD ["bash"]

View File

@ -49,13 +49,8 @@ if [[ ${BUILD_ENVIRONMENT} == *"parallelnative"* ]]; then
fi
# Enable LLVM dependency for TensorExpr testing
if [[ "$BUILD_ENVIRONMENT" == *rocm* ]]; then
export USE_LLVM=/opt/rocm/llvm
export LLVM_DIR=/opt/rocm/llvm/lib/cmake/llvm
else
export USE_LLVM=/opt/llvm
export LLVM_DIR=/opt/llvm/lib/cmake/llvm
fi
export USE_LLVM=/opt/llvm
export LLVM_DIR=/opt/llvm/lib/cmake/llvm
if [[ "$BUILD_ENVIRONMENT" == *executorch* ]]; then
# To build test_edge_op_registration
@ -237,7 +232,7 @@ fi
# Do not change workspace permissions for ROCm CI jobs
# as it can leave workspace with bad permissions for cancelled jobs
if [[ "$BUILD_ENVIRONMENT" != *rocm* ]]; then
if [[ "$BUILD_ENVIRONMENT" != *rocm* && "$BUILD_ENVIRONMENT" != *s390x* ]]; then
# Workaround for dind-rootless userid mapping (https://github.com/pytorch/ci-infra/issues/96)
WORKSPACE_ORIGINAL_OWNER_ID=$(stat -c '%u' "/var/lib/jenkins/workspace")
cleanup_workspace() {
@ -283,6 +278,7 @@ else
# set only when building other architectures
# or building non-XLA tests.
if [[ "$BUILD_ENVIRONMENT" != *rocm* &&
"$BUILD_ENVIRONMENT" != *s390x* &&
"$BUILD_ENVIRONMENT" != *xla* ]]; then
if [[ "$BUILD_ENVIRONMENT" != *py3.8* ]]; then
# Install numpy-2.0.2 for builds which are backward compatible with 1.X
@ -345,11 +341,11 @@ else
CUSTOM_OP_BUILD="${CUSTOM_TEST_ARTIFACT_BUILD_DIR}/custom-op-build"
CUSTOM_OP_TEST="$PWD/test/custom_operator"
python --version
SITE_PACKAGES="$(python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())')"
SITE_PACKAGES="$(python -c 'import site; print(";".join([x for x in site.getsitepackages()] + [x + "/torch" for x in site.getsitepackages()]))')"
mkdir -p "$CUSTOM_OP_BUILD"
pushd "$CUSTOM_OP_BUILD"
cmake "$CUSTOM_OP_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES/torch;$SITE_PACKAGES" -DPython_EXECUTABLE="$(which python)" \
cmake "$CUSTOM_OP_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES" -DPython_EXECUTABLE="$(which python)" \
-DCMAKE_MODULE_PATH="$CUSTOM_TEST_MODULE_PATH" -DUSE_ROCM="$CUSTOM_TEST_USE_ROCM"
make VERBOSE=1
popd
@ -359,10 +355,10 @@ else
JIT_HOOK_BUILD="${CUSTOM_TEST_ARTIFACT_BUILD_DIR}/jit-hook-build"
JIT_HOOK_TEST="$PWD/test/jit_hooks"
python --version
SITE_PACKAGES="$(python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())')"
SITE_PACKAGES="$(python -c 'import site; print(";".join([x for x in site.getsitepackages()] + [x + "/torch" for x in site.getsitepackages()]))')"
mkdir -p "$JIT_HOOK_BUILD"
pushd "$JIT_HOOK_BUILD"
cmake "$JIT_HOOK_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES/torch;$SITE_PACKAGES" -DPython_EXECUTABLE="$(which python)" \
cmake "$JIT_HOOK_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES" -DPython_EXECUTABLE="$(which python)" \
-DCMAKE_MODULE_PATH="$CUSTOM_TEST_MODULE_PATH" -DUSE_ROCM="$CUSTOM_TEST_USE_ROCM"
make VERBOSE=1
popd
@ -374,7 +370,7 @@ else
python --version
mkdir -p "$CUSTOM_BACKEND_BUILD"
pushd "$CUSTOM_BACKEND_BUILD"
cmake "$CUSTOM_BACKEND_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES/torch;$SITE_PACKAGES" -DPython_EXECUTABLE="$(which python)" \
cmake "$CUSTOM_BACKEND_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES" -DPython_EXECUTABLE="$(which python)" \
-DCMAKE_MODULE_PATH="$CUSTOM_TEST_MODULE_PATH" -DUSE_ROCM="$CUSTOM_TEST_USE_ROCM"
make VERBOSE=1
popd
@ -407,6 +403,6 @@ fi
# snadampal: skipping it till sccache support added for aarch64
# https://github.com/pytorch/pytorch/issues/121559
if [[ "$BUILD_ENVIRONMENT" != *aarch64* ]]; then
if [[ "$BUILD_ENVIRONMENT" != *aarch64* && "$BUILD_ENVIRONMENT" != *s390x* ]]; then
print_sccache_stats
fi

View File

@ -1,4 +1,4 @@
from datetime import datetime, timedelta
from datetime import datetime, timedelta, timezone
from tempfile import mkdtemp
from cryptography import x509
@ -42,10 +42,10 @@ def create_cert(path, C, ST, L, O, key):
.issuer_name(issuer)
.public_key(key.public_key())
.serial_number(x509.random_serial_number())
.not_valid_before(datetime.utcnow())
.not_valid_before(datetime.now(timezone.utc))
.not_valid_after(
# Our certificate will be valid for 10 days
datetime.utcnow()
datetime.now(timezone.utc)
+ timedelta(days=10)
)
.add_extension(
@ -88,10 +88,10 @@ def sign_certificate_request(path, csr_cert, ca_cert, private_ca_key):
.issuer_name(ca_cert.subject)
.public_key(csr_cert.public_key())
.serial_number(x509.random_serial_number())
.not_valid_before(datetime.utcnow())
.not_valid_before(datetime.now(timezone.utc))
.not_valid_after(
# Our certificate will be valid for 10 days
datetime.utcnow()
datetime.now(timezone.utc)
+ timedelta(days=10)
# Sign our certificate with our private key
)

View File

@ -9,15 +9,13 @@ if [[ -n "$CONDA_ENV" ]]; then
export PATH="$CONDA_ENV/bin":$PATH
fi
# Test that OpenMP is enabled for non-arm64 build
if [[ ${BUILD_ENVIRONMENT} != *arm64* ]]; then
pushd test
if [[ ! $(python -c "import torch; print(int(torch.backends.openmp.is_available()))") == "1" ]]; then
echo "Build should have OpenMP enabled, but torch.backends.openmp.is_available() is False"
exit 1
fi
popd
# Test that OpenMP is enabled
pushd test
if [[ ! $(python -c "import torch; print(int(torch.backends.openmp.is_available()))") == "1" ]]; then
echo "Build should have OpenMP enabled, but torch.backends.openmp.is_available() is False"
exit 1
fi
popd
setup_test_python() {
# The CircleCI worker hostname doesn't resolve to an address.
@ -27,8 +25,9 @@ setup_test_python() {
echo "Ninja version: $(ninja --version)"
echo "Python version: $(which python) ($(python --version))"
# Increase default limit on open file handles from 256 to 1024
ulimit -n 1024
# Set the limit on open file handles to 16384
# might help with intermittent compiler test failures
ulimit -n 16384
}
test_python_all() {

View File

@ -401,9 +401,9 @@ pr_time_benchmarks() {
TEST_REPORTS_DIR=$(pwd)/test/test-reports
mkdir -p "$TEST_REPORTS_DIR"
PYTHONPATH=$(pwd)/benchmarks/dynamo/pr_time_benchmarks source benchmarks/dynamo/pr_time_benchmarks/benchmark_runner.sh "$TEST_REPORTS_DIR/pr_time_benchmarks_after.txt" "benchmarks/dynamo/pr_time_benchmarks/benchmarks"
PYTHONPATH=$(pwd)/benchmarks/dynamo/pr_time_benchmarks source benchmarks/dynamo/pr_time_benchmarks/benchmark_runner.sh "$TEST_REPORTS_DIR/pr_time_benchmarks_results.csv" "benchmarks/dynamo/pr_time_benchmarks/benchmarks"
echo "benchmark results on current PR: "
cat "$TEST_REPORTS_DIR/pr_time_benchmarks_after.txt"
cat "$TEST_REPORTS_DIR/pr_time_benchmarks_results.csv"
}
@ -1383,14 +1383,16 @@ test_executorch() {
assert_git_not_dirty
}
test_linux_aarch64(){
test_linux_aarch64() {
python test/run_test.py --include test_modules test_mkldnn test_mkldnn_fusion test_openmp test_torch test_dynamic_shapes \
test_transformers test_multiprocessing test_numpy_interop --verbose
test_transformers test_multiprocessing test_numpy_interop \
--shard "$SHARD_NUMBER" "$NUM_TEST_SHARDS" --verbose
# Dynamo tests
python test/run_test.py --include dynamo/test_compile dynamo/test_backends dynamo/test_comptime dynamo/test_config \
dynamo/test_functions dynamo/test_fx_passes_pre_grad dynamo/test_interop dynamo/test_model_output dynamo/test_modules \
dynamo/test_optimizers dynamo/test_recompile_ux dynamo/test_recompiles --verbose
dynamo/test_optimizers dynamo/test_recompile_ux dynamo/test_recompiles \
--shard "$SHARD_NUMBER" "$NUM_TEST_SHARDS" --verbose
# Inductor tests
python test/run_test.py --include inductor/test_torchinductor inductor/test_benchmark_fusion inductor/test_codecache \
@ -1400,7 +1402,8 @@ test_linux_aarch64(){
inductor/test_max_autotune inductor/test_memory_planning inductor/test_metrics inductor/test_multi_kernel inductor/test_pad_mm \
inductor/test_pattern_matcher inductor/test_perf inductor/test_profiler inductor/test_select_algorithm inductor/test_smoke \
inductor/test_split_cat_fx_passes inductor/test_standalone_compile inductor/test_torchinductor \
inductor/test_torchinductor_codegen_dynamic_shapes inductor/test_torchinductor_dynamic_shapes --verbose
inductor/test_torchinductor_codegen_dynamic_shapes inductor/test_torchinductor_dynamic_shapes inductor/test_memory \
--shard "$SHARD_NUMBER" "$NUM_TEST_SHARDS" --verbose
}
if ! [[ "${BUILD_ENVIRONMENT}" == *libtorch* || "${BUILD_ENVIRONMENT}" == *-bazel-* ]]; then

View File

@ -43,6 +43,9 @@ python -m pip install z3-solver==4.12.2.0
# Install tlparse for test\dynamo\test_structured_trace.py UTs.
python -m pip install tlparse==0.3.25
# Install parameterized
python -m pip install parameterized==0.8.1
run_tests() {
# Run nvidia-smi if available
for path in '/c/Program Files/NVIDIA Corporation/NVSMI/nvidia-smi.exe' /c/Windows/System32/nvidia-smi.exe; do

View File

@ -32,30 +32,6 @@ self-hosted-runner:
- lf.linux.8xlarge.nvidia.gpu
- lf.linux.16xlarge.nvidia.gpu
- lf.linux.g5.4xlarge.nvidia.gpu
# Organization-wide AWS Linux Runners with new Amazon 2023 AMI
- amz2023.linux.large
- amz2023.linux.2xlarge
- amz2023.linux.4xlarge
- amz2023.linux.12xlarge
- amz2023.linux.24xlarge
- amz2023.linux.arm64.2xlarge
- amz2023.linux.arm64.m7g.4xlarge
- amz2023.linux.arm64.m7g.4xlarge.ephemeral
- amz2023.linux.4xlarge.nvidia.gpu
- amz2023.linux.8xlarge.nvidia.gpu
- amz2023.linux.16xlarge.nvidia.gpu
- amz2023.linux.g5.4xlarge.nvidia.gpu
# Pytorch/pytorch AWS Linux Runners with the new Amazon 2023 AMI on Linux Foundation account
- amz2023.lf.linux.large
- amz2023.lf.linux.2xlarge
- amz2023.lf.linux.4xlarge
- amz2023.lf.linux.12xlarge
- amz2023.lf.linux.24xlarge
- amz2023.lf.linux.arm64.2xlarge
- amz2023.lf.linux.4xlarge.nvidia.gpu
- amz2023.lf.linux.8xlarge.nvidia.gpu
- amz2023.lf.linux.16xlarge.nvidia.gpu
- amz2023.lf.linux.g5.4xlarge.nvidia.gpu
# Repo-specific IBM hosted S390x runner
- linux.s390x
# Organization wide AWS Windows runners

View File

@ -1 +1 @@
97ed7b36b7a741253d4e41e4da3c901d83294503
ba696ea3dfec4cbe693bf06a84c75dc196077f5b

View File

@ -7,10 +7,14 @@
# runners. Runners listed here will be available as self hosted
# runners, configuration is directly pulled from the main branch.
#
# NOTE (Apr, 5, 2021): Linux runners are currently all an amazonlinux2
#
# NOTE (Jan 5, 2021): Linux runners are all non-ephemeral to reduce the amount of CreateInstaces calls
# to avoid RequestLimitExceeded issues
# NOTES:
# - Linux runners are by default non-ephemeral to reduce the amount of CreateInstaces calls
# to avoid RequestLimitExceeded issues
# - When updating this file, run the following command to validate the YAML and to generate
# corresponding versions of scale-config for the pytorch/pytorch repo and merge the
# pytorch/pytorch changes before merging these changes.
# `python .github/scripts/validate_scale_config.py --test-infra-repo-root [path_to_test-infra_root] --pytorch-repo-root [path_to_pytorch_root]``
#
# TODO: Add some documentation on how the auto-scaling works
#
@ -31,58 +35,36 @@ runner_types:
is_ephemeral: false
max_available: 1000
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
variants:
amz2023:
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
am2:
ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.c.linux.10xlarge.avx2:
disk_size: 200
instance_type: m4.10xlarge
is_ephemeral: false
max_available: 450
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
variants:
amz2023:
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
am2:
ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.c.linux.24xl.spr-metal:
disk_size: 200
instance_type: c7i.metal-24xl
is_ephemeral: false
max_available: 150
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
variants:
amz2023:
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
am2:
ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.c.linux.16xlarge.spr:
disk_size: 200
instance_type: c7i.16xlarge
is_ephemeral: false
max_available: 150
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
variants:
amz2023:
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
am2:
ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.c.linux.9xlarge.ephemeral:
disk_size: 200
instance_type: c5.9xlarge
is_ephemeral: true
max_available: 50
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
variants:
amz2023:
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
am2:
ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs
lf.c.linux.12xlarge.ephemeral:
@ -91,240 +73,140 @@ runner_types:
is_ephemeral: true
max_available: 300
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
variants:
amz2023:
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
am2:
ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.c.linux.16xlarge.nvidia.gpu:
disk_size: 150
instance_type: g3.16xlarge
is_ephemeral: false
max_available: 150
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
variants:
amz2023:
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
am2:
ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.c.linux.24xlarge:
disk_size: 150
instance_type: c5.24xlarge
is_ephemeral: false
max_available: 500
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
variants:
amz2023:
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
am2:
ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.c.linux.24xlarge.ephemeral:
disk_size: 150
instance_type: c5.24xlarge
is_ephemeral: true
max_available: 200
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
variants:
amz2023:
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
am2:
ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.c.linux.2xlarge:
disk_size: 150
instance_type: c5.2xlarge
is_ephemeral: false
max_available: 3120
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
variants:
amz2023:
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
am2:
ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.c.linux.4xlarge:
disk_size: 150
instance_type: c5.4xlarge
is_ephemeral: false
max_available: 1000
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
variants:
amz2023:
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
am2:
ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.c.linux.4xlarge.nvidia.gpu:
disk_size: 150
instance_type: g3.4xlarge
is_ephemeral: false
max_available: 1000
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
variants:
amz2023:
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
am2:
ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.c.linux.8xlarge.nvidia.gpu:
disk_size: 150
instance_type: g3.8xlarge
is_ephemeral: false
max_available: 400
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
variants:
amz2023:
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
am2:
ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.c.linux.g4dn.12xlarge.nvidia.gpu:
disk_size: 150
instance_type: g4dn.12xlarge
is_ephemeral: false
max_available: 250
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
variants:
amz2023:
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
am2:
ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.c.linux.g4dn.metal.nvidia.gpu:
disk_size: 150
instance_type: g4dn.metal
is_ephemeral: false
max_available: 300
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
variants:
amz2023:
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
am2:
ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.c.linux.g5.48xlarge.nvidia.gpu:
disk_size: 150
instance_type: g5.48xlarge
is_ephemeral: false
max_available: 200
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
variants:
amz2023:
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
am2:
ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.c.linux.g5.12xlarge.nvidia.gpu:
disk_size: 150
instance_type: g5.12xlarge
is_ephemeral: false
max_available: 150
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
variants:
amz2023:
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
am2:
ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.c.linux.g5.4xlarge.nvidia.gpu:
disk_size: 150
instance_type: g5.4xlarge
is_ephemeral: false
max_available: 2400
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
variants:
amz2023:
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
am2:
ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.c.linux.g6.4xlarge.experimental.nvidia.gpu:
disk_size: 150
instance_type: g6.4xlarge
is_ephemeral: false
max_available: 50
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
variants:
amz2023:
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
am2:
ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.c.linux.large:
max_available: 1200
disk_size: 15
instance_type: c5.large
is_ephemeral: false
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
variants:
amz2023:
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
am2:
ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.c.linux.arm64.2xlarge:
disk_size: 256
instance_type: t4g.2xlarge
is_ephemeral: false
max_available: 200
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64
variants:
amz2023:
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64
am2:
ami: amzn2-ami-hvm-2.0.20240306.2-arm64-gp2
ami: al2023-ami-2023.5.202*-kernel-6.1-arm64
lf.c.linux.arm64.m7g.4xlarge:
disk_size: 256
instance_type: m7g.4xlarge
is_ephemeral: false
max_available: 200
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64
variants:
amz2023:
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64
am2:
ami: amzn2-ami-hvm-2.0.20240306.2-arm64-gp2
ami: al2023-ami-2023.5.202*-kernel-6.1-arm64
lf.c.linux.arm64.2xlarge.ephemeral:
disk_size: 256
instance_type: t4g.2xlarge
is_ephemeral: true
max_available: 200
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64
variants:
amz2023:
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64
am2:
ami: amzn2-ami-hvm-2.0.20240306.2-arm64-gp2
ami: al2023-ami-2023.5.202*-kernel-6.1-arm64
lf.c.linux.arm64.m7g.4xlarge.ephemeral:
disk_size: 256
instance_type: m7g.4xlarge
is_ephemeral: true
max_available: 200
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64
variants:
amz2023:
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64
am2:
ami: amzn2-ami-hvm-2.0.20240306.2-arm64-gp2
ami: al2023-ami-2023.5.202*-kernel-6.1-arm64
lf.c.linux.arm64.m7g.metal:
disk_size: 256
instance_type: m7g.metal
is_ephemeral: false
max_available: 100
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64
variants:
amz2023:
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64
am2:
ami: amzn2-ami-hvm-2.0.20240306.2-arm64-gp2
ami: al2023-ami-2023.5.202*-kernel-6.1-arm64
lf.c.windows.g4dn.xlarge:
disk_size: 256
instance_type: g4dn.xlarge

View File

@ -7,10 +7,14 @@
# runners. Runners listed here will be available as self hosted
# runners, configuration is directly pulled from the main branch.
#
# NOTE (Apr, 5, 2021): Linux runners are currently all an amazonlinux2
#
# NOTE (Jan 5, 2021): Linux runners are all non-ephemeral to reduce the amount of CreateInstaces calls
# to avoid RequestLimitExceeded issues
# NOTES:
# - Linux runners are by default non-ephemeral to reduce the amount of CreateInstaces calls
# to avoid RequestLimitExceeded issues
# - When updating this file, run the following command to validate the YAML and to generate
# corresponding versions of scale-config for the pytorch/pytorch repo and merge the
# pytorch/pytorch changes before merging these changes.
# `python .github/scripts/validate_scale_config.py --test-infra-repo-root [path_to_test-infra_root] --pytorch-repo-root [path_to_pytorch_root]``
#
# TODO: Add some documentation on how the auto-scaling works
#
@ -31,58 +35,36 @@ runner_types:
is_ephemeral: false
max_available: 1000
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
variants:
amz2023:
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
am2:
ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.10xlarge.avx2:
disk_size: 200
instance_type: m4.10xlarge
is_ephemeral: false
max_available: 450
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
variants:
amz2023:
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
am2:
ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.24xl.spr-metal:
disk_size: 200
instance_type: c7i.metal-24xl
is_ephemeral: false
max_available: 150
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
variants:
amz2023:
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
am2:
ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.16xlarge.spr:
disk_size: 200
instance_type: c7i.16xlarge
is_ephemeral: false
max_available: 150
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
variants:
amz2023:
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
am2:
ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.9xlarge.ephemeral:
disk_size: 200
instance_type: c5.9xlarge
is_ephemeral: true
max_available: 50
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
variants:
amz2023:
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
am2:
ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs
lf.linux.12xlarge.ephemeral:
@ -91,240 +73,140 @@ runner_types:
is_ephemeral: true
max_available: 300
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
variants:
amz2023:
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
am2:
ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.16xlarge.nvidia.gpu:
disk_size: 150
instance_type: g3.16xlarge
is_ephemeral: false
max_available: 150
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
variants:
amz2023:
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
am2:
ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.24xlarge:
disk_size: 150
instance_type: c5.24xlarge
is_ephemeral: false
max_available: 500
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
variants:
amz2023:
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
am2:
ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.24xlarge.ephemeral:
disk_size: 150
instance_type: c5.24xlarge
is_ephemeral: true
max_available: 200
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
variants:
amz2023:
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
am2:
ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.2xlarge:
disk_size: 150
instance_type: c5.2xlarge
is_ephemeral: false
max_available: 3120
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
variants:
amz2023:
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
am2:
ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.4xlarge:
disk_size: 150
instance_type: c5.4xlarge
is_ephemeral: false
max_available: 1000
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
variants:
amz2023:
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
am2:
ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.4xlarge.nvidia.gpu:
disk_size: 150
instance_type: g3.4xlarge
is_ephemeral: false
max_available: 1000
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
variants:
amz2023:
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
am2:
ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.8xlarge.nvidia.gpu:
disk_size: 150
instance_type: g3.8xlarge
is_ephemeral: false
max_available: 400
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
variants:
amz2023:
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
am2:
ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.g4dn.12xlarge.nvidia.gpu:
disk_size: 150
instance_type: g4dn.12xlarge
is_ephemeral: false
max_available: 250
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
variants:
amz2023:
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
am2:
ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.g4dn.metal.nvidia.gpu:
disk_size: 150
instance_type: g4dn.metal
is_ephemeral: false
max_available: 300
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
variants:
amz2023:
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
am2:
ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.g5.48xlarge.nvidia.gpu:
disk_size: 150
instance_type: g5.48xlarge
is_ephemeral: false
max_available: 200
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
variants:
amz2023:
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
am2:
ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.g5.12xlarge.nvidia.gpu:
disk_size: 150
instance_type: g5.12xlarge
is_ephemeral: false
max_available: 150
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
variants:
amz2023:
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
am2:
ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.g5.4xlarge.nvidia.gpu:
disk_size: 150
instance_type: g5.4xlarge
is_ephemeral: false
max_available: 2400
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
variants:
amz2023:
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
am2:
ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.g6.4xlarge.experimental.nvidia.gpu:
disk_size: 150
instance_type: g6.4xlarge
is_ephemeral: false
max_available: 50
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
variants:
amz2023:
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
am2:
ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.large:
max_available: 1200
disk_size: 15
instance_type: c5.large
is_ephemeral: false
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
variants:
amz2023:
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
am2:
ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.arm64.2xlarge:
disk_size: 256
instance_type: t4g.2xlarge
is_ephemeral: false
max_available: 200
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64
variants:
amz2023:
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64
am2:
ami: amzn2-ami-hvm-2.0.20240306.2-arm64-gp2
ami: al2023-ami-2023.5.202*-kernel-6.1-arm64
lf.linux.arm64.m7g.4xlarge:
disk_size: 256
instance_type: m7g.4xlarge
is_ephemeral: false
max_available: 200
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64
variants:
amz2023:
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64
am2:
ami: amzn2-ami-hvm-2.0.20240306.2-arm64-gp2
ami: al2023-ami-2023.5.202*-kernel-6.1-arm64
lf.linux.arm64.2xlarge.ephemeral:
disk_size: 256
instance_type: t4g.2xlarge
is_ephemeral: true
max_available: 200
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64
variants:
amz2023:
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64
am2:
ami: amzn2-ami-hvm-2.0.20240306.2-arm64-gp2
ami: al2023-ami-2023.5.202*-kernel-6.1-arm64
lf.linux.arm64.m7g.4xlarge.ephemeral:
disk_size: 256
instance_type: m7g.4xlarge
is_ephemeral: true
max_available: 200
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64
variants:
amz2023:
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64
am2:
ami: amzn2-ami-hvm-2.0.20240306.2-arm64-gp2
ami: al2023-ami-2023.5.202*-kernel-6.1-arm64
lf.linux.arm64.m7g.metal:
disk_size: 256
instance_type: m7g.metal
is_ephemeral: false
max_available: 100
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64
variants:
amz2023:
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64
am2:
ami: amzn2-ami-hvm-2.0.20240306.2-arm64-gp2
ami: al2023-ami-2023.5.202*-kernel-6.1-arm64
lf.windows.g4dn.xlarge:
disk_size: 256
instance_type: g4dn.xlarge

View File

@ -86,6 +86,18 @@
- pull
- inductor
- name: OSS CI / pytorchbot / slow tests
patterns:
- test/slow_tests.json
approved_by:
- pytorchbot
ignore_flaky_failures: false
mandatory_checks_name:
- EasyCLA
- Lint
- pull
- slow
- name: OSS CI /pytorchbot / Executorch
patterns:
- .ci/docker/ci_commit_pins/executorch.txt
@ -532,6 +544,7 @@
- anijain2305
- bdhirsh
- zou3519
- isuruf
mandatory_checks_name:
- EasyCLA
- Lint

View File

@ -31,3 +31,4 @@ optree==0.12.1
# NB: test_hparams_* from test_tensorboard is failing with protobuf 5.26.0 in
# which the stringify metadata is wrong when escaping double quote
protobuf==3.20.2
parameterized==0.8.1

View File

@ -412,8 +412,8 @@ def generate_wheels_matrix(
),
}
)
# Special build building to use on Colab. PyThon 3.10 for 12.1 CUDA
if python_version == "3.10" and arch_version == "12.1":
# Special build building to use on Colab. Python 3.11 for 12.1 CUDA
if python_version == "3.11" and arch_version == "12.1":
ret.append(
{
"python_version": python_version,

View File

@ -70,17 +70,15 @@ class BinaryBuildWorkflow:
)
else:
self.build_environment = f"{self.os}-binary-{self.package_type}"
if self.use_split_build:
# added to distinguish concurrency groups
self.build_environment += "-split"
def generate_workflow_file(self, workflow_template: jinja2.Template) -> None:
output_file_path = (
GITHUB_DIR
/ f"workflows/generated-{self.build_environment}-{self.branches}.yml"
)
if self.use_split_build:
output_file_path = (
GITHUB_DIR
/ f"workflows/generated-{self.build_environment}-{self.branches}-split.yml"
)
with open(output_file_path, "w") as output_file:
GENERATED = "generated" # Note that please keep the variable GENERATED otherwise phabricator will hide the whole file
output_file.writelines([f"# @{GENERATED} DO NOT EDIT MANUALLY\n"])

View File

@ -168,6 +168,14 @@ def gh_post_commit_comment(
)
def gh_close_pr(org: str, repo: str, pr_num: int, dry_run: bool = False) -> None:
url = f"{GITHUB_API_URL}/repos/{org}/{repo}/pulls/{pr_num}"
if dry_run:
print(f"Dry run closing PR {pr_num}")
else:
gh_fetch_url(url, method="PATCH", data={"state": "closed"})
def gh_delete_comment(org: str, repo: str, comment_id: int) -> None:
url = f"{GITHUB_API_URL}/repos/{org}/{repo}/issues/comments/{comment_id}"
gh_fetch_url(url, method="DELETE")

View File

@ -17,6 +17,11 @@ if [[ -d "${CACHE_DIRECTORY}" ]]; then
cp -r "${CACHE_DIRECTORY}" . || true
fi
# if lintrunner is not installed, install it
if ! command -v lintrunner &> /dev/null; then
python3 -m pip install lintrunner==0.12.5
fi
# This has already been cached in the docker image
lintrunner init 2> /dev/null
@ -33,7 +38,7 @@ python3 torch/utils/data/datapipes/gen_pyi.py
RC=0
# Run lintrunner on all files
if ! lintrunner --force-color --all-files --tee-json=lint.json ${ADDITIONAL_LINTRUNNER_ARGS} 2> /dev/null; then
if ! lintrunner --force-color --tee-json=lint.json ${ADDITIONAL_LINTRUNNER_ARGS} 2> /dev/null; then
echo ""
echo -e "\e[1m\e[36mYou can reproduce these results locally by using \`lintrunner -m origin/main\`. (If you don't get the same results, run \'lintrunner init\' to update your local linter)\e[0m"
echo -e "\e[1m\e[36mSee https://github.com/pytorch/pytorch/wiki/lintrunner for setup instructions.\e[0m"

View File

@ -3,49 +3,94 @@
"""
This runner determinator is used to determine which set of runners to run a
GitHub job on. It uses the first comment of a GitHub issue (by default
https://github.com/pytorch/test-infra/issues/5132) as a user list to determine
which users will get their jobs to run on experimental runners. This user list
is also a comma separated list of additional features or experiments which the
user could be opted in to.
https://github.com/pytorch/test-infra/issues/5132) to define the configuration
of which runners should be used to run which job.
The configuration has two parts, the settings and a list of opted-in users,
separated by a line containing "---". If the line is not present, the
settings are considered to be empty with only the second part, the user
list, defined.
The first part is a YAML block that defines the rollout settings. This can be
used to define any settings that are needed to determine which runners to use.
It's fields are defined by the RolloutSettings class below.
The second part is a list of users who are explicitly opted in to the LF fleet.
The user list is also a comma separated list of additional features or
experiments which the user could be opted in to.
The user list has the following rules:
- Users are GitHub usernames with the @ prefix
- If the first line is a "*" then all users will use the new runners
- If the first line is a "!" then all users will use the old runners
- Users are GitHub usernames, which must start with the @ prefix
- Each user is also a comma-separated list of features/experiments to enable
- A "#" prefix indicates the user is opted out of the new runners but is opting
into features/experiments.
- A "#" prefix opts the user out of all experiments
Example user list:
Example config:
# A list of experiments that can be opted into.
# This defines the behavior they'll induce when opted into.
# Expected syntax is:
# [experiment_name]: # Name of the experiment. Also used for the label prefix.
# rollout_perc: [int] # % of workflows to run with this experiment when users are not opted in.
@User1
@User2,amz2023
#@UserOptOutOfNewRunner,amz2023
experiments:
lf:
rollout_percent: 25
---
# Opt-ins:
# Users can opt into the LF fleet by adding their GitHub username to this list
# and specifying experiments to enable in a comma-separated list.
# Experiments should be from the above list.
@User1,lf,split_build
@User2,lf
@User3,split_build
"""
import logging
import os
import random
from argparse import ArgumentParser
from logging import LogRecord
from typing import Any, Iterable
from typing import Any, Dict, Iterable, List, NamedTuple, Tuple
import yaml
from github import Auth, Github
from github.Issue import Issue
WORKFLOW_LABEL_META = "" # use meta runners
DEFAULT_LABEL_PREFIX = "" # use meta runners
WORKFLOW_LABEL_LF = "lf." # use runners from the linux foundation
WORKFLOW_LABEL_LF_CANARY = "lf.c." # use canary runners from the linux foundation
RUNNER_AMI_LEGACY = ""
RUNNER_AMI_AMZ2023 = "amz2023"
GITHUB_OUTPUT = os.getenv("GITHUB_OUTPUT", "")
GH_OUTPUT_KEY_AMI = "runner-ami"
GH_OUTPUT_KEY_LABEL_TYPE = "label-type"
SETTING_EXPERIMENTS = "experiments"
LF_FLEET_EXPERIMENT = "lf"
CANARY_FLEET_SUFFIX = ".c"
class Experiment(NamedTuple):
rollout_perc: float = (
0 # Percentage of workflows to experiment on when user is not opted-in.
)
# Add more fields as needed
class Settings(NamedTuple):
"""
Settings for the experiments that can be opted into.
"""
experiments: Dict[str, Experiment] = {}
class ColorFormatter(logging.Formatter):
"""Color codes the log messages based on the log level"""
@ -172,85 +217,180 @@ def is_exception_branch(branch: str) -> bool:
return branch.split("/")[0] in {"main", "nightly", "release", "landchecks"}
def get_fleet(rollout_state: str, workflow_requestors: Iterable[str]) -> str:
"""
Determines if the job should run on the LF fleet or the Meta fleet
Returns:
The appropriate label prefix for the runner, corresponding to the fleet to use.
This gets prefixed to the very start of the runner label.
"""
def load_yaml(yaml_text: str) -> Any:
try:
if rollout_state[0] == "!":
log.info("LF Workflows are disabled for everyone. Using meta runners.")
return WORKFLOW_LABEL_META
elif rollout_state[0] == "*":
log.info("LF Workflows are enabled for everyone. Using LF runners.")
return WORKFLOW_LABEL_LF
else:
all_opted_in_users = {
usr_raw.strip("\n\t@ ").split(",")[0]
for usr_raw in rollout_state.split()
}
opted_in_requestors = {
usr for usr in workflow_requestors if usr in all_opted_in_users
}
if opted_in_requestors:
log.info(
f"LF Workflows are enabled for {', '.join(opted_in_requestors)}. Using LF runners."
)
return WORKFLOW_LABEL_LF
else:
log.info(
f"LF Workflows are disabled for {', '.join(workflow_requestors)}. Using meta runners."
)
return WORKFLOW_LABEL_META
except Exception as e:
log.error(
f"Failed to get determine workflow type. Falling back to meta runners. Exception: {e}"
)
return WORKFLOW_LABEL_META
data = yaml.safe_load(yaml_text)
return data
except yaml.YAMLError as exc:
log.exception("Error loading YAML")
raise
def get_optin_feature(
rollout_state: str, workflow_requestors: Iterable[str], feature: str, fallback: str
def extract_settings_user_opt_in_from_text(rollout_state: str) -> Tuple[str, str]:
"""
Extracts the text with settings, if any, and the opted in users from the rollout state.
If the issue body contains "---" then the text above that is the settings
and the text below is the list of opted in users.
If it doesn't contain "---" then the settings are empty and the rest is the users.
"""
rollout_state_parts = rollout_state.split("---")
if len(rollout_state_parts) >= 2:
return rollout_state_parts[0], rollout_state_parts[1]
else:
return "", rollout_state
class UserOptins(Dict[str, List[str]]):
"""
Dictionary of users with a list of features they have opted into
"""
def parse_user_opt_in_from_text(user_optin_text: str) -> UserOptins:
"""
Parse the user opt-in text into a key value pair of username and the list of features they have opted into
Users are GitHub usernames with the @ prefix. Each user is also a comma-separated list of features/experiments to enable.
- Example line: "@User1,lf,split_build"
- A "#" prefix indicates the user is opted out of all experiments
"""
optins = UserOptins()
for user in user_optin_text.split("\n"):
user = user.strip("\r\n\t -")
if not user or not user.startswith("@"):
# Not a valid user. Skip
continue
if user:
usr_name = user.split(",")[0].strip("@")
optins[usr_name] = [exp.strip(" ") for exp in user.split(",")[1:]]
return optins
def parse_settings_from_text(settings_text: str) -> Settings:
"""
Parse the experiments from the issue body into a list of ExperimentSettings
"""
try:
if settings_text:
# Escape the backtick as well so that we can have the settings in a code block on the GH issue
# for easy reading
# Note: Using ascii for the backtick so that the cat step in _runner-determinator.yml doesn't choke on
# the backtick character in shell commands.
backtick = chr(96) # backtick character
settings_text = settings_text.strip(f"\r\n\t{backtick} ")
settings = load_yaml(settings_text)
# For now we just load experiments. We can expand this if/when we add more settings
experiments = {}
for exp_name, exp_settings in settings.get(SETTING_EXPERIMENTS).items():
valid_settings = {}
for setting in exp_settings:
if setting not in Experiment._fields:
log.warning(
f"Unexpected setting in experiment: {setting} = {exp_settings[setting]}"
)
else:
valid_settings[setting] = exp_settings[setting]
experiments[exp_name] = Experiment(**valid_settings)
return Settings(experiments)
except Exception:
log.exception("Failed to parse settings")
return Settings()
def parse_settings(rollout_state: str) -> Settings:
"""
Parse settings, if any, from the rollout state.
If the issue body contains "---" then the text above that is the settings
and the text below is the list of opted in users.
If it doesn't contain "---" then the settings are empty and the default values are used.
"""
settings_text, _ = extract_settings_user_opt_in_from_text(rollout_state)
return parse_settings_from_text(settings_text)
def parse_users(rollout_state: str) -> UserOptins:
"""
Parse users from the rollout state.
"""
_, users_text = extract_settings_user_opt_in_from_text(rollout_state)
return parse_user_opt_in_from_text(users_text)
def is_user_opted_in(user: str, user_optins: UserOptins, experiment_name: str) -> bool:
"""
Check if a user is opted into an experiment
"""
return experiment_name in user_optins.get(user, [])
def get_runner_prefix(
rollout_state: str, workflow_requestors: Iterable[str], is_canary: bool = False
) -> str:
"""
Used to dynamically opt in jobs to specific runner-type variants.
settings = parse_settings(rollout_state)
user_optins = parse_users(rollout_state)
Returns:
The runner-type's variant name if the user has opted in to the feature, otherwise returns an empty string.
This variant name is prefixed to the runner-type in the label.
"""
try:
userlist = {u.lstrip("#").strip("\n\t@ ") for u in rollout_state.split()}
all_opted_in_users = set()
for user in userlist:
for i in user.split(","):
if i == feature:
all_opted_in_users.add(user.split(",")[0])
opted_in_requestors = {
usr for usr in workflow_requestors if usr in all_opted_in_users
}
fleet_prefix = ""
prefixes = []
for experiment_name, experiment_settings in settings.experiments.items():
enabled = False
if opted_in_requestors:
# Is any workflow_requestor opted in to this experiment?
opted_in_users = [
requestor
for requestor in workflow_requestors
if is_user_opted_in(requestor, user_optins, experiment_name)
]
if opted_in_users:
log.info(
f"Feature {feature} is enabled for {', '.join(opted_in_requestors)}. Using feature {feature}."
f"{', '.join(opted_in_users)} have opted into experiment {experiment_name}."
)
return feature
else:
log.info(
f"Feature {feature} is disabled for {', '.join(workflow_requestors)}. Using fallback \"{fallback}\"."
)
return fallback
enabled = True
elif experiment_settings.rollout_perc:
# If no user is opted in, then we randomly enable the experiment based on the rollout percentage
if random.uniform(0, 100) <= experiment_settings.rollout_perc:
log.info(
f"Based on rollout percentage of {experiment_settings.rollout_perc}%, enabling experiment {experiment_name}."
)
enabled = True
except Exception as e:
if enabled:
label = experiment_name
if experiment_name == LF_FLEET_EXPERIMENT:
# We give some special treatment to the "lf" experiment since determines the fleet we use
# - If it's enabled, then we always list it's prefix first
# - If we're in the canary branch, then we append ".c" to the lf prefix
if is_canary:
label += CANARY_FLEET_SUFFIX
fleet_prefix = label
else:
prefixes.append(label)
if len(prefixes) > 1:
log.error(
f'Failed to determine if user has opted-in to feature {feature}. Using fallback "{fallback}". Exception: {e}'
f"Only a fleet and one other experiment can be enabled for a job at any time. Enabling {prefixes[0]} and ignoring the rest, which are {', '.join(prefixes[1:])}"
)
return fallback
prefixes = prefixes[:1]
# Fleet always comes first
if fleet_prefix:
prefixes.insert(0, fleet_prefix)
return ".".join(prefixes) + "." if prefixes else ""
def get_rollout_state_from_issue(github_token: str, repo: str, issue_num: int) -> str:
@ -268,9 +408,10 @@ def main() -> None:
args = parse_args()
if args.github_ref_type == "branch" and is_exception_branch(args.github_branch):
log.info(f"Exception branch: '{args.github_branch}', using meta runners")
label_type = WORKFLOW_LABEL_META
runner_ami = RUNNER_AMI_LEGACY
log.info(
f"Exception branch: '{args.github_branch}', using Meta runners and no experiments."
)
runner_label_prefix = DEFAULT_LABEL_PREFIX
else:
try:
rollout_state = get_rollout_state_from_issue(
@ -285,35 +426,18 @@ def main() -> None:
args.github_branch,
)
label_type = get_fleet(
rollout_state,
(
args.github_issue_owner,
username,
),
)
runner_ami = get_optin_feature(
rollout_state=rollout_state,
workflow_requestors=(
args.github_issue_owner,
username,
),
feature=RUNNER_AMI_AMZ2023,
fallback=RUNNER_AMI_LEGACY,
is_canary = args.github_repo == "pytorch/pytorch-canary"
runner_label_prefix = get_runner_prefix(
rollout_state, (args.github_issue_owner, username), is_canary
)
except Exception as e:
log.error(
f"Failed to get issue. Falling back to meta runners. Exception: {e}"
f"Failed to get issue. Defaulting to Meta runners and no experiments. Exception: {e}"
)
label_type = WORKFLOW_LABEL_META
runner_ami = RUNNER_AMI_LEGACY
# For Canary builds use canary runners
if args.github_repo == "pytorch/pytorch-canary" and label_type == WORKFLOW_LABEL_LF:
label_type = WORKFLOW_LABEL_LF_CANARY
set_github_output(GH_OUTPUT_KEY_LABEL_TYPE, label_type)
set_github_output(GH_OUTPUT_KEY_AMI, runner_ami)
set_github_output(GH_OUTPUT_KEY_LABEL_TYPE, runner_label_prefix)
if __name__ == "__main__":

View File

@ -1,35 +0,0 @@
#!/bin/bash
set -eoux pipefail
SYNC_BRANCH=pytorch-stable-prototype
git config user.email "fake@example.com"
git config user.name "PyTorch Stable Bot"
git fetch origin main
git fetch origin "$SYNC_BRANCH"
git checkout "$SYNC_BRANCH"
# Using a hardcoded SHA here is a massive speedup as we can skip the entire history of the pytorch GitHub repo.
# This specific SHA was chosen as it was before the "branch point" of the stable branch
for SHA in $(git log ba3b05fdf37ddbc3c301294d6a560a816335e717..origin/main --pretty="%h" -- torch/distributed torch/csrc/distributed test/distributed test/cpp/c10d benchmarks/distributed)
do
# `git merge-base --is-ancestor` exits with code 0 if the given SHA is an ancestor, and non-0 otherwise
if git merge-base --is-ancestor $SHA HEAD || [[ $(git log --grep="(cherry picked from commit $SHA") ]]
then
echo "Skipping $SHA"
continue
fi
echo "Copying $SHA"
git cherry-pick -x "$SHA" -X theirs
git reset --soft HEAD~1
git add torch/distributed torch/csrc/distributed test/distributed test/cpp/c10d benchmarks/distributed
git checkout .
git commit --reuse-message=HEAD@{1}
git clean -f
done
if [[ "${WITH_PUSH}" == true ]]; then
git push
fi

View File

@ -51,6 +51,8 @@ def main() -> None:
for platform_image in platform_images: # type: ignore[attr-defined]
for arch in platform_image.keys(): # type: ignore[attr-defined]
if arch == "cpu-s390x":
continue
tag_image(
platform_image[arch], # type: ignore[index]
default_tag,

View File

@ -0,0 +1,237 @@
from unittest import main, TestCase
from unittest.mock import Mock, patch
import runner_determinator as rd
class TestRunnerDeterminatorIssueParser(TestCase):
def test_parse_settings(self) -> None:
settings_text = """
experiments:
lf:
rollout_perc: 25
otherExp:
rollout_perc: 0
---
Users:
@User1,lf
@User2,lf,otherExp
"""
settings = rd.parse_settings(settings_text)
self.assertTupleEqual(
rd.Experiment(rollout_perc=25),
settings.experiments["lf"],
"lf settings not parsed correctly",
)
self.assertTupleEqual(
rd.Experiment(rollout_perc=0),
settings.experiments["otherExp"],
"otherExp settings not parsed correctly",
)
def test_parse_settings_in_code_block(self) -> None:
settings_text = """
```
experiments:
lf:
rollout_perc: 25
otherExp:
rollout_perc: 0
```
---
Users:
@User1,lf
@User2,lf,otherExp
"""
settings = rd.parse_settings(settings_text)
self.assertTupleEqual(
rd.Experiment(rollout_perc=25),
settings.experiments["lf"],
"lf settings not parsed correctly",
)
self.assertTupleEqual(
rd.Experiment(rollout_perc=0),
settings.experiments["otherExp"],
"otherExp settings not parsed correctly",
)
def test_parse_users(self) -> None:
settings_text = """
experiments:
lf:
rollout_perc: 0
otherExp:
rollout_perc: 0
---
Users:
@User1,lf
@User2,lf,otherExp
"""
users = rd.parse_users(settings_text)
self.assertDictEqual(
{"User1": ["lf"], "User2": ["lf", "otherExp"]},
users,
"Users not parsed correctly",
)
def test_parse_users_without_settings(self) -> None:
settings_text = """
@User1,lf
@User2,lf,otherExp
"""
users = rd.parse_users(settings_text)
self.assertDictEqual(
{"User1": ["lf"], "User2": ["lf", "otherExp"]},
users,
"Users not parsed correctly",
)
class TestRunnerDeterminatorGetRunnerPrefix(TestCase):
def test_opted_in_user(self) -> None:
settings_text = """
experiments:
lf:
rollout_perc: 0
otherExp:
rollout_perc: 0
---
Users:
@User1,lf
@User2,lf,otherExp
"""
prefix = rd.get_runner_prefix(settings_text, ["User1"])
self.assertEqual("lf.", prefix, "Runner prefix not correct for User1")
def test_opted_in_user_two_experiments(self) -> None:
settings_text = """
experiments:
lf:
rollout_perc: 0
otherExp:
rollout_perc: 0
---
Users:
@User1,lf
@User2,lf,otherExp
"""
prefix = rd.get_runner_prefix(settings_text, ["User2"])
self.assertEqual("lf.otherExp.", prefix, "Runner prefix not correct for User2")
@patch("random.uniform", return_value=50)
def test_opted_out_user(self, mock_uniform: Mock) -> None:
settings_text = """
experiments:
lf:
rollout_perc: 25
otherExp:
rollout_perc: 25
---
Users:
@User1,lf
@User2,lf,otherExp
"""
prefix = rd.get_runner_prefix(settings_text, ["User3"])
self.assertEqual("", prefix, "Runner prefix not correct for user")
@patch("random.uniform", return_value=10)
def test_opted_out_user_was_pulled_in_by_rollout(self, mock_uniform: Mock) -> None:
settings_text = """
experiments:
lf:
rollout_perc: 25
otherExp:
rollout_perc: 25
---
Users:
@User1,lf
@User2,lf,otherExp
"""
# User3 is opted out, but is pulled into both experiments by the 10% rollout
prefix = rd.get_runner_prefix(settings_text, ["User3"])
self.assertEqual("lf.otherExp.", prefix, "Runner prefix not correct for user")
def test_lf_prefix_always_comes_first(self) -> None:
settings_text = """
experiments:
otherExp:
rollout_perc: 0
lf:
rollout_perc: 0
---
Users:
@User1,lf
@User2,otherExp,lf
"""
prefix = rd.get_runner_prefix(settings_text, ["User2"])
self.assertEqual("lf.otherExp.", prefix, "Runner prefix not correct for user")
def test_ignores_commented_users(self) -> None:
settings_text = """
experiments:
lf:
rollout_perc: 0
otherExp:
rollout_perc: 0
---
Users:
#@User1,lf
@User2,lf,otherExp
"""
prefix = rd.get_runner_prefix(settings_text, ["User1"])
self.assertEqual("", prefix, "Runner prefix not correct for user")
def test_ignores_extra_experiments(self) -> None:
settings_text = """
experiments:
lf:
rollout_perc: 0
otherExp:
rollout_perc: 0
foo:
rollout_perc: 0
---
Users:
@User1,lf,otherExp,foo
"""
prefix = rd.get_runner_prefix(settings_text, ["User1"])
self.assertEqual("lf.otherExp.", prefix, "Runner prefix not correct for user")
if __name__ == "__main__":
main()

View File

@ -36,6 +36,7 @@ from warnings import warn
import yaml
from github_utils import (
gh_close_pr,
gh_fetch_json_list,
gh_fetch_merge_base,
gh_fetch_url,
@ -1174,11 +1175,11 @@ class GitHubPR:
for pr in additional_merged_prs:
pr.add_numbered_label(MERGE_COMPLETE_LABEL, dry_run)
if comment_id and self.pr_num:
# When the merge process reaches this part, we can assume that the commit
# has been successfully pushed to trunk
merge_commit_sha = repo.rev_parse(name=REMOTE_MAIN_BRANCH)
# When the merge process reaches this part, we can assume that the commit
# has been successfully pushed to trunk
merge_commit_sha = repo.rev_parse(name=self.default_branch())
if comment_id and self.pr_num:
# Finally, upload the record to Rockset. The list of pending and failed
# checks are at the time of the merge
save_merge_record(
@ -1203,6 +1204,17 @@ class GitHubPR:
else:
print("Missing comment ID or PR number, couldn't upload to Rockset")
# Usually Github will see that the commit has "resolves <pr_num>" in the
# commit message and close the PR, but sometimes it doesn't, leading to
# confusion. When it doesn't, we close it manually.
time.sleep(60) # Give Github some time to close the PR
manually_close_merged_pr(
pr=self,
additional_merged_prs=additional_merged_prs,
merge_commit_sha=merge_commit_sha,
dry_run=dry_run,
)
def merge_changes(
self,
repo: GitRepo,
@ -1503,6 +1515,34 @@ def checks_to_markdown_bullets(
]
def manually_close_merged_pr(
pr: GitHubPR,
additional_merged_prs: List[GitHubPR],
merge_commit_sha: str,
dry_run: bool,
) -> None:
def _comment_and_close(pr: GitHubPR, comment: str) -> None:
pr = GitHubPR(pr.org, pr.project, pr.pr_num) # Refresh the PR
if not pr.is_closed():
gh_post_pr_comment(pr.org, pr.project, pr.pr_num, comment, dry_run)
gh_close_pr(pr.org, pr.project, pr.pr_num, dry_run)
message = (
f"This PR (#{pr.pr_num}) was merged in {merge_commit_sha} but it is still open, likely due to a Github bug, "
"so mergebot is closing it manually. If you think this is a mistake, please feel free to reopen and contact Dev Infra."
)
_comment_and_close(pr, message)
for additional_pr in additional_merged_prs:
message = (
f"This PR (#{additional_pr.pr_num}) was merged as part of PR #{pr.pr_num} in the stack under {merge_commit_sha} "
"but it is still open, likely due to a Github bug, so mergebot is closing it manually. "
"If you think this is a mistake, please feel free to reopen and contact Dev Infra."
)
_comment_and_close(additional_pr, message)
print(f"PR {pr.pr_num} and all additional PRs in the stack have been closed.")
@retries_decorator()
def save_merge_record(
comment_id: int,

View File

@ -109,6 +109,7 @@ jobs:
steps:
- name: Setup SSH (Click me for login details)
uses: pytorch/test-infra/.github/actions/setup-ssh@main
if: inputs.build-environment != 'linux-s390x-binary-manywheel'
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
@ -118,13 +119,16 @@ jobs:
# checkout. In other cases you should prefer a local checkout.
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
with:
no-sudo: ${{ inputs.build-environment == 'linux-s390x-binary-manywheel' }}
- name: Setup Linux
uses: ./.github/actions/setup-linux
if: inputs.build-environment != 'linux-s390x-binary-manywheel'
- name: configure aws credentials
uses: aws-actions/configure-aws-credentials@v3
if: ${{ inputs.aws-role-to-assume != '' }}
if: ${{ inputs.aws-role-to-assume != '' && inputs.build-environment != 'linux-s390x-binary-manywheel' }}
with:
role-to-assume: ${{ inputs.aws-role-to-assume }}
role-session-name: gha-linux-build
@ -133,11 +137,13 @@ jobs:
- name: Calculate docker image
id: calculate-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
if: inputs.build-environment != 'linux-s390x-binary-manywheel'
with:
docker-image-name: ${{ inputs.docker-image-name }}
- name: Use following to pull public copy of the image
id: print-ghcr-mirror
if: inputs.build-environment != 'linux-s390x-binary-manywheel'
env:
ECR_DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}
shell: bash
@ -147,6 +153,7 @@ jobs:
- name: Pull docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
if: inputs.build-environment != 'linux-s390x-binary-manywheel'
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
@ -174,6 +181,7 @@ jobs:
- name: Download pytest cache
uses: ./.github/actions/pytest-cache-download
continue-on-error: true
if: inputs.build-environment != 'linux-s390x-binary-manywheel'
with:
cache_dir: .pytest_cache
job_identifier: ${{ github.workflow }}_${{ inputs.build-environment }}
@ -195,6 +203,7 @@ jobs:
PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
TORCH_CUDA_ARCH_LIST: ${{ inputs.cuda-arch-list }}
DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}
DOCKER_IMAGE_S390X: ${{ inputs.docker-image-name }}
XLA_CUDA: ${{ contains(inputs.build-environment, 'xla') && '0' || '' }}
DEBUG: ${{ inputs.build-with-debug && '1' || '0' }}
OUR_GITHUB_JOB_ID: ${{ steps.get-job-id.outputs.job-id }}
@ -202,7 +211,21 @@ jobs:
SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
USE_SPLIT_BUILD: ${{ inputs.use_split_build }}
run: |
if [[ ${BUILD_ENVIRONMENT} == *"s390x"* ]]; then
JENKINS_USER=
USED_IMAGE="${DOCKER_IMAGE_S390X}"
# since some steps are skipped on s390x, if they are necessary, run them here
env | grep '^GITHUB' >> "/tmp/github_env_${GITHUB_RUN_ID}"
env | grep '^CI' >> "/tmp/github_env_${GITHUB_RUN_ID}"
else
JENKINS_USER="--user jenkins"
USED_IMAGE="${DOCKER_IMAGE}"
fi
# detached container should get cleaned up by teardown_ec2_linux
# Used for JENKINS_USER, which can be empty
# shellcheck disable=SC2086
container_name=$(docker run \
-e BUILD_ENVIRONMENT \
-e MAX_JOBS="$(nproc --ignore=2)" \
@ -225,10 +248,10 @@ jobs:
--cap-add=SYS_PTRACE \
--tty \
--detach \
--user jenkins \
${JENKINS_USER} \
-v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-w /var/lib/jenkins/workspace \
"${DOCKER_IMAGE}"
"${USED_IMAGE}"
)
docker exec -t "${container_name}" sh -c '.ci/pytorch/build.sh'
@ -239,7 +262,7 @@ jobs:
- name: Store PyTorch Build Artifacts on S3
uses: seemethere/upload-artifact-s3@v5
if: inputs.build-generates-artifacts && steps.build.outcome != 'skipped' && !inputs.use_split_build
if: inputs.build-generates-artifacts && steps.build.outcome != 'skipped' && !inputs.use_split_build && inputs.build-environment != 'linux-s390x-binary-manywheel'
with:
name: ${{ inputs.build-environment }}
retention-days: 14
@ -249,7 +272,7 @@ jobs:
- name: Store PyTorch Build Artifacts on S3 for split build
uses: seemethere/upload-artifact-s3@v5
if: inputs.build-generates-artifacts && steps.build.outcome != 'skipped' && inputs.use_split_build
if: inputs.build-generates-artifacts && steps.build.outcome != 'skipped' && inputs.use_split_build && inputs.build-environment != 'linux-s390x-binary-manywheel'
with:
name: ${{ inputs.build-environment }}-experimental-split-build
retention-days: 14
@ -257,8 +280,26 @@ jobs:
path: artifacts.zip
s3-bucket: ${{ inputs.s3-bucket }}
- name: Store PyTorch Build Artifacts for s390x
uses: actions/upload-artifact@v3
if: inputs.build-generates-artifacts && steps.build.outcome != 'skipped' && !inputs.use_split_build && inputs.build-environment == 'linux-s390x-binary-manywheel'
with:
name: ${{ inputs.build-environment }}
retention-days: 14
if-no-files-found: error
path: artifacts.zip
- name: Store PyTorch Build Artifacts for s390x for split build
uses: actions/upload-artifact@v3
if: inputs.build-generates-artifacts && steps.build.outcome != 'skipped' && inputs.use_split_build && inputs.build-environment == 'linux-s390x-binary-manywheel'
with:
name: ${{ inputs.build-environment }}-experimental-split-build
retention-days: 14
if-no-files-found: error
path: artifacts.zip
- name: Upload sccache stats
if: steps.build.outcome != 'skipped'
if: steps.build.outcome != 'skipped' && inputs.build-environment != 'linux-s390x-binary-manywheel'
uses: seemethere/upload-artifact-s3@v5
with:
s3-prefix: |
@ -270,4 +311,13 @@ jobs:
- name: Teardown Linux
uses: pytorch/test-infra/.github/actions/teardown-linux@main
if: always()
if: always() && inputs.build-environment != 'linux-s390x-binary-manywheel'
- name: Cleanup docker
if: always() && inputs.build-environment == 'linux-s390x-binary-manywheel'
shell: bash
run: |
# on s390x stop the container for clean worker stop
# ignore expansion of "docker ps -q" since it could be empty
# shellcheck disable=SC2046
docker stop $(docker ps -q) || true

View File

@ -88,6 +88,13 @@ jobs:
environment-file: .github/requirements/conda-env-${{ runner.os }}-${{ runner.arch }}
pip-requirements-file: .github/requirements/pip-requirements-${{ runner.os }}.txt
- name: Get workflow job id
id: get-job-id
uses: ./.github/actions/get-workflow-job-id
if: always()
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
- name: Install PyTorch and run MPS tests
id: test
env:
@ -103,6 +110,14 @@ jobs:
NO_TEST_TIMEOUT: ${{ needs.filter.outputs.ci-no-test-timeout }}
NO_TD: ${{ needs.filter.outputs.ci-no-td }}
PIP_REQUIREMENTS_FILE: .github/requirements/pip-requirements-${{ runner.os }}.txt
GITHUB_REPOSITORY: ${{ github.repository }}
GITHUB_WORKFLOW: ${{ github.workflow }}
GITHUB_JOB: ${{ github.job }}
GITHUB_RUN_ID: ${{ github.run_id }}
GITHUB_RUN_NUMBER: ${{ github.run_number }}
GITHUB_RUN_ATTEMPT: ${{ github.run_attempt }}
JOB_ID: ${{ steps.get-job-id.outputs.job-id }}
JOB_NAME: ${{ steps.get-job-id.outputs.job-name }}
REENABLED_ISSUES: ${{ needs.filter.outputs.reenabled-issues }}
run: |
# shellcheck disable=SC1090
@ -144,13 +159,6 @@ jobs:
run: |
cat test/**/*_toprint.log || true
- name: Get workflow job id
id: get-job-id
uses: ./.github/actions/get-workflow-job-id
if: always()
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
- name: Upload test artifacts
uses: ./.github/actions/upload-test-artifacts
if: always() && steps.test.conclusion && steps.test.conclusion != 'skipped'

View File

@ -62,49 +62,94 @@ jobs:
"""
This runner determinator is used to determine which set of runners to run a
GitHub job on. It uses the first comment of a GitHub issue (by default
https://github.com/pytorch/test-infra/issues/5132) as a user list to determine
which users will get their jobs to run on experimental runners. This user list
is also a comma separated list of additional features or experiments which the
user could be opted in to.
https://github.com/pytorch/test-infra/issues/5132) to define the configuration
of which runners should be used to run which job.
The configuration has two parts, the settings and a list of opted-in users,
separated by a line containing "---". If the line is not present, the
settings are considered to be empty with only the second part, the user
list, defined.
The first part is a YAML block that defines the rollout settings. This can be
used to define any settings that are needed to determine which runners to use.
It's fields are defined by the RolloutSettings class below.
The second part is a list of users who are explicitly opted in to the LF fleet.
The user list is also a comma separated list of additional features or
experiments which the user could be opted in to.
The user list has the following rules:
- Users are GitHub usernames with the @ prefix
- If the first line is a "*" then all users will use the new runners
- If the first line is a "!" then all users will use the old runners
- Users are GitHub usernames, which must start with the @ prefix
- Each user is also a comma-separated list of features/experiments to enable
- A "#" prefix indicates the user is opted out of the new runners but is opting
into features/experiments.
- A "#" prefix opts the user out of all experiments
Example user list:
Example config:
# A list of experiments that can be opted into.
# This defines the behavior they'll induce when opted into.
# Expected syntax is:
# [experiment_name]: # Name of the experiment. Also used for the label prefix.
# rollout_perc: [int] # % of workflows to run with this experiment when users are not opted in.
@User1
@User2,amz2023
#@UserOptOutOfNewRunner,amz2023
experiments:
lf:
rollout_percent: 25
---
# Opt-ins:
# Users can opt into the LF fleet by adding their GitHub username to this list
# and specifying experiments to enable in a comma-separated list.
# Experiments should be from the above list.
@User1,lf,split_build
@User2,lf
@User3,split_build
"""
import logging
import os
import random
from argparse import ArgumentParser
from logging import LogRecord
from typing import Any, Iterable
from typing import Any, Dict, Iterable, List, NamedTuple, Tuple
import yaml
from github import Auth, Github
from github.Issue import Issue
WORKFLOW_LABEL_META = "" # use meta runners
DEFAULT_LABEL_PREFIX = "" # use meta runners
WORKFLOW_LABEL_LF = "lf." # use runners from the linux foundation
WORKFLOW_LABEL_LF_CANARY = "lf.c." # use canary runners from the linux foundation
RUNNER_AMI_LEGACY = ""
RUNNER_AMI_AMZ2023 = "amz2023"
GITHUB_OUTPUT = os.getenv("GITHUB_OUTPUT", "")
GH_OUTPUT_KEY_AMI = "runner-ami"
GH_OUTPUT_KEY_LABEL_TYPE = "label-type"
SETTING_EXPERIMENTS = "experiments"
LF_FLEET_EXPERIMENT = "lf"
CANARY_FLEET_SUFFIX = ".c"
class Experiment(NamedTuple):
rollout_perc: float = (
0 # Percentage of workflows to experiment on when user is not opted-in.
)
# Add more fields as needed
class Settings(NamedTuple):
"""
Settings for the experiments that can be opted into.
"""
experiments: Dict[str, Experiment] = {}
class ColorFormatter(logging.Formatter):
"""Color codes the log messages based on the log level"""
@ -231,85 +276,180 @@ jobs:
return branch.split("/")[0] in {"main", "nightly", "release", "landchecks"}
def get_fleet(rollout_state: str, workflow_requestors: Iterable[str]) -> str:
"""
Determines if the job should run on the LF fleet or the Meta fleet
Returns:
The appropriate label prefix for the runner, corresponding to the fleet to use.
This gets prefixed to the very start of the runner label.
"""
def load_yaml(yaml_text: str) -> Any:
try:
if rollout_state[0] == "!":
log.info("LF Workflows are disabled for everyone. Using meta runners.")
return WORKFLOW_LABEL_META
elif rollout_state[0] == "*":
log.info("LF Workflows are enabled for everyone. Using LF runners.")
return WORKFLOW_LABEL_LF
else:
all_opted_in_users = {
usr_raw.strip("\n\t@ ").split(",")[0]
for usr_raw in rollout_state.split()
}
opted_in_requestors = {
usr for usr in workflow_requestors if usr in all_opted_in_users
}
if opted_in_requestors:
log.info(
f"LF Workflows are enabled for {', '.join(opted_in_requestors)}. Using LF runners."
)
return WORKFLOW_LABEL_LF
else:
log.info(
f"LF Workflows are disabled for {', '.join(workflow_requestors)}. Using meta runners."
)
return WORKFLOW_LABEL_META
except Exception as e:
log.error(
f"Failed to get determine workflow type. Falling back to meta runners. Exception: {e}"
)
return WORKFLOW_LABEL_META
data = yaml.safe_load(yaml_text)
return data
except yaml.YAMLError as exc:
log.exception("Error loading YAML")
raise
def get_optin_feature(
rollout_state: str, workflow_requestors: Iterable[str], feature: str, fallback: str
def extract_settings_user_opt_in_from_text(rollout_state: str) -> Tuple[str, str]:
"""
Extracts the text with settings, if any, and the opted in users from the rollout state.
If the issue body contains "---" then the text above that is the settings
and the text below is the list of opted in users.
If it doesn't contain "---" then the settings are empty and the rest is the users.
"""
rollout_state_parts = rollout_state.split("---")
if len(rollout_state_parts) >= 2:
return rollout_state_parts[0], rollout_state_parts[1]
else:
return "", rollout_state
class UserOptins(Dict[str, List[str]]):
"""
Dictionary of users with a list of features they have opted into
"""
def parse_user_opt_in_from_text(user_optin_text: str) -> UserOptins:
"""
Parse the user opt-in text into a key value pair of username and the list of features they have opted into
Users are GitHub usernames with the @ prefix. Each user is also a comma-separated list of features/experiments to enable.
- Example line: "@User1,lf,split_build"
- A "#" prefix indicates the user is opted out of all experiments
"""
optins = UserOptins()
for user in user_optin_text.split("\n"):
user = user.strip("\r\n\t -")
if not user or not user.startswith("@"):
# Not a valid user. Skip
continue
if user:
usr_name = user.split(",")[0].strip("@")
optins[usr_name] = [exp.strip(" ") for exp in user.split(",")[1:]]
return optins
def parse_settings_from_text(settings_text: str) -> Settings:
"""
Parse the experiments from the issue body into a list of ExperimentSettings
"""
try:
if settings_text:
# Escape the backtick as well so that we can have the settings in a code block on the GH issue
# for easy reading
# Note: Using ascii for the backtick so that the cat step in _runner-determinator.yml doesn't choke on
# the backtick character in shell commands.
backtick = chr(96) # backtick character
settings_text = settings_text.strip(f"\r\n\t{backtick} ")
settings = load_yaml(settings_text)
# For now we just load experiments. We can expand this if/when we add more settings
experiments = {}
for exp_name, exp_settings in settings.get(SETTING_EXPERIMENTS).items():
valid_settings = {}
for setting in exp_settings:
if setting not in Experiment._fields:
log.warning(
f"Unexpected setting in experiment: {setting} = {exp_settings[setting]}"
)
else:
valid_settings[setting] = exp_settings[setting]
experiments[exp_name] = Experiment(**valid_settings)
return Settings(experiments)
except Exception:
log.exception("Failed to parse settings")
return Settings()
def parse_settings(rollout_state: str) -> Settings:
"""
Parse settings, if any, from the rollout state.
If the issue body contains "---" then the text above that is the settings
and the text below is the list of opted in users.
If it doesn't contain "---" then the settings are empty and the default values are used.
"""
settings_text, _ = extract_settings_user_opt_in_from_text(rollout_state)
return parse_settings_from_text(settings_text)
def parse_users(rollout_state: str) -> UserOptins:
"""
Parse users from the rollout state.
"""
_, users_text = extract_settings_user_opt_in_from_text(rollout_state)
return parse_user_opt_in_from_text(users_text)
def is_user_opted_in(user: str, user_optins: UserOptins, experiment_name: str) -> bool:
"""
Check if a user is opted into an experiment
"""
return experiment_name in user_optins.get(user, [])
def get_runner_prefix(
rollout_state: str, workflow_requestors: Iterable[str], is_canary: bool = False
) -> str:
"""
Used to dynamically opt in jobs to specific runner-type variants.
settings = parse_settings(rollout_state)
user_optins = parse_users(rollout_state)
Returns:
The runner-type's variant name if the user has opted in to the feature, otherwise returns an empty string.
This variant name is prefixed to the runner-type in the label.
"""
try:
userlist = {u.lstrip("#").strip("\n\t@ ") for u in rollout_state.split()}
all_opted_in_users = set()
for user in userlist:
for i in user.split(","):
if i == feature:
all_opted_in_users.add(user.split(",")[0])
opted_in_requestors = {
usr for usr in workflow_requestors if usr in all_opted_in_users
}
fleet_prefix = ""
prefixes = []
for experiment_name, experiment_settings in settings.experiments.items():
enabled = False
if opted_in_requestors:
# Is any workflow_requestor opted in to this experiment?
opted_in_users = [
requestor
for requestor in workflow_requestors
if is_user_opted_in(requestor, user_optins, experiment_name)
]
if opted_in_users:
log.info(
f"Feature {feature} is enabled for {', '.join(opted_in_requestors)}. Using feature {feature}."
f"{', '.join(opted_in_users)} have opted into experiment {experiment_name}."
)
return feature
else:
log.info(
f"Feature {feature} is disabled for {', '.join(workflow_requestors)}. Using fallback \"{fallback}\"."
)
return fallback
enabled = True
elif experiment_settings.rollout_perc:
# If no user is opted in, then we randomly enable the experiment based on the rollout percentage
if random.uniform(0, 100) <= experiment_settings.rollout_perc:
log.info(
f"Based on rollout percentage of {experiment_settings.rollout_perc}%, enabling experiment {experiment_name}."
)
enabled = True
except Exception as e:
if enabled:
label = experiment_name
if experiment_name == LF_FLEET_EXPERIMENT:
# We give some special treatment to the "lf" experiment since determines the fleet we use
# - If it's enabled, then we always list it's prefix first
# - If we're in the canary branch, then we append ".c" to the lf prefix
if is_canary:
label += CANARY_FLEET_SUFFIX
fleet_prefix = label
else:
prefixes.append(label)
if len(prefixes) > 1:
log.error(
f'Failed to determine if user has opted-in to feature {feature}. Using fallback "{fallback}". Exception: {e}'
f"Only a fleet and one other experiment can be enabled for a job at any time. Enabling {prefixes[0]} and ignoring the rest, which are {', '.join(prefixes[1:])}"
)
return fallback
prefixes = prefixes[:1]
# Fleet always comes first
if fleet_prefix:
prefixes.insert(0, fleet_prefix)
return ".".join(prefixes) + "." if prefixes else ""
def get_rollout_state_from_issue(github_token: str, repo: str, issue_num: int) -> str:
@ -327,9 +467,10 @@ jobs:
args = parse_args()
if args.github_ref_type == "branch" and is_exception_branch(args.github_branch):
log.info(f"Exception branch: '{args.github_branch}', using meta runners")
label_type = WORKFLOW_LABEL_META
runner_ami = RUNNER_AMI_LEGACY
log.info(
f"Exception branch: '{args.github_branch}', using Meta runners and no experiments."
)
runner_label_prefix = DEFAULT_LABEL_PREFIX
else:
try:
rollout_state = get_rollout_state_from_issue(
@ -344,35 +485,18 @@ jobs:
args.github_branch,
)
label_type = get_fleet(
rollout_state,
(
args.github_issue_owner,
username,
),
)
runner_ami = get_optin_feature(
rollout_state=rollout_state,
workflow_requestors=(
args.github_issue_owner,
username,
),
feature=RUNNER_AMI_AMZ2023,
fallback=RUNNER_AMI_LEGACY,
is_canary = args.github_repo == "pytorch/pytorch-canary"
runner_label_prefix = get_runner_prefix(
rollout_state, (args.github_issue_owner, username), is_canary
)
except Exception as e:
log.error(
f"Failed to get issue. Falling back to meta runners. Exception: {e}"
f"Failed to get issue. Defaulting to Meta runners and no experiments. Exception: {e}"
)
label_type = WORKFLOW_LABEL_META
runner_ami = RUNNER_AMI_LEGACY
# For Canary builds use canary runners
if args.github_repo == "pytorch/pytorch-canary" and label_type == WORKFLOW_LABEL_LF:
label_type = WORKFLOW_LABEL_LF_CANARY
set_github_output(GH_OUTPUT_KEY_LABEL_TYPE, label_type)
set_github_output(GH_OUTPUT_KEY_AMI, runner_ami)
set_github_output(GH_OUTPUT_KEY_LABEL_TYPE, runner_label_prefix)
if __name__ == "__main__":

View File

@ -32,7 +32,7 @@ concurrency:
jobs:
build-docker:
environment: ${{ (github.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) && 'docker-build' || '' }}
runs-on: am2.linux.9xlarge.ephemeral
runs-on: linux.9xlarge.ephemeral
strategy:
matrix:
cuda_version: ["11.8", "12.1", "12.4", "cpu"]

View File

@ -29,9 +29,19 @@ concurrency:
cancel-in-progress: true
jobs:
get-label-type:
name: get-label-type
uses: ./.github/workflows/_runner-determinator.yml
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
build-docker-cuda:
environment: ${{ (github.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) && 'docker-build' || '' }}
runs-on: linux.9xlarge.ephemeral
needs: get-label-type
runs-on: "${{ needs.get-label-type.outputs.label-type }}linux.9xlarge.ephemeral"
strategy:
matrix:
cuda_version: ["12.4", "12.1", "11.8"]
@ -66,7 +76,8 @@ jobs:
.ci/docker/libtorch/build.sh libtorch-cxx11-builder:cuda${{matrix.cuda_version}}
build-docker-rocm:
environment: ${{ (github.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) && 'docker-build' || '' }}
runs-on: linux.9xlarge.ephemeral
needs: get-label-type
runs-on: "${{ needs.get-label-type.outputs.label-type }}linux.9xlarge.ephemeral"
strategy:
matrix:
rocm_version: ["6.1", "6.2"]
@ -101,7 +112,8 @@ jobs:
.ci/docker/libtorch/build.sh libtorch-cxx11-builder:rocm${{matrix.rocm_version}}
build-docker-cpu:
environment: ${{ (github.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) && 'docker-build' || '' }}
runs-on: linux.9xlarge.ephemeral
needs: get-label-type
runs-on: "${{ needs.get-label-type.outputs.label-type }}linux.9xlarge.ephemeral"
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

View File

@ -33,9 +33,19 @@ concurrency:
cancel-in-progress: true
jobs:
get-label-type:
name: get-label-type
uses: ./.github/workflows/_runner-determinator.yml
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
build-docker-cuda:
environment: ${{ (github.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) && 'docker-build' || '' }}
runs-on: am2.linux.9xlarge.ephemeral
needs: get-label-type
runs-on: "${{ needs.get-label-type.outputs.label-type }}linux.9xlarge.ephemeral"
strategy:
matrix:
cuda_version: ["12.4", "12.1", "11.8"]
@ -73,7 +83,8 @@ jobs:
# NOTE: manylinux_2_28 are still experimental, see https://github.com/pytorch/pytorch/issues/123649
build-docker-cuda-manylinux_2_28:
environment: ${{ (github.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) && 'docker-build' || '' }}
runs-on: linux.9xlarge.ephemeral
needs: get-label-type
runs-on: "${{ needs.get-label-type.outputs.label-type }}linux.9xlarge.ephemeral"
strategy:
matrix:
cuda_version: ["12.4", "12.1", "11.8"]
@ -110,7 +121,8 @@ jobs:
.ci/docker/manywheel/build.sh manylinux2_28-builder:cuda${{matrix.cuda_version}}
build-docker-cuda-aarch64:
environment: ${{ (github.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) && 'docker-build' || '' }}
runs-on: linux.arm64.2xlarge.ephemeral
needs: get-label-type
runs-on: "${{ needs.get-label-type.outputs.label-type }}linux.arm64.2xlarge.ephemeral"
strategy:
matrix:
cuda_version: ["12.4"]
@ -143,7 +155,8 @@ jobs:
.ci/docker/manywheel/build.sh manylinuxaarch64-builder:cuda${{matrix.cuda_version}}
build-docker-rocm:
environment: ${{ (github.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) && 'docker-build' || '' }}
runs-on: am2.linux.9xlarge.ephemeral
needs: get-label-type
runs-on: "${{ needs.get-label-type.outputs.label-type }}linux.9xlarge.ephemeral"
strategy:
matrix:
rocm_version: ["6.1", "6.2"]
@ -178,7 +191,8 @@ jobs:
.ci/docker/manywheel/build.sh manylinux-builder:rocm${{matrix.rocm_version}}
build-docker-cpu:
environment: ${{ (github.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) && 'docker-build' || '' }}
runs-on: am2.linux.9xlarge.ephemeral
needs: get-label-type
runs-on: "${{ needs.get-label-type.outputs.label-type }}linux.9xlarge.ephemeral"
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
@ -207,7 +221,8 @@ jobs:
.ci/docker/manywheel/build.sh manylinux-builder:cpu
build-docker-cpu-manylinux_2_28:
environment: ${{ (github.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) && 'docker-build' || '' }}
runs-on: linux.9xlarge.ephemeral
needs: get-label-type
runs-on: "${{ needs.get-label-type.outputs.label-type }}linux.9xlarge.ephemeral"
env:
GPU_ARCH_TYPE: cpu-manylinux_2_28
steps:
@ -238,7 +253,8 @@ jobs:
.ci/docker/manywheel/build.sh manylinux2_28-builder:cpu
build-docker-cpu-aarch64:
environment: ${{ (github.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) && 'docker-build' || '' }}
runs-on: linux.arm64.2xlarge.ephemeral
needs: get-label-type
runs-on: "${{ needs.get-label-type.outputs.label-type }}linux.arm64.2xlarge.ephemeral"
env:
GPU_ARCH_TYPE: cpu-aarch64
steps:
@ -269,7 +285,8 @@ jobs:
.ci/docker/manywheel/build.sh manylinuxaarch64-builder:cpu-aarch64
build-docker-cpu-aarch64-2_28:
environment: ${{ (github.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) && 'docker-build' || '' }}
runs-on: linux.arm64.2xlarge.ephemeral
needs: get-label-type
runs-on: "${{ needs.get-label-type.outputs.label-type }}linux.arm64.2xlarge.ephemeral"
env:
GPU_ARCH_TYPE: cpu-aarch64-2_28
steps:
@ -303,7 +320,8 @@ jobs:
.ci/docker/manywheel/build.sh manylinux2_28_aarch64-builder:cpu-aarch64
build-docker-cpu-cxx11-abi:
environment: ${{ (github.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) && 'docker-build' || '' }}
runs-on: linux.9xlarge.ephemeral
needs: get-label-type
runs-on: "${{ needs.get-label-type.outputs.label-type }}linux.9xlarge.ephemeral"
env:
GPU_ARCH_TYPE: cpu-cxx11-abi
steps:
@ -334,7 +352,8 @@ jobs:
.ci/docker/manywheel/build.sh manylinuxcxx11-abi-builder:cpu-cxx11-abi
build-docker-xpu:
environment: ${{ (github.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) && 'docker-build' || '' }}
runs-on: linux.9xlarge.ephemeral
needs: get-label-type
runs-on: "${{ needs.get-label-type.outputs.label-type }}linux.9xlarge.ephemeral"
env:
GPU_ARCH_TYPE: xpu
steps:

View File

@ -27,9 +27,19 @@ concurrency:
cancel-in-progress: true
jobs:
get-label-type:
name: get-label-type
uses: ./.github/workflows/_runner-determinator.yml
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
build-wheel:
name: "Build Triton Wheel"
runs-on: [self-hosted, linux.4xlarge]
needs: get-label-type
runs-on: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge"
strategy:
fail-fast: false
matrix:
@ -199,7 +209,8 @@ jobs:
build-conda:
name: "Build Triton Conda"
runs-on: [self-hosted, linux.2xlarge]
needs: get-label-type
runs-on: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge"
strategy:
fail-fast: false
matrix:

View File

@ -30,6 +30,9 @@ concurrency:
jobs:
check-labels:
permissions:
contents: read
pull-requests: write
name: Check labels
if: github.repository_owner == 'pytorch'
runs-on: linux.20_04.4x

View File

@ -16,6 +16,15 @@ on:
paths: [.github/workflows/create_release.yml]
jobs:
get-label-type:
name: get-label-type
uses: ./.github/workflows/_runner-determinator.yml
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
release:
if: ${{ github.repository == 'pytorch/pytorch' }}
name: Create Release
@ -63,7 +72,7 @@ jobs:
files: ${{env.PT_RELEASE_FILE}}
- name: Upload source distribution to GHA artifacts for release tags
if: ${{ github.event_name == 'push' && startsWith(github.ref, 'refs/tags/v') && contains(github.ref, 'rc') }}
uses: actions/upload-artifact@v2
uses: actions/upload-artifact@v4.4.0
with:
name: ${{ env.PT_RELEASE_FILE }}
path: ${{ env.PT_RELEASE_FILE }}
@ -73,12 +82,14 @@ jobs:
upload_source_code_to_s3:
if: ${{ github.repository == 'pytorch/pytorch' && github.event_name == 'push' && startsWith(github.ref, 'refs/tags/v') && contains(github.ref, 'rc') }}
runs-on: linux.2xlarge
runs-on: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge"
environment: sourcecode-upload
name: Upload source code to S3 for release tags
permissions:
id-token: write
needs: release
needs:
- get-label-type
- release
steps:
- uses: actions/download-artifact@v4.1.7
with:

View File

@ -30,8 +30,18 @@ env:
permissions: read-all
jobs:
get-label-type:
name: get-label-type
uses: ./.github/workflows/_runner-determinator.yml
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
docker-build:
environment: ${{ (github.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) && 'docker-build' || '' }}
needs: get-label-type
timeout-minutes: 240
strategy:
fail-fast: false
@ -68,7 +78,7 @@ jobs:
- docker-image-name: pytorch-linux-jammy-aarch64-py3.10-gcc11-inductor-benchmarks
runner: linux.arm64.m7g.4xlarge
timeout-minutes: 600
runs-on: [self-hosted, "${{ matrix.runner }}"]
runs-on: "${{ needs.get-label-type.outputs.label-type }}${{ matrix.runner }}"
env:
DOCKER_IMAGE_BASE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/${{ matrix.docker-image-name }}
steps:

View File

@ -34,9 +34,19 @@ env:
permissions: read-all
jobs:
get-label-type:
name: get-label-type
uses: ./.github/workflows/_runner-determinator.yml
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
generate-matrix:
if: github.repository_owner == 'pytorch'
runs-on: [self-hosted, linux.large]
needs: get-label-type
runs-on: "${{ needs.get-label-type.outputs.label-type }}linux.large"
outputs:
matrix: ${{ steps.generate-matrix.outputs.matrix }}
steps:
@ -54,10 +64,12 @@ jobs:
build:
if: ${{ github.repository == 'pytorch/pytorch' }}
runs-on: [self-hosted, linux.2xlarge]
runs-on: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge"
environment: ${{ (github.ref == 'refs/heads/nightly' || startsWith(github.event.ref, 'refs/tags/v')) && 'docker-build' || '' }}
timeout-minutes: 240
needs: generate-matrix
needs:
- generate-matrix
- get-label-type
strategy:
matrix: ${{ fromJson(needs.generate-matrix.outputs.matrix) }}
fail-fast: false

View File

@ -1010,76 +1010,6 @@ jobs:
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_10-cuda12_1-full-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu121
GPU_ARCH_VERSION: 12.1
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.1-main
use_split_build: False
DESIRED_PYTHON: "3.10"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_10-cuda12_1-full
build_environment: linux-binary-manywheel
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_10-cuda12_1-full-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- manywheel-py3_10-cuda12_1-full-build
- get-label-type
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu121
GPU_ARCH_VERSION: 12.1
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.1-main
use_split_build: False
DESIRED_PYTHON: "3.10"
build_name: manywheel-py3_10-cuda12_1-full
build_environment: linux-binary-manywheel
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_10-cuda12_1-full-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_10-cuda12_1-full-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu121
GPU_ARCH_VERSION: 12.1
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.1-main
use_split_build: False
DESIRED_PYTHON: "3.10"
build_name: manywheel-py3_10-cuda12_1-full
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_10-cuda12_4-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -1766,6 +1696,76 @@ jobs:
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_11-cuda12_1-full-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu121
GPU_ARCH_VERSION: 12.1
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.1-main
use_split_build: False
DESIRED_PYTHON: "3.11"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_11-cuda12_1-full
build_environment: linux-binary-manywheel
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_11-cuda12_1-full-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- manywheel-py3_11-cuda12_1-full-build
- get-label-type
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu121
GPU_ARCH_VERSION: 12.1
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.1-main
use_split_build: False
DESIRED_PYTHON: "3.11"
build_name: manywheel-py3_11-cuda12_1-full
build_environment: linux-binary-manywheel
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_11-cuda12_1-full-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_11-cuda12_1-full-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu121
GPU_ARCH_VERSION: 12.1
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.1-main
use_split_build: False
DESIRED_PYTHON: "3.11"
build_name: manywheel-py3_11-cuda12_1-full
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_11-cuda12_4-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml

View File

@ -2,7 +2,7 @@
# Template is at: .github/templates/linux_binary_build_workflow.yml.j2
# Generation script: .github/scripts/generate_ci_workflows.py
name: linux-binary-manywheel
name: linux-binary-manywheel-split
on:
@ -19,7 +19,7 @@ env:
ANACONDA_USER: pytorch
AWS_DEFAULT_REGION: us-east-1
BINARY_ENV_FILE: /tmp/env
BUILD_ENVIRONMENT: linux-binary-manywheel
BUILD_ENVIRONMENT: linux-binary-manywheel-split
BUILDER_ROOT: /builder
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
PR_NUMBER: ${{ github.event.pull_request.number }}
@ -28,7 +28,7 @@ env:
SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
SKIP_ALL_TESTS: 0
concurrency:
group: linux-binary-manywheel-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
group: linux-binary-manywheel-split-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true
jobs:
@ -58,7 +58,7 @@ jobs:
DESIRED_PYTHON: "3.9"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_9-cuda11_8
build_environment: linux-binary-manywheel
build_environment: linux-binary-manywheel-split
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu11==11.8.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu11==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu11==11.11.3.6; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu11==10.9.0.58; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu11==10.3.0.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu11==11.4.1.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu11==11.7.5.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu11==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu11==11.8.86; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -81,7 +81,7 @@ jobs:
use_split_build: True
DESIRED_PYTHON: "3.9"
build_name: manywheel-py3_9-cuda11_8
build_environment: linux-binary-manywheel
build_environment: linux-binary-manywheel-split
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge.nvidia.gpu
secrets:
@ -105,7 +105,7 @@ jobs:
DESIRED_PYTHON: "3.9"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_9-cuda12_1
build_environment: linux-binary-manywheel
build_environment: linux-binary-manywheel-split
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -128,7 +128,7 @@ jobs:
use_split_build: True
DESIRED_PYTHON: "3.9"
build_name: manywheel-py3_9-cuda12_1
build_environment: linux-binary-manywheel
build_environment: linux-binary-manywheel-split
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge.nvidia.gpu
secrets:
@ -152,7 +152,7 @@ jobs:
DESIRED_PYTHON: "3.9"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_9-cuda12_4
build_environment: linux-binary-manywheel
build_environment: linux-binary-manywheel-split
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -175,7 +175,7 @@ jobs:
use_split_build: True
DESIRED_PYTHON: "3.9"
build_name: manywheel-py3_9-cuda12_4
build_environment: linux-binary-manywheel
build_environment: linux-binary-manywheel-split
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge.nvidia.gpu
secrets:

View File

@ -2,7 +2,7 @@
# Template is at: .github/templates/linux_binary_build_workflow.yml.j2
# Generation script: .github/scripts/generate_ci_workflows.py
name: linux-binary-manywheel
name: linux-binary-manywheel-split
on:
@ -24,7 +24,7 @@ env:
ANACONDA_USER: pytorch
AWS_DEFAULT_REGION: us-east-1
BINARY_ENV_FILE: /tmp/env
BUILD_ENVIRONMENT: linux-binary-manywheel
BUILD_ENVIRONMENT: linux-binary-manywheel-split
BUILDER_ROOT: /builder
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
PR_NUMBER: ${{ github.event.pull_request.number }}
@ -33,7 +33,7 @@ env:
SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
SKIP_ALL_TESTS: 0
concurrency:
group: linux-binary-manywheel-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
group: linux-binary-manywheel-split-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true
jobs:
@ -63,7 +63,7 @@ jobs:
DESIRED_PYTHON: "3.9"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_9-cuda11_8
build_environment: linux-binary-manywheel
build_environment: linux-binary-manywheel-split
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu11==11.8.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu11==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu11==11.11.3.6; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu11==10.9.0.58; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu11==10.3.0.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu11==11.4.1.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu11==11.7.5.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu11==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu11==11.8.86; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -86,7 +86,7 @@ jobs:
use_split_build: True
DESIRED_PYTHON: "3.9"
build_name: manywheel-py3_9-cuda11_8
build_environment: linux-binary-manywheel
build_environment: linux-binary-manywheel-split
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge.nvidia.gpu
secrets:
@ -134,7 +134,7 @@ jobs:
DESIRED_PYTHON: "3.9"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_9-cuda12_1
build_environment: linux-binary-manywheel
build_environment: linux-binary-manywheel-split
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -157,7 +157,7 @@ jobs:
use_split_build: True
DESIRED_PYTHON: "3.9"
build_name: manywheel-py3_9-cuda12_1
build_environment: linux-binary-manywheel
build_environment: linux-binary-manywheel-split
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge.nvidia.gpu
secrets:
@ -205,7 +205,7 @@ jobs:
DESIRED_PYTHON: "3.9"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_9-cuda12_4
build_environment: linux-binary-manywheel
build_environment: linux-binary-manywheel-split
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -228,7 +228,7 @@ jobs:
use_split_build: True
DESIRED_PYTHON: "3.9"
build_name: manywheel-py3_9-cuda12_4
build_environment: linux-binary-manywheel
build_environment: linux-binary-manywheel-split
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge.nvidia.gpu
secrets:
@ -275,7 +275,7 @@ jobs:
DESIRED_PYTHON: "3.9"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_9-cpu
build_environment: linux-binary-manywheel
build_environment: linux-binary-manywheel-split
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cpu-test: # Testing
@ -296,7 +296,7 @@ jobs:
use_split_build: True
DESIRED_PYTHON: "3.9"
build_name: manywheel-py3_9-cpu
build_environment: linux-binary-manywheel
build_environment: linux-binary-manywheel-split
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge
secrets:
@ -343,7 +343,7 @@ jobs:
DESIRED_PYTHON: "3.10"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_10-cuda11_8
build_environment: linux-binary-manywheel
build_environment: linux-binary-manywheel-split
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu11==11.8.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu11==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu11==11.11.3.6; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu11==10.9.0.58; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu11==10.3.0.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu11==11.4.1.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu11==11.7.5.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu11==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu11==11.8.86; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -366,7 +366,7 @@ jobs:
use_split_build: True
DESIRED_PYTHON: "3.10"
build_name: manywheel-py3_10-cuda11_8
build_environment: linux-binary-manywheel
build_environment: linux-binary-manywheel-split
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge.nvidia.gpu
secrets:
@ -414,7 +414,7 @@ jobs:
DESIRED_PYTHON: "3.10"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_10-cuda12_1
build_environment: linux-binary-manywheel
build_environment: linux-binary-manywheel-split
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -437,7 +437,7 @@ jobs:
use_split_build: True
DESIRED_PYTHON: "3.10"
build_name: manywheel-py3_10-cuda12_1
build_environment: linux-binary-manywheel
build_environment: linux-binary-manywheel-split
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge.nvidia.gpu
secrets:
@ -467,76 +467,6 @@ jobs:
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_10-cuda12_1-full-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu121
GPU_ARCH_VERSION: 12.1
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.1-main
use_split_build: True
DESIRED_PYTHON: "3.10"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_10-cuda12_1-full
build_environment: linux-binary-manywheel
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_10-cuda12_1-full-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- manywheel-py3_10-cuda12_1-full-build
- get-label-type
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu121
GPU_ARCH_VERSION: 12.1
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.1-main
use_split_build: True
DESIRED_PYTHON: "3.10"
build_name: manywheel-py3_10-cuda12_1-full
build_environment: linux-binary-manywheel
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_10-cuda12_1-full-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_10-cuda12_1-full-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu121
GPU_ARCH_VERSION: 12.1
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.1-main
use_split_build: True
DESIRED_PYTHON: "3.10"
build_name: manywheel-py3_10-cuda12_1-full
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_10-cuda12_4-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -555,7 +485,7 @@ jobs:
DESIRED_PYTHON: "3.10"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_10-cuda12_4
build_environment: linux-binary-manywheel
build_environment: linux-binary-manywheel-split
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -578,7 +508,7 @@ jobs:
use_split_build: True
DESIRED_PYTHON: "3.10"
build_name: manywheel-py3_10-cuda12_4
build_environment: linux-binary-manywheel
build_environment: linux-binary-manywheel-split
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge.nvidia.gpu
secrets:
@ -625,7 +555,7 @@ jobs:
DESIRED_PYTHON: "3.10"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_10-cpu
build_environment: linux-binary-manywheel
build_environment: linux-binary-manywheel-split
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_10-cpu-test: # Testing
@ -646,7 +576,7 @@ jobs:
use_split_build: True
DESIRED_PYTHON: "3.10"
build_name: manywheel-py3_10-cpu
build_environment: linux-binary-manywheel
build_environment: linux-binary-manywheel-split
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge
secrets:
@ -693,7 +623,7 @@ jobs:
DESIRED_PYTHON: "3.11"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_11-cuda11_8
build_environment: linux-binary-manywheel
build_environment: linux-binary-manywheel-split
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu11==11.8.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu11==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu11==11.11.3.6; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu11==10.9.0.58; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu11==10.3.0.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu11==11.4.1.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu11==11.7.5.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu11==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu11==11.8.86; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -716,7 +646,7 @@ jobs:
use_split_build: True
DESIRED_PYTHON: "3.11"
build_name: manywheel-py3_11-cuda11_8
build_environment: linux-binary-manywheel
build_environment: linux-binary-manywheel-split
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge.nvidia.gpu
secrets:
@ -764,7 +694,7 @@ jobs:
DESIRED_PYTHON: "3.11"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_11-cuda12_1
build_environment: linux-binary-manywheel
build_environment: linux-binary-manywheel-split
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -787,7 +717,7 @@ jobs:
use_split_build: True
DESIRED_PYTHON: "3.11"
build_name: manywheel-py3_11-cuda12_1
build_environment: linux-binary-manywheel
build_environment: linux-binary-manywheel-split
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge.nvidia.gpu
secrets:
@ -817,6 +747,76 @@ jobs:
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_11-cuda12_1-full-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu121
GPU_ARCH_VERSION: 12.1
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.1-main
use_split_build: True
DESIRED_PYTHON: "3.11"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_11-cuda12_1-full
build_environment: linux-binary-manywheel-split
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_11-cuda12_1-full-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- manywheel-py3_11-cuda12_1-full-build
- get-label-type
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu121
GPU_ARCH_VERSION: 12.1
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.1-main
use_split_build: True
DESIRED_PYTHON: "3.11"
build_name: manywheel-py3_11-cuda12_1-full
build_environment: linux-binary-manywheel-split
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_11-cuda12_1-full-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_11-cuda12_1-full-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu121
GPU_ARCH_VERSION: 12.1
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.1-main
use_split_build: True
DESIRED_PYTHON: "3.11"
build_name: manywheel-py3_11-cuda12_1-full
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_11-cuda12_4-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -835,7 +835,7 @@ jobs:
DESIRED_PYTHON: "3.11"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_11-cuda12_4
build_environment: linux-binary-manywheel
build_environment: linux-binary-manywheel-split
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -858,7 +858,7 @@ jobs:
use_split_build: True
DESIRED_PYTHON: "3.11"
build_name: manywheel-py3_11-cuda12_4
build_environment: linux-binary-manywheel
build_environment: linux-binary-manywheel-split
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge.nvidia.gpu
secrets:
@ -905,7 +905,7 @@ jobs:
DESIRED_PYTHON: "3.11"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_11-cpu
build_environment: linux-binary-manywheel
build_environment: linux-binary-manywheel-split
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_11-cpu-test: # Testing
@ -926,7 +926,7 @@ jobs:
use_split_build: True
DESIRED_PYTHON: "3.11"
build_name: manywheel-py3_11-cpu
build_environment: linux-binary-manywheel
build_environment: linux-binary-manywheel-split
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge
secrets:
@ -973,7 +973,7 @@ jobs:
DESIRED_PYTHON: "3.12"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_12-cuda11_8
build_environment: linux-binary-manywheel
build_environment: linux-binary-manywheel-split
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu11==11.8.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu11==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu11==11.11.3.6; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu11==10.9.0.58; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu11==10.3.0.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu11==11.4.1.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu11==11.7.5.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu11==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu11==11.8.86; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -996,7 +996,7 @@ jobs:
use_split_build: True
DESIRED_PYTHON: "3.12"
build_name: manywheel-py3_12-cuda11_8
build_environment: linux-binary-manywheel
build_environment: linux-binary-manywheel-split
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge.nvidia.gpu
secrets:
@ -1044,7 +1044,7 @@ jobs:
DESIRED_PYTHON: "3.12"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_12-cuda12_1
build_environment: linux-binary-manywheel
build_environment: linux-binary-manywheel-split
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -1067,7 +1067,7 @@ jobs:
use_split_build: True
DESIRED_PYTHON: "3.12"
build_name: manywheel-py3_12-cuda12_1
build_environment: linux-binary-manywheel
build_environment: linux-binary-manywheel-split
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge.nvidia.gpu
secrets:
@ -1115,7 +1115,7 @@ jobs:
DESIRED_PYTHON: "3.12"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_12-cuda12_4
build_environment: linux-binary-manywheel
build_environment: linux-binary-manywheel-split
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -1138,7 +1138,7 @@ jobs:
use_split_build: True
DESIRED_PYTHON: "3.12"
build_name: manywheel-py3_12-cuda12_4
build_environment: linux-binary-manywheel
build_environment: linux-binary-manywheel-split
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge.nvidia.gpu
secrets:
@ -1185,7 +1185,7 @@ jobs:
DESIRED_PYTHON: "3.12"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_12-cpu
build_environment: linux-binary-manywheel
build_environment: linux-binary-manywheel-split
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_12-cpu-test: # Testing
@ -1206,7 +1206,7 @@ jobs:
use_split_build: True
DESIRED_PYTHON: "3.12"
build_name: manywheel-py3_12-cpu
build_environment: linux-binary-manywheel
build_environment: linux-binary-manywheel-split
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge
secrets:
@ -1253,7 +1253,7 @@ jobs:
DESIRED_PYTHON: "3.13"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13-cuda11_8
build_environment: linux-binary-manywheel
build_environment: linux-binary-manywheel-split
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu11==11.8.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu11==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu11==11.11.3.6; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu11==10.9.0.58; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu11==10.3.0.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu11==11.4.1.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu11==11.7.5.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu11==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu11==11.8.86; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -1276,7 +1276,7 @@ jobs:
use_split_build: True
DESIRED_PYTHON: "3.13"
build_name: manywheel-py3_13-cuda11_8
build_environment: linux-binary-manywheel
build_environment: linux-binary-manywheel-split
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge.nvidia.gpu
secrets:
@ -1324,7 +1324,7 @@ jobs:
DESIRED_PYTHON: "3.13"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13-cuda12_1
build_environment: linux-binary-manywheel
build_environment: linux-binary-manywheel-split
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -1347,7 +1347,7 @@ jobs:
use_split_build: True
DESIRED_PYTHON: "3.13"
build_name: manywheel-py3_13-cuda12_1
build_environment: linux-binary-manywheel
build_environment: linux-binary-manywheel-split
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge.nvidia.gpu
secrets:
@ -1395,7 +1395,7 @@ jobs:
DESIRED_PYTHON: "3.13"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13-cuda12_4
build_environment: linux-binary-manywheel
build_environment: linux-binary-manywheel-split
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -1418,7 +1418,7 @@ jobs:
use_split_build: True
DESIRED_PYTHON: "3.13"
build_name: manywheel-py3_13-cuda12_4
build_environment: linux-binary-manywheel
build_environment: linux-binary-manywheel-split
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge.nvidia.gpu
secrets:
@ -1465,7 +1465,7 @@ jobs:
DESIRED_PYTHON: "3.13"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13-cpu
build_environment: linux-binary-manywheel
build_environment: linux-binary-manywheel-split
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13-cpu-test: # Testing
@ -1486,7 +1486,7 @@ jobs:
use_split_build: True
DESIRED_PYTHON: "3.13"
build_name: manywheel-py3_13-cpu
build_environment: linux-binary-manywheel
build_environment: linux-binary-manywheel-split
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge
secrets:

View File

@ -18,11 +18,22 @@ concurrency:
permissions: read-all
jobs:
get-label-type:
name: get-label-type
uses: ./.github/workflows/_runner-determinator.yml
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
linux-focal-cuda12_4-py3_10-gcc9-inductor-build:
# Should be synced with the one in inductor.yml, but this doesn't run inductor_timm
name: cuda12.4-py3.10-gcc9-sm86
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
sync-tag: linux-focal-cuda12_4-py3_10-gcc9-inductor-build
build-environment: linux-focal-cuda12.4-py3.10-gcc9-sm86
docker-image-name: pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9-inductor-benchmarks

View File

@ -16,10 +16,21 @@ concurrency:
permissions: read-all
jobs:
get-label-type:
name: get-label-type
uses: ./.github/workflows/_runner-determinator.yml
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
linux-focal-cuda12_1-py3_10-gcc9-inductor-micro-benchmark-build:
name: cuda12.1-py3.10-gcc9-sm80
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-focal-cuda12.1-py3.10-gcc9-sm80
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9-inductor-benchmarks
cuda-arch-list: '8.0'

View File

@ -13,10 +13,21 @@ concurrency:
permissions: read-all
jobs:
get-label-type:
name: get-label-type
uses: ./.github/workflows/_runner-determinator.yml
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
linux-focal-cuda12_1-py3_10-gcc9-inductor-build:
name: cuda12.1-py3.10-gcc9-sm80
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-focal-cuda12.1-py3.10-gcc9-sm80
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9-inductor-benchmarks
cuda-arch-list: '8.0'

View File

@ -68,10 +68,21 @@ concurrency:
permissions: read-all
jobs:
get-label-type:
name: get-label-type
uses: ./.github/workflows/_runner-determinator.yml
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
linux-focal-cuda12_1-py3_10-gcc9-inductor-build:
name: cuda12.1-py3.10-gcc9-sm80
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-focal-cuda12.1-py3.10-gcc9-sm80
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9-inductor-benchmarks
cuda-arch-list: '8.0'

View File

@ -5,9 +5,7 @@ on:
# - cron: 0 7 * * 1-6
# - cron: 0 7 * * 0
# Does not perform max_autotune on CPU, so skip the weekly run setup
# Run 6 times everyday to see if perf instablity can be reproduced
# Will change this back
- cron: 0 */4 * * *
- cron: 0 7 * * *
# NB: GitHub has an upper limit of 10 inputs here
workflow_dispatch:
inputs:
@ -50,10 +48,21 @@ concurrency:
permissions: read-all
jobs:
get-label-type:
name: get-label-type
uses: ./.github/workflows/_runner-determinator.yml
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
linux-jammy-aarch64-py3_10-inductor-build:
name: linux-jammy-aarch64-py3.10-inductor
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runner: linux.arm64.m7g.4xlarge
build-environment: linux-jammy-aarch64-py3.10
docker-image-name: pytorch-linux-jammy-aarch64-py3.10-gcc11-inductor-benchmarks
@ -105,7 +114,7 @@ jobs:
name: linux-jammy-aarch64-py3.10-inductor
uses: ./.github/workflows/_linux-test.yml
needs: linux-jammy-aarch64-py3_10-inductor-build
if: github.event.schedule == '0 */4 * * *'
if: github.event.schedule == '0 7 * * *'
with:
build-environment: linux-jammy-aarch64-py3.10
# Turn off dynamic-shapes and aotinductor tests for now, to have faster iteration for debugging perf instability.

View File

@ -48,10 +48,21 @@ concurrency:
permissions: read-all
jobs:
get-label-type:
name: get-label-type
uses: ./.github/workflows/_runner-determinator.yml
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
linux-jammy-cpu-py3_9-gcc11-inductor-build:
name: linux-jammy-cpu-py3.9-gcc11-inductor
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-jammy-py3.9-gcc11-build
docker-image-name: pytorch-linux-jammy-py3.9-gcc11-inductor-benchmarks
test-matrix: |

View File

@ -66,10 +66,21 @@ concurrency:
permissions: read-all
jobs:
get-label-type:
name: get-label-type
uses: ./.github/workflows/_runner-determinator.yml
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
linux-focal-cuda12_1-py3_10-gcc9-inductor-build:
name: cuda12.1-py3.10-gcc9-sm80
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-focal-cuda12.1-py3.10-gcc9-sm80
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9-inductor-benchmarks
cuda-arch-list: '8.0'

View File

@ -18,10 +18,21 @@ concurrency:
permissions: read-all
jobs:
get-label-type:
name: get-label-type
uses: ./.github/workflows/_runner-determinator.yml
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
linux-focal-cuda12_1-py3_10-gcc9-periodic-dynamo-benchmarks-build:
name: cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-focal-cuda12.1-py3.10-gcc9-sm86
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9-inductor-benchmarks
cuda-arch-list: '8.6'
@ -60,7 +71,9 @@ jobs:
linux-focal-cuda12_1-py3_10-gcc9-inductor-build-gcp:
name: cuda12.1-py3.10-gcc9-sm80
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-focal-cuda12.1-py3.10-gcc9-sm80
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9-inductor-benchmarks
cuda-arch-list: '8.0'

View File

@ -22,11 +22,22 @@ concurrency:
permissions: read-all
jobs:
linux-focal-rocm6_1-py3_8-inductor-build:
name: rocm6.1-py3.8-inductor
uses: ./.github/workflows/_linux-build.yml
get-label-type:
name: get-label-type
uses: ./.github/workflows/_runner-determinator.yml
with:
build-environment: linux-focal-rocm6.1-py3.8
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
linux-focal-rocm6_2-py3_10-inductor-build:
name: rocm6.2-py3.10-inductor
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-focal-rocm6.2-py3.10
docker-image-name: pytorch-linux-focal-rocm-n-py3
test-matrix: |
{ include: [
@ -34,14 +45,14 @@ jobs:
{ config: "inductor", shard: 2, num_shards: 2, runner: "linux.rocm.gpu.2" },
]}
linux-focal-rocm6_1-py3_8-inductor-test:
linux-focal-rocm6_2-py3_10-inductor-test:
permissions:
id-token: write
contents: read
name: rocm6.1-py3.8-inductor
name: rocm6.2-py3.10-inductor
uses: ./.github/workflows/_rocm-test.yml
needs: linux-focal-rocm6_1-py3_8-inductor-build
needs: linux-focal-rocm6_2-py3_10-inductor-build
with:
build-environment: linux-focal-rocm6.1-py3.8
docker-image: ${{ needs.linux-focal-rocm6_1-py3_8-inductor-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-rocm6_1-py3_8-inductor-build.outputs.test-matrix }}
build-environment: linux-focal-rocm6.2-py3.10
docker-image: ${{ needs.linux-focal-rocm6_2-py3_10-inductor-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-rocm6_2-py3_10-inductor-build.outputs.test-matrix }}

View File

@ -58,8 +58,7 @@ jobs:
{ config: "aot_inductor_torchbench", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor_cpp_wrapper_abi_compatible", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
]}
secrets:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
secrets: inherit
linux-focal-cuda12_1-py3_10-gcc9-inductor-test:
name: cuda12.1-py3.10-gcc9-sm86
@ -69,8 +68,7 @@ jobs:
build-environment: linux-focal-cuda12.1-py3.10-gcc9-sm86
docker-image: ${{ needs.linux-focal-cuda12_1-py3_10-gcc9-inductor-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda12_1-py3_10-gcc9-inductor-build.outputs.test-matrix }}
secrets:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
secrets: inherit
linux-focal-cuda12_1-py3_12-gcc9-inductor-build:
name: cuda12.1-py3.12-gcc9-sm86
@ -86,6 +84,7 @@ jobs:
{ config: "inductor", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
]}
secrets: inherit
linux-focal-cuda12_1-py3_12-gcc9-inductor-test:
name: cuda12.1-py3.12-gcc9-sm86
@ -95,6 +94,7 @@ jobs:
build-environment: linux-focal-cuda12.1-py3.12-gcc9-sm86
docker-image: ${{ needs.linux-focal-cuda12_1-py3_12-gcc9-inductor-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda12_1-py3_12-gcc9-inductor-build.outputs.test-matrix }}
secrets: inherit
linux-jammy-cpu-py3_12-inductor-halide-build:
name: linux-jammy-cpu-py3.12-gcc11-inductor-halide
@ -108,6 +108,7 @@ jobs:
{ include: [
{ config: "inductor-halide", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.12xlarge" },
]}
secrets: inherit
linux-jammy-cpu-py3_12-inductor-halide-test:
name: linux-jammy-cpu-py3.12-gcc11-inductor-halide
@ -117,6 +118,7 @@ jobs:
build-environment: linux-jammy-py3.12-gcc11
docker-image: ${{ needs.linux-jammy-cpu-py3_12-inductor-halide-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-cpu-py3_12-inductor-halide-build.outputs.test-matrix }}
secrets: inherit
linux-focal-cuda12_4-py3_10-gcc9-inductor-build:
# Should be synced with the one in inductor-periodic.yml but this only runs inductor_timm
@ -134,8 +136,7 @@ jobs:
{ config: "inductor_timm", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor_timm", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
]}
secrets:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
secrets: inherit
linux-focal-cuda12_4-py3_10-gcc9-inductor-test:
name: cuda12.4-py3.10-gcc9-sm86
@ -146,8 +147,7 @@ jobs:
build-environment: linux-focal-cuda12.4-py3.10-gcc9-sm86
docker-image: ${{ needs.linux-focal-cuda12_4-py3_10-gcc9-inductor-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda12_4-py3_10-gcc9-inductor-build.outputs.test-matrix }}
secrets:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
secrets: inherit
linux-jammy-cpu-py3_9-gcc11-inductor-build:
name: linux-jammy-cpu-py3.9-gcc11-inductor
@ -201,8 +201,7 @@ jobs:
{ config: "cpu_inductor_freezing_avx2_timm", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.10xlarge.avx2" },
{ config: "cpu_inductor_freezing_avx2_timm", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.10xlarge.avx2" },
]}
secrets:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
secrets: inherit
linux-jammy-cpu-py3_9-gcc11-inductor-test:
name: linux-jammy-cpu-py3.9-gcc11-inductor
@ -212,5 +211,4 @@ jobs:
build-environment: linux-jammy-py3.9-gcc11-build
docker-image: ${{ needs.linux-jammy-cpu-py3_9-gcc11-inductor-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-cpu-py3_9-gcc11-inductor-build.outputs.test-matrix }}
secrets:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
secrets: inherit

44
.github/workflows/lint-autoformat.yml vendored Normal file
View File

@ -0,0 +1,44 @@
name: Apply lint suggestions
on:
pull_request:
types: [opened, synchronize, reopened]
jobs:
lintrunner-autoformat:
permissions:
contents: read
pull-requests: write
runs-on: lf.linux.2xlarge
continue-on-error: true
if: ${{ github.repository_owner == 'pytorch' }}
steps:
- name: Checkout pytorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
with:
submodules: true
fetch-depth: 0
- name: Setup miniconda
uses: pytorch/test-infra/.github/actions/setup-miniconda@main
with:
python-version: "3.10"
- name: Run lintrunner (nonretryable)
continue-on-error: true
# we can't run all files here because only changes around where the diff are shown in the PR UI
run: |
export ADDITIONAL_LINTRUNNER_ARGS="format"
bash .github/scripts/lintrunner.sh
- name: Check for changes
id: git-check
run: |
git diff --exit-code || echo "changes=true" >> "$GITHUB_OUTPUT"
- name: Suggest changes
if: steps.git-check.outputs.changes == 'true'
uses: parkerbxyz/suggest-changes@v1
with:
comment: "Please commit the suggested changes from pytorch's linter."
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true

View File

@ -36,7 +36,7 @@ jobs:
submodules: true
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
script: |
export ADDITIONAL_LINTRUNNER_ARGS="--take CLANGTIDY,CLANGFORMAT"
export ADDITIONAL_LINTRUNNER_ARGS="--take CLANGTIDY,CLANGFORMAT --all-files"
export CLANG=1
.github/scripts/lintrunner.sh
@ -53,7 +53,7 @@ jobs:
submodules: true
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
script: |
export ADDITIONAL_LINTRUNNER_ARGS="--skip CLANGTIDY,CLANGFORMAT"
export ADDITIONAL_LINTRUNNER_ARGS="--skip CLANGTIDY,CLANGFORMAT --all-files"
.github/scripts/lintrunner.sh
quick-checks:
@ -278,4 +278,4 @@ jobs:
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true
cancel-in-progress: true

View File

@ -57,8 +57,10 @@ jobs:
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9
test-matrix: |
{ include: [
{ config: "nogpu_AVX512", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "nogpu_NO_AVX2", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "nogpu_AVX512", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "nogpu_AVX512", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "nogpu_NO_AVX2", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "nogpu_NO_AVX2", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "jit_legacy", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
]}
linux-focal-cuda12_1-py3_10-gcc9-test:
@ -87,8 +89,10 @@ jobs:
{ config: "default", shard: 3, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 4, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 5, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
{ config: "nogpu_AVX512", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "nogpu_NO_AVX2", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "nogpu_AVX512", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "nogpu_AVX512", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "nogpu_NO_AVX2", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "nogpu_NO_AVX2", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "jit_legacy", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
]}
@ -214,7 +218,9 @@ jobs:
# TODO: Figure out how to migrate this job to M1 runner
ios-build-test:
name: ios-build-test
if: github.event_name != 'schedule' || github.event.schedule == '45 0,8,16 * * 1-5' || github.event.schedule == '45 4 * * 0,6' || github.event.schedule == '29 8 * * *'
# Has been broken for a while, see https://github.com/pytorch/pytorch/issues/136284
# if: github.event_name != 'schedule' || github.event.schedule == '45 0,8,16 * * 1-5' || github.event.schedule == '45 4 * * 0,6' || github.event.schedule == '29 8 * * *'
if: false
uses: ./.github/workflows/_ios-build-test.yml
with:
trigger-event: ${{ github.event_name }}
@ -293,13 +299,13 @@ jobs:
docker-image: ${{ needs.linux-vulkan-focal-py3_11-clang10-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-vulkan-focal-py3_11-clang10-build.outputs.test-matrix }}
linux-focal-rocm6_1-py3_8-build:
name: linux-focal-rocm6.1-py3.8
linux-focal-rocm6_2-py3_10-build:
name: linux-focal-rocm6.2-py3.10
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-focal-rocm6.1-py3.8
build-environment: linux-focal-rocm6.2-py3.10
docker-image-name: pytorch-linux-focal-rocm-n-py3
test-matrix: |
{ include: [
@ -308,19 +314,19 @@ jobs:
{ config: "distributed", shard: 3, num_shards: 3, runner: "linux.rocm.gpu" },
]}
linux-focal-rocm6_1-py3_8-test:
linux-focal-rocm6_2-py3_10-test:
permissions:
id-token: write
contents: read
name: linux-focal-rocm6.1-py3.8
name: linux-focal-rocm6.2-py3.10
uses: ./.github/workflows/_rocm-test.yml
needs:
- linux-focal-rocm6_1-py3_8-build
- linux-focal-rocm6_2-py3_10-build
- target-determination
with:
build-environment: linux-focal-rocm6.1-py3.8
docker-image: ${{ needs.linux-focal-rocm6_1-py3_8-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-rocm6_1-py3_8-build.outputs.test-matrix }}
build-environment: linux-focal-rocm6.2-py3.10
docker-image: ${{ needs.linux-focal-rocm6_2-py3_10-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-rocm6_2-py3_10-build.outputs.test-matrix }}
linux-focal-cuda12_1-py3_10-gcc9-experimental-split-build:
name: linux-focal-cuda12.1-py3.10-gcc9-experimental-split-build
@ -333,8 +339,10 @@ jobs:
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9
test-matrix: |
{ include: [
{ config: "nogpu_AVX512", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "nogpu_NO_AVX2", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "nogpu_AVX512", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "nogpu_AVX512", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "nogpu_NO_AVX2", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "nogpu_NO_AVX2", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "jit_legacy", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
]}

View File

@ -383,7 +383,7 @@ jobs:
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-focal-py3.9-clang9-xla
docker-image-name: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/xla_base:v1.1-lite
docker-image-name: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/xla_base:v1.3-lite
test-matrix: |
{ include: [
{ config: "xla", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.12xlarge" },
@ -503,15 +503,15 @@ jobs:
]}
secrets: inherit
linux-focal-rocm6_1-py3_8-build:
linux-focal-rocm6_2-py3_10-build:
# don't run build twice on main
if: github.event_name == 'pull_request'
name: linux-focal-rocm6.1-py3.8
name: linux-focal-rocm6.2-py3.10
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-focal-rocm6.1-py3.8
build-environment: linux-focal-rocm6.2-py3.10
docker-image-name: pytorch-linux-focal-rocm-n-py3
sync-tag: rocm-build
test-matrix: |

View File

@ -3,18 +3,12 @@ name: rocm
on:
push:
branches:
# - main
- main
- release/*
tags:
- ciflow/rocm/*
workflow_dispatch:
schedule:
# We have several schedules so jobs can check github.event.schedule to activate only for a fraction of the runs.
# Also run less frequently on weekends.
- cron: 45 0,8,16 * * 1-5
- cron: 45 4 * * 0,6
- cron: 45 4,12,20 * * 1-5
- cron: 45 12 * * 0,6
- cron: 29 8 * * * # about 1:29am PDT
concurrency:
@ -31,11 +25,11 @@ jobs:
id-token: write
contents: read
linux-focal-rocm6_1-py3_8-build:
name: linux-focal-rocm6.1-py3.8
linux-focal-rocm6_2-py3_10-build:
name: linux-focal-rocm6.2-py3.10
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-focal-rocm6.1-py3.8
build-environment: linux-focal-rocm6.2-py3.10
docker-image-name: pytorch-linux-focal-rocm-n-py3
sync-tag: rocm-build
test-matrix: |
@ -48,16 +42,16 @@ jobs:
{ config: "default", shard: 6, num_shards: 6, runner: "linux.rocm.gpu.2" },
]}
linux-focal-rocm6_1-py3_8-test:
linux-focal-rocm6_2-py3_10-test:
permissions:
id-token: write
contents: read
name: linux-focal-rocm6.1-py3.8
name: linux-focal-rocm6.2-py3.10
uses: ./.github/workflows/_rocm-test.yml
needs:
- linux-focal-rocm6_1-py3_8-build
- linux-focal-rocm6_2-py3_10-build
- target-determination
with:
build-environment: linux-focal-rocm6.1-py3.8
docker-image: ${{ needs.linux-focal-rocm6_1-py3_8-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-rocm6_1-py3_8-build.outputs.test-matrix }}
build-environment: linux-focal-rocm6.2-py3.10
docker-image: ${{ needs.linux-focal-rocm6_2-py3_10-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-rocm6_2-py3_10-build.outputs.test-matrix }}

View File

@ -56,12 +56,14 @@ jobs:
cuda-arch-list: 8.6
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 6, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 2, num_shards: 6, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 3, num_shards: 6, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 4, num_shards: 6, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 5, num_shards: 6, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 6, num_shards: 6, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 1, num_shards: 8, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 2, num_shards: 8, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 3, num_shards: 8, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 4, num_shards: 8, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 5, num_shards: 8, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 6, num_shards: 8, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 7, num_shards: 8, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 8, num_shards: 8, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
]}
linux-focal-cuda12_1-py3-gcc9-slow-gradcheck-test:
@ -87,8 +89,9 @@ jobs:
cuda-arch-list: 8.6
test-matrix: |
{ include: [
{ config: "slow", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "slow", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "slow", shard: 1, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "slow", shard: 2, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "slow", shard: 3, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
]}
linux-focal-cuda12_1-py3_10-gcc9-sm86-test:
@ -127,13 +130,13 @@ jobs:
docker-image: ${{ needs.linux-focal-py3_9-clang10-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-py3_9-clang10-build.outputs.test-matrix }}
linux-focal-rocm6_1-py3_8-build:
name: linux-focal-rocm6.1-py3.8
linux-focal-rocm6_2-py3_10-build:
name: linux-focal-rocm6.2-py3.10
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-focal-rocm6.1-py3.8
build-environment: linux-focal-rocm6.2-py3.10
docker-image-name: pytorch-linux-focal-rocm-n-py3
test-matrix: |
{ include: [
@ -141,19 +144,19 @@ jobs:
{ config: "slow", shard: 2, num_shards: 2, runner: "linux.rocm.gpu" },
]}
linux-focal-rocm6_1-py3_8-test:
linux-focal-rocm6_2-py3_10-test:
permissions:
id-token: write
contents: read
name: linux-focal-rocm6.1-py3.8
name: linux-focal-rocm6.2-py3.10
uses: ./.github/workflows/_rocm-test.yml
needs:
- linux-focal-rocm6_1-py3_8-build
- linux-focal-rocm6_2-py3_10-build
- target-determination
with:
build-environment: linux-focal-rocm6.1-py3.8
docker-image: ${{ needs.linux-focal-rocm6_1-py3_8-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-rocm6_1-py3_8-build.outputs.test-matrix }}
build-environment: linux-focal-rocm6.2-py3.10
docker-image: ${{ needs.linux-focal-rocm6_2-py3_10-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-rocm6_2-py3_10-build.outputs.test-matrix }}
linux-jammy-py3_10-clang15-asan-build:
name: linux-jammy-py3.10-clang15-asan

View File

@ -1,30 +0,0 @@
name: Sync Distributed Folder
on:
#push:
# branches:
# - 'main'
# paths:
# - 'torch/distributed/**'
workflow_dispatch:
pull_request:
paths:
- '.github/scripts/sync_distributed_folder_prototype.sh'
- '.github/workflows/sync_distributed_folder_prototype.yml'
env:
WITH_PUSH: ${{ github.event_name == 'push' && github.ref == 'refs/heads/main' }}
permissions:
contents: write
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true
jobs:
sync:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: .github/scripts/sync_distributed_folder_prototype.sh

View File

@ -10,8 +10,18 @@ permissions:
contents: read
jobs:
get-label-type:
name: get-label-type
uses: ./.github/workflows/_runner-determinator.yml
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
index:
runs-on: linux.g5.4xlarge.nvidia.gpu # 1 GPU A10G 24GB each
needs: get-label-type
runs-on: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" # 1 GPU A10G 24GB each
environment: target-determinator-env
steps:
- name: Clone PyTorch

View File

@ -11,10 +11,21 @@ concurrency:
cancel-in-progress: true
jobs:
get-label-type:
name: get-label-type
uses: ./.github/workflows/_runner-determinator.yml
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
linux-focal-cuda12_1-py3_10-gcc9-torchbench-build-gcp:
name: cuda12.1-py3.10-gcc9-sm80
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-focal-cuda12.1-py3.10-gcc9-sm80
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9-inductor-benchmarks
cuda-arch-list: '8.0'

View File

@ -223,13 +223,13 @@ jobs:
cuda-version: "12.1"
runner: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge.nonephemeral"
linux-focal-rocm6_1-py3_8-build:
name: linux-focal-rocm6.1-py3.8
linux-focal-rocm6_2-py3_10-build:
name: linux-focal-rocm6.2-py3.10
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-focal-rocm6.1-py3.8
build-environment: linux-focal-rocm6.2-py3.10
docker-image-name: pytorch-linux-focal-rocm-n-py3
sync-tag: rocm-build
test-matrix: |
@ -240,19 +240,19 @@ jobs:
]}
secrets: inherit
linux-focal-rocm6_1-py3_8-test:
linux-focal-rocm6_2-py3_10-test:
permissions:
id-token: write
contents: read
name: linux-focal-rocm6.1-py3.8
name: linux-focal-rocm6.2-py3.10
uses: ./.github/workflows/_rocm-test.yml
needs:
- linux-focal-rocm6_1-py3_8-build
- linux-focal-rocm6_2-py3_10-build
- target-determination
with:
build-environment: linux-focal-rocm6.1-py3.8
docker-image: ${{ needs.linux-focal-rocm6_1-py3_8-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-rocm6_1-py3_8-build.outputs.test-matrix }}
build-environment: linux-focal-rocm6.2-py3.10
docker-image: ${{ needs.linux-focal-rocm6_2-py3_10-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-rocm6_2-py3_10-build.outputs.test-matrix }}
tests-to-include: "test_nn test_torch test_cuda test_ops test_unary_ufuncs test_binary_ufuncs test_autograd inductor/test_torchinductor distributed/test_c10d_common distributed/test_c10d_nccl"
linux-focal-cuda12_4-py3_10-gcc9-experimental-split-build:
@ -266,8 +266,10 @@ jobs:
docker-image-name: pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9
test-matrix: |
{ include: [
{ config: "nogpu_AVX512", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "nogpu_NO_AVX2", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "nogpu_AVX512", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "nogpu_AVX512", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "nogpu_NO_AVX2", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "nogpu_NO_AVX2", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "jit_legacy", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 1, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 2, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
@ -314,3 +316,11 @@ jobs:
build-environment: linux-focal-cuda11.8-py3.10-gcc9-experimental-split-build
docker-image: ${{ needs.linux-focal-cuda11_8-py3_10-gcc9-experimental-split-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda11_8-py3_10-gcc9-experimental-split-build.outputs.test-matrix }}
linux-manylinux-2_28-py3-cpu-s390x-build:
name: linux-manylinux-2_28-py3-cpu-s390x
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-s390x-binary-manywheel
docker-image-name: pytorch/manylinuxs390x-builder:cpu-s390x-main
runner: linux.s390x

View File

@ -11,15 +11,39 @@ concurrency:
jobs:
do_update_viablestrict:
permissions:
id-token: write
if: ${{ github.repository_owner == 'pytorch' }}
runs-on: ubuntu-20.04
environment: ${{ (github.event_name == 'schedule') && 'mergebot' || '' }}
steps:
- name: Update viable/strict
uses: pytorch/test-infra/.github/actions/update-viablestrict@main
id: update_viablestrict
with:
repository: pytorch/pytorch
stable-branch: viable/strict
requires: '[\"pull\", \"trunk\", \"lint\", \"linux-binary\"]'
secret-bot-token: ${{ secrets.MERGEBOT_TOKEN }}
rockset-api-key: ${{ secrets.ROCKSET_API_KEY }}
- name: Authenticate to AWS with OIDC
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::308535385114:role/upload_to_ossci_raw_job_status
aws-region: us-east-1
- name: Print sha
env:
LATEST_SHA: ${{ steps.update_viablestrict.outputs.latest_viable_sha }}
PUSH_RESULT: ${{ steps.update_viablestrict.outputs.push_result }}
TIME: ${{ steps.update_viablestrict.outputs.time }}
run: |
echo "${PUSH_RESULT}"
if [ "$PUSH_RESULT" = "Everything up-to-date" ]; then
echo "No update pushed"
else
echo "{\"sha\": \"${LATEST_SHA}\", \"repository\":\"pytorch/pytorch\", \"timestamp\": ${TIME}}" > "/tmp/${LATEST_SHA}.json"
pip install awscli==1.29.40
aws s3 cp "/tmp/${LATEST_SHA}.json" "s3://ossci-raw-job-status/stable_pushes/pytorch/pytorch/${LATEST_SHA}.json"
fi

View File

@ -1,55 +0,0 @@
# upload alerts every 10 minutes
name: Upload Alerts to AWS/Rockset
on:
schedule:
- cron: '*/10 * * * *'
pull_request:
paths:
- 'tools/alerts/create_alerts.py'
- '.github/workflows/upload-alerts.yml'
jobs:
upload-alerts:
if: ${{ github.repository_owner == 'pytorch' }}
runs-on: ubuntu-22.04
environment: upload-stats
steps:
- name: Checkout repo
uses: actions/checkout@v3
with:
fetch-depth: 1
- uses: actions/setup-python@v4
with:
python-version: '3.11'
cache: pip
- name: Install Python Packages
run: |
pip3 install rockset==1.0.3 boto3==1.19.12 requests==2.32.2
- name: Create alerts
run: |
output=$(PYTHONPATH=$PYTHONPATH:$(pwd) python3 "tools/alerts/create_alerts.py")
echo "uploading following alerts"
echo "$output"
echo "script-output=$output" >> "$GITHUB_OUTPUT"
id: alert_creation_step
- name: Upload alerts
env:
ROCKSET_API_KEY: ${{ secrets.ROCKSET_API_KEY }}
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
uses: pytorch/test-infra/.github/actions/upload-alerts@main
with:
alerts: '${{ steps.alert_creation_step.outputs.script-output }}'
organization: "pytorch"
repo: "pytorch"
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true

View File

@ -96,7 +96,7 @@ jobs:
python3 -m tools.stats.check_disabled_tests --workflow-run-id "${WORKFLOW_RUN_ID}" --workflow-run-attempt "${WORKFLOW_RUN_ATTEMPT}" --repo "${REPO_FULLNAME}"
- name: Upload gpt-fast benchmark results to Rockset
if: steps.upload-s3.outcome && steps.upload-s3.outcome == 'success' && github.event.workflow_run.name == 'inductor-micro-benchmark'
if: steps.upload-s3.outcome && steps.upload-s3.outcome == 'success' && contains(github.event.workflow_run.name, 'inductor-micro-benchmark')
env:
ROCKSET_API_KEY: ${{ secrets.ROCKSET_API_KEY }}
WORKFLOW_RUN_ID: ${{ github.event.workflow_run.id }}

View File

@ -139,7 +139,7 @@ init_command = [
'numpy==1.24.3 ; python_version == "3.8"',
'numpy==1.26.0 ; python_version >= "3.9"',
'expecttest==0.2.1',
'mypy==1.10.0',
'mypy==1.11.2',
'sympy==1.12.1 ; python_version == "3.8"',
'sympy==1.13.0 ; python_version >= "3.9"',
'types-requests==2.27.25',
@ -195,6 +195,7 @@ include_patterns = [
# and excluding most sub-directories for now.
'aten/src/ATen/*.h',
'aten/src/ATen/*.cpp',
'aten/src/ATen/cuda/*.cpp',
'aten/src/ATen/cpu/*.h',
'aten/src/ATen/cpu/*.cpp',
'aten/src/ATen/core/*.h',
@ -224,7 +225,6 @@ exclude_patterns = [
# CUDA files are also excluded.
'**/fb/**',
'**/*pb.h',
'aten/**/cuda/*pp',
'c10/xpu/**/*.h',
'c10/xpu/**/*.cpp',
'c10/cuda/CUDAAlgorithm.h',
@ -1585,6 +1585,27 @@ command = [
]
is_formatter = true
[[linter]]
code = 'META_NO_CREATE_UNBACKED'
include_patterns = [
"torch/_meta_registrations.py"
]
command = [
'python3',
'tools/linter/adapters/grep_linter.py',
'--pattern=create_unbacked',
'--linter-name=META_NO_CREATE_UNBACKED',
'--error-name=no create_unbacked in meta registrations',
"""--error-description=\
Data-dependent operators should have their meta \
registration in torch/_subclasses/fake_impls.py, \
not torch/_meta_registrations.py
""",
'--',
'@{{PATHSFILE}}'
]
[[linter]]
code = 'ATEN_CPU_GPU_AGNOSTIC'
include_patterns = [

View File

@ -305,7 +305,6 @@ if(NOT DEFINED USE_VULKAN)
cmake_dependent_option(USE_VULKAN "Use Vulkan GPU backend" ON "ANDROID" OFF)
endif()
option(USE_SLEEF_FOR_ARM_VEC256 "Use sleef for arm" OFF)
option(USE_SOURCE_DEBUG_ON_MOBILE "Enable" ON)
option(USE_LITE_INTERPRETER_PROFILER "Enable" ON)
cmake_dependent_option(
@ -369,7 +368,7 @@ cmake_dependent_option(
USE_C10D_MPI "USE C10D MPI" ON "USE_DISTRIBUTED;USE_MPI" OFF)
cmake_dependent_option(
USE_TENSORPIPE "Use TensorPipe. Only available if USE_DISTRIBUTED is on." ON
"USE_DISTRIBUTED" OFF)
"USE_DISTRIBUTED AND NOT WIN32" OFF)
option(ONNX_ML "Enable traditional ONNX ML API." ON)
option(HAVE_SOVERSION "Whether to add SOVERSION to the shared objects" OFF)
option(BUILD_LIBTORCH_CPU_WITH_DEBUG
@ -912,11 +911,6 @@ if(USE_PYTORCH_QNNPACK)
string(APPEND CMAKE_CXX_FLAGS " -DUSE_PYTORCH_QNNPACK")
endif()
if(USE_SLEEF_FOR_ARM_VEC256)
string(APPEND CMAKE_CXX_FLAGS " -DAT_BUILD_ARM_VEC256_WITH_SLEEF")
add_definitions(-DAT_BUILD_ARM_VEC256_WITH_SLEEF)
endif()
# Enable sleef on macOS with Apple silicon by default
if((${CMAKE_SYSTEM_NAME} STREQUAL "Darwin") AND ("${CMAKE_SYSTEM_PROCESSOR}" STREQUAL "arm64"))
message(STATUS "Running on macOS with Apple silicon")
@ -924,6 +918,14 @@ if((${CMAKE_SYSTEM_NAME} STREQUAL "Darwin") AND ("${CMAKE_SYSTEM_PROCESSOR}" STR
add_definitions(-DAT_BUILD_ARM_VEC256_WITH_SLEEF)
endif()
# Enable sleef on Arm(R) architecture by default (except Android)
if((NOT ${CMAKE_SYSTEM_NAME} STREQUAL "Android")
AND("${CMAKE_SYSTEM_PROCESSOR}" MATCHES "aarch64"))
string(APPEND CMAKE_CXX_FLAGS " -DAT_BUILD_ARM_VEC256_WITH_SLEEF")
add_definitions(-DAT_BUILD_ARM_VEC256_WITH_SLEEF)
endif()
if(USE_XNNPACK)
string(APPEND CMAKE_CXX_FLAGS " -DUSE_XNNPACK")
endif()

View File

@ -98,6 +98,10 @@ test/test_type_promotion.py @mruberry
test/functorch/test_ops.py @zou3519 @chillee @kshitij12345
test/functorch/test_vmap.py @zou3519 @chillee @kshitij12345
# HOPs
torch/_higher_order_ops/*.py @zou3519
torch/_dynamo/variables/higher_order_ops.py @zou3519
# torch MPS
test/test_mps.py @kulinseth @malfet
aten/src/ATen/mps/ @kulinseth @malfet

View File

@ -50,7 +50,6 @@ aspects of contributing to PyTorch.
- [Windows development tips](#windows-development-tips)
- [Known MSVC (and MSVC with NVCC) bugs](#known-msvc-and-msvc-with-nvcc-bugs)
- [Building on legacy code and CUDA](#building-on-legacy-code-and-cuda)
- [Running clang-tidy](#running-clang-tidy)
- [Pre-commit tidy/linting hook](#pre-commit-tidylinting-hook)
- [Building PyTorch with ASAN](#building-pytorch-with-asan)
- [Getting `ccache` to work](#getting-ccache-to-work)
@ -1132,38 +1131,6 @@ CUDA, MSVC, and PyTorch versions are interdependent; please install matching ver
Note: There's a [compilation issue](https://github.com/oneapi-src/oneDNN/issues/812) in several Visual Studio 2019 versions since 16.7.1, so please make sure your Visual Studio 2019 version is not in 16.7.1 ~ 16.7.5
## Running clang-tidy
[Clang-Tidy](https://clang.llvm.org/extra/clang-tidy/index.html) is a C++
linter and static analysis tool based on the clang compiler. We run clang-tidy
in our CI to make sure that new C++ code is safe, sane and efficient. See the
[`clang-tidy` job in our GitHub Workflow's
lint.yml file](https://github.com/pytorch/pytorch/blob/main/.github/workflows/lint.yml)
for the simple commands we use for this.
To run clang-tidy locally, follow these steps:
1. Install clang-tidy.
We provide custom built binaries which have additional checks enabled. You can install it by running:
```bash
python3 -m tools.linter.clang_tidy.generate_build_files
```
We currently only support Linux and MacOS (x86).
2. Install clang-tidy driver script dependencies
```bash
pip3 install -r tools/linter/clang_tidy/requirements.txt
```
3. Run clang-tidy
```bash
# Run clang-tidy on the entire codebase
make clang-tidy
# Run clang-tidy only on your changes
make clang-tidy CHANGED_ONLY=--changed-only
```
This internally invokes our driver script and closely mimics how clang-tidy is run on CI.
## Pre-commit tidy/linting hook
We use clang-tidy to perform additional

View File

@ -27,8 +27,8 @@ Our trunk health (Continuous Integration signals) can be found at [hud.pytorch.o
- [NVIDIA CUDA Support](#nvidia-cuda-support)
- [AMD ROCm Support](#amd-rocm-support)
- [Intel GPU Support](#intel-gpu-support)
- [Install Dependencies](#install-dependencies)
- [Get the PyTorch Source](#get-the-pytorch-source)
- [Install Dependencies](#install-dependencies)
- [Install PyTorch](#install-pytorch)
- [Adjust Build Options (Optional)](#adjust-build-options-optional)
- [Docker Image](#docker-image)
@ -161,9 +161,34 @@ They require JetPack 4.2 and above, and [@dusty-nv](https://github.com/dusty-nv)
#### Prerequisites
If you are installing from source, you will need:
- Python 3.8 or later (for Linux, Python 3.8.1+ is needed)
- A compiler that fully supports C++17, such as clang or gcc (gcc 9.4.0 or newer is required)
- A compiler that fully supports C++17, such as clang or gcc (gcc 9.4.0 or newer is required, on Linux)
- Visual Studio or Visual Studio Build Tool on Windows
We highly recommend installing an [Anaconda](https://www.anaconda.com/download) environment. You will get a high-quality BLAS library (MKL) and you get controlled dependency versions regardless of your Linux distro.
\* PyTorch CI uses Visual C++ BuildTools, which come with Visual Studio Enterprise,
Professional, or Community Editions. You can also install the build tools from
https://visualstudio.microsoft.com/visual-cpp-build-tools/. The build tools *do not*
come with Visual Studio Code by default.
\* We highly recommend installing an [Anaconda](https://www.anaconda.com/download) environment. You will get a high-quality BLAS library (MKL) and you get controlled dependency versions regardless of your Linux distro.
An example of environment setup is shown below:
* Linux:
```bash
$ source <CONDA_INSTALL_DIR>/bin/activate
$ conda create -y -n <CONDA_NAME>
$ conda activate <CONDA_NAME>
```
* Windows:
```bash
$ source <CONDA_INSTALL_DIR>\Scripts\activate.bat
$ conda create -y -n <CONDA_NAME>
$ conda activate <CONDA_NAME>
$ call "C:\Program Files\Microsoft Visual Studio\<VERSION>\Community\VC\Auxiliary\Build\vcvarsall.bat" x64
```
##### NVIDIA CUDA Support
If you want to compile with CUDA support, [select a supported version of CUDA from our support matrix](https://pytorch.org/get-started/locally/), then install the following:
@ -194,12 +219,23 @@ If you want to compile with Intel GPU support, follow these
If you want to disable Intel GPU support, export the environment variable `USE_XPU=0`.
Other potentially useful environment variables may be found in `setup.py`.
#### Get the PyTorch Source
```bash
git clone --recursive https://github.com/pytorch/pytorch
cd pytorch
# if you are updating an existing checkout
git submodule sync
git submodule update --init --recursive
```
#### Install Dependencies
**Common**
```bash
conda install cmake ninja
# Run this command on native Windows
conda install rust
# Run this command from the PyTorch directory after cloning the source code using the “Get the PyTorch Source“ section below
pip install -r requirements.txt
```
@ -235,15 +271,6 @@ pip install mkl-static mkl-include
conda install -c conda-forge libuv=1.39
```
#### Get the PyTorch Source
```bash
git clone --recursive https://github.com/pytorch/pytorch
cd pytorch
# if you are updating an existing checkout
git submodule sync
git submodule update --init --recursive
```
#### Install PyTorch
**On Linux**
@ -284,13 +311,6 @@ python3 setup.py develop
**On Windows**
Choose Correct Visual Studio Version.
PyTorch CI uses Visual C++ BuildTools, which come with Visual Studio Enterprise,
Professional, or Community Editions. You can also install the build tools from
https://visualstudio.microsoft.com/visual-cpp-build-tools/. The build tools *do not*
come with Visual Studio Code by default.
If you want to build legacy python code, please refer to [Building on legacy code and CUDA](https://github.com/pytorch/pytorch/blob/main/CONTRIBUTING.md#building-on-legacy-code-and-cuda)
**CPU-only builds**
@ -298,7 +318,6 @@ If you want to build legacy python code, please refer to [Building on legacy cod
In this mode PyTorch computations will run on your CPU, not your GPU
```cmd
conda activate
python setup.py develop
```

View File

@ -48,16 +48,16 @@
Following is the Release Compatibility Matrix for PyTorch releases:
| PyTorch version | Python | Stable CUDA | Experimental CUDA | Stable ROCm |
| --- | --- | --- | --- | --- |
| 2.5 | >=3.9, <=3.12, (3.13 experimental) | CUDA 11.8, CUDA 12.1, CUDA 12.4, CUDNN 9.1.0.70 | None | ROCm 6.2 |
| 2.4 | >=3.8, <=3.12 | CUDA 11.8, CUDA 12.1, CUDNN 9.1.0.70 | CUDA 12.4, CUDNN 9.1.0.70 | ROCm 6.1 |
| 2.3 | >=3.8, <=3.11, (3.12 experimental) | CUDA 11.8, CUDNN 8.7.0.84 | CUDA 12.1, CUDNN 8.9.2.26 | ROCm 6.0 |
| 2.2 | >=3.8, <=3.11, (3.12 experimental) | CUDA 11.8, CUDNN 8.7.0.84 | CUDA 12.1, CUDNN 8.9.2.26 | ROCm 5.7 |
| 2.1 | >=3.8, <=3.11 | CUDA 11.8, CUDNN 8.7.0.84 | CUDA 12.1, CUDNN 8.9.2.26 | ROCm 5.6 |
| 2.0 | >=3.8, <=3.11 | CUDA 11.7, CUDNN 8.5.0.96 | CUDA 11.8, CUDNN 8.7.0.84 | ROCm 5.4 |
| 1.13 | >=3.7, <=3.10 | CUDA 11.6, CUDNN 8.3.2.44 | CUDA 11.7, CUDNN 8.5.0.96 | ROCm 5.2 |
| 1.12 | >=3.7, <=3.10 | CUDA 11.3, CUDNN 8.3.2.44 | CUDA 11.6, CUDNN 8.3.2.44 | ROCm 5.0 |
| PyTorch version | Python | C++ | Stable CUDA | Experimental CUDA | Stable ROCm |
| --- | --- | --- | --- | --- | --- |
| 2.5 | >=3.9, <=3.12, (3.13 experimental) | C++17 | CUDA 11.8, CUDA 12.1, CUDA 12.4, CUDNN 9.1.0.70 | None | ROCm 6.2 |
| 2.4 | >=3.8, <=3.12 | C++17 | CUDA 11.8, CUDA 12.1, CUDNN 9.1.0.70 | CUDA 12.4, CUDNN 9.1.0.70 | ROCm 6.1 |
| 2.3 | >=3.8, <=3.11, (3.12 experimental) | C++17 | CUDA 11.8, CUDNN 8.7.0.84 | CUDA 12.1, CUDNN 8.9.2.26 | ROCm 6.0 |
| 2.2 | >=3.8, <=3.11, (3.12 experimental) | C++17 | CUDA 11.8, CUDNN 8.7.0.84 | CUDA 12.1, CUDNN 8.9.2.26 | ROCm 5.7 |
| 2.1 | >=3.8, <=3.11 | C++17 | CUDA 11.8, CUDNN 8.7.0.84 | CUDA 12.1, CUDNN 8.9.2.26 | ROCm 5.6 |
| 2.0 | >=3.8, <=3.11 | C++14 | CUDA 11.7, CUDNN 8.5.0.96 | CUDA 11.8, CUDNN 8.7.0.84 | ROCm 5.4 |
| 1.13 | >=3.7, <=3.10 | C++14 | CUDA 11.6, CUDNN 8.3.2.44 | CUDA 11.7, CUDNN 8.5.0.96 | ROCm 5.2 |
| 1.12 | >=3.7, <=3.10 | C++14 | CUDA 11.3, CUDNN 8.3.2.44 | CUDA 11.6, CUDNN 8.3.2.44 | ROCm 5.0 |
## Release Cadence
@ -234,7 +234,7 @@ Typically, within a release cycle fixes are necessary for regressions, test fixe
For fixes that are to go into a release after the release branch has been cut we typically employ the use of a cherry pick tracker.
An example of this would look like:
* https://github.com/pytorch/pytorch/issues/51886
* https://github.com/pytorch/pytorch/issues/128436
Please also make sure to add milestone target to the PR/issue, especially if it needs to be considered for inclusion into the dot release.
@ -243,7 +243,9 @@ Please also make sure to add milestone target to the PR/issue, especially if it
#### How to do Cherry Picking
You can now use `pytorchbot` to cherry pick a PyTorch PR that has been committed
to the main branch using `@pytorchbot cherry-pick` command as follows.
to the main branch using `@pytorchbot cherry-pick` command as follows (make sure
that the cherry-pick tracker issue for the target release labelled as "release tracker" -
this will allow the bot to find it and post comments).
```
usage: @pytorchbot cherry-pick --onto ONTO [--fixes FIXES] -c
@ -380,7 +382,7 @@ Patch release process takes around 4-5 weeks to complete.
### Issue Tracker for Patch releases
For patch releases issue tracker needs to be created. For patch release, we require all cherry-pick changes to have links to either a high-priority GitHub issue or a CI failure from previous RC. An example of this would look like:
* https://github.com/pytorch/pytorch/issues/51886
* https://github.com/pytorch/pytorch/issues/128436
Only following issues are accepted:
1. Fixes to regressions against previous major version (e.g. regressions introduced in 1.13.0 from 1.12.0 are pickable for 1.13.1)

View File

@ -54,7 +54,7 @@ if(NOT BUILD_LITE_INTERPRETER)
endif()
EXCLUDE(ATen_CORE_SRCS "${ATen_CORE_SRCS}" ${ATen_CORE_TEST_SRCS})
file(GLOB base_h "*.h" "detail/*.h" "cpu/*.h" "cpu/vec/vec512/*.h" "cpu/vec/vec256/*.h" "cpu/vec/vec256/vsx/*.h" "cpu/vec/vec256/zarch/*.h" "cpu/vec/*.h" "quantized/*.h" "functorch/*.h")
file(GLOB base_h "*.h" "detail/*.h" "cpu/*.h" "cpu/vec/vec512/*.h" "cpu/vec/vec256/*.h" "cpu/vec/vec256/vsx/*.h" "cpu/vec/vec256/zarch/*.h" "cpu/vec/sve/*.h" "cpu/vec/*.h" "quantized/*.h" "functorch/*.h")
file(GLOB base_cpp "*.cpp" "detail/*.cpp" "cpu/*.cpp" "functorch/*.cpp")
file(GLOB cuda_h "cuda/*.h" "cuda/detail/*.h" "cuda/*.cuh" "cuda/detail/*.cuh" "cuda/tunable/*.cuh" "cuda/tunable/*.h")
file(GLOB cuda_cpp "cuda/*.cpp" "cuda/detail/*.cpp" "cuda/tunable/*.cpp")

View File

@ -145,6 +145,14 @@ void Context::setSDPUseMath(bool e) {
enabled_mathSDP = e;
}
bool Context::allowFP16BF16ReductionMathSDP() const {
return allow_fp16_bf16_reduction_mathSDP;
}
void Context::setAllowFP16BF16ReductionMathSDP(bool e) {
allow_fp16_bf16_reduction_mathSDP = e;
}
bool Context::userEnabledCuDNNSDP() const {
return enabled_cudnnSDP;
}

View File

@ -234,6 +234,9 @@ class TORCH_API Context {
void setSDPUseCuDNN(bool);
bool userEnabledCuDNNSDP() const;
void setAllowFP16BF16ReductionMathSDP(bool);
bool allowFP16BF16ReductionMathSDP() const;
void setSDPUseOverrideable(bool);
bool userEnabledOverrideableSDP() const;
@ -390,6 +393,7 @@ class TORCH_API Context {
bool enabled_mathSDP = true;
bool enabled_cudnnSDP = true;
bool enabled_overrideable = true;
bool allow_fp16_bf16_reduction_mathSDP = false;
#ifdef USE_ROCM
bool benchmark_cudnn = true;
#else

View File

@ -299,6 +299,15 @@ inline void deprecated_AT_DISPATCH_ALL_TYPES_AND_HALF_AND_COMPLEX() {}
AT_DISPATCH_CASE(SCALARTYPE3, __VA_ARGS__) \
AT_DISPATCH_CASE(SCALARTYPE4, __VA_ARGS__)
#define AT_DISPATCH_CASE_FLOATING_TYPES_AND5( \
SCALARTYPE1, SCALARTYPE2, SCALARTYPE3, SCALARTYPE4, SCALARTYPE5, ...) \
AT_DISPATCH_CASE_FLOATING_TYPES(__VA_ARGS__) \
AT_DISPATCH_CASE(SCALARTYPE1, __VA_ARGS__) \
AT_DISPATCH_CASE(SCALARTYPE2, __VA_ARGS__) \
AT_DISPATCH_CASE(SCALARTYPE3, __VA_ARGS__) \
AT_DISPATCH_CASE(SCALARTYPE4, __VA_ARGS__) \
AT_DISPATCH_CASE(SCALARTYPE5, __VA_ARGS__)
#define AT_DISPATCH_FLOATING_TYPES_AND4( \
SCALARTYPE1, SCALARTYPE2, SCALARTYPE3, SCALARTYPE4, TYPE, NAME, ...) \
AT_DISPATCH_SWITCH( \
@ -307,6 +316,26 @@ inline void deprecated_AT_DISPATCH_ALL_TYPES_AND_HALF_AND_COMPLEX() {}
AT_DISPATCH_CASE_FLOATING_TYPES_AND4( \
SCALARTYPE1, SCALARTYPE2, SCALARTYPE3, SCALARTYPE4, __VA_ARGS__))
#define AT_DISPATCH_FLOATING_TYPES_AND5( \
SCALARTYPE1, \
SCALARTYPE2, \
SCALARTYPE3, \
SCALARTYPE4, \
SCALARTYPE5, \
TYPE, \
NAME, \
...) \
AT_DISPATCH_SWITCH( \
TYPE, \
NAME, \
AT_DISPATCH_CASE_FLOATING_TYPES_AND5( \
SCALARTYPE1, \
SCALARTYPE2, \
SCALARTYPE3, \
SCALARTYPE4, \
SCALARTYPE5, \
__VA_ARGS__))
#define AT_DISPATCH_CASE_COMPLEX_TYPES(...) \
AT_DISPATCH_CASE(at::ScalarType::ComplexDouble, __VA_ARGS__) \
AT_DISPATCH_CASE(at::ScalarType::ComplexFloat, __VA_ARGS__)

View File

@ -105,6 +105,11 @@ std::string get_cpu_capability() {
return "DEFAULT";
case native::CPUCapability::ZVECTOR:
return "Z VECTOR";
#elif defined(HAVE_SVE_CPU_DEFINITION)
case native::CPUCapability::DEFAULT:
return "DEFAULT";
case native::CPUCapability::SVE256:
return "SVE256";
#else
case native::CPUCapability::DEFAULT:
return "NO AVX";

View File

@ -336,6 +336,7 @@ TORCH_LIBRARY_IMPL(aten, AutocastCPU, m) {
KERNEL_CPU(linalg_vecdot, lower_precision_fp)
KERNEL_CPU(baddbmm, lower_precision_fp)
KERNEL_CPU(addmm, lower_precision_fp)
KERNEL_CPU(_addmm_activation, lower_precision_fp)
KERNEL_CPU(addbmm, lower_precision_fp)
KERNEL_CPU(linear, lower_precision_fp)
KERNEL_CPU(_convolution, deprecated, lower_precision_fp)

View File

@ -1,4 +1,6 @@
#include <c10/core/Allocator.h>
#include <c10/core/thread_pool.h>
#include <c10/util/CallOnce.h>
#include <c10/util/flat_hash_map.h>
#include <c10/util/llvmMathExtras.h>
#include <optional>
@ -109,6 +111,17 @@ template <
typename E,
typename B = HostBlock<S>>
struct CachingHostAllocatorImpl {
CachingHostAllocatorImpl() {
// Launch the background thread and process events in a loop.
if (pinned_use_background_threads()) {
getBackgroundThreadPool()->run([&]() {
while (true) {
process_events();
std::this_thread::sleep_for(std::chrono::microseconds(100));
}
});
}
}
virtual ~CachingHostAllocatorImpl() = default;
public:
@ -118,17 +131,34 @@ struct CachingHostAllocatorImpl {
return {nullptr, nullptr};
}
process_events();
// First, try to allocate from the free list
auto* block = get_free_block(size);
if (block) {
return {block->ptr_, reinterpret_cast<void*>(block)};
// If we are using background threads, we can process events in the
// background.
if (!pinned_use_background_threads()) {
process_events();
}
// Round up the allocation to the nearest power of two to improve reuse.
// These power of two sizes are also used to index into the free list.
size_t roundSize = c10::llvm::PowerOf2Ceil(size);
// First, try to allocate from the free list
auto* block = get_free_block(roundSize);
if (block) {
return {block->ptr_, reinterpret_cast<void*>(block)};
}
// Check in the recently freed blocks with pending events to see if we
// can reuse them. Call get_free_block again after processing events
if (pinned_use_background_threads()) {
process_events_for_specific_size(roundSize);
block = get_free_block(roundSize);
if (block) {
return {block->ptr_, reinterpret_cast<void*>(block)};
}
}
// Slow path: if we can't allocate from the cached free list, we need
// to create a new block.
void* ptr = nullptr;
allocate_host_memory(roundSize, &ptr);
@ -237,6 +267,10 @@ struct CachingHostAllocatorImpl {
return c10::llvm::Log2_64_Ceil(size);
}
virtual bool pinned_use_background_threads() {
return false;
}
virtual void copy_data(void* dest [[maybe_unused]], const void* src [[maybe_unused]], std::size_t count [[maybe_unused]]) const {
TORCH_CHECK_NOT_IMPLEMENTED(false, "Not implemented for copy_data");
}
@ -261,6 +295,21 @@ struct CachingHostAllocatorImpl {
}
virtual void process_events() {
// process all events until the last unready event, not for specific size.
process_events_for_specific_size(-1);
}
// If size is -1, process all events from backwards until the last unready
// event. Otherwise, process events for a specific size and on first ready block
// is found, add it to the free list and return.
virtual void process_events_for_specific_size(int64_t size) {
size_t event_count = 0;
size_t max_events = 0;
{
std::lock_guard<std::mutex> g(events_mutex_);
max_events = events_.size();
}
while (true) {
// Avoid calling cudaEventDestroy while holding a mutex, so move
// intermediate events out of the lock into this object.
@ -278,6 +327,25 @@ struct CachingHostAllocatorImpl {
return;
}
if (size != -1) {
if (event_count++ > max_events) {
{
std::lock_guard<std::mutex> g(events_mutex_);
events_.push_front(std::move(*processed));
}
return;
}
if (size != (int64_t)processed->second->size_) {
// if we are processing a specific size, and the size of the block
// doesn't match, we can't use it.
{
std::lock_guard<std::mutex> g(events_mutex_);
events_.push_front(std::move(*processed));
}
continue;
}
}
// otherwise, query the event
{
// now, see if we can handle this element
@ -286,9 +354,14 @@ struct CachingHostAllocatorImpl {
// push the event onto the back if it's not ready.
{
std::lock_guard<std::mutex> g(events_mutex_);
events_.push_back(std::move(*processed));
if (size == -1) {
events_.push_back(std::move(*processed));
return;
} else {
events_.push_front(std::move(*processed));
continue;
}
}
return;
}
}
@ -309,46 +382,54 @@ struct CachingHostAllocatorImpl {
auto index = size_index(block->size_);
std::lock_guard<std::mutex> g(free_list_[index].mutex_);
free_list_[index].list_.push_back(block);
if (size != -1) {
return;
}
}
}
}
/* These following functions are runtime-related. */
// Allocate page-locked memory on the host.
virtual void allocate_host_memory(size_t size, void** ptr) {
TORCH_CHECK_NOT_IMPLEMENTED(
false, "Not implemented for allocate_host_memory");
TaskThreadPool* getBackgroundThreadPool() {
static TaskThreadPool* pool = new TaskThreadPool(1);
return pool;
}
// Free block and release the pointer contained in block.
virtual void free_block(B* block) {
TORCH_CHECK_NOT_IMPLEMENTED(false, "Not implemented for free_block");
}
/* These following functions are runtime-related. */
// Record an event on stream and store event into events.
virtual void record_stream(std::optional<std::vector<E>>& events, S stream) {
TORCH_CHECK_NOT_IMPLEMENTED(false, "Not implemented for record_stream");
}
// Allocate page-locked memory on the host.
virtual void allocate_host_memory(size_t size, void** ptr) {
TORCH_CHECK_NOT_IMPLEMENTED(
false, "Not implemented for allocate_host_memory");
}
// Query event if it is completed.
virtual bool query_event(E& event) {
TORCH_CHECK_NOT_IMPLEMENTED(false, "Not implemented for query_event");
}
// Free block and release the pointer contained in block.
virtual void free_block(B* block) {
TORCH_CHECK_NOT_IMPLEMENTED(false, "Not implemented for free_block");
}
alignas(64) std::mutex blocks_mutex_;
ska::flat_hash_set<B*> blocks_; // block list
ska::flat_hash_map<void*, B*> ptr_to_block_;
// Record an event on stream and store event into events.
virtual void record_stream(std::optional<std::vector<E>>& events, S stream) {
TORCH_CHECK_NOT_IMPLEMENTED(false, "Not implemented for record_stream");
}
// We keep free list as a vector of free lists, one for each power of two
// size. This allows us to quickly find a free block of the right size.
// We use deque to store per size free list and guard the list with its own
// mutex.
alignas(64) std::vector<FreeBlockList<B>> free_list_ = std::vector<FreeBlockList<B>>(MAX_SIZE_INDEX);
// Query event if it is completed.
virtual bool query_event(E& event) {
TORCH_CHECK_NOT_IMPLEMENTED(false, "Not implemented for query_event");
}
alignas(64) std::mutex events_mutex_;
std::deque<std::pair<E, B*>> events_; // event queue paired with block
};
alignas(64) std::mutex blocks_mutex_;
ska::flat_hash_set<B*> blocks_; // block list
ska::flat_hash_map<void*, B*> ptr_to_block_;
// We keep free list as a vector of free lists, one for each power of two
// size. This allows us to quickly find a free block of the right size.
// We use deque to store per size free list and guard the list with its own
// mutex.
alignas(64) std::vector<FreeBlockList<B>> free_list_ = std::vector<FreeBlockList<B>>(MAX_SIZE_INDEX);
alignas(64) std::mutex events_mutex_;
std::deque<std::pair<E, B*>> events_; // event queue paired with block
};
template <typename T>
struct CachingHostAllocatorInterface : public at::Allocator {

View File

@ -45,7 +45,7 @@ private:
c10::impl::LocalDispatchKeySet saved_;
};
void pythonFallback(const c10::OperatorHandle& op, torch::jit::Stack* stack) {
void pythonFallback(const c10::OperatorHandle& op, c10::DispatchKeySet dispatch_keys, torch::jit::Stack* stack) {
TORCH_INTERNAL_ASSERT(tls_on_entry.has_value());
// c10::impl::ForceDispatchKeyGuard dispatcher_guard(tls_on_entry.value());
// StashTLSOnEntryGuard stash_guard;
@ -68,12 +68,20 @@ void pythonFallback(const c10::OperatorHandle& op, torch::jit::Stack* stack) {
// we actually run dispatch(), we will take out PyObjects in the context
// of that interpreter, and this will ensure that everyone is on the same
// interpreter.
bool tensors_with_python_key_present = false;
c10::impl::PyInterpreter* interpreter = nullptr;
for (const auto& ivalue : torch::jit::last(*stack, num_arguments)) {
if (ivalue.isTensor()) {
auto* interpreter = ivalue.unsafeToTensorImpl()->pyobj_slot()->pyobj_interpreter();
if (interpreter) {
(*interpreter)->dispatch(op, stack);
return;
auto* t = ivalue.unsafeToTensorImpl();
if (t->key_set().has(c10::DispatchKey::Python)) {
tensors_with_python_key_present = true;
}
if (!interpreter) {
auto* t_interpreter = t->pyobj_slot()->pyobj_interpreter();
if (t_interpreter) {
interpreter = t_interpreter;
}
}
} else if (ivalue.isTensorList() || ivalue.isOptionalTensorList()) {
// NB: use toListRef as it doesn't induce refcount bumps (toTensorListRef
@ -82,14 +90,43 @@ void pythonFallback(const c10::OperatorHandle& op, torch::jit::Stack* stack) {
if (nv.isNone()) {
continue;
}
auto* interpreter = nv.unsafeToTensorImpl()->pyobj_slot()->pyobj_interpreter();
if (interpreter) {
(*interpreter)->dispatch(op, stack);
return;
auto* t = nv.unsafeToTensorImpl();
if (t->key_set().has(c10::DispatchKey::Python)) {
tensors_with_python_key_present = true;
}
if (!interpreter) {
auto* t_interpreter = t->pyobj_slot()->pyobj_interpreter();
if (t_interpreter) {
interpreter = t_interpreter;
}
}
}
}
}
if (interpreter) {
if (tensors_with_python_key_present) {
(*interpreter)->dispatch(op, stack);
} else {
// At this point, there are no modes in the stack and no tensors with the python key.
// so disable the python key before redispatching.
// See https://github.com/pytorch/pytorch/issues/136565
c10::DispatchKeySet keyset = dispatch_keys.remove(c10::DispatchKey::Python);
// Remove Python key from the included set as well (modes add it there).
c10::impl::LocalDispatchKeySet local_keyset = c10::impl::tls_local_dispatch_key_set();
c10::impl::ForceDispatchKeyGuard no_python_guard(
local_keyset.included_.remove(c10::DispatchKey::Python),
local_keyset.excluded_
);
op.redispatchBoxed(keyset, stack);
}
return;
}
TORCH_INTERNAL_ASSERT(0, "Hit Python dispatch key but no arguments had PyInterpreter (no tensor args?)");
}

View File

@ -78,7 +78,7 @@ struct VecReduceAllSIMD<float, Op> {
#endif // defined(CPU_CAPABILITY_AVX512)
#endif // defined(__GNUC__) && (__GNUC__ > 5) && !defined(_MSC_VER) && !defined(C10_MOBILE)
#if defined(__aarch64__) && !defined(C10_MOBILE) && !defined(__CUDACC__)
#if defined(__aarch64__) && !defined(C10_MOBILE) && !defined(__CUDACC__) && !defined(CPU_CAPABILITY_SVE)
template <typename Op>
struct VecReduceAllSIMD<float, Op> {
static inline float apply(const Op& vec_fun, const Vectorized<float>& acc_vec) {

View File

@ -5,6 +5,10 @@
#elif defined(__clang__) && (defined(__ARM_NEON__) || defined(__aarch64__))
/* Clang-compatible compiler, targeting arm neon */
#include <arm_neon.h>
#if defined(__ARM_FEATURE_SVE)
/* CLANG-compatible compiler, targeting ARM with SVE */
#include <arm_sve.h>
#endif
#elif defined(_MSC_VER)
/* Microsoft C/C++-compatible compiler */
#include <intrin.h>
@ -17,6 +21,10 @@
#elif defined(__GNUC__) && (defined(__ARM_NEON__) || defined(__aarch64__))
/* GCC-compatible compiler, targeting ARM with NEON */
#include <arm_neon.h>
#if defined(__ARM_FEATURE_SVE)
/* GCC-compatible compiler, targeting ARM with SVE */
#include <arm_sve.h>
#endif
#if defined (MISSING_ARM_VLD1)
#include <ATen/cpu/vec/vec256/missing_vld1_neon.h>
#elif defined (MISSING_ARM_VST1)

View File

@ -0,0 +1,63 @@
#pragma once
#include <ATen/cpu/vec/intrinsics.h>
#include <ATen/cpu/vec/vec_base.h>
#if defined(CPU_CAPABILITY_SVE)
// Define the data type of VLS(vector-length specific).
typedef svbool_t vls_pred_t __attribute__((arm_sve_vector_bits(VECTOR_WIDTH * 8)));
typedef svint8_t vls_int8_t __attribute__((arm_sve_vector_bits(VECTOR_WIDTH * 8)));
typedef svint16_t vls_int16_t __attribute__((arm_sve_vector_bits(VECTOR_WIDTH * 8)));
typedef svint32_t vls_int32_t __attribute__((arm_sve_vector_bits(VECTOR_WIDTH * 8)));
typedef svint64_t vls_int64_t __attribute__((arm_sve_vector_bits(VECTOR_WIDTH * 8)));
typedef svuint8_t vls_uint8_t __attribute__((arm_sve_vector_bits(VECTOR_WIDTH * 8)));
typedef svuint16_t vls_uint16_t __attribute__((arm_sve_vector_bits(VECTOR_WIDTH * 8)));
typedef svuint32_t vls_uint32_t __attribute__((arm_sve_vector_bits(VECTOR_WIDTH * 8)));
typedef svuint64_t vls_uint64_t __attribute__((arm_sve_vector_bits(VECTOR_WIDTH * 8)));
typedef svfloat16_t vls_float16_t __attribute__((arm_sve_vector_bits(VECTOR_WIDTH * 8)));
typedef svfloat32_t vls_float32_t __attribute__((arm_sve_vector_bits(VECTOR_WIDTH * 8)));
typedef svfloat64_t vls_float64_t __attribute__((arm_sve_vector_bits(VECTOR_WIDTH * 8)));
#define ptrue svptrue_b8()
#define ZERO_S8 svdup_n_s8(0)
#define ZERO_S16 svdup_n_s16(0)
#define ZERO_S32 svdup_n_s32(0)
#define ZERO_S64 svdup_n_s64(0)
#define ZERO_U8 svdup_n_u8(0)
#define ZERO_U16 svdup_n_u16(0)
#define ZERO_U32 svdup_n_u32(0)
#define ZERO_U64 svdup_n_u64(0)
#define ZERO_F16 svdup_n_f16(0.f)
#define ZERO_F32 svdup_n_f32(0.f)
#define ZERO_F64 svdup_n_f64(0.0)
#define ONE_S8 svdup_n_s8(1)
#define ONE_S16 svdup_n_s16(1)
#define ONE_S32 svdup_n_s32(1)
#define ONE_S64 svdup_n_s64(1)
#define ONE_U8 svdup_n_u8(1)
#define ONE_U16 svdup_n_u16(1)
#define ONE_U32 svdup_n_u32(1)
#define ONE_U64 svdup_n_u64(1)
#define ONE_F16 svdup_n_f16(1.f)
#define ONE_F32 svdup_n_f32(1.f)
#define ONE_F64 svdup_n_f64(1.0)
#define ALL_S8_TRUE_MASK svdup_n_s8(0xff)
#define ALL_S8_FALSE_MASK svdup_n_s8(0x0)
#define ALL_S16_TRUE_MASK svdup_n_s16(0xffff)
#define ALL_S16_FALSE_MASK svdup_n_s16(0x0)
#define ALL_S32_TRUE_MASK svdup_n_s32(0xffffffff)
#define ALL_S32_FALSE_MASK svdup_n_s32(0x0)
#define ALL_S64_TRUE_MASK svdup_n_s64(0xffffffffffffffff)
#define ALL_S64_FALSE_MASK svdup_n_s64(0x0)
#define ALL_U8_TRUE_MASK svdup_n_u8(0x01)
#define ALL_U8_FALSE_MASK svdup_n_u8(0x00)
#define ALL_F16_TRUE_MASK svreinterpret_f16_s16(ALL_S16_TRUE_MASK)
#define ALL_F16_FALSE_MASK svreinterpret_f16_s16(ALL_S16_FALSE_MASK)
#define ALL_F32_TRUE_MASK svreinterpret_f32_s32(ALL_S32_TRUE_MASK)
#define ALL_F32_FALSE_MASK svreinterpret_f32_s32(ALL_S32_FALSE_MASK)
#define ALL_F64_TRUE_MASK svreinterpret_f64_s64(ALL_S64_TRUE_MASK)
#define ALL_F64_FALSE_MASK svreinterpret_f64_s64(ALL_S64_FALSE_MASK)
#endif // defined(CPU_CAPABILITY_SVE)

View File

@ -0,0 +1,176 @@
#pragma once
// DO NOT DEFINE STATIC DATA IN THIS HEADER!
// See Note [Do not compile initializers with SVE]
#include <ATen/cpu/vec/intrinsics.h>
#include <ATen/cpu/vec/vec_base.h>
#include <ATen/cpu/vec/sve/sve_helper.h>
#if defined(CPU_CAPABILITY_SVE)
#include <ATen/cpu/vec/sve/vec_float.h>
#include <ATen/cpu/vec/sve/vec_double.h>
#include <ATen/cpu/vec/sve/vec_int.h>
#include <ATen/cpu/vec/sve/vec_qint.h>
#endif
namespace at {
namespace vec {
// Note [CPU_CAPABILITY namespace]
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
// This header, and all of its subheaders, will be compiled with
// different architecture flags for each supported set of vector
// intrinsics. So we need to make sure they aren't inadvertently
// linked together. We do this by declaring objects in an `inline
// namespace` which changes the name mangling, but can still be
// accessed as `at::vec`.
inline namespace CPU_CAPABILITY {
#if defined(CPU_CAPABILITY_SVE)
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ CAST ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
template<>
inline Vectorized<float> cast<float, double>(const Vectorized<double>& src) {
return svreinterpret_f32_f64(src);
}
template<>
inline Vectorized<double> cast<double, float>(const Vectorized<float>& src) {
return svreinterpret_f64_f32(src);
}
#define DEFINE_FLOAT_INT_CAST(int_t, int_bit, float_t, float_bit) \
template<> \
inline Vectorized<int_t> cast<int_t, float_t>(const Vectorized<float_t>& src) { \
return svreinterpret_s##int_bit##_f##float_bit(src); \
} \
template<> \
inline Vectorized<float_t> cast<float_t, int_t>(const Vectorized<int_t>& src) { \
return svreinterpret_f##float_bit##_s##int_bit(src); \
}
DEFINE_FLOAT_INT_CAST(int64_t, 64, double, 64)
DEFINE_FLOAT_INT_CAST(int32_t, 32, double, 64)
DEFINE_FLOAT_INT_CAST(int16_t, 16, double, 64)
DEFINE_FLOAT_INT_CAST(int64_t, 64, float, 32)
DEFINE_FLOAT_INT_CAST(int32_t, 32, float, 32)
DEFINE_FLOAT_INT_CAST(int16_t, 16, float, 32)
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ GATHER ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
template<int64_t scale = 1>
std::enable_if_t<scale == 1 || scale == 2 || scale == 4 || scale == 8, Vectorized<double>>
inline gather(const double* base_addr, const Vectorized<int64_t>& vindex_) {
svint64_t vindex = svasrd_n_s64_x(ptrue, svmul_s64_x(ptrue, vindex_, svdup_n_s64(scale)), 3);
return svld1_gather_s64index_f64(ptrue, base_addr, vindex);
}
template<int64_t scale = 1>
std::enable_if_t<scale == 1 || scale == 2 || scale == 4 || scale == 8, Vectorized<float>>
inline gather(const float* base_addr, const Vectorized<int32_t>& vindex_) {
svint32_t vindex = svasrd_n_s32_x(ptrue, svmul_s32_x(ptrue, vindex_, svdup_n_s32(scale)), 2);
return svld1_gather_s32index_f32(ptrue, base_addr, vindex);
}
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ MASK GATHER ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
template<int64_t scale = 1>
std::enable_if_t<scale == 1 || scale == 2 || scale == 4 || scale == 8, Vectorized<double>>
inline mask_gather(const Vectorized<double>& src, const double* base_addr,
const Vectorized<int64_t>& vindex_, const Vectorized<double>& mask_) {
svbool_t mask = svcmpeq_s64(ptrue, svreinterpret_s64_f64(mask_),
ALL_S64_TRUE_MASK);
svint64_t vindex = svasrd_n_s64_x(ptrue, svmul_s64_x(ptrue, vindex_, svdup_n_s64(scale)), 3);
return svsel_f64(mask, svld1_gather_s64index_f64(mask, base_addr, vindex), src);
}
template<int64_t scale = 1>
std::enable_if_t<scale == 1 || scale == 2 || scale == 4 || scale == 8, Vectorized<float>>
inline mask_gather(const Vectorized<float>& src, const float* base_addr,
const Vectorized<int32_t>& vindex_, const Vectorized<float>& mask_) {
svbool_t mask = svcmpeq_s32(ptrue, svreinterpret_s32_f32(mask_),
ALL_S32_TRUE_MASK);
svint32_t vindex = svasrd_n_s32_x(ptrue, svmul_s32_x(ptrue, vindex_, svdup_n_s32(scale)), 2);
return svsel_f32(mask, svld1_gather_s32index_f32(mask, base_addr, vindex), src);
}
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ CONVERT ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
// Only works for inputs in the range: [-2^51, 2^51]
// From: https://stackoverflow.com/a/41148578
template<>
Vectorized<int64_t>
inline convert_to_int_of_same_size<double>(const Vectorized<double> &src) {
svfloat64_t x = svadd_f64_x(ptrue, src, svdup_n_f64(0x0018000000000000));
return svsub_s64_x(ptrue,
svreinterpret_s64_f64(x),
svreinterpret_s64_f64(svdup_n_f64(0x0018000000000000)));
}
template<>
Vectorized<int32_t>
inline convert_to_int_of_same_size<float>(const Vectorized<float> &src) {
return svcvt_s32_f32_x(ptrue, src);
}
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ INTERLEAVE ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
template <>
std::pair<Vectorized<double>, Vectorized<double>>
inline interleave2<double>(const Vectorized<double>& a, const Vectorized<double>& b) {
// inputs:
// a = {a0, a1, a3, a3}
// b = {b0, b1, b2, b3}
// group cols crossing lanes:
// return {a0, b0, a1, b1}
// {a2, b2, a3, b3}
return std::make_pair(Vectorized<double>(svzip1_f64(a, b)),
Vectorized<double>(svzip2_f64(a, b)));
}
template <>
std::pair<Vectorized<float>, Vectorized<float>>
inline interleave2<float>(const Vectorized<float>& a, const Vectorized<float>& b) {
// inputs:
// a = {a0, a1, a2, a3, a4, a5, a6, a7}
// b = {b0, b1, b2, b3, b4, b5, b6, b7}
// group cols crossing lanes:
// return {a0, b0, a1, b1, a2, b2, a3, b3}
// {a4, b4, a5, b5, a6, b6, a7, b7}
return std::make_pair(Vectorized<float>(svzip1_f32(a, b)),
Vectorized<float>(svzip2_f32(a, b)));
}
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ DEINTERLEAVE ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
template <>
std::pair<Vectorized<double>, Vectorized<double>>
inline deinterleave2<double>(const Vectorized<double>& a, const Vectorized<double>& b) {
// inputs:
// a = {a0, b0, a1, b1}
// b = {a2, b2, a3, b3}
// swap lanes:
// return {a0, a1, a2, a3}
// {b0, b1, b2, b3}
return std::make_pair(Vectorized<double>(svuzp1_f64(a, b)),
Vectorized<double>(svuzp2_f64(a, b)));
}
template <>
std::pair<Vectorized<float>, Vectorized<float>>
inline deinterleave2<float>(const Vectorized<float>& a, const Vectorized<float>& b) {
// inputs:
// a = {a0, b0, a1, b1, a2, b2, a3, b3}
// b = {a4, b4, a5, b5, a6, b6, a7, b7}
// swap lanes:
// return {a0, a1, a2, a3, a4, a5, a6, a7}
// {b0, b1, b2, b3, b4, b5, b6, b7}
return std::make_pair(Vectorized<float>(svuzp1_f32(a, b)),
Vectorized<float>(svuzp2_f32(a, b)));
}
#endif // defined(CPU_CAPABILITY_SVE)
}}}

View File

@ -0,0 +1,505 @@
#pragma once
#include <ATen/cpu/vec/intrinsics.h>
#include <ATen/cpu/vec/vec_base.h>
#include <ATen/cpu/vec/sve/sve_helper.h>
#include <cmath>
#if defined(__aarch64__) && defined(AT_BUILD_ARM_VEC256_WITH_SLEEF)
#include <sleef.h>
#define USE_SLEEF(sleef_code, non_sleef_code) sleef_code
#else
#define USE_SLEEF(sleef_code, non_sleef_code) non_sleef_code
#endif
namespace at {
namespace vec {
// Note [CPU_CAPABILITY namespace]
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
// This header, and all of its subheaders, will be compiled with
// different architecture flags for each supported set of vector
// intrinsics. So we need to make sure they aren't inadvertently
// linked together. We do this by declaring objects in an `inline
// namespace` which changes the name mangling, but can still be
// accessed as `at::vec`.
inline namespace CPU_CAPABILITY {
#if defined(CPU_CAPABILITY_SVE)
template <> class Vectorized<double> {
private:
vls_float64_t values;
public:
using value_type = double;
using size_type = int;
static constexpr size_type size() {
return VECTOR_WIDTH / sizeof(double);
}
Vectorized() {}
Vectorized(svfloat64_t v) : values(v) {}
Vectorized(double val) {
values = svdup_n_f64(val);
}
template<typename... Args,
typename = std::enable_if_t<(sizeof...(Args) == size())>>
Vectorized(Args... vals) {
__at_align__ double buffer[size()] = { vals... };
values = svld1_f64(ptrue, buffer);
}
operator svfloat64_t() const {
return values;
}
static Vectorized<double> blendv(const Vectorized<double>& a, const Vectorized<double>& b,
const Vectorized<double>& mask_) {
svbool_t mask = svcmpeq_s64(ptrue, svreinterpret_s64_f64(mask_),
ALL_S64_TRUE_MASK);
return svsel_f64(mask, b, a);
}
template<typename step_t>
static Vectorized<double> arange(double base = 0., step_t step = static_cast<step_t>(1)) {
__at_align__ double buffer[size()];
for (int64_t i = 0; i < size(); i++) {
buffer[i] = base + i * step;
}
return svld1_f64(ptrue, buffer);
}
static Vectorized<double> set(const Vectorized<double>& a, const Vectorized<double>& b,
int64_t count = size()) {
if (count == 0) {
return a;
} else if (count < size()) {
return svsel_f64(svwhilelt_b64(0ull, count), b, a);
}
return b;
}
static Vectorized<double> loadu(const void* ptr, int64_t count = size()) {
if (count == size())
return svld1_f64(ptrue, reinterpret_cast<const double*>(ptr));
svbool_t pg = svwhilelt_b64(0ull, count);
return svld1_f64(pg, reinterpret_cast<const double*>(ptr));
}
void store(void* ptr, int64_t count = size()) const {
if (count == size()) {
svst1_f64(ptrue, reinterpret_cast<double*>(ptr), values);
} else {
svbool_t pg = svwhilelt_b64(0ull, count);
svst1_f64(pg, reinterpret_cast<double*>(ptr), values);
}
}
const double& operator[](int idx) const = delete;
double& operator[](int idx) = delete;
int64_t zero_mask() const {
// returns an integer mask where all zero elements are translated to 1-bit and others are translated to 0-bit
int64_t mask = 0;
__at_align__ int64_t mask_array[size()];
svbool_t svbool_mask = svcmpeq_f64(ptrue, values, ZERO_F64);
svst1_s64(ptrue, mask_array, svsel_s64(svbool_mask,
ALL_S64_TRUE_MASK,
ALL_S64_FALSE_MASK));
for (int64_t i = 0; i < size(); ++i) {
if (mask_array[i]) mask |= (1ull << i);
}
return mask;
}
Vectorized<double> isnan() const {
// NaN check
svbool_t mask = svcmpuo_f64(ptrue, values, ZERO_F64);
return svsel_f64(mask, ALL_F64_TRUE_MASK, ALL_F64_FALSE_MASK);
}
bool has_inf_nan() const {
return svptest_any(ptrue, svcmpuo_f64(ptrue, svsub_f64_x(ptrue, values, values), ZERO_F64));
}
Vectorized<double> map(double (*f)(double)) const {
__at_align__ double tmp[size()];
store(tmp);
for (int64_t i = 0; i < size(); ++i) {
tmp[i] = f(tmp[i]);
}
return loadu(tmp);
}
Vectorized<double> abs() const {
return svabs_f64_x(ptrue, values);
}
Vectorized<double> angle() const {
const auto nan_vec = svdup_n_f64(NAN);
const auto nan_mask = svcmpuo_f64(ptrue, values, ZERO_F64);
const auto pi = svdup_n_f64(c10::pi<double>);
const auto neg_mask = svcmplt_f64(ptrue, values, ZERO_F64);
auto angle = svsel_f64(neg_mask, pi, ZERO_F64);
angle = svsel_f64(nan_mask, nan_vec, angle);
return angle;
}
Vectorized<double> real() const {
return *this;
}
Vectorized<double> imag() const {
return Vectorized<double>(0.0);
}
Vectorized<double> conj() const {
return *this;
}
Vectorized<double> acos() const {
return USE_SLEEF(Vectorized<double>(Sleef_acosdx_u10sve(values)),map(std::acos));
}
Vectorized<double> acosh() const {
return USE_SLEEF( Vectorized<double>(Sleef_acoshdx_u10sve(values)),map(std::acosh));
}
Vectorized<double> asin() const {
return USE_SLEEF(Vectorized<double>(Sleef_asindx_u10sve(values)),map(std::asin));
}
Vectorized<double> atan() const {
return USE_SLEEF(Vectorized<double>(Sleef_atandx_u10sve(values)),map(std::atan));
}
Vectorized<double> atanh() const {
return USE_SLEEF(Vectorized<double>(Sleef_atanhdx_u10sve(values)),map(std::atanh));
}
Vectorized<double> atan2(const Vectorized<double> &b) const {
USE_SLEEF({return Vectorized<double>(Sleef_atan2dx_u10sve(values, b));},
{
__at_align__ double tmp[size()];
__at_align__ double tmp_b[size()];
store(tmp);
b.store(tmp_b);
for (int64_t i = 0; i < size(); i++) {
tmp[i] = std::atan2(tmp[i], tmp_b[i]);
}
return loadu(tmp);
}
)
}
Vectorized<double> copysign(const Vectorized<double> &sign) const {
USE_SLEEF( {return Vectorized<double>(Sleef_copysigndx_sve(values, sign));},
{
__at_align__ double tmp[size()];
__at_align__ double tmp_sign[size()];
store(tmp);
sign.store(tmp_sign);
for (int64_t i = 0; i < size(); i++) {
tmp[i] = std::copysign(tmp[i], tmp_sign[i]);
}
return loadu(tmp);
}
)
}
Vectorized<double> erf() const {
return USE_SLEEF(Vectorized<double>(Sleef_erfdx_u10sve(values)),map(std::erf));
}
Vectorized<double> erfc() const {
return USE_SLEEF(Vectorized<double>(Sleef_erfcdx_u15sve(values)),map(std::erfc));
}
Vectorized<double> erfinv() const {
return map(calc_erfinv);
}
Vectorized<double> exp() const {
return USE_SLEEF(Vectorized<double>(Sleef_expdx_u10sve(values)),map(std::exp));
}
Vectorized<double> exp2() const {
return USE_SLEEF(Vectorized<double>(Sleef_exp2dx_u10sve(values)),map(std::exp2));
}
Vectorized<double> expm1() const {
return USE_SLEEF(Vectorized<double>(Sleef_expm1dx_u10sve(values)),map(std::expm1));
}
Vectorized<double> exp_u20() const {
return exp();
}
Vectorized<double> fmod(const Vectorized<double>& q) const {
USE_SLEEF({return Vectorized<double>(Sleef_fmoddx_sve(values, q));},
{
__at_align__ double tmp[size()];
__at_align__ double tmp_q[size()];
store(tmp);
q.store(tmp_q);
for (int64_t i = 0; i < size(); i++) {
tmp[i] = std::fmod(tmp[i], tmp_q[i]);
}
return loadu(tmp);
}
)
}
Vectorized<double> hypot(const Vectorized<double> &b) const {
USE_SLEEF({return Vectorized<double>(Sleef_hypotdx_u05sve(values, b));},
{
__at_align__ double tmp[size()];
__at_align__ double tmp_b[size()];
store(tmp);
b.store(tmp_b);
for (int64_t i = 0; i < size(); i++) {
tmp[i] = std::hypot(tmp[i], tmp_b[i]);
}
return loadu(tmp);
})
}
Vectorized<double> i0() const {
return map(calc_i0);
}
Vectorized<double> i0e() const {
return map(calc_i0e);
}
Vectorized<double> digamma() const {
return map(calc_digamma);
}
Vectorized<double> igamma(const Vectorized<double> &x) const {
__at_align__ double tmp[size()];
__at_align__ double tmp_x[size()];
store(tmp);
x.store(tmp_x);
for (int64_t i = 0; i < size(); i++) {
tmp[i] = calc_igamma(tmp[i], tmp_x[i]);
}
return loadu(tmp);
}
Vectorized<double> igammac(const Vectorized<double> &x) const {
__at_align__ double tmp[size()];
__at_align__ double tmp_x[size()];
store(tmp);
x.store(tmp_x);
for (int64_t i = 0; i < size(); i++) {
tmp[i] = calc_igammac(tmp[i], tmp_x[i]);
}
return loadu(tmp);
}
Vectorized<double> nextafter(const Vectorized<double> &b) const {
USE_SLEEF(
{
return Vectorized<double>(Sleef_nextafterdx_sve(values, b));
},
{
__at_align__ double tmp[size()];
__at_align__ double tmp_b[size()];
store(tmp);
b.store(tmp_b);
for (int64_t i = 0; i < size(); ++i) {
tmp[i] = std::nextafter(tmp[i], tmp_b[i]);
}
return loadu(tmp);
}
)
}
Vectorized<double> log() const {
return USE_SLEEF(Vectorized<double>(Sleef_logdx_u10sve(values)),map(std::log));
}
Vectorized<double> log2() const {
return USE_SLEEF(Vectorized<double>(Sleef_log2dx_u10sve(values)),map(std::log2));
}
Vectorized<double> log10() const {
return USE_SLEEF(Vectorized<double>(Sleef_log10dx_u10sve(values)),map(std::log10));
}
Vectorized<double> log1p() const {
return USE_SLEEF(Vectorized<double>(Sleef_log1pdx_u10sve(values)),map(std::log1p));
}
Vectorized<double> frac() const;
Vectorized<double> sin() const {
return USE_SLEEF( Vectorized<double>(Sleef_sindx_u10sve(values)),map(std::sin));
}
Vectorized<double> sinh() const {
return USE_SLEEF(Vectorized<double>(Sleef_sinhdx_u10sve(values)),map(std::sinh));
}
Vectorized<double> cos() const {
return USE_SLEEF(Vectorized<double>(Sleef_cosdx_u10sve(values)),map(std::cos));
}
Vectorized<double> cosh() const {
return USE_SLEEF( Vectorized<double>(Sleef_coshdx_u10sve(values)),map(std::cosh));
}
Vectorized<double> ceil() const {
return svrintp_f64_x(ptrue, values);
}
Vectorized<double> floor() const {
return svrintm_f64_x(ptrue, values);
}
Vectorized<double> neg() const {
return svneg_f64_x(ptrue, values);
}
Vectorized<double> round() const {
return svrinti_f64_x(ptrue, values);
}
Vectorized<double> tan() const {
return USE_SLEEF( Vectorized<double>(Sleef_tandx_u10sve(values)),map(std::tan));
}
Vectorized<double> tanh() const {
return USE_SLEEF( Vectorized<double>(Sleef_tanhdx_u10sve(values)),map(std::tanh));
}
Vectorized<double> trunc() const {
return svrintz_f64_x(ptrue, values);
}
Vectorized<double> lgamma() const {
return USE_SLEEF( Vectorized<double>(Sleef_lgammadx_u10sve(values)),map(std::lgamma));
}
Vectorized<double> sqrt() const {
return svsqrt_f64_x(ptrue, values);
}
Vectorized<double> reciprocal() const {
return svdivr_f64_x(ptrue, values, ONE_F64);
}
Vectorized<double> rsqrt() const {
return svdivr_f64_x(ptrue, svsqrt_f64_x(ptrue, values), ONE_F64);
}
Vectorized<double> pow(const Vectorized<double> &b) const {
USE_SLEEF( {return Vectorized<double>(Sleef_powdx_u10sve(values, b));},
{
__at_align__ double tmp[size()];
__at_align__ double tmp_b[size()];
store(tmp);
b.store(tmp_b);
for (int64_t i = 0; i < size(); i++) {
tmp[i] = std::pow(tmp[i], tmp_b[i]);
}
return loadu(tmp);
}
)
}
// Comparison using the _CMP_**_OQ predicate.
// `O`: get false if an operand is NaN
// `Q`: do not raise if an operand is NaN
Vectorized<double> operator==(const Vectorized<double>& other) const {
svbool_t mask = svcmpeq_f64(ptrue, values, other);
return svsel_f64(mask, ALL_F64_TRUE_MASK, ALL_F64_FALSE_MASK);
}
Vectorized<double> operator!=(const Vectorized<double>& other) const {
svbool_t mask = svcmpne_f64(ptrue, values, other);
return svsel_f64(mask, ALL_F64_TRUE_MASK, ALL_F64_FALSE_MASK);
}
Vectorized<double> operator<(const Vectorized<double>& other) const {
svbool_t mask = svcmplt_f64(ptrue, values, other);
return svsel_f64(mask, ALL_F64_TRUE_MASK, ALL_F64_FALSE_MASK);
}
Vectorized<double> operator<=(const Vectorized<double>& other) const {
svbool_t mask = svcmple_f64(ptrue, values, other);
return svsel_f64(mask, ALL_F64_TRUE_MASK, ALL_F64_FALSE_MASK);
}
Vectorized<double> operator>(const Vectorized<double>& other) const {
svbool_t mask = svcmpgt_f64(ptrue, values, other);
return svsel_f64(mask, ALL_F64_TRUE_MASK, ALL_F64_FALSE_MASK);
}
Vectorized<double> operator>=(const Vectorized<double>& other) const {
svbool_t mask = svcmpge_f64(ptrue, values, other);
return svsel_f64(mask, ALL_F64_TRUE_MASK, ALL_F64_FALSE_MASK);
}
Vectorized<double> eq(const Vectorized<double>& other) const;
Vectorized<double> ne(const Vectorized<double>& other) const;
Vectorized<double> gt(const Vectorized<double>& other) const;
Vectorized<double> ge(const Vectorized<double>& other) const;
Vectorized<double> lt(const Vectorized<double>& other) const;
Vectorized<double> le(const Vectorized<double>& other) const;
};
template <>
Vectorized<double> inline operator+(const Vectorized<double>& a, const Vectorized<double>& b) {
return svadd_f64_x(ptrue, a, b);
}
template <>
Vectorized<double> inline operator-(const Vectorized<double>& a, const Vectorized<double>& b) {
return svsub_f64_x(ptrue, a, b);
}
template <>
Vectorized<double> inline operator*(const Vectorized<double>& a, const Vectorized<double>& b) {
return svmul_f64_x(ptrue, a, b);
}
template <>
Vectorized<double> inline operator/(const Vectorized<double>& a, const Vectorized<double>& b) {
return svdiv_f64_x(ptrue, a, b);
}
// frac. Implement this here so we can use subtraction
Vectorized<double> inline Vectorized<double>::frac() const {
return *this - this->trunc();
}
// Implements the IEEE 754 201X `maximum` operation, which propagates NaN if
// either input is a NaN.
template <>
Vectorized<double> inline maximum(const Vectorized<double>& a, const Vectorized<double>& b) {
return svmax_f64_x(ptrue, a, b);
}
// Implements the IEEE 754 201X `minimum` operation, which propagates NaN if
// either input is a NaN.
template <>
Vectorized<double> inline minimum(const Vectorized<double>& a, const Vectorized<double>& b) {
return svmin_f64_x(ptrue, a, b);
}
template <>
Vectorized<double> inline clamp(const Vectorized<double>& a, const Vectorized<double>& min, const Vectorized<double>& max) {
return svmin_f64_x(ptrue, max, svmax_f64_x(ptrue, min, a));
}
template <>
Vectorized<double> inline clamp_max(const Vectorized<double>& a, const Vectorized<double>& max) {
return svmin_f64_x(ptrue, max, a);
}
template <>
Vectorized<double> inline clamp_min(const Vectorized<double>& a, const Vectorized<double>& min) {
return svmax_f64_x(ptrue, min, a);
}
template <>
Vectorized<double> inline operator&(const Vectorized<double>& a, const Vectorized<double>& b) {
return svreinterpret_f64_s64(svand_s64_x(ptrue, svreinterpret_s64_f64(a), svreinterpret_s64_f64(b)));
}
template <>
Vectorized<double> inline operator|(const Vectorized<double>& a, const Vectorized<double>& b) {
return svreinterpret_f64_s64(svorr_s64_x(ptrue, svreinterpret_s64_f64(a), svreinterpret_s64_f64(b)));
}
template <>
Vectorized<double> inline operator^(const Vectorized<double>& a, const Vectorized<double>& b) {
return svreinterpret_f64_s64(sveor_s64_x(ptrue, svreinterpret_s64_f64(a), svreinterpret_s64_f64(b)));
}
Vectorized<double> inline Vectorized<double>::eq(const Vectorized<double>& other) const {
return (*this == other) & Vectorized<double>(1.0);
}
Vectorized<double> inline Vectorized<double>::ne(const Vectorized<double>& other) const {
return (*this != other) & Vectorized<double>(1.0);
}
Vectorized<double> inline Vectorized<double>::gt(const Vectorized<double>& other) const {
return (*this > other) & Vectorized<double>(1.0);
}
Vectorized<double> inline Vectorized<double>::ge(const Vectorized<double>& other) const {
return (*this >= other) & Vectorized<double>(1.0);
}
Vectorized<double> inline Vectorized<double>::lt(const Vectorized<double>& other) const {
return (*this < other) & Vectorized<double>(1.0);
}
Vectorized<double> inline Vectorized<double>::le(const Vectorized<double>& other) const {
return (*this <= other) & Vectorized<double>(1.0);
}
template <>
inline void convert(const double* src, double* dst, int64_t n) {
const int64_t fraction = n % Vectorized<double>::size();
#pragma unroll
for (int64_t i = 0; i < n - fraction; i += Vectorized<double>::size()) {
svst1_f64(ptrue, dst + i, svldnt1_f64(ptrue, src + i));
}
#pragma unroll
for (int64_t i = n - fraction; i < n; i += Vectorized<double>::size()) {
svbool_t pg = svwhilelt_b64(i, n);
svst1_f64(pg, dst + i, svldnt1_f64(pg, src + i));
}
}
template <>
Vectorized<double> inline fmadd(const Vectorized<double>& a, const Vectorized<double>& b, const Vectorized<double>& c) {
return svmad_f64_x(ptrue, a, b, c);
}
#endif // defined(CPU_CAPABILITY_SVE)
}}}

View File

@ -0,0 +1,570 @@
#pragma once
#include <ATen/cpu/vec/intrinsics.h>
#include <ATen/cpu/vec/vec_base.h>
#include <ATen/cpu/vec/sve/sve_helper.h>
#include <cmath>
#if defined(__aarch64__) && defined(AT_BUILD_ARM_VEC256_WITH_SLEEF)
#include <sleef.h>
#define USE_SLEEF(sleef_code, non_sleef_code) sleef_code
#else
#define USE_SLEEF(sleef_code, non_sleef_code) non_sleef_code
#endif
namespace at {
namespace vec {
// Note [CPU_CAPABILITY namespace]
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
// This header, and all of its subheaders, will be compiled with
// different architecture flags for each supported set of vector
// intrinsics. So we need to make sure they aren't inadvertently
// linked together. We do this by declaring objects in an `inline
// namespace` which changes the name mangling, but can still be
// accessed as `at::vec`.
inline namespace CPU_CAPABILITY {
#if defined(CPU_CAPABILITY_SVE)
template <> class Vectorized<float> {
private:
vls_float32_t values;
public:
using value_type = float;
using size_type = int;
static constexpr size_type size() {
return VECTOR_WIDTH / sizeof(float);
}
Vectorized() {}
Vectorized(svfloat32_t v) : values(v) {}
Vectorized(float val) {
values = svdup_n_f32(val);
}
template<typename... Args,
typename = std::enable_if_t<(sizeof...(Args) == size())>>
Vectorized(Args... vals) {
__at_align__ float buffer[size()] = { vals... };
values = svld1_f32(ptrue, buffer);
}
operator svfloat32_t() const {
return values;
}
static Vectorized<float> blendv(const Vectorized<float>& a, const Vectorized<float>& b,
const Vectorized<float>& mask_) {
svbool_t mask = svcmpeq_s32(ptrue, svreinterpret_s32_f32(mask_),
ALL_S32_TRUE_MASK);
return svsel_f32(mask, b, a);
}
template<typename step_t>
static Vectorized<float> arange(float base = 0.f, step_t step = static_cast<step_t>(1)) {
__at_align__ float buffer[size()];
for (int64_t i = 0; i < size(); i++) {
buffer[i] = base + i * step;
}
return svld1_f32(ptrue, buffer);
}
static Vectorized<float> set(const Vectorized<float>& a, const Vectorized<float>& b,
int64_t count = size()) {
if (count == 0) {
return a;
} else if (count < size()) {
return svsel_f32(svwhilelt_b32(0ull, count), b, a);
}
return b;
}
static Vectorized<float> loadu(const void* ptr, int64_t count = size()) {
if (count == size())
return svld1_f32(ptrue, reinterpret_cast<const float*>(ptr));
svbool_t pg = svwhilelt_b32(0ull, count);
return svld1_f32(pg, reinterpret_cast<const float*>(ptr));
}
void store(void* ptr, int64_t count = size()) const {
if (count == size()) {
svst1_f32(ptrue, reinterpret_cast<float*>(ptr), values);
} else {
svbool_t pg = svwhilelt_b32(0ull, count);
svst1_f32(pg, reinterpret_cast<float*>(ptr), values);
}
}
const float& operator[](int idx) const = delete;
float& operator[](int idx) = delete;
int64_t zero_mask() const {
// returns an integer mask where all zero elements are translated to 1-bit and others are translated to 0-bit
int64_t mask = 0;
__at_align__ int32_t mask_array[size()];
svbool_t svbool_mask = svcmpeq_f32(ptrue, values, ZERO_F32);
svst1_s32(ptrue, mask_array, svsel_s32(svbool_mask,
ALL_S32_TRUE_MASK,
ALL_S32_FALSE_MASK));
for (int64_t i = 0; i < size(); ++i) {
if (mask_array[i]) mask |= (1ull << i);
}
return mask;
}
Vectorized<float> isnan() const {
// NaN check
svbool_t mask = svcmpuo_f32(ptrue, values, ZERO_F32);
return svsel_f32(mask, ALL_F32_TRUE_MASK, ALL_F32_FALSE_MASK);
}
bool has_inf_nan() const {
return svptest_any(ptrue, svcmpuo_f32(ptrue, svsub_f32_x(ptrue, values, values), ZERO_F32));
}
Vectorized<float> map(float (*f)(float)) const {
__at_align__ float tmp[size()];
store(tmp);
for (int64_t i = 0; i < size(); ++i) {
tmp[i] = f(tmp[i]);
}
return loadu(tmp);
}
Vectorized<float> abs() const {
return svabs_f32_x(ptrue, values);
}
Vectorized<float> angle() const {
const auto nan_vec = svdup_n_f32(NAN);
const auto nan_mask = svcmpuo_f32(ptrue, values, ZERO_F32);
const auto pi = svdup_n_f32(c10::pi<float>);
const auto neg_mask = svcmplt_f32(ptrue, values, ZERO_F32);
auto angle = svsel_f32(neg_mask, pi, ZERO_F32);
angle = svsel_f32(nan_mask, nan_vec, angle);
return angle;
}
Vectorized<float> real() const {
return values;
}
Vectorized<float> imag() const {
return Vectorized<float>(0.f);
}
Vectorized<float> conj() const {
return values;
}
Vectorized<float> acos() const {
return USE_SLEEF(Vectorized<float>(Sleef_acosfx_u10sve(values)),map(std::acos));
}
Vectorized<float> acosh() const {
return USE_SLEEF(Vectorized<float>(Sleef_acoshfx_u10sve(values)),map(std::acosh));
}
Vectorized<float> asin() const {
return USE_SLEEF(Vectorized<float>(Sleef_asinfx_u10sve(values)),map(std::asin));
}
Vectorized<float> atan() const {
return USE_SLEEF(Vectorized<float>(Sleef_atanfx_u10sve(values)),map(std::atan));
}
Vectorized<float> atanh() const {
return USE_SLEEF(Vectorized<float>(Sleef_atanhfx_u10sve(values)),map(std::atanh));
}
Vectorized<float> atan2(const Vectorized<float> &b) const {
USE_SLEEF({return Vectorized<float>(Sleef_atan2fx_u10sve(values, b));},
{
__at_align__ float tmp[size()];
__at_align__ float tmp_b[size()];
store(tmp);
b.store(tmp_b);
for (int64_t i = 0; i < size(); i++){
tmp[i] = std::atan2(tmp[i], tmp_b[i]);
}
return loadu(tmp);
}
)
}
Vectorized<float> copysign(const Vectorized<float> &sign) const {
USE_SLEEF({return Vectorized<float>(Sleef_copysignfx_sve(values, sign));},
{
__at_align__ float tmp[size()];
__at_align__ float tmp_sign[size()];
store(tmp);
sign.store(tmp_sign);
for (int64_t i = 0; i < size(); ++i) {
tmp[i] = std::copysign(tmp[i], tmp_sign[i]);
}
return loadu(tmp);
})
}
Vectorized<float> erf() const {
return USE_SLEEF(Vectorized<float>(Sleef_erffx_u10sve(values)),map(std::erf));
}
Vectorized<float> erfc() const {
return USE_SLEEF(Vectorized<float>(Sleef_erfcfx_u15sve(values)),map(std::erfc));
}
Vectorized<float> erfinv() const {
return map(calc_erfinv);
}
Vectorized<float> exp() const {
return USE_SLEEF(Vectorized<float>(Sleef_expfx_u10sve(values)),map(std::exp));
}
Vectorized<float> exp2() const {
return USE_SLEEF(Vectorized<float>(Sleef_exp2fx_u10sve(values)),map(std::exp2));
}
Vectorized<float> expm1() const {
return USE_SLEEF(Vectorized<float>(Sleef_expm1fx_u10sve(values)),map(std::expm1));
}
Vectorized<float> exp_u20() const {
return exp();
}
Vectorized<float> fmod(const Vectorized<float>& q) const {
USE_SLEEF({return Vectorized<float>(Sleef_fmodfx_sve(values, q));},
{
__at_align__ float tmp[size()];
__at_align__ float tmp_q[size()];
store(tmp);
q.store(tmp_q);
for (int64_t i = 0; i < size(); ++i) {
tmp[i] = std::fmod(tmp[i], tmp_q[i]);
}
return loadu(tmp);
})
}
Vectorized<float> hypot(const Vectorized<float> &b) const {
USE_SLEEF( {return Vectorized<float>(Sleef_hypotfx_u05sve(values, b));},
{
__at_align__ float tmp[size()];
__at_align__ float tmp_b[size()];
store(tmp);
b.store(tmp_b);
for (int64_t i = 0; i < size(); i++) {
tmp[i] = std::hypot(tmp[i], tmp_b[i]);
}
return loadu(tmp);
}
)
}
Vectorized<float> i0() const {
return map(calc_i0);
}
Vectorized<float> i0e() const {
return map(calc_i0e);
}
Vectorized<float> digamma() const {
return map(calc_digamma);
}
Vectorized<float> igamma(const Vectorized<float> &x) const {
__at_align__ float tmp[size()];
__at_align__ float tmp_x[size()];
store(tmp);
x.store(tmp_x);
for (int64_t i = 0; i < size(); i++) {
tmp[i] = calc_igamma(tmp[i], tmp_x[i]);
}
return loadu(tmp);
}
Vectorized<float> igammac(const Vectorized<float> &x) const {
__at_align__ float tmp[size()];
__at_align__ float tmp_x[size()];
store(tmp);
x.store(tmp_x);
for (int64_t i = 0; i < size(); i++) {
tmp[i] = calc_igammac(tmp[i], tmp_x[i]);
}
return loadu(tmp);
}
Vectorized<float> nextafter(const Vectorized<float> &b) const {
USE_SLEEF(
{
return Vectorized<float>(Sleef_nextafterfx_sve(values, b));
},
{
__at_align__ float tmp[size()];
__at_align__ float tmp_b[size()];
store(tmp);
b.store(tmp_b);
for (int64_t i = 0; i < size(); ++i) {
tmp[i] = std::nextafter(tmp[i], tmp_b[i]);
}
return loadu(tmp);
}
)
}
Vectorized<float> log() const {
return USE_SLEEF(Vectorized<float>(Sleef_logfx_u10sve(values)),map(std::log));
}
Vectorized<float> log2() const {
return USE_SLEEF(Vectorized<float>(Sleef_log2fx_u10sve(values)),map(std::log2));
}
Vectorized<float> log10() const {
return USE_SLEEF(Vectorized<float>(Sleef_log10fx_u10sve(values)),map(std::log10));
}
Vectorized<float> log1p() const {
return USE_SLEEF(Vectorized<float>(Sleef_log1pfx_u10sve(values)),map(std::log1p));
}
Vectorized<float> frac() const;
Vectorized<float> sin() const {
return USE_SLEEF(Vectorized<float>(Sleef_sinfx_u10sve(values)),map(std::sin));
}
Vectorized<float> sinh() const {
return USE_SLEEF(Vectorized<float>(Sleef_sinhfx_u10sve(values)),map(std::sinh));
}
Vectorized<float> cos() const {
return USE_SLEEF(Vectorized<float>(Sleef_cosfx_u10sve(values)),map(std::cos));
}
Vectorized<float> cosh() const {
return USE_SLEEF(Vectorized<float>(Sleef_coshfx_u10sve(values)),map(std::cosh));
}
Vectorized<float> ceil() const {
return svrintp_f32_x(ptrue, values);
}
Vectorized<float> floor() const {
return svrintm_f32_x(ptrue, values);
}
Vectorized<float> neg() const {
return svneg_f32_x(ptrue, values);
}
Vectorized<float> round() const {
return svrinti_f32_x(ptrue, values);
}
Vectorized<float> tan() const {
return USE_SLEEF(Vectorized<float>(Sleef_tanfx_u10sve(values)),map(std::tan));
}
Vectorized<float> tanh() const {
return USE_SLEEF(Vectorized<float>(Sleef_tanhfx_u10sve(values)),map(std::tanh));
}
Vectorized<float> trunc() const {
return svrintz_f32_x(ptrue, values);
}
Vectorized<float> lgamma() const {
return USE_SLEEF(Vectorized<float>(Sleef_lgammafx_u10sve(values)),map(std::lgamma));
}
Vectorized<float> sqrt() const {
return svsqrt_f32_x(ptrue, values);
}
Vectorized<float> reciprocal() const {
return svdivr_f32_x(ptrue, values, ONE_F32);
}
Vectorized<float> rsqrt() const {
return svdivr_f32_x(ptrue, svsqrt_f32_x(ptrue, values), ONE_F32);
}
Vectorized<float> pow(const Vectorized<float> &b) const {
USE_SLEEF( {return Vectorized<float>(Sleef_powfx_u10sve(values, b));},
{
__at_align__ float tmp[size()];
__at_align__ float tmp_b[size()];
store(tmp);
b.store(tmp_b);
for (int64_t i = 0; i < size(); i++) {
tmp[i] = std::pow(tmp[i], tmp_b[i]);
}
return loadu(tmp);
}
)
}
// Comparison using the _CMP_**_OQ predicate.
// `O`: get false if an operand is NaN
// `Q`: do not raise if an operand is NaN
Vectorized<float> operator==(const Vectorized<float>& other) const {
svbool_t mask = svcmpeq_f32(ptrue, values, other);
return svsel_f32(mask, ALL_F32_TRUE_MASK, ALL_F32_FALSE_MASK);
}
Vectorized<float> operator!=(const Vectorized<float>& other) const {
svbool_t mask = svcmpne_f32(ptrue, values, other);
return svsel_f32(mask, ALL_F32_TRUE_MASK, ALL_F32_FALSE_MASK);
}
Vectorized<float> operator<(const Vectorized<float>& other) const {
svbool_t mask = svcmplt_f32(ptrue, values, other);
return svsel_f32(mask, ALL_F32_TRUE_MASK, ALL_F32_FALSE_MASK);
}
Vectorized<float> operator<=(const Vectorized<float>& other) const {
svbool_t mask = svcmple_f32(ptrue, values, other);
return svsel_f32(mask, ALL_F32_TRUE_MASK, ALL_F32_FALSE_MASK);
}
Vectorized<float> operator>(const Vectorized<float>& other) const {
svbool_t mask = svcmpgt_f32(ptrue, values, other);
return svsel_f32(mask, ALL_F32_TRUE_MASK, ALL_F32_FALSE_MASK);
}
Vectorized<float> operator>=(const Vectorized<float>& other) const {
svbool_t mask = svcmpge_f32(ptrue, values, other);
return svsel_f32(mask, ALL_F32_TRUE_MASK, ALL_F32_FALSE_MASK);
}
Vectorized<float> eq(const Vectorized<float>& other) const;
Vectorized<float> ne(const Vectorized<float>& other) const;
Vectorized<float> gt(const Vectorized<float>& other) const;
Vectorized<float> ge(const Vectorized<float>& other) const;
Vectorized<float> lt(const Vectorized<float>& other) const;
Vectorized<float> le(const Vectorized<float>& other) const;
};
template <>
Vectorized<float> inline operator+(const Vectorized<float>& a, const Vectorized<float>& b) {
return svadd_f32_x(ptrue, a, b);
}
template <>
Vectorized<float> inline operator-(const Vectorized<float>& a, const Vectorized<float>& b) {
return svsub_f32_x(ptrue, a, b);
}
template <>
Vectorized<float> inline operator*(const Vectorized<float>& a, const Vectorized<float>& b) {
return svmul_f32_x(ptrue, a, b);
}
template <>
Vectorized<float> inline operator/(const Vectorized<float>& a, const Vectorized<float>& b) {
return svdiv_f32_x(ptrue, a, b);
}
// frac. Implement this here so we can use subtraction
Vectorized<float> inline Vectorized<float>::frac() const {
return *this - this->trunc();
}
// Implements the IEEE 754 201X `maximum` operation, which propagates NaN if
// either input is a NaN.
template <>
Vectorized<float> inline maximum(const Vectorized<float>& a, const Vectorized<float>& b) {
return svmax_f32_x(ptrue, a, b);
}
// Implements the IEEE 754 201X `minimum` operation, which propagates NaN if
// either input is a NaN.
template <>
Vectorized<float> inline minimum(const Vectorized<float>& a, const Vectorized<float>& b) {
return svmin_f32_x(ptrue, a, b);
}
template <>
Vectorized<float> inline clamp(const Vectorized<float>& a, const Vectorized<float>& min, const Vectorized<float>& max) {
return svmin_f32_x(ptrue, max, svmax_f32_x(ptrue, min, a));
}
template <>
Vectorized<float> inline clamp_max(const Vectorized<float>& a, const Vectorized<float>& max) {
return svmin_f32_x(ptrue, max, a);
}
template <>
Vectorized<float> inline clamp_min(const Vectorized<float>& a, const Vectorized<float>& min) {
return svmax_f32_x(ptrue, min, a);
}
template <>
Vectorized<float> inline operator&(const Vectorized<float>& a, const Vectorized<float>& b) {
return svreinterpret_f32_s32(svand_s32_x(ptrue, svreinterpret_s32_f32(a), svreinterpret_s32_f32(b)));
}
template <>
Vectorized<float> inline operator|(const Vectorized<float>& a, const Vectorized<float>& b) {
return svreinterpret_f32_s32(svorr_s32_x(ptrue, svreinterpret_s32_f32(a), svreinterpret_s32_f32(b)));
}
template <>
Vectorized<float> inline operator^(const Vectorized<float>& a, const Vectorized<float>& b) {
return svreinterpret_f32_s32(sveor_s32_x(ptrue, svreinterpret_s32_f32(a), svreinterpret_s32_f32(b)));
}
Vectorized<float> inline Vectorized<float>::eq(const Vectorized<float>& other) const {
return (*this == other) & Vectorized<float>(1.0f);
}
Vectorized<float> inline Vectorized<float>::ne(const Vectorized<float>& other) const {
return (*this != other) & Vectorized<float>(1.0f);
}
Vectorized<float> inline Vectorized<float>::gt(const Vectorized<float>& other) const {
return (*this > other) & Vectorized<float>(1.0f);
}
Vectorized<float> inline Vectorized<float>::ge(const Vectorized<float>& other) const {
return (*this >= other) & Vectorized<float>(1.0f);
}
Vectorized<float> inline Vectorized<float>::lt(const Vectorized<float>& other) const {
return (*this < other) & Vectorized<float>(1.0f);
}
Vectorized<float> inline Vectorized<float>::le(const Vectorized<float>& other) const {
return (*this <= other) & Vectorized<float>(1.0f);
}
template <>
inline void convert(const float* src, float* dst, int64_t n) {
const int64_t fraction = n % Vectorized<float>::size();
#pragma unroll
for (int64_t i = 0; i < n - fraction; i += Vectorized<float>::size()) {
svst1_f32(ptrue, dst + i, svldnt1_f32(ptrue, src + i));
}
#pragma unroll
for (int64_t i = n - fraction; i < n; i += Vectorized<float>::size()) {
svbool_t pg = svwhilelt_b32(i, n);
svst1_f32(pg, dst + i, svldnt1_f32(pg, src + i));
}
}
template <>
inline void convert(const float *src, at::Half *dst, int64_t n) {
const int64_t fraction = n % Vectorized<float>::size();
svbool_t pg_16 = svwhilelt_b16(0ull, Vectorized<float>::size());
svbool_t pg_32 = svwhilelt_b32(0ull, Vectorized<float>::size());
#pragma unroll
for (int64_t i = 0; i < n - fraction; i += Vectorized<float>::size()) {
svfloat16_t src_vec = svuzp1_f16(svcvt_f16_f32_x(ptrue, svldnt1_f32(pg_32, src + i)),
ZERO_F16);
svst1_f16(pg_16, reinterpret_cast<float16_t*>(dst) + i, src_vec);
}
#pragma unroll
for (int64_t i = n - fraction; i < n; i += Vectorized<float>::size()) {
pg_16 = svwhilelt_b16(i, n);
pg_32 = svwhilelt_b32(i, n);
svfloat16_t src_vec = svuzp1_f16(svcvt_f16_f32_x(ptrue, svldnt1_f32(pg_32, src + i)),
ZERO_F16);
svst1_f16(pg_16, reinterpret_cast<float16_t*>(dst) + i, src_vec);
}
}
template <>
inline void convert(const at::Half *src, float *dst, int64_t n) {
const int64_t fraction = n % Vectorized<float>::size();
svbool_t pg_16 = svwhilelt_b16(0ull, Vectorized<float>::size());
svbool_t pg_32 = svwhilelt_b32(0ull, Vectorized<float>::size());
#pragma unroll
for (int64_t i = 0; i < n - fraction; i += Vectorized<float>::size()) {
svfloat16_t src_vec = svzip1_f16(svldnt1_f16(pg_16, reinterpret_cast<const float16_t*>(src) + i),
ZERO_F16);
svst1_f32(pg_32, dst + i, svcvt_f32_f16_x(ptrue, src_vec));
}
#pragma unroll
for (int64_t i = n - fraction; i < n; i += Vectorized<float>::size()) {
pg_16 = svwhilelt_b16(i, n);
pg_32 = svwhilelt_b32(i, n);
svfloat16_t src_vec = svzip1_f16(svldnt1_f16(pg_16, reinterpret_cast<const float16_t*>(src) + i),
ZERO_F16);
svst1_f32(pg_32, dst + i, svcvt_f32_f16_x(ptrue, src_vec));
}
}
template <>
inline void convert(const bool *src, float *dst, int64_t n) {
const int64_t fraction = n % Vectorized<float>::size();
svbool_t pg_8 = svwhilelt_b8(0ull, Vectorized<float>::size());
svbool_t pg_32 = svwhilelt_b32(0ull, Vectorized<float>::size());
#pragma unroll
for (int64_t i = 0; i < n - fraction; i += Vectorized<float>::size()) {
svuint8_t src_vec_u8 = svldnt1_u8(pg_8, reinterpret_cast<const uint8_t*>(src) + i);
svuint32_t src_vec_u32 = svunpklo_u32(svunpklo_u16(src_vec_u8));
svbool_t mask = svcmpne_u32(pg_32, src_vec_u32, ZERO_U32);
svst1_f32(pg_32, dst + i, svsel_f32(mask, ONE_F32, ZERO_F32));
}
#pragma unroll
for (int64_t i = n - fraction; i < n; i += Vectorized<float>::size()) {
pg_8 = svwhilelt_b8(i, n);
pg_32 = svwhilelt_b32(i, n);
svuint8_t src_vec_u8 = svldnt1_u8(pg_8, reinterpret_cast<const uint8_t*>(src) + i);
svuint32_t src_vec_u32 = svunpklo_u32(svunpklo_u16(src_vec_u8));
svbool_t mask = svcmpne_u32(pg_32, src_vec_u32, ZERO_U32);
svst1_f32(pg_32, dst + i, svsel_f32(mask, ONE_F32, ZERO_F32));
}
}
template <>
Vectorized<float> inline fmadd(const Vectorized<float>& a, const Vectorized<float>& b, const Vectorized<float>& c) {
return svmad_f32_x(ptrue, a, b, c);
}
#endif // defined(CPU_CAPABILITY_SVE)
}}}

View File

@ -0,0 +1,410 @@
#pragma once
#include <ATen/cpu/vec/intrinsics.h>
#include <ATen/cpu/vec/vec_base.h>
#include <ATen/cpu/vec/sve/sve_helper.h>
namespace at {
namespace vec {
// Note [CPU_CAPABILITY namespace]
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
// This header, and all of its subheaders, will be compiled with
// different architecture flags for each supported set of vector
// intrinsics. So we need to make sure they aren't inadvertently
// linked together. We do this by declaring objects in an `inline
// namespace` which changes the name mangling, but can still be
// accessed as `at::vec`.
inline namespace CPU_CAPABILITY {
#if defined(CPU_CAPABILITY_SVE)
#define VEC_INT_SVE_TEMPLATE(vl, bit) \
template <> class Vectorized<int##bit##_t> { \
private: \
vls_int##bit##_t values; \
public: \
using value_type = int##bit##_t; \
using size_type = int; \
static constexpr size_type size() { \
return vl; \
} \
Vectorized() {} \
Vectorized(svint##bit##_t v) : values(v) {} \
Vectorized(int##bit##_t val) { \
values = svdup_n_s##bit(val); \
} \
template<typename... Args, \
typename = std::enable_if_t<(sizeof...(Args) == size())>> \
Vectorized(Args... vals) { \
__at_align__ int##bit##_t buffer[size()] = { vals... }; \
values = svld1_s##bit(ptrue, buffer); \
} \
operator svint##bit##_t() const { \
return values; \
} \
static Vectorized<int##bit##_t> blendv(const Vectorized<int##bit##_t>& a, \
const Vectorized<int##bit##_t>& b, \
const Vectorized<int##bit##_t>& mask_) { \
svbool_t mask = svcmpeq_s##bit(ptrue, mask_, ALL_S##bit##_TRUE_MASK); \
return svsel_s##bit(mask, b, a); \
} \
/* step sometimes requires a higher precision type (e.g., T=int, step_t=double) */ \
template <typename step_t> \
static Vectorized<int##bit##_t> arange(int##bit##_t base = 0, step_t step = static_cast<step_t>(1)) { \
__at_align__ int##bit##_t buffer[size()]; \
for (int64_t i = 0; i < size(); i++) { \
buffer[i] = base + i * step; \
} \
return svld1_s##bit(ptrue, buffer); \
} \
static Vectorized<int##bit##_t> set(const Vectorized<int##bit##_t>& a, \
const Vectorized<int##bit##_t>& b, \
int##bit##_t count = size()) { \
if (count == 0) { \
return a; \
} else if (count < size()) { \
return svsel_s##bit(svwhilelt_b##bit(0ull, count), b, a); \
} \
return b; \
} \
static Vectorized<int##bit##_t> loadu(const void* ptr, int64_t count = size()) { \
if (count == size()) \
return svld1_s##bit(ptrue, reinterpret_cast<const int##bit##_t*>(ptr)); \
svbool_t pg = svwhilelt_b##bit(0ull, count); \
return svld1_s##bit(pg, reinterpret_cast<const int##bit##_t*>(ptr)); \
} \
void store(void* ptr, int64_t count = size()) const { \
if (count == size()) { \
svst1_s##bit(ptrue, reinterpret_cast<int##bit##_t*>(ptr), values); \
} else { \
svbool_t pg = svwhilelt_b##bit(0ull, count); \
svst1_s##bit(pg, reinterpret_cast<int##bit##_t*>(ptr), values); \
} \
} \
const int##bit##_t& operator[](int idx) const = delete; \
int##bit##_t& operator[](int idx) = delete; \
Vectorized<int##bit##_t> abs() const { \
return svabs_s##bit##_x(ptrue, values); \
} \
Vectorized<int##bit##_t> real() const { \
return values; \
} \
Vectorized<int##bit##_t> imag() const { \
return svdup_n_s##bit(0); \
} \
Vectorized<int##bit##_t> conj() const { \
return values; \
} \
Vectorized<int##bit##_t> frac() const; \
Vectorized<int##bit##_t> neg() const { \
return svneg_s##bit##_x(ptrue, values); \
} \
Vectorized<int##bit##_t> operator==(const Vectorized<int##bit##_t>& other) const { \
svbool_t mask = svcmpeq_s##bit(ptrue, values, other); \
return svsel_s##bit(mask, ALL_S##bit##_TRUE_MASK, ALL_S##bit##_FALSE_MASK); \
} \
Vectorized<int##bit##_t> operator!=(const Vectorized<int##bit##_t>& other) const { \
svbool_t mask = svcmpne_s##bit(ptrue, values, other); \
return svsel_s##bit(mask, ALL_S##bit##_TRUE_MASK, ALL_S##bit##_FALSE_MASK); \
} \
Vectorized<int##bit##_t> operator<(const Vectorized<int##bit##_t>& other) const { \
svbool_t mask = svcmplt_s##bit(ptrue, values, other); \
return svsel_s##bit(mask, ALL_S##bit##_TRUE_MASK, ALL_S##bit##_FALSE_MASK); \
} \
Vectorized<int##bit##_t> operator<=(const Vectorized<int##bit##_t>& other) const { \
svbool_t mask = svcmple_s##bit(ptrue, values, other); \
return svsel_s##bit(mask, ALL_S##bit##_TRUE_MASK, ALL_S##bit##_FALSE_MASK); \
} \
Vectorized<int##bit##_t> operator>(const Vectorized<int##bit##_t>& other) const { \
svbool_t mask = svcmpgt_s##bit(ptrue, values, other); \
return svsel_s##bit(mask, ALL_S##bit##_TRUE_MASK, ALL_S##bit##_FALSE_MASK); \
} \
Vectorized<int##bit##_t> operator>=(const Vectorized<int##bit##_t>& other) const { \
svbool_t mask = svcmpge_s##bit(ptrue, values, other); \
return svsel_s##bit(mask, ALL_S##bit##_TRUE_MASK, ALL_S##bit##_FALSE_MASK); \
} \
Vectorized<int##bit##_t> eq(const Vectorized<int##bit##_t>& other) const; \
Vectorized<int##bit##_t> ne(const Vectorized<int##bit##_t>& other) const; \
Vectorized<int##bit##_t> gt(const Vectorized<int##bit##_t>& other) const; \
Vectorized<int##bit##_t> ge(const Vectorized<int##bit##_t>& other) const; \
Vectorized<int##bit##_t> lt(const Vectorized<int##bit##_t>& other) const; \
Vectorized<int##bit##_t> le(const Vectorized<int##bit##_t>& other) const; \
}; \
template <> \
Vectorized<int##bit##_t> inline operator+(const Vectorized<int##bit##_t>& a, \
const Vectorized<int##bit##_t>& b) { \
return svadd_s##bit##_x(ptrue, a, b); \
} \
template <> \
Vectorized<int##bit##_t> inline operator-(const Vectorized<int##bit##_t>& a, \
const Vectorized<int##bit##_t>& b) { \
return svsub_s##bit##_x(ptrue, a, b); \
} \
template <> \
Vectorized<int##bit##_t> inline operator*(const Vectorized<int##bit##_t>& a, \
const Vectorized<int##bit##_t>& b) { \
return svmul_s##bit##_x(ptrue, a, b); \
} \
template <> \
Vectorized<int##bit##_t> inline maximum(const Vectorized<int##bit##_t>& a, \
const Vectorized<int##bit##_t>& b) { \
return svmax_s##bit##_x(ptrue, a, b); \
} \
template <> \
Vectorized<int##bit##_t> inline minimum(const Vectorized<int##bit##_t>& a, \
const Vectorized<int##bit##_t>& b) { \
return svmin_s##bit##_x(ptrue, a, b); \
} \
template <> \
Vectorized<int##bit##_t> inline clamp(const Vectorized<int##bit##_t>& a, \
const Vectorized<int##bit##_t>& min, \
const Vectorized<int##bit##_t>& max) { \
return svmin_s##bit##_x(ptrue, max, svmax_s##bit##_x(ptrue, min, a)); \
} \
template <> \
Vectorized<int##bit##_t> inline clamp_max(const Vectorized<int##bit##_t>& a, \
const Vectorized<int##bit##_t>& max) { \
return svmin_s##bit##_x(ptrue, max, a); \
} \
template <> \
Vectorized<int##bit##_t> inline clamp_min(const Vectorized<int##bit##_t>& a, \
const Vectorized<int##bit##_t>& min) { \
return svmax_s##bit##_x(ptrue, min, a); \
} \
template <> \
Vectorized<int##bit##_t> inline operator&(const Vectorized<int##bit##_t>& a, \
const Vectorized<int##bit##_t>& b) { \
return svand_s##bit##_x(ptrue, a, b); \
} \
template <> \
Vectorized<int##bit##_t> inline operator|(const Vectorized<int##bit##_t>& a, \
const Vectorized<int##bit##_t>& b) { \
return svorr_s##bit##_x(ptrue, a, b); \
} \
template <> \
Vectorized<int##bit##_t> inline operator^(const Vectorized<int##bit##_t>& a, \
const Vectorized<int##bit##_t>& b) { \
return sveor_s##bit##_x(ptrue, a, b); \
} \
template <> \
inline Vectorized<int##bit##_t> operator~(const Vectorized<int##bit##_t>& a) { \
return sveor_s##bit##_x(ptrue, a, svdup_n_s##bit(-1)); \
} \
Vectorized<int##bit##_t> inline Vectorized<int##bit##_t>::eq(const Vectorized<int##bit##_t>& other) const { \
return (*this == other) & Vectorized<int##bit##_t>(1); \
} \
Vectorized<int##bit##_t> inline Vectorized<int##bit##_t>::ne(const Vectorized<int##bit##_t>& other) const { \
return (*this != other) & Vectorized<int##bit##_t>(1); \
} \
Vectorized<int##bit##_t> inline Vectorized<int##bit##_t>::gt(const Vectorized<int##bit##_t>& other) const { \
return (*this > other) & Vectorized<int##bit##_t>(1); \
} \
Vectorized<int##bit##_t> inline Vectorized<int##bit##_t>::ge(const Vectorized<int##bit##_t>& other) const { \
return (*this >= other) & Vectorized<int##bit##_t>(1); \
} \
Vectorized<int##bit##_t> inline Vectorized<int##bit##_t>::lt(const Vectorized<int##bit##_t>& other) const { \
return (*this < other) & Vectorized<int##bit##_t>(1); \
} \
Vectorized<int##bit##_t> inline Vectorized<int##bit##_t>::le(const Vectorized<int##bit##_t>& other) const { \
return (*this <= other) & Vectorized<int##bit##_t>(1); \
}
VEC_INT_SVE_TEMPLATE(VECTOR_WIDTH / sizeof(int64_t), 64)
VEC_INT_SVE_TEMPLATE(VECTOR_WIDTH / sizeof(int32_t), 32)
VEC_INT_SVE_TEMPLATE(VECTOR_WIDTH / sizeof(int16_t), 16)
VEC_INT_SVE_TEMPLATE(VECTOR_WIDTH / sizeof(int8_t), 8)
template <typename T>
Vectorized<T> inline intdiv_nosve(const Vectorized<T>& a, const Vectorized<T>& b) {
T values_a[Vectorized<T>::size()];
T values_b[Vectorized<T>::size()];
a.store(values_a);
b.store(values_b);
for (int i = 0; i != Vectorized<T>::size(); i++) {
values_a[i] /= values_b[i];
}
return Vectorized<T>::loadu(values_a);
}
template <>
Vectorized<int64_t> inline operator/(const Vectorized<int64_t>& a, const Vectorized<int64_t>& b) {
return svdiv_s64_x(ptrue, a, b);
}
template <>
Vectorized<int32_t> inline operator/(const Vectorized<int32_t>& a, const Vectorized<int32_t>& b) {
return svdiv_s32_x(ptrue, a, b);
}
template <>
Vectorized<int16_t> inline operator/(const Vectorized<int16_t>& a, const Vectorized<int16_t>& b) {
return intdiv_nosve(a, b);
}
template <>
Vectorized<int8_t> inline operator/(const Vectorized<int8_t>& a, const Vectorized<int8_t>& b) {
return intdiv_nosve(a, b);
}
template <>
inline void convert(const int32_t *src, int64_t *dst, int64_t n) {
const int64_t fraction = n % Vectorized<int64_t>::size();
svbool_t pg_32 = svwhilelt_b32(0ull, Vectorized<int64_t>::size());
svbool_t pg_64 = svwhilelt_b64(0ull, Vectorized<int64_t>::size());
#pragma unroll
for (int64_t i = 0; i < n - fraction; i += Vectorized<int64_t>::size())
svst1_s64(pg_64, dst + i, svunpklo_s64(svldnt1_s32(pg_32, src + i)));
#pragma unroll
for (int64_t i = n - fraction; i < n; i += Vectorized<int64_t>::size()) {
pg_32 = svwhilelt_b32(i, n);
pg_64 = svwhilelt_b64(i, n);
svst1_s64(pg_64, dst + i, svunpklo_s64(svldnt1_s32(pg_32, src + i)));
}
}
template <>
inline void convert(const int64_t *src, float *dst, int64_t n) {
const int64_t fraction = n % Vectorized<int64_t>::size();
svbool_t pg_32 = svwhilelt_b32(0ull, Vectorized<int64_t>::size());
svbool_t pg_64 = svwhilelt_b64(0ull, Vectorized<int64_t>::size());
#pragma unroll
for (int64_t i = 0; i < n - fraction; i += Vectorized<int64_t>::size()) {
svint64_t src_vec_s64 = svldnt1_s64(pg_64, src + i);
svfloat32_t src_vec_f32 = svuzp1_f32(svcvt_f32_s64_x(pg_64, src_vec_s64), ZERO_F32);
svst1_f32(pg_32, dst + i, src_vec_f32);
}
#pragma unroll
for (int64_t i = n - fraction; i < n; i += Vectorized<int64_t>::size()) {
pg_32 = svwhilelt_b32(i, n);
pg_64 = svwhilelt_b64(i, n);
svint64_t src_vec_s64 = svldnt1_s64(pg_64, src + i);
svfloat32_t src_vec_f32 = svuzp1_f32(svcvt_f32_s64_x(pg_64, src_vec_s64), ZERO_F32);
svst1_f32(pg_32, dst + i, src_vec_f32);
}
}
template <>
inline void convert(const int32_t *src, float *dst, int64_t n) {
const int64_t fraction = n % Vectorized<int32_t>::size();
svbool_t pg = svwhilelt_b32(0ull, Vectorized<int32_t>::size());
#pragma unroll
for (int64_t i = 0; i < n - fraction; i += Vectorized<int32_t>::size()) {
svint32_t src_vec = svldnt1_s32(pg, src + i);
svst1_f32(pg, dst + i, svcvt_f32_s32_x(pg, src_vec));
}
#pragma unroll
for (int64_t i = n - fraction; i < n; i += Vectorized<int32_t>::size()) {
pg = svwhilelt_b32(i, n);
svint32_t src_vec = svldnt1_s32(pg, src + i);
svst1_f32(pg, dst + i, svcvt_f32_s32_x(pg, src_vec));
}
}
template <>
inline void convert(const bool *src, int64_t *dst, int64_t n) {
const int64_t fraction = n % Vectorized<int64_t>::size();
svbool_t pg_8 = svwhilelt_b8(0ull, Vectorized<int64_t>::size());
svbool_t pg_64 = svwhilelt_b64(0ull, Vectorized<int64_t>::size());
#pragma unroll
for (int64_t i = 0; i < n - fraction; i += Vectorized<int64_t>::size()) {
svuint8_t src_vec_u8 = svldnt1_u8(pg_8, reinterpret_cast<const uint8_t*>(src) + i);
svuint64_t src_vec_u64 = svunpklo_u64(svunpklo_u32(svunpklo_u16(src_vec_u8)));
svbool_t mask = svcmpne_u64(pg_64, src_vec_u64, ZERO_U64);
svst1_s64(pg_64, dst + i, svsel_s64(mask, ONE_S64, ZERO_S64));
}
#pragma unroll
for (int64_t i = n - fraction; i < n; i += Vectorized<int64_t>::size()) {
pg_8 = svwhilelt_b8(i, n);
pg_64 = svwhilelt_b64(i, n);
svuint8_t src_vec_u8 = svldnt1_u8(pg_8, reinterpret_cast<const uint8_t*>(src) + i);
svuint64_t src_vec_u64 = svunpklo_u64(svunpklo_u32(svunpklo_u16(src_vec_u8)));
svbool_t mask = svcmpne_u64(pg_64, src_vec_u64, ZERO_U64);
svst1_s64(pg_64, dst + i, svsel_s64(mask, ONE_S64, ZERO_S64));
}
}
template <>
inline void convert(const bool *src, int32_t *dst, int64_t n) {
const int64_t fraction = n % Vectorized<int32_t>::size();
svbool_t pg_8 = svwhilelt_b8(0ull, Vectorized<int32_t>::size());
svbool_t pg_32 = svwhilelt_b32(0ull, Vectorized<int32_t>::size());
#pragma unroll
for (int64_t i = 0; i < n - fraction; i += Vectorized<int32_t>::size()) {
svuint8_t src_vec_u8 = svldnt1_u8(pg_8, reinterpret_cast<const uint8_t*>(src) + i);
svuint32_t src_vec_u32 = svunpklo_u32(svunpklo_u16(src_vec_u8));
svbool_t mask = svcmpne_u32(pg_32, src_vec_u32, ZERO_U32);
svst1_s32(pg_32, dst + i, svsel_s32(mask, ONE_S32, ZERO_S32));
}
#pragma unroll
for (int64_t i = n - fraction; i < n; i += Vectorized<int32_t>::size()) {
pg_8 = svwhilelt_b8(i, n);
pg_32 = svwhilelt_b32(i, n);
svuint8_t src_vec_u8 = svldnt1_u8(pg_8, reinterpret_cast<const uint8_t*>(src) + i);
svuint32_t src_vec_u32 = svunpklo_u32(svunpklo_u16(src_vec_u8));
svbool_t mask = svcmpne_u32(pg_32, src_vec_u32, ZERO_U32);
svst1_s32(pg_32, dst + i, svsel_s32(mask, ONE_S32, ZERO_S32));
}
}
template <>
inline void convert(const uint8_t *src, bool *dst, int64_t n) {
const int64_t fraction = n % Vectorized<uint8_t>::size();
svbool_t pg = svwhilelt_b8(0ull, Vectorized<uint8_t>::size());
#pragma unroll
for (int64_t i = 0; i < n - fraction; i += Vectorized<uint8_t>::size()) {
svbool_t mask = svcmpne_u8(pg, svldnt1_u8(pg, src + i), ZERO_U8);
svst1_u8(pg, reinterpret_cast<uint8_t*>(dst) + i,
svsel_u8(mask, ALL_U8_TRUE_MASK, ALL_U8_FALSE_MASK));
}
#pragma unroll
for (int64_t i = n - fraction; i < n; i += Vectorized<uint8_t>::size()) {
pg = svwhilelt_b8(i, n);
svbool_t mask = svcmpne_u8(pg, svldnt1_u8(pg, src + i), ZERO_U8);
svst1_u8(pg, reinterpret_cast<uint8_t*>(dst) + i,
svsel_u8(mask, ALL_U8_TRUE_MASK, ALL_U8_FALSE_MASK));
}
}
template <>
Vectorized<int64_t> inline operator<<(const Vectorized<int64_t>& a, const Vectorized<int64_t>& b) {
return svlsl_s64_x(ptrue, a, svreinterpret_u64_s64(b));
}
template <>
Vectorized<int32_t> inline operator<<(const Vectorized<int32_t>& a, const Vectorized<int32_t>& b) {
return svlsl_s32_x(ptrue, a, svreinterpret_u32_s32(b));
}
template <>
Vectorized<int16_t> inline operator<<(const Vectorized<int16_t>& a, const Vectorized<int16_t>& b) {
return svlsl_s16_x(ptrue, a, svreinterpret_u16_s16(b));
}
template <>
Vectorized<int8_t> inline operator<<(const Vectorized<int8_t>& a, const Vectorized<int8_t>& b) {
return svlsl_s8_x(ptrue, a, svreinterpret_u8_s8(b));
}
template <>
Vectorized<int64_t> inline operator>>(const Vectorized<int64_t>& a, const Vectorized<int64_t>& b) {
return svasr_s64_x(ptrue, a, svreinterpret_u64_s64(b));
}
template <>
Vectorized<int32_t> inline operator>>(const Vectorized<int32_t>& a, const Vectorized<int32_t>& b) {
return svasr_s32_x(ptrue, a, svreinterpret_u32_s32(b));
}
template <>
Vectorized<int16_t> inline operator>>(const Vectorized<int16_t>& a, const Vectorized<int16_t>& b) {
return svasr_s16_x(ptrue, a, svreinterpret_u16_s16(b));
}
template <>
Vectorized<int8_t> inline operator>>(const Vectorized<int8_t>& a, const Vectorized<int8_t>& b) {
return svasr_s8_x(ptrue, a, svreinterpret_u8_s8(b));
}
#endif // defined(CPU_CAPABILITY_SVE)
}}}

View File

@ -0,0 +1,567 @@
#pragma once
// DO NOT DEFINE STATIC DATA IN THIS HEADER!
// See Note [Do not compile initializers with SVE]
#include <ATen/cpu/vec/intrinsics.h>
#include <ATen/cpu/vec/vec_base.h>
#include <ATen/native/quantized/AffineQuantizerBase.h>
#include <c10/util/qint32.h>
#include <c10/util/qint8.h>
#include <c10/util/quint8.h>
#include <array>
// This file defines Vectorized<> for the quantized types.
//
//
// Currently, we simply use these classes as efficient converters between
// the quantized types and Vectorized<float>, usually in bandwidth-bound cases
// where doing the arithmetic in full-precision is acceptable (e.g.
// elementwise operators).
//
//
// Conversions are as follows:
// Vectorized<qint8> -> 4x Vectorized<float>
// Vectorized<quint8> -> 4x Vectorized<float>
// Vectorized<qint32> -> 1x Vectorized<float>
//
// The size of the returned float vector is specified by the special
// constexpr function float_num_vecs. The type of the value returned
// from dequantize (and expected as an argument to quantize) is
// specified by float_vec_return_type.
//
// When writing kernels with these vectors, it is expected that floating-
// point operations will be carried out in a loop over Vectorized<T>::float_num_vecs
// iterations.
namespace at {
namespace vec {
// Note [CPU_CAPABILITY namespace]
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
// This header, and all of its subheaders, will be compiled with
// different architecture flags for each supported set of vector
// intrinsics. So we need to make sure they aren't inadvertently
// linked together. We do this by declaring objects in an `inline
// namespace` which changes the name mangling, but can still be
// accessed as `at::vec`.
inline namespace CPU_CAPABILITY {
#if defined(CPU_CAPABILITY_SVE)
// NOTE: These are low-performance implementations that we fall back on
// if we are not building with SVE. This may not be an issue, because
// currently for quantization we assume the user has at least SVE
// installed, so these can simply act as a reference implementation.
//
// If in the future we relax this requirement (SVE+), we should probably
// revisit these implementations
template <
typename T,
typename float_vec_return_type_,
typename int_vec_return_type_,
int size_>
struct VectorizedQuantizedConverter {
using size_type = int;
static constexpr size_type size() {
return size_;
}
static constexpr int float_num_vecs() {
return size() / Vectorized<float>::size();
}
static constexpr int int_num_vecs() {
return size() / Vectorized<int32_t>::size();
}
using float_vec_return_type = float_vec_return_type_;
using int_vec_return_type = int_vec_return_type_;
using value_type = typename T::underlying;
std::array<value_type, size_> vals;
VectorizedQuantizedConverter(T val) {
for (size_t i = 0; i < size(); ++i) {
vals[i] = val.val_;
}
}
VectorizedQuantizedConverter(const void* ptr) {
memcpy(vals.data(), ptr, sizeof(value_type) * size());
}
void store(void* ptr, int count = size()) const {
memcpy(ptr, vals.data(), count * sizeof(value_type));
}
float_vec_return_type dequantize(
Vectorized<float> scale,
Vectorized<float> zero_point,
Vectorized<float> scale_zp_premul) const {
float_vec_return_type rv;
float tmp_scale[Vectorized<float>::size()];
float tmp_zero_point[Vectorized<float>::size()];
scale.store(tmp_scale);
zero_point.store(tmp_zero_point);
for (int i = 0; i < float_num_vecs(); ++i) {
float tmp_vals[Vectorized<float>::size()];
for (int j = 0; j < Vectorized<float>::size(); ++j) {
tmp_vals[j] =
at::native::dequantize_val<T>(tmp_scale[j], tmp_zero_point[j], T(vals[Vectorized<float>::size() * i + j]));
}
rv[i] = Vectorized<float>::loadu(tmp_vals);
}
return rv;
}
float_vec_return_type dequantize(
Vectorized<float> scale,
Vectorized<float> zero_point) const {
float_vec_return_type rv;
float tmp_scale[Vectorized<float>::size()];
float tmp_zero_point[Vectorized<float>::size()];
scale.store(tmp_scale);
zero_point.store(tmp_zero_point);
for (int i = 0; i < float_num_vecs(); ++i) {
float tmp_vals[Vectorized<float>::size()];
for (int j = 0; j < Vectorized<float>::size(); ++j) {
tmp_vals[j] =
at::native::dequantize_val<T>(tmp_scale[j], tmp_zero_point[j], T(vals[Vectorized<float>::size() * i + j]));
}
rv[i] = Vectorized<float>::loadu(tmp_vals);
}
return rv;
}
protected:
VectorizedQuantizedConverter() {}
};
template <>
struct Vectorized<c10::qint32> : public VectorizedQuantizedConverter<
c10::qint32,
std::array<Vectorized<float>, 1>,
std::array<Vectorized<c10::qint32>, 1>,
VECTOR_WIDTH / 4> {
Vectorized()
: VectorizedQuantizedConverter<
c10::qint32,
std::array<Vectorized<float>, 1>,
std::array<Vectorized<c10::qint32>, 1>,
VECTOR_WIDTH / 4>() {}
Vectorized(c10::qint32 val)
: VectorizedQuantizedConverter<
c10::qint32,
std::array<Vectorized<float>, 1>,
std::array<Vectorized<c10::qint32>, 1>,
VECTOR_WIDTH / 4>(val) {}
Vectorized(const void* ptr)
: VectorizedQuantizedConverter<
c10::qint32,
std::array<Vectorized<float>, 1>,
std::array<Vectorized<c10::qint32>, 1>,
VECTOR_WIDTH / 4>(ptr) {}
#if 1
static Vectorized<c10::qint32> loadu(const void* ptr) {
return Vectorized<c10::qint32>(ptr);
}
static Vectorized<c10::qint32> loadu(const void* ptr, int64_t count) {
__at_align__ value_type tmp_values[size()];
// Ensure uninitialized memory does not change the output value See https://github.com/pytorch/pytorch/issues/32502
// for more details. We do not initialize arrays to zero using "={0}" because gcc would compile it to two
// instructions while a loop would be compiled to one instruction.
for (const auto i : c10::irange(size())) {
tmp_values[i] = 0;
}
std::memcpy(tmp_values, reinterpret_cast<const value_type*>(ptr), count * sizeof(value_type));
return loadu(tmp_values);
}
#else
static Vectorized<c10::qint32> loadu(const void* ptr, int64_t count = size()) {
if (count == size())
return svld1_s32(ptrue, reinterpret_cast<const int32_t*>(ptr));
svbool_t pg = svwhilelt_b32(0ull, count);
return svld1_s32(pg, reinterpret_cast<const int32_t*>(ptr));
}
#endif
static Vectorized<c10::qint32> quantize(
const float_vec_return_type& rhs,
float scale,
int32_t zero_point,
float inverse_scale) {
std::array<value_type, size()> qvals;
std::array<float, float_num_vecs() * Vectorized<float>::size()> float_vals;
for (int i = 0; i < float_num_vecs(); ++i) {
rhs[i].store(&float_vals[i * Vectorized<float>::size()], Vectorized<float>::size());
}
at::native::quantize_vec<c10::qint32, /*precision=*/32>(
scale,
zero_point,
float_vals.data(),
(c10::qint32*)qvals.data(),
Vectorized<float>::size() * float_num_vecs());
return Vectorized<c10::qint32>::loadu(qvals.data());
}
Vectorized<c10::qint32> maximum(Vectorized<c10::qint32> b) const {
Vectorized<c10::qint32> retval;
for (size_t i = 0; i < size(); ++i) {
retval.vals[i] = std::max<value_type>(vals[i], b.vals[i]);
}
return retval;
}
Vectorized<c10::qint32> minimum(Vectorized<c10::qint32> b) const {
Vectorized<c10::qint32> retval;
for (size_t i = 0; i < size(); ++i) {
retval.vals[i] = std::min<value_type>(vals[i], b.vals[i]);
}
return retval;
}
Vectorized<c10::qint32> relu(Vectorized<c10::qint32> zero_point) const {
return maximum(zero_point);
}
Vectorized<c10::qint32> relu6(
Vectorized<c10::qint32> zero_point,
Vectorized<c10::qint32> q_six) {
Vectorized<c10::qint32> retval;
for (size_t i = 0; i < size(); ++i) {
retval.vals[i] = std::min<value_type>(
std::max<value_type>(vals[i], zero_point.vals[i]), q_six.vals[i]);
}
return retval;
}
int_vec_return_type widening_subtract(Vectorized<c10::qint32> b) const {
int_vec_return_type retval;
for (size_t i = 0; i < size(); ++i) {
retval[0].vals[i] = vals[i] - b.vals[i];
}
return retval;
}
static Vectorized<c10::qint32> requantize_from_int(
const int_vec_return_type& inp,
float multiplier,
int32_t zero_point) {
Vectorized<c10::qint32> retval;
for (size_t i = 0; i < size(); ++i) {
retval.vals[i] =
nearbyint(static_cast<float>(inp[0].vals[i]) * multiplier) +
zero_point;
}
return retval;
}
};
template <>
Vectorized<c10::qint32> inline maximum(const Vectorized<c10::qint32>& a, const Vectorized<c10::qint32>& b) {
return a.maximum(b);
}
template <>
Vectorized<c10::qint32> inline operator*(
const Vectorized<c10::qint32>& a,
const Vectorized<c10::qint32>& b) {
Vectorized<c10::qint32> retval;
for (size_t i = 0; i < std::decay_t<decltype(a)>::size(); ++i) {
retval.vals[i] = a.vals[i] * b.vals[i];
}
return retval;
}
template <>
Vectorized<c10::qint32> inline operator+(
const Vectorized<c10::qint32>& a,
const Vectorized<c10::qint32>& b) {
Vectorized<c10::qint32> retval;
for (size_t i = 0; i < std::decay_t<decltype(a)>::size(); ++i) {
retval.vals[i] = a.vals[i] + b.vals[i];
}
return retval;
}
template <>
struct Vectorized<c10::qint8> : public VectorizedQuantizedConverter<
c10::qint8,
std::array<Vectorized<float>, 4>,
std::array<Vectorized<c10::qint32>, 4>,
VECTOR_WIDTH> {
Vectorized()
: VectorizedQuantizedConverter<
c10::qint8,
std::array<Vectorized<float>, 4>,
std::array<Vectorized<c10::qint32>, 4>,
VECTOR_WIDTH>() {}
Vectorized(c10::qint8 val)
: VectorizedQuantizedConverter<
c10::qint8,
std::array<Vectorized<float>, 4>,
std::array<Vectorized<c10::qint32>, 4>,
VECTOR_WIDTH>(val) {}
Vectorized(const void* ptr)
: VectorizedQuantizedConverter<
c10::qint8,
std::array<Vectorized<float>, 4>,
std::array<Vectorized<c10::qint32>, 4>,
VECTOR_WIDTH>(ptr) {}
static Vectorized<c10::qint8> loadu(const void* ptr) {
return Vectorized<c10::qint8>(ptr);
}
static Vectorized<c10::qint8> loadu(const void* ptr, int64_t count) {
__at_align__ value_type tmp_values[size()];
// Ensure uninitialized memory does not change the output value See https://github.com/pytorch/pytorch/issues/32502
// for more details. We do not initialize arrays to zero using "={0}" because gcc would compile it to two
// instructions while a loop would be compiled to one instruction.
for (const auto i : c10::irange(size())) {
tmp_values[i] = 0;
}
std::memcpy(tmp_values, reinterpret_cast<const value_type*>(ptr), count * sizeof(value_type));
return loadu(tmp_values);
}
static Vectorized<c10::qint8> quantize(
const float_vec_return_type& rhs,
float scale,
int32_t zero_point,
float inverse_scale) {
std::array<value_type, size()> qvals;
std::array<float, float_num_vecs() * Vectorized<float>::size()> float_vals;
for (int i = 0; i < float_num_vecs(); ++i) {
rhs[i].store(&float_vals[i * Vectorized<float>::size()], Vectorized<float>::size());
}
at::native::quantize_vec<c10::qint8>(
scale,
zero_point,
float_vals.data(),
(c10::qint8*)qvals.data(),
Vectorized<float>::size() * float_num_vecs());
return Vectorized<c10::qint8>::loadu(qvals.data());
}
Vectorized<c10::qint8> maximum(Vectorized<c10::qint8> b) const {
Vectorized<c10::qint8> retval;
for (size_t i = 0; i < size(); ++i) {
retval.vals[i] = std::max<value_type>(vals[i], b.vals[i]);
}
return retval;
}
Vectorized<c10::qint8> minimum(Vectorized<c10::qint8> b) const {
Vectorized<c10::qint8> retval;
for (size_t i = 0; i < size(); ++i) {
retval.vals[i] = std::min<value_type>(vals[i], b.vals[i]);
}
return retval;
}
Vectorized<c10::qint8> relu(Vectorized<c10::qint8> zero_point) const {
return maximum(zero_point);
}
Vectorized<c10::qint8> relu6(
Vectorized<c10::qint8> zero_point,
Vectorized<c10::qint8> q_six) {
Vectorized<c10::qint8> retval;
for (size_t i = 0; i < size(); ++i) {
retval.vals[i] = std::min<value_type>(
std::max<value_type>(vals[i], zero_point.vals[i]), q_six.vals[i]);
}
return retval;
}
int_vec_return_type widening_subtract(Vectorized<c10::qint8> b) const {
int_vec_return_type retval;
constexpr int elem_per_int_vec = size() / int_num_vecs();
for (size_t i = 0; i < int_num_vecs(); ++i) {
for (size_t j = 0; j < elem_per_int_vec; ++j) {
retval[i].vals[j] =
static_cast<int32_t>(vals[i * elem_per_int_vec + j]) -
static_cast<int32_t>(b.vals[i * elem_per_int_vec + j]);
}
}
return retval;
}
static Vectorized<c10::qint8> requantize_from_int(
const int_vec_return_type& inp,
float multiplier,
int32_t zero_point) {
constexpr int elem_per_int_vec = size() / int_num_vecs();
constexpr auto min_val = std::numeric_limits<value_type>::min();
constexpr auto max_val = std::numeric_limits<value_type>::max();
Vectorized<c10::qint8> retval;
for (size_t i = 0; i < int_num_vecs(); ++i) {
for (size_t j = 0; j < elem_per_int_vec; ++j) {
int32_t rounded =
nearbyint(static_cast<float>(inp[i].vals[j]) * multiplier) +
zero_point;
retval.vals[i * elem_per_int_vec + j] =
std::min<int32_t>(std::max<int32_t>(rounded, min_val), max_val);
}
}
return retval;
}
};
template <>
Vectorized<c10::qint8> inline maximum(const Vectorized<c10::qint8>& a, const Vectorized<c10::qint8>& b) {
return a.maximum(b);
}
template <>
struct Vectorized<c10::quint8> : public VectorizedQuantizedConverter<
c10::quint8,
std::array<Vectorized<float>, 4>,
std::array<Vectorized<c10::qint32>, 4>,
VECTOR_WIDTH> {
Vectorized()
: VectorizedQuantizedConverter<
c10::quint8,
std::array<Vectorized<float>, 4>,
std::array<Vectorized<c10::qint32>, 4>,
VECTOR_WIDTH>() {}
Vectorized(c10::quint8 val)
: VectorizedQuantizedConverter<
c10::quint8,
std::array<Vectorized<float>, 4>,
std::array<Vectorized<c10::qint32>, 4>,
VECTOR_WIDTH>(val) {}
Vectorized(const void* ptr)
: VectorizedQuantizedConverter<
c10::quint8,
std::array<Vectorized<float>, 4>,
std::array<Vectorized<c10::qint32>, 4>,
VECTOR_WIDTH>(ptr) {}
#if 1
static Vectorized<c10::quint8> loadu(const void* ptr) {
return Vectorized<c10::quint8>(ptr);
}
static Vectorized<c10::quint8> loadu(const void* ptr, int64_t count) {
__at_align__ value_type tmp_values[size()];
// Ensure uninitialized memory does not change the output value See https://github.com/pytorch/pytorch/issues/32502
// for more details. We do not initialize arrays to zero using "={0}" because gcc would compile it to two
// instructions while a loop would be compiled to one instruction.
for (const auto i : c10::irange(size())) {
tmp_values[i] = 0;
}
std::memcpy(tmp_values, reinterpret_cast<const value_type*>(ptr), count * sizeof(value_type));
return loadu(tmp_values);
}
#else
static Vectorized<c10::quint8> loadu(const void* ptr, int64_t count = size()) {
if (count == size())
return svld1_u8(ptrue, reinterpret_cast<const uint8_t*>(ptr));
svbool_t pg = svwhilelt_b8(0ull, count);
return svld1_u8(pg, reinterpret_cast<const uint8_t*>(ptr));
}
#endif
static Vectorized<c10::quint8> quantize(
const float_vec_return_type& rhs,
float scale,
int32_t zero_point,
float inverse_scale) {
std::array<value_type, size()> qvals;
std::array<float, float_num_vecs() * Vectorized<float>::size()> float_vals;
for (int i = 0; i < float_num_vecs(); ++i) {
rhs[i].store(&float_vals[i * Vectorized<float>::size()], Vectorized<float>::size());
}
at::native::quantize_vec<c10::quint8>(
scale,
zero_point,
float_vals.data(),
(c10::quint8*)qvals.data(),
Vectorized<float>::size() * float_num_vecs());
return Vectorized<c10::quint8>::loadu(qvals.data());
}
Vectorized<c10::quint8> maximum(Vectorized<c10::quint8> b) const {
Vectorized<c10::quint8> retval;
for (size_t i = 0; i < size(); ++i) {
retval.vals[i] = std::max<value_type>(vals[i], b.vals[i]);
}
return retval;
}
Vectorized<c10::quint8> minimum(Vectorized<c10::quint8> b) const {
Vectorized<c10::quint8> retval;
for (size_t i = 0; i < size(); ++i) {
retval.vals[i] = std::min<value_type>(vals[i], b.vals[i]);
}
return retval;
}
Vectorized<c10::quint8> relu(Vectorized<c10::quint8> zero_point) const {
return maximum(zero_point);
}
Vectorized<c10::quint8> relu6(
Vectorized<c10::quint8> zero_point,
Vectorized<c10::quint8> q_six) {
Vectorized<c10::quint8> retval;
for (size_t i = 0; i < size(); ++i) {
retval.vals[i] = std::min<value_type>(
std::max<value_type>(vals[i], zero_point.vals[i]), q_six.vals[i]);
}
return retval;
}
int_vec_return_type widening_subtract(Vectorized<c10::quint8> b) const {
int_vec_return_type retval;
constexpr int elem_per_int_vec = size() / int_num_vecs();
for (size_t i = 0; i < int_num_vecs(); ++i) {
for (size_t j = 0; j < elem_per_int_vec; ++j) {
retval[i].vals[j] =
static_cast<int32_t>(vals[i * elem_per_int_vec + j]) -
static_cast<int32_t>(b.vals[i * elem_per_int_vec + j]);
}
}
return retval;
}
static Vectorized<c10::quint8> requantize_from_int(
const int_vec_return_type& inp,
float multiplier,
int32_t zero_point) {
constexpr int elem_per_int_vec = size() / int_num_vecs();
constexpr auto min_val = std::numeric_limits<value_type>::min();
constexpr auto max_val = std::numeric_limits<value_type>::max();
Vectorized<c10::quint8> retval;
for (size_t i = 0; i < int_num_vecs(); ++i) {
for (size_t j = 0; j < elem_per_int_vec; ++j) {
int32_t rounded =
nearbyint(static_cast<float>(inp[i].vals[j]) * multiplier) +
zero_point;
retval.vals[i * elem_per_int_vec + j] =
std::min<int32_t>(std::max<int32_t>(rounded, min_val), max_val);
}
}
return retval;
}
};
template <>
Vectorized<c10::quint8> inline maximum(const Vectorized<c10::quint8>& a, const Vectorized<c10::quint8>& b) {
return a.maximum(b);
}
#endif // defined(CPU_CAPABILITY_SVE)
}}}

View File

@ -7,9 +7,13 @@
#include <ATen/cpu/vec/vec_base.h>
#if !(defined(__VSX__) || defined(CPU_CAPABILITY_VSX) || defined(CPU_CAPABILITY_ZVECTOR))
#include <ATen/cpu/vec/vec256/vec256_float.h>
#if defined(CPU_CAPABILITY_SVE256)
#include <ATen/cpu/vec/sve/vec_common_sve.h>
#else
#include <ATen/cpu/vec/vec256/vec256_float_neon.h>
#include <ATen/cpu/vec/vec256/vec256_half_neon.h>
#endif
#include <ATen/cpu/vec/vec256/vec256_float.h>
#include <ATen/cpu/vec/vec256/vec256_bfloat16.h>
#include <ATen/cpu/vec/vec256/vec256_double.h>
#include <ATen/cpu/vec/vec256/vec256_int.h>

View File

@ -1097,7 +1097,7 @@ inline Vectorized<type> convert_float_##name(const Vectorized<float>& a, const V
return Vectorized<type>::loadu(arr2); \
}
CONVERT_NON_VECTORIZED_INIT(BFloat16, bfloat16);
#if defined(__aarch64__) && !defined(C10_MOBILE) && !defined(__CUDACC__)
#if defined(__aarch64__) && !defined(C10_MOBILE) && !defined(__CUDACC__) && !defined(CPU_CAPABILITY_SVE256)
inline std::tuple<Vectorized<float>, Vectorized<float>> convert_half_float(const Vectorized<Half>& a) {
static_assert(Vectorized<Half>::size() == 2 * Vectorized<float>::size());
#if defined(__ARM_FEATURE_FP16_VECTOR_ARITHMETIC)

View File

@ -208,8 +208,27 @@ struct VecConvert<
(is_reduced_floating_point_v<src_t> && is_8bit_integer_v<dst_t>),
void>> {
static inline VectorizedN<dst_t, 1> apply(const VectorizedN<src_t, 1>& src) {
VectorizedN<float, 1> tmp_fp32 = VecConvert<float, 1, src_t, 1>::apply(src);
return VecConvert<dst_t, 1, float, 1>::apply(tmp_fp32);
VectorizedN<float, 2> tmp_fp32 = VecConvert<float, 2, src_t, 1>::apply(src);
return VecConvert<dst_t, 1, float, 2>::apply(tmp_fp32);
}
};
template <typename dst_t>
struct VecConvert<
dst_t,
1,
float,
2,
typename std::enable_if_t<is_8bit_integer_v<dst_t>,
void>> {
static inline VectorizedN<dst_t, 1> apply(const VectorizedN<float, 2>& src) {
at::vec::Vectorized<dst_t> vec1 = convert_float_to_int8<dst_t>(src[0]);
at::vec::Vectorized<dst_t> vec2 = convert_float_to_int8<dst_t>(src[1]);
__m128 lane2 = _mm256_castps256_ps128(_mm256_castsi256_ps(vec2));
__m256 combined = _mm256_insertf128_ps(_mm256_castsi256_ps(vec1), lane2, 1);
// Shuffle [191:128] bit from combined in to [127:64] bit of result
__m256i result = _mm256_permute4x64_epi64(_mm256_castps_si256(combined), 0b11011000);
return at::vec::Vectorized<dst_t>(result);
}
};
@ -226,6 +245,25 @@ struct VecConvert<
}
};
template <typename src_t>
struct VecConvert<
float,
2,
src_t,
1,
typename std::enable_if_t<is_8bit_integer_v<src_t>,
void>> {
static inline VectorizedN<float, 2> apply(const VectorizedN<src_t, 1>& src) {
// Shuffle [127:64] bit from src[0] in to [191:128] bit of shuffled
__m256i shuffled = _mm256_permute4x64_epi64(src[0], 0b11011000);
__m256i src2 = _mm256_castsi128_si256(
_mm_castps_si128(
_mm256_extractf128_ps(_mm256_castsi256_ps(shuffled), 1) // Extract the second 128-bit lane
)
);
return VectorizedN<float, 2>(convert_int8_to_float<src_t>(src[0]), convert_int8_to_float<src_t>(src2));
}
};
template <typename dst_t>
struct VecConvert<

View File

@ -843,7 +843,7 @@ Vectorized<c10::quint8> inline maximum(const Vectorized<c10::quint8>& a, const V
return a.maximum(b);
}
#else
#elif !defined(CPU_CAPABILITY_SVE256)
// NOTE: These are low-performance implementations that we fall back on
// if we are not building with AVX2. This may not be an issue, because

View File

@ -209,8 +209,25 @@ struct VecConvert<
(is_reduced_floating_point_v<src_t> && is_8bit_integer_v<dst_t>),
void>> {
static inline VectorizedN<dst_t, 1> apply(const VectorizedN<src_t, 1>& src) {
VectorizedN<float, 1> tmp_fp32 = VecConvert<float, 1, src_t, 1>::apply(src);
return VecConvert<dst_t, 1, float, 1>::apply(tmp_fp32);
VectorizedN<float, 2> tmp_fp32 = VecConvert<float, 2, src_t, 1>::apply(src);
return VecConvert<dst_t, 1, float, 2>::apply(tmp_fp32);
}
};
template <typename dst_t>
struct VecConvert<
dst_t,
1,
float,
2,
typename std::enable_if_t<is_8bit_integer_v<dst_t>,
void>> {
static inline VectorizedN<dst_t, 1> apply(const VectorizedN<float, 2>& src) {
at::vec::Vectorized<dst_t> vec1 = convert_float_to_int8<dst_t>(src[0]);
at::vec::Vectorized<dst_t> vec2 = convert_float_to_int8<dst_t>(src[1]);
__m128 lane2 = _mm512_castps512_ps128(_mm512_castsi512_ps(vec2));
__m512 result = _mm512_insertf32x4(_mm512_castsi512_ps(vec1), lane2, 1); // Insert lane2 into the second 128-bit lane
return at::vec::Vectorized<dst_t>(_mm512_castps_si512(result));
}
};
@ -227,6 +244,24 @@ struct VecConvert<
}
};
template <typename src_t>
struct VecConvert<
float,
2,
src_t,
1,
typename std::enable_if_t<is_8bit_integer_v<src_t>,
void>> {
static inline VectorizedN<float, 2> apply(const VectorizedN<src_t, 1>& src) {
__m512i src2 = _mm512_castsi128_si512(
_mm_castps_si128(
_mm512_extractf32x4_ps(_mm512_castsi512_ps(src[0]), 1) // Extract the second 128-bit lane
)
);
return VectorizedN<float, 2>(convert_int8_to_float<src_t>(src[0]), convert_int8_to_float<src_t>(src2));
}
};
template <typename src_t>
struct VecConvert<
float,

View File

@ -990,7 +990,7 @@ inline mask_gather(const Vectorized<T>& src, T const* base_addr,
buffer[i] = src_arr[i];
}
}
mask = Vectorized<T>(); // "zero out" mask
mask = Vectorized<T>(static_cast<T>(0)); // "zero out" mask
return Vectorized<T>::loadu(static_cast<void*>(buffer));
}

View File

@ -1408,7 +1408,6 @@ void scaled_gemm(
const void *result_scale_ptr,
int64_t result_ld,
ScalarType result_dtype,
void* amax_ptr,
bool use_fast_accum) {
#if CUDA_VERSION >= 11080 || defined(USE_ROCM)
const auto computeType = CUBLAS_COMPUTE_32F;
@ -1421,13 +1420,9 @@ void scaled_gemm(
computeDesc.setAttribute(CUBLASLT_MATMUL_DESC_TRANSB, _cublasOpFromChar(transb));
computeDesc.setAttribute(CUBLASLT_MATMUL_DESC_A_SCALE_POINTER, mat1_scale_ptr);
computeDesc.setAttribute(CUBLASLT_MATMUL_DESC_B_SCALE_POINTER, mat2_scale_ptr);
computeDesc.setAttribute(CUBLASLT_MATMUL_DESC_D_SCALE_POINTER, result_scale_ptr);
#if !defined(USE_ROCM) || (defined(USE_ROCM) && ROCM_VERSION >= 60200)
// Amax support in ROCm as of 6.2
if (isFloat8Type(result_dtype)) {
computeDesc.setAttribute(CUBLASLT_MATMUL_DESC_AMAX_D_POINTER, amax_ptr);
if (result_scale_ptr != nullptr) {
computeDesc.setAttribute(CUBLASLT_MATMUL_DESC_D_SCALE_POINTER, result_scale_ptr);
}
#endif
#ifndef USE_ROCM
computeDesc.setAttribute(CUBLASLT_MATMUL_DESC_FAST_ACCUM, fastAccuMode);
#endif

Some files were not shown because too many files have changed in this diff Show More