Compare commits

...

1086 Commits

Author SHA1 Message Date
ba96cfd2f7 amend 2024-01-11 17:33:10 +00:00
7af4f58380 amend 2024-01-10 18:00:11 +00:00
4352ebe8bc amend 2024-01-10 12:12:42 +00:00
d52fcc6d33 Merge branch 'main' into tensordict_integration 2024-01-10 11:12:49 +00:00
19e93b85b9 Fixes last_dim stride check for singleton dimensions (#117001)
Fixes #116333

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117001
Approved by: https://github.com/cpuhrsch
2024-01-10 04:46:49 +00:00
8bcdde5058 Support uint{16,32,64} deterministic empty fill and scalar Python binding handling (#116807)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116807
Approved by: https://github.com/albanD
ghstack dependencies: #116805, #116806
2024-01-10 02:17:23 +00:00
43a23a704a Support uint{16,32,64} copy (#116806)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116806
Approved by: https://github.com/albanD
ghstack dependencies: #116805
2024-01-10 02:17:23 +00:00
2e983fcfd3 Support unsigned int for randint, item, equality, fill, iinfo, tensor (#116805)
These are some basic utilities that are often used for testing.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116805
Approved by: https://github.com/albanD
2024-01-10 02:17:23 +00:00
4a10e9eed4 update build guide to use mkl-static. (#116946)
# Background:
We found current build guide use mkl dynamic link. It may trigger a mkl link issue.

Detailed:
In build environment, libtorch_cpu.so will dynamic link to system mkl binaries by default.
If users install another version mkl library, it may lead to mkl symbol conflict.

I also checked released pytorch binary it use static mkl link. The build script shows it: https://github.com/pytorch/builder/blob/main/common/install_mkl.sh#L10

# Solution:
Update build guide to use mkl static link. And it is aligned to build script.

Conda install command docs:
https://anaconda.org/intel/mkl-static
https://anaconda.org/intel/mkl-include

# Validation
No mkl libraries dependencing, after use `conda install intel::mkl-static intel::mkl-include`.
## Windows
![image](https://github.com/pytorch/pytorch/assets/8433590/cc554ded-d827-4de5-81c6-cc3039155580)

## Linux
<img width="959" alt="image" src="https://github.com/pytorch/pytorch/assets/8433590/79766ad8-4ba2-4ff1-adc9-63affd8d419a">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116946
Approved by: https://github.com/jgong5, https://github.com/malfet
2024-01-10 01:35:02 +00:00
b4f1ab4505 Docs: fix docstring errors in ddp_comm_hooks (#116866)
Reopens #115272
Fixes ddp_comm_hooks errors in https://github.com/pytorch/pytorch/issues/112644

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116866
Approved by: https://github.com/awgu
2024-01-10 01:24:06 +00:00
16d69290c6 Use view name instead of view_copy name for functional inverses (#117056)
Ex: `unsqueeze_copy_inverse()` -> `unsqueeze_inverse()`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117056
Approved by: https://github.com/bdhirsh
2024-01-10 00:52:36 +00:00
fdfdba7c13 [BE] Use __builtin_overflow_sub when available (#117015)
Which is faster then ternary.

Following script
```python
import torch
from timeit import default_timer

global_setup = """
"""
setup = """
c10::SymInt a = c10::SymInt(123);
"""
code = """
-a;
"""

from torch.utils.benchmark import Timer

t = Timer(stmt=code, setup=setup, global_setup=global_setup, language="c++", timer=default_timer)

print(t.blocked_autorange())
```

reports 4.17 ns median type before and 3.61 ns after on x86_64 Linux and 2.02 ns before and 1.91 ns after on Apple M1

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117015
Approved by: https://github.com/albanD
2024-01-10 00:50:09 +00:00
a6325ad86c Fix cuInit test on Windows (#117055)
By changing library name from `libcuda.so.1` to `nvcuda.dll` on Windows

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117055
Approved by: https://github.com/Skylion007, https://github.com/huydhn, https://github.com/atalman
2024-01-10 00:45:18 +00:00
907e80239d Fix broken lint after #117052 (#117080)
https://hud.pytorch.org/pr/pytorch/pytorch/117052#20318344490 breaks lint, forward fixing with `lintrunner -a`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117080
Approved by: https://github.com/atalman, https://github.com/clee2000, https://github.com/Skylion007
2024-01-10 00:44:19 +00:00
d9fc438083 [cpu][vec512][double] unsigned left shift for mask (#117021)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117021
Approved by: https://github.com/leslie-fang-intel
2024-01-10 00:32:15 +00:00
0b72ce1bd1 Add at::sparse::full_coo_indices utility function. (#116352)
As in the title.

`full_coo_indices(shape)` should be used instead of `ones(shape).nonzero().T` as `full_coo_indices` is exponentially more efficient.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116352
Approved by: https://github.com/cpuhrsch
ghstack dependencies: #116206
2024-01-10 00:07:09 +00:00
152bde6e27 [MPS][BE] Move kernel_index_offset to HistogramKernel (#117037)
As it have almost nothing in commmon with the rest of indexing primitives other than name
Also, use `mtl_dispatch1DJob` to dispatch the work and check that tensor
size is less than 4Gb, as this function would not work with larger
tensors
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117037
Approved by: https://github.com/kulinseth
ghstack dependencies: #116903, #116904, #116915, #116940, #116942
2024-01-10 00:02:14 +00:00
8918ce4087 Add TORCH_LOGS_OUT to direct TORCH_LOGS output (#117005)
Twice now, while I was debugging accuracy bugs, I get dynamo logs that are 100k lines long and it is impossible to read them on the terminal. Lets add an option to write them to a file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117005
Approved by: https://github.com/ezyang, https://github.com/zou3519
ghstack dependencies: #116894
2024-01-09 23:46:22 +00:00
b6028acfa4 Add _assert_scalar and teach Inductor to codegen it (#114148)
Inductor codegen for `_assert_async` is currently disabled because we don't really understand how to codegen `scalar_to_tensor` on a Sympy expression. I initially tried to see if I could get this to work, but I got into some weird problem involving stride sorting, so I decided to fix it properly by not going through a tensor.

So we introduce an `_assert_scalar` which takes a scalar as an argument, avoiding needing to turn a SymBool into a tensor before asserting on it. I also add `_functional_assert_scalar` for good luck, although this doesn't do anything right now because https://github.com/pytorch/pytorch/pull/104203 still hasn't been landed.

I need to customize the codegen for this operator, so I decide to directly implement it in Inductor, rather than trying to treat it as a generic ExternKernel. This leads to the new AssertScalar IR node. This is written carefully so that it doesn't get DCE'd by Inductor.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114148
Approved by: https://github.com/jansel
2024-01-09 23:21:26 +00:00
d2033a0639 [quant][pt2e][xnnpack_quantizer] add support for linear_relu (#117052)
Add support for linear_relu annotation for XNNPACKQuantizer, this allows the input to linear and the output to relu to share the same quantization parameter.s

Differential Revision: [D52574086](https://our.internmc.facebook.com/intern/diff/D52574086/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117052
Approved by: https://github.com/jerryzh168, https://github.com/digantdesai
2024-01-09 23:19:52 +00:00
4f3d698cac Impl. call_hasattr for BaseUserFunctionVariable (#116049)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116049
Approved by: https://github.com/zou3519
2024-01-09 22:58:58 +00:00
8a6c43fbe5 add predispatch_pass to hold pass functions to be run when config.is_predispatch is true (#116788)
Summary:
config.is_predispatch is a config to instruct inductor to enable predispatch
tracing (high level pre-dispatch IR).  Currently, there is no dedicated pass
for this config.

In this commit, for better pass function management, we created
`predispatch_pass` to hold the pass functions to be run on the high level
pre-dispatch IR-based graphs.

Test Plan: CI

Differential Revision: D52491332

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116788
Approved by: https://github.com/frank-wei
2024-01-09 22:42:24 +00:00
39ae4d8cd7 Revert "[inductor] Add support for tl.make_block_ptr (#116079)"
This reverts commit d527df707acce59bd432763c94399aa7b3fe38cf.

Reverted https://github.com/pytorch/pytorch/pull/116079 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the new test is failing on ROCm ([comment](https://github.com/pytorch/pytorch/pull/116079#issuecomment-1883890254))
2024-01-09 22:19:57 +00:00
848cfe8d45 [reland] unflatten_tensor on compute stream for DTensorExtension (#117020)
reland of https://github.com/pytorch/pytorch/pull/116559, which was reverted by internal.

The underlying reason for the revert is that the torch.dynamo.disable can't be used by the
pytorch codebase, as it's conflicting with some torch.deploy together, although the later one
only run some inference, but it somehow take that weird dependency on fsdp..

We have seen this issue with our functional collectives that we can't
use any dynamo components otherwise torch.deploy would complain..

verified internally that after removing torch.dynamo.disable the test
passed again

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117020
Approved by: https://github.com/awgu
2024-01-09 21:25:15 +00:00
1dd4813328 [BE][dynamo]: Add operator is and is not tests to dynamo tests (#116397)
Adds an operator that was unit not tested in our test suite - improves coverage. Inspired by looking into https://github.com/pytorch/pytorch/pull/116397 after @XuehaiPan brought up some issues with builtins in #116389

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116397
Approved by: https://github.com/albanD, https://github.com/jansel
2024-01-09 21:13:22 +00:00
5866284d4a Make not passing use_reentrant back to warning instead of erroring and clarify docs (#116710)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116710
Approved by: https://github.com/albanD
ghstack dependencies: #116523
2024-01-09 20:58:49 +00:00
4e666ba011 Update torch.autograd.graph logging to not print out grad_output (#116523)
Instead of printing the tensor's data print the dtype and shape metadata of the tensor.
```
Executing: <VarMeanBackward0 object at 0x1352d0e20> with grad_outputs: [None,f32[]]
```
This is important in order to avoid doing a cuda sync and also useful to reduce verbosity.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116523
Approved by: https://github.com/albanD
2024-01-09 20:40:02 +00:00
29ae4f22bf Enables private_use_one lazy_init by PrivateUse1HooksInterface (#115067)
Fixes https://github.com/pytorch/pytorch/issues/112369

In my last pr:https://github.com/pytorch/pytorch/pull/113343, I want to implement lazy_init for other device through `REGISTER_LAZY_INIT `. But this might be too big of a change.

Recently, my team found that `torch.load` without `lazy_init ` will also results in the same error.
bbd5b935e4/torch/csrc/Storage.cpp (L319-L321)
bbd5b935e4/torch/csrc/Storage.cpp (L334-L335)

So, I want to use `PrivateUse1HooksInterface` to implement lazy_init for `PrivateUse1`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115067
Approved by: https://github.com/ezyang
2024-01-09 20:12:08 +00:00
ab1ac43752 [pytree] extend pytree operations with is_leaf prediction function (#116419)
Add an extra `is_leaf` prediction function to pytree operations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116419
Approved by: https://github.com/zou3519
2024-01-09 19:50:08 +00:00
suo
902807a86d enable pytree tests in fbcode (#116787)
these were not runnable before

Differential Revision: [D52547846](https://our.internmc.facebook.com/intern/diff/D52547846/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116787
Approved by: https://github.com/zou3519
2024-01-09 19:12:43 +00:00
b4eb97a072 Revert "[C10D] Add GIL checker to NCCL watchdog monitor (#116798)"
This reverts commit 830ace33bcc0291e5c615ad1727799b1d04067cd.

Reverted https://github.com/pytorch/pytorch/pull/116798 on behalf of https://github.com/osalpekar due to This seems to crash torchrec inference unittests: [D52583939](https://www.internalfb.com/diff/D52583939) ([comment](https://github.com/pytorch/pytorch/pull/116798#issuecomment-1883624022))
2024-01-09 19:09:02 +00:00
b8374314cc [AOTI] Update AOTI runner util (#116971)
Summary: Update the runner used in integration tests after https://github.com/pytorch/torchrec/pull/1604

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116971
Approved by: https://github.com/chenyang78
2024-01-09 19:07:54 +00:00
d527df707a [inductor] Add support for tl.make_block_ptr (#116079)
On A100 this is a small regression:
![image](https://github.com/pytorch/pytorch/assets/533820/b30eee9d-c0fe-4123-99da-d554fc5d0171)

So I will leave it disabled by default.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116079
Approved by: https://github.com/shunting314
ghstack dependencies: #116078
2024-01-09 19:06:51 +00:00
94363cee41 [inductor] Indexing refactors (#116078)
Perf differences seems to be noise:
![image](https://github.com/pytorch/pytorch/assets/533820/d7a36574-0388-46e4-bd4d-b274d37cab2b)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116078
Approved by: https://github.com/aakhundov
2024-01-09 19:06:51 +00:00
84b04e42a1 [ROCm] Enable aot_inductor tests (#116713)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116713
Approved by: https://github.com/jithunnair-amd, https://github.com/desertfire
2024-01-09 19:05:44 +00:00
ad22bd2fa1 [export][refactor][6/n] Remove equality_constraints (#116979)
Through the new dynamic_shapes API and using torch.export.Dim, dimensions that are equal will now be represented by the same symbol, so we no longer need to store `equality_constraints`.

Differential Revision: D52351705

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116979
Approved by: https://github.com/avikchaudhuri
2024-01-09 19:04:47 +00:00
bdeaaad70c [CPU] _vec_log_softmax_lastdim: fix CHUNK_SIZE to avoid unnecessarily large allocation (#116990)
Given input shape of `[outer_size, dim_size]`, `_vec_log_softmax_lastdim` sets `CHUNK_SIZE` as
```cpp
int64_t CHUNK_SIZE = std::max<int64_t>(
      1,
      at::internal::GRAIN_SIZE / (sizeof(scalar_t) * dim_size));
```
where `at::internal::GRAIN_SIZE / (sizeof(scalar_t) * dim_size)` computes the maximum number of rows that can fit into L1d cache size `(GRAIN_SIZE)`.

Fix `CHUNK_SIZE` as the minimum between `CHUNK_SIZE` and `outer_size` to avoid unnecessarily large `CHUNK_SIZE` and unnecessarily large allocation for `max` and `tmp_sum` buffer.
```cpp
auto tmp_sum_scalar = std::make_unique<scalar_t[]>(CHUNK_SIZE);
auto max_input_arr = std::make_unique<scalar_t[]>(CHUNK_SIZE);
```

### Performance

Perf data collected for `dim_size` in range [2^0, 2^9] and `outer_size` in range [2^0, 2^3]. To measure the benefit from avoiding unnecessarily large allocation, values of `outer_size` were chosen such that `outer_size` is less than `at::internal::GRAIN_SIZE / (sizeof(scalar_t) * dim_size)` for all values of `dim_size`.

Tested on 28 physical cores/socket, 1 socket on Skylake.

| **dim_size** 	| **at::internal::GRAIN_SIZE / (sizeof(scalar_t)   * dim_size)** 	| **input shape: (outer_size, dim_size)** 	| **Baseline (original implementation)** 	| **Optimized** 	| **Speedup Ratio (Baseline/Optimized)** 	|
|--------------	|----------------------------------------------------------------	|-----------------------------------------	|----------------------------------------	|---------------	|----------------------------------------	|
| 1            	| 8192                                                           	| (1, 1)                                  	| 0.006070137                            	| 0.003378391   	| **1.796754**                           	|
|              	|                                                                	| (2, 1)                                  	| 0.006327629                            	| 0.00361681    	| **1.749506**                           	|
|              	|                                                                	| (4, 1)                                  	| 0.006246567                            	| 0.00379324    	| **1.646763**                           	|
|              	|                                                                	| (8, 1)                                  	| 0.006320477                            	| 0.003941059   	| **1.603751**                           	|
| 2            	| 4096                                                           	| (1, 2)                                  	| 0.004889965                            	| 0.003342628   	| **1.46291**                            	|
|              	|                                                                	| (2, 2)                                  	| 0.005021095                            	| 0.003380775   	| **1.48519**                            	|
|              	|                                                                	| (4, 2)                                  	| 0.004897118                            	| 0.003535748   	| **1.38503**                            	|
|              	|                                                                	| (8, 2)                                  	| 0.005195141                            	| 0.003790855   	| **1.37044**                            	|
| 4            	| 2048                                                           	| (1, 4)                                  	| 0.004477501                            	| 0.003364086   	| **1.330971**                           	|
|              	|                                                                	| (2, 4)                                  	| 0.004198551                            	| 0.003452301   	| **1.21616**                            	|
|              	|                                                                	| (4, 4)                                  	| 0.004312992                            	| 0.003650188   	| **1.181581**                           	|
|              	|                                                                	| (8, 4)                                  	| 0.004432201                            	| 0.00399828    	| **1.108527**                           	|
| 8            	| 1024                                                           	| (1, 8)                                  	| 0.004155636                            	| 0.0035429     	| **1.172948**                           	|
|              	|                                                                	| (2, 8)                                  	| 0.003905296                            	| 0.003569126   	| **1.094188**                           	|
|              	|                                                                	| (4, 8)                                  	| 0.004405975                            	| 0.003864765   	| **1.140037**                           	|
|              	|                                                                	| (8, 8)                                  	| 0.004785061                            	| 0.004456043   	| **1.073836**                           	|
| 16           	| 512                                                            	| (1, 16)                                 	| 0.003867149                            	| 0.003504753   	| **1.103401**                           	|
|              	|                                                                	| (2, 16)                                 	| 0.003743172                            	| 0.003340244   	| **1.120628**                           	|
|              	|                                                                	| (4, 16)                                 	| 0.003614426                            	| 0.003519058   	| 1.0271                                 	|
|              	|                                                                	| (8, 16)                                 	| 0.00395298                             	| 0.003488064   	| **1.133288**                           	|
| 32           	| 256                                                            	| (1, 32)                                 	| 0.003900528                            	| 0.003421307   	| **1.14007**                            	|
|              	|                                                                	| (2, 32)                                 	| 0.003569126                            	| 0.003511906   	| 1.016293                               	|
|              	|                                                                	| (4, 32)                                 	| 0.003736019                            	| 0.003590584   	| 1.040505                               	|
|              	|                                                                	| (8, 32)                                 	| 0.003845692                            	| 0.003662109   	| **1.05013**                            	|
| 64           	| 128                                                            	| (1, 64)                                 	| 0.003652573                            	| 0.003437996   	| **1.062413**                           	|
|              	|                                                                	| (2, 64)                                 	| 0.003700256                            	| 0.003516674   	| **1.052203**                           	|
|              	|                                                                	| (4, 64)                                 	| 0.003783703                            	| 0.003638268   	| 1.039974                               	|
|              	|                                                                	| (8, 64)                                 	| 0.003993511                            	| 0.003809929   	| 1.048185                               	|
| 128          	| 64                                                             	| (1, 128)                                	| 0.003848076                            	| 0.003600121   	| **1.068874**                           	|
|              	|                                                                	| (2, 128)                                	| 0.003979206                            	| 0.003826618   	| 1.039875                               	|
|              	|                                                                	| (4, 128)                                	| 0.004360676                            	| 0.004224777   	| 1.032167                               	|
|              	|                                                                	| (8, 128)                                	| 0.005149841                            	| 0.004999638   	| 1.030043                               	|
| 256          	| 32                                                             	| (1, 256)                                	| 0.003943443                            	| 0.003738403   	| **1.054847**                           	|
|              	|                                                                	| (2, 256)                                	| 0.00420332                             	| 0.00408411    	| 1.029189                               	|
|              	|                                                                	| (4, 256)                                	| 0.004820824                            	| 0.00474453    	| 1.01608                                	|
|              	|                                                                	| (8, 256)                                	| 0.006194115                            	| 0.006067753   	| 1.020825                               	|
| 512          	| 16                                                             	| (1, 512)                                	| 0.004277229                            	| 0.004253387   	| 1.005605                               	|
|              	|                                                                	| (2, 512)                                	| 0.004863739                            	| 0.004782677   	| 1.016949                               	|
|              	|                                                                	| (4, 512)                                	| 0.006172657                            	| 0.00607729    	| 1.015692                               	|
|              	|                                                                	| (8, 512)                                	| 0.011193752                            	| 0.010819435   	| 1.034597                               	|

Bolded speedup ratio indicates greater than 5% speedup, to identify as significant speedup. Especially for smaller `dim_size` (1, 2, 4, 8, 16), we observe significant speedups as smaller the `dim_size`, larger the `at::internal::GRAIN_SIZE / (sizeof(scalar_t) * dim_size)`, hence larger the unnecessary allocation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116990
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-01-09 18:43:02 +00:00
75968e2f94 Optimize operator (#117017)
As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117017
Approved by: https://github.com/Skylion007
2024-01-09 18:37:22 +00:00
0dd5deeced Bring docstring to .pyi file (#114705)
Fixes #37762

Since the original issue hasn't been making progress for more than 3 years, I am attempting to make this PR to at least make some progress forward.

This PR attempts to add docstring to the `.pyi` files. The docstrings are read from [`_torch_docs`](https://github.com/pytorch/pytorch/blob/main/torch/_torch_docs.py) by mocking [`_add_docstr`](9f073ae304/torch/csrc/Module.cpp (L329)), which is the only function used to add docstring.

Luckily, `_torch_docs` has no dependencies for other components of PyTorch, and can be imported without compiling `torch._C` with `_add_docstr` mocked.

The generated `.pyi` file looks something like the following:

[_VariableFunctions.pyi.txt](https://github.com/pytorch/pytorch/files/13494263/_VariableFunctions.pyi.txt)

<img width="787" alt="image" src="https://github.com/pytorch/pytorch/assets/6421097/73c2e884-f06b-4529-8301-0ca0b9de173c">

And the docstring can be picked up by VSCode:

<img width="839" alt="image" src="https://github.com/pytorch/pytorch/assets/6421097/1999dc89-a591-4c7a-80ac-aa3456672af4">

<img width="908" alt="image" src="https://github.com/pytorch/pytorch/assets/6421097/ecf3fa92-9822-4a3d-9263-d224d87ac288">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114705
Approved by: https://github.com/albanD
2024-01-09 18:37:16 +00:00
cfd0728b24 Feature: cudnn convolution out (#116759)
Fixes #115611

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116759
Approved by: https://github.com/albanD
2024-01-09 17:51:29 +00:00
0ef1266bc6 [BE] Fix CUDA build warnings (#117023)
After https://github.com/pytorch/pytorch/pull/116595/files compiling every .cu file results in
```
/home/nshulga/git/pytorch/pytorch/c10/util/Half.h(450): warning #173-D: floating-point value does not fit in required integral type
           -static_cast<uint64_t>(f) > static_cast<uint64_t>(limit::max()));
                                                   ^
          detected during:
            instantiation of "std::enable_if_t<<expression>, __nv_bool> c10::overflows<To,From>(From, __nv_bool) [with To=float, From=int64_t]" at line 159 of /home/nshulga/git/pytorch/pytorch/c10/util/TypeCast.h
            instantiation of "To c10::checked_convert<To,From>(From, const char *) [with To=float, From=int64_t]" at line 122 of /home/nshulga/git/pytorch/pytorch/c10/core/Scalar.h

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

/home/nshulga/git/pytorch/pytorch/c10/util/Half.h(450): warning #173-D: floating-point value does not fit in required integral type
           -static_cast<uint64_t>(f) > static_cast<uint64_t>(limit::max()));
                                                   ^
          detected during:
            instantiation of "std::enable_if_t<<expression>, __nv_bool> c10::overflows<To,From>(From, __nv_bool) [with To=float, From=uint64_t]" at line 159 of /home/nshulga/git/pytorch/pytorch/c10/util/TypeCast.h
            instantiation of "To c10::checked_convert<To,From>(From, const char *) [with To=float, From=uint64_t]" at line 122 of /home/nshulga/git/pytorch/pytorch/c10/core/Scalar.h

/home/nshulga/git/pytorch/pytorch/c10/util/Half.h(450): warning #173-D: floating-point value does not fit in required integral type
           -static_cast<uint64_t>(f) > static_cast<uint64_t>(limit::max()));
                                                   ^
          detected during:
            instantiation of "std::enable_if_t<<expression>, __nv_bool> c10::overflows<To,From>(From, __nv_bool) [with To=double, From=int64_t]" at line 159 of /home/nshulga/git/pytorch/pytorch/c10/util/TypeCast.h
            instantiation of "To c10::checked_convert<To,From>(From, const char *) [with To=double, From=int64_t]" at line 122 of /home/nshulga/git/pytorch/pytorch/c10/core/Scalar.h

/home/nshulga/git/pytorch/pytorch/c10/util/Half.h(450): warning #173-D: floating-point value does not fit in required integral type
           -static_cast<uint64_t>(f) > static_cast<uint64_t>(limit::max()));
                                                   ^
          detected during:
            instantiation of "std::enable_if_t<<expression>, __nv_bool> c10::overflows<To,From>(From, __nv_bool) [with To=double, From=uint64_t]" at line 159 of /home/nshulga/git/pytorch/pytorch/c10/util/TypeCast.h
            instantiation of "To c10::checked_convert<To,From>(From, const char *) [with To=double, From=uint64_t]" at line 122 of /home/nshulga/git/pytorch/pytorch/c10/core/Scalar.h

/home/nshulga/git/pytorch/pytorch/c10/util/Half.h(450): warning #173-D: floating-point value does not fit in required integral type
           -static_cast<uint64_t>(f) > static_cast<uint64_t>(limit::max()));
                                                   ^
          detected during:
            instantiation of "std::enable_if_t<<expression>, __nv_bool> c10::overflows<To,From>(From, __nv_bool) [with To=c10::complex<float>, From=int64_t]" at line 159 of /home/nshulga/git/pytorch/pytorch/c10/util/TypeCast.h
            instantiation of "To c10::checked_convert<To,From>(From, const char *) [with To=c10::complex<float>, From=int64_t]" at line 122 of /home/nshulga/git/pytorch/pytorch/c10/core/Scalar.h

/home/nshulga/git/pytorch/pytorch/c10/util/Half.h(450): warning #173-D: floating-point value does not fit in required integral type
           -static_cast<uint64_t>(f) > static_cast<uint64_t>(limit::max()));
                                                   ^
          detected during:
            instantiation of "std::enable_if_t<<expression>, __nv_bool> c10::overflows<To,From>(From, __nv_bool) [with To=c10::complex<float>, From=uint64_t]" at line 159 of /home/nshulga/git/pytorch/pytorch/c10/util/TypeCast.h
            instantiation of "To c10::checked_convert<To,From>(From, const char *) [with To=c10::complex<float>, From=uint64_t]" at line 122 of /home/nshulga/git/pytorch/pytorch/c10/core/Scalar.h

/home/nshulga/git/pytorch/pytorch/c10/util/Half.h(450): warning #173-D: floating-point value does not fit in required integral type
           -static_cast<uint64_t>(f) > static_cast<uint64_t>(limit::max()));
                                                   ^
          detected during:
            instantiation of "std::enable_if_t<<expression>, __nv_bool> c10::overflows<To,From>(From, __nv_bool) [with To=c10::complex<double>, From=int64_t]" at line 159 of /home/nshulga/git/pytorch/pytorch/c10/util/TypeCast.h
            instantiation of "To c10::checked_convert<To,From>(From, const char *) [with To=c10::complex<double>, From=int64_t]" at line 122 of /home/nshulga/git/pytorch/pytorch/c10/core/Scalar.h

/home/nshulga/git/pytorch/pytorch/c10/util/Half.h(450): warning #173-D: floating-point value does not fit in required integral type
           -static_cast<uint64_t>(f) > static_cast<uint64_t>(limit::max()));
                                                   ^
          detected during:
            instantiation of "std::enable_if_t<<expression>, __nv_bool> c10::overflows<To,From>(From, __nv_bool) [with To=c10::complex<double>, From=uint64_t]" at line 159 of /home/nshulga/git/pytorch/pytorch/c10/util/TypeCast.h
            instantiation of "To c10::checked_convert<To,From>(From, const char *) [with To=c10::complex<double>, From=uint64_t]" at line 122 of /home/nshulga/git/pytorch/pytorch/c10/core/Scalar.h
```
Fix it by using using `if constexpr` to avoid calling `static_cast<uint64_t>` for any floating point type

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117023
Approved by: https://github.com/albanD
2024-01-09 17:40:10 +00:00
b6962208b8 [CI] Add initial ci test workflow for XPU based on IDC runners (#116554)
Add initial CI test for XPU based on IDC self-hosted runners with label `linux.idc.xpu`, which will be triggered by label `ciflow/xpu` for current stage.

Works for RFC https://github.com/pytorch/pytorch/issues/114850

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116554
Approved by: https://github.com/EikanWang, https://github.com/atalman
2024-01-09 17:00:35 +00:00
6784030df4 [MPS] Add support for 64-bit index operations (#116942)
But enable it only if `iter.can_use_32bit_indexing()` is False. add test for index_select, but enable it only on Sonoma, as all attempts to create 4Gb+ tensor on Ventura and older fail
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116942
Approved by: https://github.com/Skylion007, https://github.com/kulinseth
ghstack dependencies: #116903, #116904, #116915, #116940
2024-01-09 16:56:49 +00:00
81b7a09d27 [CI] Test that cuInit is not called during import (#117010)
By making a driver API call in subprocess and expecting it to return `CUDA_ERROR_NOT_INITIALIZED`

Test Plan: run it on nighties before https://github.com/pytorch/pytorch/pull/116201 got reverted and observe the failure

This is very important for lots of distributed launchers

Fixes https://github.com/pytorch/pytorch/issues/116276

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117010
Approved by: https://github.com/albanD
2024-01-09 14:44:22 +00:00
db79ceb110 [ROCm] Enabling additional UTs on ROCm (#115738)
Unskips mostly for dynamo/inductor UT.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115738
Approved by: https://github.com/jithunnair-amd, https://github.com/malfet
2024-01-09 08:36:07 +00:00
f0bbc2fcf5 [AOTInductor] Small refactor so both Meta internal and OSS can deal with misplaced args and kwargs for Extern Fallback kernels (#116779)
Summary:
In torch/_inductor/lowering.py (https://fburl.com/code/jd58vxpw), we are using
```
fallback_cumsum(x, dim=axis, dtype=dtype)
```
so this will treat `x` as args, `dim` and `dtype` as kwargs from https://fburl.com/code/cikchxp9

The issue has been fixed from D52530506 for OSS but not Meta internal. This diff address the Meta internal issue by some refactoring so both Meta internal and OSS can use the same helper function. The diff also added some debug log.

Test Plan:
before
```
aoti_torch_proxy_executor_call_function(proxy_executor, 2, 1, std::vector<int64_t>{torch.int64}.data(), 2, std::vector<AtenTensorHandle>{buf702, buf708}.data());
```
after
```
aoti_torch_proxy_executor_call_function(proxy_executor, 2, 1, std::vector<int64_t>{0}.data(), 2, std::vector<AtenTensorHandle>{buf702, buf708}.data());
```
so `torch.int64` changed to `0`

Differential Revision: D52532031

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116779
Approved by: https://github.com/desertfire, https://github.com/chenyang78
2024-01-09 07:57:46 +00:00
6e2f879d7f [ROCm] hipify mapping for cudaDevAttrMaxSharedMemoryPerBlockOptin (#116984)
Summary: Map `cudaDevAttrMaxSharedMemoryPerBlockOptin` to `hipDeviceAttributeMaxSharedMemoryPerBlock` to make it work for AMD GPUs.

Test Plan: CI

Differential Revision: D52558076

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116984
Approved by: https://github.com/jeffdaily
2024-01-09 07:38:20 +00:00
d78776e2e6 Stop unconditionally applying hermetic mode (#116996)
When originally authored, it was not necessary to unconditionally apply
hermetic mode, but I chose to apply it in eager mode to help catch bugs.
Well, multipy is kind of dead, and hermetic mode is causing real
implementation problems for people who want to do fancy Python stuff
from the dispatcher.  So let's yank this mode for now.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116996
Approved by: https://github.com/jansel
2024-01-09 05:55:08 +00:00
6cf1fc66e3 [cuda][easy] cosmetic and small syntax changes to layer_norm_kernel.cu (#116920)
Used `auto` and `const` where needed; replaced a CUDA specific `__syncwarp` with device agnostic `WARP_SYNC`; added more comments.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116920
Approved by: https://github.com/malfet
2024-01-09 04:44:57 +00:00
104a23e4f5 [cpu][vec512] improve int load/store/with mask (#116964)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116964
Approved by: https://github.com/leslie-fang-intel
ghstack dependencies: #116961, #116962, #116963
2024-01-09 04:37:44 +00:00
4e54a70451 [cpu][vec512] improve double load/store with mask (#116963)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116963
Approved by: https://github.com/leslie-fang-intel
ghstack dependencies: #116961, #116962
2024-01-09 04:37:44 +00:00
428807f9bc [cpu][vec512] improve fp32 load/store with mask (#116962)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116962
Approved by: https://github.com/leslie-fang-intel
ghstack dependencies: #116961
2024-01-09 04:32:22 +00:00
a0bd7dfec1 [cpu][vec512] improve bf16/fp16 load/store with mask for inductor (#116961)
Improve perf of vec512 bfloat16 (and also float16) load and store with partial vector lanes using masked load/store instead of via `memcpy` with aux buffer. In inductor CPU backend, we do load/store half (16) vector lanes for bfloat16 and float16.

Using the following micro-benchmark script for `layernorm + add`:
```python
import torch
import torch.nn as nn
from benchmark_helper import time_with_torch_timer

class AddLayernorm(nn.Module):
    def __init__(self, hidden_size):
        super().__init__()
        self.ln = nn.LayerNorm(hidden_size)

    def forward(self, hidden_states):
        return hidden_states + self.ln(hidden_states)

hidden_states = torch.randn(1, 512, 1024).to(torch.bfloat16)

with torch.no_grad():
    compiled_add_ln = torch.compile(add_ln)
    print(time_with_torch_timer(compiled_add_ln, hidden_states, iters=10000))
```

Measured on single-core `Intel(R) Xeon(R) Platinum 8358 CPU`.
Before: 1.39 ms
After: 498.66 us

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116961
Approved by: https://github.com/sanchitintel, https://github.com/leslie-fang-intel
2024-01-09 04:18:33 +00:00
bac0de160c [ROCm] Add minimal inductor test to rocm-test workflow (#115425)
Adds the `inductor/test_torchinductor` to tests-to-include so we can have some PR-level test coverage for inductor tests on ROCm. This should help catch issues before merging (e.g. https://github.com/pytorch/pytorch/pull/114772)

This unit test takes ~6minutes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115425
Approved by: https://github.com/jithunnair-amd, https://github.com/huydhn, https://github.com/malfet
2024-01-09 03:54:25 +00:00
4c0d63180a Support NNModules as dict keys (#116723)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116723
Approved by: https://github.com/lezcano
2024-01-09 03:32:47 +00:00
92cf7ba36b [vision hash update] update the pinned vision hash (#117002)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117002
Approved by: https://github.com/pytorchbot
2024-01-09 03:21:43 +00:00
ff0a3f35a4 [audio hash update] update the pinned audio hash (#116954)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116954
Approved by: https://github.com/pytorchbot
2024-01-09 03:16:00 +00:00
14be2ee271 Inductor qlinear int8_bf16 with bmm (#116604)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/116492, `linear` will be decomposed into `bmm` when input dim exceeds 2 and not contiguous. Fix this issue by convert the pattern back into `qlinear`. This PR focus on int8_bf16 case following of https://github.com/pytorch/pytorch/pull/116599.

**Test Plan**
```
python -u -m pytest -s -v test_mkldnn_pattern_matcher.py -k test_qlinear_int8_mixed_bf16_input_dim_exceeds_2_and_not_contiguous
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116604
Approved by: https://github.com/jgong5
ghstack dependencies: #116937, #116599
2024-01-09 01:36:27 +00:00
153b3a0996 Inductor qlinear int8_fp32 with bmm (#116599)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/116492, `linear` will be decomposed into `bmm` when input dim exceeds 2 and not contiguous. Fix this issue by convert the pattern back into `qlinear`. This PR focus on int8_fp32 case, will follow up int8_bf16 case in next PR.

**Test Plan**
```
python -u -m pytest -s -v test_mkldnn_pattern_matcher.py -k test_qlinear_input_dim_exceeds_2_and_not_contiguous
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116599
Approved by: https://github.com/jgong5
ghstack dependencies: #116937
2024-01-09 01:33:46 +00:00
6ca31ae1d3 [CI] Add inductor workflow for rocm (#110544)
This PR is to create a separate CI job for inductor UTs on ROCm. You will need to add `ciflow/inductor` tag on PRs to trigger this job. However, the job will run on its own on any commit merged in main. This job takes around 1.5 hours to run and it is run in parallel to other rocm jobs. It is run only on the MI210 CI runners to ensure maximum inductor functionality is tested.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110544
Approved by: https://github.com/jithunnair-amd, https://github.com/jansel, https://github.com/huydhn
2024-01-09 01:32:15 +00:00
227579d6a0 [Inductor] [Quant] Add remaining user check for qconv binary fusion (#115809)
**Summary**
Similar as https://github.com/pytorch/pytorch/pull/115153, when we do the `qconv_binary` fusion with post op sum, we also need to ensure that: all users of the extra input in this pattern should be ancestor nodes of the compute node, except for the binary node connected to the compute node.

Also change some variable names in this diff as:

- Change name of `qconv2d_node_after_weight_prepack` to `compute_node`
- Change name of `extra_input_node` to `extra_input_of_binary_node`

**Test Plan**
```
python -u -m pytest -s -v test_mkldnn_pattern_matcher.py -k test_qconv2d_add_3
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115809
Approved by: https://github.com/jgong5
ghstack dependencies: #115153
2024-01-09 01:26:50 +00:00
33d90cfd16 Allow for [-oo, oo] ranges for bools (#114362)
This fixes a problem in Seamless M4T in fairseq2 repro
instructions at https://docs.google.com/document/d/1PVy4KibfljirQDoijOwyHCV97B67r_iElWqFh7h1Acc/edit

I tried extracting a minimal repro but I couldn't actually manage it!

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114362
Approved by: https://github.com/Skylion007
2024-01-09 01:08:34 +00:00
f26ed0a71d [dynamo] Move graph breaks in for/while->skip after logging (#116981)
We were losing critical graph break info if the graph break came from a for or while loop.

Given:

```
def foo(x, y):
    z = x * y
    for i in range(10):
        z = z * y
        print(z)
    return z

a = torch.randn([2, 2])
b = torch.randn([2, 2])

foo = torch._dynamo.optimize('eager')(foo)

foo(a, b)
```

Before:

```
$ TORCH_LOGS=+graph_breaks python x.py
tensor([[-0.1046, -0.1597],
        [-0.0006, -0.1327]])
tensor([[-4.2091e-02,  6.3045e-02],
        [-1.6759e-05,  4.0366e-02]])
tensor([[-1.6929e-02, -2.4892e-02],
        [-4.8690e-07, -1.2281e-02]])
tensor([[-6.8091e-03,  9.8278e-03],
        [-1.4146e-08,  3.7363e-03]])
tensor([[-2.7387e-03, -3.8803e-03],
        [-4.1097e-10, -1.1367e-03]])
tensor([[-1.1015e-03,  1.5320e-03],
        [-1.1940e-11,  3.4584e-04]])
tensor([[-4.4304e-04, -6.0488e-04],
        [-3.4688e-13, -1.0522e-04]])
tensor([[-1.7820e-04,  2.3882e-04],
        [-1.0078e-14,  3.2012e-05]])
tensor([[-7.1672e-05, -9.4293e-05],
        [-2.9279e-16, -9.7392e-06]])
tensor([[-2.8827e-05,  3.7229e-05],
        [-8.5063e-18,  2.9630e-06]])
```
After:

```
$ TORCH_LOGS=+graph_breaks python x.py
[2024-01-08 11:14:49,372] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] Graph break: call_function BuiltinVariable(print) [TensorVariable()] {} from user code at:
[2024-01-08 11:14:49,372] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]   File "/data/users/voz/pytorch/x.py", line 32, in foo
[2024-01-08 11:14:49,372] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]     print(z)
[2024-01-08 11:14:49,372] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]
tensor([[ 0.2065,  0.0766],
        [-2.0600,  1.8425]])
tensor([[-0.0617, -0.0698],
        [-3.5799,  2.2167]])
tensor([[ 0.0184,  0.0636],
        [-6.2212,  2.6669]])
tensor([[-5.5031e-03, -5.7971e-02],
        [-1.0811e+01,  3.2085e+00]])
tensor([[ 1.6437e-03,  5.2837e-02],
        [-1.8788e+01,  3.8601e+00]])
tensor([[-4.9093e-04, -4.8157e-02],
        [-3.2650e+01,  4.6441e+00]])
tensor([[ 1.4663e-04,  4.3891e-02],
        [-5.6741e+01,  5.5872e+00]])
tensor([[-4.3796e-05, -4.0004e-02],
        [-9.8605e+01,  6.7220e+00]])
tensor([[ 1.3081e-05,  3.6461e-02],
        [-1.7136e+02,  8.0871e+00]])
tensor([[-3.9070e-06, -3.3231e-02],
        [-2.9779e+02,  9.7296e+00]])
````

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116981
Approved by: https://github.com/ezyang
2024-01-09 00:39:03 +00:00
e728ebb66d Small docstring fix (#116947)
Fix a small typo in the docstring of checkpoint function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116947
Approved by: https://github.com/Skylion007, https://github.com/kit1980
2024-01-08 23:51:59 +00:00
28e2e12b2a [quant][be] enable xnnpack_quantizer tests to run in internal CI (#116911)
Summary: fixed an import problem for test_xnnpack_quantizer so that it can run in CI

Test Plan:
internal CI
sanity check: buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- --exact 'caffe2/test/quantization:test_quantization - test_conv2d (caffe2.test.quantization.pt2e.test_xnnpack_quantizer.TestXNNPACKQuantizer)'

Differential Revision: D52576449

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116911
Approved by: https://github.com/mcr229
2024-01-08 23:43:47 +00:00
534c73d478 Fix NaN bug in torch.signal.windows.kaiser (#116470)
Fixes #115595

As an aside, there are currently no tests checking the output of `torch.signal.windows.kaiser` against the output of scipy's implementation, which is what is done with `torch.kaiser_window`. The same goes for the other window functions in `torch.signal.windows`. I did some tests on my end, but I'm not sure what the best practice is, so I haven't included them for now.

@gchanan @mruberry
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116470
Approved by: https://github.com/ezyang
2024-01-08 22:24:52 +00:00
d006cae2a8 Update documentation for unsigned int types (#116804)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116804
Approved by: https://github.com/albanD
ghstack dependencies: #116595, #116803
2024-01-08 22:02:10 +00:00
fd0c071969 Add tolist support for unsigned types (#116803)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116803
Approved by: https://github.com/albanD
ghstack dependencies: #116595
2024-01-08 22:02:10 +00:00
f4e35e2c3d Proposed mechanism for handling uint64_t in Scalar (#116595)
Here's the problem: if we support unsigned integer types, and in particular if we support uint64_t, we need a way to represent these integers in Scalar. However, Scalar currently stores all integral values inside int64_t, which is not wide enough to accommodate all possible uint64_t values. So we need to do something to Scalar to support it.

The obvious thing to do is add a uint64_t field to the union, and used it some situations. But when should we use it? The proposal is that we only use it if and only if the integer in question is not representable in int64_t. The historical precedent for this is our handling for uint8_t. Because this type is representable inside int64_t, we have historically stored it inside Scalar as an int64_t. In general, the concept behind Scalar is that it doesn't know the signedness/unsignedness/bitwidth of its input; in particular, we typically construct Scalar from Python int, which doesn't have any concept of how wide the integer is! So it doesn't make any sense to allow for a small integer like 255 to be representable under both the HAS_i tag and the HAS_u tag. So we forbid the latter case.

Although I have proposed this, the PR as currently written just chokes when you pass it a uint64_t that's too big. There's some more logic that would have to be written out for this. I'm putting this out to start to get some agreement that this is the way to do it.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116595
Approved by: https://github.com/albanD
2024-01-08 22:02:03 +00:00
7073dc604e Merge merging rules of CPU inductor and x86 CPU quantization (#116937)
**Summary**
Following the discussion at https://github.com/pytorch/pytorch/pull/116599#issuecomment-1878757581, due to the limitation of the current merging rules that prevent cross-checking all files among different merge groups, it is proposed to merge the groups `x86 CPU quantization` and `CPU inductor` since they are closely related.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116937
Approved by: https://github.com/jgong5, https://github.com/atalman
2024-01-08 15:32:03 +00:00
a2d73e21d1 follow up #115078, broken distributed tests (#116217)
ROCm distributed tests started failing after #115078.  This skips the new tests if the number of GPUs available isn't sufficient.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116217
Approved by: https://github.com/malfet, https://github.com/huydhn
2024-01-08 15:26:54 +00:00
cyy
ad507789d1 [Reland] [11/N] Enable clang-tidy warnings on c10/util/*.h (#116751)
Reland of #116353 with C++ diagnostic macros restored.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116751
Approved by: https://github.com/albanD
2024-01-08 11:07:58 +00:00
e780213340 [xla hash update] update the pinned xla hash (#116958)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116958
Approved by: https://github.com/pytorchbot
2024-01-08 11:00:59 +00:00
6173386fc4 [MPS][BE] Remove unused nOffsets parameter (#116940)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116940
Approved by: https://github.com/Skylion007
ghstack dependencies: #116903, #116904, #116915
2024-01-08 04:55:35 +00:00
f663935935 [MPS] Fix boundary checks in generateKernelOffsets (#116915)
`TORCH_CHECK(i<UINT32_MAX)` is always false, it should be `TORCH_CHECK(iterShape[i] < UINT32_MAX)`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116915
Approved by: https://github.com/Skylion007, https://github.com/kulinseth
ghstack dependencies: #116903, #116904
2024-01-08 04:55:35 +00:00
aa718065b2 [MPS][BE] Refactor common code (#116904)
Into `generateKernelDataOffsets` which was repeated character by character in BinaryKernel, CrossKernel and Indexing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116904
Approved by: https://github.com/Skylion007
ghstack dependencies: #116903
2024-01-08 04:55:35 +00:00
57491d2046 Add bfloat16 + fp16 support to fractional_max_pool for CUDA and CPU (#116950)
Adds bfloat16 to fractional_max_pool. If op supports fp32 and fp16, it really should support bf16 for the most part. Most but not all ops satisfy this, so I am adding support for the few that do not.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116950
Approved by: https://github.com/lezcano
2024-01-08 03:54:29 +00:00
7d61fa23df Add float16 support to CUDA logaddexp2 (#116948)
float16 is already supported on CPU for this op and on gpu for `logaddexp` so let's expand support to the function with the base2 variant as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116948
Approved by: https://github.com/lezcano
2024-01-08 03:37:07 +00:00
2fe90e4d47 [vision hash update] update the pinned vision hash (#116908)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116908
Approved by: https://github.com/pytorchbot
2024-01-08 03:24:41 +00:00
6c32cd05a3 [executorch hash update] update the pinned executorch hash (#116936)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116936
Approved by: https://github.com/pytorchbot
2024-01-08 03:18:18 +00:00
376f036570 Add bfloat16 CUDA support to multinomial (#116951)
Add torch bfloat16 support to multinomial. Only a few methods in torch support fp32, fp16, but not bfloat16 so let's go and finish implementing them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116951
Approved by: https://github.com/lezcano
2024-01-08 01:43:16 +00:00
8257b867d8 Add bfloat16 CUDA support to binomial distribution (#116932)
Now all distributions support bfloat16 as input.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116932
Approved by: https://github.com/malfet
2024-01-07 19:50:10 +00:00
4a37f57c69 Add batched sparse CSR/CSC/BSR/BSC to sparse COO conversion support (#116206)
As in the title.

Fixes https://github.com/pytorch/pytorch/issues/104868

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116206
Approved by: https://github.com/amjames, https://github.com/lezcano, https://github.com/cpuhrsch
2024-01-07 19:42:02 +00:00
cyy
4b74bb6c34 [Exception] [2/N] Remove THPUtils_assert (#116772)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116772
Approved by: https://github.com/albanD
2024-01-07 14:21:43 +00:00
3c7f358c91 Update the expected accuracy value for demucs (#116944)
Update the expected value with `python benchmarks/dynamo/ci_expected_accuracy/update_expected.py b847290ddd9c6a5a598c70f8b660ee2b1e71dc95` as this is now failing in trunk after 95041829c8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116944
Approved by: https://github.com/voznesenskym
2024-01-07 13:34:51 +00:00
de005b14ab [dynamo] fix more broken dict tests (#116943)
Forward fixing after #111196

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116943
Approved by: https://github.com/huydhn
2024-01-07 08:00:16 +00:00
8ddac14a15 Add unsigned integer dtypes to PyTorch (#116594)
The dtypes are very useless right now (not even fill works), but it makes torch.uint16, uint32 and uint64 available as a dtype.

Towards https://github.com/pytorch/pytorch/issues/58734

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116594
Approved by: https://github.com/albanD
ghstack dependencies: #116698, #116693
2024-01-07 07:40:49 +00:00
8e273e23b5 Refactor promoteType to no longer use shifting strategy (#116693)
Instead of manually fixing the indices (extremely error prone when new
dtypes are added) we just setup a lookup table to map ScalarType to the
offsets table.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116693
Approved by: https://github.com/albanD
ghstack dependencies: #116698
2024-01-07 07:40:49 +00:00
c5e6485d14 Add AT_DISPATCH_V2 (#116698)
See top-level comment on Dispatch_v2.h for motivation.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116698
Approved by: https://github.com/albanD, https://github.com/malfet
2024-01-07 07:40:49 +00:00
9557b63c85 [MPS][BE] Do not crash if Metal function can not be found (#116938)
As [`newFunctionWithName:`](https://developer.apple.com/documentation/metal/mtllibrary/1515524-newfunctionwithname) does not accept error argument, do not attempt to print it as it'll be guaranteed `nil` at that point, that results in a classic null pointer dereference, when `TORCH_CHECK` will attempt to construct `std::string` from it. See below backtrace for example:
```
 thread #1, queue = 'metal gpu stream', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
    frame #0: 0x000000018a316dc4 libsystem_platform.dylib`_platform_strlen + 4
    frame #1: 0x00000001471011bc libtorch_cpu.dylib`std::__1::__constexpr_strlen[abi:v160006](__str=0x0000000000000000) at cstring:114:10
    frame #2: 0x0000000147100c24 libtorch_cpu.dylib`std::__1::char_traits<char>::length(__s=0x0000000000000000) at char_traits.h:220:12
  * frame #3: 0x0000000147100bf0 libtorch_cpu.dylib`std::__1::basic_ostream<char, std::__1::char_traits<char>>& std::__1::operator<<[abi:v160006]<std::__1::char_traits<char>>(__os=0x000000016fdfb3a0, __str=0x0000000000000000) at ostream:901:57
    frame #4: 0x0000000147100bb4 libtorch_cpu.dylib`std::__1::basic_ostream<char, std::__1::char_traits<char>>& c10::detail::_str<char const*>(ss=0x000000016fdfb3a0, t=0x000000016fdfb5d0) at StringUtil.h:55:6
    frame #5: 0x00000001471007ac libtorch_cpu.dylib`std::__1::basic_ostream<char, std::__1::char_traits<char>>& c10::detail::_str<char const*, char const*>(ss=0x000000016fdfb3a0, t=0x000000016fdfb4f8, args=0x000000016fdfb5d0) at StringUtil.h:68:10
    frame #6: 0x0000000147101444 libtorch_cpu.dylib`std::__1::basic_ostream<char, std::__1::char_traits<char>>& c10::detail::_str<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, char const*, char const*>(ss=0x000000016fdfb3a0, t="index_select_32bit_idx32", args=0x000000016fdfb4f8, args=0x000000016fdfb5d0) at StringUtil.h:68:10
    frame #7: 0x0000000147101404 libtorch_cpu.dylib`std::__1::basic_ostream<char, std::__1::char_traits<char>>& c10::detail::_str<char const*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, char const*, char const*>(ss=0x000000016fdfb3a0, t=0x000000016fdfb500, args="index_select_32bit_idx32", args=0x000000016fdfb4f8, args=0x000000016fdfb5d0) at StringUtil.h:68:10
    frame #8: 0x000000014710137c libtorch_cpu.dylib`c10::detail::_str_wrapper<char const*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, char const*, char const* const&>::call(args=0x000000016fdfb500, args="index_select_32bit_idx32", args=0x000000016fdfb4f8, args=0x000000016fdfb5d0) at StringUtil.h:75:5
    frame #9: 0x0000000147101310 libtorch_cpu.dylib`decltype(auto) c10::str<char [53], std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, char [10], char const*>(args={a\xcb\xa7H\x01\0\0\0}, args="index_select_32bit_idx32", args={\x96\xcb\xa7H\x01\0\0\0}, args=0x000000016fdfb5d0) at StringUtil.h:111:10
    frame #10: 0x0000000147100210 libtorch_cpu.dylib`decltype(auto) c10::detail::torchCheckMsgImpl<char [53], std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, char [10], char const*>((null)="Expected indexFunction to be true, but got false.  (Could this error message be improved?  If so, please report an enhancement request to PyTorch.)", args={a\xcb\xa7H\x01\0\0\0}, args="index_select_32bit_idx32", args={\x96\xcb\xa7H\x01\0\0\0}, args=0x000000016fdfb5d0) at Exception.h:453:10
    frame #11: 0x00000001470fffe8 libtorch_cpu.dylib`at::mps::MPSDevice::metalIndexingPSO(this=0x0000600000381670, kernel="index_select_32bit_idx32") at MPSDevice.mm:62:3
```

This was introduced by https://github.com/pytorch/pytorch/pull/99855 that replaced `newFunctionWithName:constantValues:error:` with `newFunctionWithName:`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116938
Approved by: https://github.com/Skylion007
2024-01-07 07:08:54 +00:00
20c2ec9a15 [CPU] Add flash attention mask version (#115913)
Add a masked-version flash attention for CPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115913
Approved by: https://github.com/jgong5, https://github.com/drisspg
2024-01-07 04:58:23 +00:00
b847290ddd Back out "[2d] unflatten_tensor on compute stream for DTensorExtension (#116559)" (#116939)
Summary:
Original commit changeset: 65298112f3db

Original Phabricator Diff: D52530451

Differential Revision: D52583345

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116939
Approved by: https://github.com/842974287
2024-01-07 03:53:40 +00:00
4b5b8f8a75 Add bfloat16 CUDA support to smoothl1loss (#116933)
Gradually ensuring that all CUDA ops that support float16 also support bfloat16 if possible

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116933
Approved by: https://github.com/malfet
2024-01-07 02:42:49 +00:00
a7902571be Add bfloat16 CUDA support to gamma unary functions (#116929)
Add bfloat16 support to unary gamma functions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116929
Approved by: https://github.com/malfet
2024-01-07 02:07:55 +00:00
8e1119f7b2 Fix typo in CUDA Macro (#116930)
Found while grepping for remaining _AND macros in CUDA subfolder

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116930
Approved by: https://github.com/malfet
2024-01-07 01:49:32 +00:00
83e8a0721d Reland #111196 (take 4) "Support tensors as Dict keys" (#116934)
Fixes #ISSUE_NUMBER

See that PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116934
Approved by: https://github.com/ezyang, https://github.com/huydhn
2024-01-07 01:37:26 +00:00
95041829c8 Add bfloat16 CUDA support to RNN (#116927)
Fixes #116925
Fixes #116763

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116927
Approved by: https://github.com/malfet
2024-01-06 22:55:34 +00:00
a5b86847ef Fix compiler warnings in cuda code (#116921)
Fixes compiler warnings about comparison between signed and unsigned data types

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116921
Approved by: https://github.com/Skylion007
2024-01-06 21:25:19 +00:00
65da4e1ba2 [CI] Use jemalloc for CUDA builds (#116900)
According to @ptrblck it'll likely mitigate non-deterministic NVCC bug
See https://github.com/pytorch/pytorch/issues/116289 for more detail

Test plan: ssh into one of the cuda builds and make sure that `LD_PRELOAD` is set for the top-level make command

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116900
Approved by: https://github.com/atalman
2024-01-06 21:03:02 +00:00
c05dd2aaf0 [EZ][MPS] Use dispatch with rethrow for indexing (#116903)
Otherwise any assert withing sync block will cause an unrecoverable abort rather than structured exception
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116903
Approved by: https://github.com/Skylion007
2024-01-06 20:36:47 +00:00
9519c8afd4 [export] Remove hacks for passing pinned version test. (#116871)
Summary: nature will heal itself.

Test Plan: CI

Reviewed By: angelayi

Differential Revision: D52566227

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116871
Approved by: https://github.com/angelayi
2024-01-06 18:09:27 +00:00
2dca3e99eb Revert "Support tensors as Dict keys Re-PR of #111196 (#116785)"
This reverts commit 1badad9ce9694ef70f6a3dc01000f2cf310c4c11.

Reverted https://github.com/pytorch/pytorch/pull/116785 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/116785#issuecomment-1879592261))
2024-01-06 08:22:33 +00:00
88197f2202 Rename experimental API (#116895)
Summary: Title

Test Plan: CI

Differential Revision: D52571286

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116895
Approved by: https://github.com/zhxchen17
2024-01-06 08:01:09 +00:00
830ace33bc [C10D] Add GIL checker to NCCL watchdog monitor (#116798)
Whenever the monitor thread kills the watchdog thread for being stuck,
we do so to save cluster time and get a faster failure signal, but we
want to know more about why it got stuck.

One possible reason for watchdog stuckness is GIL contention, which
could be ruled out or observed by making an attempt to acquire the GIL
at exit time.

If we cannot acquire the GIL within a short time window (1s) we abort
the attempt and report GIL contention, otherwise we report that GIL was
acquired successfully.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116798
Approved by: https://github.com/zdevito
2024-01-06 05:13:43 +00:00
f24bba1624 [executorch hash update] update the pinned executorch hash (#116800)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116800
Approved by: https://github.com/pytorchbot
2024-01-06 04:10:52 +00:00
78c3098470 cmake: Include CheckCXXCompilerFlag where it is used (#113028)
Move the `include(CheckCXXCompilerFlag)` above the `append_cxx_flag_if_supported` function that uses it to avoid depending on the caller to have it already included.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113028
Approved by: https://github.com/malfet
2024-01-06 04:05:45 +00:00
1badad9ce9 Support tensors as Dict keys Re-PR of #111196 (#116785)
This prepares the PR where we implement sets in terms of dicts.
To do so, rather than storing internally a dictionary that maps literals
to VariableTrackers, it stores (pretty much) a dictionary from VTs to VTs.
To do so, keys are wrapped in an opaque internal class _Hashable.
The Hashable class is opaque on purpose so that it fails hard if
if it inadvertently leaks back into user code.
We also found and fixed a number of latent bugs and inconsistencies
in the way dynamo checked what can be a dict key. More generally, we
make much clearer what are the things that need to be modified to add
a new supported key type to Dicts.

Fixes [#107595](https://www.internalfb.com/tasks?t=107595)
Fixes [#111603](https://www.internalfb.com/tasks?t=111603)

Re-PR of https://github.com/pytorch/pytorch/pull/111196 sadly due to reverts, we could not reuse @lezcano's original PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116785
Approved by: https://github.com/mlazos
2024-01-06 03:35:35 +00:00
ff0f79d3c7 [MPS] Mark torch.[all|any] as working with complex on MacOS14 (#116907)
It was enabled by https://github.com/pytorch/pytorch/pulls/116457 but at the time PR was landed Sonoma testing was still not enabled

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116907
Approved by: https://github.com/osalpekar, https://github.com/kit1980
2024-01-06 01:10:11 +00:00
0b0c76bace Support squeeze.dim for jagged NT (#116891)
As title. Needed for `rev_view_func()` of `unsqueeze()`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116891
Approved by: https://github.com/soulitzer
ghstack dependencies: #115894, #116512
2024-01-06 01:00:53 +00:00
8894a97707 [Dynamo] Fix source for autograd.function default value (#116894)
Before this PR, the source guard would emit
```
globals()['Gradient'].__class__.forward.__defaults__[0]
```
which is incorrect

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116894
Approved by: https://github.com/zou3519, https://github.com/yanboliang
2024-01-06 00:36:00 +00:00
5323b2daa5 [docs] add mode="reduce-overhead" into torch.compile to enable cuda g… (#116529)
…raph

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116529
Approved by: https://github.com/eellison
2024-01-05 22:54:20 +00:00
2753960177 markDynamoStrictTest most of test/lazy/.* (#116893)
[codemod] markDynamoStrictTest lazy/test_step_closures
[codemod] markDynamoStrictTest lazy/test_reuse_ir
[codemod] markDynamoStrictTest lazy/test_meta_kernel
[codemod] markDynamoStrictTest lazy/test_generator
[codemod] markDynamoStrictTest lazy/test_functionalization
[codemod] markDynamoStrictTest lazy/test_extract_compiled_graph
[codemod] markDynamoStrictTest lazy/test_debug_util
[codemod] markDynamoStrictTest lazy/test_bindings
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116893
Approved by: https://github.com/Skylion007
ghstack dependencies: #116879, #116880, #116881, #116892
2024-01-05 22:29:35 +00:00
af2ded23eb [export] Exempt autograd ops for predispatch export (#116527)
Summary:
We intend to preserve autograd ops for predispatch export. Therefore, we
need to exempt the autograd ops in some places, e.g. verifier and
proxy_tensor.py.

Test Plan:
python test/export/test_export.py -k test_predispatch_export_with_autograd_op
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116527
Approved by: https://github.com/tugsbayasgalan
ghstack dependencies: #116339
2024-01-05 22:28:57 +00:00
9431798521 [export] Error grad mode op in export API (#116339)
Summary:
As current export doesn't support training, so grad mode ops doesn't
make sense. To avoid the confusion, we choose to early error if there
exist grad mode ops.

Test Plan:
python test/export/test_safeguard.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116339
Approved by: https://github.com/tugsbayasgalan
2024-01-05 22:28:57 +00:00
8fd4efacb4 markDynamoStrictTest most test/functorch/* (#116892)
[codemod] markDynamoStrictTest functorch/test_rearrange
[codemod] markDynamoStrictTest functorch/test_parsing
[codemod] markDynamoStrictTest functorch/test_minifier
[codemod] markDynamoStrictTest functorch/test_memory_efficient_fusion
[codemod] markDynamoStrictTest functorch/test_logging
[codemod] markDynamoStrictTest functorch/test_eager_transforms
[codemod] markDynamoStrictTest functorch/test_dims
[codemod] markDynamoStrictTest functorch/test_control_flow
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116892
Approved by: https://github.com/Skylion007
ghstack dependencies: #116879, #116880, #116881
2024-01-05 22:26:20 +00:00
e5f2ac18da [codemod] markDynamoStrictTest batch 12 (#116881)
[codemod] markDynamoStrictTest distributions/test_distributions
[codemod] markDynamoStrictTest distributions/test_constraints
[codemod] markDynamoStrictTest benchmark_utils/test_benchmark_utils
[codemod] markDynamoStrictTest backends/xeon/test_launch
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116881
Approved by: https://github.com/voznesenskym
ghstack dependencies: #116879, #116880
2024-01-05 21:59:40 +00:00
7562a00946 Make TORCH_LOGS="dist_ddp" include DDPOptimizer logs (#116794)
Note: ddp_graphs is still 'separate' from log components since it is an
artifact.  Not sure it's possible to enable it by default when dist_ddp
is selected.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116794
Approved by: https://github.com/fduwjj
2024-01-05 21:31:42 +00:00
5377b994da [aot_inductor] Retrieve original FQNs for weights (#116157)
Differential Revision: D52303882

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116157
Approved by: https://github.com/frank-wei
2024-01-05 21:30:36 +00:00
521dbbfaff Remove cpp/tensorexpr benchmarks (#116868)
Summary: These refer to a deprecated backend of torchscript which is no longer built in releases, and require llvm to be built.

Test Plan:
```
python setup.py develop
```

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116868
Approved by: https://github.com/hl475, https://github.com/chenyang78, https://github.com/eellison, https://github.com/mikekgfb
2024-01-05 21:23:30 +00:00
99ef47098d Use smaller shapes in lstm test to fix the CI timeout (#116453)
Fixes https://github.com/pytorch/pytorch/issues/108824 by using smaller shapes while keeping the same test scope

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116453
Approved by: https://github.com/huydhn, https://github.com/jgong5
2024-01-05 21:19:56 +00:00
499ca71e49 [codemod] markDynamoStrictTest batch 11 (#116880)
[codemod] markDynamoStrictTest nn/test_pruning
[codemod] markDynamoStrictTest nn/test_pooling
[codemod] markDynamoStrictTest nn/test_parametrization
[codemod] markDynamoStrictTest nn/test_packed_sequence
[codemod] markDynamoStrictTest nn/test_multihead_attention
[codemod] markDynamoStrictTest nn/test_module_hooks
[codemod] markDynamoStrictTest nn/test_lazy_modules
[codemod] markDynamoStrictTest nn/test_init
[codemod] markDynamoStrictTest nn/test_embedding
[codemod] markDynamoStrictTest nn/test_dropout
[codemod] markDynamoStrictTest nn/test_convolution
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116880
Approved by: https://github.com/voznesenskym
ghstack dependencies: #116879
2024-01-05 21:17:43 +00:00
ef7abdbd1a [C10] Mark Complex::imag as C10_HOST_DEVICE (#116877)
It feels weird that `real` is marked as such, but `imag` is not

Find while working on https://github.com/pytorch/pytorch/issues/116628

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116877
Approved by: https://github.com/Skylion007
2024-01-05 21:17:05 +00:00
c72d9f5de3 [no ci] Add pytorch-dev-infra as owners of .ci folder (#116901)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116901
Approved by: https://github.com/huydhn
2024-01-05 21:15:47 +00:00
0f0020d76f [GHF] Add support for new style stacks (#116873)
Where base stack targets default branch, rather than base. But as
default branch is likely to advance, since PR was made, search for
mergebase before determining whether `base`..`head` are in sync with `orig` branch
Also, rather than hardcode default branch name, fetch it from `GitHubPR.default_branch()`

Test Plan: https://github.com/malfet/deleteme/pull/77

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116873
Approved by: https://github.com/ezyang
2024-01-05 20:32:24 +00:00
71d8fe690f Replace recursive stable_topological_sort() with iterative. (#116761)
Summary:
A graph with a deep set of nodes caused stable_topological_sort() to recurse and
pop the stack. Rewrite it to be iterative and avoid recursion.

Fixes #115506

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116761
Approved by: https://github.com/jansel, https://github.com/oulgen, https://github.com/Skylion007
2024-01-05 20:13:49 +00:00
476e9d5f77 [codemod] markDynamoStrictTest batch 10 (#116879)
[codemod] markDynamoStrictTest test_cpp_extensions_aot_no_ninja
[codemod] markDynamoStrictTest test_cpp_extensions_aot_ninja
[codemod] markDynamoStrictTest test_cpp_api_parity
[codemod] markDynamoStrictTest test_complex
[codemod] markDynamoStrictTest test_compile_benchmark_util
[codemod] markDynamoStrictTest test_comparison_utils
[codemod] markDynamoStrictTest test_bundled_inputs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116879
Approved by: https://github.com/voznesenskym
2024-01-05 19:46:55 +00:00
764a18016d VSX: Fix vectorized abs function for complex tensors (#116859)
Use a similar approach with Sleef as in #99550
to improve the precision and extremal value handling of the `abs` function for complex tensors.

This fixes
- test_reference_numerics_extremal__refs_abs_cpu_float64
- test_reference_numerics_extremal__refs_abs_cpu_float128

which failed on PPC.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116859
Approved by: https://github.com/lezcano
2024-01-05 19:24:42 +00:00
63ee35c4e0 BugFix: Fix F632 bug in dynamo (if statement is always false) (#116867)
This was flagged by a preview ruff check as the if statement always evaluating false. Likely a typo between `is` and `in`. I also micro-optimized some list construction into tuple construction, which is semantically identical, but faster.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116867
Approved by: https://github.com/lezcano, https://github.com/albanD, https://github.com/yanboliang
2024-01-05 19:15:05 +00:00
d455c33cca [ez][td] Pipe TD logs to log file (#116796)
It is a bit annoying have them come up when searching through the logs.  They're also surprisingly long
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116796
Approved by: https://github.com/huydhn
2024-01-05 19:05:12 +00:00
ebedce24ab [FSDP] enable autograd in forward prefetching (#116792)
**problem**
when prefetching for next forward, current forward may be annotated by
`@torch.no_grad`. `param.grad_fn` keeps being None during prefetching.
`_post_backward_hook` never gets triggered

repro
```pytest test/distributed/fsdp/test_fsdp_freezing_weights.py```

**solution**
this PR enabled autograd during prefetching (`_use_unsharded_views`), so
`param.grad_fn` are properly assigned for next forward

a longer-term fix would be moving `_use_unsharded_views` out of
`_prefetch_handle` and put it in `_pre_forward_unshard`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116792
Approved by: https://github.com/awgu
2024-01-05 18:44:27 +00:00
7f124167b5 [BE][Easy]: Update libfmt submodule to 10.2.1 (#116864)
Follow up to #116363. There was an update and 10.2.1 was released that fixes an accidental ABI change in 10.2 with libfmt on windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116864
Approved by: https://github.com/albanD
2024-01-05 18:32:23 +00:00
4b6961a629 [no ci] Fix spelling (#116872)
s/initization/initialization/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116872
Approved by: https://github.com/albanD, https://github.com/Skylion007, https://github.com/atalman
2024-01-05 18:04:36 +00:00
0a0209e8a1 [ROCm] Use MI210 CI runners for all trunk commits (#116797)
As a follow-up to https://github.com/pytorch/pytorch/pull/115981

To make sure we catch any regressions/breakages related to flash attention/inductor/etc. functionality that is only enabled for MI210s, we would like to switch the trunk commit CI jobs to always run on MI210 runners. This should help us accurately identify the breaking commits for ROCm CI on the HUD.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116797
Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony
2024-01-05 17:46:38 +00:00
9ac0e6971a Revert "[1/4] Intel GPU Runtime Upstreaming for Device (#116019)"
This reverts commit b4cebe2c34242ceee3a1bc285f426662942a29ac.

Reverted https://github.com/pytorch/pytorch/pull/116019 on behalf of https://github.com/malfet due to Broke internal and periodic buck builds, see https://github.com/pytorch/pytorch/actions/runs/7414664129/job/20176215868 ([comment](https://github.com/pytorch/pytorch/pull/116019#issuecomment-1879030285))
2024-01-05 17:36:39 +00:00
7956ca16e6 Enable reverse view_funcs by default for python subclasses (#116512)
Part 3 of implementation for general [subclass view fake-ification](https://docs.google.com/document/d/1C5taWiplmX7nKiURXDOAZG2W5VNJ2iV0fQFq92H0Cxw).

Changes codegen to generate `view_func()` / `rev_view_func()` by default for python subclasses. With `view_func()` existing more often now, the lazy view rebase logic [here](f10c3f4184/torch/csrc/autograd/variable.cpp (L665-L695)) causes some slight behavior changes for in-place ops on views:
* Additional view nodes are inserted into output graphs, changing their string representation, although they are functionally the same. The extra nodes are removed in AOTAutograd's DCE pass.
* When `t` is a `FunctionalTensor`, calling `t.grad_fn` will now invoke `view_func()`; we need to make sure we're operating in a `FunctionalTensorMode` so the view op calls succeed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116512
Approved by: https://github.com/bdhirsh, https://github.com/soulitzer
ghstack dependencies: #115894
2024-01-05 16:48:12 +00:00
3c21264c9b Introduce reverse view_funcs (#115894)
Part 2 of implementation for general [subclass view fake-ification](https://docs.google.com/document/d/1C5taWiplmX7nKiURXDOAZG2W5VNJ2iV0fQFq92H0Cxw).

Details:
* Codegen `rev_view_func()` alongside `view_func()`
    * Reverse view_func gives you a "base" from a "view": `rev_view_func(new_view) -> new_base` AKA it plays the original view backwards
* Utilizes the functional inverses defined in `FunctionalInverses.cpp`, passing `InverseReturnMode::AlwaysView`
* Manually implements functional inverses for `narrow()` and `chunk()`
* **NB: Multi-output views now set view_func() / rev_view_func() for each of the output views!**
    * Due to this, the `as_view()` overload that operates on a list of views is scrapped in favor of iteration via codegen

Example codegen in `ADInplaceOrViewTypeN.cpp`:
```cpp
at::Tensor narrow(c10::DispatchKeySet ks, const at::Tensor & self, int64_t dim, c10::SymInt start, c10::SymInt length) {
  auto _tmp = ([&]() {
    at::AutoDispatchBelowADInplaceOrView guard;
    return at::_ops::narrow::redispatch(ks & c10::after_ADInplaceOrView_keyset, self, dim, start, length);
  })();
  std::function<at::Tensor(const at::Tensor&)> func=nullptr;
  std::function<at::Tensor(const at::Tensor&)> rev_func=nullptr;
  if (false || !self.unsafeGetTensorImpl()->support_as_strided() ||
      c10::AutogradState::get_tls_state().get_view_replay_enabled()) {
    func = [=](const at::Tensor& input_base) {
      return at::_ops::narrow::call(input_base, dim, start, length);
    };
    rev_func = [=](const at::Tensor& input_view) {
      // NB: args from narrow() signature are passed along to the inverse
      return at::functionalization::FunctionalInverses::narrow_copy_inverse(self, input_view, at::functionalization::InverseReturnMode::AlwaysView, dim, start, length);
    };
  }
  auto result = as_view(/* base */ self, /* output */ _tmp, /* is_bw_differentiable */ true, /* is_fw_differentiable */ true, /* view_func */ func, /* rev_view_func */ rev_func, /* creation_meta */ InferenceMode::is_enabled() ? CreationMeta::INFERENCE_MODE : (at::GradMode::is_enabled() ? CreationMeta::DEFAULT : CreationMeta::NO_GRAD_MODE));
  return result;
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115894
Approved by: https://github.com/soulitzer
2024-01-05 16:48:12 +00:00
053b15c596 [codemod] markDynamoStrictTest batch 9 (#116836)
[codemod] markDynamoStrictTest test_datapipe
[codemod] markDynamoStrictTest test_cuda_trace
[codemod] markDynamoStrictTest test_cuda_sanitizer
[codemod] markDynamoStrictTest test_cuda_primary_ctx
[codemod] markDynamoStrictTest test_cuda_nvml_based_avail
[codemod] markDynamoStrictTest test_cuda_multigpu
[codemod] markDynamoStrictTest test_cuda_expandable_segments
[codemod] markDynamoStrictTest test_cuda
[codemod] markDynamoStrictTest test_cpp_extensions_open_device_registration
[codemod] markDynamoStrictTest test_cpp_extensions_jit
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116836
Approved by: https://github.com/bdhirsh
ghstack dependencies: #116802, #116827, #116829, #116834
2024-01-05 16:40:40 +00:00
ee07260337 [codemod] markDynamoStrictTest batch 8 (#116834)
[codemod] markDynamoStrictTest test_flop_counter
[codemod] markDynamoStrictTest test_fake_tensor
[codemod] markDynamoStrictTest test_expanded_weights
[codemod] markDynamoStrictTest test_dynamic_shapes
[codemod] markDynamoStrictTest test_dlpack
[codemod] markDynamoStrictTest test_dispatch
[codemod] markDynamoStrictTest test_deploy
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116834
Approved by: https://github.com/bdhirsh
ghstack dependencies: #116802, #116827, #116829
2024-01-05 16:40:24 +00:00
c0da5a4c68 [codemod] markDynamoStrictTest batch 7 (#116829)
[codemod] markDynamoStrictTest test_license
[codemod] markDynamoStrictTest test_itt
[codemod] markDynamoStrictTest test_import_stats
[codemod] markDynamoStrictTest test_hub
[codemod] markDynamoStrictTest test_futures
[codemod] markDynamoStrictTest test_functionalization_of_rng_ops
[codemod] markDynamoStrictTest test_functionalization
[codemod] markDynamoStrictTest test_functional_autograd_benchmark
[codemod] markDynamoStrictTest test_function_schema
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116829
Approved by: https://github.com/bdhirsh
ghstack dependencies: #116802, #116827
2024-01-05 16:33:20 +00:00
6747d1383f [codemod] markDynamoStrictTest batch 6 (#116827)
[codemod] markDynamoStrictTest test_model_exports_to_core_aten
[codemod] markDynamoStrictTest test_model_dump
[codemod] markDynamoStrictTest test_mobile_optimizer
[codemod] markDynamoStrictTest test_mkldnn_verbose
[codemod] markDynamoStrictTest test_mkldnn_fusion
[codemod] markDynamoStrictTest test_mkldnn
[codemod] markDynamoStrictTest test_mkl_verbose
[codemod] markDynamoStrictTest test_meta
[codemod] markDynamoStrictTest test_matmul_cuda
[codemod] markDynamoStrictTest test_logging
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116827
Approved by: https://github.com/bdhirsh
ghstack dependencies: #116802
2024-01-05 16:33:20 +00:00
9543caadc8 [codemod] markDynamoStrictTest batch 5 (#116802)
[codemod] markDynamoStrictTest test_openmp
[codemod] markDynamoStrictTest test_numpy_interop
[codemod] markDynamoStrictTest test_numba_integration
[codemod] markDynamoStrictTest test_nn
[codemod] markDynamoStrictTest test_nestedtensor
[codemod] markDynamoStrictTest test_native_mha
[codemod] markDynamoStrictTest test_native_functions
[codemod] markDynamoStrictTest test_multiprocessing_spawn
[codemod] markDynamoStrictTest test_multiprocessing
[codemod] markDynamoStrictTest test_monitor
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116802
Approved by: https://github.com/bdhirsh
2024-01-05 16:33:13 +00:00
0159e3abbd [dynamo] add a handler for itertools_chain_from_iterable and test (#116849)
1. add a handler for itertools_chain_from_iterable
2. a test for itertools_chain_from_iterable

Fixes #116463

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116849
Approved by: https://github.com/ezyang
2024-01-05 15:14:18 +00:00
0249c4a785 Add config toggle suggestions for data-dependent/dynamic output shape (#114337)
Fixes https://github.com/pytorch/pytorch/issues/114220

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114337
Approved by: https://github.com/aakhundov
2024-01-05 14:01:01 +00:00
53f8d17d1e Specialize SymNodeVariable when used as module index (#114377)
Fixes #114171

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114377
Approved by: https://github.com/Skylion007
2024-01-05 13:51:52 +00:00
0e8698c3b6 Prevent unbacked symbol reallocation by forcing unification for unbacked symbol def sites (#114368)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114368
Approved by: https://github.com/aakhundov
2024-01-05 13:51:36 +00:00
f692fc9e7f fix typo (#116828)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116828
Approved by: https://github.com/Skylion007
2024-01-05 12:35:33 +00:00
5f5405f809 I have seen this deprecation and I am curious if this is the fix (#116714)
Lets see what CI/CD says

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116714
Approved by: https://github.com/awgu, https://github.com/wanchaol
2024-01-05 07:02:58 +00:00
79ba39710e [AOTI] Forward fix a Windows build failure (#116790)
Summary: forward fix https://github.com/pytorch/pytorch/pull/116269
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116790
Approved by: https://github.com/khabinov, https://github.com/huydhn
2024-01-05 06:00:58 +00:00
2ccc7af028 Revert "[CPU] Add flash attention mask version (#115913)"
This reverts commit 76a3fbb7092d25638a046c1994030fc8108e5fbf.

Reverted https://github.com/pytorch/pytorch/pull/115913 on behalf of https://github.com/zou3519 due to broke transformer test on dynamo shard ([comment](https://github.com/pytorch/pytorch/pull/115913#issuecomment-1878043389))
2024-01-05 02:39:12 +00:00
bbfd81f513 [codemod] markDynamoStrictTest batch (#116791)
[codemod] markDynamoStrictTest test_sympy_utils
[codemod] markDynamoStrictTest test_serialization
[codemod] markDynamoStrictTest test_segment_reductions
[codemod] markDynamoStrictTest test_schema_check
[codemod] markDynamoStrictTest test_scatter_gather_ops
[codemod] markDynamoStrictTest test_pytree
[codemod] markDynamoStrictTest test_pruning_op
[codemod] markDynamoStrictTest test_per_overload_api
[codemod] markDynamoStrictTest test_out_dtype_op
[codemod] markDynamoStrictTest test_optim
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116791
Approved by: https://github.com/voznesenskym
ghstack dependencies: #116735, #116736, #116739, #116740, #116742, #116743, #116744, #116745
2024-01-05 02:22:53 +00:00
6d9b837c27 Graphbreak when creating a map with unsupported keys (#116460)
As per title. With this, https://github.com/pytorch/pytorch/issues/93697
does not choke, but spits out many of these:
```
[ERROR] Name: "L['self']"
[ERROR]     Source: local
[ERROR]     Create Function: NN_MODULE
[ERROR]     Guard Types: ['ID_MATCH']
[ERROR]     Code List: ["___check_obj_id(L['self'], 139962171127504)"]
[ERROR]     Object Weakref: <weakref at 0x7f4b72f7c9a0; to
'ActorCriticPolicy' at 0x7f4b7b7df6d0>
[ERROR]     Guarded Class Weakref: <weakref at 0x7f4afbd08b30; to
'ABCMeta' at 0x56463a727840 (ActorCriticPolicy)>
[ERROR] Created at:
[ERROR]   File
"/home/lezcano/git/pytorch/pytorch/torch/_dynamo/variables/builder.py",
line 248, in __call__
[ERROR]     vt = self._wrap(value)
[ERROR]   File
"/home/lezcano/git/pytorch/pytorch/torch/_dynamo/variables/builder.py",
line 474, in _wrap
[ERROR]     return self.wrap_module(value)
[ERROR]   File
"/home/lezcano/git/pytorch/pytorch/torch/_dynamo/variables/builder.py",
line 941, in wrap_module
[ERROR]     return self.tx.output.register_attr_or_module(
[ERROR]   File
"/home/lezcano/git/pytorch/pytorch/torch/_dynamo/output_graph.py", line
735, in register_attr_or_module
[ERROR]     install_guard(source.make_guard(GuardBuilder.NN_MODULE))
[ERROR] Error while creating guard:
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116460
Approved by: https://github.com/jansel
ghstack dependencies: #116459
2024-01-05 01:48:07 +00:00
7c8f38700a [dynamo] Fix np.issubdtype (#116459)
Fixes the issue described at https://github.com/pytorch/pytorch/issues/93697#issuecomment-1828346590

This doesn't fix the full issue yet, now we hit
```python
  File
  "/home/lezcano/git/pytorch/pytorch/torch/_dynamo/symbolic_convert.py",
  line 744, in step
  getattr(self, inst.opname)(inst)
  File
  "/home/lezcano/git/pytorch/pytorch/torch/_dynamo/symbolic_convert.py",
  line 1366, in BUILD_MAP
      assert (
      AssertionError
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116459
Approved by: https://github.com/peterbell10
2024-01-05 01:48:07 +00:00
76a3fbb709 [CPU] Add flash attention mask version (#115913)
Add a masked-version flash attention for CPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115913
Approved by: https://github.com/jgong5, https://github.com/drisspg
2024-01-05 01:27:36 +00:00
6413511713 [export][refactor][4/n] Make equality_constraints optional (#116233)
Summary: needed to remove equality_contraints eventually :P

Test Plan: CI

Differential Revision: D52351709

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116233
Approved by: https://github.com/tugsbayasgalan
2024-01-05 00:50:52 +00:00
db69956feb [Dynamo] Catch ImportError when tracing_rules load objects (#116783)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116783
Approved by: https://github.com/angelayi
2024-01-05 00:26:17 +00:00
b0393ebe9b [MPS] Make test_mps.py passable on Sonoma (#116764)
- Enable Sonoma testing on M2 machines
- Add 70+ ops to the list of supported ones on MacOS Sonoma
- Enable nn.functional.
- Add explicit `TORCH_CHECK` to mark scatter/gather, index_select and linalg ops as yet not supporting Complex, as attempt to call those will crash with various MPS asserts such as:
```
(mpsFileLoc): /AppleInternal/Library/BuildRoots/0032d1ee-80fd-11ee-8227-6aecfccc70fe/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:96:0: error: 'mps.reduction_min' op operand #0 must be tensor of MPS type values or memref of MPS type values, but got 'tensor<5x5xcomplex<f32>>'
(mpsFileLoc): /AppleInternal/Library/BuildRoots/0032d1ee-80fd-11ee-8227-6aecfccc70fe/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:96:0: note: see current operation: %3 = "mps.reduction_min"(%1, %2) <{keep_dims}> : (tensor<5x5xcomplex<f32>>, tensor<2xsi32>) -> tensor<1x1xcomplex<f32>>
```
- Treat bools as int8 to fix regression re-surfaced in `index_fill` (used to be broken in Monterey, then fixed in Ventura and broken in Sonoma again)
- `nn.functional.max_pool2d` results now match CPU output for uint8 dtype in Sonoma

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116764
Approved by: https://github.com/kulinseth, https://github.com/seemethere
2024-01-05 00:25:47 +00:00
d0cf2182ea Fix TransformerEncoderLayer for bias=False (#116760)
Fixes https://github.com/pytorch/pytorch/issues/116385

Don't call `torch._transformer_encoder_layer_fwd` when `bias=False`

`bias=False` was not something that `torch._transformer_encoder_layer_fwd`  was meant to work with, it was my bad that this wasn't tested as I approved https://github.com/pytorch/pytorch/pull/101687.

`bias=False` was causing the `tensor_args` in [`TransformerEncoder`](a17de2d645/torch/nn/modules/transformer.py (L663-L677)) to contain `None`s and error on checks for the fastpath like `t.requires_grad for t in tensor_args`.

Alternative fix would be to
1) Pass `torch.zeros_like({*}.weight)` to the kernel when `bias=False` and filter `tensor_args` as appropriate
2) Fix `torch._transformer_encoder_layer_fwd` to take `Optional<Tensor>` for biases and fix the kernels as appropriate

Let me know if these approaches are preferable

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116760
Approved by: https://github.com/jbschlosser
2024-01-05 00:13:10 +00:00
e3ca7346ce Re-add initial Flash Attention support on ROCM (#115981)
Note about the Updates:

This PR:
1. skips more flash attention related UTs on MI200
2. Fix additional ATen compiling errors after hipification
3. Fix the author "root" of a specific commit
4. Includes the patch from Nikita in favor of block level static initialization.

CAVEAT: This revised PR has a commit that modifies the CI to force its running on MI200 nodes. That specific commit must be reverted before merge.

Original PR (https://github.com/pytorch/pytorch/pull/114309) Note:

This pull requests add initial Flash Attention support for AMD/ROCM platform. It added a specialized Triton repository/branch as a compile-time dependency for Flash Attention math library on AMD/ROCM. This triton submodule is not used at runtime and will not be shipped to the final pytorch package. We have the plan to release this specialized Triton as a separate project.

Know limitations:

- Only supports MI200 series GPU (i.e., `gcnArchName == gfx90a:sramecc+:xnack-`.
- Only supports power of two sequence lengths.
- No support for varlen APIs.
- Only support head dimension 16,32,64,128.
- Performance is still being optimized.

Fixes #112997

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115981
Approved by: https://github.com/malfet
2024-01-04 22:21:31 +00:00
8195a0aaa7 Move array_of helper to c10/util (#116749)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116749
Approved by: https://github.com/drisspg, https://github.com/Skylion007
ghstack dependencies: #116685
2024-01-04 21:58:32 +00:00
5ac57a06eb [export] Refactor ExportPassBase. (#116778)
Summary:
X-link: https://github.com/pytorch/executorch/pull/1532

as title. This diff decouple the pass base library from torch export and exir, so that different layers can evolve in their own fashion, and we have more head room to divide and conquer in the future.

Test Plan: CI

Reviewed By: angelayi

Differential Revision: D52514517

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116778
Approved by: https://github.com/angelayi
2024-01-04 21:32:14 +00:00
e7d741b0fd [C10D] Dump cpp stacktraces on heartbeat monitor timeout (#116717)
Summary:
If heartbeat monitor times out and kills the process, we want to know why.

It's convenient to use an internal tool for this, but we plan to later
integrate with torchelastic to call into pyspy or something else, which will be
both better (including py stacks) and compatible with OSS.

Test Plan: tested manually, observed c++ stacktraces were dumped

Reviewed By: fduwjj

Differential Revision: D52370243

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116717
Approved by: https://github.com/zdevito
2024-01-04 21:11:47 +00:00
cyy
d23972df00 Update libfmt submodule to 10.2.0 (#116363)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116363
Approved by: https://github.com/ezyang
2024-01-04 19:25:40 +00:00
70f3a530d7 [AOTI] Add pybind for AOTIModelContainerRunnerCpu and AOTIModelContainerRunnerCuda (#116269)
Summary: Now we can allocate an AOTIModelContainerRunner object instead of relying on torch.utils.cpp_extension.load_inline. Also renamed AOTInductorModelRunner to AOTIRunnerUtil in this PR.

Test Plan: CI

Reviewed By: khabinov

Differential Revision: D52339116

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116269
Approved by: https://github.com/khabinov
2024-01-04 18:58:24 +00:00
56d7a47806 [BE] Use precompiled headers to speedup clang-tidy (#116780)
This brings the time down by 30% (from [30](https://github.com/pytorch/pytorch/actions/runs/7412899917/job/20170674075#step:11:64) min to [20](https://github.com/pytorch/pytorch/actions/runs/7413082213/job/20171286833?pr=116780#step:11:64) min)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116780
Approved by: https://github.com/kit1980, https://github.com/Skylion007
2024-01-04 18:37:44 +00:00
39f8853313 [inductor] Use max sm clock when calculating device tflops (#116754)
See openai/triton#2801

Current SM clocks may fluctuate at runtime and change the result of
`get_device_tflops`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116754
Approved by: https://github.com/lezcano
2024-01-04 17:38:21 +00:00
6793b99107 [BugFix] Fix SegFault when torch.all/any dispatched to mps or other backends (#116457)
The old implementation will result in an infinite recursive loop, leading to a stack overflow and segfault.

If TORCH_SHOW_DISPATCH_TRACE is on, with a debug version pytorch, we can see the following endless output in terminal:
```
[call] op=[aten::quantize_per_tensor], key=[AutogradCPU]
  [redispatch] op=[aten::quantize_per_tensor], key=[CPU]
 [call] op=[aten::any.dims], key=[AutogradCPU]
  [redispatch] op=[aten::any.dims], key=[QuantizedCPU]
   [call] op=[aten::empty.memory_format], key=[BackendSelect]
    [redispatch] op=[aten::empty.memory_format], key=[CPU]
   [call] op=[aten::any.dims_out], key=[QuantizedCPU]
    [call] op=[aten::any.dims], key=[QuantizedCPU]
     [call] op=[aten::empty.memory_format], key=[BackendSelect]
      [redispatch] op=[aten::empty.memory_format], key=[CPU]
     [call] op=[aten::any.dims_out], key=[QuantizedCPU]
      [call] op=[aten::any.dims], key=[QuantizedCPU]
       [call] op=[aten::empty.memory_format], key=[BackendSelect]
        [redispatch] op=[aten::empty.memory_format], key=[CPU]
       [call] op=[aten::any.dims_out], key=[QuantizedCPU]
        [call] op=[aten::any.dims], key=[QuantizedCPU]
         [call] op=[aten::empty.memory_format], key=[BackendSelect]
          [redispatch] op=[aten::empty.memory_format], key=[CPU]
         [call] op=[aten::any.dims_out], key=[QuantizedCPU]
          [call] op=[aten::any.dims], key=[QuantizedCPU]
           [call] op=[aten::empty.memory_format], key=[BackendSelect]
            [redispatch] op=[aten::empty.memory_format], key=[CPU]
           [call] op=[aten::any.dims_out], key=[QuantizedCPU]
            [call] op=[aten::any.dims], key=[QuantizedCPU]
             [call] op=[aten::empty.memory_format], key=[BackendSelect]
              [redispatch] op=[aten::empty.memory_format], key=[CPU]
             [call] op=[aten::any.dims_out], key=[QuantizedCPU]
              [call] op=[aten::any.dims], key=[QuantizedCPU]
               [call] op=[aten::empty.memory_format], key=[BackendSelect]
                [redispatch] op=[aten::empty.memory_format], key=[CPU]
               [call] op=[aten::any.dims_out], key=[QuantizedCPU]
                [call] op=[aten::any.dims], key=[QuantizedCPU]
                 [call] op=[aten::empty.memory_format], key=[BackendSelect]
                  [redispatch] op=[aten::empty.memory_format], key=[CPU]
                 [call] op=[aten::any.dims_out], key=[QuantizedCPU]
                  [call] op=[aten::any.dims], key=[QuantizedCPU]
.....
.....
.....
```

Fixes #116452
Fixes #116451

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116457
Approved by: https://github.com/malfet
2024-01-04 17:37:17 +00:00
b4cebe2c34 [1/4] Intel GPU Runtime Upstreaming for Device (#116019)
# Motivation
As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), The first runtime component we would like to upstream is `Device` which contains the device management functions of Intel GPU's runtime. To facilitate the code review, we split the code changes into 4 PRs. This is one of the 4 PRs and covers the changes under `c10`.

# Design
Intel GPU device is a wrapper of sycl device on which kernels can be executed. In our design, we will maintain a sycl device pool containing all the GPU devices of the current machine, and manage the status of the device pool by PyTorch. The thread local safe is considered in this design. The corresponding C++ files related to `Device` will be placed in c10/xpu folder. And we provide the c10 device runtime APIs, like
  - `c10::xpu::device_count`
  - `c10::xpu::set_device`
  - ...

# Additional Context
In our plan, 4 PRs should be submitted to PyTorch for `Device`:
1. for c10
2. for aten
3. for python frontend
4. for lazy initialization shared with CUDA

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116019
Approved by: https://github.com/gujinghui, https://github.com/jgong5, https://github.com/EikanWang, https://github.com/malfet
2024-01-04 17:35:04 +00:00
43fb1b671c [export] Improve verifier to not specialize on dialect. (#116705)
Summary:
Currently we have a very ugly specialization on edge dialect in verifier like the following:
```
 # TODO Remove this branch.
            if ep.dialect == "EDGE":  # !!! Don't change this allowlist. !!!
                pass
            else:
                raise e
```
In this diff we do some additional work to make signature checking also work in exir. We decouple the transformation stack in torch export and exir so that different layers of the stack can evolve in their own fashion and the team can divide and conquer them seperately.

Test Plan: CI

Differential Revision: D52499225

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116705
Approved by: https://github.com/tugsbayasgalan
2024-01-04 17:17:23 +00:00
f1a393c029 [codemod] markDynamoStrictTest batch (#116745)
- test_show_pickle
- test_show_pickle
- test_set_default_mobile_cpu_allocator
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116745
Approved by: https://github.com/Skylion007, https://github.com/voznesenskym
ghstack dependencies: #116728, #116729, #116730, #116731, #116732, #116733, #116734, #116735, #116736, #116739, #116740, #116742, #116743, #116744
2024-01-04 15:04:18 +00:00
311548b79c [codemod] markDynamoStrictTest test_sort_and_select (#116744)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116744
Approved by: https://github.com/voznesenskym
ghstack dependencies: #116728, #116729, #116730, #116731, #116732, #116733, #116734, #116735, #116736, #116739, #116740, #116742, #116743
2024-01-04 15:04:18 +00:00
30f0a05207 [codemod] markDynamoStrictTest test_stateless (#116743)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116743
Approved by: https://github.com/Skylion007, https://github.com/voznesenskym
ghstack dependencies: #116728, #116729, #116730, #116731, #116732, #116733, #116734, #116735, #116736, #116739, #116740, #116742
2024-01-04 15:03:21 +00:00
46b44fb246 [codemod] markDynamoStrictTest test_subclass (#116742)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116742
Approved by: https://github.com/voznesenskym
ghstack dependencies: #116728, #116729, #116730, #116731, #116732, #116733, #116734, #116735, #116736, #116739, #116740
2024-01-04 15:02:46 +00:00
c2174974ae [codemod] markDynamoStrictTest test_tensor_creation_ops (#116740)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116740
Approved by: https://github.com/voznesenskym
ghstack dependencies: #116728, #116729, #116730, #116731, #116732, #116733, #116734, #116735, #116736, #116739
2024-01-04 15:02:03 +00:00
7c5704fc00 [codemod] markDynamoStrictTest test_tensorboard (#116739)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116739
Approved by: https://github.com/voznesenskym
ghstack dependencies: #116728, #116729, #116730, #116731, #116732, #116733, #116734, #116735, #116736
2024-01-04 15:01:25 +00:00
caa33e1eb1 [codemod] markDynamoStrictTest test_testing (#116736)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116736
Approved by: https://github.com/voznesenskym
ghstack dependencies: #116728, #116729, #116730, #116731, #116732, #116733, #116734, #116735
2024-01-04 15:01:07 +00:00
882d1f4ea6 [codemod] markDynamoStrictTest test_transformers (#116735)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116735
Approved by: https://github.com/voznesenskym
ghstack dependencies: #116728, #116729, #116730, #116731, #116732, #116733, #116734
2024-01-04 15:00:23 +00:00
eb958d7552 Fix bug in unflatten pytree (#116750)
Summary: Title

Test Plan: CI

Differential Revision: D52529088

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116750
Approved by: https://github.com/zhxchen17
2024-01-04 14:23:40 +00:00
75dae4f691 Revert "[dynamo] Fix np.issubdtype (#116459)"
This reverts commit b5c33ccdb3198a48a354e21a4fdace0ec6d04146.

Reverted https://github.com/pytorch/pytorch/pull/116459 on behalf of https://github.com/zou3519 due to Broke CI, seems to be a landrace ([comment](https://github.com/pytorch/pytorch/pull/116459#issuecomment-1877135999))
2024-01-04 14:00:11 +00:00
3a0f6897c5 Revert "Graphbreak when creating a map with unsupported keys (#116460)"
This reverts commit c2a020a2184982361a712bbb1e9766caba26dba6.

Reverted https://github.com/pytorch/pytorch/pull/116460 on behalf of https://github.com/zou3519 due to I think the bottom PR broke CI ([comment](https://github.com/pytorch/pytorch/pull/116460#issuecomment-1877132374))
2024-01-04 13:56:57 +00:00
c2a020a218 Graphbreak when creating a map with unsupported keys (#116460)
As per title. With this, https://github.com/pytorch/pytorch/issues/93697
does not choke, but spits out many of these:
```
[ERROR] Name: "L['self']"
[ERROR]     Source: local
[ERROR]     Create Function: NN_MODULE
[ERROR]     Guard Types: ['ID_MATCH']
[ERROR]     Code List: ["___check_obj_id(L['self'], 139962171127504)"]
[ERROR]     Object Weakref: <weakref at 0x7f4b72f7c9a0; to
'ActorCriticPolicy' at 0x7f4b7b7df6d0>
[ERROR]     Guarded Class Weakref: <weakref at 0x7f4afbd08b30; to
'ABCMeta' at 0x56463a727840 (ActorCriticPolicy)>
[ERROR] Created at:
[ERROR]   File
"/home/lezcano/git/pytorch/pytorch/torch/_dynamo/variables/builder.py",
line 248, in __call__
[ERROR]     vt = self._wrap(value)
[ERROR]   File
"/home/lezcano/git/pytorch/pytorch/torch/_dynamo/variables/builder.py",
line 474, in _wrap
[ERROR]     return self.wrap_module(value)
[ERROR]   File
"/home/lezcano/git/pytorch/pytorch/torch/_dynamo/variables/builder.py",
line 941, in wrap_module
[ERROR]     return self.tx.output.register_attr_or_module(
[ERROR]   File
"/home/lezcano/git/pytorch/pytorch/torch/_dynamo/output_graph.py", line
735, in register_attr_or_module
[ERROR]     install_guard(source.make_guard(GuardBuilder.NN_MODULE))
[ERROR] Error while creating guard:
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116460
Approved by: https://github.com/jansel
ghstack dependencies: #116459
2024-01-04 12:36:31 +00:00
81f98f1082 Experimental non-strict mode (#114658)
This is proof-of-concept implementation of how people can use a marker `mark_strict` to enable torchdynamo while exporting under non-strict mode. The main idea is that `mark_strict` will turn into an HOO which then utilizes dynamo to do correctness analysis in the same way how torch.cond works today. There are some notable limitations:
1. This API is not meant for public use yet
2. Strict region can't work with arbitrary container inputs
3. We don't preserve `nn_module_stack` and other node metadata for the strict region.
4. strict_mode HOO will show up in the final graph. This is undesirable in the long term, but for short term experiments, it should be good enough. Will fix this in the follow up PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114658
Approved by: https://github.com/ydwu4
2024-01-04 12:24:58 +00:00
cyy
91bbcf8c71 [1/N] replace THPUtils_assert with TORCH_CHECK (#116675)
This PR replaces THPUtils_assert with TORCH_CHECK.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116675
Approved by: https://github.com/albanD
2024-01-04 11:15:33 +00:00
faea6f2c7a [C10D] Make heartbeat_ atomic (#116702)
Summary:
Currently, the code is working. We know this becuase we observe heartbeat
timeouts.

However, there is a chance that if the code were refactored, the compiler could
optimize away the load of heartbeat_ inside heartbeatMonitor, and we wouldn't
know.

Using atomic here is not really for thread synchronization, but more to ensure
compiler optimizations (hoisting the read outside the loop) can never be
allowed to happen.  Again, we know this isn't currently happening bc if it
were, it  would not be an intermittent failure, it would be an always failure.
(at least with a fixed compiler/platform).

I previously avoided atomic bc we didn't want shared locks between heartbeat
monitor and watchdog thread.  Why? if watchdog held the lock and hung, monitor
could also hang.  However, this really can't happen (Afaik) when using an
atomic.

Test Plan: existing CI tests

Differential Revision: D52378257

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116702
Approved by: https://github.com/fduwjj, https://github.com/zdevito
2024-01-04 06:06:32 +00:00
2bdc2a68cb [ez][td] Fix for emit metrics can't find JOB_NAME (#116748)
After #113884
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116748
Approved by: https://github.com/malfet, https://github.com/huydhn
2024-01-04 05:31:25 +00:00
670e7992fd [Easy] Document AGGRESSIVE_RECOMPUTATION flag in min-cut partitioner (#114007)
As titled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114007
Approved by: https://github.com/wanchaol
2024-01-04 05:05:08 +00:00
a8a9695047 Move promoteTypes to cpp file (#116685)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116685
Approved by: https://github.com/albanD
2024-01-04 04:42:14 +00:00
f071687ef1 Clean up macOS x86 CI build and test jobs (#116725)
We're ready to pull the plug on MacOX x86 build and test jobs on CI.

* [ ] https://github.com/pytorch/pytorch/pull/116725
* [ ] https://github.com/pytorch/pytorch/pull/116726

More details is at https://github.com/pytorch/pytorch/issues/114602
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116725
Approved by: https://github.com/malfet, https://github.com/seemethere
2024-01-04 04:26:32 +00:00
9b88354b80 [executorch hash update] update the pinned executorch hash (#116668)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116668
Approved by: https://github.com/pytorchbot
2024-01-04 04:12:25 +00:00
b5c33ccdb3 [dynamo] Fix np.issubdtype (#116459)
Fixes the issue described at https://github.com/pytorch/pytorch/issues/93697#issuecomment-1828346590

This doesn't fix the full issue yet, now we hit
```python
  File
  "/home/lezcano/git/pytorch/pytorch/torch/_dynamo/symbolic_convert.py",
  line 744, in step
  getattr(self, inst.opname)(inst)
  File
  "/home/lezcano/git/pytorch/pytorch/torch/_dynamo/symbolic_convert.py",
  line 1366, in BUILD_MAP
      assert (
      AssertionError
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116459
Approved by: https://github.com/peterbell10
2024-01-04 03:55:50 +00:00
e2359f72c8 [BE]: Update ruff to 0.1.11 (#116704)
Updates ruff to 0.1.11
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116704
Approved by: https://github.com/malfet
2024-01-04 03:35:45 +00:00
e70dfe07f6 [audio hash update] update the pinned audio hash (#116747)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116747
Approved by: https://github.com/pytorchbot
2024-01-04 03:27:48 +00:00
c14a0b6c84 [codemod] markDynamoStrictTest batch (#116734)
- test_type_promotion
- test_type_info
- test_type_hints
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116734
Approved by: https://github.com/voznesenskym
ghstack dependencies: #116728, #116729, #116730, #116731, #116732, #116733
2024-01-04 03:18:06 +00:00
bfb9df3684 [codemod] markDynamoStrictTest batch (#116733)
- test_weak
- test_view_ops
- test_typing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116733
Approved by: https://github.com/voznesenskym
ghstack dependencies: #116728, #116729, #116730, #116731, #116732
2024-01-04 03:18:06 +00:00
a308a25fb7 [codemod] markDynamoStrictTest batch (#116732)
- torch_np/numpy_tests/core/test_getlimits
- torch_np/numpy_tests/core/test_einsum
- torch_np/numpy_tests/core/test_dtype
- torch_np/numpy_tests/core/test_dlpack
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116732
Approved by: https://github.com/voznesenskym
ghstack dependencies: #116728, #116729, #116730, #116731
2024-01-04 03:17:57 +00:00
9255f55767 [codemod] markDynamoStrictTest batch (#116731)
- torch_np/numpy_tests/core/test_numerictypes
- torch_np/numpy_tests/core/test_numeric
- torch_np/numpy_tests/core/test_indexing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116731
Approved by: https://github.com/voznesenskym
ghstack dependencies: #116728, #116729, #116730
2024-01-04 03:17:47 +00:00
1f7badd856 [codemod] markDynamoStrictTest batch (#116730)
- torch_np/numpy_tests/core/test_scalarinherit
- torch_np/numpy_tests/core/test_scalar_methods
- torch_np/numpy_tests/core/test_scalar_ctors
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116730
Approved by: https://github.com/voznesenskym
ghstack dependencies: #116728, #116729
2024-01-04 03:17:39 +00:00
d1d6b90a1b [codemod] markDynamoStrictTest torch_np/numpy_tests/core/test_scalarmath (#116729)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116729
Approved by: https://github.com/voznesenskym
ghstack dependencies: #116728
2024-01-04 03:17:29 +00:00
3ba35548c3 [codemod] markDynamoStrictTest torch_np/numpy_tests/core/test_shape_base (#116728)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116728
Approved by: https://github.com/voznesenskym
2024-01-04 03:17:22 +00:00
3acb7972b0 [BE] Test CrossEntropyLoss for torch.half (#116681)
To test it on MPS and CUDA devices
Also, move some float64 skip-tests for MPS to xfail, same as CPU tests for torch.half
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116681
Approved by: https://github.com/xuzhao9, https://github.com/mikaylagawarecki
2024-01-04 02:16:09 +00:00
6fece41e9a [codemod][lowrisk] Remove extra semi colon from caffe2/c10/util/Float8_e5m2.h (#115761)
Summary:
`-Wextra-semi` or `-Wextra-semi-stmt`

If the code compiles, this is safe to land.

Test Plan: Sandcastle

Reviewed By: palmje

Differential Revision: D51995078

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115761
Approved by: https://github.com/Skylion007
2024-01-04 02:02:26 +00:00
5395331644 Avoid GIL during exit (#116709)
Stacks recorded when tensors are being freed during exit could
try to acquire the GIL. Py_IsInitialized can be used to check if we
are post Python exit and should not attempt to acquire the GIL.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116709
Approved by: https://github.com/aaronenyeshi
2024-01-04 01:56:44 +00:00
4926146537 [Inductor] Fix Conv Binary Inplace Fusion issue (#115153)
**Summary**
Take this Pattern as example
```
  #      ReLU
  #     /   \
  #  Conv1
  #   /      \
  # Conv2
  #   \      /
  #      Add
```
The current `ConvBinaryInplace` check will fail to perform Inplace fusion (using outplace fusion instead) due to `ReLU` having 2 users. However, if all users of `ReLU` are ancestor nodes of `Conv2`, we should be able to proceed with the `ConvBinaryInplace` fusion. This diff relaxes the `ConvBinaryInplace` check accordingly.

**TestPlan**
```
python -m pytest test_mkldnn_pattern_matcher.py -k test_conv2d_binary_inplace_fusion_pass_cpu
python -m pytest test_mkldnn_pattern_matcher.py -k test_conv2d_binary_inplace_fusion_failed_cpu
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115153
Approved by: https://github.com/CaoE, https://github.com/jgong5
2024-01-04 01:06:27 +00:00
ce2df3f690 [HigherOrderOp] set set_subgraph_inputs to flatten_manual for map (#115853)
We change manually_set_subgraph_inputs to three modes: manual, automatic and flatten_manual. The flatten_manual wil first flatten the sub_args then recussively call set_subgrah_inputs = "manual". This allows us to control the order of the placeholder shown up in the graph, which is necessary for map, where we want to keep the mapped arguments before the rest positional arguments.

Right now, map only takes a single tensor as mapped argument but it would become pretty easy to match the subgraph inputs to original proxy if we have a "flatten_manual" option.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115853
Approved by: https://github.com/zou3519
2024-01-04 00:27:07 +00:00
a2f3770b24 [BE] Remove arch -arch arm64 (#116724)
It was needed back in a day when there were no arm64 runner daemon binaries, so the trick was needed to execute native arm64 tests when invoked from x86 runner daemon

Followup after  https://github.com/pytorch/pytorch/pull/116680

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116724
Approved by: https://github.com/huydhn
2024-01-03 23:59:53 +00:00
4e330882da [inductor] Add ABI shim function for torch.scatter_reduce (#116700)
Ran into the following exception during C++ file compilation.
```
error: use of undeclared identifier 'aoti_torch_scatter_reduce_out'
    aoti_torch_scatter_reduce_out(buf12, buf12,0,buf13,buf14, "sum",1);
    ^
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116700
Approved by: https://github.com/aakhundov
2024-01-03 23:43:44 +00:00
a75b587803 [codemod] markDynamoStrictTest torch_np/numpy_tests/fft/test_helper (#116654)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116654
Approved by: https://github.com/bdhirsh
ghstack dependencies: #116638, #116673, #116639, #116640, #116641, #116642, #116643, #116644, #116645, #116646, #116647, #116648, #116649, #116650, #116651, #116652, #116653
2024-01-03 23:03:06 +00:00
f3e2661555 [codemod] markDynamoStrictTest torch_np/numpy_tests/fft/test_pocketfft (#116653)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116653
Approved by: https://github.com/bdhirsh
ghstack dependencies: #116638, #116673, #116639, #116640, #116641, #116642, #116643, #116644, #116645, #116646, #116647, #116648, #116649, #116650, #116651, #116652
2024-01-03 23:03:06 +00:00
bf4c1a3d66 [codemod] markDynamoStrictTest torch_np/numpy_tests/lib/test_arraypad (#116652)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116652
Approved by: https://github.com/bdhirsh
ghstack dependencies: #116638, #116673, #116639, #116640, #116641, #116642, #116643, #116644, #116645, #116646, #116647, #116648, #116649, #116650, #116651
2024-01-03 23:03:06 +00:00
f4168c0e2e [codemod] markDynamoStrictTest torch_np/numpy_tests/lib/test_arraysetops (#116651)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116651
Approved by: https://github.com/bdhirsh
ghstack dependencies: #116638, #116673, #116639, #116640, #116641, #116642, #116643, #116644, #116645, #116646, #116647, #116648, #116649, #116650
2024-01-03 23:03:06 +00:00
dab1599d81 [codemod] markDynamoStrictTest torch_np/numpy_tests/lib/test_function_base (#116650)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116650
Approved by: https://github.com/bdhirsh
ghstack dependencies: #116638, #116673, #116639, #116640, #116641, #116642, #116643, #116644, #116645, #116646, #116647, #116648, #116649
2024-01-03 23:03:06 +00:00
8a76c07b98 [threaded pg] add devices to avoid seeing warnings (#116678)
This PR adds devices to register_backend of multithraeded pg, to avoid
seeing tons of warnings.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116678
Approved by: https://github.com/awgu, https://github.com/XilunWu
ghstack dependencies: #116426, #116559, #116573
2024-01-03 23:01:19 +00:00
b10cb168a7 [tp] disable some assertion temporarily for torch.compile (#116573)
Disable some runtime assertion first as it does not work with
torch.compile properly, I'll have a follow up fix in dynamo and reenable
this check again

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116573
Approved by: https://github.com/awgu, https://github.com/XilunWu
ghstack dependencies: #116426, #116559
2024-01-03 23:01:19 +00:00
7309f6fdf0 Remove hardcoding arch to arm64 (#116680)
https://github.com/pytorch/pytorch/pull/116627 hardcodes arch to arm64 and it's failing on x86 GitHub runner (yup, they are still there on periodic, we haven't pulled the plug yet).

https://github.com/pytorch/pytorch/actions/runs/7392059632/job/20112760709#step:2:12 is an example failure.

There is no need to set the arch here because it has already been set earlier in the workflow https://github.com/pytorch/pytorch/blob/main/.github/workflows/_mac-test.yml#L47

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116680
Approved by: https://github.com/seemethere
2024-01-03 22:42:14 +00:00
f6be25bae6 [inductor] Add shape checks to ExpandView (#113839)
Currently `ExpandView` doesn't check that the expanded shape is valid which may
allow bugs to slip through which cause silent correctness issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113839
Approved by: https://github.com/ezyang
2024-01-03 22:31:43 +00:00
1c69d0bdb5 Revert "[11/N] Enable clang-tidy warnings on c10/util/*.h (#116353)"
This reverts commit 37aae5932c26c3729d68b6ebdf00e618fe229b1c.

Reverted https://github.com/pytorch/pytorch/pull/116353 on behalf of https://github.com/izaitsevfb due to Reverting, breaks internal builds: error: implicit conversion from 'long long' to 'float' may lose precision [-Werror,-Wimplicit-int-float-conversion] ([comment](https://github.com/pytorch/pytorch/pull/116353#issuecomment-1876045800))
2024-01-03 22:22:11 +00:00
0aa50909f3 Revert "[12/N] Apply clang-tidy and fix warnings in headers of torch/csrc (#116486)"
This reverts commit 5aa258eb09d5ecd62aea4d2bd02bbfa5eda0d554.

Reverted https://github.com/pytorch/pytorch/pull/116486 on behalf of https://github.com/izaitsevfb due to Reverting, as it depends on https://github.com/pytorch/pytorch/pull/116353, which has to be reverted ([comment](https://github.com/pytorch/pytorch/pull/116486#issuecomment-1876042948))
2024-01-03 22:18:54 +00:00
791db94c62 Revert "[13/N] Enable clang-tidy on headers of torch/csrc (#116560)"
This reverts commit b0629cdd67ea5dd264250262e0af75579ed26952.

Reverted https://github.com/pytorch/pytorch/pull/116560 on behalf of https://github.com/izaitsevfb due to Reverting, as it depends on #116353, which has to be reverted ([comment](https://github.com/pytorch/pytorch/pull/116560#issuecomment-1876033363))
2024-01-03 22:08:40 +00:00
71523c2289 Add 116583 to .git-blame-ignore-revs (#116676)
since #116583 is purely cosmetic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116676
Approved by: https://github.com/janeyx99
2024-01-03 19:37:31 +00:00
9693b3740b [easy] [c10d] Add documentation for the device_id parameter for init_process_group (#116222)
Follow-up to add missing docs for #114916

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116222
Approved by: https://github.com/kwen2501, https://github.com/fduwjj
2024-01-03 19:32:18 +00:00
f543093e06 [ONNX] Fix output mismatch issue of repeat_interleave when dim is None (#116689)
'input' is introduced but it's mixed with 'self' in repeat_interleave, which causes the mismatch issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116689
Approved by: https://github.com/thiagocrepaldi
2024-01-03 18:38:00 +00:00
68105da229 Revert "[Dynamo] Trace autograd.function in dynamo when inputs require grad (#116358)"
This reverts commit 97891b184c12763f335fbe1ff63fab843edafab5.

Reverted https://github.com/pytorch/pytorch/pull/116358 on behalf of https://github.com/izaitsevfb due to Breaks internal accuracy test, see D52491095, pytorch/benchmark/fb/test_gpu:run_test_gpu - test_train_ig_feed_over_inductor_accuracy  ([comment](https://github.com/pytorch/pytorch/pull/116358#issuecomment-1875779697))
2024-01-03 18:20:51 +00:00
68b77311ad Fix bug in non-strict input processor (#116674)
Summary: Title

Test Plan: CI

Reviewed By: zhxchen17

Differential Revision: D52499932

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116674
Approved by: https://github.com/tugsbayasgalan
2024-01-03 18:13:25 +00:00
1429c204f8 Increase hub download chunk size (#116536)
This PR increases the read size for the `hub.download_url_to_file` function from 8,192 bytes to 131,072 bytes (128 * 1,024), as reading in larger chunks should be more efficient. The size could probably be larger still, at the expense of the progress bar not getting updated as often.

It re-introduces use of the `READ_DATA_CHUNK` constant that was originally used for this purpose in 4a3baec961 and since forgotten.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116536
Approved by: https://github.com/NicolasHug
2024-01-03 17:38:45 +00:00
c919935cb7 [export] Update schema versioning format. (#116462)
Summary: Update the old versioning scheme to a major and minor version.

Test Plan: CI

Differential Revision: D52431963

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116462
Approved by: https://github.com/tugsbayasgalan
2024-01-03 17:34:58 +00:00
2ae55e99fe [release] Add Launch Execution XFN meeting process to release runbook (#116701)
Make sure we have this process documented in the runbook.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116701
Approved by: https://github.com/seemethere
2024-01-03 17:16:18 +00:00
d2fc00d2cc [codemod] markDynamoStrictTest torch_np/numpy_tests/lib/test_histograms (#116649)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116649
Approved by: https://github.com/bdhirsh
ghstack dependencies: #116638, #116673, #116639, #116640, #116641, #116642, #116643, #116644, #116645, #116646, #116647, #116648
2024-01-03 17:00:32 +00:00
2d1011d84f [codemod] markDynamoStrictTest torch_np/numpy_tests/lib/test_index_tricks (#116648)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116648
Approved by: https://github.com/bdhirsh
ghstack dependencies: #116638, #116673, #116639, #116640, #116641, #116642, #116643, #116644, #116645, #116646, #116647
2024-01-03 17:00:32 +00:00
c47ab693ff [codemod] markDynamoStrictTest torch_np/numpy_tests/lib/test_shape_base_ (#116647)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116647
Approved by: https://github.com/bdhirsh
ghstack dependencies: #116638, #116673, #116639, #116640, #116641, #116642, #116643, #116644, #116645, #116646
2024-01-03 17:00:23 +00:00
6a300bd1c6 [codemod] markDynamoStrictTest torch_np/numpy_tests/lib/test_twodim_base (#116646)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116646
Approved by: https://github.com/bdhirsh
ghstack dependencies: #116638, #116673, #116639, #116640, #116641, #116642, #116643, #116644, #116645
2024-01-03 17:00:13 +00:00
34a8c64c92 [codemod] markDynamoStrictTest torch_np/numpy_tests/lib/test_type_check (#116645)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116645
Approved by: https://github.com/bdhirsh
ghstack dependencies: #116638, #116673, #116639, #116640, #116641, #116642, #116643, #116644
2024-01-03 17:00:07 +00:00
fe287af812 [codemod] markDynamoStrictTest torch_np/numpy_tests/linalg/test_linalg (#116644)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116644
Approved by: https://github.com/bdhirsh
ghstack dependencies: #116638, #116673, #116639, #116640, #116641, #116642, #116643
2024-01-03 16:59:59 +00:00
28a8e4bdb6 [codemod] markDynamoStrictTest torch_np/test_basic (#116643)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116643
Approved by: https://github.com/bdhirsh
ghstack dependencies: #116638, #116673, #116639, #116640, #116641, #116642
2024-01-03 16:59:50 +00:00
146426a0df [codemod] markDynamoStrictTest torch_np/test_binary_ufuncs (#116642)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116642
Approved by: https://github.com/bdhirsh
ghstack dependencies: #116638, #116673, #116639, #116640, #116641
2024-01-03 16:59:41 +00:00
efe3b7f457 [codemod] markDynamoStrictTest torch_np/test_dtype (#116641)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116641
Approved by: https://github.com/bdhirsh
ghstack dependencies: #116638, #116673, #116639, #116640
2024-01-03 16:59:32 +00:00
d760014b9f [codemod] markDynamoStrictTest torch_np/test_function_base (#116640)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116640
Approved by: https://github.com/bdhirsh
ghstack dependencies: #116638, #116673, #116639
2024-01-03 16:59:25 +00:00
efee9e689e [codemod] markDynamoStrictTest torch_np/test_ndarray_methods (#116639)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116639
Approved by: https://github.com/bdhirsh
ghstack dependencies: #116638, #116673
2024-01-03 16:59:19 +00:00
608091e4d1 [codemod] markDynamoStrictTest torch_np/numpy_tests/core/test_multiarray (#116673)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116673
Approved by: https://github.com/voznesenskym
ghstack dependencies: #116638
2024-01-03 16:59:12 +00:00
70eb53505b [export] Update range constraints to runtime_var_to_range (#115427)
Updated range_constraints to be the union of shape_env.var_to_range and shape_env.runtime_var_to_range, with shape_env.runtime_var_to_range taking priority.

Due to 0/1 specialization, if we bound an unbacked symint to be less than 5, the range of possible values for this symint is actually recorded as [2, 5] in shape_env.var_to_range. To fix this so that users will be able to see a more understandable range of [0, 5], shape_env.runtime_var_to_range was created to store the range of [0, 5]. Since range_constraints is a user-facing attribute to query the ranges of certain symints, we want to use shape_env.runtime_var_to_range to get the unbacked symints ranges, rather than shape_env.var_to_range.

Additionally, run_decompositions() has an issue where it will always add assertions to the graph, even if a previous run has already added the assertions. So, I added a part to the AddRuntimeAssertionsForInlineConstraints which will store which assertions have already been added.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115427
Approved by: https://github.com/zhxchen17
2024-01-03 16:55:04 +00:00
f081c45a34 Add out_dtype support for sparse semi-structured CUTLASS back-end (#116519)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116519
Approved by: https://github.com/cpuhrsch
2024-01-03 16:23:17 +00:00
ba06951c66 [BE] [cuDNN] Always build assuming cuDNN >= 8.1 (#95722)
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 27084ed</samp>

This pull request simplifies and cleans up the code that uses the cuDNN library for convolution, batch normalization, CTC loss, and quantized operations. It removes the unnecessary checks and conditions for older cuDNN versions and the experimental cuDNN v8 API, and ~~replaces them with the stable `cudnn_frontend` API that requires cuDNN v8 or higher. It also adds the dependency and configuration for the `cudnn_frontend` library in the cmake and bazel files.~~ Correction: The v7 API will still be available with this PR, and can still be used, without any changes to the defaults. This change simply always _builds_ the v8 API, and removes the case where _only_ the v7 API is built.

This is a re-land of https://github.com/pytorch/pytorch/pull/91527

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95722
Approved by: https://github.com/malfet, https://github.com/atalman
2024-01-03 15:41:28 +00:00
3407541b0c add cpu inductor merge rule (#116679)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116679
Approved by: https://github.com/huydhn
2024-01-03 15:09:36 +00:00
b57d473091 [codemod] markDynamoStrictTest torch_np/test_nep50_examples (#116638)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116638
Approved by: https://github.com/bdhirsh
2024-01-03 14:45:43 +00:00
49de03f0fd adapt to other acceleration devices (#116682)
Fixes #116504

When this API is invoked, a runtime error occurs. When the NPU acceleration device is used, the input tensor is not processed at a branch. As a result, some input tensors are on the CPU and some are on the NPU. As a result, an error is reported.
Here, I adapt to other acceleration devices and move the tensor on the acceleration device to the CPU. It's tested and feasible.

The details are in the issue:#116504

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116682
Approved by: https://github.com/lezcano
2024-01-03 12:41:19 +00:00
c1b88723f8 Fix buck build after recent clang-tidy updates (#116669)
Broken after either https://github.com/pytorch/pytorch/pull/116486 or https://github.com/pytorch/pytorch/pull/116353 I think.  Here is an example build failure 0bc21c6a6b
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116669
Approved by: https://github.com/Skylion007
2024-01-03 09:02:58 +00:00
2a87ab4508 Refactor some tests by using TEST_CUDA & TEST_MULTIGPU instead (#116083)
as https://github.com/pytorch/pytorch/pull/116014#discussion_r1430510759 stated, refactor some tests related.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116083
Approved by: https://github.com/fduwjj
2024-01-03 08:53:59 +00:00
d9c0e37bab [2d] unflatten_tensor on compute stream for DTensorExtension (#116559)
Context: Existing FSDPExtension have some bug in the case when the
unflatten tensor involves some compute/communications in cuda stream,
the current logic of FSDPExtension unflatten tensor happens in the
unshard stream, which makes runtime lost sync with the compute stream,
and if there're some dependencies between the compute stream and the
unflatten tensor logic, currently it would lose sync point, which could
possibly lead to NaN.

This PR make the FSDPExtension to record the compute stream and let
DTensorExtension to directly use the compute stream for unflatten_tensor.

In long term we might want to directly make the FSDP runtime logic to only
make the unshard happen in unshard stream, and use unshard views to
happen in the compute stream. We currently fix this in the Extension
directly as this is the simplest thing to do without affecting FSDP
runtime logic

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116559
Approved by: https://github.com/awgu, https://github.com/fduwjj, https://github.com/yifuwang
ghstack dependencies: #116426
2024-01-03 07:29:08 +00:00
29674b8e1d [dtensor] fix dtensor _to_copy op for mix precision (#116426)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116426
Approved by: https://github.com/fduwjj
2024-01-03 07:29:08 +00:00
b0749bce6c [export] Allow None as the meta value for tensor output. (#116664)
Summary: Sometimes we will get a None value from ops which returns Tensor type in the schema. Allow this case during serialization.

Test Plan: test__scaled_dot_product_flash_attention

Differential Revision: D52491668

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116664
Approved by: https://github.com/SherlockNoMad
2024-01-03 07:07:39 +00:00
3fe437b24b [BE]: Update flake8 to v6.1.0 and fix lints (#116591)
Updates flake8 to v6.1.0 and fixes a few lints using sed and some ruff tooling.
- Replace `assert(0)` with `raise AssertionError()`
- Remove extraneous parenthesis i.e.
  - `assert(a == b)` -> `assert a == b`
  - `if(x > y or y < z):`->`if x > y or y < z:`
  - And `return('...')` -> `return '...'`

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116591
Approved by: https://github.com/albanD, https://github.com/malfet
2024-01-03 06:04:44 +00:00
09ee96b69d [MPS] Fix CrossEntropyLoss for float16 (#116597)
Looks like neither [`divisionNoNaNWithPrimaryTensor:`](https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/3675593-divisionnonanwithprimarytensor) nor `oneHotWithIndicesTensor:` works for `MPSDataTypeFloat16`, so provide an explicit cast for one-hot tensor and alternative implementation using the formula from the official doc, i.e.
> `resultTensor = select(secondaryTensor, primaryTensor / secondaryTensor, 0)`

Alas, at the moment  it can not be tested via `test_modules.py` as it runs only `torch.float32` and `torch.float64` tests (and `torch.half` implementation is not available for CPU)

Fixes https://github.com/pytorch/pytorch/issues/116095

TODO: Enable testing via TestModules, but will do in separate PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116597
Approved by: https://github.com/kulinseth
2024-01-03 05:58:26 +00:00
75359934bd [C10D] Improve Heartbeat Monitor exit logs (#116268) (#116661)
Summary:

- add workMetaList_.size() so we know how many outstanding works there
  were when killing
- Print our first log before debuginfo dump instead of after, since it
  is clearer when reading the logs that we time out and then dump
- Organize the log strings- put them near where they are used

cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l yf225

imported-using-ghimport

Test Plan: Imported from OSS

Reviewed By: fduwjj

Differential Revision: D52369167

Pulled By: wconstab

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116661
Approved by: https://github.com/fduwjj
2024-01-03 05:35:06 +00:00
1ae39a372e Inductor cpp wrapper: fix cumsum codegen (#116171)
Fixes https://github.com/pytorch/pytorch/issues/115829

For `cumsum(Tensor self, int dim, *, ScalarType? dtype=None) -> Tensor`, `dim` is not a `kwarg_only` argument, but it could be provided as a kwarg when calling this OP.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116171
Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel
2024-01-03 05:33:17 +00:00
ef98987017 Fix user input mutations for run_decompositions (#116382)
Fixes #115106

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116382
Approved by: https://github.com/angelayi
2024-01-03 05:04:22 +00:00
c5bd88b56a [export] Improve serialization of union types. (#116511)
Summary:
Making union types harder to use wrong:
1. Initialize unset fields still with None, but we don't assert on the uniqueness of not None field, since it's possible to set a real field to None.
2. Raise error on unset fields in union, reducing the error surface and enforcing type safety.
3. Serialize union type with only tag and omit all the unset fields, this makes the serialized model more readable and debuggable.

Test Plan:
buck test mode/opt caffe2/test:test_export
buck test mode/opt executorch/exir/...
buck test mode/opt mode/inplace aps_models/ads/icvr/tests:export_test

Differential Revision: D52446586

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116511
Approved by: https://github.com/angelayi
2024-01-03 04:58:59 +00:00
ca4df16fdd [c10d] Make DebugInfoWriter Singleton across all PG objects (#116489)
Previously, we have the writer register to each NCCL PG(backend), so for every pg, we have a NCCL PG instance, so if we use some customized writer when multiple sub-PGs are used, we need to ensure user to register the writer for every backend which indicates a bad UX. Furthermore, the debug info is global, so it does not make sense to have the writer for each instance. We even have a static mutex in the `dumpDebuggingInfo` to ensure we serialize the write, that makes it more obvious that we can make the writer a singleton so that we only have one writer instance for all PG instances.

Although the rationale is clear, the implementation may vary a lot. So this PR is RFC for now to see if this implementation makes sense or not.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116489
Approved by: https://github.com/kwen2501
2024-01-03 03:42:54 +00:00
41f265b06a [quant][pt2e] Preserve numeric_debug_handle in quantization flows (#116477)
Summary:
We introduced `node.meta["numeric_debug_handle"]` in https://github.com/pytorch/pytorch/pull/114315 to
indicate the numeric debug handle for values in the graph, in this PR we supported preserving this field
in prepare and convert so that we can use these for numerical debugging

Next: we also want to preserve these in deepcopy of GraphModule as well

Test Plan:
python test/test_quantization.py -k test_quantize_pt2e_preserve_handle

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116477
Approved by: https://github.com/tugsbayasgalan
2024-01-03 03:39:00 +00:00
f73b1b9388 [EZ] Update lxml dependency to 5.0.0 (#116657)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116657
Approved by: https://github.com/atalman
2024-01-03 02:57:31 +00:00
6e9ca2f220 Enable eye on CPU for bfloat16 dtype (#116616)
Fixes https://github.com/pytorch/pytorch/issues/116609

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116616
Approved by: https://github.com/Skylion007
2024-01-03 02:53:27 +00:00
5005f36c12 Clean up files under fb/vulkan/... (#116665)
Remove files accidentally imported in https://github.com/pytorch/pytorch/pull/114712
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116665
Approved by: https://github.com/izaitsevfb, https://github.com/seemethere
2024-01-03 01:55:32 +00:00
3ac0aaf478 [codemod] markDynamoStrictTest torch_np/test_random (#116637)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116637
Approved by: https://github.com/bdhirsh
ghstack dependencies: #116632, #116634, #116635, #116636
2024-01-03 00:51:36 +00:00
884e449753 [codemod] markDynamoStrictTest torch_np/test_reductions (#116636)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116636
Approved by: https://github.com/bdhirsh
ghstack dependencies: #116632, #116634, #116635
2024-01-03 00:51:36 +00:00
8ec606d4c5 [codemod] markDynamoStrictTest torch_np/test_scalars_0D_arrays (#116635)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116635
Approved by: https://github.com/bdhirsh
ghstack dependencies: #116632, #116634
2024-01-03 00:51:36 +00:00
9b27fcf65a [codemod] markDynamoStrictTest torch_np/test_ufuncs_basic (#116634)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116634
Approved by: https://github.com/bdhirsh
ghstack dependencies: #116632
2024-01-03 00:51:36 +00:00
0ce32ce409 [codemod] markDynamoStrictTest torch_np/test_unary_ufuncs (#116632)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116632
Approved by: https://github.com/bdhirsh
2024-01-03 00:51:36 +00:00
a1191ce4bf optimize (u)int8 vectorized operator* (#116235)
Summary: optimize (u)int8 vectorized operator*

Test Plan: sandcastle github

Differential Revision: D52318192

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116235
Approved by: https://github.com/hl475, https://github.com/malfet
2024-01-03 00:50:23 +00:00
0f6f582c0d Add config to disable TransformerEncoder/MHA fastpath (#112212)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112212
Approved by: https://github.com/jbschlosser
2024-01-02 23:59:30 +00:00
9dc68d1aa9 clangformat: fused adam (#116583)
Apply clangformat to fused adam/adamw files.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116583
Approved by: https://github.com/janeyx99
2024-01-02 22:30:23 +00:00
3ff4572fe7 delete sharded tensor from fsdp/tp tests (#116244)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116244
Approved by: https://github.com/awgu, https://github.com/wz337, https://github.com/fduwjj
ghstack dependencies: #116122
2024-01-02 22:11:36 +00:00
dfccaac31b [2d] Ensure gradient clear out pending AsyncCollectiveTensor in FSDP Extension (#116122)
As titled, this PR adds gradient hook to the FSDP DTensor extension, to check if there's gradients that are AsyncCollectiveTensors, if there're some, we eagerly wait there.

This is needed because sometimes the parameter's gradient might still pending with AsyncCollectiveTensor, if we directly feed them to FSDP then FSDP would use the ACT's storage to do reduce_scatter, which is wrong.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116122
Approved by: https://github.com/awgu, https://github.com/fduwjj
2024-01-02 22:11:36 +00:00
a2061ceefe ci: Output runner OS / HW for macOS (#116627)
It's difficult to debug these since there's no understanding of what the
OS / HW that we're running CI on so output it so we can have a better
understanding here.

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116627
Approved by: https://github.com/janeyx99
2024-01-02 22:05:53 +00:00
640d46f823 [inductor] Control the cpp_wrapper mode with an env variable (#116615)
Summary: also add one model test for the cpp_wrapper mode on CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116615
Approved by: https://github.com/angelayi
2024-01-02 21:50:25 +00:00
295bdaafb7 [codemod] markDynamoStrictTest test_module_init (#116625)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116625
Approved by: https://github.com/bdhirsh
ghstack dependencies: #116618, #116619, #116621, #116622, #116624
2024-01-02 20:55:48 +00:00
074dfc2648 [codemod] markDynamoStrictTest test_linalg (#116624)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116624
Approved by: https://github.com/bdhirsh
ghstack dependencies: #116618, #116619, #116621, #116622
2024-01-02 20:55:48 +00:00
5d8e066f6b [codemod] markDynamoStrictTest test_indexing (#116622)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116622
Approved by: https://github.com/bdhirsh
ghstack dependencies: #116618, #116619, #116621
2024-01-02 20:55:39 +00:00
fc7546e9db [codemod] markDynamoStrictTest test_functional_optim (#116621)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116621
Approved by: https://github.com/bdhirsh
ghstack dependencies: #116618, #116619
2024-01-02 20:55:31 +00:00
88d1638139 [codemod] markDynamoStrictTest test_autograd_fallback (#116619)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116619
Approved by: https://github.com/bdhirsh
ghstack dependencies: #116618
2024-01-02 20:55:21 +00:00
39339df8d7 [codemod] markDynamoStrictTest test_autocast (#116618)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116618
Approved by: https://github.com/bdhirsh
2024-01-02 20:54:24 +00:00
0bc21c6a6b [C10d] Fix Log Prefix in NCCLPG so that each instance gets its own prefix (#116520)
Somehow the logprefix only have ProcessGroup 0 rank [global rank]. This does not give the expected result as per the comment says "a prefix that is unique to this process group and rank". So this PR fix it and make it different for different subPGs.

The reason is that we set the prefix static which is shared across all NCCLPG instances and whoever calls this function first will set `rank_` and `uid_` to the prefix. We always initialize PG 0 first that's why we always see PG[0] + global ranks for all subPGs.

<img width="484" alt="image" src="https://github.com/pytorch/pytorch/assets/6937752/7fbb0226-7e25-4306-9cee-22e17b00bc8e">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116520
Approved by: https://github.com/wconstab
ghstack dependencies: #116218
2024-01-02 20:23:58 +00:00
6d8d3c1334 add a DTensor test for weight tying (#116475)
Weight tying is useful when we'd like to share weights (and their gradients) between two modules, e.g. the word/token embedding module and the output linear module in language models. This test demonstrates that with DTensor it can be achieved just as with normal tensor, e.g. using `model.fc.weight = model.embedding.weight`.

To test: `python test/distributed/tensor/parallel/test_tp_examples.py -k test_weight_tying`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116475
Approved by: https://github.com/wanchaol, https://github.com/fduwjj
2024-01-02 20:19:36 +00:00
fb5a9f2f5c Fix implicit conversion to double (#116614)
Summary:
Forward fix for https://github.com/pytorch/pytorch/pull/116185 / D52390113

Error:
```
xplat/caffe2/c10/util/order_preserving_flat_hash_map.h:602:23: error: implicit conversion from 'uint64_t' (aka 'unsigned long long') to 'double' may lose precision [-Werror,-Wimplicit-int-float-conversion]
[CONTEXT]             std::ceil(num_elements / static_cast<double>(_max_load_factor))));
[CONTEXT]                       ^~~~~~~~~~~~ ~
xplat/caffe2/c10/util/order_preserving_flat_hash_map.h:923:22: error: implicit conversion from 'uint64_t' (aka 'unsigned long long') to 'double' may lose precision [-Werror,-Wimplicit-int-float-conversion]
[CONTEXT]         num_elements + 1 >
[CONTEXT]         ~~~~~~~~~~~~~^~~ ~
xplat/caffe2/c10/util/order_preserving_flat_hash_map.h:924:34: error: implicit conversion from 'uint64_t' (aka 'unsigned long long') to 'double' may lose precision [-Werror,-Wimplicit-int-float-conversion]
[CONTEXT]             (num_slots_minus_one + 1) * static_cast<double>(_max_load_factor)) {
[CONTEXT]              ~~~~~~~~~~~~~~~~~~~~^~~  ~
xplat/caffe2/c10/util/order_preserving_flat_hash_map.h:923:22: error: implicit conversion from 'uint64_t' (aka 'unsigned long long') to 'double' may lose precision [-Werror,-Wimplicit-int-float-conversion]
[CONTEXT]         num_elements + 1 >
[CONTEXT]         ~~~~~~~~~~~~~^~~ ~
xplat/caffe2/c10/util/order_preserving_flat_hash_map.h:924:34: error: implicit conversion from 'uint64_t' (aka 'unsigned long long') to 'double' may lose precision [-Werror,-Wimplicit-int-float-conversion]
[CONTEXT]             (num_slots_minus_one + 1) * static_cast<double>(_max_load_factor)) {
[CONTEXT]              ~~~~~~~~~~~~~~~~~~~~^~~  ~
xplat/caffe2/c10/util/order_preserving_flat_hash_map.h:923:22: error: implicit conversion from 'uint64_t' (aka 'unsigned long long') to 'double' may lose precision [-Werror,-Wimplicit-int-float-conversion]
[CONTEXT]         num_elements + 1 >
[CONTEXT]         ~~~~~~~~~~~~~^~~ ~
xplat/caffe2/c10/util/order_preserving_flat_hash_map.h:924:34: error: implicit conversion from 'uint64_t' (aka 'unsigned long long') to 'double' may lose precision [-Werror,-Wimplicit-int-float-conversion]
[CONTEXT]             (num_slots_minus_one + 1) * static_cast<double>(_max_load_factor)) {
```

Fixed by casting int parts to double explicitly.

Test Plan: SC

Differential Revision: D52482968

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116614
Approved by: https://github.com/jeanschmidt, https://github.com/seemethere
2024-01-02 20:08:51 +00:00
77d979f748 Autograd attaches logging hooks only in debug level (#116522)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116522
Approved by: https://github.com/albanD
2024-01-02 20:06:18 +00:00
b18d8d4595 Add a wrapper to transform a NumPy function into a PyTorch function (#114610)
A less general version of this wrapper was used in the keynote on
`torch.compile(numpy)`. We expose a generic version of the wrapper
that works seamlessly with `torch.compile`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114610
Approved by: https://github.com/albanD
2024-01-02 18:35:29 +00:00
be455921f5 Fix missing words in README.md (#116606)
minor fix to wording

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116606
Approved by: https://github.com/Skylion007
2024-01-02 18:24:58 +00:00
95a86ed9ca [Quant] Add int8 linear op gelu for quantization PT2E with Inductor. input is an int8 CPU tensor; weight is an int8 MdkldnnCPU tensor (#114852)
**Summary**
Enable Int8 Linear Gelu post operator fusions for Stock PyTorch Inductor. The input is an int8 CPU tensor and weight is an int8 MkldnnCPU tensor.

**Test plan**
python test/test_quantization.py -k test_qlinear_gelu_pt2e

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114852
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
2024-01-02 08:11:26 +00:00
a81edf9f23 [inductor] Fix cpp_wrapper codegen for ir.ComplexView (#116481)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116481
Approved by: https://github.com/htyu
2024-01-02 05:38:58 +00:00
cyy
b0629cdd67 [13/N] Enable clang-tidy on headers of torch/csrc (#116560)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116560
Approved by: https://github.com/Skylion007, https://github.com/albanD
2024-01-02 05:33:04 +00:00
1ed8efa9b3 [MPS] Speedup addmm (#116548)
- Do not copy bias to output
- Skip respective multiplication op if either alpha or beta are equal to 1.0
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116548
Approved by: https://github.com/albanD
ghstack dependencies: #116547
2024-01-02 00:43:37 +00:00
abd80cbb15 [Inductor] Decompose bmm if batch2's last dim size is 1 and coordinate_descent_tuning is enabled (#116582)
We found this perf optimization opportunity at https://github.com/pytorch-labs/gpt-fast/pull/71. This would bring 5%+ perf gain for Mixtral 8x7B on gpt-fast.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116582
Approved by: https://github.com/lezcano
2024-01-01 21:24:02 +00:00
4ffe1fb7f4 [BE]: Improve typing to respect ruff PYI058 (#116588)
Tried out rule PYI058 and it flagged one typing recommendation in our codebase that would be better to fix.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116588
Approved by: https://github.com/malfet, https://github.com/kit1980
2024-01-01 20:49:55 +00:00
cf618452d3 [BE]: Fix F821 error in torch/fx/experimental (#116587)
Fix F821 error in torch/fx/experimental. Fixes a bug I did not fix in #116579
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116587
Approved by: https://github.com/kit1980
2024-01-01 19:45:49 +00:00
035e55822a vulkan: fix gcc build errors (#115976)
Fixes #96617

There was already an attempt to fix this build issue - see #96618. One commit is reused from this attempt (@zboszor) with adjustments to commit message. Another one differs and takes into account provided review feedback (@ezyang).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115976
Approved by: https://github.com/ezyang
2024-01-01 11:10:42 +00:00
4451ca068c [xla hash update] update the pinned xla hash (#116388)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116388
Approved by: https://github.com/pytorchbot
2024-01-01 10:30:59 +00:00
bd10fea79a [BE]: Enable F821 and fix bugs (#116579)
Fixes #112371

I tried to fix as many of the bugs as I could, a few I could not figure out what the proper fix for them was though and so I left them with noqas.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116579
Approved by: https://github.com/ezyang
2024-01-01 08:40:46 +00:00
6c02520466 Remove unneeded comment and link for BuildExtension (#115496)
`BuildExtension` is no longer derived from object, but from `build_ext`. Py2 is also deprecated, so this comment wouldn't be required anyways

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115496
Approved by: https://github.com/Skylion007
2024-01-01 08:29:48 +00:00
db752f2f1a Pin the version of expecttest to 0.1.6 in requirements.txt (#116238)
The version 0.2.0 of expecttest have removed `ACCEPT` variable by this [PR](https://github.com/ezyang/expecttest/pull/11), so when someone install python dependences using `pip install -r PyTorch_Root/requirements.txt`, the latest version of expecttest will be installed which will cuase failure in some PyTorch Tests. So Pin the version of expecttest to 0.1.6 like [this](db35ccf463/.ci/docker/requirements-ci.txt (L28)) is needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116238
Approved by: https://github.com/ezyang
2024-01-01 05:25:39 +00:00
60844ccc4f [MPS][BE] Refactor common code (#116566)
Introduce `mtl_setBuffer` and `mps_dispatch1DJob` and use it to bind
Tensor to metal kernel as well as disptatch Metal job

This avoids potential typos/bugs when one tries to bind tensor to a
Metal kernel but forgets about storage offset

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116566
Approved by: https://github.com/Skylion007
2024-01-01 04:58:18 +00:00
aec4377257 Optimize batch_norm_cpu_collect_stats_channels_last_impl when N <= num_threads (#113619)
Currently `batch_norm_cpu_collect_stats_channels_last_impl` uses two-path reduction to vertical reduce from shape of `{NHW, C}` to `{C}`. First path is reduction from `{NHW, C}` to intermediate buffer `{num_threads, C}`. Second path is reduction from `{num_threads, C}` to `{C}`.

Optimization is as follows:
- Add if/else path.

1. if `NHW > num_threads`, do the two-path reduction.
2. else `NHW <= num_threads`, do single-path reduction -- `NHW` is small enough that there is no need to first reduce to intermediate buffer.

- Moreover when `NHW <= num_threads`, use two methods, Method 1 and Method 2.
[Method 1](https://github.com/pytorch/pytorch/pull/113619/files#diff-e39a21a7125ac201b766a585b57ebf8429a7ac28cd723b09930aceb198fd25b0R372-R397): parallel on C, vertical reduce `{NHW, C} => {C}`
[Method 2](https://github.com/pytorch/pytorch/pull/113619/files#diff-e39a21a7125ac201b766a585b57ebf8429a7ac28cd723b09930aceb198fd25b0R325-R370): parallel on tiles of C, vectorized vertical reduce on each tile `{NHW, TILE_SIZE} => {TILE_SIZE}`

1. if `(num_threads == 1 || (C <= TILE_SIZE || C > THRESHOLD))`, use Method 2.
2. else, use Method 1.

- When `num_threads == 1`, there is no thread synchronization overhead, so it is better to use Method 2 than Method 1.
- When `C > THRESHOLD`, C is large enough that the benefit from tiling and vectorization outweigh the synchronization overhead.
- When `C <= TILE_SIZE`, the problem size is small enough (`C <= TILE_SIZE && NHW <= num_threads`) that it's better to launch single thread with vectorization than C threads without vectorization.
- `TILE_SIZE` is set to `16`.
- `THRESHOLD` is set to `2048`, it is an empirically found threshold to tile on C or not.

See comments for details.

### Performance

Perf data collected for C in range [2^1, 2^20], and (N,H,W) = (1,2,14) for all values of C. Values of (N,H,W)=(1,2,14) were arbitrarily chosen that satisfies the condition NHW <= num_threads = 28.
Tested on 28 physical cores/socket, 1 socket on Skylake.

| **(N, H, W) = (1, 2, 14)** 	|                                                            	|               	|                                        	|
|----------------------------	|------------------------------------------------------------	|---------------	|----------------------------------------	|
|                            	| **Avg Latency (ms)**                                       	|               	|                                        	|
| **n_channel**              	| **Baseline (original implementation): two-path reduction** 	| **Optimized** 	| **Speedup Ratio (Optimized/Baseline)** 	|
| 1048576                    	| 13.67034435                                                	| 3.059654236   	| 4.467937649                            	|
| 524288                     	| 5.230793953                                                	| 0.840408802   	| 6.224106578                            	|
| 262144                     	| 2.131233215                                                	| 0.353398323   	| 6.030682876                            	|
| 131072                     	| 0.990390778                                                	| 0.213630199   	| 4.636005491                            	|
| 65536                      	| 0.422859192                                                	| 0.107388496   	| 3.937658186                            	|
| 32768                      	| 0.224406719                                                	| 0.075747967   	| 2.962544459                            	|
| 16384                      	| 0.143647194                                                	| 0.049884319   	| 2.879606175                            	|
| 8192                       	| 0.10917902                                                 	| 0.031619072   	| 3.452948273                            	|
| 4096                       	| 0.08869648                                                 	| 0.024063587   	| 3.685920935                            	|
| 2048                       	| 0.075721741                                                	| 0.022127628   	| 3.422045038                            	|
| 1024                       	| 0.06685257                                                 	| 0.018239021   	| 3.665359477                            	|
| 512                        	| 0.051283836                                                	| 0.017580986   	| 2.917005696                            	|
| 256                        	| 0.043172836                                                	| 0.020868778   	| 2.06877642                             	|
| 128                        	| 0.042669773                                                	| 0.018148422   	| 2.351156069                            	|
| 64                         	| 0.038774014                                                	| 0.015704632   	| 2.468954                               	|
| 32                         	| 0.038630962                                                	| 0.013871193   	| 2.784977656                            	|
| 16                         	| 0.027766228                                                	| 0.008444786   	| 3.287972897                            	|
| 8                          	| 0.019891262                                                	| 0.007579327   	| 2.624410192                            	|
| 4                          	| 0.018217564                                                	| 0.008151531   	| 2.234863995                            	|
| 2                          	| 0.017716885                                                	| 0.008127689   	| 2.179818128                            	|

### Single Thread Performance
Perf data collected for C in range [2^1, 2^20], and (N,H,W) = (1,1,1) for all values of C. Values of (N,H,W)=(1,1,1) were chosen to satisfy the condition NHW <= num_threads = 1 for single thread performance.
Tested on 1 physical core/socket, 1 socket on Skylake.

| **(N, H, W) = (1, 1, 1)** 	|                                                            	|               	|                                        	|
|---------------------------	|------------------------------------------------------------	|---------------	|----------------------------------------	|
|                           	| **Avg Latency (ms)**                                       	|               	|                                        	|
| **n_channel**             	| **Baseline (original implementation): two-path reduction** 	| **Optimized** 	| **Speedup Ratio (Optimized/Baseline)** 	|
| 1048576                   	| 10.97419                                                   	| 8.390961      	| 1.307859                               	|
| 524288                    	| 4.860618                                                   	| 4.128075      	| 1.177454                               	|
| 262144                    	| 2.782302                                                   	| 1.981447      	| 1.404177                               	|
| 131072                    	| 2.105565                                                   	| 1.073592      	| 1.961234                               	|
| 65536                     	| 0.857651                                                   	| 0.523462      	| 1.63842                                	|
| 32768                     	| 0.309389                                                   	| 0.247979      	| 1.24764                                	|
| 16384                     	| 0.13869                                                    	| 0.098376      	| 1.409796                               	|
| 8192                      	| 0.072258                                                   	| 0.050876      	| 1.420263                               	|
| 4096                      	| 0.038414                                                   	| 0.027308      	| 1.40667                                	|
| 2048                      	| 0.021684                                                   	| 0.015688      	| 1.382219                               	|
| 1024                      	| 0.013294                                                   	| 0.009842      	| 1.350775                               	|
| 512                       	| 0.008659                                                   	| 0.006645      	| 1.303193                               	|
| 256                       	| 0.006964                                                   	| 0.005393      	| 1.291335                               	|
| 128                       	| 0.005918                                                   	| 0.00464       	| 1.275437                               	|
| 64                        	| 0.005324                                                   	| 0.004292      	| 1.240556                               	|
| 32                        	| 0.004981                                                   	| 0.004163      	| 1.196449                               	|
| 16                        	| 0.004833                                                   	| 0.003943      	| 1.225514                               	|
| 8                         	| 0.004768                                                   	| 0.003896      	| 1.22399                                	|
| 4                         	| 0.004828                                                   	| 0.003955      	| 1.220615                               	|
| 2                         	| 0.004776                                                   	| 0.003934      	| 1.213939                               	|

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113619
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-01-01 04:43:42 +00:00
fc5fda14bc Try creating a bf16 tensor as a last resort of is_bf16_supported(). (#115924)
Fix: #115900 https://github.com/pytorch/xla/issues/6085

This PR adds a last resort for testing for BF16 support on CUDA. This is necessary on GPUs
such as RTX 2060, where `torch.cuda.is_bf_supported()` returns False, but we can
successfully create a BF16 tensor on CUDA.

Before this PR:

```python
>>> torch.cuda.is_bf_supported()
False
>>> torch.tensor([1.], dtype=torch.bfloat16, device="cuda")
tensor([...], device='cuda:0', dtype=torch.bfloat16)
```

After this PR:

```python
>>> torch.cuda.is_bf_supported()
True
>>> torch.tensor([1.], dtype=torch.bfloat16, device="cuda")
tensor([...], device='cuda:0', dtype=torch.bfloat16)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115924
Approved by: https://github.com/jansel
2024-01-01 01:15:30 +00:00
127812efee [BE]: Further improve pathlib checks in torch serialization (#116577)
Follow up #116564. `os.path` functions can accept an os.PathLike object too.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116577
Approved by: https://github.com/malfet
2023-12-31 20:24:40 +00:00
4bfaa6bc25 [MPS] Fix addmm (#116547)
Remove weird logic for designating matrices as transposed if sizes match(which always true if square matrices are multiplied with each other), which resulted in `torch.addmm` returns transposed matrix compared to `torch.mm`, see below:
```
% python -c "import torch;torch.set_default_device('mps');a=torch.eye(2);b=torch.arange(4.0).reshape(2, 2);print(a@b);print(torch.addmm(torch.zeros(2, 2), a,b))"
tensor([[0., 1.],
        [2., 3.]], device='mps:0')
tensor([[0., 2.],
        [1., 3.]], device='mps:0')
```

Fixes introduced to `torch.mm` in https://github.com/pytorch/pytorch/pull/77462 suggests that this is not needed

Modify `sample_inputs_addmm` to test `torch.addmm` with square matrices, but skip this config for `test_autograd_dense_output_addmm`, see https://github.com/pytorch/pytorch/issues/116565

TODO: probably tweak tolerances, as `test_output_match_addmm_cpu_float16` fails with 2x2 matrices, but passes using 3x3 ones with errors slightly exceeding the tolerance

Fixes https://github.com/pytorch/pytorch/issues/116331
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116547
Approved by: https://github.com/albanD, https://github.com/Skylion007
2023-12-31 02:28:59 +00:00
aef06c316b [BE]: Add better handling of pathlib.Path with os calls (#116564)
Builds on #116562 to the rest of the instances of pathlib in the PyTorch.
* Uses more generic `os.PathLike` and `os.fspath` calls where appropiate
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116564
Approved by: https://github.com/malfet
2023-12-31 01:46:03 +00:00
86cd6655a1 [BE]: Use exist_ok arg for os.makedirs calls (#116561)
Optimize os.makedirs calls to use exist_ok parameter when possible to avoid unnecessary checks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116561
Approved by: https://github.com/malfet
2023-12-30 21:12:53 +00:00
4f9858a902 [BE]: Use os.fspath and os.PathLike in torch serialization (#116562)
Use proper `os.fspath` to better convert `os.PathLike` object to a path.
Replace `pathlib.Path` with `os.PathLike` which is more generic and typing correct. `pathlib.Path` is an instance of `os.PathLike`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116562
Approved by: https://github.com/malfet
2023-12-30 20:53:10 +00:00
cyy
5aa258eb09 [12/N] Apply clang-tidy and fix warnings in headers of torch/csrc (#116486)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116486
Approved by: https://github.com/albanD
2023-12-30 18:38:53 +00:00
cyy
37aae5932c [11/N] Enable clang-tidy warnings on c10/util/*.h (#116353)
This PR enables clang-tidy coverage on c10/util/*.h
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116353
Approved by: https://github.com/albanD
2023-12-30 14:38:39 +00:00
97891b184c [Dynamo] Trace autograd.function in dynamo when inputs require grad (#116358)
For training graphs (when inputs require grad), previously, we would speculate the forward and backward graph to determine if there are any graph breaks, side effect and etc but would not actually use these speculated graphs. We would just insert a call function node on the graph and later rely on autograd's tracing.

This approach does not work for more generalized graphs like graphs that include user defined triton kernels because autograd is not able to do the higher order function conversation.

This PR speculates the forward and backward functions and emits them in a HOF that later gets used via templating mechanism.

While working on this PR, I have exposed some bugs in the current tracing due to trampoline functions losing the source information resulting in incorrect graphs being produced. I have fixed these source information bugs and killed the trampolines.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116358
Approved by: https://github.com/jansel
2023-12-30 01:51:30 +00:00
c5d9173d04 [BE]: Enable readability-redundant-function-ptr-dereference check (#116538)
Enable an additional clang-tidy check to remove redundant function ptr dereferences to help make the code more readable.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116538
Approved by: https://github.com/malfet
2023-12-30 01:15:35 +00:00
5e58be678c Make collect env BC compatible (#116532)
To avoid errors like the one in https://github.com/pytorch/pytorch/issues/116531 when the user tries to run collect_env
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116532
Approved by: https://github.com/malfet
2023-12-30 01:13:37 +00:00
bd7d26bb96 [CI] Fix docker builds (#116549)
By pinning lxml to 4.9.4 as 5.0.0 is missing Python-3.9 binaries, see https://pypi.org/project/lxml/5.0.0/#files
<img width="568" alt="image" src="https://github.com/pytorch/pytorch/assets/2453524/fbd64512-b788-4bf6-9c1f-084dcedfd082">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116549
Approved by: https://github.com/houseroad, https://github.com/aakhundov
2023-12-30 00:38:14 +00:00
961fbbe967 [CI] Add initial ci build test for XPU (#116100)
Add initial CI build test for XPU, which will be triggered by label `ciflow/xpu` for current stage.

Works for RFC #114850

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116100
Approved by: https://github.com/EikanWang, https://github.com/huydhn, https://github.com/atalman
2023-12-29 23:44:46 +00:00
de4d48df34 [c10d] Fix timeout dump path write path overlap when there are multiple PGs (#116218)
Basically we observed that if there are multiple PGs and if the timeout happens on one of the subPG, we somehow use the local rank in the dump file. We realize that:
1. For setting the timeout signal in the store, any watchdog thread from any PG can do that.
2. For checking and dump, only the watchdog thread of default PG which we will always create and contain all ranks (no file name conflict) is needed here because the store signal and dump debug info are all global.
3. Since dump is global, we want to avoid the case when ranks from sub-PG pollute logs from global ranks (local rank 0 vs global rank 0). So that we use global ranks here to initialize debug info writer. (Down the road, we are thinking about making it a singleton so that user only register it once for multi-PG case.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116218
Approved by: https://github.com/wconstab
2023-12-29 21:58:25 +00:00
db2b4078b9 Add missing cstdint includes (#116458)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116458
Approved by: https://github.com/Skylion007
2023-12-29 18:30:26 +00:00
wgb
71ec3edbf7 Enhance Opinfo to support privateuse1 (#116417)
Fix Opinfo does not support third-party devices when the current test framework instantiation method is privateuse1.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116417
Approved by: https://github.com/albanD
2023-12-29 13:43:29 +00:00
e01e00fba8 fix code spell (#116530)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116530
Approved by: https://github.com/albanD
2023-12-29 12:58:38 +00:00
afadfa0175 [c10d] Add stream info during nccl comm abort call (#116076)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116076
Approved by: https://github.com/XilunWu
2023-12-29 06:58:26 +00:00
e8a9d088c6 [DevX] Add tool and doc on partial debug builds (#116521)
Turned command sequence mentioned in https://dev-discuss.pytorch.org/t/how-to-get-a-fast-debug-build/1597 and in various discussions into a tool that I use almost daily to debug crashes or correctness issues in the codebase

Essentially it allows one to turn this:
```
Process 87729 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
    frame #0: 0x00000001023d55a8 libtorch_python.dylib`at::indexing::impl::applySelect(at::Tensor const&, long long, c10::SymInt, long long, c10::Device const&, std::__1::optional<c10::ArrayRef<c10::SymInt>> const&)
libtorch_python.dylib`at::indexing::impl::applySelect:
->  0x1023d55a8 <+0>:  sub    sp, sp, #0xd0
    0x1023d55ac <+4>:  stp    x24, x23, [sp, #0x90]
    0x1023d55b0 <+8>:  stp    x22, x21, [sp, #0xa0]
    0x1023d55b4 <+12>: stp    x20, x19, [sp, #0xb0]
```
into this
```
Process 87741 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
    frame #0: 0x00000001024e2628 libtorch_python.dylib`at::indexing::impl::applySelect(self=0x00000001004ee8a8, dim=0, index=(data_ = 3), real_dim=0, (null)=0x000000016fdfe535, self_sizes= Has Value=true ) at TensorIndexing.h:239:7
   236 	    const at::Device& /*self_device*/,
   237 	    const c10::optional<SymIntArrayRef>& self_sizes) {
   238 	  // See NOTE [nested tensor size for indexing]
-> 239 	  if (self_sizes.has_value()) {
   240 	    auto maybe_index = index.maybe_as_int();
   241 	    if (maybe_index.has_value()) {
   242 	      TORCH_CHECK_INDEX(
```
while retaining good performance for the rest of the codebase
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116521
Approved by: https://github.com/atalman
2023-12-29 05:15:35 +00:00
df85a920cf [Inductor][Observability] Add logging for split cat pass (#116442)
Summary: Add logs for both in the pre and post grad passes

Test Plan:
```
buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode split_batch
```
[2023-12-26 17:14:24,203] [0/0] torch._inductor.fx_passes.post_grad: [INFO] counters of inductor dict after apply the split cat in the post grad pass: Counter({'pattern_matcher_nodes': 4076, 'pattern_matcher_count': 2917, 'remove_split_with_size_one': 1322, 'split_cat_norm': 461, 'consecutive_split_merged': 371, 'scmerge_cat_removed': 41, 'scmerge_cat_added': 32, 'scmerge_split_removed': 28, 'getitem_cat_merged': 11, 'batch_fusion': 7, 'scmerge_split_sections_removed': 3, 'scmerge_split_added': 2, 'split_squeeze_replaced': 2})

[2023-12-26 17:16:28,437] torch._inductor.fx_passes.post_grad: [INFO] counters of inductor dict after apply the split cat in the post grad pass: Counter({'pattern_matcher_nodes': 4122, 'pattern_matcher_count': 2935, 'remove_split_with_size_one': 1322, 'split_cat_norm': 461, 'consecutive_split_merged': 371, 'scmerge_cat_removed': 41, 'batch_fusion': 39, 'scmerge_cat_added': 32, 'scmerge_split_removed': 28, 'getitem_cat_merged': 11, 'scmerge_split_sections_removed': 3, 'scmerge_split_added': 2, 'split_squeeze_replaced': 2})

Differential Revision: D52425400

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116442
Approved by: https://github.com/yanboliang
2023-12-29 05:10:45 +00:00
8deaa13417 [EZ][Distributed] Add 'c10d' to distributed TORCH_LOG comment (#116526)
Address the comment in https://github.com/pytorch/pytorch/pull/116434, which I confused in the first beginning. Let's add c10d to the comment.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116526
Approved by: https://github.com/XilunWu
2023-12-29 04:40:37 +00:00
ef94499ad7 [executorch hash update] update the pinned executorch hash (#116474)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116474
Approved by: https://github.com/pytorchbot
2023-12-29 03:13:51 +00:00
240121587a [vision hash update] update the pinned vision hash (#116524)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116524
Approved by: https://github.com/pytorchbot
2023-12-29 03:08:39 +00:00
cab79ceb51 [Inductor Intel GPU backend Upstream] Step 2: Register and add Intel GPU Inductor backend (#116330)
Right after the first PR https://github.com/pytorch/pytorch/pull/116020, this PR forcus on generalizing device-bias runtime code that used in the basic workflow including triton kernel generation, codecache, autotuning.

 Feature request: #114856

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116330
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/desertfire
2023-12-29 02:49:37 +00:00
8173d98c57 [quant][be] Skip conv-bn folding when there are no batchnorm ops (#116440)
Summary:
`_fold_conv_bn_qat` is taking a long time currently, so skipping it when it's not necessary,
we can have follow up fixes to actually reduce the patterns or cache the patterns if possible

Test Plan:
uncomment the print in `test_speed`, run

python test/test_quantization.py -k test_speed

and make sure the convert time is low, e.g. 0.1s instead of 8-9 seconds

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116440
Approved by: https://github.com/andrewor14
2023-12-28 23:33:21 +00:00
33917150d3 Cleanup scope ref properly (#116169)
Fixes https://github.com/pytorch/pytorch/issues/116143

See test in PR for a case where this happens. Discovered while debugging optimizers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116169
Approved by: https://github.com/janeyx99, https://github.com/williamwen42, https://github.com/jansel
2023-12-28 23:29:37 +00:00
4371939751 Removing HTA documentation (#116513)
Removing HTA documentation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116513
Approved by: https://github.com/aaronenyeshi, https://github.com/malfet, https://github.com/atalman
2023-12-28 23:04:23 +00:00
8220d5c66d Support pathlib.Path as input to torch.load when mmap=True (#116104)
Fixes #116103

This now works:

```py
import torch
from pathlib import Path

file = Path("example.pt")
torch.save(torch.rand(5, 3), file)
torch.load(file, mmap=True)   # works!
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116104
Approved by: https://github.com/mikaylagawarecki
2023-12-28 22:54:11 +00:00
02e2158e75 Fix for out of bounds read in mobile interpreter INTERFACE_CALL opcode handler (#110301)
Summary:
The INTERFACE_CALL opcode for the mobile TorchScript interpreter contained an out of bounds read issue leading to memory corruption.

This change adds an explicit check that the number of inputs passed to the format method called when handling the INTERFACE_CALL opcode is a valid and within bounds of the stack.

Test Plan: contbuild + OSS signals

Differential Revision: D49739450

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110301
Approved by: https://github.com/dbort
2023-12-28 22:09:03 +00:00
7e12e722af [Dynamo][12/N] Remove allowed_functions.py (#116401)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116401
Approved by: https://github.com/angelayi
2023-12-28 21:26:06 +00:00
439f2a6c1f [RelEng] Missing signal for release branches (#116516)
Run slow/periodic and inductor workflows on push to release branches

Right now there are no signal from those jobs on release branches at all.
This will run periodic jobs on every commit to release branch, which is fine, as they are short lived and have a much lower traffic that a regular jobs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116516
Approved by: https://github.com/clee2000
2023-12-28 20:19:55 +00:00
4af1c27fa8 Migrate repr, deterministic state_dict test to OptimizerInfo (#116496)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116496
Approved by: https://github.com/albanD
ghstack dependencies: #116471
2023-12-28 19:49:04 +00:00
f3c4395358 [BE] Add helper in common_optimizers to get all optim inputs (#116471)
This will be a common utility in test_optim.py. Printing out the optimizer inputs when using this helper looks reasonable:

For local test plan, click below.
<details>

```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (d186986c)]$ python test/test_optim.py -vv -k test_step_is_noop_when_params_have_no_grad
test_step_is_noop_when_params_have_no_grad_ASGD_cpu_float32 (__main__.TestOptimRenewedCPU) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.02, 'foreach': False, 'differentiable': False}, desc=non-default lr
params=None, kwargs={'lr': 0.02, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.02, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable
params=None, kwargs={'t0': 100, 'foreach': False, 'differentiable': False}, desc=t0
params=None, kwargs={'t0': 100, 'foreach': True, 'differentiable': False}, desc=t0 & foreach
params=None, kwargs={'t0': 100, 'foreach': False, 'differentiable': True}, desc=t0 & differentiable
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable
ok
test_step_is_noop_when_params_have_no_grad_Adadelta_cpu_float32 (__main__.TestOptimRenewedCPU) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False}, desc=non-default lr
params=None, kwargs={'lr': 0.01, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable
params=None, kwargs={'rho': 0.95, 'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=rho
params=None, kwargs={'rho': 0.95, 'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=rho & foreach
params=None, kwargs={'rho': 0.95, 'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=rho & differentiable
ok
test_step_is_noop_when_params_have_no_grad_Adagrad_cpu_float32 (__main__.TestOptimRenewedCPU) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable
params=None, kwargs={'initial_accumulator_value': 0.1, 'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=initial_accumulator_value
params=None, kwargs={'initial_accumulator_value': 0.1, 'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=initial_accumulator_value & foreach
params=None, kwargs={'initial_accumulator_value': 0.1, 'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=initial_accumulator_value & differentiable
params=None, kwargs={'lr': 0.1, 'lr_decay': 0.5, 'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=lr_decay
params=None, kwargs={'lr': 0.1, 'lr_decay': 0.5, 'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=lr_decay & foreach
params=None, kwargs={'lr': 0.1, 'lr_decay': 0.5, 'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=lr_decay & differentiable
ok
test_step_is_noop_when_params_have_no_grad_AdamW_cpu_float32 (__main__.TestOptimRenewedCPU) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False}, desc=non-default lr
params=None, kwargs={'lr': 0.01, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable
params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': False, 'differentiable': False}, desc=amsgrad
params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': True, 'differentiable': False}, desc=amsgrad & foreach
params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': False, 'differentiable': True}, desc=amsgrad & differentiable
ok
test_step_is_noop_when_params_have_no_grad_Adam_cpu_float32 (__main__.TestOptimRenewedCPU) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False}, desc=non-default lr
params=None, kwargs={'lr': 0.01, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable
params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': False, 'differentiable': False}, desc=amsgrad
params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': True, 'differentiable': False}, desc=amsgrad & foreach
params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': False, 'differentiable': True}, desc=amsgrad & differentiable
ok
test_step_is_noop_when_params_have_no_grad_Adamax_cpu_float32 (__main__.TestOptimRenewedCPU) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.001, 'foreach': False, 'differentiable': False}, desc=non-default lr
params=None, kwargs={'lr': 0.001, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.001, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable
ok
test_step_is_noop_when_params_have_no_grad_LBFGS_cpu_float32 (__main__.TestOptimRenewedCPU) ... ok
test_step_is_noop_when_params_have_no_grad_NAdam_cpu_float32 (__main__.TestOptimRenewedCPU) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.001, 'foreach': False, 'differentiable': False}, desc=non-default lr
params=None, kwargs={'lr': 0.001, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.001, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable
params=None, kwargs={'momentum_decay': 0.006, 'foreach': False, 'differentiable': False}, desc=non-zero momentum_decay
params=None, kwargs={'momentum_decay': 0.006, 'foreach': True, 'differentiable': False}, desc=non-zero momentum_decay & foreach
params=None, kwargs={'momentum_decay': 0.006, 'foreach': False, 'differentiable': True}, desc=non-zero momentum_decay & differentiable
params=None, kwargs={'weight_decay': 0.9, 'momentum_decay': 0.006, 'foreach': False, 'differentiable': False}, desc=weight_decay
params=None, kwargs={'weight_decay': 0.9, 'momentum_decay': 0.006, 'foreach': True, 'differentiable': False}, desc=weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'momentum_decay': 0.006, 'foreach': False, 'differentiable': True}, desc=weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.9, 'momentum_decay': 0.006, 'decoupled_weight_decay': True, 'foreach': False, 'differentiable': False}, desc=decoupled_weight_decay
params=None, kwargs={'weight_decay': 0.9, 'momentum_decay': 0.006, 'decoupled_weight_decay': True, 'foreach': True, 'differentiable': False}, desc=decoupled_weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'momentum_decay': 0.006, 'decoupled_weight_decay': True, 'foreach': False, 'differentiable': True}, desc=decoupled_weight_decay & differentiable
ok
test_step_is_noop_when_params_have_no_grad_RAdam_cpu_float32 (__main__.TestOptimRenewedCPU) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.002, 'foreach': False, 'differentiable': False}, desc=non-default lr
params=None, kwargs={'lr': 0.002, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.002, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable
params=None, kwargs={'eps': 1e-06, 'foreach': False, 'differentiable': False}, desc=non-default eps
params=None, kwargs={'eps': 1e-06, 'foreach': True, 'differentiable': False}, desc=non-default eps & foreach
params=None, kwargs={'eps': 1e-06, 'foreach': False, 'differentiable': True}, desc=non-default eps & differentiable
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.9, 'decoupled_weight_decay': True, 'foreach': False, 'differentiable': False}, desc=decoupled_weight_decay
params=None, kwargs={'weight_decay': 0.9, 'decoupled_weight_decay': True, 'foreach': True, 'differentiable': False}, desc=decoupled_weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'decoupled_weight_decay': True, 'foreach': False, 'differentiable': True}, desc=decoupled_weight_decay & differentiable
ok
test_step_is_noop_when_params_have_no_grad_RMSprop_cpu_float32 (__main__.TestOptimRenewedCPU) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.001, 'foreach': False, 'differentiable': False}, desc=non-default lr
params=None, kwargs={'lr': 0.001, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.001, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'foreach': False, 'differentiable': False}, desc=centered
params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'foreach': True, 'differentiable': False}, desc=centered & foreach
params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'foreach': False, 'differentiable': True}, desc=centered & differentiable
params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'momentum': 0.1, 'foreach': False, 'differentiable': False}, desc=momentum
params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'momentum': 0.1, 'foreach': True, 'differentiable': False}, desc=momentum & foreach
params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'momentum': 0.1, 'foreach': False, 'differentiable': True}, desc=momentum & differentiable
params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'momentum': 0.1, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize
params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'momentum': 0.1, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach
params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'momentum': 0.1, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable
ok
test_step_is_noop_when_params_have_no_grad_Rprop_cpu_float32 (__main__.TestOptimRenewedCPU) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.0002, 'foreach': False, 'differentiable': False}, desc=non-default lr
params=None, kwargs={'lr': 0.0002, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.0002, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable
params=None, kwargs={'etas': (0.5, 1.5), 'foreach': False, 'differentiable': False}, desc=non-default etas
params=None, kwargs={'etas': (0.5, 1.5), 'foreach': True, 'differentiable': False}, desc=non-default etas & foreach
params=None, kwargs={'etas': (0.5, 1.5), 'foreach': False, 'differentiable': True}, desc=non-default etas & differentiable
params=None, kwargs={'step_sizes': (2e-06, 100), 'foreach': False, 'differentiable': False}, desc=non-default step_sizes
params=None, kwargs={'step_sizes': (2e-06, 100), 'foreach': True, 'differentiable': False}, desc=non-default step_sizes & foreach
params=None, kwargs={'step_sizes': (2e-06, 100), 'foreach': False, 'differentiable': True}, desc=non-default step_sizes & differentiable
params=None, kwargs={'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize
params=None, kwargs={'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach
params=None, kwargs={'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable
ok
test_step_is_noop_when_params_have_no_grad_SGD_cpu_float32 (__main__.TestOptimRenewedCPU) ... params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False}, desc=default
params=None, kwargs={'lr': 0.01, 'foreach': True, 'differentiable': False}, desc=default & foreach
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'foreach': False, 'differentiable': False}, desc=momentum
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'foreach': True, 'differentiable': False}, desc=momentum & foreach
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'foreach': False, 'differentiable': True}, desc=momentum & differentiable
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'dampening': 0.5, 'foreach': False, 'differentiable': False}, desc=dampening
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'dampening': 0.5, 'foreach': True, 'differentiable': False}, desc=dampening & foreach
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'dampening': 0.5, 'foreach': False, 'differentiable': True}, desc=dampening & differentiable
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=non-zero weight_decay
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=non-zero weight_decay & foreach
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=non-zero weight_decay & differentiable
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nesterov
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nesterov & foreach
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nesterov & differentiable
params=None, kwargs={'lr': 0.01, 'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize
params=None, kwargs={'lr': 0.01, 'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach
params=None, kwargs={'lr': 0.01, 'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable
ok
test_step_is_noop_when_params_have_no_grad_SparseAdam_cpu_float32 (__main__.TestOptimRenewedCPU) ... ok
test_step_is_noop_when_params_have_no_grad_ASGD_cuda_float32 (__main__.TestOptimRenewedCUDA) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.02, 'foreach': False, 'differentiable': False}, desc=non-default lr
params=None, kwargs={'lr': 0.02, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.02, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable
params=None, kwargs={'t0': 100, 'foreach': False, 'differentiable': False}, desc=t0
params=None, kwargs={'t0': 100, 'foreach': True, 'differentiable': False}, desc=t0 & foreach
params=None, kwargs={'t0': 100, 'foreach': False, 'differentiable': True}, desc=t0 & differentiable
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable
ok
test_step_is_noop_when_params_have_no_grad_Adadelta_cuda_float32 (__main__.TestOptimRenewedCUDA) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False}, desc=non-default lr
params=None, kwargs={'lr': 0.01, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable
params=None, kwargs={'rho': 0.95, 'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=rho
params=None, kwargs={'rho': 0.95, 'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=rho & foreach
params=None, kwargs={'rho': 0.95, 'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=rho & differentiable
ok
test_step_is_noop_when_params_have_no_grad_Adagrad_cuda_float32 (__main__.TestOptimRenewedCUDA) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable
params=None, kwargs={'initial_accumulator_value': 0.1, 'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=initial_accumulator_value
params=None, kwargs={'initial_accumulator_value': 0.1, 'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=initial_accumulator_value & foreach
params=None, kwargs={'initial_accumulator_value': 0.1, 'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=initial_accumulator_value & differentiable
params=None, kwargs={'lr': 0.1, 'lr_decay': 0.5, 'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=lr_decay
params=None, kwargs={'lr': 0.1, 'lr_decay': 0.5, 'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=lr_decay & foreach
params=None, kwargs={'lr': 0.1, 'lr_decay': 0.5, 'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=lr_decay & differentiable
ok
test_step_is_noop_when_params_have_no_grad_AdamW_cuda_float32 (__main__.TestOptimRenewedCUDA) ... params=None, kwargs={'foreach': False, 'differentiable': False, 'fused': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False, 'fused': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True, 'fused': False}, desc=default & differentiable
params=None, kwargs={'foreach': False, 'differentiable': False, 'fused': True}, desc=default & fused
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False, 'fused': False}, desc=non-default lr
params=None, kwargs={'lr': 0.01, 'foreach': True, 'differentiable': False, 'fused': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': True, 'fused': False}, desc=non-default lr & differentiable
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False, 'fused': True}, desc=non-default lr & fused
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False, 'fused': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False, 'fused': False}, desc=nonzero weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True, 'fused': False}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False, 'fused': True}, desc=nonzero weight_decay & fused
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=maximize
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=maximize & foreach
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=maximize & differentiable
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=maximize & fused
params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=amsgrad
params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=amsgrad & foreach
params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=amsgrad & differentiable
params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=amsgrad & fused
ok
test_step_is_noop_when_params_have_no_grad_Adam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... params=None, kwargs={'foreach': False, 'differentiable': False, 'fused': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False, 'fused': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True, 'fused': False}, desc=default & differentiable
params=None, kwargs={'foreach': False, 'differentiable': False, 'fused': True}, desc=default & fused
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False, 'fused': False}, desc=non-default lr
params=None, kwargs={'lr': 0.01, 'foreach': True, 'differentiable': False, 'fused': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': True, 'fused': False}, desc=non-default lr & differentiable
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False, 'fused': True}, desc=non-default lr & fused
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False, 'fused': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False, 'fused': False}, desc=nonzero weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True, 'fused': False}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False, 'fused': True}, desc=nonzero weight_decay & fused
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=maximize
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=maximize & foreach
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=maximize & differentiable
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=maximize & fused
params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=amsgrad
params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=amsgrad & foreach
params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=amsgrad & differentiable
params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=amsgrad & fused
ok
test_step_is_noop_when_params_have_no_grad_Adamax_cuda_float32 (__main__.TestOptimRenewedCUDA) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.001, 'foreach': False, 'differentiable': False}, desc=non-default lr
params=None, kwargs={'lr': 0.001, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.001, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable
ok
test_step_is_noop_when_params_have_no_grad_LBFGS_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_step_is_noop_when_params_have_no_grad_NAdam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.001, 'foreach': False, 'differentiable': False}, desc=non-default lr
params=None, kwargs={'lr': 0.001, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.001, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable
params=None, kwargs={'momentum_decay': 0.006, 'foreach': False, 'differentiable': False}, desc=non-zero momentum_decay
params=None, kwargs={'momentum_decay': 0.006, 'foreach': True, 'differentiable': False}, desc=non-zero momentum_decay & foreach
params=None, kwargs={'momentum_decay': 0.006, 'foreach': False, 'differentiable': True}, desc=non-zero momentum_decay & differentiable
params=None, kwargs={'weight_decay': 0.9, 'momentum_decay': 0.006, 'foreach': False, 'differentiable': False}, desc=weight_decay
params=None, kwargs={'weight_decay': 0.9, 'momentum_decay': 0.006, 'foreach': True, 'differentiable': False}, desc=weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'momentum_decay': 0.006, 'foreach': False, 'differentiable': True}, desc=weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.9, 'momentum_decay': 0.006, 'decoupled_weight_decay': True, 'foreach': False, 'differentiable': False}, desc=decoupled_weight_decay
params=None, kwargs={'weight_decay': 0.9, 'momentum_decay': 0.006, 'decoupled_weight_decay': True, 'foreach': True, 'differentiable': False}, desc=decoupled_weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'momentum_decay': 0.006, 'decoupled_weight_decay': True, 'foreach': False, 'differentiable': True}, desc=decoupled_weight_decay & differentiable
ok
test_step_is_noop_when_params_have_no_grad_RAdam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.002, 'foreach': False, 'differentiable': False}, desc=non-default lr
params=None, kwargs={'lr': 0.002, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.002, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable
params=None, kwargs={'eps': 1e-06, 'foreach': False, 'differentiable': False}, desc=non-default eps
params=None, kwargs={'eps': 1e-06, 'foreach': True, 'differentiable': False}, desc=non-default eps & foreach
params=None, kwargs={'eps': 1e-06, 'foreach': False, 'differentiable': True}, desc=non-default eps & differentiable
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.9, 'decoupled_weight_decay': True, 'foreach': False, 'differentiable': False}, desc=decoupled_weight_decay
params=None, kwargs={'weight_decay': 0.9, 'decoupled_weight_decay': True, 'foreach': True, 'differentiable': False}, desc=decoupled_weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'decoupled_weight_decay': True, 'foreach': False, 'differentiable': True}, desc=decoupled_weight_decay & differentiable
ok
test_step_is_noop_when_params_have_no_grad_RMSprop_cuda_float32 (__main__.TestOptimRenewedCUDA) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.001, 'foreach': False, 'differentiable': False}, desc=non-default lr
params=None, kwargs={'lr': 0.001, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.001, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'foreach': False, 'differentiable': False}, desc=centered
params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'foreach': True, 'differentiable': False}, desc=centered & foreach
params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'foreach': False, 'differentiable': True}, desc=centered & differentiable
params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'momentum': 0.1, 'foreach': False, 'differentiable': False}, desc=momentum
params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'momentum': 0.1, 'foreach': True, 'differentiable': False}, desc=momentum & foreach
params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'momentum': 0.1, 'foreach': False, 'differentiable': True}, desc=momentum & differentiable
params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'momentum': 0.1, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize
params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'momentum': 0.1, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach
params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'momentum': 0.1, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable
ok
test_step_is_noop_when_params_have_no_grad_Rprop_cuda_float32 (__main__.TestOptimRenewedCUDA) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.0002, 'foreach': False, 'differentiable': False}, desc=non-default lr
params=None, kwargs={'lr': 0.0002, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.0002, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable
params=None, kwargs={'etas': (0.5, 1.5), 'foreach': False, 'differentiable': False}, desc=non-default etas
params=None, kwargs={'etas': (0.5, 1.5), 'foreach': True, 'differentiable': False}, desc=non-default etas & foreach
params=None, kwargs={'etas': (0.5, 1.5), 'foreach': False, 'differentiable': True}, desc=non-default etas & differentiable
params=None, kwargs={'step_sizes': (2e-06, 100), 'foreach': False, 'differentiable': False}, desc=non-default step_sizes
params=None, kwargs={'step_sizes': (2e-06, 100), 'foreach': True, 'differentiable': False}, desc=non-default step_sizes & foreach
params=None, kwargs={'step_sizes': (2e-06, 100), 'foreach': False, 'differentiable': True}, desc=non-default step_sizes & differentiable
params=None, kwargs={'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize
params=None, kwargs={'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach
params=None, kwargs={'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable
ok
test_step_is_noop_when_params_have_no_grad_SGD_cuda_float32 (__main__.TestOptimRenewedCUDA) ... params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False}, desc=default
params=None, kwargs={'lr': 0.01, 'foreach': True, 'differentiable': False}, desc=default & foreach
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'foreach': False, 'differentiable': False}, desc=momentum
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'foreach': True, 'differentiable': False}, desc=momentum & foreach
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'foreach': False, 'differentiable': True}, desc=momentum & differentiable
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'dampening': 0.5, 'foreach': False, 'differentiable': False}, desc=dampening
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'dampening': 0.5, 'foreach': True, 'differentiable': False}, desc=dampening & foreach
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'dampening': 0.5, 'foreach': False, 'differentiable': True}, desc=dampening & differentiable
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=non-zero weight_decay
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=non-zero weight_decay & foreach
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=non-zero weight_decay & differentiable
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nesterov
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nesterov & foreach
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nesterov & differentiable
params=None, kwargs={'lr': 0.01, 'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize
params=None, kwargs={'lr': 0.01, 'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach
params=None, kwargs={'lr': 0.01, 'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable
ok
test_step_is_noop_when_params_have_no_grad_SparseAdam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok

----------------------------------------------------------------------
Ran 26 tests in 19.089s

OK
```

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116471
Approved by: https://github.com/albanD
2023-12-28 19:49:04 +00:00
577529daec [Dynamo] Implement a simple mutation tracker for user defined triton kernels (#116466)
This PR adds a very simple mutation tracking mechanism to dynamo which can later be improved to be more thorough. Currently it allows tensors to be in tl.load but if it sees a tensor used anywhere else (including a tl.load), it bails out.

One question about the method: is `ast.NodeVisitor` the best thing to use here? Having to detect mutations with this is kinda pretty since you need to keep setting state at each transition.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116466
Approved by: https://github.com/aakhundov
2023-12-28 18:59:44 +00:00
f10c3f4184 Fix module pre bw hooks when input doesn't req grad but gradients are changed by the user (#116454)
As per title.

FYI @vkuzo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116454
Approved by: https://github.com/mikaylagawarecki
2023-12-28 18:32:50 +00:00
fb91acd33b [release] Add specific section about building and testing final rc (#116476)
Formalize process of building and testing final rc. To avoid having missing PRs in the release, similar to this: https://github.com/pytorch/pytorch/pull/114197

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116476
Approved by: https://github.com/huydhn
2023-12-28 15:25:08 +00:00
b5e83b8c50 Fix edge case for size 1 channels dim in AdaptiveMaxPool (#116482)
Fixes https://github.com/pytorch/pytorch/issues/107842

Unlike `AdaptiveAvgPool`, `AdaptiveMaxPool` does not have a CUDA kernel for ChannelsLast. We workaround this by calling `contiguous()` on the input. However, there is an edge case when the channels dimension has size 1.

```python
>>> t = torch.randn(2, 1, 3, 3)
>>> t.stride()
(9, 9, 3, 1)
>>> t_c =  t.to(memory_format=torch.channels_last)
>>> t_c.stride()
(9, 1, 3, 1)  # (CHW, 1, CW, C)
>>> t_c.is_contiguous()
True  # contiguity check doesn't check strides for singleton dimensions
```

Since the CUDA kernel treats the batch,`B`, and  channels,`C`, dimensions as implicitly flattened and increments the data pointer for `input` to the start of the next plane using

669b182d33/aten/src/ATen/native/cuda/AdaptiveMaxPooling2d.cu (L67)

If our input falls into the aforementioned edge case, the `data_ptr` will not be incremented correctly. The simple fix for this is to calculate the stride for the channels dimension using $\prod_{i > 1}size(i)$

Analogous fix for the 3D case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116482
Approved by: https://github.com/albanD
2023-12-28 15:02:29 +00:00
dfc898ede4 Don't decompose functional ops in predispatch functionalization (#116383)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116383
Approved by: https://github.com/bdhirsh
ghstack dependencies: #115188, #115210
2023-12-28 11:54:04 +00:00
80c07df659 Update doc for the constraints of FractionalMaxPool2d (#116261)
Fixes [#115531 ](https://github.com/pytorch/pytorch/issues/115531)
Update doc for the constraints of FractionalMaxPool2d.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116261
Approved by: https://github.com/mikaylagawarecki
2023-12-28 06:55:36 +00:00
d791074c81 Clean up PyTorch op BC check list (#116468)
Summary: Remove the expired items.

Test Plan: CI

Differential Revision: D52435764

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116468
Approved by: https://github.com/feikou
2023-12-28 06:05:59 +00:00
6243dbb5c0 [DTensor][BE] unify PlacementStrategy print function (#116428)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116428
Approved by: https://github.com/wanchaol
ghstack dependencies: #115683, #115689
2023-12-28 01:10:20 +00:00
87fea086aa [DTensor] remove experimental DTensor op backward layer norm (#115689)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115689
Approved by: https://github.com/wanchaol, https://github.com/yoyoyocmu
ghstack dependencies: #115683
2023-12-28 01:10:20 +00:00
575f17ebd4 [DTensor] add layer norm backward support (#115683)
**Summary**
This PR adds DTensor implementation for ATen op `native_layer_norm_backward`.

**Test Plan**
pytest test/distributed/_tensor/test_math_ops.py -s -k layer_norm
pytest test/distributed/_tensor/test_dtensor_ops.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115683
Approved by: https://github.com/wanchaol
2023-12-28 01:10:10 +00:00
b3f7fdbf0a Add decomp for pad_sequence (#116285)
Summary: currently pad_sequence caused symbolic shape specialization in export which is unintended. Adding a decomp seems to work to avoid the c++ kernel which caused the specialization.

Test Plan: buck test mode/opt caffe2/test:test_export -- -r pad_sequence

Reviewed By: SherlockNoMad

Differential Revision: D52345667

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116285
Approved by: https://github.com/peterbell10, https://github.com/lezcano
2023-12-27 23:56:51 +00:00
d59350cc1c [Dynamo] Consolidate common constant types (#116366)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116366
Approved by: https://github.com/Skylion007
2023-12-27 23:54:35 +00:00
6375eb15ef [Dynamo][11/N] allow_in_graph/disallow_in_graph decorator refactor (#116365)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116365
Approved by: https://github.com/jansel
2023-12-27 23:50:35 +00:00
53e32d12c4 [c10] Use nested namespace in c10/cuda (#116464)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116464
Approved by: https://github.com/Skylion007
2023-12-27 23:14:00 +00:00
93b86bf531 [GHF] Implement stacked revert (#116447)
By adding `get_ghstack_dependent_prs` that using `git branch --contains`
finds all PRs containing stacked branch, selecting longest one (in
terms of distance between origin and default branch) and skipping all
open PRs

Please note, that reverts should be applied in a reversed order with the
one how PRs were landed originally.

Use a bit of a defensive programming, i.e. revert single PR if attempt to fetch dependencies fails for some reason.

Test plan:
 - Lint
 -  ```
    >>> from trymerge import GitRepo, GitHubPR, get_ghstack_prs, get_ghstack_dependent_prs
    >>> pr=GitHubPR("pytorch", "pytorch", 115188)
    >>> pr1=GitHubPR("pytorch", "pytorch", 115210)
    >>> repo=GitRepo("/Users/nshulga/git/pytorch/pytorch")
    >>> get_ghstack_dependent_prs(repo, pr1)
    [('22742d93a5357c9b5b45a74f91a6dc5599c9c266', <trymerge.GitHubPR object at 0x100f32f40>)]
    >>> get_ghstack_dependent_prs(repo, pr)
    [('22742d93a5357c9b5b45a74f91a6dc5599c9c266', <trymerge.GitHubPR object at 0x10102eaf0>), ('76b1d44d576c20be79295810904c589241ca1bd2', <trymerge.GitHubPR object at 0x10102eb50>)]
    >>> rc=get_ghstack_dependent_prs(repo, pr)
    rc[0]>>> rc[0][1].pr_num
    115210
    >>> rc[1][1].pr_num
    115188
    ```
 - see: https://github.com/malfet/deleteme/pull/59#issuecomment-1869904714 and https://github.com/malfet/deleteme/pull/74#issuecomment-1870542702

Fixes https://github.com/pytorch/test-infra/issues/4845
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116447
Approved by: https://github.com/huydhn
ghstack dependencies: #116446
2023-12-27 23:01:16 +00:00
5fcc2519f5 [GHF] Refactors (#116446)
Prep change for allowing stacked reverts

This is a no-op that factors out some helper function that would be
useful later:
 - `get_pr_commit_sha` finds a committed sha for a given PR
 - `_revlist_to_prs` converts a revlist to GitHubPRs conditionally
   filtering some out
 - `do_revert_prs` reverts multiple PRs in a batch, but so far is
   invoked with only one PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116446
Approved by: https://github.com/huydhn, https://github.com/seemethere
2023-12-27 23:01:16 +00:00
85628c0e57 Revert "[export] Update range constraints to runtime_var_to_range (#115427)"
This reverts commit f8ad664cf267bcbdd8f8f85e27ad3a6e7d9fa86f.

Reverted https://github.com/pytorch/pytorch/pull/115427 on behalf of https://github.com/angelayi due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/115427#issuecomment-1870671728))
2023-12-27 22:44:45 +00:00
a17069684c Improve nn.modules.activation and batchnorm docs (#113531)
Fixes #112602

For some reason, I could not get the same output when running pycodestyle command as indicated in the issue. I manually ran ruff checks fixing the following issues  `D202`, `D204`,  `D205`, `D207`, `D400` and `D401`.

### Requested output

nn.modules.activation:
before: 135
after: 79

nn.modules.batchnorm
before: 21
after: 3
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113531
Approved by: https://github.com/mikaylagawarecki
2023-12-27 21:06:47 +00:00
3149e4a667 [dynamo] fix sum() function with start argument (#116389)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116389
Approved by: https://github.com/Skylion007, https://github.com/malfet
2023-12-27 20:42:27 +00:00
83502feabe [BE]: Enable readability-simplify-subscript-expr clang-tidy check (#116356)
[BE]: enable clang-tidy check for readability-simplify-subscript-expr which looks for unnecessarily complex subscripting of the underlying data array of STL types.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116356
Approved by: https://github.com/lezcano
2023-12-27 20:22:20 +00:00
8d84b5041c [pt-vulkan] Address CLANGTIDY warnings in api, graph, and impl folders (#116431)
## Context

**Currently, `*.h` and `*.cpp` produces many lint warnings/errors from `clang-tidy` in the Meta internal Phabricator mirror**. These changes address all the lint warnings in the `api`, `graph`, and `impl` folders in preparation for upcoming planned work.

## Review Guide

* Most changes are the result of automatically applied patches from `clang-tidy`
  * However, some warnings had to be manually addressed
  * There should be no functional changes
* Many of the `clang-tidy` warnings arose from the `facebook-hte-BadMemberName` rule which checks for compliance with variable naming rules from Meta's internal C++ style guide
  * However, the rest of the ATen codebase does not conform to this rule, and PyTorch Vulkan was written to be consisten with ATen's naming conventions; thus, to stay consistent with the rest of ATen, this rule is disabled wherever relevant using `// @lint-ignore-every CLANGTIDY facebook-hte-BadMemberName`
* Lint was disabled entirely for`vulkan_api_test.cpp` since there are too many warnings to address at the moment. Addressing all of them will be a small project of its own; thus, in the interim lint will be disabled to reduce distracting signals for developers.

Internal:

## Notes for Internal Reviewers

This diff was largely created with

```
cd ~/fbsource/xplat/caffe2/aten/src/ATen/native/vulkan
arc lint -e extra -a --take CLANGTIDY * 2>&1 | tee ~/scratch/lint.txt
```

The above command automatically applied patches suggested by `clang-tidy`, and the rest of the warnings were addressed manually.

To disable `facebook-hte-BadMemberName`, I found that disabling it via a `.clang-tidy` file didn't work with `arc lint`, and the only way that worked was through the adding a comment

```
// @lint-ignore-every CLANGTIDY facebook-hte-BadMemberName
```

Differential Revision: [D50336057](https://our.internmc.facebook.com/intern/diff/D50336057/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116431
Approved by: https://github.com/GregoryComer, https://github.com/kirklandsign
2023-12-27 19:29:18 +00:00
bbe3261dd3 [BE]: Use iterable.chain.from_iterable where possible (#116376)
This is more readable and more efficient when dealing with lots of sequences to chain together.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116376
Approved by: https://github.com/albanD
2023-12-27 19:20:07 +00:00
e0e90bc0d4 Revert "[dynamo] fix sum() function with start argument (#116389)"
This reverts commit 3c9076f070fab5b27eae3b7846755c98b7c97a1a.

Reverted https://github.com/pytorch/pytorch/pull/116389 on behalf of https://github.com/kit1980 due to Breaks Meta-internal tests, but the issue could have been caught on GitHub ([comment](https://github.com/pytorch/pytorch/pull/116389#issuecomment-1870556927))
2023-12-27 19:05:55 +00:00
5c9464fb51 add CALL_FINALLY opcode (#116159)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116159
Approved by: https://github.com/yanboliang
2023-12-27 19:01:08 +00:00
f657b2b1f8 [Dynamo][10/N] Remove TorchVariable and is_allowed (#116312)
After this refactor:
* ```TorchVariable``` definition and all references are removed.
* All ```is_allowed``` references except one are removed.
  - The only left one is in ```torch/_dynamo/decorators:_disallow_in_graph_helper```. It was called when users put ```disallow_in_graph``` decorator on a function. Since we use the lists in ```trace_rules``` to decide the function's trace rule, so the decorator would only be used as customer function rather than torch functions. I'll defer this to a separate decorator refactor PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116312
Approved by: https://github.com/jansel
2023-12-27 18:47:05 +00:00
87da0e1d23 [GHF] Fix gh_get_labels for small repos (#116444)
Not sure if this is recent API change or what but  `gh_get_labels('malfet', 'deleteme')` used to raise an exception (see https://github.com/malfet/deleteme/actions/runs/7334535266/job/19971328673#step:6:37 )
```
  File "/home/runner/work/deleteme/deleteme/.github/scripts/label_utils.py", line 50, in get_last_page_num_from_header
    link_info[link_info.rindex(prefix) + len(prefix) : link_info.rindex(suffix)]
AttributeError: 'NoneType' object has no attribute 'rindex'
```

And with this fix it returns the expected list

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116444
Approved by: https://github.com/huydhn
2023-12-27 15:50:42 +00:00
e14026bc2a [CUDNN] RNNv6 API deprecation support (#115719)
The cuDNN RNNv6 API has been deprecated and support will be dropped in a recent release; this PR migrates to the newer API to support newer cuDNN versions that would otherwise break the build.

Note that it may not be tested yet in upstream CI if the upstream CI cuDNN version is less than 8.9.7.

CC @ptrblck @malfet

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115719
Approved by: https://github.com/albanD, https://github.com/malfet
2023-12-27 09:31:08 +00:00
0aa5b751bb [executorch hash update] update the pinned executorch hash (#116438)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116438
Approved by: https://github.com/pytorchbot
2023-12-27 09:13:54 +00:00
924f1b841a [optim] Allow torch.float64 scalars for forloop + foreach implementations (#115841)
Should allow for uses cases mentioned in #110940

This would allow scalars to also be float64s in the foreach implementation. The fused implementation would still create a float32 step on Adam and AdamW. This PR also does NOT worry about performance and is mainly for enablement.

Next steps:
- Relax the constraint on fused adam(w) and allow torch.float64 scalars there
- Allow _performant_ mixed dtypes in foreach (a bigger project in itself).

This PR will conflict with my other PRs, I will figure out a landing order

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115841
Approved by: https://github.com/albanD
2023-12-27 09:13:49 +00:00
1d13086492 [BE] force DTensorTestBase.build_device_mesh to use world_size rather than NUM_DEVICES constant (#116439)
**Test**:
`python test/distributed/fsdp/test_shard_utils.py -k test_create_chunk_dtensor`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116439
Approved by: https://github.com/wanchaol
2023-12-27 07:37:07 +00:00
6b91e6907e Add setUserEnabledNNPACK config (#116152)
When exporting a model with a convolution kernel on cpu, if mkldnn is disabled and nnpack is enabled, export will go down the nnpack optimized convolution kernel for certain shapes ((code pointer)[cd449e260c/aten/src/ATen/native/Convolution.cpp (L542-L552)]). This means that we will automatically create a guard on that certain shape. If users want to export without any restrictions, one option is to disable nnpack. However, no config function exists for this, so this PR is adding a config function, similar to the `set_mkldnn_enabled` function.

Original context is in https://fb.workplace.com/groups/1075192433118967/posts/1349589822345892/?comment_id=1349597102345164&reply_comment_id=1349677642337110.

To test the flag, the following script runs successfully:
```
import os

import torch
from torchvision.models import ResNet18_Weights, resnet18

torch.set_float32_matmul_precision("high")

model = resnet18(weights=ResNet18_Weights.DEFAULT)
model.eval()

with torch.no_grad():
    # device = "cuda" if torch.cuda.is_available() else "cpu"
    torch.backends.mkldnn.set_flags(False)
    torch.backends.nnpack.set_flags(False)   # <--- Added config
    device = "cpu"
    model = model.to(device=device)
    example_inputs = (torch.randn(2, 3, 224, 224, device=device),)
    batch_dim = torch.export.Dim("batch", min=2, max=32)
    so_path = torch._export.aot_compile(
        model,
        example_inputs,
        # Specify the first dimension of the input x as dynamic
        dynamic_shapes={"x": {0: batch_dim}},
        # Specify the generated shared library path
        options={
            "aot_inductor.output_path": os.path.join(os.getcwd(), "resnet18_pt2.so"),
            "max_autotune": True,
        },
    )

```

I'm not sure who to add as reviewer, so please feel free to add whoever is relevant!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116152
Approved by: https://github.com/malfet
2023-12-27 06:00:16 +00:00
9c3ae37fc4 [Distributed] Add finer granularity tag for distributed submodule (#116434)
This PR is the start to enable the integrate pytorch distributed logs in Torch LOGs. We now already have one tag "distributed" for all distributed components but distributed is a very large component and we want to have some hierarchy and give users options to only turn on logs for certain submodules. So we also added tags starting with "dist_*" for each submodule. (This PR only adds some of them and we are going to add more down the road)

Related discussions can be found here: https://github.com/pytorch/pytorch/issues/113544

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116434
Approved by: https://github.com/awgu, https://github.com/wanchaol
2023-12-27 04:09:34 +00:00
2c89e5a5e5 [inductor] Sort unbacked symbols before iterating on them (#116421)
get_unbacked_symbol_defs and get_unbacked_symbol_uses inconsistently return dicts vs. sets. The majority of the use cases of these methods use them for set membership, which is deterministic, but set iteration is non deterministic. Therefore, in the one place where we iterate through unbacked symbols, we sort by the symbol name before iterating to preserve determinism.

Another approach would be to have these functions consistently return dictionaries, where the key of the dictionary is the name of the symbol. I'm happy to do that approach if we think it's likely future code will forget to sort before iteration.

Fixes #113130

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116421
Approved by: https://github.com/oulgen, https://github.com/aakhundov
2023-12-27 03:35:58 +00:00
362bc6d7cb Fixed a segfault issue when passing an empty kernel to quantized_max_… (#116342)
…pool1d.

Fixes #116323.

Reused the same check as for `max_pool1d`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116342
Approved by: https://github.com/jerryzh168
2023-12-27 01:22:49 +00:00
d0395239c1 [DTensor] allow OpStrategy to represent ops whose return type is a tuple (#115682)
**Summary**:
Ops like `native_layer_norm_backward` return a tuple of optional torch.Tensor.
This PR allows to use OpStrategy to represent `native_layer_norm_backward`'s
return value sharding.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115682
Approved by: https://github.com/wanchaol
2023-12-27 00:44:11 +00:00
44b98c09ca [BE] migrate all assertRaises tests to OptimizerInfo test_errors (#116315)
Removes a part of the sparse adam test and the following three tests: `test_fused_optimizer_raises`, `test_duplicate_params_across_param_groups`, `test_duplicate_params_in_one_param_group`

```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (d2d129de)]$ python test/test_optim.py -k test_fused_optimizer_raises -k test_duplicate_params_across_param_groups -k test_duplicate_params_in_one_param_group
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
...
----------------------------------------------------------------------
Ran 3 tests in 0.023s

OK
```

Increases coverage by testing the duplicate param tests on ALL the optims instead of just one each. Also fixes SparseAdam bug which was accidentally calling torch.unbind through list instead of putting params in a list. This bug was caught by migrating the weird warning stuff to just one easy warning context manager, which checks that nothing else gets raised.

The new test_errors does not run slower than before, overhead is still king:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (d2d129de)]$ python test/test_optim.py -k test_errors
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
..........................
----------------------------------------------------------------------
Ran 26 tests in 10.337s

OK
```

Compared to test_errors BEFORE my commit :p
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (b47aa696)]$ python test/test_optim.py -k test_errors
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
.............sssssssssssss
----------------------------------------------------------------------
Ran 26 tests in 11.980s

OK (skipped=13)
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (b47aa696)]$
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116315
Approved by: https://github.com/mikaylagawarecki
2023-12-27 00:08:31 +00:00
8abeacda6f Refactor user defined triton kernel tests (#116425)
I will be adding more triton tests of different types, so I'm moving them to a brand new file. While doing this, I also cleaned up some flake linting opt outs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116425
Approved by: https://github.com/aakhundov
2023-12-26 23:54:26 +00:00
3b709d7c1e Revert "[Dynamo][10/N] Remove TorchVariable and is_allowed (#116312)"
This reverts commit 015bd0e0a189f929e469c6bc75fe1541c18a014d.

Reverted https://github.com/pytorch/pytorch/pull/116312 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/116312#issuecomment-1869825506))
2023-12-26 23:47:15 +00:00
13505898c9 Revert "[Dynamo][11/N] allow_in_graph/disallow_in_graph decorator refactor (#116365)"
This reverts commit 951da38800f66e2d2bb2bb8e87e12218d1e28b8c.

Reverted https://github.com/pytorch/pytorch/pull/116365 on behalf of https://github.com/kit1980 due to Need to revert this because of https://github.com/pytorch/pytorch/pull/116312 ([comment](https://github.com/pytorch/pytorch/pull/116365#issuecomment-1869824468))
2023-12-26 23:43:45 +00:00
0aa185f394 [BE] Make torch.cuda.has_magma a build time check (#116299)
Perhaps originally one needed to query about GPU capability, but right now it's a simple check for a build time flag: 52f0457d7d/aten/src/ATen/cuda/detail/CUDAHooks.cpp (L165-L171)

Alternative, to avoid `at::hasMAGMA()` call  one can implement it as follows:
```cpp
  const auto use_magma = caffe2::GetBuildOptions().at("USE_MAGMA");
  return PyBool_FromLong(use_magma == "1");
```

Make this check very similar to `_has_mkldnn`
0978482afa/torch/csrc/Module.cpp (L1793-L1794)

Test plan:
 Run `lldb -- python3 -c "import torch;print(torch.cuda.has_magma)"` and make sure it returns True and that `cuInit` is not called

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116299
Approved by: https://github.com/seemethere, https://github.com/albanD
2023-12-26 23:37:23 +00:00
0edc348788 Revert "[Dynamo] Consolidate common constant types (#116366)"
This reverts commit 36dccc2aba61a2637aa5d42f38b6fd1fe10dcbdc.

Reverted https://github.com/pytorch/pytorch/pull/116366 on behalf of https://github.com/kit1980 due to Need to revert this because of https://github.com/pytorch/pytorch/pull/116312 ([comment](https://github.com/pytorch/pytorch/pull/116366#issuecomment-1869821625))
2023-12-26 23:36:52 +00:00
e86636266f [Quantized] Fixed equal_quantized_cpu for QUInt4 (#116307)
- Return false if scalar_type is different (because QInt8 and QUint8 has identical item_size but shouldn't be compared by comparing data)
- Compute data_size correctly for QUInt4x2 and QUInt2x4 dtypes
- Add regression test

Fixes https://github.com/pytorch/pytorch/issues/116087

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116307
Approved by: https://github.com/jerryzh168
2023-12-26 21:52:28 +00:00
e5bcfe205e [inductor] fix cpp_wrapper inputs mismatch (#116197)
Summary: fixes https://github.com/pytorch/pytorch/issues/115035, where in the cpp_wrapper JIT inductor, the input args should contain the lifted parameters.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116197
Approved by: https://github.com/jansel
2023-12-26 21:41:47 +00:00
7571511af9 [inductor] More tweaks to fusion logs (#115084)
I think it's more useful to print out actual fusions rather than
possible fusions.

I also updated `speedup_by_fusion`'s logs to include the node names in
the log output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115084
Approved by: https://github.com/jansel, https://github.com/aakhundov
2023-12-26 20:25:57 +00:00
6051f9f404 multiply int8/uint8 for AVX512 (#116346)
Summary: multiply int8/uint8 for AVX512

Test Plan: sandcastle, github

Differential Revision: D52393918

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116346
Approved by: https://github.com/jgong5
2023-12-26 19:44:05 +00:00
51eef859eb min, max, clamp* for AVX2 hosts (#116236)
Summary: min, max, clamp* for AVX2 hosts

Test Plan: sandcastle, github

Differential Revision: D52353148

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116236
Approved by: https://github.com/alexsamardzic, https://github.com/malfet
2023-12-26 19:43:43 +00:00
427ecc61c0 [Easy][BE]: Fix none type comparison (#116399)
Simplifies type comparison, as it is unneeded since None is a singleton, and all objects are the same None object when they are set to None.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116399
Approved by: https://github.com/XuehaiPan, https://github.com/lezcano, https://github.com/malfet
2023-12-26 19:36:34 +00:00
0978482afa Revert "Implement aten::upsample_linear1d on mps (#115031)"
This reverts commit c6969cb8a93a7dfd3f1bf17716470174bb973076.

Reverted https://github.com/pytorch/pytorch/pull/115031 on behalf of https://github.com/malfet due to Broke lint, will fwd fix and re-land ([comment](https://github.com/pytorch/pytorch/pull/115031#issuecomment-1869693081))
2023-12-26 18:01:49 +00:00
f4230ec9fd [inductor] Remove the float16 restriction for cpu cpp_wrapper (#116205)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116205
Approved by: https://github.com/jgong5, https://github.com/chunyuan-w, https://github.com/jansel
2023-12-26 16:01:20 +00:00
Kai
c6969cb8a9 Implement aten::upsample_linear1d on mps (#115031)
Related to #77764

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115031
Approved by: https://github.com/malfet
2023-12-26 15:44:21 +00:00
4c6e842496 [inductor][cpp] load as scalar for the index invariant in the vector range (#116387)
For the test `test_expr_vec_non_contiguous`. The index_expr `31L + (63L*(c10::div_floor_integer(x1, 32L))) + (c10::div_floor_integer(x2, 32L))` is invariant under the vector range of `x2`.
Before change
```c++
            #pragma omp for
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(4L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(1024L); x1+=static_cast<long>(16L))
                {
                    {
                        #pragma omp declare reduction(max:at::vec::Vectorized<float>:omp_out = at::vec::maximum(omp_out, omp_in)) initializer(omp_priv={at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity())})
                        float tmp_acc0 = -std::numeric_limits<float>::infinity();
                        at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity());
                        for(long x2=static_cast<long>(0L); x2<static_cast<long>(1024L); x2+=static_cast<long>(1L))
                        {
                            auto tmp0 =
                            [&]
                            {
                                __at_align__ std::array<int, 16> tmpbuf;
                                #pragma GCC unroll 16
                                for (long x1_inner = 0; x1_inner < 16; x1_inner++)
                                {
                                    tmpbuf[x1_inner] = static_cast<long>(31L + (63L*(c10::div_floor_integer((x1 + x1_inner), 32L))) + (c10::div_floor_integer(x2, 32L)));
                                }
                                return at::vec::Vectorized<int>::loadu(tmpbuf.data());
                            }
                            ()
                            ;
                            auto tmp1 = static_cast<int>(2048);
                            auto tmp2 = at::vec::Vectorized<int>(tmp1);
                            auto tmp3 = to_float_mask(tmp0 < tmp2);
                            auto tmp4 = [&]
                            {
                                auto tmp5 =
                                [&]
                                {
                                    __at_align__ std::array<float, 16> tmpbuf;
                                    #pragma GCC unroll 16
                                    for (long x1_inner = 0; x1_inner < 16; x1_inner++)
                                    {
                                        if (vector_lane_mask_check(tmp3, x1_inner))
                                        {
                                            tmpbuf[x1_inner] = in_ptr0[static_cast<long>(31L + (63L*(c10::div_floor_integer((x1 + x1_inner), 32L))) + (2048L*(static_cast<long>((x1 + x1_inner)) % static_cast<long>(32L))) + (65536L*x0) + (c10::div_floor_integer(x2, 32L)))];
                                        }
                                    }
                                    return at::vec::Vectorized<float>::loadu(tmpbuf.data());
                                }
                                ()
                                ;
                                return tmp5;
                            }
                            ;
                            auto tmp6 =
                            [&]
                            {
                                if (all_zero(to_float_mask(tmp3)))
                                {
                                    return at::vec::Vectorized<float>(static_cast<float>(0.0));
                                }
                                else
                                {
                                    return decltype(tmp4())::blendv(at::vec::Vectorized<float>(static_cast<float>(0.0)), tmp4(), to_float_mask(tmp3));
                                }
                            }
                            ()
                            ;
                            tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp6);
                        }
                        tmp_acc0_vec.store(out_ptr0 + static_cast<long>(x1 + (1024L*x0)));
                    }
                }
            }
        }
```
After change
```c++
            #pragma omp for
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(4L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(1024L); x1+=static_cast<long>(16L))
                {
                    {
                        #pragma omp declare reduction(max:at::vec::Vectorized<float>:omp_out = at::vec::maximum(omp_out, omp_in)) initializer(omp_priv={at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity())})
                        float tmp_acc0 = -std::numeric_limits<float>::infinity();
                        at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity());
                        for(long x2=static_cast<long>(0L); x2<static_cast<long>(1024L); x2+=static_cast<long>(1L))
                        {
                            auto tmp0 = c10::convert<int>(31L + (63L*(c10::div_floor_integer(x1, 32L))) + (c10::div_floor_integer(x2, 32L)));
                            auto tmp1 = static_cast<int>(2048);
                            auto tmp2 = tmp0 < tmp1;
                            auto tmp3 = [&]
                            {
                                auto tmp4 =
                                [&]
                                {
                                    __at_align__ std::array<float, 16> tmpbuf;
                                    #pragma GCC unroll 16
                                    for (long x1_inner = 0; x1_inner < 16; x1_inner++)
                                    {
                                        if (tmp2 != 0)
                                        {
                                            tmpbuf[x1_inner] = in_ptr0[static_cast<long>(31L + (63L*(c10::div_floor_integer((x1 + x1_inner), 32L))) + (2048L*(static_cast<long>((x1 + x1_inner)) % static_cast<long>(32L))) + (65536L*x0) + (c10::div_floor_integer(x2, 32L)))];
                                        }
                                    }
                                    return at::vec::Vectorized<float>::loadu(tmpbuf.data());
                                }
                                ()
                                ;
                                return tmp4;
                            }
                            ;
                            auto tmp5 =
                            [&]
                            {
                                if (all_zero(to_float_mask(tmp2)))
                                {
                                    return at::vec::Vectorized<float>(static_cast<float>(0.0));
                                }
                                else
                                {
                                    return decltype(tmp3())::blendv(at::vec::Vectorized<float>(static_cast<float>(0.0)), tmp3(), to_float_mask(tmp2));
                                }
                            }
                            ()
                            ;
                            tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp5);
                        }
                        tmp_acc0_vec.store(out_ptr0 + static_cast<long>(x1 + (1024L*x0)));
                    }
                }
            }
        }
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116387
Approved by: https://github.com/EikanWang, https://github.com/lezcano
ghstack dependencies: #114545
2023-12-26 08:45:04 +00:00
3c9076f070 [dynamo] fix sum() function with start argument (#116389)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116389
Approved by: https://github.com/Skylion007
2023-12-26 06:37:55 +00:00
cyy
bb2a1e9941 Enable readability-redundant-smartptr-get in clang-tidy (#116381)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116381
Approved by: https://github.com/Skylion007
2023-12-26 06:05:15 +00:00
ffe6f9ac91 [inductor cpp] support vectorization for index_expr that depends on tiling itervar or with indirect indexing (#114545)
As the title, this PR enables vectorization for the situation when the the index_expr depends on vectorized itervar. There are two cases here:
1. The vectorized itervar has constant stride in the index_expr. We vectorize the index_expr with `Vectorized<int32>::arange` for this case.
2. Otherwise, we load the index_expr vector in a non-contiguous way with a loop.

Below is the generated code for the first case from the test `test_concat_inner_vec`. Here `x1` is the index_expr and depends on the vectorized itervar `x1`. It has constant stride 1. We vectorized it with arange. We use `all_zero` to implement a short-cut for masks to avoid unnecessary execution of nested masked regions which are invalid.
Before:
```c++
            #pragma omp for  collapse(2)
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(32L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(155L); x1+=static_cast<long>(1L))
                {
                    auto tmp0 = c10::convert<long>(x1);
                    auto tmp1 = static_cast<long>(0);
                    auto tmp2 = tmp0 >= tmp1;
                    auto tmp3 = static_cast<long>(35);
                    auto tmp4 = tmp0 < tmp3;
                    auto tmp5 = [&]
                    {
                        auto tmp6 = in_ptr0[static_cast<long>(x1 + (35L*x0))];
                        return tmp6;
                    }
                    ;
                    auto tmp7 = tmp4 ? tmp5() : static_cast<decltype(tmp5())>(0.0);
                    auto tmp8 = tmp0 >= tmp3;
                    auto tmp9 = static_cast<long>(155);
                    auto tmp10 = tmp0 < tmp9;
                    auto tmp11 = [&]
                    {
                        auto tmp12 = in_ptr1[static_cast<long>((-35L) + x1 + (120L*x0))];
                        return tmp12;
                    }
                    ;
...
```
After:
```c++
            #pragma omp for
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(32L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(144L); x1+=static_cast<long>(16L))
                {
                    auto tmp0 = c10::convert<int>(x1);
                    auto tmp1 = at::vec::Vectorized<int32_t>::arange(tmp0, 1);
                    auto tmp2 = static_cast<int>(0);
                    auto tmp3 = at::vec::Vectorized<int>(tmp2);
                    auto tmp4 = to_float_mask(tmp1 >= tmp3);
                    auto tmp5 = static_cast<int>(35);
                    auto tmp6 = at::vec::Vectorized<int>(tmp5);
                    auto tmp7 = to_float_mask(tmp1 < tmp6);
                    auto tmp8 = [&]
                    {
                        auto tmp9 = masked_load(in_ptr0 + static_cast<long>(x1 + (35L*x0)), to_float_mask(tmp7));
                        return tmp9;
                    }
                    ;
                    auto tmp10 =
                    [&]
                    {
                        if (all_zero(to_float_mask(tmp7)))
                        {
                            return at::vec::Vectorized<float>(static_cast<float>(0.0));
                        }
                        else
                        {
                            return decltype(tmp8())::blendv(at::vec::Vectorized<float>(static_cast<float>(0.0)), tmp8(), to_float_mask(tmp7));
                        }
                    }
                    ()
                    ;
...
```

Below is the generated code for the second case from the test case `test_expr_vec_non_contiguous`. Here, the index_expr is `31L + (63L*(c10::div_floor_integer(x1, 32L))) + (c10::div_floor_integer(x2, 32L))` which depends on the vectorized itervar `x2` and doesn't have constant stride. So, we load the index_expr vector with a loop. (In fact, this can be further optimized since the index_expr is invariant with the data points in the range [x2, x2+16). So it can be regarded as a scalar. This will be optimized in the follow-up PR.) The code uses `vector_lane_mask_check` to implement the masked version of non-contiguous load.
Before:
```c++
            #pragma omp for  collapse(2)
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(4L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(1024L); x1+=static_cast<long>(1L))
                {
                    {
                        float tmp_acc0 = -std::numeric_limits<float>::infinity();
                        for(long x2=static_cast<long>(0L); x2<static_cast<long>(1024L); x2+=static_cast<long>(1L))
                        {
                            auto tmp0 = c10::convert<long>(31L + (63L*(c10::div_floor_integer(x1, 32L))) + (c10::div_floor_integer(x2, 32L)));
                            auto tmp1 = static_cast<long>(2048);
                            auto tmp2 = tmp0 < tmp1;
                            auto tmp3 = [&]
                            {
                                auto tmp4 = in_ptr0[static_cast<long>(31L + (63L*(c10::div_floor_integer(x1, 32L))) + (2048L*(static_cast<long>(x1) % static_cast<long>(32L))) + (65536L*x0) + (c10::div_floor_integer(x2, 32L)))];
                                return tmp4;
                            }
                            ;
                            auto tmp5 = tmp2 ? tmp3() : static_cast<decltype(tmp3())>(0.0);
                            tmp_acc0 = max_propagate_nan(tmp_acc0, tmp5);
                        }
                        out_ptr0[static_cast<long>(x1 + (1024L*x0))] = tmp_acc0;
                    }
                }
            }
```
After:
```c++
            #pragma omp for
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(4L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(1024L); x1+=static_cast<long>(16L))
                {
                    {
                        #pragma omp declare reduction(max:at::vec::Vectorized<float>:omp_out = at::vec::maximum(omp_out, omp_in)) initializer(omp_priv={at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity())})
                        float tmp_acc0 = -std::numeric_limits<float>::infinity();
                        at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity());
                        for(long x2=static_cast<long>(0L); x2<static_cast<long>(1024L); x2+=static_cast<long>(1L))
                        {
                            auto tmp0 =
                            [&]
                            {
                                __at_align__ std::array<int, 16> tmpbuf;
                                #pragma GCC unroll 16
                                for (long x1_inner = 0; x1_inner < 16; x1_inner++)
                                {
                                    tmpbuf[x1_inner] = static_cast<long>(31L + (63L*(c10::div_floor_integer((x1 + x1_inner), 32L))) + (c10::div_floor_integer(x2, 32L)));
                                }
                                return at::vec::Vectorized<int>::loadu(tmpbuf.data());
                            }
                            ()
                            ;
                            auto tmp1 = static_cast<int>(2048);
                            auto tmp2 = at::vec::Vectorized<int>(tmp1);
                            auto tmp3 = to_float_mask(tmp0 < tmp2);
                            auto tmp4 = [&]
                            {
                                auto tmp5 =
                                [&]
                                {
                                    __at_align__ std::array<float, 16> tmpbuf;
                                    #pragma GCC unroll 16
                                    for (long x1_inner = 0; x1_inner < 16; x1_inner++)
                                    {
                                        if (vector_lane_mask_check(tmp3, x1_inner))
                                        {
                                            tmpbuf[x1_inner] = in_ptr0[static_cast<long>(31L + (63L*(c10::div_floor_integer((x1 + x1_inner), 32L))) + (2048L*(static_cast<long>((x1 + x1_inner)) % static_cast<long>(32L))) + (65536L*x0) + (c10::div_floor_integer(x2, 32L)))];
                                        }
                                    }
                                    return at::vec::Vectorized<float>::loadu(tmpbuf.data());
                                }
                                ()
                                ;
                                return tmp5;
                            }
                            ;
                            auto tmp6 =
                            [&]
                            {
                                if (all_zero(to_float_mask(tmp3)))
                                {
                                    return at::vec::Vectorized<float>(static_cast<float>(0.0));
                                }
                                else
                                {
                                    return decltype(tmp4())::blendv(at::vec::Vectorized<float>(static_cast<float>(0.0)), tmp4(), to_float_mask(tmp3));
                                }
                            }
                            ()
                            ;
                            tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp6);
                        }
                        tmp_acc0_vec.store(out_ptr0 + static_cast<long>(x1 + (1024L*x0)));
                    }
                }
            }
        }
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114545
Approved by: https://github.com/lezcano
2023-12-26 05:36:39 +00:00
a254fbfd61 Initialize variable for all codepaths in dynamo benchmarks (#116260)
Sometimes, the first statement that sets this variable in the try block fails due to out of memory issues and the finally block tries to delete this variable, but it was not written to in the first place.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116260
Approved by: https://github.com/lezcano
2023-12-26 05:15:39 +00:00
f6dfbffb3b [c10d] Add hashing as a debug feature for before and after NCCL collective call (#113238)
For now, we use `TORCH_DISTRIBUTED_DEBUG = DETAIL` to turn a debug feature which calculate the hashing for input tensors and output results of c10d collective in NCCL.  This is a debugging feature so that we can rule out the bug from c10d level.

<img width="840" alt="image" src="https://github.com/pytorch/pytorch/assets/6937752/cdc70b0b-ae3c-4efd-86ff-adc5c5ba505f">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113238
Approved by: https://github.com/wconstab, https://github.com/fegin
2023-12-25 22:25:38 +00:00
039fbeb016 [dynamo] fix functools.reduce() function with None as initial (#116398)
The `initial` argument in `functools.reduce` can be `None`.

```python
initial_missing = object()

def reduce(function, iterable, initial=initial_missing, /):
    it = iter(iterable)
    if initial is initial_missing:
        value = next(it)
    else:
        value = initial
    for element in it:
        value = function(value, element)
    return value
```

Reference:

- python/cpython#102759

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116398
Approved by: https://github.com/Skylion007
2023-12-25 21:23:28 +00:00
c7e9c15102 Ignore SIGINT in codecache workers (#116380)
Fixes #116379

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116380
Approved by: https://github.com/Skylion007
2023-12-25 08:59:54 +00:00
951da38800 [Dynamo][11/N] allow_in_graph/disallow_in_graph decorator refactor (#116365)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116365
Approved by: https://github.com/jansel
2023-12-25 07:15:09 +00:00
22742d93a5 Expose functional IR to capture_pre_autograd (#115210)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115210
Approved by: https://github.com/zhxchen17
ghstack dependencies: #115188
2023-12-25 04:51:21 +00:00
76b1d44d57 pre_dispatch aot_export (#115188)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115188
Approved by: https://github.com/bdhirsh
2023-12-25 04:51:21 +00:00
36dccc2aba [Dynamo] Consolidate common constant types (#116366)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116366
Approved by: https://github.com/Skylion007
2023-12-24 22:58:01 +00:00
199e07f108 [pytree][BE] update treespec num_children access (#116370)
Change `len(treespec.children_spes) -> treespec.num_children`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116370
Approved by: https://github.com/Skylion007
2023-12-24 20:54:32 +00:00
81cebca3d2 [Inductor] [Quant] Fix QConv Binary Inplace Layout Issue (#115613)
This pull request primarily addresses two issues to resolve the `QConvPointWiseBinaryPT2E` layout problem:

- As the changes made in 611a7457ca, for `QConvPointWiseBinaryPT2E` with post-op `sum`, we should also utilize `NoneLayout` and return `accum` instead of `QConvPointWiseBinaryPT2E`.

- Additionally, this pull request fixes an issue in the `_quantized_convolution_onednn` implementation. Given that we expect `accum` to be inplace changed, we should avoid copying `accum` by changing the memory format or data type inside the kernel implementation. Instead, we have moved the necessary changes of memory format or data type to the lowering of `QConvPointWiseBinaryPT2E`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115613
Approved by: https://github.com/jgong5, https://github.com/oulgen
ghstack dependencies: #116172
2023-12-24 08:04:29 +00:00
dfb6815170 [Reland] [PT2] [Quant] Change the QConv2d Binary post op name from add to sum (#116172)
**Summary**
Re-land https://github.com/pytorch/pytorch/pull/115329. Open a new PR since the origin branch has been deleted.
Change the QConv2d Binary fusion post op name from `add` to `sum`, since we are actually using OneDNN `post op sum` instead of `Binary_Add` for now.

**TestPlan**
```
python -m pytest test_quantized_op.py -k test_qconv2d_sum_pt2e
python -m pytest test_quantized_op.py -k test_qconv2d_sum_relu_pt2e
python -m pytest test_quantized_op.py -k test_qconv2d_sum_relu_float_output_pt2e
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116172
Approved by: https://github.com/kit1980
2023-12-24 08:00:21 +00:00
7cdbdc789d [executorch hash update] update the pinned executorch hash (#116362)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116362
Approved by: https://github.com/pytorchbot
2023-12-24 05:02:05 +00:00
f1cdb39da3 [dynamo] Fix handling of one_hot (#116338)
Fixes #115817

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116338
Approved by: https://github.com/yanboliang
2023-12-24 04:55:35 +00:00
dbbe8485b4 Fake Tensor refactors part 2 (#116345)
This should help trace time a bit.
This refactors `op_implementations` (which requires O(n) checks per op) to mostly use a dict with O(1) cost per op.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116345
Approved by: https://github.com/yanboliang
2023-12-24 04:54:50 +00:00
6c419a0efd Fixed a segfault when calling topk on a quantized scalar tensor. (#116337)
Fixes #116324.

Added an extra check for empty sizes (=scalars) when running `topk` on quantized tensors. Added a test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116337
Approved by: https://github.com/Skylion007
2023-12-23 23:21:12 +00:00
3a4fe835cc Fixed segfault when trying to permute empty tensor (#116335)
Fixes #116325.

Fixed unchecked access to first element of `dims` when permuting an empty tensor. Added test to prevent regressions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116335
Approved by: https://github.com/Skylion007
2023-12-23 23:14:28 +00:00
015bd0e0a1 [Dynamo][10/N] Remove TorchVariable and is_allowed (#116312)
After this refactor:
* ```TorchVariable``` definition and all references are removed.
* All ```is_allowed``` references except one are removed.
  - The only left one is in ```torch/_dynamo/decorators:_disallow_in_graph_helper```. It was called when users put ```disallow_in_graph``` decorator on a function. Since we use the lists in ```trace_rules``` to decide the function's trace rule, so the decorator would only be used as customer function rather than torch functions. I'll defer this to a separate decorator refactor PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116312
Approved by: https://github.com/jansel
2023-12-23 09:44:09 +00:00
4912922297 Fake Tensor refactors part 1 (#116344)
These are mostly small performance optimizations to move constant list construction into global scope and replace O(n) `x in list` checks with O(1) `x in dict` checks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116344
Approved by: https://github.com/yanboliang
2023-12-23 08:38:26 +00:00
08b404e3a2 [Dynamo] Remove ExecutionRecorder.MOD_EXCLUDES during replay & record (#116347)
Remove ```ExecutionRecorder.MOD_EXCLUDES``` since now torch python modules are wrapped as ```PythonModuleVariable``` after #115724.
This is reported from Meta internal user cases, where it triggers failure when replay & record is enabled. But the enablement was triggered by ```TORCH_COMPILE_DEBUG=1``` rather than they really need this. Actually they are not using it according the conversation with the team members. I think we don't maintain replay & record well, so probably we can remove them from our codebase to avoid such issues in the future.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116347
Approved by: https://github.com/jansel
2023-12-23 08:13:14 +00:00
cyy
7663ffb673 [10/N] Fixes clang-tidy warnings in c10/util/*.h (#116326)
Still a continued work for clean up c10/util/*.h
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116326
Approved by: https://github.com/Skylion007
2023-12-23 04:59:55 +00:00
84b2a32359 [executorch hash update] update the pinned executorch hash (#115599)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115599
Approved by: https://github.com/huydhn
2023-12-23 04:07:23 +00:00
60f4114769 Support nn_module_stack in non_strict mode (#116309)
Summary: Title

Test Plan: CI

Differential Revision: D52382672

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116309
Approved by: https://github.com/zhxchen17
2023-12-23 03:34:58 +00:00
0931170a13 [vision hash update] update the pinned vision hash (#116343)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116343
Approved by: https://github.com/pytorchbot
2023-12-23 03:16:06 +00:00
4f4b931aba [inductor] Do variance calculation in opmath type (#115181)
Fixes #114903

Previously large split variance reductions stored the intermediates as float16
precision, which may lead to overflow as the intermediate result is
unnormalized.

In #114903 we see two different `num_split` decisions made based on the
hardware capabilities, one of which has large enough intermediates to cause
overflows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115181
Approved by: https://github.com/shunting314
2023-12-23 01:06:43 +00:00
65c5eed01d [sigmoid] Remove workaround for constant output. (#116288)
Summary: no more workaround_export_bug_constant_buffer_output

Test Plan:
buck2 run mode/dev-nosan //scripts/ads_pt2_inference:pt2_cli -- --src_model manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/473164617/6/gpu_lowering/input.predictor.disagg.gpu.merge

buck2 run mode/opt caffe2/torch/fb/model_transform/fx2trt/packaging:generate_merge_net_file -- --action=generate --lower_backend=aot_inductor_ep --input_file=/data/users/zhxchen17/fbsource/fbcode/input.predictor.disagg.gpu.merge --output_file=/tmp/409501788_66.predictor.disagg.gpu.merge

buck2 run mode/opt -c fbcode.nvcc_arch=a100 caffe2/torch/fb/model_transform/fx2trt/packaging:load_merge_net_predictor -- --loadMode=Normal --inputMergeNetFile=/tmp/409501788_66.predictor.disagg.gpu.merge --pytorch_predictor_sigmoid_enabled=true

Reviewed By: khabinov, SherlockNoMad

Differential Revision: D52210429

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116288
Approved by: https://github.com/tugsbayasgalan
2023-12-22 20:33:09 +00:00
3f9e9ecfe4 Fix torch.detach doc-string (#115850)
Fixes https://github.com/pytorch/pytorch/issues/98976

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115850
Approved by: https://github.com/albanD
2023-12-22 20:04:33 +00:00
b940fa2fce Delete unused global variable (#116228)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116228
Approved by: https://github.com/angelayi
ghstack dependencies: #116225, #116226
2023-12-22 19:07:59 +00:00
f08c4da86d Add a decomposition for take() (#114813)
Presumably this can close https://github.com/pytorch/pytorch/pull/109784

Also related to https://github.com/pytorch/pytorch/issues/93757 (though `take` is not listed there).

There's no bounds checking here (out of bounds indices cause a segfault or undefined behavior). Should that be added somehow?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114813
Approved by: https://github.com/peterbell10, https://github.com/lezcano
2023-12-22 18:14:57 +00:00
341c4227a8 Update F32 sparse semi-structured support for CUTLASS back-end (#116017)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116017
Approved by: https://github.com/jcaip
2023-12-22 16:53:04 +00:00
0b9146bf5d [BE][Easy]: Update ruff to 0.1.9 (#116290)
Updates the ruff linter with lots of bugfixes, speed improvements, and fix improvements.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116290
Approved by: https://github.com/janeyx99, https://github.com/malfet
2023-12-22 15:26:02 +00:00
0e39f4db92 Disables denormal floating numbers on ARM CPU (#115184)
**Motivation:**
Denormal numbers are used to store extremely small numbers that are close to 0. Denormal numbers can incur extra computational cost. To solve the low performance issue caused by denormal numbers, Pytorch supports flushing denormal numbers and it successfully configures flush denormal mode

Currently set_flush_denormal() is only supported on x86 architectures supporting SSE3 ([https://pytorch.org/docs/stable/generated/torch.set_flush_denormal.html (Opens in new window or tab)](https://pytorch.org/docs/stable/generated/torch.set_flush_denormal.html) and now we want to extend this functionality for ARM architecture.

**This PR:**
->Supports set_flush_denormal() on ARM.
->Datatypes supported and tested: FP64, FP32, BFloat16

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115184
Approved by: https://github.com/jgong5
2023-12-22 13:56:46 +00:00
cyy
9a0c217a0a [9/N] Fixes clang-tidy warnings in c10/util/*.h (#116185)
Continued work to clean headers in c10/util.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116185
Approved by: https://github.com/Skylion007
2023-12-22 09:35:44 +00:00
c7514ccc8c Delete unused API again (#116226)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116226
Approved by: https://github.com/angelayi
ghstack dependencies: #116225
2023-12-22 09:30:00 +00:00
7a6cb9fdfb [Inductor Intel GPU backend Upstream] Step 1/3: Generalize device-bias code in code generation. (#116020)
As the [RFC](https://github.com/pytorch/pytorch/issues/114856) mentions, this is the step 1 to add Intel GPU backend as an alternative inductor backend.

### Design
Typically, in order to integrate Intel GPU backend into Inductor, we need to inherit from `WrapperCodegen` and `TritonScheduling` and implement the corresponding subclasses respectively. However, since `WrapperCodegen` and `TritonScheduling` have some device-bias code generation **scattered** in their methods, overriding them in subclasses would introduce a lot of duplicated parent class code.
For example:
2a44034895/torch/_inductor/codegen/wrapper.py (L487)

2a44034895/torch/_inductor/codegen/triton.py (L1996)

 So we abstract the device-bias code scattered in WrapperCodegen and TritonScheduling and provide a unified interface "DeviceOpOverrides". This way, when integrating a new backend, we can  maximize the reuse of `WrapperCodegen` and `TritonScheduling` code by inherit and implement this interface for device flexibility.

Currently the `DeviceOpOverrides` only cover Python wrapper code generation. We can futher extend it to cover Cpp wrapper code generation on demand.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116020
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel
2023-12-22 08:42:51 +00:00
7d0ad6e870 Make native c10d_functional ops work with AOTInductor (#113735)
Summary:
- Revised `c10d_functional` ops to conform to https://github.com/pytorch/pytorch/tree/main/aten/src/ATen/native#func
- Modifed `get_cpp_op_schema()` to handle mutable args and aliasing returns

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113735
Approved by: https://github.com/desertfire
ghstack dependencies: #113438
2023-12-22 08:12:13 +00:00
718b576e2c Port all_to_all_single to native c10d_functional (#113438)
Summary:
- Ported `all_to_all_single` to native c10d_functional
- Added Inductor support for the native `all_to_all_single` via the new collective IR's `create_out_of_place()`
- Since the new collective IR derives from `FallbackKernel` which implements a generic `free_unbacked_symbols`, no additional unbacked symbol handling for all_to_all_single is required

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113438
Approved by: https://github.com/yf225, https://github.com/ezyang
2023-12-22 08:12:13 +00:00
cb489e769c Delete unused API (#116225)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116225
Approved by: https://github.com/angelayi
2023-12-22 06:38:47 +00:00
b6473065c6 [AMD] Fix build for intra_node_comm (#116291)
Summary: amd build is broken

Test Plan:
```
buck-out/v2/gen/fbcode/75c2b50d9f8b18d8/caffe2/__fb_libtorch_hipify_gen_eqsb_torch/csrc/distributed/c10d/intra_node_comm.hip__/out/torch/csrc/distributed/c10d/intra_node_comm.hip:37:1: error: non-void function does not return a value [-Werror,-Wreturn-type]
}
^
1 error generated when compiling for gfx90a.
```

Now it's gone

Differential Revision: D52373348

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116291
Approved by: https://github.com/yifuwang
2023-12-22 05:51:50 +00:00
b342286646 adds async save, makes checkpointer private (#116293)
Adds Async Save and also makes `Checkpointer` classes private.

The original PR was here: https://github.com/pytorch/pytorch/pull/115864

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116293
Approved by: https://github.com/fegin
2023-12-22 05:22:39 +00:00
ad3c0b2c00 [torch.export] fixes for unlifting lifted tensor constants (#116266)
Summary: lifted tensor constants were not being treated the same way as named buffers when unlifting, i.e. getting name correction to convert "." in FQNS to "_" for proper names. Additionally, future torchbind object support will allow objects to be registered, so only register_buffer for lifted constants if the value is a tensor.

Differential Revision: D52367846

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116266
Approved by: https://github.com/angelayi
2023-12-22 04:46:25 +00:00
cyy
764b4cd44e Remove outdated string function wrapper for Android and Caffe2 (#116186)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116186
Approved by: https://github.com/janeyx99
2023-12-22 04:31:56 +00:00
b47aa69685 [c10d] Fix the hang issue in store.check(TIMEOUT_DUMP) (#116297)
Summary:
We have found out the root cause of the hang is NOT due to destruction
of stores. The hang in the check() only happens when the store is of
type FileStore.

The file held by each filestore was a temp file, which was created by
Python Tempfile, it was deleted by default when the file was closed.

Note that the file was opened and closed by every check() in the watchdog and in constructor of FileStore.

The when check() tried to open the deleted file again, open() would fail
after the timeout value (by default 5 mins), hence the hang happened.

The fix is simple, just avoid the default deletion after the file is
closed.
Test Plan:

1. We first reproduce the hang in check() in the existing unit test:
   test_init_process_group_for_all_backends by enabling the
   DumpOnTimeOut and making the main thread sleep for 2s, to give enough time for tempfile
   to be deleted
2. Adding log to check ref count of fileStore and also the sequence of
   file opening and closing
3. With the repro, an exception will be thrown as "no such file or
   directory' and unit test would fail
4. Verify the tests now passes with the above knob change
5. add an unit test in test_c10d_nccl to cover the fileStore check() code path
python test/distributed/test_c10d_common.py ProcessGroupWithDispatchedCollectivesTests
python test/distributed/test_c10d_nccl.py ProcessGroupNCCLTest.test_file_store_check
Reviewers:

Subscribers:

Tasks:
T173200093
Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116297
Approved by: https://github.com/fduwjj
ghstack dependencies: #116296
2023-12-22 04:04:30 +00:00
94f3781145 Fixed bug with unpickling integers > 64-bits (#115264)
Fixes #115234

Currently, the unpickling code does not support integers larger than 64 bits in size. However, this is a part of the Python unpickling code.

See `pickle.py` in CPython:
```
def decode_long(data):
    r"""Decode a long from a two's complement little-endian binary string.

    >>> decode_long(b'')
    0
    >>> decode_long(b"\xff\x00")
    255
    >>> decode_long(b"\xff\x7f")
    32767
    >>> decode_long(b"\x00\xff")
    -256
    >>> decode_long(b"\x00\x80")
    -32768
    >>> decode_long(b"\x80")
    -128
    >>> decode_long(b"\x7f")
    127
    """
    return int.from_bytes(data, byteorder='little', signed=True)
```

E.g.:
```
>>> int.from_bytes(bytearray(b'\xff\xff\xff\xff\xff\xff\xff\xff\x00'), byteorder='little', signed=True)
18446744073709551615
```

This PR makes it so that integers of arbitrary size are supported with JS BigNums.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115264
Approved by: https://github.com/zdevito
2023-12-22 03:17:34 +00:00
9736deae76 [vision hash update] update the pinned vision hash (#109957)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109957
Approved by: https://github.com/pytorchbot
2023-12-22 03:12:23 +00:00
db25462ffd [quant][pt2e] Relax constraints on dtype and qscheme to allow for customizations (#116287)
Summary:
att

Test Plan:
CI

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116287
Approved by: https://github.com/kimishpatel
2023-12-22 03:12:04 +00:00
fdf8718225 Update reviewes for PyTorch Distributed (#116296)
Summary:
Add shuqiangzhang as a reviewer
Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116296
Approved by: https://github.com/fduwjj
2023-12-22 02:49:51 +00:00
4b97ed2ed8 [SparseCompressed] support csc layout for add sparse/dense. (#115433)
`add` when passed one sparse and one dense argument  will error if the
sparse argument does not have  csr layout. This PR modifies the
underlying algorithm to be generic on the compressed dimension handling
both csr and csc. The functions are renamed to use the
`sparse_compressed` qualifier rather than `sparse_csr`

Fixes: #114807

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115433
Approved by: https://github.com/cpuhrsch, https://github.com/pearu
ghstack dependencies: #115432
2023-12-22 01:47:55 +00:00
910baa3a03 [SparseCompressed] Support add(sparse_compressed, dense) (#115432)
Addition involving sparse compressed and dense arguments is implemented
requiring that the dense tensor be on the LHS. This change adds support
for the other pattern `sparse + dense by permuting arguments.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115432
Approved by: https://github.com/cpuhrsch, https://github.com/pearu
2023-12-22 01:47:55 +00:00
suo
d2d129de65 [sigmoid] replace unflatten with upstream version (#115468)
as title

Differential Revision: [D52000213](https://our.internmc.facebook.com/intern/diff/D52000213/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115468
Approved by: https://github.com/zhxchen17
2023-12-22 00:56:19 +00:00
127cae7ec8 [C10D] Increase TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC (#116267)
Change default from 2 min to 10 min.

Why? Many cases of heartbeat timeout were reported, but increasing
timeout led to the same job hanging in a different place, suggesting
heartbeat kill was working well and not a false positive.  However, some
others reported jobs running fine with increased timeouts.  One such
case was investigated below, and suggests that indeed a 2 min timeout is
too aggressive.  While we have not fully root caused the issue, it
is better to avoid killing jobs that would otherwise complete.

Current theory is that watchdog is not totally deadlocked, but is slowed
down in its processing of work objs due to some intermittent resource
contention.  Hence, allowing more time is more of a workaround than a
fix.

Debug/Analysis:
https://docs.google.com/document/d/1NMNWoTB86ZpP9bqYLZ_EVA9byOlEfxw0wynMVEMlXwM

Differential Revision: [D52368791](https://our.internmc.facebook.com/intern/diff/D52368791)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116267
Approved by: https://github.com/fduwjj
2023-12-22 00:47:45 +00:00
d6de2df6b6 Improve the error message when a PR lacks the necessary approvals (#116161)
The error message from https://github.com/pytorch/pytorch/pull/115329#issuecomment-1857135047 is pretty confusing because it lists some random `pytorch/metamates` folks from `superuser` merge rule.  My attempt here is to make the error message clearer by pointing out:

* All the matching merge rules and
* Their list of approvers

The message will now become:

```
Approvers from one of the follow rules are needed:
- Core Reviewers (1, 2, 3, 4, 5, ...)
- Core Maintainers (1, 2)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116161
Approved by: https://github.com/malfet, https://github.com/PaliC, https://github.com/atalman, https://github.com/ZainRizvi
2023-12-22 00:22:43 +00:00
99f7e721fe [inductor] make inductor work with new triton compile interface (#115878)
Recent 2 triton PRs (https://github.com/openai/triton/pull/2701, https://github.com/openai/triton/pull/2756) change the interface for triton.compile, this PR added the necessary change on inductor side to work with both old and new compile API.

Also there is some simplification between compilation call in subprocess and the one in main process
- previously we pass warm_cache_only=True if the compilation happens in subprocess. But triton never use that argument in the currently used pin. So I removed that
- previously we only pass compute_capability if compilation happens in subprocess. The PR change that to always passing compute_capability to triton.compile no matter if the compilation happens in main or sub process.

Updated:
There are more interface change from triton side. E.g.
- tl.math.{min, max} now requires a propagate_nan argument
- JITFunction.run now requires a warmup argument. This affect the benchmarking phase of matmul max-autotune; on the other hand, JITFunction.run forbids stream argument now. Simply removing passing this in when benchmarking matmul triton kernel will work for both old and new version of triton.
- triton Autotuner change attribute name from 'warmup' to 'num_warmup' and from 'rep' to 'num_rep'. This cause dynamo failed to handle triton Autotuner object since dynamo TritonKernelVariable makes assumption about attribute names. It's used in some test cases that a model call triton Autotuner directly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115878
Approved by: https://github.com/jansel
2023-12-22 00:09:29 +00:00
247f9c3de4 Preserve strides of custom Triton kernel args (#116219)
Summary: Currently, we [`clone`](19207b9183/torch/_inductor/lowering.py (L5273)) every `TensorBox` argument of custom Triton kernels while lowering them to the Inductor IR, during which the stride information of the kernel inputs is lost. This is problematic in the common case when the strides of a `torch.Tensor` argument are passed as scalars to a custom Triton kernel alongside the tensor itself (due to the underlying Triton code interpreting the tensors as raw pointers, so the contained stride semantics of the `torch.Tensor` is lost).

In this PR, we add an extended version of the existing [`clone` lowering](19207b9183/torch/_inductor/lowering.py (L2289))---`clone_preserve_reinterpret_view`---which carries over the `ir.ReinterpretVew` layers (if any) from the source `TensorBox` to the cloned one. The rationale behind adding a new function (and switching to it in the `triton_kernel_wrap` only for now) as opposed to extending the existing `clone` is keeping the semantics of the latter untouched, as it is a lowering of `torch.clone` (albeit incomplete, as the `memory_format` is currently ignored). Changing the existing `clone` would change the semantics which is not necessarily desirable in general. Open to suggestions, though.

Test Plan:

```
$ python test/dynamo/test_functions.py -k test_triton_kernel_strided_input
...
----------------------------------------------------------------------
Ran 1 test in 5.568s

OK
```

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116219
Approved by: https://github.com/jansel
2023-12-21 22:46:32 +00:00
a27ed4d364 [dynamo / DDP] Add optimize_ddp_lazy_compile config to control lazy compile for DDPOptimizer (False by default) (#116292)
We want to enable `optimize_ddp_lazy_compile` by default as soon as possible, becuase it will fix stride mismatch errors (see motivation: https://github.com/pytorch/pytorch/pull/114154).

However, lazy compile currently causes shape mismatch in other cases (`test_graph_split_inductor_transpose`) and we need to fix them before we can enable it by default.

Differential Revision: D52373445

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116292
Approved by: https://github.com/williamwen42, https://github.com/wconstab
2023-12-21 22:34:24 +00:00
1e834e0e50 Fix bug in mem_eff kernel with attention mask and MQA (#116234)
# Summary

Found using the repros mentioned in this issue: #112577

After many go rounds with compute-sanitizer and eventual printf debugging I feel pretty confident that this was the underlying issue

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116234
Approved by: https://github.com/malfet, https://github.com/danthe3rd, https://github.com/atalman
2023-12-21 21:52:21 +00:00
52f0457d7d Support view returns for functional inverses on narrowing views (#115893)
Part 1 of implementation for general [subclass view fake-ification](https://docs.google.com/document/d/1C5taWiplmX7nKiURXDOAZG2W5VNJ2iV0fQFq92H0Cxw).

The following functional inverses are currently implemented scatter-style and thus never return views:
* `as_strided_copy_inverse()`
* `diagonal_copy_inverse()`
* `expand_copy_inverse()`
* `select_copy_int_inverse()`
* `slice_copy_Tensor_inverse()`
* `split_copy_Tensor_inverse()`
* `split_with_sizes_copy_inverse()`
* `unbind_copy_int_inverse()`
* `unfold_copy_inverse()`

We need to get actual views for the introduction of reverse view funcs coming next.

Details:
* Use `as_strided()` to implement actual view inverses for the above
    * Assumes we're given a mutated_view that is actually part of a bigger storage; this isn't really the case for functionalization
* Introduce `InverseReturnMode` enum for customization of functional inverses
    * `AlwaysView` - always return an actual view; needed for reverse view_funcs()
    * `NeverView` - always do a copy; useful for certain functionalization use cases (e.g. XLA, executorch)
    * `ViewOrScatterInverse` - return an actual view in most cases, but prefer scatter inverses when they exist. this avoids the need to implement `as_strided()` for subclasses, which can be difficult or impossible
* Make sure functionalization works as before
    * Use `ViewOrScatterInverse` when reapply_views TLS is True or `NeverView` otherwise
    * Adds tests to ensure old behavior for above inverses **in functionalization**
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115893
Approved by: https://github.com/bdhirsh
2023-12-21 21:39:22 +00:00
suo
b5c866db13 [export] Add FlatArgsAdapter to unflatten (#115467)
This is the final divergence between our internal/external unflatteners.

Differential Revision: [D52001135](https://our.internmc.facebook.com/intern/diff/D52001135/)

@diff-train-skip-merge
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115467
Approved by: https://github.com/zhxchen17
ghstack dependencies: #115466, #115795
2023-12-21 20:52:36 +00:00
suo
01ec3d1113 [export] upstream some final fixes to OSS unflatten (#115795)
as title

Differential Revision: [D52141387](https://our.internmc.facebook.com/intern/diff/D52141387/)

@diff-train-skip-merge
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115795
Approved by: https://github.com/zhxchen17
ghstack dependencies: #115466
2023-12-21 20:52:36 +00:00
suo
bc3ef1684e [export] refactor unflatten.py to be a top-level API (#115466)
This is in preparation for the merging of the internal and external versions of
the unflattener. Unflatten needs to be its own API because we are adding more
options to it in forthcoming diffs.

Differential Revision: [D52001133](https://our.internmc.facebook.com/intern/diff/D52001133/)

@diff-train-skip-merge
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115466
Approved by: https://github.com/zhxchen17
2023-12-21 20:52:29 +00:00
497777e302 Revert "Mark set_ as an inplace view op (#115769)"
This reverts commit cd449e260c830c9ce0f06ed4833b46aa638f1529.

Reverted https://github.com/pytorch/pytorch/pull/115769 on behalf of https://github.com/jeanschmidt due to breaking landing signals internally, more details on the diff, author is tagged ([comment](https://github.com/pytorch/pytorch/pull/115769#issuecomment-1866846607))
2023-12-21 19:53:32 +00:00
0e63837ec7 [dynamo] Skip some tests using scipy.kstest (#116263)
These tests are failing in CI with this error
```
  File "/opt/conda/envs/py_3.11/lib/python3.11/site-packages/torch/_dynamo/variables/builder.py", line 1126, in wrap_numpy_ndarray
    value.flags.writeable = True
    ^^^^^^^^^^^^^^^^^^^^^
torch._dynamo.exc.InternalTorchDynamoError: cannot set WRITEABLE flag to True of this array
```

And it may be related to a `SIGKILL` exception being raised shortly after the
failure.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116263
Approved by: https://github.com/lezcano
2023-12-21 18:08:29 +00:00
199b04fdbd Back out "Implement pass-through state_dict and load_state_dict for dynamo OptimizedModule (#113423)" (#116243)
Summary:
Original commit changeset: 2a9588cfd51b

Original Phabricator Diff: D52062368

Test Plan: In investigating S386328 and S382826, we found checkpoint loading succeed after backout D52062368: S386328_backout_1220_193648

Differential Revision: D52356011

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116243
Approved by: https://github.com/voznesenskym
2023-12-21 17:57:05 +00:00
ed03834693 Revert "Expose functional IR to capture_pre_autograd (#115210)"
This reverts commit 4b59b4dffba633f638f3d7ccffff2abc2e53f25e.

Reverted https://github.com/pytorch/pytorch/pull/115210 on behalf of https://github.com/malfet due to This should fix test_export_constraints_error_non_strict failures, see https://github.com/pytorch/pytorch/issues/116273 ([comment](https://github.com/pytorch/pytorch/pull/115210#issuecomment-1866706302))
2023-12-21 17:49:43 +00:00
a357a0f315 Back out "[Kineto] Initialize libkineto profilers during torch init process during pybind set-up (#112623)" (#116201)
Summary:
This diff needs to be backed out because TorchBench llama_v2_7b_16h has a cublas init error.
https://github.com/pytorch/benchmark/actions/runs/7266269668/job/19797677485?pr=2095

Test Plan: CI

Differential Revision: D52339142

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116201
Approved by: https://github.com/xuzhao9
2023-12-21 16:32:19 +00:00
ff4aac109a [BE][Easy]: Enable clang-tidy check readability-misplaced-array-index (#116210)
Enable clang-tidy check readability which checks for a bizarre C++ construct that is usually indicative of an error: https://clang.llvm.org/extra/clang-tidy/checks/readability/misplaced-array-index.html (indexing a number by a pointer, which surprisingly inverts the operands).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116210
Approved by: https://github.com/albanD, https://github.com/malfet
2023-12-21 15:09:10 +00:00
cc2c2c6ca9 [Easy][BE]: Enable clang-tidy check for duplicate includes (#116193)
Adds a clang-tidy check to flag duplicate include files
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116193
Approved by: https://github.com/albanD, https://github.com/malfet
2023-12-21 14:58:12 +00:00
2dce364634 [AOTI][refactor] Remove model_container_runner_cuda.cpp (#116113)
Differential Revision: [D52301272](https://our.internmc.facebook.com/intern/diff/D52301272)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116113
Approved by: https://github.com/khabinov
ghstack dependencies: #116047
2023-12-21 14:56:25 +00:00
f71d302c63 Revert "[Easy][BE]: Enable clang-tidy check for duplicate includes (#116193)"
This reverts commit 71cb13869b4eced76589f47e26bd64cdc2d54aa2.

Reverted https://github.com/pytorch/pytorch/pull/116193 on behalf of https://github.com/jeanschmidt due to Breaking internal test (bolt_nn_espresso_operator_test_eureka-scheduler) and job (build-rdk-diff-windows-debug-cuda11) @malfet and @albanD, please help the author get this PR merged by providing more information ([comment](https://github.com/pytorch/pytorch/pull/116193#issuecomment-1866391726))
2023-12-21 14:43:07 +00:00
348cb2f8f9 Revert "[BE][Easy]: Enable clang-tidy check readability-misplaced-array-index (#116210)"
This reverts commit 5d5ef016a622c8259b328e8b6f8fa7ffcf3c80dc.

Reverted https://github.com/pytorch/pytorch/pull/116210 on behalf of https://github.com/jeanschmidt due to unfortunately, It is required to revert this PR in order to properly revert https://github.com/pytorch/pytorch/pull/116193 ([comment](https://github.com/pytorch/pytorch/pull/116210#issuecomment-1866380974))
2023-12-21 14:37:41 +00:00
ec6c4fed3f Revert "Support nn_module_stack in torch.export(strict=False) (#115454)"
This reverts commit 6730b5bcb41e0519572759d9ad9852a113d0a7e4.

Reverted https://github.com/pytorch/pytorch/pull/115454 on behalf of https://github.com/jeanschmidt due to Breaking internal tests recycle_bin_citadel and executorch, check internal diff to see more details ([comment](https://github.com/pytorch/pytorch/pull/115454#issuecomment-1866315233))
2023-12-21 14:05:43 +00:00
0567f71ac6 Revert " pre_dispatch aot_export (#115188)"
This reverts commit a267d6735051a4714fa2ac1c163315b650118744.

Reverted https://github.com/pytorch/pytorch/pull/115188 on behalf of https://github.com/jeanschmidt due to sadly, it is required to revert this commit in order to revert https://github.com/pytorch/pytorch/pull/115454 ([comment](https://github.com/pytorch/pytorch/pull/115188#issuecomment-1866310014))
2023-12-21 14:03:18 +00:00
f170d6665c [DCP] Add a profiler function for benchmarking save and load (#116007)
Many operations when calling DCP's save and load are executed on CPU. Thus we can easily profile these operations with cProfile. This PR adds the ability to profile the save() and load()

One follow-up for this PR is to integrate the feature with the distributed logging flags.

Differential Revision: [D52245434](https://our.internmc.facebook.com/intern/diff/D52245434/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116007
Approved by: https://github.com/LucasLLC, https://github.com/wz337
ghstack dependencies: #116006
2023-12-21 08:03:07 +00:00
a548ff40de [DCP][BE] Remove unused function (#116006)
As title

Differential Revision: [D52245433](https://our.internmc.facebook.com/intern/diff/D52245433/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116006
Approved by: https://github.com/wz337
2023-12-21 07:20:08 +00:00
4b59b4dffb Expose functional IR to capture_pre_autograd (#115210)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115210
Approved by: https://github.com/zhxchen17
ghstack dependencies: #115188
2023-12-21 07:16:07 +00:00
8fd1963ae2 [dynamo][collective_op] Use the value of the wrappered attribute async_op in dynamo when checking supported or not (#115921)
I found whatever the attribute `async_op` in collective ops is `True` or `False` explicitly set by the users, it always leads to the graph break because the argument `async_op` is wrappered as `ConstantVariable(bool)` in dynamo. So here we need to use the `value` for the judgement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115921
Approved by: https://github.com/jansel, https://github.com/wconstab
2023-12-21 03:27:57 +00:00
74e8cfc9a0 Forward fix torch package bug - dont depend on dynam in fsdp directly (#116229)
Differential Revision: [D52350752](https://our.internmc.facebook.com/intern/diff/D52350752)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116229
Approved by: https://github.com/janeyx99, https://github.com/zou3519
2023-12-21 03:10:22 +00:00
db35ccf463 Revert "[innductor] make inductor work with new triton compile interface (#115878)"
This reverts commit bbded928b3556cf5678edf8fa41109d418312bcc.

Reverted https://github.com/pytorch/pytorch/pull/115878 on behalf of https://github.com/kit1980 due to Broke ROCm https://github.com/pytorch/pytorch/actions/runs/7282149837/job/19844618618 ([comment](https://github.com/pytorch/pytorch/pull/115878#issuecomment-1865369349))
2023-12-21 02:00:17 +00:00
65d3dde665 Fix allowed dtypes for mem_eff attention (#116026)
# Summary

Fix issue bug in detecting mem eff capability for cuda devices less than sm80:
https://github.com/pytorch-labs/gpt-fast/issues/49

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116026
Approved by: https://github.com/janeyx99
2023-12-21 01:56:38 +00:00
c1d960aadd [Quant] [Inductor] add input shape check for quantized conv binary lowering (#115247)
Add inputs shape check for quantized conv binary lowering, since qconv2d_pointwise.binary does not yet support the case of broadcasting shape inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115247
Approved by: https://github.com/leslie-fang-intel, https://github.com/eellison
2023-12-21 01:36:49 +00:00
be9de33240 [Dynamo][9/N] Make SkipFilesVariable wrap functions only (#115963)
Make ```SkipFilesVariable``` only handle function type, and route skipped classes to ```UserDefinedClassVariable```. The reasons behind this are:
* We'd like to remove ```is_allowed```, so the allowed/disallowed torch classes should have a proper place to handle. We can put them in either ```SkipFilesVariable``` and ```UserDefinedClassVariable``` under the current architecture, but it's  confusing to have two places do one thing.
   - Going forward, let's make ```SkipFilesVariable``` only handle functions, and probably I'll rename it to ```SkippedFunctionVariable``` in the following PRs.
   - Let's do dispatch by value's type, all torch classes stuff would go to ```UserDefinedClassVariable``` in the next PR.
* We'd merge in_graph/skip/inline trace decision into the same API ```trace_rule.lookup```, so probably we have to limit the input to only function for better organizing ```VariableBuilder._wrap``` logics.
   - Next step, I'll merge ```skipfiles.check``` into ```trace_rules.lookup```, and do the skipfile check before wrapping them into correct variable tracker.
   - Though the ```TorchCtxManagerClassVariable``` is decided by ```trace_rules.lookup```, I'll refactor it out in the following PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115963
Approved by: https://github.com/jansel
2023-12-21 01:35:07 +00:00
a734085a63 [ONNX][Dort] Fix bug preventing running with OrtValueVector (#116124)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116124
Approved by: https://github.com/justinchuby, https://github.com/titaiwangms
ghstack dependencies: #115945
2023-12-21 01:20:46 +00:00
259b0af367 [ONNX] Add copy before export for perf bench to avoid mutating base model (#115945)
Otherwise base model might be mutated and affects the performance measured.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115945
Approved by: https://github.com/justinchuby, https://github.com/titaiwangms
2023-12-21 01:20:46 +00:00
feafbcf437 [AOTI][refactor] Refactor model runner API (#116047)
Summary: 1) make proxy executor as a private member; 2) use std::string instead of char*

Differential Revision: [D52301106](https://our.internmc.facebook.com/intern/diff/D52301106)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116047
Approved by: https://github.com/khabinov
2023-12-21 01:05:37 +00:00
9502fa8d84 add a transformer suite in TP/SP tests (#115530)
This is to address issue #115309.

Test plan
`python test/distributed/tensor/parallel/test_tp_examples.py -k test_transformer_training_is_seq_parallel_False`
`python test/distributed/tensor/parallel/test_tp_examples.py -k test_transformer_training_is_seq_parallel_True`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115530
Approved by: https://github.com/wanchaol
2023-12-21 01:04:36 +00:00
7ca6e0d38f [EZ] Add CUSPARSELT to build variables (#116213)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116213
Approved by: https://github.com/Skylion007, https://github.com/kit1980, https://github.com/atalman
ghstack dependencies: #116212
2023-12-21 01:02:11 +00:00
74119a3482 [EZ] Fix typo in USE_GLOO var (#116212)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116212
Approved by: https://github.com/Skylion007, https://github.com/kit1980
2023-12-21 01:02:11 +00:00
f206e31e2f Swap slots if slots match in swap_tensor (#116128)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116128
Approved by: https://github.com/albanD
2023-12-21 00:43:30 +00:00
8aae46f843 [ROCm] fix nightly 5.6 build (#116029)
ROCm 5.6 nightly wheel build broken by #114329.  This fixes it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116029
Approved by: https://github.com/huydhn, https://github.com/jithunnair-amd, https://github.com/atalman
2023-12-21 00:22:42 +00:00
be90b757d9 Enable compiled Adam in the benchmarks (#116093)
Commit b697bcc583 of mlazos/compiled-adam2 at https://hud.pytorch.org/benchmark/compilers
is an initial benchmark run

Increases compile time by 20s for torchbench and HF, and 30s for TIMM

I expect the compile time to come down significantly with fake tensor prop caching

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116093
Approved by: https://github.com/janeyx99
2023-12-21 00:17:36 +00:00
bbded928b3 [innductor] make inductor work with new triton compile interface (#115878)
Recent 2 triton PRs (https://github.com/openai/triton/pull/2701, https://github.com/openai/triton/pull/2756) change the interface for triton.compile, this PR added the necessary change on inductor side to work with both old and new compile API.

Also there is some simplification between compilation call in subprocess and the one in main process
- previously we pass warm_cache_only=True if the compilation happens in subprocess. But triton never use that argument in the currently used pin. So I removed that
- previously we only pass compute_capability if compilation happens in subprocess. The PR change that to always passing compute_capability to triton.compile no matter if the compilation happens in main or sub process.

Updated:
There are more interface change from triton side. E.g.
- tl.math.{min, max} now requires a propagate_nan argument
- JITFunction.run now requires a warmup argument. This affect the benchmarking phase of matmul max-autotune; on the other hand, JITFunction.run forbids stream argument now. Simply removing passing this in when benchmarking matmul triton kernel will work for both old and new version of triton.
- triton Autotuner change attribute name from 'warmup' to 'num_warmup' and from 'rep' to 'num_rep'. This cause dynamo failed to handle triton Autotuner object since dynamo TritonKernelVariable makes assumption about attribute names. It's used in some test cases that a model call triton Autotuner directly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115878
Approved by: https://github.com/jansel
2023-12-21 00:03:38 +00:00
5d5ef016a6 [BE][Easy]: Enable clang-tidy check readability-misplaced-array-index (#116210)
Enable clang-tidy check readability which checks for a bizarre C++ construct that is usually indicative of an error: https://clang.llvm.org/extra/clang-tidy/checks/readability/misplaced-array-index.html (indexing a number by a pointer, which surprisingly inverts the operands).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116210
Approved by: https://github.com/albanD, https://github.com/malfet
2023-12-21 00:00:20 +00:00
897600eb35 [inductor] Some tests have both CPU and CUDA variants running with CPU tensors (#116131)
I don't think that's intended.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116131
Approved by: https://github.com/jansel
2023-12-21 00:00:15 +00:00
7c7208a9e7 Forward fix to remove xfails for vmap NT tests in Dynamo (#116216)
Resolves land race between #116111 and #114523.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116216
Approved by: https://github.com/kit1980
2023-12-20 22:55:08 +00:00
edf1ea622d Move step is noop tests (#115299)
As stated. I do notice there is perhaps opportunity to abstract, but the tests as written are also super understandable and more abstraction might not be desirable.

This PR _increases coverage_. The original tests each tested 12 default configs (left out Rprop). Now the tests test ~80 configs, and then foreach + fused on top of that! Test time, we basically increase over 10-fold, but this test is tiny so we are not worried:

Old:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (5ca9672c)]$ python test/test_optim.py -k test_step_is_noop_when_params_have_no_grad
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
.
----------------------------------------------------------------------
Ran 1 test in 0.028s

OK
```

New (includes the old test):
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (5ca9672c)]$ python test/test_optim.py -k test_step_is_noop_when_params_have_no_grad
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
...........................
----------------------------------------------------------------------
Ran 27 tests in 0.456s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115299
Approved by: https://github.com/albanD
ghstack dependencies: #114802, #115023, #115025
2023-12-20 22:49:44 +00:00
8f3a0594e9 Move tests depending on listed configs to OptimizerInfo (#115025)
Removing 4 tests:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (7539011b)]$ python test/test_optim.py -v -k test_fused_optimizers_with_large_tensors -k test_fused_optimizers_with_varying_tensors -k test_multi_tensor_optimizers_with_large_tensors -k test_multi_tensor_optimizers_with_varying_tensors
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
test_fused_optimizers_with_large_tensors (optim.test_optim.TestOptim) ... ok
test_fused_optimizers_with_varying_tensors (optim.test_optim.TestOptim) ... ok
test_multi_tensor_optimizers_with_large_tensors (optim.test_optim.TestOptim) ... ok
test_multi_tensor_optimizers_with_varying_tensors (optim.test_optim.TestOptim) ... ok

----------------------------------------------------------------------
Ran 4 tests in 22.731s

OK
```

For the same 4 but more granular:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (7539011b)]$ python test/test_optim.py  -v -k test_fused_large_tensor -k test_fused_mixed_device_dtype -k test_foreach_large_tensor -k test_foreach_mixed_device_dtype
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
test_foreach_large_tensor_ASGD_cpu_float16 (__main__.TestOptimRenewedCPU) ... skipped 'Only runs on cuda'
....
test_fused_mixed_device_dtype_Adam_cpu_float32 (__main__.TestOptimRenewedCPU) ... skipped 'Only runs on cuda'
test_foreach_large_tensor_ASGD_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_Adadelta_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_Adagrad_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_AdamW_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_Adam_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_NAdam_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_RAdam_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_RMSprop_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_Rprop_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_SGD_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_ASGD_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_Adadelta_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_Adagrad_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_AdamW_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_Adam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_Adamax_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_NAdam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_RAdam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_RMSprop_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_Rprop_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_SGD_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_fused_large_tensor_AdamW_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_fused_large_tensor_Adam_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_fused_mixed_device_dtype_AdamW_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_fused_mixed_device_dtype_Adam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok

----------------------------------------------------------------------
Ran 50 tests in 50.785s

OK (skipped=25)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115025
Approved by: https://github.com/albanD
ghstack dependencies: #114802, #115023
2023-12-20 22:49:44 +00:00
05d60931b3 Migrate test_peak_mem_multi_tensor_optimizers to OptimizerInfo (#115023)
Replace the following:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (1bbf1c6f)]$ python test/test_optim.py -k test_peak_mem_multi_tensor_optimizers
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
.
----------------------------------------------------------------------
Ran 1 test in 38.599s

OK
```

with 11 tests (one for each foreach optim :))
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (1bbf1c6f)]$ python test/test_optim.py -k TestOptimRenewedCUDA.test_foreach_memory
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
...........
----------------------------------------------------------------------
Ran 11 tests in 39.293s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115023
Approved by: https://github.com/albanD
ghstack dependencies: #114802
2023-12-20 22:49:44 +00:00
4fb92b591d [BE] remove redundant _test_derived_optimizers by migrating more to OptimizerInfo (#114802)
New tests look like:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (af8fca04)]$ python test/test_optim.py -v -k TestOptimRenewedCUDA.test_fused
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
test_fused_AdamW_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_fused_Adam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok

----------------------------------------------------------------------
Ran 2 tests in 34.591s

OK
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (af8fca04)]$ python test/test_optim.py
-v -k test_set_default_dtype_works_with_foreach
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
test_set_default_dtype_works_with_foreach_ASGD_cpu_float64 (__main__.TestOptimRenewedCPU) ... skipped 'Only runs on cuda'
...
test_set_default_dtype_works_with_foreach_ASGD_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_Adadelta_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_Adagrad_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_AdamW_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_Adam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_Adamax_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_NAdam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_RAdam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_RMSprop_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_Rprop_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_SGD_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok

----------------------------------------------------------------------
Ran 22 tests in 32.915s

OK (skipped=11)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114802
Approved by: https://github.com/albanD
2023-12-20 22:49:44 +00:00
0fae3dfef7 Add convenient things for Dynamo testing (#116173)
- added a way to easily add a skip
- added a way to easily turn markDynamoStrictTest on by default for a
  particular test file
- added an envvar to turn markDynamoStrictTest on by default
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116173
Approved by: https://github.com/voznesenskym
2023-12-20 22:49:26 +00:00
19207b9183 Allow more backend worker threads with each using a separate cuda stream (#116190)
Added a `--num_workers` option to `server.py` that allows more than 1 worker in the `ThreadPoolWorker` used for model predictions. Each worker uses its own `cuda.Stream()` that is created when the worker thread is initialized.

Ran benchmark for 2-4 workers with `compile=False` (since compile is not thread-safe)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116190
Approved by: https://github.com/albanD
ghstack dependencies: #115286, #116187, #116188, #116189
2023-12-20 22:08:29 +00:00
0dd64174bd Do H2D/D2H of input/result on separate threads/cuda.Streams (#116189)
Added two `ThreadPoolExecutor`s with 1 worker each for D2H and H2D copies. Each uses its own `cuda.Stream`. The purpose is to try to overlap D2H and H2D with compute and allow the worker handling prediction to launch compute kernels without being blocked by D2H/H2D.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116189
Approved by: https://github.com/albanD
ghstack dependencies: #115286, #116187, #116188
2023-12-20 22:08:29 +00:00
3793ad6a7e Fix bugs in metrics calculation in inference benchmark and rerun baseline (#116188)
Before this PR, each `request_time` was separated by the time for a `torch.randn(...)` to create the fake `data` tensor on CPU. This meant that the gap between `request_times` **scaled with the batch_size**. So the latency comparisons across batch sizes were inaccurate. In this PR we generate all the fake data outside the loop to avoid this.

Other bug fixes:
- Only start polling GPU utilization after warmup event is complete
- Correct calculation of throughput: previously `(num_batches * batch_size) / sum(response_times)`, should have been `(num_batches * batch_size) / (last_response_time - first_request_time)`
- Make sure that response sent back to frontend is on CPU
- Use a lock to ensure writing to `metrics_dict` in `metrics_thread` and `gpu_utilization_thread` in a thread-safe manner

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116188
Approved by: https://github.com/albanD
ghstack dependencies: #115286, #116187
2023-12-20 22:08:22 +00:00
75a4b10d56 [easy] Add option for profiling backend in inference benchmark (#116187)
Some misc fixes, also added option for experiment name to add to result table

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116187
Approved by: https://github.com/albanD
ghstack dependencies: #115286
2023-12-20 22:08:11 +00:00
31f21e033e Run inference in an Executor (#115286)
Experiment: run model predictions in the backend in a ThreadPoolExecutor so that each model prediction does not block reading requests from the queue

Baseline is reset in above PR that bugfixes a lot of the metrics calculations but I kept the metrics here anyway

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115286
Approved by: https://github.com/albanD
2023-12-20 22:08:02 +00:00
b72127cd4b [inductor] Support sym exprs in lowering constant promotion (#116196)
Follow-up to https://github.com/pytorch/pytorch/pull/115920

This PR fixes the error with symbolic expression in aten.div:
```python
import torch
aten = torch.ops.aten

def func(x, a):
    return aten.div(x * 0.5, a, rounding_mode=None)

cfunc = torch.compile(func, dynamic=True, fullgraph=True)
device = "cpu"
x = 124
a = 33
out = cfunc(x, a)
expected = func(x, a)
torch.testing.assert_close(out, expected)
```
Error message:
```
  File "/pytorch/torch/_inductor/graph.py", line 700, in call_function
    out = lowerings[target](*args, **kwargs)
  File "/pytorch/torch/_inductor/lowering.py", line 293, in wrapped
    out = decomp_fn(*args, **kwargs)
  File "/pytorch/torch/_inductor/lowering.py", line 4823, in div_mode
    return div(a, b)
  File "/pytorch/torch/_inductor/lowering.py", line 293, in wrapped
    out = decomp_fn(*args, **kwargs)
  File "/pytorch/torch/_inductor/lowering.py", line 4857, in div
    a, b = promote_constants(
  File "/pytorch/torch/_inductor/lowering.py", line 368, in promote_constants
    ex = next(x for x in inputs if isinstance(x, (TensorBox, ExpandView)))
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
LoweringException: StopIteration:
  target: aten.div.Tensor_mode
  args[0]: 1.0*s0
  args[1]: s1
  kwargs: {'rounding_mode': None}

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116196
Approved by: https://github.com/peterbell10
2023-12-20 21:59:51 +00:00
a267d67350 pre_dispatch aot_export (#115188)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115188
Approved by: https://github.com/bdhirsh
2023-12-20 21:36:25 +00:00
4afe2687d5 Reland "Serve multistream graph captures from correct pool (#114647)" (#116199)
Fixes a variable shadowing problem that broke internal builds.

This reverts commit fe156456194ed64bdf8b086d469b3643515a2baf.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116199
Approved by: https://github.com/eellison
2023-12-20 21:22:34 +00:00
199bacaf77 [Dynamo] Fix broken trunk and re-enable test_torch_name_rule_map_updated (#116146)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116146
Approved by: https://github.com/williamwen42
2023-12-20 21:22:29 +00:00
6e2c9be501 [Easy][BE]: Enable RUF008 and RUF016 checks (#116195)
Enables a few more static linting checks for mutable defaults in dataclasses and for detecting a common type error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116195
Approved by: https://github.com/malfet
2023-12-20 21:16:49 +00:00
bc0d8649a4 Fix missing dependency in torch.utils.tensorboard (#115598)
Fixes #114591

Version package was removed in this pull request: #114108 but is still used in `torch.utils.tensorboard` causing import errors. The fix removes the import and uses a simpler check.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115598
Approved by: https://github.com/malfet
2023-12-20 21:11:52 +00:00
1d5a9a1c1a [Easy][BE]: remove itertools.accumulate Python 2 shim and apply UFMT (#116192)
Removes an unnecessary duplicated utility functions and just have it rely on itertools. Since the file is low traffic, I also added the modified files to UFMT'd files and formatted them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116192
Approved by: https://github.com/malfet
2023-12-20 20:36:59 +00:00
602abf6b55 [ROCm] more 6.0 changes (#115946)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115946
Approved by: https://github.com/pruthvistony, https://github.com/huydhn, https://github.com/malfet
2023-12-20 20:19:29 +00:00
ea3a5f8ddc Add chunk for jagged layout NT (#115842)
Nice to have for the [SDPA tutorial](https://pytorch.org/tutorials/intermediate/scaled_dot_product_attention_tutorial.html)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115842
Approved by: https://github.com/soulitzer
ghstack dependencies: #115192, #116111
2023-12-20 20:13:20 +00:00
29b198dcf8 Add markDynamoStrictTest to NT tests (#116111)
Decorates all NT tests with `@markDynamoStrictTest` to ensure we get the correct signal. Adds xfails where needed to get things passing.

Includes a fix in meta_utils.py for a bug that was breaking several python 3.11 tests. In particular, a dense tensor graph input that is a view of a strided NT would slip past Dynamo's check and break in meta-ification.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116111
Approved by: https://github.com/soulitzer, https://github.com/zou3519
ghstack dependencies: #115192
2023-12-20 20:13:20 +00:00
f2c1fb3ee4 Fix crash in SymInt unary minus (#116160)
Before this change `-SymInt(std::numeric_limits<int64_t>::min()) == 0` would reliably crash with null pointer dereference, as `data_` of the SymInt returned by `operator-` would be `0x8000000000000000`, because of the carry/overflow flags set by `negq`.

Before the change x86_64 assembly generated for
4f02cc0670/c10/core/SymInt.cpp (L137)
looked as follows:
```
   0x7ffff7f2f490 <+115>: movq   %rax, %rdx
    0x7ffff7f2f493 <+118>: negq   %rdx
    0x7ffff7f2f496 <+121>: movq   %rdx, (%rbp)
    0x7ffff7f2f49a <+125>: movabsq $0x4000000000000000, %rdx ; imm = 0x4000000000000000
    0x7ffff7f2f4a4 <+135>: cmpq   %rdx, %rax
    0x7ffff7f2f4a7 <+138>: jle    0x7ffff7f2f520            ; <+259> at SymInt.cpp:141:1
```
`negq %rfx` correspond to unary minus and  `cmpq   %rdx, 0x4000000000000000` are inverted `check_range`
b6d0d0819a/c10/core/SymInt.h (L247-L249)
Flags raised by `negq` will affect the results of `cmpq`, and as result value would not be allocated on heap, but rather preserved as `nullptr`.

Not sure if it's worth benchmarking, but perhaps using `__builtin_sub_overflow` would be faster as it does not require an extra comparison, just guarantees that overflow flags is cleared after the op.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116160
Approved by: https://github.com/Skylion007, https://github.com/colesbury
2023-12-20 20:12:57 +00:00
f8ad664cf2 [export] Update range constraints to runtime_var_to_range (#115427)
Updated range_constraints to be the union of shape_env.var_to_range and shape_env.runtime_var_to_range, with shape_env.runtime_var_to_range taking priority.

Due to 0/1 specialization, if we bound an unbacked symint to be less than 5, the range of possible values for this symint is actually recorded as [2, 5] in shape_env.var_to_range. To fix this so that users will be able to see a more understandable range of [0, 5], shape_env.runtime_var_to_range was created to store the range of [0, 5]. Since range_constraints is a user-facing attribute to query the ranges of certain symints, we want to use shape_env.runtime_var_to_range to get the unbacked symints ranges, rather than shape_env.var_to_range.

Additionally, run_decompositions() has an issue where it will always add assertions to the graph, even if a previous run has already added the assertions. So, I added a part to the AddRuntimeAssertionsForInlineConstraints which will store which assertions have already been added.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115427
Approved by: https://github.com/zhxchen17
2023-12-20 20:00:41 +00:00
1be6a070bc Add support for torch.cond in vmap (#114523)
Fixes: https://github.com/pytorch/pytorch/issues/114136

Patch enables conversion of a BatchedTensor into FakeTensor and write
torch.cond vmap support using torch.where

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114523
Approved by: https://github.com/zou3519
2023-12-20 19:54:38 +00:00
06ae9b79ed [mtia] add module exporter to net minimizer (#115687)
Summary: add module exporter to net minimizer

Reviewed By: amylittleyang

Differential Revision: D52086699

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115687
Approved by: https://github.com/jfix71
2023-12-20 19:36:23 +00:00
6de28e92d2 [BE]: Apply FURB118 (prev): replaces unnecessary lambdas with operator. (#116027)
This replaces a bunch of unnecessary lambdas with the operator package. This is semantically equivalent, but the operator package is faster, and arguably more readable. When the FURB rules are taken out of preview, I will enable it as a ruff check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116027
Approved by: https://github.com/malfet
2023-12-20 19:35:08 +00:00
2d2016fdf8 WIP Add compatibility with channels_last_3d for conv3d (#114790)
Part of a multi-PR work to fix #59168

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114790
Approved by: https://github.com/albanD
2023-12-20 19:28:25 +00:00
8bff59e41d [ROCm] add hipblaslt support (#114329)
Disabled by default. Enable with env var DISABLE_ADDMM_HIP_LT=0. Tested on both ROCm 5.7 and 6.0.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114329
Approved by: https://github.com/malfet
2023-12-20 19:09:25 +00:00
0b0b9b3275 [c10d][libuv] add partial read test for libuv backend and fix an error which only happens when partially reading a buffer (#116141)
**Test Plan**
1. build pytorch
2. execute `TORCH_CPP_LOG_LEVEL=INFO build/bin/TCPStoreTest --gtest_filter=TCPStoreTest.testLibUVPartialRead` from the pytorch root directory.

without the change:
<img width="761" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/1942e3c2-a9c1-4fe4-87e8-7e21f4d8f9aa">

with the change:
<img width="747" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/f3e96a5b-0ed1-49bd-9184-bb8a5ebebc33">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116141
Approved by: https://github.com/wconstab
2023-12-20 18:37:55 +00:00
ee5d981249 [BE]: Enable RUFF PERF402 and apply fixes (#115505)
* Enable PERF402. Makes code more efficient and succinct by removing useless list copies that could be accomplished either via a list constructor or extend call. All test cases have noqa added since performance is not as sensitive in that folder.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115505
Approved by: https://github.com/malfet
2023-12-20 18:01:24 +00:00
8837df1d71 [c10d] Expose check method to Python for store via pybind (#116144)
Differential Revision: [D52310987](https://our.internmc.facebook.com/intern/diff/D52310987)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116144
Approved by: https://github.com/wconstab
2023-12-20 17:57:13 +00:00
71cb13869b [Easy][BE]: Enable clang-tidy check for duplicate includes (#116193)
Adds a clang-tidy check to flag duplicate include files
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116193
Approved by: https://github.com/albanD, https://github.com/malfet
2023-12-20 17:56:21 +00:00
fe15645619 Revert "Serve multistream graph captures from correct pool (#114647)"
This reverts commit 8a445f7bd5bef43b30b61b20483d606c6e42e606.

Reverted https://github.com/pytorch/pytorch/pull/114647 on behalf of https://github.com/jeanschmidt due to breaking multiple internal build jobs, please check internal diff in order to obtain more details ([comment](https://github.com/pytorch/pytorch/pull/114647#issuecomment-1864840724))
2023-12-20 17:11:42 +00:00
ea7f2de6f3 [docker] Fix typo in docker-release workflow (#116191)
Fix copy-paste typo in docker-release workflow.  After https://github.com/pytorch/pytorch/pull/116097

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116191
Approved by: https://github.com/malfet
2023-12-20 16:44:36 +00:00
16e539e0e6 Fix index range check (#116062)
Fixes incorrect range check when index is `std::numeric_limits<int64_t>::min()`, as result of unary minus operations for such values is undefined, but in practice is equal to self, see https://godbolt.org/z/Wxhh44ocr

Lower bound check was `size >= -index`, which was incorrect if `index` is `INT64_MIN`, with `-1 - index`, which for all int64_t values returns result that also fits into int64_t range. `- (index + 1)` is more readable and results in the identical optimized assembly, see https://godbolt.org/z/3vcnMYf9a , but its intermediate result for `INT64_MAX` is  outside of `int64_t` range, which leads to a similar problems as with `int64_min` in original example.

Added regression test.

Fixes https://github.com/pytorch/pytorch/issues/115415

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116062
Approved by: https://github.com/Skylion007, https://github.com/albanD
2023-12-20 15:40:57 +00:00
fabf9433e7 [AOTI][refactor] Organize model runner files (#116022)
Summary: Move runner util files into a subdirectory and put AOTIModelContainerRunnerCpu into a separate file

Differential Revision: [D52300693](https://our.internmc.facebook.com/intern/diff/D52300693)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116022
Approved by: https://github.com/khabinov
2023-12-20 15:35:34 +00:00
4d6a1ad400 Activation checkpoint and checkpoint_sequential errors if use_reentrant not passed explicitly (#115868)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115868
Approved by: https://github.com/albanD
ghstack dependencies: #115438
2023-12-20 15:23:44 +00:00
cfb3cd11c1 Add basic autograd TORCH_LOGS support (#115438)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115438
Approved by: https://github.com/albanD
2023-12-20 15:23:44 +00:00
cfbf647adb Add aten/src/ATen/native/quantized/cpu/ path to CPU quantization merge rule (#116145)
Observing following PR: https://github.com/pytorch/pytorch/pull/115329
Comment from author: https://github.com/pytorch/pytorch/pull/115329#issuecomment-1851339555

pytorchbot merge failed.
Reason is this logic, we expect all files in PR to match one merge rule:
110339a310/.github/scripts/trymerge.py (L1310-L1324)

This should mitigate the issue, followup will post a PR to refactor this code to allow cross rule matching of approvers
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116145
Approved by: https://github.com/huydhn, https://github.com/kit1980, https://github.com/malfet
2023-12-20 14:43:15 +00:00
8eb7f6276b Ensure wrapping subclasses with as_subclass is supported (#116091)
As title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116091
Approved by: https://github.com/pmeier, https://github.com/zou3519
2023-12-20 14:37:08 +00:00
c215e59bf2 Revert "[inductor] Avoid bool being upcast to int (#109913)"
This reverts commit 92998693a9455af6259cae468265f01cfff8810e.

Reverted https://github.com/pytorch/pytorch/pull/109913 on behalf of https://github.com/jeanschmidt due to causing performance regression in relevant metrics, @malfet I believe you are the correct person to help identify and fix the issues. More details check internal OPS count for ads metricsnin the internal related diff ([comment](https://github.com/pytorch/pytorch/pull/109913#issuecomment-1864397407))
2023-12-20 12:33:50 +00:00
cyy
968b94bef2 [8/N] Fixes clang-tidy warnings in c10/{core,util}/*.h (#116082)
This patch enables clang-tidy coverage on c10/**/*.h and contains other fixes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116082
Approved by: https://github.com/Skylion007
2023-12-20 12:22:21 +00:00
d72d99e591 Fix sparse compressed tensor invariants checks when nnz==0 (#115826)
Fixes https://github.com/pytorch/pytorch/issues/115755

This PR is a step toward deprecating `torch.empty(..., layout=<sparse compressed tensor layout>)` that usage should be minimized as it will produce invalid tensors, see also https://github.com/pytorch/pytorch/issues/90695 .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115826
Approved by: https://github.com/cpuhrsch, https://github.com/amjames
2023-12-20 12:16:07 +00:00
bdfabe5e7d Revert "[Dynamo][9/N] Make SkipFilesVariable wrap functions only (#115963)"
This reverts commit bb5a27052fa989f2365793c7ffe2d5a453aca31a.

Reverted https://github.com/pytorch/pytorch/pull/115963 on behalf of https://github.com/jeanschmidt due to causing significant performance regression, identified by number of ops in ads, please check internal diff ([comment](https://github.com/pytorch/pytorch/pull/115963#issuecomment-1864361697))
2023-12-20 12:06:55 +00:00
af8a50e656 Revert "Fix allowed dtypes for mem_eff attention (#116026)"
This reverts commit fc58909babcd07ea9652a1c1b3c2c7803f407a37.

Reverted https://github.com/pytorch/pytorch/pull/116026 on behalf of https://github.com/jeanschmidt due to breaking internal windows buck builds, check internal diff for more details ([comment](https://github.com/pytorch/pytorch/pull/116026#issuecomment-1864354665))
2023-12-20 12:01:34 +00:00
6e1ba79b7f [re-land] Introduce 3 low-latency, intra-node allreduce algorithms for small messages to PyTorch (#114001) (#116125)
This is an attempt to re-land https://github.com/pytorch/pytorch/pull/114001. The previous attempt used `std::array` in cuda kernels which wasn't compatible with Meta's internal build.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116125
Approved by: https://github.com/yf225
2023-12-20 07:13:50 +00:00
9df4ee8d38 Fix ColwiseParallel typo (#116151)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116151
Approved by: https://github.com/wanchaol
2023-12-20 06:40:32 +00:00
545d2126f6 [pt-vulkan] Enable Python code blocks in shader templates and upgrade shader template generation (#115948)
Summary:
This change makes two major improvements to PyTorch Vulkan's shader authoring workflow.

## Review Guide

There are a lot of changed files because every GLSL shader had to be touched. The majority of changes is changing

```
#define PRECISION $precision
#define FORMAT $format
```

to

```
#define PRECISION ${PRECISION}
#define FORMAT ${FORMAT}
```

due to changes in how shader templates are processed.

For reviewers, the primary functional changes to review are:

* `gen_vulkan_spv.py`
  * Majority of functional changes are in this file, which controls how shader templates are processed.
* `shader_params.yaml`
  * controls how shader variants are generated

## Python Codeblocks in Shader Templates

From now on, every compute shader (i.e. `.glsl`) is treated as a shader template. To this effect, the `templates/` folder has been removed and there is now a global `shader_params.yaml` file to describe the shader variants that should be generated for all shader templates.

**Taking inspiration from XNNPACK's [`xngen` tool](https://github.com/google/XNNPACK/blob/master/tools/xngen.py), shader templates can now use Python codeblocks**.  One example is:

```
$if not INPLACE:
  layout(set = 0, binding = 0, FORMAT) uniform PRECISION restrict writeonly image3D uOutput;
  layout(set = 0, binding = 1) uniform PRECISION sampler3D uInput;
  layout(set = 0, binding = 2) uniform PRECISION sampler3D uOther;
  layout(set = 0, binding = 3) uniform PRECISION restrict Block {
    ivec4 output_sizes;
    ivec4 input_sizes;
    ivec4 other_sizes;
    float alpha;
  }
  uArgs;
$else:
  layout(set = 0, binding = 0, FORMAT) uniform PRECISION restrict image3D uOutput;
  layout(set = 0, binding = 1) uniform PRECISION sampler3D uOther;
  layout(set = 0, binding = 2) uniform PRECISION restrict Block {
    ivec4 output_sizes;
    ivec4 other_sizes;
    float alpha;
  }
  uArgs;
```

Another is:

```
  // PYTHON CODEBLOCK
  $if not IS_DIV:
    const int c_index = (pos.z % ((uArgs.output_sizes.z + 3) / 4)) * 4;
    if (uArgs.other_sizes.z != 1 && c_index + 3 >= uArgs.output_sizes.z) {
      ivec4 c_ind = ivec4(c_index) + ivec4(0, 1, 2, 3);
      vec4 mask = vec4(lessThan(c_ind, ivec4(uArgs.output_sizes.z)));
      other_texel = other_texel * mask + vec4(1, 1, 1, 1) - mask;
    }

  // PYTHON CODEBLOCK
  $if not INPLACE:
    ivec3 input_pos =
        map_output_pos_to_input_pos(pos, uArgs.output_sizes, uArgs.input_sizes);
    const vec4 in_texel =
        load_texel(input_pos, uArgs.output_sizes, uArgs.input_sizes, uInput);

    imageStore(uOutput, pos, OP(in_texel, other_texel, uArgs.alpha));
  $else:
    const vec4 in_texel = imageLoad(uOutput, pos);
    imageStore(uOutput, pos, OP(in_texel, other_texel, uArgs.alpha));
```

In addition to making it easier and clearer to write shader templates, this enables shaders that were previously unable to be consolidated into a single template to now be represented using a single template, such as non inplace and inplace variants of the same shader.

## `generate_variant_forall` in shader variant YAML configuration

YAML files that describe how shader variants should be generated can now use a `generate_variant_forall` field to iterate over various settings for a specific parameter for each variant defined. Example:

```
unary_op:
  parameter_names_with_default_values:
    OPERATOR: exp(X)
    INPLACE: 0
  generate_variant_forall:
    INPLACE:
      - VALUE: 0
        SUFFIX: ""
      - VALUE: 1
        SUFFIX: "inplace"
  shader_variants:
    - NAME: exp
      OPERATOR: exp(X)
    - NAME: sqrt
      OPERATOR: sqrt(X)
    - NAME: log
      OPERATOR: log(X)
```

Previously, the `inplace` variants would need to have separate `shader_variants` entries. If there are multiple variables that need to be iterated across, then all possible combinations will be generated. Would be good to take a look to see how the new YAML configuration works.

Test Plan:
There is no functional change to this diff; we only need to make sure that the generated shaders are still correct. Therefore, we only need to run `vulkan_api_test`.

```
# On Mac Laptop
buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 -- --gtest_filter="*"
```

Reviewed By: digantdesai

Differential Revision: D52087084

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115948
Approved by: https://github.com/manuelcandales
2023-12-20 05:47:33 +00:00
9766781512 Skip some flaky Dynamo tests (#116165)
The goal right now is to get the Dynamo CI back to green.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116165
Approved by: https://github.com/drisspg, https://github.com/aakhundov, https://github.com/huydhn, https://github.com/khabinov
2023-12-20 05:05:02 +00:00
3747aca49a [C10D] Make all PGNCCL LOG usages use logPrefix() (#116060)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116060
Approved by: https://github.com/fduwjj
ghstack dependencies: #116059
2023-12-20 04:19:45 +00:00
6ffe1da375 Add support for multi device foreach ops (#116064)
Fix for https://github.com/pytorch/pytorch/issues/102023

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116064
Approved by: https://github.com/mlazos
2023-12-20 04:19:40 +00:00
c72bc61bcd [ROCm] Fix caffe2 build with hipblasv2 api (#116073)
Summary: we need this change along with D52244365 to make caffe2 build happy

Test Plan: OSS CI

Differential Revision: D52275058

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116073
Approved by: https://github.com/jeffdaily, https://github.com/malfet
2023-12-20 04:02:29 +00:00
a597a00c87 [AOTI][refactor][3/n] Declare python_kernel_name and cpp_kernel_name in ExternKernel (#115972)
Summary: Both ExternKernelAlloc and ExternKernelOut need the two fields, so declaring them in the base class. Also add cpp codegen for IndexPutFallback and InplaceBernoulliFallback in this PR.

This is a reland of https://github.com/pytorch/pytorch/pull/115831

Differential Revision: [D52290900](https://our.internmc.facebook.com/intern/diff/D52290900)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115972
Approved by: https://github.com/chenyang78
2023-12-20 03:22:03 +00:00
4f02cc0670 [C10D] Add logPrefix to abortCommsFromMap (#116059)
Prints additional info such as PG ID/Rank.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116059
Approved by: https://github.com/fduwjj
2023-12-20 02:17:04 +00:00
c3bc65d9d8 [dynamo] Restore constant tensor original FQNs (#116086)
Differential Revision: D52192693

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116086
Approved by: https://github.com/angelayi, https://github.com/muchulee8
2023-12-20 02:10:02 +00:00
6730b5bcb4 Support nn_module_stack in torch.export(strict=False) (#115454)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115454
Approved by: https://github.com/suo, https://github.com/bdhirsh
2023-12-20 01:43:39 +00:00
c173a9d9b3 add Half support for layer_norm on CPU (#99590)
### Testing
Single socket (icx, 32cores):
| shape | fp32 forward (ms) | fp16 forward (ms) | mixed fp32 fp16 forward (ms) | fp32 backward (ms) | fp16 backward (ms) | mixed fp32 fp16 backward (ms) |
| -- | -- | -- | -- | -- | -- | -- |
| (1, 8, 16) | 0.012 | 0.011 | 0.011 | 0.051 | 0.051 | 0.050 |
| (8 ,8, 16) | 0.013 | 0.013 | 0.013 | 0.054 | 0.053 | 0.051 |
| (32, 8, 16) | 0.015 | 0.014 | 0.014 | 0.059 | 0.054 | 0.052 |
| (64, 128, 56, 56) | 1.875 | 0.790 | 1.016 | 12.845 | 7.151 | 6.985 |
| (64, 128, 256, 256) | 50.226 | 25.462 | 35.736 | 328.957 | 179.615 | 175.618 |

Single core (icx):

| shape | fp32 forward (ms) | fp16 forward (ms) | mixed fp32 fp16 forward (ms) | fp32 backward (ms) | fp16 backward (ms) | mixed fp32 fp16 backward (ms) |
| -- | -- | -- | -- | -- | -- | -- |
| (1, 8, 16) | 0.012 | 0.011 | 0.011 | 0.040 | 0.041 | 0.041 |
| (8 ,8, 16) | 0.012 | 0.012 | 0.012 | 0.042 | 0.042 | 0.042 |
| (32, 8, 16) | 0.027 | 0.014 | 0.014 | 0.048 | 0.048 | 0.046 |
| (64, 128, 56, 56) | 58.054 | 11.034 | 17.928 | 108.603 | 48.816 | 50.244 |
| (64, 128, 256, 256) | 1327.758 | 352.394 | 496.994 | 2846.182 | 1224.247 | 1218.422 |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99590
Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/cpuhrsch
2023-12-20 01:11:15 +00:00
45cfe9cdf7 [export] Fix test to run internally (#116118)
Test Plan: `buck2 run @//mode/dev-nosan //caffe2/test:test_export`

Reviewed By: suo

Differential Revision: D52297701

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116118
Approved by: https://github.com/suo
2023-12-20 01:02:16 +00:00
c55210b4f0 [Inductor] Deduplicate grid wrapper statements for user defined triton kernels (#115849)
Noticed that on many MRS kernels the grid wrapper for autotuning is huge with a bunch of duplicates due to num_warps and num_stages not being needed for grid calculation. Lets deduplicate these entries.

Previously, we would see wrapper like
```
    def grid_wrapper_for_add_kernel_2d_autotuned_0(meta):
        if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1)
        if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1)
        if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1)
        if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1)
```
now it looks like
```
    def grid_wrapper_for_add_kernel_2d_autotuned_0(meta):
        if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1)
        if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115849
Approved by: https://github.com/jansel
2023-12-20 00:25:32 +00:00
9a2a44457a SDPA extend backward realized tensor alignment checking to forward realized tensors (#116069)
The logic to check alignment for realized tensors in the backward can be extended for realized tensors in the forward. This fixes an interaction with freezing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116069
Approved by: https://github.com/drisspg
2023-12-20 00:14:20 +00:00
110339a310 Fix c10::div_floor_floating compile error (#115647)
Introduced by #113276. I've added a test to catch future regressions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115647
Approved by: https://github.com/desertfire, https://github.com/vfdev-5
2023-12-20 00:09:01 +00:00
68c7aac809 [export][reland] non-strict export with dynamic shapes (#116048)
Reland of https://github.com/pytorch/pytorch/pull/115862

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116048
Approved by: https://github.com/ydwu4
2023-12-19 23:57:22 +00:00
cd449e260c Mark set_ as an inplace view op (#115769)
Summary: To be used in https://github.com/pytorch/pytorch/pull/113873. Since set_ is effectively an inplace view op, we'll need to skip caching them.

Test Plan: Built pytorch; specifically this step: `/home/slarsen/local/miniconda3/envs/pytorch-3.10/bin/python -m torchgen.gen --source-path /home/slarsen/local/pytorch/cmake/../aten/src/ATen --install_dir /home/slarsen/local/pytorch/build/aten/src/ATen --per-operator-headers --generate sources --output-dependencies /home/slarsen/local/pytorch/build/aten/src/ATen/generated_sources.cmake`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115769
Approved by: https://github.com/bdhirsh
2023-12-19 23:08:05 +00:00
0759240001 [sparse] update cslt to 0.5.2.1 (#115988)
Summary:

- update install_cusparselt to download 0.5.2.1 for 12.1
- add ifdef for new compute_type changes

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115988
Approved by: https://github.com/malfet
ghstack dependencies: #115369
2023-12-19 23:02:54 +00:00
eqy
d55365dc05 [CUDA] Workaround shmem limit for certain input sizes in AdaptiveAvgPool1D (#115231)
Reference issue #68248

CC @ptrblck @malfet @xwang233

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115231
Approved by: https://github.com/mikaylagawarecki
2023-12-19 22:40:10 +00:00
7d92449171 Add call to run_tests for more tests? (#115781)
To make sure they get run in CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115781
Approved by: https://github.com/kshitij12345, https://github.com/mlazos, https://github.com/voznesenskym
2023-12-19 22:20:10 +00:00
7f7a7b0b48 Reset stepcurrent cache if file succeeds (#115775)
Attempt to surface the segfault that happens on exit by resetting the "pytest last run" cache if pytest succeeds.  CI does not rerun on success so we won't hit an infinite loop anywhere, and I don't expect people to rerun on success (unless they're looking for flakes? Either way I highly doubt any one is using the --sc/--scs flag locally).

This ensures that if pytest succeeds but the process gets a non zero exit code, the rerun will start at beginning instead of skipping all the "succeeding" tests.

This only applies if the --sc/--scs flags are used, custom to pytorch and probably not used anywhere other than CI, not to be confused with --stepwise, which pytest has by default

Here's a list of segfaulting inductor/test_aot_inductor tests, which I added skips for:
```
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_duplicated_params_abi_compatible_cpu_with_stack_allocation
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_fqn_abi_compatible_cpu_with_stack_allocation
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_no_args_abi_compatible_cpu_with_stack_allocation
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_output_misaligned_abi_compatible_cpu_with_stack_allocation
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_pytree_inputs_abi_compatible_cpu_with_stack_allocation
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_seq_abi_compatible_cpu_with_stack_allocation
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_simple_split_abi_compatible_cpu_with_stack_allocation
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_addmm_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_aliased_buffer_reuse_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_buffer_reuse_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_convolution_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_duplicated_params_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_empty_graph_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_fqn_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_large_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_missing_output_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_no_args_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_output_misaligned_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_output_path_1_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_pytree_inputs_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_repeat_interleave_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_return_constant_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_reuse_kernel_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_seq_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_simple_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_simple_split_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_small_constant_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_with_no_triton_profiler_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_with_offset_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_with_profiler_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_zero_size_weight_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115775
Approved by: https://github.com/desertfire
2023-12-19 22:19:57 +00:00
f88c9af98e [TEST] Skip scaled_dot_product_attention test on sm < 80 (#115760)
According to the [functionality](https://github.com/NVIDIA/cutlass/blob/main/media/docs/functionality.md) page, CUTLASS support `bfloat16` aka `bf16` only on compute capability 80+ devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115760
Approved by: https://github.com/drisspg
2023-12-19 22:00:33 +00:00
ae6f1f4a47 [BE]: enable readability-delete-null-pointer clang-tidy check (#116107)
* Enables an additional clang-tidy check that remove unnecessary nullptr checks around delete statements.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116107
Approved by: https://github.com/albanD, https://github.com/malfet
2023-12-19 21:08:37 +00:00
d85314c95c Support Predispatch functionalization (#113728)
In this PR, we are implementing Functionalization on pre-dispatch graph. Today, every dispatch key except for Dispatchkey.Python has a dedicated mode stack in python. PreDispatch tracing relies on this behaviour by pushing ProxyTorchDispatchMode to Dispatchkey.PreDispatch mode stack and handle the dispatching logic in python. To make pre-dispatch functionalization work, we now need to push FunctionalTensorMode on DispatchKey.PreDispatch mode stack and make sure it runs before ProxyTorchDispatchMode. (this is very similar to how post-dispatch tracing work). Here are some design decisions we made for this flow to work:

1. FunctionalTensorMode internally calls C++ functionalize key. Since C++ functionalization goes after PreDispatch, if we are not careful, we will keep re-entering into PreDispatch key. We solve this by directly dispatching to C++ Functionalize key.

2. We delete mode_stack_per_key logic because the only realistic time it is exercised is for PreDispatch and it is in general not safe to have a plain list because FunctionalTensorMode and ProxyTorchDispatchMode ordering matter and it is hard to enforce it on plain list. Instead, now we have a private class that tracks PreDispatch mode stack.

3.  We will still run CompositeImplicitAutograd decomps in this PR, and disable this logic later as a followup.

Some missing bits after this PR:
1. Preserving autograd ops in a functional form. Right now they still show up in the graph but in a "non-functional" way.
2. Turn off CompositeImplicitAutograd decomps
3. Functionalizing HOO

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113728
Approved by: https://github.com/bdhirsh
2023-12-19 20:28:35 +00:00
1474eb5f29 Fix jagged composite impl of flatten() (#115192)
Need to handle this in `NestedTensor.__torch_function__()` since it's CompositeImplicit
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115192
Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer
2023-12-19 19:15:21 +00:00
cbc70e9b9c [caffe2] Add option for build_cpukernel_avx2 (#116008)
Summary: We would like to have a more flexible way to customize the build option with avx2 instruction to address other issues

Test Plan: CI

Differential Revision: D52247916

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116008
Approved by: https://github.com/mattjgalloway
2023-12-19 18:49:52 +00:00
77d5f60740 [fsdp][torch.compile] FSDP changes (#115497)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115497
Approved by: https://github.com/albanD
2023-12-19 18:44:36 +00:00
e52983939c fix(conv_v8): optimize lru cache in conv v8 (#114110)
Fixes #108474

the main issue is due to GCC's dual abi.

https://gcc.gnu.org/onlinedocs/libstdc++/manual/using_dual_abi.html
> requires lists to keep track of their size.

seems like in GCC's old abi, std::list::size is linear

other optimization is:
* `splice` instead of erase then push, will save some memory and time.

more perf benchmark is coming...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114110
Approved by: https://github.com/eqy, https://github.com/atalman, https://github.com/malfet
2023-12-19 18:43:37 +00:00
d749b4a152 Implements permute_tensor in functional collectives (#115078)
Implementation of `permute_tensor` as per @yifuwang 's suggestion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115078
Approved by: https://github.com/wanchaol, https://github.com/yifuwang
2023-12-19 18:33:28 +00:00
71bedc3a69 [Inductor UT] fix unreachable code (#116094)
The testcase test_uint4x2_mixed_mm has indentation error. This pr make testcode reachable.

test result:
```
pytest test_torchinductor.py -k test_uint4x2_mixed_mm -v
=========================================================================================== test session starts ===========================================================================================
platform linux -- Python 3.10.12, pytest-7.4.2, pluggy-1.3.0 -- /usr/bin/python
cachedir: .pytest_cache
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase('/workspace/pytorch/test/inductor/.hypothesis/examples')
rootdir: /workspace/pytorch
configfile: pytest.ini
plugins: shard-0.1.2, xdoctest-1.0.2, flakefinder-1.1.0, xdist-3.3.1, rerunfailures-12.0, hypothesis-5.35.1
collected 964 items / 962 deselected / 2 selected
Running 2 items in this shard: test/inductor/test_torchinductor.py::CpuTests::test_uint4x2_mixed_mm_cpu, test/inductor/test_torchinductor.py::CudaTests::test_uint4x2_mixed_mm_cuda

test_torchinductor.py::CpuTests::test_uint4x2_mixed_mm_cpu PASSED [2.2136s]                                                                                                                         [ 50%]
test_torchinductor.py::CudaTests::test_uint4x2_mixed_mm_cuda PASSED [1.9466s]                                                                                                                       [100%]

=================================================================================== 2 passed, 962 deselected in 15.70s ====================================================================================

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116094
Approved by: https://github.com/peterbell10
2023-12-19 17:14:25 +00:00
5ba87a31bc Unflake test_reference_numerics_large__refs_special_multigammaln_mvlgamma_p_1_cpu_bfloat16 (#116058)
Run the test under markDynamoStrict mode and record an expected failure
under the Dynamo CI shard.

Test Plan:
- wait for CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116058
Approved by: https://github.com/atalman
2023-12-19 16:42:29 +00:00
7b7f11f230 [dynamo] test number of guards when inputs are views (#115793)
After # 113734 landed (adding dynamic storage offsets), we found that compilation times increased significantly. The reason: tensors_definitely_do_not_overlap was doing comparisons on storage offsets which were adding guards

626b7dc847/torch/_functorch/_aot_autograd/input_output_analysis.py (L268-L276)

This guard is added on all pairs of tensors which are views of the same source tensor - i.e. it the number of guards can be quadratic in the number of input tensors. This PR adds a test to prevent similar regressions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115793
Approved by: https://github.com/yanboliang
2023-12-19 16:09:29 +00:00
91e184fd74 Revert "Introduce 3 low-latency, intra-node allreduce algorithms for small messages to PyTorch (#114001)"
This reverts commit 4edc921857f39ba9510b6ab1c454149cfb2de157.

Reverted https://github.com/pytorch/pytorch/pull/114001 on behalf of https://github.com/jeanschmidt due to Breaking multiple internal tests, might be flakiness but multiple retries did not elicit an improvement, please check internal diff ([comment](https://github.com/pytorch/pytorch/pull/114001#issuecomment-1863036417))
2023-12-19 16:01:19 +00:00
b6d0d0819a Revert "[PT2] [Quant] Change the QConv2d Binary post op name from add to sum (#115329)"
This reverts commit 9ae0e6292944139ea598e7347c95ebd7df09e819.

Reverted https://github.com/pytorch/pytorch/pull/115329 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, please check internal diff to get the list and logs, @jerryzh168 please support the author in order to get these changes merged and landed ([comment](https://github.com/pytorch/pytorch/pull/115329#issuecomment-1863021726))
2023-12-19 15:52:57 +00:00
c539f7df10 Revert "[Inductor] Deduplicate grid wrapper statements for user defined triton kernels (#115849)"
This reverts commit 21b8127f1c9f31c02145d906aae2db1ada703067.

Reverted https://github.com/pytorch/pytorch/pull/115849 on behalf of https://github.com/jeanschmidt due to Breaking internal tests, please check internal diff for more details ([comment](https://github.com/pytorch/pytorch/pull/115849#issuecomment-1863012933))
2023-12-19 15:47:55 +00:00
505a9e4854 add support for dynamic shapes in round (#115259)
Fixes #114310 and supersedes #114748.

There are two reasons why we have quite a few special cases for `round`:

1. `round` is actually two ops. With `ndigits=None` (default), `round` always returns an integer. When `ndigits` is an integer, the returned type is a float.
2. Although `round` takes two arguments, it is a unary function with a parameter rather than a binary one.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115259
Approved by: https://github.com/peterbell10, https://github.com/lezcano
2023-12-19 15:45:50 +00:00
a7bfa04da6 Revert "More markDynamoStrictTest (#115870)"
This reverts commit 7f686c8fe127cc7db07134297fa09be20ab87918.

Reverted https://github.com/pytorch/pytorch/pull/115870 on behalf of https://github.com/jeanschmidt due to Breaking internal tests and builds, please check diff ([comment](https://github.com/pytorch/pytorch/pull/115870#issuecomment-1862997125))
2023-12-19 15:40:57 +00:00
24af118e55 Revert "markDynamoStrictTest more tests (#115871)"
This reverts commit 478f0e96dc2593db401903ac2ae053f8cd1e29ea.

Reverted https://github.com/pytorch/pytorch/pull/115871 on behalf of https://github.com/jeanschmidt due to Breaking internal tests and builds, please check diff, this is required to revert #115870 ([comment](https://github.com/pytorch/pytorch/pull/115871#issuecomment-1862992931))
2023-12-19 15:36:27 +00:00
5b6b680517 Revert "Adamw refactor (#115983)"
This reverts commit eafeba71c1ed35f8cf2d39016bf66c0b088e4a9f.

Reverted https://github.com/pytorch/pytorch/pull/115983 on behalf of https://github.com/jeanschmidt due to Breaking internal tests, @janeyx99 please help @tfsingh to have this PR landed ([comment](https://github.com/pytorch/pytorch/pull/115983#issuecomment-1862976954))
2023-12-19 15:26:44 +00:00
92998693a9 [inductor] Avoid bool being upcast to int (#109913)
Currently the inductor code for `x.any(-1)` does a this strange dance:
```python
tmp0 = tl.load(in_ptr0 + (r1 + (128*x0)), rmask & xmask)
tmp1 = tmp0.to(tl.int64)
tmp2 = (tmp1 != 0)
```

This happens because `register_lowering` is doing type promotion with the
dimension argument, and so promotes to `int64` which we then cast back to bool.
A better fix would be to fix `register_lowering` but for now I just remove
the unnecessary type promotion from `aten.any`.

In the current code we also see:
```python
     tmp5 = tl.where(rmask & xmask, tmp3, 0)
```
which promotes the boolean value to int since `0` is an int32 in triton.
This fixes it to generate a boolean constant instead.

Finally there is also a triton bug where the `tl.load` itself upcasts to
`tl.int8`. I fix this by adding an explicit cast to `tl.int1`. The final
kernel code looks like:

```python
tmp0 = tl.load(in_ptr0 + (r1 + (128*x0)), rmask & xmask).to(tl.int1)
tmp1 = tl.broadcast_to(tmp0, [XBLOCK, RBLOCK])
tmp3 = tl.full([1, 1], 0, tl.int1)
tmp4 = tl.where(rmask & xmask, tmp1, tmp3)
tmp5 = triton_helpers.any(tmp4, 1)[:, None]

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109913
Approved by: https://github.com/lezcano
2023-12-19 14:16:10 +00:00
992c4e7b24 Actually run Dynamo tests in all Dynamo shards (#115962)
We weren't doing this before. Also adds some more skips so that CI
passes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115962
Approved by: https://github.com/voznesenskym
ghstack dependencies: #115925
2023-12-19 14:12:53 +00:00
0bd5a3fed7 [releng] Docker release Refactor Push nightly tags step. Move cuda and cudnn version to docker tag rather then name (#116097)
Follow up after : https://github.com/pytorch/pytorch/pull/116070

This PR does 2 things.

1. Refactor Push nightly tags step, don't need to extract CUDA_VERSION anymore. New tag should be in this format: ``${PYTORCH_VERSION}-cuda$(CUDA_VERSION_SHORT)-cudnn$(CUDNN_VERSION)-runtime``
2. Move cuda$(CUDA_VERSION_SHORT)-cudnn$(CUDNN_VERSION) from docker name to tag

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116097
Approved by: https://github.com/jeanschmidt
2023-12-19 13:53:08 +00:00
a31effa15f Update device_mesh.py docs imports (#116074)
These are not importable from `torch.distributed`, at least today.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116074
Approved by: https://github.com/wz337, https://github.com/fegin
2023-12-19 09:44:55 +00:00
eqy
2a44034895 [CUDA] Include <thrust/swap.h> in LinearAlgebra.cu (#116072)
Fixes build against the latest `NVIDIA/cccl`.

CC @malfet @xwang233 @ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116072
Approved by: https://github.com/malfet, https://github.com/xwang233
2023-12-19 05:56:52 +00:00
327bdcdb14 Some tiny modification about torch.set/get_default_device (#116014)
1. fix bug of torch.set_default_device in multi-threading
2. add new interface named torch.get_default_device

Fixes #115333
Fixes #115917

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116014
Approved by: https://github.com/malfet, https://github.com/jansel
2023-12-19 05:08:06 +00:00
b48abbc020 [DeviceMesh] Fix DeviceMesh docstring (#116053)
1. remove outdated comments
2. fix examples in docstring

Doc after fix:
<img width="706" alt="image" src="https://github.com/pytorch/pytorch/assets/31293777/19f4f03c-0fd7-4e88-bca1-1a6ce693fbb7">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116053
Approved by: https://github.com/wanchaol
2023-12-19 04:05:49 +00:00
8b0122ad33 Add lowerings for reflection_pad{1, 3}d_backward (#115645)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115645
Approved by: https://github.com/lezcano, https://github.com/peterbell10
2023-12-19 04:05:10 +00:00
9dda4b20a0 [MPS] Enable select/[broad]cast ops for complex dtypes (#115727)
By representing `torch.cfloat`/`torch.chalf` as `float2`/`half2` metal types and modifying `SCATTER_OPS_TEMPLATE`/`GATHER_OPS_TEMPLATE` to accept third argument which is fully specialized `cast` function, which is no-op for regular type, but special cased for float->complex and complex->float

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115727
Approved by: https://github.com/kulinseth
2023-12-19 02:25:28 +00:00
cyy
1544c37520 [7/N] Fixes clang-tidy warnings in c10/{core,util}/*.h (#115495)
This PR continues to fix clang-tidy warnings for headers in c10/core and c10/util.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115495
Approved by: https://github.com/malfet
2023-12-19 02:14:30 +00:00
9b8f934068 Remove memory_format check for native_group_norm_backward (#115721)
To fix https://github.com/pytorch/pytorch/issues/115940.
Remove memory_format check for native_group_norm_backward.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115721
Approved by: https://github.com/mikaylagawarecki
2023-12-19 02:12:26 +00:00
01b979fc9a [Inductor] Fix constant folding and extern kernel mutation tracking bugs (#115908)
This PR fixes two bugs
1) Constant folding a triton kernel results in the kernel's inputs to be returned back without any modification. Disable constant folding for triton kernels. Need more investigation
2) NoneLayout buffers should not be deleted as they do not exist

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115908
Approved by: https://github.com/aakhundov, https://github.com/jansel
2023-12-19 02:06:50 +00:00
bb5a27052f [Dynamo][9/N] Make SkipFilesVariable wrap functions only (#115963)
Make ```SkipFilesVariable``` only handle function type, and route skipped classes to ```UserDefinedClassVariable```. The reasons behind this are:
* We'd like to remove ```is_allowed```, so the allowed/disallowed torch classes should have a proper place to handle. We can put them in either ```SkipFilesVariable``` and ```UserDefinedClassVariable``` under the current architecture, but it's  confusing to have two places do one thing.
   - Going forward, let's make ```SkipFilesVariable``` only handle functions, and probably I'll rename it to ```SkippedFunctionVariable``` in the following PRs.
   - Let's do dispatch by value's type, all torch classes stuff would go to ```UserDefinedClassVariable``` in the next PR.
* We'd merge in_graph/skip/inline trace decision into the same API ```trace_rule.lookup```, so probably we have to limit the input to only function for better organizing ```VariableBuilder._wrap``` logics.
   - Next step, I'll merge ```skipfiles.check``` into ```trace_rules.lookup```, and do the skipfile check before wrapping them into correct variable tracker.
   - Though the ```TorchCtxManagerClassVariable``` is decided by ```trace_rules.lookup```, I'll refactor it out in the following PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115963
Approved by: https://github.com/jansel
2023-12-19 02:01:47 +00:00
47908a608f Revert "[ROCm] add hipblaslt support (#114329)"
This reverts commit b062ea38039234c80404a8f5f4d5a93c4cb9832d.

Reverted https://github.com/pytorch/pytorch/pull/114329 on behalf of https://github.com/jeanschmidt due to Reverting due to inconsistencies on internal diff ([comment](https://github.com/pytorch/pytorch/pull/114329#issuecomment-1861933267))
2023-12-19 01:04:58 +00:00
ed0c0c49ef Revert "[ROCm] fix nightly 5.6 build (#116029)"
This reverts commit 63e242b1e41759f9b24a0fbb997f157a06a9dd13.

Reverted https://github.com/pytorch/pytorch/pull/116029 on behalf of https://github.com/jeanschmidt due to Need to revert, in order to be able to revert #114329 ([comment](https://github.com/pytorch/pytorch/pull/116029#issuecomment-1861931736))
2023-12-19 01:01:42 +00:00
368a0c06d4 [releng] Docker Official release make sure cuda version is part of image name (#116070)
Follow up on https://github.com/pytorch/pytorch/pull/115949

Change docker build image name:
``pytorch:2.1.2-devel``-> ``2.1.2-cuda12.1-cudnn8-devel and 2.1.2-cuda11.8-cudnn8-devel``

Ref: https://github.com/orgs/pytorch/packages/container/package/pytorch-nightly

Naming will be same as in https://hub.docker.com/r/pytorch/pytorch/tags
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116070
Approved by: https://github.com/huydhn, https://github.com/seemethere
2023-12-19 00:58:15 +00:00
5894af83be Use dequantized weight and bias in conv2d quantized ops (#115615)
Summary:
Dequantize weight and bias for conv2d ops to improve performance. The weight and bias are usually small in size hence they do not increase memory footprint by a lot when dequantized.

With optimization cunet-enc ops:
vulkan.quantized_conv2d  {96, 72, 2}                      3753204
vulkan.quantized_conv2d  {96, 72, 2}                      6977048
vulkan.quantized_conv2d_dw{96, 72, 2}                      2499640
vulkan.quantized_conv2d_pw_2x2{96, 72, 2}                       842088
vulkan.quantized_conv2d  {48, 36, 4}                      2388152
vulkan.quantized_conv2d  {48, 36, 4}                      4775940
vulkan.quantized_conv2d_dw{48, 36, 4}                       709800
vulkan.quantized_conv2d_pw_2x2{48, 36, 4}                       483236
vulkan.quantized_conv2d  {24, 18, 8}                      2562144
vulkan.quantized_conv2d  {24, 18, 8}                      5447624
vulkan.quantized_conv2d_dw{24, 18, 8}                       392756
vulkan.quantized_conv2d_pw_2x2{24, 18, 8}                       509080

Without optimization:
vulkan.quantized_conv2d  {96, 72, 2}                      4291768
vulkan.quantized_conv2d  {96, 72, 2}                      7871344
vulkan.quantized_conv2d_dw{96, 72, 2}                      2658500
vulkan.quantized_conv2d_pw_2x2{96, 72, 2}                       891020
vulkan.quantized_conv2d  {48, 36, 4}                      2966860
vulkan.quantized_conv2d  {48, 36, 4}                      5661812
vulkan.quantized_conv2d_dw{48, 36, 4}                       816556
vulkan.quantized_conv2d_pw_2x2{48, 36, 4}                       528632
vulkan.quantized_conv2d  {24, 18, 8}                      3139604
vulkan.quantized_conv2d  {24, 18, 8}                      6202820
vulkan.quantized_conv2d_dw{24, 18, 8}                       452660
vulkan.quantized_conv2d_pw_2x2{24, 18, 8}                       557388

Test Plan:
Ensure all vulkan quantize tests pass:
buck2 run --target-platforms ovr_configplatform/macos:arm64-fbsourcexplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output"
Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc
[==========] Running 78 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 78 tests from VulkanAPITest

...
[==========] 78 tests from 1 test suite ran. (1519 ms total)
[  PASSED  ] 78 tests.

buck2 run --target-platforms ovr_config//platform/macos:arm64-fbsource  //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output"

Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc
[==========] Running 395 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 395 tests from VulkanAPITest

...
[----------] 395 tests from VulkanAPITest (6515 ms total)

[----------] Global test environment tear-down
[==========] 395 tests from 1 test suite ran. (6515 ms total)
[  PASSED  ] 394 tests.
[  SKIPPED ] 1 test, listed below:
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log

  YOU HAVE 5 DISABLED TESTS

Reviewed By: yipjustin

Differential Revision: D50997532

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115615
Approved by: https://github.com/manuelcandales, https://github.com/yipjustin
2023-12-19 00:23:52 +00:00
270ed13e87 [DTensor] Make DTensor from_local backward partial() to replicate() pass through (#115967)
Summary:
This change makes the `DTensor.from_local()` placements in backward pass from `Partial()` to `Replicate()` as pass through for following reasons:
1. When we run backward pass of DTensor.from_local, if the target placement is partial() (i.e. from user manual overwrite code instead of torch_dispatch) we keep the grad as replicate. This is because converting the gradients back to `Partial()` is meaningless.
2. The current div logic will lead to wrong numerical value in the above case.

Test Plan:
**CI**:
CI Tests

**Unit test**:
`buck2 test mode/dev-nosan //caffe2/test/distributed/_tensor:redistribute`
- Passed

**With model training**:
```
# We tested the case where input tensor is manually overwrite as Partial() and
# output tensor manually overwrite to Shard() then to local.

# Before the change: numerical value not correct
Forward pass:
    collective: ReduceScatter
backward pass:
    collective: AllGather + div by process group size

# After the change: div is removed as expected.
Forward pass:
    collective: ReduceScatter
Backward pas:
    collective: AllGather
```

Differential Revision: D52175709

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115967
Approved by: https://github.com/wanchaol
2023-12-19 00:16:10 +00:00
3472a9200d expand subclass type tests in dynamo (#116024)
Following up on my own comments in https://github.com/pytorch/pytorch/pull/115323#pullrequestreview-1769491483.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116024
Approved by: https://github.com/mlazos
2023-12-19 00:08:55 +00:00
054f9548b4 [dynamo] Store CompilationEvents in a buffer in torch._dynamo.utils (#115788)
Motivation: it would be nice to be able to test using the metrics in log_compilation_event; currently dumps logs (or logs to a database in fbcode) - these are hard to use in unit tests.

This change:
* always record the information in torch._dynamo.utils.record_compilation_metrics; here, log into a limited-size deque to prevent the list of metrics from getting too long
* if config.log_compilation_metrics, then call back into the original log_compilation_event function

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115788
Approved by: https://github.com/yanboliang
2023-12-18 23:26:13 +00:00
fc58909bab Fix allowed dtypes for mem_eff attention (#116026)
# Summary

Fix issue bug in detecting mem eff capability for cuda devices less than sm80:
https://github.com/pytorch-labs/gpt-fast/issues/49

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116026
Approved by: https://github.com/janeyx99
2023-12-18 23:20:52 +00:00
6b120c6cf9 Update the sdpa benchmark to measure forward backward time in isolation (#115986)
# Summary

The benchmarks were getting a little stale and I think it makes more sense to measure in isolation now rather than E2E in a mha component.

This is a pre-req for getting the data for https://github.com/pytorch/pytorch/pull/115357

Output from run:
``` Shell
+------------+-----------+-----------+------------+-----------+-----------+----------------+--------------------+--------------------+
| batch_size | num_heads | q_seq_len | kv_seq_len | embed_dim | is_causal |     dtype      |    forward_time    |   backward_time    |
+------------+-----------+-----------+------------+-----------+-----------+----------------+--------------------+--------------------+
|     1      |    16     |    128    |    128     |   2048    |   True    | torch.bfloat16 | 23.86634959839284  | 66.21150835417211  |
|     1      |    16     |    128    |    128     |   2048    |   False   | torch.bfloat16 | 23.452017060481012 | 66.90612225793302  |
|     1      |    16     |    256    |    256     |   2048    |   True    | torch.bfloat16 | 24.478124547749758 |  76.4232068322599  |
|     1      |    16     |    256    |    256     |   2048    |   False   | torch.bfloat16 |  24.6928428998217  | 75.76151192188263  |
|     1      |    16     |    512    |    512     |   2048    |   True    | torch.bfloat16 | 28.69622849393636  | 114.73898496478796 |
|     1      |    16     |    512    |    512     |   2048    |   False   | torch.bfloat16 | 34.399422979913645 | 112.96746158041059 |
|     1      |    16     |   1024    |    1024    |   2048    |   True    | torch.bfloat16 |  65.4690912924707  | 216.26344555988908 |
|     1      |    16     |   1024    |    1024    |   2048    |   False   | torch.bfloat16 | 88.57532404363155  | 212.07790216431025 |
|     8      |    16     |    128    |    128     |   2048    |   True    | torch.bfloat16 | 11.582905380055308 | 70.09557797573505  |
|     8      |    16     |    128    |    128     |   2048    |   False   | torch.bfloat16 | 12.068384909071026 | 70.01491216942668  |
|     8      |    16     |    256    |    256     |   2048    |   True    | torch.bfloat16 | 31.671419646590945 | 203.54910241439939 |
|     8      |    16     |    256    |    256     |   2048    |   False   | torch.bfloat16 |  33.0585768679157  | 209.45609430782497 |
|     8      |    16     |    512    |    512     |   2048    |   True    | torch.bfloat16 | 87.43969700299202  | 469.8729298543185  |
|     8      |    16     |    512    |    512     |   2048    |   False   | torch.bfloat16 | 123.9265550393611  | 580.1084265112877  |
|     8      |    16     |   1024    |    1024    |   2048    |   True    | torch.bfloat16 | 561.1918237991632  | 1181.655174586922  |
|     8      |    16     |   1024    |    1024    |   2048    |   False   | torch.bfloat16 | 884.2707145959139  | 1662.4679416418073 |
+------------+-----------+-----------+------------+-----------+-----------+----------------+--------------------+--------------------+
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115986
Approved by: https://github.com/mikaylagawarecki
2023-12-18 22:40:47 +00:00
bf62511e07 Reshape decomposition for jagged layout NT (#115191)
No more segfault from using `reshape()` on jagged NT :)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115191
Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer
2023-12-18 22:34:41 +00:00
63e242b1e4 [ROCm] fix nightly 5.6 build (#116029)
ROCm 5.6 nightly wheel build broken by #114329.  This fixes it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116029
Approved by: https://github.com/huydhn, https://github.com/jithunnair-amd, https://github.com/atalman
2023-12-18 22:12:30 +00:00
8452f41305 Adds allreduce to inductor remap (#115950)
Fixes #115728

Implements a rewrite path for allreduce

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115950
Approved by: https://github.com/wconstab
2023-12-18 22:00:22 +00:00
2a5659a797 add length assertion to PrepareModuleInput and PrepareModuleOutput (#115957)
## summary

`zip(inputs, self.input_layouts, self.desired_input_layouts)` is used in `_prepare_input_fn`; similar for `_prepare_output_fn`. Without assertion, unmatched dimension in inputs/outputs will be lost, potentially causing unexpected behabiors.

## test plan
`python test/distributed/tensor/parallel/test_tp_style.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115957
Approved by: https://github.com/wanchaol
2023-12-18 21:50:18 +00:00
a699b10339 [buck2][win] fix caffe2 protobuf_rule (#115954)
Summary:
c2_protobuf_rule ([here](https://fburl.com/code/iyiulpmv)) is broken on buck2, ultimately due to the following error:

> .\./caffe2.proto: File does not reside within any path specified using --proto_path (or -I).  You must specify a --proto_path which encompasses this file.  Note that the proto_path must be an exact prefix of the .proto file names -- protoc is too dumb to figure out when two paths (e.g. absolute and relative) are equivalent (it's harder than you think).

The root cause is differences in how buck1 and buck2 handle `%SRCDIR%` (absolute versus relative paths). This diff fixes the build.

Test Plan:
# Before

```
buck2 build arvr/mode/win/opt //xplat/caffe2:caffe2.pb.h
```

```
More details at https://www.internalfb.com/intern/buck/build/c6550454-ae6d-479e-9d08-016e544ef050
BUILD SUCCEEDED
```

```
Action failed: fbsource//xplat/caffe2:caffe2.pb.h (genrule)
Remote command returned non-zero exit code <no exit code>
Reproduce locally: frecli cas download-action 5df17cf64b7e2fc5ab090c91e1129f2f3cad36dc72c7c182ab052af23d3f32aa:145
stdout:
stderr:
OUTMISS: Missing outputs: buck-out/v2/gen/fbsource/dd87aacb8683145b/xplat/caffe2/caffe2.pb.h/out/caffe2.pb.h
```

# After

Buck1 still works

```
buck1 build arvr/mode/win/opt //xplat/caffe2:caffe2.pb.h
```

Buck2 works

```
buck2 build arvr/mode/win/opt //xplat/caffe2:caffe2.pb.h
```

```
Buck UI: https://www.internalfb.com/buck2/e5dae607-325a-4eab-b0c9-66fe4e9a6254
BUILD SUCCEEDED
```

Differential Revision: D52218365

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115954
Approved by: https://github.com/mcr229
2023-12-18 21:41:10 +00:00
2f7bb18def [Doc] Add padding size constraint in nn.ReflectionPad2d (#115995)
Fixes #115532

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115995
Approved by: https://github.com/mikaylagawarecki
2023-12-18 21:29:14 +00:00
1e272fb6d6 [export] Undo "module: export" labeling (#116042)
Delete the auto-labeling of "module: export" as this is not really used, and we want to delete the "module: export" label.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116042
Approved by: https://github.com/clee2000
2023-12-18 21:23:17 +00:00
c4748b425e Add main in dynamo/test_compile.py (#115941)
Need to verify that  it is dynamo's custom TestCase and run_tests instead of the general common_utils TestCase and run_tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115941
Approved by: https://github.com/msaroufim
2023-12-18 20:53:28 +00:00
a1a0b290d2 [tp] further fix the docs (#115974)
some typo result in the note section not rendered properly, can't see
this from the last PR directly as the last PR only show the first commit
documentation :(

Also make the parallelize_module doc example more concrete

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115974
Approved by: https://github.com/wz337
2023-12-18 20:41:53 +00:00
8868c1cfae [sparse][ci] Add cuSPASRELt to CI (#115369)
Summary:

This PR adds in cuSPARSELt v0.4.07 into CI (12.1 and 11.8 CUDA) to run our cuSPARSELt specific tests.

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115369
Approved by: https://github.com/malfet
2023-12-18 20:33:30 +00:00
2b2ed52799 [xla hash update] update the pinned xla hash (#116003)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116003
Approved by: https://github.com/clee2000
2023-12-18 20:31:49 +00:00
7b6210e8a4 Use matrix generate script for docker release workflows (#115949)
Enable both supported CUDA version builds for docker release. Rather then building only 1 version.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115949
Approved by: https://github.com/huydhn
2023-12-18 20:20:59 +00:00
e30d436b01 [fx][split][testing] Add testing for #107981 (#108731)
- Follow-up to #107981, adding testing for metadata copying in placeholder nodes within the `split_by_tags` utility
- Validation included in the test from #107248, since both tests are relevant to the same aspect of the utility
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108731
Approved by: https://github.com/angelayi
2023-12-18 20:19:18 +00:00
bf20b56e9d Fix PyTorch build error on ppc64le (#115729)
The PyTorch build breaks when building from tip on ppc64le with following error pytorch/aten/src/ATen/native/quantized/cpu/kernels/QuantizedOpKernels.cpp:863:46: error: no matching function for call to 'at::vec::DEFAULT::Vectorizedc10::qint8::dequantize(at::vec::DEFAULT::Vectorized&, at::vec::DEFAULT::Vectorized&)

Issue reported #115165

This patch fixes the build issue.

Fixes #115165

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115729
Approved by: https://github.com/albanD
2023-12-18 19:00:56 +00:00
77366ba637 Increased hardcoded limit for number of GPUs. (#115368)
Fixes #115331.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115368
Approved by: https://github.com/albanD
2023-12-18 18:39:19 +00:00
80b1ecc308 Run eager adam optimizer in benchmarks where possible (#115445)
Runs eager Adam (instead of SGD) on all models that don't fail accuracy.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115445
Approved by: https://github.com/desertfire
2023-12-18 18:28:23 +00:00
8a445f7bd5 Serve multistream graph captures from correct pool (#114647)
This fixes #114320 by placing the logic for determining whether to allocate
to a pool inside a callback that is controlled by CUDAGraph.cpp or by the
python bound api to allocate a stream directly to a pool.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114647
Approved by: https://github.com/ngimel, https://github.com/eellison
2023-12-18 18:24:15 +00:00
3b70bd3970 Take 2 of "Add an option to log the source of the Triton kernels generated by torch._inductor (#115979)
Summary: This is useful the comparing the Triton kernels generated by two different invocations of torch.compile on the same model (e.g., checking of serial compile and parallel compile generate identical Triton kernels).

Test Plan:
Unit test:
buck2 test mode/opt //caffe2/torch/fb/module_factory/sync_sgd/tests:test_torchdynamo_wrapper -- --print-passing-details >& ~/tmp/log.test
PyPer Mast job:
https://www.internalfb.com/mast/job/sw-951074659-OfflineTraining_87587a4e
See the *.py files generated in:
pyper_traces/tree/torchinductor_traces/sw-951074659-OfflineTraining_87587a4e/4623

Differential Revision: D52221500

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115979
Approved by: https://github.com/yanboliang
2023-12-18 18:16:44 +00:00
386776c49a [torch] Reduce the memory usage by adding flags to clearing intermediate graphs used for optimization during the ineference. (#115657)
Summary: During the inference time the intermediate graphs for optimization are not used so the Executor's graph is the only graph we need to keep around these two flags

Test Plan:
the FLAGS are all off by default

baseline
```
buck run mode/opt-clang  sigrid/predictor/client/localnet:run_model -- --model_id_to_load=951679039 --model_snapshot_to_load=244 --torch_jit_do_not_store_optimized_graph=true
I1212 10:24:20.407408 401092 SigridPredictorLocalModelFactory.cpp:32] Memory usage for 951679039_244 is 182863 Kb
```
```
buck run mode/opt-clang  sigrid/predictor/client/localnet:run_model -- --model_id_to_load=951679039 --model_snapshot_to_load=244 --torch_jit_do_not_store_optimized_graph=true --torch_jit_release_profiling_graph_after_optimization=true
I1212 10:31:37.663487 464000 SigridPredictorLocalModelFactory.cpp:32] Memory usage for 951679039_244 is 186127 Kb
```
```
buck run mode/opt-clang  sigrid/predictor/client/localnet:run_model -- --model_id_to_load=951679039 --model_snapshot_to_load=244 --torch_jit_do_not_store_optimized_graph=true --torch_jit_release_profiling_graph_after_optimization=true --torch_jit_execution_plan_avoid_extra_graph_copy=true
I1212 10:29:42.848093 447218 SigridPredictorLocalModelFactory.cpp:32] Memory usage for 951679039_244 is 129451 Kb```

Differential Revision: D52081631

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115657
Approved by: https://github.com/houseroad
2023-12-18 17:56:39 +00:00
dd367b7c8f check tensor subclass when using torch.compile + SAC (#115960)
as titled, when using SAC + torch.compile, it currently only check for
functional tensor, but not checking any tensor subclasses, therefore SAC
under torch.compile would ignore the tensor types like tensor
subclasses. Fixed in this PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115960
Approved by: https://github.com/bdhirsh
2023-12-18 17:49:06 +00:00
e43d33f4f7 [export] Support torch.sym* ops (#115854)
Fixes https://github.com/pytorch/pytorch/issues/108830 and https://github.com/pytorch/executorch/issues/1379#issuecomment-1853322866

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115854
Approved by: https://github.com/zhxchen17
2023-12-18 17:48:47 +00:00
647f14e70b [BE]: Enable clang-tidy check for readability-string-compare (#115994)
Adds a clang-tidy check to ensure string compare is not used unnecessarily in a way that is less efficient and less readable if an equality overload exists.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115994
Approved by: https://github.com/albanD
2023-12-18 16:13:00 +00:00
d7caef7996 [CI] Update clang-format (#116002)
To 17.0.6 build using https://github.com/pytorch/test-infra/blob/main/.github/workflows/clang-tidy-linux.yml

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116002
Approved by: https://github.com/suo
2023-12-18 14:58:46 +00:00
c285ca7916 [AOTInductor] Add updaing constant buffer to active buffer. (#116001)
Summary:
Refactor update inactive constant buffer to allow updating with active
buffer.

Test Plan:
Existing test to test inactive buffer updates.
UpdateConstantsCuda in cpp test for active buffer updates.

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116001
Approved by: https://github.com/chenyang78
2023-12-18 11:49:03 +00:00
34fe850d00 SymInt'ify sparse_compressed_tensor (#107903)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107903
Approved by: https://github.com/cpuhrsch
ghstack dependencies: #115586
2023-12-17 17:36:20 +00:00
419f2ca3e3 Fix a crash in sparse compressed tensor invariants check when nnz == 0 (#115825)
Fixes python crash example from https://github.com/pytorch/pytorch/issues/115755

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115825
Approved by: https://github.com/cpuhrsch
2023-12-17 17:36:15 +00:00
eafeba71c1 Adamw refactor (#115983)
Fixes #104899, refactors adamw by abstracting out common code in adam.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115983
Approved by: https://github.com/janeyx99
2023-12-17 06:58:39 +00:00
87ea6fb844 Make input contiguous for DTensor reduce scatter to fix the incorrect numerical values (#115847)
Summary:
This change is to make the input tensor contiguous for DTensor reduce scatter in the case no padding is needed.

There's no exception thrown during training, but we ran into numerical value correctness issue without the change.

Test Plan:
**CI**
CI test

**WHEN model test**:
- Verified loss for each iteration within the expected range.
- Verified NE on-par with this change with 4B training data.

Differential Revision: D52170822

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115847
Approved by: https://github.com/wanchaol
2023-12-17 01:35:09 +00:00
bc4115ffcf [Inductor][Observability] Change to log.debug to avoid excessive long of logs (#115474)
Summary: Titled

Test Plan: CI

Differential Revision: D52003825

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115474
Approved by: https://github.com/jackiexu1992, https://github.com/yanboliang
2023-12-17 00:25:54 +00:00
4123cca859 [AARCH64] Fall back to GEMM if mkldnn_matmul fails (#115936)
- Add call to `at::globalContext().userEnabledMkldnn()` to `apply_mkldnn_matmul_heur`
- Surround calls to `mkldnn_matmul` with `try {} catch {}`
- Print warning and fall back to BLAS (by calling  `at::globalContext().setUserEnabledMkldnn()`) if `mkldnn_matmul()` fails

Test plan: On Linux arm run:
```shell
$ sudo chmod 400 /sys; python -c "import torch;m=torch.nn.Linear(1, 32);print(torch.__version__);print(m(torch.rand(32, 1)))"
Error in cpuinfo: failed to parse the list of possible processors in /sys/devices/system/cpu/possible
Error in cpuinfo: failed to parse the list of present processors in /sys/devices/system/cpu/present
Error in cpuinfo: failed to parse both lists of possible and present processors
2.3.0.dev20231215
bad err=11 in Xbyak::Error
bad err=11 in Xbyak::Error
/home/ubuntu/miniconda3/envs/py311/lib/python3.11/site-packages/torch/nn/modules/linear.py:116: UserWarning: mkldnn_matmul failed, switching to BLAS gemm:internal error (Triggered internally at /pytorch/aten/src/ATen/native/LinearAlgebra.cpp:1509.)
  return F.linear(input, self.weight, self.bias)
tensor([[-0.5183,  0.2279, -0.4035,  ..., -0.3446,  0.0938, -0.2113],
        [-0.5111,  0.2362, -0.3821,  ..., -0.3536,  0.1011, -0.2159],
        [-0.6387,  0.0894, -0.7619,  ..., -0.1939, -0.0282, -0.1344],
        ...,
        [-0.6352,  0.0934, -0.7516,  ..., -0.1983, -0.0247, -0.1366],
        [-0.4790,  0.2733, -0.2862,  ..., -0.3939,  0.1338, -0.2365],
        [-0.5702,  0.1682, -0.5580,  ..., -0.2796,  0.0412, -0.1782]],
       grad_fn=<AddmmBackward0>)
```
Fixes https://github.com/pytorch/pytorch/issues/114750

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115936
Approved by: https://github.com/lezcano
2023-12-16 21:37:56 +00:00
b06b02559e Support non grapharg and intermediary grad access (#115898)
Support for something we need for both FSDP and optimizers. For sourced args that are not inputs (params, etc) - we use the dynamic_getattr flow on tensors. This soundly handles the storage and registration and guarding downstream of tensor_wrap for the grad values. For non sourced (true intermediates), we only support None (the idea being that if we have a true intermediate in the graph with grad, we are already doing something weird).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115898
Approved by: https://github.com/bdhirsh
ghstack dependencies: #115315, #112184
2023-12-16 18:43:37 +00:00
c5dcb50c00 [easy] aten ops: support passing all args as kwargs, including self (#114920)
Summary:
This is important for writing aten IR based graph transformation.

```
In [4]: [x.name for x in torch.ops.aten.reshape.default._schema.arguments]
Out[4]: ['self', 'shape']

In [8]: torch.ops.aten.reshape.default(torch.rand(1,2), shape=[2])
Out[8]: tensor([0.7584, 0.4834])

# === CANNOT CALL `self` BY KWARGS ===

In [7]: torch.ops.aten.reshape.default(self=torch.rand(1,2), shape=[2])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[7], line 1
----> 1 torch.ops.aten.reshape.default(self=torch.rand(1,2), shape=[2])

TypeError: OpOverload.__call__() got multiple values for argument 'self'

```

# Where's the problem?

1. the aten ops first arg is usually named `self` (aten/src/ATen/native/native_functions.yaml)
2. Unfortunately, in `torch._ops.{OpOverload, OpOverloadPacket}.__call__()`, the first arg is (by python convention) named `self` too.

So when call `self` by kwargs, `OpOverloadPacket.__call__` received:

```
OpOverloadPacket.__call__(self, {"self": ...})
```

It is Python that does not allow some argument named "arg" to appear twice. and hence

> TypeError: OpOverload.__call__() got multiple values for argument 'self'

# How to fix?

**Note that**, in above, `self` is an instance of `OpOverloadPacket`, and the "self" kwarg is the input tensor to the aten op. To fix, we only need to differentiate the two `self`s.

In Python, first arg of a method does not need to be named `self`. So we change the `__call__` definition to:

```
def __call__(_self, ...):
```

Now the call becomes:

```
OpOverloadPacket.__call__(_self, {"self": ...})
```

where:
* `_self` is the instance to the `OpOverloadPacket`
* `"self"` is the input tensor to the aten op.

Test Plan:
```
In [4]: [x.name for x in torch.ops.aten.reshape.default._schema.arguments]
Out[4]: ['self', 'shape']

In [3]: torch.ops.aten.reshape.default(self=torch.rand(1,2), shape=[2])
Out[3]: tensor([0.5127, 0.3051])
```

Differential Revision: D51731996

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114920
Approved by: https://github.com/houseroad
2023-12-16 18:32:58 +00:00
88207b10ca Enable thp(transparent huge pages) for buffer sizes >=2MB (#107697)
The 2MB thp pages provide better allocation latencies compared to the standard 4KB pages. This change has shown substantial improvement for batch mode usecases where the tensor sizes are larger than 100MB.

Only enabled if THP_MEM_ALLOC_ENABLE environment variable is set.

Relanding https://github.com/pytorch/pytorch/pull/93888 with functionality disabled for Android

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107697
Approved by: https://github.com/malfet
2023-12-16 18:16:19 +00:00
622947afa8 [BE] Use nested namespace in ATen/native (#115938)
It's a C++17 feature that usually makes code a bit more compact, and should have no side-effects otherwise.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115938
Approved by: https://github.com/Skylion007
2023-12-16 06:07:40 +00:00
e3aefe2970 Revert "Initial Flash Attention support on ROCM (#114309)" (#115975)
This reverts commit 5bddbed399a89bf2875a38bb84cb869f382f1809.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115975
Approved by: https://github.com/atalman, https://github.com/malfet
2023-12-16 03:40:14 +00:00
8283491eff [TEST] Increase numerical tolerances in test_torchinductor_opinfo:test_comprehensive (#115768)
There are numerical mismatches that causes some tests of `test_comprehensive` to fail. I propose to just increase tolerances a bit to make them pass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115768
Approved by: https://github.com/jansel
2023-12-16 03:00:22 +00:00
49af19cd8e Skip some flaky Dynamo tests in test_linalg.py (#115925)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115925
Approved by: https://github.com/lezcano
2023-12-16 02:38:56 +00:00
2a2f2e454a [inductor] Fixed issue with true div on integer input with dyn shapes (#115920)
Related to https://github.com/pytorch/pytorch/issues/115742, `Cpu/CudaTests.test_div8`

Description:
- Fixed issue with true div on integer input with dyn shapes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115920
Approved by: https://github.com/peterbell10
2023-12-16 02:06:39 +00:00
d08905db7e Trigger a mergability check on ghstack prs (#115944)
Works to solve https://github.com/pytorch/test-infra/issues/4816

In conjunction with https://github.com/pytorch/test-infra/pull/4823 this pr should make it such that all ghstack prs kick off a job which is a mergability check.

Test plan, once https://github.com/pytorch/test-infra/pull/4823 is merged, I'll resubmit this diff to make sure the workflow job triggers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115944
Approved by: https://github.com/izaitsevfb, https://github.com/huydhn
2023-12-16 01:53:10 +00:00
14a6b24c8b [Dynamo][8/N] Wrap itertools.* as ItertoolsVariable (#115802)
This is part of a series changes before removing ```is_allowed```.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115802
Approved by: https://github.com/voznesenskym
2023-12-16 01:42:02 +00:00
056a882cb9 add markDynamoStrictTest to TestOptimRenewed, removing flakiness (#115947)
fixes #115406 fixes #115394 fixes #115393 fixes #115392 fixes #115391

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115947
Approved by: https://github.com/albanD, https://github.com/zou3519
2023-12-16 01:33:32 +00:00
0597eb56c2 Generate exhaustive compiled optimizer tests (#115906)
Generates tests for all permutations of arguments using the existing optimizer infos.
Covers capturable, cpu/gpu, single/multitensor and optimizer specific constants like rho/etas, etc.

[new test list](https://gist.github.com/mlazos/d3404383e7c3d490cbb51b7d6c750629)
[old test list](https://gist.github.com/mlazos/e0043aee1b6a0962d2f3ac8193aa62f8)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115906
Approved by: https://github.com/janeyx99
2023-12-16 00:42:43 +00:00
034e871710 [Dynamo] Look up variables from old frame, rather than copy variables to new frame; skip some copy to save time. (#115062)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115062
Approved by: https://github.com/williamwen42
2023-12-16 00:02:59 +00:00
94d28161fa Fix broken PyYAML 6.0 on MacOS x86 (#115956)
May be we should just get rid of x86 jobs, but that's for another day.  This one should fix the broken build in trunk, i.e. https://github.com/pytorch/pytorch/actions/runs/7227220153/job/19694420117.

I guess that the failure looks flaky depending on the version of default python3 on the GitHub x86 runner.

The issue from PyYAML https://github.com/yaml/pyyaml/issues/601
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115956
Approved by: https://github.com/malfet
2023-12-15 23:17:05 +00:00
74dfdc567b [MPS] aten::erfinv bug fix: add storage offset buffers to handle slicing (#105801)
A bug fix of a recently merged PR per comment: https://github.com/pytorch/pytorch/pull/101507#discussion_r1271393706

The follow test would fail without this bug fix:

```
import torch
def test_erfinv():
    for device in ['cpu', 'mps']:
        x = torch.tensor([0.1, 0.2, 0.3, 0.4, 0.5], device=device)
        y = x[2:].erfinv()

        x2 = torch.tensor([0.3, 0.4, 0.5], device=device)
        y2 = x2.erfinv()

        print(y)
        print(y2)

        torch.testing.assert_close(y, y2)
        print(f"{device} passes.")

test_erfinv()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105801
Approved by: https://github.com/malfet
2023-12-15 23:14:03 +00:00
d92d4133e7 [8/n] Update XNNPACK Submodule Version Part 8 Everything Remaining to get it to work (#115714)
> **__Note:__** XNNPACK Upgrade is too large in the range of **40k** files and **10m** Lines of code, Thus we break the update of the library into multiple parts. All Parts [1 - n] Must be landed together for it to work. ***This also means If there is a revert. Please revert the Entire Stack.***

This change is everything remaining requiring XNNPACK version to work.

@allow-large-files

Differential Revision: [D52099769](https://our.internmc.facebook.com/intern/diff/D52099769/)

---
submodule
(unblock merge to make ShipIt happy)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115714
Approved by: https://github.com/digantdesai
2023-12-15 23:08:08 +00:00
2e517b20d9 [MPS] Add Conv3D support for MPS (#114183)
Fixes #77818

I saw that PR #99246 was approved, but no one fixed the rebase conflicts, so I am bringing this up again to be merged.
I am leveraging @mattiaspaul work. Quoting the description here:

> * this pull request enables 3D convolutions (forward/backward) for MPS (Apple Silicon) within the same Convolution.mm file as conv2d.
> * does not support channel_last (since pytorch doesn't implement channel_last for 3D tensors)
> * does not support conv3d_transpose and treats depth-separable convolutions not as normal case (there are no MPS kernels available for either of those so far)
> * requires MacOS >=13.2 (Ventura)

Please, let me know if there are any other changes needed and I'll be happy to implement them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114183
Approved by: https://github.com/malfet
2023-12-15 23:05:01 +00:00
9fcf6fb6fe [C10D] Add waitForDumpOrTimeout to log on dump abandonment (#115876)
Helps call attention to any cases where the dump actually times out.

The timeout is likely to hit if we run into slow stacktrace processing.

Log any exceptions encountered in the background thread, but don't raise
them- we're already willing to abandon the debug dump, and want to
proceed with our normal execution (in the case of dumppipe) or shutdown
process (when dumping happens on timeout and shutdown is already
initiated).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115876
Approved by: https://github.com/zdevito
ghstack dependencies: #115807
2023-12-15 22:13:06 +00:00
82e0d00da9 [c10d] Polish NCCL PG monitor thread log message (#115888)
We turned on monitor thread by default in https://github.com/pytorch/pytorch/pull/112518, and we want the error message that is displayed when the monitor kills the process to be more informative.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115888
Approved by: https://github.com/wconstab
2023-12-15 22:00:29 +00:00
1f3bdf40ad [export] Update schema version (#115712)
Since pytorch 2.1 release we've made some BC breaking changes to the serialized schema. We should update it in time for the 2.2 release. Some of the changes include:

* https://github.com/pytorch/pytorch/pull/114371 - custom class objects / pybinded objects are no longer saved directly to the `ExportedProgram` structure. Instead, the name is serialized inside of the program, and the actual bytes are stored. in a separate location from the exported program, allowing it to be saved to a different location.
* https://github.com/pytorch/pytorch/pull/111204 - `GraphSignature` structure changed and `call_spec` is removed from the `GraphModule` schema
* https://github.com/pytorch/pytorch/pull/111407 - `loss_outout` -> `loss_output`
* https://github.com/pytorch/pytorch/pull/113075 - `example_inputs` removed from the `ExportedProgram` structure (this originally did not store anything), `dialect` added to the `ExportedProgram` structure.
* https://github.com/pytorch/pytorch/pull/113689 - tensor constants are now lifted as inputs to the graph, and their locations are stored in the `GraphSignature`
* https://github.com/pytorch/pytorch/pull/114172 - removed `equality_constraints` and added a `SymExprHint` for all symbolic expressions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115712
Approved by: https://github.com/gmagogsfm
2023-12-15 21:43:03 +00:00
715d663794 [inductor] split test_cpp_wrapper.py into cpu and cuda test files (#115479)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115479
Approved by: https://github.com/atalman
ghstack dependencies: #115167
2023-12-15 21:21:10 +00:00
50c9665f92 Revert "[export] Support torch.sym* ops (#115854)"
This reverts commit 347cb91946318eaedc350c2c3cda659d1cbde931.

Reverted https://github.com/pytorch/pytorch/pull/115854 on behalf of https://github.com/atalman due to OSSCI oncall, broke multple jobs ([comment](https://github.com/pytorch/pytorch/pull/115854#issuecomment-1858486796))
2023-12-15 21:07:52 +00:00
80a9625d9f Revert "non-strict export with dynamic shapes (#115862)"
This reverts commit 1bb0d0fc1f1da750206fad45f32e9564f0edd1f4.

Reverted https://github.com/pytorch/pytorch/pull/115862 on behalf of https://github.com/atalman due to OSSCI oncall, failing trunk / macos-12-py3-arm64 / test ([comment](https://github.com/pytorch/pytorch/pull/115862#issuecomment-1858482486))
2023-12-15 21:04:12 +00:00
1bb0d0fc1f non-strict export with dynamic shapes (#115862)
Differential Revision: [D52175048](https://our.internmc.facebook.com/intern/diff/D52175048/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115862
Approved by: https://github.com/zhxchen17
2023-12-15 20:11:30 +00:00
347cb91946 [export] Support torch.sym* ops (#115854)
Fixes https://github.com/pytorch/pytorch/issues/108830 and https://github.com/pytorch/executorch/issues/1379#issuecomment-1853322866

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115854
Approved by: https://github.com/zhxchen17
2023-12-15 20:08:04 +00:00
6c2103bdf7 Fixed some failing inductor tests with exact_dtype=True (#115828)
Addresses point 1 from #115742: fixing  CPUReproTest.test_embedding_vec_bf16

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115828
Approved by: https://github.com/peterbell10
2023-12-15 20:02:19 +00:00
91b848bf81 Revert "markDynamoStrictTest on more tests (#115879)"
This reverts commit 8b650cdd3cdd1174b399f312ec2f7955551a2f5d.

Reverted https://github.com/pytorch/pytorch/pull/115879 on behalf of https://github.com/atalman due to OSSCI oncall, broke inductor ([comment](https://github.com/pytorch/pytorch/pull/115879#issuecomment-1858418921))
2023-12-15 20:00:09 +00:00
c006c8b50e Revert "markDynamoStrictTest some more (#115885)"
This reverts commit 55ce4693ff2c0b6e50b8af323f36ecc7ff929638.

Reverted https://github.com/pytorch/pytorch/pull/115885 on behalf of https://github.com/atalman due to OSSCI oncall, broke inductor ([comment](https://github.com/pytorch/pytorch/pull/115885#issuecomment-1858409669))
2023-12-15 19:51:24 +00:00
61abacf829 [tp] improve documentation (#115880)
Improve the TP documentation in terms of format and descriptions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115880
Approved by: https://github.com/XilunWu
2023-12-15 18:44:22 +00:00
d5115bfb06 Revert "[AOTI][refactor][3/n] Declare python_kernel_name and cpp_kernel_name in ExternKernel (#115831)"
This reverts commit 287a86567731ff4d87f71dcd285d0ab4253cfceb.

Reverted https://github.com/pytorch/pytorch/pull/115831 on behalf of https://github.com/desertfire due to rocm CI failure ([comment](https://github.com/pytorch/pytorch/pull/115831#issuecomment-1858322270))
2023-12-15 18:34:55 +00:00
72eab5aa43 Configures distributed_checkpoint label (#115833)
Configures the existing `module: distributed_checkpoint` label
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115833
Approved by: https://github.com/wconstab, https://github.com/wz337
2023-12-15 18:17:25 +00:00
1b506e7469 Revert "non-strict export with dynamic shapes (#115862)"
This reverts commit f54bb1ed566f27affff9fdbd5c1ceee854ef2de5.

Reverted https://github.com/pytorch/pytorch/pull/115862 on behalf of https://github.com/atalman due to OSSCI oncall, failing trunk / macos-12-py3-arm64 / test ([comment](https://github.com/pytorch/pytorch/pull/115862#issuecomment-1858197497))
2023-12-15 17:03:42 +00:00
7ed2bc7c67 [GHF] Do not block reverts with internal changes (#115903)
As check is more often than not is unreliable, so better just post a
warning and let the revert proceed.

Fixes https://github.com/pytorch/test-infra/issues/4797

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115903
Approved by: https://github.com/clee2000, https://github.com/atalman
2023-12-15 17:00:07 +00:00
f54bb1ed56 non-strict export with dynamic shapes (#115862)
Differential Revision: [D52175048](https://our.internmc.facebook.com/intern/diff/D52175048/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115862
Approved by: https://github.com/zhxchen17
2023-12-15 16:38:45 +00:00
b062ea3803 [ROCm] add hipblaslt support (#114329)
Disabled by default. Enable with env var DISABLE_ADDMM_HIP_LT=0. Tested on both ROCm 5.7 and 6.0.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114329
Approved by: https://github.com/malfet
2023-12-15 15:36:46 +00:00
287a865677 [AOTI][refactor][3/n] Declare python_kernel_name and cpp_kernel_name in ExternKernel (#115831)
Summary: Both ExternKernelAlloc and ExternKernelOut need the two fields, so declaring them in the base class. Also add cpp codegen for IndexPutFallback and InplaceBernoulliFallback in this PR.

Differential Revision: [D52189999](https://our.internmc.facebook.com/intern/diff/D52189999)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115831
Approved by: https://github.com/chenyang78
2023-12-15 14:40:44 +00:00
66994bca5f Revert "[inductor] split test_cpp_wrapper.py into cpu and cuda test files (#115479)"
This reverts commit 653acd8fe1d0a7b4a084a47ee022f163015fee64.

Reverted https://github.com/pytorch/pytorch/pull/115479 on behalf of https://github.com/desertfire due to will cause land race in fbcode because https://github.com/pytorch/pytorch/pull/115831 is already landed internally ([comment](https://github.com/pytorch/pytorch/pull/115479#issuecomment-1857979948))
2023-12-15 14:35:40 +00:00
55ce4693ff markDynamoStrictTest some more (#115885)
Featuring
test_native_mha.py
test_nn.py
test_prims.py
test_schema_check.py
test_serialization.py
test_show_pickle.py
test_sort_and_select.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115885
Approved by: https://github.com/voznesenskym
ghstack dependencies: #115845, #115855, #115856, #115857, #115858, #115870, #115871, #115879
2023-12-15 13:19:52 +00:00
8b650cdd3c markDynamoStrictTest on more tests (#115879)
Featuring:
test_mobile_optimizer.py
test_module_init.py
test_modules.py
test_multiprocessing.py
test_multiprocessing_spawn.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115879
Approved by: https://github.com/voznesenskym
ghstack dependencies: #115845, #115855, #115856, #115857, #115858, #115870, #115871
2023-12-15 13:19:52 +00:00
2d43e31aa9 Fix wrong behavior of is_alias_of and c10d::reducer on MTIA (#115553)
Reviewed By: kirteshpatil

Differential Revision: D51860023

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115553
Approved by: https://github.com/fduwjj
2023-12-15 11:14:41 +00:00
4ea7430ffb [BE] Don't copy CuDNN libs twice (#115872)
- It was installed twice : once in `/usr/local/cuda/lib64` folder and 2nd time in `/usr/lib64`
- And don't install CuDNN headers thrice, only in `/usr/local/cuda/includa`
- Error on unknown CUDA version
- Modify bazel builds to look for cudnn in `/usr/local/cuda` folder
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115872
Approved by: https://github.com/huydhn
2023-12-15 09:47:14 +00:00
b4d6443bcf [Dynamo] Log innermost user frame filename & lineno for better error aggregation (#115899)
CompilationMetrics example:
```
frame_key='1',
co_name='fn',
co_filename='/data/users/ybliang/debug/debug1.py',
co_firstlineno=58,
cache_size=0,
accumulated_cache_size=0,
guard_count=None,
graph_op_count=None,
graph_node_count=None,
graph_input_count=None,
entire_frame_compile_time_s=None,
backend_compile_time_s=None,
fail_type="<class 'torch._dynamo.exc.Unsupported'>",
fail_reason='custome dict init with args/kwargs unimplemented',
fail_user_frame_filename='/data/users/ybliang/debug/debug1.py',
fail_user_frame_lineno=61
```
where:
* ```fail_type``` and ```fail_reason``` are exceptions inside of Dynamo.
* ```fail_user_frame_filename``` and ```fail_user_frame_lineno``` are where the original user code triggered the exception.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115899
Approved by: https://github.com/davidberard98, https://github.com/ydwu4
2023-12-15 08:24:55 +00:00
4edc921857 Introduce 3 low-latency, intra-node allreduce algorithms for small messages to PyTorch (#114001)
## Summary
This PR added 3 intra-node GPU allreduce algorithms to PyTorch:
- One-shot allreduce (inspired by FasterTransformer): all ranks simultaneously read and accumulate data from other ranks.
- Two-shot allreduce (inspired by FasterTransformer): all ranks simultanesouly read and accumulate `1 / world_size` data from other ranks. Then all ranks read accumulated data from other ranks. (effectively one-shot reduce-scatter + one-shot all-gather).
- Hybrid cube mesh allreduce (original): a one-shot allreduce variant that avoids transmission over PCIe on HCM topology.

## Micro Benchmarks
![image](https://github.com/pytorch/pytorch/assets/4156752/7bd25ffc-cd5b-4acb-bd65-b01bc136726e)

![image](https://github.com/pytorch/pytorch/assets/4156752/3ced31b4-6c31-4f34-a2d8-c072df29ae0e)

![image](https://github.com/pytorch/pytorch/assets/4156752/5b942c05-4fcc-4ec9-ae29-12c64080bb1c)

## Details
The intra-node algos are organized behind `c10d::IntraNodeComm`, which is responsible for:
- Managing handshaking and cuda IPC handle exchange among ranks.
- Querying NVLink connection and detecting topology.
- Performing algo selection based on available info.
- Launching the selected allreduce kernel.

`c10d::IntraNodeComm` is integrated into `c10d::ProcessGroupNCCL` as follows:
- When the `ENABLE_INTRA_NODE_COMM` environment variable is set, `c10d::ProcessGroupNCCL` initialize a `c10d::IntraNodeComm` for its ranks.
  - If the setup is not suitable for intra-node comm (e.g. not all ranks are from the same node), the rendezvous logic guarantees all participants fall back consistently.
- `c10d::ProcessGroupNCCL::allreduce` consults `c10d::IntraNodeComm` whether to use intra-node allreduce and carries out the communication accordingly.

We currently detect two types of topoloies from the nNVLink connection mesh:
- Fully connected: all GPU pairs has direct NVLink connection (e.g. NVSwitch or fully connected sub-set of hybrid cube mesh)
  - `msg <= 256KB`: one-shot allreduce.
  - `256KB < msg <= 10MB`: two-shot allreduce.
  -  `msg > 10MB`: instructs the caller to fallback to NCCL.
- Hybrid cube mesh
  - `msg <= 256KB`: one-shot allreduce.
  - `msg > 256KB`: instructs the caller to fallback to NCCL.

## Next Steps
- Fine tune algo selection based on GPU model, topology, link speed.
- Potentially optimize the two-shot allreduce impl. Accroding to FasterTransformer, two-shot allreduce is preferred until 50MB. There might be room for improvement, but PyTorch does impose more constraints:
  - FasterTransformer uses a single process to drive multiple devices. It can use `cudaDeviceEnablePeerAccess` enable device-level peer access.
  - PyTorch uses multiple process to drive multiple devices. With cuda IPC, a device can only share a specific region to other devices. This means extra copies may be unavoidable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114001
Approved by: https://github.com/yf225
2023-12-15 08:17:35 +00:00
cd47e335d1 [TEST] Skip test_schema_correctness for float8 dtype (#115757)
According to the https://github.com/pytorch/pytorch/issues/107256#issuecomment-1705341870 the ops tested in `test_schema_correctness` are not supported with `torch.float8_e4m3fn` yet. Until they are not supported, it is best to skip the test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115757
Approved by: https://github.com/drisspg
2023-12-15 06:26:46 +00:00
c1c9b739e2 Back out "[aotinductor] replace lld with the default ld linker (#115478)" (#115875)
Summary:
Back out the diff

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115875
Approved by: https://github.com/chenyang78
2023-12-15 05:56:06 +00:00
478f0e96dc markDynamoStrictTest more tests (#115871)
For:
test_dispatch.py
test_fake_tensor.py
test_indexing.py
test_linalg.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115871
Approved by: https://github.com/voznesenskym
ghstack dependencies: #115845, #115855, #115856, #115857, #115858, #115870
2023-12-15 05:26:54 +00:00
7f686c8fe1 More markDynamoStrictTest (#115870)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115870
Approved by: https://github.com/voznesenskym
ghstack dependencies: #115845, #115855, #115856, #115857, #115858
2023-12-15 05:26:54 +00:00
9ae0e62929 [PT2] [Quant] Change the QConv2d Binary post op name from add to sum (#115329)
**Summary**
Change the QConv2d Binary fusion post op name from `add` to `sum`, since we are actually using OneDNN `post op sum` instead of `Binary_Add` for now.

**TestPlan**
```
python -m pytest test_quantized_op.py -k test_qconv2d_sum_pt2e
python -m pytest test_quantized_op.py -k test_qconv2d_sum_relu_pt2e
python -m pytest test_quantized_op.py -k test_qconv2d_sum_relu_float_output_pt2e
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115329
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2023-12-15 05:10:47 +00:00
653acd8fe1 [inductor] split test_cpp_wrapper.py into cpu and cuda test files (#115479)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115479
Approved by: https://github.com/atalman
ghstack dependencies: #115167
2023-12-15 04:04:08 +00:00
eqy
9056903b09 [CUDA] 64-bit indexing for avg_pool_backward (#114193)
Fixes #113833

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114193
Approved by: https://github.com/malfet
2023-12-15 03:58:46 +00:00
8e2d63cbc3 [export][reland] Remove runtime assertion pass (#115597)
Summary:
Reland of https://github.com/pytorch/pytorch/pull/115196
D52054112 to fix internal failures.

Test Plan: CI

Differential Revision: D52054110

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115597
Approved by: https://github.com/ydwu4, https://github.com/zhxchen17
2023-12-15 03:22:03 +00:00
7d4ccd7b9e [AOTI][refactor][2/n] Rename kernel to python_kernel_name (#115766)
Differential Revision: [D52164940](https://our.internmc.facebook.com/intern/diff/D52164940)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115766
Approved by: https://github.com/chenyang78
ghstack dependencies: #115783
2023-12-15 03:08:13 +00:00
8e1cff96e3 [C10D] Log PG size in init log (#115807)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115807
Approved by: https://github.com/XilunWu
2023-12-15 02:38:54 +00:00
5989e1222d [BE] Set torch.cuda.has_half to True (#115884)
This check was introduced by https://github.com/pytorch/pytorch/pull/5417 and then turned into a tautology by https://github.com/pytorch/pytorch/pull/10147

So I guess it's time to let go of all that dynamic initialization (and may be just delete it in 2.3?)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115884
Approved by: https://github.com/kit1980
2023-12-15 02:30:55 +00:00
a8e354a9a0 [sparse][semi-structured] enable fp32 support, separate sparse and dense constraints (#115550)
Summary:

Both cuSPASRELt and CUTLASS support 1:2 semi-structured sparsity for
fp32, which this PR enables.(thanks @alexsamardzic).

Furthermore, this PR also updates the sparse_config to take into account
the different shape constraints for sparse and dense matrices.

Technically, cuSPARSELt supports smaller sparse matrix constraints as it
seens to pad to the CUTLASS constraints under the hood. However, in
practice small sparse matrices are not commonly used and we care more
about the dense constraints for LLM inference.

For now, we keep the CUTLASS constraints in place for both cuSPARSELt
and CUTLASS tensors

This PR also reconnects the _FUSE_TRANSPOSE flag for cuSPARSELt tensors.

Test Plan:
```
python test/test_sparse_semi_structured.py
```

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115550
Approved by: https://github.com/cpuhrsch
2023-12-15 02:28:17 +00:00
6d5fe07659 Fix numpy warning when importing torch without numpy installed (#115867)
Fixes #115638

I verified locally that with no numpy install the warning no longer occurs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115867
Approved by: https://github.com/soulitzer
2023-12-15 02:22:12 +00:00
9e84d0fa60 [MPS] Fix opposite error message in empty_mps (#115746)
Fixes #115625
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115746
Approved by: https://github.com/mikaylagawarecki
2023-12-15 01:31:40 +00:00
85262b0a9e markDynamoStrictTest some test_cpp_extensions.* (#115858)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115858
Approved by: https://github.com/voznesenskym
ghstack dependencies: #115845, #115855, #115856, #115857
2023-12-15 01:22:38 +00:00
8ddca5aeae markDynamoStrictTest some more tests (#115857)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115857
Approved by: https://github.com/voznesenskym
ghstack dependencies: #115845, #115855, #115856
2023-12-15 01:22:38 +00:00
3477a2ee03 unMarkDynamoStrictTest on OpInfo-based tests (#115856)
These take too long to run under strict mode. We'll worry about them
later. Note that these decorators don't do anything yet (unless we flip
the default from non-strict to strict).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115856
Approved by: https://github.com/voznesenskym
ghstack dependencies: #115845, #115855
2023-12-15 01:22:31 +00:00
0722ce35f5 Increase number of Dynamo shards from 2->7 (#115855)
In preparation for ~3x increased test time coming in the upcoming PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115855
Approved by: https://github.com/voznesenskym
ghstack dependencies: #115845
2023-12-15 01:22:24 +00:00
4ccd8eb613 Add Dynamo test expected failure mechanism (#115845)
Tests that are added to a list in dynamo_test_failures.py will
automatically be marked as expectedFailure when run with
PYTORCH_TEST_WITH_DYNAMO=1. I'm splitting this PR off on its own so that
I can test various things on top of it.

Also added an unMarkDynamoStrictTest that is not useful until we turn
on strict mode by default.

Test Plan:
- code reading
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115845
Approved by: https://github.com/voznesenskym
2023-12-15 01:22:17 +00:00
5477120ebf [executorch] Update iOS toolchain with a modern cmake syntax. (#115799)
Summary: Replace exec_program with execute_process

Test Plan: CI

Differential Revision: D52147108

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115799
Approved by: https://github.com/huydhn
2023-12-15 00:51:30 +00:00
f90a5f891b [AOTI][refactor][1/n] Rename cpp_kernel to cpp_kernel_name (#115783)
Differential Revision: [D52142184](https://our.internmc.facebook.com/intern/diff/D52142184)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115783
Approved by: https://github.com/chenyang78, https://github.com/jansel
2023-12-15 00:50:17 +00:00
1b8599283f Optimize quantized max pool 2d (#115690)
Summary:
We do not need to dequantize and quantize again for this op.

With this optimization cunet-enc ops:

vulkan.quantized_max_pool2d_quint8{48, 36, 2}                       207532
vulkan.quantized_max_pool2d_quint8{24, 18, 4}                        78832
vulkan.quantized_max_pool2d_quint8{12, 9, 8}                         49296

Without optimization:
vulkan.quantized_max_pool2d_quint8{48, 36, 2}                       234416
vulkan.quantized_max_pool2d_quint8{24, 18, 4}                        94380
vulkan.quantized_max_pool2d_quint8{12, 9, 8}                         58760

Test Plan:
Ensure all vulkan quantize tests pass:
buck2 run --target-platforms ovr_configplatform/macos:arm64-fbsourcexplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output"
Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc
[==========] Running 78 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 78 tests from VulkanAPITest

...
[==========] 78 tests from 1 test suite ran. (1519 ms total)
[  PASSED  ] 78 tests.

buck2 run --target-platforms ovr_config//platform/macos:arm64-fbsource  //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output"

Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc
[==========] Running 395 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 395 tests from VulkanAPITest
...
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log (0 ms)
[----------] 395 tests from VulkanAPITest (6515 ms total)

[----------] Global test environment tear-down
[==========] 395 tests from 1 test suite ran. (6515 ms total)
[  PASSED  ] 394 tests.
[  SKIPPED ] 1 test, listed below:
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log

  YOU HAVE 5 DISABLED TESTS

Reviewed By: yipjustin, copyrightly

Differential Revision: D50998619

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115690
Approved by: https://github.com/SS-JIA
2023-12-15 00:45:37 +00:00
6fee208064 Handle -1 in jagged layout NT view ops (#115843)
Allows for inheriting the ragged and batch dims via -1:
```python
nt.view(-1, -1, D)
nt.expand(B, -1, D)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115843
Approved by: https://github.com/soulitzer
ghstack dependencies: #115636
2023-12-15 00:42:47 +00:00
c947ed1135 [BE][ROCm] Use modern C++ (#115844)
This removes global (but ROCM_ONLY) `over_arch` and `gcn_arch_override_flag` variables in favor of block level static initialization introduced in C++11

To quote from [ISO/IEC 14882-2014](https://www.open-std.org/jtc1/sc22/wg21/docs/standards)
>The zero-initialization (8.5) of all block-scope variables with static storage duration (3.7.1) or thread storage
> duration (3.7.2) is performed before any other initialization takes place. Constant initialization (3.6.2) of a
> block-scope entity with static storage duration, if applicable, is performed before its block is first entered.
> An implementation is permitted to perform early initialization of other block-scope variables with static or
> thread storage duration under the same conditions that an implementation is permitted to statically initialize
> a variable with static or thread storage duration in namespace scope (3.6.2). Otherwise such a variable is
> initialized the first time control passes through its declaration; such a variable is considered initialized upon
> the completion of its initialization. If the initialization exits by throwing an exception, the initialization
> is not complete, so it will be tried again the next time control enters the declaration. If control enters
> the declaration concurrently while the variable is being initialized, the concurrent execution shall wait for
> completion of the initialization.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115844
Approved by: https://github.com/huydhn
2023-12-15 00:38:43 +00:00
7e6ec8d3db [ONNX] Add proper iobinding synchronize for ONNX cuda bench (#115773)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115773
Approved by: https://github.com/thiagocrepaldi
ghstack dependencies: #115670, #115673
2023-12-15 00:37:32 +00:00
823523acc0 [ONNX] Dump sarif diagnostics for failed onnx exports in benchmark (#115673)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115673
Approved by: https://github.com/thiagocrepaldi
ghstack dependencies: #115670
2023-12-15 00:37:32 +00:00
0959e67de3 [ONNX] Set correct cuda.current_device for multi-device onnx performance bench (#115670)
Otherwise `torch.cuda.synchronize()` works on a different device from the one that
runs PyTorch model, which lead to incorrect performance number.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115670
Approved by: https://github.com/thiagocrepaldi
2023-12-15 00:37:32 +00:00
59f7355f86 Revert "[ROCm] add hipblaslt support (#114329)"
This reverts commit bb2bb8cca1c00e3f6e7025a62688d0cfcbfee144.

Reverted https://github.com/pytorch/pytorch/pull/114329 on behalf of https://github.com/atalman due to OSSCI oncall, trunk  tests are failing ([comment](https://github.com/pytorch/pytorch/pull/114329#issuecomment-1857003155))
2023-12-14 23:53:30 +00:00
66b04e3cb7 [nccl flight recorder] nullptr profiling name (#115851)
Sometimes profiling name can be a nullptr, which
throws on conversion to std::string. This adds a check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115851
Approved by: https://github.com/wconstab
2023-12-14 23:40:54 +00:00
21b8127f1c [Inductor] Deduplicate grid wrapper statements for user defined triton kernels (#115849)
Noticed that on many MRS kernels the grid wrapper for autotuning is huge with a bunch of duplicates due to num_warps and num_stages not being needed for grid calculation. Lets deduplicate these entries.

Previously, we would see wrapper like
```
    def grid_wrapper_for_add_kernel_2d_autotuned_0(meta):
        if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1)
        if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1)
        if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1)
        if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1)
```
now it looks like
```
    def grid_wrapper_for_add_kernel_2d_autotuned_0(meta):
        if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1)
        if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115849
Approved by: https://github.com/jansel
2023-12-14 23:26:04 +00:00
194d57dae7 Add values backward support for sparse CSR, CSC, BSR, and BSC tensors (#115586)
Fixes https://github.com/pytorch/pytorch/issues/107286

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115586
Approved by: https://github.com/cpuhrsch, https://github.com/albanD
2023-12-14 23:09:13 +00:00
49d826bcd3 [dtensor] update op db tests (#115722)
This PR updates the op db tests xfails, we should see whether we can
enable this again in CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115722
Approved by: https://github.com/XilunWu
2023-12-14 22:49:13 +00:00
ef6a0faf89 [export] Fix canonicalization. (#115830)
Summary: Add the missed layout argument branch.

Test Plan: buck2 test 'fbcode//mode/dev-nosan' fbcode//sigmoid/inference/test_gpu:export_package_sparse_toy_test

Differential Revision: D52166501

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115830
Approved by: https://github.com/angelayi
2023-12-14 22:48:26 +00:00
bb2bb8cca1 [ROCm] add hipblaslt support (#114329)
Disabled by default. Enable with env var DISABLE_ADDMM_HIP_LT=0. Tested on both ROCm 5.7 and 6.0.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114329
Approved by: https://github.com/malfet
2023-12-14 21:41:22 +00:00
04ef21f5dd [C10D] Make dumpDebuggingInfo share a mutex across PGs (#115803)
The mutex was originally added to avoid racing to dump debuginfo,
where a race in this case would result in a corrupted dump file.

The reason a mutex helps is that it forces all dump requests to be
serialized, so that an observer would either see an in-progress file, a
complete file, or no file.  Without a mutex, a fourth state is possible
(a file that has been written to by multiple threads and is invalid).

Becuase the mutex was a ProcessGroupNCCL class member, and each PG
instance has its own watchdog thread that can launch a dump, it was not
doing its job.  Making the mutex static shares it between instances of
the class and ensures serialization of dumps triggered by any PG.

(Note: dumps triggered by different PGs have the same, global contents
anyway- there is only one global flight recorder, so it doesn't matter
who triggers it.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115803
Approved by: https://github.com/kwen2501
ghstack dependencies: #115771, #115798, #115800, #115801
2023-12-14 21:17:44 +00:00
7ecddaef23 Revert "Introduce 3 low-latency, intra-node allreduce algorithms for small messages to PyTorch (#114001)"
This reverts commit adfbd2b219f4995d3f13870927022b67550f8b0e.

Reverted https://github.com/pytorch/pytorch/pull/114001 on behalf of https://github.com/atalman due to OSSCI oncall, breaks periodic jobs ([comment](https://github.com/pytorch/pytorch/pull/114001#issuecomment-1856539040))
2023-12-14 20:33:10 +00:00
67232199b1 [dynamo] Log shape_env_guard_count separately from guard_count (#115776)
guard_count counts all the shape_env guards as a single guard; log the shape_env_guard_count separately so those metrics can be used.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115776
Approved by: https://github.com/yanboliang
2023-12-14 20:12:49 +00:00
eqy
353f2dbd9c [CUDA] Fix V100 expected failures in test_mm_decomp and test_linalg (#115666)
BFloat16 isn't supported on sm70 and we get an unexpected cuBLAS success in 12.3+

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115666
Approved by: https://github.com/malfet
2023-12-14 19:17:53 +00:00
28e37d4f3b Update Trition pin (#115743)
To include a cherry-pick of https://github.com/openai/triton/pull/2771 that should fix  cuda-11.8 runtime issues

Also, tweak build wheel script to update both ROCm and vanilla Trition builds version to 2.2 (even though on trunk it should probably be 3.3 already)

TODO: Remove `ROCM_TRITION_VERSION` once both trunk and ROCM version are in sync again

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115743
Approved by: https://github.com/davidberard98
2023-12-14 18:54:24 +00:00
87547a26b8 [aotinductor] add no weight change version of fuse_parallel_linear (#115791)
Summary: We need a new version of fuse_parallel_linear w/o creating new weights for real-time update.

Reviewed By: khabinov

Differential Revision: D52128296

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115791
Approved by: https://github.com/khabinov
2023-12-14 18:36:17 +00:00
ca4caf4eac Revert "[inductor] Do variance calculation in opmath type (#115181)"
This reverts commit 42390a097b987cd3384511c3df3747699f2281f4.

Reverted https://github.com/pytorch/pytorch/pull/115181 on behalf of https://github.com/atalman due to OSSCI oncall, broke periodic tests ([comment](https://github.com/pytorch/pytorch/pull/115181#issuecomment-1856360644))
2023-12-14 18:21:49 +00:00
0fe014bd8a [C10D] Change PGNCCL logs to prefix [PG {} Rank {}] (#115801)
Adds a PG {process group uid} prefix component to logs.

This is helpful in situations where there are multiple processgroups,
and rank information by itself is confusing.  (For example rank0 on PG1
may correspond to rank3 on PG0.  People may assume 'rank0' references
the global (PG0) world, but it may reference a sub-pg.  Prefacing the PG
helps clarify this.

Does NOT change logs from inside WorkNCCL functions, since WorkNCCL
doens't know what PG ID it corresponds to. Will address these logs
separately.

Example:

```
[I ProcessGroupNCCL.cpp:787] [PG 0 Rank 0] ProcessGroupNCCL initialization ...
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115801
Approved by: https://github.com/fduwjj
ghstack dependencies: #115771, #115798, #115800
2023-12-14 18:17:16 +00:00
e94267587b [C10D] Refactor NCCL logs to use common prefix helper (#115800)
Put the repeated code that string formats [Rank {rank}] in one place.

Sets up for the next PR that also adds more info to this prefix.

(Does not change exception messages, which could be done as well.
Exception messages are not formatted quite the same way. Tries
instead to keep from changing log behavior (in this PR) and only
refactor code.

Did limited testing (some logs were observed OK).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115800
Approved by: https://github.com/fduwjj
ghstack dependencies: #115771, #115798
2023-12-14 18:13:24 +00:00
eb6e70cf66 [C10D] Only open NCCL dump pipe file once per process (#115798)
The NCCL flight recorder is per-process (it is shared by all
processgroups), but individual process groups used to construct their
own pipe for being signaled to dump the flight recorder.

This ensures that only one pipe per process is created, by only creating
the pipe on the first ProcessGroup (uid_ == 0) which should be the world
group.

Filenames are still keyed off of rank, but this should now be global
rank instead of sub-pg rank, making the filenames unique across the
whole trainer process.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115798
Approved by: https://github.com/zdevito
ghstack dependencies: #115771
2023-12-14 17:48:26 +00:00
74d2b9dd15 [C10D] Make DumpPipe disabled when FlightRecorder disabled (#115771)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115771
Approved by: https://github.com/fduwjj
2023-12-14 17:42:46 +00:00
b618869208 [inductor] label cpp test files with oncall: cpu inductor (#115167)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115167
Approved by: https://github.com/atalman
2023-12-14 17:39:27 +00:00
c80e2d5bb2 [fbcode] consolidate usage of fp8 linears for inference models (#115808)
Summary:
ATT, this will use implementation of D51812709 for fp8 linears.

Meanwhile, it also adds use-case of delay quantization

Test Plan:
```
CUDA_VISIBLE_DEVICES=7 buck run mode/opt  -c fbcode.platform010_cuda_version=12 -c fbcode.nvcc_arch=h100 -c fbcode.use_link_groups=false caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --local-model /home/xiaoruichao/test_models/463113248.input.predictor.disagg.gpu.merge --lower-backend AOT_INDUCTOR --fp8-linear-quantization-type delay_quantization --disable-acc-tracer-aot-inductor
```

```
CUDA_VISIBLE_DEVICES=7 buck run mode/opt  -c fbcode.platform010_cuda_version=12 -c fbcode.nvcc_arch=h100 -c fbcode.use_link_groups=false caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --local-model /home/xiaoruichao/test_models/463113248.input.predictor.disagg.gpu.merge --lower-backend AOT_INDUCTOR --fp8-linear-quantization-type delay_quantization --disable-acc-tracer-aot-inductor
```

Reviewed By: tter1

Differential Revision: D51840344

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115808
Approved by: https://github.com/ipiszy
2023-12-14 16:59:48 +00:00
5bddbed399 Initial Flash Attention support on ROCM (#114309)
This pull requests add initial Flash Attention support for AMD/ROCM platform. It added a specialized Triton repository/branch as a compile-time dependency for Flash Attention math library on AMD/ROCM. This triton submodule is not used at runtime and will not be shipped to the final pytorch package. We have the plan to release this specialized Triton as a separate project.

Know limitations:

- [ ] Only supports MI200 series GPU (i.e., `gcnArchName == gfx90a:sramecc+:xnack-`.
- [ ] Only supports power of two sequence lengths.
- [ ] No support for varlen APIs.
- [ ] Only support head dimension 16,32,64,128.
- [ ] Performance is still being optimized.

Fixes https://github.com/pytorch/pytorch/issues/112997

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114309

Approved by: https://github.com/jeffdaily, https://github.com/malfet

---------

Co-authored-by: Joseph Groenenboom <joseph.groenenboom@amd.com>
2023-12-14 08:52:57 -08:00
ac60a70e06 Migrated loss functions to ModuleInfos (#115584)
Migrates most tests in `common_nn.py:criterion_tests` to ModuleInfos.

**I can split this up if it is too large to review**

What this PR does not include:
- [`no_batch_dim` tests](https://github.com/pytorch/pytorch/blob/main/torch/testing/_internal/common_nn.py#L3995-L4112)
- [tests that use the functional variant of the loss function and `wrap_functional`](https://github.com/pytorch/pytorch/blob/main/torch/testing/_internal/common_nn.py#L1079-L1128)

#### On test times
This PR increases test time by ~58s locally
Before this PR:
```
>>> python test/test_nn.py -k Loss
Ran 1003 tests in 28.977s
```
After this PR
```
>>> python test/test_nn.py -k Loss
Ran 368 tests in 23.073s
```

```
>>> python test/test_modules.py -k Loss
Ran 836 tests in 63.900s
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115584
Approved by: https://github.com/janeyx99
ghstack dependencies: #115617
2023-12-14 16:21:05 +00:00
f727bed2e6 [inductor] Updated upsample_bilinear2d decomposition (#104182)
Description:
- Updated upsample_bilinear2d decomposition
  - added support for uint8 dtype support
  - code improvements
- Added uint8 dtype tests

Perf considerations:
- There is minor perf regression (speed-up ~0.7) on cases uint8, align_corners=True when output is smaller/equal (256, 256)
- For cases, when output is larger (256, 256) and input dtype uint8, nightly output is wrong, so IMO large perf regression (speed-up around ~0.2) should not be taken into account.

## Perfs benchmarks

```
[--------------------------------------------------------------------------------------------------------------------------------------------------------- Interpolate, cpu --------------------------------------------------------------------------------------------------------------------------------------------------------]
                                                                                                                                                    |  Eager (2.3.0a0+gitafcfdb1) PR  |  Compiled (2.3.0a0+gitafcfdb1) PR  |  Compiled (2.3.0a0+gitde89a53) Nightly  |  speed-up PR vs Nightly  |  Eager (2.3.0a0+gitde89a53) Nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input (1, 3, 500, 400), torch.uint8, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (256, 256)       |        565.212 (+-3.548)        |        1384.210 (+-10.798)         |           1230.996 (+-32.930)           |     0.889 (+-0.000)      |          566.253 (+-1.526)
      Input (1, 3, 500, 400), torch.uint8, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (256, 256)      |        565.404 (+-1.614)        |         1491.649 (+-7.763)         |            2974.959 (+-6.006)           |     1.994 (+-0.000)      |          566.476 (+-1.742)
      Input (1, 3, 500, 400), torch.uint8, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (256, 256)           |        270.761 (+-0.861)        |         1557.777 (+-4.699)         |            1080.919 (+-4.243)           |     0.694 (+-0.000)      |          269.829 (+-0.986)
      Input (1, 3, 500, 400), torch.uint8, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (256, 256)          |        270.960 (+-0.995)        |        1723.913 (+-12.433)         |            3191.938 (+-6.194)           |     1.852 (+-0.000)      |          269.962 (+-1.657)
      Input (1, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (256, 256)     |        1555.884 (+-5.169)       |         1178.753 (+-4.957)         |            1910.445 (+-5.988)           |     1.621 (+-0.000)      |          1560.804 (+-6.793)
      Input (1, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (256, 256)    |        1651.193 (+-6.952)       |         1323.466 (+-6.059)         |            3374.842 (+-8.168)           |     2.550 (+-0.000)      |          1653.497 (+-8.018)
      Input (1, 3, 500, 400), torch.float32, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (256, 256)         |        978.482 (+-10.183)       |         1383.768 (+-4.341)         |            2147.841 (+-6.581)           |     1.552 (+-0.000)      |          979.983 (+-1.499)
      Input (1, 3, 500, 400), torch.float32, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (256, 256)        |        1074.472 (+-5.031)       |         1414.912 (+-5.754)         |           3590.968 (+-10.042)           |     2.538 (+-0.000)      |          1074.589 (+-3.948)
      Input (4, 3, 500, 400), torch.uint8, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (256, 256)       |        2168.703 (+-8.964)       |        5400.528 (+-26.628)         |           4777.299 (+-11.891)           |     0.885 (+-0.000)      |          2168.133 (+-7.667)
      Input (4, 3, 500, 400), torch.uint8, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (256, 256)      |       2169.132 (+-12.618)       |        6583.866 (+-28.959)         |           11986.894 (+-45.838)          |     1.821 (+-0.000)      |         2174.488 (+-10.317)
      Input (4, 3, 500, 400), torch.uint8, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (256, 256)           |        992.808 (+-6.086)        |         5985.028 (+-9.532)         |            4334.158 (+-9.423)           |     0.724 (+-0.000)      |          989.604 (+-5.499)
      Input (4, 3, 500, 400), torch.uint8, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (256, 256)          |        987.618 (+-6.350)        |        6963.044 (+-28.885)         |           15441.096 (+-55.324)          |     2.218 (+-0.000)      |          985.573 (+-5.159)
      Input (4, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (256, 256)     |       6695.557 (+-35.067)       |        4657.603 (+-14.220)         |           8058.708 (+-41.684)           |     1.730 (+-0.000)      |         6714.996 (+-38.626)
      Input (4, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (256, 256)    |       7040.481 (+-39.486)       |        5445.704 (+-16.659)         |           13906.618 (+-53.298)          |     2.554 (+-0.000)      |         7034.453 (+-44.626)
      Input (4, 3, 500, 400), torch.float32, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (256, 256)         |       3926.186 (+-10.660)       |        5741.433 (+-12.748)         |           9356.036 (+-40.848)           |     1.630 (+-0.000)      |         3930.598 (+-17.086)
      Input (4, 3, 500, 400), torch.float32, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (256, 256)        |        4308.536 (+-9.607)       |        6122.755 (+-47.278)         |           15637.567 (+-54.392)          |     2.554 (+-0.000)      |         4307.463 (+-11.268)
      Input (1, 3, 1200, 1300), torch.uint8, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (200, 300)     |       2512.740 (+-10.860)       |         1573.590 (+-5.061)         |            451.355 (+-1.210)            |     0.287 (+-0.000)      |         2511.727 (+-10.930)
      Input (1, 3, 1200, 1300), torch.uint8, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (200, 300)    |       2489.926 (+-11.915)       |         1537.233 (+-4.212)         |            2501.470 (+-7.446)           |     1.627 (+-0.000)      |         2500.000 (+-12.155)
      Input (1, 3, 1200, 1300), torch.uint8, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (200, 300)         |        632.032 (+-2.108)        |         1496.994 (+-4.194)         |            404.759 (+-1.064)            |     0.270 (+-0.000)      |          630.122 (+-4.086)
      Input (1, 3, 1200, 1300), torch.uint8, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (200, 300)        |        629.174 (+-4.386)        |         1708.935 (+-8.817)         |            2643.296 (+-9.723)           |     1.547 (+-0.000)      |          628.388 (+-1.326)
      Input (1, 3, 1200, 1300), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (200, 300)   |        4409.941 (+-8.016)       |         1160.133 (+-4.698)         |            1897.089 (+-9.392)           |     1.635 (+-0.000)      |         4450.959 (+-10.438)
      Input (1, 3, 1200, 1300), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (200, 300)  |       4493.427 (+-11.703)       |         1329.226 (+-4.740)         |           2835.872 (+-12.241)           |     2.133 (+-0.000)      |          4506.973 (+-9.914)
      Input (1, 3, 1200, 1300), torch.float32, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (200, 300)       |        901.712 (+-4.071)        |         1320.739 (+-5.197)         |            2207.605 (+-8.219)           |     1.671 (+-0.000)      |          904.757 (+-4.558)
      Input (1, 3, 1200, 1300), torch.float32, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (200, 300)      |        990.080 (+-3.922)        |         1702.563 (+-7.909)         |           3074.196 (+-10.478)           |     1.806 (+-0.000)      |          990.482 (+-4.444)
      Input (4, 3, 1200, 1300), torch.uint8, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (200, 300)     |       9785.550 (+-58.445)       |        6135.680 (+-33.569)         |           1628.572 (+-19.770)           |     0.265 (+-0.000)      |         9893.606 (+-62.377)
      Input (4, 3, 1200, 1300), torch.uint8, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (200, 300)    |       9710.191 (+-57.597)       |        6066.824 (+-36.364)         |           10469.110 (+-42.775)          |     1.726 (+-0.000)      |         9919.022 (+-72.190)
      Input (4, 3, 1200, 1300), torch.uint8, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (200, 300)         |       2790.356 (+-12.188)       |        6134.101 (+-28.694)         |            1576.832 (+-6.030)           |     0.257 (+-0.000)      |         2761.122 (+-11.503)
      Input (4, 3, 1200, 1300), torch.uint8, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (200, 300)        |       2778.711 (+-13.603)       |        6608.528 (+-37.776)         |           10841.549 (+-49.429)          |     1.641 (+-0.000)      |         2753.037 (+-10.995)
      Input (4, 3, 1200, 1300), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (200, 300)   |      45533.868 (+-102.618)      |         4962.994 (+-8.215)         |           9003.968 (+-38.179)           |     1.814 (+-0.000)      |        43531.261 (+-102.951)
      Input (4, 3, 1200, 1300), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (200, 300)  |       45932.699 (+-81.207)      |        5595.682 (+-11.482)         |           12302.907 (+-50.254)          |     2.199 (+-0.000)      |         43916.455 (+-80.468)
      Input (4, 3, 1200, 1300), torch.float32, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (200, 300)       |        3827.804 (+-8.057)       |        6311.580 (+-25.021)         |           11760.614 (+-51.531)          |     1.863 (+-0.000)      |         3849.959 (+-10.848)
      Input (4, 3, 1200, 1300), torch.float32, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (200, 300)      |        4169.007 (+-8.452)       |        6820.716 (+-35.310)         |           15264.633 (+-49.982)          |     2.238 (+-0.000)      |         4183.875 (+-19.104)
      Input (1, 3, 300, 400), torch.uint8, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (600, 700)       |        1306.914 (+-7.470)       |        10598.101 (+-38.410)        |           2678.031 (+-11.051)           |     0.253 (+-0.000)      |          1307.470 (+-8.519)
      Input (1, 3, 300, 400), torch.uint8, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (600, 700)      |        1307.268 (+-8.197)       |        10161.123 (+-45.643)        |           17148.842 (+-55.402)          |     1.688 (+-0.000)      |          1308.077 (+-8.553)
      Input (1, 3, 300, 400), torch.uint8, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (600, 700)           |        548.574 (+-2.157)        |        10072.806 (+-41.368)        |            2408.971 (+-6.997)           |     0.239 (+-0.000)      |          547.726 (+-1.721)
      Input (1, 3, 300, 400), torch.uint8, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (600, 700)          |        546.664 (+-1.484)        |        11123.694 (+-43.636)        |           18058.070 (+-48.552)          |     1.623 (+-0.000)      |          547.151 (+-1.627)
      Input (1, 3, 300, 400), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (600, 700)     |       7935.051 (+-71.022)       |        7654.533 (+-29.512)         |           12414.194 (+-87.450)          |     1.622 (+-0.000)      |         7900.056 (+-53.997)
      Input (1, 3, 300, 400), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (600, 700)    |       8546.732 (+-53.118)       |        8583.572 (+-35.656)         |          19111.824 (+-166.978)          |     2.227 (+-0.000)      |         8515.433 (+-63.300)
      Input (1, 3, 300, 400), torch.float32, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (600, 700)         |       6202.642 (+-34.355)       |        8915.622 (+-62.293)         |           14327.295 (+-52.188)          |     1.607 (+-0.000)      |         6213.329 (+-39.740)
      Input (1, 3, 300, 400), torch.float32, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (600, 700)        |       6811.128 (+-33.747)       |        9647.316 (+-50.837)         |           20830.594 (+-62.979)          |     2.159 (+-0.000)      |         6822.512 (+-37.092)
      Input (4, 3, 300, 400), torch.uint8, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (600, 700)       |       5079.586 (+-19.067)       |        42238.442 (+-87.643)        |           11282.141 (+-42.477)          |     0.267 (+-0.000)      |         5104.234 (+-17.706)
      Input (4, 3, 300, 400), torch.uint8, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (600, 700)      |       5079.575 (+-16.306)       |        41512.995 (+-83.710)        |          68789.816 (+-440.001)          |     1.657 (+-0.000)      |         5097.446 (+-21.724)
      Input (4, 3, 300, 400), torch.uint8, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (600, 700)           |        2039.974 (+-8.614)       |       42322.773 (+-111.866)        |           10399.237 (+-43.140)          |     0.246 (+-0.000)      |         2043.808 (+-10.707)
      Input (4, 3, 300, 400), torch.uint8, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (600, 700)          |       2036.214 (+-10.083)       |        44353.281 (+-71.548)        |          73340.412 (+-324.780)          |     1.654 (+-0.000)      |          2039.000 (+-9.554)
      Input (4, 3, 300, 400), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (600, 700)     |       33821.523 (+-96.639)      |        30552.094 (+-65.023)        |          49494.486 (+-872.916)          |     1.620 (+-0.000)      |         33844.404 (+-92.466)
      Input (4, 3, 300, 400), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (600, 700)    |      36196.104 (+-128.169)      |        34038.432 (+-79.697)        |          75761.226 (+-905.194)          |     2.226 (+-0.000)      |         36260.473 (+-94.642)
      Input (4, 3, 300, 400), torch.float32, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (600, 700)         |       24827.821 (+-77.335)      |        37006.218 (+-86.318)        |          61297.625 (+-898.192)          |     1.656 (+-0.000)      |         24823.275 (+-80.945)
      Input (4, 3, 300, 400), torch.float32, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (600, 700)        |       27266.138 (+-70.262)      |        40109.475 (+-94.248)        |          92086.075 (+-404.922)          |     2.296 (+-0.000)      |         27287.992 (+-89.507)

Times are in microseconds (us).

[--------------------------------------------------------------------------------------------------------------------------------------------------------- Interpolate, cuda ---------------------------------------------------------------------------------------------------------------------------------------------------------]
                                                                                                                                                      |  Eager (2.3.0a0+gitafcfdb1) PR  |  Compiled (2.3.0a0+gitafcfdb1) PR  |  Compiled (2.3.0a0+gitde89a53) Nightly  |  speed-up PR vs Nightly  |  Eager (2.3.0a0+gitde89a53) Nightly
1 threads: ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input (1, 3, 2345, 2456), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (1234, 1345)   |         98.259 (+-0.014)        |          97.156 (+-0.008)          |             97.443 (+-0.031)            |     1.003 (+-0.000)      |           98.248 (+-0.021)
      Input (1, 3, 2345, 2456), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (1234, 1345)  |         97.048 (+-0.016)        |          97.480 (+-0.018)          |             96.819 (+-0.126)            |     0.993 (+-0.000)      |           97.045 (+-0.015)
      Input (1, 3, 2345, 2456), torch.float32, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (1234, 1345)       |         97.944 (+-0.028)        |          91.686 (+-0.411)          |             93.894 (+-1.011)            |     1.024 (+-0.000)      |           97.933 (+-0.008)
      Input (1, 3, 2345, 2456), torch.float32, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (1234, 1345)      |         98.008 (+-0.011)        |          91.205 (+-0.346)          |             96.854 (+-0.058)            |     1.062 (+-0.000)      |           97.203 (+-0.010)
      Input (4, 3, 2345, 2456), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (1234, 1345)   |        384.318 (+-0.011)        |         382.793 (+-0.007)          |            382.472 (+-0.011)            |     0.999 (+-0.000)      |          384.701 (+-0.012)
      Input (4, 3, 2345, 2456), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (1234, 1345)  |        384.266 (+-0.009)        |         385.333 (+-0.024)          |            382.554 (+-0.022)            |     0.993 (+-0.000)      |          384.386 (+-0.016)
      Input (4, 3, 2345, 2456), torch.float32, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (1234, 1345)       |        383.924 (+-0.011)        |         570.071 (+-0.030)          |            545.615 (+-0.051)            |     0.957 (+-0.000)      |          384.044 (+-0.012)
      Input (4, 3, 2345, 2456), torch.float32, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (1234, 1345)      |        384.184 (+-0.016)        |         560.857 (+-0.026)          |            552.447 (+-0.040)            |     0.985 (+-0.000)      |          384.063 (+-0.016)
      Input (1, 3, 1234, 1345), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (2345, 2456)   |        122.188 (+-0.053)        |         116.744 (+-1.006)          |            163.762 (+-0.015)            |     1.403 (+-0.000)      |          121.874 (+-0.015)
      Input (1, 3, 1234, 1345), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (2345, 2456)  |        122.156 (+-0.012)        |         182.692 (+-0.013)          |            161.653 (+-0.018)            |     0.885 (+-0.000)      |          121.926 (+-0.014)
      Input (1, 3, 1234, 1345), torch.float32, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (2345, 2456)       |        105.852 (+-0.324)        |         119.545 (+-0.294)          |            190.527 (+-0.023)            |     1.594 (+-0.000)      |          105.999 (+-0.446)
      Input (1, 3, 1234, 1345), torch.float32, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (2345, 2456)      |        106.507 (+-0.282)        |         120.060 (+-0.257)          |            162.330 (+-0.012)            |     1.352 (+-0.000)      |          106.567 (+-0.385)
      Input (4, 3, 1234, 1345), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (2345, 2456)   |        447.907 (+-0.015)        |         463.863 (+-1.779)          |            650.492 (+-0.331)            |     1.402 (+-0.000)      |          446.596 (+-0.017)
      Input (4, 3, 1234, 1345), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (2345, 2456)  |        447.750 (+-0.017)        |         723.832 (+-0.170)          |            641.539 (+-0.075)            |     0.886 (+-0.000)      |          446.467 (+-0.019)
      Input (4, 3, 1234, 1345), torch.float32, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (2345, 2456)       |        439.549 (+-0.031)        |         507.772 (+-2.879)          |            758.795 (+-0.482)            |     1.494 (+-0.000)      |          440.372 (+-0.025)
      Input (4, 3, 1234, 1345), torch.float32, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (2345, 2456)      |        439.538 (+-0.029)        |         509.260 (+-2.704)          |            654.195 (+-2.621)            |     1.285 (+-0.000)      |          440.362 (+-0.026)

Times are in microseconds (us).
```

[Source](f4751a3196/perf_interp_mode.py), [Output](899f34c024/output/20231213-214209-upsample-bilinear-pr_vs_nightly-speedup.md)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104182
Approved by: https://github.com/lezcano
2023-12-14 14:50:06 +00:00
28e4004286 Add doc for torch.distributed.breakpoint (#115656)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115656
Approved by: https://github.com/wanchaol, https://github.com/fegin
ghstack dependencies: #115705
2023-12-14 14:45:36 +00:00
cyy
fcb95bf31b [2/N] Use std::in_place (#115480)
Remove c10/util/in_place.h
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115480
Approved by: https://github.com/soulitzer
2023-12-14 12:54:22 +00:00
6500ccebd7 enable fp16 autocast for dynamo benchmark (#114088)
`--amp` to enable amp path for` CUDA` (default amp_dtype will be float16) and `CPU` (default amp_dtype will be bfloat16).

If users set `--amp_dtype`, the amp_dtype from users will have the highest priority.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114088
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-12-14 12:38:44 +00:00
afe6d272c6 Fix buck OSS build after #115570 (#115804)
From #115570, `supports_shlib_interfaces` is only available in https://buck2.build/docs/api/rules/ not buck https://buck.build/rule/cxx_library.html.  The best way to fix this is probably to migrate OSS CI to buck2, so this is a temporary workaround because the fix from #115570 is only needed internally anyway
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115804
Approved by: https://github.com/kit1980, https://github.com/malfet
2023-12-14 08:33:07 +00:00
adfbd2b219 Introduce 3 low-latency, intra-node allreduce algorithms for small messages to PyTorch (#114001)
## Summary
This PR added 3 intra-node GPU allreduce algorithms to PyTorch:
- One-shot allreduce (inspired by FasterTransformer): all ranks simultaneously read and accumulate data from other ranks.
- Two-shot allreduce (inspired by FasterTransformer): all ranks simultanesouly read and accumulate `1 / world_size` data from other ranks. Then all ranks read accumulated data from other ranks. (effectively one-shot reduce-scatter + one-shot all-gather).
- Hybrid cube mesh allreduce (original): a one-shot allreduce variant that avoids transmission over PCIe on HCM topology.

## Micro Benchmarks
![image](https://github.com/pytorch/pytorch/assets/4156752/7bd25ffc-cd5b-4acb-bd65-b01bc136726e)

![image](https://github.com/pytorch/pytorch/assets/4156752/3ced31b4-6c31-4f34-a2d8-c072df29ae0e)

![image](https://github.com/pytorch/pytorch/assets/4156752/5b942c05-4fcc-4ec9-ae29-12c64080bb1c)

## Details
The intra-node algos are organized behind `c10d::IntraNodeComm`, which is responsible for:
- Managing handshaking and cuda IPC handle exchange among ranks.
- Querying NVLink connection and detecting topology.
- Performing algo selection based on available info.
- Launching the selected allreduce kernel.

`c10d::IntraNodeComm` is integrated into `c10d::ProcessGroupNCCL` as follows:
- When the `ENABLE_INTRA_NODE_COMM` environment variable is set, `c10d::ProcessGroupNCCL` initialize a `c10d::IntraNodeComm` for its ranks.
  - If the setup is not suitable for intra-node comm (e.g. not all ranks are from the same node), the rendezvous logic guarantees all participants fall back consistently.
- `c10d::ProcessGroupNCCL::allreduce` consults `c10d::IntraNodeComm` whether to use intra-node allreduce and carries out the communication accordingly.

We currently detect two types of topoloies from the nNVLink connection mesh:
- Fully connected: all GPU pairs has direct NVLink connection (e.g. NVSwitch or fully connected sub-set of hybrid cube mesh)
  - `msg <= 256KB`: one-shot allreduce.
  - `256KB < msg <= 10MB`: two-shot allreduce.
  -  `msg > 10MB`: instructs the caller to fallback to NCCL.
- Hybrid cube mesh
  - `msg <= 256KB`: one-shot allreduce.
  - `msg > 256KB`: instructs the caller to fallback to NCCL.

## Next Steps
- Fine tune algo selection based on GPU model, topology, link speed.
- Potentially optimize the two-shot allreduce impl. Accroding to FasterTransformer, two-shot allreduce is preferred until 50MB. There might be room for improvement, but PyTorch does impose more constraints:
  - FasterTransformer uses a single process to drive multiple devices. It can use `cudaDeviceEnablePeerAccess` enable device-level peer access.
  - PyTorch uses multiple process to drive multiple devices. With cuda IPC, a device can only share a specific region to other devices. This means extra copies may be unavoidable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114001
Approved by: https://github.com/yf225
2023-12-14 08:13:08 +00:00
36c6c0c7dc [pytree] expand tree_map to accept multi-inputs (#115642)
Fixes #115419
Fixes #91323
Closes #115549

- #115419
- #91323

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115642
Approved by: https://github.com/vmoens, https://github.com/zou3519
2023-12-14 06:16:42 +00:00
eqy
7e1542b938 [CUDA][FP8] Skip test_dtypes on FP8 _scaled_mm (#115661)
This test isn't actually parametrized by `dtype` so it seems to surface bogus failures where "unsupported" types "work" but in reality fp8 is used every time.

CC @drisspg I'm guessing this doesn't surface in upstream CI because there are no SM9.0 runners yet?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115661
Approved by: https://github.com/drisspg
2023-12-14 05:12:33 +00:00
f5458f8f00 [C10D] Make DumpPipe pipe file configurable (#115770)
Add TORCH_NCCL_DEBUG_INFO_PIPE_FILE env, allowing separate pipe file
location from dump file location.

Defaults PIPE_FILE to empty, meaning disabled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115770
Approved by: https://github.com/zdevito
2023-12-14 03:54:43 +00:00
ef01e78fd9 disable test_ddp_profiling_autograd_profiler in distributed_test.py (#115704)
test was previously disabled in upstream: https://github.com/pytorch/pytorch/issues/77342, currently failing in NVIDIA internal CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115704
Approved by: https://github.com/soulitzer
2023-12-14 01:41:37 +00:00
722752fc28 Revert "Increased hardcoded limit for number of GPUs. (#115368)"
This reverts commit c039f01bd932d4f67d5b1d63ade8f1db11bfb72e.

Reverted https://github.com/pytorch/pytorch/pull/115368 on behalf of https://github.com/osalpekar due to This was reverted internally due to a release breakage ([comment](https://github.com/pytorch/pytorch/pull/115368#issuecomment-1854956224))
2023-12-14 01:28:01 +00:00
5e615f5f3a [BE] Use version.txt to determine version of nightly builds (#115794)
Fixes TODO from https://github.com/pytorch/pytorch/pull/33326
Test plan: check version generated by CI:
 - https://github.com/pytorch/pytorch/actions/runs/7202798334/job/19621620744?pr=115794#step:9:64
 - https://github.com/pytorch/pytorch/actions/runs/7202798329/job/19621639791?pr=115794#step:11:104

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115794
Approved by: https://github.com/atalman
2023-12-14 01:09:51 +00:00
661c1cf2aa numerical mismatch fix for test_mem_efficient_attention_attn_mask_vs_math_ref_grads in test_transformers.py (#115707)
adjust dropout_fudge_factor since previous fudge factor was too small and led to numerical mismatch in NVIDIA internal CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115707
Approved by: https://github.com/drisspg
2023-12-14 01:04:39 +00:00
ffc826bf10 [nccl-pg] Store PG global rank information in tracing logs (#115730)
Storing the list of global ranks associated with each PG allows us to correlate traces across different ranks.

Test Plan:

OSS CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115730
Approved by: https://github.com/fduwjj
2023-12-14 00:59:17 +00:00
b38e14c12a [Reland][HigherOrderOp] remove unused get_item in MapHigherOrder (#115758)
Summary: This is a reland of https://github.com/pytorch/pytorch/pull/115207

Test Plan: Modified existing tests.

Reviewed By: yanboliang

Differential Revision: D52045157

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115758
Approved by: https://github.com/angelayi
2023-12-14 00:41:46 +00:00
626b7dc847 Revert "Migrated loss functions to ModuleInfos (#115584)"
This reverts commit f138b08d2e9c8d676f2a404e97d773f42132b0c7.

Reverted https://github.com/pytorch/pytorch/pull/115584 on behalf of https://github.com/atalman due to OSS CI oncall, breaks slow test ([comment](https://github.com/pytorch/pytorch/pull/115584#issuecomment-1854855080))
2023-12-13 23:34:30 +00:00
3fa3ed4923 Workaround to avoid MSVC std ambiguous symbol error (#115748)
Don't know what the correct fix is, but it appears that this is the known workaround https://github.com/pytorch/pytorch/issues/18607

Failing windows build: https://hud.pytorch.org/pytorch/pytorch/pull/114897?sha=574a6f7cfe979f1bac62c6b0b51380ff67a31a09

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115748
Approved by: https://github.com/jbschlosser
ghstack dependencies: #114895, #115739
2023-12-13 23:22:52 +00:00
67ce57ff66 Add pragma once to headers (#115739)
This reverts commit 9b93c23b5e2d695c2fbd9c886cc0c8010edab717.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115739
Approved by: https://github.com/Skylion007, https://github.com/jbschlosser
ghstack dependencies: #114895
2023-12-13 23:22:52 +00:00
c7ae2c170f [inductor] Added non-integer expr support for floordiv in triton codegen (#115751)
Description:
- Added non-integer expr support for floordiv in triton codegen
- Added a test
  - cpp test is skipped as failing and https://github.com/pytorch/pytorch/pull/115647 may fix it

This PR is fixing compilation error with the following code:
```python
import torch

def func(x, a):
    n = (a * 1.234) // 8.234
    y = x + n
    return y

cfunc = torch.compile(func, dynamic=True, fullgraph=True)

device = "cuda"
x = torch.tensor(0, dtype=torch.float32, device=device)
a = 33

out = cfunc(x, a)
expected = func(x, a)
torch.testing.assert_close(out, expected)
```
Error message on Nightly:
```
  File "/usr/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
torch._dynamo.exc.BackendCompilerFailed: backend='compile_fx_wrapper' raised:
CompilationError: at 7:38:def triton_(in_ptr0, out_ptr0, ks0, xnumel, XBLOCK : tl.constexpr):
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (x0), xmask)
    tmp1 = ((1.23400000000000*ks0) // 8.23400000000000)
                                      ^
AssertionError()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115751
Approved by: https://github.com/peterbell10
2023-12-13 23:17:42 +00:00
3643548447 [Export] Support ser/des test on existing cases (#115413)
Summary:
Similar as #115399

Test Plan:
```
$ python test/export/test_serdes.py
...
Ran 72 tests in 29.097s

OK (expected failures=13)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115413
Approved by: https://github.com/tugsbayasgalan
ghstack dependencies: #115402
2023-12-13 23:17:12 +00:00
a34d56a64a [Export] Support retraceability test on existing cases (#115402)
Summary:
Similar as #115399

Test Plan:
python test/export/test_retraceability.py

    Ran 71 tests in 31.929s

    OK (expected failures=14)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115402
Approved by: https://github.com/tugsbayasgalan
2023-12-13 23:17:12 +00:00
43efe39cb1 [codemod][lowrisk] Remove extra semi colon from caffe2/caffe2/opt/optimizer.cc (#115018)
Summary:
`-Wextra-semi` or `-Wextra-semi-stmt`

If the code compiles, this is safe to land.

Test Plan: Sandcastle

Reviewed By: dmm-fb

Differential Revision: D51777924

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115018
Approved by: https://github.com/Skylion007
2023-12-13 23:11:33 +00:00
ad76a4e1e7 [inductor] Allow sympy expressions to participate in type promotion (#115676)
In the test example we have `add(i64[10], sympy.Expr)` where
`sympy.Expr` is not considered a promoting arg so isn't factored into
the type promotion. However, in eager it would promote to float32.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115676
Approved by: https://github.com/lezcano
ghstack dependencies: #115677, #115699, #115700
2023-12-13 22:22:37 +00:00
869e52e3dd Support torch function user objects (#111765)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111765
Approved by: https://github.com/jansel
2023-12-13 22:11:52 +00:00
81321baf5c [PyTorch] Remove ArrayRefTensor::dtype (#113578)
Knocks off a few nanoseconds from CPU inference due to not having to set this field; paths that would've needed it are expensive anyway.

Differential Revision: [D51182794](https://our.internmc.facebook.com/intern/diff/D51182794/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113578
Approved by: https://github.com/khabinov, https://github.com/Neilblaze
ghstack dependencies: #112800, #113577
2023-12-13 21:32:14 +00:00
8c57fde21f Let all_reduce_coalesced accept one tensor as well (#115650)
This diff introduces a change to the `all_reduce_coalesced` function in `distributed_c10d.py`. The function now accepts a single tensor as well as a list of tensors. This allows for more flexibility in the use of the function.

This is just a syntax sugar for the compiler to use `all_reduce_coalesced` without worrying  about converting the input to a list.

Differential Revision: [D51433236](https://our.internmc.facebook.com/intern/diff/D51433236/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115650
Approved by: https://github.com/wconstab
ghstack dependencies: #115523, #115302, #115648, #115649
2023-12-13 21:32:01 +00:00
b9af126908 [PyTorch] Add input numel assert for minimal arrayref interface (#113577)
We currently have no shape checking on CPU IIUC. Now we at least do numel checking for the minimal arrayref interface.

Differential Revision: [D51165703](https://our.internmc.facebook.com/intern/diff/D51165703/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113577
Approved by: https://github.com/chenyang78, https://github.com/jansel
ghstack dependencies: #112800
2023-12-13 21:31:55 +00:00
db851b1bc9 [Dynamo][7/N] Wrap python modules under torch as regular PythonModuleVariable (#115724)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115724
Approved by: https://github.com/jansel
2023-12-13 21:23:14 +00:00
54d552e991 [funcol] Directly import DeviceMesh to avoid circular dependency (#115649)
This diff aims to directly import DeviceMesh from torch.distributed.device_mesh instead of importing it from dist._tensor. This is done to avoid a circular dependency issue. The code changes in each file of the diff are as follows:

- torch/distributed/_functional_collectives.py: import DeviceMesh from torch.distributed instead of dist._tensor.

Overall, this diff aims to improve the code by avoiding circular dependencies and improving the import statements.

==
The above summary is generated by LLM with minor manual fixes. The following summary is by me.

The original import causes some issues when compiling DDP with compiled_autograd. The root cause of compilation failure is not identified but it is good to fix the lazy initialization, which indirectly fixes the compilation issues for DDP.

Differential Revision: [D51857246](https://our.internmc.facebook.com/intern/diff/D51857246/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115649
Approved by: https://github.com/wconstab, https://github.com/wz337
ghstack dependencies: #115523, #115302, #115648
2023-12-13 20:44:58 +00:00
7388d40165 Make pytorch_qnnpack a shared library (#115570)
Summary:
This library contains global state, e.g. pytorch_qnnp_params. If we make
it a static library, different shared libraries linking that static
library can end up with their own copies of the global state, leading to
bugs. Make it a shared library instead, to avoid this issue.

Test Plan: buck2 test fbsource//fbandroid/javatests/com/facebook/playground/apps/fb4aplayground/scenarios/pytorchscenario:pytorchscenario -- --run-disabled --regex runBundledInputWithLocalAsset

Differential Revision: D51926024

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115570
Approved by: https://github.com/malfet
2023-12-13 20:44:37 +00:00
c90fdb9ac0 Fix torch.distributed.breakpoint (#115705)
Switches from calling breakpoint() internally to using a subclass of
Pdb.

Fixes #115685

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115705
Approved by: https://github.com/wanchaol, https://github.com/fegin
2023-12-13 20:33:56 +00:00
8a8d0adc0b Fix troch.gradient check for spacing arg list length (#115686)
Fixes #114207

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115686
Approved by: https://github.com/albanD
2023-12-13 20:17:20 +00:00
23bff71de4 [llvm][oncall] Fix build for llvm-18+ (#115652)
Summary:
https://reviews.llvm.org/D137838 moved Host.h and some other files under TargetParser.
https://github.com/llvm/llvm-project/pull/74261 Removed it from Support folder.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115652
Approved by: https://github.com/davidberard98
2023-12-13 20:11:31 +00:00
4d8ad4fb82 Move SingletonSymNodeImpl from c10 to aten (#114895)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114895
Approved by: https://github.com/jbschlosser
2023-12-13 20:01:18 +00:00
2a514f48d7 Add huggingface gpt2 fake tensor unit test for torch.onnx.dynamo_export (#115380)
open llama, dolly v2 and falcon are still broken regardless of `ExportedProgram`, so they were not moved from `test_fx_to_onnx.py` to `fx_to_onnx_onnxruntime.py`.

Dolly and falcon already have tracking issues, but a tracking issue was created for open llama: https://github.com/pytorch/pytorch/issues/115552

A tracking issue was created for `xfail_if_model_type_is_exportedprogram` and `xfail_if_model_type_is_not_exportedprogram` issues with unexpected success runs: https://github.com/pytorch/pytorch/issues/115747
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115380
Approved by: https://github.com/titaiwangms
2023-12-13 19:49:06 +00:00
suo
926236305f [sigmoid] fix for FX tracing unflattened modules (#115708)
Differential Revision: [D52095387](https://our.internmc.facebook.com/intern/diff/D52095387/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115708
Approved by: https://github.com/zhxchen17
2023-12-13 19:43:46 +00:00
75d3bbaaa2 Fix cudagraph check message (#115664)
This error message is printed when CUDAGraph trees are used with multiple device indices.

However, the message seems to say the opposite.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115664
Approved by: https://github.com/soulitzer
2023-12-13 18:44:43 +00:00
42390a097b [inductor] Do variance calculation in opmath type (#115181)
Fixes #114903

Previously large split variance reductions stored the intermediates as float16
precision, which may lead to overflow as the intermediate result is
unnormalized.

In #114903 we see two different `num_split` decisions made based on the
hardware capabilities, one of which has large enough intermediates to cause
overflows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115181
Approved by: https://github.com/shunting314
2023-12-13 18:40:44 +00:00
95de4f5764 add sm80orlater check to test_sdpa (#115702)
test_sdpa and test_sdpa2 in test_aot_inductor.py use bfloat16 which is not supported by sm < 80, so skip test if sm < 80

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115702
Approved by: https://github.com/soulitzer
2023-12-13 18:21:32 +00:00
caddcf9de5 Fix lint error in aten/src/ATen/native/cuda/CUDALoops.cuh (#115616)
Fix lint error in `aten/src/ATen/native/cuda/CUDALoops.cuh`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115616
Approved by: https://github.com/soulitzer
2023-12-13 18:13:00 +00:00
afa62d6237 [nccl-pg] Pass group global rank information to NCCL PG (#114736)
We were only passing a subset of the group creation information to the
NCCL PG.  We are specifically missing the information on which global
ranks belong to a particular PG.

This allows the NCCL PG to use this additional information for things
like better trace logging.

Test Plan:

OSS CI

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114736
Approved by: https://github.com/kwen2501
2023-12-13 18:02:51 +00:00
193f87857e [BC breaking] Remove check_sparse_nnz argument of gradcheck (#115658)
As in title per deprecation plan.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115658
Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer
2023-12-13 17:34:30 +00:00
310f6ab11a [fsdp] Replace acc_grad hooking with register_post_accumulate_grad_hook on flat_param (#112184)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112184
Approved by: https://github.com/albanD
ghstack dependencies: #115315
2023-12-13 16:24:44 +00:00
97888725c5 [Export] Test non-strict mode on existing test cases (#115399)
Summary:
Dynamo test methodology provides a good example to patch various
treaments on the same set of test cases. A pitfall is the global config
that could be easily modified somewhere. Here we change the behavior of
the export API thru hijacking it with self defined code.

For supporting non-strict test suite, the `strict=False` is explicitly
passed into the export API when it's called w/ or w/o strict arg.

Test Plan:
python test/export/test_export_nonstrict.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115399
Approved by: https://github.com/zhxchen17, https://github.com/tugsbayasgalan
2023-12-13 16:01:06 +00:00
66a76516bf [ROCm] Disabling Kernel Asserts for ROCm by default - fix and clean up and refactoring (#114660)
Related to #103973  #110532 #108404 #94891

**Context:**
As commented in 6ae0554d11/cmake/Dependencies.cmake (L1198)
Kernel asserts are enabled by default for CUDA and disabled for ROCm.
However it is somewhat broken, and Kernel assert was still enabled for ROCm.

Disabling kernel assert is also needed for users who do not have PCIe atomics support. These community users have verified that disabling the kernel assert in PyTorch/ROCm platform fixed their pytorch workflow, like torch.sum script, stable-diffusion. (see the related issues)

**Changes:**

This pull request serves the following purposes:
* Refactor and clean up the logic,  make it simpler for ROCm to enable and disable Kernel Asserts
* Fix the bug that Kernel Asserts for ROCm was not disabled by default.

Specifically,
- Renamed `TORCH_DISABLE_GPU_ASSERTS` to `C10_USE_ROCM_KERNEL_ASSERT` for the following reasons:
(1) This variable only applies to ROCm.
(2) The new name is more align with #define CUDA_KERNEL_ASSERT function.
(3) With USE_ in front of the name, we can easily control it with environment variable to turn on and off this feature during build (e.g. `USE_ROCM_KERNEL_ASSERT=1 python setup.py develop` will enable kernel assert for ROCm build).
- Get rid of the `ROCM_FORCE_ENABLE_GPU_ASSERTS' to simplify the logic and make it easier to understand and maintain
- Added `#cmakedefine` to carry over the CMake variable to C++

**Tests:**
(1) build with default mode and verify that USE_ROCM_KERNEL_ASSERT  is OFF(0), and kernel assert is disabled:

```
python setup.py develop
```
Verify CMakeCache.txt has correct value.
```
/xxxx/pytorch/build$ grep USE_ROCM_KERNEL_ASSERT CMakeCache.txt
USE_ROCM_KERNEL_ASSERT:BOOL=0
```
Tested the following code in ROCm build and CUDA build, and expected the return code differently.

```
subprocess.call([sys.executable, '-c', "import torch;torch._assert_async(torch.tensor(0,device='cuda'));torch.cuda.synchronize()"])
```
This piece of code is adapted from below unit test to get around the limitation that this unit test now was skipped for ROCm. (We will check to enable this unit test in the future)

```
python test/test_cuda_expandable_segments.py -k test_fixed_cuda_assert_async
```

Ran the following script, expecting r ==0 since the CUDA_KERNEL_ASSERT is defined as nothing:
```
>> import sys
>>> import subprocess
>>> r=subprocess.call([sys.executable, '-c', "import torch;torch._assert_async(torch.tensor(0,device='cuda'));torch.cuda.synchronize()"])
>>> r
0
```

(2) Enable the kernel assert by building with USE_ROCM_KERNEL_ASSERT=1, or USE_ROCM_KERNEL_ASSERT=ON
```
USE_ROCM_KERNEL_ASSERT=1 python setup.py develop
```

Verify `USE_ROCM_KERNEL_ASSERT` is `1`
```
/xxxx/pytorch/build$ grep USE_ROCM_KERNEL_ASSERT CMakeCache.txt
USE_ROCM_KERNEL_ASSERT:BOOL=1
```

Run the assert test, and expected return code not equal to 0.

```
>> import sys
>>> import subprocess
>>> r=subprocess.call([sys.executable, '-c', "import torch;torch._assert_async(torch.tensor(0,device='cuda'));torch.cuda.synchronize()"])
>>>/xxxx/pytorch/aten/src/ATen/native/hip/TensorCompare.hip:108: _assert_async_cuda_kernel: Device-side assertion `input[0] != 0' failed.
:0:rocdevice.cpp            :2690: 2435301199202 us: [pid:206019 tid:0x7f6cf0a77700] Callback: Queue 0x7f64e8400000 aborting with error : HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception. code: 0x1016

>>> r
-6
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114660
Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/jithunnair-amd
2023-12-13 15:44:53 +00:00
fb80f05ee2 [inductor] Fix angle decomposition return type (#115700)
The current decomposition always returns float32 when the input isn't complex.
Instead, we should do proper type promotion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115700
Approved by: https://github.com/lezcano
ghstack dependencies: #115677, #115699
2023-12-13 14:16:31 +00:00
9cdc80d581 [inductor] Fix torch.bernoulli decomposition return type (#115699)
Strangely enough, `torch.bernoulli` doesn't return a boolean and instead
it matches the output type of the inplace bernoulli.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115699
Approved by: https://github.com/lezcano
ghstack dependencies: #115677
2023-12-13 14:16:31 +00:00
0e0dd8f985 [dynamo][BE] Move torchvision import inside of test_multi_import (#115677)
Currently this skip imports torchvision, so if your torchvision install
is broken then the entire file fails at collection time. This instead
means only the test itself will fail.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115677
Approved by: https://github.com/lezcano
2023-12-13 14:16:31 +00:00
3807fc690f [OSSCI oncall] fix lint (#115737)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115737
Approved by: https://github.com/DanilBaibak
2023-12-13 14:15:26 +00:00
0870afb85c Revert "[Export] Test non-strict mode on existing test cases (#115399)"
This reverts commit 2411a92e9d9f90e2db3cde9190e1301bd02cb221.

Reverted https://github.com/pytorch/pytorch/pull/115399 on behalf of https://github.com/atalman due to OSSCI oncall, broke CI tests ([comment](https://github.com/pytorch/pytorch/pull/115399#issuecomment-1853869965))
2023-12-13 12:59:09 +00:00
bda6f02343 Revert "[Export] Support retraceability test on existing cases (#115402)"
This reverts commit b0c7dd47cdb8d17bbfd0ab2963b1afb908dab716.

Reverted https://github.com/pytorch/pytorch/pull/115402 on behalf of https://github.com/atalman due to OSSCI oncall, broke CI tests ([comment](https://github.com/pytorch/pytorch/pull/115402#issuecomment-1853864075))
2023-12-13 12:55:07 +00:00
3b87681ddc Revert "[Export] Support ser/des test on existing cases (#115413)"
This reverts commit 47443591631ebb80a84487bbdab3233e0077941d.

Reverted https://github.com/pytorch/pytorch/pull/115413 on behalf of https://github.com/atalman due to OSSCI oncall, broke CI tests ([comment](https://github.com/pytorch/pytorch/pull/115413#issuecomment-1853859443))
2023-12-13 12:51:34 +00:00
f9cf6ae889 [PyTorch] AOTI: add minimal arrayref interface (#112800)
This implements an optional alternate interface to the AOTI
generated DSO, intended to increase efficiency for models running on
CPU and requiring minimal overhead. See comment in config.py for more
explanation.

This took a while to get right (e.g., I initially required 1-D
MiniArrayRef<T> for the inputs, but found that multi-dimensional
ArrayRefTensor<T> ended up simplifying the implementation and allowed
test_aot_inductor.py to run) and is somewhat intricate, so I am
anticipating that review will require some back-and-forth.

Differential Revision: [D50699890](https://our.internmc.facebook.com/intern/diff/D50699890/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D50699890/)!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112800
Approved by: https://github.com/chenyang78
2023-12-13 12:06:35 +00:00
331128b444 [c10] signal_handler: atomically exchange the signal count to fix data race in ExecuteStepRecursive() (#115510)
Summary:
`CheckForSignals()` can be called from multiple threads concurrently, e.g. from within `ExecuteStepRecursive()`. This means that `my_sigint_count_` and `my_sighup_count_` can be written concurrently, causing data races.

To fix, use atomic exchange which writes the new value and returns the old value in one atomic operation.

Test Plan: Running TSAN tests that failed before and now pass

Differential Revision: D52018963

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115510
Approved by: https://github.com/malfet
2023-12-13 12:06:06 +00:00
50db2aa70a [funcol][BE] Apply ufmt to _functional_collectives.py and turn on lintrunner for functional_collective (#115648)
No logic change, just formatting.

Differential Revision: [D51857236](https://our.internmc.facebook.com/intern/diff/D51857236/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115648
Approved by: https://github.com/wconstab, https://github.com/wz337
ghstack dependencies: #115523, #115302
2023-12-13 11:19:29 +00:00
db8d409d08 [DCP][BE] Apply ufmt to DCP and turn on lintrunner for DCP (#115302)
No logic change. Just typing and ufmt.

Differential Revision: [D51914982](https://our.internmc.facebook.com/intern/diff/D51914982/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115302
Approved by: https://github.com/XilunWu, https://github.com/wz337, https://github.com/LucasLLC
ghstack dependencies: #115523
2023-12-13 10:32:36 +00:00
cc28f61fa3 [DCP][BE] Move DCP._state_dict_utils out from DCP (#115523)
DCP._state_dict_utils is also used by FSDP. This can cause circular import sometimes. Move it out from DCP to avoid circular import.

Differential Revision: [D52022440](https://our.internmc.facebook.com/intern/diff/D52022440/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115523
Approved by: https://github.com/wz337
2023-12-13 08:59:48 +00:00
1500379b6d [MPS] Enable torch.rand[n] for complex types (#115514)
Test plan:
```
% python -c "import torch;print(torch.rand(3, 3, dtype=torch.chalf, device='mps'))"
tensor([[0.4639+0.8350j, 0.0479+0.1650j, 0.2510+0.9551j],
        [0.4746+0.3984j, 0.1484+0.8242j, 0.0098+0.7129j],
        [0.7979+0.6162j, 0.7188+0.9580j, 0.5186+0.2559j]], device='mps:0',
       dtype=torch.complex32)
% python3 -c "import torch; x=torch.randn(1000000, dtype=torch.cfloat, device='mps'); print((x-x.mean()).abs().pow(2).div(x.numel()-1).sum().sqrt())"
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115514
Approved by: https://github.com/lezcano
ghstack dependencies: #115512, #115513, #115554
2023-12-13 07:30:56 +00:00
4744359163 [Export] Support ser/des test on existing cases (#115413)
Summary:
Similar as #115399

Test Plan:
```
$ python test/export/test_serdes.py
...
Ran 72 tests in 29.097s

OK (expected failures=13)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115413
Approved by: https://github.com/tugsbayasgalan
ghstack dependencies: #115399, #115402
2023-12-13 06:01:17 +00:00
b0c7dd47cd [Export] Support retraceability test on existing cases (#115402)
Summary:
Similar as #115399

Test Plan:
python test/export/test_retraceability.py

FAILED (failures=6, errors=8, expected failures=7)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115402
Approved by: https://github.com/tugsbayasgalan
ghstack dependencies: #115399
2023-12-13 06:01:17 +00:00
2411a92e9d [Export] Test non-strict mode on existing test cases (#115399)
Summary:
Dynamo test methodology provides a good example to patch various
treaments on the same set of test cases. A pitfall is the global config
that could be easily modified somewhere. Here we change the behavior of
the export API thru hijacking it with self defined code.

For supporting non-strict test suite, the `strict=False` is explicitly
passed into the export API when it's called w/ or w/o strict arg.

Test Plan:
python test/export/test_export_nonstrict.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115399
Approved by: https://github.com/zhxchen17, https://github.com/tugsbayasgalan
2023-12-13 06:01:17 +00:00
dd42201cb8 [export] Preserve FQN in export_to_torch_ir (#115462)
AOTInductor currently relies of export_to_torch_ir to generate a graph, and passes it to inductor to generate the .so. They would like the FQN to be consistent so that they can easily find/update the weights in the .so.

Note that since export flattens all modules in to a single computational graph, we will change the FQNs in the original module by replacing all periods with underscores. For example, `foo.child1param`, which points to a submodule named `foo`'s parameter named `child1param`, will be renamed to `foo_child1param` since we no longer have the submodule `foo`. This is done just by doing `name.replace(".", "_")`.

Outputted AOTInductor c++ code: https://www.internalfb.com/phabricator/paste/view/P900120950?lines=377-355%2C354

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115462
Approved by: https://github.com/tugsbayasgalan
2023-12-13 04:58:47 +00:00
0dad85b402 [Dynamo] Fix torch.tensor call with tuple (#115713)
Land #114383 on behalf of @ezyang since he is on recharge and this is an high priority issue.
Fix #114231

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115713
Approved by: https://github.com/angelayi, https://github.com/voznesenskym
2023-12-13 04:08:12 +00:00
38101e349e [usdt][torch] Sample dispatch operator integration (#115593)
Summary:
By default the instruction at the USDT is nop, when the tracepoint is attached (e.g. through bpftrace) the code inside the semaphore check is executed. Thus there should be no performance impact as long as the USDT is not attached from the tracepoint execution code itself, however the semaphore check itself `TORCH_SDT_IS_ENABLED` will incur the cost of a `read_global_volatile` operation.

https://github.com/dtrace4linux/linux/blob/master/doc/usdt.html for more info

Test Plan:
```
buck2  build  mode/opt caffe2/torch/fb/observers:strobelight_observer_runner --show-full-output
```

```
/data/users/rihams/fbsource/buck-out/v2/gen/fbcode/0bc8cf217a8cf352/caffe2/torch/fb/observers/__strobelight_observer_runner__/strobelight_observer_runner
```

```
sudo bpftrace -e 'usdt:/data/users/rihams/fbsource/buck-out/v2/gen/fbcode/6081734815403318/caffe2/torch/fb/observers/__strobelight_observer_runner__/strobelight_observer_runner:pytorch:operator_* { printf("%s --> %s\n", probe, str(arg0)); }' -v

usdt:<path>strobelight_observer_runner:pytorch:operator_start --> aten::empty_strided
usdt:<path>strobelight_observer_runner:pytorch:operator_end --> aten::empty_strided
usdt:<path>strobelight_observer_runner:pytorch:operator_end --> aten::empty_like
usdt:<path>strobelight_observer_runner:pytorch:operator_start --> aten::fill_
usdt:<path>strobelight_observer_runner:pytorch:operator_end --> aten::fill_
usdt:<path>strobelight_observer_runner:pytorch:operator_end --> aten::ones_like
usdt:<path>strobelight_observer_runner:pytorch:operator_start --> aten::mul
usdt:<path>strobelight_observer_runner:pytorch:operator_end --> aten::mul
usdt:<path>strobelight_observer_runner:pytorch:operator_start --> aten::add
usdt:<path>strobelight_observer_runner:pytorch:operator_end --> aten::add
usdt:<path>strobelight_observer_runner:pytorch:operator_start --> aten::detach
usdt:<path>strobelight_observer_runner:pytorch:operator_end --> aten::detach
usdt:<path>strobelight_observer_runner:pytorch:operator_start --> aten::randn
usdt:<path>strobelight_observer_runner:pytorch:operator_start --> aten::empty
usdt:<path>strobelight_observer_runner:pytorch:operator_end --> aten::empty
usdt:<path>strobelight_observer_runner:pytorch:operator_start --> aten::normal_
usdt:<path>strobelight_observer_runner:pytorch:operator_end --> aten::normal_
usdt:<path>strobelight_observer_runner:pytorch:operator_end --> aten::randn
usdt:<path>strobelight_observer_runner:pytorch:operator_start --> aten::to
usdt:<path>strobelight_observer_runner:pytorch:operator_start --> aten::_to_copy
usdt:<path>strobelight_observer_runner:pytorch:operator_start --> aten::empty_strided
usdt:<path>strobelight_observer_runner:pytorch:operator_end --> aten::empty_strided

```

Differential Revision: D44636587

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115593
Approved by: https://github.com/malfet
2023-12-13 02:41:48 +00:00
17c104ac18 [export] Do not copy state_dict in run_decomp (#115269)
Fixes https://github.com/pytorch/pytorch/issues/114628

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115269
Approved by: https://github.com/thiagocrepaldi, https://github.com/ydwu4
2023-12-13 01:21:21 +00:00
99554112d3 [pytorch] add namespace for optTypeMetaToScalarType in codegen to avoid not declared when compile (#115623)
Fixes compilation failure in some environment.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115623
Approved by: https://github.com/albanD
2023-12-13 00:59:01 +00:00
1392843e7b [inductor] make sure bitcast input and target type have the same bitwidth (#115619)
This PR fixed #104791

bitcast requires the source and target have the bitwidth.
Because the input tensor's dtype could be promoted, e.g. from float16 to
float, we have to cast the tensor to its original source dtype before
invoking bitcast in such cases. After that, we also need to convert
the bit-casted tensor back to float to make sure we keep using higher
precision values for the rest of the computation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115619
Approved by: https://github.com/jansel, https://github.com/eellison
2023-12-13 00:53:04 +00:00
469d6d45fe [BE] Bye bye, CircleCI (#115701)
In PyTorch, a change we now see,
CircleCI's gone, set it free.
With commits and a push,
No more waiting in hush,
For a simpler CI spree!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115701
Approved by: https://github.com/PaliC, https://github.com/suo, https://github.com/seemethere
2023-12-13 00:26:49 +00:00
76ced0df03 Consider storage_changed for assigning alias_of_input in aot_autograd when computing differentiable outputs that alias each other (#115315)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115315
Approved by: https://github.com/bdhirsh
2023-12-12 23:21:58 +00:00
946de1cf4c [export][fix] Add back export strict argument (#115668)
Summary:
\#115556 omitted strict argument, which is necessary for non-strict mode
dev.

Test Plan:
python test/export/test_export.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115668
Approved by: https://github.com/tugsbayasgalan, https://github.com/angelayi
2023-12-12 22:59:10 +00:00
48ed165380 [FSDP][state_dict] Create a FSDP/EP unittest (#115567)
As title

Differential Revision: [D52043394](https://our.internmc.facebook.com/intern/diff/D52043394/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115567
Approved by: https://github.com/wz337, https://github.com/LucasLLC
2023-12-12 22:48:11 +00:00
639060cb0b Use get_mkldnn_enabled for decompositions (#115448)
`torch._C.has_mkldnn` does not respect cases where users try to disable mkldnn using `torch._C._set_mkldnn_enabled()`. This is relevant to edge use cases, where they do not want decompositions to go to the ATen opset, and do not want the mkldnn operator to appear in the graph.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115448
Approved by: https://github.com/jgong5, https://github.com/ydwu4
2023-12-12 22:42:51 +00:00
f78f23d753 [export] Turn off output value from sources for export. (#115442)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115442
Approved by: https://github.com/tugsbayasgalan
2023-12-12 22:41:23 +00:00
af09fe256a [Inductor] Implement a deduplist data structure for name to user tracking (#115609)
Summary:
An internal MRS model was taking over a day's worth of time to compile due to many duplicates in dependency tracking. This PR replaces the list with a custom dedup list.
Normally one could use a set/dict for this purpose however the list in question gets elements appended as it is being iterated over which means that we need to keep the list semantics.

Test Plan: ad hoc testing

Differential Revision: D52060659

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115609
Approved by: https://github.com/jansel
2023-12-12 22:28:30 +00:00
ffb2a28a67 Fixes expected behavior when no_dist=True in state_dict_loader.load (#115660)
Fixes expected behavior when `no_dist=True` in `state_dict_loader.load`

Fixes #115591

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115660
Approved by: https://github.com/wz337, https://github.com/fegin
2023-12-12 22:21:16 +00:00
f138b08d2e Migrated loss functions to ModuleInfos (#115584)
Migrates most tests in `common_nn.py:criterion_tests` to ModuleInfos.

**I can split this up if it is too large to review**

What this PR does not include:
- [`no_batch_dim` tests](https://github.com/pytorch/pytorch/blob/main/torch/testing/_internal/common_nn.py#L3995-L4112)
- [tests that use the functional variant of the loss function and `wrap_functional`](https://github.com/pytorch/pytorch/blob/main/torch/testing/_internal/common_nn.py#L1079-L1128)

#### On test times
This PR increases test time by ~58s locally
Before this PR:
```
>>> python test/test_nn.py -k Loss
Ran 1003 tests in 28.977s
```
After this PR
```
>>> python test/test_nn.py -k Loss
Ran 368 tests in 23.073s
```

```
>>> python test/test_modules.py -k Loss
Ran 836 tests in 63.900s
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115584
Approved by: https://github.com/janeyx99
ghstack dependencies: #115617
2023-12-12 22:20:20 +00:00
1becd2c314 Align checks in _use_cudnn_ctc_loss with those in _cudnn_ctc_loss (#115617)
This PR is intended to fix the following problem:

When using `CTCLoss`, there is a cudnn path gated by a call to [`_use_cudnn_ctc_loss`](
e918461377/aten/src/ATen/native/cudnn/LossCTC.cpp (L73-L101)) which checks some conditions

e918461377/aten/src/ATen/native/LossCTC.cpp (L486-L496)

However, there are more checks in `_cudnn_ctc_loss`
e918461377/aten/src/ATen/native/cudnn/LossCTC.cpp (L122-L130)

some of which are not present in `_use_cudnn_ctc_loss` (e.g. the check that `targets` is on CPU which will cause a RuntimeError after dispatching to `_cudnn_ctc_loss`). Instead, these checks should be in `_use_cudnn_ctc_loss` so that the normal `_ctc_loss` path will be used if the checks are not met)

e.g. Before this PR

```python
>>> import torch
>>> ctcloss = torch.nn.CTCLoss()
>>> log_probs = torch.randn((50, 3, 15), device='cuda').log_softmax(2)
>>> target = torch.randint(1, 15, (30 + 25 + 20,), dtype = torch.int)
>>> input_lengths = torch.tensor((50, 50, 50), device='cuda')
>>> target_lengths = torch.tensor((30, 25, 20), device='cuda')
>>> ctcloss(log_probs, target, input_lengths, target_lengths)
tensor(4.1172, device='cuda:0')
>>> target = target.to('cuda')
>>> ctcloss(log_probs, target, input_lengths, target_lengths)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/data/users/mg1998/pytorch/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/users/mg1998/pytorch/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/users/mg1998/pytorch/torch/nn/modules/loss.py", line 1779, in forward
    return F.ctc_loss(log_probs, targets, input_lengths, target_lengths, self.blank, self.reduction,
  File "/data/users/mg1998/pytorch/torch/nn/functional.py", line 2660, in ctc_loss
    return torch.ctc_loss(
RuntimeError: Expected tensor to have CPU Backend, but got tensor with CUDA Backend (while checking arguments for cudnn_ctc_loss)
```

After this PR the above snippet runs without error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115617
Approved by: https://github.com/janeyx99
2023-12-12 22:20:20 +00:00
c3ed9f65a0 Revert "[8/n] Update XNNPACK Version Part 8 Everything Remaining to get it to work (#115587)"
This reverts commit a8dc9d8e353ddcf7db0247349a3acd0dd37fcc6f.

Reverted https://github.com/pytorch/pytorch/pull/115587 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/115587#issuecomment-1852835898))
2023-12-12 21:28:09 +00:00
ac4f6beb00 [Dynamo] Make resume function name more explicit by adding lineno (#115608)
Adding lineno to resume function name for easy aggregation in Scuba table.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115608
Approved by: https://github.com/jansel, https://github.com/williamwen42
2023-12-12 21:08:41 +00:00
40ce9a4cfb [c10d] Create a python c10d API _set_pg_timeout to set timeout (#115453)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115453
Approved by: https://github.com/wconstab, https://github.com/H-Huang
2023-12-12 20:52:43 +00:00
8a58af2a9f [Reland][HigherOrderOp] make MapHigherOrder create map_impl (#115561)
This is a reland of #115205, which gets reverted due to internal test failure.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115561
Approved by: https://github.com/angelayi
2023-12-12 20:45:01 +00:00
8739d1e3f9 Fix a fast mode gradcheck bug where specified eps argument is ignored when switching to slow mode (#115634)
As in the title.

The reproducer for the bug is as follows:
```python
>>> import torch
>>> dtype = torch.bfloat16
>>> D1 = torch.tensor([[1, 2], [3, 4]], dtype=dtype, requires_grad=True)
>>> D2 = torch.tensor([[1, 2], [3, 4]], dtype=dtype, requires_grad=True)
>>> torch.autograd.gradcheck(torch.mm, (D1, D2), fast_mode=True)
```

<details>

```
torch.autograd.gradcheck.GradcheckError: Jacobian mismatch for output 0 with respect to input 0,
numerical:tensor(0., dtype=torch.bfloat16)
analytical:tensor(4.9062, dtype=torch.bfloat16)

The above quantities relating the numerical and analytical jacobians are computed
in fast mode. See: https://github.com/pytorch/pytorch/issues/53876 for more background
about fast mode. Below, we recompute numerical and analytical jacobians in slow mode:

Numerical:
 tensor([[0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.]], dtype=torch.bfloat16)
Analytical:
tensor([[1., 2., 0., 0.],
        [3., 4., 0., 0.],
        [0., 0., 1., 2.],
        [0., 0., 3., 4.]], dtype=torch.bfloat16)

```
</details>

```python
The max per-element difference (slow mode) is: 4.0.
>>> torch.autograd.gradcheck(torch.mm, (D1, D2), fast_mode=True, eps=1e-1)
```

<details>

```
<snip>
torch.autograd.gradcheck.GradcheckError: Jacobian mismatch for output 0 with respect to input 0,
numerical:tensor(5., dtype=torch.bfloat16)
analytical:tensor(4.9062, dtype=torch.bfloat16)

The above quantities relating the numerical and analytical jacobians are computed
in fast mode. See: https://github.com/pytorch/pytorch/issues/53876 for more background
about fast mode. Below, we recompute numerical and analytical jacobians in slow mode:

Numerical:
 tensor([[0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.]], dtype=torch.bfloat16)
Analytical:
tensor([[1., 2., 0., 0.],
        [3., 4., 0., 0.],
        [0., 0., 1., 2.],
        [0., 0., 3., 4.]], dtype=torch.bfloat16)
```

</details>

```
The max per-element difference (slow mode) is: 4.0.
```

Notice that changing `eps` value has no effect to max per-element difference.

With this PR, increasing `eps` value will lead to sensible results in numerical jacobian:
```python
>>> torch.autograd.gradcheck(torch.mm, (D1, D2), fast_mode=True, eps=1e-1)
```

<details>

```
<snip>
torch.autograd.gradcheck.GradcheckError: Jacobian mismatch for output 0 with respect to input 0,
numerical:tensor(5., dtype=torch.bfloat16)
analytical:tensor(4.9062, dtype=torch.bfloat16)

The above quantities relating the numerical and analytical jacobians are computed
in fast mode. See: https://github.com/pytorch/pytorch/issues/53876 for more background
about fast mode. Below, we recompute numerical and analytical jacobians in slow mode:

Numerical:
 tensor([[0.9375, 1.8750, 0.0000, 0.0000],
        [2.9688, 3.7500, 0.0000, 0.0000],
        [0.0000, 0.0000, 1.2500, 2.5000],
        [0.0000, 0.0000, 2.5000, 3.7500]], dtype=torch.bfloat16)
Analytical:
tensor([[1., 2., 0., 0.],
        [3., 4., 0., 0.],
        [0., 0., 1., 2.],
        [0., 0., 3., 4.]], dtype=torch.bfloat16)
```

</details>

```
The max per-element difference (slow mode) is: 0.5.
```

Finally:
```python
>>> torch.autograd.gradcheck(torch.mm, (D1, D2), fast_mode=True, eps=1e-1, atol=1)
True
```
that would fail with the current main branch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115634
Approved by: https://github.com/lezcano, https://github.com/soulitzer, https://github.com/albanD
ghstack dependencies: #115536
2023-12-12 20:00:56 +00:00
75ab294eb5 Enable builtin tests for ONNX Export with ExportedProgram models (#114762)
Fixed by https://github.com/pytorch/pytorch/pull/113982
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114762
Approved by: https://github.com/BowenBao
2023-12-12 19:50:06 +00:00
d954ef208f [DCP][state_dict] DCP state_dict cannot correctly find FQN when the leaf module is wrapped by FSDP (#115592)
Summary: The original logic has an incorrect assumption that there is at one object name left when traversing the module tree. This is not correct when the leaf module is wrapped by FSDP.

Test Plan: CI

Differential Revision: D52049293

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115592
Approved by: https://github.com/wz337
2023-12-12 19:22:23 +00:00
0ff155fb65 Fix SDPA for SAM (#115636)
Addresses the regression for Segment Anything Fast in https://github.com/pytorch-labs/segment-anything-fast/issues/99
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115636
Approved by: https://github.com/soulitzer, https://github.com/ani300
2023-12-12 18:52:38 +00:00
8885128dcc Fix backward for SDPA NT jagged layout (#115576)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115576
Approved by: https://github.com/jbschlosser, https://github.com/ani300
2023-12-12 18:35:40 +00:00
7553c49514 [S382174] Fix distributed debug w/ non-equal split (#115483)
Summary:
In collectives, it's possible to have non-equal split that has a different implementation and the output tensor size will be different, e.g. https://www.internalfb.com/code/fbsource/[460afb1172b5]/fbcode/caffe2/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp?lines=3104. However, TORCH_DISTRIBUTED_DEBUG=DETAIL will assume the output tensor size is the same and does the check and will fail the job if they don't: https://fburl.com/code/mhte9ty8. c10d code should handle this.

Ideally we should check the input size across ranks and make sure they're the same. Maybe for next diff.

Test Plan: Test torchrec's TWRW w/ non-even split and it's working now.

Reviewed By: zhangruiskyline

Differential Revision: D52010942

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115483
Approved by: https://github.com/kwen2501, https://github.com/fegin, https://github.com/XilunWu
2023-12-12 18:02:05 +00:00
d521857411 Terminate handler (#101332)
Fixes #50051.
This PR is based on #50320 and I address the last feedback.
On Windows it is enabled by default. Can be enabled or disabled via USE_CUSTOM_TERMINATE env variable.

This PR adds support for overriding the terminate handler in order to log uncaught exceptions in the threads.
If an exception is thrown and not caught, it will print <Unhandled exception caught in c10/util/AbortHandler.h>
The point of doing this is that in issue #50051, exceptions were thrown but not logged. With this logging system it will be easier to debug it in the future.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101332
Approved by: https://github.com/albanD, https://github.com/malfet
2023-12-12 17:55:27 +00:00
36b5136270 [inductor] Don't print disable_cudagraphs_reason when cudagraphs is disabled (#115489)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115489
Approved by: https://github.com/yanboliang
2023-12-12 17:50:18 +00:00
670eb83573 Enable test_sparse_addmm for crossref tests (#115536)
Fixes https://github.com/pytorch/pytorch/issues/97284

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115536
Approved by: https://github.com/cpuhrsch
2023-12-12 17:26:40 +00:00
a8dc9d8e35 [8/n] Update XNNPACK Version Part 8 Everything Remaining to get it to work (#115587)
> **__Note:__** XNNPACK Upgrade is too large in the range of **40k** files and **10m** Lines of code, Thus we break the update of the library into multiple parts. All Parts [1 - 6/n] Must be landed together for it to work. ***This also means If there is a revert. Please revert the Entire Stack.***

This change is everything remaining requiring XNNPACK version to work.

Differential Revision: [D52044420](https://our.internmc.facebook.com/intern/diff/D52044420/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115587
Approved by: https://github.com/digantdesai
2023-12-12 17:17:19 +00:00
e918461377 Add instructions for generating optimal Triton kernel parameters of bsr_dense_addmm (#115504)
As in the title.

In addition, enable verbose output when executing the torch/sparse/_triton_ops_meta.py script.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115504
Approved by: https://github.com/cpuhrsch
ghstack dependencies: #115499
2023-12-12 16:44:51 +00:00
32286512cc Add tune_bsr_dense_addmm as an API to find optimal triton kernel parameters for bsr_dense_addmm (#115499)
As in the title.

In addition:
- improve the algorithm for finding a minima of operation timings: break the inner loop early when a next minima candidate is found
- add tests and fix bugs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115499
Approved by: https://github.com/cpuhrsch
2023-12-12 16:44:51 +00:00
40dc0580a6 [inductor] De-duplicate triton helper functions (#115546)
Previously if two calls to cumsum were generated in the same triton kernel
we would generate identical helper functions with different names. Now this
recognizes identical functions and only defines it once. To do this I defer
choosing the name until after codegen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115546
Approved by: https://github.com/lezcano
ghstack dependencies: #109132
2023-12-12 16:30:50 +00:00
02196c21ac [inductor] Parameterize ir.Scan on combine_fn (#109132)
This replaces `tl.cumsum` and `tl.cumprod` with calls to `tl.associative_scan`
where the combine function is generated from inductor IR.

So before we had:
```python
@triton.jit
def triton_(in_ptr0, out_ptr0, xnumel, rnumel, XBLOCK : tl.constexpr):
    xnumel = 20
    rnumel = 30
    RBLOCK: tl.constexpr = 32
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
    xmask = xindex < xnumel
    rindex = tl.arange(0, RBLOCK)[None, :]
    rmask = rindex < rnumel
    r1 = rindex
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (r1 + (30*x0)), rmask & xmask, other=0).to(tl.float32)
    tmp1 = tl.broadcast_to(tmp0, [XBLOCK, RBLOCK])
    tmp2 = tl.where(rmask & xmask, tmp1, 0)
    tmp3 = tl.cumsum(tmp2, 1)
    tl.store(out_ptr0 + (r1 + (30*x0)), tmp3, rmask & xmask)
```

Now we have:
```python
@triton.jit
def _triton_helper_fn0(arg0, arg1):
    tmp0 = tmp0 + tmp1
    return tmp0

@triton.jit
def triton_(in_ptr0, out_ptr0, xnumel, rnumel, XBLOCK : tl.constexpr):
    xnumel = 20
    rnumel = 30
    RBLOCK: tl.constexpr = 32
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
    xmask = xindex < xnumel
    rindex = tl.arange(0, RBLOCK)[None, :]
    rmask = rindex < rnumel
    r1 = rindex
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (r1 + (30*x0)), rmask & xmask, other=0).to(tl.float32)
    tmp1 = tl.broadcast_to(tmp0, [XBLOCK, RBLOCK])
    tmp2 = tl.where(rmask & xmask, tmp1, 0)
    tmp3 = tl.associative_scan(tmp2, 1, _triton_helper_fn0)
    tl.store(out_ptr0 + (r1 + (30*x0)), tmp3, rmask & xmask)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109132
Approved by: https://github.com/lezcano
2023-12-12 16:30:50 +00:00
d5286d7ea8 [export] Add canonical form for differentiating IR (#115589)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115589
Approved by: https://github.com/suo
2023-12-12 16:21:57 +00:00
de4b2e59a7 [PyTorch] AOTI: add more basic aoti_torch getters (#112799)
Lot of simple information about tensors we couldn't get. In
particular, we didn't know the lengths of the arrays returned by sizes
and strides.

Differential Revision: [D50949929](https://our.internmc.facebook.com/intern/diff/D50949929/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112799
Approved by: https://github.com/desertfire, https://github.com/aakhundov
ghstack dependencies: #112116, #112174, #112405, #112798
2023-12-12 15:56:33 +00:00
c5c4d81b1b Switched stale workflow to linux.large.arc (#115635)
Switched stale workflow to linux.large.arc
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115635
Approved by: https://github.com/jeanschmidt
2023-12-12 15:33:59 +00:00
4fafc36c33 [MPS] Fix sum and prod for complex types (#115554)
By not force-casting dtype to float

Test plan: `python -c "import torch;print(torch.linspace(-3.0, 3.0, 50, dtype=torch.cfloat, device='mps').sqrt().sin().sum())"`

Before:
```
tensor(21.1778+0.j, device='mps:0')
```
After
```
tensor(21.1778+39.1377j, device='mps:0')
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115554
Approved by: https://github.com/lezcano
ghstack dependencies: #115512, #115513
2023-12-12 15:04:45 +00:00
07f03b4a62 [MPS] Add support for MPSDataTypeComplexFloat[16|32] (#115513)
But limit it to MacOS Sonoma +

Before the calling `torch.cat` with complex types failed, but now it works.
Before:
```
% python -c "import torch;print(torch.cat([torch.rand(3, 3, dtype=torch.cfloat).to('mps'), torch.rand(3, 3, dtype=torch.cfloat).to('mps')]))"
TypeError: Trying to convert ComplexFloat to the MPS backend but it does not have support for that dtype.
```
After:
```
% python -c "import torch;print(torch.cat([torch.rand(3, 3, dtype=torch.cfloat).to('mps'), torch.rand(3, 3, dtype=torch.cfloat).to('mps')]))"
tensor([[0.4857+0.0030j, 0.9375+0.8630j, 0.3544+0.9911j],
        [0.5293+0.8652j, 0.8440+0.1991j, 0.5152+0.8276j],
        [0.0136+0.7469j, 0.1403+0.4761j, 0.2943+0.0896j],
        [0.6458+0.0035j, 0.3579+0.4577j, 0.1723+0.1508j],
        [0.4420+0.3554j, 0.4396+0.7272j, 0.2479+0.1191j],
        [0.3895+0.2292j, 0.7886+0.1613j, 0.9243+0.4180j]], device='mps:0')
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115513
Approved by: https://github.com/kulinseth
ghstack dependencies: #115512
2023-12-12 15:04:45 +00:00
21cf6e76c2 Revert "Use linux.large.arc for stale workflow (#115440)"
This reverts commit dadb3694ffaa2a0bfe78516c294a46566430c1ad.

Reverted https://github.com/pytorch/pytorch/pull/115440 on behalf of https://github.com/DanilBaibak due to Did not merge properly ([comment](https://github.com/pytorch/pytorch/pull/115440#issuecomment-1852126050))
2023-12-12 14:20:29 +00:00
dadb3694ff Use linux.large.arc for stale workflow (#115440)
* Try linux.large.arc for stale workflow

* Run stale workflow on PR changes

* Added arc runner lable to the list of self hosted runners

* Added concurency linux-job

* Cleanup

* Added workflow_dispatch for testing purpose
2023-12-12 15:11:09 +01:00
7350dcb307 [CI] Fix lint errors on master (#115627)
Differential Revision: [D52073432](https://our.internmc.facebook.com/intern/diff/D52073432)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115627
Approved by: https://github.com/atalman
2023-12-12 13:53:14 +00:00
bc51a0c22f Revert "[PyTorch] AOTI: add more basic aoti_torch getters (#112799)"
This reverts commit 3de2596abed9717a166635b48126302fcf46527a.

Reverted https://github.com/pytorch/pytorch/pull/112799 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/112799#issuecomment-1852076887))
2023-12-12 13:52:34 +00:00
f98b0f3ebc Add bfloat16 support to torch.sparse.addmm for CPU (#115535)
Fixes https://github.com/pytorch/pytorch/issues/73145.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115535
Approved by: https://github.com/cpuhrsch
2023-12-12 13:26:33 +00:00
d6f8850653 Revert "[Export] Test non-strict mode on existing test cases (#115399)"
This reverts commit 36527df344c0c33dae8bc6c94eded8646013b736.

Reverted https://github.com/pytorch/pytorch/pull/115399 on behalf of https://github.com/atalman due to OSSCI oncall, broke CI tests ([comment](https://github.com/pytorch/pytorch/pull/115399#issuecomment-1851988651))
2023-12-12 13:02:18 +00:00
a8acd6c410 Add Half support for AvgPool2d on CPU (#109578)
Add Half support for AvgPool2d (both channels last and channels first) on CPU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109578
Approved by: https://github.com/mingfeima, https://github.com/albanD
2023-12-12 12:59:47 +00:00
92fd3927b0 [export][reland] Add math.* ops to pass base (#115559)
Reland of https://github.com/pytorch/pytorch/pull/115271/
Fixes https://github.com/pytorch/pytorch/issues/115209
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115559
Approved by: https://github.com/zhxchen17, https://github.com/atalman
ghstack dependencies: #115556, #115557, #115558
2023-12-12 10:46:41 +00:00
36527df344 [Export] Test non-strict mode on existing test cases (#115399)
Summary:
Dynamo test methodology provides a good example to patch various
treaments on the same set of test cases. A pitfall is the global config
that could be easily modified somewhere. Here we change the behavior of
the export API thru hijacking it with self defined code.

For supporting non-strict test suite, the `strict=False` is explicitly
passed into the export API when it's called w/ or w/o strict arg.

* For existing failed strict test cases, non-strict also fails.
* For passed strict but failed non-strict cases, we mark them as
`@testing.expectedFailureNonStrict`.
* Moreover, I manually check the failure reason and some of them are not
related to nn.Module asserting exception. I mark them as `# Need to fix
for non-strict mode`.

Test Plan:
python test/export/test_export_nonstrict.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115399
Approved by: https://github.com/zhxchen17, https://github.com/tugsbayasgalan
2023-12-12 07:11:53 +00:00
fdf814c6ca Revert "[MPS] Add support for MPSDataTypeComplexFloat[16|32] (#115513)"
This reverts commit a4bb4a237348ff8d688e43ba542ee59a9d7ed4a6.

Reverted https://github.com/pytorch/pytorch/pull/115513 on behalf of https://github.com/malfet due to Broke Mac x86 periodic builds ([comment](https://github.com/pytorch/pytorch/pull/115513#issuecomment-1851398773))
2023-12-12 06:50:47 +00:00
46694e92b7 Revert "[MPS] Fix sum and prod for complex types (#115554)"
This reverts commit 8b28380c8ed5b5bfe479392bcffeccf8b89be328.

Reverted https://github.com/pytorch/pytorch/pull/115554 on behalf of https://github.com/malfet due to Broke MacOS x86 builds ([comment](https://github.com/pytorch/pytorch/pull/115554#issuecomment-1851395982))
2023-12-12 06:47:39 +00:00
f28687dfb2 Do not use pytorchbot-env from upload-test-stats (#115606)
As it was only needed to check our token rate limits

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115606
Approved by: https://github.com/huydhn
2023-12-12 06:42:33 +00:00
1eca63c6ac [DeviceMesh] Move helper function 'get_mesh_dim_by_name' to MeshEnv class (#115572)
Move helper function `get_mesh_dim_by_name ` outside of the DeviceMesh class to keep the public class cleaner.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115572
Approved by: https://github.com/XilunWu, https://github.com/wanchaol
2023-12-12 06:29:46 +00:00
3de2596abe [PyTorch] AOTI: add more basic aoti_torch getters (#112799)
Lot of simple information about tensors we couldn't get. In
particular, we didn't know the lengths of the arrays returned by sizes
and strides.

Differential Revision: [D50949929](https://our.internmc.facebook.com/intern/diff/D50949929/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112799
Approved by: https://github.com/desertfire, https://github.com/aakhundov
ghstack dependencies: #112116, #112174, #112405, #112798
2023-12-12 06:19:45 +00:00
2b323e61ad [PyTorch] AOTI: Use static_cast, not dynamic_cast (#112798)
dynamic_cast is for when we aren't certain about the type. We are certain (and will crash anyway if we're wrong).

Differential Revision: [D50812978](https://our.internmc.facebook.com/intern/diff/D50812978/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112798
Approved by: https://github.com/chenyang78, https://github.com/desertfire, https://github.com/jansel, https://github.com/khabinov
ghstack dependencies: #112116, #112174, #112405
2023-12-12 06:19:45 +00:00
ca52195112 [PyTorch] AOTI: Avoid aoti_torch_data_ptr calls for constants at inference time (#112405)
Cache aoti_torch_get_data_ptr at constants update time.

Differential Revision: [D50708982](https://our.internmc.facebook.com/intern/diff/D50708982/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112405
Approved by: https://github.com/chenyang78, https://github.com/desertfire, https://github.com/khabinov
ghstack dependencies: #112116, #112174
2023-12-12 06:19:45 +00:00
24c67fe8cf [PyTorch] AOTI: Emit static constexpr int array vars when possible (#112174)
No need to populate a stack-based array for a shape/stride array when it's statically known.

Differential Revision: [D50699889](https://our.internmc.facebook.com/intern/diff/D50699889/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112174
Approved by: https://github.com/chenyang78, https://github.com/desertfire, https://github.com/jansel
ghstack dependencies: #112116
2023-12-12 06:19:45 +00:00
ff6f987adc [PyTorch] Replace cached thread_locals with stack allocation in AOTI (#112116)
This changes cached thread_local tensors to stack-allocated buffers. Since we were incidentally caching output in a thread_local, I had to add manual thread_local caching of outputs, which I implemented by caching a buffer and a Tensor whose storage is that buffer and then just memcpying the result into the cached buffer every time. Ideally, memory planning would be able to identify allocations that are the backing storage for outputs, but this should be good enough in the absence of planning.

Differential Revision: [D50416438](https://our.internmc.facebook.com/intern/diff/D50416438/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112116
Approved by: https://github.com/jansel, https://github.com/desertfire
2023-12-12 06:19:45 +00:00
405a0040cf Adds tool to visualize sharding (#114307)
This pull request adds a tool to visualize sharding. It uses the device_mesh and placement details to construct a visualization of the split of a torch dtensor.

Things to fix:

- [x] This implementation only uses the first element of the placement tuple, when can there be more than one elements?
- [x] The calculation of the split is happening here but maybe it is already done somewhere internally in Shard class and can we directly call that here?

Fixes #108746

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114307
Approved by: https://github.com/wanchaol
2023-12-12 06:18:03 +00:00
65651d970b Optimize the copy of Half to Float and Float to Half on CPU (#103148)
### Description
Optimize the copy of Half to Float and Float to Half on CPU.

### Testing

Single core:
shape | fp16 -> fp32 / ms | fp32 -> fp16 / ms | bf16 -> fp32 / ms | fp32 -> bf16 / ms
-- | -- | -- | -- | --
size: (1, 777) | 0.00345 | 0.00344 | 0.00411 | 0.00410
size: (2, 512) | 0.00355 | 0.00344 | 0.00431 | 0.00400
size: (10, 555) | 0.00473 | 0.00391 | 0.00562 | 0.00477
size: (1, 2048, 1024) | 0.488 | 0.480 | 0.498 | 0.499
size: (32, 100, 777) | 0.584 | 0.568 | 0.571 | 0.587

28 cores:
shape | fp16 -> fp32 / ms | fp32 -> fp16 / ms | bf16 -> fp32 / ms | fp32 -> bf16 / ms
-- | -- | -- | -- | --
size: (10, 555) |  0.00472 | 0.00369 | 0.00576 |  0.00481
size: (1, 2048, 1024) |  0.0189 | 0.0188 | 0.0173 | 0.0251
size: (64, 512, 1024) | 3.159 | 2.375 |  3.152 | 2.358
size: (32, 100, 777) | 0.0225 | 0.0195 | 0.0193 | 0.0261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103148
Approved by: https://github.com/jgong5, https://github.com/cpuhrsch
2023-12-12 05:57:52 +00:00
b6a4866330 [export][reland][refactor][3/n] Move unlift to separate file (#115558)
Reland of https://github.com/pytorch/pytorch/pull/114787

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115558
Approved by: https://github.com/zhxchen17, https://github.com/atalman
ghstack dependencies: #115556, #115557
2023-12-12 05:37:07 +00:00
36199747f3 [export][reland][refactor][2/n] Move tracing logic (#115557)
Reland of https://github.com/pytorch/pytorch/pull/114768
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115557
Approved by: https://github.com/zhxchen17
ghstack dependencies: #115556
2023-12-12 05:37:07 +00:00
dd9a989b83 [export][reland][refactor][1/n] Split dynamic shapes (#115556)
Reland of https://github.com/pytorch/pytorch/pull/114764
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115556
Approved by: https://github.com/zhxchen17
2023-12-12 05:36:41 +00:00
744d74c456 [inductor][optimus] enable smart fusion (#115471)
Summary: Enable gmm smart fusion in D51698686

Test Plan: buck test

Differential Revision: D52002137

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115471
Approved by: https://github.com/mengluy0125
2023-12-12 05:04:36 +00:00
fbb744fd49 [dtensor] enable radam foreach optimizer (#115566)
As titled, test both non-foreach and foreach optim

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115566
Approved by: https://github.com/XilunWu
ghstack dependencies: #115297, #115564, #115565
2023-12-12 03:57:00 +00:00
c322e5b5e9 [dtensor] add test for nadam optimizer (#115565)
as titled, foreach ops already supported, just add test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115565
Approved by: https://github.com/XilunWu
ghstack dependencies: #115297, #115564
2023-12-12 03:57:00 +00:00
4bd661c472 [dtensor] enable adadelta foreach optimizer (#115564)
as titled

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115564
Approved by: https://github.com/XilunWu
ghstack dependencies: #115297
2023-12-12 03:56:55 +00:00
8a27352d6b [dtensor] add a implicit replication flag (#115297)
This PR adds a experimental implicit replication support for DTensor to
inter-op with torch.Tensor, basically under this context manager DTensor
could work together with torch.Tensor by assuming the torch.Tensor
sharding layout is replicated.

Note that this is risky for DTensor so we don't turn it on by default,
but for certain cases where it is for sure replicated, user can use this
to allow DTensor and Tensor computation work together

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115297
Approved by: https://github.com/awgu
2023-12-12 03:56:48 +00:00
c70f995b5c [DeviceMesh] Add mesh_dim_names to DeviceMesh __repr__ if it exists (#115579)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115579
Approved by: https://github.com/wanchaol
2023-12-12 02:18:34 +00:00
0fc04e274d [inductor] Fix an aliased output bug (#115373)
Summary: for https://github.com/pytorch/pytorch/issues/97083, when

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115373
Approved by: https://github.com/jansel
2023-12-12 01:18:59 +00:00
89ee3af076 [Reland][Dynamo] Don't log compilation metrics for PyTorch unit tests (#115571)
Reland #115452, which was reverted to simplify a merge conflict with #115386

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115571
Approved by: https://github.com/yanboliang
2023-12-12 01:15:54 +00:00
064846dbc2 [cpu] flash attention optimization (#115151)
### Modifications
- **EXP**: Add a fast version with a reduced accuracy (ULP20) to vec exp `exp_u20` and use it in flash attention.
- **FUSION**: Do fusion for `softmax` ops.
- **SCALE**: Move the calculation of `scaling_factor` after `gemm`.

### Performance
_Model: Stable Diffusion V2.1_

| Version | BF16 Kernel latency (s) | BF16 speedup | FP32 Kernel latency (s) | FP32 speedup |
| ----- | ----- | ----- | ----- | ----- |
| PT | 15.865 |  | 35.362 |  |
| PT + EXP | 12.518 | 21.10% | 19.327 | 45.35% |
| PT + EXP + FUSION | 11.774 | 25.79% | 18.306 | 48.23% |
| PT + EXP + FUSION + SCALE | 11.053 | 30.33% | 18.360 | 48.08% |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115151
Approved by: https://github.com/jgong5, https://github.com/drisspg
2023-12-12 01:09:55 +00:00
0379c11248 [c10d] Enable PG NCCL monitor thread by default (#115577)
We added a monitor thread in NCCL PG in https://github.com/pytorch/pytorch/pull/112518. To summarize what we are doing in monitor thread: it listens to the heartbeat from watchdog thread and detect unhealthy nccl watchdog hang (due to several reasons such as nccl/cuda API bugs or unexpected blocking behaviors). This is the last resort to ensure that we don't silently keep the training job run for hours.

We didn't open this feature as default, since we want to perform more due diligence and have some customers to try it out. So far, we didn't see any obstacle which blocks turning on this feature and received positive feedback from users. We now decided to turn in on by default in this PR.

If this feature turns out not to work as expected and disturb one's training process, one can set `TORCH_NCCL_ENABLE_MONITORING=0` to disable this feature. Please kindly file an issue with us so that we can see if we missed any corner cases during the design.

Differential Revision: [D52045911](https://our.internmc.facebook.com/intern/diff/D52045911)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115577
Approved by: https://github.com/wconstab, https://github.com/kwen2501
2023-12-12 00:45:54 +00:00
6988e40b48 [quant][fx] Lower operator.matmul in convert_fx (#113954)
Summary: We support lowering `torch.matmul` but not
`operator.matmul`. This commit adds support for the latter,
which enables lowering the shorthand `@`. This address
https://github.com/pytorch/pytorch/issues/111450.

Test Plan:
python test/test_quantization.py TestQuantizeFx

Reviewers: jerryzh168

Subscribers: jerryzh168, supriyar
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113954
Approved by: https://github.com/jerryzh168
2023-12-12 00:34:58 +00:00
0a464ad1a7 [dtensor] turn back on symbolic shape in tests (#115568)
as titled, as @jbschlosser enabled dynamic shape support for traceable
subclass, turn back on the tests with default setting

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115568
Approved by: https://github.com/XilunWu
2023-12-12 00:26:23 +00:00
078773b32b [ROCm] Add owners for more HIP-specific paths (#113989)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113989
Approved by: https://github.com/jeffdaily, https://github.com/malfet
2023-12-12 00:24:38 +00:00
17de38c9af [Dynamo] Check duplication when loading dynamo tracing rules (#115059)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115059
Approved by: https://github.com/jansel
2023-12-12 00:22:20 +00:00
0692240b90 [dtensor] account for empty list when turning to OpStrategy (#115298)
Trying to fix https://github.com/pytorch/pytorch/issues/115065

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115298
Approved by: https://github.com/XilunWu
2023-12-12 00:11:16 +00:00
19c67a9db5 [dynamo] Fix a closure cell empty error (#115541)
Summary: Fixes https://github.com/pytorch/pytorch/issues/97115. The solution given by @jansel in that issue works. Checking in the code so it won't get lost.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115541
Approved by: https://github.com/jansel
2023-12-12 00:01:51 +00:00
617c228fba [CI] Lower the smoketest speedup threshold for nangpt (#115562)
Summary:
https://github.com/pytorch/pytorch/actions/runs/7158691360/job/19491437314
shows the variance can be larger than previously expected. Lowering it
for now and if it continues to be a problem, we should switch to some
other more stable model.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115562
Approved by: https://github.com/chenyang78
2023-12-11 23:46:30 +00:00
4471fe6c39 [sparse][semi-structured] add alg_id to _cslt_sparse_mm and _cslt_sparse_mm_search (#115178)
Summary:

cuSPARSELt has support for different alg_id, which are set via

`cusparseLTMatmulAlgSetAttribute`, in total there are 4 different
alg_ids, 0 - 3.

Previously we were just using the default alg_id, as from our initial
experiments we found that for most shapes the default alg_id is the
fastest and that they made no difference on numerical correctness, just
performance. From our previous experiments the fastest alg_id seemed to
differ only on small matmul shapes.

danthe3rd found a performance regression when running with
cuSPARSELt v0.4.0 vs v0.5.0, on LLM shapes, which match these
characteristics (activations are small, weights are large).

However it's likely that this is due to the alg_id ordering changing, as
mentioned in the release notes for v0.5.0.
```
cusparseLtMatmulAlgSelectionInit() does not ensure the same ordering of
algorithm id alg as in v0.4.0.
```

This PR adds in the following:
- support for passing in alg_id to _cslt_sparse_mm
- a new op, _cslt_sparse_mm_search, which returns the optimal alg_id for
  a given matmul

_cslt_sparse_mm_search has the same function signature as
_cslt_sparse_mm, minus the alg_id parameter.
We are able to achieve v0.4.0 performance with alg_id=1 on the shapes
that daniel provided.

We will address autoselecting the best alg_id in a future PR, possibly
with torch.compile.

Test Plan:
```
python test/test_sparse_semi_structured -k cslt
```

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115178
Approved by: https://github.com/cpuhrsch
2023-12-11 23:08:51 +00:00
8b28380c8e [MPS] Fix sum and prod for complex types (#115554)
By not force-casting dtype to float

Test plan: `python -c "import torch;print(torch.linspace(-3.0, 3.0, 50, dtype=torch.cfloat, device='mps').sqrt().sin().sum())"`

Before:
```
tensor(21.1778+0.j, device='mps:0')
```
After
```
tensor(21.1778+39.1377j, device='mps:0')
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115554
Approved by: https://github.com/lezcano
ghstack dependencies: #115512, #115513
2023-12-11 23:03:44 +00:00
a4bb4a2373 [MPS] Add support for MPSDataTypeComplexFloat[16|32] (#115513)
But limit it to MacOS Sonoma +

Before the calling `torch.cat` with complex types failed, but now it works.
Before:
```
% python -c "import torch;print(torch.cat([torch.rand(3, 3, dtype=torch.cfloat).to('mps'), torch.rand(3, 3, dtype=torch.cfloat).to('mps')]))"
TypeError: Trying to convert ComplexFloat to the MPS backend but it does not have support for that dtype.
```
After:
```
% python -c "import torch;print(torch.cat([torch.rand(3, 3, dtype=torch.cfloat).to('mps'), torch.rand(3, 3, dtype=torch.cfloat).to('mps')]))"
tensor([[0.4857+0.0030j, 0.9375+0.8630j, 0.3544+0.9911j],
        [0.5293+0.8652j, 0.8440+0.1991j, 0.5152+0.8276j],
        [0.0136+0.7469j, 0.1403+0.4761j, 0.2943+0.0896j],
        [0.6458+0.0035j, 0.3579+0.4577j, 0.1723+0.1508j],
        [0.4420+0.3554j, 0.4396+0.7272j, 0.2479+0.1191j],
        [0.3895+0.2292j, 0.7886+0.1613j, 0.9243+0.4180j]], device='mps:0')
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115513
Approved by: https://github.com/kulinseth
ghstack dependencies: #115512
2023-12-11 23:03:44 +00:00
288822c968 Increase ROCm test shards to 6 (#110997)
To reduce signal time

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110997
Approved by: https://github.com/huydhn, https://github.com/malfet
2023-12-11 22:30:16 +00:00
4307ccde99 Move ONNX's TorchModelType to pytorch_test_common to fix circ. dep. (#115353)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115353
Approved by: https://github.com/BowenBao
2023-12-11 22:23:03 +00:00
suo
ccd5bde6a3 [export] Reintroduce InterpreterModule to unflatten (#115436)
InterpreterModule is better than GraphModule codegen; it's more debuggable and
has better stack traces. The only reason we don't use it today is because
torch.compile doesn't work with it.

I work around this by constructing a GraphModule separately for usage during
dynamo tracing, but otherwise using torch.fx.Interpreter.

Differential Revision: [D51971661](https://our.internmc.facebook.com/intern/diff/D51971661/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115436
Approved by: https://github.com/zhxchen17
ghstack dependencies: #115408
2023-12-11 22:15:32 +00:00
suo
c137335b5c [export] make UnflattenedModule not inherit from GraphModule (#115408)
UnflattenedModule doesn't really behave like a graph module; we customize `__call__` to do something completely different than what GraphModule does. So, things that test `isinstance(unflattened_module, GraphModule)` and do something with the GraphModule are often broken.

This change makes UnflattenedModule it's own thing.

Differential Revision: [D51959097](https://our.internmc.facebook.com/intern/diff/D51959097/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115408
Approved by: https://github.com/zhxchen17
2023-12-11 22:15:21 +00:00
8c1567d021 [c10d] Change watchdog inner loop function name to make it more accurate (#115404)
Function `workCleanupLoop` does not affect all things we did in watchdog thread, so proposing a new name here to reflect what we are actually doing in the watchdog thread.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115404
Approved by: https://github.com/kwen2501, https://github.com/wconstab
2023-12-11 22:00:06 +00:00
99f06c0cc2 [BE] update errors to be more descriptive (#115443)
we call `_check_single_tensor` and `_check_tensor_list` as validation but don't print out the param types that were invalid

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115443
Approved by: https://github.com/XilunWu
2023-12-11 21:21:10 +00:00
b706c4116d [MPS] Add MacOS 14 runtime check (#115512)
Prerequisite for adding more complex type support and FFT operation

Check using `conjugateWithTensor:name:` selector defined as follows
```objc
/// Returns the complex conjugate of the input tensor elements.
///
/// - Parameters:
///   - tensor: The input tensor.
///   - name: An optional string which serves as an identifier for the operation..
/// - Returns: A valid `MPSGraphTensor` object containing the elementwise result of the applied operation.
-(MPSGraphTensor *) conjugateWithTensor:(MPSGraphTensor *) tensor
                                   name:(NSString * _Nullable) name
MPS_AVAILABLE_STARTING(macos(14.0), ios(17.0), tvos(17.0))
MPS_SWIFT_NAME( conjugate(tensor:name:) );
```

- Rename `isOnMacOS13orNewer(unsigned minor)` hook to `isOnMacOSorNewer(major, minor)`
- Replace `torch._C.__mps_is_on_macos_13_or_newer` with `torch._C._mps_is_on_macos_or_newer`
- Add `torch.backends.mps.is_macos_or_newer` public API
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115512
Approved by: https://github.com/albanD
2023-12-11 21:11:42 +00:00
03ff44c958 [c10d] Fix Store check condition in NCCL PG watchdog (#115475)
In https://github.com/pytorch/pytorch/pull/115449/ somehow after turning on `DUMP_ON_TIMEOUT=1`, some existing tests failed. Upon checking, the failing is because of TCPStore check call within watchdog thread.

1. It's not because of TCPStore creation has not completed, because if I put it sleep for a long time, the test still failed. Rather, it's because we query TCPStore after we shutdown the PG.

2. The reason for that is: The `std::chrono::steady_clock::now()` function in C++ returns a `time_point` object representing the current point in time according to the steady clock. The default unit of this time_point is not directly specified in terms of seconds or nanoseconds; rather, it is dependent on the internal representation of the steady clock, which can vary between implementations. In reality it's actually nanosecs which makes the delta so big that we are checking the store every time when watchdog thread wakes up. To make things even worse, `terminateProcessGroup_` might be turned to be `True` before the next check for the outmost while but before TCPStore check, so watchdog gets stuck because we are checking a TCPStore which is already deleted. And main thread is still waiting for watchdog to join.

The solution here is:
1. Add back `std::chrono::duration_cast` to ensure the delta is indeed mil_sec, so that the timeout check logic is working as expected.
2. Check `terminateProcessGroup_` as well so that, we don't do any dump when main thread has already mark the process exited.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115475
Approved by: https://github.com/wconstab
2023-12-11 21:06:05 +00:00
ccc9e5f5bc Optimize conv2d pw quantized (#115221)
Summary:
In order to get better performance on conv2d pw its better to read the input together in a batch.

With this optimization on CUNET-enc ops:

Kernel Name              Workgroup Size         Duration P50 (ns)
===========              ==============         =================
vulkan.quantized_conv2d_pw_2x2{96, 72, 2}                       891332
vulkan.quantized_conv2d_pw_2x2{48, 36, 4}                       528528
vulkan.quantized_conv2d_pw_2x2{24, 18, 8}                       557336

Without this optimization:
Kernel Name              Workgroup Size         Duration P50 (ns)
===========              ==============         =================
vulkan.quantized_conv2d_pw_2x2{96, 72, 2}                      1633268
vulkan.quantized_conv2d_pw_2x2{48, 36, 4}                      1177228
vulkan.vulkan.quantized_conv2d_pw_2x2{24, 18, 8}                      1343264

Test Plan:
Ensure all vulkan quantize tests pass:
buck2 run --target-platforms ovr_configplatform/macos:arm64-fbsourcexplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output"
Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc
[==========] Running 78 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 78 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.uniform_buffer_copy
...
[----------] Global test environment tear-down
[==========] 78 tests from 1 test suite ran. (1519 ms total)
[  PASSED  ] 78 tests.

buck2 run --target-platforms ovr_config//platform/macos:arm64-fbsource  //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output"

Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc
[==========] Running 395 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 395 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.zero_size_tensor
[       OK ] VulkanAPITest.zero_size_tensor (83 ms)
...
xplat/caffe2/aten/src/ATen/test/vulkan_api_test.cpp:7593: Skipped
QueryPool is not available
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log (0 ms)
[----------] 395 tests from VulkanAPITest (6515 ms total)

[----------] Global test environment tear-down
[==========] 395 tests from 1 test suite ran. (6515 ms total)
[  PASSED  ] 394 tests.
[  SKIPPED ] 1 test, listed below:
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log

  YOU HAVE 5 DISABLED TESTS

Reviewed By: yipjustin

Differential Revision: D50997530

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115221
Approved by: https://github.com/yipjustin
2023-12-11 20:59:15 +00:00
585aea6e77 [xla hash update] update the pinned xla hash (#115528)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115528
Approved by: https://github.com/clee2000
2023-12-11 20:22:46 +00:00
505574c46a Add decomposition for torch.block_diag (#115096)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115096
Approved by: https://github.com/peterbell10
2023-12-11 20:04:22 +00:00
5fe2b138e3 Revert "[inductor] Fix an aliased output bug (#115373)"
This reverts commit 1310f0bf38293b68a781287d1de8cf699a76974d.

Reverted https://github.com/pytorch/pytorch/pull/115373 on behalf of https://github.com/atalman due to Sorry for reverting your change it broke inductor tests ([comment](https://github.com/pytorch/pytorch/pull/115373#issuecomment-1850792869))
2023-12-11 20:02:15 +00:00
c52b78ebc2 [ez] Remove some args from run_test.py (#115459)
Don't think anyone uses these
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115459
Approved by: https://github.com/malfet, https://github.com/huydhn
2023-12-11 19:56:37 +00:00
b5578cb08b [ez] Remove unittest retries (#115460)
Pytest is used in CI now for reruns and I doubt people are using the env vars when running locally.  imo removing this code has the makes the run function easier to read
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115460
Approved by: https://github.com/malfet, https://github.com/huydhn
2023-12-11 19:46:09 +00:00
5c0976fa04 Revert "[dynamo] guarded config (#111299)" (#115386)
This reverts commit 5927e9cbf2ac18aaaaecaab02258b7a35ac10969.

Differential Revision: [D51959266](https://our.internmc.facebook.com/intern/diff/D51959266)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115386
Approved by: https://github.com/yanboliang, https://github.com/malfet
ghstack dependencies: #115384, #115401, #115385
2023-12-11 19:35:42 +00:00
6db7b30db4 Revert "[dynamo] Cache size calc for differing config (#111300)" (#115385)
This reverts commit 78318d024989cf86e1ede424997cd42d2d291694.

Differential Revision: [D51959268](https://our.internmc.facebook.com/intern/diff/D51959268)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115385
Approved by: https://github.com/malfet
ghstack dependencies: #115384, #115401
2023-12-11 19:35:42 +00:00
f06f51b152 Revert "[Dynamo] Don't log compilation metrics for PyTorch unit tests (#115452)"
This reverts commit cd444aa075dd1e9c5d85cf3fbca9e078c74a7580.

Reverted https://github.com/pytorch/pytorch/pull/115452 on behalf of https://github.com/davidberard98 due to Merge conflict with #115385, which already landed in fbcode ([comment](https://github.com/pytorch/pytorch/pull/115452#issuecomment-1850729965))
2023-12-11 19:21:40 +00:00
f5f6618813 [executorch hash update] update the pinned executorch hash (#115311)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115311
Approved by: https://github.com/pytorchbot
2023-12-11 18:31:44 +00:00
40a14e07ef Revert "[sparse][semi-structured] add alg_id to _cslt_sparse_mm and _cslt_spasre_mm_search (#115178)"
This reverts commit 1e5636f7915035b09dce22ad1d2170a65f344214.

Reverted https://github.com/pytorch/pytorch/pull/115178 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the Window build failure looks legit 1e5636f791 ([comment](https://github.com/pytorch/pytorch/pull/115178#issuecomment-1850605711))
2023-12-11 18:07:17 +00:00
5f41fc7619 [c10d] Change NCCL PG watchdog error msg and test comments (#115403)
Address the nit comments in https://github.com/pytorch/pytorch/pull/115226/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115403
Approved by: https://github.com/wconstab
ghstack dependencies: #115226
2023-12-11 17:55:28 +00:00
794545c11f [BE]: Enable RUF015 codebase wide (#115507)
Constant time access of first value in collection. This is a constant time operation instead of converting the item to a list to get the first item which is linear. The rule is turned on which automatically autofixes and enforces this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115507
Approved by: https://github.com/malfet
2023-12-11 15:51:01 +00:00
1e5636f791 [sparse][semi-structured] add alg_id to _cslt_sparse_mm and _cslt_spasre_mm_search (#115178)
Summary:

cuSPARSELt has support for different alg_id, which are set via

`cusparseLTMatmulAlgSetAttribute`, in total there are 4 different
alg_ids, 0 - 3.

Previously we were just using the default alg_id, as from our initial
experiments we found that for most shapes the default alg_id is the
fastest and that they made no difference on numerical correctness, just
performance. From our previous experiments the fastest alg_id seemed to
differ only on small matmul shapes.

danthe3rd found a performance regression when running with
cuSPARSELt v0.4.0 vs v0.5.0, on LLM shapes, which match these
characteristics (activations are small, weights are large).

However it's likely that this is due to the alg_id ordering changing, as
mentioned in the release notes for v0.5.0.
```
cusparseLtMatmulAlgSelectionInit() does not ensure the same ordering of
algorithm id alg as in v0.4.0.
```

This PR adds in the following:
- support for passing in alg_id to _cslt_sparse_mm
- a new op, _cslt_sparse_mm_search, which returns the optimal alg_id for
  a given matmul

_cslt_sparse_mm_search has the same function signature as
_cslt_sparse_mm, minus the alg_id parameter.
We are able to achieve v0.4.0 performance with alg_id=1 on the shapes
that daniel provided.

We will address autoselecting the best alg_id in a future PR, possibly
with torch.compile.

Test Plan:
```
python test/test_sparse_semi_structured -k cslt
```

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115178
Approved by: https://github.com/cpuhrsch
2023-12-11 15:47:28 +00:00
b88be1686d Revert "[export][refactor][1/n] Move dynamic shapes logic (#114764)" (#115508)
GitHub first oncall.
This reverts commit 53bf8cfcf9c966096e829247380462d0a3a61e8d.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115508
Approved by: https://github.com/malfet, https://github.com/angelayi
2023-12-11 14:54:51 +00:00
f017a1af3f [MPS] add complex_out to MPS backend (#110851)
Adds support for at::complex_out to the MPS backend

Implemented in a binary kernel using the view_as_real pattern for handling complex dtypes in the mps backend.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110851
Approved by: https://github.com/kulinseth
2023-12-11 13:37:55 +00:00
de89a53df8 [benchmarking] Reduce box_detections_per_img for vision_maskrcnn (#115487)
This fixes a failure on the [perf dashboard](https://hud.pytorch.org/benchmark/compilers) with `--amp` mode.  I believe boxes 5 and 6 were getting swapped.  The existing comment explains the issue.

Before
```
$ ./benchmarks/dynamo/torchbench.py --training  --accuracy --no-translation-validatio --amp --backend=inductor --disable-cudagraphs --only vision_maskrcnn
...
[2023-12-09 13:21:27,292] torch._dynamo.utils: [ERROR] RMSE (res-fp64): 0.00171, (ref-fp64): 0.00054 and shape=torch.Size([256, 256, 3, 3])
[2023-12-09 13:21:27,292] torch._dynamo.utils: [ERROR] Accuracy failed for key name backbone.fpn.layer_blocks.2.0.weight.grad
fail_accuracy
```

After
```
$ ./benchmarks/dynamo/torchbench.py --training  --accuracy --no-translation-validatio --amp --backend=inductor --disable-cudagraphs --only vision_maskrcnn
...
pass
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115487
Approved by: https://github.com/yanboliang
2023-12-11 08:42:25 +00:00
274fdc81f8 [Dynamo][6.3/N] Further cleanup torch.py (#114669)
A follow-up PR to clean up what I found during the refactor of torch.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114669
Approved by: https://github.com/jansel
2023-12-11 07:16:03 +00:00
fe01605830 [aotinductor] replace lld with the default ld linker (#115478)
Currently, we place constants in the .so. To avoid cases
where constants are too large (i.e. >2G), we put the
constants into .lrodata, which allows doesn't have 2G limit.
Not sure why, lld still issues errors like beow even if
those large constants data are stored in .lrodata section:

"relocation R_X86_64_PC32 out of range: 5459191920 is not in
[-2147483648, 2147483647]"

In constrast, the default gnu ld linker works fine. Let's
switch back to use ld to unblock some internal models.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115478
Approved by: https://github.com/desertfire, https://github.com/htyu
2023-12-11 02:35:26 +00:00
1310f0bf38 [inductor] Fix an aliased output bug (#115373)
Summary: for https://github.com/pytorch/pytorch/issues/97083, when

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115373
Approved by: https://github.com/jansel
2023-12-10 23:52:39 +00:00
2e6b809d6b [AOTI] Fix a missing declaration for the result of item() (#115175)
Differential Revision: [D51968539](https://our.internmc.facebook.com/intern/diff/D51968539)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115175
Approved by: https://github.com/chenyang78
2023-12-10 22:49:45 +00:00
9b3cb1c66c Fix environment condition for docker-release.yml
As those are run on nightlies and release tags environment should be set accordingly.

Also simplify `WITH_PUSH` condition.

Should fix https://github.com/pytorch/pytorch/actions/runs/7156407285/job/19494049140
2023-12-10 14:09:39 -08:00
38f890341d Implement pass-through state_dict and load_state_dict for dynamo OptimizedModule (#113423)
Fixes #113422
Fixes #94575

This is now possible:
```py
model = Model()
compiled_model = torch.compile(model)

model.load_state_dict(compiled_model.state_dict())  # previously key mismatch!
```

This also makes it much easier to checkpoint and load models that were wrapped like so:
```py
FSDP(torch.compile(model))
# or
DDP(torch.compile(model))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113423
Approved by: https://github.com/msaroufim
2023-12-10 22:09:19 +00:00
26266c9718 [CI] Call torch.cuda.empty_cache to release device memory (#114663)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114663
Approved by: https://github.com/eellison
2023-12-10 21:27:42 +00:00
694cc6af56 [benchmarks] Fix NameError: name 'args' is not defined (#115494)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115494
Approved by: https://github.com/Skylion007, https://github.com/desertfire
2023-12-10 21:22:21 +00:00
21a1d31ed8 [caffe2] update Meta-internal googletest references (#115407)
Summary: Update test dependencies to point to the new internal googletest location.

Test Plan: CI

Differential Revision: D51951643

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115407
Approved by: https://github.com/cccclai
2023-12-10 20:37:13 +00:00
24a463c46c Revert "[export][refactor][2/n] Move tracing logic (#114768)" (#115503)
Github first oncall.
This reverts commit 0ab57ee7eab5391289d30e8c49fceee3f503f539.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115503
Approved by: https://github.com/angelayi, https://github.com/kit1980
2023-12-10 19:30:15 +00:00
b4ef59f740 Revert "[dynamo] remove unused OptimizeCtx field - export (#113901)" (#115401)
This reverts commit b62230a685666e8c2b8a5cb31b16352d286bcf9f.

Differential Revision: [D52001024](https://our.internmc.facebook.com/intern/diff/D52001024)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115401
Approved by: https://github.com/malfet
ghstack dependencies: #115384
2023-12-10 18:17:24 +00:00
b36fc6790e Revert "[dynamo] Guard on HAS_GRAPH_BREAKS if graph breaks are present (i.e. cache miss if compiled object requires nopython) (#114073)" (#115384)
This reverts commit 0bb29f945079ac4c83d674f7b3ff755cfb5396cf.

Differential Revision: [D51959267](https://our.internmc.facebook.com/intern/diff/D51959267)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115384
Approved by: https://github.com/malfet
2023-12-10 18:16:02 +00:00
6c1e75e646 Revert "[HigherOrderOp] make MapHigherOrder create map_impl call_function node instead of map (#115205)"
This reverts commit 8b747358783d2411afe1136dcc9da95c01bfbdaa.

Reverted https://github.com/pytorch/pytorch/pull/115205 on behalf of https://github.com/atalman due to ghfirst broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/115205#issuecomment-1848995376))
2023-12-10 15:25:55 +00:00
100c466bff [CI][Inductor] Skip CPU tests when running on GPU (#115430)
This is just follows the standard practice for CI, when one specifies `PYTORCH_TESTING_DEVICE_ONLY_FOR=cuda`, only tests targeting the device should be run

Do it by refactoring part of `instantiate_device_type_tests` into `get_desired_device_type_test_bases` and using it from test_torchinductor.py to skip CPU tests

Fixes https://github.com/pytorch/pytorch/issues/115423

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115430
Approved by: https://github.com/seemethere
2023-12-10 15:21:24 +00:00
08d63a75a4 Revert "[HigherOrderOp] Remove additional get item calls in MapHigherOrder. (#115207)"
This reverts commit dd6ae6d3b473906d32fcb8a319895e31b039f224.

Reverted https://github.com/pytorch/pytorch/pull/115207 on behalf of https://github.com/atalman due to ghfirst broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/115207#issuecomment-1848991919))
2023-12-10 15:12:12 +00:00
fbeca60b1f Remove replace_all and make VTs mutable (#113725)
1.  Removes calls to `replace_all` and `clone` and makes VTs mutable.
2. Properly handles Tuple Iterator mutation. Previously TupleIterator variables would only be properly reconstructed if they were advanced at least once in a frame. On calls to `next`, the source information would be lost (due to constructing a new iterator without using builder), which would ensure that during codegen the variable would be reconstructed from scratch. Now that VTs are mutated, the source is never lost, so we need to properly track mutation and handle it by replaying calls to `next` at the end of the modified bytecode.
3. Added test for checking iadd side effects, this was missing in our unit test coverage.
4. Fixed two incorrect sources, DelayGraphBreakVariable, and UserMethodVariable both relied on setting the source to AttrSource(parent, name) at the callsite of `var_getattr`.
5. Fixed a bug in inplace adding for lists, it would set the resulting VariableTracker's source to `None` which would utilize a different reconstruct path in codegen. Now this is handled explicitly by reconstructing vars when allow_cache=`False`, so that during side effect replay, the mutated var is correctly updated.

In subsequent PRs:
* Refactoring side effect tracking to be significantly simpler (I think we only need an `is_modified` flag)
* Refactor `next_variables` iterator to match the signature of `next`
* Remove all references to `options` in the code
* Refactor VTs representing mutable collections to implement their own mutation update handling
* Remove clone and/or make it specific to lists for creating slices
* Add mutation tracking/replay for sets
* Add mutation tracking/replay for iter.py
* Removing setting source in builder (it's set at the top level after a var is returned)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113725
Approved by: https://github.com/jansel
2023-12-10 09:31:21 +00:00
f71d931b32 [Dynamo][6.2/N] Dump the in graph function list(~2600 ops) and add unit tests. (#114196)
This is the second PR according https://github.com/pytorch/pytorch/pull/113009#issuecomment-1804417925

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114196
Approved by: https://github.com/jansel
2023-12-10 06:41:51 +00:00
4eb5838e18 Revert "Enable builtin tests for ONNX Export with ExportedProgram models (#114762)"
This reverts commit 13d2e3eba79000028291f4739a6e9c937dbe4264.

Reverted https://github.com/pytorch/pytorch/pull/114762 on behalf of https://github.com/huydhn due to Sorry for reverting your change but ONNX test is failing from this commit 13d2e3eba7 ([comment](https://github.com/pytorch/pytorch/pull/114762#issuecomment-1848831147))
2023-12-10 01:55:47 +00:00
2ee240d14a Revert "Move ONNX's TorchModelType to pytorch_test_common to fix circ. dep. (#115353)"
This reverts commit 960ad9d94e365c758b19298b45bcba5225b79e0c.

Reverted https://github.com/pytorch/pytorch/pull/115353 on behalf of https://github.com/huydhn due to Sorry for reverting your change but ONNX test is failing from the commit below in the stack 13d2e3eba7 ([comment](https://github.com/pytorch/pytorch/pull/115353#issuecomment-1848830883))
2023-12-10 01:53:50 +00:00
4490d4692b [doc] Rewrite benchmarks/dynamo/README.md (#115485)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115485
Approved by: https://github.com/yanboliang
2023-12-10 00:37:53 +00:00
8ddc549c0f [BE][JIT] Do not wrap shared_ptr with optional (#115473)
While reviewing https://github.com/pytorch/pytorch/pull/115381 noticed that `torch::jit::GraphFunction::optimized_graph_` is an `std::array<c10::optional<std::shared_ptr<Graph>>, N>`, which feels excessive as `shared_ptr` is already nullable and have `operator bool()`. Looking at https://github.com/pytorch/pytorch/pull/26488 that introduced the change, also does not hint that this indirection is necessary.

Test plan: CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115473
Approved by: https://github.com/davidberard98, https://github.com/Skylion007
2023-12-09 20:43:40 +00:00
641ec2115f [AOTI] move model runner into a library (#115220)
Summary: So that we can import it in fbcode and do some AOTI run in py env

Test Plan: existed AOTI tests

Reviewed By: chenyang78

Differential Revision: D51780021

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115220
Approved by: https://github.com/desertfire
2023-12-09 19:03:32 +00:00
c039f01bd9 Increased hardcoded limit for number of GPUs. (#115368)
Fixes #115331.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115368
Approved by: https://github.com/albanD
2023-12-09 18:10:51 +00:00
cyy
99f222372b [5/N] Fixes clang-tidy warnings in c10/{core,util}/*.h (#115354)
This PR continues to fix clang-tidy warnings for headers in c10/core and c10/util.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115354
Approved by: https://github.com/Skylion007
2023-12-09 17:16:04 +00:00
937d616e82 Re-enable type checking for distributed_c10d.py (#115223)
Re-enable type checking for distributed_c10d.py

Type checking for distributed_c10d.py was inadvertently turned off in issues that have accumulated since.

Note: the backwards compatibility linter does not like some of these changes.  But they were incorrect before.  This needs human verification, however.

#suppress-api-compatibility-check

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115223
Approved by: https://github.com/wconstab
2023-12-09 11:07:54 +00:00
485ea9a70a [DTensor] Add DTensor experimental op for LayerNorm backward sharding rule propogation (#115398)
Summary: This diff is only for prototype to unblock the TP work. PyTorch distributed team is working on a more generic backward op for `aten.layer_norm`. Will remove this op from the experimental file once it is ready.

Test Plan:
**Local Test**:
Accuracy:
- Dtensor + Checkpoint: first run loss: P884569822 (on-par with baseline: P884213363)
- 2nd by loading saved checkpoint: P884583429 (on-par with baseline: P884271869)

Trace:
- Collective functions are inserted automatically.
- Example: https://fburl.com/perfdoctor/l567ww1x

**MAST Test**:
With: trainer = 128, batch_size=512
- NE on-par:
(see: 4441_ep_bs512_2fsdp_tp_sp_dtensor)
 {F1155318138}

Differential Revision: D51490868

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115398
Approved by: https://github.com/wanchaol
2023-12-09 09:38:56 +00:00
eb3aa424ce [Reland][Dynamo] Added support for math.radians on ints with dynamic shapes (#115477)
Reland #114507

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115477
Approved by: https://github.com/larryliu0820
2023-12-09 08:58:18 +00:00
960ad9d94e Move ONNX's TorchModelType to pytorch_test_common to fix circ. dep. (#115353)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115353
Approved by: https://github.com/BowenBao
ghstack dependencies: #114407, #115281, #114762
2023-12-09 07:47:03 +00:00
13d2e3eba7 Enable builtin tests for ONNX Export with ExportedProgram models (#114762)
Fixed by https://github.com/pytorch/pytorch/pull/113982
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114762
Approved by: https://github.com/BowenBao
ghstack dependencies: #114407, #115281
2023-12-09 07:46:43 +00:00
7e941a932b Store user model to simplify ONNXProgram.{adapt_torch_*,__call__} APIs (#115281)
Currently (after https://github.com/pytorch/pytorch/pull/114407), the user has must pass the original user ``model`` to APIs such as ``ONNXProgram.__call__``, ``ONNXProgram.adapt_torch_inputs_to_onnx`` and ``ONNXProgram.adapt_torch_outputs_to_onnx`` APIs.

This was needed because when the model is fakefied, a version of the non-fakefied model is needed so that the Initializers, buffers and constants can be extracted from a real model (and used as input to the ONNX model).
That approach brings an unnecessary usability burden to the user when the model is not fakefied, because the model that was already passed to ``torch.onnx.dynamo_export`` could be used to extract ``state_dict``.

This PR adds ``ONNXProgram._model_torch`` attribute to store the user model and demote ``model`` argument of the aforementioned APIs to optional, only (as opposed to required).

As a result, for the fakefied model scenario, the user still need to pass the required model, but for non fakefied models, the persisted model is implicitly used to extract the model state_dict, making it easier to use.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115281
Approved by: https://github.com/BowenBao
ghstack dependencies: #114407
2023-12-09 07:46:12 +00:00
da341d0d48 [Dynamo][6.1/N] Refactor out TorchInGraphFunctionVariable and improve heuristic (#113432)
This is splitted from #113009, please check https://github.com/pytorch/pytorch/pull/113009#issuecomment-1804417925 for more details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113432
Approved by: https://github.com/ezyang, https://github.com/jansel
2023-12-09 05:11:44 +00:00
1c1f2bbe8a Add a space in the error message (#115465)
Summary:
As title says

Created from CodeHub with https://fburl.com/edit-in-codehub

Test Plan:
waitforsandcastle

Sandcastle run

Reviewed By: eeggl

Differential Revision: D52000286

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115465
Approved by: https://github.com/kwen2501
2023-12-09 04:35:51 +00:00
3ebf9acea1 [Triton] Replace triton.runtime.jit.get_cuda_stream with torch.cuda.c… (#115397)
triton.runtime.jit.get_cuda_stream was removed in https://github.com/openai/triton/pull/2756

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115397
Approved by: https://github.com/jansel
2023-12-09 04:30:42 +00:00
cyy
516bd4a72c [1/N] Use std::in_place (#115170)
It is time to gradually replace c10::in_place with std::in_place.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115170
Approved by: https://github.com/colesbury
2023-12-09 03:52:39 +00:00
2ed47fecc5 Robustify torch.multiprocessing.spawn error reporting to be less deadlock prone (#114688)
multiprocessing.Queue relies on, among other things, background threads to send messages between processes.  This works in the happy path but can cause issues if a process is exiting by bypassing atexit handlers or crashing because the writer to the Queue can terminate while the reader is blocked reading the queue.  The reader sees the queue as non-empty yet even with a timeout will actually block forever.

An example of a Queue deadlock is here: https://gist.github.com/chipturner/342f72341f087737befe9df84d0e41ce

Since the error reporting case here is a simple one-shot message from the dying child to the parent, we can just use a file-based rendezvous.  This eliminates the deadlock when a large traceback is still being flushed to the network when a child exits.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114688
Approved by: https://github.com/suo, https://github.com/yifuwang
2023-12-09 03:36:43 +00:00
2962271f58 [ONNX][dynamo_export] Extend expected fx output types for int, float, bool (#115431)
Fixes exporting ops, such as `aten::_scaled_dot_product_flash_attention` that returns int, float, bool typed outputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115431
Approved by: https://github.com/titaiwangms, https://github.com/thiagocrepaldi
2023-12-09 03:24:48 +00:00
41b1919208 [nested_tensor]Python subclass NT overhead improvement (2/n): avoid getting from WeakTensorKeyDictionary twice during __init__ (#115450)
Summary:
Most NT operations end with creating a new NestedTensor, which is time-consuming. Trying to reduce overhead during the NestedTensor creation.

The ops return a new NestedTensor with the same offsets, so "tensor not in _tensor_symint_registry" would be false in most case. The "in" (__contain__) function takes ~8 us. If we use the "get" directly, then we save a few us for most NT operations.

Test Plan:
Before:
get_tensor_symint take 15us
https://pxl.cl/3XF83
After
get_tensor_symint take 10us
https://pxl.cl/3XFc9

Differential Revision: D51992836

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115450
Approved by: https://github.com/soulitzer
2023-12-09 03:12:31 +00:00
d40a7c6026 Add decompositions for replication_pad (#115113)
Fixes #115395

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115113
Approved by: https://github.com/peterbell10
2023-12-09 02:44:07 +00:00
d7705f325d Patch --save-xml when TEST_IN_SUBPROCESS (#115463)
Patch `--save-xml` when `TEST_IN_SUBPROCESS`

When `--save-xml` is given as a unit test argument and the test is handled by a `TEST_IN_SUBPROCESS` handler (e.g., `run_test_with_subprocess` for `distributed/test_c10d_nccl`), the `--save-xml` args were first "consumed" by argparser in `common_utils.py`. When a following subprocess in this `if TEST_IN_SUBPROCESS:` section starts, there are no `--save-xml` args, thus leaving `args.save_xml` to `None`.

Since argparser for `--save-xml` option has a default argument of `_get_test_report_path()` when the arg is `None`, it's not a problem for Github CI run. It could be an issue when people run those tests without `CI=1`. Test reports won't be saved in this case even if they passed `--save-xml=xxx`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115463
Approved by: https://github.com/clee2000
2023-12-09 02:38:31 +00:00
c9c4cdf9a9 [AOTAutograd] Do not call ctx.mark_dirty on mutations hidden from autograd (#115324)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115324
Approved by: https://github.com/bdhirsh
2023-12-09 02:23:13 +00:00
3361496f96 Fix the corner case of index_add (#114929)
Fixes #114864

As the title stated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114929
Approved by: https://github.com/mikaylagawarecki
2023-12-09 01:57:25 +00:00
3c54ff6bcd Update ONNX's IO Adapter to support FakeTensor with ExportedProgram (#114407)
Currently, the ONNX exporter using torch.nn.Module as input can support
FakeTensor because the ONNX model stores all initializers

When using torch.export.ExportedProgram as input, the initializers are
lifted as inputs. In order to execute the ONNX model, we need to pass a
reference to the non-fake model to the
ONNXProgram.adapt_torch_inputs_to_onnx API, so that initializers can be
fetched from the model and fed to the ONNX model as input

ps: https://github.com/pytorch/pytorch/issues/115461 will track the API revision for the cases where additional `model_with_state_dict` are required to produce complete ONNX files exported with fake support. This is also tracked by the umbrella fake tensor issue https://github.com/pytorch/pytorch/issues/105464 FYI @BowenBao
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114407
Approved by: https://github.com/BowenBao
2023-12-09 01:48:27 +00:00
495054545c Allow preserve_rng_state=True when torch.compile + selective checkpointing + CUDA (#113718)
Fixes https://github.com/pytorch/pytorch/issues/113717.

When `preserve_rng_state=True`, we let AOTAutograd trace through `torch.random.fork_rng` op, and the tracing doesn't work under CUDA, hence the original error reported in the issue.

But since we are already doing RNG functionalization at Inductor level, we don't actually need to trace this `fork_rng` op. So we should just rewrite `preserve_rng_state` to False when we are using torch.compile (and let Inductor do its RNG functionalization which it's already been doing).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113718
Approved by: https://github.com/wanchaol
2023-12-09 01:47:25 +00:00
cd444aa075 [Dynamo] Don't log compilation metrics for PyTorch unit tests (#115452)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115452
Approved by: https://github.com/zou3519
2023-12-09 01:39:36 +00:00
e1370ff80f Vectorize CPU ATen mean kernel for BF16 & FP16 dtypes (#114582)
## Summary
Since #97351, CPU ATen kernel for `mean` for BF16 & FP16 dtypes has been unvectorized (it's not even implicitly vectorized).

This PR vectorizes `mean` for BF16 & FP16 on CPU in a `cast_fp32 -> sum -> div -> cast_bf16_or_fp16` fashion.

The perf benefit would be especially pronounced on machines with `AVX512_BF16` and/or `AVX512_FP16` ISA support.

## Benchmarking data for BF16 (collected before & after the change in this PR)

**Machine:** Intel&reg; Xeon&reg; (4th generation series, formerly codenamed Sapphire Rapids) Platinum 8468H
One socket (48 physical cores) - used `numactl --membind=0 --cpunodebind=0`
libtcmalloc & Intel OpenMP were preloaded

Environment variable used -
`KMP_AFFINITY=granularity=fine,compact,1,0 KMP_BLOCKTIME=1 KMP_SETTINGS=1 OMP_NUM_THREADS=48 MKL_NUM_THREADS=48`

**Workload:** E2E performance on BS 32 resnet50 (using BF16 via AMP) inference using oneDNN Graph JIT fuser (`mean` kernel is dispatched to eager mode ATen kernel, and is the bottleneck right now)

| **BEFORE:** Latency with unvectorized mean (lower is better)| **AFTER:** Latency with vectorized mean (lower is better)| Speedup due to vectorizing mean|
|----------------------------|-------------------------|------------|
|                19.1 ms           |                10.8  ms       | latency reduced by ~43.45%      |

**Benchmarking script for BF16 -**

 ```
import time
import torch
import torchvision

# enable oneDNN Graph JIT fuser
torch.jit.enable_onednn_fusion(True)
# AMP for JIT mode is enabled by default, and is divergent with its eager mode counterpart
torch._C._jit_set_autocast_mode(False)

# sample input should be of the same shape as expected inputs
example_input = torch.rand(32, 3, 224, 224)
# Using resnet50 from torchvision in this example for illustrative purposes,
# but the line below can indeed be modified to use custom models as well.
model = getattr(torchvision.models, "resnet50")().eval()

with torch.no_grad(), torch.cpu.amp.autocast(cache_enabled=False, dtype=torch.bfloat16):
    # Conv-BatchNorm folding for CNN-based Vision Models should be done with ``torch.fx.experimental.optimization.fuse`` when AMP is used
    import torch.fx.experimental.optimization as optimization
    # Please note that optimization.fuse need not be called when AMP is not used
    model = optimization.fuse(model)
    model = torch.jit.trace(model, (example_input))
    model = torch.jit.freeze(model)
    # a couple of warm-up runs
    model(example_input)
    model(example_input)
    # speedup would be observed in subsequent runs
    start = time.time()
    model(example_input)
    end = time.time()
    inference_time = (end - start) * 1000
    print("Inference time is ", inference_time)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114582
Approved by: https://github.com/jgong5, https://github.com/malfet
2023-12-09 01:02:13 +00:00
f614ed78b8 [docs, dynamo] fix typos in dynamo custom backend docs (#115444)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115444
Approved by: https://github.com/eellison
2023-12-08 23:58:26 +00:00
fb19947962 Add decompositions for reflection_pad{1, 2, 3}d (#115100)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115100
Approved by: https://github.com/peterbell10
2023-12-08 23:05:57 +00:00
9f7b3a4e18 Move autolabeler to "oncall: distributed" not "module:.." (#115447)
Reasoning for the change is spelled out in this issue

https://github.com/pytorch/pytorch/issues/115168

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115447
Approved by: https://github.com/huydhn, https://github.com/malfet
2023-12-08 22:53:20 +00:00
749f0c90e1 Revert "[export][refactor][3/n] Move unlift to separate file (#114787)" (#115457)
Github First Oncall: This reverts commit 967863d91dbe0a56fa7bcc4e075a25cc4ad67c81.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115457
Approved by: https://github.com/osalpekar
2023-12-08 22:33:28 +00:00
28de29fdda [releng] version 2.2 -> 2.3 (#115446)
Release 2.2 branch cut is ompleted. Hence bump nightly version to 2.3
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115446
Approved by: https://github.com/huydhn, https://github.com/seemethere, https://github.com/malfet
2023-12-08 22:25:52 +00:00
3e47e3f441 Revert "[export] Fix graph output mismatch issue with constant outputs. (#115280)"
This reverts commit 622688fab9fc6d20ff3475a8a0a1fdb6af9d837e.

Reverted https://github.com/pytorch/pytorch/pull/115280 on behalf of https://github.com/atalman due to ghfirst issue when importing, will reland this PR ([comment](https://github.com/pytorch/pytorch/pull/115280#issuecomment-1847903624))
2023-12-08 22:10:03 +00:00
3dab46fe19 Revert "[export] Dont skip output caching for now. (#115374)"
This reverts commit fd79995fd6d9f599ff60b721ae56bb7b0aa4eb93.

Reverted https://github.com/pytorch/pytorch/pull/115374 on behalf of https://github.com/atalman due to ghfirst issue when importing, will reland this PR ([comment](https://github.com/pytorch/pytorch/pull/115374#issuecomment-1847899901))
2023-12-08 22:06:21 +00:00
aaaf5c08fb [ez] Don't run workflows on forks (#115429)
Adds the `if: github.repository_owner == 'pytorch'` to some jobs to make sure they don't run on forks, since they usually either fail or remain pending due to not having the correct machines to run.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115429
Approved by: https://github.com/huydhn, https://github.com/botmethere, https://github.com/malfet, https://github.com/atalman
2023-12-08 21:41:58 +00:00
b5d3d3ebf0 [ao] making hist_obs handle torch.inf and closeby values (#103467)
Summary: This PR does 2 things:

1) Previously this would simply error, now it will ignore any
torch.inf values that it recieves. note: The code checks for torch.inf after
aminmax that way if there are no torch.inf values found, the perf is a
relatively unchanged

2) as mentioned in https://github.com/pytorch/pytorch/issues/100051,
values close to (but not quite at) the maximum/minimum float value could
overflow to infinity in the course of _adjust_min_max() (when this large
value would be multiplied by something in the middle of a calculation
that would otherwise result in a non inf value). This was fixed by
rearranging the order of operations for the lines in question without
altering the actual equations. Specifically, where operations in lines
1095, 1098 and 1100 have multiplication and division of large values,
its better to divide the two large values before multiplying, rather
than multiplying the two large values together (creating overflow) before dividing like it had been.

Test Plan: python test/test_quantization.py
TestObserver.test_histogram_observer_ignore_infinity

python test/test_quantization.py TestObserver.test_histogram_observer_handle_close_to_infinity
Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D51489345](https://our.internmc.facebook.com/intern/diff/D51489345)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103467
Approved by: https://github.com/andrewor14
2023-12-08 21:41:31 +00:00
1215f2ffe2 [dtensor] readme typo (#115383)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115383
Approved by: https://github.com/awgu
ghstack dependencies: #115365
2023-12-08 21:40:40 +00:00
af925a56a1 Revert "[export] Add math.* ops to pass base (#115271)"
This reverts commit 6c0a4ced530dab78db455c37508931de2eb56239.

Reverted https://github.com/pytorch/pytorch/pull/115271 on behalf of https://github.com/atalman due to ghfirst issue when importing, will reland this PR ([comment](https://github.com/pytorch/pytorch/pull/115271#issuecomment-1847852211))
2023-12-08 21:17:56 +00:00
12d7ea19af [Indcutor][fx pass] Add sub and div pointwise ops to the post grad fusion (#115389)
Summary: Titled

Test Plan:
# unit test
```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:group_batch_fusion
```
Buck UI: https://www.internalfb.com/buck2/792c58db-c369-487d-9a42-b5da471657c0
Test UI: https://www.internalfb.com/intern/testinfra/testrun/2814749981661407
Network: Up: 74KiB  Down: 29KiB  (reSessionID-b47c266b-12d6-4e88-8dc3-4af1dd7ecbb4)
Jobs completed: 20. Time elapsed: 2:09.6s.
Cache hits: 0%. Commands: 2 (cached: 0, remote: 0, local: 2)
Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0

# local reproduce
OC: P899142918
MAI: P899175452
# e2e (oc)

Differential Revision: D51957242

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115389
Approved by: https://github.com/dshi7, https://github.com/jackiexu1992, https://github.com/xuzhao9
2023-12-08 21:07:03 +00:00
e8e4141773 Revert "[Dynamo][6.1/N] Refactor out TorchInGraphFunctionVariable and improve heuristic (#113432)"
This reverts commit e61d6b42f0f4e4fa5bb816e03fb81e5bbcc9fa06.

Reverted https://github.com/pytorch/pytorch/pull/113432 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing dynamo tests in trunk e61d6b42f0, landrace? ([comment](https://github.com/pytorch/pytorch/pull/113432#issuecomment-1847787981))
2023-12-08 20:15:39 +00:00
d7180161b5 Revert "[SparseCsr] Remove triton sdpa skip after triton pin update (#109601)"
This reverts commit f64b10803f5fdd34e43fba7f421401bcfe247c19.

Reverted https://github.com/pytorch/pytorch/pull/109601 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing in trunk with this error ZeroDivisionError: integer division or modulo by zero ([comment](https://github.com/pytorch/pytorch/pull/109601#issuecomment-1847784383))
2023-12-08 20:12:53 +00:00
4186932bac Revert "[export] Remove runtime assertion pass (#115196)"
This reverts commit c163b3c03563c11640d4dbee504ef63101b019fe.

Reverted https://github.com/pytorch/pytorch/pull/115196 on behalf of https://github.com/atalman due to Broke internal test ([comment](https://github.com/pytorch/pytorch/pull/115196#issuecomment-1847778344))
2023-12-08 20:07:04 +00:00
317486edb0 [C10D] Decouple flight recorder from enableTiming (#115358)
RE #115301

Decoupling gives us a path to disable timing without disabling the
flight recorder.

Flight recorder is still useful for stuckness analysis without 'timing'.

Disabling timing makes it miss the 'started'
state that comes from using an extra nccl event at the start of each
collective.  It will also be missing 'duration_ms' of collectives, which
hasn't been landed yet, but is useful for timing/perf work more than
stuckness analysis.

Hopefully we can enable timing by default and leave both on, but it's
nice to have the flexiblity for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115358
Approved by: https://github.com/fduwjj
2023-12-08 19:44:45 +00:00
suo
3d999d2f2c [export] optimize unflattener (#115364)
Unflattening was slow on the APS FM model (which has thousands of nn.EmbeddingBag modules).

Quick glance at the profile shows 75% of time in unflattening was spent copying this node list, which is immutable and globally shared. So just passing it around as a tuple yields a 4x speedup lol.

Differential Revision: [D51929775](https://our.internmc.facebook.com/intern/diff/D51929775/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115364
Approved by: https://github.com/zhxchen17
2023-12-08 19:32:01 +00:00
494cb28231 [PyTorch] AOTI: add ArrayRefTensor (#112115)
This adds a shim for AOTI generated code to pretend a raw array works like an AtenTensorHandle. This allows parts of AOTI that generate uses of tensors to continue to be unaware of how those tensors are allocated. See the following diff/PR for usage.

Differential Revision: [D50570252](https://our.internmc.facebook.com/intern/diff/D50570252/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112115
Approved by: https://github.com/chenyang78, https://github.com/desertfire
2023-12-08 19:31:50 +00:00
a2b89154bf New swap function (#111747)
This PR is proposing a new approach to solve the nn/optim only linked by python object identity problem.
The idea is to have a function that can swap the content of two Tensors t1 and t2 while preserving all the old references.
This would allow us to swap the `model.weight` with a new Tensor (can be any subclass of Tensor and any TensorImpl (xla, sparse, nested tensorimpl would work)). The use within nn will be done in a follow up.

This is done by swapping the whole content of the PyObject and then putting back the fields associated with external references (refcount, gc tracking and weakrefs).
Note that we have to properly handle all the cases where there is memory used before the public pointer PyObject* and where the PyObject is bigger due to dict/weakref being inlined (older CPython version) or due to slots.

The main limitation of this approach is that the number of slots need to match for the objects being swapped and thus limit usage of slots in subclasses.

Draft right now to see what @colesbury thinks about doing this?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111747
Approved by: https://github.com/colesbury
2023-12-08 18:49:35 +00:00
5f2ff29569 Fix typo in https://pytorch.org/docs/stable/sparse.html (#115282)
Fixes #111473

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115282
Approved by: https://github.com/svekars
2023-12-08 18:31:33 +00:00
68f74dd162 Add python and C++ support for LPPool3d (#114199)
Add python and C++ support for LPPool3d to Fixes #114114

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114199
Approved by: https://github.com/mikaylagawarecki
2023-12-08 18:18:44 +00:00
1c3a4a864c Remove always restore (#115317)
Removes always restore, assuming that a HOP will cleanup any leftover state from tracing fwd + bwd

This required a minor change to the autograd fn variable higher order op. If we are tracing forward DON'T add the call_function node into the main graph, since we are only tracing it for the purposes of speculation. Instead return the result directly to be passed to the backward for speculation. This was the only observable side effect on the output graph that I found.

Test plan:
test_smoke_from_test_autograd in test_autograd_function.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115317
Approved by: https://github.com/voznesenskym, https://github.com/jansel
2023-12-08 18:17:37 +00:00
a3f93dc44d [EZ] [CD] Enable Triton 3.12 conda builds (#115424)
Currently there is a chicken and egg problem with enabling triton builds for the platform, as package depends on `torch`, so I can only submit this change few days after https://github.com/pytorch/pytorch/pull/114819

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115424
Approved by: https://github.com/clee2000, https://github.com/seemethere
2023-12-08 18:10:45 +00:00
81b565b142 [CI] Fix a missing write_csv_when_exception problem (#115370)
Summary: Fix a problem shown in https://github.com/pytorch/pytorch/actions/runs/7124839624/job/19400589129 when a model times out.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115370
Approved by: https://github.com/eellison
2023-12-08 18:09:53 +00:00
c370450f02 [inductor] Remove hashing of tensor data for constants (#115356)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115356
Approved by: https://github.com/eellison
2023-12-08 18:05:34 +00:00
e61d6b42f0 [Dynamo][6.1/N] Refactor out TorchInGraphFunctionVariable and improve heuristic (#113432)
This is splitted from #113009, please check https://github.com/pytorch/pytorch/pull/113009#issuecomment-1804417925 for more details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113432
Approved by: https://github.com/ezyang, https://github.com/jansel
2023-12-08 17:15:14 +00:00
898554a3a3 [torchgen] Add logic in custom ops to return empty tensor (#114143)
Summary: Add two logic:

1. If the custom op is returning a `Tensor` but also doesn't have an out tensor as input, return an empty tensor.
2. If the custom op is returning more than one Tensor and the number of out tensors is not the same as return Tensor, return a tuple of empty tensors.

Test Plan: Rely on new unit tests

Differential Revision: D51471651

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114143
Approved by: https://github.com/cccclai
2023-12-08 17:03:44 +00:00
b3b5bd51ea [raas][torch][jit] Allow not storing the optimized graph (#115381)
Summary:
GraphFunction internally stores the optimized graph after generating it and then it is passed into the executor which makes a copy of it. So we store the optimized graph effectively twice.

This diff allows to set a flag to not store the optimized graph inside the GraphFunction.

The code is NoP right now until the flag is enabled.

Test Plan:
I ran SL with this on raas with good memory saving on raas server. From command line:

exmaple model run
```
buck run mode/opt-clang  sigrid/predictor/client/localnet:run_model -- --model_id_to_load=953556500 --model_snapshot_to_load=362

I1207 11:04:58.657143 3556226 SigridPredictorLocalModelFactory.cpp:32] Memory usage for 953556500_362 is 255646 Kb
```

then with flag enabled:
```
buck run mode/opt-clang  sigrid/predictor/client/localnet:run_model -- --model_id_to_load=953556500 --model_snapshot_to_load=362 --torch_jit_do_not_store_optimized_graph=true
I1207 11:06:25.245779 3577383 SigridPredictorLocalModelFactory.cpp:32] Memory usage for 953556500_362 is 165167 Kb
```
So collective with this flag and the flag from D51950418
```
buck run mode/opt-clang  sigrid/predictor/client/localnet:run_model -- --model_id_to_load=953556500 --model_snapshot_to_load=362 --torch_jit_do_not_store_optimized_graph=true --torch_jit_enable_profiling_graph_executor=false

I1207 11:09:17.502743 3592345 SigridPredictorLocalModelFactory.cpp:32] Memory usage for 953556500_362 is 114848 Kb
```

Differential Revision: D51931895

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115381
Approved by: https://github.com/malfet
2023-12-08 16:29:13 +00:00
f64b10803f [SparseCsr] Remove triton sdpa skip after triton pin update (#109601)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109601
Approved by: https://github.com/desertfire, https://github.com/amjames
2023-12-08 15:49:16 +00:00
72e58a756c Set markDynamoStrictTest in functorch/test_vmap.py (#115274)
We set markDynamoStrictTest in most of functorch/test_vmap.py. This
revealed many existing failing tests, so we mark those all as expected
failures or skip them.

Test Plan:
- CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115274
Approved by: https://github.com/guilhermeleobas, https://github.com/kshitij12345
ghstack dependencies: #115267, #115276, #115268
2023-12-08 14:51:19 +00:00
cc8f6f56dc [quant][pt2e] Add convert callback to Observer module (#115001)
Summary:
This is to allow easier extension of quant workflow in the future, as we are seening more
diverse ways of doing quantization

putting up this for feedbacks first

Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_observer_callback

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115001
Approved by: https://github.com/kimishpatel
2023-12-08 13:47:37 +00:00
ca15671c30 Fix failing test_invalid_input_csr_large (#114940)
The test introduced in #102530 has a bug:
Construction of `crow_indices` raises an exception: "value cannot be converted to type int32 without overflow" which is obviously correct.
This makes the test fail which is supposed to check for an overflow in nnz.
Fix by making the construction of `crow_indices` pass although with an invalid value which would error later but triggers the correct check.

Given that I'm not sure it is even worth checking for an overflow in nnz:
- `crow_indices[..., -1] == nnz` is already enforced
- this can only hold if `crow_indices` is able to hold `nnz` without overflow
- `col_indices` has to be of the same type as `crow_indices`
- Hence the type of `col_indices` has to be able to hold the value of `nnz`

So in conclusion: The situation being checked for cannot reasonably occur

CC @pearu as the test author for additional insight

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114940
Approved by: https://github.com/pearu, https://github.com/cpuhrsch
2023-12-08 11:55:21 +00:00
23fa9621e4 [DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#115099) (#115193)
Summary:

Rename _device_mesh.py to device_mesh.py, update all callsites, add documentation.
We created stubs for public class and methods in torch.distributed.device_mesh so that torch.distributed.device_mesh can be imported with or without distributed is available().

Original diff reverted: D51629761
Original PR reverted: https://github.com/pytorch/pytorch/pull/115099
Prior to landing, CI signals are all passed. Shipit added the "ci/trunk" label to the PR and DID NOT wait for it and went ahead committing. More context can be found in the reverted PR above.

Test Plan: CI.

Differential Revision: D51861018

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115193
Approved by: https://github.com/fegin
2023-12-08 08:44:32 +00:00
6c585de076 [CUDA] baddmm should fall back to addmm for batch=1 (#114992)
I.e. it feels reasonable to always call `at::cuda::gemm` rather than `at::cuda::bgemm` when num_batches == 1
After the change, benchmarking torch built with CUDA-12 using  [following perf script](https://gist.github.com/malfet/6a17156d7f5663b8b12054a1beff3fe1) on A100  are as follows:
|      Shape     |  bmm_time |  mm_time  | slow down (%) |
| -------------- | --------- | --------- | ------------- |
|    1x1x4096    |   14.18   |   14.31   |     -0.89     |
|    1x1x8192    |   14.37   |   14.37   |     -0.05     |
|   1x1x16384    |   14.03   |   14.12   |     -0.68     |
|   1x1x32768    |   14.19   |   14.24   |     -0.35     |
|   1x1x65536    |   14.85   |   14.52   |     2.30      |
|   1x1x131072   |   14.03   |   14.07   |     -0.33     |
|  128x128x128   |   11.34   |   11.06   |     2.56      |
|  256x256x256   |   14.85   |   14.40   |     3.15      |
|  512x512x512   |   27.22   |   27.22   |     -0.01     |
| 1024x1024x1024 |  129.66   |  129.50   |     0.12      |
| 2048x2048x2048 |  972.18   |  973.24   |     -0.11     |
|  129x127x129   |   11.21   |   11.25   |     -0.39     |
|  257x255x257   |   14.50   |   14.43   |     0.44      |
|  513x511x513   |   29.01   |   29.01   |     0.01      |
| 1025x1023x1025 |  137.65   |  137.64   |     0.01      |
| 2049x2047x2049 |  982.58   |  982.65   |     -0.01     |
|  4097x3x4097   |   86.65   |   86.64   |     0.01      |
|  8193x3x8193   |  384.02   |  383.96   |     0.02      |
| 16385x3x16385  |  1106.73  |  1107.32  |     -0.05     |
| 32769x3x32769  |  4739.49  |  4739.48  |     0.00      |
| 65537x3x65537  | 17377.78  | 17378.74  |     -0.01     |
|  4097x5x4097   |   87.09   |   87.12   |     -0.03     |
|  8193x5x8193   |  301.38   |  301.36   |     0.01      |
| 16385x5x16385  |  1107.38  |  1108.04  |     -0.06     |
| 32769x5x32769  |  4743.73  |  4744.07  |     -0.01     |
| 65537x5x65537  | 17392.32  | 17395.42  |     -0.02     |
|  4097x7x4097   |   87.17   |   87.19   |     -0.02     |
|  8193x7x8193   |  301.94   |  302.00   |     -0.02     |
| 16385x7x16385  |  1107.17  |  1106.79  |     0.03      |
| 32769x7x32769  |  4747.15  |  4747.13  |     0.00      |
| 65537x7x65537  | 17403.85  | 17405.02  |     -0.01     |

Fixes perf problem reported in https://github.com/pytorch/pytorch/issues/114911
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114992
Approved by: https://github.com/Skylion007, https://github.com/eqy
2023-12-08 07:53:17 +00:00
4d70802133 [c10d] Use TCPStore to record NCCL timeout and dump debug info (#115226)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115226
Approved by: https://github.com/wconstab
2023-12-08 06:19:40 +00:00
2c84616a94 Move the shape env symint cache to a symbol cache, better routing for subclass fakification [re-pr 115227] (#115396)
*
Context:

Joel sees that unless he manually writes to the fake tensor memo, fakification seems to produce spurious symbols! Voz (me) objects, saying that not only is directly writing to memo a bad pattern, recursively invoking fakification on tensor subclass elements in dynamo should suffice! Joel says that while he morally agrees, he has a test proving otherwise, a most perplexing situation.

Digging in, I figured out that while *we were* making fake tensors correctly, with properly cached symbols and the like, we were *also* incorrectly creating spurious symbols, leading the test to fail.

Before this PR, we would only cache source->symint. This was generally fine, but meant that you would create a symbol, then potentially throw it out due to symint cache. For example, the cache hit flow was:

make a symbol (ex: s2) -> use it to make a symint -> hit the cache (my_source-s1)

Now, in this example,  you have a symbol in your val_to_var/var_to_val (s2) that is unused. This is sound, but wasteful, and furthermore, misleading.

This was causing a test added in a PR in this stack to fail, specifically, because the test was using

```
curr_var_to_val = {
    str(k): v for k, v in context.fake_mode.shape_env.var_to_val.items()
}
````

To validate that no new symbols were being created (that is, that recursively creating fake tensors for subclasses was working).

The test is correct, but the implementation of caching would make (by this method of observation) cache hits look like cache misses.

So, the fix here is to move the cache up to be a general symbol cache, rather than only a cache for symints.

The initial implementation did that! But then, it ran into some interesting errors when it came to replay. When replaying symbol creation, behaviors would diverge in the new shape env! How could that be? The answer is because creating a new shape_env resulted in us replaying symbol creation... but with a cache from a different shape env! This was short circuiting symbol creation - and so, adding an extra layer to the cache for id(shape_env) fixes the problem.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115396
Approved by: https://github.com/mlazos
2023-12-08 05:02:21 +00:00
d0f161eae4 [vision hash update] update the pinned vision hash (#111264)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111264
Approved by: https://github.com/pytorchbot
2023-12-08 03:33:33 +00:00
9521331ba5 [pytorch] Multiprocessing api to use sigkill if sigterm doesn't kill the process (#115219)
Summary:
[pytorch] Multiprocessing api to use sigkill if sigterm doesn't kill the process
We have seen a handful of jobs training stuck where one of the trainer goes down
while others are stuck in c++ land and hence not handling the sigterm.

Test Plan: Manually validated by attaching gdb to one of the processes and sent a kill -9 to another. Saw the log ```WARNING] Unable to shutdown process 4422 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL```

Differential Revision: D51862545

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115219
Approved by: https://github.com/wconstab, https://github.com/fduwjj
2023-12-08 02:26:19 +00:00
459845b82d [cuDNN][cuDNN frontend] Bump cudnn_frontend submodule to 1.0 (#115218)
A prerequisite for cuDNN flash attention #113713 .

CC @malfet @atalman @drisspg @ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115218
Approved by: https://github.com/drisspg, https://github.com/malfet
2023-12-08 02:24:26 +00:00
e071d6a9eb [Nested tensor]avoid using shape in python subclass NT, use _size instead (#115371)
Summary:
calling tensor.shape will call torch_dispatch which adds more overhead.

Testing overhead difference in "NT + NT" operation:
**Before:**
the add operation takes ~300us
{F1167963824}
**After:**
the add operation takes ~200us
 {F1167964056}

Test Plan: unit tests in test_nestedtensor

Differential Revision: D51949135

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115371
Approved by: https://github.com/soulitzer, https://github.com/jbschlosser
2023-12-08 02:08:36 +00:00
5432088098 Adds Checkpointer Wrapper for DCP [3/N] (#114603)
Adds a useful high level wrapper for calling `dist.save/load` with the correct storage readers and writers.

Instead of doing:

```
DCP.save(
    state_dict={...},
    storage_writer=StorageWriter(...)
)

DCP.load(
    state_dict={...},
    storage_reader=StorageReader(...)
)
```

We can now do:

```
checkpointer = Checkpointer(...)

checkpointer.save(state_dict={...})
checkpointer.load(state_dict={...})
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114603
Approved by: https://github.com/fegin, https://github.com/wz337
2023-12-08 01:03:21 +00:00
3b01f30b20 Prevent invalid pointwise ops on jagged with transposed ragged dim (#115190)
TODO: tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115190
Approved by: https://github.com/soulitzer, https://github.com/ani300
2023-12-08 00:54:03 +00:00
784e20e3d7 [C10D] Make dumpPipe use async launcher (#115375)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115375
Approved by: https://github.com/fduwjj
ghstack dependencies: #115332
2023-12-08 00:16:22 +00:00
bb7746275c Add is_integer to SymFloat (#114703)
Fixes #114676

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114703
Approved by: https://github.com/peterbell10
2023-12-07 23:23:53 +00:00
f5919335db Fix _load_from_state_dict for num_batches_tracked in batchnorm (#115285)
I approved https://github.com/pytorch/pytorch/pull/110850 which did the following

Previously:
`num_batches_tracked` not in state_dict when doing `m.load_state_dict(state_dict)` --> always overwrite module's `num_batches_tracked` in `load_from_state_dict` with a 0 cpu tensor

Now:
`num_batches_tracked` not in state_dict loaded when doing `m.load_state_dict(state_dict)` --> only overwrite module's `num_batches_tracked`  in `load_from_state_dict` with a 0 cpu tensor if module does not have `num_batches_tracked`

This causes the following issue:

```
with torch.device('meta'):
     m = BatchNorm(...)
m.load_state_dict(state_dict, assign=True)
```

If `num_batches_tracked` is not in `state_dict`, since `modules's` `num_batches_tracked` is present on meta device, it is not overwritten with a 0 cpu tensor. When compiling, this error is raised

```
AssertionError: Does not support mixing cuda+meta
```

I am not sure whether the explicit check for meta device makes sense as a fix, will add testing if this fix is ok

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115285
Approved by: https://github.com/albanD
2023-12-07 22:48:26 +00:00
18d57dde2d Remove remaining uses of copy_graphstate (#115321)
After auditing higher_order_ops.py, the graph checkpoints were only getting used in the event of an exception, so it is safe to remove because we restart analysis in this case now.

To make this clearer the current state is the following:
Checkpoint side effects
Capture subgraph
if graph break:
  restore as usual
else:
  throw away inlining translator and subgraph tracer
Restore side effects

This will change to the following after this change:
Checkpoint side effects
Capture subgraph:
if graph break:
  restart analysis
else:
  throw away inlining translator and subgraph tracer
Restore side effects

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115321
Approved by: https://github.com/jansel, https://github.com/zou3519
2023-12-07 22:35:02 +00:00
ecba053cff [quant][pt2e] XNNPACKQuantizer skip inserting observers for non-float Tensors (#114999)
Summary:
att

Test Plan:
python test/test_quantization.py -k test_add_mul_long

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114999
Approved by: https://github.com/kimishpatel, https://github.com/guangy10
2023-12-07 22:13:36 +00:00
dacf5d6e92 [DTensor] Remove assert to allow tensor sharding dimension < Shard(x).ndim (#115114)
Consolidated by changes made by @yoyoyocmu. https://www.internalfb.com/diff/D51821717
Remove assert to allow tensor dimension < Shard(x).ndim. With the current padding, we do support this already.

Follow up: we will still need to fix the size mismatch and `full_tensor()` hang when tensor is uneven-sharded.
Created issue here: https://github.com/pytorch/pytorch/issues/115310

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115114
Approved by: https://github.com/yoyoyocmu, https://github.com/wanchaol
2023-12-07 21:57:30 +00:00
7562b45454 Reland "[C10D] Use future for flight recorder dump (#115176)" (#115332)
Replaces the "always sleep 30 sec before abort" with "wait up to 30 sec
for the future to complete then abort". The difference in this case is
the abort happens as soon as the dump finishes up to a maximum, instead
of always waiting the maximum.

Allows multiple calls to dump, which will be serialized.

Renames tryWriteDebugInfo to launchAsyncDebugDump in spirit of the
change to support more than one launch and to always launch rather than
only launching on the first call.

Adds a test for dumping on timeout.

This reverts commit ac7d14baad53fa7d63119418f760190f289d8a01.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115332
Approved by: https://github.com/fduwjj
2023-12-07 21:20:58 +00:00
fd79995fd6 [export] Dont skip output caching for now. (#115374)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115374
Approved by: https://github.com/tugsbayasgalan
2023-12-07 20:31:30 +00:00
6a6a1e3ef7 [dtensor] update README to make all example runnable (#115365)
as titled, also add torchrun commands

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115365
Approved by: https://github.com/fegin
2023-12-07 20:23:37 +00:00
c06ab369e8 [OAT] toggle for forcing matmul precision matching (#115326)
Summary: Add a toggle to inductor config that will force matmul precision dtypes to match between cublas and triton backends for addmm, bmm, and mm operations.

Test Plan: CI + model launches

Reviewed By: jansel

Differential Revision: D51442001

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115326
Approved by: https://github.com/jansel
2023-12-07 20:22:12 +00:00
7faa67f6ef [inductor] enable mkldnn op weight pre-packing on aarch64 (#115037)
This PR enables the fx passes and mkldnn optimizations for aarch64 It improved the bert inference performance up to 5.8x on AWS c7g instance when compared torch.compile() vs no compile path. This is enabled when pytorch is built with USE_MKLDNN_ACL option for aarch64.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115037
Approved by: https://github.com/jgong5, https://github.com/malfet
2023-12-07 19:58:38 +00:00
7201edc0a5 Fix RNN class constructor signature (#115341)
Fixes #114617

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115341
Approved by: https://github.com/mikaylagawarecki
2023-12-07 19:46:33 +00:00
21cca2494d Move test_multi_tensor_optimizers to use OptimizerInfos (#114797)
This PR aims for parity+ compared to the old testing for the simplest foreach test case.

Test coverage increase: we now test foreach optimizers with CPU as well as on GPU.

Before:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (19136605)]$ python test/test_optim.py -v -k test_multi_tensor_optimizers
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
test_multi_tensor_optimizers (optim.test_optim.TestOptim) ... ok

----------------------------------------------------------------------
Ran 1 test in 7.253s

OK
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (19136605)]$
```

Now, we get granular test cases at the cost of overhead!
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (19136605)]$ python test/test_optim.py -v -k test_foreach
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
test_foreach_ASGD_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_Adadelta_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_Adagrad_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_AdamW_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_Adam_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_Adamax_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_NAdam_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_RAdam_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_RMSprop_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_Rprop_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_SGD_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_ASGD_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_Adadelta_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_Adagrad_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_AdamW_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_Adam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_Adamax_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_NAdam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_RAdam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_RMSprop_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_Rprop_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_SGD_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok

----------------------------------------------------------------------
Ran 22 tests in 30.954s

OK
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (19136605)]$
```

Why the increase in time?
Two reasons:
1. overhead. Any _CUDA_ *Info test (OpInfo, ModuleInfo, OptimizerInfo) will wrap itself with the `CudaNonDefaultStream` policy, and `CudaNonDefaultStream.__enter__` when called for the first time will go through all visible CUDA devices and synchronize each of them, thus forcing the CUDAContext to be init'd. Doing this for all 8 devices takes ~10-15s. Also, test parametrization costs a little overhead too, but not to the level init'ing CUDA context does.
2. We test more! Now, we have 72 configs (in the foreach optimizer world) whereas we only had 59 before.

Next steps for the future:
- consider adding more Tensor LR configs (like a Tensor LR without capturable in the single tensor case)
- this is likely the next PR or 2: migrate all uses of _test_derived_optimizers in test_optim to TestOptimRenewed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114797
Approved by: https://github.com/albanD
2023-12-07 19:37:56 +00:00
16373bbc1f fix error message in pytorch (#115349)
Fixes https://dev-discuss.pytorch.org/t/typo-in-error-message/1709 .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115349
Approved by: https://github.com/Skylion007
2023-12-07 19:27:29 +00:00
suo
eb4ba35b07 fix test_weak.py on mac (#115367)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115367
Approved by: https://github.com/albanD
2023-12-07 19:19:56 +00:00
b0a9641815 [Inductor][fx pass] Fuse pointwise operators in the post grad (#114778)
Summary: We construct a unified API that can be easily add pointwise ops to be batched in the post grad

Test Plan:
# unit test
```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:group_batch_fusion
```
Buck UI: https://www.internalfb.com/buck2/19b3f641-782f-4f94-a953-3ff9ce2cfa7b
Test UI: https://www.internalfb.com/intern/testinfra/testrun/1125900251953016
Network: Up: 67KiB  Down: 32KiB  (reSessionID-c2a80f26-8227-4f78-89fc-bcbda0ae8353)
Jobs completed: 18. Time elapsed: 1:19.8s.
Cache hits: 0%. Commands: 2 (cached: 0, remote: 0, local: 2)
Tests finished: Pass 6. Fail 0. Fatal 0. Skip 0. Build failure 0
# local reproduce
### cmf
P881792289
### igctr
### dsnn
### icvr

Reviewed By: xuzhao9

Differential Revision: D51332067

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114778
Approved by: https://github.com/xuzhao9
2023-12-07 19:04:03 +00:00
3a5fb0d456 markDynamoStrictTest in functorch/test_eager_transforms.py (#115268)
We're doing some more work around the functorch-torch.compile
interaction. The current state is that these tests might not get run in
the Dynamo CI shard. Using this decorator makes them actually run (by
resetting the Dynamo state before/after each test).

Test Plan:
Wait for CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115268
Approved by: https://github.com/voznesenskym, https://github.com/guilhermeleobas
ghstack dependencies: #115267, #115276
2023-12-07 18:42:21 +00:00
a1bfaf75dc markDynamoStrictTest: add nopython flag, set default to False (#115276)
Default should be False because in general, we're interested
in reliability and composability: we want to check that
running PyTorch with and without Dynamo has the same semantics (with
graph breaks allowed).

Test Plan:
Existing tests?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115276
Approved by: https://github.com/voznesenskym
ghstack dependencies: #115267
2023-12-07 18:42:21 +00:00
2847045ed9 Set _dynamo.config.capture_func_transforms=False (#115267)
Due to not all tests in the Dynamo shard actually running in CI, we've
started to bitrot on this implementation. Since our plan is to trace
into the functorch implementations instead of construct a HOP
(which is what capture_func_transforms=True does), let's turn off this
config by default.

Test Plan:
- Tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115267
Approved by: https://github.com/voznesenskym, https://github.com/guilhermeleobas
2023-12-07 18:42:15 +00:00
3e66385ddd Add Work to distributed docs (#115172)
Summary:
Documenting the `Work` object

For a collective (broadcast, all_reduce, etc.) when async_op=True we return a `Work` object to which users can call `.wait()`, `.is_success()`, among other things but this class is not documented

Test Plan: Preview the docs build in OSS

Differential Revision: D51854974

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115172
Approved by: https://github.com/wconstab
2023-12-07 18:12:10 +00:00
ee8b33f7d5 Fixed crash when calling pad_packed_tensor when packed with cuda tensors and ensure_sorted=false due to indexing with tensors on different devices (#115028)
Fixes #115027

Fix in csrc as done in the python code [here](https://github.com/pytorch/pytorch/blob/main/torch/nn/utils/rnn.py#L338).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115028
Approved by: https://github.com/drisspg
2023-12-07 18:09:18 +00:00
suo
686a3e0bf0 [pytorch][PR] introduce WeakHashRef (#115216)
We would like weak dictionaries that have `torch.ScriptObject` keys. Similar to tensors, we need to override the behavior of the ref to dot he right thing under comparison.

This change also makes it so that WeakIdKeyDictionary works with a pluggable ref_type.

Differential Revision: [D51828205](https://our.internmc.facebook.com/intern/diff/D51828205/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115216
Approved by: https://github.com/albanD
2023-12-07 17:48:11 +00:00
684ce1b21d Revert "Assert that output could only be the last node of the FX graph (#115179)"
This reverts commit 4a9fb9832abc00dff9729b7d7a9647b376882f38.

Reverted https://github.com/pytorch/pytorch/pull/115179 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/115179#issuecomment-1845776365))
2023-12-07 17:26:27 +00:00
dd6ae6d3b4 [HigherOrderOp] Remove additional get item calls in MapHigherOrder. (#115207)
As titled, this PR removes the unnessecary getitem call from the graph that's manipulated in MapHigherOrder, where we want to get the first dim slice of original tensor for specualtion but using call_method will accidentally create a get_item call in the graph, so want to avoid it by calling unpack_var_sequence on input tensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115207
Approved by: https://github.com/yanboliang
ghstack dependencies: #115115, #115204, #115205
2023-12-07 17:06:44 +00:00
8b74735878 [HigherOrderOp] make MapHigherOrder create map_impl call_function node instead of map (#115205)
We want to remove the map_wrapper and replace it with dynamo always on. This is the first step of this plan.

In this PR, we make dynamo directly generates a map_impl nodes. This hasn't touch the eager logic yet. So the execution path after this PR looks like 1. `dynamo -> map_impl` when torch.compile is on. (Before this PR, it's `dynamo -> map_wrapper -> map_impl` and 2. `map_wrapper -> map_impl` (This PR did't touch the logic here).

The added TODO(yidi) is addressed in the following pr.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115205
Approved by: https://github.com/yanboliang
ghstack dependencies: #115115, #115204
2023-12-07 17:06:44 +00:00
be3efbebb6 [HigherOrderOp] make MapHigherOrder use should_flatten_output=True (#115204)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115204
Approved by: https://github.com/yanboliang
ghstack dependencies: #115115
2023-12-07 17:06:35 +00:00
998c87f93c [BE][HigherOrderOp] extract redundant code that unflattens the output (#115115)
We need this function to unflatten the variable tracker for HOPs that want pytree output support, e.g. map.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115115
Approved by: https://github.com/yanboliang
2023-12-07 17:06:28 +00:00
43f42bf3cb Updated docs for deprecated torch.set_default_tensor_type (#115041)
Added deprecation note for torch.set_default_tensor_type. Updated docs that referenced this method.

Fixes #113646.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115041
Approved by: https://github.com/janeyx99
2023-12-07 16:17:36 +00:00
441ecf03e2 Update gloo submodule (#115158)
Updates to pull ROCm 6.0 related changes and few minor updates in gloo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115158
Approved by: https://github.com/jeffdaily, https://github.com/malfet
2023-12-07 15:55:08 +00:00
cyy
7b8084d1c6 [5/N] Fixes clang-tidy warnings in c10/core/*.h (#115232)
This PR continues to fix clang-tidy warnings for headers in c10/core.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115232
Approved by: https://github.com/Skylion007
2023-12-07 15:48:03 +00:00
d08b20d534 Update FlashAttention too v2.3.6 (#115313)
# Summary
This PR updates the FlashAttention code from:
02ac572f3f.
Or Tag 2.3.2

To 92dd5703ec

Or tag 3.2.6.

As well I think that this should be cherry picked into 2.2.0 release since there was a temporary ~15% perf regression for causal masking. It is not technically a regression since Flash wasn't released yet but it would be nice to have in the release.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115313
Approved by: https://github.com/Skylion007
2023-12-07 15:47:16 +00:00
78b945484b [c10d] Extend NCCL communicator splitting to more use cases (#114916)
Previously we could only use `ncclCommSplit` when we knew all backends were connected on all shards (due to the need to perform a NOCOLOR split), which in practice meant we could only use it for subgroups that were copies of the entire world.

This change allows for specifying a bound device id to `init_process_group` which tells the pg and its backends that the specified device, and the specified device only, will be associated with this rank.

This guarantee lets us do an early connect (which we could not previously do due to how ProcessGroupNCCL infers devices based on tensors and not the rank number).  And by doing the early connect, we have the guarantee ranks are connected and can perform nocolor splits when needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114916
Approved by: https://github.com/kwen2501
2023-12-07 15:13:01 +00:00
a6736ac851 Add call to run_tests for a few tests (#115097)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115097
Approved by: https://github.com/wconstab, https://github.com/fduwjj
2023-12-07 08:27:40 +00:00
3c882925da Make subclass type instances constants (like UserDefinedClasses) (#115323)
As title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115323
Approved by: https://github.com/oulgen
2023-12-07 08:10:59 +00:00
5e3631db31 [DTensor] force re-compute sharding when normalized_shape differs in fwd layer norm (#115250)
**Summary**:
#114174 did not test the case where `elementwise_affine=False` (i.e. `weight` and `bias` are `None`) and this test would fail due to cached sharding propagation. The difference on sharding prop between these cases is, when `weight` and `bias` are None, the forward layer norm op will be recognized as a "static shape op" and `propagate_op_sharding` will be applied rather than `propagate_op_sharding_non_cached`. A fix is to force re-compute sharding when `normalized_shape` changes by setting op schema's `RuntimeSchemaInfo`'s `static_argnum` to include `normalized_shape` (i.e. 1)

**Test**:
pytest test/distributed/_tensor/test_math_ops.py -s -k layer_norm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115250
Approved by: https://github.com/wanchaol
2023-12-07 07:44:06 +00:00
622688fab9 [export] Fix graph output mismatch issue with constant outputs. (#115280)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115280
Approved by: https://github.com/tugsbayasgalan
2023-12-07 06:11:08 +00:00
e1f159e6b2 Remove rebundant api named is_int_list (#115136)
Fixes #114933

As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115136
Approved by: https://github.com/zou3519
2023-12-07 04:55:13 +00:00
5309ac1b98 Add test case to prove non-strict export supports external call (#115245)
Current non-strict test cases (added in #114697) are already supported by strict mode, so it can't demonstrate the incremental value of non-strict mode. How about adding test cases that fail in strict mode but pass in non-strict mode?

Test Plan:
python test/export/test_export.py -k test_external_call_non_strict_real_tensor
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115245
Approved by: https://github.com/tugsbayasgalan, https://github.com/zhxchen17
2023-12-07 04:51:15 +00:00
a93b9ee9d8 [quant][be] Add a test for per channel quant for groupwise conv (#115224)
Summary:
just making sure this works

Test Plan:
python test/test_quantization.py -k test_groupwise_per_channel_quant

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115224
Approved by: https://github.com/andrewor14
2023-12-07 04:46:20 +00:00
b7eb9b1e7e [Autotune] Enable register pressure handling logic for H100. (#115295)
I have seen the register pressure handling logic helps performance on H100 for a couple kernels. Also my local run of Huggingface and timm_models both show neutral results.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115295
Approved by: https://github.com/jansel
2023-12-07 04:37:44 +00:00
f55ab176fc [OAT] move matmul precision out of system info (#115242)
Summary: move matmul precision out of the system info (system hash) and into the cache in preparation for switching precisions during compile

Test Plan: CI

Reviewed By: jansel

Differential Revision: D51442000

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115242
Approved by: https://github.com/jansel
2023-12-07 04:30:06 +00:00
7ec145bfed [Quant] [PT2] Fix XNNPACKQuantizer set_module_type issue (#115252)
**Summary**
Fix the issue https://github.com/pytorch/pytorch/issues/115251, the root-cause is we pass the `filter_fn` parameter of `find_sequential_partitions` in wrong position. Use keyword arg to fix this issue.

**Summary**
```
python -u -m pytest -s -v test_quantization.py -k test_set_module_type_case_2
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115252
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2023-12-07 03:08:20 +00:00
6c0a4ced53 [export] Add math.* ops to pass base (#115271)
Fixes https://github.com/pytorch/pytorch/issues/115209

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115271
Approved by: https://github.com/ydwu4
2023-12-07 02:47:04 +00:00
d7160c9223 Handle potential ValueError exception when stringifying signals (#114696)
On some systems it is possible to receive a signal that does not have a name.  Rare, but possible.  This prevents our error handler from crashing and instead properly reports the signal.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114696
Approved by: https://github.com/xmfan
2023-12-07 02:10:30 +00:00
ac7d14baad Revert "[C10D] Use future for flight recorder dump (#115176)"
This reverts commit 0e07e3dbe434ce31a5aea634628c7d39747f265f.

Reverted https://github.com/pytorch/pytorch/pull/115176 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the test_timeout_dumps is failing in trunk 0e07e3dbe4 ([comment](https://github.com/pytorch/pytorch/pull/115176#issuecomment-1844076455))
2023-12-07 02:09:58 +00:00
3a18211622 Guard on subclass inner tensors (#114965)
This PR introduces guarding on subclass inner tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114965
Approved by: https://github.com/voznesenskym
ghstack dependencies: #114311, #115212
2023-12-07 01:47:48 +00:00
c163b3c035 [export] Remove runtime assertion pass (#115196)
Reland of https://github.com/pytorch/pytorch/pull/111949/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115196
Approved by: https://github.com/avikchaudhuri
2023-12-07 01:44:11 +00:00
73c0035160 Add reset_storage method to FunctionalTensorWrapper (#115235)
In certain edge cases when using lazy tensors, the base tensor stored in the `FunctionalStorageImpl` and the `value_` tensor stored in the `FunctionalTensorWrapper` diverge. For instance, take this simple example
```python
class Model(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = torch.nn.Linear(4, 2, bias=False)

    def forward(self, x):
        return x @ self.fc1.weight.transpose(0, 1)

with torch.device("lazy"):
    model = Model()

    x = torch.ones(4)
    out = model(x)
```
The call to `transpose` on the lazily initialized weight `fc1.weight` applies a view op on the functional tensor which only gets propagated to the functional tensor wrapper and not the base tensor in the storage. Thus, causing them to diverge.

To fix this behaviour, we need to reset the functional tensor's storage. To facilitate this, we add a `reset_storage` method to `FunctionalTensorWrapper` which clears away the old storage and view metas.

CC: @behzad-a @GlebKazantaev @wconstab @bdhirsh
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115235
Approved by: https://github.com/bdhirsh
2023-12-07 01:32:01 +00:00
cyy
4e9fe496cd Remove c10::either (#112733)
Time to remove it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112733
Approved by: https://github.com/albanD
2023-12-07 01:31:53 +00:00
240f4b2d25 make __lookup_backend return None when cache misses (#114766)
Fixes #114674. The error is because cached_backends is a thread-local object, when it's accessed from the other thread, we'll have a cache miss. The naive fix is to just return None and re-compiles when cache misses. This could also be related to making dynamo more thread-safe but I'm not sure if there an on-going effort or not.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114766
Approved by: https://github.com/IvanYashchuk, https://github.com/Neilblaze, https://github.com/jansel
2023-12-07 00:25:01 +00:00
7457a5f4be [inductor] adapt to the get_max_simd_tflops Triton API change (#115288)
Differential Revision: D51907617

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115288
Approved by: https://github.com/hl475, https://github.com/chenyang78
2023-12-07 00:22:06 +00:00
ae5365819d [ONNX] Extend test_fx_op_consistency.py to cover ExportedProgram model type (#114886)
This PR covers `ExportedProgram` to `test_fx_op_consistency.py`, which helps us identify the necessary but missing io_steps.
Next, we should refactor the tests to actually cover all ops supported by registry.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114886
Approved by: https://github.com/thiagocrepaldi
2023-12-07 00:03:23 +00:00
3642f29a64 DistributedDataParallel._post_forward, fix return (#114678)
Fix `return` in case of `_delay_all_reduce_all_params`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114678
Approved by: https://github.com/Skylion007, https://github.com/fegin
2023-12-06 23:44:52 +00:00
0e07e3dbe4 [C10D] Use future for flight recorder dump (#115176)
Replaces the "always sleep 30 sec before abort" with "wait up to 30 sec
for the future to complete then abort".  The difference in this case is
the abort happens as soon as the dump finishes up to a maximum, instead
of always waiting the maximum.

Allows multiple calls to dump, which will be serialized.

Renames `tryWriteDebugInfo` to `launchAsyncDebugDump` in spirit of the
change to support more than one launch and to always launch rather than
only launching on the first call.

Adds a test for dumping on timeout.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115176
Approved by: https://github.com/zdevito
2023-12-06 23:42:19 +00:00
0757e2ba84 [aotautograd] Fix an output shape error when inputs are aliased (#115279)
Summary: https://github.com/pytorch/pytorch/issues/97083, when an output
is marked as OutputType.is_input but a synthetic base is constructed
because of aliased inputs, we may need to update the output type to
OutputType.alias_of_input if needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115279
Approved by: https://github.com/bdhirsh
2023-12-06 23:10:21 +00:00
7e0e124a5d Automated submodule update: FBGEMM (#115103)
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: dbc3157bf2

Test Plan: Ensure that CI jobs succeed on GitHub before landing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115103
Approved by: https://github.com/malfet
2023-12-06 22:47:40 +00:00
83cb6a75ad [dynamo] add list iterator contains (#115237)
Fixes https://github.com/pytorch/pytorch/issues/115236

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115237
Approved by: https://github.com/jansel
2023-12-06 22:26:16 +00:00
71bf4f3b87 [CI] Add torch/_functorch/_aot_autograd to auto-label rule (#115283)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115283
Approved by: https://github.com/bdhirsh
2023-12-06 20:07:53 +00:00
1489e4bcf3 [Quant] [PT2] Enable batchnorm in _move_exported_model_to_eval (#114547)
**Summary**
Add standalone batchnorm into `_move_exported_model_to_eval` to move it from training mode into eval mode

**Test Plan**
```
python -m pytest test_mkldnn_pattern_matcher.py -k test_qat_bn_conv2d
python -u -m pytest -s -v test_quantize_pt2e.py -k test_bn_move_exported_model_to_eval
```

Differential Revision: [D51853407](https://our.internmc.facebook.com/intern/diff/D51853407)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114547
Approved by: https://github.com/jgong5, https://github.com/andrewor14
2023-12-06 19:51:22 +00:00
c99db5617a Introduce general metadata cache to jagged layout NestedTensor (#115212)
Slight refactor to:
* lazily compute min / max seq_len used for flash. this avoids unnecessary graph breaks / specialization when we're not accessing these
* store min / max seq_len in a general `metadata_cache`. condensing these should make it easier to avoid specializing on these and others we may add in the future
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115212
Approved by: https://github.com/soulitzer, https://github.com/ani300
ghstack dependencies: #114311
2023-12-06 19:40:35 +00:00
b6de337d16 [funcol] a few optimizations to funcol (#113324)
Apply a few optimizations to funcol:

- allgather on non-0 dim, the resulting tensor already needs to access
data in order to do torch.cat, so we sync wait here so that we don;t
need to go through ACT dispatch for chunk + cat alltogether
- have a fast return logic to aten.view as it's a commonly hit op for
view related ops

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113324
Approved by: https://github.com/XilunWu
2023-12-06 19:25:35 +00:00
2cf0cf8137 [dynamo / DDP] - lazily compile submodules - to propagate real tensor strides to backend compiler (#114154)
Fixes https://github.com/pytorch/pytorch/issues/113812, https://github.com/pytorch/pytorch/issues/102591, Probably fixes: https://github.com/pytorch/pytorch/issues/113740, https://github.com/pytorch/pytorch/issues/113786, https://github.com/pytorch/pytorch/issues/113788

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114154
Approved by: https://github.com/wconstab, https://github.com/yf225
2023-12-06 18:50:14 +00:00
967863d91d [export][refactor][3/n] Move unlift to separate file (#114787)
Differential Revision: [D51823960](https://our.internmc.facebook.com/intern/diff/D51823960)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114787
Approved by: https://github.com/ydwu4
ghstack dependencies: #114764, #114768
2023-12-06 16:46:47 +00:00
0ab57ee7ea [export][refactor][2/n] Move tracing logic (#114768)
2/n of refactoring export code:

* Moved tracing logic in torch/_export/init.py to torch/export/_tracer.py

Differential Revision: [D51823961](https://our.internmc.facebook.com/intern/diff/D51823961)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114768
Approved by: https://github.com/ydwu4
ghstack dependencies: #114764
2023-12-06 16:46:47 +00:00
53bf8cfcf9 [export][refactor][1/n] Move dynamic shapes logic (#114764)
1/n of refactoring export code:
* Moved dynamic shapes/constraints/dynamic_dims logic in torch/_export/__init__.py and torch/export/__init__.py to torch/export/dynamic_shapes.py

Differential Revision: [D51823962](https://our.internmc.facebook.com/intern/diff/D51823962)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114764
Approved by: https://github.com/ydwu4
2023-12-06 16:46:38 +00:00
5f939e32e3 [CI] Log load_model failures in csv (#114784)
Summary: Right now when load_model fails (either because of loading error or validation eager run failure), the result won't be logged in generated csv files. Let's log them in csv so that they are monitored by the expected results checking.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114784
Approved by: https://github.com/malfet
2023-12-06 15:19:16 +00:00
67c8ad7285 Fix autograd.Function x enum input x torch.compile (#115206)
Fixes https://github.com/pytorch/pytorch/issues/114777. We treat Enums
like we do ConstantVariable.

Test Plan:
New test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115206
Approved by: https://github.com/yanboliang
ghstack dependencies: #115185, #115186, #115187
2023-12-06 15:18:25 +00:00
233ce0d24b Support GPU annotations for auto-trace jobs similar on-demand support (#114638)
Summary: When using auto_trace, gpu_user_annotation is not shown in the results. Fixing this by including `GPU_USER_ANNOTATION` in `kCudaTypes`.

Differential Revision: D51597995

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114638
Approved by: https://github.com/aaronenyeshi
2023-12-06 09:38:13 +00:00
d4c79a3078 Add an attention bias subclass for a lower right causal masking (#114823)
# Summary
This PR introduces a new Tensor subclass that is designed to be used with torch.nn.functional.scaled_dot_product_attention. Currently we have a boolean `is_causal` flag that allows users to do do causal masking without the need to actually create the "realized" attention bias and pass into sdpa. We originally added this flag since there is native support in both fused kernels we support. This provides a big performance gain ( the kernels only need to iterate over ~0.5x the sequence, and for very large sequence lengths this can provide vary large memory improvements.

The flag was introduced when the early on in the kernel development and at the time it was implicitly meant to "upper_left" causal attention. This distinction only matters when the attention_bias is not square. For a more detailed break down see: https://github.com/pytorch/pytorch/issues/108108. The kernels default behavior has since changed, largely due to the rise of autogressive text generation. And unfortunately this would lead to a BC break. In the long term it may actually be beneficial to change the default meaning of `is_causal` to represent lower_right causal masking.

The larger theme though is laid here: https://github.com/pytorch/pytorch/issues/110681. The thesis being that there is alot of innovation in SDPA revolving around the attention_bias being used. This is the first in hopefully a few more attention_biases that we would like to add. The next interesting one would be `sliding_window` which is used by the popular mistral model family.

Results from benchmarking, I improved the meff_attention perf hence the slightly decreased max perf.
```Shell
+---------+--------------------+------------+-----------+-----------+-----------+-----------+----------------+----------+
|  Type   |      Speedup       | batch_size | num_heads | q_seq_len | k_seq_len | embed_dim |     dtype      | head_dim |
+---------+--------------------+------------+-----------+-----------+-----------+-----------+----------------+----------+
| Average | 1.2388050062214226 |            |           |           |           |           |                |          |
|   Max   | 1.831672915579016  |    128     |    32     |   1024    |   2048    |   2048    | torch.bfloat16 |    64    |
|   Min   | 0.9430534166730135 |     1      |    16     |    256    |    416    |   2048    | torch.bfloat16 |   128    |
+---------+--------------------+------------+-----------+-----------+-----------+-----------+----------------+----------+
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114823
Approved by: https://github.com/cpuhrsch
2023-12-06 08:29:26 +00:00
4a9fb9832a Assert that output could only be the last node of the FX graph (#115179)
Test Plan: unit tests

Differential Revision: D51856848

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115179
Approved by: https://github.com/Chillee
2023-12-06 08:17:16 +00:00
fcf6a76108 [aot_inductor][pass] fuse parallel linear based on pre grad aten IR (#114776)
Summary:
This work is for PT2 inference. Since the IR from Export will change to pre-grad aten IR in a few months. We need to start this work from now on. Here is what I do in this diff:
1) Copy the fuse parallel linear pass to fb folder and adapt it to aten IR. We still want to keep the original `group_batch_fusion.py` because it is still used in training. In future at certain time point when PT2 training decided to retire the torch IR based group_batch_fusion, we can remove it. But right now, it's better to have torch IR and aten IR version seperately.

Our plan is to gradually transform the existing and important pre-grad passes to aten IR based passes.

Differential Revision: D51017854

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114776
Approved by: https://github.com/zhxchen17
2023-12-06 05:48:20 +00:00
cyy
d250b2158e [4/N] Fixes clang-tidy warnings in header files (#115163)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115163
Approved by: https://github.com/Skylion007
2023-12-06 05:00:01 +00:00
f4c67ffff4 [dynamo] Improve support for dynamic shapes str.format and _assert (#115203)
This removes a graph break in vision_maskrcnn.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115203
Approved by: https://github.com/yanboliang
2023-12-06 04:54:45 +00:00
4ff4e06b5b Update xla pin (#115211)
This is to update the pin pass 062aa91a9c so flaky test can be skipped
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115211
Approved by: https://github.com/malfet
2023-12-06 04:52:37 +00:00
534f25887b [inductor] avoid inplace for ComplexView (#115166)
Fix https://github.com/pytorch/pytorch/issues/115071
A regression introduced by https://github.com/pytorch/pytorch/pull/112875/files#diff-d2539c9c8dc6a3d7e457767a880612e96d3c85752a77ead49a9e4e00a3e4c3c7R335

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115166
Approved by: https://github.com/Skylion007
2023-12-06 04:52:28 +00:00
490f2d7570 Skip privateuse1's checkZeroPoints (#114117)
We want to use ``quantize_per_channel`` to create a quantized tensor, but we found that ``checkZeroPoints`` for ``privateuse1`` backend failed.

``quantize_tensor_per_channel_affine`` will ``checkZeroPoints`` for all backends expect ``CUDA``:
140c54e6cc/aten/src/ATen/native/quantized/AffineQuantizer.cpp (L162-L164)

However, our ``privateuse1`` backend will get a segmentation error if we try to cast our data to int64_t in ``checkZeroPoints``:
140c54e6cc/aten/src/ATen/native/quantized/AffineQuantizer.cpp (L82-L88)

So if we can skip ``privateuse1``'s ``checkZeroPoints`` and check this item in the actual device function? What do you think?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114117
Approved by: https://github.com/jerryzh168
2023-12-06 04:44:49 +00:00
acdd06e00f [executorch hash update] update the pinned executorch hash (#115215)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115215
Approved by: https://github.com/pytorchbot
2023-12-06 04:33:25 +00:00
a548e80536 Use test_vulkan to validate run_test without boto3 (#115233)
As `test_weak` can undergo some changes, but `test_vulkan` is a no-op for CPU builds
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115233
Approved by: https://github.com/suo
2023-12-06 03:45:52 +00:00
2bff36bb0e [c10d] Change set timeout API name to _set_default_timeout (#115197)
Somehow the feedback does not show up, this PR is to address the comment in https://github.com/pytorch/pytorch/pull/115141.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115197
Approved by: https://github.com/XilunWu, https://github.com/wconstab
2023-12-06 03:38:39 +00:00
b56b002842 Fix NULL dereference in binary CPU ops (#115183)
Targeted fix for https://github.com/pytorch/pytorch/issues/113037

A more fundamental one, where those functions are not even called for
empty tensors are coming later

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115183
Approved by: https://github.com/drisspg, https://github.com/atalman, https://github.com/huydhn
2023-12-06 03:37:47 +00:00
892a14a450 [vision hash update] update the pinned vision hash (#111408)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111408
Approved by: https://github.com/pytorchbot
2023-12-06 03:25:52 +00:00
ef6cbf4e1f remove myself from CODEOWNERS (#115230)
Trying to reign in my notifications ;-)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115230
Approved by: https://github.com/malfet
2023-12-06 02:50:50 +00:00
b0b190f7c0 More descriptive error message for unsupported inputs to HOP (#115187)
Test Plan:
See updated tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115187
Approved by: https://github.com/ydwu4, https://github.com/yanboliang
ghstack dependencies: #115185, #115186
2023-12-06 01:29:03 +00:00
b5b011a5cd Expand input types for HOPs that use manually_set_subgraph_inputs=False (#115186)
Previously we only supported Tensor, Constants, and SymNode. We lift
that restriction (there's not really a good reason for it). HOPs like
torch.cond, torch.map already do input validation (those are the ones
that can only support Tensor, Constant, and SymNode inputs).

Test Plan:
New test for `wrap`, which is a HOP that has
manually_set_subgraph_inputs=False

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115186
Approved by: https://github.com/ydwu4, https://github.com/yanboliang
ghstack dependencies: #115185
2023-12-06 01:29:03 +00:00
bc46347152 Refactor how HOPs create new args to subgraphs (#115185)
This PR combines the logic for Tensor and SymNode.

Test Plan:
- Existing tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115185
Approved by: https://github.com/ydwu4, https://github.com/yanboliang
2023-12-06 01:29:03 +00:00
f6291a5e93 [Quant] [Inductor] Enable QLinear weight prepack when input dimension size exceeds 2 (#113928)
**Summary**
Enable the qlinear weight prepack when input dimension size exceeds 2. There are extra reshape node before and after the `addmm` or `mm` node if input dimension size exceeds 2.

**Test Plan**
```
python -m pytest test_mkldnn_pattern_matcher.py -k input_dim_exceeds_2
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113928
Approved by: https://github.com/jgong5, https://github.com/eellison
ghstack dependencies: #113733, #113912
2023-12-06 01:24:15 +00:00
6d0cf26c3a [Quant] [Inductor] Enable Dequant Promotion when Linear input dimension size exceeds 2 (#113912)
**Summary**
When decomposing `Linear` to `addmm` or `mm` within Inductor, if the input dimension size exceeds 2, `reshape` nodes are introduced to convert the input into a 2-dimensional form before and after the `addmm` or `mm` node. It is essential to identify and match this pattern during quantization for dequantization promotion. For instance,
```
        #            quant
        #      + - - - | - - - +
        #      |    dequant    |
        #      |       |       |
        #      |    reshape    |
        #      |    /     \    |
        #      |  node1  node2 |
        #      + - | - - - | - +
        #        reshape reshape
        #      + - | - - - | - +
        #        quant    quant
```
In this PR, we mainly do 2 things:

- Extend support for the dequantization pattern in QLinear when the input dimension size exceeds 2.
- Revise the implementation of the dequant promotion pass, as it now needs to accommodate the matching of four different patterns.

**Test Plan**
```
python -m pytest test_mkldnn_pattern_matcher.py -k input_dim_exceeds_2
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113912
Approved by: https://github.com/jgong5, https://github.com/eellison
ghstack dependencies: #113733
2023-12-06 01:20:36 +00:00
4a624d1f8a [Quant] [PT2] Enable QLinear input with multi dims (#113733)
**Summary**
In the previous QLinear implementation, it was assumed that inputs have a dimension of 2. In this update, we have modified QLinear to accept inputs with a dimension greater than 2, incorporating input and output reshaping accordingly.

**Test Plan**
```
python -u -m pytest -s -v test_quantized_op.py -k test_qlinear_pt2e
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113733
Approved by: https://github.com/jgong5, https://github.com/eellison
2023-12-06 01:16:51 +00:00
b8ce05456c enable cat for cuda bits types (#115044)
It was already working for cpu, so bring parity.
Also, slightly reduce number of compiled kernels by using OpaqueType.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115044
Approved by: https://github.com/malfet
2023-12-06 00:05:18 +00:00
b9c4fb68c5 [ONNX][Bench] Fix model name retrieval and remove unused argument (#115108)
Might be some upstream updates, the previous hack starts to not pick up model names, updating to use the other more appropriate variable.
Also fix a bug with an unused argument that was supposed to be removed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115108
Approved by: https://github.com/thiagocrepaldi
2023-12-05 23:55:12 +00:00
ae457a2c4a [PyTorch] Change test_aot_inductor CPU test failures syntax (#115180)
This portion of D50416438 is extremely subject to merge conflicts. It can also be safely landed without full CI round trip because it changes just one test file that we can simply run to make sure it works.

Differential Revision: [D51856943](https://our.internmc.facebook.com/intern/diff/D51856943/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115180
Approved by: https://github.com/mikekgfb, https://github.com/desertfire
2023-12-05 23:55:08 +00:00
01ec71e466 [NFC][Autotune] Use device_prop.regsPerMultiprocessor instead of hardcoded reg number. (#115094)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115094
Approved by: https://github.com/jansel
2023-12-05 23:49:46 +00:00
1102d37958 remove aot_config.keep_inference_input_mutations from assert_functional_graph (#115195)
We technically allow backends to aot_autograd to pass a config saying "yes I am ok with seeing input mutations in my graph".

With https://github.com/pytorch/pytorch/pull/112906 though, there can be input mutations that show up in the backward (that we need to handle for correctness), that are a large pain to keep out of the graph. The meta-point is that it's been ~a year since we added the config, and it almost always makes sense for backends to support input mutations for performance reasons (inductor does). So I just allow these input mutations in the graph in this rare backward situation, even if the backend didn't explicitly use the config.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115195
Approved by: https://github.com/drisspg
2023-12-05 23:36:37 +00:00
7aac689b19 [inductor] Add ir.Scan and lower aten.cumsum on CUDA (#106581)
This adds the `ir.Scan` node (currently only supported on CUDA) which re-uses the existing reduction kernel machinery to support different kinds of non-pointwise ops. Just like reductions it supports prologue and epilogue fusions and has both persistent and non-persistent kernel generation.

Currently this doesn't support the equivalent of `Reduction.create_multilayer` and will instead fall back to eager in those cases. This is because splitting into multiple kernel invocations ends up being far slower than cub's single kernel strategy which matches the performance of a copy kernel.

Fixes https://github.com/pytorch/pytorch/issues/93631

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106581
Approved by: https://github.com/lezcano, https://github.com/atalman
2023-12-05 23:31:49 +00:00
d78fe039eb Introduce OptimizerInfos + add a test_errors (#114178)
Introduce OptimizerInfos + use them to refactor out the error testing.

Why OptimizerInfos?
- cleaner, easier way to test all configs of optimizers
- would plug in well with devicetype to auto-enable tests for devices like MPS, meta
- would allow for more granular testing. currently, lots of functionality is tested in `_test_basic_cases` and some of that should be broken down more.

What did I do for error testing?
- I moved out some error cases from `_test_basic_cases` into a new test_errors parametrized test.
- The new test has to live in TestOptimRenewed (bikeshedding welcome) because the parametrized tests need to take in device and dtype and hook correctly, and not all tests in TestOptim do that.
- TestOptimRenewed also is migrating to the toplevel test/test_optim.py now because importing TestOptimRenewed does not work (because of test instantiation, TestOptimRenewed gets replaced with TestOptimRenewedDevice for CPU, CUDA, and whatever other device).

Is there any change in test coverage?
- INCREASE: The error case where a single Parameter (vs a container of them) are passed in has now expanded to all optims instead of only LBFGS
- DECREASE: Not much. The only thing is we no longer test two error cases for foreach=True AND foreach=False, which I think is redundant. (Highlighted in comments)

Possible but not urgent next step: test ALL possible error cases by going through all the constructors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114178
Approved by: https://github.com/albanD
2023-12-05 22:58:36 +00:00
99257002fa Extend auto_functionalized to support ops that return Tensors (#115135)
We can auto-functionalize operators that mutate their inputs as long as
the outputs of the operator do not alias their inputs. The user needs
to provide an abstract impl for the operator if it has non-trivial
returns.
- We update can_auto_functionalize(op) to include ops that return (but
  do not alias) Tensors
- We update auto_functionalized(op, mutated_args_names, kwargs) to
  return (out, mutated_args), where `out = op(**kwargs)` and
  `mutated_args` are the new values of the inputs that would have been
  mutated.

Test Plan:
- new test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115135
Approved by: https://github.com/bdhirsh
ghstack dependencies: #114955, #114956, #115134
2023-12-05 22:43:06 +00:00
d0aad93249 Refactor can_auto_functionalize (#115134)
In preparation for the next PR up in the stack, which is going to update
"can_auto_functionalize" to support more operators than just ones that
return nothing. We are unable to auto-generate FakeTensor kernels for
operators that do not return nothing, but we are able to generate
functionalization kernels for operators that return something.

Test Plan:
Existing tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115134
Approved by: https://github.com/bdhirsh
ghstack dependencies: #114955, #114956
2023-12-05 22:43:06 +00:00
4620170008 [Dynamo] Revert multiple PRs since they triggered compilation stuck internally (#115126)
Revert the following PRs to mitigate internal compilation stuck:
#113432
#114016
#114507
#114196
#114739
#114669

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115126
Approved by: https://github.com/xush6528
2023-12-05 22:35:37 +00:00
80527c0cf2 [AOTInductor] Double buffering for Weights (#114446)
Summary:
This adds function to model container doing weight swapping with double buffering.

There are 2 parts for double buffering
a) Write constants into inactive buffer
b) Swap active buffer

For (a), we write the constants into the buffer that's currently not in use, and store the information in both constants map and the corresponding constant array to read.
For (b), we obtain the lock, and activate the constant map/constant array that is inactive, and flag the one that's currently in use to inactive.

Test Plan:
test/cpp/aot_inductor/test.cpp

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D51543732](https://our.internmc.facebook.com/intern/diff/D51543732)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114446
Approved by: https://github.com/chenyang78, https://github.com/eellison
2023-12-05 22:31:56 +00:00
12085914b8 Replace bsr_dense_mm triton kernel with bsr_dense_addm triton kernel (#115030)
The `bsr_dense_addmm` triton kernel introduced in https://github.com/pytorch/pytorch/pull/114595 is a generalization of `bsr_dense_mm` triton kernel and a more efficient version of it because it uses an extra kernel parameter `SPLIT_N` that has notable effect to performance for r.h.s operand with a larger number of columns.

This PR eliminates the `bsr_dense_mm` triton kernel in favor of using `bsr_dense_addmm` triton kernel.

The performance increase of `bsr_dense_mm` is as follows (float16, `NVIDIA A100-SXM4-80GB`):
- with 16x16 blocks, the average/maximal speed up is 50/71 %
- with 32x32 blocks, the average/maximal speed up is 30/63 %
- with 64x64 blocks, the average/maximal speed up is 12/26 %
- with 128x128 blocks, the average/maximal speed up is 7/17 %

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115030
Approved by: https://github.com/cpuhrsch
2023-12-05 22:29:24 +00:00
f35f52e4a6 Update auto_request_review.yml (#115182)
remove myself to avoid notification noise

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115182
Approved by: https://github.com/huydhn, https://github.com/albanD
2023-12-05 21:36:18 +00:00
f09e8381b7 [Inductor][fx pass] Fix a bug in batch linear fusion in the post grad (#115061) (#115131)
Summary:

Titled

Test Plan:
```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:group_batch_fusion
```
Buck UI: https://www.internalfb.com/buck2/ab4b918c-9ffa-4d00-a747-880521a27851
Test UI: https://www.internalfb.com/intern/testinfra/testrun/16607023638890043
Network: Up: 11MiB  Down: 117MiB  (reSessionID-079402d0-8fd7-4797-9ed5-dd0f778dce1a)
Jobs completed: 189430. Time elapsed: 2:02.5s.
Cache hits: 99%. Commands: 77000 (cached: 76995, remote: 5, local: 0)
Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0

Reviewed By: mengluy0125

Differential Revision: D51796899

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115131
Approved by: https://github.com/mengluy0125
2023-12-05 21:20:17 +00:00
ab120e65fb Fix FSDP + TP state dict in param unflattening (#115105)
Summary:
This diff fix the param unflattening when using FSDP together with TP. Currently we hardcode the `reshape_size` to be multiplied by 2, which instead should be the size of the process group.

Before the fix, example exception: `shape '[257, 514]' is invalid for input of size 264196`, where the process group size is 4 instead of 2.

Test Plan:
**CI**:
CI test

**Unit test**:
`buck2 test mode/dev-nosan //caffe2/test/distributed/tensor/parallel:fsdp_2d_parallel`
- Passed

**Test model with WHEN**:
- Verified that checkpoint can be saved and resumed successfully;
- Verified the accuracy with window_ne, which is on-par with baseline.
https://pxl.cl/3Wp8w

Differential Revision: D51826120

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115105
Approved by: https://github.com/fegin
2023-12-05 21:19:56 +00:00
22704426c3 Expand dynamic dims support for traceable subclasses (#114311)
Continuation of #112185, following the design in this [doc](https://docs.google.com/document/d/1ipSxcTzEMMOAPvxP-YJlD5JBZZmIGgh8Q34ixtOUCRo).

Summary:
* Introduce `SubclassSymbolicPolicy` containing separate dynamic dim / constraint policies for the outer and inner tensors
    * Expand the automatic dynamic algorithm to recurse into inner tensors and produce one of these for a subclass instance
    * Maintain legacy behavior for subclasses by recursively calling `mark_dynamic()` on inner tensors *of the same dim as outer* when `mark_dynamic(outer, ...)` is called
    * Addresses this: 6a86cf00ad/torch/_dynamo/variables/builder.py (L1750)
* Add `outer_size` and `outer_stride` arguments to `__tensor_unflatten__()` so that you can find out what symbols were allocated for the outer size / stride (you are expected to return a tensor that compares equal to the outer symbols)
    * Signatures now:
    ```python
    # attrs is a list of inner tensor attributes on x; inner_tensor = getattr(x, attr)
    # ctx is anything useful for rebuilding the class we want to guard on
    attrs, ctx = x.__tensor_flatten__()
    ...
    # inner_tensors is a dict of {attr -> tensor}
    # ctx is taken unmodified from flattening and (eventually) guarded on
    # outer_size is the expected size of the output; possibly symbolic
    # outer_stride is the expected strides of the output; possibly symbolic
    y = MySubclass.__tensor_unflatten__(inner_tensors, ctx, outer_size, outer_stride)

    # at the __tensor_unflatten__() call-site in PT2, we assert y.shape == outer_size and y.stride() == outer_stride
    # the assert simplifies symbols when there are relationships between outer and inner symbols
    ```
    * Size info needed for `NestedTensor` at least, stride info needed for `DTensor` at least
    * Punting on `outer_storage_offset` because storage_offset handling is horribly broken in PT2 right now
* ~~Add new `__tensor_mark_dynamic__()` to allow overriding the behavior of mark_dynamic on a per-subclass basis~~ (booted to future work)
* ~~Add guards for tensor subclasses by calling `__tensor_flatten__()` in the guard to test equality on `ctx`~~
    * Now handled in #114469
* Next PR: add TENSOR_MATCH guards on inner tensors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114311
Approved by: https://github.com/ezyang, https://github.com/drisspg, https://github.com/voznesenskym, https://github.com/bdhirsh
2023-12-05 21:09:25 +00:00
259a99669d [NCCL flight recorder] Dump when writing to pipe (#115139)
If TORCH_NCCL_DUMP_ON_TIMEOUT is set, then along with producing a dump
file when a timeout happens, you can trigger a dump by writing to local pipe
`<TORCH_NCCL_DEBUG_INFO_TEMP_FILE>_<rank>.pipe` (by default
/tmp/nccl_trace_{rank}_<rank>.pipe).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115139
Approved by: https://github.com/wconstab
2023-12-05 20:44:23 +00:00
5fdae89c03 [docs][aoti] Link to export docs in AOTI docs (#115088)
Context: https://fb.workplace.com/groups/1075192433118967/posts/1341833143121560/?comment_id=1341841786454029

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115088
Approved by: https://github.com/desertfire
2023-12-05 20:22:42 +00:00
cd0e9c4c05 fix 2023-11-16 17:48:57 +00:00
8fc1309f9f edits 2023-11-16 17:08:29 +00:00
11a52e8d6d Merge remote-tracking branch 'origin/main' into tensordict_integration 2023-11-10 10:57:10 -05:00
52061a05a4 functional decorator 2023-11-08 21:04:47 -05:00
afe4a40805 functional decorator 2023-11-08 20:56:32 -05:00
f5c809e33d functional efficiency 2023-11-06 17:27:46 -05:00
1d9582c627 fixes 2023-11-06 16:49:46 -05:00
2331b048af no more pickling 2023-11-06 16:09:47 -05:00
fb2103bee1 faster from_module 2023-11-06 11:52:54 -05:00
2a69d65d52 lint 2023-11-05 18:22:58 -05:00
f36bd08109 native shared tensors 2023-11-05 18:17:27 -05:00
d6160943b1 partial fix 2023-11-02 21:18:50 +00:00
d147161883 partial fix 2023-11-02 18:10:19 +00:00
6d3b90b64b more tests 2023-11-01 17:23:38 +00:00
d040d35294 fix tensorclass tests 2023-11-01 13:33:55 +00:00
cf6704548a fixes 2023-11-01 12:18:49 +00:00
b8ab16bad6 tensorclass 2023-11-01 11:46:15 +00:00
570605e37d amend 2023-11-01 10:25:06 +00:00
d307e5e0be Merge remote-tracking branch 'origin/main' into tensordict_integration 2023-10-30 19:40:19 +00:00
19f3d13102 init 2023-10-30 19:38:04 +00:00
1789 changed files with 81599 additions and 52182 deletions

View File

@ -19,6 +19,7 @@ See `build.sh` for valid build environments (it's the giant switch).
* `ubuntu` -- Dockerfile for Ubuntu image for CPU build and test jobs
* `ubuntu-cuda` -- Dockerfile for Ubuntu image with CUDA support for nvidia-docker
* `ubuntu-rocm` -- Dockerfile for Ubuntu image with ROCm support
* `ubuntu-xpu` -- Dockerfile for Ubuntu image with XPU support
## Usage

View File

@ -71,6 +71,8 @@ if [[ "$image" == *cuda* && "$UBUNTU_VERSION" != "22.04" ]]; then
DOCKERFILE="${OS}-cuda/Dockerfile"
elif [[ "$image" == *rocm* ]]; then
DOCKERFILE="${OS}-rocm/Dockerfile"
elif [[ "$image" == *xpu* ]]; then
DOCKERFILE="${OS}-xpu/Dockerfile"
elif [[ "$image" == *cuda*linter* ]]; then
# Use a separate Dockerfile for linter to keep a small image size
DOCKERFILE="linter-cuda/Dockerfile"
@ -218,6 +220,16 @@ case "$image" in
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-jammy-xpu-2024.0-py3)
ANACONDA_PYTHON_VERSION=3.8
GCC_VERSION=11
PROTOBUF=yes
DB=yes
VISION=yes
BASEKIT_VERSION=2024.0.0-49522
NINJA_VERSION=1.9.0
CONDA_CMAKE=yes
;;
pytorch-linux-jammy-py3.8-gcc11-inductor-benchmarks)
ANACONDA_PYTHON_VERSION=3.8
GCC_VERSION=11
@ -374,6 +386,7 @@ docker build \
--build-arg "DOCS=${DOCS}" \
--build-arg "INDUCTOR_BENCHMARKS=${INDUCTOR_BENCHMARKS}" \
--build-arg "EXECUTORCH=${EXECUTORCH}" \
--build-arg "BASEKIT_VERSION=${BASEKIT_VERSION}" \
-f $(dirname ${DOCKERFILE})/Dockerfile \
-t "$tmp_tag" \
"$@" \

View File

@ -1 +1 @@
b2f5dfe80704404298467347b8ee3ac229efed47
663882fe7dc518c04adf3d2ee5ccb7d99f41ade4

View File

@ -1 +1 @@
bcad9dabe15021c53b6a88296e9d7a210044f108
e28a256d71f3cf2bcc7b69d6bda73a9b855e385e

View File

@ -61,6 +61,7 @@ install_ubuntu() {
${maybe_libiomp_dev} \
libyaml-dev \
libz-dev \
libjemalloc2 \
libjpeg-dev \
libasound2-dev \
libsndfile-dev \
@ -74,6 +75,7 @@ install_ubuntu() {
libtool \
vim \
unzip \
gpg-agent \
gdb
# Should resolve issues related to various apt package repository cert issues

View File

@ -2,8 +2,8 @@
if [[ ${CUDNN_VERSION} == 8 ]]; then
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
mkdir tmp_cudnn && cd tmp_cudnn
CUDNN_NAME="cudnn-linux-x86_64-8.3.2.44_cuda11.5-archive"
mkdir tmp_cudnn
pushd tmp_cudnn
if [[ ${CUDA_VERSION:0:4} == "12.1" ]]; then
CUDNN_NAME="cudnn-linux-x86_64-8.9.2.26_cuda12-archive"
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/${CUDNN_NAME}.tar.xz
@ -11,17 +11,14 @@ if [[ ${CUDNN_VERSION} == 8 ]]; then
CUDNN_NAME="cudnn-linux-x86_64-8.7.0.84_cuda11-archive"
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/redist/cudnn/v8.7.0/local_installers/11.8/${CUDNN_NAME}.tar.xz
else
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/redist/cudnn/v8.3.2/local_installers/11.5/${CUDNN_NAME}.tar.xz
print "Unsupported CUDA version ${CUDA_VERSION}"
exit 1
fi
tar xf ${CUDNN_NAME}.tar.xz
cp -a ${CUDNN_NAME}/include/* /usr/include/
cp -a ${CUDNN_NAME}/include/* /usr/local/cuda/include/
cp -a ${CUDNN_NAME}/include/* /usr/include/x86_64-linux-gnu/
cp -a ${CUDNN_NAME}/lib/* /usr/local/cuda/lib64/
cp -a ${CUDNN_NAME}/lib/* /usr/lib/x86_64-linux-gnu/
cd ..
popd
rm -rf tmp_cudnn
ldconfig
fi

View File

@ -0,0 +1,21 @@
#!/bin/bash
set -ex
# cuSPARSELt license: https://docs.nvidia.com/cuda/cusparselt/license.html
mkdir tmp_cusparselt && cd tmp_cusparselt
if [[ ${CUDA_VERSION:0:4} == "12.1" ]]; then
CUSPARSELT_NAME="libcusparse_lt-linux-x86_64-0.5.2.1-archive"
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-x86_64/${CUSPARSELT_NAME}.tar.xz
elif [[ ${CUDA_VERSION:0:4} == "11.8" ]]; then
CUSPARSELT_NAME="libcusparse_lt-linux-x86_64-0.4.0.7-archive"
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-x86_64/${CUSPARSELT_NAME}.tar.xz
fi
tar xf ${CUSPARSELT_NAME}.tar.xz
cp -a ${CUSPARSELT_NAME}/include/* /usr/local/cuda/include/
cp -a ${CUSPARSELT_NAME}/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf tmp_cusparselt
ldconfig

View File

@ -0,0 +1,115 @@
#!/bin/bash
set -xe
# Intel® software for general purpose GPU capabilities.
# Refer to https://dgpu-docs.intel.com/releases/stable_647_21_20230714.html
# Intel® oneAPI Base Toolkit (version 2024.0.0) has been updated to include functional and security updates.
# Refer to https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html
# Users should update to the latest version as it becomes available
function install_ubuntu() {
apt-get update -y
apt-get install -y gpg-agent wget
# Set up the repository. To do this, download the key to the system keyring
wget -qO - https://repositories.intel.com/gpu/intel-graphics.key \
| gpg --dearmor --output /usr/share/keyrings/intel-graphics.gpg
wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB \
| gpg --dearmor | tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null
# Add the signed entry to APT sources and configure the APT client to use the Intel repository
echo "deb [arch=amd64 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/gpu/ubuntu jammy/production/2328 unified" \
| tee /etc/apt/sources.list.d/intel-gpu-jammy.list
echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" \
| tee /etc/apt/sources.list.d/oneAPI.list
# Update the packages list and repository index
apt-get update
# The xpu-smi packages
apt-get install -y flex bison xpu-smi
# Compute and Media Runtimes
apt-get install -y \
intel-opencl-icd intel-level-zero-gpu level-zero \
intel-media-va-driver-non-free libmfx1 libmfxgen1 libvpl2 \
libegl-mesa0 libegl1-mesa libegl1-mesa-dev libgbm1 libgl1-mesa-dev libgl1-mesa-dri \
libglapi-mesa libgles2-mesa-dev libglx-mesa0 libigdgmm12 libxatracker2 mesa-va-drivers \
mesa-vdpau-drivers mesa-vulkan-drivers va-driver-all vainfo hwinfo clinfo
# Development Packages
apt-get install -y libigc-dev intel-igc-cm libigdfcl-dev libigfxcmrt-dev level-zero-dev
# Install Intel® oneAPI Base Toolkit
if [ -n "$BASEKIT_VERSION" ]; then
apt-get install intel-basekit=$BASEKIT_VERSION -y
else
apt-get install intel-basekit -y
fi
# Cleanup
apt-get autoclean && apt-get clean
rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
}
function install_centos() {
dnf install -y 'dnf-command(config-manager)'
dnf config-manager --add-repo \
https://repositories.intel.com/gpu/rhel/8.6/production/2328/unified/intel-gpu-8.6.repo
# To add the EPEL repository needed for DKMS
dnf -y install https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm
# https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm
# Create the YUM repository file in the /temp directory as a normal user
tee > /tmp/oneAPI.repo << EOF
[oneAPI]
name=Intel® oneAPI repository
baseurl=https://yum.repos.intel.com/oneapi
enabled=1
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://yum.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB
EOF
# Move the newly created oneAPI.repo file to the YUM configuration directory /etc/yum.repos.d
mv /tmp/oneAPI.repo /etc/yum.repos.d
# The xpu-smi packages
dnf install -y flex bison xpu-smi
# Compute and Media Runtimes
dnf install -y \
intel-opencl intel-media intel-mediasdk libmfxgen1 libvpl2\
level-zero intel-level-zero-gpu mesa-dri-drivers mesa-vulkan-drivers \
mesa-vdpau-drivers libdrm mesa-libEGL mesa-libgbm mesa-libGL \
mesa-libxatracker libvpl-tools intel-metrics-discovery \
intel-metrics-library intel-igc-core intel-igc-cm \
libva libva-utils intel-gmmlib libmetee intel-gsc intel-ocloc hwinfo clinfo
# Development packages
dnf install -y --refresh \
intel-igc-opencl-devel level-zero-devel intel-gsc-devel libmetee-devel \
level-zero-devel
# Install Intel® oneAPI Base Toolkit
dnf install intel-basekit -y
# Cleanup
dnf clean all
rm -rf /var/cache/yum
rm -rf /var/lib/yum/yumdb
rm -rf /var/lib/yum/history
}
# The installation depends on the base OS
ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')
case "$ID" in
ubuntu)
install_ubuntu
;;
centos)
install_centos
;;
*)
echo "Unable to determine OS..."
exit 1
;;
esac

View File

@ -298,3 +298,8 @@ pywavelets==1.4.1
# it here because 1.5.0 conflicts with numpy 1.21.2 used in CI
#Pinned versions: 1.4.1
#test that import:
lxml==5.0.0.
#Description: This is a requirement of unittest-xml-reporting
# Python-3.9 binaries

View File

@ -1 +1 @@
2.1.0
2.2.0

View File

@ -142,6 +142,12 @@ COPY ./common/install_cudnn.sh install_cudnn.sh
RUN if [ "${CUDNN_VERSION}" -eq 8 ]; then bash install_cudnn.sh; fi
RUN rm install_cudnn.sh
# Install CUSPARSELT
ARG CUDA_VERSION
COPY ./common/install_cusparselt.sh install_cusparselt.sh
RUN bash install_cusparselt.sh
RUN rm install_cusparselt.sh
# Delete /usr/local/cuda-11.X/cuda-11.X symlinks
RUN if [ -h /usr/local/cuda-11.6/cuda-11.6 ]; then rm /usr/local/cuda-11.6/cuda-11.6; fi
RUN if [ -h /usr/local/cuda-11.7/cuda-11.7 ]; then rm /usr/local/cuda-11.7/cuda-11.7; fi

View File

@ -0,0 +1,118 @@
ARG UBUNTU_VERSION
FROM ubuntu:${UBUNTU_VERSION}
ARG UBUNTU_VERSION
ENV DEBIAN_FRONTEND noninteractive
ARG CLANG_VERSION
# Install common dependencies (so that this step can be cached separately)
COPY ./common/install_base.sh install_base.sh
RUN bash ./install_base.sh && rm install_base.sh
# Install clang
ARG LLVMDEV
COPY ./common/install_clang.sh install_clang.sh
RUN bash ./install_clang.sh && rm install_clang.sh
# Install user
COPY ./common/install_user.sh install_user.sh
RUN bash ./install_user.sh && rm install_user.sh
# Install katex
ARG KATEX
COPY ./common/install_docs_reqs.sh install_docs_reqs.sh
RUN bash ./install_docs_reqs.sh && rm install_docs_reqs.sh
# Install conda and other packages (e.g., numpy, pytest)
ARG ANACONDA_PYTHON_VERSION
ARG CONDA_CMAKE
ARG DOCS
ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION
ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH
ENV DOCS=$DOCS
COPY requirements-ci.txt requirements-docs.txt /opt/conda/
COPY ./common/install_conda.sh install_conda.sh
COPY ./common/common_utils.sh common_utils.sh
RUN bash ./install_conda.sh && rm install_conda.sh common_utils.sh /opt/conda/requirements-ci.txt /opt/conda/requirements-docs.txt
# Install gcc
ARG GCC_VERSION
COPY ./common/install_gcc.sh install_gcc.sh
RUN bash ./install_gcc.sh && rm install_gcc.sh
# Install lcov for C++ code coverage
COPY ./common/install_lcov.sh install_lcov.sh
RUN bash ./install_lcov.sh && rm install_lcov.sh
COPY ./common/install_openssl.sh install_openssl.sh
RUN bash ./install_openssl.sh
ENV OPENSSL_ROOT_DIR /opt/openssl
ENV OPENSSL_DIR /opt/openssl
RUN rm install_openssl.sh
ARG INDUCTOR_BENCHMARKS
COPY ./common/install_inductor_benchmark_deps.sh install_inductor_benchmark_deps.sh
COPY ./common/common_utils.sh common_utils.sh
COPY ci_commit_pins/huggingface.txt huggingface.txt
COPY ci_commit_pins/timm.txt timm.txt
RUN if [ -n "${INDUCTOR_BENCHMARKS}" ]; then bash ./install_inductor_benchmark_deps.sh; fi
RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface.txt
ARG TRITON
# Install triton, this needs to be done before sccache because the latter will
# try to reach out to S3, which docker build runners don't have access
COPY ./common/install_triton.sh install_triton.sh
COPY ./common/common_utils.sh common_utils.sh
# TODO: will add triton xpu commit
COPY ci_commit_pins/triton.txt triton.txt
RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi
RUN rm install_triton.sh common_utils.sh triton.txt
# (optional) Install database packages like LMDB and LevelDB
ARG DB
COPY ./common/install_db.sh install_db.sh
RUN if [ -n "${DB}" ]; then bash ./install_db.sh; fi
RUN rm install_db.sh
ENV INSTALLED_DB ${DB}
# (optional) Install vision packages like OpenCV and ffmpeg
ARG VISION
COPY ./common/install_vision.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./
RUN if [ -n "${VISION}" ]; then bash ./install_vision.sh; fi
RUN rm install_vision.sh cache_vision_models.sh common_utils.sh
ENV INSTALLED_VISION ${VISION}
# Install XPU Dependencies
ARG BASEKIT_VERSION
COPY ./common/install_xpu.sh install_xpu.sh
RUN bash ./install_xpu.sh && rm install_xpu.sh
# (optional) Install non-default CMake version
ARG CMAKE_VERSION
COPY ./common/install_cmake.sh install_cmake.sh
RUN if [ -n "${CMAKE_VERSION}" ]; then bash ./install_cmake.sh; fi
RUN rm install_cmake.sh
# (optional) Install non-default Ninja version
ARG NINJA_VERSION
COPY ./common/install_ninja.sh install_ninja.sh
RUN if [ -n "${NINJA_VERSION}" ]; then bash ./install_ninja.sh; fi
RUN rm install_ninja.sh
# Install ccache/sccache (do this last, so we get priority in PATH)
COPY ./common/install_cache.sh install_cache.sh
ENV PATH /opt/cache/bin:$PATH
RUN bash ./install_cache.sh && rm install_cache.sh
# Include BUILD_ENVIRONMENT environment variable in image
ARG BUILD_ENVIRONMENT
ENV BUILD_ENVIRONMENT ${BUILD_ENVIRONMENT}
# Install LLVM dev version (Defined in the pytorch/builder github repository)
COPY --from=pytorch/llvm:9.0.1 /opt/llvm /opt/llvm
USER jenkins
CMD ["bash"]

View File

@ -28,6 +28,8 @@ echo "Environment variables:"
env
if [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; then
# Use jemalloc during compilation to mitigate https://github.com/pytorch/pytorch/issues/116289
export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2
echo "NVCC version:"
nvcc --version
fi
@ -151,6 +153,12 @@ if [[ "$BUILD_ENVIRONMENT" == *rocm* ]]; then
python tools/amd_build/build_amd.py
fi
if [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then
# shellcheck disable=SC1091
source /opt/intel/oneapi/compiler/latest/env/vars.sh
export USE_XPU=1
fi
# sccache will fail for CUDA builds if all cores are used for compiling
# gcc 7 with sccache seems to have intermittent OOM issue if all cores are used
if [ -z "$MAX_JOBS" ]; then

View File

@ -18,6 +18,10 @@ BUILD_DIR="build"
BUILD_RENAMED_DIR="build_renamed"
BUILD_BIN_DIR="$BUILD_DIR"/bin
#Set Default values for these variables in case they are not set
SHARD_NUMBER="${SHARD_NUMBER:=1}"
NUM_TEST_SHARDS="${NUM_TEST_SHARDS:=1}"
export VALGRIND=ON
# export TORCH_INDUCTOR_INSTALL_GXX=ON
if [[ "$BUILD_ENVIRONMENT" == *clang9* ]]; then
@ -124,6 +128,8 @@ if [[ "$BUILD_ENVIRONMENT" == *cuda* || "$BUILD_ENVIRONMENT" == *rocm* ]]; then
# mainly used so that we're not spending extra cycles testing cpu
# devices on expensive gpu machines
export PYTORCH_TESTING_DEVICE_ONLY_FOR="cuda"
elif [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then
export PYTORCH_TESTING_DEVICE_ONLY_FOR="xpu"
fi
if [[ "$TEST_CONFIG" == *crossref* ]]; then
@ -136,6 +142,15 @@ if [[ "$BUILD_ENVIRONMENT" == *rocm* ]]; then
rocminfo | grep -E 'Name:.*\sgfx|Marketing'
fi
if [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then
# Source Intel oneAPI envrioment script to enable xpu runtime related libraries
# refer to https://www.intel.com/content/www/us/en/docs/oneapi/programming-guide/2024-0/use-the-setvars-and-oneapi-vars-scripts-with-linux.html
# shellcheck disable=SC1091
source /opt/intel/oneapi/compiler/latest/env/vars.sh
# Check XPU status before testing
xpu-smi discovery
fi
if [[ "$BUILD_ENVIRONMENT" != *-bazel-* ]] ; then
# JIT C++ extensions require ninja.
pip_install --user "ninja==1.10.2"
@ -259,6 +274,7 @@ test_dynamo_shard() {
--exclude-jit-executor \
--exclude-distributed-tests \
--exclude \
test_ao_sparsity \
test_autograd \
test_jit \
test_proxy_tensor \
@ -308,8 +324,10 @@ test_inductor() {
# docker build uses bdist_wheel which does not work with test_aot_inductor
# TODO: need a faster way to build
BUILD_AOT_INDUCTOR_TEST=1 python setup.py develop
CPP_TESTS_DIR="${BUILD_BIN_DIR}" LD_LIBRARY_PATH="${TORCH_LIB_DIR}" python test/run_test.py --cpp --verbose -i cpp/test_aot_inductor
if [[ "$BUILD_ENVIRONMENT" != *rocm* ]]; then
BUILD_AOT_INDUCTOR_TEST=1 python setup.py develop
CPP_TESTS_DIR="${BUILD_BIN_DIR}" LD_LIBRARY_PATH="${TORCH_LIB_DIR}" python test/run_test.py --cpp --verbose -i cpp/test_aot_inductor
fi
}
# "Global" flags for inductor benchmarking controlled by TEST_CONFIG
@ -389,8 +407,8 @@ test_perf_for_dashboard() {
--output "$TEST_REPORTS_DIR/${backend}_dynamic_${suite}_${dtype}_${mode}_cuda_${target}.csv"
fi
if [[ "$DASHBOARD_TAG" == *cppwrapper-true* ]] && [[ "$mode" == "inference" ]]; then
python "benchmarks/dynamo/$suite.py" \
"${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" --disable-cudagraphs --cpp-wrapper "$@" \
TORCHINDUCTOR_CPP_WRAPPER=1 python "benchmarks/dynamo/$suite.py" \
"${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" --disable-cudagraphs "$@" \
--output "$TEST_REPORTS_DIR/${backend}_cpp_wrapper_${suite}_${dtype}_${mode}_cuda_${target}.csv"
fi
if [[ "$DASHBOARD_TAG" == *freezing_cudagraphs-true* ]] && [[ "$mode" == "inference" ]]; then
@ -491,6 +509,13 @@ test_inductor_torchbench_smoketest_perf() {
TEST_REPORTS_DIR=$(pwd)/test/test-reports
mkdir -p "$TEST_REPORTS_DIR"
# smoke test the cpp_wrapper mode
TORCHINDUCTOR_CPP_WRAPPER=1 python benchmarks/dynamo/torchbench.py --device cuda --accuracy --bfloat16 \
--inference --inductor --only hf_T5 --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_smoketest.csv"
python benchmarks/dynamo/check_accuracy.py \
--actual "$TEST_REPORTS_DIR/inductor_cpp_wrapper_smoketest.csv" \
--expected "benchmarks/dynamo/ci_expected_accuracy/inductor_torchbench_inference.csv"
python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --float16 --training \
--batch-size-file "$(realpath benchmarks/dynamo/torchbench_models_list.txt)" --only hf_Bert \
--output "$TEST_REPORTS_DIR/inductor_training_smoketest.csv"
@ -500,7 +525,11 @@ test_inductor_torchbench_smoketest_perf() {
python benchmarks/dynamo/torchbench.py --device cuda --performance --bfloat16 --inference \
--export-aot-inductor --only nanogpt --output "$TEST_REPORTS_DIR/inductor_inference_smoketest.csv"
# The threshold value needs to be actively maintained to make this check useful
python benchmarks/dynamo/check_perf_csv.py -f "$TEST_REPORTS_DIR/inductor_inference_smoketest.csv" -t 5.2
# The perf number of nanogpt seems not very stable, e.g.
# https://github.com/pytorch/pytorch/actions/runs/7158691360/job/19491437314,
# and thus we lower its threshold to reduce flakiness. If this continues to be a problem,
# we switch to use some other model.
python benchmarks/dynamo/check_perf_csv.py -f "$TEST_REPORTS_DIR/inductor_inference_smoketest.csv" -t 4.9
# Check memory compression ratio for a few models
for test in hf_Albert timm_vision_transformer; do
@ -660,6 +689,20 @@ test_libtorch_api() {
fi
}
test_xpu_bin(){
TEST_REPORTS_DIR=$(pwd)/test/test-reports
mkdir -p "$TEST_REPORTS_DIR"
for xpu_case in "${BUILD_BIN_DIR}"/*{xpu,sycl}*
do
if [[ "$xpu_case" != *"*"* ]]; then
case_name=$(basename "$xpu_case")
echo "Testing ${case_name} ..."
"$xpu_case" --gtest_output=xml:"$TEST_REPORTS_DIR"/"$case_name".xml
fi
done
}
test_aot_compilation() {
echo "Testing Ahead of Time compilation"
ln -sf "$TORCH_LIB_DIR"/libc10* "$TORCH_BIN_DIR"
@ -1069,7 +1112,7 @@ elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then
# https://github.com/opencv/opencv-python/issues/885
pip_install opencv-python==4.8.0.74
if [[ "${TEST_CONFIG}" == *inductor_torchbench_smoketest_perf* ]]; then
checkout_install_torchbench hf_Bert hf_Albert timm_vision_transformer
checkout_install_torchbench hf_Bert hf_Albert nanogpt timm_vision_transformer
PYTHONPATH=$(pwd)/torchbench test_inductor_torchbench_smoketest_perf
else
checkout_install_torchbench
@ -1085,19 +1128,21 @@ elif [[ "${TEST_CONFIG}" == *inductor* && "${SHARD_NUMBER}" == 1 ]]; then
test_inductor
test_inductor_distributed
elif [[ "${TEST_CONFIG}" == *dynamo* && "${SHARD_NUMBER}" == 1 && $NUM_TEST_SHARDS -gt 1 ]]; then
test_without_numpy
install_torchvision
test_dynamo_shard 1
test_aten
elif [[ "${TEST_CONFIG}" == *dynamo* && "${SHARD_NUMBER}" == 2 && $NUM_TEST_SHARDS -gt 1 ]]; then
elif [[ "${TEST_CONFIG}" == *dynamo* && $SHARD_NUMBER -gt 1 && $NUM_TEST_SHARDS -gt 1 ]]; then
install_torchvision
test_dynamo_shard 2
test_dynamo_shard "${SHARD_NUMBER}"
elif [[ "${SHARD_NUMBER}" == 1 && $NUM_TEST_SHARDS -gt 1 ]]; then
test_without_numpy
install_torchvision
test_python_shard 1
test_aten
test_libtorch 1
if [[ "${BUILD_ENVIRONMENT}" == *xpu* ]]; then
test_xpu_bin
fi
elif [[ "${SHARD_NUMBER}" == 2 && $NUM_TEST_SHARDS -gt 1 ]]; then
install_torchvision
test_python_shard 2
@ -1122,6 +1167,11 @@ elif [[ "${BUILD_ENVIRONMENT}" == *rocm* && -n "$TESTS_TO_INCLUDE" ]]; then
install_torchvision
test_python
test_aten
elif [[ "${BUILD_ENVIRONMENT}" == *xpu* ]]; then
install_torchvision
test_python
test_aten
test_xpu_bin
else
install_torchvision
install_monkeytype

View File

@ -1,198 +0,0 @@
"""
This module models the tree of configuration variants
for "smoketest" builds.
Each subclass of ConfigNode represents a layer of the configuration hierarchy.
These tree nodes encapsulate the logic for whether a branch of the hierarchy
should be "pruned".
"""
from collections import OrderedDict
import cimodel.data.dimensions as dimensions
from cimodel.lib.conf_tree import ConfigNode
LINKING_DIMENSIONS = [
"shared",
"static",
]
DEPS_INCLUSION_DIMENSIONS = [
"with-deps",
"without-deps",
]
def get_processor_arch_name(gpu_version):
return (
"cpu"
if not gpu_version
else (
"cu" + gpu_version.strip("cuda")
if gpu_version.startswith("cuda")
else gpu_version
)
)
CONFIG_TREE_DATA = OrderedDict()
# GCC config variants:
#
# All the nightlies (except libtorch with new gcc ABI) are built with devtoolset7,
# which can only build with old gcc ABI. It is better than devtoolset3
# because it understands avx512, which is needed for good fbgemm performance.
#
# Libtorch with new gcc ABI is built with gcc 5.4 on Ubuntu 16.04.
LINUX_GCC_CONFIG_VARIANTS = OrderedDict(
manywheel=["devtoolset7"],
conda=["devtoolset7"],
libtorch=[
"devtoolset7",
"gcc5.4_cxx11-abi",
],
)
WINDOWS_LIBTORCH_CONFIG_VARIANTS = [
"debug",
"release",
]
class TopLevelNode(ConfigNode):
def __init__(self, node_name, config_tree_data, smoke):
super().__init__(None, node_name)
self.config_tree_data = config_tree_data
self.props["smoke"] = smoke
def get_children(self):
return [
OSConfigNode(self, x, c, p) for (x, (c, p)) in self.config_tree_data.items()
]
class OSConfigNode(ConfigNode):
def __init__(self, parent, os_name, gpu_versions, py_tree):
super().__init__(parent, os_name)
self.py_tree = py_tree
self.props["os_name"] = os_name
self.props["gpu_versions"] = gpu_versions
def get_children(self):
return [PackageFormatConfigNode(self, k, v) for k, v in self.py_tree.items()]
class PackageFormatConfigNode(ConfigNode):
def __init__(self, parent, package_format, python_versions):
super().__init__(parent, package_format)
self.props["python_versions"] = python_versions
self.props["package_format"] = package_format
def get_children(self):
if self.find_prop("os_name") == "linux":
return [
LinuxGccConfigNode(self, v)
for v in LINUX_GCC_CONFIG_VARIANTS[self.find_prop("package_format")]
]
elif (
self.find_prop("os_name") == "windows"
and self.find_prop("package_format") == "libtorch"
):
return [
WindowsLibtorchConfigNode(self, v)
for v in WINDOWS_LIBTORCH_CONFIG_VARIANTS
]
else:
return [ArchConfigNode(self, v) for v in self.find_prop("gpu_versions")]
class LinuxGccConfigNode(ConfigNode):
def __init__(self, parent, gcc_config_variant):
super().__init__(parent, "GCC_CONFIG_VARIANT=" + str(gcc_config_variant))
self.props["gcc_config_variant"] = gcc_config_variant
def get_children(self):
gpu_versions = self.find_prop("gpu_versions")
# XXX devtoolset7 on CUDA 9.0 is temporarily disabled
# see https://github.com/pytorch/pytorch/issues/20066
if self.find_prop("gcc_config_variant") == "devtoolset7":
gpu_versions = filter(lambda x: x != "cuda_90", gpu_versions)
# XXX disabling conda rocm build since docker images are not there
if self.find_prop("package_format") == "conda":
gpu_versions = filter(
lambda x: x not in dimensions.ROCM_VERSION_LABELS, gpu_versions
)
# XXX libtorch rocm build is temporarily disabled
if self.find_prop("package_format") == "libtorch":
gpu_versions = filter(
lambda x: x not in dimensions.ROCM_VERSION_LABELS, gpu_versions
)
return [ArchConfigNode(self, v) for v in gpu_versions]
class WindowsLibtorchConfigNode(ConfigNode):
def __init__(self, parent, libtorch_config_variant):
super().__init__(
parent, "LIBTORCH_CONFIG_VARIANT=" + str(libtorch_config_variant)
)
self.props["libtorch_config_variant"] = libtorch_config_variant
def get_children(self):
return [ArchConfigNode(self, v) for v in self.find_prop("gpu_versions")]
class ArchConfigNode(ConfigNode):
def __init__(self, parent, gpu):
super().__init__(parent, get_processor_arch_name(gpu))
self.props["gpu"] = gpu
def get_children(self):
return [PyVersionConfigNode(self, v) for v in self.find_prop("python_versions")]
class PyVersionConfigNode(ConfigNode):
def __init__(self, parent, pyver):
super().__init__(parent, pyver)
self.props["pyver"] = pyver
def get_children(self):
package_format = self.find_prop("package_format")
os_name = self.find_prop("os_name")
has_libtorch_variants = package_format == "libtorch" and os_name == "linux"
linking_variants = LINKING_DIMENSIONS if has_libtorch_variants else []
return [LinkingVariantConfigNode(self, v) for v in linking_variants]
class LinkingVariantConfigNode(ConfigNode):
def __init__(self, parent, linking_variant):
super().__init__(parent, linking_variant)
def get_children(self):
return [
DependencyInclusionConfigNode(self, v) for v in DEPS_INCLUSION_DIMENSIONS
]
class DependencyInclusionConfigNode(ConfigNode):
def __init__(self, parent, deps_variant):
super().__init__(parent, deps_variant)
self.props["libtorch_variant"] = "-".join(
[self.parent.get_label(), self.get_label()]
)

View File

@ -1,275 +0,0 @@
from collections import OrderedDict
import cimodel.data.binary_build_data as binary_build_data
import cimodel.data.simple.util.branch_filters as branch_filters
import cimodel.lib.conf_tree as conf_tree
import cimodel.lib.miniutils as miniutils
class Conf:
def __init__(
self,
os,
gpu_version,
pydistro,
parms,
smoke,
libtorch_variant,
gcc_config_variant,
libtorch_config_variant,
):
self.os = os
self.gpu_version = gpu_version
self.pydistro = pydistro
self.parms = parms
self.smoke = smoke
self.libtorch_variant = libtorch_variant
self.gcc_config_variant = gcc_config_variant
self.libtorch_config_variant = libtorch_config_variant
def gen_build_env_parms(self):
elems = (
[self.pydistro]
+ self.parms
+ [binary_build_data.get_processor_arch_name(self.gpu_version)]
)
if self.gcc_config_variant is not None:
elems.append(str(self.gcc_config_variant))
if self.libtorch_config_variant is not None:
elems.append(str(self.libtorch_config_variant))
return elems
def gen_docker_image(self):
if self.gcc_config_variant == "gcc5.4_cxx11-abi":
if self.gpu_version is None:
return miniutils.quote("pytorch/libtorch-cxx11-builder:cpu")
else:
return miniutils.quote(
f"pytorch/libtorch-cxx11-builder:{self.gpu_version}"
)
if self.pydistro == "conda":
if self.gpu_version is None:
return miniutils.quote("pytorch/conda-builder:cpu")
else:
return miniutils.quote(f"pytorch/conda-builder:{self.gpu_version}")
docker_word_substitution = {
"manywheel": "manylinux",
"libtorch": "manylinux",
}
docker_distro_prefix = miniutils.override(
self.pydistro, docker_word_substitution
)
# The cpu nightlies are built on the pytorch/manylinux-cuda102 docker image
# TODO cuda images should consolidate into tag-base images similar to rocm
alt_docker_suffix = (
"cuda102"
if not self.gpu_version
else (
"rocm:" + self.gpu_version.strip("rocm")
if self.gpu_version.startswith("rocm")
else self.gpu_version
)
)
docker_distro_suffix = (
alt_docker_suffix
if self.pydistro != "conda"
else ("cuda" if alt_docker_suffix.startswith("cuda") else "rocm")
)
return miniutils.quote(
"pytorch/" + docker_distro_prefix + "-" + docker_distro_suffix
)
def get_name_prefix(self):
return "smoke" if self.smoke else "binary"
def gen_build_name(self, build_or_test, nightly):
parts = [self.get_name_prefix(), self.os] + self.gen_build_env_parms()
if nightly:
parts.append("nightly")
if self.libtorch_variant:
parts.append(self.libtorch_variant)
if not self.smoke:
parts.append(build_or_test)
joined = "_".join(parts)
return joined.replace(".", "_")
def gen_workflow_job(self, phase, upload_phase_dependency=None, nightly=False):
job_def = OrderedDict()
job_def["name"] = self.gen_build_name(phase, nightly)
job_def["build_environment"] = miniutils.quote(
" ".join(self.gen_build_env_parms())
)
if self.smoke:
job_def["requires"] = [
"update_s3_htmls",
]
job_def["filters"] = branch_filters.gen_filter_dict(
branches_list=["postnightly"],
)
else:
filter_branch = r"/.*/"
job_def["filters"] = branch_filters.gen_filter_dict(
branches_list=[filter_branch],
tags_list=[branch_filters.RC_PATTERN],
)
if self.libtorch_variant:
job_def["libtorch_variant"] = miniutils.quote(self.libtorch_variant)
if phase == "test":
if not self.smoke:
job_def["requires"] = [self.gen_build_name("build", nightly)]
if not (self.smoke and self.os == "macos") and self.os != "windows":
job_def["docker_image"] = self.gen_docker_image()
# fix this. only works on cuda not rocm
if self.os != "windows" and self.gpu_version:
job_def["use_cuda_docker_runtime"] = miniutils.quote("1")
else:
if self.os == "linux" and phase != "upload":
job_def["docker_image"] = self.gen_docker_image()
if phase == "test":
if self.gpu_version:
if self.os == "windows":
job_def["executor"] = "windows-with-nvidia-gpu"
else:
job_def["resource_class"] = "gpu.medium"
os_name = miniutils.override(self.os, {"macos": "mac"})
job_name = "_".join([self.get_name_prefix(), os_name, phase])
return {job_name: job_def}
def gen_upload_job(self, phase, requires_dependency):
"""Generate binary_upload job for configuration
Output looks similar to:
- binary_upload:
name: binary_linux_manywheel_3_7m_cu113_devtoolset7_nightly_upload
context: org-member
requires: binary_linux_manywheel_3_7m_cu113_devtoolset7_nightly_test
filters:
branches:
only:
- nightly
tags:
only: /v[0-9]+(\\.[0-9]+)*-rc[0-9]+/
package_type: manywheel
upload_subfolder: cu113
"""
return {
"binary_upload": OrderedDict(
{
"name": self.gen_build_name(phase, nightly=True),
"context": "org-member",
"requires": [
self.gen_build_name(requires_dependency, nightly=True)
],
"filters": branch_filters.gen_filter_dict(
branches_list=["nightly"],
tags_list=[branch_filters.RC_PATTERN],
),
"package_type": self.pydistro,
"upload_subfolder": binary_build_data.get_processor_arch_name(
self.gpu_version,
),
}
)
}
def get_root(smoke, name):
return binary_build_data.TopLevelNode(
name,
binary_build_data.CONFIG_TREE_DATA,
smoke,
)
def gen_build_env_list(smoke):
root = get_root(smoke, "N/A")
config_list = conf_tree.dfs(root)
newlist = []
for c in config_list:
conf = Conf(
c.find_prop("os_name"),
c.find_prop("gpu"),
c.find_prop("package_format"),
[c.find_prop("pyver")],
c.find_prop("smoke")
and not (c.find_prop("os_name") == "macos_arm64"), # don't test arm64
c.find_prop("libtorch_variant"),
c.find_prop("gcc_config_variant"),
c.find_prop("libtorch_config_variant"),
)
newlist.append(conf)
return newlist
def predicate_exclude_macos(config):
return config.os == "linux" or config.os == "windows"
def get_nightly_uploads():
configs = gen_build_env_list(False)
mylist = []
for conf in configs:
phase_dependency = "test" if predicate_exclude_macos(conf) else "build"
mylist.append(conf.gen_upload_job("upload", phase_dependency))
return mylist
def get_post_upload_jobs():
return [
{
"update_s3_htmls": {
"name": "update_s3_htmls",
"context": "org-member",
"filters": branch_filters.gen_filter_dict(
branches_list=["postnightly"],
),
},
},
]
def get_nightly_tests():
configs = gen_build_env_list(False)
filtered_configs = filter(predicate_exclude_macos, configs)
tests = []
for conf_options in filtered_configs:
yaml_item = conf_options.gen_workflow_job("test", nightly=True)
tests.append(yaml_item)
return tests
def get_jobs(toplevel_key, smoke):
jobs_list = []
configs = gen_build_env_list(smoke)
phase = "build" if toplevel_key == "binarybuilds" else "test"
for build_config in configs:
# don't test for macos_arm64 as it's cross compiled
if phase != "test" or build_config.os != "macos_arm64":
jobs_list.append(build_config.gen_workflow_job(phase, nightly=True))
return jobs_list
def get_binary_build_jobs():
return get_jobs("binarybuilds", False)
def get_binary_smoke_test_jobs():
return get_jobs("binarysmoketests", True)

View File

@ -1,19 +0,0 @@
PHASES = ["build", "test"]
CUDA_VERSIONS = [
"102",
"113",
"116",
"117",
]
ROCM_VERSIONS = [
"4.3.1",
"4.5.2",
]
ROCM_VERSION_LABELS = ["rocm" + v for v in ROCM_VERSIONS]
GPU_VERSIONS = [None] + ["cuda" + v for v in CUDA_VERSIONS] + ROCM_VERSION_LABELS
STANDARD_PYTHON_VERSIONS = ["3.7", "3.8", "3.9", "3.10"]

View File

@ -1,296 +0,0 @@
from cimodel.lib.conf_tree import ConfigNode
CONFIG_TREE_DATA = []
def get_major_pyver(dotted_version):
parts = dotted_version.split(".")
return "py" + parts[0]
class TreeConfigNode(ConfigNode):
def __init__(self, parent, node_name, subtree):
super().__init__(parent, self.modify_label(node_name))
self.subtree = subtree
self.init2(node_name)
def modify_label(self, label):
return label
def init2(self, node_name):
pass
def get_children(self):
return [self.child_constructor()(self, k, v) for (k, v) in self.subtree]
class TopLevelNode(TreeConfigNode):
def __init__(self, node_name, subtree):
super().__init__(None, node_name, subtree)
# noinspection PyMethodMayBeStatic
def child_constructor(self):
return DistroConfigNode
class DistroConfigNode(TreeConfigNode):
def init2(self, node_name):
self.props["distro_name"] = node_name
def child_constructor(self):
distro = self.find_prop("distro_name")
next_nodes = {
"xenial": XenialCompilerConfigNode,
"bionic": BionicCompilerConfigNode,
}
return next_nodes[distro]
class PyVerConfigNode(TreeConfigNode):
def init2(self, node_name):
self.props["pyver"] = node_name
self.props["abbreviated_pyver"] = get_major_pyver(node_name)
if node_name == "3.9":
self.props["abbreviated_pyver"] = "py3.9"
# noinspection PyMethodMayBeStatic
def child_constructor(self):
return ExperimentalFeatureConfigNode
class ExperimentalFeatureConfigNode(TreeConfigNode):
def init2(self, node_name):
self.props["experimental_feature"] = node_name
def child_constructor(self):
experimental_feature = self.find_prop("experimental_feature")
next_nodes = {
"asan": AsanConfigNode,
"xla": XlaConfigNode,
"mps": MPSConfigNode,
"vulkan": VulkanConfigNode,
"parallel_tbb": ParallelTBBConfigNode,
"crossref": CrossRefConfigNode,
"dynamo": DynamoConfigNode,
"parallel_native": ParallelNativeConfigNode,
"onnx": ONNXConfigNode,
"libtorch": LibTorchConfigNode,
"important": ImportantConfigNode,
"build_only": BuildOnlyConfigNode,
"shard_test": ShardTestConfigNode,
"cuda_gcc_override": CudaGccOverrideConfigNode,
"pure_torch": PureTorchConfigNode,
"slow_gradcheck": SlowGradcheckConfigNode,
}
return next_nodes[experimental_feature]
class SlowGradcheckConfigNode(TreeConfigNode):
def init2(self, node_name):
self.props["is_slow_gradcheck"] = True
def child_constructor(self):
return ExperimentalFeatureConfigNode
class PureTorchConfigNode(TreeConfigNode):
def modify_label(self, label):
return "PURE_TORCH=" + str(label)
def init2(self, node_name):
self.props["is_pure_torch"] = node_name
def child_constructor(self):
return ImportantConfigNode
class XlaConfigNode(TreeConfigNode):
def modify_label(self, label):
return "XLA=" + str(label)
def init2(self, node_name):
self.props["is_xla"] = node_name
def child_constructor(self):
return ImportantConfigNode
class MPSConfigNode(TreeConfigNode):
def modify_label(self, label):
return "MPS=" + str(label)
def init2(self, node_name):
self.props["is_mps"] = node_name
def child_constructor(self):
return ImportantConfigNode
class AsanConfigNode(TreeConfigNode):
def modify_label(self, label):
return "Asan=" + str(label)
def init2(self, node_name):
self.props["is_asan"] = node_name
def child_constructor(self):
return ExperimentalFeatureConfigNode
class ONNXConfigNode(TreeConfigNode):
def modify_label(self, label):
return "Onnx=" + str(label)
def init2(self, node_name):
self.props["is_onnx"] = node_name
def child_constructor(self):
return ImportantConfigNode
class VulkanConfigNode(TreeConfigNode):
def modify_label(self, label):
return "Vulkan=" + str(label)
def init2(self, node_name):
self.props["is_vulkan"] = node_name
def child_constructor(self):
return ImportantConfigNode
class ParallelTBBConfigNode(TreeConfigNode):
def modify_label(self, label):
return "PARALLELTBB=" + str(label)
def init2(self, node_name):
self.props["parallel_backend"] = "paralleltbb"
def child_constructor(self):
return ImportantConfigNode
class CrossRefConfigNode(TreeConfigNode):
def init2(self, node_name):
self.props["is_crossref"] = node_name
def child_constructor(self):
return ImportantConfigNode
class DynamoConfigNode(TreeConfigNode):
def init2(self, node_name):
self.props["is_dynamo"] = node_name
def child_constructor(self):
return ImportantConfigNode
class ParallelNativeConfigNode(TreeConfigNode):
def modify_label(self, label):
return "PARALLELNATIVE=" + str(label)
def init2(self, node_name):
self.props["parallel_backend"] = "parallelnative"
def child_constructor(self):
return ImportantConfigNode
class LibTorchConfigNode(TreeConfigNode):
def modify_label(self, label):
return "BUILD_TEST_LIBTORCH=" + str(label)
def init2(self, node_name):
self.props["is_libtorch"] = node_name
def child_constructor(self):
return ExperimentalFeatureConfigNode
class CudaGccOverrideConfigNode(TreeConfigNode):
def init2(self, node_name):
self.props["cuda_gcc_override"] = node_name
def child_constructor(self):
return ExperimentalFeatureConfigNode
class BuildOnlyConfigNode(TreeConfigNode):
def init2(self, node_name):
self.props["build_only"] = node_name
def child_constructor(self):
return ExperimentalFeatureConfigNode
class ShardTestConfigNode(TreeConfigNode):
def init2(self, node_name):
self.props["shard_test"] = node_name
def child_constructor(self):
return ImportantConfigNode
class ImportantConfigNode(TreeConfigNode):
def modify_label(self, label):
return "IMPORTANT=" + str(label)
def init2(self, node_name):
self.props["is_important"] = node_name
def get_children(self):
return []
class XenialCompilerConfigNode(TreeConfigNode):
def modify_label(self, label):
return label or "<unspecified>"
def init2(self, node_name):
self.props["compiler_name"] = node_name
# noinspection PyMethodMayBeStatic
def child_constructor(self):
return (
XenialCompilerVersionConfigNode
if self.props["compiler_name"]
else PyVerConfigNode
)
class BionicCompilerConfigNode(TreeConfigNode):
def modify_label(self, label):
return label or "<unspecified>"
def init2(self, node_name):
self.props["compiler_name"] = node_name
# noinspection PyMethodMayBeStatic
def child_constructor(self):
return (
BionicCompilerVersionConfigNode
if self.props["compiler_name"]
else PyVerConfigNode
)
class XenialCompilerVersionConfigNode(TreeConfigNode):
def init2(self, node_name):
self.props["compiler_version"] = node_name
# noinspection PyMethodMayBeStatic
def child_constructor(self):
return PyVerConfigNode
class BionicCompilerVersionConfigNode(TreeConfigNode):
def init2(self, node_name):
self.props["compiler_version"] = node_name
# noinspection PyMethodMayBeStatic
def child_constructor(self):
return PyVerConfigNode

View File

@ -1,382 +0,0 @@
from collections import OrderedDict
from dataclasses import dataclass, field
from typing import List, Optional
import cimodel.data.dimensions as dimensions
import cimodel.lib.conf_tree as conf_tree
import cimodel.lib.miniutils as miniutils
from cimodel.data.pytorch_build_data import CONFIG_TREE_DATA, TopLevelNode
from cimodel.data.simple.util.branch_filters import gen_filter_dict, RC_PATTERN
from cimodel.data.simple.util.docker_constants import gen_docker_image
@dataclass
class Conf:
distro: str
parms: List[str]
parms_list_ignored_for_docker_image: Optional[List[str]] = None
pyver: Optional[str] = None
cuda_version: Optional[str] = None
rocm_version: Optional[str] = None
# TODO expand this to cover all the USE_* that we want to test for
# tesnrorrt, leveldb, lmdb, redis, opencv, mkldnn, ideep, etc.
# (from https://github.com/pytorch/pytorch/pull/17323#discussion_r259453608)
is_xla: bool = False
is_vulkan: bool = False
is_pure_torch: bool = False
restrict_phases: Optional[List[str]] = None
gpu_resource: Optional[str] = None
dependent_tests: List = field(default_factory=list)
parent_build: Optional["Conf"] = None
is_libtorch: bool = False
is_important: bool = False
parallel_backend: Optional[str] = None
build_only: bool = False
@staticmethod
def is_test_phase(phase):
return "test" in phase
# TODO: Eliminate the special casing for docker paths
# In the short term, we *will* need to support special casing as docker images are merged for caffe2 and pytorch
def get_parms(self, for_docker):
leading = []
# We just don't run non-important jobs on pull requests;
# previously we also named them in a way to make it obvious
# if self.is_important and not for_docker:
# leading.append("AAA")
leading.append("pytorch")
if self.is_xla and not for_docker:
leading.append("xla")
if self.is_vulkan and not for_docker:
leading.append("vulkan")
if self.is_libtorch and not for_docker:
leading.append("libtorch")
if self.is_pure_torch and not for_docker:
leading.append("pure_torch")
if self.parallel_backend is not None and not for_docker:
leading.append(self.parallel_backend)
cuda_parms = []
if self.cuda_version:
cudnn = "cudnn8" if self.cuda_version.startswith("11.") else "cudnn7"
cuda_parms.extend(["cuda" + self.cuda_version, cudnn])
if self.rocm_version:
cuda_parms.extend([f"rocm{self.rocm_version}"])
result = leading + ["linux", self.distro] + cuda_parms + self.parms
if not for_docker and self.parms_list_ignored_for_docker_image is not None:
result = result + self.parms_list_ignored_for_docker_image
return result
def gen_docker_image_path(self):
parms_source = self.parent_build or self
base_build_env_name = "-".join(parms_source.get_parms(True))
image_name, _ = gen_docker_image(base_build_env_name)
return miniutils.quote(image_name)
def gen_docker_image_requires(self):
parms_source = self.parent_build or self
base_build_env_name = "-".join(parms_source.get_parms(True))
_, requires = gen_docker_image(base_build_env_name)
return miniutils.quote(requires)
def get_build_job_name_pieces(self, build_or_test):
return self.get_parms(False) + [build_or_test]
def gen_build_name(self, build_or_test):
return (
("_".join(map(str, self.get_build_job_name_pieces(build_or_test))))
.replace(".", "_")
.replace("-", "_")
)
def get_dependents(self):
return self.dependent_tests or []
def gen_workflow_params(self, phase):
parameters = OrderedDict()
build_job_name_pieces = self.get_build_job_name_pieces(phase)
build_env_name = "-".join(map(str, build_job_name_pieces))
parameters["build_environment"] = miniutils.quote(build_env_name)
parameters["docker_image"] = self.gen_docker_image_path()
if Conf.is_test_phase(phase) and self.gpu_resource:
parameters["use_cuda_docker_runtime"] = miniutils.quote("1")
if Conf.is_test_phase(phase):
resource_class = "large"
if self.gpu_resource:
resource_class = "gpu." + self.gpu_resource
if self.rocm_version is not None:
resource_class = "pytorch/amd-gpu"
parameters["resource_class"] = resource_class
if phase == "build" and self.rocm_version is not None:
parameters["resource_class"] = "xlarge"
if hasattr(self, "filters"):
parameters["filters"] = self.filters
if self.build_only:
parameters["build_only"] = miniutils.quote(str(int(True)))
return parameters
def gen_workflow_job(self, phase):
job_def = OrderedDict()
job_def["name"] = self.gen_build_name(phase)
if Conf.is_test_phase(phase):
# TODO When merging the caffe2 and pytorch jobs, it might be convenient for a while to make a
# caffe2 test job dependent on a pytorch build job. This way we could quickly dedup the repeated
# build of pytorch in the caffe2 build job, and just run the caffe2 tests off of a completed
# pytorch build job (from https://github.com/pytorch/pytorch/pull/17323#discussion_r259452641)
dependency_build = self.parent_build or self
job_def["requires"] = [dependency_build.gen_build_name("build")]
job_name = "pytorch_linux_test"
else:
job_name = "pytorch_linux_build"
job_def["requires"] = [self.gen_docker_image_requires()]
if not self.is_important:
job_def["filters"] = gen_filter_dict()
job_def.update(self.gen_workflow_params(phase))
return {job_name: job_def}
# TODO This is a hack to special case some configs just for the workflow list
class HiddenConf:
def __init__(self, name, parent_build=None, filters=None):
self.name = name
self.parent_build = parent_build
self.filters = filters
def gen_workflow_job(self, phase):
return {
self.gen_build_name(phase): {
"requires": [self.parent_build.gen_build_name("build")],
"filters": self.filters,
}
}
def gen_build_name(self, _):
return self.name
class DocPushConf:
def __init__(self, name, parent_build=None, branch="master"):
self.name = name
self.parent_build = parent_build
self.branch = branch
def gen_workflow_job(self, phase):
return {
"pytorch_doc_push": {
"name": self.name,
"branch": self.branch,
"requires": [self.parent_build],
"context": "org-member",
"filters": gen_filter_dict(
branches_list=["nightly"], tags_list=RC_PATTERN
),
}
}
def gen_docs_configs(xenial_parent_config):
configs = []
configs.append(
HiddenConf(
"pytorch_python_doc_build",
parent_build=xenial_parent_config,
filters=gen_filter_dict(
branches_list=["master", "main", "nightly"], tags_list=RC_PATTERN
),
)
)
configs.append(
DocPushConf(
"pytorch_python_doc_push",
parent_build="pytorch_python_doc_build",
branch="site",
)
)
configs.append(
HiddenConf(
"pytorch_cpp_doc_build",
parent_build=xenial_parent_config,
filters=gen_filter_dict(
branches_list=["master", "main", "nightly"], tags_list=RC_PATTERN
),
)
)
configs.append(
DocPushConf(
"pytorch_cpp_doc_push",
parent_build="pytorch_cpp_doc_build",
branch="master",
)
)
return configs
def get_root():
return TopLevelNode("PyTorch Builds", CONFIG_TREE_DATA)
def gen_tree():
root = get_root()
configs_list = conf_tree.dfs(root)
return configs_list
def instantiate_configs(only_slow_gradcheck):
config_list = []
root = get_root()
found_configs = conf_tree.dfs(root)
for fc in found_configs:
restrict_phases = None
distro_name = fc.find_prop("distro_name")
compiler_name = fc.find_prop("compiler_name")
compiler_version = fc.find_prop("compiler_version")
is_xla = fc.find_prop("is_xla") or False
is_asan = fc.find_prop("is_asan") or False
is_crossref = fc.find_prop("is_crossref") or False
is_dynamo = fc.find_prop("is_dynamo") or False
is_onnx = fc.find_prop("is_onnx") or False
is_pure_torch = fc.find_prop("is_pure_torch") or False
is_vulkan = fc.find_prop("is_vulkan") or False
is_slow_gradcheck = fc.find_prop("is_slow_gradcheck") or False
parms_list_ignored_for_docker_image = []
if only_slow_gradcheck ^ is_slow_gradcheck:
continue
python_version = None
if compiler_name == "cuda" or compiler_name == "android":
python_version = fc.find_prop("pyver")
parms_list = [fc.find_prop("abbreviated_pyver")]
else:
parms_list = ["py" + fc.find_prop("pyver")]
cuda_version = None
rocm_version = None
if compiler_name == "cuda":
cuda_version = fc.find_prop("compiler_version")
elif compiler_name == "rocm":
rocm_version = fc.find_prop("compiler_version")
restrict_phases = ["build", "test1", "test2", "caffe2_test"]
elif compiler_name == "android":
android_ndk_version = fc.find_prop("compiler_version")
# TODO: do we need clang to compile host binaries like protoc?
parms_list.append("clang5")
parms_list.append("android-ndk-" + android_ndk_version)
android_abi = fc.find_prop("android_abi")
parms_list_ignored_for_docker_image.append(android_abi)
restrict_phases = ["build"]
elif compiler_name:
gcc_version = compiler_name + (fc.find_prop("compiler_version") or "")
parms_list.append(gcc_version)
if is_asan:
parms_list.append("asan")
python_version = fc.find_prop("pyver")
parms_list[0] = fc.find_prop("abbreviated_pyver")
if is_crossref:
parms_list_ignored_for_docker_image.append("crossref")
if is_dynamo:
parms_list_ignored_for_docker_image.append("dynamo")
if is_onnx:
parms_list.append("onnx")
python_version = fc.find_prop("pyver")
parms_list[0] = fc.find_prop("abbreviated_pyver")
restrict_phases = ["build", "ort_test1", "ort_test2"]
if cuda_version:
cuda_gcc_version = fc.find_prop("cuda_gcc_override") or "gcc7"
parms_list.append(cuda_gcc_version)
is_libtorch = fc.find_prop("is_libtorch") or False
is_important = fc.find_prop("is_important") or False
parallel_backend = fc.find_prop("parallel_backend") or None
build_only = fc.find_prop("build_only") or False
shard_test = fc.find_prop("shard_test") or False
# TODO: fix pure_torch python test packaging issue.
if shard_test:
restrict_phases = ["build"] if restrict_phases is None else restrict_phases
restrict_phases.extend(["test1", "test2"])
if build_only or is_pure_torch:
restrict_phases = ["build"]
if is_slow_gradcheck:
parms_list_ignored_for_docker_image.append("old")
parms_list_ignored_for_docker_image.append("gradcheck")
gpu_resource = None
if cuda_version and cuda_version != "10":
gpu_resource = "medium"
c = Conf(
distro_name,
parms_list,
parms_list_ignored_for_docker_image,
python_version,
cuda_version,
rocm_version,
is_xla,
is_vulkan,
is_pure_torch,
restrict_phases,
gpu_resource,
is_libtorch=is_libtorch,
is_important=is_important,
parallel_backend=parallel_backend,
build_only=build_only,
)
# run docs builds on "pytorch-linux-xenial-py3.7-gcc5.4". Docs builds
# should run on a CPU-only build that runs on all PRs.
# XXX should this be updated to a more modern build?
if (
distro_name == "xenial"
and fc.find_prop("pyver") == "3.7"
and cuda_version is None
and parallel_backend is None
and not is_vulkan
and not is_pure_torch
and compiler_name == "gcc"
and fc.find_prop("compiler_version") == "5.4"
):
c.filters = gen_filter_dict(branches_list=r"/.*/", tags_list=RC_PATTERN)
c.dependent_tests = gen_docs_configs(c)
config_list.append(c)
return config_list
def get_workflow_jobs(only_slow_gradcheck=False):
config_list = instantiate_configs(only_slow_gradcheck)
x = []
for conf_options in config_list:
phases = conf_options.restrict_phases or dimensions.PHASES
for phase in phases:
# TODO why does this not have a test?
if Conf.is_test_phase(phase) and conf_options.cuda_version == "10":
continue
x.append(conf_options.gen_workflow_job(phase))
# TODO convert to recursion
for conf in conf_options.get_dependents():
x.append(conf.gen_workflow_job("test"))
return x

View File

@ -1,39 +0,0 @@
from collections import OrderedDict
from cimodel.data.simple.util.branch_filters import gen_filter_dict, RC_PATTERN
from cimodel.lib.miniutils import quote
# NOTE: All hardcoded docker image builds have been migrated to GHA
IMAGE_NAMES = []
# This entry should be an element from the list above
# This should contain the image matching the "slow_gradcheck" entry in
# pytorch_build_data.py
SLOW_GRADCHECK_IMAGE_NAME = "pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7"
def get_workflow_jobs(images=IMAGE_NAMES, only_slow_gradcheck=False):
"""Generates a list of docker image build definitions"""
ret = []
for image_name in images:
if image_name.startswith("docker-"):
image_name = image_name.lstrip("docker-")
if only_slow_gradcheck and image_name is not SLOW_GRADCHECK_IMAGE_NAME:
continue
parameters = OrderedDict(
{
"name": quote(f"docker-{image_name}"),
"image_name": quote(image_name),
}
)
if image_name == "pytorch-linux-xenial-py3.7-gcc5.4":
# pushing documentation on tags requires CircleCI to also
# build all the dependencies on tags, including this docker image
parameters["filters"] = gen_filter_dict(
branches_list=r"/.*/", tags_list=RC_PATTERN
)
ret.append(OrderedDict({"docker_build_job": parameters}))
return ret

View File

@ -1,100 +0,0 @@
import cimodel.lib.miniutils as miniutils
from cimodel.data.simple.util.branch_filters import gen_filter_dict_exclude
from cimodel.data.simple.util.versions import MultiPartVersion
XCODE_VERSION = MultiPartVersion([12, 5, 1])
class ArchVariant:
def __init__(self, name, custom_build_name=""):
self.name = name
self.custom_build_name = custom_build_name
def render(self):
extra_parts = (
[self.custom_build_name] if len(self.custom_build_name) > 0 else []
)
return "-".join([self.name] + extra_parts).replace("_", "-")
def get_platform(arch_variant_name):
return "SIMULATOR" if arch_variant_name == "x86_64" else "OS"
class IOSJob:
def __init__(
self, xcode_version, arch_variant, is_org_member_context=True, extra_props=None
):
self.xcode_version = xcode_version
self.arch_variant = arch_variant
self.is_org_member_context = is_org_member_context
self.extra_props = extra_props
def gen_name_parts(self):
version_parts = self.xcode_version.render_dots_or_parts("-")
build_variant_suffix = self.arch_variant.render()
return (
[
"ios",
]
+ version_parts
+ [
build_variant_suffix,
]
)
def gen_job_name(self):
return "-".join(self.gen_name_parts())
def gen_tree(self):
platform_name = get_platform(self.arch_variant.name)
props_dict = {
"name": self.gen_job_name(),
"build_environment": self.gen_job_name(),
"ios_arch": self.arch_variant.name,
"ios_platform": platform_name,
}
if self.is_org_member_context:
props_dict["context"] = "org-member"
if self.extra_props:
props_dict.update(self.extra_props)
props_dict["filters"] = gen_filter_dict_exclude()
return [{"pytorch_ios_build": props_dict}]
WORKFLOW_DATA = [
IOSJob(
XCODE_VERSION,
ArchVariant("x86_64"),
is_org_member_context=False,
extra_props={"lite_interpreter": miniutils.quote(str(int(True)))},
),
# IOSJob(XCODE_VERSION, ArchVariant("arm64"), extra_props={
# "lite_interpreter": miniutils.quote(str(int(True)))}),
# IOSJob(XCODE_VERSION, ArchVariant("arm64", "metal"), extra_props={
# "use_metal": miniutils.quote(str(int(True))),
# "lite_interpreter": miniutils.quote(str(int(True)))}),
# IOSJob(XCODE_VERSION, ArchVariant("arm64", "custom-ops"), extra_props={
# "op_list": "mobilenetv2.yaml",
# "lite_interpreter": miniutils.quote(str(int(True)))}),
IOSJob(
XCODE_VERSION,
ArchVariant("x86_64", "coreml"),
is_org_member_context=False,
extra_props={
"use_coreml": miniutils.quote(str(int(True))),
"lite_interpreter": miniutils.quote(str(int(True))),
},
),
# IOSJob(XCODE_VERSION, ArchVariant("arm64", "coreml"), extra_props={
# "use_coreml": miniutils.quote(str(int(True))),
# "lite_interpreter": miniutils.quote(str(int(True)))}),
]
def get_workflow_jobs():
return [item.gen_tree() for item in WORKFLOW_DATA]

View File

@ -1,54 +0,0 @@
class MacOsJob:
def __init__(self, os_version, is_build=False, is_test=False, extra_props=tuple()):
# extra_props is tuple type, because mutable data structures for argument defaults
# is not recommended.
self.os_version = os_version
self.is_build = is_build
self.is_test = is_test
self.extra_props = dict(extra_props)
def gen_tree(self):
non_phase_parts = ["pytorch", "macos", self.os_version, "py3"]
extra_name_list = [name for name, exist in self.extra_props.items() if exist]
full_job_name_list = (
non_phase_parts
+ extra_name_list
+ [
"build" if self.is_build else None,
"test" if self.is_test else None,
]
)
full_job_name = "_".join(list(filter(None, full_job_name_list)))
test_build_dependency = "_".join(non_phase_parts + ["build"])
extra_dependencies = [test_build_dependency] if self.is_test else []
job_dependencies = extra_dependencies
# Yes we name the job after itself, it needs a non-empty value in here
# for the YAML output to work.
props_dict = {"requires": job_dependencies, "name": full_job_name}
return [{full_job_name: props_dict}]
WORKFLOW_DATA = [
MacOsJob("10_15", is_build=True),
MacOsJob("10_13", is_build=True),
MacOsJob(
"10_13",
is_build=False,
is_test=True,
),
MacOsJob(
"10_13",
is_build=True,
is_test=True,
extra_props=tuple({"lite_interpreter": True}.items()),
),
]
def get_workflow_jobs():
return [item.gen_tree() for item in WORKFLOW_DATA]

View File

@ -1,51 +0,0 @@
"""
PyTorch Mobile PR builds (use linux host toolchain + mobile build options)
"""
import cimodel.data.simple.util.branch_filters
import cimodel.lib.miniutils as miniutils
class MobileJob:
def __init__(
self, docker_image, docker_requires, variant_parts, is_master_only=False
):
self.docker_image = docker_image
self.docker_requires = docker_requires
self.variant_parts = variant_parts
self.is_master_only = is_master_only
def gen_tree(self):
non_phase_parts = [
"pytorch",
"linux",
"xenial",
"py3",
"clang5",
"mobile",
] + self.variant_parts
full_job_name = "_".join(non_phase_parts)
build_env_name = "-".join(non_phase_parts)
props_dict = {
"build_environment": build_env_name,
"build_only": miniutils.quote(str(int(True))),
"docker_image": self.docker_image,
"requires": self.docker_requires,
"name": full_job_name,
}
if self.is_master_only:
props_dict[
"filters"
] = cimodel.data.simple.util.branch_filters.gen_filter_dict()
return [{"pytorch_linux_build": props_dict}]
WORKFLOW_DATA = []
def get_workflow_jobs():
return [item.gen_tree() for item in WORKFLOW_DATA]

View File

@ -1,96 +0,0 @@
import cimodel.data.simple.ios_definitions as ios_definitions
import cimodel.lib.miniutils as miniutils
class IOSNightlyJob:
def __init__(self, variant, is_full_jit=False, is_upload=False):
self.variant = variant
self.is_full_jit = is_full_jit
self.is_upload = is_upload
def get_phase_name(self):
return "upload" if self.is_upload else "build"
def get_common_name_pieces(self, sep):
extra_name_suffix = [self.get_phase_name()] if self.is_upload else []
extra_name = ["full_jit"] if self.is_full_jit else []
common_name_pieces = (
[
"ios",
]
+ extra_name
+ []
+ ios_definitions.XCODE_VERSION.render_dots_or_parts(sep)
+ [
"nightly",
self.variant,
"build",
]
+ extra_name_suffix
)
return common_name_pieces
def gen_job_name(self):
return "_".join(["pytorch"] + self.get_common_name_pieces(None))
def gen_tree(self):
build_configs = BUILD_CONFIGS_FULL_JIT if self.is_full_jit else BUILD_CONFIGS
extra_requires = (
[x.gen_job_name() for x in build_configs] if self.is_upload else []
)
props_dict = {
"build_environment": "-".join(
["libtorch"] + self.get_common_name_pieces(".")
),
"requires": extra_requires,
"context": "org-member",
"filters": {"branches": {"only": "nightly"}},
}
if not self.is_upload:
props_dict["ios_arch"] = self.variant
props_dict["ios_platform"] = ios_definitions.get_platform(self.variant)
props_dict["name"] = self.gen_job_name()
props_dict["use_metal"] = miniutils.quote(str(int(True)))
props_dict["use_coreml"] = miniutils.quote(str(int(True)))
if self.is_full_jit:
props_dict["lite_interpreter"] = miniutils.quote(str(int(False)))
template_name = "_".join(
[
"binary",
"ios",
self.get_phase_name(),
]
)
return [{template_name: props_dict}]
BUILD_CONFIGS = [
IOSNightlyJob("x86_64"),
IOSNightlyJob("arm64"),
]
BUILD_CONFIGS_FULL_JIT = [
IOSNightlyJob("x86_64", is_full_jit=True),
IOSNightlyJob("arm64", is_full_jit=True),
]
WORKFLOW_DATA = (
BUILD_CONFIGS
+ BUILD_CONFIGS_FULL_JIT
+ [
IOSNightlyJob("binary", is_full_jit=False, is_upload=True),
IOSNightlyJob("binary", is_full_jit=True, is_upload=True),
]
)
def get_workflow_jobs():
return [item.gen_tree() for item in WORKFLOW_DATA]

View File

@ -1,36 +0,0 @@
NON_PR_BRANCH_LIST = [
"main",
"master",
r"/ci-all\/.*/",
r"/release\/.*/",
]
PR_BRANCH_LIST = [
r"/gh\/.*\/head/",
r"/pull\/.*/",
]
RC_PATTERN = r"/v[0-9]+(\.[0-9]+)*-rc[0-9]+/"
MAC_IOS_EXCLUSION_LIST = ["nightly", "postnightly"]
def gen_filter_dict(branches_list=NON_PR_BRANCH_LIST, tags_list=None):
"""Generates a filter dictionary for use with CircleCI's job filter"""
filter_dict = {
"branches": {
"only": branches_list,
},
}
if tags_list is not None:
filter_dict["tags"] = {"only": tags_list}
return filter_dict
def gen_filter_dict_exclude(branches_list=MAC_IOS_EXCLUSION_LIST):
return {
"branches": {
"ignore": branches_list,
},
}

View File

@ -1,35 +0,0 @@
AWS_DOCKER_HOST = "308535385114.dkr.ecr.us-east-1.amazonaws.com"
def gen_docker_image(container_type):
return (
"/".join([AWS_DOCKER_HOST, "pytorch", container_type]),
f"docker-{container_type}",
)
def gen_docker_image_requires(image_name):
return [f"docker-{image_name}"]
DOCKER_IMAGE_BASIC, DOCKER_REQUIREMENT_BASE = gen_docker_image(
"pytorch-linux-xenial-py3.7-gcc5.4"
)
DOCKER_IMAGE_CUDA_10_2, DOCKER_REQUIREMENT_CUDA_10_2 = gen_docker_image(
"pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7"
)
DOCKER_IMAGE_GCC7, DOCKER_REQUIREMENT_GCC7 = gen_docker_image(
"pytorch-linux-xenial-py3.7-gcc7"
)
def gen_mobile_docker(specifier):
container_type = "pytorch-linux-xenial-py3-clang5-" + specifier
return gen_docker_image(container_type)
DOCKER_IMAGE_ASAN, DOCKER_REQUIREMENT_ASAN = gen_mobile_docker("asan")
DOCKER_IMAGE_NDK, DOCKER_REQUIREMENT_NDK = gen_mobile_docker("android-ndk-r21e")

View File

@ -1,36 +0,0 @@
from typing import Optional
class MultiPartVersion:
def __init__(self, parts, prefix=""):
self.parts = parts
self.prefix = prefix
def prefixed_parts(self):
"""
Prepends the first element of the version list
with the prefix string.
"""
if self.parts:
return [self.prefix + str(self.parts[0])] + [
str(part) for part in self.parts[1:]
]
else:
return [self.prefix]
def render_dots_or_parts(self, sep: Optional[str] = None):
if sep is None:
return self.prefixed_parts()
else:
return [sep.join(self.prefixed_parts())]
class CudaVersion(MultiPartVersion):
def __init__(self, major, minor):
self.major = major
self.minor = minor
super().__init__([self.major, self.minor], "cuda")
def __str__(self):
return f"{self.major}.{self.minor}"

View File

@ -1,111 +0,0 @@
from dataclasses import dataclass, field
from typing import Dict, Optional
def X(val):
"""
Compact way to write a leaf node
"""
return val, []
def XImportant(name):
"""Compact way to write an important (run on PRs) leaf node"""
return (name, [("important", [X(True)])])
@dataclass
class Ver:
"""
Represents a product with a version number
"""
name: str
version: str = ""
def __str__(self):
return self.name + self.version
@dataclass
class ConfigNode:
parent: Optional["ConfigNode"]
node_name: str
props: Dict[str, str] = field(default_factory=dict)
def get_label(self):
return self.node_name
# noinspection PyMethodMayBeStatic
def get_children(self):
return []
def get_parents(self):
return (
(self.parent.get_parents() + [self.parent.get_label()])
if self.parent
else []
)
def get_depth(self):
return len(self.get_parents())
def get_node_key(self):
return "%".join(self.get_parents() + [self.get_label()])
def find_prop(self, propname, searched=None):
"""
Checks if its own dictionary has
the property, otherwise asks parent node.
"""
if searched is None:
searched = []
searched.append(self.node_name)
if propname in self.props:
return self.props[propname]
elif self.parent:
return self.parent.find_prop(propname, searched)
else:
# raise Exception('Property "%s" does not exist anywhere in the tree! Searched: %s' % (propname, searched))
return None
def dfs_recurse(
node,
leaf_callback=lambda x: None,
discovery_callback=lambda x, y, z: None,
child_callback=lambda x, y: None,
sibling_index=0,
sibling_count=1,
):
discovery_callback(node, sibling_index, sibling_count)
node_children = node.get_children()
if node_children:
for i, child in enumerate(node_children):
child_callback(node, child)
dfs_recurse(
child,
leaf_callback,
discovery_callback,
child_callback,
i,
len(node_children),
)
else:
leaf_callback(node)
def dfs(toplevel_config_node):
config_list = []
def leaf_callback(node):
config_list.append(node)
dfs_recurse(toplevel_config_node, leaf_callback)
return config_list

View File

@ -1,10 +0,0 @@
def quote(s):
return sandwich('"', s)
def sandwich(bread, jam):
return bread + jam + bread
def override(word, substitutions):
return substitutions.get(word, word)

View File

@ -1,51 +0,0 @@
from collections import OrderedDict
import cimodel.lib.miniutils as miniutils
LIST_MARKER = "- "
INDENTATION_WIDTH = 2
def is_dict(data):
return type(data) in [dict, OrderedDict]
def is_collection(data):
return is_dict(data) or type(data) is list
def render(fh, data, depth, is_list_member=False):
"""
PyYaml does not allow precise control over the quoting
behavior, especially for merge references.
Therefore, we use this custom YAML renderer.
"""
indentation = " " * INDENTATION_WIDTH * depth
if is_dict(data):
tuples = list(data.items())
if type(data) is not OrderedDict:
tuples.sort()
for i, (k, v) in enumerate(tuples):
if not v:
continue
# If this dict is itself a list member, the first key gets prefixed with a list marker
list_marker_prefix = LIST_MARKER if is_list_member and not i else ""
trailing_whitespace = "\n" if is_collection(v) else " "
fh.write(indentation + list_marker_prefix + k + ":" + trailing_whitespace)
render(fh, v, depth + 1 + int(is_list_member))
elif type(data) is list:
for v in data:
render(fh, v, depth, True)
else:
# use empty quotes to denote an empty string value instead of blank space
modified_data = miniutils.quote(data) if data == "" else data
list_member_prefix = indentation + LIST_MARKER if is_list_member else ""
fh.write(list_member_prefix + str(modified_data) + "\n")

1388
.circleci/config.yml generated

File diff suppressed because it is too large Load Diff

View File

@ -1,41 +0,0 @@
#!/usr/bin/env python3
import os
import subprocess
import sys
import tempfile
import generate_config_yml
CHECKED_IN_FILE = "config.yml"
REGENERATION_SCRIPT = "regenerate.sh"
PARENT_DIR = os.path.basename(os.path.dirname(os.path.abspath(__file__)))
README_PATH = os.path.join(PARENT_DIR, "README.md")
ERROR_MESSAGE_TEMPLATE = """
The checked-in CircleCI "%s" file does not match what was generated by the scripts.
Please re-run the "%s" script in the "%s" directory and commit the result. See "%s" for more information.
"""
def check_consistency():
_, temp_filename = tempfile.mkstemp("-generated-config.yml")
with open(temp_filename, "w") as fh:
generate_config_yml.stitch_sources(fh)
try:
subprocess.check_call(["cmp", temp_filename, CHECKED_IN_FILE])
except subprocess.CalledProcessError:
sys.exit(
ERROR_MESSAGE_TEMPLATE
% (CHECKED_IN_FILE, REGENERATION_SCRIPT, PARENT_DIR, README_PATH)
)
finally:
os.remove(temp_filename)
if __name__ == "__main__":
check_consistency()

View File

@ -1,196 +0,0 @@
#!/usr/bin/env python3
"""
This script is the source of truth for config.yml.
Please see README.md in this directory for details.
"""
import os
import shutil
import sys
from collections import namedtuple
import cimodel.data.simple.docker_definitions
import cimodel.data.simple.mobile_definitions
import cimodel.data.simple.nightly_ios
import cimodel.lib.miniutils as miniutils
import cimodel.lib.miniyaml as miniyaml
class File:
"""
Verbatim copy the contents of a file into config.yml
"""
def __init__(self, filename):
self.filename = filename
def write(self, output_filehandle):
with open(os.path.join("verbatim-sources", self.filename)) as fh:
shutil.copyfileobj(fh, output_filehandle)
class FunctionGen(namedtuple("FunctionGen", "function depth")):
__slots__ = ()
class Treegen(FunctionGen):
"""
Insert the content of a YAML tree into config.yml
"""
def write(self, output_filehandle):
miniyaml.render(output_filehandle, self.function(), self.depth)
class Listgen(FunctionGen):
"""
Insert the content of a YAML list into config.yml
"""
def write(self, output_filehandle):
miniyaml.render(output_filehandle, self.function(), self.depth)
def horizontal_rule():
return "".join("#" * 78)
class Header:
def __init__(self, title, summary=None):
self.title = title
self.summary_lines = summary or []
def write(self, output_filehandle):
text_lines = [self.title] + self.summary_lines
comment_lines = ["# " + x for x in text_lines]
lines = miniutils.sandwich([horizontal_rule()], comment_lines)
for line in filter(None, lines):
output_filehandle.write(line + "\n")
def _for_all_items(items, functor) -> None:
if isinstance(items, list):
for item in items:
_for_all_items(item, functor)
if isinstance(items, dict) and len(items) == 1:
item_type, item = next(iter(items.items()))
functor(item_type, item)
def filter_master_only_jobs(items):
def _is_main_or_master_item(item):
filters = item.get("filters", None)
branches = filters.get("branches", None) if filters is not None else None
branches_only = branches.get("only", None) if branches is not None else None
return (
("main" in branches_only or "master" in branches_only)
if branches_only is not None
else False
)
master_deps = set()
def _save_requires_if_master(item_type, item):
requires = item.get("requires", None)
item_name = item.get("name", None)
if not isinstance(requires, list):
return
if _is_main_or_master_item(item) or item_name in master_deps:
master_deps.update([n.strip('"') for n in requires])
def _do_filtering(items):
if isinstance(items, list):
rc = [_do_filtering(item) for item in items]
return [item for item in rc if len(item if item is not None else []) > 0]
assert isinstance(items, dict) and len(items) == 1
item_type, item = next(iter(items.items()))
item_name = item.get("name", None)
item_name = item_name.strip('"') if item_name is not None else None
if not _is_main_or_master_item(item) and item_name not in master_deps:
return None
if "filters" in item:
item = item.copy()
item.pop("filters")
return {item_type: item}
# Scan of dependencies twice to pick up nested required jobs
# I.e. jobs depending on jobs that main-only job depend on
_for_all_items(items, _save_requires_if_master)
_for_all_items(items, _save_requires_if_master)
return _do_filtering(items)
def generate_required_docker_images(items):
required_docker_images = set()
def _requires_docker_image(item_type, item):
requires = item.get("requires", None)
if not isinstance(requires, list):
return
for requirement in requires:
requirement = requirement.replace('"', "")
if requirement.startswith("docker-"):
required_docker_images.add(requirement)
_for_all_items(items, _requires_docker_image)
return required_docker_images
def gen_build_workflows_tree():
build_workflows_functions = [
cimodel.data.simple.mobile_definitions.get_workflow_jobs,
cimodel.data.simple.nightly_ios.get_workflow_jobs,
]
build_jobs = [f() for f in build_workflows_functions]
build_jobs.extend(
cimodel.data.simple.docker_definitions.get_workflow_jobs(
# sort for consistency
sorted(generate_required_docker_images(build_jobs))
)
)
master_build_jobs = filter_master_only_jobs(build_jobs)
rc = {
"workflows": {
"build": {
"when": r"<< pipeline.parameters.run_build >>",
"jobs": build_jobs,
},
}
}
if len(master_build_jobs) > 0:
rc["workflows"]["master_build"] = {
"when": r"<< pipeline.parameters.run_master_build >>",
"jobs": master_build_jobs,
}
return rc
# Order of this list matters to the generated config.yml.
YAML_SOURCES = [
File("header-section.yml"),
File("commands.yml"),
File("nightly-binary-build-defaults.yml"),
Header("Build parameters"),
File("build-parameters/pytorch-build-params.yml"),
File("build-parameters/binary-build-params.yml"),
Header("Job specs"),
File("job-specs/binary-job-specs.yml"),
File("job-specs/job-specs-custom.yml"),
File("job-specs/binary_update_htmls.yml"),
File("job-specs/binary-build-tests.yml"),
File("job-specs/docker_jobs.yml"),
Header("Workflows"),
Treegen(gen_build_workflows_tree, 0),
]
def stitch_sources(output_filehandle):
for f in YAML_SOURCES:
f.write(output_filehandle)
if __name__ == "__main__":
stitch_sources(sys.stdout)

View File

@ -1,5 +0,0 @@
cd $PSScriptRoot;
$NewFile = New-TemporaryFile;
python generate_config_yml.py > $NewFile.name
(Get-Content $NewFile.name -Raw).TrimEnd().Replace("`r`n","`n") | Set-Content config.yml -Force
Remove-Item $NewFile.name

View File

@ -1,17 +0,0 @@
#!/bin/bash -e
# Allows this script to be invoked from any directory:
cd "$(dirname "$0")"
UNCOMMIT_CHANGE=$(git status -s | grep " config.yml" | wc -l | xargs)
if [[ $UNCOMMIT_CHANGE != 0 ]]; then
OLD_FILE=$(mktemp)
cp config.yml "$OLD_FILE"
echo "Uncommitted change detected in .circleci/config.yml"
echo "It has been backed up to $OLD_FILE"
fi
NEW_FILE=$(mktemp)
./generate_config_yml.py > "$NEW_FILE"
cp "$NEW_FILE" config.yml
echo "New config generated in .circleci/config.yml"

View File

@ -58,8 +58,7 @@ fi
PIP_UPLOAD_FOLDER='nightly/'
# We put this here so that OVERRIDE_PACKAGE_VERSION below can read from it
export DATE="$(date -u +%Y%m%d)"
#TODO: We should be pulling semver version from the base version.txt
BASE_BUILD_VERSION="2.2.0.dev$DATE"
BASE_BUILD_VERSION="$(cat ${PYTORCH_ROOT}/version.txt|cut -da -f1).dev${DATE}"
# Change BASE_BUILD_VERSION to git tag when on a git tag
# Use 'git -C' to make doubly sure we're in the correct directory for checking
# the git tag

View File

@ -1,65 +0,0 @@
binary_linux_build_params: &binary_linux_build_params
parameters:
build_environment:
type: string
default: ""
docker_image:
type: string
default: ""
libtorch_variant:
type: string
default: ""
resource_class:
type: string
default: "2xlarge+"
environment:
BUILD_ENVIRONMENT: << parameters.build_environment >>
LIBTORCH_VARIANT: << parameters.libtorch_variant >>
ANACONDA_USER: pytorch
resource_class: << parameters.resource_class >>
docker:
- image: << parameters.docker_image >>
binary_linux_test_upload_params: &binary_linux_test_upload_params
parameters:
build_environment:
type: string
default: ""
docker_image:
type: string
default: ""
libtorch_variant:
type: string
default: ""
resource_class:
type: string
default: "medium"
use_cuda_docker_runtime:
type: string
default: ""
environment:
BUILD_ENVIRONMENT: << parameters.build_environment >>
DOCKER_IMAGE: << parameters.docker_image >>
USE_CUDA_DOCKER_RUNTIME: << parameters.use_cuda_docker_runtime >>
LIBTORCH_VARIANT: << parameters.libtorch_variant >>
resource_class: << parameters.resource_class >>
binary_mac_params: &binary_mac_params
parameters:
build_environment:
type: string
default: ""
environment:
BUILD_ENVIRONMENT: << parameters.build_environment >>
binary_windows_params: &binary_windows_params
parameters:
build_environment:
type: string
default: ""
executor:
type: string
default: "windows-xlarge-cpu-with-nvidia-cuda"
environment:
BUILD_ENVIRONMENT: << parameters.build_environment >>
JOB_EXECUTOR: <<parameters.executor>>

View File

@ -1,105 +0,0 @@
pytorch_params: &pytorch_params
parameters:
build_environment:
type: string
default: ""
docker_image:
type: string
default: ""
resource_class:
type: string
default: "large"
use_cuda_docker_runtime:
type: string
default: ""
build_only:
type: string
default: ""
ci_master:
type: string
default: ""
environment:
BUILD_ENVIRONMENT: << parameters.build_environment >>
DOCKER_IMAGE: << parameters.docker_image >>
USE_CUDA_DOCKER_RUNTIME: << parameters.use_cuda_docker_runtime >>
BUILD_ONLY: << parameters.build_only >>
CI_MASTER: << pipeline.parameters.run_master_build >>
resource_class: << parameters.resource_class >>
pytorch_ios_params: &pytorch_ios_params
parameters:
build_environment:
type: string
default: ""
ios_arch:
type: string
default: ""
ios_platform:
type: string
default: ""
op_list:
type: string
default: ""
use_metal:
type: string
default: "0"
lite_interpreter:
type: string
default: "1"
use_coreml:
type: string
default: "0"
environment:
BUILD_ENVIRONMENT: << parameters.build_environment >>
IOS_ARCH: << parameters.ios_arch >>
IOS_PLATFORM: << parameters.ios_platform >>
SELECTED_OP_LIST: << parameters.op_list >>
USE_PYTORCH_METAL: << parameters.use_metal >>
BUILD_LITE_INTERPRETER: << parameters.lite_interpreter >>
USE_COREML_DELEGATE: << parameters.use_coreml >>
pytorch_windows_params: &pytorch_windows_params
parameters:
executor:
type: string
default: "windows-xlarge-cpu-with-nvidia-cuda"
build_environment:
type: string
default: ""
test_name:
type: string
default: ""
cuda_version:
type: string
default: "10.1"
python_version:
type: string
default: "3.8"
vs_version:
type: string
default: "16.8.6"
vc_version:
type: string
default: "14.16"
vc_year:
type: string
default: "2019"
vc_product:
type: string
default: "BuildTools"
use_cuda:
type: string
default: ""
environment:
BUILD_ENVIRONMENT: <<parameters.build_environment>>
SCCACHE_BUCKET: "ossci-compiler-cache"
CUDA_VERSION: <<parameters.cuda_version>>
PYTHON_VERSION: <<parameters.python_version>>
VS_VERSION: <<parameters.vs_version>>
VC_VERSION: <<parameters.vc_version>>
VC_YEAR: <<parameters.vc_year>>
VC_PRODUCT: <<parameters.vc_product>>
USE_CUDA: <<parameters.use_cuda>>
TORCH_CUDA_ARCH_LIST: "5.2 7.5"
JOB_BASE_NAME: <<parameters.test_name>>
JOB_EXECUTOR: <<parameters.executor>>

View File

@ -1,134 +0,0 @@
commands:
calculate_docker_image_tag:
description: "Calculates the docker image tag"
steps:
- run:
name: "Calculate docker image hash"
command: |
DOCKER_TAG=$(git rev-parse HEAD:.ci/docker)
echo "DOCKER_TAG=${DOCKER_TAG}" >> "${BASH_ENV}"
designate_upload_channel:
description: "inserts the correct upload channel into ${BASH_ENV}"
steps:
- run:
name: adding UPLOAD_CHANNEL to BASH_ENV
command: |
our_upload_channel=nightly
# On tags upload to test instead
if [[ -n "${CIRCLE_TAG}" ]]; then
our_upload_channel=test
fi
echo "export UPLOAD_CHANNEL=${our_upload_channel}" >> ${BASH_ENV}
# This system setup script is meant to run before the CI-related scripts, e.g.,
# installing Git client, checking out code, setting up CI env, and
# building/testing.
setup_linux_system_environment:
steps:
- run:
name: Set Up System Environment
no_output_timeout: "1h"
command: .circleci/scripts/setup_linux_system_environment.sh
setup_ci_environment:
steps:
- run:
name: Set Up CI Environment After attach_workspace
no_output_timeout: "1h"
command: .circleci/scripts/setup_ci_environment.sh
brew_update:
description: "Update Homebrew and install base formulae"
steps:
- run:
name: Update Homebrew
no_output_timeout: "10m"
command: |
set -ex
# Update repositories manually.
# Running `brew update` produces a comparison between the
# current checkout and the updated checkout, which takes a
# very long time because the existing checkout is 2y old.
for path in $(find /usr/local/Homebrew -type d -name .git)
do
cd $path/..
git fetch --depth=1 origin
git reset --hard origin/master
done
export HOMEBREW_NO_AUTO_UPDATE=1
# Install expect and moreutils so that we can call `unbuffer` and `ts`.
# moreutils installs a `parallel` executable by default, which conflicts
# with the executable from the GNU `parallel`, so we must unlink GNU
# `parallel` first, and relink it afterwards.
brew unlink parallel
brew install moreutils
brew link parallel --overwrite
brew install expect
brew_install:
description: "Install Homebrew formulae"
parameters:
formulae:
type: string
default: ""
steps:
- run:
name: Install << parameters.formulae >>
no_output_timeout: "10m"
command: |
set -ex
export HOMEBREW_NO_AUTO_UPDATE=1
brew install << parameters.formulae >>
run_brew_for_macos_build:
steps:
- brew_update
- brew_install:
formulae: libomp
run_brew_for_ios_build:
steps:
- brew_update
- brew_install:
formulae: libtool
optional_merge_target_branch:
steps:
- run:
name: (Optional) Merge target branch
no_output_timeout: "10m"
command: |
if [[ -n "$CIRCLE_PULL_REQUEST" && "$CIRCLE_BRANCH" != "nightly" ]]; then
PR_NUM=$(basename $CIRCLE_PULL_REQUEST)
CIRCLE_PR_BASE_BRANCH=$(curl -s https://api.github.com/repos/$CIRCLE_PROJECT_USERNAME/$CIRCLE_PROJECT_REPONAME/pulls/$PR_NUM | jq -r '.base.ref')
if [[ "${BUILD_ENVIRONMENT}" == *"xla"* || "${BUILD_ENVIRONMENT}" == *"gcc5"* ]] ; then
set -x
git config --global user.email "circleci.ossci@gmail.com"
git config --global user.name "CircleCI"
git config remote.origin.url https://github.com/pytorch/pytorch.git
git config --add remote.origin.fetch +refs/heads/master:refs/remotes/origin/master
git fetch --tags --progress https://github.com/pytorch/pytorch.git +refs/heads/master:refs/remotes/origin/master --depth=100 --quiet
# PRs generated from ghstack has format CIRCLE_PR_BASE_BRANCH=gh/xxx/1234/base
if [[ "${CIRCLE_PR_BASE_BRANCH}" == "gh/"* ]]; then
CIRCLE_PR_BASE_BRANCH=master
fi
export GIT_MERGE_TARGET=`git log -n 1 --pretty=format:"%H" origin/$CIRCLE_PR_BASE_BRANCH`
echo "GIT_MERGE_TARGET: " ${GIT_MERGE_TARGET}
export GIT_COMMIT=${CIRCLE_SHA1}
echo "GIT_COMMIT: " ${GIT_COMMIT}
git checkout -f ${GIT_COMMIT}
git reset --hard ${GIT_COMMIT}
git merge --allow-unrelated-histories --no-edit --no-ff ${GIT_MERGE_TARGET}
echo "Merged $CIRCLE_PR_BASE_BRANCH branch before building in environment $BUILD_ENVIRONMENT"
set +x
else
echo "No need to merge with $CIRCLE_PR_BASE_BRANCH, skipping..."
fi
else
echo "This is not a pull request, skipping..."
fi

View File

@ -1,41 +0,0 @@
# WARNING: DO NOT EDIT THIS FILE DIRECTLY!!!
# See the README.md in this directory.
# IMPORTANT: To update Docker image version, please follow
# the instructions at
# https://github.com/pytorch/pytorch/wiki/Docker-image-build-on-CircleCI
version: 2.1
parameters:
run_binary_tests:
type: boolean
default: false
run_build:
type: boolean
default: true
run_master_build:
type: boolean
default: false
run_slow_gradcheck_build:
type: boolean
default: false
executors:
windows-with-nvidia-gpu:
machine:
resource_class: windows.gpu.nvidia.medium
image: windows-server-2019-nvidia:previous
shell: bash.exe
windows-xlarge-cpu-with-nvidia-cuda:
machine:
resource_class: windows.xlarge
image: windows-server-2019-vs2019:stable
shell: bash.exe
windows-medium-cpu-with-nvidia-cuda:
machine:
resource_class: windows.medium
image: windows-server-2019-vs2019:stable
shell: bash.exe

View File

@ -1,14 +0,0 @@
# There is currently no testing for libtorch TODO
# binary_linux_libtorch_3.6m_cpu_test:
# environment:
# BUILD_ENVIRONMENT: "libtorch 3.6m cpu"
# resource_class: gpu.nvidia.small
# <<: *binary_linux_test
#
# binary_linux_libtorch_3.6m_cu90_test:
# environment:
# BUILD_ENVIRONMENT: "libtorch 3.6m cu90"
# resource_class: gpu.nvidia.small
# <<: *binary_linux_test
#

View File

@ -1,44 +0,0 @@
jobs:
binary_ios_build:
<<: *pytorch_ios_params
macos:
xcode: "12.5.1"
steps:
- attach_workspace:
at: ~/workspace
- checkout
- run_brew_for_ios_build
- run:
name: Build
no_output_timeout: "1h"
command: |
script="/Users/distiller/project/.circleci/scripts/binary_ios_build.sh"
cat "$script"
source "$script"
- run:
name: Test
no_output_timeout: "30m"
command: |
script="/Users/distiller/project/.circleci/scripts/binary_ios_test.sh"
cat "$script"
source "$script"
- persist_to_workspace:
root: /Users/distiller/workspace/
paths: ios
binary_ios_upload:
<<: *pytorch_ios_params
macos:
xcode: "12.5.1"
steps:
- attach_workspace:
at: ~/workspace
- checkout
- run_brew_for_ios_build
- run:
name: Upload
no_output_timeout: "1h"
command: |
script="/Users/distiller/project/.circleci/scripts/binary_ios_upload.sh"
cat "$script"
source "$script"

View File

@ -1,53 +0,0 @@
# update_s3_htmls job
# These jobs create html files for every cpu/cu## folder in s3. The html
# files just store the names of all the files in that folder (which are
# binary files (.whl files)). This is to allow pip installs of the latest
# version in a folder without having to know the latest date. Pip has a flag
# -f that you can pass an html file listing a bunch of packages, and pip will
# then install the one with the most recent version.
update_s3_htmls: &update_s3_htmls
machine:
image: ubuntu-2004:202104-01
resource_class: medium
steps:
- checkout
- setup_linux_system_environment
- run:
<<: *binary_checkout
# N.B. we do not run binary_populate_env. The only variable we need is
# PIP_UPLOAD_FOLDER (which is 'nightly/' for the nightlies and '' for
# releases, and sometimes other things for special cases). Instead we
# expect PIP_UPLOAD_FOLDER to be passed directly in the env. This is
# because, unlike all the other binary jobs, these jobs only get run once,
# in a separate workflow. They are not a step in other binary jobs like
# build, test, upload.
#
# You could attach this to every job, or include it in the upload step if
# you wanted. You would need to add binary_populate_env in this case to
# make sure it has the same upload folder as the job it's attached to. This
# function is idempotent, so it won't hurt anything; it's just a little
# unnescessary"
- run:
name: define PIP_UPLOAD_FOLDER
command: |
our_upload_folder=nightly/
# On tags upload to test instead
if [[ -n "${CIRCLE_TAG}" ]]; then
our_upload_folder=test/
fi
echo "export PIP_UPLOAD_FOLDER=${our_upload_folder}" >> ${BASH_ENV}
- run:
name: Update s3 htmls
no_output_timeout: "1h"
command: |
set +x
echo "declare -x \"AWS_ACCESS_KEY_ID=${PYTORCH_BINARY_AWS_ACCESS_KEY_ID}\"" >> /home/circleci/project/env
echo "declare -x \"AWS_SECRET_ACCESS_KEY=${PYTORCH_BINARY_AWS_SECRET_ACCESS_KEY}\"" >> /home/circleci/project/env
source /home/circleci/project/env
set -eux -o pipefail
retry () {
$* || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)
}
retry pip install awscli==1.6
"/home/circleci/project/builder/cron/update_s3_htmls.sh"

View File

@ -1,56 +0,0 @@
docker_build_job:
parameters:
image_name:
type: string
default: ""
machine:
image: ubuntu-2004:202104-01
resource_class: large
environment:
IMAGE_NAME: << parameters.image_name >>
# Enable 'docker manifest'
DOCKER_CLI_EXPERIMENTAL: "enabled"
DOCKER_BUILDKIT: 1
steps:
- checkout
- calculate_docker_image_tag
- run:
name: Check if image should be built
command: |
set +x
export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_DOCKER_BUILDER_V1}
export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_DOCKER_BUILDER_V1}
export AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
export AWS_REGION=us-east-1
aws ecr get-login-password --region $AWS_REGION|docker login --username AWS \
--password-stdin $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com
set -x
# Check if image already exists, if it does then skip building it
if docker manifest inspect "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/${IMAGE_NAME}:${DOCKER_TAG}"; then
circleci-agent step halt
# circleci-agent step halt doesn't actually halt the step so we need to
# explicitly exit the step here ourselves before it causes too much trouble
exit 0
fi
# Covers the case where a previous tag doesn't exist for the tree
# this is only really applicable on trees that don't have `.ci/docker` at its merge base, i.e. nightly
if ! git rev-parse "$(git merge-base HEAD << pipeline.git.base_revision >>):.ci/docker"; then
echo "Directory '.ci/docker' not found in tree << pipeline.git.base_revision >>, you should probably rebase onto a more recent commit"
exit 1
fi
PREVIOUS_DOCKER_TAG=$(git rev-parse "$(git merge-base HEAD << pipeline.git.base_revision >>):ci/docker")
# If no image exists but the hash is the same as the previous hash then we should error out here
if [[ "${PREVIOUS_DOCKER_TAG}" = "${DOCKER_TAG}" ]]; then
echo "ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch"
echo " contact the PyTorch team to restore the original images"
exit 1
fi
- run:
name: build_docker_image_<< parameters.image_name >>
no_output_timeout: "1h"
command: |
set +x
export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_DOCKER_BUILDER_V1}
export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_DOCKER_BUILDER_V1}
set -x
cd .ci/docker && ./build_docker.sh

View File

@ -1,747 +0,0 @@
pytorch_doc_push:
resource_class: medium
machine:
image: ubuntu-2004:202104-01
parameters:
branch:
type: string
default: "main"
steps:
- attach_workspace:
at: /tmp/workspace
- run:
name: Generate netrc
command: |
# set credentials for https pushing
cat > ~/.netrc \<<DONE
machine github.com
login pytorchbot
password ${GITHUB_PYTORCHBOT_TOKEN}
DONE
- run:
name: Docs push
command: |
pushd /tmp/workspace
git push -u origin "<< parameters.branch >>"
pytorch_macos_10_15_py3_build:
environment:
BUILD_ENVIRONMENT: pytorch-macos-10.15-py3-arm64-build
macos:
xcode: "12.3.0"
steps:
- checkout
- run_brew_for_macos_build
- run:
name: Build
no_output_timeout: "1h"
command: |
set -e
export CROSS_COMPILE_ARM64=1
export JOB_BASE_NAME=$CIRCLE_JOB
# Install sccache
sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
sudo chmod +x /usr/local/bin/sccache
export SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2
# This IAM user allows write access to S3 bucket for sccache
set +x
export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_SCCACHE_S3_BUCKET_V4}
export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_SCCACHE_S3_BUCKET_V4}
set -x
chmod a+x .ci/pytorch/macos-build.sh
unbuffer .ci/pytorch/macos-build.sh 2>&1 | ts
- persist_to_workspace:
root: /Users/distiller/workspace/
paths:
- miniconda3
- store_artifacts:
path: /Users/distiller/project/dist
pytorch_macos_10_13_py3_build:
environment:
BUILD_ENVIRONMENT: pytorch-macos-10.13-py3-build
macos:
xcode: "12.0"
steps:
- checkout
- run_brew_for_macos_build
- run:
name: Build
no_output_timeout: "1h"
command: |
set -e
export JOB_BASE_NAME=$CIRCLE_JOB
# Install sccache
sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
sudo chmod +x /usr/local/bin/sccache
export SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2
# This IAM user allows write access to S3 bucket for sccache
set +x
export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_SCCACHE_S3_BUCKET_V4}
export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_SCCACHE_S3_BUCKET_V4}
set -x
chmod a+x .ci/pytorch/macos-build.sh
unbuffer .ci/pytorch/macos-build.sh 2>&1 | ts
- persist_to_workspace:
root: /Users/distiller/workspace/
paths:
- miniconda3
mac_build:
parameters:
build-environment:
type: string
description: Top-level label for what's being built/tested.
xcode-version:
type: string
default: "13.3.1"
description: What xcode version to build with.
build-generates-artifacts:
type: boolean
default: true
description: if the build generates build artifacts
python-version:
type: string
default: "3.8"
macos:
xcode: << parameters.xcode-version >>
resource_class: medium
environment:
BUILD_ENVIRONMENT: << parameters.build-environment >>
AWS_REGION: us-east-1
steps:
- checkout
- run_brew_for_macos_build
- run:
name: Install sccache
command: |
sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
sudo chmod +x /usr/local/bin/sccache
echo "export SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${BASH_ENV}"
echo "export SCCACHE_S3_KEY_PREFIX=${GITHUB_WORKFLOW}" >> "${BASH_ENV}"
set +x
echo "export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_SCCACHE_S3_BUCKET_V4}" >> "${BASH_ENV}"
echo "export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_SCCACHE_S3_BUCKET_V4}" >> "${BASH_ENV}"
set -x
- run:
name: Get workflow job id
command: |
echo "export OUR_GITHUB_JOB_ID=${CIRCLE_WORKFLOW_JOB_ID}" >> "${BASH_ENV}"
- run:
name: Build
command: |
set -x
git submodule sync
git submodule update --init --recursive --depth 1 --jobs 0
export PATH="/usr/local/bin:$PATH"
export WORKSPACE_DIR="${HOME}/workspace"
mkdir -p "${WORKSPACE_DIR}"
MINICONDA_URL="https://repo.anaconda.com/miniconda/Miniconda3-py38_4.12.0-MacOSX-x86_64.sh"
if [ << parameters.python-version >> == 3.9.12 ]; then
MINICONDA_URL="https://repo.anaconda.com/miniconda/Miniconda3-py39_4.12.0-MacOSX-x86_64.sh"
fi
# If a local installation of conda doesn't exist, we download and install conda
if [ ! -d "${WORKSPACE_DIR}/miniconda3" ]; then
mkdir -p "${WORKSPACE_DIR}"
curl --retry 3 ${MINICONDA_URL} -o "${WORKSPACE_DIR}"/miniconda3.sh
bash "${WORKSPACE_DIR}"/miniconda3.sh -b -p "${WORKSPACE_DIR}"/miniconda3
fi
export PATH="${WORKSPACE_DIR}/miniconda3/bin:$PATH"
# shellcheck disable=SC1091
source "${WORKSPACE_DIR}"/miniconda3/bin/activate
brew link --force libomp
echo "export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname "$(which conda)")/../"}" >> "${BASH_ENV}"
.ci/pytorch/macos-build.sh
- when:
condition: << parameters.build-generates-artifacts >>
steps:
- run:
name: Archive artifacts into zip
command: |
zip -1 -r artifacts.zip dist/ build/.ninja_log build/compile_commands.json .additional_ci_files
cp artifacts.zip /Users/distiller/workspace
- persist_to_workspace:
root: /Users/distiller/workspace/
paths:
- miniconda3
- artifacts.zip
- store_artifacts:
path: /Users/distiller/project/artifacts.zip
mac_test:
parameters:
build-environment:
type: string
shard-number:
type: string
num-test-shards:
type: string
xcode-version:
type: string
test-config:
type: string
default: 'default'
macos:
xcode: << parameters.xcode-version >>
environment:
GIT_DEFAULT_BRANCH: 'master'
BUILD_ENVIRONMENT: << parameters.build-environment >>
TEST_CONFIG: << parameters.test-config >>
SHARD_NUMBER: << parameters.shard-number >>
NUM_TEST_SHARDS: << parameters.num-test-shards >>
PYTORCH_RETRY_TEST_CASES: 1
PYTORCH_OVERRIDE_FLAKY_SIGNAL: 1
steps:
- checkout
- attach_workspace:
at: ~/workspace
- run_brew_for_macos_build
- run:
name: Test
no_output_timeout: "2h"
command: |
set -x
git submodule sync --recursive
git submodule update --init --recursive
mv ~/workspace/artifacts.zip .
unzip artifacts.zip
export IN_CI=1
COMMIT_MESSAGES=$(git cherry -v "origin/${GIT_DEFAULT_BRANCH:-master}")
export PATH="/usr/local/bin:$PATH"
export WORKSPACE_DIR="${HOME}/workspace"
mkdir -p "${WORKSPACE_DIR}"
export PATH="${WORKSPACE_DIR}/miniconda3/bin:$PATH"
source "${WORKSPACE_DIR}"/miniconda3/bin/activate
# sanitize the input commit message and PR body here:
# trim all new lines from commit messages to avoid issues with batch environment
# variable copying. see https://github.com/pytorch/pytorch/pull/80043#issuecomment-1167796028
COMMIT_MESSAGES="${COMMIT_MESSAGES//[$'\n\r']}"
# then trim all special characters like single and double quotes to avoid unescaped inputs to
# wreak havoc internally
export COMMIT_MESSAGES="${COMMIT_MESSAGES//[\'\"]}"
python3 -mpip install dist/*.whl
.ci/pytorch/macos-test.sh
- run:
name: Copy files for uploading test stats
command: |
# copy into a parent folder test-reports because we can't use CIRCLEI_BUILD_NUM in path when persisting to workspace
mkdir -p test-reports/test-reports_${CIRCLE_BUILD_NUM}/test/test-reports
cp -r test/test-reports test-reports/test-reports_${CIRCLE_BUILD_NUM}/test/test-reports
- store_test_results:
path: test/test-reports
- persist_to_workspace:
root: /Users/distiller/project/
paths:
- test-reports
upload_test_stats:
machine: # executor type
image: ubuntu-2004:202010-01 # # recommended linux image - includes Ubuntu 20.04, docker 19.03.13, docker-compose 1.27.4
steps:
- checkout
- attach_workspace:
at: ~/workspace
- run:
name: upload
command: |
set -ex
if [ -z ${AWS_ACCESS_KEY_FOR_OSSCI_ARTIFACT_UPLOAD} ]; then
echo "No credentials found, cannot upload test stats (are you on a fork?)"
exit 0
fi
cp -r ~/workspace/test-reports/* ~/project
pip3 install requests==2.26 rockset==1.0.3 boto3==1.19.12
export AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_FOR_OSSCI_ARTIFACT_UPLOAD}
export AWS_SECRET_ACCESS_KEY=${AWS_SECRET_KEY_FOR_OSSCI_ARTIFACT_UPLOAD}
# i dont know how to get the run attempt number for reruns so default to 1
python3 -m tools.stats.upload_test_stats --workflow-run-id "${CIRCLE_WORKFLOW_JOB_ID}" --workflow-run-attempt 1 --head-branch << pipeline.git.branch >> --circleci
pytorch_macos_10_13_py3_test:
environment:
BUILD_ENVIRONMENT: pytorch-macos-10.13-py3-test
macos:
xcode: "12.0"
steps:
- checkout
- attach_workspace:
at: ~/workspace
- run_brew_for_macos_build
- run:
name: Test
no_output_timeout: "1h"
command: |
set -e
export JOB_BASE_NAME=$CIRCLE_JOB
chmod a+x .ci/pytorch/macos-test.sh
unbuffer .ci/pytorch/macos-test.sh 2>&1 | ts
- store_test_results:
path: test/test-reports
pytorch_macos_10_13_py3_lite_interpreter_build_test:
environment:
BUILD_ENVIRONMENT: pytorch-macos-10.13-py3-test
macos:
xcode: "12.0"
steps:
- checkout
- attach_workspace:
at: ~/workspace
- run_brew_for_macos_build
- run:
name: Test
no_output_timeout: "1h"
command: |
set -e
export BUILD_LITE_INTERPRETER=1
export JOB_BASE_NAME=$CIRCLE_JOB
chmod a+x ${HOME}/project/.ci/pytorch/macos-lite-interpreter-build-test.sh
unbuffer ${HOME}/project/.ci/pytorch/macos-lite-interpreter-build-test.sh 2>&1 | ts
- store_test_results:
path: test/test-reports
pytorch_android_gradle_build:
environment:
BUILD_ENVIRONMENT: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-build
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c"
PYTHON_VERSION: "3.7"
resource_class: large
machine:
image: ubuntu-2004:202104-01
steps:
- checkout
- calculate_docker_image_tag
- setup_linux_system_environment
- setup_ci_environment
- run:
name: pytorch android gradle build
no_output_timeout: "1h"
command: |
set -eux
docker_image_commit=${DOCKER_IMAGE}:build-${DOCKER_TAG}-${CIRCLE_SHA1}
docker_image_libtorch_android_x86_32=${docker_image_commit}-android-x86_32
docker_image_libtorch_android_x86_64=${docker_image_commit}-android-x86_64
docker_image_libtorch_android_arm_v7a=${docker_image_commit}-android-arm-v7a
docker_image_libtorch_android_arm_v8a=${docker_image_commit}-android-arm-v8a
echo "docker_image_commit: "${docker_image_commit}
echo "docker_image_libtorch_android_x86_32: "${docker_image_libtorch_android_x86_32}
echo "docker_image_libtorch_android_x86_64: "${docker_image_libtorch_android_x86_64}
echo "docker_image_libtorch_android_arm_v7a: "${docker_image_libtorch_android_arm_v7a}
echo "docker_image_libtorch_android_arm_v8a: "${docker_image_libtorch_android_arm_v8a}
# x86_32
time docker pull ${docker_image_libtorch_android_x86_32} >/dev/null
export id_x86_32=$(docker run --env-file "${BASH_ENV}" -e GRADLE_OFFLINE=1 --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${docker_image_libtorch_android_x86_32})
export COMMAND='((echo "sudo chown -R jenkins workspace") | docker exec -u jenkins -i "$id_x86_32" bash) 2>&1'
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
# arm-v7a
time docker pull ${docker_image_libtorch_android_arm_v7a} >/dev/null
export id_arm_v7a=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${docker_image_libtorch_android_arm_v7a})
export COMMAND='((echo "sudo chown -R jenkins workspace") | docker exec -u jenkins -i "$id_arm_v7a" bash) 2>&1'
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
mkdir -p ~/workspace/build_android_install_arm_v7a
docker cp $id_arm_v7a:/var/lib/jenkins/workspace/build_android/install ~/workspace/build_android_install_arm_v7a
# x86_64
time docker pull ${docker_image_libtorch_android_x86_64} >/dev/null
export id_x86_64=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${docker_image_libtorch_android_x86_64})
export COMMAND='((echo "sudo chown -R jenkins workspace") | docker exec -u jenkins -i "$id_x86_64" bash) 2>&1'
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
mkdir -p ~/workspace/build_android_install_x86_64
docker cp $id_x86_64:/var/lib/jenkins/workspace/build_android/install ~/workspace/build_android_install_x86_64
# arm-v8a
time docker pull ${docker_image_libtorch_android_arm_v8a} >/dev/null
export id_arm_v8a=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${docker_image_libtorch_android_arm_v8a})
export COMMAND='((echo "sudo chown -R jenkins workspace") | docker exec -u jenkins -i "$id_arm_v8a" bash) 2>&1'
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
mkdir -p ~/workspace/build_android_install_arm_v8a
docker cp $id_arm_v8a:/var/lib/jenkins/workspace/build_android/install ~/workspace/build_android_install_arm_v8a
docker cp ~/workspace/build_android_install_arm_v7a $id_x86_32:/var/lib/jenkins/workspace/build_android_install_arm_v7a
docker cp ~/workspace/build_android_install_x86_64 $id_x86_32:/var/lib/jenkins/workspace/build_android_install_x86_64
docker cp ~/workspace/build_android_install_arm_v8a $id_x86_32:/var/lib/jenkins/workspace/build_android_install_arm_v8a
# run gradle buildRelease
export COMMAND='((echo "sudo chown -R jenkins workspace && cd workspace && ./.circleci/scripts/build_android_gradle.sh") | docker exec -u jenkins -i "$id_x86_32" bash) 2>&1'
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
mkdir -p ~/workspace/build_android_artifacts
docker cp $id_x86_32:/var/lib/jenkins/workspace/android/artifacts.tgz ~/workspace/build_android_artifacts/
output_image=$docker_image_libtorch_android_x86_32-gradle
docker commit "$id_x86_32" ${output_image}
time docker push ${output_image}
- store_artifacts:
path: ~/workspace/build_android_artifacts/artifacts.tgz
destination: artifacts.tgz
pytorch_android_publish_snapshot:
environment:
BUILD_ENVIRONMENT: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-publish-snapshot
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c"
PYTHON_VERSION: "3.7"
resource_class: large
machine:
image: ubuntu-2004:202104-01
steps:
- checkout
- calculate_docker_image_tag
- setup_linux_system_environment
- setup_ci_environment
- run:
name: pytorch android gradle build
no_output_timeout: "1h"
command: |
set -eux
docker_image_commit=${DOCKER_IMAGE}:build-${DOCKER_TAG}-${CIRCLE_SHA1}
docker_image_libtorch_android_x86_32_gradle=${docker_image_commit}-android-x86_32-gradle
echo "docker_image_commit: "${docker_image_commit}
echo "docker_image_libtorch_android_x86_32_gradle: "${docker_image_libtorch_android_x86_32_gradle}
# x86_32
time docker pull ${docker_image_libtorch_android_x86_32_gradle} >/dev/null
export id_x86_32=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${docker_image_libtorch_android_x86_32_gradle})
export COMMAND='((echo "sudo chown -R jenkins workspace" && echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "export SONATYPE_NEXUS_USERNAME=${SONATYPE_NEXUS_USERNAME}" && echo "export SONATYPE_NEXUS_PASSWORD=${SONATYPE_NEXUS_PASSWORD}" && echo "export ANDROID_SIGN_KEY=${ANDROID_SIGN_KEY}" && echo "export ANDROID_SIGN_PASS=${ANDROID_SIGN_PASS}" && echo "sudo chown -R jenkins workspace && cd workspace && ./.circleci/scripts/publish_android_snapshot.sh") | docker exec -u jenkins -i "$id_x86_32" bash) 2>&1'
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
output_image=${docker_image_libtorch_android_x86_32_gradle}-publish-snapshot
docker commit "$id_x86_32" ${output_image}
time docker push ${output_image}
pytorch_android_gradle_build-x86_32:
environment:
BUILD_ENVIRONMENT: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-build-only-x86_32
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c"
PYTHON_VERSION: "3.7"
resource_class: large
machine:
image: ubuntu-2004:202104-01
steps:
- checkout
- calculate_docker_image_tag
- setup_linux_system_environment
- checkout
- setup_ci_environment
- run:
name: pytorch android gradle build only x86_32 (for PR)
no_output_timeout: "1h"
command: |
set -e
docker_image_libtorch_android_x86_32=${DOCKER_IMAGE}:build-${DOCKER_TAG}-${CIRCLE_SHA1}-android-x86_32
echo "docker_image_libtorch_android_x86_32: "${docker_image_libtorch_android_x86_32}
# x86
time docker pull ${docker_image_libtorch_android_x86_32} >/dev/null
export id=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${docker_image_libtorch_android_x86_32})
export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "export GRADLE_OFFLINE=1" && echo "sudo chown -R jenkins workspace && cd workspace && ./.circleci/scripts/build_android_gradle.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
mkdir -p ~/workspace/build_android_x86_32_artifacts
docker cp $id:/var/lib/jenkins/workspace/android/artifacts.tgz ~/workspace/build_android_x86_32_artifacts/
output_image=${docker_image_libtorch_android_x86_32}-gradle
docker commit "$id" ${output_image}
time docker push ${output_image}
- store_artifacts:
path: ~/workspace/build_android_x86_32_artifacts/artifacts.tgz
destination: artifacts.tgz
pytorch_ios_build:
<<: *pytorch_ios_params
macos:
xcode: "12.5.1"
steps:
- run:
name: checkout with retry
command: |
checkout() {
set -ex
# Workaround old docker images with incorrect $HOME
# check https://github.com/docker/docker/issues/2968 for details
if [ "${HOME}" = "/" ]
then
export HOME=$(getent passwd $(id -un) | cut -d: -f6)
fi
mkdir -p ~/.ssh
echo 'github.com ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAq2A7hRGmdnm9tUDbO9IDSwBK6TbQa+PXYPCPy6rbTrTtw7PHkccKrpp0yVhp5HdEIcKr6pLlVDBfOLX9QUsyCOV0wzfjIJNlGEYsdlLJizHhbn2mUjvSAHQqZETYP81eFzLQNnPHt4EVVUh7VfDESU84KezmD5QlWpXLmvU31/yMf+Se8xhHTvKSCZIFImWwoG6mbUoWf9nzpIoaSjB+weqqUUmpaaasXVal72J+UX2B+2RPW3RcT0eOzQgqlJL3RKrTJvdsjE3JEAvGq3lGHSZXy28G3skua2SmVi/w4yCE6gbODqnTWlg7+wC604ydGXA8VJiS5ap43JXiUFFAaQ==
' >> ~/.ssh/known_hosts
# use git+ssh instead of https
git config --global url."ssh://git@github.com".insteadOf "https://github.com" || true
git config --global gc.auto 0 || true
echo 'Cloning git repository'
mkdir -p '/Users/distiller/project'
cd '/Users/distiller/project'
git clone "$CIRCLE_REPOSITORY_URL" .
echo 'Checking out branch'
git checkout --force -B "$CIRCLE_BRANCH" "$CIRCLE_SHA1"
git --no-pager log --no-color -n 1 --format='HEAD is now at %h %s'
}
retry () {
$* || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)
}
retry checkout
- run_brew_for_ios_build
- run:
name: Setup Fastlane
no_output_timeout: "1h"
command: |
set -e
PROJ_ROOT=/Users/distiller/project
cd ${PROJ_ROOT}/ios/TestApp
# install fastlane
sudo gem install bundler && bundle install
- run:
name: Build
no_output_timeout: "1h"
command: |
set -e
WORKSPACE=/Users/distiller/workspace
PROJ_ROOT=/Users/distiller/project
export TCLLIBPATH="/usr/local/lib"
# Install conda
curl --retry 3 -o ~/conda.sh https://repo.anaconda.com/miniconda/Miniconda3-py39_4.12.0-MacOSX-x86_64.sh
chmod +x ~/conda.sh
/bin/bash ~/conda.sh -b -p ~/anaconda
export PATH="~/anaconda/bin:${PATH}"
source ~/anaconda/bin/activate
# Install dependencies
retry () {
$* || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)
}
retry conda install numpy ninja pyyaml mkl mkl-include setuptools cmake requests typing-extensions --yes
# sync submodules
cd ${PROJ_ROOT}
git submodule sync
git submodule update --init --recursive --depth 1 --jobs 0
# export
export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
# run build script
chmod a+x ${PROJ_ROOT}/scripts/build_ios.sh
echo "IOS_ARCH: ${IOS_ARCH}"
echo "IOS_PLATFORM: ${IOS_PLATFORM}"
echo "USE_PYTORCH_METAL": "${USE_METAL}"
echo "BUILD_LITE_INTERPRETER": "${BUILD_LITE_INTERPRETER}"
echo "USE_COREML_DELEGATE": "${USE_COREML_DELEGATE}"
#check the custom build flag
echo "SELECTED_OP_LIST: ${SELECTED_OP_LIST}"
if [ -n "${SELECTED_OP_LIST}" ]; then
export SELECTED_OP_LIST="${PROJ_ROOT}/ios/TestApp/custom_build/${SELECTED_OP_LIST}"
fi
export IOS_ARCH=${IOS_ARCH}
export IOS_PLATFORM=${IOS_PLATFORM}
export USE_COREML_DELEGATE=${USE_COREML_DELEGATE}
if [ ${IOS_PLATFORM} != "SIMULATOR" ]; then
export USE_PYTORCH_METAL=${USE_METAL}
fi
unbuffer ${PROJ_ROOT}/scripts/build_ios.sh 2>&1 | ts
- run:
name: Run Build Test
no_output_timeout: "30m"
command: |
set -e
PROJ_ROOT=/Users/distiller/project
# run the ruby build script
if ! [ -x "$(command -v xcodebuild)" ]; then
echo 'Error: xcodebuild is not installed.'
exit 1
fi
ruby ${PROJ_ROOT}/scripts/xcode_build.rb -i ${PROJ_ROOT}/build_ios/install -x ${PROJ_ROOT}/ios/TestApp/TestApp.xcodeproj -p ${IOS_PLATFORM}
if ! [ "$?" -eq "0" ]; then
echo 'xcodebuild failed!'
exit 1
fi
- run:
name: Run Simulator Tests
no_output_timeout: "2h"
command: |
set -e
if [ ${IOS_PLATFORM} != "SIMULATOR" ]; then
echo "not SIMULATOR build, skip it."
exit 0
fi
WORKSPACE=/Users/distiller/workspace
PROJ_ROOT=/Users/distiller/project
source ~/anaconda/bin/activate
# use the pytorch nightly build to generate models
pip3 install --pre torch torchvision torchaudio -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html
# generate models for differnet backends
cd ${PROJ_ROOT}/ios/TestApp/benchmark
mkdir -p ../models
if [ ${USE_COREML_DELEGATE} == 1 ]; then
pip install coremltools==5.0b5 protobuf==3.20.1
python coreml_backend.py
else
cd "${PROJ_ROOT}"
python test/mobile/model_test/gen_test_model.py ios-test
fi
cd "${PROJ_ROOT}/ios/TestApp/benchmark"
if [ ${BUILD_LITE_INTERPRETER} == 1 ]; then
echo "Setting up the TestApp for LiteInterpreter"
ruby setup.rb --lite 1
else
echo "Setting up the TestApp for Full JIT"
ruby setup.rb
fi
cd "${PROJ_ROOT}/ios/TestApp"
# instruments -s -devices
if [ "${BUILD_LITE_INTERPRETER}" == 1 ]; then
if [ "${USE_COREML_DELEGATE}" == 1 ]; then
fastlane scan --only_testing TestAppTests/TestAppTests/testCoreML
else
fastlane scan --only_testing TestAppTests/TestAppTests/testLiteInterpreter
fi
else
fastlane scan --only_testing TestAppTests/TestAppTests/testFullJIT
fi
pytorch_linux_bazel_build:
<<: *pytorch_params
machine:
image: ubuntu-2004:202104-01
steps:
- checkout
- calculate_docker_image_tag
- setup_linux_system_environment
- setup_ci_environment
- run:
name: Bazel Build
no_output_timeout: "1h"
command: |
set -e
# Pull Docker image and run build
echo "DOCKER_IMAGE: "${DOCKER_IMAGE}:${DOCKER_TAG}
time docker pull ${DOCKER_IMAGE}:${DOCKER_TAG} >/dev/null
export id=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${DOCKER_IMAGE}:${DOCKER_TAG})
echo "Do NOT merge main branch into $CIRCLE_BRANCH in environment $BUILD_ENVIRONMENT"
git submodule sync && git submodule update -q --init --recursive --depth 1 --jobs 0
docker cp /home/circleci/project/. $id:/var/lib/jenkins/workspace
export COMMAND='((echo "sudo chown -R jenkins workspace && cd workspace && .ci/pytorch/build.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
# Push intermediate Docker image for next phase to use
if [ -z "${BUILD_ONLY}" ]; then
# Augment our output image name with bazel to avoid collisions
output_image=${DOCKER_IMAGE}:build-${DOCKER_TAG}-bazel-${CIRCLE_SHA1}
export COMMIT_DOCKER_IMAGE=$output_image
docker commit "$id" ${COMMIT_DOCKER_IMAGE}
time docker push ${COMMIT_DOCKER_IMAGE}
fi
pytorch_linux_bazel_test:
<<: *pytorch_params
machine:
image: ubuntu-2004:202104-01
steps:
- checkout
- calculate_docker_image_tag
- setup_linux_system_environment
- setup_ci_environment
- run:
name: Test
no_output_timeout: "90m"
command: |
set -e
output_image=${DOCKER_IMAGE}:build-${DOCKER_TAG}-bazel-${CIRCLE_SHA1}
export COMMIT_DOCKER_IMAGE=$output_image
echo "DOCKER_IMAGE: "${COMMIT_DOCKER_IMAGE}
time docker pull ${COMMIT_DOCKER_IMAGE} >/dev/null
if [ -n "${USE_CUDA_DOCKER_RUNTIME}" ]; then
export id=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --gpus all -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})
else
export id=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})
fi
retrieve_test_reports() {
echo "retrieving test reports"
docker cp -L $id:/var/lib/jenkins/workspace/bazel-testlogs ./ || echo 'No test reports found!'
}
trap "retrieve_test_reports" ERR
if [[ ${BUILD_ENVIRONMENT} == *"multigpu"* ]]; then
export COMMAND='((echo "sudo chown -R jenkins workspace && cd workspace && .ci/pytorch/multigpu-test.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
else
export COMMAND='((echo "sudo chown -R jenkins workspace && cd workspace && .ci/pytorch/test.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
fi
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
retrieve_test_reports
docker stats --all --no-stream
- store_test_results:
path: bazel-testlogs
pytorch_windows_test_multigpu:
machine:
image: ubuntu-2004:202104-01
steps:
- checkout
- run:
name: Test
no_output_timeout: "90m"
command: |
set -e
python3 -m pip install requests
python3 ./.circleci/scripts/trigger_azure_pipeline.py

View File

@ -1,18 +0,0 @@
promote_s3:
<<: *promote_common
steps:
- checkout
- run:
name: Running promote script
command: |
scripts/release/promote/wheel_to_s3.sh
promote_conda:
<<: *promote_common
steps:
- checkout
- run:
name: Running promote script
command: |
scripts/release/promote/conda_to_conda.sh

View File

@ -1,29 +0,0 @@
setup:
docker:
- image: circleci/python:3.7.3
steps:
- checkout
- run:
name: Save commit message
command: git log --format='%B' -n 1 HEAD > .circleci/scripts/COMMIT_MSG
# Note [Workspace for CircleCI scripts]
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# In the beginning, you wrote your CI scripts in a
# .circleci/config.yml file, and life was good. Your CI
# configurations flourished and multiplied.
#
# Then one day, CircleCI cometh down high and say, "Your YAML file
# is too biggeth, it stresses our servers so." And thus they
# asketh us to smite the scripts in the yml file.
#
# But you can't just put the scripts in the .circleci folder,
# because in some jobs, you don't ever actually checkout the
# source repository. Where you gonna get the scripts from?
#
# Here's how you do it: you persist .circleci/scripts into a
# workspace, attach the workspace in your subjobs, and run all
# your scripts from there.
- persist_to_workspace:
root: .
paths: .circleci/scripts

View File

@ -1,51 +0,0 @@
##############################################################################
# Binary build (nightlies nightly build) defaults
# The binary builds use the docker executor b/c at time of writing the machine
# executor is limited to only two cores and is painfully slow (4.5+ hours per
# GPU build). But the docker executor cannot be run with --runtime=nvidia, and
# so the binary test/upload jobs must run on a machine executor. The package
# built in the build job is persisted to the workspace, which the test jobs
# expect. The test jobs just run a few quick smoke tests (very similar to the
# second-round-user-facing smoke tests above) and then upload the binaries to
# their final locations. The upload part requires credentials that should only
# be available to org-members.
#
# binary_checkout MUST be run before other commands here. This is because the
# other commands are written in .circleci/scripts/*.sh , so the pytorch source
# code must be downloaded on the machine before they can be run. We cannot
# inline all the code into this file, since that would cause the yaml size to
# explode past 4 MB (all the code in the command section is just copy-pasted to
# everywhere in the .circleci/config.yml file where it appears).
##############################################################################
# Checks out the Pytorch and Builder repos (always both of them), and places
# them in the right place depending on what executor we're running on. We curl
# our .sh file from the interweb to avoid yaml size bloat. Note that many jobs
# do not need both the pytorch and builder repos, so this is a little wasteful
# (smoke tests and upload jobs do not need the pytorch repo).
binary_checkout: &binary_checkout
name: Checkout pytorch/builder repo
no_output_timeout: "30m"
command: .circleci/scripts/binary_checkout.sh
# Parses circleci arguments in a consistent way, essentially routing to the
# correct pythonXgccXcudaXos build we want
binary_populate_env: &binary_populate_env
name: Set up binary env variables
command: .circleci/scripts/binary_populate_env.sh
binary_install_miniconda: &binary_install_miniconda
name: Install miniconda
no_output_timeout: "1h"
command: .circleci/scripts/binary_install_miniconda.sh
# This section is used in the binary_test and smoke_test jobs. It expects
# 'binary_populate_env' to have populated /home/circleci/project/env and it
# expects another section to populate /home/circleci/project/ci_test_script.sh
# with the code to run in the docker
binary_run_in_docker: &binary_run_in_docker
name: Run in docker
# This step only runs on circleci linux machine executors that themselves
# need to start docker images
command: .circleci/scripts/binary_run_in_docker.sh

View File

@ -1,8 +0,0 @@
#- binary_linux_libtorch_3.6m_cpu_test:
# requires:
# - binary_linux_libtorch_3.6m_cpu_build
#- binary_linux_libtorch_3.6m_cu90_test:
# requires:
# - binary_linux_libtorch_3.6m_cu90_build
# Nightly uploads

View File

@ -52,6 +52,13 @@ modernize-*,
-modernize-use-nodiscard,
performance-*,
readability-container-size-empty,
readability-delete-null-pointer,
readability-duplicate-include
readability-misplaced-array-index,
readability-redundant-function-ptr-dereference,
readability-redundant-smartptr-get,
readability-simplify-subscript-expr,
readability-string-compare,
'
HeaderFilterRegex: '^(aten/|c10/|torch/).*$'
AnalyzeTemporaryDtors: false

View File

@ -7,9 +7,7 @@ max-line-length = 120
# C408 ignored because we like the dict keyword argument syntax
# E501 is not flexible enough, we're using B950 instead
ignore =
E203,E305,E402,E501,E721,E741,F405,F821,F841,F999,W503,W504,C408,E302,W291,E303,
# fix these lints in the future
E275,
E203,E305,E402,E501,E721,E741,F405,F841,F999,W503,W504,C408,E302,W291,E303,
# shebang has extra meaning in fbcode lints, so I think it's not worth trying
# to line this up with executable bit
EXE001,
@ -31,6 +29,8 @@ ignore =
TOR102,
per-file-ignores =
__init__.py: F401
test/**: F821
test/**/__init__.py: F401,F821
torch/utils/cpp_extension.py: B950
torchgen/api/types/__init__.py: F401,F403
torchgen/executorch/api/types/__init__.py: F401,F403

View File

@ -38,3 +38,5 @@ f70844bec783bfce43c950ccf180dc494e86f2bf
e6ec0efaf87703c5f889cfc20b29be455885d58d
# 2023-07-31 [optim][BE] split test file into logical parts: SWA, LR, optim
a53cda1ddc15336dc1ff0ce1eff2a49cdc5f882e
# 2024-01-02 clangformat: fused adam #116583
9dc68d1aa9e554d09344a10fff69f7b50b2d23a0

View File

@ -3,6 +3,7 @@ self-hosted-runner:
- linux.20_04.4x
- linux.20_04.16x
- linux.large
- linux.large.arc
- linux.2xlarge
- linux.4xlarge
- linux.12xlarge

View File

@ -46,7 +46,8 @@ runs:
retry_wait_seconds: 30
command: |
set -eux
python3 -m pip install requests==2.26.0 pyyaml==6.0
# PyYAML 6.0 doesn't work with MacOS x86 anymore
python3 -m pip install requests==2.26.0 pyyaml==6.0.1
- name: Parse ref
id: parse-ref

67
.github/actions/setup-xpu/action.yml vendored Normal file
View File

@ -0,0 +1,67 @@
name: Setup XPU host
description: Set up XPU host for CI
runs:
using: composite
steps:
- name: Clean all stopped docker containers
if: always()
shell: bash
run: |
# Prune all stopped containers.
# If other runner is pruning on this node, will skip.
nprune=$(ps -ef | grep -c "docker container prune")
if [[ $nprune -eq 1 ]]; then
docker container prune -f
fi
- name: Runner health check system info
if: always()
shell: bash
run: |
cat /etc/os-release || true
cat /etc/apt/sources.list.d/oneAPI.list || true
cat /etc/apt/sources.list.d/intel-gpu-jammy.list || true
whoami
- name: Runner health check xpu-smi
if: always()
shell: bash
run: |
xpu-smi discovery
- name: Runner health check GPU count
if: always()
shell: bash
run: |
ngpu=$(xpu-smi discovery | grep -c -E 'Device Name')
msg="Please file an issue on pytorch/pytorch reporting the faulty runner. Include a link to the runner logs so the runner can be identified"
if [[ $ngpu -eq 0 ]]; then
echo "Error: Failed to detect any GPUs on the runner"
echo "$msg"
exit 1
fi
- name: Runner diskspace health check
uses: ./.github/actions/diskspace-cleanup
if: always()
- name: Runner health check disconnect on failure
if: ${{ failure() }}
shell: bash
run: |
killall runsvc.sh
- name: Preserve github env variables for use in docker
shell: bash
run: |
env | grep '^GITHUB' >> "/tmp/github_env_${GITHUB_RUN_ID}"
env | grep '^CI' >> "/tmp/github_env_${GITHUB_RUN_ID}"
- name: XPU set GPU_FLAG
shell: bash
run: |
# Add render group for container creation.
render_gid=`cat /etc/group | grep render | cut -d: -f3`
echo "GPU_FLAG=--device=/dev/mem --device=/dev/dri --group-add video --group-add $render_gid" >> "${GITHUB_ENV}"

20
.github/actions/teardown-xpu/action.yml vendored Normal file
View File

@ -0,0 +1,20 @@
name: Teardown XPU host
description: Tear down XPU host for CI
runs:
using: composite
steps:
- name: Teardown XPU
if: always()
shell: bash
run: |
# Prune all stopped containers.
# If other runner is pruning on this node, will skip.
nprune=$(ps -ef | grep -c "docker container prune")
if [[ $nprune -eq 1 ]]; then
docker container prune -f
fi
- name: Runner diskspace health check
uses: ./.github/actions/diskspace-cleanup
if: always()

View File

@ -12,7 +12,6 @@ reviewers:
symbolic-shapes:
- symbolic-shapes
- antoniojkim
- wconstab
- SherlockNoMad
Chillee:
- ezyang

View File

@ -1 +1 @@
6518fa9b2c74e84d7eb1fc6e3eb51e43213f0c05
e3efbc2d9094685dd2d4ae143853941f82f167af

View File

@ -1 +1 @@
c1e2095c3a16fbe7db25b9e2f206025488c2c203
d23430765b5df76cd1267f438f129f51b7d6e3e1

View File

@ -1 +1 @@
77b968a541b6d3062e81aafcc140dc20808703ae
e1c94dfa5a74331a376537c23bf74a2c367f24bd

13
.github/labeler.yml vendored
View File

@ -8,10 +8,6 @@
- torch/_inductor/**
- test/inductor/**
"module: export":
- torch/_export/**
- test/export/**
"ciflow/inductor":
- torch/_decomp/**
- torch/_dynamo/**
@ -23,8 +19,9 @@
- torch/_subclasses/meta_utils.py
- test/distributed/test_dynamo_distributed.py
- test/distributed/test_inductor_collectives.py
- torch/_functorch/partitioners.py
- torch/_functorch/_aot_autograd/**
- torch/_functorch/aot_autograd.py
- torch/_functorch/partitioners.py
- .ci/docker/ci_commit_pins/**
- .github/ci_commit_pins/**
- c10/core/Sym*
@ -72,9 +69,13 @@
"ciflow/trunk":
- .ci/docker/ci_commit_pins/triton.txt
"module: distributed":
"oncall: distributed":
- torch/csrc/distributed/**
- torch/distributed/**
- torch/nn/parallel/**
- test/distributed/**
- torch/testing/_internal/distributed/**
"module: distributed_checkpoint":
- torch/distributed/checkpoint/**
- test/distributed/checkpoint/**

View File

@ -285,6 +285,7 @@
- yhcharles
- kiukchung
- d4l3k
- shuqiangzhang
mandatory_checks_name:
- EasyCLA
- Lint
@ -351,16 +352,22 @@
- Lint
- pull
- name: x86 CPU quantization
- name: CPU inductor
patterns:
- torch/ao/quantization/quantizer/x86_inductor_quantizer.py
- torch/_inductor/fx_passes/mkldnn_fusion.py
- torch/_inductor/fx_passes/quantization.py
- test/quantization/core/test_quantized_op.py
- torch/_inductor/codegen/cpp.py
- test/inductor/test_mkldnn_pattern_matcher.py
- test/inductor/test_cpu_repo.py
- test/inductor/test_cpu_cpp_wrapper.py
- aten/src/ATen/native/quantized/cpu/**
- test/quantization/core/test_quantized_op.py
- torch/ao/quantization/quantizer/x86_inductor_quantizer.py
- test/quantization/pt2e/test_x86inductor_quantizer.py
approved_by:
- leslie-fang-intel
- jgong5
- EikanWang
mandatory_checks_name:
- EasyCLA
- Lint

View File

@ -14,6 +14,7 @@ ciflow_push_tags:
- ciflow/slow
- ciflow/trunk
- ciflow/unstable
- ciflow/xpu
retryable_workflows:
- lint
- pull

View File

@ -10,6 +10,9 @@ from typing import Optional
SCRIPT_DIR = Path(__file__).parent
REPO_DIR = SCRIPT_DIR.parent.parent
# TODO: Remove me once Triton version is again in sync for vanilla and ROCm
ROCM_TRITION_VERSION = "2.1.0"
def read_triton_pin(rocm_hash: bool = False) -> str:
triton_file = "triton.txt" if not rocm_hash else "triton-rocm.txt"
@ -29,25 +32,37 @@ def check_and_replace(inp: str, src: str, dst: str) -> str:
return inp.replace(src, dst)
def patch_setup_py(path: Path, *, version: str, name: str = "triton") -> None:
def patch_setup_py(
path: Path,
*,
version: str,
name: str = "triton",
expected_version: Optional[str] = None,
) -> None:
with open(path) as f:
orig = f.read()
# Replace name
orig = check_and_replace(orig, 'name="triton",', f'name="{name}",')
# Replace version
if not expected_version:
expected_version = read_triton_version()
orig = check_and_replace(
orig, f'version="{read_triton_version()}",', f'version="{version}",'
orig, f'version="{expected_version}",', f'version="{version}",'
)
with open(path, "w") as f:
f.write(orig)
def patch_init_py(path: Path, *, version: str) -> None:
def patch_init_py(
path: Path, *, version: str, expected_version: Optional[str] = None
) -> None:
if not expected_version:
expected_version = read_triton_version()
with open(path) as f:
orig = f.read()
# Replace version
orig = check_and_replace(
orig, f"__version__ = '{read_triton_version()}'", f'__version__ = "{version}"'
orig, f"__version__ = '{expected_version}'", f'__version__ = "{version}"'
)
with open(path, "w") as f:
f.write(orig)
@ -130,7 +145,7 @@ def build_triton(
cwd=triton_basedir,
env=env,
)
conda_path = list(Path(tmpdir).glob("linux-64/torchtriton*.bz2"))[0]
conda_path = next(iter(Path(tmpdir).glob("linux-64/torchtriton*.bz2")))
shutil.copy(conda_path, Path.cwd())
return Path.cwd() / conda_path.name
@ -140,6 +155,7 @@ def build_triton(
patch_init_py(
triton_pythondir / "triton" / "__init__.py",
version=f"{version}",
expected_version=ROCM_TRITION_VERSION if build_rocm else None,
)
if build_rocm:
@ -148,6 +164,7 @@ def build_triton(
triton_pythondir / "setup.py",
name=triton_pkg_name,
version=f"{version}",
expected_version=ROCM_TRITION_VERSION,
)
check_call("scripts/amd/setup_rocm_libs.sh", cwd=triton_basedir, shell=True)
print("ROCm libraries setup for triton installation...")
@ -156,7 +173,7 @@ def build_triton(
[sys.executable, "setup.py", "bdist_wheel"], cwd=triton_pythondir, env=env
)
whl_path = list((triton_pythondir / "dist").glob("*.whl"))[0]
whl_path = next(iter((triton_pythondir / "dist").glob("*.whl")))
shutil.copy(whl_path, Path.cwd())
if build_rocm:

Binary file not shown.

View File

@ -16,6 +16,12 @@ from typing import Dict, List, Optional, Tuple
CUDA_ARCHES = ["11.8", "12.1"]
CUDA_ARCHES_FULL_VERSION = {"11.8": "11.8.0", "12.1": "12.1.1"}
CUDA_ARCHES_CUDNN_VERSION = {"11.8": "8", "12.1": "8"}
ROCM_ARCHES = ["5.6", "5.7"]
@ -24,6 +30,7 @@ CPU_CXX11_ABI_ARCH = ["cpu-cxx11-abi"]
CPU_AARCH64_ARCH = ["cpu-aarch64"]
PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {
"11.8": (
"nvidia-cuda-nvrtc-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | " # noqa: B950
@ -86,7 +93,9 @@ def get_nccl_wheel_version(arch_version: str) -> str:
requirements = map(
str.strip, re.split("[;|]", PYTORCH_EXTRA_INSTALL_REQUIREMENTS[arch_version])
)
return [x for x in requirements if x.startswith("nvidia-nccl-cu")][0].split("==")[1]
return next(x for x in requirements if x.startswith("nvidia-nccl-cu")).split("==")[
1
]
def validate_nccl_dep_consistency(arch_version: str) -> None:

View File

@ -0,0 +1,42 @@
#!/usr/bin/env python3
"""Generates a matrix for docker releases through github actions
Will output a condensed version of the matrix. Will include fllowing:
* CUDA version short
* CUDA full verison
* CUDNN version short
* Image type either runtime or devel
* Platform linux/arm64,linux/amd64
"""
import json
from typing import Dict, List
import generate_binary_build_matrix
DOCKER_IMAGE_TYPES = ["runtime", "devel"]
def generate_docker_matrix() -> Dict[str, List[Dict[str, str]]]:
ret: List[Dict[str, str]] = []
for cuda, version in generate_binary_build_matrix.CUDA_ARCHES_FULL_VERSION.items():
for image in DOCKER_IMAGE_TYPES:
ret.append(
{
"cuda": cuda,
"cuda_full_version": version,
"cudnn_version": generate_binary_build_matrix.CUDA_ARCHES_CUDNN_VERSION[
cuda
],
"image_type": image,
"platform": "linux/arm64,linux/amd64",
}
)
return {"include": ret}
if __name__ == "__main__":
build_matrix = generate_docker_matrix()
print(json.dumps(build_matrix))

View File

@ -145,6 +145,16 @@ class GitRepo:
rc = self._run_git("rev-list", revision_range, "--", ".").strip()
return rc.split("\n") if len(rc) > 0 else []
def branches_containing_ref(
self, ref: str, *, include_remote: bool = True
) -> List[str]:
rc = (
self._run_git("branch", "--remote", "--contains", ref)
if include_remote
else self._run_git("branch", "--contains", ref)
)
return [x.strip() for x in rc.split("\n") if x.strip()] if len(rc) > 0 else []
def current_branch(self) -> str:
return self._run_git("symbolic-ref", "--short", "HEAD").strip()
@ -387,13 +397,28 @@ def _shasum(value: str) -> str:
return m.hexdigest()
def are_ghstack_branches_in_sync(repo: GitRepo, head_ref: str) -> bool:
def is_commit_hash(ref: str) -> bool:
"True if ref is hexadecimal number, else false"
try:
int(ref, 16)
except ValueError:
return False
return True
def are_ghstack_branches_in_sync(
repo: GitRepo, head_ref: str, base_ref: Optional[str] = None
) -> bool:
"""Checks that diff between base and head is the same as diff between orig and its parent"""
orig_ref = re.sub(r"/head$", "/orig", head_ref)
base_ref = re.sub(r"/head$", "/base", head_ref)
if base_ref is None:
base_ref = re.sub(r"/head$", "/base", head_ref)
orig_diff_sha = _shasum(repo.diff(f"{repo.remote}/{orig_ref}"))
head_diff_sha = _shasum(
repo.diff(f"{repo.remote}/{base_ref}", f"{repo.remote}/{head_ref}")
repo.diff(
base_ref if is_commit_hash(base_ref) else f"{repo.remote}/{base_ref}",
f"{repo.remote}/{head_ref}",
)
)
return orig_diff_sha == head_diff_sha

Binary file not shown.

View File

@ -44,6 +44,10 @@ def get_last_page_num_from_header(header: Any) -> int:
# Link info looks like: <https://api.github.com/repositories/65600975/labels?per_page=100&page=2>;
# rel="next", <https://api.github.com/repositories/65600975/labels?per_page=100&page=3>; rel="last"
link_info = header["link"]
# Docs does not specify that it should be present for projects with just few labels
# And https://github.com/malfet/deleteme/actions/runs/7334565243/job/19971396887 it's not the case
if link_info is None:
return 1
prefix = "&page="
suffix = ">;"
return int(

View File

@ -32,7 +32,6 @@ from trymerge import (
main as trymerge_main,
MandatoryChecksMissingError,
MergeRule,
PostCommentError,
RE_GHSTACK_DESC,
read_merge_rules,
remove_job_name_suffix,
@ -222,6 +221,31 @@ def mocked_read_merge_rules(repo: Any, org: str, project: str) -> List[MergeRule
]
def mocked_read_merge_rules_approvers(
repo: Any, org: str, project: str
) -> List[MergeRule]:
return [
MergeRule(
name="Core Reviewers",
patterns=["*"],
approved_by=["1", "2", "3", "4", "5", "6"],
mandatory_checks_name=[
"Lint",
"pull",
],
),
MergeRule(
name="Core Maintainers",
patterns=["*"],
approved_by=["1", "2", "malfet"],
mandatory_checks_name=[
"Lint",
"pull",
],
),
]
def mocked_read_merge_rules_raise(repo: Any, org: str, project: str) -> List[MergeRule]:
raise RuntimeError("testing")
@ -287,6 +311,27 @@ class TestTryMerge(TestCase):
RuntimeError, "testing", lambda: find_matching_merge_rule(pr, repo)
)
@mock.patch(
"trymerge.read_merge_rules", side_effect=mocked_read_merge_rules_approvers
)
def test_match_rules_approvers(self, *args: Any) -> None:
"Tests that PR has the necessary approvers"
repo = DummyGitRepo()
pr = GitHubPR("pytorch", "pytorch", 115329)
# Test that all potential approvers across all rules are listed if the
# PR doesn't have one of them
for mock_rule in ["Core Reviewers", "Core Maintainers"]:
self.assertRaisesRegex(
RuntimeError,
mock_rule,
lambda: find_matching_merge_rule(pr, repo),
)
pr = GitHubPR("pytorch", "pytorch", 115495)
# Test that PR with the correct approvers doesn't raise any exception
self.assertTrue(find_matching_merge_rule(pr, repo) is not None)
@mock.patch("trymerge.read_merge_rules", side_effect=mocked_read_merge_rules)
def test_lint_fails(self, *args: Any) -> None:
"Tests that PR fails mandatory lint check"
@ -470,20 +515,6 @@ class TestTryMerge(TestCase):
self.assertEqual(len(changed_files), pr.get_changed_files_count())
def test_revert_codev_fails(self, *args: Any) -> None:
pr = GitHubPR("pytorch", "pytorch", 91340)
class GitRepoCoDev(DummyGitRepo):
def commit_message(self, ref: str) -> str:
return pr.get_body()
repo = GitRepoCoDev()
self.assertRaisesRegex(
PostCommentError,
"landed via phabricator",
lambda: validate_revert(repo, pr, comment_id=1372496233),
)
def test_revert_codev_abandoned_diff_succeeds(self, *args: Any) -> None:
pr = GitHubPR("pytorch", "pytorch", 100652)

View File

@ -20,7 +20,18 @@ from collections import defaultdict
from dataclasses import dataclass
from functools import lru_cache
from pathlib import Path
from typing import Any, Callable, cast, Dict, List, NamedTuple, Optional, Pattern, Tuple
from typing import (
Any,
Callable,
cast,
Dict,
Iterable,
List,
NamedTuple,
Optional,
Pattern,
Tuple,
)
from warnings import warn
import yaml
@ -612,19 +623,14 @@ def can_skip_internal_checks(pr: "GitHubPR", comment_id: Optional[int] = None) -
return comment.author_login == "facebook-github-bot"
def get_ghstack_prs(
repo: GitRepo, pr: "GitHubPR", open_only: bool = True
def _revlist_to_prs(
repo: GitRepo,
pr: "GitHubPR",
rev_list: Iterable[str],
should_skip: Optional[Callable[[int, "GitHubPR"], bool]] = None,
) -> List[Tuple["GitHubPR", str]]:
"""
Get the PRs in the stack that are below this PR (inclusive). Throws error if any of the open PRs are out of sync.
@:param open_only: Only return open PRs
"""
assert pr.is_ghstack_pr()
entire_stack: List[Tuple[GitHubPR, str]] = []
# For ghstack, cherry-pick commits based from origin
orig_ref = f"{repo.remote}/{re.sub(r'/head$', '/orig', pr.head_ref())}"
rev_list = repo.revlist(f"{pr.default_branch()}..{orig_ref}")
for idx, rev in enumerate(reversed(rev_list)):
rc: List[Tuple[GitHubPR, str]] = []
for idx, rev in enumerate(rev_list):
msg = repo.commit_message(rev)
m = RE_PULL_REQUEST_RESOLVED.search(msg)
if m is None:
@ -635,25 +641,48 @@ def get_ghstack_prs(
raise RuntimeError(
f"PR {m.group('number')} resolved to wrong owner/repo pair"
)
stacked_pr_num = int(m.group("number"))
if stacked_pr_num != pr.pr_num:
stacked_pr = GitHubPR(pr.org, pr.project, stacked_pr_num)
if open_only and stacked_pr.is_closed():
print(
f"Skipping {idx+1} of {len(rev_list)} PR (#{stacked_pr_num}) as its already been merged"
)
continue
entire_stack.append((stacked_pr, rev))
else:
entire_stack.append((pr, rev))
pr_num = int(m.group("number"))
candidate = GitHubPR(pr.org, pr.project, pr_num) if pr_num != pr.pr_num else pr
if should_skip is not None and should_skip(idx, candidate):
continue
rc.append((candidate, rev))
return rc
def get_ghstack_prs(
repo: GitRepo, pr: "GitHubPR", open_only: bool = True
) -> List[Tuple["GitHubPR", str]]:
"""
Get the PRs in the stack that are below this PR (inclusive). Throws error if any of the open PRs are out of sync.
@:param open_only: Only return open PRs
"""
# For ghstack, cherry-pick commits based from origin
orig_ref = f"{repo.remote}/{pr.get_ghstack_orig_ref()}"
rev_list = repo.revlist(f"{pr.default_branch()}..{orig_ref}")
def skip_func(idx: int, candidate: "GitHubPR") -> bool:
if not open_only or not candidate.is_closed():
return False
print(
f"Skipping {idx+1} of {len(rev_list)} PR (#{candidate.pr_num}) as its already been merged"
)
return True
assert pr.is_ghstack_pr()
entire_stack = _revlist_to_prs(repo, pr, reversed(rev_list), skip_func)
for stacked_pr, rev in entire_stack:
if stacked_pr.is_closed():
continue
if not are_ghstack_branches_in_sync(repo, stacked_pr.head_ref()):
base_ref = stacked_pr.base_ref()
if base_ref == pr.default_branch():
base_ref = repo.get_merge_base(
f"{repo.remote}/{base_ref}", f"{repo.remote}/{stacked_pr.head_ref()}"
)
if not are_ghstack_branches_in_sync(repo, stacked_pr.head_ref(), base_ref):
raise RuntimeError(
f"PR {stacked_pr.pr_num} is out of sync with the corresponding revision {rev} on "
+ f"branch {orig_ref} that would be merged into main. "
+ f"branch {stacked_pr.get_ghstack_orig_ref()} that would be merged into {stacked_pr.default_branch()}. "
+ "This usually happens because there is a non ghstack change in the PR. "
+ f"Please sync them and try again (ex. make the changes on {orig_ref} and run ghstack)."
)
@ -694,6 +723,10 @@ class GitHubPR:
def is_ghstack_pr(self) -> bool:
return RE_GHSTACK_HEAD_REF.match(self.head_ref()) is not None
def get_ghstack_orig_ref(self) -> str:
assert self.is_ghstack_pr()
return re.sub(r"/head$", "/orig", self.head_ref())
def is_base_repo_private(self) -> bool:
return bool(self.info["baseRepository"]["isPrivate"])
@ -1288,6 +1321,9 @@ def find_matching_merge_rule(
ignore_current_checks=ignore_current_checks,
)
# This keeps the list of all approvers that could stamp the change
all_rule_approvers = {}
# PRs can fail multiple merge rules, but it only needs to pass one rule to be approved.
# If it fails all rules, we need to find the rule that it came closest to passing and report
# that to the dev.
@ -1331,24 +1367,31 @@ def find_matching_merge_rule(
continue
# Does the PR have the required approvals for this rule?
rule_approvers_set = set()
rule_approvers = set()
for approver in rule.approved_by:
if "/" in approver:
org, name = approver.split("/")
rule_approvers_set.update(gh_get_team_members(org, name))
rule_approvers.update(gh_get_team_members(org, name))
else:
rule_approvers_set.add(approver)
approvers_intersection = approved_by.intersection(rule_approvers_set)
rule_approvers.add(approver)
approvers_intersection = approved_by.intersection(rule_approvers)
# If rule requires approvers but they aren't the ones that reviewed PR
if len(approvers_intersection) == 0 and len(rule_approvers_set) > 0:
if reject_reason_score < 10000:
if len(approvers_intersection) == 0 and len(rule_approvers) > 0:
# Less than or equal is intentionally used here to gather all potential
# approvers
if reject_reason_score <= 10000:
reject_reason_score = 10000
reject_reason = "\n".join(
(
"Approval needed from one of the following:",
f"{', '.join(list(rule_approvers_set)[:5])}{', ...' if len(rule_approvers_set) > 5 else ''}",
)
)
all_rule_approvers[rule.name] = rule.approved_by
# Prepare the reject reason
all_rule_approvers_msg = [
f"- {name} ({', '.join(approved_by[:5])}{', ...' if len(approved_by) > 5 else ''})"
for name, approved_by in all_rule_approvers.items()
]
reject_reason = "Approvers from one of the following sets are needed:\n"
reject_reason += "\n".join(all_rule_approvers_msg)
continue
# Does the PR pass the checks required by this rule?
@ -1722,6 +1765,16 @@ def filter_checks_with_lambda(
return [check for check in checks.values() if status_filter(check.status)]
def get_pr_commit_sha(repo: GitRepo, pr: GitHubPR) -> str:
commit_sha = pr.get_merge_commit()
if commit_sha is not None:
return commit_sha
commits = repo.commits_resolving_gh_pr(pr.pr_num)
if len(commits) == 0:
raise PostCommentError("Can't find any commits resolving PR")
return commits[0]
def validate_revert(
repo: GitRepo, pr: GitHubPR, *, comment_id: Optional[int] = None
) -> Tuple[str, str]:
@ -1743,32 +1796,98 @@ def validate_revert(
f"Will not revert as @{author_login} is not one of "
f"[{', '.join(allowed_reverters)}], but instead is {author_association}."
)
skip_internal_checks = can_skip_internal_checks(pr, comment_id)
# Ignore associated diff it PR does not have internal changes
if pr.has_no_connected_diff():
skip_internal_checks = True
# Raises exception if matching rule is not found, but ignores all status checks
find_matching_merge_rule(
pr, repo, skip_mandatory_checks=True, skip_internal_checks=skip_internal_checks
pr, repo, skip_mandatory_checks=True, skip_internal_checks=True
)
commit_sha = pr.get_merge_commit()
if commit_sha is None:
commits = repo.commits_resolving_gh_pr(pr.pr_num)
if len(commits) == 0:
raise PostCommentError("Can't find any commits resolving PR")
commit_sha = commits[0]
msg = repo.commit_message(commit_sha)
rc = RE_DIFF_REV.search(msg)
if rc is not None and not skip_internal_checks:
raise PostCommentError(
f"Can't revert PR that was landed via phabricator as {rc.group(1)}. "
+ "Please revert by going to the internal diff and clicking Unland."
)
commit_sha = get_pr_commit_sha(repo, pr)
return (author_login, commit_sha)
def get_ghstack_dependent_prs(
repo: GitRepo, pr: GitHubPR, only_closed: bool = True
) -> List[Tuple[str, GitHubPR]]:
"""
Get the PRs in the stack that are above this PR (inclusive).
Throws error if stack have branched or original branches are gone
"""
assert pr.is_ghstack_pr()
orig_ref = f"{repo.remote}/{pr.get_ghstack_orig_ref()}"
rev_list = repo.revlist(f"{pr.default_branch()}..{orig_ref}")
if len(rev_list) == 0:
raise RuntimeError(
f"PR {pr.pr_num} does not have any revisions associated with it"
)
skip_len = len(rev_list) - 1
for branch in repo.branches_containing_ref(orig_ref):
candidate = repo.revlist(f"{pr.default_branch()}..{branch}")
# Pick longest candidate
if len(candidate) > len(rev_list):
candidate, rev_list = rev_list, candidate
# Validate that candidate always ends rev-list
if rev_list[-len(candidate) :] != candidate:
raise RuntimeError(
f"Branch {branch} revlist {', '.join(candidate)} is not a subset of {', '.join(rev_list)}"
)
# Remove commits original PR depends on
if skip_len > 0:
rev_list = rev_list[:-skip_len]
rc: List[Tuple[str, GitHubPR]] = []
for pr_, sha in _revlist_to_prs(repo, pr, rev_list):
if not pr_.is_closed():
if not only_closed:
rc.append(("", pr_))
continue
commit_sha = get_pr_commit_sha(repo, pr_)
rc.append((commit_sha, pr_))
return rc
def do_revert_prs(
repo: GitRepo,
shas_and_prs: List[Tuple[str, GitHubPR]],
*,
author_login: str,
extra_msg: str = "",
skip_internal_checks: bool = False,
dry_run: bool = False,
) -> None:
# Prepare and push revert commits
commit_shas: List[str] = []
for commit_sha, pr in shas_and_prs:
revert_msg = f"\nReverted {pr.get_pr_url()} on behalf of {prefix_with_github_url(author_login)}"
revert_msg += extra_msg
repo.checkout(pr.default_branch())
repo.revert(commit_sha)
msg = repo.commit_message("HEAD")
msg = re.sub(RE_PULL_REQUEST_RESOLVED, "", msg)
msg += revert_msg
repo.amend_commit_message(msg)
repo.push(shas_and_prs[0][1].default_branch(), dry_run)
# Comment/reopen PRs
for commit_sha, pr in shas_and_prs:
revert_message = (
f"@{pr.get_pr_creator_login()} your PR has been successfully reverted."
)
if (
pr.has_internal_changes()
and not pr.has_no_connected_diff()
and not skip_internal_checks
):
revert_message += "\n:warning: This PR might contain internal changes"
revert_message += "\ncc: @pytorch/pytorch-dev-infra"
gh_post_pr_comment(
pr.org, pr.project, pr.pr_num, revert_message, dry_run=dry_run
)
if not dry_run:
pr.add_numbered_label("reverted")
gh_post_commit_comment(pr.org, pr.project, commit_sha, revert_msg)
gh_update_pr_state(pr.org, pr.project, pr.pr_num)
def try_revert(
repo: GitRepo,
pr: GitHubPR,
@ -1777,34 +1896,37 @@ def try_revert(
comment_id: Optional[int] = None,
reason: Optional[str] = None,
) -> None:
def post_comment(msg: str) -> None:
gh_post_pr_comment(pr.org, pr.project, pr.pr_num, msg, dry_run=dry_run)
try:
author_login, commit_sha = validate_revert(repo, pr, comment_id=comment_id)
except PostCommentError as e:
return post_comment(str(e))
revert_msg = f"\nReverted {pr.get_pr_url()} on behalf of {prefix_with_github_url(author_login)}"
revert_msg += f" due to {reason}" if reason is not None else ""
revert_msg += (
gh_post_pr_comment(pr.org, pr.project, pr.pr_num, str(e), dry_run=dry_run)
return
extra_msg = f" due to {reason}" if reason is not None else ""
extra_msg += (
f" ([comment]({pr.get_comment_by_id(comment_id).url}))\n"
if comment_id is not None
else "\n"
)
repo.checkout(pr.default_branch())
repo.revert(commit_sha)
msg = repo.commit_message("HEAD")
msg = re.sub(RE_PULL_REQUEST_RESOLVED, "", msg)
msg += revert_msg
repo.amend_commit_message(msg)
repo.push(pr.default_branch(), dry_run)
post_comment(
f"@{pr.get_pr_creator_login()} your PR has been successfully reverted."
shas_and_prs = [(commit_sha, pr)]
if pr.is_ghstack_pr():
try:
shas_and_prs = get_ghstack_dependent_prs(repo, pr)
prs_to_revert = " ".join([t[1].get_pr_url() for t in shas_and_prs])
print(f"About to stack of PRs: {prs_to_revert}")
except Exception as e:
print(
f"Failed to fetch dependent PRs: {str(e)}, fall over to single revert"
)
do_revert_prs(
repo,
shas_and_prs,
author_login=author_login,
extra_msg=extra_msg,
dry_run=dry_run,
skip_internal_checks=can_skip_internal_checks(pr, comment_id),
)
if not dry_run:
pr.add_numbered_label("reverted")
gh_post_commit_comment(pr.org, pr.project, commit_sha, revert_msg)
gh_update_pr_state(pr.org, pr.project, pr.pr_num)
def prefix_with_github_url(suffix_str: str) -> str:

View File

@ -29,6 +29,7 @@ env:
jobs:
filter:
if: github.repository_owner == 'pytorch'
runs-on: [self-hosted, linux.large]
outputs:
test-matrix: ${{ steps.filter.outputs.test-matrix }}

View File

@ -29,6 +29,7 @@ env:
jobs:
filter:
if: github.repository_owner == 'pytorch'
runs-on: [self-hosted, linux.large]
outputs:
test-matrix: ${{ steps.filter.outputs.test-matrix }}

View File

@ -33,6 +33,7 @@ env:
jobs:
filter:
if: github.repository_owner == 'pytorch'
runs-on: [self-hosted, linux.large]
outputs:
test-matrix: ${{ steps.filter.outputs.test-matrix }}
@ -121,8 +122,6 @@ jobs:
GITHUB_RUN_NUMBER: ${{ github.run_number }}
GITHUB_RUN_ATTEMPT: ${{ github.run_attempt }}
JOB_ID: ${{ steps.get-job-id.outputs.job-id }}
PYTORCH_RETRY_TEST_CASES: 1
PYTORCH_OVERRIDE_FLAKY_SIGNAL: 1
REENABLED_ISSUES: ${{ needs.filter.outputs.reenabled-issues }}
# TODO duplicated
AWS_DEFAULT_REGION: us-east-1
@ -159,8 +158,6 @@ jobs:
-e TORCH_CUDA_ARCH_LIST \
-e OUR_GITHUB_JOB_ID \
-e CUDA_VERSION \
-e PYTORCH_RETRY_TEST_CASES \
-e PYTORCH_OVERRIDE_FLAKY_SIGNAL \
--env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
--security-opt seccomp=unconfined \
--cap-add=SYS_PTRACE \

View File

@ -15,6 +15,7 @@ defaults:
jobs:
filter:
if: github.repository_owner == 'pytorch'
runs-on: [self-hosted, linux.large]
outputs:
test-matrix: ${{ steps.filter.outputs.test-matrix }}

View File

@ -38,6 +38,7 @@ env:
jobs:
filter:
if: github.repository_owner == 'pytorch'
runs-on: [self-hosted, linux.large]
outputs:
test-matrix: ${{ steps.filter.outputs.test-matrix }}

View File

@ -164,8 +164,6 @@ jobs:
BRANCH: ${{ steps.parse-ref.outputs.branch }}
SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
BASE_SHA: ${{ github.event.pull_request.base.sha || github.sha }}
PYTORCH_RETRY_TEST_CASES: 1
PYTORCH_OVERRIDE_FLAKY_SIGNAL: 1
TEST_CONFIG: ${{ matrix.config }}
SHARD_NUMBER: ${{ matrix.shard }}
NUM_TEST_SHARDS: ${{ matrix.num_shards }}
@ -209,6 +207,7 @@ jobs:
-e GITHUB_RUN_NUMBER \
-e GITHUB_RUN_ATTEMPT \
-e JOB_ID \
-e JOB_NAME \
-e BASE_SHA \
-e BRANCH \
-e SHA1 \
@ -219,8 +218,6 @@ jobs:
-e NUM_TEST_SHARDS \
-e REENABLED_ISSUES \
-e CONTINUE_THROUGH_ERROR \
-e PYTORCH_RETRY_TEST_CASES \
-e PYTORCH_OVERRIDE_FLAKY_SIGNAL \
-e PR_LABELS \
-e MAX_JOBS="$(nproc --ignore=2)" \
-e SCCACHE_BUCKET \

View File

@ -28,6 +28,7 @@ on:
jobs:
filter:
if: github.repository_owner == 'pytorch'
runs-on: [self-hosted, linux.large]
outputs:
test-matrix: ${{ steps.filter.outputs.test-matrix }}
@ -58,7 +59,6 @@ jobs:
runs-on: ${{ matrix.runner }}
steps:
- name: Print runner OS/HW info
shell: arch -arch arm64 bash {0}
run: |
sysctl machdep.cpu.brand_string kern.osproductversion
@ -69,7 +69,6 @@ jobs:
quiet-checkout: true
- name: Clean checkout
shell: arch -arch arm64 bash {0}
run: |
git clean -fxd
@ -95,12 +94,9 @@ jobs:
ENV_NAME: conda-test-env-${{ github.run_id }}
PY_VERS: 3.9
PR_BODY: ${{ github.event.pull_request.body }}
PYTORCH_RETRY_TEST_CASES: 1
PYTORCH_OVERRIDE_FLAKY_SIGNAL: 1
CONTINUE_THROUGH_ERROR: ${{ needs.filter.outputs.keep-going }}
PIP_REQUIREMENTS_FILE: .github/requirements/pip-requirements-${{ runner.os }}.txt
REENABLED_ISSUES: ${{ needs.filter.outputs.reenabled-issues }}
shell: arch -arch arm64 bash {0}
run: |
# shellcheck disable=SC1090
set -ex

View File

@ -57,9 +57,11 @@ jobs:
SHARD_NUMBER: ${{ matrix.shard }}
NUM_TEST_SHARDS: ${{ matrix.num_shards }}
PR_BODY: ${{ github.event.pull_request.body }}
PYTORCH_RETRY_TEST_CASES: 1
PYTORCH_OVERRIDE_FLAKY_SIGNAL: 1
steps:
- name: Print runner OS/HW info
run: |
sysctl machdep.cpu.brand_string kern.osproductversion
- name: Clean up leftover processes on MacOS pet runner
continue-on-error: true
run: |
@ -76,8 +78,6 @@ jobs:
rm -rf "${dir}"
done
- name: Clean up disk space before running MacOS workflow
uses: pytorch/test-infra/.github/actions/check-disk-space@main

View File

@ -131,8 +131,6 @@ jobs:
JOB_NAME: ${{ steps.get-job-id.outputs.job-name }}
BRANCH: ${{ steps.parse-ref.outputs.branch }}
SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
PYTORCH_RETRY_TEST_CASES: 1
PYTORCH_OVERRIDE_FLAKY_SIGNAL: 1
CONTINUE_THROUGH_ERROR: ${{ steps.keep-going.outputs.keep-going }}
TEST_CONFIG: ${{ matrix.config }}
SHARD_NUMBER: ${{ matrix.shard }}
@ -172,6 +170,7 @@ jobs:
-e GITHUB_RUN_NUMBER \
-e GITHUB_RUN_ATTEMPT \
-e JOB_ID \
-e JOB_NAME \
-e BRANCH \
-e SHA1 \
-e AWS_DEFAULT_REGION \
@ -180,8 +179,6 @@ jobs:
-e TEST_CONFIG \
-e NUM_TEST_SHARDS \
-e REENABLED_ISSUES \
-e PYTORCH_RETRY_TEST_CASES \
-e PYTORCH_OVERRIDE_FLAKY_SIGNAL \
-e CONTINUE_THROUGH_ERROR \
-e MAX_JOBS="$(nproc --ignore=2)" \
-e SCCACHE_BUCKET \

View File

@ -15,6 +15,7 @@ defaults:
jobs:
filter:
if: github.repository_owner == 'pytorch'
runs-on: [self-hosted, linux.large]
outputs:
test-matrix: ${{ steps.filter.outputs.test-matrix }}

View File

@ -139,8 +139,6 @@ jobs:
USE_CUDA: ${{ inputs.cuda-version != 'cpu' && '1' || '0' }}
INSTALL_WINDOWS_SDK: 1
PYTHON_VERSION: 3.8
PYTORCH_RETRY_TEST_CASES: 1
PYTORCH_OVERRIDE_FLAKY_SIGNAL: 1
CONTINUE_THROUGH_ERROR: ${{ steps.keep-going.outputs.keep-going }}
VC_PRODUCT: "BuildTools"
VC_VERSION: ""

269
.github/workflows/_xpu-test.yml vendored Normal file
View File

@ -0,0 +1,269 @@
# TODO: this looks sort of similar to _linux-test, but there are like a dozen
# places where you would have to insert an if statement. Probably it's better to
# just use a different workflow altogether
name: xpu-test
on:
workflow_call:
inputs:
build-environment:
required: true
type: string
description: Top-level label for what's being built/tested.
test-matrix:
required: true
type: string
description: JSON description of what test configs to run.
docker-image:
required: true
type: string
description: Docker image to run in.
sync-tag:
required: false
type: string
default: ""
description: |
If this is set, our linter will use this to make sure that every other
job with the same `sync-tag` is identical.
timeout-minutes:
required: false
type: number
default: 300
description: |
Set the maximum (in minutes) how long the workflow should take to finish
tests-to-include:
required: false
type: string
default: ""
description: |
List of tests to include (empty string implies default list)
env:
GIT_DEFAULT_BRANCH: ${{ github.event.repository.default_branch }}
permissions:
id-token: write
contents: read
jobs:
test:
# Don't run on forked repos or empty test matrix
if: github.repository_owner == 'pytorch' && toJSON(fromJSON(inputs.test-matrix).include) != '[]'
strategy:
matrix: ${{ fromJSON(inputs.test-matrix) }}
fail-fast: false
timeout-minutes: ${{ matrix.mem_leak_check == 'mem_leak_check' && 600 || inputs.timeout-minutes }}
runs-on: ${{ matrix.runner }}
steps:
# [see note: pytorch repo ref]
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
- name: Setup XPU
uses: ./.github/actions/setup-xpu
- name: configure aws credentials
id: aws_creds
uses: aws-actions/configure-aws-credentials@v1.7.0
with:
role-to-assume: arn:aws:iam::308535385114:role/gha_workflow_pytorch_artifacts
aws-region: us-east-1
- name: Login to Amazon ECR
id: login-ecr
uses: aws-actions/amazon-ecr-login@v2
- name: Calculate docker image
id: calculate-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
with:
docker-image-name: ${{ inputs.docker-image }}
- name: Pull docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
- name: Start monitoring script
id: monitor-script
shell: bash
continue-on-error: true
run: |
python3 -m pip install psutil==5.9.1 nvidia-ml-py==11.525.84
python3 -m tools.stats.monitor > usage_log.txt 2>&1 &
echo "monitor-script-pid=${!}" >> "${GITHUB_OUTPUT}"
- name: Download build artifacts
uses: ./.github/actions/download-build-artifacts
with:
name: ${{ inputs.build-environment }}
- name: Parse ref
id: parse-ref
run: .github/scripts/parse_ref.py
- name: Get workflow job id
id: get-job-id
uses: ./.github/actions/get-workflow-job-id
if: always()
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
- name: Check for keep-going label and re-enabled test issues
# This uses the filter-test-configs action because it conviniently
# checks for labels and re-enabled test issues. It does not actually do
# any filtering. All filtering is done in the build step.
id: keep-going
uses: ./.github/actions/filter-test-configs
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
test-matrix: ${{ inputs.test-matrix }}
job-name: ${{ steps.get-job-id.outputs.job-name }}
- name: Set Test step time
id: test-timeout
shell: bash
env:
JOB_TIMEOUT: ${{ matrix.mem_leak_check == 'mem_leak_check' && 600 || inputs.timeout-minutes }}
run: |
echo "timeout=$((JOB_TIMEOUT-30))" >> "${GITHUB_OUTPUT}"
- name: Test
id: test
env:
BUILD_ENVIRONMENT: ${{ inputs.build-environment }}
PR_NUMBER: ${{ github.event.pull_request.number }}
GITHUB_REPOSITORY: ${{ github.repository }}
GITHUB_WORKFLOW: ${{ github.workflow }}
GITHUB_JOB: ${{ github.job }}
GITHUB_RUN_ID: ${{ github.run_id }}
GITHUB_RUN_NUMBER: ${{ github.run_number }}
GITHUB_RUN_ATTEMPT: ${{ github.run_attempt }}
JOB_ID: ${{ steps.get-job-id.outputs.job-id }}
JOB_NAME: ${{ steps.get-job-id.outputs.job-name }}
BRANCH: ${{ steps.parse-ref.outputs.branch }}
SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
PYTORCH_RETRY_TEST_CASES: 1
PYTORCH_OVERRIDE_FLAKY_SIGNAL: 1
CONTINUE_THROUGH_ERROR: ${{ steps.keep-going.outputs.keep-going }}
TEST_CONFIG: ${{ matrix.config }}
SHARD_NUMBER: ${{ matrix.shard }}
NUM_TEST_SHARDS: ${{ matrix.num_shards }}
REENABLED_ISSUES: ${{ steps.keep-going.outputs.reenabled-issues }}
SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
DOCKER_IMAGE: ${{ inputs.docker-image }}
XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla
PYTORCH_TEST_CUDA_MEM_LEAK_CHECK: ${{ matrix.mem_leak_check && '1' || '0' }}
PYTORCH_TEST_RERUN_DISABLED_TESTS: ${{ matrix.rerun_disabled_tests && '1' || '0' }}
TESTS_TO_INCLUDE: ${{ inputs.tests-to-include }}
timeout-minutes: ${{ fromJson(steps.test-timeout.outputs.timeout) }}
run: |
set -x
TEST_COMMAND=.ci/pytorch/test.sh
# detached container should get cleaned up by teardown_ec2_linux
# Used for GPU_FLAG since that doesn't play nice
# shellcheck disable=SC2086,SC2090
container_name=$(docker run \
${GPU_FLAG:-} \
-e BUILD_ENVIRONMENT \
-e PR_NUMBER \
-e GITHUB_ACTIONS \
-e GITHUB_REPOSITORY \
-e GITHUB_WORKFLOW \
-e GITHUB_JOB \
-e GITHUB_RUN_ID \
-e GITHUB_RUN_NUMBER \
-e GITHUB_RUN_ATTEMPT \
-e JOB_ID \
-e BRANCH \
-e SHA1 \
-e AWS_DEFAULT_REGION \
-e IN_WHEEL_TEST \
-e SHARD_NUMBER \
-e TEST_CONFIG \
-e NUM_TEST_SHARDS \
-e REENABLED_ISSUES \
-e PYTORCH_RETRY_TEST_CASES \
-e PYTORCH_OVERRIDE_FLAKY_SIGNAL \
-e CONTINUE_THROUGH_ERROR \
-e MAX_JOBS="$(nproc --ignore=2)" \
-e SCCACHE_BUCKET \
-e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-e PYTORCH_TEST_CUDA_MEM_LEAK_CHECK \
-e PYTORCH_TEST_RERUN_DISABLED_TESTS \
-e TESTS_TO_INCLUDE \
-e ZE_AFFINITY_MASK \
--env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
--ulimit stack=10485760:83886080 \
--ulimit core=0 \
--security-opt seccomp=unconfined \
--cap-add=SYS_PTRACE \
--shm-size="8g" \
--tty \
--detach \
--name="${container_name}" \
--user jenkins \
--privileged \
-v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-w /var/lib/jenkins/workspace \
"${DOCKER_IMAGE}"
)
# save container name for later step
echo "CONTAINER_NAME=${container_name}" >> "$GITHUB_ENV"
# jenkins user does not have write permission to mounted workspace; work-around by copying within container to jenkins home
docker exec -t "${container_name}" sh -c "cd .. && cp -R workspace pytorch && cd pytorch && pip install dist/*.whl && ${TEST_COMMAND}"
- name: Save test results
if: always()
run: |
# copy test results back to the mounted workspace, needed sudo, resulting permissions were correct
docker exec -t "${{ env.CONTAINER_NAME }}" sh -c "cd ../pytorch && sudo cp -R test/test-reports ../workspace/test"
- name: Print remaining test logs
shell: bash
if: always() && steps.test.conclusion
run: |
cat test/**/*_toprint.log || true
- name: Stop monitoring script
if: always() && steps.monitor-script.outputs.monitor-script-pid
shell: bash
continue-on-error: true
env:
MONITOR_SCRIPT_PID: ${{ steps.monitor-script.outputs.monitor-script-pid }}
run: |
kill "$MONITOR_SCRIPT_PID"
- name: Upload test artifacts
uses: ./.github/actions/upload-test-artifacts
if: always() && steps.test.conclusion && steps.test.conclusion != 'skipped'
with:
use-gha: true
file-suffix: ${{ github.job }}-${{ matrix.config }}-${{ matrix.shard }}-${{ matrix.num_shards }}-${{ matrix.runner }}_${{ steps.get-job-id.outputs.job-id }}
- name: Collect backtraces from coredumps (if any)
if: always()
run: |
# shellcheck disable=SC2156
find . -iname "core.[1-9]*" -exec docker exec "${CONTAINER_NAME}" sh -c "gdb python {} -ex 'bt' -ex 'q'" \;
- name: Stop container before exit
if: always()
run: |
# Workaround for multiple runners on same IDC node
docker stop "${{ env.CONTAINER_NAME }}"
- name: Store Core dumps on GitHub
uses: actions/upload-artifact@v3
if: failure()
with:
name: coredumps-${{ matrix.config }}-${{ matrix.shard }}-${{ matrix.num_shards }}-${{ matrix.runner }}
retention-days: 14
if-no-files-found: ignore
path: ./**/core.[1-9]*
- name: Teardown XPU
uses: ./.github/actions/teardown-xpu

View File

@ -182,7 +182,7 @@ jobs:
strategy:
fail-fast: false
matrix:
py_vers: [ "3.8", "3.9", "3.10", "3.11" ]
py_vers: [ "3.8", "3.9", "3.10", "3.11", "3.12" ]
timeout-minutes: 40
env:
DOCKER_IMAGE: pytorch/conda-builder:cpu

View File

@ -0,0 +1,30 @@
name: Check mergeability and dependencies for ghstack prs
on:
pull_request:
types: [opened, synchronize, reopened, edited]
jobs:
check-regex:
runs-on: ubuntu-latest
outputs:
regex-match: ${{ steps.regex-match.outputs.match }}
steps:
- uses: actions/checkout@v4
- id: regex-match
uses: actions-ecosystem/action-regex-match@d50fd2e7a37d0e617aea3d7ada663bd56862b9cc
with:
text: ${{ github.head_ref }}
regex: '^(gh/[^/]+/[0-9]+/)head$'
pr-dependencies-check:
needs: check-regex
if: ${{ needs.check-regex.outputs.regex-match != '' }}
uses: pytorch/test-infra/.github/workflows/pr-dependencies-check.yml@main
with:
pr_number: ${{ github.event.pull_request.number }}
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true

View File

@ -47,6 +47,7 @@ jobs:
- docker-image-name: pytorch-linux-focal-py3-clang9-android-ndk-r21e
- docker-image-name: pytorch-linux-jammy-py3.8-gcc11
- docker-image-name: pytorch-linux-jammy-py3.8-gcc11-inductor-benchmarks
- docker-image-name: pytorch-linux-jammy-xpu-2024.0-py3
- docker-image-name: pytorch-linux-jammy-py3-clang15-asan
- docker-image-name: pytorch-linux-focal-py3-clang10-onnx
- docker-image-name: pytorch-linux-focal-linter

View File

@ -26,25 +26,42 @@ env:
DOCKER_REGISTRY: ghcr.io
NO_BUILD_SUFFIX: true
USE_BUILDX: 1
WITH_PUSH: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/'))) }}
WITH_PUSH: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || startsWith(github.event.ref, 'refs/tags/v')) }}
jobs:
generate-matrix:
if: github.repository_owner == 'pytorch'
runs-on: [self-hosted, linux.large]
outputs:
matrix: ${{ steps.generate-matrix.outputs.matrix }}
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
with:
fetch-depth: 1
submodules: true
- name: Get docker release matrix
id: generate-matrix
run: |
MATRIX_BLOB="$(python3 .github/scripts/generate_docker_release_matrix.py)"
echo "${MATRIX_BLOB}"
echo "matrix=${MATRIX_BLOB}" >> "${GITHUB_OUTPUT}"
build:
if: ${{ github.repository == 'pytorch/pytorch' }}
runs-on: [self-hosted, linux.2xlarge]
environment: ${{ (github.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) && 'docker-build' || '' }}
environment: ${{ (github.ref == 'refs/heads/nightly' || startsWith(github.event.ref, 'refs/tags/v')) && 'docker-build' || '' }}
timeout-minutes: 240
needs: generate-matrix
strategy:
matrix:
include:
# nvidia specific images don't exist for arm64 so only build the runtime image
- image_type: runtime
platform: linux/arm64,linux/amd64
- image_type: devel
platform: linux/amd64
matrix: ${{ fromJson(needs.generate-matrix.outputs.matrix) }}
fail-fast: false
env:
BUILD_IMAGE_TYPE: ${{ matrix.image_type }}
BUILD_PLATFORMS: ${{ matrix.platform }}
CUDA_VERSION: ${{ matrix.cuda_full_version }}
CUDA_VERSION_SHORT: ${{ matrix.cuda }}
CUDNN_VERSION: ${{ matrix.cudnn_version }}
steps:
- name: Setup SSH (Click me for login details)
uses: pytorch/test-infra/.github/actions/setup-ssh@main
@ -97,10 +114,11 @@ jobs:
- name: Push nightly tags
if: ${{ github.event.ref == 'refs/heads/nightly' && matrix.image_type == 'runtime' }}
run: |
PYTORCH_DOCKER_TAG="${PYTORCH_VERSION}-runtime"
CUDA_VERSION=$(python3 -c "import re;print(re.search('CUDA_VERSION\s+=\s+([0-9\.]+)',open('docker.Makefile').read())[1],end='')")
PYTORCH_DOCKER_TAG="${PYTORCH_VERSION}-cuda${CUDA_VERSION_SHORT}-cudnn${CUDNN_VERSION}-runtime"
PYTORCH_NIGHTLY_COMMIT=$(docker run ghcr.io/pytorch/pytorch-nightly:"${PYTORCH_DOCKER_TAG}" \
python -c 'import torch; print(torch.version.git_version[:7],end="")')
docker tag ghcr.io/pytorch/pytorch-nightly:"${PYTORCH_DOCKER_TAG}" \
ghcr.io/pytorch/pytorch-nightly:"${PYTORCH_NIGHTLY_COMMIT}-cu${CUDA_VERSION}"
docker push ghcr.io/pytorch/pytorch-nightly:"${PYTORCH_NIGHTLY_COMMIT}-cu${CUDA_VERSION}"

View File

@ -4,6 +4,7 @@ on:
push:
branches:
- main
- release/*
tags:
- ciflow/inductor/*
workflow_dispatch:
@ -13,6 +14,26 @@ concurrency:
cancel-in-progress: true
jobs:
linux-focal-rocm5_7-py3_8-inductor-build:
name: rocm5.7-py3.8-inductor
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-focal-rocm5.7-py3.8
docker-image-name: pytorch-linux-focal-rocm-n-py3
test-matrix: |
{ include: [
{ config: "inductor", shard: 1, num_shards: 1, runner: "linux.rocm.gpu.2" },
]}
linux-focal-rocm5_7-py3_8-inductor-test:
name: rocm5.7-py3.8-inductor
uses: ./.github/workflows/_rocm-test.yml
needs: linux-focal-rocm5_7-py3_8-inductor-build
with:
build-environment: linux-focal-rocm5.7-py3.8
docker-image: ${{ needs.linux-focal-rocm5_7-py3_8-inductor-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-rocm5_7-py3_8-inductor-build.outputs.test-matrix }}
linux-focal-cuda12_1-py3_10-gcc9-inductor-build:
name: cuda12.1-py3.10-gcc9-sm86
uses: ./.github/workflows/_linux-build.yml

View File

@ -228,8 +228,8 @@ jobs:
pip install torch --pre --index-url https://download.pytorch.org/whl/nightly/cpu/
- name: Run run_test.py (nonretryable)
run: |
# Run test_weak, which is very fast
python3 test/run_test.py --include test_weak --verbose
# Run test_vulkan, which is a fast noop on Linux
python3 test/run_test.py --include test_vulkan --verbose
test_collect_env:
if: ${{ github.repository == 'pytorch/pytorch' }}

View File

@ -28,8 +28,7 @@ jobs:
test-matrix: |
{ include: [
{ config: "mps", shard: 1, num_shards: 1, runner: "macos-m1-12" },
# TODO: Revert me when those runners are back online
# { config: "mps", shard: 1, num_shards: 1, runner: "macos-m1-13" },
{ config: "mps", shard: 1, num_shards: 1, runner: "macos-m2-14" },
]}
macos-12-py3-arm64-mps-test:

View File

@ -12,6 +12,8 @@ on:
push:
tags:
- ciflow/periodic/*
branches:
- release/*
workflow_dispatch:
concurrency:
@ -156,34 +158,6 @@ jobs:
{ config: "default", shard: 1, num_shards: 1, runner: "ubuntu-latest" },
]}
macos-12-py3-x86-64-build:
name: macos-12-py3-x86-64
if: github.event_name != 'schedule' || github.event.schedule == '45 4,12,20 * * 1-5' || github.event.schedule == '45 12 * * 0,6' || github.event.schedule == '29 8 * * *'
uses: ./.github/workflows/_mac-build.yml
with:
build-environment: macos-12-py3-x86-64
xcode-version: "13.3.1"
runner-type: macos-12-xl
build-generates-artifacts: true
sccache-use-gha: true
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 4, runner: "macos-12" },
{ config: "default", shard: 2, num_shards: 4, runner: "macos-12" },
{ config: "default", shard: 3, num_shards: 4, runner: "macos-12" },
{ config: "default", shard: 4, num_shards: 4, runner: "macos-12" },
]}
macos-12-py3-x86-64-test:
name: macos-12-py3-x86-64
uses: ./.github/workflows/_mac-test.yml
needs: macos-12-py3-x86-64-build
with:
build-environment: macos-12-py3-x86-64
test-matrix: ${{ needs.macos-12-py3-x86-64-build.outputs.test-matrix }}
arch: x86_64
android-emulator-build-test:
name: android-emulator-build-test
uses: ./.github/workflows/_run_android_tests.yml

View File

@ -136,8 +136,13 @@ jobs:
{ config: "default", shard: 3, num_shards: 3, runner: "linux.2xlarge" },
{ config: "crossref", shard: 1, num_shards: 2, runner: "linux.2xlarge" },
{ config: "crossref", shard: 2, num_shards: 2, runner: "linux.2xlarge" },
{ config: "dynamo", shard: 1, num_shards: 2, runner: "linux.2xlarge" },
{ config: "dynamo", shard: 2, num_shards: 2, runner: "linux.2xlarge" },
{ config: "dynamo", shard: 1, num_shards: 7, runner: "linux.2xlarge" },
{ config: "dynamo", shard: 2, num_shards: 7, runner: "linux.2xlarge" },
{ config: "dynamo", shard: 3, num_shards: 7, runner: "linux.2xlarge" },
{ config: "dynamo", shard: 4, num_shards: 7, runner: "linux.2xlarge" },
{ config: "dynamo", shard: 5, num_shards: 7, runner: "linux.2xlarge" },
{ config: "dynamo", shard: 6, num_shards: 7, runner: "linux.2xlarge" },
{ config: "dynamo", shard: 7, num_shards: 7, runner: "linux.2xlarge" },
]}
linux-focal-py3_8-clang10-test:
@ -162,8 +167,13 @@ jobs:
{ config: "default", shard: 3, num_shards: 3, runner: "linux.2xlarge" },
{ config: "crossref", shard: 1, num_shards: 2, runner: "linux.2xlarge" },
{ config: "crossref", shard: 2, num_shards: 2, runner: "linux.2xlarge" },
{ config: "dynamo", shard: 1, num_shards: 2, runner: "linux.2xlarge" },
{ config: "dynamo", shard: 2, num_shards: 2, runner: "linux.2xlarge" },
{ config: "dynamo", shard: 1, num_shards: 7, runner: "linux.2xlarge" },
{ config: "dynamo", shard: 2, num_shards: 7, runner: "linux.2xlarge" },
{ config: "dynamo", shard: 3, num_shards: 7, runner: "linux.2xlarge" },
{ config: "dynamo", shard: 4, num_shards: 7, runner: "linux.2xlarge" },
{ config: "dynamo", shard: 5, num_shards: 7, runner: "linux.2xlarge" },
{ config: "dynamo", shard: 6, num_shards: 7, runner: "linux.2xlarge" },
{ config: "dynamo", shard: 7, num_shards: 7, runner: "linux.2xlarge" },
]}
linux-focal-py3_11-clang10-test:

View File

@ -25,9 +25,12 @@ jobs:
sync-tag: rocm-build
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 3, runner: "linux.rocm.gpu" },
{ config: "default", shard: 2, num_shards: 3, runner: "linux.rocm.gpu" },
{ config: "default", shard: 3, num_shards: 3, runner: "linux.rocm.gpu" },
{ config: "default", shard: 1, num_shards: 6, runner: "linux.rocm.gpu.2" },
{ config: "default", shard: 2, num_shards: 6, runner: "linux.rocm.gpu.2" },
{ config: "default", shard: 3, num_shards: 6, runner: "linux.rocm.gpu.2" },
{ config: "default", shard: 4, num_shards: 6, runner: "linux.rocm.gpu.2" },
{ config: "default", shard: 5, num_shards: 6, runner: "linux.rocm.gpu.2" },
{ config: "default", shard: 6, num_shards: 6, runner: "linux.rocm.gpu.2" },
]}
linux-focal-rocm5_7-py3_8-test:

View File

@ -10,6 +10,8 @@ on:
push:
tags:
- ciflow/slow/*
branches:
- release/*
workflow_dispatch:
concurrency:

View File

@ -16,11 +16,12 @@ on:
schedule:
# Run hourly.
- cron: 30 * * * *
workflow_dispatch:
jobs:
stale:
if: ${{ github.repository == 'pytorch/pytorch' }}
runs-on: ubuntu-latest
runs-on: linux.large.arc
steps:
- uses: actions/github-script@v6

View File

@ -195,4 +195,4 @@ jobs:
build-environment: linux-focal-rocm5.7-py3.8
docker-image: ${{ needs.linux-focal-rocm5_7-py3_8-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-rocm5_7-py3_8-build.outputs.test-matrix }}
tests-to-include: "test_nn test_torch test_cuda test_ops test_unary_ufuncs test_binary_ufuncs test_autograd"
tests-to-include: "test_nn test_torch test_cuda test_ops test_unary_ufuncs test_binary_ufuncs test_autograd inductor/test_torchinductor"

Some files were not shown because too many files have changed in this diff Show More